Neural Architecture Search — Deep Dive
DARTS: Bi-Level Optimization
DARTS formulates NAS as a bi-level optimization problem. Let $\alpha$ be architecture parameters, $w$ be network weights, $\mathcal{L}{train}$ be training loss, $\mathcal{L}{val}$ be validation loss:
$$\min_\alpha \mathcal{L}{val}(w^(\alpha), \alpha)$$ $$\text{s.t. } w^(\alpha) = \arg\min_w \mathcal{L}{train}(w, \alpha)$$
Exactly solving the inner optimization to get $w^*(\alpha)$ is too expensive. DARTS approximates by:
- Update $w$ by descending on $\nabla_w \mathcal{L}_{train}(w, \alpha)$ (inner step)
- Update $\alpha$ by descending on $\nabla_\alpha \mathcal{L}{val}(w - \xi \nabla_w \mathcal{L}{train}, \alpha)$ (outer step)
Where $\xi$ is the inner learning rate. The approximation assumes $w$ is close to $w^*(\alpha)$ after each inner step.
The outer gradient requires second-order terms $\nabla^2_{w,\alpha} \mathcal{L}_{train}$. DARTS approximates this with a finite-difference approximation:
$$\nabla^2_{w,\alpha} \mathcal{L}{train}(w, \alpha) \approx \frac{\nabla\alpha \mathcal{L}{train}(w^+, \alpha) - \nabla\alpha \mathcal{L}_{train}(w^-, \alpha)}{2\epsilon}$$
Where $w^{\pm} = w \pm \epsilon \nabla_w \mathcal{L}_{val}$. This “first-order DARTS” avoids expensive Hessian computation.
Instability analysis (Zela et al., 2020): DARTS tends to degenerate — architecture parameters collapse to preferring skip connections and parameter-free operations. This happens because:
- Skip connections have near-zero loss during the architecture search (the unweighted skip is essentially a scaled residual)
- Gradient flow through skip connections is stronger (less attenuation)
- The loss decreases faster with skip connections regardless of generalization
Solutions: regularization on $\alpha$ norms, early stopping of architecture search, restricting the operation set.
Weight Entanglement in Supernets
One-shot NAS uses weight sharing — all candidate architectures share the same weights in the supernet. The theoretical concern: weights that work well for one architecture may not work well for another (weight entanglement).
Formally: the optimal weights for architecture $a_1$ may be $w_1^$, for $a_2$ are $w_2^$. The shared weight $w$ satisfies $\hat{a}^* = \arg\max_{a} \text{acc}(a, w)$, but $\hat{a}^$ may not equal $a^ = \arg\max_a \text{acc}(a, w_a^*)$.
Empirical evidence (Yu et al., 2020): Ranking correlation between supernet performance (using shared weights) and standalone performance (trained from scratch) is often 0.6–0.9 — decent but not perfect. Architectures that rank highly in the supernet generally rank highly standalone, but with errors.
Path dropout: Randomly drop some operations during supernet training, forcing other operations to compensate. Reduces weight entanglement by preventing any path from dominating.
Batch normalization statistics: Each architecture in the supernet has different path-specific statistics. During evaluation, need to collect fresh BN statistics (forward pass on the calibration data) for each candidate architecture — expensive but necessary.
Bayesian Optimization for NAS
Bayesian optimization (BO) is an efficient black-box optimization strategy that builds a probabilistic surrogate model of the true objective function and uses it to select the next evaluation point.
For NAS, the architecture is encoded as a vector, and BO fits a Gaussian Process (GP) surrogate:
$$f(a) \sim \mathcal{GP}(\mu(a), k(a, a’))$$
Where $k$ is a kernel measuring architectural similarity. New candidates are selected using acquisition functions like Expected Improvement (EI):
$$a_{next} = \arg\max_a \text{EI}(a) = \mathbb{E}[\max(f(a) - f^*, 0)]$$
Architecture kernels: Measuring “similarity” between architectures is non-trivial. Approaches:
- Edit distance: number of operations that differ
- Walk kernel: similar to graph kernel, counts matching paths in computation graphs
- Hamming distance on one-hot encoded operation choices
BO is most effective when the per-evaluation cost is high (full training) but the number of evaluations is small (< 1000 for typical BO). For cheap evaluations (proxy tasks), random search or evolutionary algorithms often match BO.
Multi-Objective NAS: Pareto Frontiers
In practice, NAS optimizes multiple objectives: accuracy, latency, memory, energy consumption. These are often in conflict — higher accuracy usually requires more compute.
Multi-objective NAS seeks the Pareto frontier: the set of architectures where no other architecture is better on all objectives simultaneously.
Formally, architecture $a_1$ dominates $a_2$ ($a_1 \succ a_2$) if $a_1$ is at least as good on all objectives and strictly better on at least one. The Pareto frontier is the set of non-dominated architectures.
NSGA-II for NAS: Apply the NSGA-II evolutionary algorithm (Non-dominated Sorting Genetic Algorithm) with architectural mutation and crossover operators. Each generation selects the Pareto-frontier architectures for reproduction.
MNasNet (Tan et al., 2019): RL-based NAS with a latency-aware reward: $$\text{Reward}(m) = \text{ACC}(m) \times [\text{LAT}(m) / T]^w$$
Where $T$ is target latency and $w$ is weight. $w < 0$ penalizes high latency. Different $w$ values shift the accuracy-latency tradeoff curve.
Once-For-All (OFA) Pareto extraction: OFA trains one supernet, then for each (accuracy, latency) point on the Pareto frontier, finds the optimal sub-network via evolutionary search. A single 40 GPU-hour supernet training supports efficient search for any device.
NAS for Large Language Models
NAS at LLM scale is emerging:
GPT-NAS: Search over depth (number of layers), width (hidden dimension), number of attention heads, FFN ratio, and head dimensions — while constraining total parameter count. The search space is much smaller than vision NAS (fewer structural choices) but the evaluation cost is much higher.
Layer-dropping / depth NAS: SandwichNets train supernets that can drop any set of transformer layers. At inference time, select the number of layers based on available compute. This is essentially hardware-adaptive NAS for transformers.
Mixture of Experts as NAS: MoE architectures (used in GPT-4, Mixtral, DeepSeek) can be viewed as learned NAS — the gating network selects which expert (which architectural variant) processes each token. The “search” happens implicitly during training.
MatMul-free architectures (2024): Exploration of transformer architectures that eliminate matrix multiplications, replacing with bitwise operations. Hardware-aware NAS is essential here — the optimal architecture depends heavily on hardware support for specific operations.
One thing to remember: NAS’s long-term significance isn’t in any specific algorithm — it’s in establishing that architecture design is a solvable optimization problem rather than pure art, which means future models will increasingly be designed by optimization processes guided by hardware constraints and efficiency objectives rather than human architectural intuitions.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'