Neural Architecture Search — Deep Dive

DARTS bi-level optimization, weight-sharing supernet training, Bayesian NAS, multi-objective NAS on Pareto frontiers, and NAS applied to LLM depth/width optimization.

DARTS: Bi-Level Optimization

DARTS formulates NAS as a bi-level optimization problem. Let $\alpha$ be architecture parameters, $w$ be network weights, $\mathcal{L}{train}$ be training loss, $\mathcal{L}{val}$ be validation loss:

$$\min_\alpha \mathcal{L}{val}(w^(\alpha), \alpha)$$ $$\text{s.t. } w^(\alpha) = \arg\min_w \mathcal{L}{train}(w, \alpha)$$

Exactly solving the inner optimization to get $w^*(\alpha)$ is too expensive. DARTS approximates by:

Update $w$ by descending on $\nabla_w \mathcal{L}_{train}(w, \alpha)$ (inner step)
Update $\alpha$ by descending on $\nabla_\alpha \mathcal{L}{val}(w - \xi \nabla_w \mathcal{L}{train}, \alpha)$ (outer step)

Where $\xi$ is the inner learning rate. The approximation assumes $w$ is close to $w^*(\alpha)$ after each inner step.

The outer gradient requires second-order terms $\nabla^2_{w,\alpha} \mathcal{L}_{train}$. DARTS approximates this with a finite-difference approximation:

$$\nabla^2_{w,\alpha} \mathcal{L}{train}(w, \alpha) \approx \frac{\nabla\alpha \mathcal{L}{train}(w^+, \alpha) - \nabla\alpha \mathcal{L}_{train}(w^-, \alpha)}{2\epsilon}$$

Where $w^{\pm} = w \pm \epsilon \nabla_w \mathcal{L}_{val}$. This “first-order DARTS” avoids expensive Hessian computation.

Instability analysis (Zela et al., 2020): DARTS tends to degenerate — architecture parameters collapse to preferring skip connections and parameter-free operations. This happens because:

Skip connections have near-zero loss during the architecture search (the unweighted skip is essentially a scaled residual)
Gradient flow through skip connections is stronger (less attenuation)
The loss decreases faster with skip connections regardless of generalization

Solutions: regularization on $\alpha$ norms, early stopping of architecture search, restricting the operation set.

Weight Entanglement in Supernets

One-shot NAS uses weight sharing — all candidate architectures share the same weights in the supernet. The theoretical concern: weights that work well for one architecture may not work well for another (weight entanglement).

Formally: the optimal weights for architecture $a_1$ may be $w_1^$, for $a_2$ are $w_2^$. The shared weight $w$ satisfies $\hat{a}^* = \arg\max_{a} \text{acc}(a, w)$, but $\hat{a}^$ may not equal $a^ = \arg\max_a \text{acc}(a, w_a^*)$.

Empirical evidence (Yu et al., 2020): Ranking correlation between supernet performance (using shared weights) and standalone performance (trained from scratch) is often 0.6–0.9 — decent but not perfect. Architectures that rank highly in the supernet generally rank highly standalone, but with errors.

Path dropout: Randomly drop some operations during supernet training, forcing other operations to compensate. Reduces weight entanglement by preventing any path from dominating.

Batch normalization statistics: Each architecture in the supernet has different path-specific statistics. During evaluation, need to collect fresh BN statistics (forward pass on the calibration data) for each candidate architecture — expensive but necessary.

Bayesian Optimization for NAS

Bayesian optimization (BO) is an efficient black-box optimization strategy that builds a probabilistic surrogate model of the true objective function and uses it to select the next evaluation point.

For NAS, the architecture is encoded as a vector, and BO fits a Gaussian Process (GP) surrogate:

$$f(a) \sim \mathcal{GP}(\mu(a), k(a, a’))$$

Where $k$ is a kernel measuring architectural similarity. New candidates are selected using acquisition functions like Expected Improvement (EI):

$$a_{next} = \arg\max_a \text{EI}(a) = \mathbb{E}[\max(f(a) - f^*, 0)]$$

Architecture kernels: Measuring “similarity” between architectures is non-trivial. Approaches:

Edit distance: number of operations that differ
Walk kernel: similar to graph kernel, counts matching paths in computation graphs
Hamming distance on one-hot encoded operation choices

BO is most effective when the per-evaluation cost is high (full training) but the number of evaluations is small (< 1000 for typical BO). For cheap evaluations (proxy tasks), random search or evolutionary algorithms often match BO.

Multi-Objective NAS: Pareto Frontiers

In practice, NAS optimizes multiple objectives: accuracy, latency, memory, energy consumption. These are often in conflict — higher accuracy usually requires more compute.

Multi-objective NAS seeks the Pareto frontier: the set of architectures where no other architecture is better on all objectives simultaneously.

Formally, architecture $a_1$ dominates $a_2$ ($a_1 \succ a_2$) if $a_1$ is at least as good on all objectives and strictly better on at least one. The Pareto frontier is the set of non-dominated architectures.

NSGA-II for NAS: Apply the NSGA-II evolutionary algorithm (Non-dominated Sorting Genetic Algorithm) with architectural mutation and crossover operators. Each generation selects the Pareto-frontier architectures for reproduction.

MNasNet (Tan et al., 2019): RL-based NAS with a latency-aware reward: $$\text{Reward}(m) = \text{ACC}(m) \times [\text{LAT}(m) / T]^w$$

Where $T$ is target latency and $w$ is weight. $w < 0$ penalizes high latency. Different $w$ values shift the accuracy-latency tradeoff curve.

Once-For-All (OFA) Pareto extraction: OFA trains one supernet, then for each (accuracy, latency) point on the Pareto frontier, finds the optimal sub-network via evolutionary search. A single 40 GPU-hour supernet training supports efficient search for any device.

NAS for Large Language Models

NAS at LLM scale is emerging:

GPT-NAS: Search over depth (number of layers), width (hidden dimension), number of attention heads, FFN ratio, and head dimensions — while constraining total parameter count. The search space is much smaller than vision NAS (fewer structural choices) but the evaluation cost is much higher.

Layer-dropping / depth NAS: SandwichNets train supernets that can drop any set of transformer layers. At inference time, select the number of layers based on available compute. This is essentially hardware-adaptive NAS for transformers.

Mixture of Experts as NAS: MoE architectures (used in GPT-4, Mixtral, DeepSeek) can be viewed as learned NAS — the gating network selects which expert (which architectural variant) processes each token. The “search” happens implicitly during training.

MatMul-free architectures (2024): Exploration of transformer architectures that eliminate matrix multiplications, replacing with bitwise operations. Hardware-aware NAS is essential here — the optimal architecture depends heavily on hardware support for specific operations.

One thing to remember: NAS’s long-term significance isn’t in any specific algorithm — it’s in establishing that architecture design is a solvable optimization problem rather than pure art, which means future models will increasingly be designed by optimization processes guided by hardware constraints and efficiency objectives rather than human architectural intuitions.

neural-architecture-searchdartsbi-level-optimizationweight-entanglementpareto-nasllm-nas

Neural Architecture Search — Deep Dive

DARTS: Bi-Level Optimization

Weight Entanglement in Supernets

Bayesian Optimization for NAS

Multi-Objective NAS: Pareto Frontiers

NAS for Large Language Models

See Also

Related Topics