Neural Architecture Search — Core Concepts
The Architecture Design Problem
Choosing a neural network architecture involves thousands of decisions: number of layers, width at each layer, connection patterns, type of operations (convolution, attention, pooling), skip connections, normalization placement. The space of possible architectures is exponentially large.
Human designers constrain this space with intuition and experience — but human biases also limit exploration. The AlexNet-to-VGG-to-ResNet progression shows how architecture design evolved iteratively over years. NAS asks: can we automate this search?
Three Generations of NAS
1. Reinforcement Learning-Based NAS (2017)
Zoph & Le (Google Brain, 2017) framed NAS as a sequence generation problem. A recurrent neural network (the “controller”) generates architecture descriptions token by token: “layer 1: conv 3×3, layer 2: skip connection to layer 1…”
The controller is trained with reinforcement learning:
- Sample architecture $a$ from controller
- Train architecture $a$ on CIFAR-10 for N steps
- Measure validation accuracy as reward
- Update controller to make higher-reward architectures more likely
After 800 GPU × 28 days, the discovered architecture (NASNet) outperformed handcrafted models. But the cost was prohibitive.
Block structure: Rather than searching full architectures, Zoph et al. searched for a repeatable “cell” (a small directed acyclic graph of operations) that could be stacked multiple times. This reduced the search space from ~$10^{13}$ to ~$10^7$ configurations.
2. DARTS: Differentiable Architecture Search (2019)
Liu et al. (2019) made NAS dramatically cheaper by relaxing the discrete search space to a continuous one.
Instead of choosing one operation per edge, maintain a softmax distribution over all possible operations: $$\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o’ \in \mathcal{O}} \exp(\alpha_{o’}^{(i,j)})} o(x)$$
Where $\alpha_o^{(i,j)}$ are learnable architecture parameters.
Now the architecture parameters $\alpha$ and the network weights $w$ can be jointly optimized via gradient descent:
- Inner optimization: update $w$ on training data (standard SGD)
- Outer optimization: update $\alpha$ on validation data (gradient through the mixed operations)
After training, discretize by selecting the highest-weight operation at each edge.
DARTS cost: ~4 GPU days (vs. 800 GPU × 28 days for RL-based NAS) — a 5000x speedup.
DARTS problems: Regularization sensitivity, tendency to favor skip connections over convolutions (skip connections have near-zero cost so they dominate early in search), instability requiring careful tuning. DARTS+ and numerous variants address these issues.
3. One-Shot / Supernet Methods
One-Shot NAS: Train a single “supernet” that contains all possible architectures simultaneously, with weight sharing. Then search within this supernet using evolutionary algorithms or random search — no retraining needed.
SPOS (Single Path One-Shot, 2020): Train the supernet by uniformly sampling one active path per step. After training, search 500M candidate architectures by evaluating each path using the supernet weights in a fraction of a second. Best architecture evaluated fully.
HireNAS, NASVIT: Apply NAS to transformer architectures — searching attention head counts, MLP ratios, and embedding dimensions across layers simultaneously.
EfficientNet: Compound Scaling
Tan & Le (2019) used NAS to find an optimal baseline architecture (EfficientNet-B0), then studied how to scale it.
Three scaling dimensions:
- Depth: More layers → larger receptive field, more capacity, diminishing returns and vanishing gradients
- Width: More channels → finer-grained features, plateaus quickly
- Resolution: Higher input resolution → finer details, exponential memory cost
Manually scaling any single dimension has diminishing returns. The insight: these dimensions are interdependent. Doubling depth without increasing resolution leaves the model looking at too many large abstractions without fine detail. The “compound coefficient” $\phi$ scales all three together:
$$d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi$$ $$\text{subject to: } \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2, \quad \alpha \geq 1, \beta \geq 1, \gamma \geq 1$$
The constraint ensures roughly doubling FLOPS per unit $\phi$ increase. $\alpha, \beta, \gamma$ are found by grid search with $\phi = 1$.
Empirically: $\alpha = 1.2, \beta = 1.1, \gamma = 1.15$ for EfficientNet.
Results: EfficientNet-B7 (84.4% ImageNet top-1) used 8.4x fewer parameters and 6.1x fewer FLOPs than GPT-3P while exceeding accuracy. The compound scaling insight applies beyond EfficientNet — EfficientDet for detection, EfficientNetV2 with improved training-aware scaling.
Hardware-Aware NAS
Accuracy isn’t the only objective. On a mobile phone, you care about latency on the specific hardware (Snapdragon DSP, Apple Neural Engine). On a server, you care about throughput on an A100 GPU.
MobileNets: Designed for mobile via depthwise separable convolutions. V2, V3 increasingly used NAS components.
Once-for-All (OFA) (Han et al., 2020): Train a single supernet that can be deployed at any latency target by selecting a different sub-network. No retraining needed when hardware changes — just search a different subnetwork from the same supernet. Supports 10,000+ different architectural configurations.
MCUNet (Han et al., MIT, 2020): NAS specifically for microcontrollers (256KB RAM, 1MB flash). Discovered architectures that allow inference on 100KB models — enabling on-device AI in the smallest IoT devices.
One thing to remember: NAS is most valuable not as “AI designing AI” but as a systematic way to explore the architecture-efficiency frontier — finding the optimal accuracy-cost tradeoff for a specific hardware target that would take years of human experimentation to discover.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'