Neural Scaling Laws — Deep Dive
Deriving Scaling Laws: IsoFLOP Analysis
Hoffmann et al.’s Chinchilla paper used an elegant experimental design to discover compute-optimal scaling.
IsoFLOP experimental design: For a fixed compute budget $C$ (measured in FLOPs), train multiple models with different $(N, D)$ combinations that each consume exactly $C$ FLOPs. Since a single training step requires approximately $6N$ FLOPs per token:
$$C = 6 N D$$
For $C = 10^{21}$ FLOPs and varying $N$:
- $N = 10^9, D = \frac{10^{21}}{6 \times 10^9} = 1.67 \times 10^{11}$ tokens
- $N = 10^{10}, D = 1.67 \times 10^{10}$ tokens
- $N = 10^{11}, D = 1.67 \times 10^9$ tokens
Train all three for exactly the same total FLOPs, compare final loss. The minimum identifies the compute-optimal $(N^, D^)$ for budget $C$.
Repeating across many compute budgets produces a set of $(C, N^(C), D^(C))$ tuples. Fitting power laws to these:
$$N^(C) = G \cdot C^a, \quad D^(C) = G^{-1} \cdot C^{1-a}$$
Chinchilla found $a \approx 0.49$, justifying the “equal scaling” recommendation.
The 6ND Approximation: Where It Comes From
The “$C \approx 6ND$ FLOPs” approximation requires justification.
For a transformer layer with embedding dimension $d_{model}$ and MLP expansion factor 4:
- Attention: $\sim 12 d_{model}^2 N_{seq}$ FLOPs per forward pass
- MLP: $\sim 16 d_{model}^2 N_{seq}$ FLOPs per forward pass
- Total per layer: $\sim 28 d_{model}^2 N_{seq}$
For $L$ layers with $N_{params} \approx 12 L d_{model}^2$ (parameters in attention + MLP):
- Forward pass: $\approx 2N_{params}$ FLOPs per token
- Forward + backward: $\approx 6N_{params}$ FLOPs per token
Hence $C \approx 6ND$. This approximation holds for large, standard transformers with $N_{seq} \ll d_{model}$ (short sequences relative to model width).
Data-Constrained Scaling
The Chinchilla recommendation (equal model and data scaling) assumed essentially unlimited data. What happens when data runs out?
Muennighoff et al. (2023) “Scaling Data-Constrained Language Models” studied what happens when you train multiple times over the same data (epochs > 1).
Key findings:
- 1 epoch (unique data): best perplexity per token
- 2 epochs: modest degradation (~1–2% worse)
- 4 epochs: noticeable degradation
- 16 epochs: significantly worse, not worth the compute
With $k$ epochs of data repetition and compute budget $C$, the compute-optimal strategy shifts toward larger models (since data is the constraint, not compute). The optimal model size with $k$ data repetitions:
$$N_{opt}(k) \approx N_{opt}(1) \cdot k^{0.3}$$
For $k = 4$ (quarteting the data): optimal model is $4^{0.3} \approx 1.5\times$ larger than the 1-epoch optimum for the same compute.
The internet data problem: By 2024, frontier models (GPT-4, Gemini, Claude) were estimated to have consumed most high-quality English internet text. Future models face data scarcity:
- Multi-lingual data
- Synthetic data (see synthetic-data topic)
- High-quality curated data (books, scientific papers, code)
- Private/proprietary data
This constraint shifts the compute-optimal frontier and is a key driver of synthetic data research.
Inference-Time Compute Scaling
Snell et al. (2024) identified a new scaling dimension: inference-time compute.
Training-time scaling law: Fixed training budget $C_{train}$, minimize loss $L$. $L \propto C_{train}^{-0.050}$.
Inference-time scaling: Given trained model, use $C_{inf}$ compute at inference (more sampling, longer reasoning chains). Performance as a function of $C_{inf}$.
Empirical finding: on reasoning-heavy tasks, inference-time scaling is more compute-efficient than training-time scaling for hard problems:
- Marginal FLOPs during training: diminishing returns at large scale
- Marginal FLOPs during inference: can substitute for training FLOPs for specific tasks
The crossover point depends on task difficulty. For easy tasks, training dominates. For very hard tasks (competition math, novel research), inference-time compute (via extended reasoning) can provide improvements that would require orders of magnitude more training compute.
Economic Analysis of Compute-Optimal Training
Taking the Chinchilla scaling laws at face value, what does compute-optimal training cost?
For 2024’s best frontier models estimated at ~$10^{25}$ FLOPs:
- Optimal model size: $N^* \approx G \cdot (10^{25})^{0.49} \approx 5 \times 10^{11}$ parameters (500B)
- Optimal training tokens: $D^* \approx G^{-1} \cdot (10^{25})^{0.51} \approx 2 \times 10^{13}$ tokens (20T tokens)
Training cost at $$3/H100$-hour, $1000$ H100s for $\sim 10^{25}$ FLOPs ($6 \times 10^{22}$ FLOPs/GPU-hour):
$$\text{GPU-hours} = \frac{10^{25}}{6 \times 10^{22}} \approx 167 \text{ GPU-hours} \times 1000 \text{ GPUs} = 167,000 \text{ GPU-hours}$$
At $3/hour: $\sim $500,000 for a compute-optimal 500B parameter model. (In practice, GPT-4 scale training costs are estimated at $50–100M due to iterative runs, infrastructure, and researchers — not just raw GPU time.)
The doubling cost: Each doubling of capabilities (requiring ~1000x more compute by the scaling law exponents) costs 1000x more. At $500M/doubling from current frontiers, the next meaningful performance improvement costs ~$500B — economically constraining further scaling.
One thing to remember: Scaling laws are one of AI’s most powerful planning tools, but they describe how performance scales with compute, not the cost of turning that performance into useful capabilities — and the gap between “better language model loss” and “better at things humans care about” determines whether the economics of scaling work out.
See Also
- Mixture Of Experts How GPT-4 and Mixtral use specialized sub-networks to handle different types of questions — the architecture secret that lets AI be huge without being slow.
- Sparse Attention How AI models handle very long documents without running out of memory — the tricks that let language models work with books, not just paragraphs.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.