Neural Scaling Laws — Deep Dive

Deriving Scaling Laws: IsoFLOP Analysis

Hoffmann et al.’s Chinchilla paper used an elegant experimental design to discover compute-optimal scaling.

IsoFLOP experimental design: For a fixed compute budget $C$ (measured in FLOPs), train multiple models with different $(N, D)$ combinations that each consume exactly $C$ FLOPs. Since a single training step requires approximately $6N$ FLOPs per token:

$$C = 6 N D$$

For $C = 10^{21}$ FLOPs and varying $N$:

  • $N = 10^9, D = \frac{10^{21}}{6 \times 10^9} = 1.67 \times 10^{11}$ tokens
  • $N = 10^{10}, D = 1.67 \times 10^{10}$ tokens
  • $N = 10^{11}, D = 1.67 \times 10^9$ tokens

Train all three for exactly the same total FLOPs, compare final loss. The minimum identifies the compute-optimal $(N^, D^)$ for budget $C$.

Repeating across many compute budgets produces a set of $(C, N^(C), D^(C))$ tuples. Fitting power laws to these:

$$N^(C) = G \cdot C^a, \quad D^(C) = G^{-1} \cdot C^{1-a}$$

Chinchilla found $a \approx 0.49$, justifying the “equal scaling” recommendation.

The 6ND Approximation: Where It Comes From

The “$C \approx 6ND$ FLOPs” approximation requires justification.

For a transformer layer with embedding dimension $d_{model}$ and MLP expansion factor 4:

  • Attention: $\sim 12 d_{model}^2 N_{seq}$ FLOPs per forward pass
  • MLP: $\sim 16 d_{model}^2 N_{seq}$ FLOPs per forward pass
  • Total per layer: $\sim 28 d_{model}^2 N_{seq}$

For $L$ layers with $N_{params} \approx 12 L d_{model}^2$ (parameters in attention + MLP):

  • Forward pass: $\approx 2N_{params}$ FLOPs per token
  • Forward + backward: $\approx 6N_{params}$ FLOPs per token

Hence $C \approx 6ND$. This approximation holds for large, standard transformers with $N_{seq} \ll d_{model}$ (short sequences relative to model width).

Data-Constrained Scaling

The Chinchilla recommendation (equal model and data scaling) assumed essentially unlimited data. What happens when data runs out?

Muennighoff et al. (2023) “Scaling Data-Constrained Language Models” studied what happens when you train multiple times over the same data (epochs > 1).

Key findings:

  • 1 epoch (unique data): best perplexity per token
  • 2 epochs: modest degradation (~1–2% worse)
  • 4 epochs: noticeable degradation
  • 16 epochs: significantly worse, not worth the compute

With $k$ epochs of data repetition and compute budget $C$, the compute-optimal strategy shifts toward larger models (since data is the constraint, not compute). The optimal model size with $k$ data repetitions:

$$N_{opt}(k) \approx N_{opt}(1) \cdot k^{0.3}$$

For $k = 4$ (quarteting the data): optimal model is $4^{0.3} \approx 1.5\times$ larger than the 1-epoch optimum for the same compute.

The internet data problem: By 2024, frontier models (GPT-4, Gemini, Claude) were estimated to have consumed most high-quality English internet text. Future models face data scarcity:

  • Multi-lingual data
  • Synthetic data (see synthetic-data topic)
  • High-quality curated data (books, scientific papers, code)
  • Private/proprietary data

This constraint shifts the compute-optimal frontier and is a key driver of synthetic data research.

Inference-Time Compute Scaling

Snell et al. (2024) identified a new scaling dimension: inference-time compute.

Training-time scaling law: Fixed training budget $C_{train}$, minimize loss $L$. $L \propto C_{train}^{-0.050}$.

Inference-time scaling: Given trained model, use $C_{inf}$ compute at inference (more sampling, longer reasoning chains). Performance as a function of $C_{inf}$.

Empirical finding: on reasoning-heavy tasks, inference-time scaling is more compute-efficient than training-time scaling for hard problems:

  • Marginal FLOPs during training: diminishing returns at large scale
  • Marginal FLOPs during inference: can substitute for training FLOPs for specific tasks

The crossover point depends on task difficulty. For easy tasks, training dominates. For very hard tasks (competition math, novel research), inference-time compute (via extended reasoning) can provide improvements that would require orders of magnitude more training compute.

Economic Analysis of Compute-Optimal Training

Taking the Chinchilla scaling laws at face value, what does compute-optimal training cost?

For 2024’s best frontier models estimated at ~$10^{25}$ FLOPs:

  • Optimal model size: $N^* \approx G \cdot (10^{25})^{0.49} \approx 5 \times 10^{11}$ parameters (500B)
  • Optimal training tokens: $D^* \approx G^{-1} \cdot (10^{25})^{0.51} \approx 2 \times 10^{13}$ tokens (20T tokens)

Training cost at $$3/H100$-hour, $1000$ H100s for $\sim 10^{25}$ FLOPs ($6 \times 10^{22}$ FLOPs/GPU-hour):

$$\text{GPU-hours} = \frac{10^{25}}{6 \times 10^{22}} \approx 167 \text{ GPU-hours} \times 1000 \text{ GPUs} = 167,000 \text{ GPU-hours}$$

At $3/hour: $\sim $500,000 for a compute-optimal 500B parameter model. (In practice, GPT-4 scale training costs are estimated at $50–100M due to iterative runs, infrastructure, and researchers — not just raw GPU time.)

The doubling cost: Each doubling of capabilities (requiring ~1000x more compute by the scaling law exponents) costs 1000x more. At $500M/doubling from current frontiers, the next meaningful performance improvement costs ~$500B — economically constraining further scaling.

One thing to remember: Scaling laws are one of AI’s most powerful planning tools, but they describe how performance scales with compute, not the cost of turning that performance into useful capabilities — and the gap between “better language model loss” and “better at things humans care about” determines whether the economics of scaling work out.

scaling-lawsisoflopchinchillainference-scalingdata-constrainedscaling-plateaus

See Also

  • Mixture Of Experts How GPT-4 and Mixtral use specialized sub-networks to handle different types of questions — the architecture secret that lets AI be huge without being slow.
  • Sparse Attention How AI models handle very long documents without running out of memory — the tricks that let language models work with books, not just paragraphs.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.