Neural Scaling Laws — Core Concepts

The Original OpenAI Scaling Laws

Kaplan et al. (OpenAI, 2020) “Scaling Laws for Neural Language Models” characterized power-law relationships between model performance and scale.

For language model cross-entropy loss $L$ as a function of:

  • Model size $N$ (non-embedding parameters): $L(N) \propto N^{-\alpha_N}$ with $\alpha_N \approx 0.076$
  • Dataset size $D$ (tokens): $L(D) \propto D^{-\alpha_D}$ with $\alpha_D \approx 0.095$
  • Compute $C$ (FLOPs): $L(C) \propto C^{-\alpha_C}$ with $\alpha_C \approx 0.050$

The joint scaling law combines these: $$L(N, D) = \left[\frac{N_c}{N}\right]^{\alpha_N/\beta} + \left[\frac{D_c}{D}\right]^{\alpha_D/\beta}$$

Where $N_c, D_c$ are constants and $\beta$ relates the exponents.

Key findings:

  1. Performance improves predictably with scale — no fundamental wall observed
  2. Larger models are more sample efficient (achieve the same loss with fewer tokens)
  3. For fixed compute, the optimal model is much larger than previously trained

Implication: researchers were undertrained models relative to compute. You should spend most of your compute on model size, not training duration.

Chinchilla: Correcting the Scaling Allocation

Hoffmann et al. (Google DeepMind, 2022) “Training Compute-Optimal Large Language Models” challenged Kaplan et al.’s practical recommendations.

Kaplan et al. had trained many models for different durations on a fixed dataset, concluding that for a compute budget $C$: $N_{opt} \propto C^{0.73}$, $D_{opt} \propto C^{0.27}$ — dedicate most compute to model size.

Hoffmann et al. instead varied both model size AND number of training tokens simultaneously, finding:

$$N_{opt} \propto C^{0.49}, \quad D_{opt} \propto C^{0.51}$$

The compute-optimal allocation is roughly equal between model size and data: for every doubling of compute, both model size and training tokens should double.

This meant GPT-3 (175B params, 300B tokens) was significantly undertrained. A compute-equivalent model with ~70B params and ~1.4T tokens (“Chinchilla”) would perform significantly better.

Result: Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) on the same compute budget by substantial margins. The field shifted toward more data relative to model size.

Why Exponents Matter: Practical Implications

With power-law scaling:

  • Doubling model size: loss decreases by $2^{-0.076} \approx 5%$
  • Doubling training data: loss decreases by $2^{-0.095} \approx 6.3%$
  • Doubling compute: loss decreases by $2^{-0.050} \approx 3.4%$ (if compute-optimally allocated)

Small exponents mean you need large multipliers for significant improvements. To halve the loss via compute alone requires $2^{1/0.050} \approx 1000 \times$ more compute.

This explains why AI progress feels both fast and slow:

  • Fast: Each new model generation uses 5–10x more compute → measurable quality improvement
  • Slow: Fundamental doublings are expensive — reducing loss by 50% requires 1000x more compute

The “bitter lesson” (Rich Sutton): scaling wins over clever engineering. General methods applied at scale consistently outperform specialized architectures with limited scale.

Emergent Capabilities: The Scaling Debate

Wei et al. (2022) “Emergent Abilities of Large Language Models” documented that some capabilities appear abruptly at certain scale thresholds rather than gradually.

Examples:

  • Arithmetic: negligible performance below ~8B parameters, then jumps sharply
  • Multi-step reasoning: absent at smaller scales, emerges around 100B+
  • Theory of mind tasks: step-change around 1B parameters

This “emergence” pattern seemed to contradict smooth power-law scaling.

The counter-argument (Schaeffer et al., 2023 “Are Emergent Abilities of Large Language Models a Mirage?”): Emergence is an artifact of discontinuous metrics. If you use continuous metrics (log probability), capabilities improve smoothly. But if you use pass/fail thresholds (is the answer exactly correct?), the smooth underlying improvement translates to apparent emergence when the model crosses the threshold.

The debate matters for AI forecasting: smooth power laws allow prediction; genuine emergence would mean “surprises” that can’t be forecasted.

The current evidence: Most capabilities show smooth improvement under fine-grained metrics. But some behaviors (like complex multi-step reasoning) may have genuinely nonlinear properties due to the compositional nature of what they require.

One thing to remember: Scaling laws revealed that AI capability improvement is systematic and predictable — but the debate about emergence shows that translating raw model performance into specific human-valuable capabilities remains less predictable.

scaling-lawschinchillakaplanpower-lawscompute-optimalemergent-capabilities

See Also

  • Mixture Of Experts How GPT-4 and Mixtral use specialized sub-networks to handle different types of questions — the architecture secret that lets AI be huge without being slow.
  • Sparse Attention How AI models handle very long documents without running out of memory — the tricks that let language models work with books, not just paragraphs.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.