Speculative Decoding — Explain Like I'm 5
The Fast Typist and the Careful Editor
Imagine you’re writing an important document. You have a fast but sometimes sloppy typist and a slow but extremely careful editor.
Instead of the editor typing every word themselves (slow), you have the typist draft a paragraph rapidly. The editor then quickly reads through it — which is much faster than writing it from scratch — and either approves each sentence or makes a correction where needed.
If the typist is accurate most of the time, this is dramatically faster than the editor writing alone. You get the editor’s quality with near-typist speed.
That’s speculative decoding.
How It Works With AI
Generating text with a large language model is slow because it needs to run the entire giant model once for each word. A 70-billion parameter model might need 100 milliseconds per word.
Speculative decoding uses a tiny “draft” model (maybe 7 billion parameters, 10x faster) to rapidly guess the next several words. Then the large model checks all those guesses simultaneously in a single forward pass.
If the large model agrees with the guesses — “yes, these are the words I would have chosen” — you accept them all. That’s 5 words in the time it would have taken 1 word.
If the large model disagrees somewhere, you use the large model’s correct choice and try again.
The Math of Why It Works
When the draft model gets the right words even 80% of the time, you still get a dramatic speedup. You’re doing 5 words in ~2 passes instead of 5 passes — even with occasional corrections, it’s 2–3x faster.
The key insight: running the large model in “verification mode” (checking a sequence of tokens) is nearly as fast as running it once for a single token — the GPU can do it in parallel.
This technique is used in production by Anthropic, Google, and Meta to serve their LLM APIs at higher throughput and lower cost.
One thing to remember: Speculative decoding works because verifying is faster than generating — a small model drafts quickly, a large model verifies quickly, and together they’re faster than either working alone.
See Also
- Knowledge Distillation How AI companies shrink massive models down to phone-sized ones without losing much intelligence — the teacher-student trick that powers on-device AI.
- Model Pruning How AI models lose weight without losing intelligence — removing the neurons that don't actually do anything useful to make models faster and smaller.
- Model Quantization How AI models get shrunk to run on your phone — the precision-tradeoff trick that makes 70 billion parameter models fit in consumer hardware.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.