Speculative Decoding — Explain Like I'm 5

The clever trick that makes large AI models generate text 2-4x faster — using a small 'draft' model to guess tokens that a big model then quickly verifies.

The Fast Typist and the Careful Editor

Imagine you’re writing an important document. You have a fast but sometimes sloppy typist and a slow but extremely careful editor.

Instead of the editor typing every word themselves (slow), you have the typist draft a paragraph rapidly. The editor then quickly reads through it — which is much faster than writing it from scratch — and either approves each sentence or makes a correction where needed.

If the typist is accurate most of the time, this is dramatically faster than the editor writing alone. You get the editor’s quality with near-typist speed.

That’s speculative decoding.

How It Works With AI

Generating text with a large language model is slow because it needs to run the entire giant model once for each word. A 70-billion parameter model might need 100 milliseconds per word.

Speculative decoding uses a tiny “draft” model (maybe 7 billion parameters, 10x faster) to rapidly guess the next several words. Then the large model checks all those guesses simultaneously in a single forward pass.

If the large model agrees with the guesses — “yes, these are the words I would have chosen” — you accept them all. That’s 5 words in the time it would have taken 1 word.

If the large model disagrees somewhere, you use the large model’s correct choice and try again.

The Math of Why It Works

When the draft model gets the right words even 80% of the time, you still get a dramatic speedup. You’re doing 5 words in ~2 passes instead of 5 passes — even with occasional corrections, it’s 2–3x faster.

The key insight: running the large model in “verification mode” (checking a sequence of tokens) is nearly as fast as running it once for a single token — the GPU can do it in parallel.

This technique is used in production by Anthropic, Google, and Meta to serve their LLM APIs at higher throughput and lower cost.

One thing to remember: Speculative decoding works because verifying is faster than generating — a small model drafts quickly, a large model verifies quickly, and together they’re faster than either working alone.

speculative-decodinginference-optimizationllmlatencyefficiency

Speculative Decoding — Explain Like I'm 5

The Fast Typist and the Careful Editor

How It Works With AI

The Math of Why It Works

See Also

Related Topics