Speculative Decoding — Explain Like I'm 5
The Key Insight: Parallel Verification
LLM generation is sequential — each token requires a separate forward pass to access the KV-cache of all previous tokens. This sequential bottleneck limits throughput.
Leviathan et al. (Google, 2022) “Fast Inference from Transformers via Speculative Decoding” observed: verification is faster than generation.
Given $k$ draft tokens, the target model can verify all $k$ simultaneously in a single forward pass. If the target model processes $k$ tokens in parallel, it computes:
- $P_{target}(x_{n+1} | context)$ through $P_{target}(x_{n+k} | context_{n+k-1})$
All in one parallel operation. Compare against the draft model’s probabilities and apply rejection sampling.
The Rejection Sampling Algorithm
Let $q$ = draft model probability distribution, $p$ = target model probability distribution.
For each draft token $\tilde{x}_i$:
- Accept with probability $\min(1, p(\tilde{x}_i) / q(\tilde{x}_i))$
- If rejected: sample from adjusted distribution $p’(x) \propto \max(0, p(x) - q(x))$ and stop
The key theorem: This procedure produces tokens with exactly the target model’s distribution. The output quality is identical to generating with the target model alone — speculative decoding is lossless.
Proof sketch: Let the event $A$ = “token is accepted”. Then: $$P(\text{output token is } x) = P(A) \cdot \mathbb{E}[p(\tilde{x}) | A] + P(\neg A) \cdot p’(x)$$
After algebraic manipulation, this equals $p(x)$. The acceptance sampling preserves the exact target distribution.
Acceptance Rate and Speedup
The acceptance rate $\alpha$ depends on how well the draft model approximates the target model:
$$\alpha = \sum_x \min(p(x), q(x)) = 1 - \text{TV}(p, q)$$
Where TV is total variation distance. When $p = q$ (draft and target are identical): $\alpha = 1$ (always accept), maximum speedup. When $p$ and $q$ are very different: low $\alpha$, many rejections, little speedup.
Expected accepted tokens per call: With $k$ draft tokens and acceptance rate $\alpha$, the expected number of accepted tokens is:
$$E[\text{accepted}] = \frac{1 - \alpha^{k+1}}{1 - \alpha}$$
For $k=4$, $\alpha=0.85$: expected 3.39 accepted tokens per target model call. Since the target model would have produced 1 token per call without speculation: speedup ≈ 3.4x (minus overhead).
Optimal k: Increasing $k$ gives more potential speedup but lower average acceptance rate (each additional draft token is harder to get right). The optimal $k$ depends on the target-draft model pair and hardware. In practice, $k \in [4, 8]$ works well.
Self-Speculative Methods
Finding a good draft model is challenging. The draft model must:
- Be significantly faster than the target model
- Have high token acceptance rate (similar distribution to target)
- Be available (often requires fine-tuning)
Self-speculative decoding eliminates the need for a separate draft model:
LLMA (Kou et al., 2023): When regenerating text similar to previously seen text (e.g., in RAG, where retrieved passages may appear verbatim in the output), retrieve tokens from context as draft tokens. High acceptance rate because the model often copies retrieved text.
Medusa (Cai et al., 2023): Attach multiple “draft heads” to the target model — extra prediction heads that generate speculative tokens at different future positions simultaneously. The draft heads are small and share the target model’s representations. Single forward pass generates both the current token and multiple drafts.
EAGLE (Li et al., 2024): Feature-level speculative decoding. The draft model operates on feature representations from the target model’s intermediate layers (not token-level). The target model provides features for free during its forward pass; the draft model uses these to predict next features, decoded to tokens.
EAGLE achieves acceptance rates of 0.85–0.95 (comparable to a separately trained draft model) because it directly uses the target model’s features as input.
Production Deployment
Speculative decoding is used in production by:
- Anthropic: For Claude models
- Google: For Gemini API serving
- Meta: For Llama inference at scale
The speedup in production is typically 1.5–2.5x (vs. theoretical 3–4x) because:
- Draft model adds overhead
- Memory bandwidth bottleneck remains
- Batching efficiency changes with variable accepted lengths
For streaming APIs, speculative decoding also improves user-perceived latency: the first tokens appear faster because the system can rapidly accept consecutive draft tokens when they’re correct.
One thing to remember: Speculative decoding’s mathematical elegance is that it provides lossless speedup — the output distribution is identical to greedy or sampled generation from the target model alone, making it a transparent optimization that any LLM serving system can apply.
See Also
- Knowledge Distillation How AI companies shrink massive models down to phone-sized ones without losing much intelligence — the teacher-student trick that powers on-device AI.
- Model Pruning How AI models lose weight without losing intelligence — removing the neurons that don't actually do anything useful to make models faster and smaller.
- Model Quantization How AI models get shrunk to run on your phone — the precision-tradeoff trick that makes 70 billion parameter models fit in consumer hardware.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.