Speculative Decoding — Explain Like I'm 5

The rejection sampling algorithm behind speculative decoding, acceptance rates, speedup analysis, self-speculative methods, and why speculative decoding preserves exact output distribution.

The Key Insight: Parallel Verification

LLM generation is sequential — each token requires a separate forward pass to access the KV-cache of all previous tokens. This sequential bottleneck limits throughput.

Leviathan et al. (Google, 2022) “Fast Inference from Transformers via Speculative Decoding” observed: verification is faster than generation.

Given $k$ draft tokens, the target model can verify all $k$ simultaneously in a single forward pass. If the target model processes $k$ tokens in parallel, it computes:

$P_{target}(x_{n+1} | context)$ through $P_{target}(x_{n+k} | context_{n+k-1})$

All in one parallel operation. Compare against the draft model’s probabilities and apply rejection sampling.

The Rejection Sampling Algorithm

Let $q$ = draft model probability distribution, $p$ = target model probability distribution.

For each draft token $\tilde{x}_i$:

Accept with probability $\min(1, p(\tilde{x}_i) / q(\tilde{x}_i))$
If rejected: sample from adjusted distribution $p’(x) \propto \max(0, p(x) - q(x))$ and stop

The key theorem: This procedure produces tokens with exactly the target model’s distribution. The output quality is identical to generating with the target model alone — speculative decoding is lossless.

Proof sketch: Let the event $A$ = “token is accepted”. Then: $$P(\text{output token is } x) = P(A) \cdot \mathbb{E}[p(\tilde{x}) | A] + P(\neg A) \cdot p’(x)$$

After algebraic manipulation, this equals $p(x)$. The acceptance sampling preserves the exact target distribution.

Acceptance Rate and Speedup

The acceptance rate $\alpha$ depends on how well the draft model approximates the target model:

$$\alpha = \sum_x \min(p(x), q(x)) = 1 - \text{TV}(p, q)$$

Where TV is total variation distance. When $p = q$ (draft and target are identical): $\alpha = 1$ (always accept), maximum speedup. When $p$ and $q$ are very different: low $\alpha$, many rejections, little speedup.

Expected accepted tokens per call: With $k$ draft tokens and acceptance rate $\alpha$, the expected number of accepted tokens is:

$$E[\text{accepted}] = \frac{1 - \alpha^{k+1}}{1 - \alpha}$$

For $k=4$, $\alpha=0.85$: expected 3.39 accepted tokens per target model call. Since the target model would have produced 1 token per call without speculation: speedup ≈ 3.4x (minus overhead).

Optimal k: Increasing $k$ gives more potential speedup but lower average acceptance rate (each additional draft token is harder to get right). The optimal $k$ depends on the target-draft model pair and hardware. In practice, $k \in [4, 8]$ works well.

Self-Speculative Methods

Finding a good draft model is challenging. The draft model must:

Be significantly faster than the target model
Have high token acceptance rate (similar distribution to target)
Be available (often requires fine-tuning)

Self-speculative decoding eliminates the need for a separate draft model:

LLMA (Kou et al., 2023): When regenerating text similar to previously seen text (e.g., in RAG, where retrieved passages may appear verbatim in the output), retrieve tokens from context as draft tokens. High acceptance rate because the model often copies retrieved text.

Medusa (Cai et al., 2023): Attach multiple “draft heads” to the target model — extra prediction heads that generate speculative tokens at different future positions simultaneously. The draft heads are small and share the target model’s representations. Single forward pass generates both the current token and multiple drafts.

EAGLE (Li et al., 2024): Feature-level speculative decoding. The draft model operates on feature representations from the target model’s intermediate layers (not token-level). The target model provides features for free during its forward pass; the draft model uses these to predict next features, decoded to tokens.

EAGLE achieves acceptance rates of 0.85–0.95 (comparable to a separately trained draft model) because it directly uses the target model’s features as input.

Production Deployment

Speculative decoding is used in production by:

Anthropic: For Claude models
Google: For Gemini API serving
Meta: For Llama inference at scale

The speedup in production is typically 1.5–2.5x (vs. theoretical 3–4x) because:

Draft model adds overhead
Memory bandwidth bottleneck remains
Batching efficiency changes with variable accepted lengths

For streaming APIs, speculative decoding also improves user-perceived latency: the first tokens appear faster because the system can rapidly accept consecutive draft tokens when they’re correct.

One thing to remember: Speculative decoding’s mathematical elegance is that it provides lossless speedup — the output distribution is identical to greedy or sampled generation from the target model alone, making it a transparent optimization that any LLM serving system can apply.

speculative-decodingrejection-samplinginference-optimizationdraft-modelllm-serving

Speculative Decoding — Explain Like I'm 5

The Key Insight: Parallel Verification

The Rejection Sampling Algorithm

Acceptance Rate and Speedup

Self-Speculative Methods

Production Deployment

See Also

Related Topics