Speculative decoding · Microscale

Decode is bandwidth-bound — exploit that

During autoregressive generation, each token requires reading the entire model from HBM — ~6 GB for a 3B model in FP16. The actual compute per token is tiny; the GPU spends most of its time waiting for weight loads. That's why decode throughput is capped far below peak FLOPs.

Speculative decoding exploits this. What if a small draft model proposed several tokens cheaply, and the big target model verified them all in a single forward pass? The verification pass reads the model once and checks $k$ tokens in parallel — amortizing the memory load across multiple tokens.

The math — why it's lossless

Speculative decoding uses rejection sampling so that the final output distribution is exactly what the target model would have produced alone. For each draft token:

p_{\text{accept}} = \min\!\left(1, \frac{p_\text{target}(\text{token})}{p_\text{draft}(\text{token})}\right)

If you accept, move on. If you reject, you sample a new token from the corrected distribution $p'(\text{token}) \propto \max(0, p_\text{target} - p_\text{draft})$ . The whole procedure is mathematically equivalent to sampling from $p_\text{target}$ — no quality drop, only speed.

If the draft matches the target a fraction $p$ of the time, and you propose $k$ tokens per round, the expected number of accepted tokens per round is:

E[\text{accepted}] = \frac{1 - p^{k+1}}{1 - p}

draft acceptance rate p80%

how often the draft matches the target

tokens proposed per round k5

expected accepted / round

3.69

net speedup (rough)

3.7×

quality change

0exact

example run — draft proposes, target verifies

The

cat

sat

the

mat

was

grey

Six tokens accepted, one rejected (and corrected), three skipped. In one target-model forward pass you advanced 6 tokens instead of 1. The rejected token is replaced with a sample from the corrected distribution.

Where it comes from — and where it breaks

The idea has two independent origins: Stern, Shazeer & Uszkoreit (NeurIPS 2018, "Blockwise Parallel Decoding") proposed the parallel-verify shape; Leviathan, Kalman & Matias (Google, 2022, "Fast Inference from Transformers via Speculative Decoding") and Chen et al. (DeepMind, 2023) gave the rejection-sampling proof that makes it lossless. The practical win depends entirely on the acceptance rate $p$ : a 7B draft for a 70B target on general chat hits $p \approx 0.6\text{–}0.7$ and gives ~1.8× wall-clock latency improvement. On narrow distributions (code, math, regex, JSON schemas) acceptance collapses because a generic draft produces syntactically wrong proposals that the target immediately rejects — you end up paying the draft cost for no tokens. The production fix is to distill the draft model from the target's outputs on the serving distribution, or to use EAGLE-style draft heads that see the target's own hidden states and therefore drift less.

Who drafts?

Separate small model. A 1B-class model drafting for a 70B. Simple but requires loading two models.
Medusa heads (Cai et al. 2024). Add extra prediction heads to the big model itself. Each head predicts t+1, t+2, t+3. No separate model.
EAGLE / EAGLE-3. A lightweight autoregressive head attached to the big model's internal features, trained to predict future tokens. Higher acceptance rates than Medusa.

Notice where SLMs fit: speculative decoding for SLMs isn't that useful because SLMs are already cheap to decode. But SLMs are great as draft modelsfor bigger siblings. A 1B Qwen drafting for a 32B Qwen is a common production setup — the 1B generates fast, the 32B verifies, and the effective decode speed is 2–3× the 32B's native rate.