The KV cache is the real constraint
In the multi-head lesson we discovered that each head has its own learned projections. At inference time, during autoregressive generation, we cache the keys and values of every token we've already processed — otherwise every new token would require recomputing attention over the whole sequence from scratch. This is called the KV cache.
because we store both K and V. = number of transformer layers. = number of KV heads. = per-head dimension. = bytes per element (2 for FP16).
For a Llama-2 7B with , FP16, that's about 0.5 MB per token. At 4096 tokens it's 2 GB of KV cache for one sequence— more than the weights for a 3B model. Serve ten concurrent users at 4k each and your KV cache is 20 GB, before you've even allocated the model.
The fix — divide queries into groups
Grouped-query attention (Ainslie et al. 2023) found the sweet spot. Divide the query heads into groups, each group sharing one K/V pair:
- → full MHA (no compression)
- → MQA (maximum compression, quality hit)
- → GQA (tunable)
Modern SLMs live in the sweet spot with or — Phi-4-mini, Llama 3.2-3B, Qwen3-4B, SmolLM3-3B all sit there.
Maximum compression (h× reduction)
All query heads pool their retrieval direction into one K/V
Quality loss: 1–3 points on SuperGLUE, visible on summarisation
Used by: PaLM, some early Google models
Tunable compression (h/g×)
Query heads in a group share one K/V
Quality loss: < 0.5 points at h/g ≈ 4
Used by: Llama 2 70B+, Phi-4-mini, Qwen3, Gemma 3, SmolLM3 — essentially every 2024+ SLM
Why it works — mechanistically
Empirically, attention heads cluster: many heads learn nearly-identical retrieval patterns (“attend to the previous token”, “attend to the first token”, “attend to the subject”). Forcing every query head to maintain its own K/V pair is redundant — those near-identical patterns don't need independent K/V storage. GQA lets a small set of K/V pairs be shared across similar query directions, preserving the head diversity via the independent projections but consolidating the storage.
Concrete current choices:
- Phi-4-mini: , (3× compression)
- Llama 3.2-3B: , (3×)
- Qwen3-4B: , (4×)
- SmolLM3-3B: , (4×)