Microscale
0
Act VIIIServing the Model
lesson kv-cache · 10 min · 50 xp

The KV cache

What it is, why it's the binding constraint

The cache you didn't know dominated everything

During autoregressive generation, every step requires attention over every previous token. Recomputing K and V for all prior tokens on every step would be quadratically wasteful — each of LL new tokens would recompute O(L)O(L) attention, for total O(L2)O(L^2) work per generation.

The fix is the KV cache: store KK and VV from every past token so you only compute them once. Generation per token becomes O(L)O(L) attention over cached keys — much better.

KV bytes per token=2Llayershkvdhbytes\text{KV bytes per token} = 2 \cdot L_{\text{layers}} \cdot h_\text{kv} \cdot d_h \cdot \text{bytes}

For a 3B-class model (32 layers, 8 KV heads, dh=128d_h=128, FP16): that's 128 KB per token. Sounds small until you multiply by sequence length and concurrent users.

4096
8
KV / tok
128 KB
KV / seq
512 MB
KV total
4.0 GB
weights + KV
10.0 GB
memory breakdown
weights
KV cache
At 4096 tokens × 8 concurrent sequences, KV cache is 40% of total memory. Scaling concurrency is a KV problem, not a weights problem.

Why this is the binding constraint in production

Once KV cache exceeds weights, every additional concurrent user directly eats into your available slots. A 24 GB GPU with a 6 GB model can host 18 GB of KV cache. That's enough for maybe ~12 concurrent 4k-context sequences — without any paged attention, continuous batching, or FP8 KV tricks. Those techniques all exist specifically to raise that ceiling.

And memory is only half the story — every decode step has to read the accumulated KV back from HBM to compute attention. At sequence length 4,096, each forward pass pulls 512MB per sequence just for KV reads, on top of the ~6 GB weight read. On an H100's 3.35 TB/s HBM3 that KV read alone costs 149.25 ms — which is why long-context decode gets progressively slower as the sequence grows. It also explains why GQA exists: the hkv=8h_{\text{kv}}=8 in the formula above used to be hkv=32h_{\text{kv}}=32 (full MHA), and shrinking it 4× is a straight 4× cut on both KV memory andKV bandwidth. Llama-2-70B ships 8 KV heads for exactly this reason — Shazeer's 2019 multi-query paper called the shot a full four years before it became production-standard.

The next lesson shows PagedAttention, which turns the KV cache from a contiguous block into a paged virtual memory system. The fragmentation savings alone yield 2–4× more concurrent sessions on the same hardware.