Microscale
0
Act VIIIServing the Model
lesson paged-attention · 12 min · 60 xp

PagedAttention

Virtual memory for transformers

Virtual memory, for transformers

A naive KV allocator gives each sequence a contiguous block sized for the maximum expected context length. Short sequences waste most of that block; finished sequences leave fragmented holes. Measured waste in production: 60–80% of KV VRAM sits idle.

vLLM's PagedAttention steals the idea from OS virtual memory. Divide the KV cache into small fixed-size physical blocks (typically 16 tokens each). Each sequence has a block table mapping logical positions to physical blocks. Sequences grow one block at a time and hand their blocks back when they finish. The attention kernel walks the block table instead of a contiguous region.

The block table is the whole trick. Pre-vLLM serving engines (FasterTransformer, the original HF TGI path) reserved one contiguous KV slab per request sized to max_seq_len, because attention kernels needed contiguous KK and VV tensors to stride through. For a batch of 32 requests with max_seq_len=2048 but typical length ~300, roughly 85% of the reserved KV sat empty— you couldn't give those bytes to anyone else because the owning request might still grow into them. Kwon et al. (SOSP 2023, "Efficient Memory Management for LLM Serving with PagedAttention") swapped that for a per-sequence block table — an indirection map from logical position i to physical block block_table[i / 16], offset i % 16. A sequence at 300 tokens now owns exactly 300/16=19\lceil 300/16 \rceil = 19 physical blocks, and every block it doesn't need becomes available to the next request that walks in. Same VRAM, 2–4× more in-flight requests.

physical KV pool — 16 blocks × 16 tokens each
1
16
2
16
3
1
4
16
5
16
6
1
7
8
9
10
11
12
13
14
15
16
blocks used
6/16
tokens stored
66
capacity
256
waste
~4%
Paged: blocks allocated on demand. Only the last block of each sequence is partially used. Waste drops to under 5%, usually ~4%.

Why it took until 2023 to invent

PagedAttention seems obvious in hindsight. The delay was implementing a working CUDA kernel that could tolerate non-contiguous KV memory. Attention kernels historically assume contiguous arrays so that memory access patterns are cache-friendly. Following block table pointers adds indirection and can wreck cache locality.

vLLM's contribution was showing that you can write a block-aware attention kernel that doesn't sacrifice throughput. With that in place, the virtual memory abstraction becomes a free speedup for serving.

The consequences

  • 2–4× more concurrent sessions on the same hardware (just from fragmentation savings)
  • Shared-prefix KV reuse — beam search, many-sample decoding, shared system prompts all become efficient
  • Copy-on-write — forked beams or agent rollouts only pay for divergence, not duplication
  • Clean abstraction — every 2024+ serving engine (vLLM, SGLang, TGI, TensorRT-LLM) now has a variant of this