Shared prefixes are everywhere — stop prefilling them twice
Real-world LLM workloads have huge amounts of shared prefix: every request shares a system prompt; agent trajectories branch late from a common prefix; few-shot prompts repeat across many queries. Standard engines re-prefill those shared tokens on every request. Wasteful.
RadixAttention (SGLang, LMSYS 2024) stores the KV cache in a radix tree indexed by token prefix. When a new request arrives, walk down the tree matching tokens; for every match, reuse the cached KV block. Unmatched suffixes become new tree nodes. LRU eviction keeps the cache bounded.
When this wins big
- Agent workloads — many rollouts from the same prompt for tree search or best-of-N
- Chat products — every user shares a system prompt
- Few-shot serving — same few-shot example bank across thousands of queries
- Document-grounded assistants — same retrieved context shared across a session's turns
SGLang reports 5–10× speedups over vLLM on these workloads. For independent-request workloads with no prefix sharing, RadixAttention offers no benefit. Match the engine to the workload pattern.
Why this is really a storage trick — with scheduling strings attached
RadixAttention doesn't make attention faster. It makes the prefillyou don't have to do faster — infinitely, because you skip it. That's why the latency win shows up in TTFT (time-to-first-token), not throughput per decode step: once a prefix is cached, a new request with that prefix starts decoding on token instead of paying the full prefill. For a 2048-token system prompt on a 7B model, that's roughly 400–600 ms erased from TTFT on an H100.
The catch is that the scheduler has to know about the cache. Zheng et al. (SGLang, 2024) showed that a naive FIFO scheduler destroys hit rate — an LRU eviction can toss a hot prefix right before a burst of requests that would have reused it. SGLang's scheduler is prefix-aware: it reorders the request queue so sequences sharing a prefix are served together, pinning their common nodes in cache for the duration of the burst. That scheduler coupling is why you can't just "add RadixAttention to vLLM" as a library; prefix caching and scheduling are the same design decision. vLLM's own automatic prefix caching (APC) is a simpler hash-keyed block-table reuse, which gets most of the storage win but not the scheduling one.