Serving the Model

The hardware-level act: KV cache, paging, speculation

Watch the KV cache grow. See PagedAttention turn the GPU into a tiny operating system. See speculative decoding dance — drafts proposed and verified. This is where SLM serving becomes economical.

badge · KV Cache Master

0 of 7 lessons completed

1
Inside the GPU
The memory pyramid, the roofline, and why decode is bandwidth-bound — the prerequisite
15 min
60 xp
2
The KV cache
What it is, why it's the binding constraint
10 min
50 xp
3
PagedAttention
Virtual memory for transformers
12 min
60 xp
4
Continuous batching
Requests stream through the GPU
9 min
45 xp
5
FlashAttention
Never materialize the L×L matrix
10 min
50 xp
6
Speculative decoding
Draft · verify · accept
11 min
55 xp
7
RadixAttention
Prefix trees for shared-prompt workloads
10 min
50 xp