Microscale
0
← back to the atlas
Act VIII · Region 08

Serving the Model

The hardware-level act: KV cache, paging, speculation

Watch the KV cache grow. See PagedAttention turn the GPU into a tiny operating system. See speculative decoding dance — drafts proposed and verified. This is where SLM serving becomes economical.

badge · KV Cache Master
0 of 7 lessons completed
  1. 1
    Inside the GPU
    The memory pyramid, the roofline, and why decode is bandwidth-bound — the prerequisite
  2. 2
    The KV cache
    What it is, why it's the binding constraint
  3. 3
    PagedAttention
    Virtual memory for transformers
  4. 4
    Continuous batching
    Requests stream through the GPU
  5. 5
    FlashAttention
    Never materialize the L×L matrix
  6. 6
    Speculative decoding
    Draft · verify · accept
  7. 7
    RadixAttention
    Prefix trees for shared-prompt workloads