Microscale
0
Act VIIIServing the Model
lesson continuous-batching · 9 min · 45 xp

Continuous batching

Requests stream through the GPU

Static batching wastes the GPU

Classical batched inference: wait for N requests to arrive, forward them together, wait for all to finish, return them all, start another batch. Sounds efficient — and it is, until you notice that one slow sequence holds up all the rest. Long outputs waste GPU cycles on padding for the short ones.

Continuous batching (also called in-flight batching) replaces the static batch with a streaming model. Every GPU forward pass processes whatever is currently running. Finished sequences drop out; new requests join as slots free up. No waiting, no padding, no idle cycles.

time steps 0..17 · active requests per step
R1
R2
R3
R4
R5
t = 0requests stream through — enter and leave independently

Why it actually wins — amortized HBM reads

The marketing answer is "no padding, no waiting" but that undersells it. The real win is that decode is bandwidth-bound on the weights. For a 7B fp16 model, a decode step reads all 14 GB of weights from HBM once. On an H100 at 3.35 TB/s that's a 4.2 ms floor whether you decode one token or thirty-two. Static batching wastes that read because the batch is only full at the start; as sequences finish, the GPU keeps paying the full weight-read cost to decode fewer and fewer live requests. Continuous batching keeps the batch topped up— every forward pass amortizes that 14 GB HBM read across whichever 30+ sequences happen to be mid-decode right now. Yu et al. "Orca" (OSDI 2022) introduced this as iteration-level scheduling: the scheduler runs at the granularity of a single decode step, not a whole request.

Continuous batching is usually a 2–5× throughput improvement on typical LLM serving workloads. Combined with PagedAttention (which removes memory waste) and chunked prefill (which interleaves prompt-processing with decode steps), you get compounding gains.

Every 2024+ production serving engine implements continuous batching: vLLM, SGLang, TGI, TensorRT-LLM, Ollama (to a limited extent via llama.cpp's parallel mode). The details differ — chunked prefill, prefill-decode disaggregation, cache-aware schedulers — but the core idea is the same.