Microscale
0
Act IXShip It
lesson vllm · 9 min · 45 xp

vLLM in production

Build a config for any workload

vLLM — the production serving king

vLLM is the reference implementation of everything you learned in Act VIII: PagedAttention, continuous batching, chunked prefill, FlashAttention-3, AWQ/GPTQ/FP8 quantization, speculative decoding, multi-LoRA serving, XGrammar-based constrained decoding. All of it, in one serving binary with an OpenAI-compatible HTTP API.

It's the answer to “I need to serve a lot of concurrent users on a GPU efficiently.” Pick your goal below and read the corresponding config.

vLLM's throughput edge over naive HuggingFace generate() — reported as roughly 14–24× in the original Kwon et al. 2023 paper — is three stacked wins, not one. (1) PagedAttention splits the KV cache into 16-token blocks with per-sequence block tables, eliminating the internal fragmentation that a pre-allocated [max_len, n_heads, d] buffer wastes; in practice that buys 2–4× more concurrent sequences on the same VRAM. (2) Continuous batching refills each slot the instant a request finishes decoding instead of waiting for the whole batch to drain, pinning GPU utilization at 85–95% instead of the 20–40% you get from static-batch HF pipelines. (3) Chunked prefill interleaves prefill chunks with ongoing decode steps, so one 32k-context prompt can't head-of-line-block every other request behind it. Remove any one of these and the advantage collapses — which is why Ollama and TGI, both missing pieces of this stack, fall off the cliff under real concurrency.

suggested config
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct \
  --quantization awq \
  --kv-cache-dtype fp8 \
  --speculative-model Qwen/Qwen3-0.6B-Instruct \
  --num-speculative-tokens 5 \
  --max-model-len 8192 \
  --dtype half
why each flag
  • AWQ quantization keeps quality close to FP16 while cutting memory.
  • FP8 KV cache halves per-sequence memory — more concurrent slots for the same GPU.
  • Speculative decoding with a 0.6B drafter gives ~2–3× decode speedup for free.
  • Short-ish context length (8k) keeps prefill fast — most chat requests don't need more.

Picking a runtime — the 30-second decision tree

Ollamaif you're the only user, you want one command to stand up a model, and “good enough” is ~60 tok/s on your laptop. Dev loops, internal tools, CLI agents, demos at a coffee shop.

MLX / MLX-LMif you're on Apple Silicon and the model is bigger than what fits in any consumer NVIDIA card — a 70B Q4 on an M2 Ultra is unmatched price-per-GB-of-addressable-weights, because the alternative is a multi-GPU rig with NVLink. Also the right answer if you need to LoRA-fine-tune on a MacBook without spinning up a cloud box.

vLLM the moment you serve more than a handful of concurrent users, need multi-LoRA fan-out, or care about P99 latency under load. The threshold in practice: if you have a single NVIDIA GPU with ≥ 16 GB VRAM and expect anyparallel traffic, vLLM's PagedAttention pays for the extra ops complexity within the first hour of traffic. Below that threshold, the overhead of installing CUDA-matched wheels isn't worth it.

You now have the whole picture

You've walked the whole map:

  • Act 0 — what a language model is.
  • Act I — why small models exist.
  • Act II — every component of a 2026 transformer block.
  • Act III — the industry bestiary and head-to-head comparisons.
  • Act IV — scaling laws, textbook data, distillation.
  • Act V — the honest limits: emergence, reasoning, hallucination, lost-in-middle.
  • Act VI — LoRA, QLoRA, DPO, GRPO, recipes.
  • Act VII — formats, K-quants, BitNet ternary.
  • Act VIII — KV cache, PagedAttention, continuous batching, FlashAttention, speculative decoding, RadixAttention.
  • Act IX — Ollama for dev, MLX for Mac, vLLM for production.

You can now look at a modern SLM release, read its technical report, and understand every architectural decision. You can fine-tune one for a specialized task. You can pick a quantization level and serving engine for any workload. You can explain why the choices are what they are, not just that they work.

Go build something specialized. The smallest thing that works is always the right answer.