KV Cache Budget Calculator

the aha moment

Turn the KV-cache formula from the lesson into a working calculator that takes model config + batch size + sequence length and predicts exact bytes. Serve 1 / 4 / 16 / 64 concurrent requests on a real model and plot predicted vs actual memory. The gap is everything the formula doesn't cover — activations, CUDA context, framework overhead. Now you know the correction factor for your stack.

Open in Colab View on GitHub

the facts

Time: 60 min
Hardware: CPU · GPU · Mac · Colab
Act: VIII · Serving the Model
Status: Live
Artifact: A KV-cache calculator script + a predicted-vs-actual memory chart.

run it locally

Clone the labs repo and run this lab as a script or open it as a notebook:

git clone https://github.com/iqbal-sk/Microscale-labs.git
cd Microscale
just setup-auto      # auto-detects CPU / CUDA / Mac
just run 11
# or:  jupyter lab labs/11-kv-cache-calculator/lab.py

Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.

read alongside

Lesson · 10 min · 50 xp

The KV cache

What it is, why it's the binding constraint

Open in Colab View on GitHub ← all labs