the aha moment
Turn the KV-cache formula from the lesson into a working calculator that takes model config + batch size + sequence length and predicts exact bytes. Serve 1 / 4 / 16 / 64 concurrent requests on a real model and plot predicted vs actual memory. The gap is everything the formula doesn't cover — activations, CUDA context, framework overhead. Now you know the correction factor for your stack.
the facts
- Time
- 60 min
- Hardware
- CPU · GPU · Mac · Colab
- Act
- VIII · Serving the Model
- Status
- Live
- Artifact
- A KV-cache calculator script + a predicted-vs-actual memory chart.
run it locally
Clone the labs repo and run this lab as a script or open it as a notebook:
git clone https://github.com/iqbal-sk/Microscale-labs.git cd Microscale just setup-auto # auto-detects CPU / CUDA / Mac just run 11 # or: jupyter lab labs/11-kv-cache-calculator/lab.py
Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.
read alongside