Microscale Labs

Act I · Region 01

The Landscape

Lab 0130 minCPU · Colab

The Token Tax

Load four real BPE tokenizers — o200k, cl100k, p50k, gpt2 — feed them the same sentence in five languages, and watch the token-per-word ratio climb from 1× in English to 4-5× in Hindi on GPT-2. The fairness gap stops being theory and becomes a number you measured.

read the brief

Colab GitHub

Act II · Region 02

Inside the Machine

→ read the act

Lab 0260–90 minCPU · Mac · GPU · Colab

Attention Under the Microscope

Load Qwen3-0.6B, register forward hooks on every attention layer, and extract all 448 head-patterns as heatmaps. Find the previous-token head, find the induction head, find the uniform global-attention head. Zero one and measure the perplexity hit. Some heads matter, most don't — and now you can prove it.

read the brief

Colab GitHub

Lab 0390–120 minCPU · Mac · Colab

Build a Transformer Block from Raw Ops

Implement RMSNorm, RoPE, Grouped-Query Attention, and SwiGLU from scratch in PyTorch — no `nn.TransformerEncoderLayer`, no HuggingFace. Load the real weights from Qwen3-0.6B's layer 0 into your version, and wait for `torch.allclose(yours, theirs, atol=1e-5)` to return True. The hardest lab. The most satisfying lab.

read the brief

Colab GitHub

Act III · Region 03

The Current Champions

→ read the act

Lab 0445–60 minCPU · Colab

Model Autopsy

Parse the safetensors header of SmolLM2-360M, Qwen3-0.6B, SmolLM3-3B, and Phi-4-mini without downloading a single weight. Detect each model's GQA group size, tied-vs-untied embeddings, SwiGLU hidden ratio, and vocab size from tensor names alone. See Phi-4-mini spend 31% of its params on vocabulary where SmolLM3 spends 18%.

read the brief

Colab GitHub

Act IV · Region 04

How They Learn

→ read the act

Lab 0590–120 minGPU · Mac · Colab · CPU

The $1 Pretraining Run

Train a 10M-parameter GPT-2 from scratch on TinyStories for ~20 minutes, watch the loss curve descend from random noise to coherent English, then train a second copy on corrupted data and see the textbook hypothesis as a measured gap. Compute cost: well under $1.

read the brief

Colab GitHub

Act V · Region 05

Where They Break

→ read the act

Lab 0660–90 minCPU · Colab

The Hallucination Probe

Ask a small model 50 factual questions split into common-knowledge and long-tail buckets, score the answers with multi-alias matching, and plot the hallucination rate against the Kalai-Vempala theoretical lower bound. Measure ~5% on common questions, ~40-60% on long-tail — exactly where the proof says you should land.

read the brief

Colab GitHub

Act VI · Region 06

Making It Yours

→ read the act

Lab 07-A60–90 minGPU · Mac · Colab · CPU

LoRA for Behavioral Fine-Tuning

Implement LoRA from scratch — the A×B low-rank decomposition, the α/r scaling, the zero-init — attach it to Qwen3-0.6B's query projection, and fine-tune on 20 cooking-instruction examples. 24,576 trainable params. A 2 MB adapter. A noticeably shifted voice after 200 steps. Merge, verify, keep the file.

read the brief

Colab GitHub

Lab 07-B60–90 minGPU · Mac · Colab · CPU

LoRA for Tool Calling

Same LoRA from 07-A, higher rank (r=16), two target projections (q_proj + v_proj), and training examples formatted with Qwen3's native tool-calling template. Teach the model to emit valid JSON for a 6-function kitchen-assistant API. Evaluate deterministically: does the JSON parse? Does the function name match? Does it generalise to held-out prompts?

read the brief

Colab GitHub

Lab 0890 minGPU · Mac · Colab

Your First DPO Alignment

Build 20 preference pairs for a narrow task (chosen vs rejected responses), run TRL's DPOTrainer on Qwen3-0.6B for 100 steps, and watch the model's behaviour shift from generic-chatbot to specifically-matches-your-chosen-examples. Alignment stops being abstract and becomes a trained adapter you can A/B test.

read the brief

Colab GitHub

Act VII · Region 07

Packing for Travel

→ read the act

Lab 0990 minCPU · Colab

Quantize It Yourself

Take a single 3072×3072 weight tensor from Qwen3-0.6B and implement three quantisation schemes from scratch — naive 4-bit uniform, NF4 quantile-binned, K-quant Q4_K_M with sub-block scales. Measure L2 error for each. Watch naive lose 3× to NF4, and NF4 lose 2× to Q4_K_M. The hierarchy of quantisation tricks is now a chart you built.

read the brief

Colab GitHub

Act VIII · Region 08

Serving the Model

→ read the act

Lab 1060–90 minGPU · Mac · Colab

The Roofline Lab

Measure your GPU's actual sustained bandwidth (not spec-sheet) with a memcpy microbenchmark, measure sustained compute with a matmul microbench, and plot YOUR hardware's roofline. Overlay your model's arithmetic intensity at batch=1 (decode) and batch=32 (prefill-like). See decode sitting deep in the bandwidth-bound region on your actual GPU.

read the brief

Colab GitHub

Lab 1160 minCPU · GPU · Mac · Colab

KV Cache Budget Calculator

Turn the KV-cache formula from the lesson into a working calculator that takes model config + batch size + sequence length and predicts exact bytes. Serve 1 / 4 / 16 / 64 concurrent requests on a real model and plot predicted vs actual memory. The gap is everything the formula doesn't cover — activations, CUDA context, framework overhead. Now you know the correction factor for your stack.

read the brief

Colab GitHub

Act IX · Region 09

Ship It

→ read the act

Lab 1245–60 minCPU · GPU · Machot pickcoming soon

The Inference Showdown

Take one model (Qwen3-0.6B-Q4_K_M) and serve it through Ollama, llama.cpp-server, and (depending on your hardware) vLLM or MLX-LM. Measure cold-start time, TTFT, tok/s at batch=1, tok/s at batch=8, and peak memory for each. Find the crossover point where vLLM's batching advantage overtakes Ollama's simplicity — on YOUR hardware, not someone's benchmark blog.

read the brief

GitHub

Labs.

The Landscape

Inside the Machine

The Current Champions

How They Learn

Where They Break

Making It Yours

Packing for Travel

Serving the Model

Ship It