hands-on · microscale academy
Labs.
A different way to learn sits next to the reading path. Twelve specimens wait on the workbench — all 448 attention heads of a 600M model classifying themselves into previous-token and induction patterns, a 10M transformer descending from noise to coherent English in twenty minutes of consumer GPU time, a 2 MB LoRA adapter that reshapes a model's voice on twenty cooking examples, your own GPU's bandwidth plotted on a roofline against your own model's arithmetic intensity.
Every one produces a number or a file you keep. None of them require a datacentre.
live today
12
1 coming soon
Lab 0130 minCPU · Colab
Load four real BPE tokenizers — o200k, cl100k, p50k, gpt2 — feed them the same sentence in five languages, and watch the token-per-word ratio climb from 1× in English to 4-5× in Hindi on GPT-2. The fairness gap stops being theory and becomes a number you measured.
Lab 0260–90 minCPU · Mac · GPU · Colab
Load Qwen3-0.6B, register forward hooks on every attention layer, and extract all 448 head-patterns as heatmaps. Find the previous-token head, find the induction head, find the uniform global-attention head. Zero one and measure the perplexity hit. Some heads matter, most don't — and now you can prove it.
Lab 0390–120 minCPU · Mac · Colab
Implement RMSNorm, RoPE, Grouped-Query Attention, and SwiGLU from scratch in PyTorch — no `nn.TransformerEncoderLayer`, no HuggingFace. Load the real weights from Qwen3-0.6B's layer 0 into your version, and wait for `torch.allclose(yours, theirs, atol=1e-5)` to return True. The hardest lab. The most satisfying lab.
Lab 0445–60 minCPU · Colab
Parse the safetensors header of SmolLM2-360M, Qwen3-0.6B, SmolLM3-3B, and Phi-4-mini without downloading a single weight. Detect each model's GQA group size, tied-vs-untied embeddings, SwiGLU hidden ratio, and vocab size from tensor names alone. See Phi-4-mini spend 31% of its params on vocabulary where SmolLM3 spends 18%.
Lab 0590–120 minGPU · Mac · Colab · CPU
Train a 10M-parameter GPT-2 from scratch on TinyStories for ~20 minutes, watch the loss curve descend from random noise to coherent English, then train a second copy on corrupted data and see the textbook hypothesis as a measured gap. Compute cost: well under $1.
Lab 0660–90 minCPU · Colab
Ask a small model 50 factual questions split into common-knowledge and long-tail buckets, score the answers with multi-alias matching, and plot the hallucination rate against the Kalai-Vempala theoretical lower bound. Measure ~5% on common questions, ~40-60% on long-tail — exactly where the proof says you should land.
Lab 07-A60–90 minGPU · Mac · Colab · CPU
Implement LoRA from scratch — the A×B low-rank decomposition, the α/r scaling, the zero-init — attach it to Qwen3-0.6B's query projection, and fine-tune on 20 cooking-instruction examples. 24,576 trainable params. A 2 MB adapter. A noticeably shifted voice after 200 steps. Merge, verify, keep the file.
Lab 07-B60–90 minGPU · Mac · Colab · CPU
Same LoRA from 07-A, higher rank (r=16), two target projections (q_proj + v_proj), and training examples formatted with Qwen3's native tool-calling template. Teach the model to emit valid JSON for a 6-function kitchen-assistant API. Evaluate deterministically: does the JSON parse? Does the function name match? Does it generalise to held-out prompts?
Lab 0890 minGPU · Mac · Colab
Build 20 preference pairs for a narrow task (chosen vs rejected responses), run TRL's DPOTrainer on Qwen3-0.6B for 100 steps, and watch the model's behaviour shift from generic-chatbot to specifically-matches-your-chosen-examples. Alignment stops being abstract and becomes a trained adapter you can A/B test.
Lab 0990 minCPU · Colab
Take a single 3072×3072 weight tensor from Qwen3-0.6B and implement three quantisation schemes from scratch — naive 4-bit uniform, NF4 quantile-binned, K-quant Q4_K_M with sub-block scales. Measure L2 error for each. Watch naive lose 3× to NF4, and NF4 lose 2× to Q4_K_M. The hierarchy of quantisation tricks is now a chart you built.
Lab 1060–90 minGPU · Mac · Colab
Measure your GPU's actual sustained bandwidth (not spec-sheet) with a memcpy microbenchmark, measure sustained compute with a matmul microbench, and plot YOUR hardware's roofline. Overlay your model's arithmetic intensity at batch=1 (decode) and batch=32 (prefill-like). See decode sitting deep in the bandwidth-bound region on your actual GPU.
Lab 1160 minCPU · GPU · Mac · Colab
Turn the KV-cache formula from the lesson into a working calculator that takes model config + batch size + sequence length and predicts exact bytes. Serve 1 / 4 / 16 / 64 concurrent requests on a real model and plot predicted vs actual memory. The gap is everything the formula doesn't cover — activations, CUDA context, framework overhead. Now you know the correction factor for your stack.
Lab 1245–60 minCPU · GPU · Machot pickcoming soon
Take one model (Qwen3-0.6B-Q4_K_M) and serve it through Ollama, llama.cpp-server, and (depending on your hardware) vLLM or MLX-LM. Measure cold-start time, TTFT, tok/s at batch=1, tok/s at batch=8, and peak memory for each. Find the crossover point where vLLM's batching advantage overtakes Ollama's simplicity — on YOUR hardware, not someone's benchmark blog.