MicroscaleLabs
0
Back to the Atlas
hands-on · microscale academy

Labs.

A different way to learn sits next to the reading path. Twelve specimens wait on the workbench — all 448 attention heads of a 600M model classifying themselves into previous-token and induction patterns, a 10M transformer descending from noise to coherent English in twenty minutes of consumer GPU time, a 2 MB LoRA adapter that reshapes a model's voice on twenty cooking examples, your own GPU's bandwidth plotted on a roofline against your own model's arithmetic intensity.

Every one produces a number or a file you keep. None of them require a datacentre.

labs
13
live today
12
1 coming soon
cpu-friendly
11
Act I · Region 01

The Landscape

→ read the act
Lab 0130 minCPU · Colab

The Token Tax

Load four real BPE tokenizers — o200k, cl100k, p50k, gpt2 — feed them the same sentence in five languages, and watch the token-per-word ratio climb from 1× in English to 4-5× in Hindi on GPT-2. The fairness gap stops being theory and becomes a number you measured.

Act II · Region 02

Inside the Machine

→ read the act
Lab 0260–90 minCPU · Mac · GPU · Colab

Attention Under the Microscope

Load Qwen3-0.6B, register forward hooks on every attention layer, and extract all 448 head-patterns as heatmaps. Find the previous-token head, find the induction head, find the uniform global-attention head. Zero one and measure the perplexity hit. Some heads matter, most don't — and now you can prove it.

Lab 0390–120 minCPU · Mac · Colab

Build a Transformer Block from Raw Ops

Implement RMSNorm, RoPE, Grouped-Query Attention, and SwiGLU from scratch in PyTorch — no `nn.TransformerEncoderLayer`, no HuggingFace. Load the real weights from Qwen3-0.6B's layer 0 into your version, and wait for `torch.allclose(yours, theirs, atol=1e-5)` to return True. The hardest lab. The most satisfying lab.

Act III · Region 03

The Current Champions

→ read the act
Lab 0445–60 minCPU · Colab

Model Autopsy

Parse the safetensors header of SmolLM2-360M, Qwen3-0.6B, SmolLM3-3B, and Phi-4-mini without downloading a single weight. Detect each model's GQA group size, tied-vs-untied embeddings, SwiGLU hidden ratio, and vocab size from tensor names alone. See Phi-4-mini spend 31% of its params on vocabulary where SmolLM3 spends 18%.

Act IV · Region 04

How They Learn

→ read the act
Lab 0590–120 minGPU · Mac · Colab · CPU

The $1 Pretraining Run

Train a 10M-parameter GPT-2 from scratch on TinyStories for ~20 minutes, watch the loss curve descend from random noise to coherent English, then train a second copy on corrupted data and see the textbook hypothesis as a measured gap. Compute cost: well under $1.

Act V · Region 05

Where They Break

→ read the act
Lab 0660–90 minCPU · Colab

The Hallucination Probe

Ask a small model 50 factual questions split into common-knowledge and long-tail buckets, score the answers with multi-alias matching, and plot the hallucination rate against the Kalai-Vempala theoretical lower bound. Measure ~5% on common questions, ~40-60% on long-tail — exactly where the proof says you should land.

Act VI · Region 06

Making It Yours

→ read the act
Lab 07-A60–90 minGPU · Mac · Colab · CPU

LoRA for Behavioral Fine-Tuning

Implement LoRA from scratch — the A×B low-rank decomposition, the α/r scaling, the zero-init — attach it to Qwen3-0.6B's query projection, and fine-tune on 20 cooking-instruction examples. 24,576 trainable params. A 2 MB adapter. A noticeably shifted voice after 200 steps. Merge, verify, keep the file.

Lab 07-B60–90 minGPU · Mac · Colab · CPU

LoRA for Tool Calling

Same LoRA from 07-A, higher rank (r=16), two target projections (q_proj + v_proj), and training examples formatted with Qwen3's native tool-calling template. Teach the model to emit valid JSON for a 6-function kitchen-assistant API. Evaluate deterministically: does the JSON parse? Does the function name match? Does it generalise to held-out prompts?

Lab 0890 minGPU · Mac · Colab

Your First DPO Alignment

Build 20 preference pairs for a narrow task (chosen vs rejected responses), run TRL's DPOTrainer on Qwen3-0.6B for 100 steps, and watch the model's behaviour shift from generic-chatbot to specifically-matches-your-chosen-examples. Alignment stops being abstract and becomes a trained adapter you can A/B test.

Act VII · Region 07

Packing for Travel

→ read the act
Lab 0990 minCPU · Colab

Quantize It Yourself

Take a single 3072×3072 weight tensor from Qwen3-0.6B and implement three quantisation schemes from scratch — naive 4-bit uniform, NF4 quantile-binned, K-quant Q4_K_M with sub-block scales. Measure L2 error for each. Watch naive lose 3× to NF4, and NF4 lose 2× to Q4_K_M. The hierarchy of quantisation tricks is now a chart you built.

Act VIII · Region 08

Serving the Model

→ read the act
Lab 1060–90 minGPU · Mac · Colab

The Roofline Lab

Measure your GPU's actual sustained bandwidth (not spec-sheet) with a memcpy microbenchmark, measure sustained compute with a matmul microbench, and plot YOUR hardware's roofline. Overlay your model's arithmetic intensity at batch=1 (decode) and batch=32 (prefill-like). See decode sitting deep in the bandwidth-bound region on your actual GPU.

Lab 1160 minCPU · GPU · Mac · Colab

KV Cache Budget Calculator

Turn the KV-cache formula from the lesson into a working calculator that takes model config + batch size + sequence length and predicts exact bytes. Serve 1 / 4 / 16 / 64 concurrent requests on a real model and plot predicted vs actual memory. The gap is everything the formula doesn't cover — activations, CUDA context, framework overhead. Now you know the correction factor for your stack.

Act IX · Region 09

Ship It

→ read the act
Lab 1245–60 minCPU · GPU · Machot pickcoming soon

The Inference Showdown

Take one model (Qwen3-0.6B-Q4_K_M) and serve it through Ollama, llama.cpp-server, and (depending on your hardware) vLLM or MLX-LM. Measure cold-start time, TTFT, tok/s at batch=1, tok/s at batch=8, and peak memory for each. Find the crossover point where vLLM's batching advantage overtakes Ollama's simplicity — on YOUR hardware, not someone's benchmark blog.