Build-a-block capstone

Build your own transformer block

You've spent the last eight lessons meeting every piece of the modern transformer layer — attention, its multi-head variant, GQA, SwiGLU, RoPE and NoPE, the local–global hybrid, tied embeddings. Now pretend you're the engineer on call whose job is to assemble one from scratch. A colleague walks over with a slot form and says: “pick one component for each slot. We're shipping this on Tuesday.”

Below are the five slots of the canonical 2026 layer: pre-attention norm, attention mechanism, position encoding, pre-FFN norm, and the FFN itself. For each slot you have four choices — some are the modern answer, some are historical baggage, some are valid-at-other-scales, some just shouldn't be there at all.

This isn't a trick quiz. Every correct answer below is literally the same choice made by Phi-4-mini, Llama 3.2, Qwen3, Gemma 3, and SmolLM3 in April 2026. If you've been reading carefully, you'll recognise each right answer when you see it — the wrong ones are the ones we spent time explaining why not.

slot 1

Pre-attention normalisation

slot 2

Attention mechanism

slot 3

Position encoding on Q and K

slot 4

Pre-FFN normalisation

slot 5

Feed-forward network

0/5 slots filled

You have now built a 2026 transformer block

The assembled layer is what Phi-4-mini, Llama 3.2-3B, Qwen3, Gemma 3, and SmolLM3 all stack 28–32 times to get their full model. Everything else — how those blocks are trained, why they work, where they fail — is now a question about what the weights do given this fixed architecture.

Act III surveys the zoo of current SLMs, Act IV explains how they're trained, and Act V confronts their limitations. The block you just assembled is the chassis under all of that.