Eight current specimens. These are the models I would actually consider for an SLM specialization project in April 2026. Every entry has a full technical report, open weights, and demonstrated production use.
specimen dossier
Phi-4-miniMicrosoft · 3.8B · MIT
Headline: Synthetic textbook data + 200k vocab + tied embeddings.
Training: ~5T tokens, heavy synthetic data (50 types, 400B tokens). Post-SFT + DPO; reasoning variant adds RLVR.
Best for: Structured output, reasoning, tool calling.
Pattern recognition
Look across the dossiers and you'll notice every model makes the same small number of choices: SwiGLU, GQA, RoPE, RMSNorm. Where they differ:
Data strategy — textbook synthetic (Phi), distillation from a bigger sibling (Llama 3.2), pure quality curriculum (SmolLM3), multilingual-heavy (Qwen3)
Post-training — SFT + DPO is the minimum; reasoning variants add RLVR
The family trees matter more than the individual dossiers. Phi runs Phi-1 (1.3B, 2023) → Phi-1.5 → Phi-2 (2.7B) → Phi-3-mini (3.8B) → Phi-3.5-mini → Phi-4-mini; every generation doubles down on Microsoft's “textbooks are all you need” thesis and the SFT recipe is more valuable than the weights. Llama runs Llama-1 (Feb 2023, research-only) → Llama-2 (commercial, GQA introduced at 34B) → Llama-3 (tied embeddings + 128k vocab) → Llama-3.1 (405B teacher) → Llama-3.2 (1B / 3B distilled from 3.1-8B with logit KD). Qwen runs 1 → 1.5 → 2 → 2.5 → 3, picking up YaRN long context at Qwen2 and the thinking-mode toggle at Qwen3. When a “new” SLM drops, the first question is which lineage does it extend, and what did the parent already know? — because most of what the child can do, it inherited.
MMXXVI
historical note
Feb 2023 → Apr 2026 · three years of open SLMs
Feb 2023: LLaMA-1 7B leaks onto 4chan, kicking off the open era. Sep 2023: Mistral-7B beats Llama-2-13B and proves European labs can ship. Apr 2024: Phi-3-mini lands and is the first <4Bmodel anyone takes seriously. Jul 2024: Llama-3.1 405B ships as a teacher-for-distillation. Sep 2024: Llama-3.2-1B/3B are that distillation. 2025: thinking-mode toggles (Qwen3, Gemma 3), ternary weights (BitNet), and hybrid-attention (Gemma 3's 5:1) become standard ideas. The field went from “can a 7B be useful?” to “which 1.5B reasoning distillate should I fine-tune?” in thirty-six months.
In the next lesson we run head-to-head comparisons between some of these on specific benchmarks — and learn why most of those numbers lie a little.