Microscale
0
Act VWhere They Break
lesson emergence · 10 min · 50 xp

The emergence cliff

Scrub the parameter slider, see capabilities pop

Some abilities appear to turn on all at once

Wei et al. 2022 coined “emergent abilities” for capabilities that are absent below a certain model scale and present above it — with a sharp, almost discontinuous-looking transition. Arithmetic, multi-step reasoning, chain-of-thought-following, format compliance: all of these appear to pop into existence around specific parameter counts.

Schaeffer, Miranda, and Koyejo (2023, “Are Emergent Abilities of Large Language Models a Mirage?”) pushed back: emergence is partly a metric effect. If you measure exact-match accuracy (a 0/1 judgment), tiny improvements compound into a sudden apparent jump. If you measure log-probability of the correct answer — or even token-level edit distance — the same capability grows smoothly. Their clinching move: they took the BIG-Bench tasks Wei flagged as emergent, re-scored them under cross-entropy instead of exact match, and 92% of the apparent cliffs disappeared, leaving ordinary power-law scaling.

The underlying mechanism is simple once you see it. Exact match only flips to 1 when the correct token's log-probability exceeds the max log-probability of every distractor. A model can be steadily climbing from p=0.001 to p=0.49 on the right answer and score zero the entire way; the moment it crosses p=0.51 (assuming a single dominant distractor), the metric snaps to 1. The smoothness is in the capability; the discontinuity is in the yardstick. Both stories are partly right — a handful of abilities (modular arithmetic, multi-step symbolic manipulation) do look threshold-like even under log-prob, but most of the famous “emergent” BIG-Bench tasks were metric artifacts.

10^9.0 = 1.0B
3-digit addition
14%
exact
log-prob: 21%
chain-of-thought
0%
exact
log-prob: 0%
reliable tool calling
0%
exact
log-prob: 0%
exact-match metric — looks like a cliff
100M1B10B100B00.51log₁₀ Nexact-matchyou
log-probability metric — smooth growth
100M1B10B100B00.51log₁₀ Nlog-p(correct)you
3-digit addition
chain-of-thought
reliable tool calling

Why the threshold moves over time

The 2022 CoT threshold was ~62B params. Today's 3B Phi-4-mini reasons better than that 62B did. The threshold didn't disappear — it moved. Better data (Phi's textbook synthetic), distillation from larger teachers (Llama 3.2), and curriculum pretraining (SmolLM3) all push the threshold down. Emergence is a function of training quality, not just parameter count.

But the shape of the phenomenon survives. Today's 3B can do what 2022's 60B could; today's 3B cannot do what today's 70B can. There is always a next tier of capability that only the next tier of scale reliably unlocks.

comprehension check
comprehension · 1 / 1

Why does the same capability look 'emergent' on exact-match metrics but 'smooth' on log-probability?