The emergence cliff

Some abilities appear to turn on all at once

Wei et al. 2022 coined “emergent abilities” for capabilities that are absent below a certain model scale and present above it — with a sharp, almost discontinuous-looking transition. Arithmetic, multi-step reasoning, chain-of-thought-following, format compliance: all of these appear to pop into existence around specific parameter counts.

Schaeffer, Miranda, and Koyejo (2023, “Are Emergent Abilities of Large Language Models a Mirage?”) pushed back: emergence is partly a metric effect. If you measure exact-match accuracy (a 0/1 judgment), tiny improvements compound into a sudden apparent jump. If you measure log-probability of the correct answer — or even token-level edit distance — the same capability grows smoothly. Their clinching move: they took the BIG-Bench tasks Wei flagged as emergent, re-scored them under cross-entropy instead of exact match, and 92% of the apparent cliffs disappeared, leaving ordinary power-law scaling.

The underlying mechanism is simple once you see it. Exact match only flips to 1 when the correct token's log-probability exceeds the max log-probability of every distractor. A model can be steadily climbing from p=0.001 to p=0.49 on the right answer and score zero the entire way; the moment it crosses p=0.51 (assuming a single dominant distractor), the metric snaps to 1. The smoothness is in the capability; the discontinuity is in the yardstick. Both stories are partly right — a handful of abilities (modular arithmetic, multi-step symbolic manipulation) do look threshold-like even under log-prob, but most of the famous “emergent” BIG-Bench tasks were metric artifacts.

log₁₀(N) — parameters10^9.0 = 1.0B

3-digit addition

14%

exact

log-prob: 21%

chain-of-thought

exact

log-prob: 0%

reliable tool calling

exact

log-prob: 0%

exact-match metric — looks like a cliff

log-probability metric — smooth growth

3-digit addition

chain-of-thought

reliable tool calling

Why the threshold moves over time

The 2022 CoT threshold was ~62B params. Today's 3B Phi-4-mini reasons better than that 62B did. The threshold didn't disappear — it moved. Better data (Phi's textbook synthetic), distillation from larger teachers (Llama 3.2), and curriculum pretraining (SmolLM3) all push the threshold down. Emergence is a function of training quality, not just parameter count.

But the shape of the phenomenon survives. Today's 3B can do what 2022's 60B could; today's 3B cannot do what today's 70B can. There is always a next tier of capability that only the next tier of scale reliably unlocks.

comprehension check

comprehension · 1 / 1

Some abilities appear to turn on all at once

Why the threshold moves over time

Why does the same capability look 'emergent' on exact-match metrics but 'smooth' on log-probability?