Teaching a model to think longer
In 2024, OpenAI's o1 showed that a model could trade inference tokens for accuracy. DeepSeek-R1 made the recipe public. s1 showed it was almost embarrassingly simple. Here is the whole apparatus — sampling, verification, and the compute allocator that ties them together.
The pre-reasoning paradigm — one forward pass, one answer
Until mid-2024, every production LLM answered the same way. You sent a prompt, the model autoregressed token-by-token, and whatever fell out of the sampler wasthe answer. There was no separate “thinking” phase. The only meaningful lever at inference was temperature, and the only meaningful lever for quality was training a bigger model — the story we told in hello, model and refined in distillation.
In that world, test-time compute was a constant. A 7B model took the same ~2 GPU-seconds to answer a hard integral as it did a greeting. Doubling user-visible quality meant doubling parameters, or training on 10× the data. Scaling laws were a one-dimensional problem.
Chain-of-thought, the behaviour that predates the training method
Chain-of-thought is older than o1 by two years. Wei et al. (2022) showed that appending “Let's think step by step” to a prompt pulled out dramatically better arithmetic and commonsense reasoning from models that had no explicit reasoning training whatsoever. The model wasn't learning anything new — it was narrating what it already knew how to do, and the narration happened to be useful.
This is the first place the field confused a behaviour with a training method. CoT prompting worked because pretraining corpora contained huge amounts of step-by-step mathematical text — proofs, worked examples, Khan-Academy-style tutorials — and the model could be nudged into that distribution with the right prefix. No gradient update required.
So CoT established that more tokens at test time could buy more accuracy, at least above some capability threshold. What it did not establish: how to make that behaviour reliable, how to train for it, or how to budget it. Those questions waited until 2024.
Best-of-N — the first real test-time compute lever
Once you accept that the model is drawing from a distribution of candidate solutions, the next move is obvious: draw more than one. Best-of-N sampling takes independent completions from the policy, scores each with a verifier, and returns the highest-scoring one. The scale of improvement is often dramatic — pass@1 of 40% can become pass@8 of 78% on hard math, because correctness is a rare event but a well-calibrated verifier finds it when it happens.
Draw completions from the policy , score each with the verifier , return the argmax. The verifier is the load-bearing piece — without it, you're just hoping more samples means the right one landed somewhere in the stack.
The accuracy curve flattens with — doubling samples from 1→2 helps more than 32→64 — which is exactly what the hook viz shows with its exponent. Real o1-class systems live on this curve: they burn compute in the single-digit percent to cross meaningful capability thresholds (MATH → AIME → Olympiad), and double-digit percent is wasted on returns that diminish hard.
ORM vs PRM — where the verifier looks
A verifier can grade two things. An outcome reward model (ORM)looks at the final answer and says “yes” or “no”. A process reward model (PRM) looks at each intermediate step and scores them individually. ORM data is cheap — you just need ground-truth answers. PRM data is savagely expensive because humans have to label every step.
One label per chain: right or wrong. Cheap to collect — just check the final answer against ground truth. The weakness: a chain can be right for the wrong reasons (lucky arithmetic error that cancels itself) and still score 1. Training signal is noisy toward the process.
One label per step. Expensive to collect — PRM800K (OpenAI, 2023) paid raters to grade 800K individual math steps. Gives much stronger signal because the model learns why a chain failed, not just that it did. A broken step in the middle of a correct-final-answer chain is flagged and penalised.
The open question of 2024 was whether you could get PRM-quality signal without PRM-quality labelling. The answer — from several labs independently in 2025–2026 — is yes, if you're clever about the synthetic data loop.
DeepSeek-R1 — the recipe, the aha moment, the open weights
January 2025. DeepSeek publishes R1, a reasoning model that matches o1 on AIME and MATH, and — crucially — publishes the training recipe. The recipe is stunningly short.
- Start from a strong base model (DeepSeek-V3).
- Apply GRPO, from the grpo lesson, directly on verifiable tasks — math with ground-truth answers, code with unit tests. No critic network, no preference pairs.
- The reward is binary: did the final answer match? Did the unit tests pass?
- Iterate until the reasoning traces get noticeably longer on their own, and accuracy keeps rising.
That's it. No PRM. No MCTS. No human-curated reasoning demonstrations. The model teaches itselfto hedge (“wait, let me check that”), backtrack (“actually, that's wrong, let me redo”), and verify (“plugging back in: 60 × 3 = 180 ✓”). DeepSeek calls this the aha moment— an emergent inflection during R1-Zero's RL training where the model suddenly starts producing much longer, much more self-corrective chains, and accuracy follows. (The production R1 uses a small SFT warm-up before the same GRPO loop; the aha moment itself is documented on R1-Zero.)
For each problem, sample a group of rollouts, check each with the verifier, standardise inside the group. Rollouts that solved it get positive advantage; rollouts that didn't get negative. The policy gradient pushes the positive ones harder and suppresses the negative ones. No value head, no reward model. The grpolesson covers the full objective — here we're focused on what it produces, not the math.
There is a pedagogical trap here worth naming. The R1 recipe is not “RL makes reasoning emerge.” It is “RL on a domain where correctness is cheap to check, applied to a base model that already has the latent ability, elicits longer chains that use that ability better.” R1-Zero confirms the first two pieces. Phi-4-Mini-Reasoning and related distillation work confirm the third — the behaviour can be imitated directly from R1's output traces, without ever running the RL. Which brings us to SLMs.
s1 and budget-forcing — the uncomfortable result
February 2025. Stanford's s1paper lands with a provocative finding: they fine-tune Qwen2.5-32B on a tiny curated dataset of 1,000 reasoning traces, and get o1-class performance on AIME and MATH. What's more striking is the inference trick they introduce: budget-forcing.
At decode time, when the model tries to stop thinking and emit a final answer, they intercept the end-of-thought token and force-append a single word: "Wait". The model obliges — it starts a new reasoning segment, often catches an error, and arrives at a better final answer. AIME24 accuracy rises roughly 50% → 57% just from forcing the model to think longer (sometimes by appending "Wait" more than once), with no training change at all.
The uncomfortable implication: some non-trivial fraction of “reasoning ability” is not a trained skill at all. It is just more inference compute applied in a particular shape. The model already knew how to backtrack; it just needed a reason to. That's not to say training doesn't matter — the base model has to have the latent capability, which R1's RL helps sharpen. But the gap between “reasoning model” and “non-reasoning model” is partly a gap in how much compute you're willing to spend per answer.
GRPO on verifiable rewards (R1) or SFT on 1K curated traces (s1). Both produce a model whose default behaviour is to generate longer, self-corrective chains. The prior is baked in; no prompt engineering required to get it to narrate its work.
The model now burns hundreds-to-thousands of tokens beforeemitting its final answer. Best-of-N can be stacked on top, and budget-forcing can extend the chains further mid-decode. Inference cost per query is an order of magnitude higher than a non- reasoning model's.