act iv · how they learn · lesson

Teaching a model to think longer

In 2024, OpenAI's o1 showed that a model could trade inference tokens for accuracy. DeepSeek-R1 made the recipe public. s1 showed it was almost embarrassingly simple. Here is the whole apparatus — sampling, verification, and the compute allocator that ties them together.

best-of-N with a process verifier · toy problem

more samples. greener chains. one winner.

step correct

step wrong

best-of-N pick

problem

If a train leaves Boston at 9 AM traveling at 60 mph, how far has it gone by noon?

ground truth · 180 mi

samples drawn · NN = 8

in production, best-of-N ranges from 4 up to 64 — each candidate is a full reasoning chain from the policy

sampled reasoning chains

the policy draws N candidates

chain 1

9 AM to noon is 3 hours.

distance = speed × time = 60 × 3

60 × 3 = 180

answer: 180 mi

→ 180 mi

chain 2

9 AM to noon is 3 hours.

d = 60 × 3

= 180 mi

→ 180 mi

chain 3

From 9 to 12 is 4 hours.

d = 60 × 4

= 240 mi

→ 240 mi

chain 4

Elapsed time: 3 hours.

speed 60 mph means 60 mi per hour

so 3 × 60 = 180 mi

→ 180 mi

chain 5

9 AM to noon is 3 hours.

60 × 3 = 120

answer: 120 mi

→ 120 mi

chain 6

Time elapsed: 3 h.

distance = rate × time

= 60 mph × 3 h

= 180 mi

→ 180 mi

chain 7

Between 9 and noon: 3 hours.

Convert 60 mph to km/h: 96.6 km/h

3 × 96.6 = 289.7

answer: about 290 km

→ 290 km

chain 8

Three hours elapsed.

60 mph × 3 h = 180 mi

→ 180 mi

after process reward model

each step scored · best-of-N selected

best-of-N

chain 1

9 AM to noon is 3 hours.

distance = speed × time = 60 × 3

60 × 3 = 180

answer: 180 mi

→ 180 mi

chain 2

9 AM to noon is 3 hours.

d = 60 × 3

= 180 mi

→ 180 mi

chain 3

From 9 to 12 is 4 hours.

d = 60 × 4

= 240 mi

→ 240 mi

chain 4

Elapsed time: 3 hours.

speed 60 mph means 60 mi per hour

so 3 × 60 = 180 mi

→ 180 mi

chain 5

9 AM to noon is 3 hours.

60 × 3 = 120

answer: 120 mi

→ 120 mi

chain 6

Time elapsed: 3 h.

distance = rate × time

= 60 mph × 3 h

= 180 mi

→ 180 mi

chain 7

Between 9 and noon: 3 hours.

Convert 60 mph to km/h: 96.6 km/h

3 × 96.6 = 289.7

answer: about 290 km

→ 290 km

chain 8

Three hours elapsed.

60 mph × 3 h = 180 mi

→ 180 mi

compute allocator · where you spend the flops

← train-time · 50%test-time · 50% →

drag right for more test-time compute · drag left for more train-time RL · both paths can reach the same accuracy, for different dollar costs per query

pass@1

72%

pass@8 (best-of-N)

83%

samples

8chains

test-time share

50%

Two paths to 83%. Push the allocator left (more train-time RL) and pass@1 rises on its own — the model just gets better. Push it right (more test-time sampling) and pass@1 stays modest, but best-of-N rescues it. This is the whole intellectual shift of 2024–2026: compute is no longer only spent before the user shows up.

note · the accuracy numbers are a pedagogical toy, not a real benchmark — see the deep-dive on the actual o1/R1 curves below.

The pre-reasoning paradigm — one forward pass, one answer

Until mid-2024, every production LLM answered the same way. You sent a prompt, the model autoregressed token-by-token, and whatever fell out of the sampler wasthe answer. There was no separate “thinking” phase. The only meaningful lever at inference was temperature, and the only meaningful lever for quality was training a bigger model — the story we told in hello, model and refined in distillation.

In that world, test-time compute was a constant. A 7B model took the same ~2 GPU-seconds to answer a hard integral as it did a greeting. Doubling user-visible quality meant doubling parameters, or training on 10× the data. Scaling laws were a one-dimensional problem.

MMXXVI

historical note

September 2024 · OpenAI

The o1-preview release broke the contract. The model took visibly longer to respond to hard problems — tens of seconds, sometimes a minute — and the quality jump on math and code was large enough that AIME scores roughly doubled over GPT-4o. OpenAI's framing was explicit: “o1 thinks before it answers.”The internal “reasoning tokens” were hidden, but the billing was not — o1 charged for them, which meant you were paying for test-time compute as a first-class resource.

◆ paper

Learning to reason with LLMs

OpenAI · 2024 · openai.com/index/learning-to-reason-with-llms

The o1 launch post is careful to not publish the method. It shows a log-log plot: accuracy on AIME grows roughly linearly in both train-time compute and test-time compute. That second axis — accuracy vs how long the model thinks — was the new idea. Everything that follows in this lesson is an attempt to reconstruct (R1, s1) or understand (this lesson) that second axis.

Chain-of-thought, the behaviour that predates the training method

Chain-of-thought is older than o1 by two years. Wei et al. (2022) showed that appending “Let's think step by step” to a prompt pulled out dramatically better arithmetic and commonsense reasoning from models that had no explicit reasoning training whatsoever. The model wasn't learning anything new — it was narrating what it already knew how to do, and the narration happened to be useful.

This is the first place the field confused a behaviour with a training method. CoT prompting worked because pretraining corpora contained huge amounts of step-by-step mathematical text — proofs, worked examples, Khan-Academy-style tutorials — and the model could be nudged into that distribution with the right prefix. No gradient update required.

◆ paper

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou · 2022 · NeurIPS 2022

arxiv:2201.11903

The original paper noted a threshold: CoT prompting only helped models above a certain size. Smaller models often generated plausible-looking step text but arrived at worse answers — they could imitate the narration without doing the underlying computation. We visited that failure mode in detail in cot-regression: small models faking the form of reasoning without the substance.

So CoT established that more tokens at test time could buy more accuracy, at least above some capability threshold. What it did not establish: how to make that behaviour reliable, how to train for it, or how to budget it. Those questions waited until 2024.

Best-of-N — the first real test-time compute lever

Once you accept that the model is drawing from a distribution of candidate solutions, the next move is obvious: draw more than one. Best-of-N sampling takes $N$ independent completions from the policy, scores each with a verifier, and returns the highest-scoring one. The scale of improvement is often dramatic — pass@1 of 40% can become pass@8 of 78% on hard math, because correctness is a rare event but a well-calibrated verifier finds it when it happens.

the best-of-N pick

\hat y \;=\; \arg\max_{y_1,\,\ldots,\,y_N \sim \pi} \; r(y_i)

Draw $N$ completions from the policy $\pi$ , score each with the verifier $r$ , return the argmax. The verifier is the load-bearing piece — without it, you're just hoping more samples means the right one landed somewhere in the stack.

The accuracy curve flattens with $N$ — doubling samples from 1→2 helps more than 32→64 — which is exactly what the hook viz shows with its $\sqrt{N}$ exponent. Real o1-class systems live on this curve: they burn compute in the single-digit percent to cross meaningful capability thresholds (MATH → AIME → Olympiad), and double-digit percent is wasted on returns that diminish hard.

Best-of-N is not the only test-time algorithm — self- consistency (Wang et al. 2022) takes a majority vote over sampled chains without a verifier; MCTS and tree-search variants prune partially. But best-of-N with a well-trained verifier is the simplest non-trivial move and remains competitive on everything a tree doesn't help with. When you see “test-time compute” in a 2025 paper, assume best-of-N until told otherwise.

ORM vs PRM — where the verifier looks

A verifier can grade two things. An outcome reward model (ORM)looks at the final answer and says “yes” or “no”. A process reward model (PRM) looks at each intermediate step and scores them individually. ORM data is cheap — you just need ground-truth answers. PRM data is savagely expensive because humans have to label every step.

outcome reward model · ORM

One label per chain: right or wrong. Cheap to collect — just check the final answer against ground truth. The weakness: a chain can be right for the wrong reasons (lucky arithmetic error that cancels itself) and still score 1. Training signal is noisy toward the process.

process reward model · PRM

One label per step. Expensive to collect — PRM800K (OpenAI, 2023) paid raters to grade 800K individual math steps. Gives much stronger signal because the model learns why a chain failed, not just that it did. A broken step in the middle of a correct-final-answer chain is flagged and penalised.

◆ paper

Let's Verify Step by Step

Lightman, Kosaraju, Burda, Edwards, Baker, Lee, Leike, Schulman, Sutskever, Cobbe · 2023 · ICLR 2024

arxiv:2305.20050

The PRM800K paper. Shows that a PRM trained on per-step human labels beats an ORM trained on 10× the data on hard MATH problems. Also introduces the dataset of 800K step-level labels that every subsequent PRM paper has either used or tried to replicate cheaply.

The open question of 2024 was whether you could get PRM-quality signal without PRM-quality labelling. The answer — from several labs independently in 2025–2026 — is yes, if you're clever about the synthetic data loop.

MMXXVI

historical note

2025–2026 · ThinkPRM and successors

The ThinkPRM line of work (and its spiritual relatives) showed that about 1% of PRM800K-scale training data is enough if the per-step labels are derived synthetically: sample chains from a strong model, check the final outcome against ground truth, and back-label each step by whether chains that pass through it tend to end correctly. The process signal is distilled out of the outcome signal. This is what makes PRM-based reasoning affordable at SLM scale in 2026.

◆ paper

Process Reward Models That Think

Khalifa, Agarwal, Logeswaran, et al. · 2025 · preprint

arxiv:2504.16828

Uses only ~1% of the process labels in PRM800K and still outperforms LLM-as-judge and discriminative verifiers. The core move: synthetic step labels filtered by the final outcome recover most of the process-reward signal at a fraction of the labelling cost, and the resulting PRM generalises to unseen domains.

DeepSeek-R1 — the recipe, the aha moment, the open weights

January 2025. DeepSeek publishes R1, a reasoning model that matches o1 on AIME and MATH, and — crucially — publishes the training recipe. The recipe is stunningly short.

Start from a strong base model (DeepSeek-V3).
Apply GRPO, from the grpo lesson, directly on verifiable tasks — math with ground-truth answers, code with unit tests. No critic network, no preference pairs.
The reward is binary: did the final answer match? Did the unit tests pass?
Iterate until the reasoning traces get noticeably longer on their own, and accuracy keeps rising.

That's it. No PRM. No MCTS. No human-curated reasoning demonstrations. The model teaches itselfto hedge (“wait, let me check that”), backtrack (“actually, that's wrong, let me redo”), and verify (“plugging back in: 60 × 3 = 180 ✓”). DeepSeek calls this the aha moment— an emergent inflection during R1-Zero's RL training where the model suddenly starts producing much longer, much more self-corrective chains, and accuracy follows. (The production R1 uses a small SFT warm-up before the same GRPO loop; the aha moment itself is documented on R1-Zero.)

GRPO on verifiable rewards, in one line

A_i \;=\; \frac{r_i - \bar r}{\sigma_r + \epsilon}, \quad r_i \in \{0, 1\}

For each problem, sample a group of $G$ rollouts, check each with the verifier, standardise inside the group. Rollouts that solved it get positive advantage; rollouts that didn't get negative. The policy gradient pushes the positive ones harder and suppresses the negative ones. No value head, no reward model. The grpolesson covers the full objective — here we're focused on what it produces, not the math.

◆ paper

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025 · Technical Report

arxiv:2501.12948

Section 2.2.4 of the R1 paper documents the famous “aha moment” on R1-Zero — the pure-RL variant trained withoutan SFT warm-up. Figure 3 shows average response length rising steadily through training as the model self-teaches to hedge, backtrack, and verify. The production R1 uses a small SFT warm-up before the same GRPO loop; the aha moment itself is a property of R1-Zero's pure-RL run.

There is a pedagogical trap here worth naming. The R1 recipe is not “RL makes reasoning emerge.” It is “RL on a domain where correctness is cheap to check, applied to a base model that already has the latent ability, elicits longer chains that use that ability better.” R1-Zero confirms the first two pieces. Phi-4-Mini-Reasoning and related distillation work confirm the third — the behaviour can be imitated directly from R1's output traces, without ever running the RL. Which brings us to SLMs.

DeepSeek open-weights R1. Within weeks, half of Hugging Face's trending models are R1 distillations at 1.5B, 7B, and 14B — Qwen and Llama bases fine-tuned on R1's CoT traces. This is the moment the reasoning behaviour became available to everyone, not just labs with eight-digit RL budgets. It is also why the distillation lesson sits right next to this one.

s1 and budget-forcing — the uncomfortable result

February 2025. Stanford's s1paper lands with a provocative finding: they fine-tune Qwen2.5-32B on a tiny curated dataset of 1,000 reasoning traces, and get o1-class performance on AIME and MATH. What's more striking is the inference trick they introduce: budget-forcing.

At decode time, when the model tries to stop thinking and emit a final answer, they intercept the end-of-thought token and force-append a single word: "Wait". The model obliges — it starts a new reasoning segment, often catches an error, and arrives at a better final answer. AIME24 accuracy rises roughly 50% → 57% just from forcing the model to think longer (sometimes by appending "Wait" more than once), with no training change at all.

◆ paper

s1: Simple test-time scaling

Muennighoff, Yang, Shi, Chen, Liu, Chang, Chen, Celikyilmaz, Carbonell, Zettlemoyer · 2025 · preprint

arxiv:2501.19393

The paper trains on 1,000 examples and demonstrates o1-preview-class reasoning. The budget-forcing trick is in §4.1. Crucially, s1 also shows the opposite knob — forcing the model to stop early — and confirms that accuracy drops monotonically when you cut the thinking budget. Together these two knobs let you dial accuracy with inference compute, at runtime, with no retraining.

The uncomfortable implication: some non-trivial fraction of “reasoning ability” is not a trained skill at all. It is just more inference compute applied in a particular shape. The model already knew how to backtrack; it just needed a reason to. That's not to say training doesn't matter — the base model has to have the latent capability, which R1's RL helps sharpen. But the gap between “reasoning model” and “non-reasoning model” is partly a gap in how much compute you're willing to spend per answer.

what changed at training

GRPO on verifiable rewards (R1) or SFT on 1K curated traces (s1). Both produce a model whose default behaviour is to generate longer, self-corrective chains. The prior is baked in; no prompt engineering required to get it to narrate its work.

what changed at inference

The model now burns hundreds-to-thousands of tokens beforeemitting its final answer. Best-of-N can be stacked on top, and budget-forcing can extend the chains further mid-decode. Inference cost per query is an order of magnitude higher than a non- reasoning model's.

The billing consequence, for the engineer: o1, R1, and s1 all charge by the token — including the hidden reasoning tokens. A single “solve this AIME problem” query can consume 10K+ tokens of thinking before producing a 20-token answer. Plan accordingly. The cost-curve lesson in Act VIII has the spreadsheet.

Deeper cuts — the pieces most tutorials skip

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What does best-of-N sampling do?

act iv · how they learn · lesson

Teaching a model to think longer

best-of-N with a process verifier · toy problem

more samples. greener chains. one winner.

step correct

step wrong

best-of-N pick

problem

If a train leaves Boston at 9 AM traveling at 60 mph, how far has it gone by noon?

ground truth · 180 mi

samples drawn · NN = 8

in production, best-of-N ranges from 4 up to 64 — each candidate is a full reasoning chain from the policy

sampled reasoning chains

the policy draws N candidates

chain 1

9 AM to noon is 3 hours.

distance = speed × time = 60 × 3

60 × 3 = 180

answer: 180 mi

→ 180 mi

chain 2

9 AM to noon is 3 hours.

d = 60 × 3

= 180 mi

→ 180 mi

chain 3

From 9 to 12 is 4 hours.

d = 60 × 4

= 240 mi

→ 240 mi

chain 4

Elapsed time: 3 hours.

speed 60 mph means 60 mi per hour

so 3 × 60 = 180 mi

→ 180 mi

chain 5

9 AM to noon is 3 hours.

60 × 3 = 120

answer: 120 mi

→ 120 mi

chain 6

Time elapsed: 3 h.

distance = rate × time

= 60 mph × 3 h

= 180 mi

→ 180 mi

chain 7

Between 9 and noon: 3 hours.

Convert 60 mph to km/h: 96.6 km/h

3 × 96.6 = 289.7

answer: about 290 km

→ 290 km

chain 8

Three hours elapsed.

60 mph × 3 h = 180 mi

→ 180 mi

after process reward model

each step scored · best-of-N selected

best-of-N

chain 1

9 AM to noon is 3 hours.

distance = speed × time = 60 × 3

60 × 3 = 180

answer: 180 mi

→ 180 mi

chain 2

9 AM to noon is 3 hours.

d = 60 × 3

= 180 mi

→ 180 mi

chain 3

From 9 to 12 is 4 hours.

d = 60 × 4

= 240 mi

→ 240 mi

chain 4

Elapsed time: 3 hours.

speed 60 mph means 60 mi per hour

so 3 × 60 = 180 mi

→ 180 mi

chain 5

9 AM to noon is 3 hours.

60 × 3 = 120

answer: 120 mi

→ 120 mi

chain 6

Time elapsed: 3 h.

distance = rate × time

= 60 mph × 3 h

= 180 mi

→ 180 mi

chain 7

Between 9 and noon: 3 hours.

Convert 60 mph to km/h: 96.6 km/h

3 × 96.6 = 289.7

answer: about 290 km

→ 290 km

chain 8

Three hours elapsed.

60 mph × 3 h = 180 mi

→ 180 mi

compute allocator · where you spend the flops

← train-time · 50%test-time · 50% →

drag right for more test-time compute · drag left for more train-time RL · both paths can reach the same accuracy, for different dollar costs per query

pass@1

72%

pass@8 (best-of-N)

83%

samples

8chains

test-time share

50%

note · the accuracy numbers are a pedagogical toy, not a real benchmark — see the deep-dive on the actual o1/R1 curves below.

The pre-reasoning paradigm — one forward pass, one answer

MMXXVI

historical note

September 2024 · OpenAI

◆ paper

Learning to reason with LLMs

OpenAI · 2024 · openai.com/index/learning-to-reason-with-llms

Chain-of-thought, the behaviour that predates the training method

◆ paper

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou · 2022 · NeurIPS 2022

arxiv:2201.11903

Best-of-N — the first real test-time compute lever

the best-of-N pick

\hat y \;=\; \arg\max_{y_1,\,\ldots,\,y_N \sim \pi} \; r(y_i)

ORM vs PRM — where the verifier looks

outcome reward model · ORM

process reward model · PRM

◆ paper

Let's Verify Step by Step

Lightman, Kosaraju, Burda, Edwards, Baker, Lee, Leike, Schulman, Sutskever, Cobbe · 2023 · ICLR 2024

arxiv:2305.20050

MMXXVI

historical note

2025–2026 · ThinkPRM and successors

◆ paper

Process Reward Models That Think

Khalifa, Agarwal, Logeswaran, et al. · 2025 · preprint

arxiv:2504.16828

DeepSeek-R1 — the recipe, the aha moment, the open weights

January 2025. DeepSeek publishes R1, a reasoning model that matches o1 on AIME and MATH, and — crucially — publishes the training recipe. The recipe is stunningly short.

Start from a strong base model (DeepSeek-V3).
Apply GRPO, from the grpo lesson, directly on verifiable tasks — math with ground-truth answers, code with unit tests. No critic network, no preference pairs.
The reward is binary: did the final answer match? Did the unit tests pass?
Iterate until the reasoning traces get noticeably longer on their own, and accuracy keeps rising.

GRPO on verifiable rewards, in one line

A_i \;=\; \frac{r_i - \bar r}{\sigma_r + \epsilon}, \quad r_i \in \{0, 1\}

◆ paper

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025 · Technical Report

arxiv:2501.12948

s1 and budget-forcing — the uncomfortable result

◆ paper

s1: Simple test-time scaling

Muennighoff, Yang, Shi, Chen, Liu, Chang, Chen, Celikyilmaz, Carbonell, Zettlemoyer · 2025 · preprint

arxiv:2501.19393

what changed at training

what changed at inference

Deeper cuts — the pieces most tutorials skip

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3