GRPO — PPO without the critic, RLVR-friendly
DPO needs a preference dataset — pairs labelled by humans. That's expensive and slow. For tasks with automated verifiable rewards (math, code, JSON schema validity), you can skip humans entirely. Sample a group of completions from the current policy, check which ones are correct, use the result as the training signal.
GRPO (DeepSeek 2024, famous from the R1 paper) is the lightweight RL algorithm that makes this practical. It's a descendant of PPO with one big simplification: no critic network. Instead of estimating the advantage with a learned value function, it standardises rewards within a sampled group of rollouts.
The full GRPO objective
The picture above simplified the math. The full GRPO loss is a clipped PPO-style objective with an added KL penalty to a reference model. Here it is:
where is the importance ratio between the current policy and the policy that generated the rollouts, and is the standardised advantage from the group.
Why RLVR is the trick that made DeepSeek-R1 possible
DeepSeek-R1 used pure rule-based rewards:
- Accuracy reward — does the final answer match ground truth?
- Format reward — is the output wrapped in
<think>...</think><answer>...</answer>?
No humans in the loop. No reward model. Just regex + a math checker. With GRPO running thousands of rollout-grade-update cycles, the model learned to produce long correct reasoning traces from scratch. The AIME 2024 pass@1 score went from 15.6% to 71.0% — entirely from RLVR on verifiable problems.
The 2025 surprise: GRPO and DPO are the same thing
Chen et al. 2025 (“It Takes Two: Your GRPO Is Secretly DPO”) showed that GRPO and DPO implement the same contrastive mechanism. Both push up the log-probability of good responses relative to bad ones. The only difference is where the pairs come from:
- DPO: pairs from a static human preference dataset.
- GRPO: pairs from on-policy rollouts, grouped and standardised.
The unifying view: preference optimisation is contrastive learning over log-probability ratios. When you have preference data, use DPO. When you have rollouts and a verifier, use GRPO. The math is the same.