GRPO and RLVR · Microscale

GRPO — PPO without the critic, RLVR-friendly

DPO needs a preference dataset — pairs labelled by humans. That's expensive and slow. For tasks with automated verifiable rewards (math, code, JSON schema validity), you can skip humans entirely. Sample a group of completions from the current policy, check which ones are correct, use the result as the training signal.

GRPO (DeepSeek 2024, famous from the R1 paper) is the lightweight RL algorithm that makes this practical. It's a descendant of PPO with one big simplification: no critic network. Instead of estimating the advantage with a learned value function, it standardises rewards within a sampled group of rollouts.

A_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G) + \epsilon}

problem

What is 37 × 42? → 1554

rollout 1

1554

rollout 2

1534

rollout 3

1554

rollout 4

1564

rollout 5

1454

rollout 6

1554

rollout 7

1544

rollout 8

1554

Sample. Generate G rollouts from the current policy for the same prompt. Here G=8.

The full GRPO objective

The picture above simplified the math. The full GRPO loss is a clipped PPO-style objective with an added KL penalty to a reference model. Here it is:

GRPO loss, full form

\mathcal{L}_\text{GRPO} = -\mathbb{E}\!\left[ \min\!\left(\rho_i A_i,\; \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i\right)\right] + \beta\, \text{KL}(\pi_\theta \,\|\, \pi_\text{ref})

where $\rho_i = \pi_\theta(y_i)/\pi_\text{old}(y_i)$ is the importance ratio between the current policy and the policy that generated the rollouts, and $A_i$ is the standardised advantage from the group.

Why RLVR is the trick that made DeepSeek-R1 possible

DeepSeek-R1 used pure rule-based rewards:

Accuracy reward — does the final answer match ground truth?
Format reward — is the output wrapped in <think>...</think><answer>...</answer>?

No humans in the loop. No reward model. Just regex + a math checker. With GRPO running thousands of rollout-grade-update cycles, the model learned to produce long correct reasoning traces from scratch. The AIME 2024 pass@1 score went from 15.6% to 71.0% — entirely from RLVR on verifiable problems.

The 2025 surprise: GRPO and DPO are the same thing

Chen et al. 2025 (“It Takes Two: Your GRPO Is Secretly DPO”) showed that GRPO and DPO implement the same contrastive mechanism. Both push up the log-probability of good responses relative to bad ones. The only difference is where the pairs come from:

DPO: pairs from a static human preference dataset.
GRPO: pairs from on-policy rollouts, grouped and standardised.

The unifying view: preference optimisation is contrastive learning over log-probability ratios. When you have preference data, use DPO. When you have rollouts and a verifier, use GRPO. The math is the same.