DPO as KL-constrained optimum

RLHF without the RL — or the HF

The classical RLHF pipeline needs three things in memory simultaneously: the policy, a trained reward model, and a frozen reference. Plus a PPO optimization loop with rollouts, advantage estimation, and notorious instability. It works at scale, but it is miserable to run — most teams who tried it for their own fine-tuning projects abandoned it.

MMXXVI

historical note

2023 · Rafailov et al., Stanford

Rafailov and team asked the question that was in everyone's heads: why do we need a separate reward model at all? The reward model is a neural network fit to match human preferences. The policy is a neural network we're training to maximise that reward. Could we collapse these into a single training step? They showed the answer is yes, and the result — Direct Preference Optimization — now dominates the preference-tuning ecosystem. Every major SLM post-training pipeline in 2024–2026 uses DPO or one of its descendants.

◆ paper

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Sharma, Mitchell, Manning, Ermon, Finn · 2023 · NeurIPS 2023

arxiv:2305.18290

The paper that collapsed two years of RLHF engineering into a single classification loss. The title is the punchline: if you invert the relationship between reward and policy, the policy itself can be used as the reward model, and you never need to train a separate one.

Click through the derivation on the right. Then read the deep dive below — the partition-function cancellation is the single most beautiful algebraic move in modern alignment research, and most tutorials skip it.

This is the step that actually makes DPO work. Start from the closed-form solution of the KL-constrained RL objective:

\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_\text{ref}(y \mid x) \exp\!\left(\frac{1}{\beta} r(x, y)\right)

where $Z(x)$ is a normalising constant (the partition function) that depends only on $x$ , not on $y$ . Take logs and solve for $r$ :

r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_\text{ref}(y \mid x)} + \beta \log Z(x)

Plug this into the Bradley-Terry preference model:

P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))

Now compute the difference $r(x, y_w) - r(x, y_l)$ :

= \beta \log\frac{\pi^*(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} + \beta \log Z(x) - \beta \log\frac{\pi^*(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} - \beta \log Z(x)

The $\beta \log Z(x)$ terms cancel exactly. This is the magic: the partition function depends only on $x$ , so it appears identically in both $r(x, y_w)$ and $r(x, y_l)$ , and the preference model only sees their difference.

What you're left with:

r(x, y_w) - r(x, y_l) = \beta \log\frac{\pi^*(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log\frac{\pi^*(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}

Notice: this difference contains only the policy $\pi^*$ and the reference $\pi_\text{ref}$ . There is no reward model left. You can train the policy directly by optimising the Bradley-Terry log-likelihood — no reward network, no PPO loop, no advantage estimation. That's DPO in one insight.

step 1 of 5

line 1

\max_\pi \mathbb{E}_{x, y \sim \pi}[r(x, y)] - \beta\, \text{KL}(\pi \,\|\, \pi_\text{ref})

Start with the RLHF objective. You want to maximise the reward

r(x, y)

while not drifting too far from the reference policy

\pi_\text{ref}

(the SFT model before preference tuning). The KL penalty

\beta

controls how willing you are to move.

What the loss actually does, in plain English

The DPO loss is contrastive. For each preference pair it says:

Push up the log-probability of the chosen response $y_w$ relative to the reference.
Push down the log-probability of the rejected response $y_l$ relative to the reference.
Balance these so the total log-ratio difference matches the human preference probabilities, via the sigmoid link function.

The $\beta$ controls how far the policy can drift from reference. Small $\beta$ = aggressive updates (can produce stronger improvements but risks degeneration); large $\beta$ = conservative updates (safer but less effective). A typical default is $\beta = 0.1$ .

When DPO wins, and when it doesn't

DPO is the right default for preference tuning of SLMs. It's stable, it's memory-efficient (only two models in VRAM: policy and reference), it works with LoRA, it has a clean theoretical foundation.

It has weaknesses. DPO has no online exploration — it bakes in whatever biases exist in the preference dataset. It's famous for the length exploit: chosen responses in public datasets tend to be longer, so DPO-tuned models become gratuitously verbose. It also has no “second chance” to catch generations that fall outside the preference dataset's distribution.

For verifiable tasks (math, code, JSON), GRPO + RLVR (coming up in the next lesson) often outperforms DPO — because it can use automated rewards and online rollouts. For open-ended tasks, DPO is usually the right call.