Microscale
0
Act VIMaking It Yours
lesson dpo · 12 min · 60 xp

DPO as KL-constrained optimum

A Manim-style derivation of the loss

RLHF without the RL — or the HF

The classical RLHF pipeline needs three things in memory simultaneously: the policy, a trained reward model, and a frozen reference. Plus a PPO optimization loop with rollouts, advantage estimation, and notorious instability. It works at scale, but it is miserable to run — most teams who tried it for their own fine-tuning projects abandoned it.

MMXXVI
historical note
2023 · Rafailov et al., Stanford
Rafailov and team asked the question that was in everyone's heads: why do we need a separate reward model at all? The reward model is a neural network fit to match human preferences. The policy is a neural network we're training to maximise that reward. Could we collapse these into a single training step? They showed the answer is yes, and the result — Direct Preference Optimization — now dominates the preference-tuning ecosystem. Every major SLM post-training pipeline in 2024–2026 uses DPO or one of its descendants.
◆ paper
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov, Sharma, Mitchell, Manning, Ermon, Finn · 2023 · NeurIPS 2023
arxiv:2305.18290
The paper that collapsed two years of RLHF engineering into a single classification loss. The title is the punchline: if you invert the relationship between reward and policy, the policy itself can be used as the reward model, and you never need to train a separate one.

Click through the derivation on the right. Then read the deep dive below — the partition-function cancellation is the single most beautiful algebraic move in modern alignment research, and most tutorials skip it.

This is the step that actually makes DPO work. Start from the closed-form solution of the KL-constrained RL objective:

π(yx)=1Z(x)πref(yx)exp ⁣(1βr(x,y))\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_\text{ref}(y \mid x) \exp\!\left(\frac{1}{\beta} r(x, y)\right)

where Z(x)Z(x) is a normalising constant (the partition function) that depends only on xx, not on yy. Take logs and solve for rr:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_\text{ref}(y \mid x)} + \beta \log Z(x)

Plug this into the Bradley-Terry preference model:

P(ywylx)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))

Now compute the difference r(x,yw)r(x,yl)r(x, y_w) - r(x, y_l):

=βlogπ(ywx)πref(ywx)+βlogZ(x)βlogπ(ylx)πref(ylx)βlogZ(x)= \beta \log\frac{\pi^*(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} + \beta \log Z(x) - \beta \log\frac{\pi^*(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} - \beta \log Z(x)

The βlogZ(x)\beta \log Z(x) terms cancel exactly. This is the magic: the partition function depends only on xx, so it appears identically in both r(x,yw)r(x, y_w) and r(x,yl)r(x, y_l), and the preference model only sees their difference.

What you're left with:

r(x,yw)r(x,yl)=βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx)r(x, y_w) - r(x, y_l) = \beta \log\frac{\pi^*(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log\frac{\pi^*(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}

Notice: this difference contains only the policy π\pi^* and the reference πref\pi_\text{ref}. There is no reward model left. You can train the policy directly by optimising the Bradley-Terry log-likelihood — no reward network, no PPO loop, no advantage estimation. That's DPO in one insight.

step 1 of 5
line 1
maxπEx,yπ[r(x,y)]βKL(ππref)\max_\pi \mathbb{E}_{x, y \sim \pi}[r(x, y)] - \beta\, \text{KL}(\pi \,\|\, \pi_\text{ref})
Start with the RLHF objective. You want to maximise the reward r(x,y)r(x, y) while not drifting too far from the reference policy πref\pi_\text{ref} (the SFT model before preference tuning). The KL penalty β\beta controls how willing you are to move.

What the loss actually does, in plain English

The DPO loss is contrastive. For each preference pair it says:

  • Push up the log-probability of the chosen response ywy_w relative to the reference.
  • Push down the log-probability of the rejected response yly_l relative to the reference.
  • Balance these so the total log-ratio difference matches the human preference probabilities, via the sigmoid link function.

The β\beta controls how far the policy can drift from reference. Small β\beta = aggressive updates (can produce stronger improvements but risks degeneration); large β\beta = conservative updates (safer but less effective). A typical default is β=0.1\beta = 0.1.

When DPO wins, and when it doesn't

DPO is the right default for preference tuning of SLMs. It's stable, it's memory-efficient (only two models in VRAM: policy and reference), it works with LoRA, it has a clean theoretical foundation.

It has weaknesses. DPO has no online exploration — it bakes in whatever biases exist in the preference dataset. It's famous for the length exploit: chosen responses in public datasets tend to be longer, so DPO-tuned models become gratuitously verbose. It also has no “second chance” to catch generations that fall outside the preference dataset's distribution.

For verifiable tasks (math, code, JSON), GRPO + RLVR (coming up in the next lesson) often outperforms DPO — because it can use automated rewards and online rollouts. For open-ended tasks, DPO is usually the right call.