RLHF without the RL — or the HF
The classical RLHF pipeline needs three things in memory simultaneously: the policy, a trained reward model, and a frozen reference. Plus a PPO optimization loop with rollouts, advantage estimation, and notorious instability. It works at scale, but it is miserable to run — most teams who tried it for their own fine-tuning projects abandoned it.
Click through the derivation on the right. Then read the deep dive below — the partition-function cancellation is the single most beautiful algebraic move in modern alignment research, and most tutorials skip it.
This is the step that actually makes DPO work. Start from the closed-form solution of the KL-constrained RL objective:
where is a normalising constant (the partition function) that depends only on , not on . Take logs and solve for :
Plug this into the Bradley-Terry preference model:
Now compute the difference :
The terms cancel exactly. This is the magic: the partition function depends only on , so it appears identically in both and , and the preference model only sees their difference.
What you're left with:
Notice: this difference contains only the policy and the reference . There is no reward model left. You can train the policy directly by optimising the Bradley-Terry log-likelihood — no reward network, no PPO loop, no advantage estimation. That's DPO in one insight.
What the loss actually does, in plain English
The DPO loss is contrastive. For each preference pair it says:
- Push up the log-probability of the chosen response relative to the reference.
- Push down the log-probability of the rejected response relative to the reference.
- Balance these so the total log-ratio difference matches the human preference probabilities, via the sigmoid link function.
The controls how far the policy can drift from reference. Small = aggressive updates (can produce stronger improvements but risks degeneration); large = conservative updates (safer but less effective). A typical default is .
When DPO wins, and when it doesn't
DPO is the right default for preference tuning of SLMs. It's stable, it's memory-efficient (only two models in VRAM: policy and reference), it works with LoRA, it has a clean theoretical foundation.
It has weaknesses. DPO has no online exploration — it bakes in whatever biases exist in the preference dataset. It's famous for the length exploit: chosen responses in public datasets tend to be longer, so DPO-tuned models become gratuitously verbose. It also has no “second chance” to catch generations that fall outside the preference dataset's distribution.
For verifiable tasks (math, code, JSON), GRPO + RLVR (coming up in the next lesson) often outperforms DPO — because it can use automated rewards and online rollouts. For open-ended tasks, DPO is usually the right call.