Skip to main content
illumin8
Courses
Week 13: Direct Preference Optimization and GRPO
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 13

Week 13: Direct Preference Optimization and GRPO

✦Learning Outcomes
  • Compare DPODirect Preference Optimization and GRPOGroup Relative Policy Optimisation and understand trade-offs
  • Implement preference optimization for LLMLarge Language Model alignment
  • Analyze why preference optimization is more efficient than PPOProximal Policy Optimisation-based RLHFReinforcement Learning from Human Feedback
  • Connect DPODirect Preference Optimization/GRPOGroup Relative Policy Optimisation to real-world LLMLarge Language Model training pipelines
◆Prerequisites
  • Week 12: RLHFReinforcement Learning from Human Feedback pipeline, reward modeling, PPOProximal Policy Optimisation
  • Week 7: Policy gradient

Recommended: Review Week 12 before proceeding.

Purpose of this lecture

The RLHFReinforcement Learning from Human Feedback pipeline from Week 12 works: it produces well-aligned language models that score highly on human preference evaluations. It also carries a significant overhead. PPOProximal Policy Optimisation-based RLHFReinforcement Learning from Human Feedback requires four large neural networks loaded simultaneously — the active policy, a frozen reference model, a reward model, and a value critic. For a 70B parameter model in BF16, this represents roughly 560 GB of GPU memory before accounting for optimizer state and activations. The result is that full-scale RLHFReinforcement Learning from Human Feedback is accessible to only a small number of organizations, and iteration cycles are slow even for those who can afford it.

This lecture develops two modern approaches that reduce this overhead without abandoning the theoretical grounding of the RLHFReinforcement Learning from Human Feedback framework. Direct Preference Optimization (DPODirect Preference Optimization) rederives the RLHFReinforcement Learning from Human Feedback objective to show that the reward model is implicit in the policy ratio and can be eliminated entirely: the alignment problem reduces to binary classification on preference pairs. Group Relative Policy Optimization (GRPOGroup Relative Policy Optimisation) takes the opposite simplification: keep the RLReinforcement Learning loop but replace the learned value critic with an empirical group baseline, eliminating the critic network while preserving token-level credit assignment.

Both methods trace directly to the closed-form KL-regularized optimal policy derived in Week 12. Understanding the derivations — not just the final loss functions — is what separates principled use of these methods from treating them as black-box recipes.


DPODirect Preference Optimization: the derivation and landmark result

Direct Preference Optimization (Rafailov et al., 2023) demonstrated a surprising insight: the RLHF alignment objective has a closed-form optimal policy, and substituting that policy into the Bradley-Terry preference model yields a loss function that depends only on policy ratios, never on an explicit reward model. The paper's central contribution was empirical: DPO-trained models reached InstructGPT-level instruction-following quality with a single epoch of offline training, no reward model fitting, and no RL rollouts — roughly 10x less compute than PPO-RLHF. The community's initial reception was enthusiastic: DPO's simplicity, stability, and efficiency made RLHF-scale alignment accessible to academic labs. However, subsequent work exposed limitations: Dubois et al. (2024) showed DPO underperforms PPO on mathematical and code reasoning tasks where step-level verification is important. More recently, studies of DPO's implicit reward model revealed overoptimization pathologies — the policy can diverge sharply from the reference model on out-of-distribution prompts, and the implicit reward is often uninterpretable. The field has responded with a proliferation of DPO variants (SimPO, IPO, ORPO, CPO), each addressing a specific failure mode, suggesting that no single offline preference optimization formulation dominates all tasks.

Starting point

Recall from Week 12 that the KL-regularized RLHFReinforcement Learning from Human Feedback objective:

max⁡πEy∼π ⁣[r(x,y)]−β DKL(π∥πref)\max_\pi \mathbb{E}_{y \sim \pi}\!\left[r(x,y)\right] - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{ref}})πmax​Ey∼π​[r(x,y)]−βDKL​(π∥πref​)

has the closed-form optimal policy:

π∗(y∣x)=1Z(x) πref(y∣x)⋅exp⁡ ⁣(r(x,y)β)\pi^*(y \mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y \mid x)\cdot \exp\!\left(\frac{r(x,y)}{\beta}\right)π∗(y∣x)=Z(x)1​πref​(y∣x)⋅exp(βr(x,y)​)

where Z(x)=∑yπref(y∣x)exp⁡(r(x,y)/β)Z(x) = \sum_y \pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta)Z(x)=∑y​πref​(y∣x)exp(r(x,y)/β) is the intractable partition function.

Inverting for the implicit reward

DPODirect Preference Optimization's key move is to rearrange this expression to express r(x,y)r(x,y)r(x,y) in terms of the policy ratio, rather than expressing π∗\pi^*π∗ in terms of rrr:

r(x,y)=βlog⁡π∗(y∣x)πref(y∣x)+βlog⁡Z(x)r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)r(x,y)=βlogπref​(y∣x)π∗(y∣x)​+βlogZ(x)

The term βlog⁡Z(x)\beta \log Z(x)βlogZ(x) depends only on xxx, not on yyy. Since the Bradley-Terry preference model from Week 12 involves the difference of rewards:

P(yw≻yl∣x)=σ ⁣(r(x,yw)−r(x,yl))P(y^w \succ y^l \mid x) = \sigma\!\left(r(x, y^w) - r(x, y^l)\right)P(yw≻yl∣x)=σ(r(x,yw)−r(x,yl))

the βlog⁡Z(x)\beta \log Z(x)βlogZ(x) terms cancel exactly when the reward difference is taken:

r(x,yw)−r(x,yl)=βlog⁡π∗(yw∣x)πref(yw∣x)−βlog⁡π∗(yl∣x)πref(yl∣x)r(x, y^w) - r(x, y^l) = \beta \log \frac{\pi^*(y^w \mid x)}{\pi_{\text{ref}}(y^w \mid x)} - \beta \log \frac{\pi^*(y^l \mid x)}{\pi_{\text{ref}}(y^l \mid x)}r(x,yw)−r(x,yl)=βlogπref​(yw∣x)π∗(yw∣x)​−βlogπref​(yl∣x)π∗(yl∣x)​

The DPODirect Preference Optimization loss

Substituting into the Bradley-Terry log-likelihood, and replacing the optimal policy π∗\pi^*π∗ with the parameterized policy πθ\pi_\thetaπθ​ we are training:

LDPO(θ)=−E(x,yw,yl)∼D ⁣[log⁡σ ⁣(βlog⁡πθ(yw∣x)πref(yw∣x)−βlog⁡πθ(yl∣x)πref(yl∣x))]\boxed{ \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x,y^w,y^l) \sim \mathcal{D}}\!\left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y^w \mid x)}{\pi_{\text{ref}}(y^w \mid x)} - \beta \log \frac{\pi_\theta(y^l \mid x)}{\pi_{\text{ref}}(y^l \mid x)} \right) \right] }LDPO​(θ)=−E(x,yw,yl)∼D​[logσ(βlogπref​(yw∣x)πθ​(yw∣x)​−βlogπref​(yl∣x)πθ​(yl∣x)​)]​

This is a binary cross-entropy loss where the "logit" is the difference in log-probability ratios (policy log-probability minus reference log-probability) between the preferred and dispreferred responses, scaled by β\betaβ. No reward model appears. The reward model has been algebraically eliminated — it is implicit in the ratio πθ/πref\pi_\theta / \pi_{\text{ref}}πθ​/πref​.

What DPODirect Preference Optimization is actually optimizing

Taking the gradient of L<Glossary term="DPO" />\mathcal{L}_{\text{<Glossary term="DPO" />}}L<Glossary term="DPO" />​ with respect to θ\thetaθ and examining the update direction reveals the mechanism:

−∇θLDPO∝β ⁣[∇θlog⁡πθ(yw∣x)⏟increase πθ(yw)−∇θlog⁡πθ(yl∣x)⏟decrease πθ(yl)]⋅σ^-\nabla_\theta \mathcal{L}_{\text{DPO}} \propto \beta\!\left[ \underbrace{\nabla_\theta \log \pi_\theta(y^w|x)}_{\text{increase } \pi_\theta(y^w)} - \underbrace{\nabla_\theta \log \pi_\theta(y^l|x)}_{\text{decrease } \pi_\theta(y^l)} \right] \cdot \hat{\sigma}−∇θ​LDPO​∝β​increase πθ​(yw)∇θ​logπθ​(yw∣x)​​−decrease πθ​(yl)∇θ​logπθ​(yl∣x)​​​⋅σ^

where σ^=1−σ(⋯ )\hat{\sigma} = 1 - \sigma(\cdots)σ^=1−σ(⋯) is a weighting factor that is large when the model currently predicts the preference pair incorrectly (low confidence gap) and small when the model already predicts it correctly (high confidence gap). DPODirect Preference Optimization therefore applies larger gradients to pairs where the model is currently confused, concentrating learning effort where it is most needed — analogous to hard example mining in metric learning.

The β\betaβ parameter controls the KL penalty strength: large β\betaβ prevents πθ\pi_\thetaπθ​ from deviating much from πref\pi_{\text{ref}}πref​ (conservative, low risk of mode collapse); small β\betaβ allows large policy shifts (aggressive, risk of forgetting reference model's qualities).

Practical implications

DPODirect Preference Optimization reduces the RLHFReinforcement Learning from Human Feedback training setup from four networks to two: the trainable policy πθ\pi_\thetaπθ​ and the frozen reference πref\pi_{\text{ref}}πref​. Training is stable supervised learning — no reward signal variance, no PPOProximal Policy Optimisation clipping, no advantage estimation. The preference dataset D\mathcal{D}D can be collected once offline and reused, making DPODirect Preference Optimization a fully offline alignment method in the sense of Week 11. A single epoch of DPODirect Preference Optimization on a well-curated preference dataset can match the alignment quality of PPOProximal Policy Optimisation-RLHFReinforcement Learning from Human Feedback with a fraction of the compute.

⚠Critical Lens: DPO's Implicit Reward and Overoptimization

DPO's elimination of the reward model is mathematically elegant, but introduces failure modes that are invisible from the loss function alone.

The implicit reward is uninterpretable. In PPO-RLHF, the explicit reward model rϕ(x,y)r_\phi(x, y)rϕ​(x,y) can be inspected: we can score any response and ask whether the RM assigns higher values to responses we consider better. In DPO, the "reward" is βlog⁡πθ/πref\beta \log \pi_\theta / \pi_{\text{ref}}βlogπθ​/πref​ — a policy ratio that exists only for responses the model can actually generate. There is no way to evaluate whether the implicit reward aligns with human preference on held-out responses without running a separate human evaluation, which defeats the purpose of eliminating the RM.

Reference model dependence is total. Every update in DPO is relative to πref\pi_{\text{ref}}πref​. If πref\pi_{\text{ref}}πref​ has biases — e.g., it assigns artificially low probability to certain valid response styles — DPO will learn to amplify those probability differences rather than correct them. The β\betaβ parameter can slow this amplification but cannot reverse it. This means DPO cannot increase the probability of a response below πref\pi_{\text{ref}}πref​'s support more than it can decrease the probability of a response with high πref\pi_{\text{ref}}πref​ probability — a fundamental asymmetry inherited from the KL-regularized framework.

DPO overoptimizes on out-of-distribution prompts (Dubois et al., 2024). Because DPO is offline, the policy never encounters its own generations during training. On prompts within the preference dataset's distribution, this works adequately. On OOD prompts, the policy can assign high probability to responses that the πref\pi_{\text{ref}}πref​ model would never have produced — responses for which no preference data exists — and DPO's implicit reward provides no signal about their quality. This is the same distributional shift problem as offline RL (Week 11), and DPO inherits it without the KL penalty's online enforcement.

The proliferation of DPO variants (SimPO, IPO, ORPO, CPO) suggests that no single offline preference loss dominates all tasks. Each variant corrects a specific failure mode of the original DPO: SimPO handles length bias and removes the reference model, IPO enforces a stricter KL constraint, ORPO combines SFT and alignment into one stage. The fragmentation is itself evidence that the reward-model-free approach trades explicit reward interpretability and trainability for a loss function that is task-sensitive in ways not yet theoretically characterized.


DPODirect Preference Optimization variants and limitations

Sequence-level credit assignment

DPODirect Preference Optimization evaluates the log-probability of the entire sequence yyy:

log⁡πθ(y∣x)=∑tlog⁡πθ(yt∣x,y<t)\log \pi_\theta(y \mid x) = \sum_t \log \pi_\theta(y_t \mid x, y_{<t})logπθ​(y∣x)=t∑​logπθ​(yt​∣x,y<t​)

This means a response that contains 499 excellent tokens followed by one harmful token and one that is consistently poor receive the same treatment: the sequence is labeled as preferred or dispreferred as a whole. There is no mechanism to attribute the preference to specific tokens within the sequence. This sparse credit assignment is DPODirect Preference Optimization's principal limitation relative to PPOProximal Policy Optimisation: it cannot identify and correct specific failure modes within a response.

SimPO

Simple Preference Optimization (SimPO, Meng et al., 2024) modifies DPODirect Preference Optimization by removing the reference model and normalizing log-probabilities by sequence length:

LSimPO(θ)=−E ⁣[log⁡σ ⁣(β∣yw∣log⁡πθ(yw∣x)−β∣yl∣log⁡πθ(yl∣x)−γ)]\mathcal{L}_{\text{SimPO}}(\theta) = -\mathbb{E}\!\left[ \log \sigma\!\left( \frac{\beta}{|y^w|} \log \pi_\theta(y^w|x) - \frac{\beta}{|y^l|} \log \pi_\theta(y^l|x) - \gamma \right) \right]LSimPO​(θ)=−E[logσ(∣yw∣β​logπθ​(yw∣x)−∣yl∣β​logπθ​(yl∣x)−γ)]

where γ>0\gamma > 0γ>0 is a margin. SimPO addresses two empirical issues with DPO: first, the implicit reference model dependence — models trained with DPO exhibit unexpected behaviors when the reference model is distant or misaligned — and second, length bias, where longer sequences accumulate higher log-probabilities regardless of quality. SimPO's length normalization divides by sequence length, and the margin γ\gammaγ introduces an explicit preference calibration. Removal of the reference model reduces memory overhead. SimPO achieves competitive results with DPODirect Preference Optimization on benchmark preference datasets with 15–20% lower memory cost; however, it sacrifices the theoretical connection to the KL-regularized RLHF objective, making it harder to reason about how much the policy has shifted from the training distribution. In practice, SimPO works well when preference data is high-quality and homogeneous in task structure, but offers less robustness on diverse or distribution-shifted preferences.


GRPOGroup Relative Policy Optimisation: eliminating the critic

Group Relative Policy Optimization (GRPOGroup Relative Policy Optimisation; Shao et al., 2024, introduced in DeepSeekMath) takes the opposite simplification from DPO: retain the RLReinforcement Learning loop and an explicit reward signal, but eliminate the learned value critic by replacing it with an empirical group baseline computed from multiple sampled completions. Shao et al. demonstrated that GRPO matches PPO performance on mathematical reasoning with significantly lower memory overhead (one critic network eliminated), enabling training of large reasoning models. The landmark result came with DeepSeek-R1 (2025): applied at scale with a rule-based verifier (correctness checking), GRPO elicited emergent chain-of-thought reasoning — models learned to generate extensive self-correcting solution traces without explicit supervision of the reasoning process itself, only outcome-level verification. This finding reignited the debate about process versus outcome reward models: if outcome-only rewards can implicitly elicit step-level reasoning behaviors, do process reward models (Lightman et al., 2023) provide additional value, or are they an unnecessary intermediate representation?

The critic's role and cost

In PPOProximal Policy Optimisation (Week 7–8), the advantage estimate At=Q(st,at)−V(st)A_t = Q(s_t, a_t) - V(s_t)At​=Q(st​,at​)−V(st​) requires a value function VϕV_\phiVϕ​ that predicts the expected return from each state. For a language model, VϕV_\phiVϕ​ is typically another copy of the language model with an added scalar head. This doubles the number of active parameters during training and requires a separate optimization loop to keep VϕV_\phiVϕ​ accurate.

The group baseline

GRPOGroup Relative Policy Optimisation replaces the learned critic with an empirical baseline computed from a group of sampled completions. For each prompt xxx in the training batch:

  1. Sample GGG completions {y1,…,yG}\{y_1, \ldots, y_G\}{y1​,…,yG​} from the current policy πθ\pi_\thetaπθ​.
  2. Score each with the reward model (or a verifier): {r1,…,rG}\{r_1, \ldots, r_G\}{r1​,…,rG​}.
  3. Compute the group-normalized advantage for each completion:
Ai=ri−μrσr,μr=1G∑i=1Gri,σr=1G∑i=1G(ri−μr)2A_i = \frac{r_i - \mu_{\mathbf{r}}}{\sigma_{\mathbf{r}}}, \qquad \mu_{\mathbf{r}} = \frac{1}{G}\sum_{i=1}^G r_i, \quad \sigma_{\mathbf{r}} = \sqrt{\frac{1}{G}\sum_{i=1}^G (r_i - \mu_{\mathbf{r}})^2}Ai​=σr​ri​−μr​​,μr​=G1​i=1∑G​ri​,σr​=G1​i=1∑G​(ri​−μr​)2​

The standardized score AiA_iAi​ is positive for completions that outperform the group average and negative for those that underperform. It is a relative, not absolute measure of quality: a reward of 0.7 is a strong positive advantage if all other completions scored 0.3, but a negative advantage if others scored 0.9.

The GRPOGroup Relative Policy Optimisation objective

The policy update applies the PPOProximal Policy Optimisation clipped surrogate objective using AiA_iAi​ in place of the GAE advantage, adding a per-token KL penalty against the reference model:

LGRPO(θ)=−E ⁣[∑tmin⁡ ⁣(ρi,tAi,  clip(ρi,t,1−ϵ,1+ϵ)Ai)−β DKL(πθ∥πref)]\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\!\left[ \sum_{t} \min\!\left( \rho_{i,t} A_i,\; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_i \right) - \beta\, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]LGRPO​(θ)=−E[t∑​min(ρi,t​Ai​,clip(ρi,t​,1−ϵ,1+ϵ)Ai​)−βDKL​(πθ​∥πref​)]

where ρi,t=πθ(yi,t∣x,yi,<t)/πold(yi,t∣x,yi,<t)\rho_{i,t} = \pi_\theta(y_{i,t} \mid x, y_{i,<t}) / \pi_{\text{old}}(y_{i,t} \mid x, y_{i,<t})ρi,t​=πθ​(yi,t​∣x,yi,<t​)/πold​(yi,t​∣x,yi,<t​) is the per-token importance ratio. The KL penalty is computed token-by-token rather than at the sequence level, providing denser regularization.

Interactive: Understanding GRPO Advantage

The core of GRPO is the relative scoring within a group. Consider a batch with G=4G = 4G=4 completions for the same prompt:

| Completion | Reward rir_iri​ | Ai=ri−μσA_i = \frac{r_i - \mu}{\sigma}Ai​=σri​−μ​ | |---|---|---| | A | 0.9 | 0.9−0.550.33≈+1.06\frac{0.9 - 0.55}{0.33} \approx +1.060.330.9−0.55​≈+1.06 | | B | 0.7 | 0.7−0.550.33≈+0.45\frac{0.7 - 0.55}{0.33} \approx +0.450.330.7−0.55​≈+0.45 | | C | 0.4 | 0.4−0.550.33≈−0.45\frac{0.4 - 0.55}{0.33} \approx -0.450.330.4−0.55​≈−0.45 | | D | 0.2 | 0.2−0.550.33≈−1.06\frac{0.2 - 0.55}{0.33} \approx -1.060.330.2−0.55​≈−1.06 |

μ=0.55\mu = 0.55μ=0.55, σ≈0.33\sigma \approx 0.33σ≈0.33. Completions A and B are reinforced; C and D are suppressed. Now suppose we change only completion D's reward from 0.2 to 0.8 while keeping A's reward at 0.9:

| Completion | Reward rir_iri​ | AiA_iAi​ | |---|---|---| | A | 0.9 | 0.9−0.700.13≈+1.54\frac{0.9 - 0.70}{0.13} \approx +1.540.130.9−0.70​≈+1.54 | | B | 0.7 | 0.7−0.700.13≈\frac{0.7 - 0.70}{0.13} \approx0.130.7−0.70​≈ 0 | | C | 0.4 | 0.4−0.700.13≈−2.31\frac{0.4 - 0.70}{0.13} \approx -2.310.130.4−0.70​≈−2.31 | | D | 0.8 | 0.8−0.700.13≈+0.77\frac{0.8 - 0.70}{0.13} \approx +0.770.130.8−0.70​≈+0.77 |

Key insight: Completion A's reward stayed at 0.9, but its advantage jumped from +1.06 to +1.54 because the group mean and standard deviation shifted. The advantage is a relative, not absolute, measure — it depends on all completions in the group, not just the individual reward. This relative nature is why GRPO creates strong contrastive gradients for reasoning tasks: correct solutions look better when directly compared with incorrect solutions from the same prompt.

Why GRPOGroup Relative Policy Optimisation works for reasoning tasks

The group sampling structure makes GRPOGroup Relative Policy Optimisation particularly effective for tasks with verifiable correct answers: mathematics, formal proofs, code execution, and structured reasoning. In these settings:

  • The reward signal is binary or near-binary (correct / incorrect), making reward models unnecessary — a rule-based verifier suffices.
  • Multiple completions per prompt are natural: the model generates several candidate solutions, and the verifier checks each. Correct solutions have Ai>0A_i > 0Ai​>0; incorrect ones have Ai<0A_i < 0Ai​<0.
  • The contrastive signal between correct and incorrect solutions in the same group is a strong learning signal: the model simultaneously reinforces correct reasoning paths and suppresses incorrect ones from the same prompt.

This is the structure that enabled DeepSeek-R1's reasoning capabilities: by training on mathematical problems with verifiable answers using GRPOGroup Relative Policy Optimisation, the model learned to generate extended, self-correcting chain-of-thought reasoning — a behavior that emerges from the reinforcement signal rather than being explicitly supervised.

⚠Critical Lens: GRPO Group Baseline Limitations

GRPO's group baseline is computationally attractive — eliminating the critic halves memory — but shifts the failure modes from critic inaccuracy to baseline variance.

Group size GGG controls the variance-budget tradeoff. The group baseline uses σr\sigma_{\mathbf{r}}σr​ as the normalizer: when σr\sigma_{\mathbf{r}}σr​ is small (all completions score similarly), the normalized advantage amplifies small reward differences into large gradient updates. With G=4G = 4G=4 (a common setting), the variance of the baseline is high, and a single outlier completion — e.g., a reward-model-hacked response that scores 0.98 when all others score 0.2 — dominates the gradient for that batch. The fix is larger GGG, but each additional completion costs a full generation pass through the policy. This creates a FLOPs multiplier: GRPO with G=8G = 8G=8 requires 8×8\times8× the generation compute of PPO with a single rollout per prompt. The critic's memory cost is replaced by a generation cost that scales linearly in GGG.

Correlated completions produce zero signal. If all GGG completions suffer from the same failure mode — say, all hallucinate the same incorrect fact or all produce the same flawed reasoning step — then σr≈0\sigma_{\mathbf{r}} \approx 0σr​≈0 and all Ai≈0A_i \approx 0Ai​≈0. The gradient vanishes, and the model receives no corrective signal. This is the correlated failure mode problem: the baseline cannot distinguish between "all completions are good" and "all completions share the same flaw," so GRPO is blind to systematic errors that affect every sample in a group.

The clipping-advantage interaction is uncharacterized. PPO's clipped objective was designed with GAE advantages in mind, where AtA_tAt​ is a discounted sum of temporal-difference errors with bounded scale. In GRPO, AiA_iAi​ is a z-score that can exceed ±2\pm 2±2 when GGG is small and reward variance is high. The clipping threshold ϵ\epsilonϵ (typically 0.2) interacts with the raw AiA_iAi​ value: a large positive Ai=+3A_i = +3Ai​=+3 with a ratio ρi,t>1.2\rho_{i,t} > 1.2ρi,t​>1.2 will get clipped, discarding most of the gradient signal from the strongest learning example. There is no published study of the joint optimal setting of (G,ϵ)(G, \epsilon)(G,ϵ) for GRPO; practitioners currently tune them independently, which is likely suboptimal.

Rule-based verifiers prevent reward hacking but restrict task scope. GRPO's success on mathematical reasoning (DeepSeek-R1) relies on binary verifiability: an answer is either correct or incorrect, with no ambiguity. For open-ended generation (creative writing, dialogue, summarization), no rule exists to verify quality. Using a learned RM with GRPO reintroduces the overoptimization problem that the group baseline was meant to mitigate — now the RM score can be gamed by completions that exploit proxy-reward gaps, and the group baseline amplifies the resulting advantage differences rather than correcting them.


Credit assignment: sequence-level vs token-level

The sequence-level vs token-level distinction is a fundamental axis along which alignment methods differ, and it has direct implications for what behaviors the model can and cannot learn.

Sequence-level (DPODirect Preference Optimization)

The preference label is applied to the entire response. The gradient signal is a uniform rescaling of token log-probabilities throughout the response by the same factor. Every token in a preferred response is equally reinforced; every token in a dispreferred response is equally suppressed. This is appropriate when the preference is a global property of the response (style, tone, factual accuracy uniformly distributed) but breaks down when the preference is driven by a specific failure — a hallucinated fact in token 50, an unsafe statement in token 120 — embedded in an otherwise acceptable response.

Token-level (PPOProximal Policy Optimisation, GRPOGroup Relative Policy Optimisation)

The advantage AtA_tAt​ can vary across tokens within a single response. A critic-based advantage (PPOProximal Policy Optimisation) or a response-averaged group advantage applied per-token (GRPOGroup Relative Policy Optimisation) allows the model to receive different gradient signals at different positions. In practice, PPOProximal Policy Optimisation's token-level credit assignment enables it to learn to avoid specific failure modes within responses in ways that DPODirect Preference Optimization cannot, at the cost of requiring either a learned critic (PPOProximal Policy Optimisation) or a group sampling overhead (GRPOGroup Relative Policy Optimisation).


Synthesis: choosing an alignment method

◆Diagram: Alignment Method Comparison

PPO (RLHF): four networks (policy, critic, reference, RM) with online rollouts and KL penalty — highest compute, most flexible. DPO: two networks (policy, reference), offline preference dataset, no RM — half the memory, but implicit reward is uninterpretable. GRPO: policy + verifier, GGG online rollouts per prompt, group-normalized advantages — critic-free but generation cost scales with GGG. All three share the KL-regularized objective; they differ in how the advantage signal is estimated.

| | PPOProximal Policy Optimisation (RLHFReinforcement Learning from Human Feedback) | DPODirect Preference Optimization | GRPOGroup Relative Policy Optimisation | |---|---|---|---| | Reward model needed | Yes | No (implicit) | Yes / rule-based | | Value critic needed | Yes | No | No | | Networks in training | 4 | 2 | 2–3 | | Credit assignment | Token-level | Sequence-level | Token-level | | Stability | Moderate | High | Moderate | | Best setting | General alignment | Low-compute alignment | Verifiable reasoning |


Exercise: Comparing DPO and GRPO

  1. Analytical: Derive the SimPO loss from the DPO loss by applying sequence-length normalization.
  2. Implementation: Given the GRPO advantage formula Ai=(ri−μ)/σA_i = (r_i - \mu)/\sigmaAi​=(ri​−μ)/σ for a group of GGG completions, compute the advantages for the following group: {r}=[0.95,0.88,0.45,0.42,0.15]\{r\} = [0.95, 0.88, 0.45, 0.42, 0.15]{r}=[0.95,0.88,0.45,0.42,0.15]. Now add a KL penalty term: modify the advantage to AiKL=(ri−μ)/σ−β⋅KLiA_i^{\text{KL}} = (r_i - \mu)/\sigma - \beta \cdot \text{KL}_iAiKL​=(ri​−μ)/σ−β⋅KLi​ where KLi\text{KL}_iKLi​ is the per-completion KL divergence from the reference model. If β=0.1\beta = 0.1β=0.1 and the KL values are [0.05,0.12,0.03,0.02,0.08][0.05, 0.12, 0.03, 0.02, 0.08][0.05,0.12,0.03,0.02,0.08], recompute the advantages. Which completions change sign, and why?
  3. Reasoning: Explain why GRPO is inherently more robust to reward model hallucination than PPO when using a rule-based verifier.

The practical choice among these methods depends primarily on two factors: compute budget and task structure. DPODirect Preference Optimization is the default choice when the preference dataset is high-quality and the alignment target is a global property of responses (helpfulness, harmlessness). GRPOGroup Relative Policy Optimisation is preferred when the task has verifiable rewards and the goal is to train reasoning capabilities that require credit assignment at the reasoning-step level. PPOProximal Policy Optimisation remains the most principled option when the reward model is reliable, the task requires fine-grained credit assignment, and compute is available.


◆Open Problems

When to use online versus offline preference optimization. The field lacks a principled criterion for choosing between offline methods (DPO, SimPO) and online methods (PPO-RLHF, GRPO). DPO is computationally cheaper but suffers from overoptimization on distribution shift. PPO is more robust to preference variation but requires four networks and live RL interaction. Recent work suggests the choice depends on preference data quality and task distribution diversity, but these properties are hard to characterize a priori. What metrics predict which method will perform better?

The proliferation of DPO variants and algorithmic convergence. Since Rafailov et al., the field has proposed SimPO, IPO, ORPO, CPO, and other variants, each motivated by specific empirical failures. Do these variants represent genuine algorithmic progress, or are they examples of overfitting to benchmark suites? More fundamentally, is there a unified theory of offline preference optimization, or is DPO a specific point in a high-dimensional loss landscape where different tasks require different points?

Synthetic preference data at scale. The emerging paradigm is to use AI feedback (language model-generated preferences) rather than human annotations to scale preference data. How does alignment trained on synthetic preferences compare to human-annotated preferences? Early results are mixed: AI feedback can match human-annotated performance on factual tasks but degrades on subjective preference (style, tone, creativity). The deeper question: if the reference model generates preferences, and we train the policy to match them, are we learning human alignment or just reproducing the reference model's biases?

Preference definition for open-ended tasks. Weeks 12–13 assume preference data on well-defined tasks (instruction following, helpfulness, safety). But for creative writing, scientific hypothesis generation, or autonomous research agents, what does "preference" even mean? Preferences are often intransitive, context-dependent, and evolving. How do you specify or learn reward signals for tasks where the goal itself is under-defined?


Key takeaways

DPODirect Preference Optimization rederives the RLHFReinforcement Learning from Human Feedback alignment problem by inverting the closed-form KL-regularized optimal policy to express the reward as a ratio of policy to reference log-probabilities. Substituting this into the Bradley-Terry preference model yields a binary cross-entropy loss on preference pairs that requires no reward model, no RLReinforcement Learning loop, and only two networks. The β\betaβ parameter plays the role of the KL penalty in RLHFReinforcement Learning from Human Feedback. DPODirect Preference Optimization's limitation is sequence-level credit assignment: it cannot attribute preference outcomes to specific tokens within a response. GRPOGroup Relative Policy Optimisation retains the RLReinforcement Learning loop and explicit reward signal but replaces the learned value critic with a group-normalized empirical baseline, eliminating the critic network. The contrastive structure of group sampling makes GRPOGroup Relative Policy Optimisation particularly effective for reasoning tasks with verifiable answers. SimPO extends DPODirect Preference Optimization by removing the reference model and adding length normalization. The sequence-level vs token-level credit assignment axis is the principal theoretical distinction among these methods, with practical implications for which failure modes each can address.


Conceptual questions

  1. Derive the DPODirect Preference Optimization loss function from scratch: start from the KL-regularized RLHFReinforcement Learning from Human Feedback objective, write down its closed-form optimal policy, invert to express r(x,y)r(x,y)r(x,y) in terms of the policy ratio, and substitute into the Bradley-Terry preference model. At what step do the intractable partition functions Z(x)Z(x)Z(x) cancel, and why is this cancellation exact? What property of the Bradley-Terry model enables it?

  2. The DPODirect Preference Optimization gradient scales each preference pair update by a factor σ^=1−σ(Δ)\hat{\sigma} = 1 - \sigma(\Delta)σ^=1−σ(Δ) where Δ\DeltaΔ is the log-ratio difference. For a pair where the model already correctly predicts the preference with high confidence (Δ≫0\Delta \gg 0Δ≫0), σ^≈0\hat{\sigma} \approx 0σ^≈0. Explain why this weighting is beneficial (analogous to hard example mining) and identify a failure mode it could cause if the training dataset contains mislabeled preference pairs.

  3. GRPOGroup Relative Policy Optimisation samples GGG completions per prompt and standardizes their rewards to compute advantages. If G=1G = 1G=1, what does the advantage equal, and what does the GRPOGroup Relative Policy Optimisation gradient become? If all GGG completions receive identical rewards (e.g., all correct or all incorrect), what happens to the gradient? What does this imply about the minimum dataset and sampling conditions required for GRPOGroup Relative Policy Optimisation to produce a useful signal?

  4. A model trained with DPODirect Preference Optimization on a preference dataset where the preferred response is always longer than the dispreferred response will develop a length bias. Explain mechanistically why this happens using the sequence-level log-probability sum, and show how SimPO's length normalization resolves it. Does PPOProximal Policy Optimisation-RLHFReinforcement Learning from Human Feedback suffer from the same bias? Why or why not?

  5. DeepSeek-R1 uses GRPOGroup Relative Policy Optimisation with a rule-based verifier (checking mathematical correctness) rather than a learned reward model. Compare this to the standard RLHFReinforcement Learning from Human Feedback reward model approach from Week 12. Under what conditions does a rule-based verifier provide a more reliable training signal than a learned RM, and what category of tasks is structurally excluded from rule-based verification? For a task that cannot use rule-based verification, describe how you would design the reward signal and why GRPOGroup Relative Policy Optimisation may still be preferable to PPOProximal Policy Optimisation.

✦Solutions
  1. Start from max⁡πE[r(x,y)]−β KL(π ∥ πref)\max_\pi \mathbb{E}[r(x,y)] - \beta\,\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})maxπ​E[r(x,y)]−βKL(π∥πref​); its closed-form optimum is π∗(y∣x)=1Z(x)πref(y∣x)exp⁡(r(x,y)/β)\pi^*(y\mid x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\exp(r(x,y)/\beta)π∗(y∣x)=Z(x)1​πref​(y∣x)exp(r(x,y)/β). Inverting gives r(x,y)=βlog⁡π∗(y∣x)πref(y∣x)+βlog⁡Z(x)r(x,y) = \beta\log\frac{\pi^*(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x)r(x,y)=βlogπref​(y∣x)π∗(y∣x)​+βlogZ(x); substituting into Bradley-Terry σ(r(x,yw)−r(x,yl))\sigma(r(x,y_w) - r(x,y_l))σ(r(x,yw​)−r(x,yl​)) cancels βlog⁡Z(x)\beta\log Z(x)βlogZ(x) because both completions share the prompt xxx and the model depends only on the reward difference. The cancellation is exact thanks to that shift-invariance.
  2. For confidently-correct pairs (Δ≫0\Delta \gg 0Δ≫0), σ^=1−σ(Δ)≈0\hat\sigma = 1 - \sigma(\Delta) \approx 0σ^=1−σ(Δ)≈0, so updates concentrate on pairs the model gets wrong or is unsure about — effective hard-example mining. The failure mode: a mislabeled pair looks high-loss, receives a large weight, and aggressively drags the model toward the wrong label, amplifying label noise.
  3. With G=1G = 1G=1 the advantage is the reward minus its own mean =0= 0=0, so the gradient vanishes; likewise if all GGG completions get identical rewards the standardized advantages are all 000 and there is no gradient. So GRPO needs G≥2G \ge 2G≥2 and within-group reward variance — prompts where every sample succeeds or every sample fails contribute no signal.
  4. DPO's implicit reward is a sum of per-token log-probabilities, so longer preferred responses accumulate larger summed log-prob and the objective can be reduced just by increasing length — a length bias. SimPO normalizes by length (average per-token log-prob), removing the scaling. PPO-RLHF avoids this at the objective level (its reward is a sequence-level scalar, not a log-prob sum) but can still inherit length bias if the reward model itself prefers longer outputs.
  5. A rule-based verifier is more reliable when correctness is programmatically checkable (math, code, formal tasks): no learned-RM exploitation or drift. It structurally excludes open-ended/subjective tasks (creative writing, helpfulness, tone) with no programmatic ground truth. For those, use a learned reward model or LLM-as-judge from preference data; GRPO can still beat PPO because it drops the separate value network (lower memory/compute) and uses a group-relative baseline that scales conveniently to LLM training.

Looking ahead

The final lecture brings the course full circle by asking how the aligned models, planning algorithms, and RLReinforcement Learning foundations developed across all thirteen weeks combine into deployed agentic systems.

Week 14: Agentic Systems. We examine how tool-using LLMs instantiate the MDPMarkov Decision Process formalism from Week 1, how hierarchical and compositional task structures map onto RLReinforcement Learning sub-problems developed throughout the course, and what the remaining open problems in agentic AI are from a reinforcement learning perspective.


Further reading

  • Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. (DPODirect Preference Optimization).
  • Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv. (Introduced GRPOGroup Relative Policy Optimisation).
  • Meng, Y., et al. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv. (SimPO).
← Previous
Week 12: Reinforcement Learning from Human Feedback
Next →
Week 14: Agentic Systems and Course Capstone
On this page
  • Purpose of this lecture
  • DPO: the derivation and landmark result
  • Starting point
  • Inverting for the implicit reward
  • The DPO loss
  • What DPO is actually optimizing
  • Practical implications
  • DPO variants and limitations
  • Sequence-level credit assignment
  • SimPO
  • GRPO: eliminating the critic
  • The critic's role and cost
  • The group baseline
  • The GRPO objective
  • Interactive: Understanding GRPO Advantage
  • Why GRPO works for reasoning tasks
  • Credit assignment: sequence-level vs token-level
  • Sequence-level (DPO)
  • Token-level (PPO, GRPO)
  • Synthesis: choosing an alignment method
  • Exercise: Comparing DPO and GRPO
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading