Week 13: Direct Preference Optimization and GRPO

Purpose of this lecture#

The RLHF pipeline from Week 12 works: it produces well-aligned language models that score highly on human preference evaluations. It also carries a significant overhead. PPO-based RLHF requires four large neural networks loaded simultaneously — the active policy, a frozen reference model, a reward model, and a value critic. For a 70B parameter model in BF16, this represents roughly 560 GB of GPU memory before accounting for optimizer state and activations. The result is that full-scale RLHF is accessible to only a small number of organizations, and iteration cycles are slow even for those who can afford it.

This lecture develops two modern approaches that reduce this overhead without abandoning the theoretical grounding of the RLHF framework. Direct Preference Optimization (DPO) rederives the RLHF objective to show that the reward model is implicit in the policy ratio and can be eliminated entirely: the alignment problem reduces to binary classification on preference pairs. Group Relative Policy Optimization (GRPO) takes the opposite simplification: keep the RL loop but replace the learned value critic with an empirical group baseline, eliminating the critic network while preserving token-level credit assignment.

Both methods trace directly to the closed-form KL-regularized optimal policy derived in Week 12. Understanding the derivations — not just the final loss functions — is what separates principled use of these methods from treating them as black-box recipes.

DPO: the derivation and landmark result#

Direct Preference Optimization (Rafailov et al., 2023) demonstrated a surprising insight: the RLHF alignment objective has a closed-form optimal policy, and substituting that policy into the Bradley-Terry preference model yields a loss function that depends only on policy ratios, never on an explicit reward model. The paper's central contribution was empirical: DPO-trained models reached InstructGPT-level instruction-following quality with a single epoch of offline training, no reward model fitting, and no RL rollouts — roughly 10x less compute than PPO-RLHF. The community's initial reception was enthusiastic: DPO's simplicity, stability, and efficiency made RLHF-scale alignment accessible to academic labs. However, subsequent work exposed limitations: Dubois et al. (2024) showed DPO underperforms PPO on mathematical and code reasoning tasks where step-level verification is important. More recently, studies of DPO's implicit reward model revealed overoptimization pathologies — the policy can diverge sharply from the reference model on out-of-distribution prompts, and the implicit reward is often uninterpretable. The field has responded with a proliferation of DPO variants (SimPO, IPO, ORPO, CPO), each addressing a specific failure mode, suggesting that no single offline preference optimization formulation dominates all tasks.

Starting point#

Recall from Week 12 that the KL-regularized RLHF objective:

\max_\pi \mathbb{E}_{y \sim \pi}\!\left[r(x,y)\right] - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{ref}})

has the closed-form optimal policy:

\pi^*(y \mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y \mid x)\cdot \exp\!\left(\frac{r(x,y)}{\beta}\right)

where $Z(x) = \sum_y \pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta)$ is the intractable partition function.

Inverting for the implicit reward#

DPO's key move is to rearrange this expression to express $r(x,y)$ in terms of the policy ratio, rather than expressing $\pi^*$ in terms of $r$ :

r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

The term $\beta \log Z(x)$ depends only on $x$ , not on $y$ . Since the Bradley-Terry preference model from Week 12 involves the difference of rewards:

P(y^w \succ y^l \mid x) = \sigma\!\left(r(x, y^w) - r(x, y^l)\right)

the $\beta \log Z(x)$ terms cancel exactly when the reward difference is taken:

r(x, y^w) - r(x, y^l) = \beta \log \frac{\pi^*(y^w \mid x)}{\pi_{\text{ref}}(y^w \mid x)} - \beta \log \frac{\pi^*(y^l \mid x)}{\pi_{\text{ref}}(y^l \mid x)}

The DPO loss#

Substituting into the Bradley-Terry log-likelihood, and replacing the optimal policy $\pi^*$ with the parameterized policy $\pi_\theta$ we are training:

\boxed{ \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x,y^w,y^l) \sim \mathcal{D}}\!\left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y^w \mid x)}{\pi_{\text{ref}}(y^w \mid x)} - \beta \log \frac{\pi_\theta(y^l \mid x)}{\pi_{\text{ref}}(y^l \mid x)} \right) \right] }

This is a binary cross-entropy loss where the "logit" is the difference in log-probability ratios (policy log-probability minus reference log-probability) between the preferred and dispreferred responses, scaled by $\beta$ . No reward model appears. The reward model has been algebraically eliminated — it is implicit in the ratio $\pi_\theta / \pi_{\text{ref}}$ .

What DPO is actually optimizing#

Taking the gradient of $\mathcal{L}_{\text{<Glossary term="DPO" />}}$ with respect to $\theta$ and examining the update direction reveals the mechanism:

-\nabla_\theta \mathcal{L}_{\text{DPO}} \propto \beta\!\left[ \underbrace{\nabla_\theta \log \pi_\theta(y^w|x)}_{\text{increase } \pi_\theta(y^w)} - \underbrace{\nabla_\theta \log \pi_\theta(y^l|x)}_{\text{decrease } \pi_\theta(y^l)} \right] \cdot \hat{\sigma}

where $\hat{\sigma} = 1 - \sigma(\cdots)$ is a weighting factor that is large when the model currently predicts the preference pair incorrectly (low confidence gap) and small when the model already predicts it correctly (high confidence gap). DPO therefore applies larger gradients to pairs where the model is currently confused, concentrating learning effort where it is most needed — analogous to hard example mining in metric learning.

The $\beta$ parameter controls the KL penalty strength: large $\beta$ prevents $\pi_\theta$ from deviating much from $\pi_{\text{ref}}$ (conservative, low risk of mode collapse); small $\beta$ allows large policy shifts (aggressive, risk of forgetting reference model's qualities).

Practical implications#

DPO reduces the RLHF training setup from four networks to two: the trainable policy $\pi_\theta$ and the frozen reference $\pi_{\text{ref}}$ . Training is stable supervised learning — no reward signal variance, no PPO clipping, no advantage estimation. The preference dataset $\mathcal{D}$ can be collected once offline and reused, making DPO a fully offline alignment method in the sense of Week 11. A single epoch of DPO on a well-curated preference dataset can match the alignment quality of PPO-RLHF with a fraction of the compute.

Critical Lens: DPO's Implicit Reward and Overoptimization

DPO's elimination of the reward model is mathematically elegant, but introduces failure modes that are invisible from the loss function alone.

The implicit reward is uninterpretable. In PPO-RLHF, the explicit reward model $r_\phi(x, y)$ can be inspected: we can score any response and ask whether the RM assigns higher values to responses we consider better. In DPO, the "reward" is $\beta \log \pi_\theta / \pi_{\text{ref}}$ — a policy ratio that exists only for responses the model can actually generate. There is no way to evaluate whether the implicit reward aligns with human preference on held-out responses without running a separate human evaluation, which defeats the purpose of eliminating the RM.

Reference model dependence is total. Every update in DPO is relative to $\pi_{\text{ref}}$ . If $\pi_{\text{ref}}$ has biases — e.g., it assigns artificially low probability to certain valid response styles — DPO will learn to amplify those probability differences rather than correct them. The $\beta$ parameter can slow this amplification but cannot reverse it. This means DPO cannot increase the probability of a response below $\pi_{\text{ref}}$ 's support more than it can decrease the probability of a response with high $\pi_{\text{ref}}$ probability — a fundamental asymmetry inherited from the KL-regularized framework.

DPO overoptimizes on out-of-distribution prompts (Dubois et al., 2024). Because DPO is offline, the policy never encounters its own generations during training. On prompts within the preference dataset's distribution, this works adequately. On OOD prompts, the policy can assign high probability to responses that the $\pi_{\text{ref}}$ model would never have produced — responses for which no preference data exists — and DPO's implicit reward provides no signal about their quality. This is the same distributional shift problem as offline RL (Week 11), and DPO inherits it without the KL penalty's online enforcement.

The proliferation of DPO variants (SimPO, IPO, ORPO, CPO) suggests that no single offline preference loss dominates all tasks. Each variant corrects a specific failure mode of the original DPO: SimPO handles length bias and removes the reference model, IPO enforces a stricter KL constraint, ORPO combines SFT and alignment into one stage. The fragmentation is itself evidence that the reward-model-free approach trades explicit reward interpretability and trainability for a loss function that is task-sensitive in ways not yet theoretically characterized.

DPO variants and limitations#

Sequence-level credit assignment#

DPO evaluates the log-probability of the entire sequence $y$ :

\log \pi_\theta(y \mid x) = \sum_t \log \pi_\theta(y_t \mid x, y_{<t})

This means a response that contains 499 excellent tokens followed by one harmful token and one that is consistently poor receive the same treatment: the sequence is labeled as preferred or dispreferred as a whole. There is no mechanism to attribute the preference to specific tokens within the sequence. This sparse credit assignment is DPO's principal limitation relative to PPO: it cannot identify and correct specific failure modes within a response.

SimPO#

Simple Preference Optimization (SimPO, Meng et al., 2024) modifies DPO by removing the reference model and normalizing log-probabilities by sequence length:

\mathcal{L}_{\text{SimPO}}(\theta) = -\mathbb{E}\!\left[ \log \sigma\!\left( \frac{\beta}{|y^w|} \log \pi_\theta(y^w|x) - \frac{\beta}{|y^l|} \log \pi_\theta(y^l|x) - \gamma \right) \right]

where $\gamma > 0$ is a margin. SimPO addresses two empirical issues with DPO: first, the implicit reference model dependence — models trained with DPO exhibit unexpected behaviors when the reference model is distant or misaligned — and second, length bias, where longer sequences accumulate higher log-probabilities regardless of quality. SimPO's length normalization divides by sequence length, and the margin $\gamma$ introduces an explicit preference calibration. Removal of the reference model reduces memory overhead. SimPO achieves competitive results with DPO on benchmark preference datasets with 15–20% lower memory cost; however, it sacrifices the theoretical connection to the KL-regularized RLHF objective, making it harder to reason about how much the policy has shifted from the training distribution. In practice, SimPO works well when preference data is high-quality and homogeneous in task structure, but offers less robustness on diverse or distribution-shifted preferences.

GRPO: eliminating the critic#

Group Relative Policy Optimization (GRPO; Shao et al., 2024, introduced in DeepSeekMath) takes the opposite simplification from DPO: retain the RL loop and an explicit reward signal, but eliminate the learned value critic by replacing it with an empirical group baseline computed from multiple sampled completions. Shao et al. demonstrated that GRPO matches PPO performance on mathematical reasoning with significantly lower memory overhead (one critic network eliminated), enabling training of large reasoning models. The landmark result came with DeepSeek-R1 (2025): applied at scale with a rule-based verifier (correctness checking), GRPO elicited emergent chain-of-thought reasoning — models learned to generate extensive self-correcting solution traces without explicit supervision of the reasoning process itself, only outcome-level verification. This finding reignited the debate about process versus outcome reward models: if outcome-only rewards can implicitly elicit step-level reasoning behaviors, do process reward models (Lightman et al., 2023) provide additional value, or are they an unnecessary intermediate representation?

The critic's role and cost#

In PPO (Week 7–8), the advantage estimate $A_t = Q(s_t, a_t) - V(s_t)$ requires a value function $V_\phi$ that predicts the expected return from each state. For a language model, $V_\phi$ is typically another copy of the language model with an added scalar head. This doubles the number of active parameters during training and requires a separate optimization loop to keep $V_\phi$ accurate.

The group baseline#

GRPO replaces the learned critic with an empirical baseline computed from a group of sampled completions. For each prompt $x$ in the training batch:

Sample $G$ completions $\{y_1, \ldots, y_G\}$ from the current policy $\pi_\theta$ .
Score each with the reward model (or a verifier): $\{r_1, \ldots, r_G\}$ .
Compute the group-normalized advantage for each completion:

A_i = \frac{r_i - \mu_{\mathbf{r}}}{\sigma_{\mathbf{r}}}, \qquad \mu_{\mathbf{r}} = \frac{1}{G}\sum_{i=1}^G r_i, \quad \sigma_{\mathbf{r}} = \sqrt{\frac{1}{G}\sum_{i=1}^G (r_i - \mu_{\mathbf{r}})^2}

The standardized score $A_i$ is positive for completions that outperform the group average and negative for those that underperform. It is a relative, not absolute measure of quality: a reward of 0.7 is a strong positive advantage if all other completions scored 0.3, but a negative advantage if others scored 0.9.

The GRPO objective#

The policy update applies the PPO clipped surrogate objective using $A_i$ in place of the GAE advantage, adding a per-token KL penalty against the reference model:

\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\!\left[ \sum_{t} \min\!\left( \rho_{i,t} A_i,\; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_i \right) - \beta\, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]

where $\rho_{i,t} = \pi_\theta(y_{i,t} \mid x, y_{i,<t}) / \pi_{\text{old}}(y_{i,t} \mid x, y_{i,<t})$ is the per-token importance ratio. The KL penalty is computed token-by-token rather than at the sequence level, providing denser regularization.

Interactive: Understanding GRPO Advantage#

The core of GRPO is the relative scoring within a group. Consider a batch with $G = 4$ completions for the same prompt:

| Completion | Reward $r_i$ | $A_i = \frac{r_i - \mu}{\sigma}$ | |---|---|---| | A | 0.9 | $\frac{0.9 - 0.55}{0.33} \approx +1.06$ | | B | 0.7 | $\frac{0.7 - 0.55}{0.33} \approx +0.45$ | | C | 0.4 | $\frac{0.4 - 0.55}{0.33} \approx -0.45$ | | D | 0.2 | $\frac{0.2 - 0.55}{0.33} \approx -1.06$ |

$\mu = 0.55$ , $\sigma \approx 0.33$ . Completions A and B are reinforced; C and D are suppressed. Now suppose we change only completion D's reward from 0.2 to 0.8 while keeping A's reward at 0.9:

| Completion | Reward $r_i$ | $A_i$ | |---|---|---| | A | 0.9 | $\frac{0.9 - 0.70}{0.13} \approx +1.54$ | | B | 0.7 | $\frac{0.7 - 0.70}{0.13} \approx$ 0 | | C | 0.4 | $\frac{0.4 - 0.70}{0.13} \approx -2.31$ | | D | 0.8 | $\frac{0.8 - 0.70}{0.13} \approx +0.77$ |

Key insight: Completion A's reward stayed at 0.9, but its advantage jumped from +1.06 to +1.54 because the group mean and standard deviation shifted. The advantage is a relative, not absolute, measure — it depends on all completions in the group, not just the individual reward. This relative nature is why GRPO creates strong contrastive gradients for reasoning tasks: correct solutions look better when directly compared with incorrect solutions from the same prompt.

Why GRPO works for reasoning tasks#

The group sampling structure makes GRPO particularly effective for tasks with verifiable correct answers: mathematics, formal proofs, code execution, and structured reasoning. In these settings:

The reward signal is binary or near-binary (correct / incorrect), making reward models unnecessary — a rule-based verifier suffices.
Multiple completions per prompt are natural: the model generates several candidate solutions, and the verifier checks each. Correct solutions have $A_i > 0$ ; incorrect ones have $A_i < 0$ .
The contrastive signal between correct and incorrect solutions in the same group is a strong learning signal: the model simultaneously reinforces correct reasoning paths and suppresses incorrect ones from the same prompt.

This is the structure that enabled DeepSeek-R1's reasoning capabilities: by training on mathematical problems with verifiable answers using GRPO, the model learned to generate extended, self-correcting chain-of-thought reasoning — a behavior that emerges from the reinforcement signal rather than being explicitly supervised.

Critical Lens: GRPO Group Baseline Limitations

GRPO's group baseline is computationally attractive — eliminating the critic halves memory — but shifts the failure modes from critic inaccuracy to baseline variance.

Group size $G$ controls the variance-budget tradeoff. The group baseline uses $\sigma_{\mathbf{r}}$ as the normalizer: when $\sigma_{\mathbf{r}}$ is small (all completions score similarly), the normalized advantage amplifies small reward differences into large gradient updates. With $G = 4$ (a common setting), the variance of the baseline is high, and a single outlier completion — e.g., a reward-model-hacked response that scores 0.98 when all others score 0.2 — dominates the gradient for that batch. The fix is larger $G$ , but each additional completion costs a full generation pass through the policy. This creates a FLOPs multiplier: GRPO with $G = 8$ requires $8\times$ the generation compute of PPO with a single rollout per prompt. The critic's memory cost is replaced by a generation cost that scales linearly in $G$ .

Correlated completions produce zero signal. If all $G$ completions suffer from the same failure mode — say, all hallucinate the same incorrect fact or all produce the same flawed reasoning step — then $\sigma_{\mathbf{r}} \approx 0$ and all $A_i \approx 0$ . The gradient vanishes, and the model receives no corrective signal. This is the correlated failure mode problem: the baseline cannot distinguish between "all completions are good" and "all completions share the same flaw," so GRPO is blind to systematic errors that affect every sample in a group.

The clipping-advantage interaction is uncharacterized. PPO's clipped objective was designed with GAE advantages in mind, where $A_t$ is a discounted sum of temporal-difference errors with bounded scale. In GRPO, $A_i$ is a z-score that can exceed $\pm 2$ when $G$ is small and reward variance is high. The clipping threshold $\epsilon$ (typically 0.2) interacts with the raw $A_i$ value: a large positive $A_i = +3$ with a ratio $\rho_{i,t} > 1.2$ will get clipped, discarding most of the gradient signal from the strongest learning example. There is no published study of the joint optimal setting of $(G, \epsilon)$ for GRPO; practitioners currently tune them independently, which is likely suboptimal.

Rule-based verifiers prevent reward hacking but restrict task scope. GRPO's success on mathematical reasoning (DeepSeek-R1) relies on binary verifiability: an answer is either correct or incorrect, with no ambiguity. For open-ended generation (creative writing, dialogue, summarization), no rule exists to verify quality. Using a learned RM with GRPO reintroduces the overoptimization problem that the group baseline was meant to mitigate — now the RM score can be gamed by completions that exploit proxy-reward gaps, and the group baseline amplifies the resulting advantage differences rather than correcting them.

Credit assignment: sequence-level vs token-level#

The sequence-level vs token-level distinction is a fundamental axis along which alignment methods differ, and it has direct implications for what behaviors the model can and cannot learn.

Sequence-level (DPO)#

The preference label is applied to the entire response. The gradient signal is a uniform rescaling of token log-probabilities throughout the response by the same factor. Every token in a preferred response is equally reinforced; every token in a dispreferred response is equally suppressed. This is appropriate when the preference is a global property of the response (style, tone, factual accuracy uniformly distributed) but breaks down when the preference is driven by a specific failure — a hallucinated fact in token 50, an unsafe statement in token 120 — embedded in an otherwise acceptable response.

Token-level (PPO, GRPO)#

The advantage $A_t$ can vary across tokens within a single response. A critic-based advantage (PPO) or a response-averaged group advantage applied per-token (GRPO) allows the model to receive different gradient signals at different positions. In practice, PPO's token-level credit assignment enables it to learn to avoid specific failure modes within responses in ways that DPO cannot, at the cost of requiring either a learned critic (PPO) or a group sampling overhead (GRPO).

Synthesis: choosing an alignment method#

| | PPO (RLHF) | DPO | GRPO | |---|---|---|---| | Reward model needed | Yes | No (implicit) | Yes / rule-based | | Value critic needed | Yes | No | No | | Networks in training | 4 | 2 | 2–3 | | Credit assignment | Token-level | Sequence-level | Token-level | | Stability | Moderate | High | Moderate | | Best setting | General alignment | Low-compute alignment | Verifiable reasoning |

Exercise: Comparing DPO and GRPO#

Analytical: Derive the SimPO loss from the DPO loss by applying sequence-length normalization.
Implementation: Given the GRPO advantage formula $A_i = (r_i - \mu)/\sigma$ for a group of $G$ completions, compute the advantages for the following group: $\{r\} = [0.95, 0.88, 0.45, 0.42, 0.15]$ . Now add a KL penalty term: modify the advantage to $A_i^{\text{KL}} = (r_i - \mu)/\sigma - \beta \cdot \text{KL}_i$ where $\text{KL}_i$ is the per-completion KL divergence from the reference model. If $\beta = 0.1$ and the KL values are $[0.05, 0.12, 0.03, 0.02, 0.08]$ , recompute the advantages. Which completions change sign, and why?
Reasoning: Explain why GRPO is inherently more robust to reward model hallucination than PPO when using a rule-based verifier.

The practical choice among these methods depends primarily on two factors: compute budget and task structure. DPO is the default choice when the preference dataset is high-quality and the alignment target is a global property of responses (helpfulness, harmlessness). GRPO is preferred when the task has verifiable rewards and the goal is to train reasoning capabilities that require credit assignment at the reasoning-step level. PPO remains the most principled option when the reward model is reliable, the task requires fine-grained credit assignment, and compute is available.

Open Problems

When to use online versus offline preference optimization. The field lacks a principled criterion for choosing between offline methods (DPO, SimPO) and online methods (PPO-RLHF, GRPO). DPO is computationally cheaper but suffers from overoptimization on distribution shift. PPO is more robust to preference variation but requires four networks and live RL interaction. Recent work suggests the choice depends on preference data quality and task distribution diversity, but these properties are hard to characterize a priori. What metrics predict which method will perform better?

The proliferation of DPO variants and algorithmic convergence. Since Rafailov et al., the field has proposed SimPO, IPO, ORPO, CPO, and other variants, each motivated by specific empirical failures. Do these variants represent genuine algorithmic progress, or are they examples of overfitting to benchmark suites? More fundamentally, is there a unified theory of offline preference optimization, or is DPO a specific point in a high-dimensional loss landscape where different tasks require different points?

Synthetic preference data at scale. The emerging paradigm is to use AI feedback (language model-generated preferences) rather than human annotations to scale preference data. How does alignment trained on synthetic preferences compare to human-annotated preferences? Early results are mixed: AI feedback can match human-annotated performance on factual tasks but degrades on subjective preference (style, tone, creativity). The deeper question: if the reference model generates preferences, and we train the policy to match them, are we learning human alignment or just reproducing the reference model's biases?

Preference definition for open-ended tasks. Weeks 12–13 assume preference data on well-defined tasks (instruction following, helpfulness, safety). But for creative writing, scientific hypothesis generation, or autonomous research agents, what does "preference" even mean? Preferences are often intransitive, context-dependent, and evolving. How do you specify or learn reward signals for tasks where the goal itself is under-defined?

Key takeaways#

DPO rederives the RLHF alignment problem by inverting the closed-form KL-regularized optimal policy to express the reward as a ratio of policy to reference log-probabilities. Substituting this into the Bradley-Terry preference model yields a binary cross-entropy loss on preference pairs that requires no reward model, no RL loop, and only two networks. The $\beta$ parameter plays the role of the KL penalty in RLHF. DPO's limitation is sequence-level credit assignment: it cannot attribute preference outcomes to specific tokens within a response. GRPO retains the RL loop and explicit reward signal but replaces the learned value critic with a group-normalized empirical baseline, eliminating the critic network. The contrastive structure of group sampling makes GRPO particularly effective for reasoning tasks with verifiable answers. SimPO extends DPO by removing the reference model and adding length normalization. The sequence-level vs token-level credit assignment axis is the principal theoretical distinction among these methods, with practical implications for which failure modes each can address.

Conceptual questions#

Derive the DPO loss function from scratch: start from the KL-regularized RLHF objective, write down its closed-form optimal policy, invert to express $r(x,y)$ in terms of the policy ratio, and substitute into the Bradley-Terry preference model. At what step do the intractable partition functions $Z(x)$ cancel, and why is this cancellation exact? What property of the Bradley-Terry model enables it?
The DPO gradient scales each preference pair update by a factor $\hat{\sigma} = 1 - \sigma(\Delta)$ where $\Delta$ is the log-ratio difference. For a pair where the model already correctly predicts the preference with high confidence ( $\Delta \gg 0$ ), $\hat{\sigma} \approx 0$ . Explain why this weighting is beneficial (analogous to hard example mining) and identify a failure mode it could cause if the training dataset contains mislabeled preference pairs.
GRPO samples $G$ completions per prompt and standardizes their rewards to compute advantages. If $G = 1$ , what does the advantage equal, and what does the GRPO gradient become? If all $G$ completions receive identical rewards (e.g., all correct or all incorrect), what happens to the gradient? What does this imply about the minimum dataset and sampling conditions required for GRPO to produce a useful signal?
A model trained with DPO on a preference dataset where the preferred response is always longer than the dispreferred response will develop a length bias. Explain mechanistically why this happens using the sequence-level log-probability sum, and show how SimPO's length normalization resolves it. Does PPO-RLHF suffer from the same bias? Why or why not?
DeepSeek-R1 uses GRPO with a rule-based verifier (checking mathematical correctness) rather than a learned reward model. Compare this to the standard RLHF reward model approach from Week 12. Under what conditions does a rule-based verifier provide a more reliable training signal than a learned RM, and what category of tasks is structurally excluded from rule-based verification? For a task that cannot use rule-based verification, describe how you would design the reward signal and why GRPO may still be preferable to PPO.

Solutions

Start from $\max_\pi \mathbb{E}[r(x,y)] - \beta\,\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})$ ; its closed-form optimum is $\pi^*(y\mid x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\exp(r(x,y)/\beta)$ . Inverting gives $r(x,y) = \beta\log\frac{\pi^*(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x)$ ; substituting into Bradley-Terry $\sigma(r(x,y_w) - r(x,y_l))$ cancels $\beta\log Z(x)$ because both completions share the prompt $x$ and the model depends only on the reward difference. The cancellation is exact thanks to that shift-invariance.
For confidently-correct pairs ( $\Delta \gg 0$ ), $\hat\sigma = 1 - \sigma(\Delta) \approx 0$ , so updates concentrate on pairs the model gets wrong or is unsure about — effective hard-example mining. The failure mode: a mislabeled pair looks high-loss, receives a large weight, and aggressively drags the model toward the wrong label, amplifying label noise.
With $G = 1$ the advantage is the reward minus its own mean $= 0$ , so the gradient vanishes; likewise if all $G$ completions get identical rewards the standardized advantages are all $0$ and there is no gradient. So GRPO needs $G \ge 2$ and within-group reward variance — prompts where every sample succeeds or every sample fails contribute no signal.
DPO's implicit reward is a sum of per-token log-probabilities, so longer preferred responses accumulate larger summed log-prob and the objective can be reduced just by increasing length — a length bias. SimPO normalizes by length (average per-token log-prob), removing the scaling. PPO-RLHF avoids this at the objective level (its reward is a sequence-level scalar, not a log-prob sum) but can still inherit length bias if the reward model itself prefers longer outputs.
A rule-based verifier is more reliable when correctness is programmatically checkable (math, code, formal tasks): no learned-RM exploitation or drift. It structurally excludes open-ended/subjective tasks (creative writing, helpfulness, tone) with no programmatic ground truth. For those, use a learned reward model or LLM-as-judge from preference data; GRPO can still beat PPO because it drops the separate value network (lower memory/compute) and uses a group-relative baseline that scales conveniently to LLM training.

Looking ahead#

The final lecture brings the course full circle by asking how the aligned models, planning algorithms, and RL foundations developed across all thirteen weeks combine into deployed agentic systems.

Week 14: Agentic Systems. We examine how tool-using LLMs instantiate the MDP formalism from Week 1, how hierarchical and compositional task structures map onto RL sub-problems developed throughout the course, and what the remaining open problems in agentic AI are from a reinforcement learning perspective.

Purpose of this lecture#

DPO: the derivation and landmark result#

Starting point#

Recall from Week 12 that the KL-regularized RLHF objective:

\max_\pi \mathbb{E}_{y \sim \pi}\!\left[r(x,y)\right] - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{ref}})

has the closed-form optimal policy:

\pi^*(y \mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y \mid x)\cdot \exp\!\left(\frac{r(x,y)}{\beta}\right)

where $Z(x) = \sum_y \pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta)$ is the intractable partition function.

Inverting for the implicit reward#

DPO's key move is to rearrange this expression to express $r(x,y)$ in terms of the policy ratio, rather than expressing $\pi^*$ in terms of $r$ :

r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

The term $\beta \log Z(x)$ depends only on $x$ , not on $y$ . Since the Bradley-Terry preference model from Week 12 involves the difference of rewards:

P(y^w \succ y^l \mid x) = \sigma\!\left(r(x, y^w) - r(x, y^l)\right)

the $\beta \log Z(x)$ terms cancel exactly when the reward difference is taken:

r(x, y^w) - r(x, y^l) = \beta \log \frac{\pi^*(y^w \mid x)}{\pi_{\text{ref}}(y^w \mid x)} - \beta \log \frac{\pi^*(y^l \mid x)}{\pi_{\text{ref}}(y^l \mid x)}

The DPO loss#

Substituting into the Bradley-Terry log-likelihood, and replacing the optimal policy $\pi^*$ with the parameterized policy $\pi_\theta$ we are training:

\boxed{ \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x,y^w,y^l) \sim \mathcal{D}}\!\left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y^w \mid x)}{\pi_{\text{ref}}(y^w \mid x)} - \beta \log \frac{\pi_\theta(y^l \mid x)}{\pi_{\text{ref}}(y^l \mid x)} \right) \right] }

What DPO is actually optimizing#

Taking the gradient of $\mathcal{L}_{\text{<Glossary term="DPO" />}}$ with respect to $\theta$ and examining the update direction reveals the mechanism:

-\nabla_\theta \mathcal{L}_{\text{DPO}} \propto \beta\!\left[ \underbrace{\nabla_\theta \log \pi_\theta(y^w|x)}_{\text{increase } \pi_\theta(y^w)} - \underbrace{\nabla_\theta \log \pi_\theta(y^l|x)}_{\text{decrease } \pi_\theta(y^l)} \right] \cdot \hat{\sigma}

Practical implications#

Critical Lens: DPO's Implicit Reward and Overoptimization

DPO's elimination of the reward model is mathematically elegant, but introduces failure modes that are invisible from the loss function alone.

DPO variants and limitations#

Sequence-level credit assignment#

DPO evaluates the log-probability of the entire sequence $y$ :

\log \pi_\theta(y \mid x) = \sum_t \log \pi_\theta(y_t \mid x, y_{<t})

SimPO#

Simple Preference Optimization (SimPO, Meng et al., 2024) modifies DPO by removing the reference model and normalizing log-probabilities by sequence length:

\mathcal{L}_{\text{SimPO}}(\theta) = -\mathbb{E}\!\left[ \log \sigma\!\left( \frac{\beta}{|y^w|} \log \pi_\theta(y^w|x) - \frac{\beta}{|y^l|} \log \pi_\theta(y^l|x) - \gamma \right) \right]

GRPO: eliminating the critic#

The critic's role and cost#

The group baseline#

GRPO replaces the learned critic with an empirical baseline computed from a group of sampled completions. For each prompt $x$ in the training batch:

Sample $G$ completions $\{y_1, \ldots, y_G\}$ from the current policy $\pi_\theta$ .
Score each with the reward model (or a verifier): $\{r_1, \ldots, r_G\}$ .
Compute the group-normalized advantage for each completion:

A_i = \frac{r_i - \mu_{\mathbf{r}}}{\sigma_{\mathbf{r}}}, \qquad \mu_{\mathbf{r}} = \frac{1}{G}\sum_{i=1}^G r_i, \quad \sigma_{\mathbf{r}} = \sqrt{\frac{1}{G}\sum_{i=1}^G (r_i - \mu_{\mathbf{r}})^2}

The GRPO objective#

The policy update applies the PPO clipped surrogate objective using $A_i$ in place of the GAE advantage, adding a per-token KL penalty against the reference model:

\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\!\left[ \sum_{t} \min\!\left( \rho_{i,t} A_i,\; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_i \right) - \beta\, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]

Interactive: Understanding GRPO Advantage#

The core of GRPO is the relative scoring within a group. Consider a batch with $G = 4$ completions for the same prompt:

$\mu = 0.55$ , $\sigma \approx 0.33$ . Completions A and B are reinforced; C and D are suppressed. Now suppose we change only completion D's reward from 0.2 to 0.8 while keeping A's reward at 0.9:

Why GRPO works for reasoning tasks#

The group sampling structure makes GRPO particularly effective for tasks with verifiable correct answers: mathematics, formal proofs, code execution, and structured reasoning. In these settings:

The reward signal is binary or near-binary (correct / incorrect), making reward models unnecessary — a rule-based verifier suffices.
Multiple completions per prompt are natural: the model generates several candidate solutions, and the verifier checks each. Correct solutions have $A_i > 0$ ; incorrect ones have $A_i < 0$ .
The contrastive signal between correct and incorrect solutions in the same group is a strong learning signal: the model simultaneously reinforces correct reasoning paths and suppresses incorrect ones from the same prompt.

Critical Lens: GRPO Group Baseline Limitations

GRPO's group baseline is computationally attractive — eliminating the critic halves memory — but shifts the failure modes from critic inaccuracy to baseline variance.

Credit assignment: sequence-level vs token-level#

The sequence-level vs token-level distinction is a fundamental axis along which alignment methods differ, and it has direct implications for what behaviors the model can and cannot learn.

Sequence-level (DPO)#

Token-level (PPO, GRPO)#

Synthesis: choosing an alignment method#

Exercise: Comparing DPO and GRPO#

Analytical: Derive the SimPO loss from the DPO loss by applying sequence-length normalization.
Implementation: Given the GRPO advantage formula $A_i = (r_i - \mu)/\sigma$ for a group of $G$ completions, compute the advantages for the following group: $\{r\} = [0.95, 0.88, 0.45, 0.42, 0.15]$ . Now add a KL penalty term: modify the advantage to $A_i^{\text{KL}} = (r_i - \mu)/\sigma - \beta \cdot \text{KL}_i$ where $\text{KL}_i$ is the per-completion KL divergence from the reference model. If $\beta = 0.1$ and the KL values are $[0.05, 0.12, 0.03, 0.02, 0.08]$ , recompute the advantages. Which completions change sign, and why?
Reasoning: Explain why GRPO is inherently more robust to reward model hallucination than PPO when using a rule-based verifier.

Open Problems

Key takeaways#

Conceptual questions#

Derive the DPO loss function from scratch: start from the KL-regularized RLHF objective, write down its closed-form optimal policy, invert to express $r(x,y)$ in terms of the policy ratio, and substitute into the Bradley-Terry preference model. At what step do the intractable partition functions $Z(x)$ cancel, and why is this cancellation exact? What property of the Bradley-Terry model enables it?
The DPO gradient scales each preference pair update by a factor $\hat{\sigma} = 1 - \sigma(\Delta)$ where $\Delta$ is the log-ratio difference. For a pair where the model already correctly predicts the preference with high confidence ( $\Delta \gg 0$ ), $\hat{\sigma} \approx 0$ . Explain why this weighting is beneficial (analogous to hard example mining) and identify a failure mode it could cause if the training dataset contains mislabeled preference pairs.
GRPO samples $G$ completions per prompt and standardizes their rewards to compute advantages. If $G = 1$ , what does the advantage equal, and what does the GRPO gradient become? If all $G$ completions receive identical rewards (e.g., all correct or all incorrect), what happens to the gradient? What does this imply about the minimum dataset and sampling conditions required for GRPO to produce a useful signal?
A model trained with DPO on a preference dataset where the preferred response is always longer than the dispreferred response will develop a length bias. Explain mechanistically why this happens using the sequence-level log-probability sum, and show how SimPO's length normalization resolves it. Does PPO-RLHF suffer from the same bias? Why or why not?
DeepSeek-R1 uses GRPO with a rule-based verifier (checking mathematical correctness) rather than a learned reward model. Compare this to the standard RLHF reward model approach from Week 12. Under what conditions does a rule-based verifier provide a more reliable training signal than a learned RM, and what category of tasks is structurally excluded from rule-based verification? For a task that cannot use rule-based verification, describe how you would design the reward signal and why GRPO may still be preferable to PPO.

Solutions

Start from $\max_\pi \mathbb{E}[r(x,y)] - \beta\,\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})$ ; its closed-form optimum is $\pi^*(y\mid x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y\mid x)\exp(r(x,y)/\beta)$ . Inverting gives $r(x,y) = \beta\log\frac{\pi^*(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x)$ ; substituting into Bradley-Terry $\sigma(r(x,y_w) - r(x,y_l))$ cancels $\beta\log Z(x)$ because both completions share the prompt $x$ and the model depends only on the reward difference. The cancellation is exact thanks to that shift-invariance.
For confidently-correct pairs ( $\Delta \gg 0$ ), $\hat\sigma = 1 - \sigma(\Delta) \approx 0$ , so updates concentrate on pairs the model gets wrong or is unsure about — effective hard-example mining. The failure mode: a mislabeled pair looks high-loss, receives a large weight, and aggressively drags the model toward the wrong label, amplifying label noise.
With $G = 1$ the advantage is the reward minus its own mean $= 0$ , so the gradient vanishes; likewise if all $G$ completions get identical rewards the standardized advantages are all $0$ and there is no gradient. So GRPO needs $G \ge 2$ and within-group reward variance — prompts where every sample succeeds or every sample fails contribute no signal.
DPO's implicit reward is a sum of per-token log-probabilities, so longer preferred responses accumulate larger summed log-prob and the objective can be reduced just by increasing length — a length bias. SimPO normalizes by length (average per-token log-prob), removing the scaling. PPO-RLHF avoids this at the objective level (its reward is a sequence-level scalar, not a log-prob sum) but can still inherit length bias if the reward model itself prefers longer outputs.
A rule-based verifier is more reliable when correctness is programmatically checkable (math, code, formal tasks): no learned-RM exploitation or drift. It structurally excludes open-ended/subjective tasks (creative writing, helpfulness, tone) with no programmatic ground truth. For those, use a learned reward model or LLM-as-judge from preference data; GRPO can still beat PPO because it drops the separate value network (lower memory/compute) and uses a group-relative baseline that scales conveniently to LLM training.

Purpose of this lecture#

DPODirect Preference Optimization: the derivation and landmark result#

Starting point#

Inverting for the implicit reward#

The DPODirect Preference Optimization loss#

What DPODirect Preference Optimization is actually optimizing#

Practical implications#

DPODirect Preference Optimization variants and limitations#

Sequence-level credit assignment#

SimPO#

GRPOGroup Relative Policy Optimisation: eliminating the critic#

The critic's role and cost#

The group baseline#

The GRPOGroup Relative Policy Optimisation objective#

Interactive: Understanding GRPO Advantage#

Why GRPOGroup Relative Policy Optimisation works for reasoning tasks#

Credit assignment: sequence-level vs token-level#

Sequence-level (DPODirect Preference Optimization)#

Token-level (PPOProximal Policy Optimisation, GRPOGroup Relative Policy Optimisation)#

Synthesis: choosing an alignment method#

Exercise: Comparing DPO and GRPO#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 13: Direct Preference Optimization and GRPO

Purpose of this lecture#

DPODirect Preference Optimization: the derivation and landmark result#

Starting point#

Inverting for the implicit reward#

The DPODirect Preference Optimization loss#

What DPODirect Preference Optimization is actually optimizing#

Practical implications#

DPODirect Preference Optimization variants and limitations#

Sequence-level credit assignment#

SimPO#

GRPOGroup Relative Policy Optimisation: eliminating the critic#

The critic's role and cost#

The group baseline#

The GRPOGroup Relative Policy Optimisation objective#

Interactive: Understanding GRPO Advantage#

Why GRPOGroup Relative Policy Optimisation works for reasoning tasks#

Credit assignment: sequence-level vs token-level#

Sequence-level (DPODirect Preference Optimization)#

Token-level (PPOProximal Policy Optimisation, GRPOGroup Relative Policy Optimisation)#

Synthesis: choosing an alignment method#

Exercise: Comparing DPO and GRPO#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

DPO: the derivation and landmark result#

The DPO loss#

What DPO is actually optimizing#

DPO variants and limitations#

GRPO: eliminating the critic#

The GRPO objective#

Why GRPO works for reasoning tasks#

Sequence-level (DPO)#

Token-level (PPO, GRPO)#

DPO: the derivation and landmark result#

The DPO loss#

What DPO is actually optimizing#

DPO variants and limitations#

GRPO: eliminating the critic#

The GRPO objective#

Why GRPO works for reasoning tasks#

Sequence-level (DPO)#

Token-level (PPO, GRPO)#