Purpose of this lecture
The RLHFReinforcement Learning from Human Feedback pipeline from Week 12 works: it produces well-aligned language models that score highly on human preference evaluations. It also carries a significant overhead. PPOProximal Policy Optimisation-based RLHFReinforcement Learning from Human Feedback requires four large neural networks loaded simultaneously — the active policy, a frozen reference model, a reward model, and a value critic. For a 70B parameter model in BF16, this represents roughly 560 GB of GPU memory before accounting for optimizer state and activations. The result is that full-scale RLHFReinforcement Learning from Human Feedback is accessible to only a small number of organizations, and iteration cycles are slow even for those who can afford it.
This lecture develops two modern approaches that reduce this overhead without abandoning the theoretical grounding of the RLHFReinforcement Learning from Human Feedback framework. Direct Preference Optimization (DPODirect Preference Optimization) rederives the RLHFReinforcement Learning from Human Feedback objective to show that the reward model is implicit in the policy ratio and can be eliminated entirely: the alignment problem reduces to binary classification on preference pairs. Group Relative Policy Optimization (GRPOGroup Relative Policy Optimisation) takes the opposite simplification: keep the RLReinforcement Learning loop but replace the learned value critic with an empirical group baseline, eliminating the critic network while preserving token-level credit assignment.
Both methods trace directly to the closed-form KL-regularized optimal policy derived in Week 12. Understanding the derivations — not just the final loss functions — is what separates principled use of these methods from treating them as black-box recipes.
DPODirect Preference Optimization: the derivation and landmark result
Direct Preference Optimization (Rafailov et al., 2023) demonstrated a surprising insight: the RLHF alignment objective has a closed-form optimal policy, and substituting that policy into the Bradley-Terry preference model yields a loss function that depends only on policy ratios, never on an explicit reward model. The paper's central contribution was empirical: DPO-trained models reached InstructGPT-level instruction-following quality with a single epoch of offline training, no reward model fitting, and no RL rollouts — roughly 10x less compute than PPO-RLHF. The community's initial reception was enthusiastic: DPO's simplicity, stability, and efficiency made RLHF-scale alignment accessible to academic labs. However, subsequent work exposed limitations: Dubois et al. (2024) showed DPO underperforms PPO on mathematical and code reasoning tasks where step-level verification is important. More recently, studies of DPO's implicit reward model revealed overoptimization pathologies — the policy can diverge sharply from the reference model on out-of-distribution prompts, and the implicit reward is often uninterpretable. The field has responded with a proliferation of DPO variants (SimPO, IPO, ORPO, CPO), each addressing a specific failure mode, suggesting that no single offline preference optimization formulation dominates all tasks.
Starting point
Recall from Week 12 that the KL-regularized RLHFReinforcement Learning from Human Feedback objective:
has the closed-form optimal policy:
where is the intractable partition function.
Inverting for the implicit reward
DPODirect Preference Optimization's key move is to rearrange this expression to express in terms of the policy ratio, rather than expressing in terms of :
The term depends only on , not on . Since the Bradley-Terry preference model from Week 12 involves the difference of rewards:
the terms cancel exactly when the reward difference is taken:
The DPODirect Preference Optimization loss
Substituting into the Bradley-Terry log-likelihood, and replacing the optimal policy with the parameterized policy we are training:
This is a binary cross-entropy loss where the "logit" is the difference in log-probability ratios (policy log-probability minus reference log-probability) between the preferred and dispreferred responses, scaled by . No reward model appears. The reward model has been algebraically eliminated — it is implicit in the ratio .
What DPODirect Preference Optimization is actually optimizing
Taking the gradient of with respect to and examining the update direction reveals the mechanism:
where is a weighting factor that is large when the model currently predicts the preference pair incorrectly (low confidence gap) and small when the model already predicts it correctly (high confidence gap). DPODirect Preference Optimization therefore applies larger gradients to pairs where the model is currently confused, concentrating learning effort where it is most needed — analogous to hard example mining in metric learning.
The parameter controls the KL penalty strength: large prevents from deviating much from (conservative, low risk of mode collapse); small allows large policy shifts (aggressive, risk of forgetting reference model's qualities).
Practical implications
DPODirect Preference Optimization reduces the RLHFReinforcement Learning from Human Feedback training setup from four networks to two: the trainable policy and the frozen reference . Training is stable supervised learning — no reward signal variance, no PPOProximal Policy Optimisation clipping, no advantage estimation. The preference dataset can be collected once offline and reused, making DPODirect Preference Optimization a fully offline alignment method in the sense of Week 11. A single epoch of DPODirect Preference Optimization on a well-curated preference dataset can match the alignment quality of PPOProximal Policy Optimisation-RLHFReinforcement Learning from Human Feedback with a fraction of the compute.
DPODirect Preference Optimization variants and limitations
Sequence-level credit assignment
DPODirect Preference Optimization evaluates the log-probability of the entire sequence :
This means a response that contains 499 excellent tokens followed by one harmful token and one that is consistently poor receive the same treatment: the sequence is labeled as preferred or dispreferred as a whole. There is no mechanism to attribute the preference to specific tokens within the sequence. This sparse credit assignment is DPODirect Preference Optimization's principal limitation relative to PPOProximal Policy Optimisation: it cannot identify and correct specific failure modes within a response.
SimPO
Simple Preference Optimization (SimPO, Meng et al., 2024) modifies DPODirect Preference Optimization by removing the reference model and normalizing log-probabilities by sequence length:
where is a margin. SimPO addresses two empirical issues with DPO: first, the implicit reference model dependence — models trained with DPO exhibit unexpected behaviors when the reference model is distant or misaligned — and second, length bias, where longer sequences accumulate higher log-probabilities regardless of quality. SimPO's length normalization divides by sequence length, and the margin introduces an explicit preference calibration. Removal of the reference model reduces memory overhead. SimPO achieves competitive results with DPODirect Preference Optimization on benchmark preference datasets with 15–20% lower memory cost; however, it sacrifices the theoretical connection to the KL-regularized RLHF objective, making it harder to reason about how much the policy has shifted from the training distribution. In practice, SimPO works well when preference data is high-quality and homogeneous in task structure, but offers less robustness on diverse or distribution-shifted preferences.
GRPOGroup Relative Policy Optimisation: eliminating the critic
Group Relative Policy Optimization (GRPOGroup Relative Policy Optimisation; Shao et al., 2024, introduced in DeepSeekMath) takes the opposite simplification from DPO: retain the RLReinforcement Learning loop and an explicit reward signal, but eliminate the learned value critic by replacing it with an empirical group baseline computed from multiple sampled completions. Shao et al. demonstrated that GRPO matches PPO performance on mathematical reasoning with significantly lower memory overhead (one critic network eliminated), enabling training of large reasoning models. The landmark result came with DeepSeek-R1 (2025): applied at scale with a rule-based verifier (correctness checking), GRPO elicited emergent chain-of-thought reasoning — models learned to generate extensive self-correcting solution traces without explicit supervision of the reasoning process itself, only outcome-level verification. This finding reignited the debate about process versus outcome reward models: if outcome-only rewards can implicitly elicit step-level reasoning behaviors, do process reward models (Lightman et al., 2023) provide additional value, or are they an unnecessary intermediate representation?
The critic's role and cost
In PPOProximal Policy Optimisation (Week 7–8), the advantage estimate requires a value function that predicts the expected return from each state. For a language model, is typically another copy of the language model with an added scalar head. This doubles the number of active parameters during training and requires a separate optimization loop to keep accurate.
The group baseline
GRPOGroup Relative Policy Optimisation replaces the learned critic with an empirical baseline computed from a group of sampled completions. For each prompt in the training batch:
- Sample completions from the current policy .
- Score each with the reward model (or a verifier): .
- Compute the group-normalized advantage for each completion:
The standardized score is positive for completions that outperform the group average and negative for those that underperform. It is a relative, not absolute measure of quality: a reward of 0.7 is a strong positive advantage if all other completions scored 0.3, but a negative advantage if others scored 0.9.
The GRPOGroup Relative Policy Optimisation objective
The policy update applies the PPOProximal Policy Optimisation clipped surrogate objective using in place of the GAE advantage, adding a per-token KL penalty against the reference model:
where is the per-token importance ratio. The KL penalty is computed token-by-token rather than at the sequence level, providing denser regularization.
Interactive: Understanding GRPO Advantage
The core of GRPO is the relative scoring within a group. Consider a batch with completions for the same prompt:
| Completion | Reward | | |---|---|---| | A | 0.9 | | | B | 0.7 | | | C | 0.4 | | | D | 0.2 | |
, . Completions A and B are reinforced; C and D are suppressed. Now suppose we change only completion D's reward from 0.2 to 0.8 while keeping A's reward at 0.9:
| Completion | Reward | | |---|---|---| | A | 0.9 | | | B | 0.7 | 0 | | C | 0.4 | | | D | 0.8 | |
Key insight: Completion A's reward stayed at 0.9, but its advantage jumped from +1.06 to +1.54 because the group mean and standard deviation shifted. The advantage is a relative, not absolute, measure — it depends on all completions in the group, not just the individual reward. This relative nature is why GRPO creates strong contrastive gradients for reasoning tasks: correct solutions look better when directly compared with incorrect solutions from the same prompt.
Why GRPOGroup Relative Policy Optimisation works for reasoning tasks
The group sampling structure makes GRPOGroup Relative Policy Optimisation particularly effective for tasks with verifiable correct answers: mathematics, formal proofs, code execution, and structured reasoning. In these settings:
- The reward signal is binary or near-binary (correct / incorrect), making reward models unnecessary — a rule-based verifier suffices.
- Multiple completions per prompt are natural: the model generates several candidate solutions, and the verifier checks each. Correct solutions have ; incorrect ones have .
- The contrastive signal between correct and incorrect solutions in the same group is a strong learning signal: the model simultaneously reinforces correct reasoning paths and suppresses incorrect ones from the same prompt.
This is the structure that enabled DeepSeek-R1's reasoning capabilities: by training on mathematical problems with verifiable answers using GRPOGroup Relative Policy Optimisation, the model learned to generate extended, self-correcting chain-of-thought reasoning — a behavior that emerges from the reinforcement signal rather than being explicitly supervised.
Credit assignment: sequence-level vs token-level
The sequence-level vs token-level distinction is a fundamental axis along which alignment methods differ, and it has direct implications for what behaviors the model can and cannot learn.
Sequence-level (DPODirect Preference Optimization)
The preference label is applied to the entire response. The gradient signal is a uniform rescaling of token log-probabilities throughout the response by the same factor. Every token in a preferred response is equally reinforced; every token in a dispreferred response is equally suppressed. This is appropriate when the preference is a global property of the response (style, tone, factual accuracy uniformly distributed) but breaks down when the preference is driven by a specific failure — a hallucinated fact in token 50, an unsafe statement in token 120 — embedded in an otherwise acceptable response.
Token-level (PPOProximal Policy Optimisation, GRPOGroup Relative Policy Optimisation)
The advantage can vary across tokens within a single response. A critic-based advantage (PPOProximal Policy Optimisation) or a response-averaged group advantage applied per-token (GRPOGroup Relative Policy Optimisation) allows the model to receive different gradient signals at different positions. In practice, PPOProximal Policy Optimisation's token-level credit assignment enables it to learn to avoid specific failure modes within responses in ways that DPODirect Preference Optimization cannot, at the cost of requiring either a learned critic (PPOProximal Policy Optimisation) or a group sampling overhead (GRPOGroup Relative Policy Optimisation).
Synthesis: choosing an alignment method
| | PPOProximal Policy Optimisation (RLHFReinforcement Learning from Human Feedback) | DPODirect Preference Optimization | GRPOGroup Relative Policy Optimisation | |---|---|---|---| | Reward model needed | Yes | No (implicit) | Yes / rule-based | | Value critic needed | Yes | No | No | | Networks in training | 4 | 2 | 2–3 | | Credit assignment | Token-level | Sequence-level | Token-level | | Stability | Moderate | High | Moderate | | Best setting | General alignment | Low-compute alignment | Verifiable reasoning |
Exercise: Comparing DPO and GRPO
- Analytical: Derive the SimPO loss from the DPO loss by applying sequence-length normalization.
- Implementation: Given the GRPO advantage formula for a group of completions, compute the advantages for the following group: . Now add a KL penalty term: modify the advantage to where is the per-completion KL divergence from the reference model. If and the KL values are , recompute the advantages. Which completions change sign, and why?
- Reasoning: Explain why GRPO is inherently more robust to reward model hallucination than PPO when using a rule-based verifier.
The practical choice among these methods depends primarily on two factors: compute budget and task structure. DPODirect Preference Optimization is the default choice when the preference dataset is high-quality and the alignment target is a global property of responses (helpfulness, harmlessness). GRPOGroup Relative Policy Optimisation is preferred when the task has verifiable rewards and the goal is to train reasoning capabilities that require credit assignment at the reasoning-step level. PPOProximal Policy Optimisation remains the most principled option when the reward model is reliable, the task requires fine-grained credit assignment, and compute is available.
Key takeaways
DPODirect Preference Optimization rederives the RLHFReinforcement Learning from Human Feedback alignment problem by inverting the closed-form KL-regularized optimal policy to express the reward as a ratio of policy to reference log-probabilities. Substituting this into the Bradley-Terry preference model yields a binary cross-entropy loss on preference pairs that requires no reward model, no RLReinforcement Learning loop, and only two networks. The parameter plays the role of the KL penalty in RLHFReinforcement Learning from Human Feedback. DPODirect Preference Optimization's limitation is sequence-level credit assignment: it cannot attribute preference outcomes to specific tokens within a response. GRPOGroup Relative Policy Optimisation retains the RLReinforcement Learning loop and explicit reward signal but replaces the learned value critic with a group-normalized empirical baseline, eliminating the critic network. The contrastive structure of group sampling makes GRPOGroup Relative Policy Optimisation particularly effective for reasoning tasks with verifiable answers. SimPO extends DPODirect Preference Optimization by removing the reference model and adding length normalization. The sequence-level vs token-level credit assignment axis is the principal theoretical distinction among these methods, with practical implications for which failure modes each can address.
Conceptual questions
-
Derive the DPODirect Preference Optimization loss function from scratch: start from the KL-regularized RLHFReinforcement Learning from Human Feedback objective, write down its closed-form optimal policy, invert to express in terms of the policy ratio, and substitute into the Bradley-Terry preference model. At what step do the intractable partition functions cancel, and why is this cancellation exact? What property of the Bradley-Terry model enables it?
-
The DPODirect Preference Optimization gradient scales each preference pair update by a factor where is the log-ratio difference. For a pair where the model already correctly predicts the preference with high confidence (), . Explain why this weighting is beneficial (analogous to hard example mining) and identify a failure mode it could cause if the training dataset contains mislabeled preference pairs.
-
GRPOGroup Relative Policy Optimisation samples completions per prompt and standardizes their rewards to compute advantages. If , what does the advantage equal, and what does the GRPOGroup Relative Policy Optimisation gradient become? If all completions receive identical rewards (e.g., all correct or all incorrect), what happens to the gradient? What does this imply about the minimum dataset and sampling conditions required for GRPOGroup Relative Policy Optimisation to produce a useful signal?
-
A model trained with DPODirect Preference Optimization on a preference dataset where the preferred response is always longer than the dispreferred response will develop a length bias. Explain mechanistically why this happens using the sequence-level log-probability sum, and show how SimPO's length normalization resolves it. Does PPOProximal Policy Optimisation-RLHFReinforcement Learning from Human Feedback suffer from the same bias? Why or why not?
-
DeepSeek-R1 uses GRPOGroup Relative Policy Optimisation with a rule-based verifier (checking mathematical correctness) rather than a learned reward model. Compare this to the standard RLHFReinforcement Learning from Human Feedback reward model approach from Week 12. Under what conditions does a rule-based verifier provide a more reliable training signal than a learned RM, and what category of tasks is structurally excluded from rule-based verification? For a task that cannot use rule-based verification, describe how you would design the reward signal and why GRPOGroup Relative Policy Optimisation may still be preferable to PPOProximal Policy Optimisation.
Looking ahead
The final lecture brings the course full circle by asking how the aligned models, planning algorithms, and RLReinforcement Learning foundations developed across all thirteen weeks combine into deployed agentic systems.
Week 14: Agentic Systems. We examine how tool-using LLMs instantiate the MDPMarkov Decision Process formalism from Week 1, how hierarchical and compositional task structures map onto RLReinforcement Learning sub-problems developed throughout the course, and what the remaining open problems in agentic AI are from a reinforcement learning perspective.
Further reading
- Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. (DPODirect Preference Optimization).
- Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv. (Introduced GRPOGroup Relative Policy Optimisation).
- Meng, Y., et al. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv. (SimPO).