Purpose of this lecture
The entire theoretical apparatus developed in Weeks 1–11 — MDPs, value functions, policy gradients, actor-critics, PPOProximal Policy Optimisation, offline RLReinforcement Learning — was built around the assumption that the reward function is given. An agent playing Atari receives a score from the game engine; a robot receives a distance-to-goal signal from its simulator; a Q-learning agent on a gridworld receives for reaching the goal.
For language models, this assumption fails. There is no function that a programmer can write to evaluate whether a response to a prompt is helpful, honest, and harmless. The property being optimized is a human preference — a subjective, context-dependent judgment that cannot be reduced to a closed-form expression. Next-token prediction (cross-entropy on a text corpus) produces a fluent language model, but fluency and alignment are distinct: a model can be highly fluent while being sycophantic, misleading, harmful, or evasive.
Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback) is the approach that bridges this gap: learn a reward model from human preference data, then optimize the language model against that reward model using RLReinforcement Learning. This lecture develops the full pipeline — SFT, reward modeling, KL-regularized PPOProximal Policy Optimisation — and connects each component to the RLReinforcement Learning theory developed throughout the course.
The three-stage RLHFReinforcement Learning from Human Feedback pipeline
Stage 1: Supervised fine-tuning
A base language model pretrained on web-scale text has learned to predict the statistical distribution of text on the internet — including low-quality, unhelpful, and unsafe text. Before applying RLReinforcement Learning, the model must be shaped into a format suitable for the alignment task.
Supervised Fine-Tuning (SFT) trains the base model on a curated dataset of high-quality prompt–response pairs written or approved by human annotators:
SFT is standard next-token prediction on the curated dataset. Its purpose is not alignment but format shaping: after SFT, the model produces responses that look like helpful answers to prompts, rather than continuations of arbitrary web text. This provides a reasonable starting point for reward optimization and defines the reference distribution for the KL penalty in Stage 3.
The SFT model is also the foundation for the reward model: rather than training a reward model from scratch, the RM is initialized from with the final token prediction head replaced by a scalar regression head.
Stage 2: Reward modeling
The core challenge of RLHFReinforcement Learning from Human Feedback is learning : a function mapping a prompt–response pair to a scalar representing human preference. Two design choices make this tractable.
Pairwise comparisons over absolute scores
Human annotators are poor at assigning absolute quality scores (one annotator's 7 is another's 9), but are reliable at relative ranking: given two responses and to the same prompt , annotators can consistently identify which is better. This shifts the learning problem from regression on absolute quality to classification on pairwise preferences.
The preference dataset takes the form:
where indicates that the human preferred over for prompt .
The Bradley-Terry preference model
The probability that a human prefers over is modeled as:
where is the logistic sigmoid. This is the Bradley-Terry model (Bradley and Terry, 1952), originally developed for ranking sports teams. Its key property is that preferences are governed entirely by the difference in latent rewards — absolute reward scale is unidentified. This matches the human annotation setting: a human judges relative quality, and the model only needs to produce an ordering, not calibrated absolute scores.
The Bradley-Terry model's application to human preference learning assumes that human judgments form a total order (transitivity: if A > B and B > C, then A > C) that is consistent and independent of context. In practice, human preferences on complex reasoning tasks (math, code, scientific writing) are often intransitive, context-dependent, and irreproducible across annotators. The model's popularity in RLHF stems not from empirical validation of its assumptions but from mathematical convenience: it leads to a tractable maximum likelihood objective and has been validated post-hoc on simpler preference tasks (summarization quality, toxicity avoidance). For alignment tasks (honesty, helpfulness, safety), the gap between the model's assumptions and human judgment is substantial and remains underexplored.
The reward model is trained by maximum likelihood on the preference dataset:
This is binary cross-entropy where the positive label is and the negative label is , with logits given by the reward difference. The loss is minimized when for all preference pairs in the dataset.
Reward model architecture
In practice, the RM is initialized from with the language model head replaced by a linear layer that outputs a scalar. This leverages the SFT model's understanding of language quality as a strong initialization. The RM receives the full prompt–response pair as a single sequence and returns a scalar at the final token position, trained via the pairwise loss above.
Stage 3: Policy optimization with PPOProximal Policy Optimisation
With a reward model in hand, the LLMLarge Language Model optimization becomes a standard RLReinforcement Learning problem. The MDPMarkov Decision Process is defined as:
| MDPMarkov Decision Process component | RLHFReinforcement Learning from Human Feedback interpretation | |---|---| | State | Prompt + all tokens generated so far | | Action | Next token selected from vocabulary | | Transition | Appending selected token (deterministic) | | Reward | at end of generation; at intermediate steps | | Policy | The language model (token distribution) |
This is an episodic MDPMarkov Decision Process with a delayed terminal reward: the agent generates a full response token by token, and the reward model evaluates the complete response. The episode length is the response length , which is variable.
The KL-regularized RLHFReinforcement Learning from Human Feedback objective
Optimizing directly against without constraint produces reward hacking: the policy finds responses that score highly under but are not actually preferred by humans. This is the extrapolation error from Week 11 — the RM was trained only on responses from the SFT model's distribution, so its predictions are unreliable far from that distribution. A model optimizing unconstrained against the RM discovers OOD text patterns that exploit the RM's failure modes.
The fix is a KL divergence penalty anchoring the optimized policy to the SFT reference model:
where controls the penalty strength. The KL term is computed token by token and summed over the response:
This is precisely behavior regularization (Week 11) with : it prevents the policy from straying into the OOD region where the reward model provides unreliable signal. The connection to offline RLReinforcement Learning is exact — RLHFReinforcement Learning from Human Feedback is offline RLReinforcement Learning on a preference dataset, with as the behavior policy.
KL-regularized objective: the closed-form solution
The KL-regularized objective has an analytically tractable optimal policy. For any fixed reward function , the solution to:
is the Boltzmann distribution:
where is the partition function. This is the same Boltzmann optimal policy as in maximum entropy RLReinforcement Learning (Week 8, SACSoft Actor-Critic), with playing the role of temperature . The optimal policy concentrates on high-reward responses while maintaining support near the reference distribution. This closed-form solution is the starting point for deriving DPODirect Preference Optimization (Week 13), which bypasses the reward model entirely by solving for in terms of and .
PPOProximal Policy Optimisation in the RLHFReinforcement Learning from Human Feedback context
In practice, the KL-regularized objective is optimized with PPOProximal Policy Optimisation (Week 7–8). The token-level reward signal for PPOProximal Policy Optimisation is constructed as:
The terminal reward is the RM score minus the KL penalty at the final token; all intermediate steps receive only the per-token KL penalty. PPOProximal Policy Optimisation's clipped surrogate objective and GAE are then applied to this reward signal, with the language model serving as the actor and a separate value head (or a copy of the language model with an added scalar head) serving as the critic. This formulation was pioneered by Ouyang et al. (2022) in the InstructGPT paper, which demonstrated that supervised fine-tuning followed by reward modeling and PPO optimization produces significant improvements in human preference ratings over the base SFT model. The empirical gains—especially in reducing harmful outputs and improving instruction-following—validated the RLHF pipeline as a practical approach to alignment, though the paper's ablations did not isolate the contribution of each stage, leading to continued debate about the necessity of the RM stage versus simpler preference optimization.
The full RLHFReinforcement Learning from Human Feedback training loop therefore requires four networks in memory simultaneously: the policy (actor), the value function (critic), the SFT reference model (for KL computation), and the reward model . The memory and compute demands of this four-network setup motivate the simpler preference optimization approaches in Week 13.
Limitations of vanilla RLHFReinforcement Learning from Human Feedback
Overoptimization and Goodhart's Law
The reward model is a proxy for true human preference, not the true preference itself. As PPOProximal Policy Optimisation pushes to maximize , it eventually finds policies that score highly on the proxy while performing poorly on the true objective. This is Goodhart's Law: a measure used as a target ceases to be a good measure.
In practice, overoptimization produces characteristic failure modes: responses become verbosely padded to cover all angles (the RM rewards completeness), excessively hedged (the RM rewards safety language), or sycophantically agreeable (the RM rewards responses that match perceived user preferences). The gap between the proxy reward and true preference is often visualized by plotting RM score (increasing) against human evaluation score (increasing then decreasing) as a function of KL divergence from — the classic overoptimization curve (Gao et al., 2023). This empirical observation of preference-reward divergence has been validated in multiple settings (summarization, instruction-following, mathematical reasoning) and represents one of the sharpest critiques of vanilla RLHF. The scaling laws for overoptimization—how quickly the gap emerges as a function of model size, dataset size, and KL penalty—remain only partially understood.
Reward model ensembles partially mitigate overoptimization: train independent reward models and optimize against their minimum (pessimistic ensemble), average, or a penalized version that accounts for ensemble disagreement. The minimum-of-ensemble approach has the same structure as TD3's clipped double Q-learning (Week 8) and CQL's conservative Q-function (Week 11) — all are applications of pessimism under uncertainty.
Distributional mismatch in preference data
The RM is trained on preference pairs from the SFT model's distribution. After PPOProximal Policy Optimisation fine-tuning, the policy's distribution shifts, potentially producing responses that fall outside the RM's training distribution. The RM's predictions become less reliable precisely in the region the policy most wants to exploit. This is the exact offline RLReinforcement Learning distributional shift problem (Week 11) applied in the RLHFReinforcement Learning from Human Feedback context.
One mitigation is iterative RLHFReinforcement Learning from Human Feedback: alternate between (a) collecting new preference data from the current and (b) updating the reward model on the expanded dataset. Each iteration keeps the RM's training distribution close to the current policy's output distribution, reducing distributional mismatch.
The alignment tax
RLHFReinforcement Learning from Human Feedback often degrades performance on tasks that were not in the preference dataset. A model fine-tuned for conversational helpfulness may show reduced performance on mathematical reasoning, code generation, or factual recall benchmarks. This tradeoff is the alignment tax: alignment-focused RLHFReinforcement Learning from Human Feedback specializes the model toward preference-annotated behaviors at the cost of general capability.
The alignment tax is partially a consequence of the SFT stage: fine-tuning on a narrow curated dataset risks forgetting the breadth of pretraining knowledge. Approaches including PPOProximal Policy Optimisation with a cross-entropy auxiliary loss (preventing forgetting), careful learning rate scheduling, and mixing SFT and RLReinforcement Learning gradient updates have reduced but not eliminated the alignment tax in practice.
Variants and extensions
Constitutional AI (CAI) (Bai et al., 2022) replaces human preference annotators with the model itself: the model critiques its own responses according to a set of principles (the "constitution") and generates revised responses. A reward model is then trained on these model-generated preference pairs rather than human-annotated ones. CAI reduces the cost of preference data collection and enables alignment at scales where human annotation is impractical. The core contribution of CAI is demonstrating that AI-generated preference pairs can approximate human preferences at a fraction of the cost, shifting the alignment bottleneck from annotation to constitution design. However, the approach replaces the question "what do humans prefer?" with "what does our constitution imply?", shifting rather than solving the preference specification problem.
RLAIF (RLReinforcement Learning from AI Feedback) generalizes CAI: the evaluator model can be a separate, stronger LLMLarge Language Model. Preferences are generated by the evaluator rather than humans, then used to train the reward model for the policy. RLAIF can be applied iteratively: the policy improves, the evaluator generates harder preference pairs, the RM improves, the policy improves further. This self-improvement loop has achieved human-competitive alignment at reduced annotation cost. As of 2025–2026, RLAIF has been scaled to large models (DeepSeek-R1 uses a variant) and shown that iterative self-improvement via synthetic preference data can rival or exceed human-annotated RLHF on reasoning tasks.
Key takeaways
The RLHFReinforcement Learning from Human Feedback pipeline maps the alignment problem onto standard RLReinforcement Learning components. SFT shapes the base model into a format suitable for preference optimization and defines the reference distribution. Reward modeling translates pairwise human preferences into a scalar via the Bradley-Terry model and maximum likelihood estimation — avoiding the noise and inconsistency of absolute human scoring. KL-regularized PPOProximal Policy Optimisation optimizes the policy against the RM while constraining it to the SFT distribution, connecting RLHFReinforcement Learning from Human Feedback directly to offline RLReinforcement Learning behavior regularization. The closed-form KL-regularized optimal policy — a Boltzmann distribution over the reference model weighted by reward — is the theoretical centerpiece that links RLHFReinforcement Learning from Human Feedback to maximum entropy RLReinforcement Learning and enables the DPODirect Preference Optimization derivation. Overoptimization, distributional mismatch, and the alignment tax are the principal failure modes of vanilla RLHFReinforcement Learning from Human Feedback, each with mitigation strategies rooted in the RLReinforcement Learning theory developed throughout the course.
Conceptual questions
-
The Bradley-Terry model assumes . This model has an identifiability property: adding a constant to all rewards leaves all preference probabilities unchanged. Explain why this means that RLHFReinforcement Learning from Human Feedback cannot learn the absolute scale of the reward function, only the relative ordering. Does this matter for policy optimization? What additional assumption is required to compare the reward of responses to different prompts?
-
The KL-regularized RLHFReinforcement Learning from Human Feedback objective has the closed-form optimal policy . Show what happens to this policy as and . For a fixed RM , describe the qualitative behavior of the optimized policy at each extreme. Why does large reduce overoptimization, and what is the cost?
-
The PPOProximal Policy Optimisation implementation of RLHFReinforcement Learning from Human Feedback requires four networks (policy, critic, reference model, reward model). For a 7B parameter model, estimate the minimum GPU memory required if all four are loaded simultaneously in FP16. Then explain why this memory constraint motivates approaches that eliminate the reward model (DPODirect Preference Optimization) or the separate critic (GRPOGroup Relative Policy Optimisation). What architectural compromise does each method make?
-
Overoptimization experiments (Gao et al., 2023) use a gold reward model — a separate, held-out RM trained on additional human data — to evaluate whether the proxy RM score correlates with true preference as KL divergence from increases. The proxy RM score increases monotonically while the gold RM score peaks and then decreases. Explain this divergence in terms of the distributional shift between the RM's training distribution and the policy's output distribution during PPOProximal Policy Optimisation. What does the peak of the gold RM curve represent, and how would you detect this peak during training without access to a gold RM?
-
Constitutional AI uses model-generated preference pairs rather than human annotations. Identify two ways this approach could fail to produce a well-aligned model even if the constitution is well-specified: one failure mode related to the quality of the evaluator model, and one related to the diversity of the generated preference pairs. Propose a modification to the CAI pipeline that addresses each failure mode.
Looking ahead
The RLHFReinforcement Learning from Human Feedback pipeline is powerful but expensive: four networks, human annotation, and a PPOProximal Policy Optimisation training loop that requires careful hyperparameter tuning.
Week 13: Direct Preference Optimization and GRPOGroup Relative Policy Optimisation. We derive DPODirect Preference Optimization's reparameterization of the reward model in terms of the optimal policy — eliminating the reward model entirely — and study GRPOGroup Relative Policy Optimisation's removal of the critic, tracing what each simplification gives up and what it gains.
Further reading
- Christiano, P. F., et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS. (The conceptual origin of modern RLHFReinforcement Learning from Human Feedback).
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS. (InstructGPT / OpenAI's RLHFReinforcement Learning from Human Feedback paper).
- Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv. (Anthropic's RLAIF framework).
- Gao, L., et al. (2023). Scaling Laws for Reward Model Overoptimization. ICML. (Empirical study of Goodhart's law in RLHFReinforcement Learning from Human Feedback).