Skip to main content
illumin8
Courses
Week 12: Reinforcement Learning from Human Feedback
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 12

Week 12: Reinforcement Learning from Human Feedback

✦Learning Outcomes
  • Describe the full RLHFReinforcement Learning from Human Feedback pipeline (SFT, reward modeling, PPOProximal Policy Optimisation)
  • Analyze KL-regularized objectives and their role in RLHFReinforcement Learning from Human Feedback
  • Understand reward hacking and over-optimization issues
  • Connect RLHFReinforcement Learning from Human Feedback to modern LLMLarge Language Model alignment techniques
◆Prerequisites
  • Week 11: Offline RLReinforcement Learning, distributional shift
  • Week 7: Policy gradient, PPOProximal Policy Optimisation
  • Week 2: Bandits (for preference learning connections)

Recommended: Review Week 11 and Week 7 before proceeding.

Purpose of this lecture

The entire theoretical apparatus developed in Weeks 1–11 — MDPs, value functions, policy gradients, actor-critics, PPOProximal Policy Optimisation, offline RLReinforcement Learning — was built around the assumption that the reward function is given. An agent playing Atari receives a score from the game engine; a robot receives a distance-to-goal signal from its simulator; a Q-learning agent on a gridworld receives +1+1+1 for reaching the goal.

For language models, this assumption fails. There is no function r(x,y)r(x, y)r(x,y) that a programmer can write to evaluate whether a response yyy to a prompt xxx is helpful, honest, and harmless. The property being optimized is a human preference — a subjective, context-dependent judgment that cannot be reduced to a closed-form expression. Next-token prediction (cross-entropy on a text corpus) produces a fluent language model, but fluency and alignment are distinct: a model can be highly fluent while being sycophantic, misleading, harmful, or evasive.

Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback) is the approach that bridges this gap: learn a reward model from human preference data, then optimize the language model against that reward model using RLReinforcement Learning. This lecture develops the full pipeline — SFT, reward modeling, KL-regularized PPOProximal Policy Optimisation — and connects each component to the RLReinforcement Learning theory developed throughout the course.


The three-stage RLHFReinforcement Learning from Human Feedback pipeline

Stage 1: Supervised fine-tuning

A base language model pretrained on web-scale text has learned to predict the statistical distribution of text on the internet — including low-quality, unhelpful, and unsafe text. Before applying RLReinforcement Learning, the model must be shaped into a format suitable for the alignment task.

Supervised Fine-Tuning (SFT) trains the base model on a curated dataset of high-quality prompt–response pairs written or approved by human annotators:

LSFT(θ)=−E(x,y)∼DSFT ⁣[∑tlog⁡πθ(yt∣x,y<t)]\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_{\text{SFT}}}\!\left[\sum_t \log \pi_\theta(y_t \mid x, y_{<t})\right]LSFT​(θ)=−E(x,y)∼DSFT​​[t∑​logπθ​(yt​∣x,y<t​)]

SFT is standard next-token prediction on the curated dataset. Its purpose is not alignment but format shaping: after SFT, the model πSFT\pi_{\text{SFT}}πSFT​ produces responses that look like helpful answers to prompts, rather than continuations of arbitrary web text. This provides a reasonable starting point for reward optimization and defines the reference distribution for the KL penalty in Stage 3.

The SFT model is also the foundation for the reward model: rather than training a reward model from scratch, the RM is initialized from πSFT\pi_{\text{SFT}}πSFT​ with the final token prediction head replaced by a scalar regression head.

Stage 2: Reward modeling

The core challenge of RLHFReinforcement Learning from Human Feedback is learning rϕ(x,y)r_\phi(x, y)rϕ​(x,y): a function mapping a prompt–response pair to a scalar representing human preference. Two design choices make this tractable.

Pairwise comparisons over absolute scores

Human annotators are poor at assigning absolute quality scores (one annotator's 7 is another's 9), but are reliable at relative ranking: given two responses y1y_1y1​ and y2y_2y2​ to the same prompt xxx, annotators can consistently identify which is better. This shifts the learning problem from regression on absolute quality to classification on pairwise preferences.

The preference dataset takes the form:

DRM={(xi,yiw,yil)}i=1N\mathcal{D}_{\text{RM}} = \{(x_i, y^w_i, y^l_i)\}_{i=1}^NDRM​={(xi​,yiw​,yil​)}i=1N​

where yiw≻yily^w_i \succ y^l_iyiw​≻yil​ indicates that the human preferred yiwy^w_iyiw​ over yily^l_iyil​ for prompt xix_ixi​.

The Bradley-Terry preference model

The probability that a human prefers ywy^wyw over yly^lyl is modeled as:

P(yw≻yl∣x)=σ ⁣(rϕ(x,yw)−rϕ(x,yl))P(y^w \succ y^l \mid x) = \sigma\!\left(r_\phi(x, y^w) - r_\phi(x, y^l)\right)P(yw≻yl∣x)=σ(rϕ​(x,yw)−rϕ​(x,yl))

where σ\sigmaσ is the logistic sigmoid. This is the Bradley-Terry model (Bradley and Terry, 1952), originally developed for ranking sports teams. Its key property is that preferences are governed entirely by the difference in latent rewards — absolute reward scale is unidentified. This matches the human annotation setting: a human judges relative quality, and the model only needs to produce an ordering, not calibrated absolute scores.

The Bradley-Terry model's application to human preference learning assumes that human judgments form a total order (transitivity: if A > B and B > C, then A > C) that is consistent and independent of context. In practice, human preferences on complex reasoning tasks (math, code, scientific writing) are often intransitive, context-dependent, and irreproducible across annotators. The model's popularity in RLHF stems not from empirical validation of its assumptions but from mathematical convenience: it leads to a tractable maximum likelihood objective and has been validated post-hoc on simpler preference tasks (summarization quality, toxicity avoidance). For alignment tasks (honesty, helpfulness, safety), the gap between the model's assumptions and human judgment is substantial and remains underexplored.

The reward model is trained by maximum likelihood on the preference dataset:

LRM(ϕ)=−E(x,yw,yl)∼DRM ⁣[log⁡σ ⁣(rϕ(x,yw)−rϕ(x,yl))]\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y^w, y^l) \sim \mathcal{D}_{\text{RM}}}\!\left[ \log \sigma\!\left(r_\phi(x, y^w) - r_\phi(x, y^l)\right) \right]LRM​(ϕ)=−E(x,yw,yl)∼DRM​​[logσ(rϕ​(x,yw)−rϕ​(x,yl))]

This is binary cross-entropy where the positive label is ywy^wyw and the negative label is yly^lyl, with logits given by the reward difference. The loss is minimized when rϕ(x,yw)>rϕ(x,yl)r_\phi(x, y^w) > r_\phi(x, y^l)rϕ​(x,yw)>rϕ​(x,yl) for all preference pairs in the dataset.

Reward model architecture

In practice, the RM is initialized from πSFT\pi_{\text{SFT}}πSFT​ with the language model head replaced by a linear layer that outputs a scalar. This leverages the SFT model's understanding of language quality as a strong initialization. The RM receives the full prompt–response pair as a single sequence and returns a scalar rϕ∈Rr_\phi \in \mathbb{R}rϕ​∈R at the final token position, trained via the pairwise loss above.

Stage 3: Policy optimization with PPOProximal Policy Optimisation

With a reward model in hand, the LLMLarge Language Model optimization becomes a standard RLReinforcement Learning problem. The MDPMarkov Decision Process is defined as:

| MDPMarkov Decision Process component | RLHFReinforcement Learning from Human Feedback interpretation | |---|---| | State sts_tst​ | Prompt xxx + all tokens generated so far | | Action ata_tat​ | Next token selected from vocabulary | | Transition PPP | Appending selected token (deterministic) | | Reward | rϕ(x,y)r_\phi(x, y)rϕ​(x,y) at end of generation; 000 at intermediate steps | | Policy πθ\pi_\thetaπθ​ | The language model (token distribution) |

This is an episodic MDPMarkov Decision Process with a delayed terminal reward: the agent generates a full response yyy token by token, and the reward model evaluates the complete response. The episode length is the response length ∣y∣|y|∣y∣, which is variable.

The KL-regularized RLHFReinforcement Learning from Human Feedback objective

Optimizing πθ\pi_\thetaπθ​ directly against rϕr_\phirϕ​ without constraint produces reward hacking: the policy finds responses that score highly under rϕr_\phirϕ​ but are not actually preferred by humans. This is the extrapolation error from Week 11 — the RM was trained only on responses from the SFT model's distribution, so its predictions are unreliable far from that distribution. A model optimizing unconstrained against the RM discovers OOD text patterns that exploit the RM's failure modes.

The fix is a KL divergence penalty anchoring the optimized policy to the SFT reference model:

max⁡πθ  Ex∼D, y∼πθ(⋅∣x) ⁣[rϕ(x,y)−β DKL ⁣(πθ(⋅∣x) ∥ πSFT(⋅∣x))]\max_{\pi_\theta}\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[ r_\phi(x, y) - \beta\, D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{SFT}}(\cdot|x)\right) \right]πθ​max​Ex∼D,y∼πθ​(⋅∣x)​[rϕ​(x,y)−βDKL​(πθ​(⋅∣x)∥πSFT​(⋅∣x))]

where β>0\beta > 0β>0 controls the penalty strength. The KL term is computed token by token and summed over the response:

DKL(πθ∥πSFT)=∑t∑vπθ(v∣x,y<t)log⁡πθ(v∣x,y<t)πSFT(v∣x,y<t)D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) = \sum_t \sum_v \pi_\theta(v \mid x, y_{<t}) \log \frac{\pi_\theta(v \mid x, y_{<t})}{\pi_{\text{SFT}}(v \mid x, y_{<t})}DKL​(πθ​∥πSFT​)=t∑​v∑​πθ​(v∣x,y<t​)logπSFT​(v∣x,y<t​)πθ​(v∣x,y<t​)​

This is precisely behavior regularization (Week 11) with πβ=πSFT\pi_\beta = \pi_{\text{SFT}}πβ​=πSFT​: it prevents the policy from straying into the OOD region where the reward model provides unreliable signal. The connection to offline RLReinforcement Learning is exact — RLHFReinforcement Learning from Human Feedback is offline RLReinforcement Learning on a preference dataset, with πSFT\pi_{\text{SFT}}πSFT​ as the behavior policy.

KL-regularized objective: the closed-form solution

The KL-regularized objective has an analytically tractable optimal policy. For any fixed reward function rrr, the solution to:

max⁡πEy ⁣[r(x,y)]−β DKL(π∥πref)\max_\pi \mathbb{E}_y\!\left[r(x,y)\right] - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{ref}})πmax​Ey​[r(x,y)]−βDKL​(π∥πref​)

is the Boltzmann distribution:

π∗(y∣x)=1Z(x) πref(y∣x)⋅exp⁡ ⁣(r(x,y)β)\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y \mid x) \cdot \exp\!\left(\frac{r(x,y)}{\beta}\right)π∗(y∣x)=Z(x)1​πref​(y∣x)⋅exp(βr(x,y)​)

where Z(x)=∑yπref(y∣x)exp⁡(r(x,y)/β)Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp(r(x,y)/\beta)Z(x)=∑y​πref​(y∣x)exp(r(x,y)/β) is the partition function. This is the same Boltzmann optimal policy as in maximum entropy RLReinforcement Learning (Week 8, SACSoft Actor-Critic), with β\betaβ playing the role of temperature α\alphaα. The optimal policy concentrates on high-reward responses while maintaining support near the reference distribution. This closed-form solution is the starting point for deriving DPODirect Preference Optimization (Week 13), which bypasses the reward model entirely by solving for rrr in terms of π∗\pi^*π∗ and πref\pi_{\text{ref}}πref​.

PPOProximal Policy Optimisation in the RLHFReinforcement Learning from Human Feedback context

In practice, the KL-regularized objective is optimized with PPOProximal Policy Optimisation (Week 7–8). The token-level reward signal for PPOProximal Policy Optimisation is constructed as:

r~t={rϕ(x,y)−βlog⁡πθ(yt∣x,y<t)πSFT(yt∣x,y<t)t=∣y∣−βlog⁡πθ(yt∣x,y<t)πSFT(yt∣x,y<t)t<∣y∣\tilde{r}_t = \begin{cases} r_\phi(x, y) - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{SFT}}(y_t | x, y_{<t})} & t = |y| \\ - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{SFT}}(y_t | x, y_{<t})} & t < |y| \end{cases}r~t​={rϕ​(x,y)−βlogπSFT​(yt​∣x,y<t​)πθ​(yt​∣x,y<t​)​−βlogπSFT​(yt​∣x,y<t​)πθ​(yt​∣x,y<t​)​​t=∣y∣t<∣y∣​

The terminal reward is the RM score minus the KL penalty at the final token; all intermediate steps receive only the per-token KL penalty. PPOProximal Policy Optimisation's clipped surrogate objective and GAE are then applied to this reward signal, with the language model serving as the actor and a separate value head (or a copy of the language model with an added scalar head) serving as the critic. This formulation was pioneered by Ouyang et al. (2022) in the InstructGPT paper, which demonstrated that supervised fine-tuning followed by reward modeling and PPO optimization produces significant improvements in human preference ratings over the base SFT model. The empirical gains—especially in reducing harmful outputs and improving instruction-following—validated the RLHF pipeline as a practical approach to alignment, though the paper's ablations did not isolate the contribution of each stage, leading to continued debate about the necessity of the RM stage versus simpler preference optimization.

The full RLHFReinforcement Learning from Human Feedback training loop therefore requires four networks in memory simultaneously: the policy πθ\pi_\thetaπθ​ (actor), the value function VϕV_\phiVϕ​ (critic), the SFT reference model πSFT\pi_{\text{SFT}}πSFT​ (for KL computation), and the reward model rψr_\psirψ​. The memory and compute demands of this four-network setup motivate the simpler preference optimization approaches in Week 13.

⚠Critical Lens: PPO in RLHF

The four-network setup described above is the standard InstructGPT recipe, but it conceals several engineering challenges that make RLHF fragile in practice.

KL coefficient tuning is non-transferable. The penalty strength β\betaβ must be swept independently for each model scale, dataset, and task. Too small and the policy overfits the RM; too large and the policy collapses to πSFT\pi_{\text{SFT}}πSFT​, wasting the RL step. There is no principled way to set β\betaβ before training — it is currently determined by expensive trial and error.

PPO requires online rollouts from the current policy, making the training loop inherently sequential: generate responses → score with RM → compute KL penalty → PPO update → repeat. Each rollout generation pass is a forward pass through the full policy model (often 7B–70B parameters), making wall-clock time dominated by generation, not gradient computation. This is fundamentally different from pretraining (parallel batches) and SFT (static dataset).

The critic is a single-point bottleneck. If the value function VϕV_\phiVϕ​ estimates advantages poorly, the PPO surrogate objective optimizes against the wrong signal. Training a critic on open-ended text generation — where episode lengths vary from 10 to 10,000 tokens and the reward signal is zero everywhere except the final token — is substantially harder than standard RL domains with dense rewards.

Memory dominates computation. The memory cost of four models in FP16 (policy + critic + reference + RM) for a 7B model exceeds 56 GB (7B × 2 bytes × 4 models), excluding optimizer states, KV-cache for generation, and activations. PPO training on consumer hardware is infeasible; even on 8×A100 nodes, the memory budget constrains batch size to small values, increasing gradient noise. These practical limitations are the direct motivation for the reward-model-free (DPO) and critic-free (GRPO) approaches in Week 13.


Limitations of vanilla RLHFReinforcement Learning from Human Feedback

Overoptimization and Goodhart's Law

The reward model rϕr_\phirϕ​ is a proxy for true human preference, not the true preference itself. As PPOProximal Policy Optimisation pushes πθ\pi_\thetaπθ​ to maximize rϕr_\phirϕ​, it eventually finds policies that score highly on the proxy while performing poorly on the true objective. This is Goodhart's Law: a measure used as a target ceases to be a good measure.

In practice, overoptimization produces characteristic failure modes: responses become verbosely padded to cover all angles (the RM rewards completeness), excessively hedged (the RM rewards safety language), or sycophantically agreeable (the RM rewards responses that match perceived user preferences). The gap between the proxy reward and true preference is often visualized by plotting RM score (increasing) against human evaluation score (increasing then decreasing) as a function of KL divergence from πSFT\pi_{\text{SFT}}πSFT​ — the classic overoptimization curve (Gao et al., 2023). This empirical observation of preference-reward divergence has been validated in multiple settings (summarization, instruction-following, mathematical reasoning) and represents one of the sharpest critiques of vanilla RLHF. The scaling laws for overoptimization—how quickly the gap emerges as a function of model size, dataset size, and KL penalty—remain only partially understood.

Reward model ensembles partially mitigate overoptimization: train MMM independent reward models and optimize against their minimum (pessimistic ensemble), average, or a penalized version that accounts for ensemble disagreement. The minimum-of-ensemble approach has the same structure as TD3's clipped double Q-learning (Week 8) and CQL's conservative Q-function (Week 11) — all are applications of pessimism under uncertainty.

⚠Critical Lens: Overoptimization and Goodhart's Law

The overoptimization problem has a deeper structural cause than the proxy-target mismatch described above. Consider the two objectives:

Jproxy(θ)=Ex,y∼πθ ⁣[rϕ(x,y)]−βDKL(πθ∥πSFT)Jtrue(θ)=Ex,y∼πθ ⁣[r∗(x,y)]−βDKL(πθ∥πSFT)\begin{aligned} J_{\text{proxy}}(\theta) &= \mathbb{E}_{x,y\sim\pi_\theta}\!\left[r_\phi(x,y)\right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) \\ J_{\text{true}}(\theta) &= \mathbb{E}_{x,y\sim\pi_\theta}\!\left[r^*(x,y)\right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) \end{aligned}Jproxy​(θ)Jtrue​(θ)​=Ex,y∼πθ​​[rϕ​(x,y)]−βDKL​(πθ​∥πSFT​)=Ex,y∼πθ​​[r∗(x,y)]−βDKL​(πθ​∥πSFT​)​

where r∗r^*r∗ is the (unobservable) true human preference. The RM rϕr_\phirϕ​ is an approximation of r∗r^*r∗ trained on a finite dataset. As πθ\pi_\thetaπθ​ moves away from πSFT\pi_{\text{SFT}}πSFT​ through PPO updates, the policy can discover a region S⊂YS \subset \mathcal{Y}S⊂Y where rϕr_\phirϕ​ and r∗r^*r∗ disagree systematically. This region exists because:

  1. The RM training set covers a tiny fraction of the response space
  2. Human preferences are inconsistent — two annotators may disagree on the same pair
  3. The Bradley-Terry likelihood assumes preference transitivity, which is violated for complex reasoning tasks

The overoptimization curve (Gao et al., 2023) shows that gold RM score peaks and declines while proxy RM score continues to rise — this is the empirical signature of the policy entering SSS. Importantly, the KL penalty only slows the rate of entry into SSS; it does not prevent it. Once the policy enters SSS, the proxy RM provides anti-helpful signal, and continued training degrades alignment.

Reward model ensembles address the wrong problem. Using a pessimistic ensemble (optimizing against min⁡krϕk\min_k r_{\phi_k}mink​rϕk​​) protects against regions where a single RM is overconfident, but it does not protect against regions where all RMs are collectively wrong. If all RMs share the same architecture, initialization scale, and training distribution, their errors will be correlated. Ensemble diversity — not just ensemble size — is what determines protection, and architectural diversity increases the already-substantial memory cost.

Distributional mismatch in preference data

The RM is trained on preference pairs from the SFT model's distribution. After PPOProximal Policy Optimisation fine-tuning, the policy's distribution shifts, potentially producing responses that fall outside the RM's training distribution. The RM's predictions become less reliable precisely in the region the policy most wants to exploit. This is the exact offline RLReinforcement Learning distributional shift problem (Week 11) applied in the RLHFReinforcement Learning from Human Feedback context.

One mitigation is iterative RLHFReinforcement Learning from Human Feedback: alternate between (a) collecting new preference data from the current πθ\pi_\thetaπθ​ and (b) updating the reward model on the expanded dataset. Each iteration keeps the RM's training distribution close to the current policy's output distribution, reducing distributional mismatch.

The alignment tax

RLHFReinforcement Learning from Human Feedback often degrades performance on tasks that were not in the preference dataset. A model fine-tuned for conversational helpfulness may show reduced performance on mathematical reasoning, code generation, or factual recall benchmarks. This tradeoff is the alignment tax: alignment-focused RLHFReinforcement Learning from Human Feedback specializes the model toward preference-annotated behaviors at the cost of general capability.

The alignment tax is partially a consequence of the SFT stage: fine-tuning on a narrow curated dataset risks forgetting the breadth of pretraining knowledge. Approaches including PPOProximal Policy Optimisation with a cross-entropy auxiliary loss (preventing forgetting), careful learning rate scheduling, and mixing SFT and RLReinforcement Learning gradient updates have reduced but not eliminated the alignment tax in practice.


Variants and extensions

Constitutional AI (CAI) (Bai et al., 2022) replaces human preference annotators with the model itself: the model critiques its own responses according to a set of principles (the "constitution") and generates revised responses. A reward model is then trained on these model-generated preference pairs rather than human-annotated ones. CAI reduces the cost of preference data collection and enables alignment at scales where human annotation is impractical. The core contribution of CAI is demonstrating that AI-generated preference pairs can approximate human preferences at a fraction of the cost, shifting the alignment bottleneck from annotation to constitution design. However, the approach replaces the question "what do humans prefer?" with "what does our constitution imply?", shifting rather than solving the preference specification problem.

RLAIF (RLReinforcement Learning from AI Feedback) generalizes CAI: the evaluator model can be a separate, stronger LLMLarge Language Model. Preferences are generated by the evaluator rather than humans, then used to train the reward model for the policy. RLAIF can be applied iteratively: the policy improves, the evaluator generates harder preference pairs, the RM improves, the policy improves further. This self-improvement loop has achieved human-competitive alignment at reduced annotation cost. As of 2025–2026, RLAIF has been scaled to large models (DeepSeek-R1 uses a variant) and shown that iterative self-improvement via synthetic preference data can rival or exceed human-annotated RLHF on reasoning tasks.

⚠Critical Lens: CAI and Preference Specification

CAI's core insight — replacing human annotators with a constitution-guided model — is elegant but creates a new class of failure modes:

The constitution becomes the alignment bottleneck. Writing a complete, non-contradictory set of principles that covers every preference scenario is at least as hard as specifying the reward function directly. A constitution that says "be helpful" and "be harmless" without resolving their conflict (e.g., a user asking for instructions to circumvent safety filters) leaves the RM to arbitrate an undefined tradeoff. The RLHF pipeline then optimizes against an RM that was itself trained on preference pairs generated under an ambiguous set of rules.

Self-critique introduces a competence ceiling. If the policy πθ\pi_\thetaπθ​ is used both as the critique generator (rewriting its own responses) and as the evaluator (judging which rewrite is better), the preference data reflects only the information available within πθ\pi_\thetaπθ​ — not what a stronger evaluator could observe. A model cannot reliably detect flaws it lacks the knowledge to recognize. RLAIF partially addresses this by using a stronger evaluator (e.g., GPT-4 judging GPT-3.5's responses), but the evaluator's own biases propagate into the RM.

Synthetic preference diversity collapses over iterations. In iterative RLAIF, the evaluator generates preference pairs, the RM trains on them, the policy improves against the RM, and the evaluator generates new pairs from the improved policy's distribution. If the evaluator's preference model is insufficiently diverse (e.g., always preferring verbose but superficial completions), the RM overfits to that narrow signal, the policy specializes to it, and subsequent preference pairs become increasingly homogeneous — an alignment collapse that is hard to detect without human evaluation checkpoints.


◆Open Problems
  • Multi-dimensional alignment. Human preferences involve conflicting axes (helpfulness, honesty, harmlessness, conciseness, creativity) that cannot be reduced to a scalar without loss. Methods for multi-objective RLHF — including Pareto-optimal preference aggregation and user-conditional reward models — are active research areas.
  • Preference model assumptions. The Bradley-Terry model assumes transitivity and context-independence, both violated in practice. Thurstonian models (separating preference noise from reward variance) and embedding-based preference models are alternatives with different failure modes.
  • Interpretable reward decomposition. Current RMs are black-box networks whose scalar output provides no signal about why a response was preferred. Factorization into interpretable dimensions (factuality score, helpfulness score, safety score) could enable debugging and targeted improvement.
  • Alignment tax elimination. Every RLHF method studied to date degrades some capability metric. Understanding whether this tradeoff is fundamental (a consequence of the no-free-lunch theorem applied to alignment) or an artifact of current methods remains unresolved.
  • Detection of reward hacking. There is no reliable, automated method for determining whether an RM score increase represents genuine alignment improvement or reward hacking. Gold RM evaluation is the current standard but requires additional human annotation — precisely the bottleneck RLHF was designed to bypass.

Key takeaways

The RLHFReinforcement Learning from Human Feedback pipeline maps the alignment problem onto standard RLReinforcement Learning components. SFT shapes the base model into a format suitable for preference optimization and defines the reference distribution. Reward modeling translates pairwise human preferences into a scalar via the Bradley-Terry model and maximum likelihood estimation — avoiding the noise and inconsistency of absolute human scoring. KL-regularized PPOProximal Policy Optimisation optimizes the policy against the RM while constraining it to the SFT distribution, connecting RLHFReinforcement Learning from Human Feedback directly to offline RLReinforcement Learning behavior regularization. The closed-form KL-regularized optimal policy — a Boltzmann distribution over the reference model weighted by reward — is the theoretical centerpiece that links RLHFReinforcement Learning from Human Feedback to maximum entropy RLReinforcement Learning and enables the DPODirect Preference Optimization derivation. Overoptimization, distributional mismatch, and the alignment tax are the principal failure modes of vanilla RLHFReinforcement Learning from Human Feedback, each with mitigation strategies rooted in the RLReinforcement Learning theory developed throughout the course.


Conceptual questions

  1. The Bradley-Terry model assumes P(yw≻yl)=σ(r(x,yw)−r(x,yl))P(y^w \succ y^l) = \sigma(r(x, y^w) - r(x, y^l))P(yw≻yl)=σ(r(x,yw)−r(x,yl)). This model has an identifiability property: adding a constant to all rewards leaves all preference probabilities unchanged. Explain why this means that RLHFReinforcement Learning from Human Feedback cannot learn the absolute scale of the reward function, only the relative ordering. Does this matter for policy optimization? What additional assumption is required to compare the reward of responses to different prompts?

  2. The KL-regularized RLHFReinforcement Learning from Human Feedback objective has the closed-form optimal policy π∗(y∣x)∝πref(y∣x)exp⁡(r(x,y)/β)\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)π∗(y∣x)∝πref​(y∣x)exp(r(x,y)/β). Show what happens to this policy as β→0\beta \to 0β→0 and β→∞\beta \to \inftyβ→∞. For a fixed RM rϕr_\phirϕ​, describe the qualitative behavior of the optimized policy at each extreme. Why does large β\betaβ reduce overoptimization, and what is the cost?

  3. The PPOProximal Policy Optimisation implementation of RLHFReinforcement Learning from Human Feedback requires four networks (policy, critic, reference model, reward model). For a 7B parameter model, estimate the minimum GPU memory required if all four are loaded simultaneously in FP16. Then explain why this memory constraint motivates approaches that eliminate the reward model (DPODirect Preference Optimization) or the separate critic (GRPOGroup Relative Policy Optimisation). What architectural compromise does each method make?

  4. Overoptimization experiments (Gao et al., 2023) use a gold reward model — a separate, held-out RM trained on additional human data — to evaluate whether the proxy RM score correlates with true preference as KL divergence from πSFT\pi_{\text{SFT}}πSFT​ increases. The proxy RM score increases monotonically while the gold RM score peaks and then decreases. Explain this divergence in terms of the distributional shift between the RM's training distribution and the policy's output distribution during PPOProximal Policy Optimisation. What does the peak of the gold RM curve represent, and how would you detect this peak during training without access to a gold RM?

  5. Constitutional AI uses model-generated preference pairs rather than human annotations. Identify two ways this approach could fail to produce a well-aligned model even if the constitution is well-specified: one failure mode related to the quality of the evaluator model, and one related to the diversity of the generated preference pairs. Propose a modification to the CAI pipeline that addresses each failure mode.


✦Solutions
  1. Bradley-Terry scale. Preferences depend only on r(x,yw)−r(x,yl)r(x,y^w)-r(x,y^l)r(x,yw)−r(x,yl), so an additive constant cancels — only relative rewards are identifiable, not absolute scale. It does not matter for policy optimization: the optimal π∗∝πrefexp⁡(r/β)\pi^*\propto\pi_\text{ref}\exp(r/\beta)π∗∝πref​exp(r/β) is unchanged by a constant shift (it only rescales the normalizer). To compare rewards across different prompts you need an extra anchoring assumption (a shared per-prompt reference/zero point), since BT only orders responses within one prompt.
  2. KL-regularized optimum. As β→0\beta\to0β→0, π∗∝πrefexp⁡(r/β)\pi^*\propto\pi_\text{ref}\exp(r/\beta)π∗∝πref​exp(r/β) becomes sharply peaked on the highest-reward response — greedy, ignoring πref\pi_\text{ref}πref​. As β→∞\beta\to\inftyβ→∞ it returns to πref\pi_\text{ref}πref​ — ignoring reward. Large β\betaβ reduces overoptimization by keeping the policy near πref\pi_\text{ref}πref​ where the RM is in-distribution and reliable, at the cost of capturing less reward improvement (more conservative alignment).
  3. Four-network memory. At FP16, 7B params ≈ 14 GB each; policy, critic, reference, and reward models ≈ 4×14=564\times14 = 564×14=56 GB minimum for weights alone (more with optimizer states/activations). This motivates DPO, which removes the reward model and the RL loop by optimizing preferences directly through a closed-form implicit reward, and GRPO, which removes the separate critic by estimating advantages from group-relative sample rewards. The compromise: DPO assumes the BT model and optimizes offline (no online reward-model exploration); GRPO trades a learned value baseline for a higher-variance group statistic needing several samples per prompt.
  4. Overoptimization. As PPO drives the policy away from πSFT\pi_\text{SFT}πSFT​ (rising KL), its outputs drift out of the proxy RM's training distribution, so the proxy — evaluated OOD — returns inflated scores while true (gold) preference peaks then falls (Goodhart). The gold peak marks the optimal KL distance where real preference is maximized before overoptimization dominates. Detect it without a gold RM by monitoring KL from πSFT\pi_\text{SFT}πSFT​ and watching held-out validation metrics or RM-ensemble disagreement, early-stopping where they start degrading.
  5. Constitutional AI failures. (a) Evaluator quality: if the teacher LLM misjudges the constitution or carries its own biases, the generated preferences are systematically wrong and the model aligns to flawed labels — fix with a stronger/ensemble evaluator calibrated against some human-checked examples. (b) Pair diversity: if generated pairs cover a narrow distribution, the model is unaligned on uncovered inputs — fix by diversifying prompts and responses (higher temperature, varied generators, red-team/adversarial prompts) to broaden coverage.

Looking ahead

The RLHFReinforcement Learning from Human Feedback pipeline is powerful but expensive: four networks, human annotation, and a PPOProximal Policy Optimisation training loop that requires careful hyperparameter tuning.

Week 13: Direct Preference Optimization and GRPOGroup Relative Policy Optimisation. We derive DPODirect Preference Optimization's reparameterization of the reward model in terms of the optimal policy — eliminating the reward model entirely — and study GRPOGroup Relative Policy Optimisation's removal of the critic, tracing what each simplification gives up and what it gains.


Further reading

  • Christiano, P. F., et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS. (The conceptual origin of modern RLHFReinforcement Learning from Human Feedback).
  • Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS. (InstructGPT / OpenAI's RLHFReinforcement Learning from Human Feedback paper).
  • Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv. (Anthropic's RLAIF framework).
  • Gao, L., et al. (2023). Scaling Laws for Reward Model Overoptimization. ICML. (Empirical study of Goodhart's law in RLHFReinforcement Learning from Human Feedback).
← Previous
Week 11: Offline Reinforcement Learning
Next →
Week 13: Direct Preference Optimization and GRPO
On this page
  • Purpose of this lecture
  • The three-stage RLHF pipeline
  • Stage 1: Supervised fine-tuning
  • Stage 2: Reward modeling
  • Stage 3: Policy optimization with PPO
  • Limitations of vanilla RLHF
  • Overoptimization and Goodhart's Law
  • Distributional mismatch in preference data
  • The alignment tax
  • Variants and extensions
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading