Week 12: Reinforcement Learning from Human Feedback

Purpose of this lecture#

The entire theoretical apparatus developed in Weeks 1–11 — MDPs, value functions, policy gradients, actor-critics, PPO, offline RL — was built around the assumption that the reward function is given. An agent playing Atari receives a score from the game engine; a robot receives a distance-to-goal signal from its simulator; a Q-learning agent on a gridworld receives $+1$ for reaching the goal.

For language models, this assumption fails. There is no function $r(x, y)$ that a programmer can write to evaluate whether a response $y$ to a prompt $x$ is helpful, honest, and harmless. The property being optimized is a human preference — a subjective, context-dependent judgment that cannot be reduced to a closed-form expression. Next-token prediction (cross-entropy on a text corpus) produces a fluent language model, but fluency and alignment are distinct: a model can be highly fluent while being sycophantic, misleading, harmful, or evasive.

Reinforcement Learning from Human Feedback (RLHF) is the approach that bridges this gap: learn a reward model from human preference data, then optimize the language model against that reward model using RL. This lecture develops the full pipeline — SFT, reward modeling, KL-regularized PPO — and connects each component to the RL theory developed throughout the course.

The three-stage RLHF pipeline#

Stage 1: Supervised fine-tuning#

A base language model pretrained on web-scale text has learned to predict the statistical distribution of text on the internet — including low-quality, unhelpful, and unsafe text. Before applying RL, the model must be shaped into a format suitable for the alignment task.

Supervised Fine-Tuning (SFT) trains the base model on a curated dataset of high-quality prompt–response pairs written or approved by human annotators:

\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_{\text{SFT}}}\!\left[\sum_t \log \pi_\theta(y_t \mid x, y_{<t})\right]

SFT is standard next-token prediction on the curated dataset. Its purpose is not alignment but format shaping: after SFT, the model $\pi_{\text{SFT}}$ produces responses that look like helpful answers to prompts, rather than continuations of arbitrary web text. This provides a reasonable starting point for reward optimization and defines the reference distribution for the KL penalty in Stage 3.

The SFT model is also the foundation for the reward model: rather than training a reward model from scratch, the RM is initialized from $\pi_{\text{SFT}}$ with the final token prediction head replaced by a scalar regression head.

Stage 2: Reward modeling#

The core challenge of RLHF is learning $r_\phi(x, y)$ : a function mapping a prompt–response pair to a scalar representing human preference. Two design choices make this tractable.

Pairwise comparisons over absolute scores

Human annotators are poor at assigning absolute quality scores (one annotator's 7 is another's 9), but are reliable at relative ranking: given two responses $y_1$ and $y_2$ to the same prompt $x$ , annotators can consistently identify which is better. This shifts the learning problem from regression on absolute quality to classification on pairwise preferences.

The preference dataset takes the form:

\mathcal{D}_{\text{RM}} = \{(x_i, y^w_i, y^l_i)\}_{i=1}^N

where $y^w_i \succ y^l_i$ indicates that the human preferred $y^w_i$ over $y^l_i$ for prompt $x_i$ .

The Bradley-Terry preference model

The probability that a human prefers $y^w$ over $y^l$ is modeled as:

P(y^w \succ y^l \mid x) = \sigma\!\left(r_\phi(x, y^w) - r_\phi(x, y^l)\right)

where $\sigma$ is the logistic sigmoid. This is the Bradley-Terry model (Bradley and Terry, 1952), originally developed for ranking sports teams. Its key property is that preferences are governed entirely by the difference in latent rewards — absolute reward scale is unidentified. This matches the human annotation setting: a human judges relative quality, and the model only needs to produce an ordering, not calibrated absolute scores.

The Bradley-Terry model's application to human preference learning assumes that human judgments form a total order (transitivity: if A > B and B > C, then A > C) that is consistent and independent of context. In practice, human preferences on complex reasoning tasks (math, code, scientific writing) are often intransitive, context-dependent, and irreproducible across annotators. The model's popularity in RLHF stems not from empirical validation of its assumptions but from mathematical convenience: it leads to a tractable maximum likelihood objective and has been validated post-hoc on simpler preference tasks (summarization quality, toxicity avoidance). For alignment tasks (honesty, helpfulness, safety), the gap between the model's assumptions and human judgment is substantial and remains underexplored.

The reward model is trained by maximum likelihood on the preference dataset:

\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y^w, y^l) \sim \mathcal{D}_{\text{RM}}}\!\left[ \log \sigma\!\left(r_\phi(x, y^w) - r_\phi(x, y^l)\right) \right]

This is binary cross-entropy where the positive label is $y^w$ and the negative label is $y^l$ , with logits given by the reward difference. The loss is minimized when $r_\phi(x, y^w) > r_\phi(x, y^l)$ for all preference pairs in the dataset.

Reward model architecture

In practice, the RM is initialized from $\pi_{\text{SFT}}$ with the language model head replaced by a linear layer that outputs a scalar. This leverages the SFT model's understanding of language quality as a strong initialization. The RM receives the full prompt–response pair as a single sequence and returns a scalar $r_\phi \in \mathbb{R}$ at the final token position, trained via the pairwise loss above.

Stage 3: Policy optimization with PPO#

With a reward model in hand, the LLM optimization becomes a standard RL problem. The MDP is defined as:

| MDP component | RLHF interpretation | |---|---| | State $s_t$ | Prompt $x$ + all tokens generated so far | | Action $a_t$ | Next token selected from vocabulary | | Transition $P$ | Appending selected token (deterministic) | | Reward | $r_\phi(x, y)$ at end of generation; $0$ at intermediate steps | | Policy $\pi_\theta$ | The language model (token distribution) |

This is an episodic MDP with a delayed terminal reward: the agent generates a full response $y$ token by token, and the reward model evaluates the complete response. The episode length is the response length $|y|$ , which is variable.

The KL-regularized RLHF objective

Optimizing $\pi_\theta$ directly against $r_\phi$ without constraint produces reward hacking: the policy finds responses that score highly under $r_\phi$ but are not actually preferred by humans. This is the extrapolation error from Week 11 — the RM was trained only on responses from the SFT model's distribution, so its predictions are unreliable far from that distribution. A model optimizing unconstrained against the RM discovers OOD text patterns that exploit the RM's failure modes.

The fix is a KL divergence penalty anchoring the optimized policy to the SFT reference model:

\max_{\pi_\theta}\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[ r_\phi(x, y) - \beta\, D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{SFT}}(\cdot|x)\right) \right]

where $\beta > 0$ controls the penalty strength. The KL term is computed token by token and summed over the response:

D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) = \sum_t \sum_v \pi_\theta(v \mid x, y_{<t}) \log \frac{\pi_\theta(v \mid x, y_{<t})}{\pi_{\text{SFT}}(v \mid x, y_{<t})}

This is precisely behavior regularization (Week 11) with $\pi_\beta = \pi_{\text{SFT}}$ : it prevents the policy from straying into the OOD region where the reward model provides unreliable signal. The connection to offline RL is exact — RLHF is offline RL on a preference dataset, with $\pi_{\text{SFT}}$ as the behavior policy.

KL-regularized objective: the closed-form solution

The KL-regularized objective has an analytically tractable optimal policy. For any fixed reward function $r$ , the solution to:

\max_\pi \mathbb{E}_y\!\left[r(x,y)\right] - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{ref}})

is the Boltzmann distribution:

\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y \mid x) \cdot \exp\!\left(\frac{r(x,y)}{\beta}\right)

where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp(r(x,y)/\beta)$ is the partition function. This is the same Boltzmann optimal policy as in maximum entropy RL (Week 8, SAC), with $\beta$ playing the role of temperature $\alpha$ . The optimal policy concentrates on high-reward responses while maintaining support near the reference distribution. This closed-form solution is the starting point for deriving DPO (Week 13), which bypasses the reward model entirely by solving for $r$ in terms of $\pi^*$ and $\pi_{\text{ref}}$ .

PPO in the RLHF context

In practice, the KL-regularized objective is optimized with PPO (Week 7–8). The token-level reward signal for PPO is constructed as:

\tilde{r}_t = \begin{cases} r_\phi(x, y) - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{SFT}}(y_t | x, y_{<t})} & t = |y| \\ - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{SFT}}(y_t | x, y_{<t})} & t < |y| \end{cases}

The terminal reward is the RM score minus the KL penalty at the final token; all intermediate steps receive only the per-token KL penalty. PPO's clipped surrogate objective and GAE are then applied to this reward signal, with the language model serving as the actor and a separate value head (or a copy of the language model with an added scalar head) serving as the critic. This formulation was pioneered by Ouyang et al. (2022) in the InstructGPT paper, which demonstrated that supervised fine-tuning followed by reward modeling and PPO optimization produces significant improvements in human preference ratings over the base SFT model. The empirical gains—especially in reducing harmful outputs and improving instruction-following—validated the RLHF pipeline as a practical approach to alignment, though the paper's ablations did not isolate the contribution of each stage, leading to continued debate about the necessity of the RM stage versus simpler preference optimization.

The full RLHF training loop therefore requires four networks in memory simultaneously: the policy $\pi_\theta$ (actor), the value function $V_\phi$ (critic), the SFT reference model $\pi_{\text{SFT}}$ (for KL computation), and the reward model $r_\psi$ . The memory and compute demands of this four-network setup motivate the simpler preference optimization approaches in Week 13.

Critical Lens: PPO in RLHF

The four-network setup described above is the standard InstructGPT recipe, but it conceals several engineering challenges that make RLHF fragile in practice.

KL coefficient tuning is non-transferable. The penalty strength $\beta$ must be swept independently for each model scale, dataset, and task. Too small and the policy overfits the RM; too large and the policy collapses to $\pi_{\text{SFT}}$ , wasting the RL step. There is no principled way to set $\beta$ before training — it is currently determined by expensive trial and error.

PPO requires online rollouts from the current policy, making the training loop inherently sequential: generate responses → score with RM → compute KL penalty → PPO update → repeat. Each rollout generation pass is a forward pass through the full policy model (often 7B–70B parameters), making wall-clock time dominated by generation, not gradient computation. This is fundamentally different from pretraining (parallel batches) and SFT (static dataset).

The critic is a single-point bottleneck. If the value function $V_\phi$ estimates advantages poorly, the PPO surrogate objective optimizes against the wrong signal. Training a critic on open-ended text generation — where episode lengths vary from 10 to 10,000 tokens and the reward signal is zero everywhere except the final token — is substantially harder than standard RL domains with dense rewards.

Memory dominates computation. The memory cost of four models in FP16 (policy + critic + reference + RM) for a 7B model exceeds 56 GB (7B × 2 bytes × 4 models), excluding optimizer states, KV-cache for generation, and activations. PPO training on consumer hardware is infeasible; even on 8×A100 nodes, the memory budget constrains batch size to small values, increasing gradient noise. These practical limitations are the direct motivation for the reward-model-free (DPO) and critic-free (GRPO) approaches in Week 13.

Limitations of vanilla RLHF#

Overoptimization and Goodhart's Law#

The reward model $r_\phi$ is a proxy for true human preference, not the true preference itself. As PPO pushes $\pi_\theta$ to maximize $r_\phi$ , it eventually finds policies that score highly on the proxy while performing poorly on the true objective. This is Goodhart's Law: a measure used as a target ceases to be a good measure.

In practice, overoptimization produces characteristic failure modes: responses become verbosely padded to cover all angles (the RM rewards completeness), excessively hedged (the RM rewards safety language), or sycophantically agreeable (the RM rewards responses that match perceived user preferences). The gap between the proxy reward and true preference is often visualized by plotting RM score (increasing) against human evaluation score (increasing then decreasing) as a function of KL divergence from $\pi_{\text{SFT}}$ — the classic overoptimization curve (Gao et al., 2023). This empirical observation of preference-reward divergence has been validated in multiple settings (summarization, instruction-following, mathematical reasoning) and represents one of the sharpest critiques of vanilla RLHF. The scaling laws for overoptimization—how quickly the gap emerges as a function of model size, dataset size, and KL penalty—remain only partially understood.

Reward model ensembles partially mitigate overoptimization: train $M$ independent reward models and optimize against their minimum (pessimistic ensemble), average, or a penalized version that accounts for ensemble disagreement. The minimum-of-ensemble approach has the same structure as TD3's clipped double Q-learning (Week 8) and CQL's conservative Q-function (Week 11) — all are applications of pessimism under uncertainty.

Critical Lens: Overoptimization and Goodhart's Law

The overoptimization problem has a deeper structural cause than the proxy-target mismatch described above. Consider the two objectives:

\begin{aligned} J_{\text{proxy}}(\theta) &= \mathbb{E}_{x,y\sim\pi_\theta}\!\left[r_\phi(x,y)\right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) \\ J_{\text{true}}(\theta) &= \mathbb{E}_{x,y\sim\pi_\theta}\!\left[r^*(x,y)\right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) \end{aligned}

where $r^*$ is the (unobservable) true human preference. The RM $r_\phi$ is an approximation of $r^*$ trained on a finite dataset. As $\pi_\theta$ moves away from $\pi_{\text{SFT}}$ through PPO updates, the policy can discover a region $S \subset \mathcal{Y}$ where $r_\phi$ and $r^*$ disagree systematically. This region exists because:

The RM training set covers a tiny fraction of the response space
Human preferences are inconsistent — two annotators may disagree on the same pair
The Bradley-Terry likelihood assumes preference transitivity, which is violated for complex reasoning tasks

The overoptimization curve (Gao et al., 2023) shows that gold RM score peaks and declines while proxy RM score continues to rise — this is the empirical signature of the policy entering $S$ . Importantly, the KL penalty only slows the rate of entry into $S$ ; it does not prevent it. Once the policy enters $S$ , the proxy RM provides anti-helpful signal, and continued training degrades alignment.

Reward model ensembles address the wrong problem. Using a pessimistic ensemble (optimizing against $\min_k r_{\phi_k}$ ) protects against regions where a single RM is overconfident, but it does not protect against regions where all RMs are collectively wrong. If all RMs share the same architecture, initialization scale, and training distribution, their errors will be correlated. Ensemble diversity — not just ensemble size — is what determines protection, and architectural diversity increases the already-substantial memory cost.

Distributional mismatch in preference data#

The RM is trained on preference pairs from the SFT model's distribution. After PPO fine-tuning, the policy's distribution shifts, potentially producing responses that fall outside the RM's training distribution. The RM's predictions become less reliable precisely in the region the policy most wants to exploit. This is the exact offline RL distributional shift problem (Week 11) applied in the RLHF context.

One mitigation is iterative RLHF: alternate between (a) collecting new preference data from the current $\pi_\theta$ and (b) updating the reward model on the expanded dataset. Each iteration keeps the RM's training distribution close to the current policy's output distribution, reducing distributional mismatch.

The alignment tax#

RLHF often degrades performance on tasks that were not in the preference dataset. A model fine-tuned for conversational helpfulness may show reduced performance on mathematical reasoning, code generation, or factual recall benchmarks. This tradeoff is the alignment tax: alignment-focused RLHF specializes the model toward preference-annotated behaviors at the cost of general capability.

The alignment tax is partially a consequence of the SFT stage: fine-tuning on a narrow curated dataset risks forgetting the breadth of pretraining knowledge. Approaches including PPO with a cross-entropy auxiliary loss (preventing forgetting), careful learning rate scheduling, and mixing SFT and RL gradient updates have reduced but not eliminated the alignment tax in practice.

Variants and extensions#

Constitutional AI (CAI) (Bai et al., 2022) replaces human preference annotators with the model itself: the model critiques its own responses according to a set of principles (the "constitution") and generates revised responses. A reward model is then trained on these model-generated preference pairs rather than human-annotated ones. CAI reduces the cost of preference data collection and enables alignment at scales where human annotation is impractical. The core contribution of CAI is demonstrating that AI-generated preference pairs can approximate human preferences at a fraction of the cost, shifting the alignment bottleneck from annotation to constitution design. However, the approach replaces the question "what do humans prefer?" with "what does our constitution imply?", shifting rather than solving the preference specification problem.

RLAIF (RL from AI Feedback) generalizes CAI: the evaluator model can be a separate, stronger LLM. Preferences are generated by the evaluator rather than humans, then used to train the reward model for the policy. RLAIF can be applied iteratively: the policy improves, the evaluator generates harder preference pairs, the RM improves, the policy improves further. This self-improvement loop has achieved human-competitive alignment at reduced annotation cost. As of 2025–2026, RLAIF has been scaled to large models (DeepSeek-R1 uses a variant) and shown that iterative self-improvement via synthetic preference data can rival or exceed human-annotated RLHF on reasoning tasks.

Critical Lens: CAI and Preference Specification

CAI's core insight — replacing human annotators with a constitution-guided model — is elegant but creates a new class of failure modes:

The constitution becomes the alignment bottleneck. Writing a complete, non-contradictory set of principles that covers every preference scenario is at least as hard as specifying the reward function directly. A constitution that says "be helpful" and "be harmless" without resolving their conflict (e.g., a user asking for instructions to circumvent safety filters) leaves the RM to arbitrate an undefined tradeoff. The RLHF pipeline then optimizes against an RM that was itself trained on preference pairs generated under an ambiguous set of rules.

Self-critique introduces a competence ceiling. If the policy $\pi_\theta$ is used both as the critique generator (rewriting its own responses) and as the evaluator (judging which rewrite is better), the preference data reflects only the information available within $\pi_\theta$ — not what a stronger evaluator could observe. A model cannot reliably detect flaws it lacks the knowledge to recognize. RLAIF partially addresses this by using a stronger evaluator (e.g., GPT-4 judging GPT-3.5's responses), but the evaluator's own biases propagate into the RM.

Synthetic preference diversity collapses over iterations. In iterative RLAIF, the evaluator generates preference pairs, the RM trains on them, the policy improves against the RM, and the evaluator generates new pairs from the improved policy's distribution. If the evaluator's preference model is insufficiently diverse (e.g., always preferring verbose but superficial completions), the RM overfits to that narrow signal, the policy specializes to it, and subsequent preference pairs become increasingly homogeneous — an alignment collapse that is hard to detect without human evaluation checkpoints.

Open Problems

Multi-dimensional alignment. Human preferences involve conflicting axes (helpfulness, honesty, harmlessness, conciseness, creativity) that cannot be reduced to a scalar without loss. Methods for multi-objective RLHF — including Pareto-optimal preference aggregation and user-conditional reward models — are active research areas.
Preference model assumptions. The Bradley-Terry model assumes transitivity and context-independence, both violated in practice. Thurstonian models (separating preference noise from reward variance) and embedding-based preference models are alternatives with different failure modes.
Interpretable reward decomposition. Current RMs are black-box networks whose scalar output provides no signal about why a response was preferred. Factorization into interpretable dimensions (factuality score, helpfulness score, safety score) could enable debugging and targeted improvement.
Alignment tax elimination. Every RLHF method studied to date degrades some capability metric. Understanding whether this tradeoff is fundamental (a consequence of the no-free-lunch theorem applied to alignment) or an artifact of current methods remains unresolved.
Detection of reward hacking. There is no reliable, automated method for determining whether an RM score increase represents genuine alignment improvement or reward hacking. Gold RM evaluation is the current standard but requires additional human annotation — precisely the bottleneck RLHF was designed to bypass.

Key takeaways#

The RLHF pipeline maps the alignment problem onto standard RL components. SFT shapes the base model into a format suitable for preference optimization and defines the reference distribution. Reward modeling translates pairwise human preferences into a scalar via the Bradley-Terry model and maximum likelihood estimation — avoiding the noise and inconsistency of absolute human scoring. KL-regularized PPO optimizes the policy against the RM while constraining it to the SFT distribution, connecting RLHF directly to offline RL behavior regularization. The closed-form KL-regularized optimal policy — a Boltzmann distribution over the reference model weighted by reward — is the theoretical centerpiece that links RLHF to maximum entropy RL and enables the DPO derivation. Overoptimization, distributional mismatch, and the alignment tax are the principal failure modes of vanilla RLHF, each with mitigation strategies rooted in the RL theory developed throughout the course.

Conceptual questions#

The Bradley-Terry model assumes $P(y^w \succ y^l) = \sigma(r(x, y^w) - r(x, y^l))$ . This model has an identifiability property: adding a constant to all rewards leaves all preference probabilities unchanged. Explain why this means that RLHF cannot learn the absolute scale of the reward function, only the relative ordering. Does this matter for policy optimization? What additional assumption is required to compare the reward of responses to different prompts?
The KL-regularized RLHF objective has the closed-form optimal policy $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$ . Show what happens to this policy as $\beta \to 0$ and $\beta \to \infty$ . For a fixed RM $r_\phi$ , describe the qualitative behavior of the optimized policy at each extreme. Why does large $\beta$ reduce overoptimization, and what is the cost?
The PPO implementation of RLHF requires four networks (policy, critic, reference model, reward model). For a 7B parameter model, estimate the minimum GPU memory required if all four are loaded simultaneously in FP16. Then explain why this memory constraint motivates approaches that eliminate the reward model (DPO) or the separate critic (GRPO). What architectural compromise does each method make?
Overoptimization experiments (Gao et al., 2023) use a gold reward model — a separate, held-out RM trained on additional human data — to evaluate whether the proxy RM score correlates with true preference as KL divergence from $\pi_{\text{SFT}}$ increases. The proxy RM score increases monotonically while the gold RM score peaks and then decreases. Explain this divergence in terms of the distributional shift between the RM's training distribution and the policy's output distribution during PPO. What does the peak of the gold RM curve represent, and how would you detect this peak during training without access to a gold RM?
Constitutional AI uses model-generated preference pairs rather than human annotations. Identify two ways this approach could fail to produce a well-aligned model even if the constitution is well-specified: one failure mode related to the quality of the evaluator model, and one related to the diversity of the generated preference pairs. Propose a modification to the CAI pipeline that addresses each failure mode.

Solutions

Bradley-Terry scale. Preferences depend only on $r(x,y^w)-r(x,y^l)$ , so an additive constant cancels — only relative rewards are identifiable, not absolute scale. It does not matter for policy optimization: the optimal $\pi^*\propto\pi_\text{ref}\exp(r/\beta)$ is unchanged by a constant shift (it only rescales the normalizer). To compare rewards across different prompts you need an extra anchoring assumption (a shared per-prompt reference/zero point), since BT only orders responses within one prompt.
KL-regularized optimum. As $\beta\to0$ , $\pi^*\propto\pi_\text{ref}\exp(r/\beta)$ becomes sharply peaked on the highest-reward response — greedy, ignoring $\pi_\text{ref}$ . As $\beta\to\infty$ it returns to $\pi_\text{ref}$ — ignoring reward. Large $\beta$ reduces overoptimization by keeping the policy near $\pi_\text{ref}$ where the RM is in-distribution and reliable, at the cost of capturing less reward improvement (more conservative alignment).
Four-network memory. At FP16, 7B params ≈ 14 GB each; policy, critic, reference, and reward models ≈ $4\times14 = 56$ GB minimum for weights alone (more with optimizer states/activations). This motivates DPO, which removes the reward model and the RL loop by optimizing preferences directly through a closed-form implicit reward, and GRPO, which removes the separate critic by estimating advantages from group-relative sample rewards. The compromise: DPO assumes the BT model and optimizes offline (no online reward-model exploration); GRPO trades a learned value baseline for a higher-variance group statistic needing several samples per prompt.
Overoptimization. As PPO drives the policy away from $\pi_\text{SFT}$ (rising KL), its outputs drift out of the proxy RM's training distribution, so the proxy — evaluated OOD — returns inflated scores while true (gold) preference peaks then falls (Goodhart). The gold peak marks the optimal KL distance where real preference is maximized before overoptimization dominates. Detect it without a gold RM by monitoring KL from $\pi_\text{SFT}$ and watching held-out validation metrics or RM-ensemble disagreement, early-stopping where they start degrading.
Constitutional AI failures. (a) Evaluator quality: if the teacher LLM misjudges the constitution or carries its own biases, the generated preferences are systematically wrong and the model aligns to flawed labels — fix with a stronger/ensemble evaluator calibrated against some human-checked examples. (b) Pair diversity: if generated pairs cover a narrow distribution, the model is unaligned on uncovered inputs — fix by diversifying prompts and responses (higher temperature, varied generators, red-team/adversarial prompts) to broaden coverage.

Looking ahead#

The RLHF pipeline is powerful but expensive: four networks, human annotation, and a PPO training loop that requires careful hyperparameter tuning.

Week 13: Direct Preference Optimization and GRPO. We derive DPO's reparameterization of the reward model in terms of the optimal policy — eliminating the reward model entirely — and study GRPO's removal of the critic, tracing what each simplification gives up and what it gains.

Purpose of this lecture#

The three-stage RLHF pipeline#

Stage 1: Supervised fine-tuning#

Supervised Fine-Tuning (SFT) trains the base model on a curated dataset of high-quality prompt–response pairs written or approved by human annotators:

\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_{\text{SFT}}}\!\left[\sum_t \log \pi_\theta(y_t \mid x, y_{<t})\right]

Stage 2: Reward modeling#

The core challenge of RLHF is learning $r_\phi(x, y)$ : a function mapping a prompt–response pair to a scalar representing human preference. Two design choices make this tractable.

Pairwise comparisons over absolute scores

The preference dataset takes the form:

\mathcal{D}_{\text{RM}} = \{(x_i, y^w_i, y^l_i)\}_{i=1}^N

where $y^w_i \succ y^l_i$ indicates that the human preferred $y^w_i$ over $y^l_i$ for prompt $x_i$ .

The Bradley-Terry preference model

The probability that a human prefers $y^w$ over $y^l$ is modeled as:

P(y^w \succ y^l \mid x) = \sigma\!\left(r_\phi(x, y^w) - r_\phi(x, y^l)\right)

The reward model is trained by maximum likelihood on the preference dataset:

\mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y^w, y^l) \sim \mathcal{D}_{\text{RM}}}\!\left[ \log \sigma\!\left(r_\phi(x, y^w) - r_\phi(x, y^l)\right) \right]

Reward model architecture

Stage 3: Policy optimization with PPO#

With a reward model in hand, the LLM optimization becomes a standard RL problem. The MDP is defined as:

The KL-regularized RLHF objective

The fix is a KL divergence penalty anchoring the optimized policy to the SFT reference model:

\max_{\pi_\theta}\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[ r_\phi(x, y) - \beta\, D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{SFT}}(\cdot|x)\right) \right]

where $\beta > 0$ controls the penalty strength. The KL term is computed token by token and summed over the response:

D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) = \sum_t \sum_v \pi_\theta(v \mid x, y_{<t}) \log \frac{\pi_\theta(v \mid x, y_{<t})}{\pi_{\text{SFT}}(v \mid x, y_{<t})}

KL-regularized objective: the closed-form solution

The KL-regularized objective has an analytically tractable optimal policy. For any fixed reward function $r$ , the solution to:

\max_\pi \mathbb{E}_y\!\left[r(x,y)\right] - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{ref}})

is the Boltzmann distribution:

\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y \mid x) \cdot \exp\!\left(\frac{r(x,y)}{\beta}\right)

PPO in the RLHF context

In practice, the KL-regularized objective is optimized with PPO (Week 7–8). The token-level reward signal for PPO is constructed as:

\tilde{r}_t = \begin{cases} r_\phi(x, y) - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{SFT}}(y_t | x, y_{<t})} & t = |y| \\ - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\text{SFT}}(y_t | x, y_{<t})} & t < |y| \end{cases}

Critical Lens: PPO in RLHF

The four-network setup described above is the standard InstructGPT recipe, but it conceals several engineering challenges that make RLHF fragile in practice.

Limitations of vanilla RLHF#

Overoptimization and Goodhart's Law#

Critical Lens: Overoptimization and Goodhart's Law

The overoptimization problem has a deeper structural cause than the proxy-target mismatch described above. Consider the two objectives:

\begin{aligned} J_{\text{proxy}}(\theta) &= \mathbb{E}_{x,y\sim\pi_\theta}\!\left[r_\phi(x,y)\right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) \\ J_{\text{true}}(\theta) &= \mathbb{E}_{x,y\sim\pi_\theta}\!\left[r^*(x,y)\right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{SFT}}) \end{aligned}

The RM training set covers a tiny fraction of the response space
Human preferences are inconsistent — two annotators may disagree on the same pair
The Bradley-Terry likelihood assumes preference transitivity, which is violated for complex reasoning tasks

Distributional mismatch in preference data#

The alignment tax#

Variants and extensions#

Critical Lens: CAI and Preference Specification

CAI's core insight — replacing human annotators with a constitution-guided model — is elegant but creates a new class of failure modes:

Open Problems

Multi-dimensional alignment. Human preferences involve conflicting axes (helpfulness, honesty, harmlessness, conciseness, creativity) that cannot be reduced to a scalar without loss. Methods for multi-objective RLHF — including Pareto-optimal preference aggregation and user-conditional reward models — are active research areas.
Preference model assumptions. The Bradley-Terry model assumes transitivity and context-independence, both violated in practice. Thurstonian models (separating preference noise from reward variance) and embedding-based preference models are alternatives with different failure modes.
Interpretable reward decomposition. Current RMs are black-box networks whose scalar output provides no signal about why a response was preferred. Factorization into interpretable dimensions (factuality score, helpfulness score, safety score) could enable debugging and targeted improvement.
Alignment tax elimination. Every RLHF method studied to date degrades some capability metric. Understanding whether this tradeoff is fundamental (a consequence of the no-free-lunch theorem applied to alignment) or an artifact of current methods remains unresolved.
Detection of reward hacking. There is no reliable, automated method for determining whether an RM score increase represents genuine alignment improvement or reward hacking. Gold RM evaluation is the current standard but requires additional human annotation — precisely the bottleneck RLHF was designed to bypass.

Key takeaways#

Conceptual questions#

The Bradley-Terry model assumes $P(y^w \succ y^l) = \sigma(r(x, y^w) - r(x, y^l))$ . This model has an identifiability property: adding a constant to all rewards leaves all preference probabilities unchanged. Explain why this means that RLHF cannot learn the absolute scale of the reward function, only the relative ordering. Does this matter for policy optimization? What additional assumption is required to compare the reward of responses to different prompts?
The KL-regularized RLHF objective has the closed-form optimal policy $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$ . Show what happens to this policy as $\beta \to 0$ and $\beta \to \infty$ . For a fixed RM $r_\phi$ , describe the qualitative behavior of the optimized policy at each extreme. Why does large $\beta$ reduce overoptimization, and what is the cost?
The PPO implementation of RLHF requires four networks (policy, critic, reference model, reward model). For a 7B parameter model, estimate the minimum GPU memory required if all four are loaded simultaneously in FP16. Then explain why this memory constraint motivates approaches that eliminate the reward model (DPO) or the separate critic (GRPO). What architectural compromise does each method make?
Overoptimization experiments (Gao et al., 2023) use a gold reward model — a separate, held-out RM trained on additional human data — to evaluate whether the proxy RM score correlates with true preference as KL divergence from $\pi_{\text{SFT}}$ increases. The proxy RM score increases monotonically while the gold RM score peaks and then decreases. Explain this divergence in terms of the distributional shift between the RM's training distribution and the policy's output distribution during PPO. What does the peak of the gold RM curve represent, and how would you detect this peak during training without access to a gold RM?
Constitutional AI uses model-generated preference pairs rather than human annotations. Identify two ways this approach could fail to produce a well-aligned model even if the constitution is well-specified: one failure mode related to the quality of the evaluator model, and one related to the diversity of the generated preference pairs. Propose a modification to the CAI pipeline that addresses each failure mode.

Solutions

Bradley-Terry scale. Preferences depend only on $r(x,y^w)-r(x,y^l)$ , so an additive constant cancels — only relative rewards are identifiable, not absolute scale. It does not matter for policy optimization: the optimal $\pi^*\propto\pi_\text{ref}\exp(r/\beta)$ is unchanged by a constant shift (it only rescales the normalizer). To compare rewards across different prompts you need an extra anchoring assumption (a shared per-prompt reference/zero point), since BT only orders responses within one prompt.
KL-regularized optimum. As $\beta\to0$ , $\pi^*\propto\pi_\text{ref}\exp(r/\beta)$ becomes sharply peaked on the highest-reward response — greedy, ignoring $\pi_\text{ref}$ . As $\beta\to\infty$ it returns to $\pi_\text{ref}$ — ignoring reward. Large $\beta$ reduces overoptimization by keeping the policy near $\pi_\text{ref}$ where the RM is in-distribution and reliable, at the cost of capturing less reward improvement (more conservative alignment).
Four-network memory. At FP16, 7B params ≈ 14 GB each; policy, critic, reference, and reward models ≈ $4\times14 = 56$ GB minimum for weights alone (more with optimizer states/activations). This motivates DPO, which removes the reward model and the RL loop by optimizing preferences directly through a closed-form implicit reward, and GRPO, which removes the separate critic by estimating advantages from group-relative sample rewards. The compromise: DPO assumes the BT model and optimizes offline (no online reward-model exploration); GRPO trades a learned value baseline for a higher-variance group statistic needing several samples per prompt.
Overoptimization. As PPO drives the policy away from $\pi_\text{SFT}$ (rising KL), its outputs drift out of the proxy RM's training distribution, so the proxy — evaluated OOD — returns inflated scores while true (gold) preference peaks then falls (Goodhart). The gold peak marks the optimal KL distance where real preference is maximized before overoptimization dominates. Detect it without a gold RM by monitoring KL from $\pi_\text{SFT}$ and watching held-out validation metrics or RM-ensemble disagreement, early-stopping where they start degrading.
Constitutional AI failures. (a) Evaluator quality: if the teacher LLM misjudges the constitution or carries its own biases, the generated preferences are systematically wrong and the model aligns to flawed labels — fix with a stronger/ensemble evaluator calibrated against some human-checked examples. (b) Pair diversity: if generated pairs cover a narrow distribution, the model is unaligned on uncovered inputs — fix by diversifying prompts and responses (higher temperature, varied generators, red-team/adversarial prompts) to broaden coverage.

Looking ahead#

The RLHF pipeline is powerful but expensive: four networks, human annotation, and a PPO training loop that requires careful hyperparameter tuning.

Purpose of this lecture#

The three-stage RLHFReinforcement Learning from Human Feedback pipeline#

Stage 1: Supervised fine-tuning#

Stage 2: Reward modeling#

Pairwise comparisons over absolute scores

The Bradley-Terry preference model

Reward model architecture

Stage 3: Policy optimization with PPOProximal Policy Optimisation#

The KL-regularized RLHFReinforcement Learning from Human Feedback objective

KL-regularized objective: the closed-form solution

PPOProximal Policy Optimisation in the RLHFReinforcement Learning from Human Feedback context

Limitations of vanilla RLHFReinforcement Learning from Human Feedback#

Overoptimization and Goodhart's Law#

Distributional mismatch in preference data#

The alignment tax#

Variants and extensions#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 12: Reinforcement Learning from Human Feedback

Purpose of this lecture#

The three-stage RLHFReinforcement Learning from Human Feedback pipeline#

Stage 1: Supervised fine-tuning#

Stage 2: Reward modeling#

Pairwise comparisons over absolute scores

The Bradley-Terry preference model

Reward model architecture

Stage 3: Policy optimization with PPOProximal Policy Optimisation#

The KL-regularized RLHFReinforcement Learning from Human Feedback objective

KL-regularized objective: the closed-form solution

PPOProximal Policy Optimisation in the RLHFReinforcement Learning from Human Feedback context

Limitations of vanilla RLHFReinforcement Learning from Human Feedback#

Overoptimization and Goodhart's Law#

Distributional mismatch in preference data#

The alignment tax#

Variants and extensions#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

The three-stage RLHF pipeline#

Stage 3: Policy optimization with PPO#

The KL-regularized RLHF objective

PPO in the RLHF context

Limitations of vanilla RLHF#

The three-stage RLHF pipeline#

Stage 3: Policy optimization with PPO#

The KL-regularized RLHF objective

PPO in the RLHF context

Limitations of vanilla RLHF#