Week 7: Policy Gradient and Actor–Critic Methods

Purpose of this lecture#

In Week 6, we saw how Deep Q-Learning stabilizes value-based RL through architectural constraints. However, value-based methods fundamentally rely on a max operator over actions, which limits them to discrete action spaces and introduces overestimation bias.

This lecture introduces policy gradient methods, which take a different approach:

Instead of learning how good actions are, we directly learn how likely actions should be.

Policy gradient methods naturally handle continuous action spaces, stochastic policies, and large structured action spaces such as token vocabularies in language models. They form the conceptual foundation for actor–critic methods, GAE, PPO, TRPO, and modern RLHF pipelines.

The lecture follows a logical arc: derive the policy gradient theorem from first principles, identify the variance problem in REINFORCE, develop the baseline and advantage machinery that reduces variance, introduce actor-critic as the key architectural leap that enables bootstrapping, derive GAE as the continuous interpolation, and then develop PPO as the practical stabilization that makes policy gradients deployable at scale.

Policies as parameterized distributions#

A policy is a parameterized probability distribution over actions:

\pi_\theta(a \mid s)

Discrete actions: categorical distribution — $\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_a$
Continuous actions: Gaussian distribution — $\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s),\, \sigma_\theta(s)^2)$

The objective is to find $\theta$ that maximizes expected return:

J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right], \quad R(\tau) = \sum_{t=0}^{T} \gamma^t r_{t+1}

Unlike value-based methods, there is no max over actions — the policy itself is optimized directly via gradient ascent on $J(\theta)$ .

The policy gradient theorem: derivation#

The central question is: how do we compute $\nabla_\theta J(\theta)$ ? The environment transition $P(s_{t+1}|s_t, a_t)$ is not differentiable with respect to $\theta$ , so we cannot simply backpropagate through the trajectory.

Step 1: write J as an integral#

Derivation: The Policy Gradient Theorem

J(\theta) = \int p_\theta(\tau) R(\tau)\, d\tau

where $p_\theta(\tau)$ is the probability of trajectory $\tau$ under policy $\pi_\theta$ .

Taking the gradient:

\nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau)\, R(\tau)\, d\tau

Step 2: apply the log-derivative trick#

For any differentiable $p_\theta(\tau) > 0$ :

\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau)

Substituting:

\nabla_\theta J(\theta) = \int p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau)\, R(\tau)\, d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log p_\theta(\tau)\, R(\tau)\right]

Step 3: factorize the trajectory log-probability#

The trajectory probability factorizes as:

p_\theta(\tau) = p(s_0) \prod_{t=0}^T \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)

Taking the log:

\log p_\theta(\tau) = \underbrace{\log p(s_0)}_{\text{no } \theta} + \sum_t \log \pi_\theta(a_t \mid s_t) + \sum_t \underbrace{\log P(s_{t+1} \mid s_t, a_t)}_{\text{no } \theta}

The initial state distribution and environment transition terms have zero gradient with respect to $\theta$ . Only the policy terms remain:

\nabla_\theta \log p_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)

Result: the policy gradient theorem#

\boxed{ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \; G_t \right] }

Three things this derivation reveals:

No environment model needed: the environment terms drop out entirely. The gradient depends only on the policy's log-probabilities and the observed returns — both available from experience without knowing $P$ .
The log-derivative trick converts a density gradient into an expectation: this is why policy gradient methods are also called likelihood ratio methods or score function estimators — $\nabla_\theta \log \pi_\theta(a|s)$ is the score function of the policy.
The result holds for any differentiable $\pi_\theta$ : discrete (softmax), continuous (Gaussian), autoregressive (language models). The derivation makes no assumptions about the action space structure. (This generality is why policy gradient is the foundation for both robot learning in Course 2 — where actions are continuous joint torques — and language model alignment in Course 4 — where actions are token selections from a vocabulary.)

Continuous action spaces: the Gaussian policy#

For robotics and continuous control, the canonical policy class is:

\pi_\theta(a \mid s) = \mathcal{N}(\mu_\theta(s),\, \sigma_\theta^2(s))

where $\mu_\theta(s)$ and $\sigma_\theta(s)$ are neural network outputs. The log-probability:

\log \pi_\theta(a \mid s) = -\frac{(a - \mu_\theta(s))^2}{2\sigma_\theta^2(s)} - \log \sigma_\theta(s) - \frac{1}{2}\log 2\pi

The gradient with respect to the mean parameters $\theta_\mu$ :

\nabla_{\theta_\mu} \log \pi_\theta(a \mid s) = \frac{a - \mu_\theta(s)}{\sigma_\theta^2(s)} \nabla_{\theta_\mu} \mu_\theta(s)

Interpretation: when action $a$ produced a high return, the gradient pushes $\mu_\theta(s)$ toward $a$ — the policy increases the probability of actions that worked well. When $a < \mu_\theta(s)$ (the action was below the current mean), the gradient is negative and pushes $\mu$ down. The magnitude scales inversely with variance: in high-uncertainty states (large $\sigma$ ), updates are smaller, providing natural exploration-exploitation coupling.

The variance parameter $\sigma_\theta(s)$ can also be learned — larger $\sigma$ increases entropy and exploration; smaller $\sigma$ focuses the policy on a narrow action range. In locomotion RL, the standard architecture outputs $\mu_\theta(s)$ from the policy network and treats $\sigma$ as a learned state-independent parameter. (This Gaussian policy structure is the standard starting point in Course 2's robot learning: the mean $\mu_\theta(s)$ outputs the desired joint torques, and the variance controls exploration-exploitation during skill learning.)

REINFORCE#

The simplest policy gradient algorithm is REINFORCE (Williams, 1992). At the end of each episode, update:

\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \; G_t

where $G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}$ is the actual observed return.

Why REINFORCE has high variance#

The REINFORCE gradient estimator is unbiased — $\mathbb{E}[\hat{g}] = \nabla_\theta J(\theta)$ — but its variance is large. The source is the return $G_t$ : it is a sum of $T - t$ random reward terms:

G_t = r_{t+1} + \gamma r_{t+2} + \ldots + \gamma^{T-t-1} r_T

Each term is random, and their sum accumulates variance. In long-horizon problems, $G_t$ can vary enormously across episodes for the same starting state $s_t$ — a stochastic environment or exploratory policy can produce very different trajectories. The variance of $G_t$ grows with episode length, which is why REINFORCE fails for long episodes even when individual rewards are low-variance.

The consequence is that REINFORCE requires many episodes to average out this variance before gradient estimates become reliable — making it impractically slow for all but the simplest problems. This motivates variance reduction.

REINFORCE algorithm#

text

# REINFORCE: Monte Carlo Policy Gradient
# Input: differentiable policy π_θ, learning rate α
for episode = 1..M:
    generate trajectory τ = (s₀, a₀, r₁, ..., s_T) using π_θ
    for t = 0..T-1:
        G_t = Σ_{k=0}^{T-t-1} γᵏ r_{t+k+1}           # monte carlo return
        θ ← θ + α γᵗ ∇_θ log π_θ(a_t | s_t) G_t       # policy gradient update

Notable properties:

Unbiased: the gradient estimate converges to the true policy gradient in expectation.
High variance: $G_t$ is the sum of $T-t$ random reward terms, so variance grows with episode length.
Episode-level updates: weight changes only happen after each complete trajectory, not mid-episode.
The $\gamma^t$ term in the update is an artifact of the discounted formulation; the undiscounted variant omits it.

Variance reduction: the REINFORCE identity and baselines#

The REINFORCE identity#

The key identity underlying all baseline subtraction:

Baseline subtraction#

For any function $b(s_t)$ that does not depend on $a_t$ :

\mathbb{E}_{a_t \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, b(s_t)\right] = b(s_t)\, \underbrace{\mathbb{E}_{a_t}\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\right]}_{= 0} = 0

Therefore we can subtract $b(s_t)$ from $G_t$ without changing the expected gradient:

\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \left(G_t - b(s_t)\right) \right]

The gradient remains unbiased for any choice of $b(s_t)$ . The variance is reduced because $G_t - b(s_t)$ has smaller magnitude than $G_t$ when $b(s_t)$ is a good predictor of $G_t$ .

The advantage function#

The variance-minimizing baseline is $b(s_t) = V^\pi(s_t)$ , giving the advantage function:

A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

Intuitively: $A^\pi(s,a) > 0$ means action $a$ is better than average in state $s$ ; $A^\pi(s,a) < 0$ means it is worse than average. Using advantage rather than raw return centers the updates around zero, dramatically reducing variance.

Three advantage estimators#

This distinction is critical and frequently confused:

| Estimator | Formula | Bias | Variance | |---|---|---|---| | True advantage | $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ | Zero | N/A (not an estimator) | | MC estimate | $\hat{A}_t^{MC} = G_t - V^\pi(s_t)$ | Zero | High (sum of $T-t$ random terms) | | TD(0) estimate | $\hat{A}_t^{\text{TD}} = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ | Nonzero | Low |

REINFORCE with baseline uses $\hat{A}_t^{MC}$ — it is unbiased but still bootstraps on $G_t$ . Actor-critic methods use $\hat{A}_t^{<Glossary term="TD" />}$ — they bootstrap via $V_\phi(s_{t+1})$ , introducing bias but dramatically reducing variance. GAE interpolates between them.

Entropy regularization#

Policy gradients can collapse to deterministic policies prematurely — once the policy assigns high probability to a single action in each state, the gradient signal for other actions vanishes and exploration stops.

To prevent premature collapse, add an entropy bonus:

J_{\text{entropy}}(\theta) = J(\theta) + \beta\, \mathbb{E}_{s \sim d^{\pi_\theta}}\!\left[\mathcal{H}\!\left(\pi_\theta(\cdot \mid s)\right)\right]

where $\mathcal{H}(\pi) = -\sum_a \pi(a|s) \log \pi(a|s)$ is the entropy of the policy. The coefficient $\beta$ controls the exploration-exploitation tradeoff.

Entropy regularization appears in every modern actor-critic implementation (PPO, SAC, RLHF) and is particularly important in large action spaces (language vocabularies, continuous joint spaces) where naive policy gradient would collapse to narrow modes.

Actor–critic methods#

REINFORCE with advantage still requires full-episode Monte Carlo returns — updates are only possible at episode end, and variance remains high for long episodes. Actor-critic methods replace the Monte Carlo return with a bootstrapped TD estimate, enabling step-by-step updates.

Architecture#

Actor: policy $\pi_\theta(a \mid s)$ — decides what action to take.
Critic: value function $V_\phi(s)$ — estimates the expected return from state $s$ .

The critic is not just a baseline. In REINFORCE with baseline, $V^\pi(s_t)$ is used to center returns but the actual return $G_t$ still propagates the gradient. In actor-critic, the critic enables a TD advantage estimate that does not require waiting for episode termination:

\hat{A}_t^{TD} = \underbrace{r_{t+1} + \gamma V_\phi(s_{t+1})}_{\text{TD target}} - V_\phi(s_t) = \delta_t

This is the one-step TD error — the same $\delta_t$ from Week 4. The actor update uses this estimate in place of $G_t - V^\pi(s_t)$ :

\theta \leftarrow \theta + \alpha_\theta\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \hat{A}_t

The critic training objective#

The critic is trained to minimize the TD prediction error:

\mathcal{L}_{\text{critic}}(\phi) = \mathbb{E}_t\left[\left(r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\right)^2\right] = \mathbb{E}_t\left[\delta_t^2\right]

In practice, the actor and critic often share a common feature extraction network (the trunk) with separate output heads, making the joint training loss. (This shared-trunk, dual-head architecture is foundational in Course 2: the trunk learns robot state representations from visual observations or proprioceptive data, the actor head outputs desired actions, and the critic head estimates value — enabling efficient learning in the high-dimensional, continuous action setting of robotic control.)

\mathcal{L}(\theta, \phi) = \underbrace{-\mathbb{E}_t\left[\hat{A}_t\, \log \pi_\theta(a_t \mid s_t)\right]}_{\text{actor loss}} + c_1\, \underbrace{\mathbb{E}_t\left[\delta_t^2\right]}_{\text{critic loss}} - c_2\, \underbrace{\mathbb{E}_t\left[\mathcal{H}(\pi_\theta(\cdot \mid s_t))\right]}_{\text{entropy bonus}}

where $c_1, c_2$ are tunable coefficients. This is the actual PPO loss used in practice and in RLHF fine-tuning.

A2C and A3C#

Advantage Actor-Critic (A2C) applies the actor-critic update synchronously: all parallel workers complete their rollouts, gradients are aggregated, and a single synchronized update is applied. The synchronization ensures consistent gradient estimates and is more stable than asynchronous updates.

Asynchronous Advantage Actor-Critic (A3C) runs workers asynchronously — each worker pushes gradient updates to a shared parameter server without waiting for others. The asynchrony decorrelates experience across workers (similar to a replay buffer) but without requiring off-policy corrections, since each worker runs the current policy. A2C is generally preferred in practice because synchronous updates produce more stable learning — the additional variance from asynchronous gradient staleness typically outweighs the decorrelation benefit.

Generalized Advantage Estimation (GAE)#

The TD(0) advantage estimate $\delta_t$ has low variance but nonzero bias (because $V_\phi \neq V^\pi$ ). The Monte Carlo estimate $G_t - V_\phi(s_t)$ has zero bias but high variance. GAE provides a principled interpolation.

Derivation from n-step advantages#

The $n$ -step advantage estimate:

\hat{A}_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1} + \gamma^n V_\phi(s_{t+n}) - V_\phi(s_t) = \sum_{k=0}^{n-1} \gamma^k \delta_{t+k}

where the second equality rewrites the n-step return as a telescoping sum of TD errors. This is the same structure as the n-step TD return from Week 4.

GAE takes the exponentially weighted average over all $n$ -step estimates:

A_t^{\text{GAE}(\lambda)} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} \hat{A}_t^{(n)} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} \sum_{k=0}^{n-1} \gamma^k \delta_{t+k}

Swapping the order of summation (each $\delta_{t+k}$ appears in all $\hat{A}_t^{(n)}$ for $n > k$ ):

A_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\, \delta_{t+l}

This is a geometric series of TD errors, decaying at rate $\gamma\lambda$ . Recent TD errors (small $l$ ) receive high weight; distant ones receive exponentially less credit.

Endpoint special cases:

$\lambda = 0$ : only $\delta_t$ survives → TD(0) advantage, minimum variance, maximum bias.
$\lambda = 1$ : all terms equally weighted → $\hat{A}_t^{(\infty)} = G_t - V_\phi(s_t)$ , the Monte Carlo advantage.

The effective bias-variance tradeoff as a function of $\lambda$ :

| $\lambda$ | Bias | Variance | Effective horizon | |---|---|---|---| | 0.0 | High (only 1 TD step) | Lowest | 1 step | | 0.95 | Low | Moderate | ~20 steps ( $1/(1-0.95)$ ) | | 1.0 | Zero | Highest | Full episode |

$\lambda = 0.95$ is standard in PPO implementations and corresponds to an effective advantage horizon of about 20 TD steps — enough to capture meaningful return structure while controlling variance.

Connection to TD(λ) from Week 4#

GAE is exactly the policy gradient analog of TD(λ): both take exponentially weighted combinations of $n$ -step estimates, with $\lambda$ as the decay parameter. The difference is that TD(λ) applies the weighting to value estimates for policy evaluation, while GAE applies the same weighting to advantage estimates for policy improvement. The derivation is identical.

Trust regions and PPO#

Actor-critic with GAE still suffers from destructive policy updates. A large gradient step changes $\pi_\theta$ substantially, which invalidates the advantage estimates computed under the old policy. The new policy collects different data, leading to further degraded estimates — a catastrophic feedback loop that can permanently destabilize training.

The fundamental problem#

In supervised learning, a large gradient step may hurt performance but the loss function remains valid — we can detect the mistake and recover. In RL, the policy generates its own training data. A bad policy update changes what data is collected, which changes the gradient, which may push the policy further in the wrong direction. The feedback loop has no self-correcting mechanism.

TRPO: trust region policy optimization#

TRPO (Schulman et al., 2015) constrains each update to stay within a trust region defined by the KL divergence between old and new policies:

\max_\theta\; L(\theta_{\text{old}}, \theta) \quad \text{subject to} \quad \mathbb{E}_s\left[D_{KL}\!\left(\pi_{\theta_{\text{old}}}(\cdot \mid s) \,\|\, \pi_\theta(\cdot \mid s)\right)\right] \leq \delta

where $L(\theta_{\text{old}}, \theta) = \mathbb{E}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} A_t\right]$ is the importance-weighted policy objective.

TRPO provides a monotone improvement guarantee: each constrained update is guaranteed not to decrease $J(\theta)$ (up to approximation error). The cost is computational complexity: TRPO requires computing the Fisher information matrix (the Hessian of KL with respect to $\theta$ ) and solving a constrained optimization problem via conjugate gradient at each step. This is expensive for large neural networks.

PPO: proximal policy optimization#

PPO (Schulman et al., 2017) approximates the TRPO constraint with a clipped surrogate objective that is simple to implement and computationally cheap:

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[ \min\!\left( \rho_t(\theta)\, A_t,\;\; \text{clip}\!\left(\rho_t(\theta),\, 1-\epsilon,\, 1+\epsilon\right) A_t \right) \right]

where the importance ratio:

\rho_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}

measures how much the current policy differs from the policy that collected the data.

Understanding the PPO objective#

The clipped objective has four cases depending on the sign of $A_t$ and whether $\rho_t$ is inside or outside $[1-\epsilon, 1+\epsilon]$ :

| $A_t > 0$ (good action) | $\rho_t \leq 1+\epsilon$ | Normal update: increase action probability | |---|---|---| | $A_t > 0$ | $\rho_t > 1+\epsilon$ | Clipped: don't increase probability further | | $A_t < 0$ (bad action) | $\rho_t \geq 1-\epsilon$ | Normal update: decrease action probability | | $A_t < 0$ | $\rho_t < 1-\epsilon$ | Clipped: don't decrease probability further |

The $\min$ ensures the objective is never more optimistic than the clipped version: updates that would push $\rho_t$ far from 1 in the direction of improvement are capped, preventing overconfident large steps. Updates that push $\rho_t$ far in the direction of degradation are not capped — they are always allowed to reduce the objective.

Connection to Week 4 importance sampling#

The ratio $\rho_t = \pi_\theta / \pi_{\theta_{\text{old}}}$ is the importance sampling ratio from Week 4, applied to the policy gradient update. PPO is collecting data under $\pi_{\theta_{\text{old}}}$ (the behavior policy) and updating $\pi_\theta$ (the target policy) — this is off-policy policy optimization. The $\rho_t$ ratio corrects for the distribution mismatch, and clipping it prevents the variance explosion from large importance weights that was identified in Week 4. PPO's clipping is the policy gradient analog of the importance weight clipping in off-policy TD.

The full PPO training loss#

In practice, PPO optimizes the joint actor-critic objective:

\mathcal{L}^{\text{PPO}}(\theta, \phi) = -L^{\text{CLIP}}(\theta) + c_1\, \mathbb{E}_t\!\left[\left(V_\phi(s_t) - V_t^{\text{target}}\right)^2\right] - c_2\, \mathbb{E}_t\!\left[\mathcal{H}(\pi_\theta(\cdot \mid s_t))\right]

where $V_t^{\text{target}}$ is the GAE-computed return target and $c_1 \approx 0.5$ , $c_2 \approx 0.01$ are standard coefficients. This is the actual loss function used in PPO implementations and in RLHF fine-tuning of language models.

PPO in RLHF: the full picture#

RLHF fine-tunes a language model $\pi_\theta$ using PPO with a learned reward model $r_\phi$ . The complete objective adds a KL penalty against the reference model $\pi_{\text{ref}}$ (the SFT checkpoint):

J^{\text{RLHF}}(\theta) = \mathbb{E}_{(x, y) \sim \pi_\theta}\!\left[ r_\phi(x, y) - \beta\, D_{KL}\!\left(\pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\right) \right]

Every component of this objective maps directly onto the policy gradient framework:

| RLHF component | Policy gradient analog | |---|---| | $\pi_\theta$ (language model) | Actor: $\pi_\theta(a \mid s)$ | | $r_\phi(x, y)$ (reward model) | Return signal: $R(\tau)$ | | PPO surrogate $L^{\text{CLIP}}$ | Policy gradient with clipped importance ratio | | GAE advantage estimates | Variance reduction for policy gradient | | KL penalty from $\pi_{\text{ref}}$ | Entropy/trust region regularization | | Value head $V_\phi$ | Critic: $V_\phi(s)$ |

The KL penalty serves the same role as TRPO's trust region constraint: it prevents the policy from moving too far from a stable reference point ( $\pi_{\text{ref}}$ ), which in the language setting prevents reward hacking (exploiting the reward model far outside the distribution it was trained on). (This RLHF framework is the direct application of Week 7's policy gradient machinery in Course 4 (Week 12) — the language model becomes the actor, the reward model becomes the return signal, and PPO with KL penalties becomes the optimization algorithm. Understanding this week is prerequisite to understanding language model alignment.)

Key takeaways#

The lecture develops a logical progression from first principles to the state of the art. The policy gradient theorem is derived via the log-derivative trick — the environment terms drop out because only $\pi_\theta$ depends on $\theta$ , giving a gradient that requires no model. REINFORCE is the direct implementation: unbiased but high-variance because $G_t$ is a sum of $T-t$ random terms. The REINFORCE identity proves that any state-dependent baseline can be subtracted without bias, and the advantage function is the variance-minimizing baseline. The three advantage estimators — true, MC, TD — form a bias-variance spectrum, and the actor-critic architecture enables the TD estimate via a learned critic that supports step-by-step bootstrapped updates.

GAE derives the exponentially weighted combination of n-step advantages, recovering TD(0) at $\lambda=0$ and Monte Carlo at $\lambda=1$ , with $\lambda=0.95$ as the standard practical choice. The continuous action space Gaussian policy makes the abstract theorem concrete for robotics applications. TRPO solves the destructive update problem via a KL-constrained trust region with monotone improvement guarantees. PPO approximates TRPO with a clipped importance ratio, connecting directly to the Week 4 importance sampling framework — clipping prevents variance explosion from large $\rho_t$ . The full PPO loss combines clipped policy objective, critic regression, and entropy bonus in a single joint objective. And the RLHF formulation maps every component of the policy gradient framework onto the language model alignment setting.

Conceptual questions#

Derive the policy gradient theorem for a two-step episodic MDP: $s_0 \xrightarrow{a_0} s_1 \xrightarrow{a_1} s_2$ (terminal), with rewards $r_1$ and $r_2$ . Write out $p_\theta(\tau)$ explicitly, apply the log-derivative trick, and show that the environment transition terms $\log P(s_1|s_0,a_0)$ and $\log P(s_2|s_1,a_1)$ cancel. What does the final expression for $\nabla_\theta J(\theta)$ look like?
REINFORCE with baseline subtracts $V^\pi(s_t)$ from $G_t$ . Prove using the REINFORCE identity that this subtraction does not change the expected gradient. Then explain intuitively why it reduces variance: what property of $V^\pi(s_t)$ makes it a good baseline compared to, say, a constant $b = 0$ ?
An actor-critic uses the TD(0) advantage estimate $\hat{A}_t = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ . When $V_\phi \neq V^\pi$ (i.e., the critic is not perfectly trained), this estimate is biased. Describe the direction of the bias when the critic systematically underestimates $V^\pi$ , and explain how GAE with $\lambda$ close to 1 reduces this bias at the cost of increased variance.
PPO clips the importance ratio $\rho_t = \pi_\theta / \pi_{\theta_{\text{old}}}$ to $[1-\epsilon, 1+\epsilon]$ . The clipping applies to both positive-advantage and negative-advantage cases but asymmetrically — updates that degrade performance are not clipped. Explain this asymmetry using the $\min$ operator in the PPO objective. Why would clipping negative-advantage updates as well be harmful?
An RLHF pipeline fine-tunes a language model using PPO with $\beta = 0.0$ (no KL penalty against the reference model). After 10,000 updates, the model achieves very high reward model scores but produces outputs that no longer resemble natural language. Diagnose this failure in terms of the trust region / KL constraint, reward hacking, and the importance sampling distribution mismatch. What value of $\beta$ should be used, and what does it control geometrically in the policy space?

Solutions

Two-step policy gradient. $p_\theta(\tau)=p(s_0)\,\pi_\theta(a_0|s_0)\,P(s_1|s_0,a_0)\,\pi_\theta(a_1|s_1)\,P(s_2|s_1,a_1)$ . Taking $\log$ and $\nabla_\theta$ , the initial-state and transition terms have no $\theta$ and vanish, leaving $\nabla_\theta\log p_\theta(\tau)=\nabla_\theta\log\pi_\theta(a_0|s_0)+\nabla_\theta\log\pi_\theta(a_1|s_1)$ . Thus $\nabla_\theta J = \mathbb{E}\big[(r_1+r_2)\,(\nabla\log\pi(a_0|s_0)+\nabla\log\pi(a_1|s_1))\big]$ — the environment dynamics cancel.
Baseline. $\mathbb{E}[\nabla\log\pi(a|s)\,b(s)] = b(s)\sum_a\pi(a|s)\nabla\log\pi(a|s) = b(s)\nabla\sum_a\pi(a|s) = b(s)\nabla 1 = 0$ , so subtracting $V^\pi(s)$ leaves the expected gradient unchanged. It reduces variance because $V^\pi(s)$ is the expected return from $s$ , so the advantage $G_t-V^\pi(s)$ is centered near zero — only deviations from the state's average matter, unlike $b=0$ which leaves the large state-dependent return magnitudes in the estimator.
TD(0) advantage bias. With $V_\phi\neq V^\pi$ the estimate $\hat A_t = r+\gamma V_\phi(s')-V_\phi(s)$ is biased through the bootstrapped $V_\phi(s')$ term; systematic underestimation propagates that bias via bootstrapping. GAE with $\lambda\to1$ weights longer, more Monte-Carlo-like returns that rely less on the biased critic, reducing bias at the cost of higher variance (long noisy returns); $\lambda\to0$ is the opposite low-variance, high-bias extreme.
PPO clip asymmetry. The objective $\min(\rho_t\hat A,\ \mathrm{clip}(\rho_t,1-\epsilon,1+\epsilon)\hat A)$ caps the gain from moving $\rho$ outside the trust region but still penalizes updates that worsen the surrogate. For negative-advantage actions this means the policy can always be pushed to reduce a bad action's probability; clipping that side too would cap the corrective penalty, letting catastrophic actions keep high probability — which is why only the improving side is clipped.
RLHF with $\beta=0$ . No KL penalty removes the trust region against $\pi_\text{ref}$ , so the policy drifts arbitrarily to chase RM score (reward hacking) and leaves the natural-language manifold, while the importance ratios blow up as $\pi_\theta$ diverges (distribution mismatch, high-variance/invalid estimates). Use $\beta>0$ (e.g. ~0.01–0.1): geometrically it confines the policy to a KL ball around $\pi_\text{ref}$ , keeping outputs language-like and the RM in-distribution.

Coding exercises#

Exercise 1: REINFORCE with baseline on CartPole#

Implement REINFORCE with a learned baseline on the CartPole-v1 environment:

Define a policy network $\pi_\theta$ with two output heads: action logits (for the categorical distribution) and a scalar $V_\phi$ (the baseline).
Generate complete episodes using the current stochastic policy — sample actions from the categorical distribution.
After each episode, compute $G_t$ for every timestep and the advantage $\hat{A}_t = G_t - V_\phi(s_t)$ .
Update the policy head via gradient ascent and the value head via gradient descent on the MSE of $(G_t - V_\phi(s_t))^2$ .
Plot episode return vs. episode number. Compare against a version that uses a constant baseline $b = 0$ (i.e., plain REINFORCE) — how much does the learned baseline reduce variance?

Expected outcome: REINFORCE with baseline should converge to a mean return of ~200 within 2,000–5,000 episodes.

Exercise 2: One-step actor-critic with bootstrapping#

Modify Exercise 1 to use one-step TD bootstrap instead of Monte Carlo returns:

Replace $\hat{A}_t = G_t - V_\phi(s_t)$ with $\hat{A}_t = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ (where $V_\phi(s_T) = 0$ for terminal states).
Update both actor and critic after every step rather than at episode end.
Add an entropy bonus to the actor loss: $\mathcal{L}_{\text{actor}} = -(\hat{A}_t \log \pi_\theta(a_t|s_t) + \beta \mathcal{H}(\pi_\theta(\cdot|s_t)))$ with $\beta = 0.01$ .
Compare learning speed (steps to reach reward 195) between MC and TD variants. Which converges faster, and why?

Exercise 3: GAE with varying $\lambda$ #

Extend Exercise 2 to use GAE advantage estimates:

Compute $\delta_t = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ for all timesteps in a short rollout (e.g., 128 steps).
Compute GAE advantages: $A_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}$ .
Run experiments with $\lambda \in \{0.0, 0.5, 0.95, 1.0\}$ . Plot average episode return vs. wall-clock time for each.
Explain which $\lambda$ gives the best learning speed and why this matches the bias-variance tradeoff described in the lecture.

Extension prompts#

Continuous control extension: Replace the categorical policy in Exercise 2 with a Gaussian policy (mean $\mu_\theta(s)$ and learnable log-standard-deviation $\sigma$ ). Implement this on Pendulum-v1 or LunarLanderContinuous-v2. How does the entropy bonus formula change for a Gaussian policy?
PPO clipping from scratch: Starting from your GAE actor-critic (Exercise 3), replace the standard policy loss $-\hat{A}_t \log \pi_\theta(a_t|s_t)$ with the PPO clipped surrogate objective. Implement the ratio $\rho_t = \pi_\theta/\pi_{\theta_{\text{old}}}$ , compute the clipped objective, and run multiple epochs over the same rollout data. Plot the KL divergence $D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)$ over epochs — does clipping keep it bounded?
RLHF intuition experiment: Use a small transformer (e.g., GPT-2 124M) with a frozen reward model. Implement the KL-penalized PPO objective from the RLHF section. Vary $\beta \in \{0.0, 0.01, 0.1\}$ and observe the generated text after fine-tuning. Show that $\beta = 0$ leads to reward hacking (high reward, degraded language), while moderate $\beta$ maintains fluency. This directly demonstrates the trust-region principle from the lecture.

Looking ahead#

The next lecture studies PPO and RLHF in depth, examining how the full RLHF pipeline — reward model training, KL-penalized PPO, and preference optimization — connects to the policy gradient foundations developed here, and where it diverges from classical RL theory.