Skip to main content
illumin8
Courses
Week 7: Policy Gradient and Actor–Critic Methods
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 7

Week 7: Policy Gradient and Actor–Critic Methods

✦Learning Outcomes
  • Implement REINFORCE and understand its variance problem
  • Apply baselines and advantage functions to reduce variance
  • Explain actor-critic architecture and Generalized Advantage Estimation (GAE)
  • Connect policy gradient methods to modern RLHFReinforcement Learning from Human Feedback pipelines
◆Prerequisites
  • Week 6: DQNDeep Q-Network, experience replay, target networks
  • Week 5: Function approximation, neural networks in RLReinforcement Learning
  • Week 4: TDTemporal Difference learning, value functions

Recommended: Review Week 6 sections on "Deep Q-Networks" before proceeding.

◆Grounded In
  • Continuous control: Policy gradient methods (PPO, SAC) are the go-to approach for robot locomotion and manipulation — action spaces are continuous joint torques, where value-based max-operators cannot apply. This week directly equips you for Course 2's robot learning content.
  • LLM alignment: The entire RLHF pipeline (ChatGPT, Claude, Gemini) rests on PPO. Every piece of this lecture — the policy gradient theorem, advantage estimation, GAE, trust regions, clipped importance ratios — reappears identically in Course 4's language model alignment weeks.
  • Generative RL: Recent work on diffusion policies, autoregressive action generation, and generative world models all build on the policy-as-parameterized-distribution viewpoint established here.

Purpose of this lecture

In Week 6, we saw how Deep Q-Learning stabilizes value-based RLReinforcement Learning through architectural constraints. However, value-based methods fundamentally rely on a max operator over actions, which limits them to discrete action spaces and introduces overestimation bias.

This lecture introduces policy gradient methods, which take a different approach:

Instead of learning how good actions are, we directly learn how likely actions should be.

Policy gradient methods naturally handle continuous action spaces, stochastic policies, and large structured action spaces such as token vocabularies in language models. They form the conceptual foundation for actor–critic methods, GAE, PPOProximal Policy Optimisation, TRPOTrust Region Policy Optimisation, and modern RLHFReinforcement Learning from Human Feedback pipelines.

The lecture follows a logical arc: derive the policy gradient theorem from first principles, identify the variance problem in REINFORCE, develop the baseline and advantage machinery that reduces variance, introduce actor-critic as the key architectural leap that enables bootstrapping, derive GAE as the continuous interpolation, and then develop PPOProximal Policy Optimisation as the practical stabilization that makes policy gradients deployable at scale.

✦By the end of this week

You will have implemented a complete REINFORCE agent with baseline subtraction, an actor-critic learner with GAE-advantage estimates, and understood how every component of the PPO+RLHF pipeline maps to the policy gradient framework.


Policies as parameterized distributions

A policy is a parameterized probability distribution over actions:

πθ(a∣s)\pi_\theta(a \mid s)πθ​(a∣s)
  • Discrete actions: categorical distribution — πθ(a∣s)=softmax(fθ(s))a\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_aπθ​(a∣s)=softmax(fθ​(s))a​
  • Continuous actions: Gaussian distribution — πθ(a∣s)=N(μθ(s), σθ(s)2)\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s),\, \sigma_\theta(s)^2)πθ​(a∣s)=N(μθ​(s),σθ​(s)2)

The objective is to find θ\thetaθ that maximizes expected return:

J(θ)=Eτ∼πθ[R(τ)],R(τ)=∑t=0Tγtrt+1J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right], \quad R(\tau) = \sum_{t=0}^{T} \gamma^t r_{t+1}J(θ)=Eτ∼πθ​​[R(τ)],R(τ)=t=0∑T​γtrt+1​

Unlike value-based methods, there is no max over actions — the policy itself is optimized directly via gradient ascent on J(θ)J(\theta)J(θ).


The policy gradient theorem: derivation

The central question is: how do we compute ∇θJ(θ)\nabla_\theta J(\theta)∇θ​J(θ)? The environment transition P(st+1∣st,at)P(s_{t+1}|s_t, a_t)P(st+1​∣st​,at​) is not differentiable with respect to θ\thetaθ, so we cannot simply backpropagate through the trajectory.

Step 1: write J as an integral

◆Derivation: The Policy Gradient Theorem
J(θ)=∫pθ(τ)R(τ) dτJ(\theta) = \int p_\theta(\tau) R(\tau)\, d\tauJ(θ)=∫pθ​(τ)R(τ)dτ

where pθ(τ)p_\theta(\tau)pθ​(τ) is the probability of trajectory τ\tauτ under policy πθ\pi_\thetaπθ​.

Taking the gradient:

∇θJ(θ)=∫∇θpθ(τ) R(τ) dτ\nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau)\, R(\tau)\, d\tau∇θ​J(θ)=∫∇θ​pθ​(τ)R(τ)dτ

Step 2: apply the log-derivative trick

For any differentiable pθ(τ)>0p_\theta(\tau) > 0pθ​(τ)>0:

∇θpθ(τ)=pθ(τ) ∇θlog⁡pθ(τ)\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau)∇θ​pθ​(τ)=pθ​(τ)∇θ​logpθ​(τ)

Substituting:

∇θJ(θ)=∫pθ(τ) ∇θlog⁡pθ(τ) R(τ) dτ=Eτ∼πθ[∇θlog⁡pθ(τ) R(τ)]\nabla_\theta J(\theta) = \int p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau)\, R(\tau)\, d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log p_\theta(\tau)\, R(\tau)\right]∇θ​J(θ)=∫pθ​(τ)∇θ​logpθ​(τ)R(τ)dτ=Eτ∼πθ​​[∇θ​logpθ​(τ)R(τ)]

Step 3: factorize the trajectory log-probability

The trajectory probability factorizes as:

pθ(τ)=p(s0)∏t=0Tπθ(at∣st) P(st+1∣st,at)p_\theta(\tau) = p(s_0) \prod_{t=0}^T \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)pθ​(τ)=p(s0​)t=0∏T​πθ​(at​∣st​)P(st+1​∣st​,at​)

Taking the log:

log⁡pθ(τ)=log⁡p(s0)⏟no θ+∑tlog⁡πθ(at∣st)+∑tlog⁡P(st+1∣st,at)⏟no θ\log p_\theta(\tau) = \underbrace{\log p(s_0)}_{\text{no } \theta} + \sum_t \log \pi_\theta(a_t \mid s_t) + \sum_t \underbrace{\log P(s_{t+1} \mid s_t, a_t)}_{\text{no } \theta}logpθ​(τ)=no θlogp(s0​)​​+t∑​logπθ​(at​∣st​)+t∑​no θlogP(st+1​∣st​,at​)​​

The initial state distribution and environment transition terms have zero gradient with respect to θ\thetaθ. Only the policy terms remain:

∇θlog⁡pθ(τ)=∑t∇θlog⁡πθ(at∣st)\nabla_\theta \log p_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)∇θ​logpθ​(τ)=t∑​∇θ​logπθ​(at​∣st​)

Result: the policy gradient theorem

∇θJ(θ)=Eτ∼πθ[∑t=0T∇θlog⁡πθ(at∣st)  Gt]\boxed{ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \; G_t \right] }∇θ​J(θ)=Eτ∼πθ​​[t=0∑T​∇θ​logπθ​(at​∣st​)Gt​]​

Three things this derivation reveals:

  1. No environment model needed: the environment terms drop out entirely. The gradient depends only on the policy's log-probabilities and the observed returns — both available from experience without knowing PPP.

  2. The log-derivative trick converts a density gradient into an expectation: this is why policy gradient methods are also called likelihood ratio methods or score function estimators — ∇θlog⁡πθ(a∣s)\nabla_\theta \log \pi_\theta(a|s)∇θ​logπθ​(a∣s) is the score function of the policy.

  3. The result holds for any differentiable πθ\pi_\thetaπθ​: discrete (softmax), continuous (Gaussian), autoregressive (language models). The derivation makes no assumptions about the action space structure. (This generality is why policy gradient is the foundation for both robot learning in Course 2 — where actions are continuous joint torques — and language model alignment in Course 4 — where actions are token selections from a vocabulary.)


Continuous action spaces: the Gaussian policy

For robotics and continuous control, the canonical policy class is:

πθ(a∣s)=N(μθ(s), σθ2(s))\pi_\theta(a \mid s) = \mathcal{N}(\mu_\theta(s),\, \sigma_\theta^2(s))πθ​(a∣s)=N(μθ​(s),σθ2​(s))

where μθ(s)\mu_\theta(s)μθ​(s) and σθ(s)\sigma_\theta(s)σθ​(s) are neural network outputs. The log-probability:

log⁡πθ(a∣s)=−(a−μθ(s))22σθ2(s)−log⁡σθ(s)−12log⁡2π\log \pi_\theta(a \mid s) = -\frac{(a - \mu_\theta(s))^2}{2\sigma_\theta^2(s)} - \log \sigma_\theta(s) - \frac{1}{2}\log 2\pilogπθ​(a∣s)=−2σθ2​(s)(a−μθ​(s))2​−logσθ​(s)−21​log2π

The gradient with respect to the mean parameters θμ\theta_\muθμ​:

∇θμlog⁡πθ(a∣s)=a−μθ(s)σθ2(s)∇θμμθ(s)\nabla_{\theta_\mu} \log \pi_\theta(a \mid s) = \frac{a - \mu_\theta(s)}{\sigma_\theta^2(s)} \nabla_{\theta_\mu} \mu_\theta(s)∇θμ​​logπθ​(a∣s)=σθ2​(s)a−μθ​(s)​∇θμ​​μθ​(s)

Interpretation: when action aaa produced a high return, the gradient pushes μθ(s)\mu_\theta(s)μθ​(s) toward aaa — the policy increases the probability of actions that worked well. When a<μθ(s)a < \mu_\theta(s)a<μθ​(s) (the action was below the current mean), the gradient is negative and pushes μ\muμ down. The magnitude scales inversely with variance: in high-uncertainty states (large σ\sigmaσ), updates are smaller, providing natural exploration-exploitation coupling.

The variance parameter σθ(s)\sigma_\theta(s)σθ​(s) can also be learned — larger σ\sigmaσ increases entropy and exploration; smaller σ\sigmaσ focuses the policy on a narrow action range. In locomotion RLReinforcement Learning, the standard architecture outputs μθ(s)\mu_\theta(s)μθ​(s) from the policy network and treats σ\sigmaσ as a learned state-independent parameter. (This Gaussian policy structure is the standard starting point in Course 2's robot learning: the mean μθ(s)\mu_\theta(s)μθ​(s) outputs the desired joint torques, and the variance controls exploration-exploitation during skill learning.)


REINFORCE

The simplest policy gradient algorithm is REINFORCE (Williams, 1992). At the end of each episode, update:

θ←θ+α∑t∇θlog⁡πθ(at∣st)  Gt\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \; G_tθ←θ+αt∑​∇θ​logπθ​(at​∣st​)Gt​

where Gt=∑k=0T−t−1γkrt+k+1G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}Gt​=∑k=0T−t−1​γkrt+k+1​ is the actual observed return.

Why REINFORCE has high variance

The REINFORCE gradient estimator is unbiased — E[g^]=∇θJ(θ)\mathbb{E}[\hat{g}] = \nabla_\theta J(\theta)E[g^​]=∇θ​J(θ) — but its variance is large. The source is the return GtG_tGt​: it is a sum of T−tT - tT−t random reward terms:

Gt=rt+1+γrt+2+…+γT−t−1rTG_t = r_{t+1} + \gamma r_{t+2} + \ldots + \gamma^{T-t-1} r_TGt​=rt+1​+γrt+2​+…+γT−t−1rT​

Each term is random, and their sum accumulates variance. In long-horizon problems, GtG_tGt​ can vary enormously across episodes for the same starting state sts_tst​ — a stochastic environment or exploratory policy can produce very different trajectories. The variance of GtG_tGt​ grows with episode length, which is why REINFORCE fails for long episodes even when individual rewards are low-variance.

The consequence is that REINFORCE requires many episodes to average out this variance before gradient estimates become reliable — making it impractically slow for all but the simplest problems. This motivates variance reduction.

REINFORCE algorithm

# REINFORCE: Monte Carlo Policy Gradient
# Input: differentiable policy π_θ, learning rate α
for episode = 1..M:
    generate trajectory τ = (s₀, a₀, r₁, ..., s_T) using π_θ
    for t = 0..T-1:
        G_t = Σ_{k=0}^{T-t-1} γᵏ r_{t+k+1}           # monte carlo return
        θ ← θ + α γᵗ ∇_θ log π_θ(a_t | s_t) G_t       # policy gradient update

Notable properties:

  • Unbiased: the gradient estimate converges to the true policy gradient in expectation.
  • High variance: GtG_tGt​ is the sum of T−tT-tT−t random reward terms, so variance grows with episode length.
  • Episode-level updates: weight changes only happen after each complete trajectory, not mid-episode.
  • The γt\gamma^tγt term in the update is an artifact of the discounted formulation; the undiscounted variant omits it.

Variance reduction: the REINFORCE identity and baselines

The REINFORCE identity

The key identity underlying all baseline subtraction:

◆Proof: The REINFORCE Identity
Ea∼πθ(⋅∣s)[∇θlog⁡πθ(a∣s)]=0\mathbb{E}_{a \sim \pi_\theta(\cdot \mid s)}\left[\nabla_\theta \log \pi_\theta(a \mid s)\right] = 0Ea∼πθ​(⋅∣s)​[∇θ​logπθ​(a∣s)]=0

Proof: for any normalized distribution ∑aπθ(a∣s)=1\sum_a \pi_\theta(a|s) = 1∑a​πθ​(a∣s)=1:

∇θ∑aπθ(a∣s)=∑a∇θπθ(a∣s)=∑aπθ(a∣s) ∇θlog⁡πθ(a∣s)=0\nabla_\theta \sum_a \pi_\theta(a \mid s) = \sum_a \nabla_\theta \pi_\theta(a \mid s) = \sum_a \pi_\theta(a \mid s)\, \nabla_\theta \log \pi_\theta(a \mid s) = 0∇θ​a∑​πθ​(a∣s)=a∑​∇θ​πθ​(a∣s)=a∑​πθ​(a∣s)∇θ​logπθ​(a∣s)=0

The expected score function is identically zero. □\square□

Baseline subtraction

For any function b(st)b(s_t)b(st​) that does not depend on ata_tat​:

Eat∼πθ[∇θlog⁡πθ(at∣st) b(st)]=b(st) Eat[∇θlog⁡πθ(at∣st)]⏟=0=0\mathbb{E}_{a_t \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, b(s_t)\right] = b(s_t)\, \underbrace{\mathbb{E}_{a_t}\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\right]}_{= 0} = 0Eat​∼πθ​​[∇θ​logπθ​(at​∣st​)b(st​)]=b(st​)=0Eat​​[∇θ​logπθ​(at​∣st​)]​​=0

Therefore we can subtract b(st)b(s_t)b(st​) from GtG_tGt​ without changing the expected gradient:

∇θJ(θ)=E[∑t∇θlog⁡πθ(at∣st)(Gt−b(st))]\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \left(G_t - b(s_t)\right) \right]∇θ​J(θ)=E[t∑​∇θ​logπθ​(at​∣st​)(Gt​−b(st​))]

The gradient remains unbiased for any choice of b(st)b(s_t)b(st​). The variance is reduced because Gt−b(st)G_t - b(s_t)Gt​−b(st​) has smaller magnitude than GtG_tGt​ when b(st)b(s_t)b(st​) is a good predictor of GtG_tGt​.

The advantage function

The variance-minimizing baseline is b(st)=Vπ(st)b(s_t) = V^\pi(s_t)b(st​)=Vπ(st​), giving the advantage function:

Aπ(s,a)=Qπ(s,a)−Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)Aπ(s,a)=Qπ(s,a)−Vπ(s)

Intuitively: Aπ(s,a)>0A^\pi(s,a) > 0Aπ(s,a)>0 means action aaa is better than average in state sss; Aπ(s,a)<0A^\pi(s,a) < 0Aπ(s,a)<0 means it is worse than average. Using advantage rather than raw return centers the updates around zero, dramatically reducing variance.

Three advantage estimators

This distinction is critical and frequently confused:

| Estimator | Formula | Bias | Variance | |---|---|---|---| | True advantage | Aπ(s,a)=Qπ(s,a)−Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)Aπ(s,a)=Qπ(s,a)−Vπ(s) | Zero | N/A (not an estimator) | | MC estimate | A^tMC=Gt−Vπ(st)\hat{A}_t^{MC} = G_t - V^\pi(s_t)A^tMC​=Gt​−Vπ(st​) | Zero | High (sum of T−tT-tT−t random terms) | | TDTemporal Difference(0) estimate | A^tTD=rt+1+γVϕ(st+1)−Vϕ(st)\hat{A}_t^{\text{TD}} = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)A^tTD​=rt+1​+γVϕ​(st+1​)−Vϕ​(st​) | Nonzero | Low |

REINFORCE with baseline uses A^tMC\hat{A}_t^{MC}A^tMC​ — it is unbiased but still bootstraps on GtG_tGt​. Actor-critic methods use A^t<Glossaryterm="TD"/>\hat{A}_t^{<Glossary term="TD" />}A^t<Glossaryterm="TD"/>​ — they bootstrap via Vϕ(st+1)V_\phi(s_{t+1})Vϕ​(st+1​), introducing bias but dramatically reducing variance. GAE interpolates between them.


Entropy regularization

Policy gradients can collapse to deterministic policies prematurely — once the policy assigns high probability to a single action in each state, the gradient signal for other actions vanishes and exploration stops.

To prevent premature collapse, add an entropy bonus:

Jentropy(θ)=J(θ)+β Es∼dπθ ⁣[H ⁣(πθ(⋅∣s))]J_{\text{entropy}}(\theta) = J(\theta) + \beta\, \mathbb{E}_{s \sim d^{\pi_\theta}}\!\left[\mathcal{H}\!\left(\pi_\theta(\cdot \mid s)\right)\right]Jentropy​(θ)=J(θ)+βEs∼dπθ​​[H(πθ​(⋅∣s))]

where H(π)=−∑aπ(a∣s)log⁡π(a∣s)\mathcal{H}(\pi) = -\sum_a \pi(a|s) \log \pi(a|s)H(π)=−∑a​π(a∣s)logπ(a∣s) is the entropy of the policy. The coefficient β\betaβ controls the exploration-exploitation tradeoff.

Entropy regularization appears in every modern actor-critic implementation (PPOProximal Policy Optimisation, SACSoft Actor-Critic, RLHFReinforcement Learning from Human Feedback) and is particularly important in large action spaces (language vocabularies, continuous joint spaces) where naive policy gradient would collapse to narrow modes.


Actor–critic methods

REINFORCE with advantage still requires full-episode Monte Carlo returns — updates are only possible at episode end, and variance remains high for long episodes. Actor-critic methods replace the Monte Carlo return with a bootstrapped TDTemporal Difference estimate, enabling step-by-step updates.

Architecture

  • Actor: policy πθ(a∣s)\pi_\theta(a \mid s)πθ​(a∣s) — decides what action to take.
  • Critic: value function Vϕ(s)V_\phi(s)Vϕ​(s) — estimates the expected return from state sss.

The critic is not just a baseline. In REINFORCE with baseline, Vπ(st)V^\pi(s_t)Vπ(st​) is used to center returns but the actual return GtG_tGt​ still propagates the gradient. In actor-critic, the critic enables a TDTemporal Difference advantage estimate that does not require waiting for episode termination:

A^tTD=rt+1+γVϕ(st+1)⏟TD target−Vϕ(st)=δt\hat{A}_t^{TD} = \underbrace{r_{t+1} + \gamma V_\phi(s_{t+1})}_{\text{TD target}} - V_\phi(s_t) = \delta_tA^tTD​=TD targetrt+1​+γVϕ​(st+1​)​​−Vϕ​(st​)=δt​

This is the one-step TDTemporal Difference error — the same δt\delta_tδt​ from Week 4. The actor update uses this estimate in place of Gt−Vπ(st)G_t - V^\pi(s_t)Gt​−Vπ(st​):

θ←θ+αθ ∇θlog⁡πθ(at∣st) A^t\theta \leftarrow \theta + \alpha_\theta\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \hat{A}_tθ←θ+αθ​∇θ​logπθ​(at​∣st​)A^t​

The critic training objective

The critic is trained to minimize the TDTemporal Difference prediction error:

Lcritic(ϕ)=Et[(rt+1+γVϕ(st+1)−Vϕ(st))2]=Et[δt2]\mathcal{L}_{\text{critic}}(\phi) = \mathbb{E}_t\left[\left(r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\right)^2\right] = \mathbb{E}_t\left[\delta_t^2\right]Lcritic​(ϕ)=Et​[(rt+1​+γVϕ​(st+1​)−Vϕ​(st​))2]=Et​[δt2​]

In practice, the actor and critic often share a common feature extraction network (the trunk) with separate output heads, making the joint training loss. (This shared-trunk, dual-head architecture is foundational in Course 2: the trunk learns robot state representations from visual observations or proprioceptive data, the actor head outputs desired actions, and the critic head estimates value — enabling efficient learning in the high-dimensional, continuous action setting of robotic control.)

L(θ,ϕ)=−Et[A^t log⁡πθ(at∣st)]⏟actor loss+c1 Et[δt2]⏟critic loss−c2 Et[H(πθ(⋅∣st))]⏟entropy bonus\mathcal{L}(\theta, \phi) = \underbrace{-\mathbb{E}_t\left[\hat{A}_t\, \log \pi_\theta(a_t \mid s_t)\right]}_{\text{actor loss}} + c_1\, \underbrace{\mathbb{E}_t\left[\delta_t^2\right]}_{\text{critic loss}} - c_2\, \underbrace{\mathbb{E}_t\left[\mathcal{H}(\pi_\theta(\cdot \mid s_t))\right]}_{\text{entropy bonus}}L(θ,ϕ)=actor loss−Et​[A^t​logπθ​(at​∣st​)]​​+c1​critic lossEt​[δt2​]​​−c2​entropy bonusEt​[H(πθ​(⋅∣st​))]​​

where c1,c2c_1, c_2c1​,c2​ are tunable coefficients. This is the actual PPOProximal Policy Optimisation loss used in practice and in RLHFReinforcement Learning from Human Feedback fine-tuning.

A2CAdvantage Actor-Critic and A3CAsynchronous Advantage Actor-Critic

Advantage Actor-Critic (A2CAdvantage Actor-Critic) applies the actor-critic update synchronously: all parallel workers complete their rollouts, gradients are aggregated, and a single synchronized update is applied. The synchronization ensures consistent gradient estimates and is more stable than asynchronous updates.

Asynchronous Advantage Actor-Critic (A3CAsynchronous Advantage Actor-Critic) runs workers asynchronously — each worker pushes gradient updates to a shared parameter server without waiting for others. The asynchrony decorrelates experience across workers (similar to a replay buffer) but without requiring off-policy corrections, since each worker runs the current policy. A2CAdvantage Actor-Critic is generally preferred in practice because synchronous updates produce more stable learning — the additional variance from asynchronous gradient staleness typically outweighs the decorrelation benefit.


Generalized Advantage Estimation (GAE)

The TDTemporal Difference(0) advantage estimate δt\delta_tδt​ has low variance but nonzero bias (because Vϕ≠VπV_\phi \neq V^\piVϕ​=Vπ). The Monte Carlo estimate Gt−Vϕ(st)G_t - V_\phi(s_t)Gt​−Vϕ​(st​) has zero bias but high variance. GAE provides a principled interpolation.

Derivation from n-step advantages

The nnn-step advantage estimate:

A^t(n)=∑k=0n−1γkrt+k+1+γnVϕ(st+n)−Vϕ(st)=∑k=0n−1γkδt+k\hat{A}_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1} + \gamma^n V_\phi(s_{t+n}) - V_\phi(s_t) = \sum_{k=0}^{n-1} \gamma^k \delta_{t+k}A^t(n)​=k=0∑n−1​γkrt+k+1​+γnVϕ​(st+n​)−Vϕ​(st​)=k=0∑n−1​γkδt+k​

where the second equality rewrites the n-step return as a telescoping sum of TDTemporal Difference errors. This is the same structure as the n-step TDTemporal Difference return from Week 4.

GAE takes the exponentially weighted average over all nnn-step estimates:

AtGAE(λ)=(1−λ)∑n=1∞λn−1A^t(n)=(1−λ)∑n=1∞λn−1∑k=0n−1γkδt+kA_t^{\text{GAE}(\lambda)} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} \hat{A}_t^{(n)} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} \sum_{k=0}^{n-1} \gamma^k \delta_{t+k}AtGAE(λ)​=(1−λ)n=1∑∞​λn−1A^t(n)​=(1−λ)n=1∑∞​λn−1k=0∑n−1​γkδt+k​

Swapping the order of summation (each δt+k\delta_{t+k}δt+k​ appears in all A^t(n)\hat{A}_t^{(n)}A^t(n)​ for n>kn > kn>k):

AtGAE(λ)=∑l=0∞(γλ)l δt+lA_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\, \delta_{t+l}AtGAE(λ)​=l=0∑∞​(γλ)lδt+l​

This is a geometric series of TDTemporal Difference errors, decaying at rate γλ\gamma\lambdaγλ. Recent TDTemporal Difference errors (small lll) receive high weight; distant ones receive exponentially less credit.

Endpoint special cases:

  • λ=0\lambda = 0λ=0: only δt\delta_tδt​ survives → TDTemporal Difference(0) advantage, minimum variance, maximum bias.
  • λ=1\lambda = 1λ=1: all terms equally weighted → A^t(∞)=Gt−Vϕ(st)\hat{A}_t^{(\infty)} = G_t - V_\phi(s_t)A^t(∞)​=Gt​−Vϕ​(st​), the Monte Carlo advantage.

The effective bias-variance tradeoff as a function of λ\lambdaλ:

| λ\lambdaλ | Bias | Variance | Effective horizon | |---|---|---|---| | 0.0 | High (only 1 TDTemporal Difference step) | Lowest | 1 step | | 0.95 | Low | Moderate | ~20 steps (1/(1−0.95)1/(1-0.95)1/(1−0.95)) | | 1.0 | Zero | Highest | Full episode |

λ=0.95\lambda = 0.95λ=0.95 is standard in PPOProximal Policy Optimisation implementations and corresponds to an effective advantage horizon of about 20 TDTemporal Difference steps — enough to capture meaningful return structure while controlling variance.

Connection to TDTemporal Difference(λ) from Week 4

GAE is exactly the policy gradient analog of TDTemporal Difference(λ): both take exponentially weighted combinations of nnn-step estimates, with λ\lambdaλ as the decay parameter. The difference is that TDTemporal Difference(λ) applies the weighting to value estimates for policy evaluation, while GAE applies the same weighting to advantage estimates for policy improvement. The derivation is identical.


Trust regions and PPOProximal Policy Optimisation

Actor-critic with GAE still suffers from destructive policy updates. A large gradient step changes πθ\pi_\thetaπθ​ substantially, which invalidates the advantage estimates computed under the old policy. The new policy collects different data, leading to further degraded estimates — a catastrophic feedback loop that can permanently destabilize training.

The fundamental problem

In supervised learning, a large gradient step may hurt performance but the loss function remains valid — we can detect the mistake and recover. In RLReinforcement Learning, the policy generates its own training data. A bad policy update changes what data is collected, which changes the gradient, which may push the policy further in the wrong direction. The feedback loop has no self-correcting mechanism.

TRPOTrust Region Policy Optimisation: trust region policy optimization

TRPOTrust Region Policy Optimisation (Schulman et al., 2015) constrains each update to stay within a trust region defined by the KL divergence between old and new policies:

max⁡θ  L(θold,θ)subject toEs[DKL ⁣(πθold(⋅∣s) ∥ πθ(⋅∣s))]≤δ\max_\theta\; L(\theta_{\text{old}}, \theta) \quad \text{subject to} \quad \mathbb{E}_s\left[D_{KL}\!\left(\pi_{\theta_{\text{old}}}(\cdot \mid s) \,\|\, \pi_\theta(\cdot \mid s)\right)\right] \leq \deltaθmax​L(θold​,θ)subject toEs​[DKL​(πθold​​(⋅∣s)∥πθ​(⋅∣s))]≤δ

where L(θold,θ)=Et[πθ(at∣st)πθold(at∣st)At]L(\theta_{\text{old}}, \theta) = \mathbb{E}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} A_t\right]L(θold​,θ)=Et​[πθold​​(at​∣st​)πθ​(at​∣st​)​At​] is the importance-weighted policy objective.

TRPOTrust Region Policy Optimisation provides a monotone improvement guarantee: each constrained update is guaranteed not to decrease J(θ)J(\theta)J(θ) (up to approximation error). The cost is computational complexity: TRPOTrust Region Policy Optimisation requires computing the Fisher information matrix (the Hessian of KL with respect to θ\thetaθ) and solving a constrained optimization problem via conjugate gradient at each step. This is expensive for large neural networks.

PPOProximal Policy Optimisation: proximal policy optimization

PPOProximal Policy Optimisation (Schulman et al., 2017) approximates the TRPOTrust Region Policy Optimisation constraint with a clipped surrogate objective that is simple to implement and computationally cheap:

LCLIP(θ)=Et ⁣[min⁡ ⁣(ρt(θ) At,    clip ⁣(ρt(θ), 1−ϵ, 1+ϵ)At)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[ \min\!\left( \rho_t(\theta)\, A_t,\;\; \text{clip}\!\left(\rho_t(\theta),\, 1-\epsilon,\, 1+\epsilon\right) A_t \right) \right]LCLIP(θ)=Et​[min(ρt​(θ)At​,clip(ρt​(θ),1−ϵ,1+ϵ)At​)]

where the importance ratio:

ρt(θ)=πθ(at∣st)πθold(at∣st)\rho_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}ρt​(θ)=πθold​​(at​∣st​)πθ​(at​∣st​)​

measures how much the current policy differs from the policy that collected the data.

Understanding the PPOProximal Policy Optimisation objective

The clipped objective has four cases depending on the sign of AtA_tAt​ and whether ρt\rho_tρt​ is inside or outside [1−ϵ,1+ϵ][1-\epsilon, 1+\epsilon][1−ϵ,1+ϵ]:

| At>0A_t > 0At​>0 (good action) | ρt≤1+ϵ\rho_t \leq 1+\epsilonρt​≤1+ϵ | Normal update: increase action probability | |---|---|---| | At>0A_t > 0At​>0 | ρt>1+ϵ\rho_t > 1+\epsilonρt​>1+ϵ | Clipped: don't increase probability further | | At<0A_t < 0At​<0 (bad action) | ρt≥1−ϵ\rho_t \geq 1-\epsilonρt​≥1−ϵ | Normal update: decrease action probability | | At<0A_t < 0At​<0 | ρt<1−ϵ\rho_t < 1-\epsilonρt​<1−ϵ | Clipped: don't decrease probability further |

The min⁡\minmin ensures the objective is never more optimistic than the clipped version: updates that would push ρt\rho_tρt​ far from 1 in the direction of improvement are capped, preventing overconfident large steps. Updates that push ρt\rho_tρt​ far in the direction of degradation are not capped — they are always allowed to reduce the objective.

Connection to Week 4 importance sampling

The ratio ρt=πθ/πθold\rho_t = \pi_\theta / \pi_{\theta_{\text{old}}}ρt​=πθ​/πθold​​ is the importance sampling ratio from Week 4, applied to the policy gradient update. PPOProximal Policy Optimisation is collecting data under πθold\pi_{\theta_{\text{old}}}πθold​​ (the behavior policy) and updating πθ\pi_\thetaπθ​ (the target policy) — this is off-policy policy optimization. The ρt\rho_tρt​ ratio corrects for the distribution mismatch, and clipping it prevents the variance explosion from large importance weights that was identified in Week 4. PPOProximal Policy Optimisation's clipping is the policy gradient analog of the importance weight clipping in off-policy TDTemporal Difference.

The full PPOProximal Policy Optimisation training loss

In practice, PPOProximal Policy Optimisation optimizes the joint actor-critic objective:

LPPO(θ,ϕ)=−LCLIP(θ)+c1 Et ⁣[(Vϕ(st)−Vttarget)2]−c2 Et ⁣[H(πθ(⋅∣st))]\mathcal{L}^{\text{PPO}}(\theta, \phi) = -L^{\text{CLIP}}(\theta) + c_1\, \mathbb{E}_t\!\left[\left(V_\phi(s_t) - V_t^{\text{target}}\right)^2\right] - c_2\, \mathbb{E}_t\!\left[\mathcal{H}(\pi_\theta(\cdot \mid s_t))\right]LPPO(θ,ϕ)=−LCLIP(θ)+c1​Et​[(Vϕ​(st​)−Vttarget​)2]−c2​Et​[H(πθ​(⋅∣st​))]

where VttargetV_t^{\text{target}}Vttarget​ is the GAE-computed return target and c1≈0.5c_1 \approx 0.5c1​≈0.5, c2≈0.01c_2 \approx 0.01c2​≈0.01 are standard coefficients. This is the actual loss function used in PPOProximal Policy Optimisation implementations and in RLHFReinforcement Learning from Human Feedback fine-tuning of language models.


PPOProximal Policy Optimisation in RLHFReinforcement Learning from Human Feedback: the full picture

RLHFReinforcement Learning from Human Feedback fine-tunes a language model πθ\pi_\thetaπθ​ using PPOProximal Policy Optimisation with a learned reward model rϕr_\phirϕ​. The complete objective adds a KL penalty against the reference model πref\pi_{\text{ref}}πref​ (the SFT checkpoint):

JRLHF(θ)=E(x,y)∼πθ ⁣[rϕ(x,y)−β DKL ⁣(πθ(⋅∣x) ∥ πref(⋅∣x))]J^{\text{RLHF}}(\theta) = \mathbb{E}_{(x, y) \sim \pi_\theta}\!\left[ r_\phi(x, y) - \beta\, D_{KL}\!\left(\pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\right) \right]JRLHF(θ)=E(x,y)∼πθ​​[rϕ​(x,y)−βDKL​(πθ​(⋅∣x)∥πref​(⋅∣x))]

Every component of this objective maps directly onto the policy gradient framework:

| RLHFReinforcement Learning from Human Feedback component | Policy gradient analog | |---|---| | πθ\pi_\thetaπθ​ (language model) | Actor: πθ(a∣s)\pi_\theta(a \mid s)πθ​(a∣s) | | rϕ(x,y)r_\phi(x, y)rϕ​(x,y) (reward model) | Return signal: R(τ)R(\tau)R(τ) | | PPOProximal Policy Optimisation surrogate LCLIPL^{\text{CLIP}}LCLIP | Policy gradient with clipped importance ratio | | GAE advantage estimates | Variance reduction for policy gradient | | KL penalty from πref\pi_{\text{ref}}πref​ | Entropy/trust region regularization | | Value head VϕV_\phiVϕ​ | Critic: Vϕ(s)V_\phi(s)Vϕ​(s) |

The KL penalty serves the same role as TRPOTrust Region Policy Optimisation's trust region constraint: it prevents the policy from moving too far from a stable reference point (πref\pi_{\text{ref}}πref​), which in the language setting prevents reward hacking (exploiting the reward model far outside the distribution it was trained on). (This RLHFReinforcement Learning from Human Feedback framework is the direct application of Week 7's policy gradient machinery in Course 4 (Week 12) — the language model becomes the actor, the reward model becomes the return signal, and PPOProximal Policy Optimisation with KL penalties becomes the optimization algorithm. Understanding this week is prerequisite to understanding language model alignment.)


Key takeaways

The lecture develops a logical progression from first principles to the state of the art. The policy gradient theorem is derived via the log-derivative trick — the environment terms drop out because only πθ\pi_\thetaπθ​ depends on θ\thetaθ, giving a gradient that requires no model. REINFORCE is the direct implementation: unbiased but high-variance because GtG_tGt​ is a sum of T−tT-tT−t random terms. The REINFORCE identity proves that any state-dependent baseline can be subtracted without bias, and the advantage function is the variance-minimizing baseline. The three advantage estimators — true, MC, TDTemporal Difference — form a bias-variance spectrum, and the actor-critic architecture enables the TDTemporal Difference estimate via a learned critic that supports step-by-step bootstrapped updates.

GAE derives the exponentially weighted combination of n-step advantages, recovering TDTemporal Difference(0) at λ=0\lambda=0λ=0 and Monte Carlo at λ=1\lambda=1λ=1, with λ=0.95\lambda=0.95λ=0.95 as the standard practical choice. The continuous action space Gaussian policy makes the abstract theorem concrete for robotics applications. TRPOTrust Region Policy Optimisation solves the destructive update problem via a KL-constrained trust region with monotone improvement guarantees. PPOProximal Policy Optimisation approximates TRPOTrust Region Policy Optimisation with a clipped importance ratio, connecting directly to the Week 4 importance sampling framework — clipping prevents variance explosion from large ρt\rho_tρt​. The full PPOProximal Policy Optimisation loss combines clipped policy objective, critic regression, and entropy bonus in a single joint objective. And the RLHFReinforcement Learning from Human Feedback formulation maps every component of the policy gradient framework onto the language model alignment setting.


Conceptual questions

  1. Derive the policy gradient theorem for a two-step episodic MDPMarkov Decision Process: s0→a0s1→a1s2s_0 \xrightarrow{a_0} s_1 \xrightarrow{a_1} s_2s0​a0​​s1​a1​​s2​ (terminal), with rewards r1r_1r1​ and r2r_2r2​. Write out pθ(τ)p_\theta(\tau)pθ​(τ) explicitly, apply the log-derivative trick, and show that the environment transition terms log⁡P(s1∣s0,a0)\log P(s_1|s_0,a_0)logP(s1​∣s0​,a0​) and log⁡P(s2∣s1,a1)\log P(s_2|s_1,a_1)logP(s2​∣s1​,a1​) cancel. What does the final expression for ∇θJ(θ)\nabla_\theta J(\theta)∇θ​J(θ) look like?

  2. REINFORCE with baseline subtracts Vπ(st)V^\pi(s_t)Vπ(st​) from GtG_tGt​. Prove using the REINFORCE identity that this subtraction does not change the expected gradient. Then explain intuitively why it reduces variance: what property of Vπ(st)V^\pi(s_t)Vπ(st​) makes it a good baseline compared to, say, a constant b=0b = 0b=0?

  3. An actor-critic uses the TDTemporal Difference(0) advantage estimate A^t=rt+1+γVϕ(st+1)−Vϕ(st)\hat{A}_t = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)A^t​=rt+1​+γVϕ​(st+1​)−Vϕ​(st​). When Vϕ≠VπV_\phi \neq V^\piVϕ​=Vπ (i.e., the critic is not perfectly trained), this estimate is biased. Describe the direction of the bias when the critic systematically underestimates VπV^\piVπ, and explain how GAE with λ\lambdaλ close to 1 reduces this bias at the cost of increased variance.

  4. PPOProximal Policy Optimisation clips the importance ratio ρt=πθ/πθold\rho_t = \pi_\theta / \pi_{\theta_{\text{old}}}ρt​=πθ​/πθold​​ to [1−ϵ,1+ϵ][1-\epsilon, 1+\epsilon][1−ϵ,1+ϵ]. The clipping applies to both positive-advantage and negative-advantage cases but asymmetrically — updates that degrade performance are not clipped. Explain this asymmetry using the min⁡\minmin operator in the PPOProximal Policy Optimisation objective. Why would clipping negative-advantage updates as well be harmful?

  5. An RLHFReinforcement Learning from Human Feedback pipeline fine-tunes a language model using PPOProximal Policy Optimisation with β=0.0\beta = 0.0β=0.0 (no KL penalty against the reference model). After 10,000 updates, the model achieves very high reward model scores but produces outputs that no longer resemble natural language. Diagnose this failure in terms of the trust region / KL constraint, reward hacking, and the importance sampling distribution mismatch. What value of β\betaβ should be used, and what does it control geometrically in the policy space?


✦Solutions
  1. Two-step policy gradient. pθ(τ)=p(s0) πθ(a0∣s0) P(s1∣s0,a0) πθ(a1∣s1) P(s2∣s1,a1)p_\theta(\tau)=p(s_0)\,\pi_\theta(a_0|s_0)\,P(s_1|s_0,a_0)\,\pi_\theta(a_1|s_1)\,P(s_2|s_1,a_1)pθ​(τ)=p(s0​)πθ​(a0​∣s0​)P(s1​∣s0​,a0​)πθ​(a1​∣s1​)P(s2​∣s1​,a1​). Taking log⁡\loglog and ∇θ\nabla_\theta∇θ​, the initial-state and transition terms have no θ\thetaθ and vanish, leaving ∇θlog⁡pθ(τ)=∇θlog⁡πθ(a0∣s0)+∇θlog⁡πθ(a1∣s1)\nabla_\theta\log p_\theta(\tau)=\nabla_\theta\log\pi_\theta(a_0|s_0)+\nabla_\theta\log\pi_\theta(a_1|s_1)∇θ​logpθ​(τ)=∇θ​logπθ​(a0​∣s0​)+∇θ​logπθ​(a1​∣s1​). Thus ∇θJ=E[(r1+r2) (∇log⁡π(a0∣s0)+∇log⁡π(a1∣s1))]\nabla_\theta J = \mathbb{E}\big[(r_1+r_2)\,(\nabla\log\pi(a_0|s_0)+\nabla\log\pi(a_1|s_1))\big]∇θ​J=E[(r1​+r2​)(∇logπ(a0​∣s0​)+∇logπ(a1​∣s1​))] — the environment dynamics cancel.
  2. Baseline. E[∇log⁡π(a∣s) b(s)]=b(s)∑aπ(a∣s)∇log⁡π(a∣s)=b(s)∇∑aπ(a∣s)=b(s)∇1=0\mathbb{E}[\nabla\log\pi(a|s)\,b(s)] = b(s)\sum_a\pi(a|s)\nabla\log\pi(a|s) = b(s)\nabla\sum_a\pi(a|s) = b(s)\nabla 1 = 0E[∇logπ(a∣s)b(s)]=b(s)∑a​π(a∣s)∇logπ(a∣s)=b(s)∇∑a​π(a∣s)=b(s)∇1=0, so subtracting Vπ(s)V^\pi(s)Vπ(s) leaves the expected gradient unchanged. It reduces variance because Vπ(s)V^\pi(s)Vπ(s) is the expected return from sss, so the advantage Gt−Vπ(s)G_t-V^\pi(s)Gt​−Vπ(s) is centered near zero — only deviations from the state's average matter, unlike b=0b=0b=0 which leaves the large state-dependent return magnitudes in the estimator.
  3. TD(0) advantage bias. With Vϕ≠VπV_\phi\neq V^\piVϕ​=Vπ the estimate A^t=r+γVϕ(s′)−Vϕ(s)\hat A_t = r+\gamma V_\phi(s')-V_\phi(s)A^t​=r+γVϕ​(s′)−Vϕ​(s) is biased through the bootstrapped Vϕ(s′)V_\phi(s')Vϕ​(s′) term; systematic underestimation propagates that bias via bootstrapping. GAE with λ→1\lambda\to1λ→1 weights longer, more Monte-Carlo-like returns that rely less on the biased critic, reducing bias at the cost of higher variance (long noisy returns); λ→0\lambda\to0λ→0 is the opposite low-variance, high-bias extreme.
  4. PPO clip asymmetry. The objective min⁡(ρtA^, clip(ρt,1−ϵ,1+ϵ)A^)\min(\rho_t\hat A,\ \mathrm{clip}(\rho_t,1-\epsilon,1+\epsilon)\hat A)min(ρt​A^, clip(ρt​,1−ϵ,1+ϵ)A^) caps the gain from moving ρ\rhoρ outside the trust region but still penalizes updates that worsen the surrogate. For negative-advantage actions this means the policy can always be pushed to reduce a bad action's probability; clipping that side too would cap the corrective penalty, letting catastrophic actions keep high probability — which is why only the improving side is clipped.
  5. RLHF with β=0\beta=0β=0. No KL penalty removes the trust region against πref\pi_\text{ref}πref​, so the policy drifts arbitrarily to chase RM score (reward hacking) and leaves the natural-language manifold, while the importance ratios blow up as πθ\pi_\thetaπθ​ diverges (distribution mismatch, high-variance/invalid estimates). Use β>0\beta>0β>0 (e.g. ~0.01–0.1): geometrically it confines the policy to a KL ball around πref\pi_\text{ref}πref​, keeping outputs language-like and the RM in-distribution.

Coding exercises

Exercise 1: REINFORCE with baseline on CartPole

Implement REINFORCE with a learned baseline on the CartPole-v1 environment:

  1. Define a policy network πθ\pi_\thetaπθ​ with two output heads: action logits (for the categorical distribution) and a scalar VϕV_\phiVϕ​ (the baseline).
  2. Generate complete episodes using the current stochastic policy — sample actions from the categorical distribution.
  3. After each episode, compute GtG_tGt​ for every timestep and the advantage A^t=Gt−Vϕ(st)\hat{A}_t = G_t - V_\phi(s_t)A^t​=Gt​−Vϕ​(st​).
  4. Update the policy head via gradient ascent and the value head via gradient descent on the MSE of (Gt−Vϕ(st))2(G_t - V_\phi(s_t))^2(Gt​−Vϕ​(st​))2.
  5. Plot episode return vs. episode number. Compare against a version that uses a constant baseline b=0b = 0b=0 (i.e., plain REINFORCE) — how much does the learned baseline reduce variance?

Expected outcome: REINFORCE with baseline should converge to a mean return of ~200 within 2,000–5,000 episodes.

Exercise 2: One-step actor-critic with bootstrapping

Modify Exercise 1 to use one-step TD bootstrap instead of Monte Carlo returns:

  1. Replace A^t=Gt−Vϕ(st)\hat{A}_t = G_t - V_\phi(s_t)A^t​=Gt​−Vϕ​(st​) with A^t=rt+1+γVϕ(st+1)−Vϕ(st)\hat{A}_t = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)A^t​=rt+1​+γVϕ​(st+1​)−Vϕ​(st​) (where Vϕ(sT)=0V_\phi(s_T) = 0Vϕ​(sT​)=0 for terminal states).
  2. Update both actor and critic after every step rather than at episode end.
  3. Add an entropy bonus to the actor loss: Lactor=−(A^tlog⁡πθ(at∣st)+βH(πθ(⋅∣st)))\mathcal{L}_{\text{actor}} = -(\hat{A}_t \log \pi_\theta(a_t|s_t) + \beta \mathcal{H}(\pi_\theta(\cdot|s_t)))Lactor​=−(A^t​logπθ​(at​∣st​)+βH(πθ​(⋅∣st​))) with β=0.01\beta = 0.01β=0.01.
  4. Compare learning speed (steps to reach reward 195) between MC and TD variants. Which converges faster, and why?

Exercise 3: GAE with varying λ\lambdaλ

Extend Exercise 2 to use GAE advantage estimates:

  1. Compute δt=rt+1+γVϕ(st+1)−Vϕ(st)\delta_t = r_{t+1} + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)δt​=rt+1​+γVϕ​(st+1​)−Vϕ​(st​) for all timesteps in a short rollout (e.g., 128 steps).
  2. Compute GAE advantages: AtGAE(λ)=∑l=0T−t−1(γλ)lδt+lA_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}AtGAE(λ)​=∑l=0T−t−1​(γλ)lδt+l​.
  3. Run experiments with λ∈{0.0,0.5,0.95,1.0}\lambda \in \{0.0, 0.5, 0.95, 1.0\}λ∈{0.0,0.5,0.95,1.0}. Plot average episode return vs. wall-clock time for each.
  4. Explain which λ\lambdaλ gives the best learning speed and why this matches the bias-variance tradeoff described in the lecture.

Extension prompts

  1. Continuous control extension: Replace the categorical policy in Exercise 2 with a Gaussian policy (mean μθ(s)\mu_\theta(s)μθ​(s) and learnable log-standard-deviation σ\sigmaσ). Implement this on Pendulum-v1 or LunarLanderContinuous-v2. How does the entropy bonus formula change for a Gaussian policy?

  2. PPO clipping from scratch: Starting from your GAE actor-critic (Exercise 3), replace the standard policy loss −A^tlog⁡πθ(at∣st)-\hat{A}_t \log \pi_\theta(a_t|s_t)−A^t​logπθ​(at​∣st​) with the PPO clipped surrogate objective. Implement the ratio ρt=πθ/πθold\rho_t = \pi_\theta/\pi_{\theta_{\text{old}}}ρt​=πθ​/πθold​​, compute the clipped objective, and run multiple epochs over the same rollout data. Plot the KL divergence DKL(πθold∥πθ)D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)DKL​(πθold​​∥πθ​) over epochs — does clipping keep it bounded?

  3. RLHF intuition experiment: Use a small transformer (e.g., GPT-2 124M) with a frozen reward model. Implement the KL-penalized PPO objective from the RLHF section. Vary β∈{0.0,0.01,0.1}\beta \in \{0.0, 0.01, 0.1\}β∈{0.0,0.01,0.1} and observe the generated text after fine-tuning. Show that β=0\beta = 0β=0 leads to reward hacking (high reward, degraded language), while moderate β\betaβ maintains fluency. This directly demonstrates the trust-region principle from the lecture.


Looking ahead

The next lecture studies PPOProximal Policy Optimisation and RLHFReinforcement Learning from Human Feedback in depth, examining how the full RLHFReinforcement Learning from Human Feedback pipeline — reward model training, KL-penalized PPOProximal Policy Optimisation, and preference optimization — connects to the policy gradient foundations developed here, and where it diverges from classical RLReinforcement Learning theory.


Further reading

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning. (Introduced REINFORCE).
  • Sutton, R. S., et al. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS. (The Policy Gradient Theorem).
  • Schulman, J., et al. (2015a). High-Dimensional Continuous Control Using Generalized Advantage Estimation (GAE). ICLR.
  • Schulman, J., et al. (2015b). Trust Region Policy Optimization (TRPO). ICML.
  • Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms (PPOProximal Policy Optimisation). arXiv. (The foundation of modern RLHFReinforcement Learning from Human Feedback).
← Previous
Week 6: Deep Q-Learning and Variants
Next →
Week 8: Modern Deep Reinforcement Learning Algorithms
On this page
  • Purpose of this lecture
  • Policies as parameterized distributions
  • The policy gradient theorem: derivation
  • Step 1: write J as an integral
  • Step 2: apply the log-derivative trick
  • Step 3: factorize the trajectory log-probability
  • Result: the policy gradient theorem
  • Continuous action spaces: the Gaussian policy
  • REINFORCE
  • Why REINFORCE has high variance
  • REINFORCE algorithm
  • Variance reduction: the REINFORCE identity and baselines
  • The REINFORCE identity
  • Baseline subtraction
  • The advantage function
  • Three advantage estimators
  • Entropy regularization
  • Actor–critic methods
  • Architecture
  • The critic training objective
  • A2C and A3C
  • Generalized Advantage Estimation (GAE)
  • Derivation from n-step advantages
  • Connection to TD(λ) from Week 4
  • Trust regions and PPO
  • The fundamental problem
  • TRPO: trust region policy optimization
  • PPO: proximal policy optimization
  • Understanding the PPO objective
  • Connection to Week 4 importance sampling
  • The full PPO training loss
  • PPO in RLHF: the full picture
  • Key takeaways
  • Conceptual questions
  • Coding exercises
  • Exercise 1: REINFORCE with baseline on CartPole
  • Exercise 2: One-step actor-critic with bootstrapping
  • Exercise 3: GAE with varying \lambda
  • Extension prompts
  • Looking ahead
  • Further reading