Purpose of this lecture
In Week 6, we saw how Deep Q-Learning stabilizes value-based RLReinforcement Learning through architectural constraints. However, value-based methods fundamentally rely on a max operator over actions, which limits them to discrete action spaces and introduces overestimation bias.
This lecture introduces policy gradient methods, which take a different approach:
Instead of learning how good actions are, we directly learn how likely actions should be.
Policy gradient methods naturally handle continuous action spaces, stochastic policies, and large structured action spaces such as token vocabularies in language models. They form the conceptual foundation for actor–critic methods, GAE, PPOProximal Policy Optimisation, TRPOTrust Region Policy Optimisation, and modern RLHFReinforcement Learning from Human Feedback pipelines.
The lecture follows a logical arc: derive the policy gradient theorem from first principles, identify the variance problem in REINFORCE, develop the baseline and advantage machinery that reduces variance, introduce actor-critic as the key architectural leap that enables bootstrapping, derive GAE as the continuous interpolation, and then develop PPOProximal Policy Optimisation as the practical stabilization that makes policy gradients deployable at scale.
Policies as parameterized distributions
A policy is a parameterized probability distribution over actions:
- Discrete actions: categorical distribution —
- Continuous actions: Gaussian distribution —
The objective is to find that maximizes expected return:
Unlike value-based methods, there is no max over actions — the policy itself is optimized directly via gradient ascent on .
The policy gradient theorem: derivation
The central question is: how do we compute ? The environment transition is not differentiable with respect to , so we cannot simply backpropagate through the trajectory.
Step 1: write J as an integral
Three things this derivation reveals:
-
No environment model needed: the environment terms drop out entirely. The gradient depends only on the policy's log-probabilities and the observed returns — both available from experience without knowing .
-
The log-derivative trick converts a density gradient into an expectation: this is why policy gradient methods are also called likelihood ratio methods or score function estimators — is the score function of the policy.
-
The result holds for any differentiable : discrete (softmax), continuous (Gaussian), autoregressive (language models). The derivation makes no assumptions about the action space structure. (This generality is why policy gradient is the foundation for both robot learning in Course 2 — where actions are continuous joint torques — and language model alignment in Course 4 — where actions are token selections from a vocabulary.)
Continuous action spaces: the Gaussian policy
For robotics and continuous control, the canonical policy class is:
where and are neural network outputs. The log-probability:
The gradient with respect to the mean parameters :
Interpretation: when action produced a high return, the gradient pushes toward — the policy increases the probability of actions that worked well. When (the action was below the current mean), the gradient is negative and pushes down. The magnitude scales inversely with variance: in high-uncertainty states (large ), updates are smaller, providing natural exploration-exploitation coupling.
The variance parameter can also be learned — larger increases entropy and exploration; smaller focuses the policy on a narrow action range. In locomotion RLReinforcement Learning, the standard architecture outputs from the policy network and treats as a learned state-independent parameter. (This Gaussian policy structure is the standard starting point in Course 2's robot learning: the mean outputs the desired joint torques, and the variance controls exploration-exploitation during skill learning.)
REINFORCE
The simplest policy gradient algorithm is REINFORCE (Williams, 1992). At the end of each episode, update:
where is the actual observed return.
Why REINFORCE has high variance
The REINFORCE gradient estimator is unbiased — — but its variance is large. The source is the return : it is a sum of random reward terms:
Each term is random, and their sum accumulates variance. In long-horizon problems, can vary enormously across episodes for the same starting state — a stochastic environment or exploratory policy can produce very different trajectories. The variance of grows with episode length, which is why REINFORCE fails for long episodes even when individual rewards are low-variance.
The consequence is that REINFORCE requires many episodes to average out this variance before gradient estimates become reliable — making it impractically slow for all but the simplest problems. This motivates variance reduction.
REINFORCE algorithm
# REINFORCE: Monte Carlo Policy Gradient
# Input: differentiable policy π_θ, learning rate α
for episode = 1..M:
generate trajectory τ = (s₀, a₀, r₁, ..., s_T) using π_θ
for t = 0..T-1:
G_t = Σ_{k=0}^{T-t-1} γᵏ r_{t+k+1} # monte carlo return
θ ← θ + α γᵗ ∇_θ log π_θ(a_t | s_t) G_t # policy gradient update
Notable properties:
- Unbiased: the gradient estimate converges to the true policy gradient in expectation.
- High variance: is the sum of random reward terms, so variance grows with episode length.
- Episode-level updates: weight changes only happen after each complete trajectory, not mid-episode.
- The term in the update is an artifact of the discounted formulation; the undiscounted variant omits it.
Variance reduction: the REINFORCE identity and baselines
The REINFORCE identity
The key identity underlying all baseline subtraction:
Baseline subtraction
For any function that does not depend on :
Therefore we can subtract from without changing the expected gradient:
The gradient remains unbiased for any choice of . The variance is reduced because has smaller magnitude than when is a good predictor of .
The advantage function
The variance-minimizing baseline is , giving the advantage function:
Intuitively: means action is better than average in state ; means it is worse than average. Using advantage rather than raw return centers the updates around zero, dramatically reducing variance.
Three advantage estimators
This distinction is critical and frequently confused:
| Estimator | Formula | Bias | Variance | |---|---|---|---| | True advantage | | Zero | N/A (not an estimator) | | MC estimate | | Zero | High (sum of random terms) | | TDTemporal Difference(0) estimate | | Nonzero | Low |
REINFORCE with baseline uses — it is unbiased but still bootstraps on . Actor-critic methods use — they bootstrap via , introducing bias but dramatically reducing variance. GAE interpolates between them.
Entropy regularization
Policy gradients can collapse to deterministic policies prematurely — once the policy assigns high probability to a single action in each state, the gradient signal for other actions vanishes and exploration stops.
To prevent premature collapse, add an entropy bonus:
where is the entropy of the policy. The coefficient controls the exploration-exploitation tradeoff.
Entropy regularization appears in every modern actor-critic implementation (PPOProximal Policy Optimisation, SACSoft Actor-Critic, RLHFReinforcement Learning from Human Feedback) and is particularly important in large action spaces (language vocabularies, continuous joint spaces) where naive policy gradient would collapse to narrow modes.
Actor–critic methods
REINFORCE with advantage still requires full-episode Monte Carlo returns — updates are only possible at episode end, and variance remains high for long episodes. Actor-critic methods replace the Monte Carlo return with a bootstrapped TDTemporal Difference estimate, enabling step-by-step updates.
Architecture
- Actor: policy — decides what action to take.
- Critic: value function — estimates the expected return from state .
The critic is not just a baseline. In REINFORCE with baseline, is used to center returns but the actual return still propagates the gradient. In actor-critic, the critic enables a TDTemporal Difference advantage estimate that does not require waiting for episode termination:
This is the one-step TDTemporal Difference error — the same from Week 4. The actor update uses this estimate in place of :
The critic training objective
The critic is trained to minimize the TDTemporal Difference prediction error:
In practice, the actor and critic often share a common feature extraction network (the trunk) with separate output heads, making the joint training loss. (This shared-trunk, dual-head architecture is foundational in Course 2: the trunk learns robot state representations from visual observations or proprioceptive data, the actor head outputs desired actions, and the critic head estimates value — enabling efficient learning in the high-dimensional, continuous action setting of robotic control.)
where are tunable coefficients. This is the actual PPOProximal Policy Optimisation loss used in practice and in RLHFReinforcement Learning from Human Feedback fine-tuning.
A2CAdvantage Actor-Critic and A3CAsynchronous Advantage Actor-Critic
Advantage Actor-Critic (A2CAdvantage Actor-Critic) applies the actor-critic update synchronously: all parallel workers complete their rollouts, gradients are aggregated, and a single synchronized update is applied. The synchronization ensures consistent gradient estimates and is more stable than asynchronous updates.
Asynchronous Advantage Actor-Critic (A3CAsynchronous Advantage Actor-Critic) runs workers asynchronously — each worker pushes gradient updates to a shared parameter server without waiting for others. The asynchrony decorrelates experience across workers (similar to a replay buffer) but without requiring off-policy corrections, since each worker runs the current policy. A2CAdvantage Actor-Critic is generally preferred in practice because synchronous updates produce more stable learning — the additional variance from asynchronous gradient staleness typically outweighs the decorrelation benefit.
Generalized Advantage Estimation (GAE)
The TDTemporal Difference(0) advantage estimate has low variance but nonzero bias (because ). The Monte Carlo estimate has zero bias but high variance. GAE provides a principled interpolation.
Derivation from n-step advantages
The -step advantage estimate:
where the second equality rewrites the n-step return as a telescoping sum of TDTemporal Difference errors. This is the same structure as the n-step TDTemporal Difference return from Week 4.
GAE takes the exponentially weighted average over all -step estimates:
Swapping the order of summation (each appears in all for ):
This is a geometric series of TDTemporal Difference errors, decaying at rate . Recent TDTemporal Difference errors (small ) receive high weight; distant ones receive exponentially less credit.
Endpoint special cases:
- : only survives → TDTemporal Difference(0) advantage, minimum variance, maximum bias.
- : all terms equally weighted → , the Monte Carlo advantage.
The effective bias-variance tradeoff as a function of :
| | Bias | Variance | Effective horizon | |---|---|---|---| | 0.0 | High (only 1 TDTemporal Difference step) | Lowest | 1 step | | 0.95 | Low | Moderate | ~20 steps () | | 1.0 | Zero | Highest | Full episode |
is standard in PPOProximal Policy Optimisation implementations and corresponds to an effective advantage horizon of about 20 TDTemporal Difference steps — enough to capture meaningful return structure while controlling variance.
Connection to TDTemporal Difference(λ) from Week 4
GAE is exactly the policy gradient analog of TDTemporal Difference(λ): both take exponentially weighted combinations of -step estimates, with as the decay parameter. The difference is that TDTemporal Difference(λ) applies the weighting to value estimates for policy evaluation, while GAE applies the same weighting to advantage estimates for policy improvement. The derivation is identical.
Trust regions and PPOProximal Policy Optimisation
Actor-critic with GAE still suffers from destructive policy updates. A large gradient step changes substantially, which invalidates the advantage estimates computed under the old policy. The new policy collects different data, leading to further degraded estimates — a catastrophic feedback loop that can permanently destabilize training.
The fundamental problem
In supervised learning, a large gradient step may hurt performance but the loss function remains valid — we can detect the mistake and recover. In RLReinforcement Learning, the policy generates its own training data. A bad policy update changes what data is collected, which changes the gradient, which may push the policy further in the wrong direction. The feedback loop has no self-correcting mechanism.
TRPOTrust Region Policy Optimisation: trust region policy optimization
TRPOTrust Region Policy Optimisation (Schulman et al., 2015) constrains each update to stay within a trust region defined by the KL divergence between old and new policies:
where is the importance-weighted policy objective.
TRPOTrust Region Policy Optimisation provides a monotone improvement guarantee: each constrained update is guaranteed not to decrease (up to approximation error). The cost is computational complexity: TRPOTrust Region Policy Optimisation requires computing the Fisher information matrix (the Hessian of KL with respect to ) and solving a constrained optimization problem via conjugate gradient at each step. This is expensive for large neural networks.
PPOProximal Policy Optimisation: proximal policy optimization
PPOProximal Policy Optimisation (Schulman et al., 2017) approximates the TRPOTrust Region Policy Optimisation constraint with a clipped surrogate objective that is simple to implement and computationally cheap:
where the importance ratio:
measures how much the current policy differs from the policy that collected the data.
Understanding the PPOProximal Policy Optimisation objective
The clipped objective has four cases depending on the sign of and whether is inside or outside :
| (good action) | | Normal update: increase action probability | |---|---|---| | | | Clipped: don't increase probability further | | (bad action) | | Normal update: decrease action probability | | | | Clipped: don't decrease probability further |
The ensures the objective is never more optimistic than the clipped version: updates that would push far from 1 in the direction of improvement are capped, preventing overconfident large steps. Updates that push far in the direction of degradation are not capped — they are always allowed to reduce the objective.
Connection to Week 4 importance sampling
The ratio is the importance sampling ratio from Week 4, applied to the policy gradient update. PPOProximal Policy Optimisation is collecting data under (the behavior policy) and updating (the target policy) — this is off-policy policy optimization. The ratio corrects for the distribution mismatch, and clipping it prevents the variance explosion from large importance weights that was identified in Week 4. PPOProximal Policy Optimisation's clipping is the policy gradient analog of the importance weight clipping in off-policy TDTemporal Difference.
The full PPOProximal Policy Optimisation training loss
In practice, PPOProximal Policy Optimisation optimizes the joint actor-critic objective:
where is the GAE-computed return target and , are standard coefficients. This is the actual loss function used in PPOProximal Policy Optimisation implementations and in RLHFReinforcement Learning from Human Feedback fine-tuning of language models.
PPOProximal Policy Optimisation in RLHFReinforcement Learning from Human Feedback: the full picture
RLHFReinforcement Learning from Human Feedback fine-tunes a language model using PPOProximal Policy Optimisation with a learned reward model . The complete objective adds a KL penalty against the reference model (the SFT checkpoint):
Every component of this objective maps directly onto the policy gradient framework:
| RLHFReinforcement Learning from Human Feedback component | Policy gradient analog | |---|---| | (language model) | Actor: | | (reward model) | Return signal: | | PPOProximal Policy Optimisation surrogate | Policy gradient with clipped importance ratio | | GAE advantage estimates | Variance reduction for policy gradient | | KL penalty from | Entropy/trust region regularization | | Value head | Critic: |
The KL penalty serves the same role as TRPOTrust Region Policy Optimisation's trust region constraint: it prevents the policy from moving too far from a stable reference point (), which in the language setting prevents reward hacking (exploiting the reward model far outside the distribution it was trained on). (This RLHFReinforcement Learning from Human Feedback framework is the direct application of Week 7's policy gradient machinery in Course 4 (Week 12) — the language model becomes the actor, the reward model becomes the return signal, and PPOProximal Policy Optimisation with KL penalties becomes the optimization algorithm. Understanding this week is prerequisite to understanding language model alignment.)
Key takeaways
The lecture develops a logical progression from first principles to the state of the art. The policy gradient theorem is derived via the log-derivative trick — the environment terms drop out because only depends on , giving a gradient that requires no model. REINFORCE is the direct implementation: unbiased but high-variance because is a sum of random terms. The REINFORCE identity proves that any state-dependent baseline can be subtracted without bias, and the advantage function is the variance-minimizing baseline. The three advantage estimators — true, MC, TDTemporal Difference — form a bias-variance spectrum, and the actor-critic architecture enables the TDTemporal Difference estimate via a learned critic that supports step-by-step bootstrapped updates.
GAE derives the exponentially weighted combination of n-step advantages, recovering TDTemporal Difference(0) at and Monte Carlo at , with as the standard practical choice. The continuous action space Gaussian policy makes the abstract theorem concrete for robotics applications. TRPOTrust Region Policy Optimisation solves the destructive update problem via a KL-constrained trust region with monotone improvement guarantees. PPOProximal Policy Optimisation approximates TRPOTrust Region Policy Optimisation with a clipped importance ratio, connecting directly to the Week 4 importance sampling framework — clipping prevents variance explosion from large . The full PPOProximal Policy Optimisation loss combines clipped policy objective, critic regression, and entropy bonus in a single joint objective. And the RLHFReinforcement Learning from Human Feedback formulation maps every component of the policy gradient framework onto the language model alignment setting.
Conceptual questions
-
Derive the policy gradient theorem for a two-step episodic MDPMarkov Decision Process: (terminal), with rewards and . Write out explicitly, apply the log-derivative trick, and show that the environment transition terms and cancel. What does the final expression for look like?
-
REINFORCE with baseline subtracts from . Prove using the REINFORCE identity that this subtraction does not change the expected gradient. Then explain intuitively why it reduces variance: what property of makes it a good baseline compared to, say, a constant ?
-
An actor-critic uses the TDTemporal Difference(0) advantage estimate . When (i.e., the critic is not perfectly trained), this estimate is biased. Describe the direction of the bias when the critic systematically underestimates , and explain how GAE with close to 1 reduces this bias at the cost of increased variance.
-
PPOProximal Policy Optimisation clips the importance ratio to . The clipping applies to both positive-advantage and negative-advantage cases but asymmetrically — updates that degrade performance are not clipped. Explain this asymmetry using the operator in the PPOProximal Policy Optimisation objective. Why would clipping negative-advantage updates as well be harmful?
-
An RLHFReinforcement Learning from Human Feedback pipeline fine-tunes a language model using PPOProximal Policy Optimisation with (no KL penalty against the reference model). After 10,000 updates, the model achieves very high reward model scores but produces outputs that no longer resemble natural language. Diagnose this failure in terms of the trust region / KL constraint, reward hacking, and the importance sampling distribution mismatch. What value of should be used, and what does it control geometrically in the policy space?
Coding exercises
Exercise 1: REINFORCE with baseline on CartPole
Implement REINFORCE with a learned baseline on the CartPole-v1 environment:
- Define a policy network with two output heads: action logits (for the categorical distribution) and a scalar (the baseline).
- Generate complete episodes using the current stochastic policy — sample actions from the categorical distribution.
- After each episode, compute for every timestep and the advantage .
- Update the policy head via gradient ascent and the value head via gradient descent on the MSE of .
- Plot episode return vs. episode number. Compare against a version that uses a constant baseline (i.e., plain REINFORCE) — how much does the learned baseline reduce variance?
Expected outcome: REINFORCE with baseline should converge to a mean return of ~200 within 2,000–5,000 episodes.
Exercise 2: One-step actor-critic with bootstrapping
Modify Exercise 1 to use one-step TD bootstrap instead of Monte Carlo returns:
- Replace with (where for terminal states).
- Update both actor and critic after every step rather than at episode end.
- Add an entropy bonus to the actor loss: with .
- Compare learning speed (steps to reach reward 195) between MC and TD variants. Which converges faster, and why?
Exercise 3: GAE with varying
Extend Exercise 2 to use GAE advantage estimates:
- Compute for all timesteps in a short rollout (e.g., 128 steps).
- Compute GAE advantages: .
- Run experiments with . Plot average episode return vs. wall-clock time for each.
- Explain which gives the best learning speed and why this matches the bias-variance tradeoff described in the lecture.
Extension prompts
-
Continuous control extension: Replace the categorical policy in Exercise 2 with a Gaussian policy (mean and learnable log-standard-deviation ). Implement this on
Pendulum-v1orLunarLanderContinuous-v2. How does the entropy bonus formula change for a Gaussian policy? -
PPO clipping from scratch: Starting from your GAE actor-critic (Exercise 3), replace the standard policy loss with the PPO clipped surrogate objective. Implement the ratio , compute the clipped objective, and run multiple epochs over the same rollout data. Plot the KL divergence over epochs — does clipping keep it bounded?
-
RLHF intuition experiment: Use a small transformer (e.g., GPT-2 124M) with a frozen reward model. Implement the KL-penalized PPO objective from the RLHF section. Vary and observe the generated text after fine-tuning. Show that leads to reward hacking (high reward, degraded language), while moderate maintains fluency. This directly demonstrates the trust-region principle from the lecture.
Looking ahead
The next lecture studies PPOProximal Policy Optimisation and RLHFReinforcement Learning from Human Feedback in depth, examining how the full RLHFReinforcement Learning from Human Feedback pipeline — reward model training, KL-penalized PPOProximal Policy Optimisation, and preference optimization — connects to the policy gradient foundations developed here, and where it diverges from classical RLReinforcement Learning theory.
Further reading
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning. (Introduced REINFORCE).
- Sutton, R. S., et al. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS. (The Policy Gradient Theorem).
- Schulman, J., et al. (2015a). High-Dimensional Continuous Control Using Generalized Advantage Estimation (GAE). ICLR.
- Schulman, J., et al. (2015b). Trust Region Policy Optimization (
TRPO). ICML. - Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms (PPOProximal Policy Optimisation). arXiv. (The foundation of modern RLHFReinforcement Learning from Human Feedback).