Skip to main content
illumin8
Courses
Week 8: Modern Deep Reinforcement Learning Algorithms
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 8

Week 8: Modern Deep Reinforcement Learning Algorithms

✦Learning Outcomes
  • Explain the deterministic policy gradient and its advantages
  • Implement DDPGDeep Deterministic Policy Gradient, TD3, and SACSoft Actor-Critic and understand their differences
  • Analyze when to use each algorithm based on application context
  • Connect modern deep RLReinforcement Learning algorithms to robotics applications
◆Prerequisites
  • Week 7: Policy gradient theorem, actor-critic, GAE, PPOProximal Policy Optimisation/TRPOTrust Region Policy Optimisation
  • Week 6: DQNDeep Q-Network, function approximation

Recommended: Review Week 7 sections on "Actor-critic" and "PPOProximal Policy Optimisation" before proceeding.

◆Grounded In
  • Robotics: DDPG, TD3, and SACSoft Actor-Critic are the dominant algorithms for continuous control in real-robot systems — SACSoft Actor-Critic is used for Franka arm manipulation, Unitree Go1 locomotion, and dexterous in-hand manipulation. Each interaction with physical hardware is expensive (seconds per step vs milliseconds in simulation), making off-policy sample efficiency essential.
  • GenAI: PPOProximal Policy Optimisation powers RLHFReinforcement Learning from Human Feedback for language model alignment (ChatGPT, Claude); the maximum-entropy framework under SACSoft Actor-Critic explains temperature sampling in LLM decoding; DPODirect Preference Optimization and GRPOGroup Relative Policy Optimisation (next lecture) reframe alignment as preference optimization without a critic.

Purpose of this lecture

In Week 7, we derived the policy gradient theorem, developed actor-critic methods with GAE, and fully derived PPOProximal Policy Optimisation and TRPOTrust Region Policy Optimisation. This lecture completes the modern deep RLReinforcement Learning toolkit by studying the off-policy actor-critic family — DDPGDeep Deterministic Policy Gradient, TD3, and SACSoft Actor-Critic — which trade on-policy stability for sample efficiency and dominate physical robotics applications.

The lecture is organized in three tiers. First, a brief recap of TRPOTrust Region Policy Optimisation/PPOProximal Policy Optimisation establishes the on-policy baseline. Second, the off-policy family is developed as a progression: DDPGDeep Deterministic Policy Gradient introduces the deterministic policy gradient theorem; TD3 fixes DDPGDeep Deterministic Policy Gradient's three core pathologies; SACSoft Actor-Critic replaces deterministic policies with the maximum entropy framework and adds automatic entropy tuning. Third, a synthesis section maps the algorithm landscape onto practical deployment contexts — particularly the PPOProximal Policy Optimisation vs SACSoft Actor-Critic choice in legged locomotion vs real-robot manipulation.


On-policy recap: TRPOTrust Region Policy Optimisation and PPOProximal Policy Optimisation

TRPOTrust Region Policy Optimisation and PPOProximal Policy Optimisation were derived in full in Week 7. The key results:

TRPOTrust Region Policy Optimisation constrains each update to a trust region defined by KL divergence:

max⁡θ  Et ⁣[ρtAt]subject toEs ⁣[DKL(πθold∥πθ)]≤δ\max_\theta\; \mathbb{E}_t\!\left[\rho_t A_t\right] \quad\text{subject to}\quad \mathbb{E}_s\!\left[D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)\right] \leq \deltaθmax​Et​[ρt​At​]subject toEs​[DKL​(πθold​​∥πθ​)]≤δ

providing monotone improvement guarantees at the cost of second-order optimization.

PPOProximal Policy Optimisation approximates the trust region with a clipped surrogate:

LCLIP(θ)=Et ⁣[min⁡ ⁣(ρtAt,  clip(ρt,1−ϵ,1+ϵ)At)],ρt=πθ(at∣st)πθold(at∣st)L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(\rho_t A_t,\; \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t\right)\right], \quad \rho_t = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}LCLIP(θ)=Et​[min(ρt​At​,clip(ρt​,1−ϵ,1+ϵ)At​)],ρt​=πθold​​(at​∣st​)πθ​(at​∣st​)​

The min⁡\minmin ensures updates that push ρt\rho_tρt​ far from 1 in the direction of improvement are capped; degrading updates are never capped. The full PPOProximal Policy Optimisation loss adds a critic regression term and entropy bonus (see Week 7 for the complete derivation and RLHFReinforcement Learning from Human Feedback mapping).

Why on-policy methods are sample-inefficient: both TRPOTrust Region Policy Optimisation and PPOProximal Policy Optimisation discard data after each policy update — every transition in the rollout batch is used for a few gradient steps, then thrown away. For physical robots where each transition requires real hardware time, this is unacceptable. Off-policy methods, which reuse past transitions via a replay buffer, are essential for data-scarce settings. (This is the core motivation for SACSoft Actor-Critic and TD3 in Course 2: on a Unitree Go1 or manipulation robot, each second of real interaction is expensive; SACSoft Actor-Critic can achieve superhuman performance with 100K transitions, while PPOProximal Policy Optimisation would require millions.)


The off-policy actor-critic family: overview

The three off-policy algorithms in this lecture form a clear progression:

| Algorithm | Policy type | Exploration | Key problem solved | |---|---|---|---| | DDPGDeep Deterministic Policy Gradient | Deterministic | External noise | Continuous actions with DPG theorem | | TD3 | Deterministic | Smoothed noise | Overestimation + instability in DDPGDeep Deterministic Policy Gradient | | SACSoft Actor-Critic | Stochastic (MaxEnt) | Intrinsic entropy | Exploration + robustness + auto-tuning |

Each algorithm inherits from its predecessor and adds targeted fixes, exactly as the DQNDeep Q-Network → Double DQNDeep Q-Network → Dueling → Rainbow progression did in Week 6.

All three share the DQNDeep Q-Network engineering foundation from Week 6: a replay buffer D\mathcal{D}D, target networks updated via Polyak averaging θ−←τθ+(1−τ)θ−\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-θ−←τθ+(1−τ)θ−, and separate actor and critic networks.


Deterministic Policy Gradient (DDPGDeep Deterministic Policy Gradient)

From stochastic to deterministic policies

Recall the stochastic policy gradient theorem from Week 7:

∇θJ(θ)=Eτ∼πθ ⁣[∑t∇θlog⁡πθ(at∣st) At]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, A_t\right]∇θ​J(θ)=Eτ∼πθ​​[t∑​∇θ​logπθ​(at​∣st​)At​]

This requires the policy to be stochastic — the score function ∇θlog⁡πθ(a∣s)\nabla_\theta \log \pi_\theta(a|s)∇θ​logπθ​(a∣s) is well-defined only when πθ(a∣s)>0\pi_\theta(a|s) > 0πθ​(a∣s)>0 for all aaa. For continuous action spaces, we may want to learn a deterministic policy a=μθ(s)a = \mu_\theta(s)a=μθ​(s) that directly outputs an action. The stochastic policy gradient theorem does not apply.

The deterministic policy gradient theorem

DPG theorem (Silver et al., 2014): For a deterministic policy a=μθ(s)a = \mu_\theta(s)a=μθ​(s), the gradient of the expected return is:

∇θJ(θ)=Es∼ρμ ⁣[∇θμθ(s)⋅∇aQμ(s,a)∣a=μθ(s)]\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu}\!\left[ \nabla_\theta \mu_\theta(s)\cdot \nabla_a Q^\mu(s, a)\big|_{a = \mu_\theta(s)} \right]∇θ​J(θ)=Es∼ρμ​[∇θ​μθ​(s)⋅∇a​Qμ(s,a)​a=μθ​(s)​]

Derivation. The expected return under the deterministic policy:

J(θ)=∫Sρμ(s) Qμ(s,μθ(s)) dsJ(\theta) = \int_{\mathcal{S}} \rho^\mu(s)\, Q^\mu(s, \mu_\theta(s))\, dsJ(θ)=∫S​ρμ(s)Qμ(s,μθ​(s))ds

where ρμ(s)\rho^\mu(s)ρμ(s) is the state visitation distribution induced by μθ\mu_\thetaμθ​. Differentiating with respect to θ\thetaθ and applying the chain rule:

∇θJ(θ)=Es∼ρμ ⁣[∇θμθ(s)⋅∇aQμ(s,a)∣a=μθ(s)]\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu}\!\left[ \nabla_\theta \mu_\theta(s)\cdot \nabla_a Q^\mu(s, a)\big|_{a = \mu_\theta(s)} \right]∇θ​J(θ)=Es∼ρμ​[∇θ​μθ​(s)⋅∇a​Qμ(s,a)​a=μθ​(s)​]

The environment dynamics P(s′∣s,a)P(s'|s,a)P(s′∣s,a) drop out (as in the stochastic case) because ρμ\rho^\muρμ is treated as fixed for the purposes of differentiating through μθ(s)\mu_\theta(s)μθ​(s).

⚠Why not just add noise and use the stochastic PG theorem?

A natural first thought: if we need exploration, why not make the policy stochastic by adding Gaussian noise to μθ(s)\mu_\theta(s)μθ​(s) and applying the standard stochastic policy gradient theorem? This naive approach has two problems. First, the score function estimator ∇θlog⁡πθ\nabla_\theta \log \pi_\theta∇θ​logπθ​ has high variance because the noise and gradient are correlated through different paths — the DPG theorem avoids this by having the gradient flow directly through ∇aQ\nabla_a Q∇a​Q. Second, and more fundamentally, a noise-augmented deterministic policy cannot reuse off-policy data from a replay buffer without importance-sampling corrections (because the behavior policy β≠π\beta \neq \piβ=π). The DPG theorem provides a gradient estimator that is inherently off-policy — no score function, no importance weights, no variance correction — which is the key insight that makes DDPG work on real-robot data.

Key difference from stochastic PG:

| | Stochastic PG | Deterministic PG | |---|---|---| | Gradient signal | Score function ∇θlog⁡πθ(a∥s)\nabla_\theta \log \pi_\theta(a\|s)∇θ​logπθ​(a∥s) | Critic action-gradient ∇aQμ\nabla_a Q^\mu∇a​Qμ | | Exploration | Policy stochasticity | External noise added to μθ(s)\mu_\theta(s)μθ​(s) | | Critic type | Vϕ(s)V_\phi(s)Vϕ​(s) or Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a) | Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a) — must be differentiable in aaa | | Data | On or off-policy | Off-policy (replay buffer) |

The gradient flows through the Q-function with respect to the action, then through the actor with respect to θ\thetaθ. This requires Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a) to be differentiable in aaa — which is satisfied for neural network critics with continuous action inputs. (This differentiability requirement is why DDPGDeep Deterministic Policy Gradient is the default choice for continuous control in Course 2 robot learning: joint torques are continuous, and the gradient ∇aQ\nabla_a Q∇a​Q with respect to joint configurations is well-defined and efficient.)

DDPGDeep Deterministic Policy Gradient architecture and exploration

DDPGDeep Deterministic Policy Gradient (Lillicrap et al., 2015) combines the DPG theorem with DQNDeep Q-Network's engineering:

Initialize actor μ_θ, critic Q_φ, target networks μ_{θ⁻}, Q_{φ⁻}
Initialize replay buffer D

For each step t:
  # Action selection with exploration noise
  aₜ = μ_θ(sₜ) + εₜ,   εₜ ~ N(0, σ²)

  # Environment step, store transition
  Execute aₜ, observe rₜ₊₁, sₜ₊₁
  Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D

  # Sample mini-batch
  Sample {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D

  # Critic update (DQN-style TD target)
  yⱼ = rⱼ + γ Q_{φ⁻}(s'ⱼ, μ_{θ⁻}(s'ⱼ))
  Minimize Σⱼ (yⱼ - Q_φ(sⱼ, aⱼ))²  over φ

  # Actor update (DPG theorem)
  Maximize Σⱼ Q_φ(sⱼ, μ_θ(sⱼ))  over θ
  (gradient: ∇_θ μ_θ(sⱼ) · ∇_a Q_φ(sⱼ, a)|_{a=μ_θ(sⱼ)})

  # Soft target updates
  θ⁻ ← τθ + (1-τ)θ⁻
  φ⁻ ← τφ + (1-τ)φ⁻
  1. Action selection with noise (aₜ = μ_θ(sₜ) + εₜ): Because the policy is deterministic, we must explicitly add noise to the output action to explore the environment.
  2. TD target with target networks (yⱼ = rⱼ + γ Q_{φ⁻}(s'ⱼ, μ_{θ⁻}(s'ⱼ))): The target value uses both target networks — the target actor selects the next action, and the target critic evaluates it.
  3. Critic regression (Minimize Σⱼ (yⱼ - Q_φ(sⱼ, aⱼ))²): Updated to minimize the mean squared Bellman error, just like DQNDeep Q-Network.
  4. Actor gradient (Maximize Σⱼ Q_φ(sⱼ, μ_θ(sⱼ))): The actor is updated to maximize the critic's output. The gradient flows backwards from the critic to the actor via ∇θμθ(s)⋅∇aQϕ(s,a)\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q_\phi(s,a)∇θ​μθ​(s)⋅∇a​Qϕ​(s,a).
  5. Polyak averaging (θ⁻ ← τθ + (1-τ)θ⁻): Soft target updates smoothly blend the current weights into the target weights, providing much more stable training than hard resets.

Exploration in DDPGDeep Deterministic Policy Gradient is entirely external: the policy μθ(s)\mu_\theta(s)μθ​(s) is deterministic, so exploration requires adding noise at action selection time. The original paper uses Ornstein-Uhlenbeck noise (temporally correlated, mean-reverting) to simulate momentum-like exploration in physical systems. In practice, uncorrelated Gaussian noise ϵ∼N(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2)ϵ∼N(0,σ2) performs comparably and is simpler. The behavior policy β(a∣s)=δ(a−μθ(s)−ϵ)\beta(a|s) = \delta(a - \mu_\theta(s) - \epsilon)β(a∣s)=δ(a−μθ​(s)−ϵ) differs from the target policy π(a∣s)=δ(a−μθ(s))\pi(a|s) = \delta(a - \mu_\theta(s))π(a∣s)=δ(a−μθ​(s)) — DDPGDeep Deterministic Policy Gradient is inherently off-policy, which is why the replay buffer's off-policy data is justified.

DDPGDeep Deterministic Policy Gradient limitations

Despite its elegance, DDPGDeep Deterministic Policy Gradient has three pathologies in practice:

  1. Overestimation bias: the critic target y=r+γQϕ−(s′,μθ−(s′))y = r + \gamma Q_{\phi^-}(s', \mu_{\theta^-}(s'))y=r+γQϕ−​(s′,μθ−​(s′)) evaluates the target policy's action exactly. Like the single-network DQNDeep Q-Network target, this is susceptible to overestimation — the critic learns to overvalue actions the actor has been encouraged to take, creating a feedback loop.

  2. Coupling between actor and critic errors: the actor gradient ∇aQϕ\nabla_a Q_\phi∇a​Qϕ​ depends on the quality of the critic. An inaccurate critic provides misleading gradients that update the actor in the wrong direction, degrading the policy, which in turn produces worse training data for the critic. This coupled instability is severe in early training when both networks are randomly initialized.

  3. Sensitivity to hyperparameters: the learning rates for the actor and critic must be carefully balanced. If the critic learns too slowly, actor gradients are noisy; if the actor updates too quickly relative to the critic, the policy diverges.

TD3 addresses all three.


Twin Delayed Deep Deterministic Policy Gradient (TD3)

TD3 (Fujimoto et al., 2018) introduces three targeted fixes to DDPGDeep Deterministic Policy Gradient's pathologies. Each fix corresponds to a precisely identified failure mode.

Fix 1: Clipped double Q-learning

Maintain two independent critic networks Qϕ1Q_{\phi_1}Qϕ1​​ and Qϕ2Q_{\phi_2}Qϕ2​​. Compute TDTemporal Difference targets using the minimum of the two:

y=r+γmin⁡i=1,2Qϕi−(s′, a~′)y = r + \gamma \min_{i=1,2} Q_{\phi_i^-}(s',\, \tilde{a}')y=r+γi=1,2min​Qϕi−​​(s′,a~′)

Why the minimum corrects overestimation. Each critic has independent random initialization and independent gradient noise, so their errors are partially decorrelated. The maximum of two estimates is biased upward (Jensen's inequality: max⁡(Q^1,Q^2)≥Q∗\max(\hat{Q}_1, \hat{Q}_2) \geq Q^*max(Q^​1​,Q^​2​)≥Q∗ in expectation when errors are symmetric). The minimum is biased downward — slightly underestimating is far less harmful than overestimating, because underestimation does not produce the positive feedback loop (overestimated values → actor selects overvalued actions → critic reinforces overvaluation).

This is the continuous-action analog of Double DQNDeep Q-Network from Week 6: decoupling the networks that select and evaluate actions to reduce the max-over-noise bias. (The clipped double Q-learning mechanism is particularly important in Course 2's offline robot learning, where the replay buffer contains demonstrations from suboptimal policies; overestimation would lead the robot to imitate the worst demonstrations.)

Fix 2: Delayed policy updates

Update the actor less frequently than the critics — typically once every two critic gradient steps:

For each step t:
  Update Q_φ₁ and Q_φ₂  (every step)
  If t mod d == 0:
    Update μ_θ using ∇_a Q_φ₁    (every d steps, d=2 typically)
    Update all target networks

Why delayed updates stabilize training. The actor gradient ∇θμθ(s)⋅∇aQϕ1(s,a)∣a=μθ(s)\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q_{\phi_1}(s,a)|_{a=\mu_\theta(s)}∇θ​μθ​(s)⋅∇a​Qϕ1​​(s,a)∣a=μθ​(s)​ is only meaningful when the critic Qϕ1Q_{\phi_1}Qϕ1​​ is a reliable estimate. In early training, the critic is inaccurate, and updating the actor from a bad critic pushes the policy in the wrong direction. Delaying actor updates gives the critics more time to converge on each policy before the policy is updated. Fewer, more reliable actor updates outperform frequent, noisy ones.

Fix 3: Target policy smoothing

Add clipped noise to the target policy's action when computing the TDTemporal Difference target:

a~′=μθ−(s′)+clip(ξ, −c, c),ξ∼N(0,σ2)\tilde{a}' = \mu_{\theta^-}(s') + \text{clip}(\xi,\, -c,\, c),\quad \xi \sim \mathcal{N}(0, \sigma^2)a~′=μθ−​(s′)+clip(ξ,−c,c),ξ∼N(0,σ2) y=r+γmin⁡i=1,2Qϕi−(s′, a~′)y = r + \gamma \min_{i=1,2} Q_{\phi_i^-}(s',\, \tilde{a}')y=r+γi=1,2min​Qϕi−​​(s′,a~′)

Why this regularizes the critic. A deterministic policy evaluates the critic at exactly one action per state. If Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a) develops a sharp peak at a=μθ(s)a = \mu_\theta(s)a=μθ​(s) (which the actor is incentivized to exploit), the critic will overfit to these exact actions and generalize poorly. Adding noise smooths the target: instead of fitting QQQ to the value at a single point, the critic must fit the value in a neighborhood of actions around μθ−(s′)\mu_{\theta^-}(s')μθ−​(s′). The clipping [−c,c][-c, c][−c,c] ensures the smoothed action stays near the target policy and does not drift to irrelevant regions of the action space.

TD3 pseudocode summary

Initialize μ_θ, Q_φ₁, Q_φ₂ and target networks μ_{θ⁻}, Q_{φ₁⁻}, Q_{φ₂⁻}
Initialize replay buffer D

For each step t:
  aₜ = μ_θ(sₜ) + ε,   ε ~ N(0, σ²)
  Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D
  Sample mini-batch from D

  # Smoothed target action
  ã' = μ_{θ⁻}(s') + clip(N(0,σ̃²), -c, c)

  # Critic targets (clipped double Q)
  y = r + γ · min(Q_{φ₁⁻}(s', ã'), Q_{φ₂⁻}(s', ã'))

  # Update both critics
  Minimize (y - Q_φᵢ(s, a))² for i = 1, 2

  # Delayed actor update
  If t mod d == 0:
    Maximize Q_φ₁(s, μ_θ(s))  over θ
    Update target networks via Polyak averaging
  1. Target Policy Smoothing (ã' = μ_{θ⁻}(s') + clip(N(0,σ̃²), -c, c)): We add clipped noise to the target action, preventing the critic from overfitting to the deterministic actor output and smoothing the value estimate.
  2. Clipped Double Q-Learning (y = r + γ · min(Q_{φ₁⁻}(s', ã'), Q_{φ₂⁻}(s', ã'))): We evaluate the noisy target action with both target critics and take the minimum, drastically reducing the overestimation bias endemic to Q-learning.
  3. Critic updates (Minimize (y - Q_φᵢ(s, a))² for i = 1, 2): Both critics are updated towards the same conservative target.
  4. Delayed Policy Updates (If t mod d == 0): We only update the actor and target networks once every d critic updates, ensuring the actor optimizes against a stable and accurate critic.
  5. Actor update (Maximize Q_φ₁(s, μ_θ(s))): The actor is updated by maximizing Q1Q_1Q1​. Only one critic is needed to provide the gradient for the actor.

Soft Actor–Critic (SACSoft Actor-Critic)

SACSoft Actor-Critic (Haarnoja et al., 2018) replaces the deterministic policy with a maximum entropy framework, changing the problem the actor-critic is solving at a fundamental level.

The maximum entropy objective

Standard RLReinforcement Learning maximizes expected return:

J(π)=Eτ∼π ⁣[∑tr(st,at)]J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t r(s_t, a_t)\right]J(π)=Eτ∼π​[t∑​r(st​,at​)]

Maximum entropy RLReinforcement Learning augments this with a policy entropy bonus at every timestep:

JMaxEnt(π)=Eτ∼π ⁣[∑tr(st,at)+α H(π(⋅∣st))]J_{\text{MaxEnt}}(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t r(s_t, a_t) + \alpha\, \mathcal{H}(\pi(\cdot \mid s_t))\right]JMaxEnt​(π)=Eτ∼π​[t∑​r(st​,at​)+αH(π(⋅∣st​))]

where H(π(⋅∣s))=−Ea∼π[log⁡π(a∣s)]\mathcal{H}(\pi(\cdot|s)) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s)]H(π(⋅∣s))=−Ea∼π​[logπ(a∣s)] and α>0\alpha > 0α>0 is the temperature parameter controlling the entropy-reward tradeoff. This is not just reward shaping — it changes the definition of optimality.

The soft Bellman equation

The maximum entropy objective admits a modified Bellman equation. Define the soft Q-function Qsoft∗(s,a)Q_{\text{soft}}^*(s,a)Qsoft∗​(s,a) as the solution to:

Qsoft∗(s,a)=r(s,a)+γ Es′ ⁣[Vsoft∗(s′)]Q_{\text{soft}}^*(s,a) = r(s,a) + \gamma\, \mathbb{E}_{s'}\!\left[V_{\text{soft}}^*(s')\right]Qsoft∗​(s,a)=r(s,a)+γEs′​[Vsoft∗​(s′)]

where the soft value function integrates over both reward and entropy:

Vsoft∗(s)=Ea∼π∗ ⁣[Qsoft∗(s,a)−αlog⁡π∗(a∣s)]=αlog⁡∫exp⁡ ⁣(Qsoft∗(s,a)α)daV_{\text{soft}}^*(s) = \mathbb{E}_{a \sim \pi^*}\!\left[Q_{\text{soft}}^*(s,a) - \alpha \log \pi^*(a \mid s)\right] = \alpha \log \int \exp\!\left(\frac{Q_{\text{soft}}^*(s,a)}{\alpha}\right) daVsoft∗​(s)=Ea∼π∗​[Qsoft∗​(s,a)−αlogπ∗(a∣s)]=αlog∫exp(αQsoft∗​(s,a)​)da

The last equality (the log-sum-exp / soft maximum) follows from the closed-form optimal policy:

π∗(a∣s)=exp⁡(Qsoft∗(s,a)/α)∫exp⁡(Qsoft∗(s,a′)/α) da′\boxed{\pi^*(a \mid s) = \frac{\exp(Q_{\text{soft}}^*(s,a)/\alpha)}{\int \exp(Q_{\text{soft}}^*(s,a')/\alpha)\, da'}}π∗(a∣s)=∫exp(Qsoft∗​(s,a′)/α)da′exp(Qsoft∗​(s,a)/α)​​

The optimal maximum entropy policy is a Boltzmann distribution over the Q-function, with temperature α\alphaα. As α→0\alpha \to 0α→0, this collapses to a deterministic policy selecting the greedy action. As α→∞\alpha \to \inftyα→∞, the policy approaches uniform — all actions are equally likely. At intermediate α\alphaα, the policy concentrates on high-value actions while maintaining uncertainty in lower-value regions.

This is a principled closed-form result, not a heuristic: maximum entropy RLReinforcement Learning has a well-defined optimal policy, and SACSoft Actor-Critic approximates it.

The reparameterization trick

PPOProximal Policy Optimisation estimates policy gradient via the score function (Week 7):

∇θEa∼πθ[f(a)]=Ea∼πθ[f(a) ∇θlog⁡πθ(a)]\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[f(a)] = \mathbb{E}_{a \sim \pi_\theta}[f(a)\, \nabla_\theta \log \pi_\theta(a)]∇θ​Ea∼πθ​​[f(a)]=Ea∼πθ​​[f(a)∇θ​logπθ​(a)]

This is high-variance because f(a)f(a)f(a) and ∇θlog⁡πθ\nabla_\theta \log \pi_\theta∇θ​logπθ​ are correlated through different paths. SACSoft Actor-Critic uses the reparameterization trick: instead of sampling a∼πθ(⋅∣s)a \sim \pi_\theta(\cdot|s)a∼πθ​(⋅∣s) directly, write:

a=fθ(s,ξ),ξ∼N(0,I)a = f_\theta(s, \xi), \quad \xi \sim \mathcal{N}(0, I)a=fθ​(s,ξ),ξ∼N(0,I)

where fθ(s,ξ)=μθ(s)+σθ(s)⊙ξf_\theta(s,\xi) = \mu_\theta(s) + \sigma_\theta(s) \odot \xifθ​(s,ξ)=μθ​(s)+σθ​(s)⊙ξ (for diagonal Gaussian, typically with tanh squashing for bounded actions). Then:

∇θEξ ⁣[Q(s,fθ(s,ξ))−αlog⁡πθ(fθ(s,ξ)∣s)]=Eξ ⁣[∇θ(Q(s,fθ(s,ξ))−αlog⁡πθ(fθ(s,ξ)∣s))]\nabla_\theta \mathbb{E}_{\xi}\!\left[Q(s, f_\theta(s,\xi)) - \alpha \log \pi_\theta(f_\theta(s,\xi) \mid s)\right] = \mathbb{E}_{\xi}\!\left[\nabla_\theta \left(Q(s, f_\theta(s,\xi)) - \alpha \log \pi_\theta(f_\theta(s,\xi) \mid s)\right)\right]∇θ​Eξ​[Q(s,fθ​(s,ξ))−αlogπθ​(fθ​(s,ξ)∣s)]=Eξ​[∇θ​(Q(s,fθ​(s,ξ))−αlogπθ​(fθ​(s,ξ)∣s))]

The gradient now flows directly through Q(s,a)Q(s,a)Q(s,a) with respect to a=fθ(s,ξ)a = f_\theta(s,\xi)a=fθ​(s,ξ), then through fθf_\thetafθ​ with respect to θ\thetaθ. This is lower-variance than the score function estimator because the backpropagation path is deterministic (noise ξ\xiξ is fixed when differentiating). The reparameterization trick is what enables SACSoft Actor-Critic to use a stochastic policy while maintaining the low-variance gradient of the DPG theorem.

Automatic entropy tuning

The temperature α\alphaα controls the exploration-exploitation tradeoff: high α\alphaα encourages diverse actions; low α\alphaα focuses the policy. Manually tuning α\alphaα as a hyperparameter requires task-specific knowledge. SACSoft Actor-Critic automatically adjusts α\alphaα by formulating it as a constrained optimization: maintain a target entropy Htarget\mathcal{H}_{\text{target}}Htarget​ and find the α\alphaα that achieves it.

Applying dual gradient descent (Lagrangian duality) to the constraint E[H(π)]≥Htarget\mathbb{E}[\mathcal{H}(\pi)] \geq \mathcal{H}_{\text{target}}E[H(π)]≥Htarget​:

α∗=arg⁡min⁡α≥0  Ea∼πθ ⁣[−αlog⁡πθ(a∣s)−αHtarget]\alpha^* = \arg\min_{\alpha \geq 0}\; \mathbb{E}_{a \sim \pi_\theta}\!\left[-\alpha \log \pi_\theta(a \mid s) - \alpha \mathcal{H}_{\text{target}}\right]α∗=argα≥0min​Ea∼πθ​​[−αlogπθ​(a∣s)−αHtarget​]

In practice, α\alphaα is updated alongside the actor and critic via a separate gradient step:

α←α−λα∇α ⁣[α(−log⁡πθ(a∣s)−Htarget)]\alpha \leftarrow \alpha - \lambda_\alpha \nabla_\alpha\!\left[\alpha(-\log \pi_\theta(a \mid s) - \mathcal{H}_{\text{target}})\right]α←α−λα​∇α​[α(−logπθ​(a∣s)−Htarget​)]

If the current policy entropy is above Htarget\mathcal{H}_{\text{target}}Htarget​, this decreases α\alphaα (reduce exploration penalty); if below, it increases α\alphaα (increase exploration). The system self-regulates. A common choice is Htarget=−dim⁡(A)\mathcal{H}_{\text{target}} = -\dim(\mathcal{A})Htarget​=−dim(A) (negative of action dimension), which works well across continuous control tasks without manual tuning.

This is the same Lagrangian dual ascent structure used to prove SACSoft Actor-Critic temperature policy optimization in Haarnoja et al. (2018), and it is analogous to the automatic KL coefficient tuning in some RLHFReinforcement Learning from Human Feedback implementations.

SACSoft Actor-Critic pseudocode

Initialize actor π_θ (stochastic: outputs μ_θ(s), σ_θ(s))
Initialize critics Q_φ₁, Q_φ₂ and target networks Q_{φ₁⁻}, Q_{φ₂⁻}
Initialize temperature α (or α-network for auto-tuning)
Initialize replay buffer D

For each step t:
  # Stochastic action via reparameterization
  aₜ = μ_θ(sₜ) + σ_θ(sₜ) ⊙ ξ,   ξ ~ N(0, I)
  Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D

  Sample mini-batch from D
  Sample â' = μ_θ(s') + σ_θ(s') ⊙ ξ',  ξ' ~ N(0, I)

  # Critic targets (soft Bellman, clipped double Q)
  y = r + γ · (min(Q_{φ₁⁻}(s', â'), Q_{φ₂⁻}(s', â')) - α log π_θ(â'|s'))

  # Update both critics
  Minimize (y - Q_φᵢ(s, a))²  for i = 1, 2

  # Actor update (reparameterization trick)
  â = μ_θ(s) + σ_θ(s) ⊙ ξ
  Maximize  min(Q_φ₁(s, â), Q_φ₂(s, â)) - α log π_θ(â|s)  over θ

  # Temperature update (auto-tuning)
  Minimize  α(-log π_θ(â|s) - H_target)  over α

  # Soft target updates
  φᵢ⁻ ← τφᵢ + (1-τ)φᵢ⁻  for i = 1, 2

Note three connections to prior lectures: the clipped double Q target (Week 6: Double DQNDeep Q-Network), the soft Bellman entropy term −αlog⁡πθ(a^′∣s′)-\alpha \log \pi_\theta(\hat{a}'|s')−αlogπθ​(a^′∣s′) in the critic target (the maximum entropy modification), and the reparameterization gradient in the actor update (as opposed to the score function estimator in PPOProximal Policy Optimisation).


Algorithm synthesis

Expanded comparison table

| Algorithm | Policy | Exploration | Critic | Target network | Sample efficiency | |---|---|---|---|---|---| | PPOProximal Policy Optimisation | Stochastic | Policy entropy | VϕV_\phiVϕ​ | None | Low (on-policy) | | DDPGDeep Deterministic Policy Gradient | Deterministic | External noise | QϕQ_\phiQϕ​ | Polyak | High | | TD3 | Deterministic | Smoothed noise | Qϕ1,Qϕ2Q_{\phi_1}, Q_{\phi_2}Qϕ1​​,Qϕ2​​ | Polyak | High | | SACSoft Actor-Critic | Stochastic (MaxEnt) | Intrinsic entropy | Qϕ1,Qϕ2Q_{\phi_1}, Q_{\phi_2}Qϕ1​​,Qϕ2​​ | Polyak | Highest |

PPOProximal Policy Optimisation vs SACSoft Actor-Critic: the robotics decision

In practice, the choice between PPOProximal Policy Optimisation and SACSoft Actor-Critic depends on the data collection regime:

Choose PPOProximal Policy Optimisation when:

  • Fast parallel simulation is available (IsaacLab, MuJoCo with 4096 parallel envs). On-policy inefficiency is irrelevant when you can collect millions of transitions per second in simulation. PPOProximal Policy Optimisation with GAE achieves excellent results in legged locomotion (ANYmal, Unitree, Spot) precisely because massive parallelism compensates for sample inefficiency.
  • Reward shaping is complex. On-policy data always reflects the current policy, making reward shaping and curriculum learning easier to reason about. Off-policy replay buffers contain stale data from old policies, which can interact badly with changing reward functions.
  • Implementation simplicity matters. PPOProximal Policy Optimisation has fewer moving parts than SACSoft Actor-Critic — no automatic entropy tuning, no reparameterization, no double critics. RSL-RLReinforcement Learning's PPOProximal Policy Optimisation implementation for legged locomotion is ~500 lines; a full SACSoft Actor-Critic implementation is substantially more complex.

Choose SACSoft Actor-Critic when:

  • Real-robot data collection is the bottleneck. SACSoft Actor-Critic's replay buffer reuses every transition, typically achieving 3–10× better sample efficiency than PPOProximal Policy Optimisation. For Franka arm manipulation or other real-hardware setups, this difference is decisive.
  • Exploration quality matters more than speed. SACSoft Actor-Critic's entropy-driven exploration produces smoother, more diverse trajectories than ϵ\epsilonϵ-greedy or additive noise, which is particularly valuable for contact-rich manipulation tasks where the agent must discover narrow-margin contact modes.
  • Hyperparameter robustness is needed. Automatic entropy tuning makes SACSoft Actor-Critic largely self-configuring on the temperature hyperparameter, reducing the tuning burden compared to PPOProximal Policy Optimisation's entropy coefficient and GAE λ\lambdaλ.

The hybrid regime: many modern systems use PPOProximal Policy Optimisation for initial training in simulation (fast, stable, handles complex reward shaping), then fine-tune on real hardware with SACSoft Actor-Critic (sample efficient, smooth exploration). This sim-to-real workflow combines the strengths of both.


GenAI context: PPOProximal Policy Optimisation for RLHFReinforcement Learning from Human Feedback and SACSoft Actor-Critic for structured generation

PPOProximal Policy Optimisation in RLHFReinforcement Learning from Human Feedback (recap from Week 7): the language model πθ\pi_\thetaπθ​ is the actor, the reward model rϕr_\phirϕ​ provides the return signal, and the KL penalty against the SFT reference model prevents reward hacking. PPOProximal Policy Optimisation's on-policy sampling is acceptable in this setting because the language model generates its own rollouts (completions), and on-policy data ensures the distribution of completions matches the current policy — important for RLHFReinforcement Learning from Human Feedback where the reward model may be poorly calibrated outside the SFT distribution.

SACSoft Actor-Critic-inspired ideas in GenAI: maximum entropy RLReinforcement Learning has influenced several GenAI directions:

  • Temperature sampling in language model decoding is the inference-time analog of the Boltzmann optimal policy π∗(a∣s)∝exp⁡(Q/α)\pi^*(a|s) \propto \exp(Q/\alpha)π∗(a∣s)∝exp(Q/α) — high temperature encourages diverse completions; low temperature is greedy.
  • Energy-based models for text generation model the distribution over sequences as p(x)∝exp⁡(r(x)/α)p(x) \propto \exp(r(x)/\alpha)p(x)∝exp(r(x)/α), the same Boltzmann form as the SACSoft Actor-Critic optimal policy.
  • DPODirect Preference Optimization and GRPOGroup Relative Policy Optimisation (next lecture) eliminate the critic and reformulate RLHFReinforcement Learning from Human Feedback as a maximum-entropy optimization directly on preference data, removing the need for PPOProximal Policy Optimisation entirely.

Key takeaways

The lecture develops the off-policy actor-critic family as a principled progression. DDPGDeep Deterministic Policy Gradient combines the deterministic policy gradient theorem — gradient flows through ∇aQ\nabla_a Q∇a​Q rather than ∇θlog⁡π\nabla_\theta \log \pi∇θ​logπ — with DQNDeep Q-Network's replay buffer and target networks, but suffers from overestimation, coupled instability, and sensitivity to hyperparameters. TD3 applies three targeted fixes: clipped double Q-learning reduces overestimation via the minimum of two independent critics (continuous-action Double DQNDeep Q-Network); delayed policy updates decouple critic convergence from actor updates; target policy smoothing prevents the critic from overfitting to sharp peaks at deterministic actions. SACSoft Actor-Critic replaces the deterministic policy with the maximum entropy framework, whose optimal policy is provably a Boltzmann distribution over the soft Q-function. The reparameterization trick provides lower-variance gradients than score function estimation. Automatic entropy tuning via dual gradient descent eliminates the temperature hyperparameter.

The PPOProximal Policy Optimisation vs SACSoft Actor-Critic choice maps cleanly onto deployment context: PPOProximal Policy Optimisation with fast parallel simulation for legged locomotion, SACSoft Actor-Critic for real-robot data-scarce settings, with hybrid workflows bridging the two.


✦Looking Forward

Next lecture: GRPOGroup Relative Policy Optimisation and Direct Preference Optimization (DPODirect Preference Optimization) — methods that eliminate the critic entirely, reformulate RLHFReinforcement Learning from Human Feedback as preference optimization, and scale more naturally to very large language models. Understanding PPOProximal Policy Optimisation and SACSoft Actor-Critic deeply (as developed here) is essential for appreciating what GRPOGroup Relative Policy Optimisation and DPODirect Preference Optimization give up and what they gain.


Conceptual questions

  1. The deterministic policy gradient theorem gives ∇θJ(θ)=E[∇θμθ(s)⋅∇aQμ(s,a)∣a=μθ(s)]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}]∇θ​J(θ)=E[∇θ​μθ​(s)⋅∇a​Qμ(s,a)∣a=μθ​(s)​]. Compare this with the stochastic policy gradient theorem from Week 7. In the stochastic case, where does the gradient come from? In the deterministic case, what must be true about the critic Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a) for this gradient to be well-defined? Why does DDPGDeep Deterministic Policy Gradient use a Q-critic rather than a V-critic?

  2. TD3 uses the minimum of two Q-estimates as the critic target: y=r+γmin⁡i=1,2Qϕi−(s′,a~′)y = r + \gamma \min_{i=1,2} Q_{\phi_i^-}(s', \tilde{a}')y=r+γmini=1,2​Qϕi−​​(s′,a~′). Using the minimum rather than the mean introduces a downward bias. Explain why downward bias is less harmful than upward bias in the actor-critic setting by tracing the feedback loop that upward bias (overestimation) creates between the actor and critic.

  3. SACSoft Actor-Critic's optimal policy is π∗(a∣s)∝exp⁡(Qsoft∗(s,a)/α)\pi^*(a|s) \propto \exp(Q_{\text{soft}}^*(s,a)/\alpha)π∗(a∣s)∝exp(Qsoft∗​(s,a)/α). Show what happens to this policy as α→0\alpha \to 0α→0 and as α→∞\alpha \to \inftyα→∞. Then explain: if α\alphaα is too large during training, what behavior do you expect from the agent? If α\alphaα is too small, what problem arises? How does automatic entropy tuning resolve this without manual intervention?

  4. Compare exploration mechanisms across the three off-policy algorithms: DDPGDeep Deterministic Policy Gradient adds external Gaussian noise to the deterministic action; TD3 adds smoothed noise to the target policy (not the behavior policy); SACSoft Actor-Critic's stochastic policy provides intrinsic exploration. For a manipulation task requiring precise contact (narrow-margin task), which exploration mechanism is most appropriate and why? What happens to DDPGDeep Deterministic Policy Gradient-style exploration when the task reward requires very precise actions?

  5. You are training a quadruped locomotion policy. You have two setups: (A) IsaacLab with 4096 parallel simulation environments, each running at 100 Hz; (B) a single physical Unitree Go1 robot collecting ~1 Hz data due to safety stops. For each setup, argue which algorithm — PPOProximal Policy Optimisation or SACSoft Actor-Critic — is more appropriate, citing sample efficiency, exploration quality, and implementation complexity. For setup B, what modifications to the standard SACSoft Actor-Critic algorithm would you make to handle the safety constraints of real hardware?


✦Solutions
  1. DPG vs SPG. The stochastic gradient comes from the score function ∇θlog⁡π(a∣s) Q\nabla_\theta\log\pi(a|s)\,Q∇θ​logπ(a∣s)Q, integrating over the action distribution; the deterministic gradient flows through the critic by the chain rule, ∇θμθ(s) ∇aQμ(s,a)\nabla_\theta\mu_\theta(s)\,\nabla_a Q^\mu(s,a)∇θ​μθ​(s)∇a​Qμ(s,a). For DPG to be well-defined Qϕ(s,a)Q_\phi(s,a)Qϕ​(s,a) must be differentiable in aaa. DDPG uses a Q-critic, not a V-critic, because the actor update needs ∇aQ(s,a)\nabla_a Q(s,a)∇a​Q(s,a) — the action-gradient of value — which V(s)V(s)V(s) cannot provide.
  2. TD3 minimum. Overestimation is self-amplifying: the actor maximizes QQQ, steering toward actions the critic overrates, which the critic then bootstraps on, inflating targets further — a positive feedback loop. Underestimation is benign because the actor simply avoids under-rated actions, so the loop is self-limiting. Taking the min of two critics deliberately biases downward to break the overestimation loop.
  3. SAC temperature. As α→0\alpha\to0α→0, π∗∝exp⁡(Q/α)\pi^*\propto\exp(Q/\alpha)π∗∝exp(Q/α) becomes the greedy arg⁡max⁡\arg\maxargmax (pure exploitation); as α→∞\alpha\to\inftyα→∞ it becomes uniform (pure exploration, ignoring QQQ). Too-large α\alphaα makes the agent act randomly and never exploit reward; too-small α\alphaα under-explores and converges prematurely. Automatic entropy tuning adjusts α\alphaα by dual gradient descent toward a target entropy, balancing the two without a manual schedule.
  4. Exploration mechanisms. DDPG adds external Gaussian action noise; TD3 adds smoothing noise to the target action (critic regularization, not behavior exploration); SAC explores intrinsically through its stochastic policy. For precise, narrow-margin contact, SAC is best because its learned, state-dependent stochasticity can shrink where precision is needed — whereas DDPG's fixed external noise perturbs even precise actions, knocking the gripper off the narrow target and degrading the very precision the task demands.
  5. Quadruped setups. (A) 4096 parallel envs yield abundant cheap on-policy data, favoring PPO — on-policy, stable, scales with parallelism, simple. (B) a single robot at ~1 Hz is severely sample-limited, favoring SAC — off-policy and replay-efficient. For (B), add safety: a CBF/shield on actions, Lagrangian constrained RL for the safety stops, a lower entropy target near joint limits, action smoothing, and offline pretraining to avoid hardware-damaging exploration.

Implementation exercises

Exercise 1: DDPG core loop

Implement the DDPG training loop in Python using a deep learning framework of your choice. Start from the pseudocode above. Key implementation details:

  • Critic architecture: The critic takes both state and action as input and outputs a scalar Q-value. Concatenate action after a hidden layer (not at the input) for numerical stability.
  • Exploration noise: Start with simple Gaussian noise ϵ∼N(0,0.1)\epsilon \sim \mathcal{N}(0, 0.1)ϵ∼N(0,0.1) — it performs comparably to Ornstein-Uhlenbeck and is simpler to debug.
  • Polyak averaging: Use τ=0.005\tau = 0.005τ=0.005 for target network updates. Verify that target_actor.parameters() track actor.parameters() with a small lag.
  • Training ratio: One gradient step per environment step. Use a replay buffer of size 10610^6106 and batch size 64.

Test on Pendulum-v1: The agent should solve it (average return > −200 over 100 episodes) within 100 episodes.

Exercise 2: Upgrade DDPG to TD3

Starting from your DDPG implementation, add the three TD3 fixes one at a time, testing after each:

  1. Add a second critic network (Qϕ2Q_{\phi_2}Qϕ2​​). Use min⁡(Q1,Q2)\min(Q_1, Q_2)min(Q1​,Q2​) for target computation. Track the Q-value gap max⁡(Q1,Q2)−min⁡(Q1,Q2)\max(Q_1, Q_2) - \min(Q_1, Q_2)max(Q1​,Q2​)−min(Q1​,Q2​) during training — it should shrink as both critics converge.
  2. Delay the actor update to every 2 critic steps (d=2d = 2d=2). Plot policy entropy over time — a stable TD3 policy should show less oscillation than DDPG.
  3. Add target policy smoothing with σ=0.2\sigma = 0.2σ=0.2, c=0.5c = 0.5c=0.5. Compare critic gradient variance with and without smoothing.

Test on HalfCheetah-v4 (MuJoCo): TD3 should outperform DDPG by approximately 30% in final return.

Exercise 3: SAC with automatic entropy tuning

Implement SACSoft Actor-Critic extending your TD3 codebase. Key additions:

  • Stochastic actor: Outputs μθ(s)\mu_\theta(s)μθ​(s) and log⁡σθ(s)\log \sigma_\theta(s)logσθ​(s) for a diagonal Gaussian. Apply tanh squashing for bounded actions (add log-probability correction for the Jacobian of tanh).
  • Reparameterization trick: Sample a=tanh⁡(μθ(s)+σθ(s)⊙ξ)a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \xi)a=tanh(μθ​(s)+σθ​(s)⊙ξ) with ξ∼N(0,I)\xi \sim \mathcal{N}(0, I)ξ∼N(0,I). The gradient flows through Q(s,a)Q(s, a)Q(s,a) via autodiff — not through a score function estimator.
  • Soft Bellman target: y=r+γ(min⁡iQϕi−(s′,a′)−αlog⁡πθ(a′∣s′))y = r + \gamma(\min_i Q_{\phi_i^-}(s', a') - \alpha \log \pi_\theta(a'|s'))y=r+γ(mini​Qϕi−​​(s′,a′)−αlogπθ​(a′∣s′)).
  • Automatic entropy tuning: Initialize log⁡α\log \alphalogα (not α\alphaα) to ensure positivity. Target entropy Htarget=−dim⁡(A)\mathcal{H}_{\text{target}} = -\dim(\mathcal{A})Htarget​=−dim(A). Update α\alphaα with a separate optimizer (learning rate 3⋅10−43 \cdot 10^{-4}3⋅10−4).

Test on Ant-v4 (MuJoCo): SAC should achieve approximately 6000 return in 1M steps, demonstrating 3–5× better sample efficiency than TD3.


Extension prompts

  1. Continuous-action Rainbow: Combine TD3's clipped double Q with SAC's maximum-entropy framework and automatic temperature tuning. How does this hybrid perform on Humanoid-v4 compared to either algorithm alone? What does the entropy curve reveal about the benefit of stochastic exploration in high-dimensional action spaces?

  2. Offline RL via CQL: Collect 1M transitions from a partially trained SAC agent. Train a new policy from this fixed dataset using Conservative Q-Learning (CQL), which penalizes overestimated Q-values for out-of-distribution actions. Compare with standard SAC trained on the same offline data. Which shows less extrapolation error?

  3. Sim-to-real gap: Train a policy in MuJoCo, deploy it in a different physics engine (e.g., PyBullet or Isaac Sim), and measure the performance gap. Explore domain randomization (varying mass, friction, joint damping ranges) as a mitigation — how much randomization is needed before the policy transfers without fine-tuning?


Further reading

  • Silver, D., et al. (2014). Deterministic Policy Gradient Algorithms. ICML. (DPG Theorem).
  • Lillicrap, T. P., et al. (2015). Continuous control with deep reinforcement learning. ICLR. (DDPGDeep Deterministic Policy Gradient).
  • Fujimoto, S., et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML. (TD3).
  • Haarnoja, T., et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML. (SACSoft Actor-Critic).
  • Haarnoja, T., et al. (2018b). Soft Actor-Critic Algorithms and Applications. arXiv. (Automatic entropy tuning and practical SAC implementation details).
  • Duan, Y., et al. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control. ICML. (Standardized evaluation benchmarks for DDPG, TRPO, and other continuous-control algorithms).
← Previous
Week 7: Policy Gradient and Actor–Critic Methods
Next →
Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
On this page
  • Purpose of this lecture
  • On-policy recap: TRPO and PPO
  • The off-policy actor-critic family: overview
  • Deterministic Policy Gradient (DDPG)
  • From stochastic to deterministic policies
  • The deterministic policy gradient theorem
  • DDPG architecture and exploration
  • DDPG limitations
  • Twin Delayed Deep Deterministic Policy Gradient (TD3)
  • Fix 1: Clipped double Q-learning
  • Fix 2: Delayed policy updates
  • Fix 3: Target policy smoothing
  • TD3 pseudocode summary
  • Soft Actor–Critic (SAC)
  • The maximum entropy objective
  • The soft Bellman equation
  • The reparameterization trick
  • Automatic entropy tuning
  • SAC pseudocode
  • Algorithm synthesis
  • Expanded comparison table
  • PPO vs SAC: the robotics decision
  • GenAI context: PPO for RLHF and SAC for structured generation
  • Key takeaways
  • Conceptual questions
  • Implementation exercises
  • Exercise 1: DDPG core loop
  • Exercise 2: Upgrade DDPG to TD3
  • Exercise 3: SAC with automatic entropy tuning
  • Extension prompts
  • Further reading