Purpose of this lecture
In Week 7, we derived the policy gradient theorem, developed actor-critic methods with GAE, and fully derived PPOProximal Policy Optimisation and TRPOTrust Region Policy Optimisation. This lecture completes the modern deep RLReinforcement Learning toolkit by studying the off-policy actor-critic family — DDPGDeep Deterministic Policy Gradient, TD3, and SACSoft Actor-Critic — which trade on-policy stability for sample efficiency and dominate physical robotics applications.
The lecture is organized in three tiers. First, a brief recap of TRPOTrust Region Policy Optimisation/PPOProximal Policy Optimisation establishes the on-policy baseline. Second, the off-policy family is developed as a progression: DDPGDeep Deterministic Policy Gradient introduces the deterministic policy gradient theorem; TD3 fixes DDPGDeep Deterministic Policy Gradient's three core pathologies; SACSoft Actor-Critic replaces deterministic policies with the maximum entropy framework and adds automatic entropy tuning. Third, a synthesis section maps the algorithm landscape onto practical deployment contexts — particularly the PPOProximal Policy Optimisation vs SACSoft Actor-Critic choice in legged locomotion vs real-robot manipulation.
On-policy recap: TRPOTrust Region Policy Optimisation and PPOProximal Policy Optimisation
TRPOTrust Region Policy Optimisation and PPOProximal Policy Optimisation were derived in full in Week 7. The key results:
TRPOTrust Region Policy Optimisation constrains each update to a trust region defined by KL divergence:
providing monotone improvement guarantees at the cost of second-order optimization.
PPOProximal Policy Optimisation approximates the trust region with a clipped surrogate:
The ensures updates that push far from 1 in the direction of improvement are capped; degrading updates are never capped. The full PPOProximal Policy Optimisation loss adds a critic regression term and entropy bonus (see Week 7 for the complete derivation and RLHFReinforcement Learning from Human Feedback mapping).
Why on-policy methods are sample-inefficient: both TRPOTrust Region Policy Optimisation and PPOProximal Policy Optimisation discard data after each policy update — every transition in the rollout batch is used for a few gradient steps, then thrown away. For physical robots where each transition requires real hardware time, this is unacceptable. Off-policy methods, which reuse past transitions via a replay buffer, are essential for data-scarce settings. (This is the core motivation for SACSoft Actor-Critic and TD3 in Course 2: on a Unitree Go1 or manipulation robot, each second of real interaction is expensive; SACSoft Actor-Critic can achieve superhuman performance with 100K transitions, while PPOProximal Policy Optimisation would require millions.)
The off-policy actor-critic family: overview
The three off-policy algorithms in this lecture form a clear progression:
| Algorithm | Policy type | Exploration | Key problem solved | |---|---|---|---| | DDPGDeep Deterministic Policy Gradient | Deterministic | External noise | Continuous actions with DPG theorem | | TD3 | Deterministic | Smoothed noise | Overestimation + instability in DDPGDeep Deterministic Policy Gradient | | SACSoft Actor-Critic | Stochastic (MaxEnt) | Intrinsic entropy | Exploration + robustness + auto-tuning |
Each algorithm inherits from its predecessor and adds targeted fixes, exactly as the DQNDeep Q-Network → Double DQNDeep Q-Network → Dueling → Rainbow progression did in Week 6.
All three share the DQNDeep Q-Network engineering foundation from Week 6: a replay buffer , target networks updated via Polyak averaging , and separate actor and critic networks.
Deterministic Policy Gradient (DDPGDeep Deterministic Policy Gradient)
From stochastic to deterministic policies
Recall the stochastic policy gradient theorem from Week 7:
This requires the policy to be stochastic — the score function is well-defined only when for all . For continuous action spaces, we may want to learn a deterministic policy that directly outputs an action. The stochastic policy gradient theorem does not apply.
The deterministic policy gradient theorem
DPG theorem (Silver et al., 2014): For a deterministic policy , the gradient of the expected return is:
Derivation. The expected return under the deterministic policy:
where is the state visitation distribution induced by . Differentiating with respect to and applying the chain rule:
The environment dynamics drop out (as in the stochastic case) because is treated as fixed for the purposes of differentiating through .
Key difference from stochastic PG:
| | Stochastic PG | Deterministic PG | |---|---|---| | Gradient signal | Score function | Critic action-gradient | | Exploration | Policy stochasticity | External noise added to | | Critic type | or | — must be differentiable in | | Data | On or off-policy | Off-policy (replay buffer) |
The gradient flows through the Q-function with respect to the action, then through the actor with respect to . This requires to be differentiable in — which is satisfied for neural network critics with continuous action inputs. (This differentiability requirement is why DDPGDeep Deterministic Policy Gradient is the default choice for continuous control in Course 2 robot learning: joint torques are continuous, and the gradient with respect to joint configurations is well-defined and efficient.)
DDPGDeep Deterministic Policy Gradient architecture and exploration
DDPGDeep Deterministic Policy Gradient (Lillicrap et al., 2015) combines the DPG theorem with DQNDeep Q-Network's engineering:
Initialize actor μ_θ, critic Q_φ, target networks μ_{θ⁻}, Q_{φ⁻}
Initialize replay buffer D
For each step t:
# Action selection with exploration noise
aₜ = μ_θ(sₜ) + εₜ, εₜ ~ N(0, σ²)
# Environment step, store transition
Execute aₜ, observe rₜ₊₁, sₜ₊₁
Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D
# Sample mini-batch
Sample {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D
# Critic update (DQN-style TD target)
yⱼ = rⱼ + γ Q_{φ⁻}(s'ⱼ, μ_{θ⁻}(s'ⱼ))
Minimize Σⱼ (yⱼ - Q_φ(sⱼ, aⱼ))² over φ
# Actor update (DPG theorem)
Maximize Σⱼ Q_φ(sⱼ, μ_θ(sⱼ)) over θ
(gradient: ∇_θ μ_θ(sⱼ) · ∇_a Q_φ(sⱼ, a)|_{a=μ_θ(sⱼ)})
# Soft target updates
θ⁻ ← τθ + (1-τ)θ⁻
φ⁻ ← τφ + (1-τ)φ⁻
- Action selection with noise (
aₜ = μ_θ(sₜ) + εₜ): Because the policy is deterministic, we must explicitly add noise to the output action to explore the environment. - TD target with target networks (
yⱼ = rⱼ + γ Q_{φ⁻}(s'ⱼ, μ_{θ⁻}(s'ⱼ))): The target value uses both target networks — the target actor selects the next action, and the target critic evaluates it. - Critic regression (
Minimize Σⱼ (yⱼ - Q_φ(sⱼ, aⱼ))²): Updated to minimize the mean squared Bellman error, just like DQNDeep Q-Network. - Actor gradient (
Maximize Σⱼ Q_φ(sⱼ, μ_θ(sⱼ))): The actor is updated to maximize the critic's output. The gradient flows backwards from the critic to the actor via . - Polyak averaging (
θ⁻ ← τθ + (1-τ)θ⁻): Soft target updates smoothly blend the current weights into the target weights, providing much more stable training than hard resets.
Exploration in DDPGDeep Deterministic Policy Gradient is entirely external: the policy is deterministic, so exploration requires adding noise at action selection time. The original paper uses Ornstein-Uhlenbeck noise (temporally correlated, mean-reverting) to simulate momentum-like exploration in physical systems. In practice, uncorrelated Gaussian noise performs comparably and is simpler. The behavior policy differs from the target policy — DDPGDeep Deterministic Policy Gradient is inherently off-policy, which is why the replay buffer's off-policy data is justified.
DDPGDeep Deterministic Policy Gradient limitations
Despite its elegance, DDPGDeep Deterministic Policy Gradient has three pathologies in practice:
-
Overestimation bias: the critic target evaluates the target policy's action exactly. Like the single-network DQNDeep Q-Network target, this is susceptible to overestimation — the critic learns to overvalue actions the actor has been encouraged to take, creating a feedback loop.
-
Coupling between actor and critic errors: the actor gradient depends on the quality of the critic. An inaccurate critic provides misleading gradients that update the actor in the wrong direction, degrading the policy, which in turn produces worse training data for the critic. This coupled instability is severe in early training when both networks are randomly initialized.
-
Sensitivity to hyperparameters: the learning rates for the actor and critic must be carefully balanced. If the critic learns too slowly, actor gradients are noisy; if the actor updates too quickly relative to the critic, the policy diverges.
TD3 addresses all three.
Twin Delayed Deep Deterministic Policy Gradient (TD3)
TD3 (Fujimoto et al., 2018) introduces three targeted fixes to DDPGDeep Deterministic Policy Gradient's pathologies. Each fix corresponds to a precisely identified failure mode.
Fix 1: Clipped double Q-learning
Maintain two independent critic networks and . Compute TDTemporal Difference targets using the minimum of the two:
Why the minimum corrects overestimation. Each critic has independent random initialization and independent gradient noise, so their errors are partially decorrelated. The maximum of two estimates is biased upward (Jensen's inequality: in expectation when errors are symmetric). The minimum is biased downward — slightly underestimating is far less harmful than overestimating, because underestimation does not produce the positive feedback loop (overestimated values → actor selects overvalued actions → critic reinforces overvaluation).
This is the continuous-action analog of Double DQNDeep Q-Network from Week 6: decoupling the networks that select and evaluate actions to reduce the max-over-noise bias. (The clipped double Q-learning mechanism is particularly important in Course 2's offline robot learning, where the replay buffer contains demonstrations from suboptimal policies; overestimation would lead the robot to imitate the worst demonstrations.)
Fix 2: Delayed policy updates
Update the actor less frequently than the critics — typically once every two critic gradient steps:
For each step t:
Update Q_φ₁ and Q_φ₂ (every step)
If t mod d == 0:
Update μ_θ using ∇_a Q_φ₁ (every d steps, d=2 typically)
Update all target networks
Why delayed updates stabilize training. The actor gradient is only meaningful when the critic is a reliable estimate. In early training, the critic is inaccurate, and updating the actor from a bad critic pushes the policy in the wrong direction. Delaying actor updates gives the critics more time to converge on each policy before the policy is updated. Fewer, more reliable actor updates outperform frequent, noisy ones.
Fix 3: Target policy smoothing
Add clipped noise to the target policy's action when computing the TDTemporal Difference target:
Why this regularizes the critic. A deterministic policy evaluates the critic at exactly one action per state. If develops a sharp peak at (which the actor is incentivized to exploit), the critic will overfit to these exact actions and generalize poorly. Adding noise smooths the target: instead of fitting to the value at a single point, the critic must fit the value in a neighborhood of actions around . The clipping ensures the smoothed action stays near the target policy and does not drift to irrelevant regions of the action space.
TD3 pseudocode summary
Initialize μ_θ, Q_φ₁, Q_φ₂ and target networks μ_{θ⁻}, Q_{φ₁⁻}, Q_{φ₂⁻}
Initialize replay buffer D
For each step t:
aₜ = μ_θ(sₜ) + ε, ε ~ N(0, σ²)
Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D
Sample mini-batch from D
# Smoothed target action
ã' = μ_{θ⁻}(s') + clip(N(0,σ̃²), -c, c)
# Critic targets (clipped double Q)
y = r + γ · min(Q_{φ₁⁻}(s', ã'), Q_{φ₂⁻}(s', ã'))
# Update both critics
Minimize (y - Q_φᵢ(s, a))² for i = 1, 2
# Delayed actor update
If t mod d == 0:
Maximize Q_φ₁(s, μ_θ(s)) over θ
Update target networks via Polyak averaging
- Target Policy Smoothing (
ã' = μ_{θ⁻}(s') + clip(N(0,σ̃²), -c, c)): We add clipped noise to the target action, preventing the critic from overfitting to the deterministic actor output and smoothing the value estimate. - Clipped Double Q-Learning (
y = r + γ · min(Q_{φ₁⁻}(s', ã'), Q_{φ₂⁻}(s', ã'))): We evaluate the noisy target action with both target critics and take the minimum, drastically reducing the overestimation bias endemic to Q-learning. - Critic updates (
Minimize (y - Q_φᵢ(s, a))² for i = 1, 2): Both critics are updated towards the same conservative target. - Delayed Policy Updates (
If t mod d == 0): We only update the actor and target networks once everydcritic updates, ensuring the actor optimizes against a stable and accurate critic. - Actor update (
Maximize Q_φ₁(s, μ_θ(s))): The actor is updated by maximizing . Only one critic is needed to provide the gradient for the actor.
Soft Actor–Critic (SACSoft Actor-Critic)
SACSoft Actor-Critic (Haarnoja et al., 2018) replaces the deterministic policy with a maximum entropy framework, changing the problem the actor-critic is solving at a fundamental level.
The maximum entropy objective
Standard RLReinforcement Learning maximizes expected return:
Maximum entropy RLReinforcement Learning augments this with a policy entropy bonus at every timestep:
where and is the temperature parameter controlling the entropy-reward tradeoff. This is not just reward shaping — it changes the definition of optimality.
The soft Bellman equation
The maximum entropy objective admits a modified Bellman equation. Define the soft Q-function as the solution to:
where the soft value function integrates over both reward and entropy:
The last equality (the log-sum-exp / soft maximum) follows from the closed-form optimal policy:
The optimal maximum entropy policy is a Boltzmann distribution over the Q-function, with temperature . As , this collapses to a deterministic policy selecting the greedy action. As , the policy approaches uniform — all actions are equally likely. At intermediate , the policy concentrates on high-value actions while maintaining uncertainty in lower-value regions.
This is a principled closed-form result, not a heuristic: maximum entropy RLReinforcement Learning has a well-defined optimal policy, and SACSoft Actor-Critic approximates it.
The reparameterization trick
PPOProximal Policy Optimisation estimates policy gradient via the score function (Week 7):
This is high-variance because and are correlated through different paths. SACSoft Actor-Critic uses the reparameterization trick: instead of sampling directly, write:
where (for diagonal Gaussian, typically with tanh squashing for bounded actions). Then:
The gradient now flows directly through with respect to , then through with respect to . This is lower-variance than the score function estimator because the backpropagation path is deterministic (noise is fixed when differentiating). The reparameterization trick is what enables SACSoft Actor-Critic to use a stochastic policy while maintaining the low-variance gradient of the DPG theorem.
Automatic entropy tuning
The temperature controls the exploration-exploitation tradeoff: high encourages diverse actions; low focuses the policy. Manually tuning as a hyperparameter requires task-specific knowledge. SACSoft Actor-Critic automatically adjusts by formulating it as a constrained optimization: maintain a target entropy and find the that achieves it.
Applying dual gradient descent (Lagrangian duality) to the constraint :
In practice, is updated alongside the actor and critic via a separate gradient step:
If the current policy entropy is above , this decreases (reduce exploration penalty); if below, it increases (increase exploration). The system self-regulates. A common choice is (negative of action dimension), which works well across continuous control tasks without manual tuning.
This is the same Lagrangian dual ascent structure used to prove SACSoft Actor-Critic temperature policy optimization in Haarnoja et al. (2018), and it is analogous to the automatic KL coefficient tuning in some RLHFReinforcement Learning from Human Feedback implementations.
SACSoft Actor-Critic pseudocode
Initialize actor π_θ (stochastic: outputs μ_θ(s), σ_θ(s))
Initialize critics Q_φ₁, Q_φ₂ and target networks Q_{φ₁⁻}, Q_{φ₂⁻}
Initialize temperature α (or α-network for auto-tuning)
Initialize replay buffer D
For each step t:
# Stochastic action via reparameterization
aₜ = μ_θ(sₜ) + σ_θ(sₜ) ⊙ ξ, ξ ~ N(0, I)
Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D
Sample mini-batch from D
Sample â' = μ_θ(s') + σ_θ(s') ⊙ ξ', ξ' ~ N(0, I)
# Critic targets (soft Bellman, clipped double Q)
y = r + γ · (min(Q_{φ₁⁻}(s', â'), Q_{φ₂⁻}(s', â')) - α log π_θ(â'|s'))
# Update both critics
Minimize (y - Q_φᵢ(s, a))² for i = 1, 2
# Actor update (reparameterization trick)
â = μ_θ(s) + σ_θ(s) ⊙ ξ
Maximize min(Q_φ₁(s, â), Q_φ₂(s, â)) - α log π_θ(â|s) over θ
# Temperature update (auto-tuning)
Minimize α(-log π_θ(â|s) - H_target) over α
# Soft target updates
φᵢ⁻ ← τφᵢ + (1-τ)φᵢ⁻ for i = 1, 2
Note three connections to prior lectures: the clipped double Q target (Week 6: Double DQNDeep Q-Network), the soft Bellman entropy term in the critic target (the maximum entropy modification), and the reparameterization gradient in the actor update (as opposed to the score function estimator in PPOProximal Policy Optimisation).
Algorithm synthesis
Expanded comparison table
| Algorithm | Policy | Exploration | Critic | Target network | Sample efficiency | |---|---|---|---|---|---| | PPOProximal Policy Optimisation | Stochastic | Policy entropy | | None | Low (on-policy) | | DDPGDeep Deterministic Policy Gradient | Deterministic | External noise | | Polyak | High | | TD3 | Deterministic | Smoothed noise | | Polyak | High | | SACSoft Actor-Critic | Stochastic (MaxEnt) | Intrinsic entropy | | Polyak | Highest |
PPOProximal Policy Optimisation vs SACSoft Actor-Critic: the robotics decision
In practice, the choice between PPOProximal Policy Optimisation and SACSoft Actor-Critic depends on the data collection regime:
Choose PPOProximal Policy Optimisation when:
- Fast parallel simulation is available (IsaacLab, MuJoCo with 4096 parallel envs). On-policy inefficiency is irrelevant when you can collect millions of transitions per second in simulation. PPOProximal Policy Optimisation with GAE achieves excellent results in legged locomotion (ANYmal, Unitree, Spot) precisely because massive parallelism compensates for sample inefficiency.
- Reward shaping is complex. On-policy data always reflects the current policy, making reward shaping and curriculum learning easier to reason about. Off-policy replay buffers contain stale data from old policies, which can interact badly with changing reward functions.
- Implementation simplicity matters. PPOProximal Policy Optimisation has fewer moving parts than SACSoft Actor-Critic — no automatic entropy tuning, no reparameterization, no double critics. RSL-RLReinforcement Learning's PPOProximal Policy Optimisation implementation for legged locomotion is ~500 lines; a full SACSoft Actor-Critic implementation is substantially more complex.
Choose SACSoft Actor-Critic when:
- Real-robot data collection is the bottleneck. SACSoft Actor-Critic's replay buffer reuses every transition, typically achieving 3–10× better sample efficiency than PPOProximal Policy Optimisation. For Franka arm manipulation or other real-hardware setups, this difference is decisive.
- Exploration quality matters more than speed. SACSoft Actor-Critic's entropy-driven exploration produces smoother, more diverse trajectories than -greedy or additive noise, which is particularly valuable for contact-rich manipulation tasks where the agent must discover narrow-margin contact modes.
- Hyperparameter robustness is needed. Automatic entropy tuning makes SACSoft Actor-Critic largely self-configuring on the temperature hyperparameter, reducing the tuning burden compared to PPOProximal Policy Optimisation's entropy coefficient and GAE .
The hybrid regime: many modern systems use PPOProximal Policy Optimisation for initial training in simulation (fast, stable, handles complex reward shaping), then fine-tune on real hardware with SACSoft Actor-Critic (sample efficient, smooth exploration). This sim-to-real workflow combines the strengths of both.
GenAI context: PPOProximal Policy Optimisation for RLHFReinforcement Learning from Human Feedback and SACSoft Actor-Critic for structured generation
PPOProximal Policy Optimisation in RLHFReinforcement Learning from Human Feedback (recap from Week 7): the language model is the actor, the reward model provides the return signal, and the KL penalty against the SFT reference model prevents reward hacking. PPOProximal Policy Optimisation's on-policy sampling is acceptable in this setting because the language model generates its own rollouts (completions), and on-policy data ensures the distribution of completions matches the current policy — important for RLHFReinforcement Learning from Human Feedback where the reward model may be poorly calibrated outside the SFT distribution.
SACSoft Actor-Critic-inspired ideas in GenAI: maximum entropy RLReinforcement Learning has influenced several GenAI directions:
- Temperature sampling in language model decoding is the inference-time analog of the Boltzmann optimal policy — high temperature encourages diverse completions; low temperature is greedy.
- Energy-based models for text generation model the distribution over sequences as , the same Boltzmann form as the SACSoft Actor-Critic optimal policy.
- DPODirect Preference Optimization and GRPOGroup Relative Policy Optimisation (next lecture) eliminate the critic and reformulate RLHFReinforcement Learning from Human Feedback as a maximum-entropy optimization directly on preference data, removing the need for PPOProximal Policy Optimisation entirely.
Key takeaways
The lecture develops the off-policy actor-critic family as a principled progression. DDPGDeep Deterministic Policy Gradient combines the deterministic policy gradient theorem — gradient flows through rather than — with DQNDeep Q-Network's replay buffer and target networks, but suffers from overestimation, coupled instability, and sensitivity to hyperparameters. TD3 applies three targeted fixes: clipped double Q-learning reduces overestimation via the minimum of two independent critics (continuous-action Double DQNDeep Q-Network); delayed policy updates decouple critic convergence from actor updates; target policy smoothing prevents the critic from overfitting to sharp peaks at deterministic actions. SACSoft Actor-Critic replaces the deterministic policy with the maximum entropy framework, whose optimal policy is provably a Boltzmann distribution over the soft Q-function. The reparameterization trick provides lower-variance gradients than score function estimation. Automatic entropy tuning via dual gradient descent eliminates the temperature hyperparameter.
The PPOProximal Policy Optimisation vs SACSoft Actor-Critic choice maps cleanly onto deployment context: PPOProximal Policy Optimisation with fast parallel simulation for legged locomotion, SACSoft Actor-Critic for real-robot data-scarce settings, with hybrid workflows bridging the two.
Conceptual questions
-
The deterministic policy gradient theorem gives . Compare this with the stochastic policy gradient theorem from Week 7. In the stochastic case, where does the gradient come from? In the deterministic case, what must be true about the critic for this gradient to be well-defined? Why does DDPGDeep Deterministic Policy Gradient use a Q-critic rather than a V-critic?
-
TD3 uses the minimum of two Q-estimates as the critic target: . Using the minimum rather than the mean introduces a downward bias. Explain why downward bias is less harmful than upward bias in the actor-critic setting by tracing the feedback loop that upward bias (overestimation) creates between the actor and critic.
-
SACSoft Actor-Critic's optimal policy is . Show what happens to this policy as and as . Then explain: if is too large during training, what behavior do you expect from the agent? If is too small, what problem arises? How does automatic entropy tuning resolve this without manual intervention?
-
Compare exploration mechanisms across the three off-policy algorithms: DDPGDeep Deterministic Policy Gradient adds external Gaussian noise to the deterministic action; TD3 adds smoothed noise to the target policy (not the behavior policy); SACSoft Actor-Critic's stochastic policy provides intrinsic exploration. For a manipulation task requiring precise contact (narrow-margin task), which exploration mechanism is most appropriate and why? What happens to DDPGDeep Deterministic Policy Gradient-style exploration when the task reward requires very precise actions?
-
You are training a quadruped locomotion policy. You have two setups: (A) IsaacLab with 4096 parallel simulation environments, each running at 100 Hz; (B) a single physical Unitree Go1 robot collecting ~1 Hz data due to safety stops. For each setup, argue which algorithm — PPOProximal Policy Optimisation or SACSoft Actor-Critic — is more appropriate, citing sample efficiency, exploration quality, and implementation complexity. For setup B, what modifications to the standard SACSoft Actor-Critic algorithm would you make to handle the safety constraints of real hardware?
Implementation exercises
Exercise 1: DDPG core loop
Implement the DDPG training loop in Python using a deep learning framework of your choice. Start from the pseudocode above. Key implementation details:
- Critic architecture: The critic takes both state and action as input and outputs a scalar Q-value. Concatenate action after a hidden layer (not at the input) for numerical stability.
- Exploration noise: Start with simple Gaussian noise — it performs comparably to Ornstein-Uhlenbeck and is simpler to debug.
- Polyak averaging: Use for target network updates. Verify that
target_actor.parameters()trackactor.parameters()with a small lag. - Training ratio: One gradient step per environment step. Use a replay buffer of size and batch size 64.
Test on Pendulum-v1: The agent should solve it (average return > −200 over 100 episodes) within 100 episodes.
Exercise 2: Upgrade DDPG to TD3
Starting from your DDPG implementation, add the three TD3 fixes one at a time, testing after each:
- Add a second critic network (). Use for target computation. Track the Q-value gap during training — it should shrink as both critics converge.
- Delay the actor update to every 2 critic steps (). Plot policy entropy over time — a stable TD3 policy should show less oscillation than DDPG.
- Add target policy smoothing with , . Compare critic gradient variance with and without smoothing.
Test on HalfCheetah-v4 (MuJoCo): TD3 should outperform DDPG by approximately 30% in final return.
Exercise 3: SAC with automatic entropy tuning
Implement SACSoft Actor-Critic extending your TD3 codebase. Key additions:
- Stochastic actor: Outputs and for a diagonal Gaussian. Apply
tanhsquashing for bounded actions (add log-probability correction for the Jacobian of tanh). - Reparameterization trick: Sample with . The gradient flows through via autodiff — not through a score function estimator.
- Soft Bellman target: .
- Automatic entropy tuning: Initialize (not ) to ensure positivity. Target entropy . Update with a separate optimizer (learning rate ).
Test on Ant-v4 (MuJoCo): SAC should achieve approximately 6000 return in 1M steps, demonstrating 3–5× better sample efficiency than TD3.
Extension prompts
-
Continuous-action Rainbow: Combine TD3's clipped double Q with SAC's maximum-entropy framework and automatic temperature tuning. How does this hybrid perform on
Humanoid-v4compared to either algorithm alone? What does the entropy curve reveal about the benefit of stochastic exploration in high-dimensional action spaces? -
Offline RL via CQL: Collect 1M transitions from a partially trained SAC agent. Train a new policy from this fixed dataset using Conservative Q-Learning (CQL), which penalizes overestimated Q-values for out-of-distribution actions. Compare with standard SAC trained on the same offline data. Which shows less extrapolation error?
-
Sim-to-real gap: Train a policy in MuJoCo, deploy it in a different physics engine (e.g., PyBullet or Isaac Sim), and measure the performance gap. Explore domain randomization (varying mass, friction, joint damping ranges) as a mitigation — how much randomization is needed before the policy transfers without fine-tuning?
Further reading
- Silver, D., et al. (2014). Deterministic Policy Gradient Algorithms. ICML. (DPG Theorem).
- Lillicrap, T. P., et al. (2015). Continuous control with deep reinforcement learning. ICLR. (DDPGDeep Deterministic Policy Gradient).
- Fujimoto, S., et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML. (TD3).
- Haarnoja, T., et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML. (SACSoft Actor-Critic).
- Haarnoja, T., et al. (2018b). Soft Actor-Critic Algorithms and Applications. arXiv. (Automatic entropy tuning and practical SAC implementation details).
- Duan, Y., et al. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control. ICML. (Standardized evaluation benchmarks for DDPG, TRPO, and other continuous-control algorithms).