Week 8: Modern Deep Reinforcement Learning Algorithms

Grounded In

Robotics: DDPG, TD3, and SAC are the dominant algorithms for continuous control in real-robot systems — SAC is used for Franka arm manipulation, Unitree Go1 locomotion, and dexterous in-hand manipulation. Each interaction with physical hardware is expensive (seconds per step vs milliseconds in simulation), making off-policy sample efficiency essential.
GenAI: PPO powers RLHF for language model alignment (ChatGPT, Claude); the maximum-entropy framework under SAC explains temperature sampling in LLM decoding; DPO and GRPO (next lecture) reframe alignment as preference optimization without a critic.

Purpose of this lecture#

In Week 7, we derived the policy gradient theorem, developed actor-critic methods with GAE, and fully derived PPO and TRPO. This lecture completes the modern deep RL toolkit by studying the off-policy actor-critic family — DDPG, TD3, and SAC — which trade on-policy stability for sample efficiency and dominate physical robotics applications.

The lecture is organized in three tiers. First, a brief recap of TRPO/PPO establishes the on-policy baseline. Second, the off-policy family is developed as a progression: DDPG introduces the deterministic policy gradient theorem; TD3 fixes DDPG's three core pathologies; SAC replaces deterministic policies with the maximum entropy framework and adds automatic entropy tuning. Third, a synthesis section maps the algorithm landscape onto practical deployment contexts — particularly the PPO vs SAC choice in legged locomotion vs real-robot manipulation.

On-policy recap: TRPO and PPO#

TRPO and PPO were derived in full in Week 7. The key results:

TRPO constrains each update to a trust region defined by KL divergence:

\max_\theta\; \mathbb{E}_t\!\left[\rho_t A_t\right] \quad\text{subject to}\quad \mathbb{E}_s\!\left[D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)\right] \leq \delta

providing monotone improvement guarantees at the cost of second-order optimization.

PPO approximates the trust region with a clipped surrogate:

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(\rho_t A_t,\; \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t\right)\right], \quad \rho_t = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}

The $\min$ ensures updates that push $\rho_t$ far from 1 in the direction of improvement are capped; degrading updates are never capped. The full PPO loss adds a critic regression term and entropy bonus (see Week 7 for the complete derivation and RLHF mapping).

Why on-policy methods are sample-inefficient: both TRPO and PPO discard data after each policy update — every transition in the rollout batch is used for a few gradient steps, then thrown away. For physical robots where each transition requires real hardware time, this is unacceptable. Off-policy methods, which reuse past transitions via a replay buffer, are essential for data-scarce settings. (This is the core motivation for SAC and TD3 in Course 2: on a Unitree Go1 or manipulation robot, each second of real interaction is expensive; SAC can achieve superhuman performance with 100K transitions, while PPO would require millions.)

The off-policy actor-critic family: overview#

The three off-policy algorithms in this lecture form a clear progression:

| Algorithm | Policy type | Exploration | Key problem solved | |---|---|---|---| | DDPG | Deterministic | External noise | Continuous actions with DPG theorem | | TD3 | Deterministic | Smoothed noise | Overestimation + instability in DDPG | | SAC | Stochastic (MaxEnt) | Intrinsic entropy | Exploration + robustness + auto-tuning |

Each algorithm inherits from its predecessor and adds targeted fixes, exactly as the DQN → Double DQN → Dueling → Rainbow progression did in Week 6.

All three share the DQN engineering foundation from Week 6: a replay buffer $\mathcal{D}$ , target networks updated via Polyak averaging $\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$ , and separate actor and critic networks.

Deterministic Policy Gradient (DDPG)#

From stochastic to deterministic policies#

Recall the stochastic policy gradient theorem from Week 7:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, A_t\right]

This requires the policy to be stochastic — the score function $\nabla_\theta \log \pi_\theta(a|s)$ is well-defined only when $\pi_\theta(a|s) > 0$ for all $a$ . For continuous action spaces, we may want to learn a deterministic policy $a = \mu_\theta(s)$ that directly outputs an action. The stochastic policy gradient theorem does not apply.

The deterministic policy gradient theorem#

DPG theorem (Silver et al., 2014): For a deterministic policy $a = \mu_\theta(s)$ , the gradient of the expected return is:

\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu}\!\left[ \nabla_\theta \mu_\theta(s)\cdot \nabla_a Q^\mu(s, a)\big|_{a = \mu_\theta(s)} \right]

Derivation. The expected return under the deterministic policy:

J(\theta) = \int_{\mathcal{S}} \rho^\mu(s)\, Q^\mu(s, \mu_\theta(s))\, ds

where $\rho^\mu(s)$ is the state visitation distribution induced by $\mu_\theta$ . Differentiating with respect to $\theta$ and applying the chain rule:

\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu}\!\left[ \nabla_\theta \mu_\theta(s)\cdot \nabla_a Q^\mu(s, a)\big|_{a = \mu_\theta(s)} \right]

The environment dynamics $P(s'|s,a)$ drop out (as in the stochastic case) because $\rho^\mu$ is treated as fixed for the purposes of differentiating through $\mu_\theta(s)$ .

Why not just add noise and use the stochastic PG theorem?

A natural first thought: if we need exploration, why not make the policy stochastic by adding Gaussian noise to $\mu_\theta(s)$ and applying the standard stochastic policy gradient theorem? This naive approach has two problems. First, the score function estimator $\nabla_\theta \log \pi_\theta$ has high variance because the noise and gradient are correlated through different paths — the DPG theorem avoids this by having the gradient flow directly through $\nabla_a Q$ . Second, and more fundamentally, a noise-augmented deterministic policy cannot reuse off-policy data from a replay buffer without importance-sampling corrections (because the behavior policy $\beta \neq \pi$ ). The DPG theorem provides a gradient estimator that is inherently off-policy — no score function, no importance weights, no variance correction — which is the key insight that makes DDPG work on real-robot data.

Key difference from stochastic PG:

| | Stochastic PG | Deterministic PG | |---|---|---| | Gradient signal | Score function $\nabla_\theta \log \pi_\theta(a\|s)$ | Critic action-gradient $\nabla_a Q^\mu$ | | Exploration | Policy stochasticity | External noise added to $\mu_\theta(s)$ | | Critic type | $V_\phi(s)$ or $Q_\phi(s,a)$ | $Q_\phi(s,a)$ — must be differentiable in $a$ | | Data | On or off-policy | Off-policy (replay buffer) |

The gradient flows through the Q-function with respect to the action, then through the actor with respect to $\theta$ . This requires $Q_\phi(s,a)$ to be differentiable in $a$ — which is satisfied for neural network critics with continuous action inputs. (This differentiability requirement is why DDPG is the default choice for continuous control in Course 2 robot learning: joint torques are continuous, and the gradient $\nabla_a Q$ with respect to joint configurations is well-defined and efficient.)

DDPG architecture and exploration#

DDPG (Lillicrap et al., 2015) combines the DPG theorem with DQN's engineering:

code

Initialize actor μ_θ, critic Q_φ, target networks μ_{θ⁻}, Q_{φ⁻}
Initialize replay buffer D

For each step t:
  # Action selection with exploration noise
  aₜ = μ_θ(sₜ) + εₜ,   εₜ ~ N(0, σ²)

  # Environment step, store transition
  Execute aₜ, observe rₜ₊₁, sₜ₊₁
  Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D

  # Sample mini-batch
  Sample {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D

  # Critic update (DQN-style TD target)
  yⱼ = rⱼ + γ Q_{φ⁻}(s'ⱼ, μ_{θ⁻}(s'ⱼ))
  Minimize Σⱼ (yⱼ - Q_φ(sⱼ, aⱼ))²  over φ

  # Actor update (DPG theorem)
  Maximize Σⱼ Q_φ(sⱼ, μ_θ(sⱼ))  over θ
  (gradient: ∇_θ μ_θ(sⱼ) · ∇_a Q_φ(sⱼ, a)|_{a=μ_θ(sⱼ)})

  # Soft target updates
  θ⁻ ← τθ + (1-τ)θ⁻
  φ⁻ ← τφ + (1-τ)φ⁻

Action selection with noise (aₜ = μ_θ(sₜ) + εₜ): Because the policy is deterministic, we must explicitly add noise to the output action to explore the environment.
TD target with target networks (yⱼ = rⱼ + γ Q_{φ⁻}(s'ⱼ, μ_{θ⁻}(s'ⱼ))): The target value uses both target networks — the target actor selects the next action, and the target critic evaluates it.
Critic regression (Minimize Σⱼ (yⱼ - Q_φ(sⱼ, aⱼ))²): Updated to minimize the mean squared Bellman error, just like DQN.
Actor gradient (Maximize Σⱼ Q_φ(sⱼ, μ_θ(sⱼ))): The actor is updated to maximize the critic's output. The gradient flows backwards from the critic to the actor via $\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q_\phi(s,a)$ .
Polyak averaging (θ⁻ ← τθ + (1-τ)θ⁻): Soft target updates smoothly blend the current weights into the target weights, providing much more stable training than hard resets.

Exploration in DDPG is entirely external: the policy $\mu_\theta(s)$ is deterministic, so exploration requires adding noise at action selection time. The original paper uses Ornstein-Uhlenbeck noise (temporally correlated, mean-reverting) to simulate momentum-like exploration in physical systems. In practice, uncorrelated Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma^2)$ performs comparably and is simpler. The behavior policy $\beta(a|s) = \delta(a - \mu_\theta(s) - \epsilon)$ differs from the target policy $\pi(a|s) = \delta(a - \mu_\theta(s))$ — DDPG is inherently off-policy, which is why the replay buffer's off-policy data is justified.

DDPG limitations#

Despite its elegance, DDPG has three pathologies in practice:

Overestimation bias: the critic target $y = r + \gamma Q_{\phi^-}(s', \mu_{\theta^-}(s'))$ evaluates the target policy's action exactly. Like the single-network DQN target, this is susceptible to overestimation — the critic learns to overvalue actions the actor has been encouraged to take, creating a feedback loop.
Coupling between actor and critic errors: the actor gradient $\nabla_a Q_\phi$ depends on the quality of the critic. An inaccurate critic provides misleading gradients that update the actor in the wrong direction, degrading the policy, which in turn produces worse training data for the critic. This coupled instability is severe in early training when both networks are randomly initialized.
Sensitivity to hyperparameters: the learning rates for the actor and critic must be carefully balanced. If the critic learns too slowly, actor gradients are noisy; if the actor updates too quickly relative to the critic, the policy diverges.

TD3 addresses all three.

Twin Delayed Deep Deterministic Policy Gradient (TD3)#

TD3 (Fujimoto et al., 2018) introduces three targeted fixes to DDPG's pathologies. Each fix corresponds to a precisely identified failure mode.

Fix 1: Clipped double Q-learning#

Maintain two independent critic networks $Q_{\phi_1}$ and $Q_{\phi_2}$ . Compute TD targets using the minimum of the two:

y = r + \gamma \min_{i=1,2} Q_{\phi_i^-}(s',\, \tilde{a}')

Why the minimum corrects overestimation. Each critic has independent random initialization and independent gradient noise, so their errors are partially decorrelated. The maximum of two estimates is biased upward (Jensen's inequality: $\max(\hat{Q}_1, \hat{Q}_2) \geq Q^*$ in expectation when errors are symmetric). The minimum is biased downward — slightly underestimating is far less harmful than overestimating, because underestimation does not produce the positive feedback loop (overestimated values → actor selects overvalued actions → critic reinforces overvaluation).

This is the continuous-action analog of Double DQN from Week 6: decoupling the networks that select and evaluate actions to reduce the max-over-noise bias. (The clipped double Q-learning mechanism is particularly important in Course 2's offline robot learning, where the replay buffer contains demonstrations from suboptimal policies; overestimation would lead the robot to imitate the worst demonstrations.)

Fix 2: Delayed policy updates#

Update the actor less frequently than the critics — typically once every two critic gradient steps:

code

For each step t:
  Update Q_φ₁ and Q_φ₂  (every step)
  If t mod d == 0:
    Update μ_θ using ∇_a Q_φ₁    (every d steps, d=2 typically)
    Update all target networks

Why delayed updates stabilize training. The actor gradient $\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q_{\phi_1}(s,a)|_{a=\mu_\theta(s)}$ is only meaningful when the critic $Q_{\phi_1}$ is a reliable estimate. In early training, the critic is inaccurate, and updating the actor from a bad critic pushes the policy in the wrong direction. Delaying actor updates gives the critics more time to converge on each policy before the policy is updated. Fewer, more reliable actor updates outperform frequent, noisy ones.

Fix 3: Target policy smoothing#

Add clipped noise to the target policy's action when computing the TD target:

\tilde{a}' = \mu_{\theta^-}(s') + \text{clip}(\xi,\, -c,\, c),\quad \xi \sim \mathcal{N}(0, \sigma^2)

y = r + \gamma \min_{i=1,2} Q_{\phi_i^-}(s',\, \tilde{a}')

Why this regularizes the critic. A deterministic policy evaluates the critic at exactly one action per state. If $Q_\phi(s,a)$ develops a sharp peak at $a = \mu_\theta(s)$ (which the actor is incentivized to exploit), the critic will overfit to these exact actions and generalize poorly. Adding noise smooths the target: instead of fitting $Q$ to the value at a single point, the critic must fit the value in a neighborhood of actions around $\mu_{\theta^-}(s')$ . The clipping $[-c, c]$ ensures the smoothed action stays near the target policy and does not drift to irrelevant regions of the action space.

TD3 pseudocode summary#

code

Initialize μ_θ, Q_φ₁, Q_φ₂ and target networks μ_{θ⁻}, Q_{φ₁⁻}, Q_{φ₂⁻}
Initialize replay buffer D

For each step t:
  aₜ = μ_θ(sₜ) + ε,   ε ~ N(0, σ²)
  Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D
  Sample mini-batch from D

  # Smoothed target action
  ã' = μ_{θ⁻}(s') + clip(N(0,σ̃²), -c, c)

  # Critic targets (clipped double Q)
  y = r + γ · min(Q_{φ₁⁻}(s', ã'), Q_{φ₂⁻}(s', ã'))

  # Update both critics
  Minimize (y - Q_φᵢ(s, a))² for i = 1, 2

  # Delayed actor update
  If t mod d == 0:
    Maximize Q_φ₁(s, μ_θ(s))  over θ
    Update target networks via Polyak averaging

Target Policy Smoothing (ã' = μ_{θ⁻}(s') + clip(N(0,σ̃²), -c, c)): We add clipped noise to the target action, preventing the critic from overfitting to the deterministic actor output and smoothing the value estimate.
Clipped Double Q-Learning (y = r + γ · min(Q_{φ₁⁻}(s', ã'), Q_{φ₂⁻}(s', ã'))): We evaluate the noisy target action with both target critics and take the minimum, drastically reducing the overestimation bias endemic to Q-learning.
Critic updates (Minimize (y - Q_φᵢ(s, a))² for i = 1, 2): Both critics are updated towards the same conservative target.
Delayed Policy Updates (If t mod d == 0): We only update the actor and target networks once every d critic updates, ensuring the actor optimizes against a stable and accurate critic.
Actor update (Maximize Q_φ₁(s, μ_θ(s))): The actor is updated by maximizing $Q_1$ . Only one critic is needed to provide the gradient for the actor.

Soft Actor–Critic (SAC)#

SAC (Haarnoja et al., 2018) replaces the deterministic policy with a maximum entropy framework, changing the problem the actor-critic is solving at a fundamental level.

The maximum entropy objective#

Standard RL maximizes expected return:

J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t r(s_t, a_t)\right]

Maximum entropy RL augments this with a policy entropy bonus at every timestep:

J_{\text{MaxEnt}}(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t r(s_t, a_t) + \alpha\, \mathcal{H}(\pi(\cdot \mid s_t))\right]

where $\mathcal{H}(\pi(\cdot|s)) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s)]$ and $\alpha > 0$ is the temperature parameter controlling the entropy-reward tradeoff. This is not just reward shaping — it changes the definition of optimality.

The soft Bellman equation#

The maximum entropy objective admits a modified Bellman equation. Define the soft Q-function $Q_{\text{soft}}^*(s,a)$ as the solution to:

Q_{\text{soft}}^*(s,a) = r(s,a) + \gamma\, \mathbb{E}_{s'}\!\left[V_{\text{soft}}^*(s')\right]

where the soft value function integrates over both reward and entropy:

V_{\text{soft}}^*(s) = \mathbb{E}_{a \sim \pi^*}\!\left[Q_{\text{soft}}^*(s,a) - \alpha \log \pi^*(a \mid s)\right] = \alpha \log \int \exp\!\left(\frac{Q_{\text{soft}}^*(s,a)}{\alpha}\right) da

The last equality (the log-sum-exp / soft maximum) follows from the closed-form optimal policy:

\boxed{\pi^*(a \mid s) = \frac{\exp(Q_{\text{soft}}^*(s,a)/\alpha)}{\int \exp(Q_{\text{soft}}^*(s,a')/\alpha)\, da'}}

The optimal maximum entropy policy is a Boltzmann distribution over the Q-function, with temperature $\alpha$ . As $\alpha \to 0$ , this collapses to a deterministic policy selecting the greedy action. As $\alpha \to \infty$ , the policy approaches uniform — all actions are equally likely. At intermediate $\alpha$ , the policy concentrates on high-value actions while maintaining uncertainty in lower-value regions.

This is a principled closed-form result, not a heuristic: maximum entropy RL has a well-defined optimal policy, and SAC approximates it.

The reparameterization trick#

PPO estimates policy gradient via the score function (Week 7):

\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[f(a)] = \mathbb{E}_{a \sim \pi_\theta}[f(a)\, \nabla_\theta \log \pi_\theta(a)]

This is high-variance because $f(a)$ and $\nabla_\theta \log \pi_\theta$ are correlated through different paths. SAC uses the reparameterization trick: instead of sampling $a \sim \pi_\theta(\cdot|s)$ directly, write:

a = f_\theta(s, \xi), \quad \xi \sim \mathcal{N}(0, I)

where $f_\theta(s,\xi) = \mu_\theta(s) + \sigma_\theta(s) \odot \xi$ (for diagonal Gaussian, typically with tanh squashing for bounded actions). Then:

\nabla_\theta \mathbb{E}_{\xi}\!\left[Q(s, f_\theta(s,\xi)) - \alpha \log \pi_\theta(f_\theta(s,\xi) \mid s)\right] = \mathbb{E}_{\xi}\!\left[\nabla_\theta \left(Q(s, f_\theta(s,\xi)) - \alpha \log \pi_\theta(f_\theta(s,\xi) \mid s)\right)\right]

The gradient now flows directly through $Q(s,a)$ with respect to $a = f_\theta(s,\xi)$ , then through $f_\theta$ with respect to $\theta$ . This is lower-variance than the score function estimator because the backpropagation path is deterministic (noise $\xi$ is fixed when differentiating). The reparameterization trick is what enables SAC to use a stochastic policy while maintaining the low-variance gradient of the DPG theorem.

Automatic entropy tuning#

The temperature $\alpha$ controls the exploration-exploitation tradeoff: high $\alpha$ encourages diverse actions; low $\alpha$ focuses the policy. Manually tuning $\alpha$ as a hyperparameter requires task-specific knowledge. SAC automatically adjusts $\alpha$ by formulating it as a constrained optimization: maintain a target entropy $\mathcal{H}_{\text{target}}$ and find the $\alpha$ that achieves it.

Applying dual gradient descent (Lagrangian duality) to the constraint $\mathbb{E}[\mathcal{H}(\pi)] \geq \mathcal{H}_{\text{target}}$ :

\alpha^* = \arg\min_{\alpha \geq 0}\; \mathbb{E}_{a \sim \pi_\theta}\!\left[-\alpha \log \pi_\theta(a \mid s) - \alpha \mathcal{H}_{\text{target}}\right]

In practice, $\alpha$ is updated alongside the actor and critic via a separate gradient step:

\alpha \leftarrow \alpha - \lambda_\alpha \nabla_\alpha\!\left[\alpha(-\log \pi_\theta(a \mid s) - \mathcal{H}_{\text{target}})\right]

If the current policy entropy is above $\mathcal{H}_{\text{target}}$ , this decreases $\alpha$ (reduce exploration penalty); if below, it increases $\alpha$ (increase exploration). The system self-regulates. A common choice is $\mathcal{H}_{\text{target}} = -\dim(\mathcal{A})$ (negative of action dimension), which works well across continuous control tasks without manual tuning.

This is the same Lagrangian dual ascent structure used to prove SAC temperature policy optimization in Haarnoja et al. (2018), and it is analogous to the automatic KL coefficient tuning in some RLHF implementations.

SAC pseudocode#

code

Initialize actor π_θ (stochastic: outputs μ_θ(s), σ_θ(s))
Initialize critics Q_φ₁, Q_φ₂ and target networks Q_{φ₁⁻}, Q_{φ₂⁻}
Initialize temperature α (or α-network for auto-tuning)
Initialize replay buffer D

For each step t:
  # Stochastic action via reparameterization
  aₜ = μ_θ(sₜ) + σ_θ(sₜ) ⊙ ξ,   ξ ~ N(0, I)
  Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D

  Sample mini-batch from D
  Sample â' = μ_θ(s') + σ_θ(s') ⊙ ξ',  ξ' ~ N(0, I)

  # Critic targets (soft Bellman, clipped double Q)
  y = r + γ · (min(Q_{φ₁⁻}(s', â'), Q_{φ₂⁻}(s', â')) - α log π_θ(â'|s'))

  # Update both critics
  Minimize (y - Q_φᵢ(s, a))²  for i = 1, 2

  # Actor update (reparameterization trick)
  â = μ_θ(s) + σ_θ(s) ⊙ ξ
  Maximize  min(Q_φ₁(s, â), Q_φ₂(s, â)) - α log π_θ(â|s)  over θ

  # Temperature update (auto-tuning)
  Minimize  α(-log π_θ(â|s) - H_target)  over α

  # Soft target updates
  φᵢ⁻ ← τφᵢ + (1-τ)φᵢ⁻  for i = 1, 2

Note three connections to prior lectures: the clipped double Q target (Week 6: Double DQN), the soft Bellman entropy term $-\alpha \log \pi_\theta(\hat{a}'|s')$ in the critic target (the maximum entropy modification), and the reparameterization gradient in the actor update (as opposed to the score function estimator in PPO).

Algorithm synthesis#

Expanded comparison table#

| Algorithm | Policy | Exploration | Critic | Target network | Sample efficiency | |---|---|---|---|---|---| | PPO | Stochastic | Policy entropy | $V_\phi$ | None | Low (on-policy) | | DDPG | Deterministic | External noise | $Q_\phi$ | Polyak | High | | TD3 | Deterministic | Smoothed noise | $Q_{\phi_1}, Q_{\phi_2}$ | Polyak | High | | SAC | Stochastic (MaxEnt) | Intrinsic entropy | $Q_{\phi_1}, Q_{\phi_2}$ | Polyak | Highest |

PPO vs SAC: the robotics decision#

In practice, the choice between PPO and SAC depends on the data collection regime:

Choose PPO when:

Fast parallel simulation is available (IsaacLab, MuJoCo with 4096 parallel envs). On-policy inefficiency is irrelevant when you can collect millions of transitions per second in simulation. PPO with GAE achieves excellent results in legged locomotion (ANYmal, Unitree, Spot) precisely because massive parallelism compensates for sample inefficiency.
Reward shaping is complex. On-policy data always reflects the current policy, making reward shaping and curriculum learning easier to reason about. Off-policy replay buffers contain stale data from old policies, which can interact badly with changing reward functions.
Implementation simplicity matters. PPO has fewer moving parts than SAC — no automatic entropy tuning, no reparameterization, no double critics. RSL-RL's PPO implementation for legged locomotion is ~500 lines; a full SAC implementation is substantially more complex.

Choose SAC when:

Real-robot data collection is the bottleneck. SAC's replay buffer reuses every transition, typically achieving 3–10× better sample efficiency than PPO. For Franka arm manipulation or other real-hardware setups, this difference is decisive.
Exploration quality matters more than speed. SAC's entropy-driven exploration produces smoother, more diverse trajectories than $\epsilon$ -greedy or additive noise, which is particularly valuable for contact-rich manipulation tasks where the agent must discover narrow-margin contact modes.
Hyperparameter robustness is needed. Automatic entropy tuning makes SAC largely self-configuring on the temperature hyperparameter, reducing the tuning burden compared to PPO's entropy coefficient and GAE $\lambda$ .

The hybrid regime: many modern systems use PPO for initial training in simulation (fast, stable, handles complex reward shaping), then fine-tune on real hardware with SAC (sample efficient, smooth exploration). This sim-to-real workflow combines the strengths of both.

GenAI context: PPO for RLHF and SAC for structured generation#

PPO in RLHF (recap from Week 7): the language model $\pi_\theta$ is the actor, the reward model $r_\phi$ provides the return signal, and the KL penalty against the SFT reference model prevents reward hacking. PPO's on-policy sampling is acceptable in this setting because the language model generates its own rollouts (completions), and on-policy data ensures the distribution of completions matches the current policy — important for RLHF where the reward model may be poorly calibrated outside the SFT distribution.

SAC-inspired ideas in GenAI: maximum entropy RL has influenced several GenAI directions:

Temperature sampling in language model decoding is the inference-time analog of the Boltzmann optimal policy $\pi^*(a|s) \propto \exp(Q/\alpha)$ — high temperature encourages diverse completions; low temperature is greedy.
Energy-based models for text generation model the distribution over sequences as $p(x) \propto \exp(r(x)/\alpha)$ , the same Boltzmann form as the SAC optimal policy.
DPO and GRPO (next lecture) eliminate the critic and reformulate RLHF as a maximum-entropy optimization directly on preference data, removing the need for PPO entirely.

Key takeaways#

The lecture develops the off-policy actor-critic family as a principled progression. DDPG combines the deterministic policy gradient theorem — gradient flows through $\nabla_a Q$ rather than $\nabla_\theta \log \pi$ — with DQN's replay buffer and target networks, but suffers from overestimation, coupled instability, and sensitivity to hyperparameters. TD3 applies three targeted fixes: clipped double Q-learning reduces overestimation via the minimum of two independent critics (continuous-action Double DQN); delayed policy updates decouple critic convergence from actor updates; target policy smoothing prevents the critic from overfitting to sharp peaks at deterministic actions. SAC replaces the deterministic policy with the maximum entropy framework, whose optimal policy is provably a Boltzmann distribution over the soft Q-function. The reparameterization trick provides lower-variance gradients than score function estimation. Automatic entropy tuning via dual gradient descent eliminates the temperature hyperparameter.

The PPO vs SAC choice maps cleanly onto deployment context: PPO with fast parallel simulation for legged locomotion, SAC for real-robot data-scarce settings, with hybrid workflows bridging the two.

Conceptual questions#

The deterministic policy gradient theorem gives $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}]$ . Compare this with the stochastic policy gradient theorem from Week 7. In the stochastic case, where does the gradient come from? In the deterministic case, what must be true about the critic $Q_\phi(s,a)$ for this gradient to be well-defined? Why does DDPG use a Q-critic rather than a V-critic?
TD3 uses the minimum of two Q-estimates as the critic target: $y = r + \gamma \min_{i=1,2} Q_{\phi_i^-}(s', \tilde{a}')$ . Using the minimum rather than the mean introduces a downward bias. Explain why downward bias is less harmful than upward bias in the actor-critic setting by tracing the feedback loop that upward bias (overestimation) creates between the actor and critic.
SAC's optimal policy is $\pi^*(a|s) \propto \exp(Q_{\text{soft}}^*(s,a)/\alpha)$ . Show what happens to this policy as $\alpha \to 0$ and as $\alpha \to \infty$ . Then explain: if $\alpha$ is too large during training, what behavior do you expect from the agent? If $\alpha$ is too small, what problem arises? How does automatic entropy tuning resolve this without manual intervention?
Compare exploration mechanisms across the three off-policy algorithms: DDPG adds external Gaussian noise to the deterministic action; TD3 adds smoothed noise to the target policy (not the behavior policy); SAC's stochastic policy provides intrinsic exploration. For a manipulation task requiring precise contact (narrow-margin task), which exploration mechanism is most appropriate and why? What happens to DDPG-style exploration when the task reward requires very precise actions?
You are training a quadruped locomotion policy. You have two setups: (A) IsaacLab with 4096 parallel simulation environments, each running at 100 Hz; (B) a single physical Unitree Go1 robot collecting ~1 Hz data due to safety stops. For each setup, argue which algorithm — PPO or SAC — is more appropriate, citing sample efficiency, exploration quality, and implementation complexity. For setup B, what modifications to the standard SAC algorithm would you make to handle the safety constraints of real hardware?

Solutions

DPG vs SPG. The stochastic gradient comes from the score function $\nabla_\theta\log\pi(a|s)\,Q$ , integrating over the action distribution; the deterministic gradient flows through the critic by the chain rule, $\nabla_\theta\mu_\theta(s)\,\nabla_a Q^\mu(s,a)$ . For DPG to be well-defined $Q_\phi(s,a)$ must be differentiable in $a$ . DDPG uses a Q-critic, not a V-critic, because the actor update needs $\nabla_a Q(s,a)$ — the action-gradient of value — which $V(s)$ cannot provide.
TD3 minimum. Overestimation is self-amplifying: the actor maximizes $Q$ , steering toward actions the critic overrates, which the critic then bootstraps on, inflating targets further — a positive feedback loop. Underestimation is benign because the actor simply avoids under-rated actions, so the loop is self-limiting. Taking the min of two critics deliberately biases downward to break the overestimation loop.
SAC temperature. As $\alpha\to0$ , $\pi^*\propto\exp(Q/\alpha)$ becomes the greedy $\arg\max$ (pure exploitation); as $\alpha\to\infty$ it becomes uniform (pure exploration, ignoring $Q$ ). Too-large $\alpha$ makes the agent act randomly and never exploit reward; too-small $\alpha$ under-explores and converges prematurely. Automatic entropy tuning adjusts $\alpha$ by dual gradient descent toward a target entropy, balancing the two without a manual schedule.
Exploration mechanisms. DDPG adds external Gaussian action noise; TD3 adds smoothing noise to the target action (critic regularization, not behavior exploration); SAC explores intrinsically through its stochastic policy. For precise, narrow-margin contact, SAC is best because its learned, state-dependent stochasticity can shrink where precision is needed — whereas DDPG's fixed external noise perturbs even precise actions, knocking the gripper off the narrow target and degrading the very precision the task demands.
Quadruped setups. (A) 4096 parallel envs yield abundant cheap on-policy data, favoring PPO — on-policy, stable, scales with parallelism, simple. (B) a single robot at ~1 Hz is severely sample-limited, favoring SAC — off-policy and replay-efficient. For (B), add safety: a CBF/shield on actions, Lagrangian constrained RL for the safety stops, a lower entropy target near joint limits, action smoothing, and offline pretraining to avoid hardware-damaging exploration.

Implementation exercises#

Exercise 1: DDPG core loop#

Implement the DDPG training loop in Python using a deep learning framework of your choice. Start from the pseudocode above. Key implementation details:

Critic architecture: The critic takes both state and action as input and outputs a scalar Q-value. Concatenate action after a hidden layer (not at the input) for numerical stability.
Exploration noise: Start with simple Gaussian noise $\epsilon \sim \mathcal{N}(0, 0.1)$ — it performs comparably to Ornstein-Uhlenbeck and is simpler to debug.
Polyak averaging: Use $\tau = 0.005$ for target network updates. Verify that target_actor.parameters() track actor.parameters() with a small lag.
Training ratio: One gradient step per environment step. Use a replay buffer of size $10^6$ and batch size 64.

Test on Pendulum-v1: The agent should solve it (average return > −200 over 100 episodes) within 100 episodes.

Exercise 2: Upgrade DDPG to TD3#

Starting from your DDPG implementation, add the three TD3 fixes one at a time, testing after each:

Add a second critic network ( $Q_{\phi_2}$ ). Use $\min(Q_1, Q_2)$ for target computation. Track the Q-value gap $\max(Q_1, Q_2) - \min(Q_1, Q_2)$ during training — it should shrink as both critics converge.
Delay the actor update to every 2 critic steps ( $d = 2$ ). Plot policy entropy over time — a stable TD3 policy should show less oscillation than DDPG.
Add target policy smoothing with $\sigma = 0.2$ , $c = 0.5$ . Compare critic gradient variance with and without smoothing.

Test on HalfCheetah-v4 (MuJoCo): TD3 should outperform DDPG by approximately 30% in final return.

Exercise 3: SAC with automatic entropy tuning#

Implement SAC extending your TD3 codebase. Key additions:

Stochastic actor: Outputs $\mu_\theta(s)$ and $\log \sigma_\theta(s)$ for a diagonal Gaussian. Apply tanh squashing for bounded actions (add log-probability correction for the Jacobian of tanh).
Reparameterization trick: Sample $a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \xi)$ with $\xi \sim \mathcal{N}(0, I)$ . The gradient flows through $Q(s, a)$ via autodiff — not through a score function estimator.
Soft Bellman target: $y = r + \gamma(\min_i Q_{\phi_i^-}(s', a') - \alpha \log \pi_\theta(a'|s'))$ .
Automatic entropy tuning: Initialize $\log \alpha$ (not $\alpha$ ) to ensure positivity. Target entropy $\mathcal{H}_{\text{target}} = -\dim(\mathcal{A})$ . Update $\alpha$ with a separate optimizer (learning rate $3 \cdot 10^{-4}$ ).

Test on Ant-v4 (MuJoCo): SAC should achieve approximately 6000 return in 1M steps, demonstrating 3–5× better sample efficiency than TD3.

Extension prompts#

Continuous-action Rainbow: Combine TD3's clipped double Q with SAC's maximum-entropy framework and automatic temperature tuning. How does this hybrid perform on Humanoid-v4 compared to either algorithm alone? What does the entropy curve reveal about the benefit of stochastic exploration in high-dimensional action spaces?
Offline RL via CQL: Collect 1M transitions from a partially trained SAC agent. Train a new policy from this fixed dataset using Conservative Q-Learning (CQL), which penalizes overestimated Q-values for out-of-distribution actions. Compare with standard SAC trained on the same offline data. Which shows less extrapolation error?
Sim-to-real gap: Train a policy in MuJoCo, deploy it in a different physics engine (e.g., PyBullet or Isaac Sim), and measure the performance gap. Explore domain randomization (varying mass, friction, joint damping ranges) as a mitigation — how much randomization is needed before the policy transfers without fine-tuning?

Purpose of this lecture#

On-policy recap: TRPOTrust Region Policy Optimisation and PPOProximal Policy Optimisation#

The off-policy actor-critic family: overview#

Deterministic Policy Gradient (DDPGDeep Deterministic Policy Gradient)#

From stochastic to deterministic policies#

The deterministic policy gradient theorem#

DDPGDeep Deterministic Policy Gradient architecture and exploration#

DDPGDeep Deterministic Policy Gradient limitations#

Twin Delayed Deep Deterministic Policy Gradient (TD3)#

Fix 1: Clipped double Q-learning#

Fix 2: Delayed policy updates#

Fix 3: Target policy smoothing#

TD3 pseudocode summary#

Soft Actor–Critic (SACSoft Actor-Critic)#

The maximum entropy objective#

The soft Bellman equation#

The reparameterization trick#

Automatic entropy tuning#

SACSoft Actor-Critic pseudocode#

Algorithm synthesis#

Expanded comparison table#

PPOProximal Policy Optimisation vs SACSoft Actor-Critic: the robotics decision#

GenAI context: PPOProximal Policy Optimisation for RLHFReinforcement Learning from Human Feedback and SACSoft Actor-Critic for structured generation#

Key takeaways#

Conceptual questions#

Implementation exercises#

Exercise 1: DDPG core loop#

Exercise 2: Upgrade DDPG to TD3#

Exercise 3: SAC with automatic entropy tuning#

Extension prompts#

Further reading#

On-policy recap: TRPO and PPO#

Deterministic Policy Gradient (DDPG)#

DDPG architecture and exploration#

DDPG limitations#

Soft Actor–Critic (SAC)#

SAC pseudocode#

PPO vs SAC: the robotics decision#

GenAI context: PPO for RLHF and SAC for structured generation#