Week 6: Deep Q-Learning and Variants

Purpose of this lecture#

In Week 5, we crossed the critical boundary from tabular RL to function approximation and saw that naïvely combining Q-learning with neural networks leads to instability, divergence, and pathological behavior due to the deadly triad.

This lecture introduces Deep Q-Networks (DQN) — the first algorithm to successfully stabilize value-based RL with neural networks at scale. DQN does not introduce a new theoretical framework. It succeeds through a bundle of stabilizing engineering constraints, each of which is a direct and identifiable response to a specific failure mode from Week 5.

Understanding DQN means understanding the mapping from failure → fix. The extensions that follow — Double DQN, Dueling Networks, Prioritized Replay, Distributional RL — each address a residual failure that the base DQN bundle does not fully eliminate.

From Q-learning to Deep Q-learning#

Recall the Q-learning update:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]

In DQN, $Q(s,a)$ is replaced by a neural network $Q_\theta(s,a)$ . The training objective minimizes the squared TD error:

\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[ \Bigl( \underbrace{r + \gamma \max_{a'} Q_{\theta^-}(s',a')}_{\text{target } y} - Q_\theta(s,a) \Bigr)^2 \right]

where $\mathcal{D}$ is a replay buffer and $\theta^-$ is a separate target network.

DQN's historical breakthrough was learning to play Atari 2600 games directly from raw pixels using convolutional neural networks, achieving superhuman performance on many games from a single algorithm and hyperparameter set. This was the first demonstration that deep neural networks and RL could be combined successfully at scale.

Without additional structure, however, this combination is unstable for the reasons established in Week 5. We now examine each stabilizing mechanism in detail.

The complete DQN algorithm#

Before examining each component, it helps to see the full algorithm:

code

Initialize Q_θ with random weights θ
Initialize target network Q_{θ⁻} with weights θ⁻ ← θ
Initialize replay buffer D with capacity N

For each episode:
  Initialize state s₀

  For each step t:
    # Action selection (ε-greedy)
    With probability ε: select random action aₜ
    Otherwise:          select aₜ = argmax_a Q_θ(sₜ, a)

    # Environment step
    Execute aₜ, observe rₜ₊₁, sₜ₊₁

    # Store transition
    Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D

    # Sample and train (once |D| ≥ batch_size)
    Sample random mini-batch {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D

    Compute targets:
      yⱼ = rⱼ + γ max_{a'} Q_{θ⁻}(s'ⱼ, a')   if s'ⱼ not terminal
      yⱼ = rⱼ                                   if s'ⱼ terminal

    Update θ by SGD on Σⱼ (yⱼ - Q_θ(sⱼ, aⱼ))²

    # Periodic target network update
    Every C steps: θ⁻ ← θ

Target network initialization (θ⁻ ← θ): An exact copy of the main network — provides stable Bellman backup targets that don't shift every gradient step.
Replay buffer (capacity N): Stores raw transitions up to massive capacity (e.g., 1 million frames), enabling offline-like sampling that decorrelates sequential experience.
Deferred storage (Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D): Transitions are stored without immediately learning from them, breaking the sequential coupling that destabilizes online Q-learning.
Random mini-batch sampling: Decorrelates the training data and stabilizes stochastic gradient descent — each update sees a mix of old and recent experience.
Frozen target values (Q_{θ⁻}): Targets $y_j$ use frozen parameters $\theta^-$ , not the current $\theta$ being updated — removing the feedback loop that causes divergence.
Periodic target update (Every C steps: θ⁻ ← θ): The target network is refreshed only every $C$ steps, ensuring $y_j$ stays stationary long enough for the critic to fit against it.

Every non-obvious line in this algorithm corresponds to a deliberate design choice that addresses a specific failure mode. The rest of the lecture explains those choices.

Experience replay#

The failure mode it addresses#

Sequential RL transitions are highly temporally correlated. The transition $(s_t, a_t, r_{t+1}, s_{t+1})$ and the next transition $(s_{t+1}, a_{t+1}, r_{t+2}, s_{t+2})$ share a state. In an Atari game, consecutive frames differ by a single timestep and are nearly identical observations. SGD assumes mini-batches are i.i.d. samples from a fixed distribution — training on consecutive transitions violates both assumptions simultaneously.

The effect is analogous to training a classifier by showing it the same image 50 times before showing any other image: gradient estimates are low-variance but systematically biased toward recent experience, and the network rapidly overfits to recent transitions while forgetting earlier ones. In RL, this manifests as oscillating value estimates and policy instability.

A second, related problem is non-stationarity of the training distribution: as the policy improves, the distribution of visited states shifts. Training exclusively on on-policy data means the network adapts to the current policy's narrow distribution and may catastrophically forget how to handle states it no longer frequently visits.

The fix: replay buffer#

Store transitions in a fixed-capacity buffer $\mathcal{D}$ and train on randomly sampled mini-batches:

(s_j, a_j, r_j, s'_j) \sim \text{Uniform}(\mathcal{D})

Random sampling breaks temporal correlations: the mini-batch contains transitions from many different timepoints and behavioral contexts, restoring approximate i.i.d. structure. The buffer also provides a broader, more stationary training distribution that includes past experience, reducing catastrophic forgetting.

The off-policy justification#

Sampling from a replay buffer containing transitions from older policy versions makes DQN off-policy: the data was collected under different (older) policies than the current one. The theoretical justification for this is Q-learning's off-policy convergence property from Week 4 — Q-learning converges to $Q^*$ regardless of the behavior policy, provided every state-action pair is covered. Transitions from old policies satisfy this coverage requirement as long as the buffer is large enough and diverse enough. (This replay buffer idea is the foundational principle behind offline robot learning in Course 2, where collected demonstrations from a teacher policy are replayed to learn robotic skills — the same insight that experience is reusable, independent of when it was collected.)

The capacity tradeoff#

Buffer capacity is a practical hyperparameter with real consequences:

Too small: the buffer retains mostly recent, correlated transitions — similar to on-policy learning with weaker correlation reduction.
Too large: the buffer retains very old transitions collected under a much earlier, worse policy. High-priority states from the current policy are diluted by irrelevant old data, slowing learning.

DQN uses a buffer of 1 million transitions — large enough to cover many episodes but small enough that recent data dominates. Prioritized experience replay (discussed below) addresses the inefficiency of uniform sampling from a large buffer.

Target networks#

The failure mode it addresses#

With a standard neural network, the training loss is:

\mathcal{L}(\theta) = \mathbb{E}\left[\bigl(r + \gamma\max_{a'} Q_\theta(s',a') - Q_\theta(s,a)\bigr)^2\right]

Both the target $y(\theta) = r + \gamma\max_{a'} Q_\theta(s',a')$ and the prediction $Q_\theta(s,a)$ are functions of the same $\theta$ . When $\theta$ is updated to reduce $Q_\theta(s,a)$ toward $y(\theta)$ , the target $y(\theta)$ itself shifts — you are chasing a moving target that moves every time you take a step toward it.

This instability is not merely inconvenient. In the worst case it is divergent: the update can move $Q_\theta(s,a)$ toward a target that has already moved further away, producing a feedback loop that amplifies errors. This is the moving bootstrap target problem identified in Week 5.

The fix: frozen target network#

Maintain two networks:

Online network $Q_\theta$ : updated at every training step via gradient descent.
Target network $Q_{\theta^-}$ : updated periodically by copying $\theta^- \leftarrow \theta$ every $C$ steps (typically $C \approx 10{,}000$ ).

Targets are computed using the frozen network:

y = r + \gamma \max_{a'} Q_{\theta^-}(s', a')

Between target network updates, $\theta^-$ is fixed. The loss $\mathcal{L}(\theta) = \mathbb{E}[(y - Q_\theta(s,a))^2]$ is then a standard supervised regression loss with fixed labels $y$ — this is stable by the same theory as standard supervised learning. The moving target problem is converted into a sequence of stable supervised regression problems, each running for $C$ steps.

Hard update vs Polyak averaging#

DQN uses a hard update: copy $\theta \to \theta^-$ every $C$ steps. An alternative is Polyak averaging (soft update):

\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-, \quad \tau \ll 1

applied at every step. Soft updates produce a target network that tracks the online network slowly and smoothly, avoiding the discontinuous jumps of hard updates. Soft updates are standard in continuous control algorithms (DDPG, TD3, SAC) where the smoother target dynamics improve stability. Hard updates remain common in discrete-action DQN variants.

Connection to the deadly triad#

Freezing the target network effectively removes bootstrapping from the instability triangle for $C$ steps: the target is treated as a fixed constant, so the update behaves like supervised regression on fixed labels rather than self-referential bootstrapping. This is the precise mechanism by which the target network addresses one leg of the deadly triad — it does not eliminate bootstrapping, but it converts continuous bootstrapping into windowed supervised regression. (The same target network mechanism appears in Course 2's actor-critic robot controllers, where the frozen reference policy prevents unsafe policy drift during learning from demonstrations.)

Reward clipping#

The problem it solves#

Large or task-varying reward magnitudes cause unstable gradients and make it impossible to use the same hyperparameters across tasks with different reward scales.

The fix and its costs#

DQN clips rewards to $[-1, +1]$ :

r \leftarrow \text{clip}(r, -1, 1)

This stabilizes optimization across Atari games with different scoring systems — the gradient magnitude is bounded regardless of whether a game awards 1 point or 10,000 points per event.

However, reward clipping has a serious cost that deserves explicit treatment:

It distorts the optimal policy. After clipping, a reward of $+100$ and a reward of $+1$ are treated identically. A strategy that earns occasional rewards of $+100$ (clipped to $+1$ ) may be objectively superior to a strategy that earns many rewards of $+1$ — but after clipping, the agent cannot distinguish them. The optimal policy under clipped rewards may differ from the optimal policy under true rewards.

Reward normalization is generally preferable. Rather than clipping, maintain a running estimate of the reward mean and variance and normalize:

r_{\text{normalized}} = \frac{r - \hat{\mu}_r}{\hat{\sigma}_r + \epsilon}

Normalization preserves the relative ordering of rewards (and thus the true optimal policy) while bounding gradient magnitudes. It is the standard approach in modern implementations and should be preferred over clipping when reward statistics can be estimated online.

Reward clipping is retained in the DQN literature primarily for reproducibility across the Atari benchmark, not because it is theoretically sound.

Failure modes of base DQN#

Despite the three stabilizing mechanisms, base DQN still suffers from:

Overestimation bias: the $\max$ operator over noisy Q-values systematically inflates targets (derived in Week 5).
Uniform replay inefficiency: all transitions are sampled equally regardless of their learning value.
Value-advantage conflation: the Q-function must jointly represent state value and action advantage, making learning inefficient in states where actions are similar.
Mean-only estimation: learning only $\mathbb{E}[G]$ discards distributional information that could improve stability and enable risk-sensitive behavior.

Each residual failure motivates one of the extensions below.

Double DQN#

The failure mode#

The Q-learning target $r + \gamma\max_{a'} Q_{\theta^-}(s',a')$ selects and evaluates the best next action using the same network $Q_{\theta^-}$ . As derived in Week 5, this produces systematic overestimation: the network that selects the best action is the same network whose noise inflated that action's value, so the selected action is biased upward.

The fix#

Decouple action selection from action evaluation using different networks:

y = r + \gamma\, Q_{\theta^-}\!\left(s',\; \underbrace{\arg\max_{a'} Q_\theta(s', a')}_{\text{online selects}}\right)

The online network $Q_\theta$ selects which action appears best. The target network $Q_{\theta^-}$ evaluates how good that action actually is. Since $\theta$ and $\theta^-$ are updated at different rates, their noise is partially decorrelated — the action inflated by $\theta$ 's noise is unlikely to also be inflated by $\theta^-$ 's noise. The upward bias is reduced without adding any new parameters or computational cost beyond what DQN already requires.

In DQN, the target network is already maintained for stability. Double DQN is simply a change to which network computes the $\arg\max$ in the target — a one-line modification with meaningful empirical benefits.

Dueling networks#

The insight#

Not all states require distinguishing between actions. In a racing game on a straight section, all steering actions have similar value — what matters most is estimating how good the overall state is, not which specific action to take. In a dangerous intersection, the action choice is critical. Standard Q-learning conflates these two sources of value and must learn them jointly from the same update signal.

The decomposition#

Dueling networks decompose the Q-function as:

Q(s, a;\, \theta, \alpha, \beta) = V(s;\, \theta, \beta) + \left[A(s, a;\, \theta, \alpha) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a';\, \theta, \alpha)\right]

where:

$V(s;\, \theta, \beta)$ is the state-value stream — a scalar measuring how good state $s$ is regardless of action.
$A(s, a;\, \theta, \alpha)$ is the advantage stream — how much better action $a$ is relative to the average action in state $s$ .
The mean subtraction enforces identifiability: without it, $V$ and $A$ are underdetermined (adding a constant to $V$ and subtracting it from all $A$ values leaves $Q$ unchanged). Subtracting the mean advantage forces $\sum_{a'} A(s,a') = 0$ , making the decomposition unique.

Why this helps#

The value stream $V(s)$ can be updated from any experience in state $s$ , regardless of which action was taken. The advantage stream $A(s,a)$ is only updated for the specific action taken. In states where actions have similar values (small variance in $A$ ), the value stream dominates and receives more effective updates — the network learns faster because each transition provides useful information about $V$ even when the specific action choice was uninformative.

The two streams share a common feature extraction trunk (the convolutional layers in Atari) and split only in the final layers, adding negligible computational cost.

Dueling network architecture diagram: shared trunk splits to V and A streams

Prioritized experience replay#

The failure mode#

Uniform random sampling from the replay buffer treats all transitions equally. A transition where the agent already has an accurate estimate (small TD error $|\delta_i|$ ) contributes little learning signal. A transition where the estimate is badly wrong (large $|\delta_i|$ ) carries substantial new information. Sampling them with equal probability wastes compute on low-information transitions.

The fix: priority-proportional sampling#

Sample transition $i$ with probability proportional to its TD error:

P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}, \quad p_i = |\delta_i| + \epsilon

where $\alpha \in [0,1]$ controls the degree of prioritization ( $\alpha = 0$ is uniform sampling; $\alpha = 1$ is fully greedy) and $\epsilon > 0$ prevents transitions with zero TD error from never being sampled.

The importance sampling correction#

Prioritized sampling changes the distribution over transitions from uniform to $P(i)$ . This introduces a bias: transitions with low TD error are undersampled and under-represented in gradient estimates. To correct this and restore unbiased gradient estimates, each transition is weighted by its importance sampling weight:

w_i = \left(\frac{1}{N \cdot P(i)}\right)^\beta

where $N$ is the buffer size and $\beta \in [0,1]$ controls the correction strength. The update for transition $i$ becomes:

\theta \leftarrow \theta + \alpha\, w_i\, \delta_i\, \nabla_\theta Q_\theta(s_i, a_i)

In practice, $\beta$ is annealed from a small initial value (e.g., 0.4) to 1 over the course of training. Early in training, when estimates are noisy and TD errors are large, full importance sampling correction is unnecessary and adds variance. As training stabilizes and the distribution of priorities narrows, full correction ( $\beta = 1$ ) restores unbiasedness.

This is the same importance sampling ratio from Week 4 applied to the replay buffer — the priority distribution $P(i)$ plays the role of the behavior policy, and the uniform distribution plays the role of the target policy. Correcting for the mismatch requires reweighting by $1/P(i)$ , exactly as in off-policy importance sampling.

Distributional reinforcement learning#

The insight#

Standard Q-learning learns the expected return $\mathbb{E}[G]$ . But the return $G$ is a random variable, and its full distribution can carry information that the mean discards.

Consider two state-action pairs with the same expected return of 5: one always returns exactly 5 (zero variance), another returns 0 or 10 with equal probability (high variance). A risk-neutral agent is indifferent; a risk-averse agent prefers the first. The mean alone cannot distinguish them, but the return distribution can.

C51#

C51 (Bellemare et al., 2017) represents the return distribution as a categorical distribution over a fixed support $\{z_1, z_2, \ldots, z_K\}$ (e.g., 51 atoms evenly spaced between $V_{\min}$ and $V_{\max}$ ):

Z(s,a) \approx \sum_{k=1}^K p_k(s,a;\theta)\, \delta_{z_k}

where $p_k(s,a;\theta)$ is the predicted probability that the return falls in bin $k$ . The distributional Bellman operator projects the shifted distribution $r + \gamma Z(s',a')$ onto the fixed support and trains via cross-entropy loss, providing a richer gradient signal than scalar TD error.

The key empirical finding: distributional methods significantly outperform mean-only methods on the Atari benchmark, even for tasks where risk sensitivity seems irrelevant. The richer gradient signal from matching the full distribution — rather than just the mean — stabilizes training and improves the learned representations. (This distributional view is essential for Course 4's RLHF pipelines, where reward model uncertainty directly translates to policy conservatism: in regions where the reward model is uncertain, the agent learns to be cautious.)

QR-DQN (Dabney et al., 2018) extends this by representing the distribution via quantile regression, removing the need for a fixed support and improving approximation quality.

Rainbow: the full combination#

Rainbow DQN (Hessel et al., 2017) combines six improvements into a single agent: Double DQN, Prioritized Replay, Dueling Networks, Multi-step returns, Distributional RL, and Noisy Networks (learned exploration noise replacing $\epsilon$ -greedy).

The ablation study in the Rainbow paper provides a direct empirical answer to which components matter most:

| Component removed | Performance drop | |---|---| | Prioritized replay | Largest drop | | Multi-step returns | Second largest drop | | Distributional RL | Third | | Double DQN | Moderate | | Dueling networks | Moderate | | Noisy Networks | Smallest |

Key finding: prioritized replay and multi-step returns are the most important individual components. The components are largely complementary — removing any one degrades performance, and their combination exceeds what any subset achieves. Rainbow held the state-of-the-art on Atari for over a year and remains the standard reference for what a well-engineered DQN variant can achieve.

Failure modes and limits of DQN#

Even with all extensions, the DQN family has fundamental limitations:

Discrete actions only: the $\max_{a'} Q(s',a')$ operation requires enumerating all actions. DQN cannot be directly applied to continuous action spaces (robot joint torques, steering angles), which motivates policy gradient methods (Week 7).
Sample inefficiency: DQN requires tens of millions of environment steps for Atari. Model-based methods (Week 9) and off-policy actor-critics (Week 8) can achieve comparable performance with orders of magnitude less data.
Representational collapse: neural network Q-functions can suffer from degradation of the learned features over long training runs — a phenomenon distinct from the instabilities addressed by target networks and replay buffers.

GenAI context: DQN ideas in RLHF#

DQN itself is rarely applied to language models, but its stabilizing ideas appear throughout RLHF pipelines:

Replay buffers → offline RL and dataset reuse: storing and resampling experience from past policy versions is the foundation of offline RL (Week 10) and is how RLHF pipelines reuse preference data collected under earlier model checkpoints.
Target networks → frozen reference models: RLHF freezes a reference model $\pi_{\text{ref}}$ and penalizes KL divergence from it. This serves the same stabilizing role as the target network — it prevents the policy from moving too far from a stable anchor.
Overestimation bias → reward model conservatism: the same max-over-noisy-estimates problem appears in reward models trained on limited preference data. Overestimated reward scores produce reward hacking; techniques like reward model ensembles and conservative reward penalties address the same underlying bias.
Distributional views → uncertainty estimation: distributional RL's representation of return variance informs uncertainty-aware reward modeling in RLHF, enabling models to be more conservative in high-uncertainty regions.

Key takeaways#

DQN's success follows a clear logic: identify the failure modes precisely (from Week 5), then engineer targeted fixes. Experience replay breaks temporal correlations and provides a stable, broad training distribution, justified by Q-learning's off-policy convergence property. Target networks convert continuous bootstrapping into windowed supervised regression, directly addressing the moving target leg of the deadly triad. Reward clipping stabilizes gradients but distorts the optimal policy — normalization is preferable when feasible.

The extensions address residual failures. Double DQN decouples selection from evaluation to reduce overestimation bias. Dueling networks separate state value from action advantage, accelerating learning in states where action choice is uninformative, with mean advantage subtraction enforcing identifiability. Prioritized replay focuses learning on high-error transitions using priority-proportional sampling, corrected by importance sampling weights to prevent bias. Distributional RL (C51, QR-DQN) learns the full return distribution rather than the mean, providing richer gradient signal. Rainbow combines all six and shows through ablation that prioritized replay and multi-step returns contribute most.

The fundamental limitation — discrete action spaces — is what motivates the shift to policy gradient methods in Week 7.

Conceptual questions#

Without a replay buffer, DQN trains on consecutive transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ , $(s_{t+1}, a_{t+1}, r_{t+2}, s_{t+2})$ , etc. Explain precisely why this violates the assumptions of SGD, why the resulting gradient estimates are biased, and how replay buffer sampling restores approximate i.i.d. structure. Why does the off-policy nature of replay buffer sampling not invalidate Q-learning's convergence guarantees?
The target network is updated every $C = 10{,}000$ steps by hard copying $\theta^- \leftarrow \theta$ . In between updates, the target $y = r + \gamma \max_{a'} Q_{\theta^-}(s',a')$ is fixed. Explain why this converts bootstrapped TD learning into a sequence of supervised regression problems. What is the tradeoff between small $C$ (frequent updates) and large $C$ (infrequent updates)?
A dueling network decomposes $Q(s,a) = V(s) + A(s,a)$ . Without the mean-subtraction normalization $A(s,a) \leftarrow A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a')$ , the decomposition is not uniquely identified. Give a concrete example showing two different $(V, A)$ pairs that produce the same $Q$ values, and explain why the mean subtraction removes this ambiguity.
Prioritized experience replay samples transition $i$ with probability $P(i) \propto |\delta_i|^\alpha$ . Without importance sampling corrections, this introduces a bias in the gradient estimates. Write out the importance sampling weight $w_i$ and explain what distribution it is correcting from and to. Why is $\beta$ annealed from less than 1 to 1 over training rather than set to 1 from the start?
You are applying a DQN variant to an RLHF pipeline where the language model policy is fine-tuned using Q-values over the token vocabulary ( $|\mathcal{A}| \approx 50{,}000$ ). Identify two specific failure modes from this lecture that are likely to be most severe at this action-space scale. For each, state which DQN extension addresses it and whether that extension is directly applicable to the language setting or requires modification.

Solutions

Replay buffer. Consecutive transitions are strongly temporally correlated and non-stationary, violating SGD's i.i.d. assumption, so gradient estimates are biased and updates become unstable. Random replay sampling breaks the correlation, restoring approximate i.i.d. structure. Off-policy sampling does not invalidate Q-learning because the update bootstraps toward $\max_{a'}Q(s',a')$ regardless of which behavior policy generated the data — Q-learning is off-policy by construction.
Target network. Freezing $\theta^-$ for $C$ steps makes $y = r + \gamma\max_{a'}Q_{\theta^-}(s',a')$ a fixed target, turning the update into a stationary supervised regression for that window instead of chasing a moving bootstrapped target. Small $C$ keeps targets current but reintroduces moving-target oscillation/instability; large $C$ gives stable targets but stale value estimates, slowing learning.
Dueling identifiability. $Q=V+A$ is invariant under $(V+c,\,A-c)$ . For example $V=5,\,A=[0,2]$ and $V=6,\,A=[-1,1]$ both give $Q=[5,7]$ . Mean subtraction forces $\sum_a A(s,a)=0$ , pinning the gauge so $V$ is uniquely the mean $Q$ and $A$ the centered advantages — removing the additive ambiguity.
PER importance sampling. $w_i = (N\,P(i))^{-\beta}$ (normalized by $\max_j w_j$ ). It corrects from the priority-proportional sampling distribution $P(i)\propto|\delta_i|^\alpha$ back to the uniform distribution the loss expectation assumes. $\beta$ is annealed from below 1 to 1 because early training is noisy and non-stationary, where full correction adds variance; partial correction stabilizes early learning, and $\beta\to1$ ensures the final estimates are unbiased.
DQN at vocabulary scale. (a) Max-operator overestimation is amplified across ~50,000 noisy action-values; Double DQN addresses it and transfers directly (decouple action selection from evaluation). (b) Exploration/coverage and computing $\max_a Q$ over 50,000 tokens are intractable, so value-based control scales poorly here — dueling/distributional improve estimation but the action-space scale ultimately favors policy-gradient methods, so this extension requires substantial modification (RLHF normally uses PPO instead).

Coding exercise: Mini DQN on CartPole#

Implement a minimal DQN agent for OpenAI Gym's CartPole-v1 environment using PyTorch. The goal is to reproduce the three core stabilizing mechanisms from this lecture in code.

Setup (provided):

python · runs in browser

import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

env = gym.make("CartPole-v1")
state_dim  = env.observation_space.shape[0]   # 4
action_dim = env.action_space.n               # 2

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
        )
    def forward(self, x):
        return self.net(x)

Part 1 — Replay buffer. Implement a ReplayBuffer class with push(state, action, reward, next_state, done) and sample(batch_size) methods. Store transitions in a deque with maxlen. Return numpy arrays (or torch tensors) for states, actions, rewards, next_states, and dones.

Part 2 — Target network. After training the online network $Q_\theta$ for $C = 1{,}000$ steps, hard-copy its weights to the target network $Q_{\theta^-}$ . Verify by comparing torch.equal(online.state_dict()['net.0.weight'], target.state_dict()['net.0.weight']) before and after the copy.

Part 3 — ε-greedy action selection and training loop. Implement the full DQN training loop:

Select actions $a = \arg\max_a Q_\theta(s, a)$ with probability $1 - \varepsilon$ , else random.
Compute target values $y_j = r_j + \gamma \max_{a'} Q_{\theta^-}(s'_j, a')$ using the frozen target network (or $y_j = r_j$ if terminal).
Update $\theta$ via MSE loss $\frac{1}{B}\sum_j (y_j - Q_\theta(s_j, a_j))^2$ .
Anneal $\varepsilon$ from 1.0 to 0.01 over the first 10% of episodes.

Expected result: With $\gamma = 0.99$ , batch size 64, learning rate $10^{-3}$ , buffer capacity 10,000, and $C = 1{,}000$ , the agent should solve CartPole (average return ≥ 195 over 100 consecutive episodes) within approximately 500 episodes.

Extension: Replace the hard target network update with Polyak (soft) averaging — $\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-$ with $\tau = 0.005$ , applied every step. Compare convergence speed and stability against the hard update baseline.

Looking ahead#

DQN and its variants require discrete, enumerable action spaces. The next lecture introduces policy gradient methods, which optimize policies directly and naturally extend to continuous action spaces — unlocking robotics, continuous control, and the modern RLHF alignment pipeline. The key insight is a shift from learning $Q^*$ and deriving the policy implicitly, to learning $\pi_\theta$ directly by following the gradient of expected return.

Purpose of this lecture#

From Q-learning to Deep Q-learning#

Recall the Q-learning update:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]

In DQN, $Q(s,a)$ is replaced by a neural network $Q_\theta(s,a)$ . The training objective minimizes the squared TD error:

\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[ \Bigl( \underbrace{r + \gamma \max_{a'} Q_{\theta^-}(s',a')}_{\text{target } y} - Q_\theta(s,a) \Bigr)^2 \right]

where $\mathcal{D}$ is a replay buffer and $\theta^-$ is a separate target network.

Without additional structure, however, this combination is unstable for the reasons established in Week 5. We now examine each stabilizing mechanism in detail.

The complete DQN algorithm#

Before examining each component, it helps to see the full algorithm:

code

Initialize Q_θ with random weights θ
Initialize target network Q_{θ⁻} with weights θ⁻ ← θ
Initialize replay buffer D with capacity N

For each episode:
  Initialize state s₀

  For each step t:
    # Action selection (ε-greedy)
    With probability ε: select random action aₜ
    Otherwise:          select aₜ = argmax_a Q_θ(sₜ, a)

    # Environment step
    Execute aₜ, observe rₜ₊₁, sₜ₊₁

    # Store transition
    Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D

    # Sample and train (once |D| ≥ batch_size)
    Sample random mini-batch {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D

    Compute targets:
      yⱼ = rⱼ + γ max_{a'} Q_{θ⁻}(s'ⱼ, a')   if s'ⱼ not terminal
      yⱼ = rⱼ                                   if s'ⱼ terminal

    Update θ by SGD on Σⱼ (yⱼ - Q_θ(sⱼ, aⱼ))²

    # Periodic target network update
    Every C steps: θ⁻ ← θ

Target network initialization (θ⁻ ← θ): An exact copy of the main network — provides stable Bellman backup targets that don't shift every gradient step.
Replay buffer (capacity N): Stores raw transitions up to massive capacity (e.g., 1 million frames), enabling offline-like sampling that decorrelates sequential experience.
Deferred storage (Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D): Transitions are stored without immediately learning from them, breaking the sequential coupling that destabilizes online Q-learning.
Random mini-batch sampling: Decorrelates the training data and stabilizes stochastic gradient descent — each update sees a mix of old and recent experience.
Frozen target values (Q_{θ⁻}): Targets $y_j$ use frozen parameters $\theta^-$ , not the current $\theta$ being updated — removing the feedback loop that causes divergence.
Periodic target update (Every C steps: θ⁻ ← θ): The target network is refreshed only every $C$ steps, ensuring $y_j$ stays stationary long enough for the critic to fit against it.

Every non-obvious line in this algorithm corresponds to a deliberate design choice that addresses a specific failure mode. The rest of the lecture explains those choices.

Experience replay#

The failure mode it addresses#

The fix: replay buffer#

Store transitions in a fixed-capacity buffer $\mathcal{D}$ and train on randomly sampled mini-batches:

(s_j, a_j, r_j, s'_j) \sim \text{Uniform}(\mathcal{D})

The off-policy justification#

The capacity tradeoff#

Buffer capacity is a practical hyperparameter with real consequences:

Too small: the buffer retains mostly recent, correlated transitions — similar to on-policy learning with weaker correlation reduction.
Too large: the buffer retains very old transitions collected under a much earlier, worse policy. High-priority states from the current policy are diluted by irrelevant old data, slowing learning.

Target networks#

The failure mode it addresses#

With a standard neural network, the training loss is:

\mathcal{L}(\theta) = \mathbb{E}\left[\bigl(r + \gamma\max_{a'} Q_\theta(s',a') - Q_\theta(s,a)\bigr)^2\right]

The fix: frozen target network#

Maintain two networks:

Online network $Q_\theta$ : updated at every training step via gradient descent.
Target network $Q_{\theta^-}$ : updated periodically by copying $\theta^- \leftarrow \theta$ every $C$ steps (typically $C \approx 10{,}000$ ).

Targets are computed using the frozen network:

y = r + \gamma \max_{a'} Q_{\theta^-}(s', a')

Hard update vs Polyak averaging#

DQN uses a hard update: copy $\theta \to \theta^-$ every $C$ steps. An alternative is Polyak averaging (soft update):

\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-, \quad \tau \ll 1

Connection to the deadly triad#

Reward clipping#

The problem it solves#

Large or task-varying reward magnitudes cause unstable gradients and make it impossible to use the same hyperparameters across tasks with different reward scales.

The fix and its costs#

DQN clips rewards to $[-1, +1]$ :

r \leftarrow \text{clip}(r, -1, 1)

This stabilizes optimization across Atari games with different scoring systems — the gradient magnitude is bounded regardless of whether a game awards 1 point or 10,000 points per event.

However, reward clipping has a serious cost that deserves explicit treatment:

Reward normalization is generally preferable. Rather than clipping, maintain a running estimate of the reward mean and variance and normalize:

r_{\text{normalized}} = \frac{r - \hat{\mu}_r}{\hat{\sigma}_r + \epsilon}

Reward clipping is retained in the DQN literature primarily for reproducibility across the Atari benchmark, not because it is theoretically sound.

Failure modes of base DQN#

Despite the three stabilizing mechanisms, base DQN still suffers from:

Overestimation bias: the $\max$ operator over noisy Q-values systematically inflates targets (derived in Week 5).
Uniform replay inefficiency: all transitions are sampled equally regardless of their learning value.
Value-advantage conflation: the Q-function must jointly represent state value and action advantage, making learning inefficient in states where actions are similar.
Mean-only estimation: learning only $\mathbb{E}[G]$ discards distributional information that could improve stability and enable risk-sensitive behavior.

Each residual failure motivates one of the extensions below.

Double DQN#

The failure mode#

The fix#

Decouple action selection from action evaluation using different networks:

y = r + \gamma\, Q_{\theta^-}\!\left(s',\; \underbrace{\arg\max_{a'} Q_\theta(s', a')}_{\text{online selects}}\right)

Dueling networks#

The insight#

The decomposition#

Dueling networks decompose the Q-function as:

Q(s, a;\, \theta, \alpha, \beta) = V(s;\, \theta, \beta) + \left[A(s, a;\, \theta, \alpha) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a';\, \theta, \alpha)\right]

where:

$V(s;\, \theta, \beta)$ is the state-value stream — a scalar measuring how good state $s$ is regardless of action.
$A(s, a;\, \theta, \alpha)$ is the advantage stream — how much better action $a$ is relative to the average action in state $s$ .
The mean subtraction enforces identifiability: without it, $V$ and $A$ are underdetermined (adding a constant to $V$ and subtracting it from all $A$ values leaves $Q$ unchanged). Subtracting the mean advantage forces $\sum_{a'} A(s,a') = 0$ , making the decomposition unique.

Why this helps#

The two streams share a common feature extraction trunk (the convolutional layers in Atari) and split only in the final layers, adding negligible computational cost.

Prioritized experience replay#

The failure mode#

The fix: priority-proportional sampling#

Sample transition $i$ with probability proportional to its TD error:

P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}, \quad p_i = |\delta_i| + \epsilon

The importance sampling correction#

w_i = \left(\frac{1}{N \cdot P(i)}\right)^\beta

where $N$ is the buffer size and $\beta \in [0,1]$ controls the correction strength. The update for transition $i$ becomes:

\theta \leftarrow \theta + \alpha\, w_i\, \delta_i\, \nabla_\theta Q_\theta(s_i, a_i)

Distributional reinforcement learning#

The insight#

Standard Q-learning learns the expected return $\mathbb{E}[G]$ . But the return $G$ is a random variable, and its full distribution can carry information that the mean discards.

C51#

Z(s,a) \approx \sum_{k=1}^K p_k(s,a;\theta)\, \delta_{z_k}

QR-DQN (Dabney et al., 2018) extends this by representing the distribution via quantile regression, removing the need for a fixed support and improving approximation quality.

Rainbow: the full combination#

The ablation study in the Rainbow paper provides a direct empirical answer to which components matter most:

Failure modes and limits of DQN#

Even with all extensions, the DQN family has fundamental limitations:

Discrete actions only: the $\max_{a'} Q(s',a')$ operation requires enumerating all actions. DQN cannot be directly applied to continuous action spaces (robot joint torques, steering angles), which motivates policy gradient methods (Week 7).
Sample inefficiency: DQN requires tens of millions of environment steps for Atari. Model-based methods (Week 9) and off-policy actor-critics (Week 8) can achieve comparable performance with orders of magnitude less data.
Representational collapse: neural network Q-functions can suffer from degradation of the learned features over long training runs — a phenomenon distinct from the instabilities addressed by target networks and replay buffers.

GenAI context: DQN ideas in RLHF#

DQN itself is rarely applied to language models, but its stabilizing ideas appear throughout RLHF pipelines:

Replay buffers → offline RL and dataset reuse: storing and resampling experience from past policy versions is the foundation of offline RL (Week 10) and is how RLHF pipelines reuse preference data collected under earlier model checkpoints.
Target networks → frozen reference models: RLHF freezes a reference model $\pi_{\text{ref}}$ and penalizes KL divergence from it. This serves the same stabilizing role as the target network — it prevents the policy from moving too far from a stable anchor.
Overestimation bias → reward model conservatism: the same max-over-noisy-estimates problem appears in reward models trained on limited preference data. Overestimated reward scores produce reward hacking; techniques like reward model ensembles and conservative reward penalties address the same underlying bias.
Distributional views → uncertainty estimation: distributional RL's representation of return variance informs uncertainty-aware reward modeling in RLHF, enabling models to be more conservative in high-uncertainty regions.

Key takeaways#

The fundamental limitation — discrete action spaces — is what motivates the shift to policy gradient methods in Week 7.

Conceptual questions#

Without a replay buffer, DQN trains on consecutive transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ , $(s_{t+1}, a_{t+1}, r_{t+2}, s_{t+2})$ , etc. Explain precisely why this violates the assumptions of SGD, why the resulting gradient estimates are biased, and how replay buffer sampling restores approximate i.i.d. structure. Why does the off-policy nature of replay buffer sampling not invalidate Q-learning's convergence guarantees?
The target network is updated every $C = 10{,}000$ steps by hard copying $\theta^- \leftarrow \theta$ . In between updates, the target $y = r + \gamma \max_{a'} Q_{\theta^-}(s',a')$ is fixed. Explain why this converts bootstrapped TD learning into a sequence of supervised regression problems. What is the tradeoff between small $C$ (frequent updates) and large $C$ (infrequent updates)?
A dueling network decomposes $Q(s,a) = V(s) + A(s,a)$ . Without the mean-subtraction normalization $A(s,a) \leftarrow A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a')$ , the decomposition is not uniquely identified. Give a concrete example showing two different $(V, A)$ pairs that produce the same $Q$ values, and explain why the mean subtraction removes this ambiguity.
Prioritized experience replay samples transition $i$ with probability $P(i) \propto |\delta_i|^\alpha$ . Without importance sampling corrections, this introduces a bias in the gradient estimates. Write out the importance sampling weight $w_i$ and explain what distribution it is correcting from and to. Why is $\beta$ annealed from less than 1 to 1 over training rather than set to 1 from the start?
You are applying a DQN variant to an RLHF pipeline where the language model policy is fine-tuned using Q-values over the token vocabulary ( $|\mathcal{A}| \approx 50{,}000$ ). Identify two specific failure modes from this lecture that are likely to be most severe at this action-space scale. For each, state which DQN extension addresses it and whether that extension is directly applicable to the language setting or requires modification.

Solutions

Replay buffer. Consecutive transitions are strongly temporally correlated and non-stationary, violating SGD's i.i.d. assumption, so gradient estimates are biased and updates become unstable. Random replay sampling breaks the correlation, restoring approximate i.i.d. structure. Off-policy sampling does not invalidate Q-learning because the update bootstraps toward $\max_{a'}Q(s',a')$ regardless of which behavior policy generated the data — Q-learning is off-policy by construction.
Target network. Freezing $\theta^-$ for $C$ steps makes $y = r + \gamma\max_{a'}Q_{\theta^-}(s',a')$ a fixed target, turning the update into a stationary supervised regression for that window instead of chasing a moving bootstrapped target. Small $C$ keeps targets current but reintroduces moving-target oscillation/instability; large $C$ gives stable targets but stale value estimates, slowing learning.
Dueling identifiability. $Q=V+A$ is invariant under $(V+c,\,A-c)$ . For example $V=5,\,A=[0,2]$ and $V=6,\,A=[-1,1]$ both give $Q=[5,7]$ . Mean subtraction forces $\sum_a A(s,a)=0$ , pinning the gauge so $V$ is uniquely the mean $Q$ and $A$ the centered advantages — removing the additive ambiguity.
PER importance sampling. $w_i = (N\,P(i))^{-\beta}$ (normalized by $\max_j w_j$ ). It corrects from the priority-proportional sampling distribution $P(i)\propto|\delta_i|^\alpha$ back to the uniform distribution the loss expectation assumes. $\beta$ is annealed from below 1 to 1 because early training is noisy and non-stationary, where full correction adds variance; partial correction stabilizes early learning, and $\beta\to1$ ensures the final estimates are unbiased.
DQN at vocabulary scale. (a) Max-operator overestimation is amplified across ~50,000 noisy action-values; Double DQN addresses it and transfers directly (decouple action selection from evaluation). (b) Exploration/coverage and computing $\max_a Q$ over 50,000 tokens are intractable, so value-based control scales poorly here — dueling/distributional improve estimation but the action-space scale ultimately favors policy-gradient methods, so this extension requires substantial modification (RLHF normally uses PPO instead).

Coding exercise: Mini DQN on CartPole#

Implement a minimal DQN agent for OpenAI Gym's CartPole-v1 environment using PyTorch. The goal is to reproduce the three core stabilizing mechanisms from this lecture in code.

Setup (provided):

python · runs in browser

import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

env = gym.make("CartPole-v1")
state_dim  = env.observation_space.shape[0]   # 4
action_dim = env.action_space.n               # 2

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
        )
    def forward(self, x):
        return self.net(x)

Part 3 — ε-greedy action selection and training loop. Implement the full DQN training loop:

Select actions $a = \arg\max_a Q_\theta(s, a)$ with probability $1 - \varepsilon$ , else random.
Compute target values $y_j = r_j + \gamma \max_{a'} Q_{\theta^-}(s'_j, a')$ using the frozen target network (or $y_j = r_j$ if terminal).
Update $\theta$ via MSE loss $\frac{1}{B}\sum_j (y_j - Q_\theta(s_j, a_j))^2$ .
Anneal $\varepsilon$ from 1.0 to 0.01 over the first 10% of episodes.

Purpose of this lecture#

From Q-learning to Deep Q-learning#

The complete DQNDeep Q-Network algorithm#

Experience replay#

The failure mode it addresses#

The fix: replay buffer#

The off-policy justification#

The capacity tradeoff#

Target networks#

The failure mode it addresses#

The fix: frozen target network#

Hard update vs Polyak averaging#

Connection to the deadly triad#

Reward clipping#

The problem it solves#

The fix and its costs#

Failure modes of base DQNDeep Q-Network#

Double DQNDeep Q-Network#

The failure mode#

The fix#

Dueling networks#

The insight#

The decomposition#

Why this helps#

Prioritized experience replay#

The failure mode#

The fix: priority-proportional sampling#

The importance sampling correction#

Distributional reinforcement learning#

The insight#

C51#

Rainbow: the full combination#

Failure modes and limits of DQNDeep Q-Network#

GenAI context: DQNDeep Q-Network ideas in RLHFReinforcement Learning from Human Feedback#

Key takeaways#

Conceptual questions#

Coding exercise: Mini DQN on CartPole#

Looking ahead#

Further reading#

Week 6: Deep Q-Learning and Variants

Purpose of this lecture#

From Q-learning to Deep Q-learning#

The complete DQNDeep Q-Network algorithm#

Experience replay#

The failure mode it addresses#

The fix: replay buffer#

The off-policy justification#

The capacity tradeoff#

Target networks#

The failure mode it addresses#

The fix: frozen target network#

Hard update vs Polyak averaging#

Connection to the deadly triad#

Reward clipping#

The problem it solves#

The fix and its costs#

Failure modes of base DQNDeep Q-Network#

Double DQNDeep Q-Network#

The failure mode#

The fix#

Dueling networks#

The insight#

The decomposition#

Why this helps#

Prioritized experience replay#

The failure mode#

The fix: priority-proportional sampling#

The importance sampling correction#

Distributional reinforcement learning#

The insight#

C51#

Rainbow: the full combination#

Failure modes and limits of DQNDeep Q-Network#

GenAI context: DQNDeep Q-Network ideas in RLHFReinforcement Learning from Human Feedback#

Key takeaways#

Conceptual questions#

Coding exercise: Mini DQN on CartPole#

Looking ahead#

Further reading#

The complete DQN algorithm#

Failure modes of base DQN#

Double DQN#

Failure modes and limits of DQN#

GenAI context: DQN ideas in RLHF#

The complete DQN algorithm#

Failure modes of base DQN#

Double DQN#

Failure modes and limits of DQN#

GenAI context: DQN ideas in RLHF#