Skip to main content
illumin8
Courses
Week 6: Deep Q-Learning and Variants
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 6

Week 6: Deep Q-Learning and Variants

✦Learning Outcomes
  • Understand experience replay and target networks as responses to the deadly triad
  • Compare Double DQNDeep Q-Network, Dueling DQNDeep Q-Network, Prioritized Replay, and Distributional RLReinforcement Learning
  • Analyze the limitations of value-based methods and how Rainbow combines complementary improvements
  • Connect DQNDeep Q-Network to modern value-based RLHFReinforcement Learning from Human Feedback approaches
◆Prerequisites
  • Week 5: Function approximation, deadly triad, overestimation bias
  • Week 4: TDTemporal Difference learning, Q-learning

Recommended: Review Week 5 sections on "The deadly triad" and "Why DQNDeep Q-Network works" before proceeding.

◆Grounded In
  • Week 5 (rl-w5): The deadly triad — function approximation, bootstrapping, and off-policy learning — and why naively combining Q-learning with neural networks diverges. This is the precise motivation for DQN's stabilizing mechanisms.
  • Week 4 (rl-w4): Q-learning's off-policy convergence property — Q-learning converges to (Q^*) regardless of the behavior policy — which justifies experience replay sampling from old policies.
  • Week 3 (rl-w3): Markov Decision Processes and the Bellman optimality equation, which provides the recursive target structure DQN approximates.

Purpose of this lecture

In Week 5, we crossed the critical boundary from tabular RLReinforcement Learning to function approximation and saw that naïvely combining Q-learning with neural networks leads to instability, divergence, and pathological behavior due to the deadly triad.

This lecture introduces Deep Q-Networks (DQNDeep Q-Network) — the first algorithm to successfully stabilize value-based RLReinforcement Learning with neural networks at scale. DQNDeep Q-Network does not introduce a new theoretical framework. It succeeds through a bundle of stabilizing engineering constraints, each of which is a direct and identifiable response to a specific failure mode from Week 5.

Understanding DQNDeep Q-Network means understanding the mapping from failure → fix. The extensions that follow — Double DQNDeep Q-Network, Dueling Networks, Prioritized Replay, Distributional RLReinforcement Learning — each address a residual failure that the base DQNDeep Q-Network bundle does not fully eliminate.


From Q-learning to Deep Q-learning

Recall the Q-learning update:

Q(st,at)←Q(st,at)+α[rt+1+γmax⁡a′Q(st+1,a′)−Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]Q(st​,at​)←Q(st​,at​)+α[rt+1​+γa′max​Q(st+1​,a′)−Q(st​,at​)]

In DQNDeep Q-Network, Q(s,a)Q(s,a)Q(s,a) is replaced by a neural network Qθ(s,a)Q_\theta(s,a)Qθ​(s,a). The training objective minimizes the squared TDTemporal Difference error:

L(θ)=E(s,a,r,s′)∼D[(r+γmax⁡a′Qθ−(s′,a′)⏟target y−Qθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[ \Bigl( \underbrace{r + \gamma \max_{a'} Q_{\theta^-}(s',a')}_{\text{target } y} - Q_\theta(s,a) \Bigr)^2 \right]L(θ)=E(s,a,r,s′)∼D​​(target yr+γa′max​Qθ−​(s′,a′)​​−Qθ​(s,a))2​

where D\mathcal{D}D is a replay buffer and θ−\theta^-θ− is a separate target network.

DQNDeep Q-Network's historical breakthrough was learning to play Atari 2600 games directly from raw pixels using convolutional neural networks, achieving superhuman performance on many games from a single algorithm and hyperparameter set. This was the first demonstration that deep neural networks and RLReinforcement Learning could be combined successfully at scale.

Without additional structure, however, this combination is unstable for the reasons established in Week 5. We now examine each stabilizing mechanism in detail.


The complete DQNDeep Q-Network algorithm

Before examining each component, it helps to see the full algorithm:

Initialize Q_θ with random weights θ
Initialize target network Q_{θ⁻} with weights θ⁻ ← θ
Initialize replay buffer D with capacity N

For each episode:
  Initialize state s₀

  For each step t:
    # Action selection (ε-greedy)
    With probability ε: select random action aₜ
    Otherwise:          select aₜ = argmax_a Q_θ(sₜ, a)

    # Environment step
    Execute aₜ, observe rₜ₊₁, sₜ₊₁

    # Store transition
    Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D

    # Sample and train (once |D| ≥ batch_size)
    Sample random mini-batch {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D

    Compute targets:
      yⱼ = rⱼ + γ max_{a'} Q_{θ⁻}(s'ⱼ, a')   if s'ⱼ not terminal
      yⱼ = rⱼ                                   if s'ⱼ terminal

    Update θ by SGD on Σⱼ (yⱼ - Q_θ(sⱼ, aⱼ))²

    # Periodic target network update
    Every C steps: θ⁻ ← θ
  1. Target network initialization (θ⁻ ← θ): An exact copy of the main network — provides stable Bellman backup targets that don't shift every gradient step.
  2. Replay buffer (capacity N): Stores raw transitions up to massive capacity (e.g., 1 million frames), enabling offline-like sampling that decorrelates sequential experience.
  3. Deferred storage (Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D): Transitions are stored without immediately learning from them, breaking the sequential coupling that destabilizes online Q-learning.
  4. Random mini-batch sampling: Decorrelates the training data and stabilizes stochastic gradient descent — each update sees a mix of old and recent experience.
  5. Frozen target values (Q_{θ⁻}): Targets yjy_jyj​ use frozen parameters θ−\theta^-θ−, not the current θ\thetaθ being updated — removing the feedback loop that causes divergence.
  6. Periodic target update (Every C steps: θ⁻ ← θ): The target network is refreshed only every CCC steps, ensuring yjy_jyj​ stays stationary long enough for the critic to fit against it.

Every non-obvious line in this algorithm corresponds to a deliberate design choice that addresses a specific failure mode. The rest of the lecture explains those choices.


Experience replay

The failure mode it addresses

Sequential RLReinforcement Learning transitions are highly temporally correlated. The transition (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1})(st​,at​,rt+1​,st+1​) and the next transition (st+1,at+1,rt+2,st+2)(s_{t+1}, a_{t+1}, r_{t+2}, s_{t+2})(st+1​,at+1​,rt+2​,st+2​) share a state. In an Atari game, consecutive frames differ by a single timestep and are nearly identical observations. SGD assumes mini-batches are i.i.d. samples from a fixed distribution — training on consecutive transitions violates both assumptions simultaneously.

The effect is analogous to training a classifier by showing it the same image 50 times before showing any other image: gradient estimates are low-variance but systematically biased toward recent experience, and the network rapidly overfits to recent transitions while forgetting earlier ones. In RLReinforcement Learning, this manifests as oscillating value estimates and policy instability.

A second, related problem is non-stationarity of the training distribution: as the policy improves, the distribution of visited states shifts. Training exclusively on on-policy data means the network adapts to the current policy's narrow distribution and may catastrophically forget how to handle states it no longer frequently visits.

The fix: replay buffer

Store transitions in a fixed-capacity buffer D\mathcal{D}D and train on randomly sampled mini-batches:

(sj,aj,rj,sj′)∼Uniform(D)(s_j, a_j, r_j, s'_j) \sim \text{Uniform}(\mathcal{D})(sj​,aj​,rj​,sj′​)∼Uniform(D)

Random sampling breaks temporal correlations: the mini-batch contains transitions from many different timepoints and behavioral contexts, restoring approximate i.i.d. structure. The buffer also provides a broader, more stationary training distribution that includes past experience, reducing catastrophic forgetting.

The off-policy justification

Sampling from a replay buffer containing transitions from older policy versions makes DQNDeep Q-Network off-policy: the data was collected under different (older) policies than the current one. The theoretical justification for this is Q-learning's off-policy convergence property from Week 4 — Q-learning converges to Q∗Q^*Q∗ regardless of the behavior policy, provided every state-action pair is covered. Transitions from old policies satisfy this coverage requirement as long as the buffer is large enough and diverse enough. (This replay buffer idea is the foundational principle behind offline robot learning in Course 2, where collected demonstrations from a teacher policy are replayed to learn robotic skills — the same insight that experience is reusable, independent of when it was collected.)

The capacity tradeoff

Buffer capacity is a practical hyperparameter with real consequences:

  • Too small: the buffer retains mostly recent, correlated transitions — similar to on-policy learning with weaker correlation reduction.
  • Too large: the buffer retains very old transitions collected under a much earlier, worse policy. High-priority states from the current policy are diluted by irrelevant old data, slowing learning.

DQNDeep Q-Network uses a buffer of 1 million transitions — large enough to cover many episodes but small enough that recent data dominates. Prioritized experience replay (discussed below) addresses the inefficiency of uniform sampling from a large buffer.


Target networks

The failure mode it addresses

With a standard neural network, the training loss is:

L(θ)=E[(r+γmax⁡a′Qθ(s′,a′)−Qθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}\left[\bigl(r + \gamma\max_{a'} Q_\theta(s',a') - Q_\theta(s,a)\bigr)^2\right]L(θ)=E[(r+γa′max​Qθ​(s′,a′)−Qθ​(s,a))2]

Both the target y(θ)=r+γmax⁡a′Qθ(s′,a′)y(\theta) = r + \gamma\max_{a'} Q_\theta(s',a')y(θ)=r+γmaxa′​Qθ​(s′,a′) and the prediction Qθ(s,a)Q_\theta(s,a)Qθ​(s,a) are functions of the same θ\thetaθ. When θ\thetaθ is updated to reduce Qθ(s,a)Q_\theta(s,a)Qθ​(s,a) toward y(θ)y(\theta)y(θ), the target y(θ)y(\theta)y(θ) itself shifts — you are chasing a moving target that moves every time you take a step toward it.

This instability is not merely inconvenient. In the worst case it is divergent: the update can move Qθ(s,a)Q_\theta(s,a)Qθ​(s,a) toward a target that has already moved further away, producing a feedback loop that amplifies errors. This is the moving bootstrap target problem identified in Week 5.

The fix: frozen target network

Maintain two networks:

  • Online network QθQ_\thetaQθ​: updated at every training step via gradient descent.
  • Target network Qθ−Q_{\theta^-}Qθ−​: updated periodically by copying θ−←θ\theta^- \leftarrow \thetaθ−←θ every CCC steps (typically C≈10,000C \approx 10{,}000C≈10,000).

Targets are computed using the frozen network:

y=r+γmax⁡a′Qθ−(s′,a′)y = r + \gamma \max_{a'} Q_{\theta^-}(s', a')y=r+γa′max​Qθ−​(s′,a′)

Between target network updates, θ−\theta^-θ− is fixed. The loss L(θ)=E[(y−Qθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}[(y - Q_\theta(s,a))^2]L(θ)=E[(y−Qθ​(s,a))2] is then a standard supervised regression loss with fixed labels yyy — this is stable by the same theory as standard supervised learning. The moving target problem is converted into a sequence of stable supervised regression problems, each running for CCC steps.

Hard update vs Polyak averaging

DQNDeep Q-Network uses a hard update: copy θ→θ−\theta \to \theta^-θ→θ− every CCC steps. An alternative is Polyak averaging (soft update):

θ−←τθ+(1−τ)θ−,τ≪1\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-, \quad \tau \ll 1θ−←τθ+(1−τ)θ−,τ≪1

applied at every step. Soft updates produce a target network that tracks the online network slowly and smoothly, avoiding the discontinuous jumps of hard updates. Soft updates are standard in continuous control algorithms (DDPGDeep Deterministic Policy Gradient, TD3, SACSoft Actor-Critic) where the smoother target dynamics improve stability. Hard updates remain common in discrete-action DQNDeep Q-Network variants.

Connection to the deadly triad

Freezing the target network effectively removes bootstrapping from the instability triangle for CCC steps: the target is treated as a fixed constant, so the update behaves like supervised regression on fixed labels rather than self-referential bootstrapping. This is the precise mechanism by which the target network addresses one leg of the deadly triad — it does not eliminate bootstrapping, but it converts continuous bootstrapping into windowed supervised regression. (The same target network mechanism appears in Course 2's actor-critic robot controllers, where the frozen reference policy prevents unsafe policy drift during learning from demonstrations.)


Reward clipping

The problem it solves

Large or task-varying reward magnitudes cause unstable gradients and make it impossible to use the same hyperparameters across tasks with different reward scales.

The fix and its costs

DQNDeep Q-Network clips rewards to [−1,+1][-1, +1][−1,+1]:

r←clip(r,−1,1)r \leftarrow \text{clip}(r, -1, 1)r←clip(r,−1,1)

This stabilizes optimization across Atari games with different scoring systems — the gradient magnitude is bounded regardless of whether a game awards 1 point or 10,000 points per event.

However, reward clipping has a serious cost that deserves explicit treatment:

It distorts the optimal policy. After clipping, a reward of +100+100+100 and a reward of +1+1+1 are treated identically. A strategy that earns occasional rewards of +100+100+100 (clipped to +1+1+1) may be objectively superior to a strategy that earns many rewards of +1+1+1 — but after clipping, the agent cannot distinguish them. The optimal policy under clipped rewards may differ from the optimal policy under true rewards.

✕What Breaks Here

Reward clipping discards the scale of rewards: +100+100+100 and +1+1+1 become indistinguishable. In environments where rewards vary naturally by magnitude — e.g., collecting a rare bonus object worth 1,000 points vs. a common coin worth 1 point — the agent cannot learn to prioritize the high-value objective. The optimal policy under clipped rewards systematically undervalues sparse, large-reward strategies. Reward normalization (subtract mean, divide by standard deviation) preserves relative ordering and is the preferred modern approach whenever running reward statistics can be tracked online. Clipping persists in the Atari benchmark literature for reproducibility, not because it is the correct solution.

Reward normalization is generally preferable. Rather than clipping, maintain a running estimate of the reward mean and variance and normalize:

rnormalized=r−μ^rσ^r+ϵr_{\text{normalized}} = \frac{r - \hat{\mu}_r}{\hat{\sigma}_r + \epsilon}rnormalized​=σ^r​+ϵr−μ^​r​​

Normalization preserves the relative ordering of rewards (and thus the true optimal policy) while bounding gradient magnitudes. It is the standard approach in modern implementations and should be preferred over clipping when reward statistics can be estimated online.

Reward clipping is retained in the DQNDeep Q-Network literature primarily for reproducibility across the Atari benchmark, not because it is theoretically sound.


Failure modes of base DQNDeep Q-Network

Despite the three stabilizing mechanisms, base DQNDeep Q-Network still suffers from:

  • Overestimation bias: the max⁡\maxmax operator over noisy Q-values systematically inflates targets (derived in Week 5).
  • Uniform replay inefficiency: all transitions are sampled equally regardless of their learning value.
  • Value-advantage conflation: the Q-function must jointly represent state value and action advantage, making learning inefficient in states where actions are similar.
  • Mean-only estimation: learning only E[G]\mathbb{E}[G]E[G] discards distributional information that could improve stability and enable risk-sensitive behavior.

Each residual failure motivates one of the extensions below.


Double DQNDeep Q-Network

The failure mode

The Q-learning target r+γmax⁡a′Qθ−(s′,a′)r + \gamma\max_{a'} Q_{\theta^-}(s',a')r+γmaxa′​Qθ−​(s′,a′) selects and evaluates the best next action using the same network Qθ−Q_{\theta^-}Qθ−​. As derived in Week 5, this produces systematic overestimation: the network that selects the best action is the same network whose noise inflated that action's value, so the selected action is biased upward.

The fix

Decouple action selection from action evaluation using different networks:

y=r+γ Qθ− ⁣(s′,  arg⁡max⁡a′Qθ(s′,a′)⏟online selects)y = r + \gamma\, Q_{\theta^-}\!\left(s',\; \underbrace{\arg\max_{a'} Q_\theta(s', a')}_{\text{online selects}}\right)y=r+γQθ−​​s′,online selectsarga′max​Qθ​(s′,a′)​​​

The online network QθQ_\thetaQθ​ selects which action appears best. The target network Qθ−Q_{\theta^-}Qθ−​ evaluates how good that action actually is. Since θ\thetaθ and θ−\theta^-θ− are updated at different rates, their noise is partially decorrelated — the action inflated by θ\thetaθ's noise is unlikely to also be inflated by θ−\theta^-θ−'s noise. The upward bias is reduced without adding any new parameters or computational cost beyond what DQNDeep Q-Network already requires.

In DQNDeep Q-Network, the target network is already maintained for stability. Double DQNDeep Q-Network is simply a change to which network computes the arg⁡max⁡\arg\maxargmax in the target — a one-line modification with meaningful empirical benefits.


Dueling networks

The insight

Not all states require distinguishing between actions. In a racing game on a straight section, all steering actions have similar value — what matters most is estimating how good the overall state is, not which specific action to take. In a dangerous intersection, the action choice is critical. Standard Q-learning conflates these two sources of value and must learn them jointly from the same update signal.

The decomposition

Dueling networks decompose the Q-function as:

Q(s,a; θ,α,β)=V(s; θ,β)+[A(s,a; θ,α)−1∣A∣∑a′A(s,a′; θ,α)]Q(s, a;\, \theta, \alpha, \beta) = V(s;\, \theta, \beta) + \left[A(s, a;\, \theta, \alpha) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a';\, \theta, \alpha)\right]Q(s,a;θ,α,β)=V(s;θ,β)+[A(s,a;θ,α)−∣A∣1​a′∑​A(s,a′;θ,α)]

where:

  • V(s; θ,β)V(s;\, \theta, \beta)V(s;θ,β) is the state-value stream — a scalar measuring how good state sss is regardless of action.
  • A(s,a; θ,α)A(s, a;\, \theta, \alpha)A(s,a;θ,α) is the advantage stream — how much better action aaa is relative to the average action in state sss.
  • The mean subtraction enforces identifiability: without it, VVV and AAA are underdetermined (adding a constant to VVV and subtracting it from all AAA values leaves QQQ unchanged). Subtracting the mean advantage forces ∑a′A(s,a′)=0\sum_{a'} A(s,a') = 0∑a′​A(s,a′)=0, making the decomposition unique.

Why this helps

The value stream V(s)V(s)V(s) can be updated from any experience in state sss, regardless of which action was taken. The advantage stream A(s,a)A(s,a)A(s,a) is only updated for the specific action taken. In states where actions have similar values (small variance in AAA), the value stream dominates and receives more effective updates — the network learns faster because each transition provides useful information about VVV even when the specific action choice was uninformative.

The two streams share a common feature extraction trunk (the convolutional layers in Atari) and split only in the final layers, adding negligible computational cost.

Dueling network architecture diagram: shared trunk splits to V and A streams

Dueling architecture: shared CNN trunk splits into V(s) and A(s,a) streams, combined via Q(s,a) = V(s) + A(s,a) − mean(A).


Prioritized experience replay

The failure mode

Uniform random sampling from the replay buffer treats all transitions equally. A transition where the agent already has an accurate estimate (small TDTemporal Difference error ∣δi∣|\delta_i|∣δi​∣) contributes little learning signal. A transition where the estimate is badly wrong (large ∣δi∣|\delta_i|∣δi​∣) carries substantial new information. Sampling them with equal probability wastes compute on low-information transitions.

The fix: priority-proportional sampling

Sample transition iii with probability proportional to its TDTemporal Difference error:

P(i)=piα∑kpkα,pi=∣δi∣+ϵP(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}, \quad p_i = |\delta_i| + \epsilonP(i)=∑k​pkα​piα​​,pi​=∣δi​∣+ϵ

where α∈[0,1]\alpha \in [0,1]α∈[0,1] controls the degree of prioritization (α=0\alpha = 0α=0 is uniform sampling; α=1\alpha = 1α=1 is fully greedy) and ϵ>0\epsilon > 0ϵ>0 prevents transitions with zero TDTemporal Difference error from never being sampled.

The importance sampling correction

Prioritized sampling changes the distribution over transitions from uniform to P(i)P(i)P(i). This introduces a bias: transitions with low TDTemporal Difference error are undersampled and under-represented in gradient estimates. To correct this and restore unbiased gradient estimates, each transition is weighted by its importance sampling weight:

wi=(1N⋅P(i))βw_i = \left(\frac{1}{N \cdot P(i)}\right)^\betawi​=(N⋅P(i)1​)β

where NNN is the buffer size and β∈[0,1]\beta \in [0,1]β∈[0,1] controls the correction strength. The update for transition iii becomes:

θ←θ+α wi δi ∇θQθ(si,ai)\theta \leftarrow \theta + \alpha\, w_i\, \delta_i\, \nabla_\theta Q_\theta(s_i, a_i)θ←θ+αwi​δi​∇θ​Qθ​(si​,ai​)

In practice, β\betaβ is annealed from a small initial value (e.g., 0.4) to 1 over the course of training. Early in training, when estimates are noisy and TDTemporal Difference errors are large, full importance sampling correction is unnecessary and adds variance. As training stabilizes and the distribution of priorities narrows, full correction (β=1\beta = 1β=1) restores unbiasedness.

This is the same importance sampling ratio from Week 4 applied to the replay buffer — the priority distribution P(i)P(i)P(i) plays the role of the behavior policy, and the uniform distribution plays the role of the target policy. Correcting for the mismatch requires reweighting by 1/P(i)1/P(i)1/P(i), exactly as in off-policy importance sampling.


Distributional reinforcement learning

The insight

Standard Q-learning learns the expected return E[G]\mathbb{E}[G]E[G]. But the return GGG is a random variable, and its full distribution can carry information that the mean discards.

Consider two state-action pairs with the same expected return of 5: one always returns exactly 5 (zero variance), another returns 0 or 10 with equal probability (high variance). A risk-neutral agent is indifferent; a risk-averse agent prefers the first. The mean alone cannot distinguish them, but the return distribution can.

C51

C51 (Bellemare et al., 2017) represents the return distribution as a categorical distribution over a fixed support {z1,z2,…,zK}\{z_1, z_2, \ldots, z_K\}{z1​,z2​,…,zK​} (e.g., 51 atoms evenly spaced between Vmin⁡V_{\min}Vmin​ and Vmax⁡V_{\max}Vmax​):

Z(s,a)≈∑k=1Kpk(s,a;θ) δzkZ(s,a) \approx \sum_{k=1}^K p_k(s,a;\theta)\, \delta_{z_k}Z(s,a)≈k=1∑K​pk​(s,a;θ)δzk​​

where pk(s,a;θ)p_k(s,a;\theta)pk​(s,a;θ) is the predicted probability that the return falls in bin kkk. The distributional Bellman operator projects the shifted distribution r+γZ(s′,a′)r + \gamma Z(s',a')r+γZ(s′,a′) onto the fixed support and trains via cross-entropy loss, providing a richer gradient signal than scalar TDTemporal Difference error.

The key empirical finding: distributional methods significantly outperform mean-only methods on the Atari benchmark, even for tasks where risk sensitivity seems irrelevant. The richer gradient signal from matching the full distribution — rather than just the mean — stabilizes training and improves the learned representations. (This distributional view is essential for Course 4's RLHF pipelines, where reward model uncertainty directly translates to policy conservatism: in regions where the reward model is uncertain, the agent learns to be cautious.)

QR-DQNDeep Q-Network (Dabney et al., 2018) extends this by representing the distribution via quantile regression, removing the need for a fixed support and improving approximation quality.


Rainbow: the full combination

Rainbow DQNDeep Q-Network (Hessel et al., 2017) combines six improvements into a single agent: Double DQNDeep Q-Network, Prioritized Replay, Dueling Networks, Multi-step returns, Distributional RLReinforcement Learning, and Noisy Networks (learned exploration noise replacing ϵ\epsilonϵ-greedy).

The ablation study in the Rainbow paper provides a direct empirical answer to which components matter most:

| Component removed | Performance drop | |---|---| | Prioritized replay | Largest drop | | Multi-step returns | Second largest drop | | Distributional RLReinforcement Learning | Third | | Double DQNDeep Q-Network | Moderate | | Dueling networks | Moderate | | Noisy Networks | Smallest |

Key finding: prioritized replay and multi-step returns are the most important individual components. The components are largely complementary — removing any one degrades performance, and their combination exceeds what any subset achieves. Rainbow held the state-of-the-art on Atari for over a year and remains the standard reference for what a well-engineered DQNDeep Q-Network variant can achieve.


Failure modes and limits of DQNDeep Q-Network

Even with all extensions, the DQNDeep Q-Network family has fundamental limitations:

  • Discrete actions only: the max⁡a′Q(s′,a′)\max_{a'} Q(s',a')maxa′​Q(s′,a′) operation requires enumerating all actions. DQNDeep Q-Network cannot be directly applied to continuous action spaces (robot joint torques, steering angles), which motivates policy gradient methods (Week 7).
  • Sample inefficiency: DQNDeep Q-Network requires tens of millions of environment steps for Atari. Model-based methods (Week 9) and off-policy actor-critics (Week 8) can achieve comparable performance with orders of magnitude less data.
  • Representational collapse: neural network Q-functions can suffer from degradation of the learned features over long training runs — a phenomenon distinct from the instabilities addressed by target networks and replay buffers.

GenAI context: DQNDeep Q-Network ideas in RLHFReinforcement Learning from Human Feedback

DQNDeep Q-Network itself is rarely applied to language models, but its stabilizing ideas appear throughout RLHFReinforcement Learning from Human Feedback pipelines:

  • Replay buffers → offline RLReinforcement Learning and dataset reuse: storing and resampling experience from past policy versions is the foundation of offline RLReinforcement Learning (Week 10) and is how RLHFReinforcement Learning from Human Feedback pipelines reuse preference data collected under earlier model checkpoints.
  • Target networks → frozen reference models: RLHFReinforcement Learning from Human Feedback freezes a reference model πref\pi_{\text{ref}}πref​ and penalizes KL divergence from it. This serves the same stabilizing role as the target network — it prevents the policy from moving too far from a stable anchor.
  • Overestimation bias → reward model conservatism: the same max-over-noisy-estimates problem appears in reward models trained on limited preference data. Overestimated reward scores produce reward hacking; techniques like reward model ensembles and conservative reward penalties address the same underlying bias.
  • Distributional views → uncertainty estimation: distributional RLReinforcement Learning's representation of return variance informs uncertainty-aware reward modeling in RLHFReinforcement Learning from Human Feedback, enabling models to be more conservative in high-uncertainty regions.

Key takeaways

DQNDeep Q-Network's success follows a clear logic: identify the failure modes precisely (from Week 5), then engineer targeted fixes. Experience replay breaks temporal correlations and provides a stable, broad training distribution, justified by Q-learning's off-policy convergence property. Target networks convert continuous bootstrapping into windowed supervised regression, directly addressing the moving target leg of the deadly triad. Reward clipping stabilizes gradients but distorts the optimal policy — normalization is preferable when feasible.

The extensions address residual failures. Double DQNDeep Q-Network decouples selection from evaluation to reduce overestimation bias. Dueling networks separate state value from action advantage, accelerating learning in states where action choice is uninformative, with mean advantage subtraction enforcing identifiability. Prioritized replay focuses learning on high-error transitions using priority-proportional sampling, corrected by importance sampling weights to prevent bias. Distributional RLReinforcement Learning (C51, QR-DQNDeep Q-Network) learns the full return distribution rather than the mean, providing richer gradient signal. Rainbow combines all six and shows through ablation that prioritized replay and multi-step returns contribute most.

The fundamental limitation — discrete action spaces — is what motivates the shift to policy gradient methods in Week 7.


✦Looking Forward
  • Week 7: Policy gradient methods — shift from learning Q∗Q^*Q∗ to learning πθ\pi_\thetaπθ​ directly, enabling continuous action spaces
  • Week 8: Actor-critic architectures (PPO, SAC) — the modern RL standard
  • Course 4: PPO in RLHF pipelines — how the same ideas align LLMs

Conceptual questions

  1. Without a replay buffer, DQNDeep Q-Network trains on consecutive transitions (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1})(st​,at​,rt+1​,st+1​), (st+1,at+1,rt+2,st+2)(s_{t+1}, a_{t+1}, r_{t+2}, s_{t+2})(st+1​,at+1​,rt+2​,st+2​), etc. Explain precisely why this violates the assumptions of SGD, why the resulting gradient estimates are biased, and how replay buffer sampling restores approximate i.i.d. structure. Why does the off-policy nature of replay buffer sampling not invalidate Q-learning's convergence guarantees?

  2. The target network is updated every C=10,000C = 10{,}000C=10,000 steps by hard copying θ−←θ\theta^- \leftarrow \thetaθ−←θ. In between updates, the target y=r+γmax⁡a′Qθ−(s′,a′)y = r + \gamma \max_{a'} Q_{\theta^-}(s',a')y=r+γmaxa′​Qθ−​(s′,a′) is fixed. Explain why this converts bootstrapped TDTemporal Difference learning into a sequence of supervised regression problems. What is the tradeoff between small CCC (frequent updates) and large CCC (infrequent updates)?

  3. A dueling network decomposes Q(s,a)=V(s)+A(s,a)Q(s,a) = V(s) + A(s,a)Q(s,a)=V(s)+A(s,a). Without the mean-subtraction normalization A(s,a)←A(s,a)−1∣A∣∑a′A(s,a′)A(s,a) \leftarrow A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a')A(s,a)←A(s,a)−∣A∣1​∑a′​A(s,a′), the decomposition is not uniquely identified. Give a concrete example showing two different (V,A)(V, A)(V,A) pairs that produce the same QQQ values, and explain why the mean subtraction removes this ambiguity.

  4. Prioritized experience replay samples transition iii with probability P(i)∝∣δi∣αP(i) \propto |\delta_i|^\alphaP(i)∝∣δi​∣α. Without importance sampling corrections, this introduces a bias in the gradient estimates. Write out the importance sampling weight wiw_iwi​ and explain what distribution it is correcting from and to. Why is β\betaβ annealed from less than 1 to 1 over training rather than set to 1 from the start?

  5. You are applying a DQNDeep Q-Network variant to an RLHFReinforcement Learning from Human Feedback pipeline where the language model policy is fine-tuned using Q-values over the token vocabulary (∣A∣≈50,000|\mathcal{A}| \approx 50{,}000∣A∣≈50,000). Identify two specific failure modes from this lecture that are likely to be most severe at this action-space scale. For each, state which DQNDeep Q-Network extension addresses it and whether that extension is directly applicable to the language setting or requires modification.


✦Solutions
  1. Replay buffer. Consecutive transitions are strongly temporally correlated and non-stationary, violating SGD's i.i.d. assumption, so gradient estimates are biased and updates become unstable. Random replay sampling breaks the correlation, restoring approximate i.i.d. structure. Off-policy sampling does not invalidate Q-learning because the update bootstraps toward max⁡a′Q(s′,a′)\max_{a'}Q(s',a')maxa′​Q(s′,a′) regardless of which behavior policy generated the data — Q-learning is off-policy by construction.
  2. Target network. Freezing θ−\theta^-θ− for CCC steps makes y=r+γmax⁡a′Qθ−(s′,a′)y = r + \gamma\max_{a'}Q_{\theta^-}(s',a')y=r+γmaxa′​Qθ−​(s′,a′) a fixed target, turning the update into a stationary supervised regression for that window instead of chasing a moving bootstrapped target. Small CCC keeps targets current but reintroduces moving-target oscillation/instability; large CCC gives stable targets but stale value estimates, slowing learning.
  3. Dueling identifiability. Q=V+AQ=V+AQ=V+A is invariant under (V+c, A−c)(V+c,\,A-c)(V+c,A−c). For example V=5, A=[0,2]V=5,\,A=[0,2]V=5,A=[0,2] and V=6, A=[−1,1]V=6,\,A=[-1,1]V=6,A=[−1,1] both give Q=[5,7]Q=[5,7]Q=[5,7]. Mean subtraction forces ∑aA(s,a)=0\sum_a A(s,a)=0∑a​A(s,a)=0, pinning the gauge so VVV is uniquely the mean QQQ and AAA the centered advantages — removing the additive ambiguity.
  4. PER importance sampling. wi=(N P(i))−βw_i = (N\,P(i))^{-\beta}wi​=(NP(i))−β (normalized by max⁡jwj\max_j w_jmaxj​wj​). It corrects from the priority-proportional sampling distribution P(i)∝∣δi∣αP(i)\propto|\delta_i|^\alphaP(i)∝∣δi​∣α back to the uniform distribution the loss expectation assumes. β\betaβ is annealed from below 1 to 1 because early training is noisy and non-stationary, where full correction adds variance; partial correction stabilizes early learning, and β→1\beta\to1β→1 ensures the final estimates are unbiased.
  5. DQN at vocabulary scale. (a) Max-operator overestimation is amplified across ~50,000 noisy action-values; Double DQN addresses it and transfers directly (decouple action selection from evaluation). (b) Exploration/coverage and computing max⁡aQ\max_a Qmaxa​Q over 50,000 tokens are intractable, so value-based control scales poorly here — dueling/distributional improve estimation but the action-space scale ultimately favors policy-gradient methods, so this extension requires substantial modification (RLHF normally uses PPO instead).

Coding exercise: Mini DQN on CartPole

Implement a minimal DQN agent for OpenAI Gym's CartPole-v1 environment using PyTorch. The goal is to reproduce the three core stabilizing mechanisms from this lecture in code.

Setup (provided):

import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

env = gym.make("CartPole-v1")
state_dim  = env.observation_space.shape[0]   # 4
action_dim = env.action_space.n               # 2

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
        )
    def forward(self, x):
        return self.net(x)

Part 1 — Replay buffer. Implement a ReplayBuffer class with push(state, action, reward, next_state, done) and sample(batch_size) methods. Store transitions in a deque with maxlen. Return numpy arrays (or torch tensors) for states, actions, rewards, next_states, and dones.

Part 2 — Target network. After training the online network QθQ_\thetaQθ​ for C=1,000C = 1{,}000C=1,000 steps, hard-copy its weights to the target network Qθ−Q_{\theta^-}Qθ−​. Verify by comparing torch.equal(online.state_dict()['net.0.weight'], target.state_dict()['net.0.weight']) before and after the copy.

Part 3 — ε-greedy action selection and training loop. Implement the full DQN training loop:

  • Select actions a=arg⁡max⁡aQθ(s,a)a = \arg\max_a Q_\theta(s, a)a=argmaxa​Qθ​(s,a) with probability 1−ε1 - \varepsilon1−ε, else random.
  • Compute target values yj=rj+γmax⁡a′Qθ−(sj′,a′)y_j = r_j + \gamma \max_{a'} Q_{\theta^-}(s'_j, a')yj​=rj​+γmaxa′​Qθ−​(sj′​,a′) using the frozen target network (or yj=rjy_j = r_jyj​=rj​ if terminal).
  • Update θ\thetaθ via MSE loss 1B∑j(yj−Qθ(sj,aj))2\frac{1}{B}\sum_j (y_j - Q_\theta(s_j, a_j))^2B1​∑j​(yj​−Qθ​(sj​,aj​))2.
  • Anneal ε\varepsilonε from 1.0 to 0.01 over the first 10% of episodes.

Expected result: With γ=0.99\gamma = 0.99γ=0.99, batch size 64, learning rate 10−310^{-3}10−3, buffer capacity 10,000, and C=1,000C = 1{,}000C=1,000, the agent should solve CartPole (average return ≥ 195 over 100 consecutive episodes) within approximately 500 episodes.

Extension: Replace the hard target network update with Polyak (soft) averaging — θ−←τθ+(1−τ)θ−\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-θ−←τθ+(1−τ)θ− with τ=0.005\tau = 0.005τ=0.005, applied every step. Compare convergence speed and stability against the hard update baseline.


Looking ahead

DQNDeep Q-Network and its variants require discrete, enumerable action spaces. The next lecture introduces policy gradient methods, which optimize policies directly and naturally extend to continuous action spaces — unlocking robotics, continuous control, and the modern RLHFReinforcement Learning from Human Feedback alignment pipeline. The key insight is a shift from learning Q∗Q^*Q∗ and deriving the policy implicitly, to learning πθ\pi_\thetaπθ​ directly by following the gradient of expected return.


Further reading

  • Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature. (The seminal DQNDeep Q-Network paper).
  • Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI. (Double DQNDeep Q-Network).
  • Wang, Z., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML. (Dueling DQNDeep Q-Network).
  • Schaul, T., et al. (2015). Prioritized Experience Replay. ICLR.
  • Bellemare, M. G., et al. (2017). A Distributional Perspective on Reinforcement Learning. ICML. (C51 / Distributional RLReinforcement Learning).
  • Hessel, M., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI.
← Previous
Week 5: Function Approximation in Reinforcement Learning
Next →
Week 7: Policy Gradient and Actor–Critic Methods
On this page
  • Purpose of this lecture
  • From Q-learning to Deep Q-learning
  • The complete DQN algorithm
  • Experience replay
  • The failure mode it addresses
  • The fix: replay buffer
  • The off-policy justification
  • The capacity tradeoff
  • Target networks
  • The failure mode it addresses
  • The fix: frozen target network
  • Hard update vs Polyak averaging
  • Connection to the deadly triad
  • Reward clipping
  • The problem it solves
  • The fix and its costs
  • Failure modes of base DQN
  • Double DQN
  • The failure mode
  • The fix
  • Dueling networks
  • The insight
  • The decomposition
  • Why this helps
  • Prioritized experience replay
  • The failure mode
  • The fix: priority-proportional sampling
  • The importance sampling correction
  • Distributional reinforcement learning
  • The insight
  • C51
  • Rainbow: the full combination
  • Failure modes and limits of DQN
  • GenAI context: DQN ideas in RLHF
  • Key takeaways
  • Conceptual questions
  • Coding exercise: Mini DQN on CartPole
  • Looking ahead
  • Further reading