Purpose of this lecture
In Week 5, we crossed the critical boundary from tabular RLReinforcement Learning to function approximation and saw that naïvely combining Q-learning with neural networks leads to instability, divergence, and pathological behavior due to the deadly triad.
This lecture introduces Deep Q-Networks (DQNDeep Q-Network) — the first algorithm to successfully stabilize value-based RLReinforcement Learning with neural networks at scale. DQNDeep Q-Network does not introduce a new theoretical framework. It succeeds through a bundle of stabilizing engineering constraints, each of which is a direct and identifiable response to a specific failure mode from Week 5.
Understanding DQNDeep Q-Network means understanding the mapping from failure → fix. The extensions that follow — Double DQNDeep Q-Network, Dueling Networks, Prioritized Replay, Distributional RLReinforcement Learning — each address a residual failure that the base DQNDeep Q-Network bundle does not fully eliminate.
From Q-learning to Deep Q-learning
Recall the Q-learning update:
In DQNDeep Q-Network, is replaced by a neural network . The training objective minimizes the squared TDTemporal Difference error:
where is a replay buffer and is a separate target network.
DQNDeep Q-Network's historical breakthrough was learning to play Atari 2600 games directly from raw pixels using convolutional neural networks, achieving superhuman performance on many games from a single algorithm and hyperparameter set. This was the first demonstration that deep neural networks and RLReinforcement Learning could be combined successfully at scale.
Without additional structure, however, this combination is unstable for the reasons established in Week 5. We now examine each stabilizing mechanism in detail.
The complete DQNDeep Q-Network algorithm
Before examining each component, it helps to see the full algorithm:
Initialize Q_θ with random weights θ
Initialize target network Q_{θ⁻} with weights θ⁻ ← θ
Initialize replay buffer D with capacity N
For each episode:
Initialize state s₀
For each step t:
# Action selection (ε-greedy)
With probability ε: select random action aₜ
Otherwise: select aₜ = argmax_a Q_θ(sₜ, a)
# Environment step
Execute aₜ, observe rₜ₊₁, sₜ₊₁
# Store transition
Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D
# Sample and train (once |D| ≥ batch_size)
Sample random mini-batch {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D
Compute targets:
yⱼ = rⱼ + γ max_{a'} Q_{θ⁻}(s'ⱼ, a') if s'ⱼ not terminal
yⱼ = rⱼ if s'ⱼ terminal
Update θ by SGD on Σⱼ (yⱼ - Q_θ(sⱼ, aⱼ))²
# Periodic target network update
Every C steps: θ⁻ ← θ
- Target network initialization (
θ⁻ ← θ): An exact copy of the main network — provides stable Bellman backup targets that don't shift every gradient step. - Replay buffer (capacity
N): Stores raw transitions up to massive capacity (e.g., 1 million frames), enabling offline-like sampling that decorrelates sequential experience. - Deferred storage (
Store (sₜ, aₜ, rₜ₊₁, sₜ₊₁) in D): Transitions are stored without immediately learning from them, breaking the sequential coupling that destabilizes online Q-learning. - Random mini-batch sampling: Decorrelates the training data and stabilizes stochastic gradient descent — each update sees a mix of old and recent experience.
- Frozen target values (
Q_{θ⁻}): Targets use frozen parameters , not the current being updated — removing the feedback loop that causes divergence. - Periodic target update (
Every C steps: θ⁻ ← θ): The target network is refreshed only every steps, ensuring stays stationary long enough for the critic to fit against it.
Every non-obvious line in this algorithm corresponds to a deliberate design choice that addresses a specific failure mode. The rest of the lecture explains those choices.
Experience replay
The failure mode it addresses
Sequential RLReinforcement Learning transitions are highly temporally correlated. The transition and the next transition share a state. In an Atari game, consecutive frames differ by a single timestep and are nearly identical observations. SGD assumes mini-batches are i.i.d. samples from a fixed distribution — training on consecutive transitions violates both assumptions simultaneously.
The effect is analogous to training a classifier by showing it the same image 50 times before showing any other image: gradient estimates are low-variance but systematically biased toward recent experience, and the network rapidly overfits to recent transitions while forgetting earlier ones. In RLReinforcement Learning, this manifests as oscillating value estimates and policy instability.
A second, related problem is non-stationarity of the training distribution: as the policy improves, the distribution of visited states shifts. Training exclusively on on-policy data means the network adapts to the current policy's narrow distribution and may catastrophically forget how to handle states it no longer frequently visits.
The fix: replay buffer
Store transitions in a fixed-capacity buffer and train on randomly sampled mini-batches:
Random sampling breaks temporal correlations: the mini-batch contains transitions from many different timepoints and behavioral contexts, restoring approximate i.i.d. structure. The buffer also provides a broader, more stationary training distribution that includes past experience, reducing catastrophic forgetting.
The off-policy justification
Sampling from a replay buffer containing transitions from older policy versions makes DQNDeep Q-Network off-policy: the data was collected under different (older) policies than the current one. The theoretical justification for this is Q-learning's off-policy convergence property from Week 4 — Q-learning converges to regardless of the behavior policy, provided every state-action pair is covered. Transitions from old policies satisfy this coverage requirement as long as the buffer is large enough and diverse enough. (This replay buffer idea is the foundational principle behind offline robot learning in Course 2, where collected demonstrations from a teacher policy are replayed to learn robotic skills — the same insight that experience is reusable, independent of when it was collected.)
The capacity tradeoff
Buffer capacity is a practical hyperparameter with real consequences:
- Too small: the buffer retains mostly recent, correlated transitions — similar to on-policy learning with weaker correlation reduction.
- Too large: the buffer retains very old transitions collected under a much earlier, worse policy. High-priority states from the current policy are diluted by irrelevant old data, slowing learning.
DQNDeep Q-Network uses a buffer of 1 million transitions — large enough to cover many episodes but small enough that recent data dominates. Prioritized experience replay (discussed below) addresses the inefficiency of uniform sampling from a large buffer.
Target networks
The failure mode it addresses
With a standard neural network, the training loss is:
Both the target and the prediction are functions of the same . When is updated to reduce toward , the target itself shifts — you are chasing a moving target that moves every time you take a step toward it.
This instability is not merely inconvenient. In the worst case it is divergent: the update can move toward a target that has already moved further away, producing a feedback loop that amplifies errors. This is the moving bootstrap target problem identified in Week 5.
The fix: frozen target network
Maintain two networks:
- Online network : updated at every training step via gradient descent.
- Target network : updated periodically by copying every steps (typically ).
Targets are computed using the frozen network:
Between target network updates, is fixed. The loss is then a standard supervised regression loss with fixed labels — this is stable by the same theory as standard supervised learning. The moving target problem is converted into a sequence of stable supervised regression problems, each running for steps.
Hard update vs Polyak averaging
DQNDeep Q-Network uses a hard update: copy every steps. An alternative is Polyak averaging (soft update):
applied at every step. Soft updates produce a target network that tracks the online network slowly and smoothly, avoiding the discontinuous jumps of hard updates. Soft updates are standard in continuous control algorithms (DDPGDeep Deterministic Policy Gradient, TD3, SACSoft Actor-Critic) where the smoother target dynamics improve stability. Hard updates remain common in discrete-action DQNDeep Q-Network variants.
Connection to the deadly triad
Freezing the target network effectively removes bootstrapping from the instability triangle for steps: the target is treated as a fixed constant, so the update behaves like supervised regression on fixed labels rather than self-referential bootstrapping. This is the precise mechanism by which the target network addresses one leg of the deadly triad — it does not eliminate bootstrapping, but it converts continuous bootstrapping into windowed supervised regression. (The same target network mechanism appears in Course 2's actor-critic robot controllers, where the frozen reference policy prevents unsafe policy drift during learning from demonstrations.)
Reward clipping
The problem it solves
Large or task-varying reward magnitudes cause unstable gradients and make it impossible to use the same hyperparameters across tasks with different reward scales.
The fix and its costs
DQNDeep Q-Network clips rewards to :
This stabilizes optimization across Atari games with different scoring systems — the gradient magnitude is bounded regardless of whether a game awards 1 point or 10,000 points per event.
However, reward clipping has a serious cost that deserves explicit treatment:
It distorts the optimal policy. After clipping, a reward of and a reward of are treated identically. A strategy that earns occasional rewards of (clipped to ) may be objectively superior to a strategy that earns many rewards of — but after clipping, the agent cannot distinguish them. The optimal policy under clipped rewards may differ from the optimal policy under true rewards.
Reward normalization is generally preferable. Rather than clipping, maintain a running estimate of the reward mean and variance and normalize:
Normalization preserves the relative ordering of rewards (and thus the true optimal policy) while bounding gradient magnitudes. It is the standard approach in modern implementations and should be preferred over clipping when reward statistics can be estimated online.
Reward clipping is retained in the DQNDeep Q-Network literature primarily for reproducibility across the Atari benchmark, not because it is theoretically sound.
Failure modes of base DQNDeep Q-Network
Despite the three stabilizing mechanisms, base DQNDeep Q-Network still suffers from:
- Overestimation bias: the operator over noisy Q-values systematically inflates targets (derived in Week 5).
- Uniform replay inefficiency: all transitions are sampled equally regardless of their learning value.
- Value-advantage conflation: the Q-function must jointly represent state value and action advantage, making learning inefficient in states where actions are similar.
- Mean-only estimation: learning only discards distributional information that could improve stability and enable risk-sensitive behavior.
Each residual failure motivates one of the extensions below.
Double DQNDeep Q-Network
The failure mode
The Q-learning target selects and evaluates the best next action using the same network . As derived in Week 5, this produces systematic overestimation: the network that selects the best action is the same network whose noise inflated that action's value, so the selected action is biased upward.
The fix
Decouple action selection from action evaluation using different networks:
The online network selects which action appears best. The target network evaluates how good that action actually is. Since and are updated at different rates, their noise is partially decorrelated — the action inflated by 's noise is unlikely to also be inflated by 's noise. The upward bias is reduced without adding any new parameters or computational cost beyond what DQNDeep Q-Network already requires.
In DQNDeep Q-Network, the target network is already maintained for stability. Double DQNDeep Q-Network is simply a change to which network computes the in the target — a one-line modification with meaningful empirical benefits.
Dueling networks
The insight
Not all states require distinguishing between actions. In a racing game on a straight section, all steering actions have similar value — what matters most is estimating how good the overall state is, not which specific action to take. In a dangerous intersection, the action choice is critical. Standard Q-learning conflates these two sources of value and must learn them jointly from the same update signal.
The decomposition
Dueling networks decompose the Q-function as:
where:
- is the state-value stream — a scalar measuring how good state is regardless of action.
- is the advantage stream — how much better action is relative to the average action in state .
- The mean subtraction enforces identifiability: without it, and are underdetermined (adding a constant to and subtracting it from all values leaves unchanged). Subtracting the mean advantage forces , making the decomposition unique.
Why this helps
The value stream can be updated from any experience in state , regardless of which action was taken. The advantage stream is only updated for the specific action taken. In states where actions have similar values (small variance in ), the value stream dominates and receives more effective updates — the network learns faster because each transition provides useful information about even when the specific action choice was uninformative.
The two streams share a common feature extraction trunk (the convolutional layers in Atari) and split only in the final layers, adding negligible computational cost.
Dueling architecture: shared CNN trunk splits into V(s) and A(s,a) streams, combined via Q(s,a) = V(s) + A(s,a) − mean(A).
Prioritized experience replay
The failure mode
Uniform random sampling from the replay buffer treats all transitions equally. A transition where the agent already has an accurate estimate (small TDTemporal Difference error ) contributes little learning signal. A transition where the estimate is badly wrong (large ) carries substantial new information. Sampling them with equal probability wastes compute on low-information transitions.
The fix: priority-proportional sampling
Sample transition with probability proportional to its TDTemporal Difference error:
where controls the degree of prioritization ( is uniform sampling; is fully greedy) and prevents transitions with zero TDTemporal Difference error from never being sampled.
The importance sampling correction
Prioritized sampling changes the distribution over transitions from uniform to . This introduces a bias: transitions with low TDTemporal Difference error are undersampled and under-represented in gradient estimates. To correct this and restore unbiased gradient estimates, each transition is weighted by its importance sampling weight:
where is the buffer size and controls the correction strength. The update for transition becomes:
In practice, is annealed from a small initial value (e.g., 0.4) to 1 over the course of training. Early in training, when estimates are noisy and TDTemporal Difference errors are large, full importance sampling correction is unnecessary and adds variance. As training stabilizes and the distribution of priorities narrows, full correction () restores unbiasedness.
This is the same importance sampling ratio from Week 4 applied to the replay buffer — the priority distribution plays the role of the behavior policy, and the uniform distribution plays the role of the target policy. Correcting for the mismatch requires reweighting by , exactly as in off-policy importance sampling.
Distributional reinforcement learning
The insight
Standard Q-learning learns the expected return . But the return is a random variable, and its full distribution can carry information that the mean discards.
Consider two state-action pairs with the same expected return of 5: one always returns exactly 5 (zero variance), another returns 0 or 10 with equal probability (high variance). A risk-neutral agent is indifferent; a risk-averse agent prefers the first. The mean alone cannot distinguish them, but the return distribution can.
C51
C51 (Bellemare et al., 2017) represents the return distribution as a categorical distribution over a fixed support (e.g., 51 atoms evenly spaced between and ):
where is the predicted probability that the return falls in bin . The distributional Bellman operator projects the shifted distribution onto the fixed support and trains via cross-entropy loss, providing a richer gradient signal than scalar TDTemporal Difference error.
The key empirical finding: distributional methods significantly outperform mean-only methods on the Atari benchmark, even for tasks where risk sensitivity seems irrelevant. The richer gradient signal from matching the full distribution — rather than just the mean — stabilizes training and improves the learned representations. (This distributional view is essential for Course 4's RLHF pipelines, where reward model uncertainty directly translates to policy conservatism: in regions where the reward model is uncertain, the agent learns to be cautious.)
QR-DQNDeep Q-Network (Dabney et al., 2018) extends this by representing the distribution via quantile regression, removing the need for a fixed support and improving approximation quality.
Rainbow: the full combination
Rainbow DQNDeep Q-Network (Hessel et al., 2017) combines six improvements into a single agent: Double DQNDeep Q-Network, Prioritized Replay, Dueling Networks, Multi-step returns, Distributional RLReinforcement Learning, and Noisy Networks (learned exploration noise replacing -greedy).
The ablation study in the Rainbow paper provides a direct empirical answer to which components matter most:
| Component removed | Performance drop | |---|---| | Prioritized replay | Largest drop | | Multi-step returns | Second largest drop | | Distributional RLReinforcement Learning | Third | | Double DQNDeep Q-Network | Moderate | | Dueling networks | Moderate | | Noisy Networks | Smallest |
Key finding: prioritized replay and multi-step returns are the most important individual components. The components are largely complementary — removing any one degrades performance, and their combination exceeds what any subset achieves. Rainbow held the state-of-the-art on Atari for over a year and remains the standard reference for what a well-engineered DQNDeep Q-Network variant can achieve.
Failure modes and limits of DQNDeep Q-Network
Even with all extensions, the DQNDeep Q-Network family has fundamental limitations:
- Discrete actions only: the operation requires enumerating all actions. DQNDeep Q-Network cannot be directly applied to continuous action spaces (robot joint torques, steering angles), which motivates policy gradient methods (Week 7).
- Sample inefficiency: DQNDeep Q-Network requires tens of millions of environment steps for Atari. Model-based methods (Week 9) and off-policy actor-critics (Week 8) can achieve comparable performance with orders of magnitude less data.
- Representational collapse: neural network Q-functions can suffer from degradation of the learned features over long training runs — a phenomenon distinct from the instabilities addressed by target networks and replay buffers.
GenAI context: DQNDeep Q-Network ideas in RLHFReinforcement Learning from Human Feedback
DQNDeep Q-Network itself is rarely applied to language models, but its stabilizing ideas appear throughout RLHFReinforcement Learning from Human Feedback pipelines:
- Replay buffers → offline RLReinforcement Learning and dataset reuse: storing and resampling experience from past policy versions is the foundation of offline RLReinforcement Learning (Week 10) and is how RLHFReinforcement Learning from Human Feedback pipelines reuse preference data collected under earlier model checkpoints.
- Target networks → frozen reference models: RLHFReinforcement Learning from Human Feedback freezes a reference model and penalizes KL divergence from it. This serves the same stabilizing role as the target network — it prevents the policy from moving too far from a stable anchor.
- Overestimation bias → reward model conservatism: the same max-over-noisy-estimates problem appears in reward models trained on limited preference data. Overestimated reward scores produce reward hacking; techniques like reward model ensembles and conservative reward penalties address the same underlying bias.
- Distributional views → uncertainty estimation: distributional RLReinforcement Learning's representation of return variance informs uncertainty-aware reward modeling in RLHFReinforcement Learning from Human Feedback, enabling models to be more conservative in high-uncertainty regions.
Key takeaways
DQNDeep Q-Network's success follows a clear logic: identify the failure modes precisely (from Week 5), then engineer targeted fixes. Experience replay breaks temporal correlations and provides a stable, broad training distribution, justified by Q-learning's off-policy convergence property. Target networks convert continuous bootstrapping into windowed supervised regression, directly addressing the moving target leg of the deadly triad. Reward clipping stabilizes gradients but distorts the optimal policy — normalization is preferable when feasible.
The extensions address residual failures. Double DQNDeep Q-Network decouples selection from evaluation to reduce overestimation bias. Dueling networks separate state value from action advantage, accelerating learning in states where action choice is uninformative, with mean advantage subtraction enforcing identifiability. Prioritized replay focuses learning on high-error transitions using priority-proportional sampling, corrected by importance sampling weights to prevent bias. Distributional RLReinforcement Learning (C51, QR-DQNDeep Q-Network) learns the full return distribution rather than the mean, providing richer gradient signal. Rainbow combines all six and shows through ablation that prioritized replay and multi-step returns contribute most.
The fundamental limitation — discrete action spaces — is what motivates the shift to policy gradient methods in Week 7.
Conceptual questions
-
Without a replay buffer, DQNDeep Q-Network trains on consecutive transitions , , etc. Explain precisely why this violates the assumptions of SGD, why the resulting gradient estimates are biased, and how replay buffer sampling restores approximate i.i.d. structure. Why does the off-policy nature of replay buffer sampling not invalidate Q-learning's convergence guarantees?
-
The target network is updated every steps by hard copying . In between updates, the target is fixed. Explain why this converts bootstrapped TDTemporal Difference learning into a sequence of supervised regression problems. What is the tradeoff between small (frequent updates) and large (infrequent updates)?
-
A dueling network decomposes . Without the mean-subtraction normalization , the decomposition is not uniquely identified. Give a concrete example showing two different pairs that produce the same values, and explain why the mean subtraction removes this ambiguity.
-
Prioritized experience replay samples transition with probability . Without importance sampling corrections, this introduces a bias in the gradient estimates. Write out the importance sampling weight and explain what distribution it is correcting from and to. Why is annealed from less than 1 to 1 over training rather than set to 1 from the start?
-
You are applying a DQNDeep Q-Network variant to an RLHFReinforcement Learning from Human Feedback pipeline where the language model policy is fine-tuned using Q-values over the token vocabulary (). Identify two specific failure modes from this lecture that are likely to be most severe at this action-space scale. For each, state which DQNDeep Q-Network extension addresses it and whether that extension is directly applicable to the language setting or requires modification.
Coding exercise: Mini DQN on CartPole
Implement a minimal DQN agent for OpenAI Gym's CartPole-v1 environment using PyTorch. The goal is to reproduce the three core stabilizing mechanisms from this lecture in code.
Setup (provided):
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0] # 4
action_dim = env.action_space.n # 2
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim),
)
def forward(self, x):
return self.net(x)
Part 1 — Replay buffer. Implement a ReplayBuffer class with push(state, action, reward, next_state, done) and sample(batch_size) methods. Store transitions in a deque with maxlen. Return numpy arrays (or torch tensors) for states, actions, rewards, next_states, and dones.
Part 2 — Target network. After training the online network for steps, hard-copy its weights to the target network . Verify by comparing torch.equal(online.state_dict()['net.0.weight'], target.state_dict()['net.0.weight']) before and after the copy.
Part 3 — ε-greedy action selection and training loop. Implement the full DQN training loop:
- Select actions with probability , else random.
- Compute target values using the frozen target network (or if terminal).
- Update via MSE loss .
- Anneal from 1.0 to 0.01 over the first 10% of episodes.
Expected result: With , batch size 64, learning rate , buffer capacity 10,000, and , the agent should solve CartPole (average return ≥ 195 over 100 consecutive episodes) within approximately 500 episodes.
Extension: Replace the hard target network update with Polyak (soft) averaging — with , applied every step. Compare convergence speed and stability against the hard update baseline.
Looking ahead
DQNDeep Q-Network and its variants require discrete, enumerable action spaces. The next lecture introduces policy gradient methods, which optimize policies directly and naturally extend to continuous action spaces — unlocking robotics, continuous control, and the modern RLHFReinforcement Learning from Human Feedback alignment pipeline. The key insight is a shift from learning and deriving the policy implicitly, to learning directly by following the gradient of expected return.
Further reading
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature. (The seminal DQNDeep Q-Network paper).
- Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI. (Double DQNDeep Q-Network).
- Wang, Z., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML. (Dueling DQNDeep Q-Network).
- Schaul, T., et al. (2015). Prioritized Experience Replay. ICLR.
- Bellemare, M. G., et al. (2017). A Distributional Perspective on Reinforcement Learning. ICML. (C51 / Distributional RLReinforcement Learning).
- Hessel, M., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI.