Week 5: Function Approximation in Reinforcement Learning

Purpose of this lecture#

Here is the central puzzle of this lecture: Why does adding a neural network break a stable algorithm?

Up to this point, we assumed value functions could be represented exactly in tables. This assumption collapses the moment the state space becomes large, continuous, or combinatorial — as is the case in nearly all real-world problems and all GenAI systems.

Function approximation is not optional: it is essential. A DQN agent for Atari learns from $84 \times 84 \times 4$ pixel observations — a state space astronomically larger than any tabular representation could handle. An LLM learns value functions over token sequences of length 4,000, with vocabulary size 50,000 — a computational impossibility for tables.

But here is the problem: the TD learning algorithms from Week 4 converge in the tabular case and diverge with function approximation. Not because of implementation bugs. Not because of bad hyperparameters. Structurally and provably.

This lecture develops the precise failure modes and connects them to algorithmic solutions. We will show:

What the learning objective is (MSVE) and why the distribution $\mu$ matters
How linear approximation introduces the approximation/estimation error tradeoff
Why semi-gradient TD is not true gradient descent, and why this matters off-policy
How the deadly triad causes divergence: Baird's concrete counterexample
Why gradient TD methods work in theory but not in practice
Why neural networks amplify instability and introduce the moving target problem
How overestimation bias compounds and what Double DQN does about it

The lecture follows a logical arc: define what we are optimizing, develop the simplest (linear) approximation class, identify where and why convergence fails, show a provable divergence example, extend to neural networks, and connect the theoretical failures to the algorithmic fixes introduced in Week 6.

Why function approximation is necessary#

Tabular methods require memory proportional to $|\mathcal{S}|$ or $|\mathcal{S}||\mathcal{A}|$ .

A robot with 12 joint angles discretized to 100 positions has $100^{12} = 10^{24}$ states.
An LLM context window of 4,000 tokens over a vocabulary of 50,000 has $50{,}000^{4000}$ states.
Continuous control problems have uncountably infinite state spaces.

Enumerating, storing, or updating a table over these spaces is not an engineering challenge — it is a mathematical impossibility. Function approximation replaces the table with a parameterized function:

\hat{V}(s;\, \theta) \quad \text{or} \quad \hat{Q}(s, a;\, \theta)

where $\theta \in \mathbb{R}^d$ are learned parameters, $d \ll |\mathcal{S}|$ . The function approximator must generalize: updating $\theta$ based on experience at state $s$ should improve estimates at nearby or similar states, not just at $s$ itself. This generalization is both the benefit and the source of instability.

The Mean Squared Value Error objective#

Before developing algorithms, we need to define what we are trying to minimize. The standard objective for value function approximation is the Mean Squared Value Error:

\overline{VE}(\theta) = \sum_{s \in \mathcal{S}} \mu(s)\left[V^\pi(s) - \hat{V}(s;\theta)\right]^2

where $\mu(s) \geq 0$ is a weighting over states with $\sum_s \mu(s) = 1$ .

The natural choice for $\mu$ in on-policy learning is the on-policy state visitation distribution: the fraction of time spent in state $s$ when following policy $\pi$ . States visited frequently under $\pi$ receive high weight and must be estimated accurately; rarely visited states can tolerate larger error.

This choice has an immediate and important consequence for off-policy learning: when data is collected under a behavior policy $\mu_b \neq \pi$ , the distribution of observed states reflects $\mu_b$ , not $\mu_\pi$ . Minimizing $\overline{VE}$ under $\mu_b$ does not minimize $\overline{VE}$ under $\mu_\pi$ — the approximator is being shaped by a distribution that does not match the target. This is the distributional mismatch at the heart of the deadly triad, made precise.

Linear value function approximation#

The simplest approximation class is linear:

\hat{V}(s;\, \theta) = \phi(s)^\top \theta

where $\phi(s) \in \mathbb{R}^d$ is a fixed feature vector and $\theta \in \mathbb{R}^d$ are learned weights. The expressiveness of the approximation is entirely determined by the feature map $\phi$ .

Feature design#

Tile coding: partition the continuous state space into overlapping grids. Each tile is a region of state space; the feature vector has a 1 in the component corresponding to each tile the current state falls into, and 0 elsewhere. Overlapping tilings ensure that nearby states share features and receive similar value estimates — the overlap encodes a smoothness prior over state space. Tile coding is the canonical example of domain knowledge about state structure built directly into the representation.

Radial basis functions (RBFs): each feature $\phi_i(s) = \exp(-\|s - c_i\|^2 / 2\sigma_i^2)$ is a Gaussian centered at $c_i$ . States near center $c_i$ have high feature $i$ activation; distant states have near-zero activation. RBFs provide smooth, localized generalization and are natural for continuous state spaces where the value function is expected to be smooth.

Both tile coding and RBFs embed assumptions about what "nearby" means in state space. When these assumptions are correct, the approximation is efficient. When they are wrong — for instance, applying distance-based features to a high-dimensional space with non-Euclidean structure — the approximation can be poor regardless of how many parameters are used.

A worked example#

Consider a 1D continuous state space $s \in [0,1]$ with two RBF features centered at $c_1 = 0.25$ and $c_2 = 0.75$ . The true value function is $V^\pi(s) = s$ .

Feature vectors: $\phi(0.0) = [1, 0]^\top$ , $\phi(0.5) \approx [0.6, 0.6]^\top$ , $\phi(1.0) = [0, 1]^\top$ (approximately, for appropriate $\sigma$ ).

The best linear fit $\hat{V}(s;\theta) = \phi(s)^\top\theta$ must find $\theta$ such that the weighted approximation error is minimized. With only two basis functions, the approximation cannot represent $V^\pi(s) = s$ exactly — it will be accurate near the centers and less accurate in between. This irreducible error is the approximation error, determined by the feature class, not the learning algorithm. No algorithm can reduce it below $\min_\theta \overline{VE}(\theta)$ .

This distinction — approximation error (determined by features) vs estimation error (reduced by more data and better algorithms) — is fundamental in statistical learning theory and carries over directly to deep RL, where the network architecture determines the approximation error floor.

Semi-gradient TD: what it is and why it is not SGD#

With linear function approximation, the TD(0) update becomes:

\theta \leftarrow \theta + \alpha\, \delta_t\, \phi(s_t)

where the TD error is:

\delta_t = r_{t+1} + \gamma\, \phi(s_{t+1})^\top\theta - \phi(s_t)^\top\theta

A common and important misconception is that this is stochastic gradient descent on the squared TD error $\frac{1}{2}\delta_t^2$ . It is not. The true gradient of $\frac{1}{2}\delta_t^2$ with respect to $\theta$ is:

\nabla_\theta \frac{1}{2}\delta_t^2 = -\delta_t \nabla_\theta \delta_t = -\delta_t \bigl(\phi(s_t) - \gamma\phi(s_{t+1})\bigr)

The TD update uses only $\phi(s_t)$ — it omits the $-\gamma\phi(s_{t+1})$ term. This is because the bootstrap target $r_{t+1} + \gamma\phi(s_{t+1})^\top\theta$ is treated as a fixed constant when differentiating, as if it did not depend on $\theta$ . This is the defining characteristic of a semi-gradient method.

Why the semi-gradient approximation is made#

The full gradient would require differentiating through the bootstrap target, which introduces the term $\gamma\phi(s_{t+1})$ — an expectation over next-state features that requires either a model or a double-sampling trick to estimate without bias. Semi-gradient TD avoids this by simply not differentiating through the target, making each update $O(d)$ and implementable from a single transition.

Consequences of semi-gradient#

The semi-gradient approximation has two consequences:

On-policy with linear approximation: convergence is preserved. Linear semi-gradient TD(0) converges to the TD fixed point (the projected Bellman equation solution) under on-policy sampling. The proof relies on the fact that the on-policy distribution makes the expected update a contraction in the appropriate norm.
Off-policy: convergence is not guaranteed. With off-policy sampling, the expected semi-gradient update is no longer a contraction. The omitted $\gamma\phi(s_{t+1})$ term is precisely what would restore contractivity. Without it, the update can point in a direction that increases rather than decreases the Bellman error, leading to divergence.

This is the mathematical reason why the semi-gradient approximation, while computationally convenient, is not theoretically robust — and why off-policy learning with function approximation requires additional care.

What Breaks Here: Off-Policy with Semi-Gradient

Consider a 2-state MDP with states A and B. The target policy $\pi$ always moves to B; the behavior policy $\mu$ explores both equally. A linear approximator with a single feature $\phi(A) = 1, \phi(B) = -0.5$ shares a single weight $\theta$ . Under $\mu$ , transitions from A increase $\theta$ (pushing $\hat{V}(B)$ down), while transitions from B push $\theta$ in the opposite direction (pushing $\hat{V}(A)$ up). The two update directions conflict, and because the semi-gradient omits the contraction term, there is no restoring force — $\theta$ oscillates and can diverge. The full gradient would include the $\gamma\phi(s_{t+1})$ term that exactly cancels this instability under on-policy sampling but amplifies it off-policy.

The projected Bellman equation and convergence bound#

Because $\hat{V}(\cdot;\theta)$ lies in a restricted subspace $\mathcal{F}$ of all value functions, applying the Bellman operator $T^\pi$ to a function in $\mathcal{F}$ generally moves it outside $\mathcal{F}$ . We cannot satisfy the Bellman equation exactly — instead, we project back.

Definition#

Let $\Pi$ be the orthogonal projection onto $\mathcal{F}$ under the $\mu$ -weighted norm. The projected Bellman fixed point satisfies:

\hat{V} = \Pi T^\pi \hat{V}

Linear semi-gradient TD(0) under on-policy sampling converges to exactly this fixed point.

The convergence bound#

Theorem (TD fixed-point bound): Let $\hat{V}_{\text{<Glossary term="TD" />}}$ be the linear TD fixed point. Then:

\left\|V^\pi - \hat{V}_{\text{TD}}\right\|_\mu \leq \frac{1}{\sqrt{1-\gamma}}\left\|V^\pi - \Pi V^\pi\right\|_\mu

where $\|V^\pi - \Pi V^\pi\|_\mu$ is the best possible approximation error achievable by any $\theta$ in the feature class.

Interpretation: the TD solution is at most $1/\sqrt{1-\gamma}$ times worse than the best approximation under the given features. This bound has two direct implications:

If the feature class represents $V^\pi$ well (small $\|V^\pi - \Pi V^\pi\|_\mu$ ), the TD solution is also good.
The factor $1/\sqrt{1-\gamma}$ grows as $\gamma \to 1$ . For long-horizon problems (large $\gamma$ ), even a good feature class can produce a TD solution that is significantly worse than the best possible approximation. High-discount problems are harder — not just computationally, but in terms of approximation quality.

Why function approximation makes the deadly triad dangerous#

The deadly triad was introduced in Week 4: the combination of function approximation, bootstrapping, and off-policy learning can cause divergence. Now that we have the machinery of function approximation, we can explain the mechanism precisely.

In the tabular case, off-policy updates change the value of specific states. The update to $V(s_t)$ affects only $s_t$ — there is no coupling between states. Off-policy sampling affects which states get updated and how often, but individual updates are still correct Bellman updates. The distribution mismatch is manageable.

With function approximation, all states share the parameter vector $\theta$ . Updating $\theta$ based on a transition at $s_t$ changes $\hat{V}(s;\theta)$ for every state $s$ — not just $s_t$ . The update direction is $\delta_t \phi(s_t)$ , which moves the value estimates of all states in directions determined by the feature similarities to $s_t$ . If the off-policy distribution concentrates updates on states with features that are negatively correlated with the on-policy states that matter for $V^\pi$ , the approximator can be pushed systematically away from $V^\pi$ — with no restoring force, because the updates never reflect the on-policy distribution.

This is the coupling that makes the deadly triad dangerous with approximation but not with tables: the function approximator ties all states together, so off-policy updates at one state corrupt the estimates at others.

Intuition: Parameter Sharing as a Double-Edged Sword

In the tabular setting, each state has its own independent entry $V(s)$ . Updating $V(s_1)$ leaves $V(s_2)$ completely unchanged — states are decoupled. With function approximation, updating $\theta$ using data from $s_1$ changes $\hat{V}(s_2;\theta)$ by an amount proportional to the feature similarity $\phi(s_2)^\top\phi(s_1)$ . This is desirable when states genuinely share structure — it is how we get generalization. But it is dangerous when the off-policy distribution feeds updates from states whose optimal values are negatively correlated with the states the target policy actually visits. The parameter sharing that enables generalization also enables contamination.

Baird's counterexample#

Baird (1995) constructed an example showing that linear semi-gradient TD with off-policy sampling diverges even in a tiny MDP with a linear function approximator — not due to noise or numerical issues, but structurally.

The setup#

Consider a 7-state MDP with a star topology: states $\{1, 2, 3, 4, 5, 6\}$ are peripheral; state $7$ is central.

Behavior policy $\mu$ : with probability $6/7$ , transition to a peripheral state uniformly at random (dashed transitions); with probability $1/7$ , transition to the central state.
Target policy $\pi$ : always transition to the central state.
All rewards are zero.
Discount $\gamma = 0.99$ .

The linear feature representation for each state $s_i$ is:

\phi(s_i) = \begin{cases} 2e_i + e_7 & i \in \{1,\ldots,6\} \\ 2e_7 & i = 7 \end{cases}

where $e_i$ is the $i$ th standard basis vector in $\mathbb{R}^8$ , giving $\theta \in \mathbb{R}^8$ .

Why it diverges#

Under the behavior policy, updates concentrate on peripheral states. The feature representation ties the peripheral state components $e_i$ to the shared component $e_7$ . Under off-policy bootstrapping, each peripheral-state update moves the shared weight $\theta_7$ in a direction inconsistent with the target policy's visitation distribution. The peripheral weights $\theta_1, \ldots, \theta_6$ and the shared weight $\theta_7$ amplify each other through the bootstrap target, and the weights grow without bound.

Running TD(0) on this example with any constant step size $\alpha > 0$ causes $\|\theta\| \to \infty$ — all parameter components diverge to $\pm\infty$ , even though the true value function under $\pi$ is identically zero (since all rewards are zero).

Takeaway: divergence under the deadly triad is not a pathology of bad initialization or poor hyperparameter tuning. It is structural and provable. The only reliable fixes are either algorithmic (target networks, replay buffers) or theoretical (gradient TD methods).

What Breaks Here: The Coupling Problem in Miniature

Walk through the weight dynamics: consider $\theta_7$ , the shared weight connected to the central state. Under the target policy $\pi$ , the agent is always at state 7 — $\hat{V}(7;\theta) = 2\theta_7$ . Under the behavior policy, the agent spends most time at peripheral states $i$ , where $\hat{V}(i;\theta) = 2\theta_i + \theta_7$ . Each peripheral update changes $\theta_i$ (peripheral weight) and $\theta_7$ (shared weight). But the direction of $\theta_7$ 's update from peripheral data does not align with the direction needed for the target policy's valuation of state 7. The bootstrap target couples $\theta_i$ and $\theta_7$ through the feature vector: an error at state $i$ drives $\theta_i$ and $\theta_7$ in ways that amplify each other through the $\gamma\phi(s_{t+1})^\top\theta$ term. The result is a positive feedback loop — weights grow without bound even though the true values are all zero.

Gradient TD methods: the theoretical resolution#

Semi-gradient TD is not true gradient descent, and this is the root cause of its off-policy instability. The natural fix is to perform true gradient descent on a well-defined scalar objective — specifically, the projected Bellman error (PBE):

\overline{PBE}(\theta) = \left\|\Pi T^\pi \hat{V}_\theta - \hat{V}_\theta\right\|_\mu^2

Gradient TD methods (GTD, GTD2, TDC — Sutton et al., 2009) compute the true gradient of $\overline{PBE}(\theta)$ with respect to $\theta$ , using a second set of parameters to estimate the gradient correction term online. Unlike semi-gradient TD, gradient TD methods converge under all three deadly triad conditions: function approximation, bootstrapping, and off-policy sampling simultaneously.

Why gradient TD is not universally used#

Despite their theoretical advantages, gradient TD methods have not displaced semi-gradient TD in practice for two reasons:

Slower convergence: the gradient correction term introduces additional variance and slows convergence relative to semi-gradient TD with stabilization tricks.
Engineering solutions are sufficient in practice: target networks and replay buffers (introduced in Week 6) stabilize semi-gradient TD empirically in most settings where gradient TD would be theoretically required. The gap between theory and practice is bridged by engineering rather than by using the theoretically correct algorithm.

This is a recurring pattern in deep RL: theoretically correct algorithms often underperform practically motivated heuristics that violate the theory but work well empirically. Knowing the theory is what allows you to reason about when the heuristics will fail.

Why Not Always Use Gradient TD?

Gradient TD methods maintain a second set of parameters (a "correction" vector $w \in \mathbb{R}^d$ ) that estimates the gradient correction term $\gamma\phi(s_{t+1})$ online. This doubles the memory and per-step computation, and the auxiliary parameter introduces additional variance that slows convergence — GTD2 and TDC can require 10-100x more samples than semi-gradient TD with experience replay on the same problem. In practice, the combination of a replay buffer (which makes data look more on-policy by shuffling) and a target network (which stabilizes the bootstrap target) achieves most of the theoretical benefits of gradient TD with none of the extra variance. Gradient TD is the theoretically correct tool, but the engineering heuristics work better empirically — until they don't, at which point the theory tells you why.

From linear to neural network approximation#

Neural networks generalize linear approximation:

\hat{Q}(s, a;\, \theta) = \text{NN}_\theta(s, a)

Advantages:

Features are learned from data rather than hand-designed.
Deep networks can represent complex, non-local value functions.
Scales to images (convolutional), text (transformer), and multimodal inputs.

Amplified instabilities: every failure mode of linear approximation is present in neural networks, and nonlinearity introduces additional ones:

Non-convex loss landscape: gradient updates can follow directions that increase loss.
Changing representations: the implicit features $\phi(s;\theta)$ (the penultimate layer) change as $\theta$ is updated, making the effective feature class non-stationary.
Moving bootstrap targets: with nonlinear $\hat{Q}$ , the bootstrap target $r + \gamma\max_{a'}\hat{Q}(s',a';\theta)$ changes every time $\theta$ is updated — the "target" the network is regressing toward is itself a function of the parameters being optimized. This is not a problem in supervised learning (labels are fixed) and has no tabular analog. It is a fundamental source of instability unique to neural network function approximation with bootstrapping.

The moving bootstrap target problem motivates the target network architecture in DQN: freeze a copy of $\theta$ to compute bootstrap targets, updating it periodically rather than every step. This decouples the target from the current parameters and is the primary stabilization mechanism in DQN.

Overestimation bias#

Q-learning uses:

\max_{a'} Q(s_{t+1}, a')

as the bootstrap target. When $Q$ -value estimates are noisy, taking a maximum introduces a systematic positive bias.

Derivation#

Let $Q(s,a) = Q^*(s,a) + \epsilon_a$ where $\{\epsilon_a\}_{a=1}^K$ are zero-mean estimation errors, not necessarily independent. Then:

\mathbb{E}\left[\max_a Q(s,a)\right] = \mathbb{E}\left[\max_a (Q^*(s,a) + \epsilon_a)\right] \geq \max_a Q^*(s,a) + \mathbb{E}\left[\max_a \epsilon_a\right] \geq \max_a Q^*(s,a)

The second inequality holds because $\mathbb{E}[\max_a \epsilon_a] \geq \max_a \mathbb{E}[\epsilon_a] = 0$ by Jensen's inequality applied to the convex $\max$ function. The bias $\mathbb{E}[\max_a \epsilon_a] > 0$ whenever the $\epsilon_a$ are not all identical — which is never the case in practice.

How bias grows#

The overestimation bias grows with:

Number of actions $|\mathcal{A}|$ : more actions means more opportunities for an upward noise spike to be selected as the maximum. With $K$ i.i.d. Gaussian errors $\epsilon_a \sim \mathcal{N}(0,\sigma^2)$ , the expected maximum scales as $\sigma\sqrt{2\log K}$ — logarithmic in $K$ but positive for any $K > 1$ .
Estimation variance $\sigma^2$ : noisier estimates produce larger max bias. Early in training, when estimates are highly uncertain, bias is worst.

Compounding effects#

Overestimation in $Q(s_{t+1}, \cdot)$ propagates into $Q(s_t, \cdot)$ through the Bellman update. Over many updates, overestimates compound backward through the value function, producing systematically inflated $Q$ -values that cause the policy to prefer actions that look good due to noise rather than true value.

Double Q-learning: the fix#

Double Q-learning (van Hasselt, 2010) decouples action selection from action evaluation:

r + \gamma Q(s', \arg\max_{a'} Q(s',a';\theta);\, \theta^-)

where $\theta$ selects the action and $\theta^-$ (a separate or lagged parameter vector) evaluates it. Since the same noise that inflates $Q(s',a^*;\theta)$ is unlikely to also inflate $Q(s',a^*;\theta^-)$ if $\theta^-$ is independent, the upward bias is reduced. In DQN, the target network $\theta^-$ serves naturally as the evaluation network in Double DQN, integrating the fix without additional overhead.

What Breaks Here: The Bias from the Max Operator

Concrete example: suppose $Q^*(s,a) = 0$ for all actions (a terminal state). With $\sigma = 0.5$ estimation noise and $K = 4$ actions, Gaussian order statistics tell us $\mathbb{E}[\max_a \epsilon_a] \approx 0.5 \cdot 1.03 = 0.52$ . After 100 Bellman backups through the same state, this bias compounds — each backup adds approximately 0.52 of spurious value, so the estimates can inflate to $>50$ despite true values of zero. With $K = 50{,}000$ (language model vocabulary), $\mathbb{E}[\max_a \epsilon_a] \approx 0.5 \cdot 3.55 = 1.77$ — nearly 4x larger from the action count alone. This is why Double DQN, which decouples action selection from action evaluation, is not a minor tweak but a structural necessity for problems with large action spaces.

GenAI context: why this matters for language models#

In language modeling:

States are token histories — space of size $50{,}000^{4000}$
Actions are next-token choices — $|\mathcal{A}| \approx 50{,}000$
Value functions must be approximated by neural networks

All three deadly triad conditions are present in RLHF pipelines:

Function approximation: neural network value head on the language model
Bootstrapping: GAE (Generalized Advantage Estimation) uses TD-style returns
Off-policy data: rollouts are generated by old policy checkpoints

The overestimation bias is particularly acute: with $|\mathcal{A}| = 50{,}000$ tokens, the expected max bias over the next-token Q-values is substantial. This is one reason RLHF typically uses actor-critic with a value baseline rather than Q-learning directly — the value function $V(s)$ (over states, not state-action pairs) avoids the explicit max over the full vocabulary.

Without careful algorithm design — clipped importance ratios (PPO), frozen reference models, KL divergence penalties, careful advantage normalization — the instabilities described in this lecture manifest as degenerate outputs, mode collapse, and reward hacking. The stabilization techniques in RLHF are not arbitrary engineering choices. They are direct responses to the theoretical failure modes introduced here.

Intuition: Why RLHF Uses V(s) Not Q(s,a)

With a Q-function over 50,000 actions, every Bellman backup requires computing $\max_{a' \in \mathcal{A}} Q(s',a';\theta)$ — 50,000 forward passes through the value head. A value function $V(s)$ has a single scalar output, reducing the max to a single forward pass. Beyond the computational advantage, $V(s)$ also avoids the overestimation bias from the max operator entirely: there is no max to take. RLHF pipelines use Generalized Advantage Estimation (GAE), which estimates $A(s,a) = Q(s,a) - V(s)$ using $V(s)$ as the baseline — this requires only $V(s)$ , not the full $Q(s,a)$ , while the advantage $A(s,a)$ is estimated from the rollout returns. The choice of $V(s)$ over $Q(s,a)$ in RLHF is driven by both the overestimation problem and computational constraints at vocabulary scale.

Key takeaways#

The lecture traces a progression from the ideal to the practically achievable. The MSVE objective defines what we optimize, with on-policy weighting $\mu$ tying the objective to the behavior distribution. Linear approximation introduces the approximation/estimation error decomposition and the feature design choices that determine the expressiveness floor. Semi-gradient TD is not SGD — the omission of the gradient through the bootstrap target is what makes it computationally cheap, and what removes its convergence guarantees off-policy. The projected Bellman fixed point is where on-policy linear TD converges, with the $1/\sqrt{1-\gamma}$ bound quantifying the cost of approximation. The deadly triad is dangerous with function approximation because parameter sharing couples all states — off-policy updates at one state corrupt estimates everywhere. Baird's counterexample proves this is not a pathology but a structural fact. Gradient TD methods provide the theoretically correct fix — true gradient descent on the projected Bellman error — but are superseded in practice by engineering stabilization (target networks, replay buffers). Neural networks amplify all linear instabilities and introduce the moving bootstrap target problem, directly motivating the DQN target network. Overestimation bias arises from the max operator over noisy estimates and compounds backward through the value function — Double DQN decouples selection from evaluation to reduce it.

Conceptual questions#

The TD(0) update for linear function approximation is $\theta \leftarrow \theta + \alpha\delta_t\phi(s_t)$ . Write out the true gradient of $\frac{1}{2}\delta_t^2$ with respect to $\theta$ and identify the term that the semi-gradient update omits. Under what sampling distribution does the omitted term have zero expectation, restoring convergence? Why does off-policy sampling violate this condition?
The TD fixed-point bound gives $\|V^\pi - \hat{V}_{\text{<Glossary term="TD" />}}\|_\mu \leq \frac{1}{\sqrt{1-\gamma}}\|V^\pi - \Pi V^\pi\|_\mu$ . Suppose $\gamma = 0.99$ and the best linear approximation error is 0.1. What is the worst-case TD error? What does this imply about using high-discount TD for long-horizon robotics tasks?
In the tabular setting, off-policy Q-learning converges to $Q^*$ . In the linear function approximation setting with off-policy sampling, TD can diverge (Baird's example). Identify the precise structural difference between the tabular and approximation cases that explains why off-policy sampling is benign in one case and catastrophic in the other.
You are designing a Q-learning agent for a robotic manipulation task with 8 discrete actions. After 100k training steps, you notice the Q-values are systematically much larger than the true returns. Diagnose this as overestimation bias: derive the expected magnitude of the bias as a function of estimation variance $\sigma^2$ and number of actions $K$ , and describe the Double Q-learning fix at the level of the update equation.
An RLHF pipeline trains a value head on top of a language model using TD(0) with data collected from old policy checkpoints. Map this setting onto the deadly triad: identify which component each ingredient corresponds to, explain which specific failure mode is most likely to manifest first, and propose one concrete mitigation for each component of the triad.

Solutions

Coding exercise#

Simulate the overestimation bias in the max operator to verify the $\sigma\sqrt{2\log K}$ approximate scaling.

Use NumPy to implement the following:

python · runs in browser

import numpy as np

def simulate_max_bias(K, sigma, num_trials=10000):
    """
    Simulate the expected overestimation bias of max over K actions.

    Args:
        K (int): number of actions
        sigma (float): standard deviation of estimation noise
        num_trials (int): number of independent trials

    Returns:
        float: estimated E[max_a epsilon_a]
    """
    noise = np.random.randn(num_trials, K) * sigma
    max_noise = noise.max(axis=1)  # max over actions for each trial
    return float(max_noise.mean())

# Verify the approximate scaling
for K in [2, 4, 8, 16, 32]:
    estimated_bias = simulate_max_bias(K, sigma=0.5)
    predicted = 0.5 * np.sqrt(2 * np.log(K))
    print(f"K={K:3d}:  estimated bias={estimated_bias:.3f},  predicted≈{predicted:.3f}")

Expected output (your numbers will vary slightly due to sampling):

code

K=  2:  estimated bias=0.282,  predicted≈0.589
K=  4:  estimated bias=0.513,  predicted≈0.833
K=  8:  estimated bias=0.695,  predicted≈1.020
K= 16:  estimated bias=0.835,  predicted≈1.177
K= 32:  estimated bias=0.960,  predicted≈1.316

Explain why the estimated bias is consistently lower than the $\sigma\sqrt{2\log K}$ prediction — what assumption does the approximation make that is violated in the simulation?

Extension prompts#

Implement Baird's counterexample. Build the 7-state star MDP in Python (or your language of choice) with the exact feature representation from the lecture. Run linear semi-gradient TD(0) with off-policy sampling and $\alpha = 0.01$ . Watch $\|\theta\|$ diverge. Then switch to on-policy sampling and verify convergence. How many steps until divergence is visually obvious?
Compare semi-gradient vs gradient TD. Using OpenAI Gym's CartPole-v1, train a linear Q-function (RBF features over the 4D state space) with both semi-gradient TD(0) and gradient TD (GTD2). Plot the parameter norm over training steps. Where does semi-gradient diverge?
Design a feature map that avoids Baird's problem. Propose a different linear feature representation for Baird's 7-state MDP such that off-policy semi-gradient TD(0) converges. What property of your feature map ensures contractivity?

Looking ahead#

The next lecture introduces Deep Q-Networks (DQN) — the first algorithm to successfully stabilize Q-learning with neural networks. Each of DQN's two core innovations maps directly onto a failure mode from this lecture:

Experience replay breaks temporal correlations in training data and stabilizes the on-policy distribution assumption underlying the MSVE objective.
Target networks freeze the bootstrap target, directly addressing the moving target problem introduced by neural network function approximation with bootstrapping.

Understanding the failures introduced here is what makes DQN's design choices interpretable as principled engineering responses rather than empirical tricks.