Skip to main content
illumin8
Courses
Week 5: Function Approximation in Reinforcement Learning
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 5

Week 5: Function Approximation in Reinforcement Learning

✦Learning Outcomes
  • Define the Mean Squared Value Error (MSVE) objective
  • Identify the "deadly triad" and explain why it causes divergence
  • Understand overestimation bias in Q-learning
  • Connect theoretical failure modes to algorithmic solutions (DQNDeep Q-Network, target networks)
◆Prerequisites
  • Week 4: TDTemporal Difference learning, bootstrapping, SARSAState-Action-Reward-State-Action, Q-learning
  • Week 3: Bellman operators, value iteration

Recommended: Review Week 4 sections on "TDTemporal Difference learning" and "Control" before proceeding.

Purpose of this lecture

Here is the central puzzle of this lecture: Why does adding a neural network break a stable algorithm?

Up to this point, we assumed value functions could be represented exactly in tables. This assumption collapses the moment the state space becomes large, continuous, or combinatorial — as is the case in nearly all real-world problems and all GenAI systems.

Function approximation is not optional: it is essential. A DQNDeep Q-Network agent for Atari learns from 84×84×484 \times 84 \times 484×84×4 pixel observations — a state space astronomically larger than any tabular representation could handle. An LLM learns value functions over token sequences of length 4,000, with vocabulary size 50,000 — a computational impossibility for tables.

But here is the problem: the TDTemporal Difference learning algorithms from Week 4 converge in the tabular case and diverge with function approximation. Not because of implementation bugs. Not because of bad hyperparameters. Structurally and provably.

This lecture develops the precise failure modes and connects them to algorithmic solutions. We will show:

  1. What the learning objective is (MSVE) and why the distribution μ\muμ matters
  2. How linear approximation introduces the approximation/estimation error tradeoff
  3. Why semi-gradient TDTemporal Difference is not true gradient descent, and why this matters off-policy
  4. How the deadly triad causes divergence: Baird's concrete counterexample
  5. Why gradient TDTemporal Difference methods work in theory but not in practice
  6. Why neural networks amplify instability and introduce the moving target problem
  7. How overestimation bias compounds and what Double DQNDeep Q-Network does about it

The lecture follows a logical arc: define what we are optimizing, develop the simplest (linear) approximation class, identify where and why convergence fails, show a provable divergence example, extend to neural networks, and connect the theoretical failures to the algorithmic fixes introduced in Week 6.


Why function approximation is necessary

Tabular methods require memory proportional to ∣S∣|\mathcal{S}|∣S∣ or ∣S∣∣A∣|\mathcal{S}||\mathcal{A}|∣S∣∣A∣.

  • A robot with 12 joint angles discretized to 100 positions has 10012=1024100^{12} = 10^{24}10012=1024 states.
  • An LLMLarge Language Model context window of 4,000 tokens over a vocabulary of 50,000 has 50,000400050{,}000^{4000}50,0004000 states.
  • Continuous control problems have uncountably infinite state spaces.

Enumerating, storing, or updating a table over these spaces is not an engineering challenge — it is a mathematical impossibility. Function approximation replaces the table with a parameterized function:

V^(s; θ)orQ^(s,a; θ)\hat{V}(s;\, \theta) \quad \text{or} \quad \hat{Q}(s, a;\, \theta)V^(s;θ)orQ^​(s,a;θ)

where θ∈Rd\theta \in \mathbb{R}^dθ∈Rd are learned parameters, d≪∣S∣d \ll |\mathcal{S}|d≪∣S∣. The function approximator must generalize: updating θ\thetaθ based on experience at state sss should improve estimates at nearby or similar states, not just at sss itself. This generalization is both the benefit and the source of instability.

✦Intuition: The Generalization Assumption

Generalization is a double-edged sword. When we update θ\thetaθ based on a transition at state sss, we change V^(s′;θ)\hat{V}(s';\theta)V^(s′;θ) for every state s′s's′ — not just sss. This is the entire point of function approximation: without it, every state is an island and we learn nothing from experience. But this coupling also means that an update from a poorly-understood state, or an update driven by the wrong distribution, can corrupt estimates everywhere. The same parameter sharing that makes function approximation powerful is what makes it fragile.


The Mean Squared Value Error objective

Before developing algorithms, we need to define what we are trying to minimize. The standard objective for value function approximation is the Mean Squared Value Error:

VE‾(θ)=∑s∈Sμ(s)[Vπ(s)−V^(s;θ)]2\overline{VE}(\theta) = \sum_{s \in \mathcal{S}} \mu(s)\left[V^\pi(s) - \hat{V}(s;\theta)\right]^2VE(θ)=s∈S∑​μ(s)[Vπ(s)−V^(s;θ)]2

where μ(s)≥0\mu(s) \geq 0μ(s)≥0 is a weighting over states with ∑sμ(s)=1\sum_s \mu(s) = 1∑s​μ(s)=1.

The natural choice for μ\muμ in on-policy learning is the on-policy state visitation distribution: the fraction of time spent in state sss when following policy π\piπ. States visited frequently under π\piπ receive high weight and must be estimated accurately; rarely visited states can tolerate larger error.

This choice has an immediate and important consequence for off-policy learning: when data is collected under a behavior policy μb≠π\mu_b \neq \piμb​=π, the distribution of observed states reflects μb\mu_bμb​, not μπ\mu_\piμπ​. Minimizing VE‾\overline{VE}VE under μb\mu_bμb​ does not minimize VE‾\overline{VE}VE under μπ\mu_\piμπ​ — the approximator is being shaped by a distribution that does not match the target. This is the distributional mismatch at the heart of the deadly triad, made precise.

⚠What Breaks Here: On-Policy vs Off-Policy Distributions

Consider an off-policy Q-learning agent exploring uniformly while the target policy is near-deterministic. The uniform behavior distribution spends equal time in all states, but the target policy visits only a narrow corridor. Minimizing MSVE under the uniform distribution wastes capacity on irrelevant states and underfits the corridor. Worse, updates from irrelevant states can actively push parameters away from the target solution because the semi-gradient direction depends on both the error δt\delta_tδt​ and the feature ϕ(st)\phi(s_t)ϕ(st​). States that generate large errors under the behavior distribution but are never visited under the target policy can dominate the update direction and pull the approximator off course.


Linear value function approximation

The simplest approximation class is linear:

V^(s; θ)=ϕ(s)⊤θ\hat{V}(s;\, \theta) = \phi(s)^\top \thetaV^(s;θ)=ϕ(s)⊤θ

where ϕ(s)∈Rd\phi(s) \in \mathbb{R}^dϕ(s)∈Rd is a fixed feature vector and θ∈Rd\theta \in \mathbb{R}^dθ∈Rd are learned weights. The expressiveness of the approximation is entirely determined by the feature map ϕ\phiϕ.

Feature design

Tile coding: partition the continuous state space into overlapping grids. Each tile is a region of state space; the feature vector has a 1 in the component corresponding to each tile the current state falls into, and 0 elsewhere. Overlapping tilings ensure that nearby states share features and receive similar value estimates — the overlap encodes a smoothness prior over state space. Tile coding is the canonical example of domain knowledge about state structure built directly into the representation.

Radial basis functions (RBFs): each feature ϕi(s)=exp⁡(−∥s−ci∥2/2σi2)\phi_i(s) = \exp(-\|s - c_i\|^2 / 2\sigma_i^2)ϕi​(s)=exp(−∥s−ci​∥2/2σi2​) is a Gaussian centered at cic_ici​. States near center cic_ici​ have high feature iii activation; distant states have near-zero activation. RBFs provide smooth, localized generalization and are natural for continuous state spaces where the value function is expected to be smooth.

Both tile coding and RBFs embed assumptions about what "nearby" means in state space. When these assumptions are correct, the approximation is efficient. When they are wrong — for instance, applying distance-based features to a high-dimensional space with non-Euclidean structure — the approximation can be poor regardless of how many parameters are used.

◆Why Not Learn the Features?

Modern deep RL does learn features from raw data — the convolutional layers in a DQN are exactly a learned feature map ϕ(s;θ)\phi(s;\theta)ϕ(s;θ). But understanding the linear case first is essential: it isolates the approximation error from the optimization error. With linear features, the loss landscape is convex and the fixed point is unique — so any failure to converge must be caused by the algorithm, not by getting stuck in a bad local minimum. Once we understand failure modes in the linear setting, we can identify which ones are fundamental (present even with convex optimization) and which are artifacts of neural network non-convexity. The linear theory provides the cleanest lens for the deadly triad.

A worked example

Consider a 1D continuous state space s∈[0,1]s \in [0,1]s∈[0,1] with two RBF features centered at c1=0.25c_1 = 0.25c1​=0.25 and c2=0.75c_2 = 0.75c2​=0.75. The true value function is Vπ(s)=sV^\pi(s) = sVπ(s)=s.

Feature vectors: ϕ(0.0)=[1,0]⊤\phi(0.0) = [1, 0]^\topϕ(0.0)=[1,0]⊤, ϕ(0.5)≈[0.6,0.6]⊤\phi(0.5) \approx [0.6, 0.6]^\topϕ(0.5)≈[0.6,0.6]⊤, ϕ(1.0)=[0,1]⊤\phi(1.0) = [0, 1]^\topϕ(1.0)=[0,1]⊤ (approximately, for appropriate σ\sigmaσ).

The best linear fit V^(s;θ)=ϕ(s)⊤θ\hat{V}(s;\theta) = \phi(s)^\top\thetaV^(s;θ)=ϕ(s)⊤θ must find θ\thetaθ such that the weighted approximation error is minimized. With only two basis functions, the approximation cannot represent Vπ(s)=sV^\pi(s) = sVπ(s)=s exactly — it will be accurate near the centers and less accurate in between. This irreducible error is the approximation error, determined by the feature class, not the learning algorithm. No algorithm can reduce it below min⁡θVE‾(θ)\min_\theta \overline{VE}(\theta)minθ​VE(θ).

This distinction — approximation error (determined by features) vs estimation error (reduced by more data and better algorithms) — is fundamental in statistical learning theory and carries over directly to deep RLReinforcement Learning, where the network architecture determines the approximation error floor.

✦Intuition: The Approximation Error Floor

Think of approximation error as the irreducible bias of your feature class. No learning algorithm can make a linear fit with 10 basis functions represent an arbitrary function that requires 100. But estimation error — the gap between the learned θ\thetaθ and the best possible θ∗\theta^*θ∗ within the class — can be driven down with more data and better algorithms. The TD fixed-point bound says your final error is at most 11−γ\frac{1}{\sqrt{1-\gamma}}1−γ​1​ times the approximation error floor. This means two things: (1) better features directly improve the ceiling, and (2) for γ\gammaγ close to 1, even great features can produce a mediocre TD solution because bootstrapping compounds approximation error across long horizons.


Semi-gradient TDTemporal Difference: what it is and why it is not SGD

With linear function approximation, the TDTemporal Difference(0) update becomes:

θ←θ+α δt ϕ(st)\theta \leftarrow \theta + \alpha\, \delta_t\, \phi(s_t)θ←θ+αδt​ϕ(st​)

where the TDTemporal Difference error is:

δt=rt+1+γ ϕ(st+1)⊤θ−ϕ(st)⊤θ\delta_t = r_{t+1} + \gamma\, \phi(s_{t+1})^\top\theta - \phi(s_t)^\top\thetaδt​=rt+1​+γϕ(st+1​)⊤θ−ϕ(st​)⊤θ

A common and important misconception is that this is stochastic gradient descent on the squared TDTemporal Difference error 12δt2\frac{1}{2}\delta_t^221​δt2​. It is not. The true gradient of 12δt2\frac{1}{2}\delta_t^221​δt2​ with respect to θ\thetaθ is:

∇θ12δt2=−δt∇θδt=−δt(ϕ(st)−γϕ(st+1))\nabla_\theta \frac{1}{2}\delta_t^2 = -\delta_t \nabla_\theta \delta_t = -\delta_t \bigl(\phi(s_t) - \gamma\phi(s_{t+1})\bigr)∇θ​21​δt2​=−δt​∇θ​δt​=−δt​(ϕ(st​)−γϕ(st+1​))

The TDTemporal Difference update uses only ϕ(st)\phi(s_t)ϕ(st​) — it omits the −γϕ(st+1)-\gamma\phi(s_{t+1})−γϕ(st+1​) term. This is because the bootstrap target rt+1+γϕ(st+1)⊤θr_{t+1} + \gamma\phi(s_{t+1})^\top\thetart+1​+γϕ(st+1​)⊤θ is treated as a fixed constant when differentiating, as if it did not depend on θ\thetaθ. This is the defining characteristic of a semi-gradient method.

Why the semi-gradient approximation is made

The full gradient would require differentiating through the bootstrap target, which introduces the term γϕ(st+1)\gamma\phi(s_{t+1})γϕ(st+1​) — an expectation over next-state features that requires either a model or a double-sampling trick to estimate without bias. Semi-gradient TDTemporal Difference avoids this by simply not differentiating through the target, making each update O(d)O(d)O(d) and implementable from a single transition.

✦Intuition: The Computational Shortcut

The full gradient requires ϕ(st+1)\phi(s_{t+1})ϕ(st+1​) — but computing the expected value of this term (needed for an unbiased gradient estimate) requires either a model of the environment or two independent next-state samples from the same (st,at)(s_t, a_t)(st​,at​) transition. Neither is available in model-free RL from single transitions. Semi-gradient TD accepts a biased gradient direction in exchange for being computable from a single (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1})(st​,at​,rt+1​,st+1​) tuple with a single O(d)O(d)O(d) vector operation. This is the fundamental computational tradeoff at the heart of all practical TD methods.

Consequences of semi-gradient

The semi-gradient approximation has two consequences:

  1. On-policy with linear approximation: convergence is preserved. Linear semi-gradient TDTemporal Difference(0) converges to the TDTemporal Difference fixed point (the projected Bellman equation solution) under on-policy sampling. The proof relies on the fact that the on-policy distribution makes the expected update a contraction in the appropriate norm.

  2. Off-policy: convergence is not guaranteed. With off-policy sampling, the expected semi-gradient update is no longer a contraction. The omitted γϕ(st+1)\gamma\phi(s_{t+1})γϕ(st+1​) term is precisely what would restore contractivity. Without it, the update can point in a direction that increases rather than decreases the Bellman error, leading to divergence.

This is the mathematical reason why the semi-gradient approximation, while computationally convenient, is not theoretically robust — and why off-policy learning with function approximation requires additional care.

⚠What Breaks Here: Off-Policy with Semi-Gradient

Consider a 2-state MDP with states A and B. The target policy π\piπ always moves to B; the behavior policy μ\muμ explores both equally. A linear approximator with a single feature ϕ(A)=1,ϕ(B)=−0.5\phi(A) = 1, \phi(B) = -0.5ϕ(A)=1,ϕ(B)=−0.5 shares a single weight θ\thetaθ. Under μ\muμ, transitions from A increase θ\thetaθ (pushing V^(B)\hat{V}(B)V^(B) down), while transitions from B push θ\thetaθ in the opposite direction (pushing V^(A)\hat{V}(A)V^(A) up). The two update directions conflict, and because the semi-gradient omits the contraction term, there is no restoring force — θ\thetaθ oscillates and can diverge. The full gradient would include the γϕ(st+1)\gamma\phi(s_{t+1})γϕ(st+1​) term that exactly cancels this instability under on-policy sampling but amplifies it off-policy.


The projected Bellman equation and convergence bound

Because V^(⋅;θ)\hat{V}(\cdot;\theta)V^(⋅;θ) lies in a restricted subspace F\mathcal{F}F of all value functions, applying the Bellman operator TπT^\piTπ to a function in F\mathcal{F}F generally moves it outside F\mathcal{F}F. We cannot satisfy the Bellman equation exactly — instead, we project back.

Definition

Let Π\PiΠ be the orthogonal projection onto F\mathcal{F}F under the μ\muμ-weighted norm. The projected Bellman fixed point satisfies:

V^=ΠTπV^\hat{V} = \Pi T^\pi \hat{V}V^=ΠTπV^

Linear semi-gradient TDTemporal Difference(0) under on-policy sampling converges to exactly this fixed point.

The convergence bound

Theorem (TDTemporal Difference fixed-point bound): Let V^<Glossary term="TD" />\hat{V}_{\text{<Glossary term="TD" />}}V^<Glossary term="TD" />​ be the linear TDTemporal Difference fixed point. Then:

∥Vπ−V^TD∥μ≤11−γ∥Vπ−ΠVπ∥μ\left\|V^\pi - \hat{V}_{\text{TD}}\right\|_\mu \leq \frac{1}{\sqrt{1-\gamma}}\left\|V^\pi - \Pi V^\pi\right\|_\mu​Vπ−V^TD​​μ​≤1−γ​1​∥Vπ−ΠVπ∥μ​

where ∥Vπ−ΠVπ∥μ\|V^\pi - \Pi V^\pi\|_\mu∥Vπ−ΠVπ∥μ​ is the best possible approximation error achievable by any θ\thetaθ in the feature class.

Interpretation: the TDTemporal Difference solution is at most 1/1−γ1/\sqrt{1-\gamma}1/1−γ​ times worse than the best approximation under the given features. This bound has two direct implications:

  • If the feature class represents VπV^\piVπ well (small ∥Vπ−ΠVπ∥μ\|V^\pi - \Pi V^\pi\|_\mu∥Vπ−ΠVπ∥μ​), the TDTemporal Difference solution is also good.
  • The factor 1/1−γ1/\sqrt{1-\gamma}1/1−γ​ grows as γ→1\gamma \to 1γ→1. For long-horizon problems (large γ\gammaγ), even a good feature class can produce a TDTemporal Difference solution that is significantly worse than the best possible approximation. High-discount problems are harder — not just computationally, but in terms of approximation quality.
✦Intuition: The Cost of Bootstrapping

The factor 11−γ\frac{1}{\sqrt{1-\gamma}}1−γ​1​ quantifies the price of bootstrapping. When γ=0\gamma = 0γ=0, TD reduces to supervised learning from immediate rewards and the bound collapses to ∥Vπ−ΠVπ∥\|V^\pi - \Pi V^\pi\|∥Vπ−ΠVπ∥ — no amplification. When γ=0.99\gamma = 0.99γ=0.99, the factor is 10.01=10\frac{1}{\sqrt{0.01}} = 100.01​1​=10 — a 10x amplification of approximation error. For γ=0.999\gamma = 0.999γ=0.999 (common in long-horizon robotics), the factor is ≈31.6\approx 31.6≈31.6. This means the TD solution can be 30 times worse than the best possible linear fit — purely because of how errors compound through the bootstrap chain.

◆Diagram: Projected Bellman Equation

The true value function VπV^\piVπ typically lies outside the linear approximation space Φ\PhiΦ. The Bellman operator TπT^\piTπ maps a function in Φ\PhiΦ to a function that may land outside Φ\PhiΦ. The projected Bellman operator ΠTπ\Pi T^\piΠTπ first applies TπT^\piTπ (producing a function outside Φ\PhiΦ), then applies the orthogonal projection Π\PiΠ under the μ\muμ-weighted norm to pull the result back into Φ\PhiΦ. The semi-gradient TD(0) fixed point is exactly the stationary point of this composed operator: V^=ΠTπV^\hat{V} = \Pi T^\pi \hat{V}V^=ΠTπV^.


Why function approximation makes the deadly triad dangerous

The deadly triad was introduced in Week 4: the combination of function approximation, bootstrapping, and off-policy learning can cause divergence. Now that we have the machinery of function approximation, we can explain the mechanism precisely.

In the tabular case, off-policy updates change the value of specific states. The update to V(st)V(s_t)V(st​) affects only sts_tst​ — there is no coupling between states. Off-policy sampling affects which states get updated and how often, but individual updates are still correct Bellman updates. The distribution mismatch is manageable.

With function approximation, all states share the parameter vector θ\thetaθ. Updating θ\thetaθ based on a transition at sts_tst​ changes V^(s;θ)\hat{V}(s;\theta)V^(s;θ) for every state sss — not just sts_tst​. The update direction is δtϕ(st)\delta_t \phi(s_t)δt​ϕ(st​), which moves the value estimates of all states in directions determined by the feature similarities to sts_tst​. If the off-policy distribution concentrates updates on states with features that are negatively correlated with the on-policy states that matter for VπV^\piVπ, the approximator can be pushed systematically away from VπV^\piVπ — with no restoring force, because the updates never reflect the on-policy distribution.

This is the coupling that makes the deadly triad dangerous with approximation but not with tables: the function approximator ties all states together, so off-policy updates at one state corrupt the estimates at others.

✦Intuition: Parameter Sharing as a Double-Edged Sword

In the tabular setting, each state has its own independent entry V(s)V(s)V(s). Updating V(s1)V(s_1)V(s1​) leaves V(s2)V(s_2)V(s2​) completely unchanged — states are decoupled. With function approximation, updating θ\thetaθ using data from s1s_1s1​ changes V^(s2;θ)\hat{V}(s_2;\theta)V^(s2​;θ) by an amount proportional to the feature similarity ϕ(s2)⊤ϕ(s1)\phi(s_2)^\top\phi(s_1)ϕ(s2​)⊤ϕ(s1​). This is desirable when states genuinely share structure — it is how we get generalization. But it is dangerous when the off-policy distribution feeds updates from states whose optimal values are negatively correlated with the states the target policy actually visits. The parameter sharing that enables generalization also enables contamination.


Baird's counterexample

Baird (1995) constructed an example showing that linear semi-gradient TDTemporal Difference with off-policy sampling diverges even in a tiny MDPMarkov Decision Process with a linear function approximator — not due to noise or numerical issues, but structurally.

The setup

Consider a 7-state MDPMarkov Decision Process with a star topology: states {1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\}{1,2,3,4,5,6} are peripheral; state 777 is central.

  • Behavior policy μ\muμ: with probability 6/76/76/7, transition to a peripheral state uniformly at random (dashed transitions); with probability 1/71/71/7, transition to the central state.
  • Target policy π\piπ: always transition to the central state.
  • All rewards are zero.
  • Discount γ=0.99\gamma = 0.99γ=0.99.

The linear feature representation for each state sis_isi​ is:

ϕ(si)={2ei+e7i∈{1,…,6}2e7i=7\phi(s_i) = \begin{cases} 2e_i + e_7 & i \in \{1,\ldots,6\} \\ 2e_7 & i = 7 \end{cases}ϕ(si​)={2ei​+e7​2e7​​i∈{1,…,6}i=7​

where eie_iei​ is the iiith standard basis vector in R8\mathbb{R}^8R8, giving θ∈R8\theta \in \mathbb{R}^8θ∈R8.

Why it diverges

Under the behavior policy, updates concentrate on peripheral states. The feature representation ties the peripheral state components eie_iei​ to the shared component e7e_7e7​. Under off-policy bootstrapping, each peripheral-state update moves the shared weight θ7\theta_7θ7​ in a direction inconsistent with the target policy's visitation distribution. The peripheral weights θ1,…,θ6\theta_1, \ldots, \theta_6θ1​,…,θ6​ and the shared weight θ7\theta_7θ7​ amplify each other through the bootstrap target, and the weights grow without bound.

Running TDTemporal Difference(0) on this example with any constant step size α>0\alpha > 0α>0 causes ∥θ∥→∞\|\theta\| \to \infty∥θ∥→∞ — all parameter components diverge to ±∞\pm\infty±∞, even though the true value function under π\piπ is identically zero (since all rewards are zero).

Takeaway: divergence under the deadly triad is not a pathology of bad initialization or poor hyperparameter tuning. It is structural and provable. The only reliable fixes are either algorithmic (target networks, replay buffers) or theoretical (gradient TDTemporal Difference methods).

⚠What Breaks Here: The Coupling Problem in Miniature

Walk through the weight dynamics: consider θ7\theta_7θ7​, the shared weight connected to the central state. Under the target policy π\piπ, the agent is always at state 7 — V^(7;θ)=2θ7\hat{V}(7;\theta) = 2\theta_7V^(7;θ)=2θ7​. Under the behavior policy, the agent spends most time at peripheral states iii, where V^(i;θ)=2θi+θ7\hat{V}(i;\theta) = 2\theta_i + \theta_7V^(i;θ)=2θi​+θ7​. Each peripheral update changes θi\theta_iθi​ (peripheral weight) and θ7\theta_7θ7​ (shared weight). But the direction of θ7\theta_7θ7​'s update from peripheral data does not align with the direction needed for the target policy's valuation of state 7. The bootstrap target couples θi\theta_iθi​ and θ7\theta_7θ7​ through the feature vector: an error at state iii drives θi\theta_iθi​ and θ7\theta_7θ7​ in ways that amplify each other through the γϕ(st+1)⊤θ\gamma\phi(s_{t+1})^\top\thetaγϕ(st+1​)⊤θ term. The result is a positive feedback loop — weights grow without bound even though the true values are all zero.

◆Diagram: Baird's Counterexample

Baird's star MDP: states 1–6 are peripheral, state 7 is central. The behavior policy (dashed transitions) distributes probability across all 7 states — 6/7 toward peripheral states, 1/7 toward the central state. The target policy (solid transitions) always moves to the central state. The feature representation ϕ(si)=2ei+e7\phi(s_i) = 2e_i + e_7ϕ(si​)=2ei​+e7​ for peripheral states and ϕ(s7)=2e7\phi(s_7) = 2e_7ϕ(s7​)=2e7​ ties peripheral and central states through shared weight θ7\theta_7θ7​, causing off-policy TD(0) to diverge.


Gradient TDTemporal Difference methods: the theoretical resolution

Semi-gradient TDTemporal Difference is not true gradient descent, and this is the root cause of its off-policy instability. The natural fix is to perform true gradient descent on a well-defined scalar objective — specifically, the projected Bellman error (PBE):

PBE‾(θ)=∥ΠTπV^θ−V^θ∥μ2\overline{PBE}(\theta) = \left\|\Pi T^\pi \hat{V}_\theta - \hat{V}_\theta\right\|_\mu^2PBE(θ)=​ΠTπV^θ​−V^θ​​μ2​

Gradient TDTemporal Difference methods (GTD, GTD2, TDC — Sutton et al., 2009) compute the true gradient of PBE‾(θ)\overline{PBE}(\theta)PBE(θ) with respect to θ\thetaθ, using a second set of parameters to estimate the gradient correction term online. Unlike semi-gradient TDTemporal Difference, gradient TDTemporal Difference methods converge under all three deadly triad conditions: function approximation, bootstrapping, and off-policy sampling simultaneously.

Why gradient TDTemporal Difference is not universally used

Despite their theoretical advantages, gradient TDTemporal Difference methods have not displaced semi-gradient TDTemporal Difference in practice for two reasons:

  1. Slower convergence: the gradient correction term introduces additional variance and slows convergence relative to semi-gradient TDTemporal Difference with stabilization tricks.
  2. Engineering solutions are sufficient in practice: target networks and replay buffers (introduced in Week 6) stabilize semi-gradient TDTemporal Difference empirically in most settings where gradient TDTemporal Difference would be theoretically required. The gap between theory and practice is bridged by engineering rather than by using the theoretically correct algorithm.

This is a recurring pattern in deep RLReinforcement Learning: theoretically correct algorithms often underperform practically motivated heuristics that violate the theory but work well empirically. Knowing the theory is what allows you to reason about when the heuristics will fail.

◆Why Not Always Use Gradient TD?

Gradient TD methods maintain a second set of parameters (a "correction" vector w∈Rdw \in \mathbb{R}^dw∈Rd) that estimates the gradient correction term γϕ(st+1)\gamma\phi(s_{t+1})γϕ(st+1​) online. This doubles the memory and per-step computation, and the auxiliary parameter introduces additional variance that slows convergence — GTD2 and TDC can require 10-100x more samples than semi-gradient TD with experience replay on the same problem. In practice, the combination of a replay buffer (which makes data look more on-policy by shuffling) and a target network (which stabilizes the bootstrap target) achieves most of the theoretical benefits of gradient TD with none of the extra variance. Gradient TD is the theoretically correct tool, but the engineering heuristics work better empirically — until they don't, at which point the theory tells you why.


From linear to neural network approximation

Neural networks generalize linear approximation:

Q^(s,a; θ)=NNθ(s,a)\hat{Q}(s, a;\, \theta) = \text{NN}_\theta(s, a)Q^​(s,a;θ)=NNθ​(s,a)

Advantages:

  • Features are learned from data rather than hand-designed.
  • Deep networks can represent complex, non-local value functions.
  • Scales to images (convolutional), text (transformer), and multimodal inputs.

Amplified instabilities: every failure mode of linear approximation is present in neural networks, and nonlinearity introduces additional ones:

  • Non-convex loss landscape: gradient updates can follow directions that increase loss.
  • Changing representations: the implicit features ϕ(s;θ)\phi(s;\theta)ϕ(s;θ) (the penultimate layer) change as θ\thetaθ is updated, making the effective feature class non-stationary.
  • Moving bootstrap targets: with nonlinear Q^\hat{Q}Q^​, the bootstrap target r+γmax⁡a′Q^(s′,a′;θ)r + \gamma\max_{a'}\hat{Q}(s',a';\theta)r+γmaxa′​Q^​(s′,a′;θ) changes every time θ\thetaθ is updated — the "target" the network is regressing toward is itself a function of the parameters being optimized. This is not a problem in supervised learning (labels are fixed) and has no tabular analog. It is a fundamental source of instability unique to neural network function approximation with bootstrapping.

The moving bootstrap target problem motivates the target network architecture in DQNDeep Q-Network: freeze a copy of θ\thetaθ to compute bootstrap targets, updating it periodically rather than every step. This decouples the target from the current parameters and is the primary stabilization mechanism in DQNDeep Q-Network.

✦Intuition: The Moving Target Problem

In supervised learning, the training labels {yi}\{y_i\}{yi​} are fixed — you are regressing a function toward a stationary target. In Q-learning with a neural network, the "label" is r+γmax⁡a′Q(s′,a′;θ)r + \gamma\max_{a'}Q(s',a';\theta)r+γmaxa′​Q(s′,a′;θ), which depends on θ\thetaθ itself. When you update θ\thetaθ to reduce the TD error, the target moves. This is like trying to hit a bullseye that shifts every time you throw a dart. The target network θ−\theta^-θ− pauses the target — you throw darts at a stationary bullseye for C steps, then update θ−←θ\theta^- \leftarrow \thetaθ−←θ and start again. This decoupling is the single most important stabilization trick in deep RL.

⚠What Breaks Here: Non-Stationarity of Representation

With a neural network, the "features" are the activations of the penultimate layer — but unlike tile coding or RBFs, these features change with every gradient step. The linear analysis assumed a fixed ϕ(s)\phi(s)ϕ(s); when ϕ\phiϕ itself depends on θ\thetaθ, the convergence proofs break. Specifically, the contraction argument for linear TD relies on the projection Π\PiΠ being fixed. With changing features, Π\PiΠ becomes Πθ\Pi_\thetaΠθ​ — a moving projection target — and the contraction property is lost. This is why neural network TD can diverge even on-policy with a replay buffer, and why the target network + replay buffer combination is necessary, not optional, in DQN.


Overestimation bias

Q-learning uses:

max⁡a′Q(st+1,a′)\max_{a'} Q(s_{t+1}, a')a′max​Q(st+1​,a′)

as the bootstrap target. When QQQ-value estimates are noisy, taking a maximum introduces a systematic positive bias.

Derivation

Let Q(s,a)=Q∗(s,a)+ϵaQ(s,a) = Q^*(s,a) + \epsilon_aQ(s,a)=Q∗(s,a)+ϵa​ where {ϵa}a=1K\{\epsilon_a\}_{a=1}^K{ϵa​}a=1K​ are zero-mean estimation errors, not necessarily independent. Then:

E[max⁡aQ(s,a)]=E[max⁡a(Q∗(s,a)+ϵa)]≥max⁡aQ∗(s,a)+E[max⁡aϵa]≥max⁡aQ∗(s,a)\mathbb{E}\left[\max_a Q(s,a)\right] = \mathbb{E}\left[\max_a (Q^*(s,a) + \epsilon_a)\right] \geq \max_a Q^*(s,a) + \mathbb{E}\left[\max_a \epsilon_a\right] \geq \max_a Q^*(s,a)E[amax​Q(s,a)]=E[amax​(Q∗(s,a)+ϵa​)]≥amax​Q∗(s,a)+E[amax​ϵa​]≥amax​Q∗(s,a)

The second inequality holds because E[max⁡aϵa]≥max⁡aE[ϵa]=0\mathbb{E}[\max_a \epsilon_a] \geq \max_a \mathbb{E}[\epsilon_a] = 0E[maxa​ϵa​]≥maxa​E[ϵa​]=0 by Jensen's inequality applied to the convex max⁡\maxmax function. The bias E[max⁡aϵa]>0\mathbb{E}[\max_a \epsilon_a] > 0E[maxa​ϵa​]>0 whenever the ϵa\epsilon_aϵa​ are not all identical — which is never the case in practice.

How bias grows

The overestimation bias grows with:

  • Number of actions ∣A∣|\mathcal{A}|∣A∣: more actions means more opportunities for an upward noise spike to be selected as the maximum. With KKK i.i.d. Gaussian errors ϵa∼N(0,σ2)\epsilon_a \sim \mathcal{N}(0,\sigma^2)ϵa​∼N(0,σ2), the expected maximum scales as σ2log⁡K\sigma\sqrt{2\log K}σ2logK​ — logarithmic in KKK but positive for any K>1K > 1K>1.
  • Estimation variance σ2\sigma^2σ2: noisier estimates produce larger max bias. Early in training, when estimates are highly uncertain, bias is worst.

Compounding effects

Overestimation in Q(st+1,⋅)Q(s_{t+1}, \cdot)Q(st+1​,⋅) propagates into Q(st,⋅)Q(s_t, \cdot)Q(st​,⋅) through the Bellman update. Over many updates, overestimates compound backward through the value function, producing systematically inflated QQQ-values that cause the policy to prefer actions that look good due to noise rather than true value.

✦Intuition: The Winner's Curse

The overestimation bias is the RL analog of the winner's curse in auction theory: the winning bidder in a common-value auction is the one who most overestimated the item's value. When you estimate Q-values for KKK actions, you get KKK noisy estimates. Taking the max⁡\maxmax selects the action whose noise was most positive — systematically biasing the estimate upward. The more actions, the more opportunities for a lucky noise spike, and the larger the expected bias. This is not a bug in the estimation procedure; it is a mathematical consequence of applying max⁡\maxmax to random variables.

Double Q-learning: the fix

Double Q-learning (van Hasselt, 2010) decouples action selection from action evaluation:

r+γQ(s′,arg⁡max⁡a′Q(s′,a′;θ); θ−)r + \gamma Q(s', \arg\max_{a'} Q(s',a';\theta);\, \theta^-)r+γQ(s′,arga′max​Q(s′,a′;θ);θ−)

where θ\thetaθ selects the action and θ−\theta^-θ− (a separate or lagged parameter vector) evaluates it. Since the same noise that inflates Q(s′,a∗;θ)Q(s',a^*;\theta)Q(s′,a∗;θ) is unlikely to also inflate Q(s′,a∗;θ−)Q(s',a^*;\theta^-)Q(s′,a∗;θ−) if θ−\theta^-θ− is independent, the upward bias is reduced. In DQNDeep Q-Network, the target network θ−\theta^-θ− serves naturally as the evaluation network in Double DQNDeep Q-Network, integrating the fix without additional overhead.

⚠What Breaks Here: The Bias from the Max Operator

Concrete example: suppose Q∗(s,a)=0Q^*(s,a) = 0Q∗(s,a)=0 for all actions (a terminal state). With σ=0.5\sigma = 0.5σ=0.5 estimation noise and K=4K = 4K=4 actions, Gaussian order statistics tell us E[max⁡aϵa]≈0.5⋅1.03=0.52\mathbb{E}[\max_a \epsilon_a] \approx 0.5 \cdot 1.03 = 0.52E[maxa​ϵa​]≈0.5⋅1.03=0.52. After 100 Bellman backups through the same state, this bias compounds — each backup adds approximately 0.52 of spurious value, so the estimates can inflate to >50>50>50 despite true values of zero. With K=50,000K = 50{,}000K=50,000 (language model vocabulary), E[max⁡aϵa]≈0.5⋅3.55=1.77\mathbb{E}[\max_a \epsilon_a] \approx 0.5 \cdot 3.55 = 1.77E[maxa​ϵa​]≈0.5⋅3.55=1.77 — nearly 4x larger from the action count alone. This is why Double DQN, which decouples action selection from action evaluation, is not a minor tweak but a structural necessity for problems with large action spaces.

◆Diagram: Q-Value Overestimation Mechanism

For a state with KKK actions, each Q-value estimate Q(s,a)Q(s,a)Q(s,a) carries zero-mean estimation noise ϵa\epsilon_aϵa​. The max operator max⁡aQ(s,a)\max_a Q(s,a)maxa​Q(s,a) selects the action with the largest positive noise realization. The expected value of the maximum of KKK independent Gaussians scales approximately as σ2log⁡K\sigma\sqrt{2\log K}σ2logK​, producing a positive bias even though each individual estimate is unbiased. Double Q-learning eliminates this by using one network to select the action and a separate network to evaluate it — since the same noise spike is unlikely in both networks, the expected bias is substantially reduced.


GenAI context: why this matters for language models

In language modeling:

  • States are token histories — space of size 50,000400050{,}000^{4000}50,0004000
  • Actions are next-token choices — ∣A∣≈50,000|\mathcal{A}| \approx 50{,}000∣A∣≈50,000
  • Value functions must be approximated by neural networks

All three deadly triad conditions are present in RLHFReinforcement Learning from Human Feedback pipelines:

  • Function approximation: neural network value head on the language model
  • Bootstrapping: GAE (Generalized Advantage Estimation) uses TDTemporal Difference-style returns
  • Off-policy data: rollouts are generated by old policy checkpoints

The overestimation bias is particularly acute: with ∣A∣=50,000|\mathcal{A}| = 50{,}000∣A∣=50,000 tokens, the expected max bias over the next-token Q-values is substantial. This is one reason RLHFReinforcement Learning from Human Feedback typically uses actor-critic with a value baseline rather than Q-learning directly — the value function V(s)V(s)V(s) (over states, not state-action pairs) avoids the explicit max over the full vocabulary.

Without careful algorithm design — clipped importance ratios (PPOProximal Policy Optimisation), frozen reference models, KL divergence penalties, careful advantage normalization — the instabilities described in this lecture manifest as degenerate outputs, mode collapse, and reward hacking. The stabilization techniques in RLHFReinforcement Learning from Human Feedback are not arbitrary engineering choices. They are direct responses to the theoretical failure modes introduced here.

✦Intuition: Why RLHF Uses V(s) Not Q(s,a)

With a Q-function over 50,000 actions, every Bellman backup requires computing max⁡a′∈AQ(s′,a′;θ)\max_{a' \in \mathcal{A}} Q(s',a';\theta)maxa′∈A​Q(s′,a′;θ) — 50,000 forward passes through the value head. A value function V(s)V(s)V(s) has a single scalar output, reducing the max to a single forward pass. Beyond the computational advantage, V(s)V(s)V(s) also avoids the overestimation bias from the max operator entirely: there is no max to take. RLHF pipelines use Generalized Advantage Estimation (GAE), which estimates A(s,a)=Q(s,a)−V(s)A(s,a) = Q(s,a) - V(s)A(s,a)=Q(s,a)−V(s) using V(s)V(s)V(s) as the baseline — this requires only V(s)V(s)V(s), not the full Q(s,a)Q(s,a)Q(s,a), while the advantage A(s,a)A(s,a)A(s,a) is estimated from the rollout returns. The choice of V(s)V(s)V(s) over Q(s,a)Q(s,a)Q(s,a) in RLHF is driven by both the overestimation problem and computational constraints at vocabulary scale.


Key takeaways

The lecture traces a progression from the ideal to the practically achievable. The MSVE objective defines what we optimize, with on-policy weighting μ\muμ tying the objective to the behavior distribution. Linear approximation introduces the approximation/estimation error decomposition and the feature design choices that determine the expressiveness floor. Semi-gradient TDTemporal Difference is not SGD — the omission of the gradient through the bootstrap target is what makes it computationally cheap, and what removes its convergence guarantees off-policy. The projected Bellman fixed point is where on-policy linear TDTemporal Difference converges, with the 1/1−γ1/\sqrt{1-\gamma}1/1−γ​ bound quantifying the cost of approximation. The deadly triad is dangerous with function approximation because parameter sharing couples all states — off-policy updates at one state corrupt estimates everywhere. Baird's counterexample proves this is not a pathology but a structural fact. Gradient TDTemporal Difference methods provide the theoretically correct fix — true gradient descent on the projected Bellman error — but are superseded in practice by engineering stabilization (target networks, replay buffers). Neural networks amplify all linear instabilities and introduce the moving bootstrap target problem, directly motivating the DQNDeep Q-Network target network. Overestimation bias arises from the max operator over noisy estimates and compounds backward through the value function — Double DQNDeep Q-Network decouples selection from evaluation to reduce it.


Conceptual questions

  1. The TDTemporal Difference(0) update for linear function approximation is θ←θ+αδtϕ(st)\theta \leftarrow \theta + \alpha\delta_t\phi(s_t)θ←θ+αδt​ϕ(st​). Write out the true gradient of 12δt2\frac{1}{2}\delta_t^221​δt2​ with respect to θ\thetaθ and identify the term that the semi-gradient update omits. Under what sampling distribution does the omitted term have zero expectation, restoring convergence? Why does off-policy sampling violate this condition?

  2. The TDTemporal Difference fixed-point bound gives ∥Vπ−V^<Glossary term="TD" />∥μ≤11−γ∥Vπ−ΠVπ∥μ\|V^\pi - \hat{V}_{\text{<Glossary term="TD" />}}\|_\mu \leq \frac{1}{\sqrt{1-\gamma}}\|V^\pi - \Pi V^\pi\|_\mu∥Vπ−V^<Glossary term="TD" />​∥μ​≤1−γ​1​∥Vπ−ΠVπ∥μ​. Suppose γ=0.99\gamma = 0.99γ=0.99 and the best linear approximation error is 0.1. What is the worst-case TDTemporal Difference error? What does this imply about using high-discount TDTemporal Difference for long-horizon robotics tasks?

  3. In the tabular setting, off-policy Q-learning converges to Q∗Q^*Q∗. In the linear function approximation setting with off-policy sampling, TDTemporal Difference can diverge (Baird's example). Identify the precise structural difference between the tabular and approximation cases that explains why off-policy sampling is benign in one case and catastrophic in the other.

  4. You are designing a Q-learning agent for a robotic manipulation task with 8 discrete actions. After 100k training steps, you notice the Q-values are systematically much larger than the true returns. Diagnose this as overestimation bias: derive the expected magnitude of the bias as a function of estimation variance σ2\sigma^2σ2 and number of actions KKK, and describe the Double Q-learning fix at the level of the update equation.

  5. An RLHFReinforcement Learning from Human Feedback pipeline trains a value head on top of a language model using TDTemporal Difference(0) with data collected from old policy checkpoints. Map this setting onto the deadly triad: identify which component each ingredient corresponds to, explain which specific failure mode is most likely to manifest first, and propose one concrete mitigation for each component of the triad.


Coding exercise

Simulate the overestimation bias in the max operator to verify the σ2log⁡K\sigma\sqrt{2\log K}σ2logK​ approximate scaling.

Use NumPy to implement the following:

import numpy as np

def simulate_max_bias(K, sigma, num_trials=10000):
    """
    Simulate the expected overestimation bias of max over K actions.

    Args:
        K (int): number of actions
        sigma (float): standard deviation of estimation noise
        num_trials (int): number of independent trials

    Returns:
        float: estimated E[max_a epsilon_a]
    """
    noise = np.random.randn(num_trials, K) * sigma
    max_noise = noise.max(axis=1)  # max over actions for each trial
    return float(max_noise.mean())

# Verify the approximate scaling
for K in [2, 4, 8, 16, 32]:
    estimated_bias = simulate_max_bias(K, sigma=0.5)
    predicted = 0.5 * np.sqrt(2 * np.log(K))
    print(f"K={K:3d}:  estimated bias={estimated_bias:.3f},  predicted≈{predicted:.3f}")

Expected output (your numbers will vary slightly due to sampling):

K=  2:  estimated bias=0.282,  predicted≈0.589
K=  4:  estimated bias=0.513,  predicted≈0.833
K=  8:  estimated bias=0.695,  predicted≈1.020
K= 16:  estimated bias=0.835,  predicted≈1.177
K= 32:  estimated bias=0.960,  predicted≈1.316

Explain why the estimated bias is consistently lower than the σ2log⁡K\sigma\sqrt{2\log K}σ2logK​ prediction — what assumption does the approximation make that is violated in the simulation?


Extension prompts

  1. Implement Baird's counterexample. Build the 7-state star MDP in Python (or your language of choice) with the exact feature representation from the lecture. Run linear semi-gradient TD(0) with off-policy sampling and α=0.01\alpha = 0.01α=0.01. Watch ∥θ∥\|\theta\|∥θ∥ diverge. Then switch to on-policy sampling and verify convergence. How many steps until divergence is visually obvious?

  2. Compare semi-gradient vs gradient TD. Using OpenAI Gym's CartPole-v1, train a linear Q-function (RBF features over the 4D state space) with both semi-gradient TD(0) and gradient TD (GTD2). Plot the parameter norm over training steps. Where does semi-gradient diverge?

  3. Design a feature map that avoids Baird's problem. Propose a different linear feature representation for Baird's 7-state MDP such that off-policy semi-gradient TD(0) converges. What property of your feature map ensures contractivity?


Looking ahead

The next lecture introduces Deep Q-Networks (DQNDeep Q-Network) — the first algorithm to successfully stabilize Q-learning with neural networks. Each of DQNDeep Q-Network's two core innovations maps directly onto a failure mode from this lecture:

  • Experience replay breaks temporal correlations in training data and stabilizes the on-policy distribution assumption underlying the MSVE objective.
  • Target networks freeze the bootstrap target, directly addressing the moving target problem introduced by neural network function approximation with bootstrapping.

Understanding the failures introduced here is what makes DQNDeep Q-Network's design choices interpretable as principled engineering responses rather than empirical tricks.


Further reading

  • Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. ICML. (Introduced Baird's Counterexample and the Deadly Triad).
  • Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control. (Proofs of convergence for linear function approximation).
  • Thrun, S., & Schwartz, A. (1993). Issues in using function approximation for reinforcement learning. (Early identification of the overestimation bias in Q-learning).
  • Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., & Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. ICML. (Gradient TD methods: GTD, GTD2, TDC — theoretically convergent under the deadly triad).
  • van Hasselt, H. (2010). Double Q-learning. NeurIPS. (Decouples action selection from action evaluation to reduce overestimation bias).
  • Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature. (DQN: experience replay, target networks, and superhuman Atari performance).
← Previous
Week 4: Monte Carlo and Temporal-Difference Learning
Next →
Week 6: Deep Q-Learning and Variants
On this page
  • Purpose of this lecture
  • Why function approximation is necessary
  • The Mean Squared Value Error objective
  • Linear value function approximation
  • Feature design
  • A worked example
  • Semi-gradient TD: what it is and why it is not SGD
  • Why the semi-gradient approximation is made
  • Consequences of semi-gradient
  • The projected Bellman equation and convergence bound
  • Definition
  • The convergence bound
  • Why function approximation makes the deadly triad dangerous
  • Baird's counterexample
  • The setup
  • Why it diverges
  • Gradient TD methods: the theoretical resolution
  • Why gradient TD is not universally used
  • From linear to neural network approximation
  • Overestimation bias
  • Derivation
  • How bias grows
  • Compounding effects
  • Double Q-learning: the fix
  • GenAI context: why this matters for language models
  • Key takeaways
  • Conceptual questions
  • Coding exercise
  • Extension prompts
  • Looking ahead
  • Further reading