Purpose of this lecture
Here is the central puzzle of this lecture: Why does adding a neural network break a stable algorithm?
Up to this point, we assumed value functions could be represented exactly in tables. This assumption collapses the moment the state space becomes large, continuous, or combinatorial — as is the case in nearly all real-world problems and all GenAI systems.
Function approximation is not optional: it is essential. A DQNDeep Q-Network agent for Atari learns from pixel observations — a state space astronomically larger than any tabular representation could handle. An LLM learns value functions over token sequences of length 4,000, with vocabulary size 50,000 — a computational impossibility for tables.
But here is the problem: the TDTemporal Difference learning algorithms from Week 4 converge in the tabular case and diverge with function approximation. Not because of implementation bugs. Not because of bad hyperparameters. Structurally and provably.
This lecture develops the precise failure modes and connects them to algorithmic solutions. We will show:
- What the learning objective is (MSVE) and why the distribution matters
- How linear approximation introduces the approximation/estimation error tradeoff
- Why semi-gradient TDTemporal Difference is not true gradient descent, and why this matters off-policy
- How the deadly triad causes divergence: Baird's concrete counterexample
- Why gradient TDTemporal Difference methods work in theory but not in practice
- Why neural networks amplify instability and introduce the moving target problem
- How overestimation bias compounds and what Double DQNDeep Q-Network does about it
The lecture follows a logical arc: define what we are optimizing, develop the simplest (linear) approximation class, identify where and why convergence fails, show a provable divergence example, extend to neural networks, and connect the theoretical failures to the algorithmic fixes introduced in Week 6.
Why function approximation is necessary
Tabular methods require memory proportional to or .
- A robot with 12 joint angles discretized to 100 positions has states.
- An LLMLarge Language Model context window of 4,000 tokens over a vocabulary of 50,000 has states.
- Continuous control problems have uncountably infinite state spaces.
Enumerating, storing, or updating a table over these spaces is not an engineering challenge — it is a mathematical impossibility. Function approximation replaces the table with a parameterized function:
where are learned parameters, . The function approximator must generalize: updating based on experience at state should improve estimates at nearby or similar states, not just at itself. This generalization is both the benefit and the source of instability.
The Mean Squared Value Error objective
Before developing algorithms, we need to define what we are trying to minimize. The standard objective for value function approximation is the Mean Squared Value Error:
where is a weighting over states with .
The natural choice for in on-policy learning is the on-policy state visitation distribution: the fraction of time spent in state when following policy . States visited frequently under receive high weight and must be estimated accurately; rarely visited states can tolerate larger error.
This choice has an immediate and important consequence for off-policy learning: when data is collected under a behavior policy , the distribution of observed states reflects , not . Minimizing under does not minimize under — the approximator is being shaped by a distribution that does not match the target. This is the distributional mismatch at the heart of the deadly triad, made precise.
Linear value function approximation
The simplest approximation class is linear:
where is a fixed feature vector and are learned weights. The expressiveness of the approximation is entirely determined by the feature map .
Feature design
Tile coding: partition the continuous state space into overlapping grids. Each tile is a region of state space; the feature vector has a 1 in the component corresponding to each tile the current state falls into, and 0 elsewhere. Overlapping tilings ensure that nearby states share features and receive similar value estimates — the overlap encodes a smoothness prior over state space. Tile coding is the canonical example of domain knowledge about state structure built directly into the representation.
Radial basis functions (RBFs): each feature is a Gaussian centered at . States near center have high feature activation; distant states have near-zero activation. RBFs provide smooth, localized generalization and are natural for continuous state spaces where the value function is expected to be smooth.
Both tile coding and RBFs embed assumptions about what "nearby" means in state space. When these assumptions are correct, the approximation is efficient. When they are wrong — for instance, applying distance-based features to a high-dimensional space with non-Euclidean structure — the approximation can be poor regardless of how many parameters are used.
A worked example
Consider a 1D continuous state space with two RBF features centered at and . The true value function is .
Feature vectors: , , (approximately, for appropriate ).
The best linear fit must find such that the weighted approximation error is minimized. With only two basis functions, the approximation cannot represent exactly — it will be accurate near the centers and less accurate in between. This irreducible error is the approximation error, determined by the feature class, not the learning algorithm. No algorithm can reduce it below .
This distinction — approximation error (determined by features) vs estimation error (reduced by more data and better algorithms) — is fundamental in statistical learning theory and carries over directly to deep RLReinforcement Learning, where the network architecture determines the approximation error floor.
Semi-gradient TDTemporal Difference: what it is and why it is not SGD
With linear function approximation, the TDTemporal Difference(0) update becomes:
where the TDTemporal Difference error is:
A common and important misconception is that this is stochastic gradient descent on the squared TDTemporal Difference error . It is not. The true gradient of with respect to is:
The TDTemporal Difference update uses only — it omits the term. This is because the bootstrap target is treated as a fixed constant when differentiating, as if it did not depend on . This is the defining characteristic of a semi-gradient method.
Why the semi-gradient approximation is made
The full gradient would require differentiating through the bootstrap target, which introduces the term — an expectation over next-state features that requires either a model or a double-sampling trick to estimate without bias. Semi-gradient TDTemporal Difference avoids this by simply not differentiating through the target, making each update and implementable from a single transition.
Consequences of semi-gradient
The semi-gradient approximation has two consequences:
-
On-policy with linear approximation: convergence is preserved. Linear semi-gradient TDTemporal Difference(0) converges to the TDTemporal Difference fixed point (the projected Bellman equation solution) under on-policy sampling. The proof relies on the fact that the on-policy distribution makes the expected update a contraction in the appropriate norm.
-
Off-policy: convergence is not guaranteed. With off-policy sampling, the expected semi-gradient update is no longer a contraction. The omitted term is precisely what would restore contractivity. Without it, the update can point in a direction that increases rather than decreases the Bellman error, leading to divergence.
This is the mathematical reason why the semi-gradient approximation, while computationally convenient, is not theoretically robust — and why off-policy learning with function approximation requires additional care.
The projected Bellman equation and convergence bound
Because lies in a restricted subspace of all value functions, applying the Bellman operator to a function in generally moves it outside . We cannot satisfy the Bellman equation exactly — instead, we project back.
Definition
Let be the orthogonal projection onto under the -weighted norm. The projected Bellman fixed point satisfies:
Linear semi-gradient TDTemporal Difference(0) under on-policy sampling converges to exactly this fixed point.
The convergence bound
Theorem (TDTemporal Difference fixed-point bound): Let be the linear TDTemporal Difference fixed point. Then:
where is the best possible approximation error achievable by any in the feature class.
Interpretation: the TDTemporal Difference solution is at most times worse than the best approximation under the given features. This bound has two direct implications:
- If the feature class represents well (small ), the TDTemporal Difference solution is also good.
- The factor grows as . For long-horizon problems (large ), even a good feature class can produce a TDTemporal Difference solution that is significantly worse than the best possible approximation. High-discount problems are harder — not just computationally, but in terms of approximation quality.
Why function approximation makes the deadly triad dangerous
The deadly triad was introduced in Week 4: the combination of function approximation, bootstrapping, and off-policy learning can cause divergence. Now that we have the machinery of function approximation, we can explain the mechanism precisely.
In the tabular case, off-policy updates change the value of specific states. The update to affects only — there is no coupling between states. Off-policy sampling affects which states get updated and how often, but individual updates are still correct Bellman updates. The distribution mismatch is manageable.
With function approximation, all states share the parameter vector . Updating based on a transition at changes for every state — not just . The update direction is , which moves the value estimates of all states in directions determined by the feature similarities to . If the off-policy distribution concentrates updates on states with features that are negatively correlated with the on-policy states that matter for , the approximator can be pushed systematically away from — with no restoring force, because the updates never reflect the on-policy distribution.
This is the coupling that makes the deadly triad dangerous with approximation but not with tables: the function approximator ties all states together, so off-policy updates at one state corrupt the estimates at others.
Baird's counterexample
Baird (1995) constructed an example showing that linear semi-gradient TDTemporal Difference with off-policy sampling diverges even in a tiny MDPMarkov Decision Process with a linear function approximator — not due to noise or numerical issues, but structurally.
The setup
Consider a 7-state MDPMarkov Decision Process with a star topology: states are peripheral; state is central.
- Behavior policy : with probability , transition to a peripheral state uniformly at random (dashed transitions); with probability , transition to the central state.
- Target policy : always transition to the central state.
- All rewards are zero.
- Discount .
The linear feature representation for each state is:
where is the th standard basis vector in , giving .
Why it diverges
Under the behavior policy, updates concentrate on peripheral states. The feature representation ties the peripheral state components to the shared component . Under off-policy bootstrapping, each peripheral-state update moves the shared weight in a direction inconsistent with the target policy's visitation distribution. The peripheral weights and the shared weight amplify each other through the bootstrap target, and the weights grow without bound.
Running TDTemporal Difference(0) on this example with any constant step size causes — all parameter components diverge to , even though the true value function under is identically zero (since all rewards are zero).
Takeaway: divergence under the deadly triad is not a pathology of bad initialization or poor hyperparameter tuning. It is structural and provable. The only reliable fixes are either algorithmic (target networks, replay buffers) or theoretical (gradient TDTemporal Difference methods).
Gradient TDTemporal Difference methods: the theoretical resolution
Semi-gradient TDTemporal Difference is not true gradient descent, and this is the root cause of its off-policy instability. The natural fix is to perform true gradient descent on a well-defined scalar objective — specifically, the projected Bellman error (PBE):
Gradient TDTemporal Difference methods (GTD, GTD2, TDC — Sutton et al., 2009) compute the true gradient of with respect to , using a second set of parameters to estimate the gradient correction term online. Unlike semi-gradient TDTemporal Difference, gradient TDTemporal Difference methods converge under all three deadly triad conditions: function approximation, bootstrapping, and off-policy sampling simultaneously.
Why gradient TDTemporal Difference is not universally used
Despite their theoretical advantages, gradient TDTemporal Difference methods have not displaced semi-gradient TDTemporal Difference in practice for two reasons:
- Slower convergence: the gradient correction term introduces additional variance and slows convergence relative to semi-gradient TDTemporal Difference with stabilization tricks.
- Engineering solutions are sufficient in practice: target networks and replay buffers (introduced in Week 6) stabilize semi-gradient TDTemporal Difference empirically in most settings where gradient TDTemporal Difference would be theoretically required. The gap between theory and practice is bridged by engineering rather than by using the theoretically correct algorithm.
This is a recurring pattern in deep RLReinforcement Learning: theoretically correct algorithms often underperform practically motivated heuristics that violate the theory but work well empirically. Knowing the theory is what allows you to reason about when the heuristics will fail.
From linear to neural network approximation
Neural networks generalize linear approximation:
Advantages:
- Features are learned from data rather than hand-designed.
- Deep networks can represent complex, non-local value functions.
- Scales to images (convolutional), text (transformer), and multimodal inputs.
Amplified instabilities: every failure mode of linear approximation is present in neural networks, and nonlinearity introduces additional ones:
- Non-convex loss landscape: gradient updates can follow directions that increase loss.
- Changing representations: the implicit features (the penultimate layer) change as is updated, making the effective feature class non-stationary.
- Moving bootstrap targets: with nonlinear , the bootstrap target changes every time is updated — the "target" the network is regressing toward is itself a function of the parameters being optimized. This is not a problem in supervised learning (labels are fixed) and has no tabular analog. It is a fundamental source of instability unique to neural network function approximation with bootstrapping.
The moving bootstrap target problem motivates the target network architecture in DQNDeep Q-Network: freeze a copy of to compute bootstrap targets, updating it periodically rather than every step. This decouples the target from the current parameters and is the primary stabilization mechanism in DQNDeep Q-Network.
Overestimation bias
Q-learning uses:
as the bootstrap target. When -value estimates are noisy, taking a maximum introduces a systematic positive bias.
Derivation
Let where are zero-mean estimation errors, not necessarily independent. Then:
The second inequality holds because by Jensen's inequality applied to the convex function. The bias whenever the are not all identical — which is never the case in practice.
How bias grows
The overestimation bias grows with:
- Number of actions : more actions means more opportunities for an upward noise spike to be selected as the maximum. With i.i.d. Gaussian errors , the expected maximum scales as — logarithmic in but positive for any .
- Estimation variance : noisier estimates produce larger max bias. Early in training, when estimates are highly uncertain, bias is worst.
Compounding effects
Overestimation in propagates into through the Bellman update. Over many updates, overestimates compound backward through the value function, producing systematically inflated -values that cause the policy to prefer actions that look good due to noise rather than true value.
Double Q-learning: the fix
Double Q-learning (van Hasselt, 2010) decouples action selection from action evaluation:
where selects the action and (a separate or lagged parameter vector) evaluates it. Since the same noise that inflates is unlikely to also inflate if is independent, the upward bias is reduced. In DQNDeep Q-Network, the target network serves naturally as the evaluation network in Double DQNDeep Q-Network, integrating the fix without additional overhead.
GenAI context: why this matters for language models
In language modeling:
- States are token histories — space of size
- Actions are next-token choices —
- Value functions must be approximated by neural networks
All three deadly triad conditions are present in RLHFReinforcement Learning from Human Feedback pipelines:
- Function approximation: neural network value head on the language model
- Bootstrapping: GAE (Generalized Advantage Estimation) uses TDTemporal Difference-style returns
- Off-policy data: rollouts are generated by old policy checkpoints
The overestimation bias is particularly acute: with tokens, the expected max bias over the next-token Q-values is substantial. This is one reason RLHFReinforcement Learning from Human Feedback typically uses actor-critic with a value baseline rather than Q-learning directly — the value function (over states, not state-action pairs) avoids the explicit max over the full vocabulary.
Without careful algorithm design — clipped importance ratios (PPOProximal Policy Optimisation), frozen reference models, KL divergence penalties, careful advantage normalization — the instabilities described in this lecture manifest as degenerate outputs, mode collapse, and reward hacking. The stabilization techniques in RLHFReinforcement Learning from Human Feedback are not arbitrary engineering choices. They are direct responses to the theoretical failure modes introduced here.
Key takeaways
The lecture traces a progression from the ideal to the practically achievable. The MSVE objective defines what we optimize, with on-policy weighting tying the objective to the behavior distribution. Linear approximation introduces the approximation/estimation error decomposition and the feature design choices that determine the expressiveness floor. Semi-gradient TDTemporal Difference is not SGD — the omission of the gradient through the bootstrap target is what makes it computationally cheap, and what removes its convergence guarantees off-policy. The projected Bellman fixed point is where on-policy linear TDTemporal Difference converges, with the bound quantifying the cost of approximation. The deadly triad is dangerous with function approximation because parameter sharing couples all states — off-policy updates at one state corrupt estimates everywhere. Baird's counterexample proves this is not a pathology but a structural fact. Gradient TDTemporal Difference methods provide the theoretically correct fix — true gradient descent on the projected Bellman error — but are superseded in practice by engineering stabilization (target networks, replay buffers). Neural networks amplify all linear instabilities and introduce the moving bootstrap target problem, directly motivating the DQNDeep Q-Network target network. Overestimation bias arises from the max operator over noisy estimates and compounds backward through the value function — Double DQNDeep Q-Network decouples selection from evaluation to reduce it.
Conceptual questions
-
The TDTemporal Difference(0) update for linear function approximation is . Write out the true gradient of with respect to and identify the term that the semi-gradient update omits. Under what sampling distribution does the omitted term have zero expectation, restoring convergence? Why does off-policy sampling violate this condition?
-
The TDTemporal Difference fixed-point bound gives . Suppose and the best linear approximation error is 0.1. What is the worst-case TDTemporal Difference error? What does this imply about using high-discount TDTemporal Difference for long-horizon robotics tasks?
-
In the tabular setting, off-policy Q-learning converges to . In the linear function approximation setting with off-policy sampling, TDTemporal Difference can diverge (Baird's example). Identify the precise structural difference between the tabular and approximation cases that explains why off-policy sampling is benign in one case and catastrophic in the other.
-
You are designing a Q-learning agent for a robotic manipulation task with 8 discrete actions. After 100k training steps, you notice the Q-values are systematically much larger than the true returns. Diagnose this as overestimation bias: derive the expected magnitude of the bias as a function of estimation variance and number of actions , and describe the Double Q-learning fix at the level of the update equation.
-
An RLHFReinforcement Learning from Human Feedback pipeline trains a value head on top of a language model using TDTemporal Difference(0) with data collected from old policy checkpoints. Map this setting onto the deadly triad: identify which component each ingredient corresponds to, explain which specific failure mode is most likely to manifest first, and propose one concrete mitigation for each component of the triad.
Coding exercise
Simulate the overestimation bias in the max operator to verify the approximate scaling.
Use NumPy to implement the following:
import numpy as np
def simulate_max_bias(K, sigma, num_trials=10000):
"""
Simulate the expected overestimation bias of max over K actions.
Args:
K (int): number of actions
sigma (float): standard deviation of estimation noise
num_trials (int): number of independent trials
Returns:
float: estimated E[max_a epsilon_a]
"""
noise = np.random.randn(num_trials, K) * sigma
max_noise = noise.max(axis=1) # max over actions for each trial
return float(max_noise.mean())
# Verify the approximate scaling
for K in [2, 4, 8, 16, 32]:
estimated_bias = simulate_max_bias(K, sigma=0.5)
predicted = 0.5 * np.sqrt(2 * np.log(K))
print(f"K={K:3d}: estimated bias={estimated_bias:.3f}, predicted≈{predicted:.3f}")
Expected output (your numbers will vary slightly due to sampling):
K= 2: estimated bias=0.282, predicted≈0.589
K= 4: estimated bias=0.513, predicted≈0.833
K= 8: estimated bias=0.695, predicted≈1.020
K= 16: estimated bias=0.835, predicted≈1.177
K= 32: estimated bias=0.960, predicted≈1.316
Explain why the estimated bias is consistently lower than the prediction — what assumption does the approximation make that is violated in the simulation?
Extension prompts
-
Implement Baird's counterexample. Build the 7-state star MDP in Python (or your language of choice) with the exact feature representation from the lecture. Run linear semi-gradient TD(0) with off-policy sampling and . Watch diverge. Then switch to on-policy sampling and verify convergence. How many steps until divergence is visually obvious?
-
Compare semi-gradient vs gradient TD. Using OpenAI Gym's
CartPole-v1, train a linear Q-function (RBF features over the 4D state space) with both semi-gradient TD(0) and gradient TD (GTD2). Plot the parameter norm over training steps. Where does semi-gradient diverge? -
Design a feature map that avoids Baird's problem. Propose a different linear feature representation for Baird's 7-state MDP such that off-policy semi-gradient TD(0) converges. What property of your feature map ensures contractivity?
Looking ahead
The next lecture introduces Deep Q-Networks (DQNDeep Q-Network) — the first algorithm to successfully stabilize Q-learning with neural networks. Each of DQNDeep Q-Network's two core innovations maps directly onto a failure mode from this lecture:
- Experience replay breaks temporal correlations in training data and stabilizes the on-policy distribution assumption underlying the MSVE objective.
- Target networks freeze the bootstrap target, directly addressing the moving target problem introduced by neural network function approximation with bootstrapping.
Understanding the failures introduced here is what makes DQNDeep Q-Network's design choices interpretable as principled engineering responses rather than empirical tricks.
Further reading
- Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. ICML. (Introduced Baird's Counterexample and the Deadly Triad).
- Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control. (Proofs of convergence for linear function approximation).
- Thrun, S., & Schwartz, A. (1993). Issues in using function approximation for reinforcement learning. (Early identification of the overestimation bias in Q-learning).
- Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., & Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. ICML. (Gradient TD methods: GTD, GTD2, TDC — theoretically convergent under the deadly triad).
- van Hasselt, H. (2010). Double Q-learning. NeurIPS. (Decouples action selection from action evaluation to reduce overestimation bias).
- Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature. (DQN: experience replay, target networks, and superhuman Atari performance).