Purpose of this lecture
The foundational question underlying this lecture is: How can we learn the value of a state without ever seeing the end of an episode?
Dynamic programming (Week 3) showed how to solve finite MDPs exactly — but only when the environment's transition dynamics are fully known. In practice, agents must learn from experience: observing transitions, collecting rewards, and updating value estimates without access to or .
On the surface, this seems impossible. The Bellman equation says , which requires knowledge of next-state values — a future value that may depend on a distant reward. How can we improve estimates without that information?
The answer splits into two families of methods with radically different philosophies:
- Monte Carlo (MC) methods take the direct route: wait until the episode ends, observe the true return, and use it as an unbiased target. This guarantees correctness but suffers from high variance and cannot handle continuing tasks.
- Temporal-Difference (TDTemporal Difference) methods take a radical shortcut: bootstrap from current estimates. Use your own (wrong) value predictions to improve themselves. Philosophically strange — how can a wrong estimate help? — but it works, with lower variance than MC at the cost of bias.
These two approaches embody a fundamental tension in statistical learning: the bias-variance tradeoff. Every deep RLReinforcement Learning algorithm — DQNDeep Q-Network, PPOProximal Policy Optimisation, SACSoft Actor-Critic, actor-critic — is an extension of the ideas developed here.
Learning from experience
Assume the agent interacts with the environment and observes trajectories:
We aim to estimate value functions using sampled returns, rather than expectations over known transition probabilities. The Bellman equations from Week 3 remain the target — we are now computing them from data rather than from a model.
Three key design questions structure the space of methods:
- Do we wait until the episode ends before updating, or update incrementally?
- Do we use the true return or bootstrap from current estimates?
- How do we handle data collected under a different policy than the one we want to learn?
A worked example
Before developing the theory, we trace both MC and TDTemporal Difference(0) through a single episode on the 4-state chain from Week 3. Suppose the current value estimates are:
and we observe the episode:
with and learning rate .
Monte Carlo update for :
The actual return from :
MC update:
TDTemporal Difference(0) update for (using the one-step target ):
The MC update is larger because it uses the actual observed return, which is higher than the current estimate implies. The TDTemporal Difference update is smaller and biased by the inaccuracy in . As improves toward , the TDTemporal Difference updates will converge to the same target as MC — but they do so earlier, one step at a time.
Monte Carlo methods
Monte Carlo methods estimate value functions by averaging observed returns from complete episodes.
Return definition
For time step in an episode of length :
is the actual discounted return from step to episode end. It is a direct sample of the quantity .
First-visit vs every-visit Monte Carlo
- First-visit MC: update the value estimate for using only the return from the first visit to in each episode. Samples across episodes are independent, making the estimator a standard i.i.d. average. Converges to by the law of large numbers.
- Every-visit MC: update using the return from every visit to within an episode. Within-episode visits are correlated (the state was reached from the same history), so the samples are not independent. Despite this, every-visit MC is also consistent — it converges to — but the convergence analysis is less clean. First-visit MC is the standard choice when independence matters for theoretical guarantees.
Monte Carlo policy evaluation
For a fixed policy :
where is the observed return following a visit to . This is an incremental mean update — over many episodes, converges to .
Key properties:
- Unbiased: is an unbiased sample of ; no approximation of future values enters the target.
- High variance: is a sum of up to random variables. Variance scales with episode length and reward stochasticity. In long or high-variance episodes, single-sample MC estimates can be far from the true mean.
- Episode-terminal: updates are only possible after the episode ends. Cannot be used for continuing tasks.
Monte Carlo: waits until the episode ends, then updates using the full return .
Temporal-Difference learning
Temporal-Difference learning combines:
- MC's ability to learn from experience (no model required)
- DP's ability to update before episode termination (bootstrapping)
Rather than waiting for the full return, TDTemporal Difference uses the current value estimate of the next state as a proxy for the remaining return.
TDTemporal Difference(0): the one-step update
The TDTemporal Difference error is the discrepancy between:
- the one-step bootstrap target , and
- the current estimate .
If , then — the Bellman equation is satisfied and there is no average error. TDTemporal Difference learning drives toward zero.
TD(0): one-step bootstrapping — uses the immediate reward plus the next state's value estimate.
TDTemporal Difference(0) convergence
Theorem: Under a fixed policy , tabular TDTemporal Difference(0) converges to with probability 1, provided the step sizes satisfy the Robbins-Monro conditions:
Why it converges — the contraction argument: TDTemporal Difference(0) is a stochastic approximation of the fixed-point equation , where is the Bellman expectation operator from Week 3. Each TDTemporal Difference update is a noisy step in the direction of . Because is a -contraction (as established in Week 3), the fixed point is unique and the stochastic iterates converge to it — the contraction ensures the systematic signal toward dominates the noise over time.
The Robbins-Monro conditions encode two requirements: step sizes must be large enough to overcome noise (first condition: the series diverges) but must decay fast enough that the algorithm eventually settles (second condition: the sum of squares converges). A constant step size satisfies neither in the limit but is used in practice for non-stationary problems.
Contrast with MC convergence: MC converges because it computes a sample mean of an unbiased estimator — no Bellman operator, no contraction required. The theoretical justification is simpler (law of large numbers) but the practical performance is worse due to high variance. TDTemporal Difference's convergence is more subtle because the target itself depends on the current estimate, but the contraction of is what ensures this circular dependency converges rather than diverges.
Bias-variance tradeoff: MC vs TDTemporal Difference vs n-step TDTemporal Difference
The core distinction between MC and TDTemporal Difference is a bias-variance tradeoff that can be made precise.
Monte Carlo: unbiased, high variance
The MC target satisfies — it is an unbiased estimator of . But is a sum of random variables. By the variance of a sum:
and in practice variance grows with the number of stochastic steps between and termination. In long or high-variance episodes, single-sample MC estimates can be far from the true mean.
TDTemporal Difference(0): biased, low variance
The TDTemporal Difference target uses the current estimate , which equals only if the value function has converged. Before convergence, this introduces bootstrapping bias: the target is wrong by . However, the target depends on only one random variable (, since is determined by and ), giving much lower variance than MC.
As , bootstrapping bias vanishes. TDTemporal Difference is asymptotically unbiased — in the limit, TDTemporal Difference and MC converge to the same value function.
n-step TDTemporal Difference: interpolating the tradeoff
The -step return uses actual rewards before bootstrapping:
- Bias: — decays exponentially in because the bootstrap contribution is discounted over steps. Increasing reduces bias.
- Variance: scales with the sum of reward terms — increasing increases variance.
The update rule:
Special cases: gives TDTemporal Difference(0); gives Monte Carlo. The optimal depends on the environment's reward variance and current estimation accuracy — both unknown in practice, which motivates TDTemporal Difference(λ).
TDTemporal Difference(λ) and eligibility traces
TDTemporal Difference(λ) generalizes n-step TDTemporal Difference by combining all n-step returns simultaneously, with weights that decay exponentially in :
The λ-return
The factor normalizes the weights so they sum to 1. The λ-return is a geometric mixture: each n-step return contributes, with longer returns weighted by . For : only the 1-step return contributes → TDTemporal Difference(0). For : all returns contribute equally → equivalent to Monte Carlo.
The λ-return requires the full episode to compute, since it involves all . Eligibility traces provide an online, incremental implementation.
Eligibility traces: the backward view
Rather than computing the λ-return forward in time, eligibility traces distribute the TDTemporal Difference error backward through time to recently visited states.
The accumulating eligibility trace for state :
- At each step, all traces decay by .
- The trace for the current state is incremented by 1.
- States visited recently have high traces; states not visited recently have traces near zero.
The value update applies the current TDTemporal Difference error to all states, scaled by their trace:
Intuition: the trace is a decaying memory of which states were recently visited. When a reward is received, credit () is assigned backward to all recent states in proportion to how recently they were visited — states visited long ago receive little credit; states visited just before the reward receive substantial credit. This is temporal credit assignment implemented efficiently online.
The forward view (λ-return) and backward view (eligibility traces) are equivalent in expectation for linear function approximation — this equivalence is what justifies using the online trace algorithm as an efficient implementation of TDTemporal Difference(λ).
Eligibility traces: each visit increments e(s) by 1, then decays by γλ per step. TD error flows backward through traces.
On-policy vs off-policy learning: formal definitions
A distinction that must be made precise before discussing control methods.
- On-policy learning: the behavior policy (used to generate experience) equals the target policy (the policy being evaluated or improved). The agent learns about the policy it is currently following.
- Off-policy learning: the behavior policy . The agent follows one policy to generate experience while learning about a different (typically better) policy.
This distinction has concrete consequences:
- On-policy methods are simpler and more stable, but the learned value function reflects the behavior policy, which may be exploratory and suboptimal.
- Off-policy methods can learn about the optimal policy while following a safe or exploratory behavior policy. They can also learn from data generated by any source — old policies, human demonstrations, replay buffers — making them far more data-efficient in practice. The cost is additional complexity and, in some cases, instability.
Control with TDTemporal Difference methods
Control extends value estimation to policy improvement: learning optimal policies directly from experience.
On-policy control: SARSAState-Action-Reward-State-Action
SARSAState-Action-Reward-State-Action updates the action-value function using the action actually taken under the current policy:
The name SARSAState-Action-Reward-State-Action records the five quantities involved in the update: .
Since is sampled from the current policy, SARSAState-Action-Reward-State-Action evaluates the behavior policy. Under GLIE conditions (Greedy in the Limit with Infinite Exploration — the policy must explore all state-action pairs infinitely often and eventually become greedy), SARSAState-Action-Reward-State-Action converges to .
Practical consequence: SARSAState-Action-Reward-State-Action is conservative. In stochastic environments with dangerous actions, SARSAState-Action-Reward-State-Action learns to avoid them because it accounts for the possibility of taking them under the exploratory policy. This makes it suitable when safety during training matters.
On-policy (SARSA) vs off-policy (Q-learning): SARSA uses the actual next action from the policy; Q-learning uses the max over all actions.
Off-policy control: Q-learning
Q-learning updates using the greedy action in the next state, regardless of which action the behavior policy actually takes:
The target corresponds to the Bellman optimality equation for — Q-learning is directly applying the model-free, sampled version of the value iteration update from Week 3.
Convergence: Q-learning converges to with probability 1 under the Robbins-Monro conditions and the assumption that every state-action pair is visited infinitely often (coverage). Crucially, convergence holds regardless of the behavior policy — the operator in the target decouples learning about the optimal policy from the policy used to collect data. This is what makes Q-learning off-policy.
Practical consequence: Q-learning can learn from data collected by any exploratory policy, including -greedy, random, or even human demonstrations. This generalizes directly to DQNDeep Q-Network, which stores transitions in a replay buffer and samples them uniformly — the replay buffer data was collected by old versions of the policy, but Q-learning's off-policy property means convergence is still theoretically supported.
SARSAState-Action-Reward-State-Action vs Q-learning: a summary
| Property | SARSAState-Action-Reward-State-Action | Q-learning | |---|---|---| | Policy type | On-policy | Off-policy | | Update target | — action taken | — greedy action | | Convergence target | (behavior policy) | (optimal policy) | | Data requirement | Must follow target policy | Any behavior policy with coverage | | Stability | More stable in stochastic settings | Can overestimate (maximization bias) | | Deep RLReinforcement Learning extension | Expected SARSAState-Action-Reward-State-Action, soft actor-critic | DQNDeep Q-Network, double DQNDeep Q-Network |
Exploration in control
Control requires both exploitation (taking actions that maximize estimated ) and exploration (trying actions whose values are uncertain). Without exploration, the agent may never discover high-reward actions it has not yet tried; without exploitation, it never uses what it has learned.
ε-greedy policies
The simplest exploration strategy is ε-greedy: with probability take the greedy action ; with probability take a uniformly random action.
Properties:
- Every action has probability , so coverage is guaranteed — every state-action pair is eventually visited.
- is fully greedy (no exploration); is uniform random (no exploitation).
- In practice, is annealed over training: high early (to explore broadly), low late (to exploit learned values).
GLIE: the theoretical requirement for convergence
Greedy in the Limit with Infinite Exploration (GLIE) states two conditions:
- Every state-action pair is visited infinitely often: for all .
- The policy converges to greedy: .
An ε-greedy policy satisfies GLIE if as and (e.g., ). Under GLIE, SARSA converges to with probability 1.
The exploration problem in practice
GLIE provides convergence guarantees but offers little guidance on how fast to anneal . In practice:
- Too fast: the agent commits to a suboptimal policy before discovering better options. Common in environments with sparse rewards where the high-reward region is rarely reached by random exploration.
- Too slow: the agent spends excessive time on random actions after the value function is already well-estimated. Sample efficiency suffers.
For deep RL, these issues are amplified: neural network approximators are initially poorly calibrated, and large action spaces make uniform random exploration impractical. Week 8 covers structured exploration strategies (UCB, Thompson sampling, curiosity-driven exploration) that address these limitations.
Importance sampling for off-policy learning
When the behavior policy differs from the target policy , updates must correct for the distributional mismatch.
The importance sampling ratio
For off-policy TDTemporal Difference(0), the corrected update is:
When , the ratio is 1 and we recover the standard TDTemporal Difference update. When assigns higher probability to than did, the update is amplified; when lower, it is dampened.
The variance explosion problem
For multi-step returns, the importance sampling correction requires the product of ratios over all steps:
If and differ at each step, each ratio may be greater than 1, and their product grows exponentially in . For full MC returns (), the variance of the importance-weighted estimator can be so large as to make learning impractical — a single trajectory with a large product weight can dominate the entire estimate.
This variance explosion is a fundamental obstacle to off-policy learning with long trajectories, not an implementation detail.
PPOProximal Policy Optimisation as a solution to variance explosion
Proximal Policy Optimization (PPOProximal Policy Optimisation) — the algorithm used to fine-tune language models in RLHFReinforcement Learning from Human Feedback — addresses this directly by clipping the importance ratio:
where is the ratio between the current and previous policy, and is the advantage estimate. Clipping the ratio to prevents any single transition from contributing an outsized update — it controls the variance explosion at the cost of introducing a small bias. This is the direct application of importance sampling theory to policy optimization: the ratio appears in PPOProximal Policy Optimisation for exactly the same reason it appears in off-policy TDTemporal Difference — to correct for the fact that data was collected under a different (older) policy.
The deadly triad
Week 5 introduces function approximation — replacing tables with neural networks. Before doing so, it is important to understand why combining the techniques of this lecture with function approximation can cause instability.
The deadly triad (Sutton & Barto) refers to the combination of three elements that, when present simultaneously, can cause divergence in value function learning:
- Function approximation (representing or with a neural network)
- Bootstrapping (using current value estimates as targets — i.e., TDTemporal Difference methods)
- Off-policy learning (learning about a target policy from data generated by a different behavior policy)
Each element alone is manageable: supervised regression with function approximation converges; tabular TDTemporal Difference(0) converges; tabular off-policy Q-learning converges. But the three together remove the convergence guarantees. The problem is that the combination of bootstrapping and function approximation creates a moving target (the target depends on the current parameters), and the off-policy distribution mismatch means the function approximator is trained on a distribution that does not match the on-policy distribution of the target policy — creating systematic bias that compounds.
DQNDeep Q-Network (Week 5) directly addresses the deadly triad through two mechanisms:
- Replay buffers decorrelate updates and stabilize the training distribution.
- Target networks freeze the bootstrap target temporarily, preventing the moving target problem from destabilizing training.
Understanding the deadly triad explains why these engineering choices are necessary — not as tricks, but as principled responses to a known theoretical obstacle.
Key takeaways
The lecture develops a unified view of learning from experience through the lens of backup depth and policy alignment. Monte Carlo uses full-trajectory backups: unbiased but high variance, requiring complete episodes. TDTemporal Difference(0) uses one-step bootstrapped backups: lower variance but biased, converging by the contraction argument on the Bellman expectation operator. n-step TDTemporal Difference interpolates via a bias-variance tradeoff that decays bias exponentially in . TDTemporal Difference(λ) resolves the tradeoff adaptively by combining all n-step returns, implemented online via eligibility traces that assign temporal credit backward through recent state visits.
SARSAState-Action-Reward-State-Action is on-policy TDTemporal Difference for control: it evaluates the behavior policy and converges to only if the policy becomes greedy in the limit (GLIE). Q-learning is off-policy TDTemporal Difference: it targets directly via the Bellman optimality operator and converges regardless of the behavior policy. Exploration is the mechanism that satisfies GLIE — -greedy policies guarantee coverage but introduce a tradeoff between exploration speed and exploitation quality. Importance sampling corrects for distribution mismatch in off-policy learning but suffers from exponential variance growth over long trajectories — the exact problem that PPOProximal Policy Optimisation's clipped ratio objective addresses in RLHFReinforcement Learning from Human Feedback. And the deadly triad of function approximation, bootstrapping, and off-policy learning identifies why combining these ingredients requires careful engineering, previewing the DQNDeep Q-Network stabilization techniques in Week 5.
Conceptual questions
-
Trace through the worked example at the start of the lecture but for state in the same episode. Compute both the MC update and the TDTemporal Difference(0) update for . Which update moves further from its current value, and why?
-
TDTemporal Difference(0) converges to while MC also converges to . If they converge to the same target, why would you ever prefer TDTemporal Difference over MC? Give two distinct reasons, one theoretical and one practical.
-
An agent is learning in an environment where episodes are very long (thousands of steps) and rewards are sparse (most , with occasional large rewards). Should you use MC, TDTemporal Difference(0), or n-step TDTemporal Difference? Justify your answer in terms of the bias-variance tradeoff and the practical effect of episode length on each method.
-
Q-learning converges to even when the behavior policy is -greedy with (i.e., random half the time). SARSAState-Action-Reward-State-Action with the same behavior policy converges to something other than . Explain precisely what SARSAState-Action-Reward-State-Action converges to and why the operator in Q-learning's target makes the difference.
-
A language model is fine-tuned using PPOProximal Policy Optimisation. The old policy (used to collect rollouts) and the new policy (being updated) begin to diverge significantly during training. Explain why this causes the importance ratio to become large, why large is harmful in the context of the variance explosion problem, and how PPOProximal Policy Optimisation's clipping mechanism addresses this. What does the clipping introduce in return?
-
(Extension) The eligibility trace update is . Show that when , the update with eligibility traces reduces exactly to the standard update. Then show that when and , the update is equivalent to Monte Carlo.
-
(Extension) Double Q-learning maintains two independent Q-value estimates and . When updating , the greedy action is selected using but evaluated using : the target is . Explain why standard Q-learning suffers from maximization bias, and how the decoupling in Double Q-learning addresses it.
Coding exercises
Exercise 1: First-visit Monte Carlo prediction
Implement first-visit MC policy evaluation for a simple chain MDP. The environment has states where state 4 is terminal with reward and all other transitions give reward .
import numpy as np
from collections import defaultdict
def first_visit_mc(policy, n_episodes=5000, gamma=0.9, alpha=0.05):
"""
First-visit Monte Carlo policy evaluation.
policy: callable (state -> action), the policy to evaluate
Returns: V, a dict mapping state -> estimated value
"""
V = defaultdict(float)
for _ in range(n_episodes):
# --- generate an episode ---
episode = [] # list of (state, reward) tuples
state = 0
while state != 4:
action = policy(state)
# TODO: implement the chain transition:
# action=1 moves right (state+1), reward=+1 if reaching state 4 else 0
# action=0 stays in place, reward=0
next_state = ...
reward = ...
episode.append((state, reward))
state = next_state
# --- first-visit MC update ---
visited = set()
G = 0.0
for t in reversed(range(len(episode))):
state, reward = episode[t]
G = reward + gamma * G
if state not in visited:
visited.add(state)
# TODO: apply the incremental mean update
V[state] += ...
return V
# Test: uniform random policy
policy = lambda s: np.random.choice([0, 1])
V = first_visit_mc(policy)
print({s: round(v, 3) for s, v in sorted(V.items())})
# Expected: values should increase from state 0 to state 3
Exercise 2: TD(0) prediction
Implement TD(0) on the same chain MDP and compare convergence with MC.
def td_zero(policy, n_episodes=5000, gamma=0.9, alpha=0.05):
"""
TD(0) policy evaluation.
Returns: V, a dict mapping state -> estimated value
"""
V = defaultdict(float)
for _ in range(n_episodes):
state = 0
while state != 4:
action = policy(state)
next_state = state + 1 if action == 1 else state
reward = 1.0 if next_state == 4 else 0.0
# TODO: compute the TD error delta_t
delta = ...
# TODO: apply the TD(0) update
V[state] += ...
state = next_state
return V
V_td = td_zero(lambda s: np.random.choice([0, 1]))
print({s: round(v, 3) for s, v in sorted(V_td.items())})
After implementing both, run both with the same number of episodes and compare their estimates. Which converges faster to the true values? Why?
Exercise 3: Q-learning vs SARSA
Implement both Q-learning and SARSA on a simple grid world and observe the policy difference in stochastic settings.
def q_learning(n_episodes=2000, gamma=0.9, alpha=0.1, epsilon=0.1):
"""Off-policy Q-learning."""
# States: 0-7 (linear chain), state 7 is terminal (+10 reward)
# State 3 is a "cliff" (-5 reward if visited with random action)
n_states, n_actions = 8, 2 # 0=left, 1=right
Q = np.zeros((n_states, n_actions))
for _ in range(n_episodes):
state = 0
while state != 7:
# epsilon-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(n_actions)
else:
action = np.argmax(Q[state])
next_state = min(state + 1, 7) if action == 1 else max(state - 1, 0)
reward = 10.0 if next_state == 7 else (-5.0 if next_state == 3 else 0.0)
# TODO: Q-learning update (off-policy: use max over next actions)
next_value = 0.0 # TODO: replace with np.max(Q[next_state])
Q[state, action] += alpha * (reward + gamma * next_value - Q[state, action])
state = next_state
return Q
def sarsa(n_episodes=2000, gamma=0.9, alpha=0.1, epsilon=0.1):
"""On-policy SARSA."""
n_states, n_actions = 8, 2
Q = np.zeros((n_states, n_actions))
for _ in range(n_episodes):
state = 0
action = np.argmax(Q[state]) if np.random.random() > epsilon else np.random.randint(n_actions)
while state != 7:
next_state = min(state + 1, 7) if action == 1 else max(state - 1, 0)
reward = 10.0 if next_state == 7 else (-5.0 if next_state == 3 else 0.0)
next_action = np.argmax(Q[next_state]) if np.random.random() > epsilon else np.random.randint(n_actions)
# TODO: SARSA update (on-policy: use next_action, not max)
next_value = 0.0 # TODO: replace with Q[next_state, next_action]
Q[state, action] += alpha * (reward + gamma * next_value - Q[state, action])
state, action = next_state, next_action
return Q
Q_ql = q_learning()
Q_sarsa = sarsa()
print("Q-learning greedy policy:", [np.argmax(Q_ql[s]) for s in range(7)])
print("SARSA greedy policy: ", [np.argmax(Q_sarsa[s]) for s in range(7)])
# Do the learned policies differ near state 3? Why?
Looking ahead
The next lecture introduces function approximation: replacing lookup tables with parametric models, and specifically neural networks. We will see that the deadly triad makes naive deep TDTemporal Difference learning unstable, and study how Deep Q-Networks (DQNDeep Q-Network) addresses this through target networks and experience replay — each of which can be understood as a direct engineering response to one component of the triad.
Further reading
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), MIT Press. Chapters 5–7 cover MC methods, TD learning, and n-step TD in depth. Freely available online.
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning. (The paper that introduced Q-learning).
- Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. (The technical report that introduced SARSAState-Action-Reward-State-Action).
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning. (The foundation of TDTemporal Difference learning).
Knowledge Check
Test your understanding of Monte Carlo, TD, and eligibility traces.
The TD error is defined as δ_t = r_{t+1} + γV(s_{t+1}) − ___
On each time step, an eligibility trace e(s) for a state not visited is multiplied by ___
Which method requires waiting until the end of an episode before updating value estimates?