Week 4: Monte Carlo and Temporal-Difference Learning

Purpose of this lecture#

The foundational question underlying this lecture is: How can we learn the value of a state without ever seeing the end of an episode?

Dynamic programming (Week 3) showed how to solve finite MDPs exactly — but only when the environment's transition dynamics are fully known. In practice, agents must learn from experience: observing transitions, collecting rewards, and updating value estimates without access to $P$ or $R$ .

On the surface, this seems impossible. The Bellman equation says $V(s) = \mathbb{E}[r + \gamma V(s')]$ , which requires knowledge of next-state values — a future value that may depend on a distant reward. How can we improve estimates without that information?

The answer splits into two families of methods with radically different philosophies:

Monte Carlo (MC) methods take the direct route: wait until the episode ends, observe the true return, and use it as an unbiased target. This guarantees correctness but suffers from high variance and cannot handle continuing tasks.
Temporal-Difference (TD) methods take a radical shortcut: bootstrap from current estimates. Use your own (wrong) value predictions to improve themselves. Philosophically strange — how can a wrong estimate help? — but it works, with lower variance than MC at the cost of bias.

These two approaches embody a fundamental tension in statistical learning: the bias-variance tradeoff. Every deep RL algorithm — DQN, PPO, SAC, actor-critic — is an extension of the ideas developed here.

Learning from experience#

Assume the agent interacts with the environment and observes trajectories:

(s_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_T)

We aim to estimate value functions using sampled returns, rather than expectations over known transition probabilities. The Bellman equations from Week 3 remain the target — we are now computing them from data rather than from a model.

Three key design questions structure the space of methods:

Do we wait until the episode ends before updating, or update incrementally?
Do we use the true return or bootstrap from current estimates?
How do we handle data collected under a different policy than the one we want to learn?

A worked example#

Before developing the theory, we trace both MC and TD(0) through a single episode on the 4-state chain from Week 3. Suppose the current value estimates are:

$V = [V(s_1), V(s_2), V(s_3), V(s_4)] = [-0.5,\; 0.0,\; 0.3,\; 1.0]$

and we observe the episode: $s_2 \xrightarrow{R, r=0} s_3 \xrightarrow{R, r=0} s_4 \xrightarrow{\text{terminal}, r=+1}$

with $\gamma = 0.9$ and learning rate $\alpha = 0.1$ .

Monte Carlo update for $s_2$ :

The actual return from $s_2$ : $G = 0 + 0.9 \cdot 0 + 0.9^2 \cdot 1 = 0.81$

MC update: $V(s_2) \leftarrow 0.0 + 0.1(0.81 - 0.0) = 0.081$

TD(0) update for $s_2$ (using the one-step target $r + \gamma V(s_3)$ ): $\delta = 0 + 0.9 \cdot 0.3 - 0.0 = 0.27$ $V(s_2) \leftarrow 0.0 + 0.1 \cdot 0.27 = 0.027$

The MC update is larger because it uses the actual observed return, which is higher than the current estimate $V(s_3) = 0.3$ implies. The TD update is smaller and biased by the inaccuracy in $V(s_3)$ . As $V(s_3)$ improves toward $V^\pi(s_3)$ , the TD updates will converge to the same target as MC — but they do so earlier, one step at a time.

Monte Carlo methods#

Monte Carlo methods estimate value functions by averaging observed returns from complete episodes.

Return definition#

For time step $t$ in an episode of length $T$ :

G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}

$G_t$ is the actual discounted return from step $t$ to episode end. It is a direct sample of the quantity $V^\pi(s_t) = \mathbb{E}_\pi[G_t \mid s_t]$ .

First-visit vs every-visit Monte Carlo#

First-visit MC: update the value estimate for $s$ using only the return from the first visit to $s$ in each episode. Samples across episodes are independent, making the estimator a standard i.i.d. average. Converges to $V^\pi(s)$ by the law of large numbers.
Every-visit MC: update using the return from every visit to $s$ within an episode. Within-episode visits are correlated (the state was reached from the same history), so the samples are not independent. Despite this, every-visit MC is also consistent — it converges to $V^\pi$ — but the convergence analysis is less clean. First-visit MC is the standard choice when independence matters for theoretical guarantees.

Monte Carlo policy evaluation#

For a fixed policy $\pi$ :

V(s) \leftarrow V(s) + \alpha\bigl[G_t - V(s)\bigr]

where $G_t$ is the observed return following a visit to $s$ . This is an incremental mean update — over many episodes, $V(s)$ converges to $\mathbb{E}[G_t \mid s_t = s] = V^\pi(s)$ .

Key properties:

Unbiased: $G_t$ is an unbiased sample of $V^\pi(s_t)$ ; no approximation of future values enters the target.
High variance: $G_t$ is a sum of up to $T$ random variables. Variance scales with episode length and reward stochasticity. In long or high-variance episodes, single-sample MC estimates can be far from the true mean.
Episode-terminal: updates are only possible after the episode ends. Cannot be used for continuing tasks.

Monte Carlo backup diagram: full return from episode end

Temporal-Difference learning#

Temporal-Difference learning combines:

MC's ability to learn from experience (no model required)
DP's ability to update before episode termination (bootstrapping)

Rather than waiting for the full return, TD uses the current value estimate of the next state as a proxy for the remaining return.

TD(0): the one-step update#

V(s_t) \leftarrow V(s_t) + \alpha \underbrace{\left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]}_{\delta_t \;=\; \text{TD error}}

The TD error $\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is the discrepancy between:

the one-step bootstrap target $r_{t+1} + \gamma V(s_{t+1})$ , and
the current estimate $V(s_t)$ .

If $V = V^\pi$ , then $\mathbb{E}[\delta_t] = 0$ — the Bellman equation is satisfied and there is no average error. TD learning drives $\delta_t$ toward zero.

TD(0) backup diagram: one-step bootstrapping

TD(0) convergence#

Theorem: Under a fixed policy $\pi$ , tabular TD(0) converges to $V^\pi$ with probability 1, provided the step sizes satisfy the Robbins-Monro conditions:

\sum_{t=0}^\infty \alpha_t = \infty \qquad \text{and} \qquad \sum_{t=0}^\infty \alpha_t^2 < \infty

Why it converges — the contraction argument: TD(0) is a stochastic approximation of the fixed-point equation $V = T^\pi V$ , where $T^\pi$ is the Bellman expectation operator from Week 3. Each TD update is a noisy step in the direction of $T^\pi V - V$ . Because $T^\pi$ is a $\gamma$ -contraction (as established in Week 3), the fixed point $V^\pi$ is unique and the stochastic iterates converge to it — the contraction ensures the systematic signal toward $V^\pi$ dominates the noise over time.

The Robbins-Monro conditions encode two requirements: step sizes must be large enough to overcome noise (first condition: the series diverges) but must decay fast enough that the algorithm eventually settles (second condition: the sum of squares converges). A constant step size $\alpha$ satisfies neither in the limit but is used in practice for non-stationary problems.

Contrast with MC convergence: MC converges because it computes a sample mean of an unbiased estimator — no Bellman operator, no contraction required. The theoretical justification is simpler (law of large numbers) but the practical performance is worse due to high variance. TD's convergence is more subtle because the target itself depends on the current estimate, but the contraction of $T^\pi$ is what ensures this circular dependency converges rather than diverges.

Bias-variance tradeoff: MC vs TD vs n-step TD#

The core distinction between MC and TD is a bias-variance tradeoff that can be made precise.

Monte Carlo: unbiased, high variance#

The MC target $G_t$ satisfies $\mathbb{E}[G_t \mid s_t] = V^\pi(s_t)$ — it is an unbiased estimator of $V^\pi$ . But $G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}$ is a sum of $T - t$ random variables. By the variance of a sum:

\text{Var}(G_t) \leq \frac{R_{\max}^2}{(1-\gamma)^2}

and in practice variance grows with the number of stochastic steps between $s_t$ and termination. In long or high-variance episodes, single-sample MC estimates can be far from the true mean.

TD(0): biased, low variance#

The TD target $r_{t+1} + \gamma V(s_{t+1})$ uses the current estimate $V(s_{t+1})$ , which equals $V^\pi(s_{t+1})$ only if the value function has converged. Before convergence, this introduces bootstrapping bias: the target is wrong by $\gamma(V(s_{t+1}) - V^\pi(s_{t+1}))$ . However, the target depends on only one random variable ( $r_{t+1}$ , since $s_{t+1}$ is determined by $s_t$ and $a_t$ ), giving much lower variance than MC.

As $V \to V^\pi$ , bootstrapping bias vanishes. TD is asymptotically unbiased — in the limit, TD and MC converge to the same value function.

n-step TD: interpolating the tradeoff#

The $n$ -step return uses $n$ actual rewards before bootstrapping:

G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1} + \gamma^n V(s_{t+n})

Bias: $\gamma^n (V(s_{t+n}) - V^\pi(s_{t+n}))$ — decays exponentially in $n$ because the bootstrap contribution is discounted over $n$ steps. Increasing $n$ reduces bias.
Variance: scales with the sum of $n$ reward terms — increasing $n$ increases variance.

The update rule:

V(s_t) \leftarrow V(s_t) + \alpha\bigl[G_t^{(n)} - V(s_t)\bigr]

Special cases: $n = 1$ gives TD(0); $n \to \infty$ gives Monte Carlo. The optimal $n$ depends on the environment's reward variance and current estimation accuracy — both unknown in practice, which motivates TD(λ).

Intuition: The Exponential Decay

At $n$ steps, the bootstrap error is $\gamma^n \cdot (V(s_{t+n}) - V^\pi(s_{t+n}))$ . The $\gamma^n$ factor shrinks this contribution exponentially — by step 20 with $\gamma = 0.9$ , the bootstrap contributes only $0.9^{20} \approx 0.12$ of the original error. Meanwhile, variance grows with the number of stochastic reward terms included.

This creates a crossover point: early in training, when $V$ is badly wrong, bias from the bootstrap target dominates — small $n$ wins because it relies less on the inaccurate estimate. Late in training, as $V \to V^\pi$ , variance becomes the bottleneck — larger $n$ extracts more signal per episode. The optimal $n$ is a moving target, which is why $\text{TD}(\lambda)$ (which combines all $n$ ) often outperforms any fixed $n$ .

TD(λ) and eligibility traces#

TD(λ) generalizes n-step TD by combining all n-step returns simultaneously, with weights that decay exponentially in $n$ :

The λ-return#

G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}

The factor $(1-\lambda)$ normalizes the weights so they sum to 1. The λ-return is a geometric mixture: each n-step return contributes, with longer returns weighted by $\lambda^{n-1}$ . For $\lambda = 0$ : only the 1-step return contributes → TD(0). For $\lambda = 1$ : all returns contribute equally → equivalent to Monte Carlo.

The λ-return requires the full episode to compute, since it involves all $G_t^{(n)}$ . Eligibility traces provide an online, incremental implementation.

Eligibility traces: the backward view#

Rather than computing the λ-return forward in time, eligibility traces distribute the TD error backward through time to recently visited states.

The accumulating eligibility trace for state $s$ :

e_t(s) = \gamma\lambda\, e_{t-1}(s) + \mathbf{1}[s_t = s]

At each step, all traces decay by $\gamma\lambda$ .
The trace for the current state $s_t$ is incremented by 1.
States visited recently have high traces; states not visited recently have traces near zero.

The value update applies the current TD error $\delta_t$ to all states, scaled by their trace:

V(s) \leftarrow V(s) + \alpha\, \delta_t\, e_t(s) \quad \forall s

Intuition: the trace is a decaying memory of which states were recently visited. When a reward is received, credit ( $\delta_t$ ) is assigned backward to all recent states in proportion to how recently they were visited — states visited long ago receive little credit; states visited just before the reward receive substantial credit. This is temporal credit assignment implemented efficiently online.

The forward view (λ-return) and backward view (eligibility traces) are equivalent in expectation for linear function approximation — this equivalence is what justifies using the online trace algorithm as an efficient implementation of TD(λ).

Intuition: Backward Credit Assignment

Imagine a sequence of events leading to a large reward: $s_1 \to s_2 \to s_3 \to s_4$ (reward arrives). When the reward appears at $s_4$ , which earlier states deserve credit? The eligibility trace answers by recency: $s_3$ was visited just before the reward, so its trace is high and it receives substantial credit. $s_1$ 's trace has decayed by $(\gamma\lambda)^3$ , so it receives little.

Without eligibility traces, computing multi-step credit requires storing the full trajectory and working backward — $O(T \cdot |S|)$ memory per episode. Eligibility traces achieve the same backward credit assignment in $O(|S|)$ memory, updated online at each step. This is the practical payoff of the forward/backward view equivalence.

Eligibility trace diagram: decays by γλ per step

On-policy vs off-policy learning: formal definitions#

A distinction that must be made precise before discussing control methods.

On-policy learning: the behavior policy $\mu$ (used to generate experience) equals the target policy $\pi$ (the policy being evaluated or improved). The agent learns about the policy it is currently following.
Off-policy learning: the behavior policy $\mu \neq \pi$ . The agent follows one policy to generate experience while learning about a different (typically better) policy.

This distinction has concrete consequences:

On-policy methods are simpler and more stable, but the learned value function reflects the behavior policy, which may be exploratory and suboptimal.
Off-policy methods can learn about the optimal policy while following a safe or exploratory behavior policy. They can also learn from data generated by any source — old policies, human demonstrations, replay buffers — making them far more data-efficient in practice. The cost is additional complexity and, in some cases, instability.

Control with TD methods#

Control extends value estimation to policy improvement: learning optimal policies directly from experience.

On-policy control: SARSA#

SARSA updates the action-value function $Q(s,a)$ using the action actually taken under the current policy:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]

The name SARSA records the five quantities involved in the update: $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$ .

Since $a_{t+1}$ is sampled from the current policy, SARSA evaluates the behavior policy. Under GLIE conditions (Greedy in the Limit with Infinite Exploration — the policy must explore all state-action pairs infinitely often and eventually become greedy), SARSA converges to $Q^*$ .

Practical consequence: SARSA is conservative. In stochastic environments with dangerous actions, SARSA learns to avoid them because it accounts for the possibility of taking them under the exploratory policy. This makes it suitable when safety during training matters.

SARSA vs Q-learning backup diagram: on-policy uses actual action, off-policy uses max

Off-policy control: Q-learning#

Q-learning updates using the greedy action in the next state, regardless of which action the behavior policy actually takes:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]

The target $r_{t+1} + \gamma\max_{a'} Q(s_{t+1}, a')$ corresponds to the Bellman optimality equation for $Q^*$ — Q-learning is directly applying the model-free, sampled version of the value iteration update from Week 3.

Convergence: Q-learning converges to $Q^*$ with probability 1 under the Robbins-Monro conditions and the assumption that every state-action pair is visited infinitely often (coverage). Crucially, convergence holds regardless of the behavior policy — the $\max$ operator in the target decouples learning about the optimal policy from the policy used to collect data. This is what makes Q-learning off-policy.

Practical consequence: Q-learning can learn $Q^*$ from data collected by any exploratory policy, including $\epsilon$ -greedy, random, or even human demonstrations. This generalizes directly to DQN, which stores transitions in a replay buffer and samples them uniformly — the replay buffer data was collected by old versions of the policy, but Q-learning's off-policy property means convergence is still theoretically supported.

SARSA vs Q-learning: a summary#

| Property | SARSA | Q-learning | |---|---|---| | Policy type | On-policy | Off-policy | | Update target | $Q(s_{t+1}, a_{t+1})$ — action taken | $\max_{a'} Q(s_{t+1}, a')$ — greedy action | | Convergence target | $Q^\pi$ (behavior policy) | $Q^*$ (optimal policy) | | Data requirement | Must follow target policy | Any behavior policy with coverage | | Stability | More stable in stochastic settings | Can overestimate $Q^*$ (maximization bias) | | Deep RL extension | Expected SARSA, soft actor-critic | DQN, double DQN |

Exploration in control#

Control requires both exploitation (taking actions that maximize estimated $Q$ ) and exploration (trying actions whose values are uncertain). Without exploration, the agent may never discover high-reward actions it has not yet tried; without exploitation, it never uses what it has learned.

ε-greedy policies#

The simplest exploration strategy is ε-greedy: with probability $1 - \varepsilon$ take the greedy action $\arg\max_a Q(s, a)$ ; with probability $\varepsilon$ take a uniformly random action.

\pi(a \mid s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & \text{if } a = \arg\max_{a'} Q(s, a') \\[6pt] \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases}

Properties:

Every action has probability $\geq \varepsilon / |\mathcal{A}| > 0$ , so coverage is guaranteed — every state-action pair is eventually visited.
$\varepsilon = 0$ is fully greedy (no exploration); $\varepsilon = 1$ is uniform random (no exploitation).
In practice, $\varepsilon$ is annealed over training: high early (to explore broadly), low late (to exploit learned values).

GLIE: the theoretical requirement for convergence#

Greedy in the Limit with Infinite Exploration (GLIE) states two conditions:

Every state-action pair is visited infinitely often: $\lim_{t \to \infty} N_t(s,a) = \infty$ for all $(s, a)$ .
The policy converges to greedy: $\lim_{t \to \infty} \pi_t(a \mid s) = \mathbf{1}[a = \arg\max_{a'} Q(s, a')]$ .

An ε-greedy policy satisfies GLIE if $\varepsilon_t \to 0$ as $t \to \infty$ and $\sum_t \varepsilon_t = \infty$ (e.g., $\varepsilon_t = 1/t$ ). Under GLIE, SARSA converges to $Q^*$ with probability 1.

The exploration problem in practice#

GLIE provides convergence guarantees but offers little guidance on how fast to anneal $\varepsilon$ . In practice:

Too fast: the agent commits to a suboptimal policy before discovering better options. Common in environments with sparse rewards where the high-reward region is rarely reached by random exploration.
Too slow: the agent spends excessive time on random actions after the value function is already well-estimated. Sample efficiency suffers.

For deep RL, these issues are amplified: neural network approximators are initially poorly calibrated, and large action spaces make uniform random exploration impractical. Week 8 covers structured exploration strategies (UCB, Thompson sampling, curiosity-driven exploration) that address these limitations.

Importance sampling for off-policy learning#

When the behavior policy $\mu$ differs from the target policy $\pi$ , updates must correct for the distributional mismatch.

The importance sampling ratio#

\rho_t = \frac{\pi(a_t \mid s_t)}{\mu(a_t \mid s_t)}

For off-policy TD(0), the corrected update is:

V(s_t) \leftarrow V(s_t) + \alpha\, \rho_t\, \delta_t

When $\pi(a_t|s_t) = \mu(a_t|s_t)$ , the ratio is 1 and we recover the standard TD update. When $\pi$ assigns higher probability to $a_t$ than $\mu$ did, the update is amplified; when lower, it is dampened.

The variance explosion problem#

For multi-step returns, the importance sampling correction requires the product of ratios over all steps:

\rho_{t:t+n} = \prod_{k=0}^{n-1} \frac{\pi(a_{t+k} \mid s_{t+k})}{\mu(a_{t+k} \mid s_{t+k})}

If $\pi$ and $\mu$ differ at each step, each ratio may be greater than 1, and their product grows exponentially in $n$ . For full MC returns ( $n = T$ ), the variance of the importance-weighted estimator can be so large as to make learning impractical — a single trajectory with a large product weight can dominate the entire estimate.

This variance explosion is a fundamental obstacle to off-policy learning with long trajectories, not an implementation detail.

What Breaks Here: Distribution Mismatch at Scale

Consider a 50-step trajectory where the target policy $\pi$ is twice as likely as $\mu$ to take the chosen action at each step. The product ratio is $\rho = 2^{50} \approx 10^{15}$ — this single trajectory would receive $10^{15}$ times the normal update weight, completely destabilizing learning.

Even modest divergence compounds catastrophically: if $\rho_t = 1.1$ at each step, then $1.1^{50} \approx 117$ . This is not fixable through better implementations or numerical precision — it is a structural property of importance-weighted estimators over long horizons. The only solutions are to constrain policy divergence (PPO), truncate the horizon (n-step with small $n$ ), or use a learning rule that doesn't require importance sampling (Q-learning's max operator).

PPO as a solution to variance explosion#

Proximal Policy Optimization (PPO) — the algorithm used to fine-tune language models in RLHF — addresses this directly by clipping the importance ratio:

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[ \min\left( \rho_t(\theta) A_t,\; \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]

where $\rho_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ is the ratio between the current and previous policy, and $A_t$ is the advantage estimate. Clipping the ratio to $[1-\epsilon, 1+\epsilon]$ prevents any single transition from contributing an outsized update — it controls the variance explosion at the cost of introducing a small bias. This is the direct application of importance sampling theory to policy optimization: the ratio $\rho_t$ appears in PPO for exactly the same reason it appears in off-policy TD — to correct for the fact that data was collected under a different (older) policy.

Intuition: Clipping Prevents Tail Risk

PPO's clip sets a maximum contribution any single trajectory can make to the gradient. Without clipping, a trajectory where the new policy is 10× more probable than the old policy receives weight 10, potentially dominating the entire update. With $\varepsilon = 0.2$ , the effective weight is capped at $1.2$ — a more than 8-fold reduction in the worst-case tail contribution.

The consequence of clipping: if the new policy has diverged significantly from where the data was collected, the gradient is deliberately under-estimated rather than allowing a destabilizing large step. This is conservative learning, not accurate learning — and in sequential decision problems where instability compounds across updates, conservative beats accurate.

The deadly triad#

Week 5 introduces function approximation — replacing tables with neural networks. Before doing so, it is important to understand why combining the techniques of this lecture with function approximation can cause instability.

The deadly triad (Sutton & Barto) refers to the combination of three elements that, when present simultaneously, can cause divergence in value function learning:

Function approximation (representing $V$ or $Q$ with a neural network)
Bootstrapping (using current value estimates as targets — i.e., TD methods)
Off-policy learning (learning about a target policy from data generated by a different behavior policy)

Each element alone is manageable: supervised regression with function approximation converges; tabular TD(0) converges; tabular off-policy Q-learning converges. But the three together remove the convergence guarantees. The problem is that the combination of bootstrapping and function approximation creates a moving target (the target depends on the current parameters), and the off-policy distribution mismatch means the function approximator is trained on a distribution that does not match the on-policy distribution of the target policy — creating systematic bias that compounds.

DQN (Week 5) directly addresses the deadly triad through two mechanisms:

Replay buffers decorrelate updates and stabilize the training distribution.
Target networks freeze the bootstrap target temporarily, preventing the moving target problem from destabilizing training.

Understanding the deadly triad explains why these engineering choices are necessary — not as tricks, but as principled responses to a known theoretical obstacle.

Why Not Avoid the Deadly Triad Entirely?

Each element of the triad is load-bearing for practical RL:

Function approximation is unavoidable — Atari has roughly $10^{170}$ possible screen states; a lookup table is physically impossible.
Bootstrapping is unavoidable — MC requires complete episodes, which are often prohibitively long or nonexistent for continuing tasks.
Off-policy learning is unavoidable for data efficiency — without it, every transition must come from the current policy, making replay buffers and pre-collected datasets unusable.

The deadly triad names a real problem, not a design choice to avoid. DQN's target networks and replay buffer are not workarounds — they are principled engineering responses to a forced tradeoff that every practical deep RL system must confront.

Key takeaways#

The lecture develops a unified view of learning from experience through the lens of backup depth and policy alignment. Monte Carlo uses full-trajectory backups: unbiased but high variance, requiring complete episodes. TD(0) uses one-step bootstrapped backups: lower variance but biased, converging by the contraction argument on the Bellman expectation operator. n-step TD interpolates via a bias-variance tradeoff that decays bias exponentially in $n$ . TD(λ) resolves the tradeoff adaptively by combining all n-step returns, implemented online via eligibility traces that assign temporal credit backward through recent state visits.

SARSA is on-policy TD for control: it evaluates the behavior policy and converges to $Q^*$ only if the policy becomes greedy in the limit (GLIE). Q-learning is off-policy TD: it targets $Q^*$ directly via the Bellman optimality operator and converges regardless of the behavior policy. Exploration is the mechanism that satisfies GLIE — $\varepsilon$ -greedy policies guarantee coverage but introduce a tradeoff between exploration speed and exploitation quality. Importance sampling corrects for distribution mismatch in off-policy learning but suffers from exponential variance growth over long trajectories — the exact problem that PPO's clipped ratio objective addresses in RLHF. And the deadly triad of function approximation, bootstrapping, and off-policy learning identifies why combining these ingredients requires careful engineering, previewing the DQN stabilization techniques in Week 5.

Conceptual questions#

Trace through the worked example at the start of the lecture but for state $s_3$ in the same episode. Compute both the MC update and the TD(0) update for $V(s_3)$ . Which update moves $V(s_3)$ further from its current value, and why?
TD(0) converges to $V^\pi$ while MC also converges to $V^\pi$ . If they converge to the same target, why would you ever prefer TD over MC? Give two distinct reasons, one theoretical and one practical.
An agent is learning in an environment where episodes are very long (thousands of steps) and rewards are sparse (most $r_t = 0$ , with occasional large rewards). Should you use MC, TD(0), or n-step TD? Justify your answer in terms of the bias-variance tradeoff and the practical effect of episode length on each method.
Q-learning converges to $Q^*$ even when the behavior policy is $\epsilon$ -greedy with $\epsilon = 0.5$ (i.e., random half the time). SARSA with the same behavior policy converges to something other than $Q^*$ . Explain precisely what SARSA converges to and why the $\max$ operator in Q-learning's target makes the difference.
A language model is fine-tuned using PPO. The old policy (used to collect rollouts) and the new policy (being updated) begin to diverge significantly during training. Explain why this causes the importance ratio $\rho_t$ to become large, why large $\rho_t$ is harmful in the context of the variance explosion problem, and how PPO's clipping mechanism addresses this. What does the clipping introduce in return?
(Extension) The eligibility trace update is $e_t(s) = \gamma\lambda\, e_{t-1}(s) + \mathbf{1}[s_t = s]$ . Show that when $\lambda = 0$ , the $\text{TD}(\lambda)$ update with eligibility traces reduces exactly to the standard $\text{TD}(0)$ update. Then show that when $\lambda = 1$ and $\gamma = 1$ , the update is equivalent to Monte Carlo.
(Extension) Double Q-learning maintains two independent Q-value estimates $Q_A$ and $Q_B$ . When updating $Q_A$ , the greedy action is selected using $Q_A$ but evaluated using $Q_B$ : the target is $r + \gamma Q_B(s', \arg\max_{a'} Q_A(s', a'))$ . Explain why standard Q-learning suffers from maximization bias, and how the decoupling in Double Q-learning addresses it.

Solutions

Coding exercises#

Exercise 1: First-visit Monte Carlo prediction#

Implement first-visit MC policy evaluation for a simple chain MDP. The environment has states $\{0, 1, 2, 3, 4\}$ where state 4 is terminal with reward $+1$ and all other transitions give reward $0$ .

python · runs in browser

import numpy as np
from collections import defaultdict

def first_visit_mc(policy, n_episodes=5000, gamma=0.9, alpha=0.05):
    """
    First-visit Monte Carlo policy evaluation.

    policy: callable (state -> action), the policy to evaluate
    Returns: V, a dict mapping state -> estimated value
    """
    V = defaultdict(float)

    for _ in range(n_episodes):
        # --- generate an episode ---
        episode = []  # list of (state, reward) tuples
        state = 0
        while state != 4:
            action = policy(state)
            # TODO: implement the chain transition:
            #   action=1 moves right (state+1), reward=+1 if reaching state 4 else 0
            #   action=0 stays in place, reward=0
            next_state = ...
            reward = ...
            episode.append((state, reward))
            state = next_state
        # --- first-visit MC update ---
        visited = set()
        G = 0.0
        for t in reversed(range(len(episode))):
            state, reward = episode[t]
            G = reward + gamma * G
            if state not in visited:
                visited.add(state)
                # TODO: apply the incremental mean update
                V[state] += ...

    return V

# Test: uniform random policy
policy = lambda s: np.random.choice([0, 1])
V = first_visit_mc(policy)
print({s: round(v, 3) for s, v in sorted(V.items())})
# Expected: values should increase from state 0 to state 3

Exercise 2: TD(0) prediction#

Implement TD(0) on the same chain MDP and compare convergence with MC.

python · runs in browser

def td_zero(policy, n_episodes=5000, gamma=0.9, alpha=0.05):
    """
    TD(0) policy evaluation.
    Returns: V, a dict mapping state -> estimated value
    """
    V = defaultdict(float)

    for _ in range(n_episodes):
        state = 0
        while state != 4:
            action = policy(state)
            next_state = state + 1 if action == 1 else state
            reward = 1.0 if next_state == 4 else 0.0

            # TODO: compute the TD error delta_t
            delta = ...

            # TODO: apply the TD(0) update
            V[state] += ...

            state = next_state

    return V

V_td = td_zero(lambda s: np.random.choice([0, 1]))
print({s: round(v, 3) for s, v in sorted(V_td.items())})

After implementing both, run both with the same number of episodes and compare their estimates. Which converges faster to the true values? Why?

Exercise 3: Q-learning vs SARSA#

Implement both Q-learning and SARSA on a simple grid world and observe the policy difference in stochastic settings.

python · runs in browser

def q_learning(n_episodes=2000, gamma=0.9, alpha=0.1, epsilon=0.1):
    """Off-policy Q-learning."""
    # States: 0-7 (linear chain), state 7 is terminal (+10 reward)
    # State 3 is a "cliff" (-5 reward if visited with random action)
    n_states, n_actions = 8, 2  # 0=left, 1=right
    Q = np.zeros((n_states, n_actions))

    for _ in range(n_episodes):
        state = 0
        while state != 7:
            # epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(n_actions)
            else:
                action = np.argmax(Q[state])

            next_state = min(state + 1, 7) if action == 1 else max(state - 1, 0)
            reward = 10.0 if next_state == 7 else (-5.0 if next_state == 3 else 0.0)

            # TODO: Q-learning update (off-policy: use max over next actions)
            next_value = 0.0  # TODO: replace with np.max(Q[next_state])
            Q[state, action] += alpha * (reward + gamma * next_value - Q[state, action])
            state = next_state

    return Q

def sarsa(n_episodes=2000, gamma=0.9, alpha=0.1, epsilon=0.1):
    """On-policy SARSA."""
    n_states, n_actions = 8, 2
    Q = np.zeros((n_states, n_actions))

    for _ in range(n_episodes):
        state = 0
        action = np.argmax(Q[state]) if np.random.random() > epsilon else np.random.randint(n_actions)

        while state != 7:
            next_state = min(state + 1, 7) if action == 1 else max(state - 1, 0)
            reward = 10.0 if next_state == 7 else (-5.0 if next_state == 3 else 0.0)

            next_action = np.argmax(Q[next_state]) if np.random.random() > epsilon else np.random.randint(n_actions)

            # TODO: SARSA update (on-policy: use next_action, not max)
            next_value = 0.0  # TODO: replace with Q[next_state, next_action]
            Q[state, action] += alpha * (reward + gamma * next_value - Q[state, action])
            state, action = next_state, next_action

    return Q

Q_ql = q_learning()
Q_sarsa = sarsa()
print("Q-learning greedy policy:", [np.argmax(Q_ql[s]) for s in range(7)])
print("SARSA greedy policy:     ", [np.argmax(Q_sarsa[s]) for s in range(7)])
# Do the learned policies differ near state 3? Why?

Looking ahead#

The next lecture introduces function approximation: replacing lookup tables with parametric models, and specifically neural networks. We will see that the deadly triad makes naive deep TD learning unstable, and study how Deep Q-Networks (DQN) addresses this through target networks and experience replay — each of which can be understood as a direct engineering response to one component of the triad.

Knowledge Check#

Test your understanding of Monte Carlo, TD, and eligibility traces.

Exercise · Fill in the blank

The TD error is defined as δ_t = r_{t+1} + γV(s_{t+1}) − ___

Exercise · Multiple choice

On each time step, an eligibility trace e(s) for a state not visited is multiplied by ___

γλ

γ only

λ only

1 − α

Question 1 of 3

Which method requires waiting until the end of an episode before updating value estimates?

Temporal-Difference (TD)

Monte Carlo (MC)

Dynamic Programming (DP)

TD(lambda) with lambda=0.5

← Previous

Week 3: Dynamic Programming for Finite MDPs

Week 5: Function Approximation in Reinforcement Learning

Purpose of this lecture#

Learning from experience#

A worked example#

Monte Carlo methods#

Return definition#

First-visit vs every-visit Monte Carlo#

Monte Carlo policy evaluation#

Temporal-Difference learning#

TDTemporal Difference(0): the one-step update#

TDTemporal Difference(0) convergence#

Bias-variance tradeoff: MC vs TDTemporal Difference vs n-step TDTemporal Difference#

Monte Carlo: unbiased, high variance#

TDTemporal Difference(0): biased, low variance#

n-step TDTemporal Difference: interpolating the tradeoff#

TDTemporal Difference(λ) and eligibility traces#

The λ-return#

Eligibility traces: the backward view#

On-policy vs off-policy learning: formal definitions#

Control with TDTemporal Difference methods#

On-policy control: SARSAState-Action-Reward-State-Action#

Off-policy control: Q-learning#

SARSAState-Action-Reward-State-Action vs Q-learning: a summary#

Exploration in control#

ε-greedy policies#

GLIE: the theoretical requirement for convergence#

The exploration problem in practice#

Importance sampling for off-policy learning#

The importance sampling ratio#

The variance explosion problem#

PPOProximal Policy Optimisation as a solution to variance explosion#

The deadly triad#

Key takeaways#

Conceptual questions#

Coding exercises#

Exercise 1: First-visit Monte Carlo prediction#

Exercise 2: TD(0) prediction#

Exercise 3: Q-learning vs SARSA#

Looking ahead#

Further reading#

Knowledge Check#

TD(0): the one-step update#

TD(0) convergence#

Bias-variance tradeoff: MC vs TD vs n-step TD#

TD(0): biased, low variance#

n-step TD: interpolating the tradeoff#

TD(λ) and eligibility traces#

Control with TD methods#

On-policy control: SARSA#

SARSA vs Q-learning: a summary#

PPO as a solution to variance explosion#