Skip to main content
illumin8
Courses
Week 4: Monte Carlo and Temporal-Difference Learning
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 4

Week 4: Monte Carlo and Temporal-Difference Learning

✦Learning Outcomes
  • Implement MC prediction, TDTemporal Difference(0), and n-step TDTemporal Difference algorithms
  • Understand bias-variance trade-offs in return estimation
  • Compare on-policy (SARSAState-Action-Reward-State-Action) and off-policy (Q-learning) methods
  • Analyze the exploration problem in control settings
◆Prerequisites
  • Week 3: Policy evaluation, Bellman operators, value iteration
  • Week 2: Exploration-exploitation trade-off

Recommended: Review Week 3 on "Bellman operators and contraction mappings" before proceeding.

Purpose of this lecture

The foundational question underlying this lecture is: How can we learn the value of a state without ever seeing the end of an episode?

Dynamic programming (Week 3) showed how to solve finite MDPs exactly — but only when the environment's transition dynamics are fully known. In practice, agents must learn from experience: observing transitions, collecting rewards, and updating value estimates without access to PPP or RRR.

On the surface, this seems impossible. The Bellman equation says V(s)=E[r+γV(s′)]V(s) = \mathbb{E}[r + \gamma V(s')]V(s)=E[r+γV(s′)], which requires knowledge of next-state values — a future value that may depend on a distant reward. How can we improve estimates without that information?

The answer splits into two families of methods with radically different philosophies:

  • Monte Carlo (MC) methods take the direct route: wait until the episode ends, observe the true return, and use it as an unbiased target. This guarantees correctness but suffers from high variance and cannot handle continuing tasks.
  • Temporal-Difference (TDTemporal Difference) methods take a radical shortcut: bootstrap from current estimates. Use your own (wrong) value predictions to improve themselves. Philosophically strange — how can a wrong estimate help? — but it works, with lower variance than MC at the cost of bias.

These two approaches embody a fundamental tension in statistical learning: the bias-variance tradeoff. Every deep RLReinforcement Learning algorithm — DQNDeep Q-Network, PPOProximal Policy Optimisation, SACSoft Actor-Critic, actor-critic — is an extension of the ideas developed here.


Learning from experience

Assume the agent interacts with the environment and observes trajectories:

(s0,a0,r1,s1,a1,r2,…,sT)(s_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_T)(s0​,a0​,r1​,s1​,a1​,r2​,…,sT​)

We aim to estimate value functions using sampled returns, rather than expectations over known transition probabilities. The Bellman equations from Week 3 remain the target — we are now computing them from data rather than from a model.

Three key design questions structure the space of methods:

  • Do we wait until the episode ends before updating, or update incrementally?
  • Do we use the true return or bootstrap from current estimates?
  • How do we handle data collected under a different policy than the one we want to learn?

A worked example

Before developing the theory, we trace both MC and TDTemporal Difference(0) through a single episode on the 4-state chain from Week 3. Suppose the current value estimates are:

V=[V(s1),V(s2),V(s3),V(s4)]=[−0.5,  0.0,  0.3,  1.0]V = [V(s_1), V(s_2), V(s_3), V(s_4)] = [-0.5,\; 0.0,\; 0.3,\; 1.0]V=[V(s1​),V(s2​),V(s3​),V(s4​)]=[−0.5,0.0,0.3,1.0]

and we observe the episode: s2→R,r=0s3→R,r=0s4→terminal,r=+1s_2 \xrightarrow{R, r=0} s_3 \xrightarrow{R, r=0} s_4 \xrightarrow{\text{terminal}, r=+1}s2​R,r=0​s3​R,r=0​s4​terminal,r=+1​

with γ=0.9\gamma = 0.9γ=0.9 and learning rate α=0.1\alpha = 0.1α=0.1.

Monte Carlo update for s2s_2s2​:

The actual return from s2s_2s2​: G=0+0.9⋅0+0.92⋅1=0.81G = 0 + 0.9 \cdot 0 + 0.9^2 \cdot 1 = 0.81G=0+0.9⋅0+0.92⋅1=0.81

MC update: V(s2)←0.0+0.1(0.81−0.0)=0.081V(s_2) \leftarrow 0.0 + 0.1(0.81 - 0.0) = 0.081V(s2​)←0.0+0.1(0.81−0.0)=0.081

TDTemporal Difference(0) update for s2s_2s2​ (using the one-step target r+γV(s3)r + \gamma V(s_3)r+γV(s3​)): δ=0+0.9⋅0.3−0.0=0.27\delta = 0 + 0.9 \cdot 0.3 - 0.0 = 0.27δ=0+0.9⋅0.3−0.0=0.27 V(s2)←0.0+0.1⋅0.27=0.027V(s_2) \leftarrow 0.0 + 0.1 \cdot 0.27 = 0.027V(s2​)←0.0+0.1⋅0.27=0.027

The MC update is larger because it uses the actual observed return, which is higher than the current estimate V(s3)=0.3V(s_3) = 0.3V(s3​)=0.3 implies. The TDTemporal Difference update is smaller and biased by the inaccuracy in V(s3)V(s_3)V(s3​). As V(s3)V(s_3)V(s3​) improves toward Vπ(s3)V^\pi(s_3)Vπ(s3​), the TDTemporal Difference updates will converge to the same target as MC — but they do so earlier, one step at a time.


Monte Carlo methods

Monte Carlo methods estimate value functions by averaging observed returns from complete episodes.

Return definition

For time step ttt in an episode of length TTT:

Gt=∑k=0T−t−1γkrt+k+1G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}Gt​=k=0∑T−t−1​γkrt+k+1​

GtG_tGt​ is the actual discounted return from step ttt to episode end. It is a direct sample of the quantity Vπ(st)=Eπ[Gt∣st]V^\pi(s_t) = \mathbb{E}_\pi[G_t \mid s_t]Vπ(st​)=Eπ​[Gt​∣st​].

First-visit vs every-visit Monte Carlo

  • First-visit MC: update the value estimate for sss using only the return from the first visit to sss in each episode. Samples across episodes are independent, making the estimator a standard i.i.d. average. Converges to Vπ(s)V^\pi(s)Vπ(s) by the law of large numbers.
  • Every-visit MC: update using the return from every visit to sss within an episode. Within-episode visits are correlated (the state was reached from the same history), so the samples are not independent. Despite this, every-visit MC is also consistent — it converges to VπV^\piVπ — but the convergence analysis is less clean. First-visit MC is the standard choice when independence matters for theoretical guarantees.

Monte Carlo policy evaluation

For a fixed policy π\piπ:

V(s)←V(s)+α[Gt−V(s)]V(s) \leftarrow V(s) + \alpha\bigl[G_t - V(s)\bigr]V(s)←V(s)+α[Gt​−V(s)]

where GtG_tGt​ is the observed return following a visit to sss. This is an incremental mean update — over many episodes, V(s)V(s)V(s) converges to E[Gt∣st=s]=Vπ(s)\mathbb{E}[G_t \mid s_t = s] = V^\pi(s)E[Gt​∣st​=s]=Vπ(s).

Key properties:

  • Unbiased: GtG_tGt​ is an unbiased sample of Vπ(st)V^\pi(s_t)Vπ(st​); no approximation of future values enters the target.
  • High variance: GtG_tGt​ is a sum of up to TTT random variables. Variance scales with episode length and reward stochasticity. In long or high-variance episodes, single-sample MC estimates can be far from the true mean.
  • Episode-terminal: updates are only possible after the episode ends. Cannot be used for continuing tasks.
✦Intuition: MC as a Sample Mean

Think of collecting exam scores from a class. After seeing 100 scores, your running average is close to the true mean — each new score is an unbiased sample. MC works identically: each episode produces one return GtG_tGt​, which is an unbiased sample of Vπ(st)V^\pi(s_t)Vπ(st​). The incremental update V(s)←V(s)+α[Gt−V(s)]V(s) \leftarrow V(s) + \alpha[G_t - V(s)]V(s)←V(s)+α[Gt​−V(s)] is exactly the standard running-mean formula — α\alphaα controls how quickly old estimates are discounted.

Unlike DP, which needs the true expectation E[r+γV(s′)]\mathbb{E}[r + \gamma V(s')]E[r+γV(s′)] over all possible next states, MC only needs samples. The law of large numbers does the rest.

Monte Carlo backup diagram: full return from episode end

Monte Carlo: waits until the episode ends, then updates using the full return G0=r1+γr2+…+γT−1rTG_0 = r_1 + \gamma r_2 + \ldots + \gamma^{T-1}r_TG0​=r1​+γr2​+…+γT−1rT​.


Temporal-Difference learning

Temporal-Difference learning combines:

  • MC's ability to learn from experience (no model required)
  • DP's ability to update before episode termination (bootstrapping)

Rather than waiting for the full return, TDTemporal Difference uses the current value estimate of the next state as a proxy for the remaining return.

TDTemporal Difference(0): the one-step update

V(st)←V(st)+α[rt+1+γV(st+1)−V(st)]⏟δt  =  TD errorV(s_t) \leftarrow V(s_t) + \alpha \underbrace{\left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]}_{\delta_t \;=\; \text{TD error}}V(st​)←V(st​)+αδt​=TD error[rt+1​+γV(st+1​)−V(st​)]​​

The TDTemporal Difference error δt=rt+1+γV(st+1)−V(st)\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)δt​=rt+1​+γV(st+1​)−V(st​) is the discrepancy between:

  • the one-step bootstrap target rt+1+γV(st+1)r_{t+1} + \gamma V(s_{t+1})rt+1​+γV(st+1​), and
  • the current estimate V(st)V(s_t)V(st​).

If V=VπV = V^\piV=Vπ, then E[δt]=0\mathbb{E}[\delta_t] = 0E[δt​]=0 — the Bellman equation is satisfied and there is no average error. TDTemporal Difference learning drives δt\delta_tδt​ toward zero.

✦Intuition: The Bootstrapping Idea

The philosophical strangeness dissolves once you see that the bias only matters for correctness, not for the direction of improvement. If V(s′)=0V(s') = 0V(s′)=0 but Vπ(s′)=0.8V^\pi(s') = 0.8Vπ(s′)=0.8, the TD target r+γ⋅0r + \gamma \cdot 0r+γ⋅0 is wrong — but it still pulls V(s)V(s)V(s) in the right direction when r>V(s)r > V(s)r>V(s). More importantly: as training progresses and V(s′)V(s')V(s′) improves toward 0.80.80.8, the bias shrinks automatically.

TD bootstraps from an imperfect estimate, but the estimate is always improving, so the bias is always decreasing. The contraction of TπT^\piTπ (Week 3) is what guarantees this self-correcting property converges rather than spiraling.

TD(0) backup diagram: one-step bootstrapping

TD(0): one-step bootstrapping — uses the immediate reward plus the next state's value estimate.

TDTemporal Difference(0) convergence

Theorem: Under a fixed policy π\piπ, tabular TDTemporal Difference(0) converges to VπV^\piVπ with probability 1, provided the step sizes satisfy the Robbins-Monro conditions:

∑t=0∞αt=∞and∑t=0∞αt2<∞\sum_{t=0}^\infty \alpha_t = \infty \qquad \text{and} \qquad \sum_{t=0}^\infty \alpha_t^2 < \inftyt=0∑∞​αt​=∞andt=0∑∞​αt2​<∞

Why it converges — the contraction argument: TDTemporal Difference(0) is a stochastic approximation of the fixed-point equation V=TπVV = T^\pi VV=TπV, where TπT^\piTπ is the Bellman expectation operator from Week 3. Each TDTemporal Difference update is a noisy step in the direction of TπV−VT^\pi V - VTπV−V. Because TπT^\piTπ is a γ\gammaγ-contraction (as established in Week 3), the fixed point VπV^\piVπ is unique and the stochastic iterates converge to it — the contraction ensures the systematic signal toward VπV^\piVπ dominates the noise over time.

The Robbins-Monro conditions encode two requirements: step sizes must be large enough to overcome noise (first condition: the series diverges) but must decay fast enough that the algorithm eventually settles (second condition: the sum of squares converges). A constant step size α\alphaα satisfies neither in the limit but is used in practice for non-stationary problems.

Contrast with MC convergence: MC converges because it computes a sample mean of an unbiased estimator — no Bellman operator, no contraction required. The theoretical justification is simpler (law of large numbers) but the practical performance is worse due to high variance. TDTemporal Difference's convergence is more subtle because the target itself depends on the current estimate, but the contraction of TπT^\piTπ is what ensures this circular dependency converges rather than diverges.

⚠What Breaks Here: The Markov Assumption

TD(0)'s convergence proof relies on the environment being Markovian — the next state depends only on the current state, not the full history. In partially observable environments (POMDPs), the observed "state" is incomplete: the same observation can arise from different underlying states with different true values. TD updates will average over these distinct situations, converging to neither true value.

This is one reason deep RL agents sometimes show unstable or inconsistent behavior even when tabular theory predicts clean convergence. The neural network input (e.g., a single game frame) may not contain enough information to determine the true state — and no algorithm can fix a fundamentally non-Markov input representation.


Bias-variance tradeoff: MC vs TDTemporal Difference vs n-step TDTemporal Difference

The core distinction between MC and TDTemporal Difference is a bias-variance tradeoff that can be made precise.

Monte Carlo: unbiased, high variance

The MC target GtG_tGt​ satisfies E[Gt∣st]=Vπ(st)\mathbb{E}[G_t \mid s_t] = V^\pi(s_t)E[Gt​∣st​]=Vπ(st​) — it is an unbiased estimator of VπV^\piVπ. But Gt=∑k=0T−t−1γkrt+k+1G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}Gt​=∑k=0T−t−1​γkrt+k+1​ is a sum of T−tT - tT−t random variables. By the variance of a sum:

Var(Gt)≤Rmax⁡2(1−γ)2\text{Var}(G_t) \leq \frac{R_{\max}^2}{(1-\gamma)^2}Var(Gt​)≤(1−γ)2Rmax2​​

and in practice variance grows with the number of stochastic steps between sts_tst​ and termination. In long or high-variance episodes, single-sample MC estimates can be far from the true mean.

TDTemporal Difference(0): biased, low variance

The TDTemporal Difference target rt+1+γV(st+1)r_{t+1} + \gamma V(s_{t+1})rt+1​+γV(st+1​) uses the current estimate V(st+1)V(s_{t+1})V(st+1​), which equals Vπ(st+1)V^\pi(s_{t+1})Vπ(st+1​) only if the value function has converged. Before convergence, this introduces bootstrapping bias: the target is wrong by γ(V(st+1)−Vπ(st+1))\gamma(V(s_{t+1}) - V^\pi(s_{t+1}))γ(V(st+1​)−Vπ(st+1​)). However, the target depends on only one random variable (rt+1r_{t+1}rt+1​, since st+1s_{t+1}st+1​ is determined by sts_tst​ and ata_tat​), giving much lower variance than MC.

As V→VπV \to V^\piV→Vπ, bootstrapping bias vanishes. TDTemporal Difference is asymptotically unbiased — in the limit, TDTemporal Difference and MC converge to the same value function.

◆Why Not Just Use MC?

Two practical blockers:

  1. Continuing tasks: many real environments (trading systems, robotics controllers, language model inference) never terminate. MC requires an episode boundary to compute GtG_tGt​; TD can update after every step. For any non-episodic problem, MC simply cannot be applied.

  2. Long-horizon variance: with thousands of steps between sts_tst​ and episode end, a single MC estimate can be orders of magnitude away from the true mean — the learning signal is effectively noise. TD's one-step target depends on only two random variables (rt+1r_{t+1}rt+1​ and st+1s_{t+1}st+1​), keeping variance low even in very long episodes.

The bias TD introduces is temporary; the variance MC suffers from is structural.

n-step TDTemporal Difference: interpolating the tradeoff

The nnn-step return uses nnn actual rewards before bootstrapping:

Gt(n)=∑k=0n−1γkrt+k+1+γnV(st+n)G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1} + \gamma^n V(s_{t+n})Gt(n)​=k=0∑n−1​γkrt+k+1​+γnV(st+n​)
  • Bias: γn(V(st+n)−Vπ(st+n))\gamma^n (V(s_{t+n}) - V^\pi(s_{t+n}))γn(V(st+n​)−Vπ(st+n​)) — decays exponentially in nnn because the bootstrap contribution is discounted over nnn steps. Increasing nnn reduces bias.
  • Variance: scales with the sum of nnn reward terms — increasing nnn increases variance.

The update rule:

V(st)←V(st)+α[Gt(n)−V(st)]V(s_t) \leftarrow V(s_t) + \alpha\bigl[G_t^{(n)} - V(s_t)\bigr]V(st​)←V(st​)+α[Gt(n)​−V(st​)]

Special cases: n=1n = 1n=1 gives TDTemporal Difference(0); n→∞n \to \inftyn→∞ gives Monte Carlo. The optimal nnn depends on the environment's reward variance and current estimation accuracy — both unknown in practice, which motivates TDTemporal Difference(λ).

✦Intuition: The Exponential Decay

At nnn steps, the bootstrap error is γn⋅(V(st+n)−Vπ(st+n))\gamma^n \cdot (V(s_{t+n}) - V^\pi(s_{t+n}))γn⋅(V(st+n​)−Vπ(st+n​)). The γn\gamma^nγn factor shrinks this contribution exponentially — by step 20 with γ=0.9\gamma = 0.9γ=0.9, the bootstrap contributes only 0.920≈0.120.9^{20} \approx 0.120.920≈0.12 of the original error. Meanwhile, variance grows with the number of stochastic reward terms included.

This creates a crossover point: early in training, when VVV is badly wrong, bias from the bootstrap target dominates — small nnn wins because it relies less on the inaccurate estimate. Late in training, as V→VπV \to V^\piV→Vπ, variance becomes the bottleneck — larger nnn extracts more signal per episode. The optimal nnn is a moving target, which is why TD(λ)\text{TD}(\lambda)TD(λ) (which combines all nnn) often outperforms any fixed nnn.


TDTemporal Difference(λ) and eligibility traces

TDTemporal Difference(λ) generalizes n-step TDTemporal Difference by combining all n-step returns simultaneously, with weights that decay exponentially in nnn:

The λ-return

Gtλ=(1−λ)∑n=1∞λn−1Gt(n)G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}Gtλ​=(1−λ)n=1∑∞​λn−1Gt(n)​

The factor (1−λ)(1-\lambda)(1−λ) normalizes the weights so they sum to 1. The λ-return is a geometric mixture: each n-step return contributes, with longer returns weighted by λn−1\lambda^{n-1}λn−1. For λ=0\lambda = 0λ=0: only the 1-step return contributes → TDTemporal Difference(0). For λ=1\lambda = 1λ=1: all returns contribute equally → equivalent to Monte Carlo.

The λ-return requires the full episode to compute, since it involves all Gt(n)G_t^{(n)}Gt(n)​. Eligibility traces provide an online, incremental implementation.

Eligibility traces: the backward view

Rather than computing the λ-return forward in time, eligibility traces distribute the TDTemporal Difference error backward through time to recently visited states.

The accumulating eligibility trace for state sss:

et(s)=γλ et−1(s)+1[st=s]e_t(s) = \gamma\lambda\, e_{t-1}(s) + \mathbf{1}[s_t = s]et​(s)=γλet−1​(s)+1[st​=s]
  • At each step, all traces decay by γλ\gamma\lambdaγλ.
  • The trace for the current state sts_tst​ is incremented by 1.
  • States visited recently have high traces; states not visited recently have traces near zero.

The value update applies the current TDTemporal Difference error δt\delta_tδt​ to all states, scaled by their trace:

V(s)←V(s)+α δt et(s)∀sV(s) \leftarrow V(s) + \alpha\, \delta_t\, e_t(s) \quad \forall sV(s)←V(s)+αδt​et​(s)∀s

Intuition: the trace is a decaying memory of which states were recently visited. When a reward is received, credit (δt\delta_tδt​) is assigned backward to all recent states in proportion to how recently they were visited — states visited long ago receive little credit; states visited just before the reward receive substantial credit. This is temporal credit assignment implemented efficiently online.

The forward view (λ-return) and backward view (eligibility traces) are equivalent in expectation for linear function approximation — this equivalence is what justifies using the online trace algorithm as an efficient implementation of TDTemporal Difference(λ).

✦Intuition: Backward Credit Assignment

Imagine a sequence of events leading to a large reward: s1→s2→s3→s4s_1 \to s_2 \to s_3 \to s_4s1​→s2​→s3​→s4​ (reward arrives). When the reward appears at s4s_4s4​, which earlier states deserve credit? The eligibility trace answers by recency: s3s_3s3​ was visited just before the reward, so its trace is high and it receives substantial credit. s1s_1s1​'s trace has decayed by (γλ)3(\gamma\lambda)^3(γλ)3, so it receives little.

Without eligibility traces, computing multi-step credit requires storing the full trajectory and working backward — O(T⋅∣S∣)O(T \cdot |S|)O(T⋅∣S∣) memory per episode. Eligibility traces achieve the same backward credit assignment in O(∣S∣)O(|S|)O(∣S∣) memory, updated online at each step. This is the practical payoff of the forward/backward view equivalence.

◆Accumulating vs replacing traces

The accumulating trace increments et(st)e_t(s_t)et​(st​) by 1 at each visit. A second variant, the replacing trace, instead sets et(st)=1e_t(s_t) = 1et​(st​)=1 (clamping rather than accumulating). Replacing traces prevent a state from receiving disproportionately large credit when it is revisited many times within a single episode — this matters in deterministic environments where the same state may be visited frequently on the optimal path. Most modern implementations default to replacing traces for tabular methods.

Eligibility trace diagram: decays by γλ per step

Eligibility traces: each visit increments e(s) by 1, then decays by γλ per step. TD error flows backward through traces.


On-policy vs off-policy learning: formal definitions

A distinction that must be made precise before discussing control methods.

  • On-policy learning: the behavior policy μ\muμ (used to generate experience) equals the target policy π\piπ (the policy being evaluated or improved). The agent learns about the policy it is currently following.
  • Off-policy learning: the behavior policy μ≠π\mu \neq \piμ=π. The agent follows one policy to generate experience while learning about a different (typically better) policy.

This distinction has concrete consequences:

  • On-policy methods are simpler and more stable, but the learned value function reflects the behavior policy, which may be exploratory and suboptimal.
  • Off-policy methods can learn about the optimal policy while following a safe or exploratory behavior policy. They can also learn from data generated by any source — old policies, human demonstrations, replay buffers — making them far more data-efficient in practice. The cost is additional complexity and, in some cases, instability.
◆Why On-Policy First?

Off-policy learning requires correcting for distribution mismatch — the data was collected under μ\muμ, but we want to learn about π\piπ. The correction (importance sampling, introduced below) adds variance and implementation complexity. On-policy SARSA sidesteps this entirely: the behavior policy is the target policy, so no correction is needed. Convergence analysis is simpler, implementation is cleaner, and the algorithm is more stable.

Understanding SARSA's on-policy dynamics builds the intuition needed to see why Q-learning's off-policy correction (the max operator) works — and why it sometimes causes maximization bias.


Control with TDTemporal Difference methods

Control extends value estimation to policy improvement: learning optimal policies directly from experience.

On-policy control: SARSAState-Action-Reward-State-Action

SARSAState-Action-Reward-State-Action updates the action-value function Q(s,a)Q(s,a)Q(s,a) using the action actually taken under the current policy:

Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)−Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]Q(st​,at​)←Q(st​,at​)+α[rt+1​+γQ(st+1​,at+1​)−Q(st​,at​)]

The name SARSAState-Action-Reward-State-Action records the five quantities involved in the update: (st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})(st​,at​,rt+1​,st+1​,at+1​).

Since at+1a_{t+1}at+1​ is sampled from the current policy, SARSAState-Action-Reward-State-Action evaluates the behavior policy. Under GLIE conditions (Greedy in the Limit with Infinite Exploration — the policy must explore all state-action pairs infinitely often and eventually become greedy), SARSAState-Action-Reward-State-Action converges to Q∗Q^*Q∗.

Practical consequence: SARSAState-Action-Reward-State-Action is conservative. In stochastic environments with dangerous actions, SARSAState-Action-Reward-State-Action learns to avoid them because it accounts for the possibility of taking them under the exploratory policy. This makes it suitable when safety during training matters.

✦Intuition: On-Policy Means Conservative

The classic illustration is cliff-walking: a grid world where the optimal path runs along the edge of a cliff, and falling gives −100-100−100. With ε\varepsilonε-greedy exploration (ε>0\varepsilon > 0ε>0), the agent occasionally takes random actions near the cliff edge.

Q-learning ignores this in its target — it always assumes the greedy action will be taken — so it learns the cliff-edge path as "optimal." SARSA accounts for the actual ε\varepsilonε-greedy policy: it knows a random action near the cliff might cause a fall, so it learns the safer path farther from the edge. Q-learning is optimal under perfect execution; SARSA is safer when exploration is ongoing.

SARSA vs Q-learning backup diagram: on-policy uses actual action, off-policy uses max

On-policy (SARSA) vs off-policy (Q-learning): SARSA uses the actual next action from the policy; Q-learning uses the max over all actions.

Off-policy control: Q-learning

Q-learning updates using the greedy action in the next state, regardless of which action the behavior policy actually takes:

Q(st,at)←Q(st,at)+α[rt+1+γmax⁡a′Q(st+1,a′)−Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]Q(st​,at​)←Q(st​,at​)+α[rt+1​+γa′max​Q(st+1​,a′)−Q(st​,at​)]

The target rt+1+γmax⁡a′Q(st+1,a′)r_{t+1} + \gamma\max_{a'} Q(s_{t+1}, a')rt+1​+γmaxa′​Q(st+1​,a′) corresponds to the Bellman optimality equation for Q∗Q^*Q∗ — Q-learning is directly applying the model-free, sampled version of the value iteration update from Week 3.

Convergence: Q-learning converges to Q∗Q^*Q∗ with probability 1 under the Robbins-Monro conditions and the assumption that every state-action pair is visited infinitely often (coverage). Crucially, convergence holds regardless of the behavior policy — the max⁡\maxmax operator in the target decouples learning about the optimal policy from the policy used to collect data. This is what makes Q-learning off-policy.

Practical consequence: Q-learning can learn Q∗Q^*Q∗ from data collected by any exploratory policy, including ϵ\epsilonϵ-greedy, random, or even human demonstrations. This generalizes directly to DQNDeep Q-Network, which stores transitions in a replay buffer and samples them uniformly — the replay buffer data was collected by old versions of the policy, but Q-learning's off-policy property means convergence is still theoretically supported.

✦Intuition: The Decoupling Power of the Max

The max in Q-learning's target says: "assume the next action will be optimal, regardless of what I actually do." This means the data-collection policy and the learning target are fully decoupled. Whether the agent was exploring randomly, following an old policy, or a human was demonstrating gameplay, the max target remains well-defined and correct.

This is the same principle that makes DQN's replay buffer valid: transitions stored from old (suboptimal) policies are off-policy data, but Q-learning's off-policy property means those transitions still provide a correct gradient signal for learning Q∗Q^*Q∗. The replay buffer would be theoretically incoherent for SARSA; it is theoretically sound for Q-learning.

SARSAState-Action-Reward-State-Action vs Q-learning: a summary

| Property | SARSAState-Action-Reward-State-Action | Q-learning | |---|---|---| | Policy type | On-policy | Off-policy | | Update target | Q(st+1,at+1)Q(s_{t+1}, a_{t+1})Q(st+1​,at+1​) — action taken | max⁡a′Q(st+1,a′)\max_{a'} Q(s_{t+1}, a')maxa′​Q(st+1​,a′) — greedy action | | Convergence target | QπQ^\piQπ (behavior policy) | Q∗Q^*Q∗ (optimal policy) | | Data requirement | Must follow target policy | Any behavior policy with coverage | | Stability | More stable in stochastic settings | Can overestimate Q∗Q^*Q∗ (maximization bias) | | Deep RLReinforcement Learning extension | Expected SARSAState-Action-Reward-State-Action, soft actor-critic | DQNDeep Q-Network, double DQNDeep Q-Network |


Exploration in control

Control requires both exploitation (taking actions that maximize estimated QQQ) and exploration (trying actions whose values are uncertain). Without exploration, the agent may never discover high-reward actions it has not yet tried; without exploitation, it never uses what it has learned.

ε-greedy policies

The simplest exploration strategy is ε-greedy: with probability 1−ε1 - \varepsilon1−ε take the greedy action arg⁡max⁡aQ(s,a)\arg\max_a Q(s, a)argmaxa​Q(s,a); with probability ε\varepsilonε take a uniformly random action.

π(a∣s)={1−ε+ε∣A∣if a=arg⁡max⁡a′Q(s,a′)ε∣A∣otherwise\pi(a \mid s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & \text{if } a = \arg\max_{a'} Q(s, a') \\[6pt] \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases}π(a∣s)=⎩⎨⎧​1−ε+∣A∣ε​∣A∣ε​​if a=argmaxa′​Q(s,a′)otherwise​

Properties:

  • Every action has probability ≥ε/∣A∣>0\geq \varepsilon / |\mathcal{A}| > 0≥ε/∣A∣>0, so coverage is guaranteed — every state-action pair is eventually visited.
  • ε=0\varepsilon = 0ε=0 is fully greedy (no exploration); ε=1\varepsilon = 1ε=1 is uniform random (no exploitation).
  • In practice, ε\varepsilonε is annealed over training: high early (to explore broadly), low late (to exploit learned values).

GLIE: the theoretical requirement for convergence

Greedy in the Limit with Infinite Exploration (GLIE) states two conditions:

  1. Every state-action pair is visited infinitely often: lim⁡t→∞Nt(s,a)=∞\lim_{t \to \infty} N_t(s,a) = \inftylimt→∞​Nt​(s,a)=∞ for all (s,a)(s, a)(s,a).
  2. The policy converges to greedy: lim⁡t→∞πt(a∣s)=1[a=arg⁡max⁡a′Q(s,a′)]\lim_{t \to \infty} \pi_t(a \mid s) = \mathbf{1}[a = \arg\max_{a'} Q(s, a')]limt→∞​πt​(a∣s)=1[a=argmaxa′​Q(s,a′)].

An ε-greedy policy satisfies GLIE if εt→0\varepsilon_t \to 0εt​→0 as t→∞t \to \inftyt→∞ and ∑tεt=∞\sum_t \varepsilon_t = \infty∑t​εt​=∞ (e.g., εt=1/t\varepsilon_t = 1/tεt​=1/t). Under GLIE, SARSA converges to Q∗Q^*Q∗ with probability 1.

The exploration problem in practice

GLIE provides convergence guarantees but offers little guidance on how fast to anneal ε\varepsilonε. In practice:

  • Too fast: the agent commits to a suboptimal policy before discovering better options. Common in environments with sparse rewards where the high-reward region is rarely reached by random exploration.
  • Too slow: the agent spends excessive time on random actions after the value function is already well-estimated. Sample efficiency suffers.

For deep RL, these issues are amplified: neural network approximators are initially poorly calibrated, and large action spaces make uniform random exploration impractical. Week 8 covers structured exploration strategies (UCB, Thompson sampling, curiosity-driven exploration) that address these limitations.

◆Exploration vs exploitation is harder in control than bandit settings

In the bandit setting (Week 2), exploration and exploitation are symmetric: every arm can be pulled at any time. In control, the state visited at time t+1t+1t+1 depends on the action taken at ttt — exploration choices now have structural consequences. An exploratory action that moves the agent into a low-value region makes it harder to reach high-value regions later. This temporal coupling between exploration and future state distribution is what makes the exploration problem in control fundamentally harder than in bandits.


Importance sampling for off-policy learning

When the behavior policy μ\muμ differs from the target policy π\piπ, updates must correct for the distributional mismatch.

The importance sampling ratio

ρt=π(at∣st)μ(at∣st)\rho_t = \frac{\pi(a_t \mid s_t)}{\mu(a_t \mid s_t)}ρt​=μ(at​∣st​)π(at​∣st​)​

For off-policy TDTemporal Difference(0), the corrected update is:

V(st)←V(st)+α ρt δtV(s_t) \leftarrow V(s_t) + \alpha\, \rho_t\, \delta_tV(st​)←V(st​)+αρt​δt​

When π(at∣st)=μ(at∣st)\pi(a_t|s_t) = \mu(a_t|s_t)π(at​∣st​)=μ(at​∣st​), the ratio is 1 and we recover the standard TDTemporal Difference update. When π\piπ assigns higher probability to ata_tat​ than μ\muμ did, the update is amplified; when lower, it is dampened.

The variance explosion problem

For multi-step returns, the importance sampling correction requires the product of ratios over all steps:

ρt:t+n=∏k=0n−1π(at+k∣st+k)μ(at+k∣st+k)\rho_{t:t+n} = \prod_{k=0}^{n-1} \frac{\pi(a_{t+k} \mid s_{t+k})}{\mu(a_{t+k} \mid s_{t+k})}ρt:t+n​=k=0∏n−1​μ(at+k​∣st+k​)π(at+k​∣st+k​)​

If π\piπ and μ\muμ differ at each step, each ratio may be greater than 1, and their product grows exponentially in nnn. For full MC returns (n=Tn = Tn=T), the variance of the importance-weighted estimator can be so large as to make learning impractical — a single trajectory with a large product weight can dominate the entire estimate.

This variance explosion is a fundamental obstacle to off-policy learning with long trajectories, not an implementation detail.

⚠What Breaks Here: Distribution Mismatch at Scale

Consider a 50-step trajectory where the target policy π\piπ is twice as likely as μ\muμ to take the chosen action at each step. The product ratio is ρ=250≈1015\rho = 2^{50} \approx 10^{15}ρ=250≈1015 — this single trajectory would receive 101510^{15}1015 times the normal update weight, completely destabilizing learning.

Even modest divergence compounds catastrophically: if ρt=1.1\rho_t = 1.1ρt​=1.1 at each step, then 1.150≈1171.1^{50} \approx 1171.150≈117. This is not fixable through better implementations or numerical precision — it is a structural property of importance-weighted estimators over long horizons. The only solutions are to constrain policy divergence (PPO), truncate the horizon (n-step with small nnn), or use a learning rule that doesn't require importance sampling (Q-learning's max operator).

PPOProximal Policy Optimisation as a solution to variance explosion

Proximal Policy Optimization (PPOProximal Policy Optimisation) — the algorithm used to fine-tune language models in RLHFReinforcement Learning from Human Feedback — addresses this directly by clipping the importance ratio:

LCLIP(θ)=Et[min⁡(ρt(θ)At,  clip(ρt(θ),1−ϵ,1+ϵ)At)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[ \min\left( \rho_t(\theta) A_t,\; \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]LCLIP(θ)=Et​[min(ρt​(θ)At​,clip(ρt​(θ),1−ϵ,1+ϵ)At​)]

where ρt(θ)=πθ(at∣st)/πθold(at∣st)\rho_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)ρt​(θ)=πθ​(at​∣st​)/πθold​​(at​∣st​) is the ratio between the current and previous policy, and AtA_tAt​ is the advantage estimate. Clipping the ratio to [1−ϵ,1+ϵ][1-\epsilon, 1+\epsilon][1−ϵ,1+ϵ] prevents any single transition from contributing an outsized update — it controls the variance explosion at the cost of introducing a small bias. This is the direct application of importance sampling theory to policy optimization: the ratio ρt\rho_tρt​ appears in PPOProximal Policy Optimisation for exactly the same reason it appears in off-policy TDTemporal Difference — to correct for the fact that data was collected under a different (older) policy.

✦Intuition: Clipping Prevents Tail Risk

PPO's clip sets a maximum contribution any single trajectory can make to the gradient. Without clipping, a trajectory where the new policy is 10× more probable than the old policy receives weight 10, potentially dominating the entire update. With ε=0.2\varepsilon = 0.2ε=0.2, the effective weight is capped at 1.21.21.2 — a more than 8-fold reduction in the worst-case tail contribution.

The consequence of clipping: if the new policy has diverged significantly from where the data was collected, the gradient is deliberately under-estimated rather than allowing a destabilizing large step. This is conservative learning, not accurate learning — and in sequential decision problems where instability compounds across updates, conservative beats accurate.


The deadly triad

Week 5 introduces function approximation — replacing tables with neural networks. Before doing so, it is important to understand why combining the techniques of this lecture with function approximation can cause instability.

The deadly triad (Sutton & Barto) refers to the combination of three elements that, when present simultaneously, can cause divergence in value function learning:

  1. Function approximation (representing VVV or QQQ with a neural network)
  2. Bootstrapping (using current value estimates as targets — i.e., TDTemporal Difference methods)
  3. Off-policy learning (learning about a target policy from data generated by a different behavior policy)

Each element alone is manageable: supervised regression with function approximation converges; tabular TDTemporal Difference(0) converges; tabular off-policy Q-learning converges. But the three together remove the convergence guarantees. The problem is that the combination of bootstrapping and function approximation creates a moving target (the target depends on the current parameters), and the off-policy distribution mismatch means the function approximator is trained on a distribution that does not match the on-policy distribution of the target policy — creating systematic bias that compounds.

DQNDeep Q-Network (Week 5) directly addresses the deadly triad through two mechanisms:

  • Replay buffers decorrelate updates and stabilize the training distribution.
  • Target networks freeze the bootstrap target temporarily, preventing the moving target problem from destabilizing training.

Understanding the deadly triad explains why these engineering choices are necessary — not as tricks, but as principled responses to a known theoretical obstacle.

◆Why Not Avoid the Deadly Triad Entirely?

Each element of the triad is load-bearing for practical RL:

  • Function approximation is unavoidable — Atari has roughly 1017010^{170}10170 possible screen states; a lookup table is physically impossible.
  • Bootstrapping is unavoidable — MC requires complete episodes, which are often prohibitively long or nonexistent for continuing tasks.
  • Off-policy learning is unavoidable for data efficiency — without it, every transition must come from the current policy, making replay buffers and pre-collected datasets unusable.

The deadly triad names a real problem, not a design choice to avoid. DQN's target networks and replay buffer are not workarounds — they are principled engineering responses to a forced tradeoff that every practical deep RL system must confront.


Key takeaways

The lecture develops a unified view of learning from experience through the lens of backup depth and policy alignment. Monte Carlo uses full-trajectory backups: unbiased but high variance, requiring complete episodes. TDTemporal Difference(0) uses one-step bootstrapped backups: lower variance but biased, converging by the contraction argument on the Bellman expectation operator. n-step TDTemporal Difference interpolates via a bias-variance tradeoff that decays bias exponentially in nnn. TDTemporal Difference(λ) resolves the tradeoff adaptively by combining all n-step returns, implemented online via eligibility traces that assign temporal credit backward through recent state visits.

SARSAState-Action-Reward-State-Action is on-policy TDTemporal Difference for control: it evaluates the behavior policy and converges to Q∗Q^*Q∗ only if the policy becomes greedy in the limit (GLIE). Q-learning is off-policy TDTemporal Difference: it targets Q∗Q^*Q∗ directly via the Bellman optimality operator and converges regardless of the behavior policy. Exploration is the mechanism that satisfies GLIE — ε\varepsilonε-greedy policies guarantee coverage but introduce a tradeoff between exploration speed and exploitation quality. Importance sampling corrects for distribution mismatch in off-policy learning but suffers from exponential variance growth over long trajectories — the exact problem that PPOProximal Policy Optimisation's clipped ratio objective addresses in RLHFReinforcement Learning from Human Feedback. And the deadly triad of function approximation, bootstrapping, and off-policy learning identifies why combining these ingredients requires careful engineering, previewing the DQNDeep Q-Network stabilization techniques in Week 5.


Conceptual questions

  1. Trace through the worked example at the start of the lecture but for state s3s_3s3​ in the same episode. Compute both the MC update and the TDTemporal Difference(0) update for V(s3)V(s_3)V(s3​). Which update moves V(s3)V(s_3)V(s3​) further from its current value, and why?

  2. TDTemporal Difference(0) converges to VπV^\piVπ while MC also converges to VπV^\piVπ. If they converge to the same target, why would you ever prefer TDTemporal Difference over MC? Give two distinct reasons, one theoretical and one practical.

  3. An agent is learning in an environment where episodes are very long (thousands of steps) and rewards are sparse (most rt=0r_t = 0rt​=0, with occasional large rewards). Should you use MC, TDTemporal Difference(0), or n-step TDTemporal Difference? Justify your answer in terms of the bias-variance tradeoff and the practical effect of episode length on each method.

  4. Q-learning converges to Q∗Q^*Q∗ even when the behavior policy is ϵ\epsilonϵ-greedy with ϵ=0.5\epsilon = 0.5ϵ=0.5 (i.e., random half the time). SARSAState-Action-Reward-State-Action with the same behavior policy converges to something other than Q∗Q^*Q∗. Explain precisely what SARSAState-Action-Reward-State-Action converges to and why the max⁡\maxmax operator in Q-learning's target makes the difference.

  5. A language model is fine-tuned using PPOProximal Policy Optimisation. The old policy (used to collect rollouts) and the new policy (being updated) begin to diverge significantly during training. Explain why this causes the importance ratio ρt\rho_tρt​ to become large, why large ρt\rho_tρt​ is harmful in the context of the variance explosion problem, and how PPOProximal Policy Optimisation's clipping mechanism addresses this. What does the clipping introduce in return?

  6. (Extension) The eligibility trace update is et(s)=γλ et−1(s)+1[st=s]e_t(s) = \gamma\lambda\, e_{t-1}(s) + \mathbf{1}[s_t = s]et​(s)=γλet−1​(s)+1[st​=s]. Show that when λ=0\lambda = 0λ=0, the TD(λ)\text{TD}(\lambda)TD(λ) update with eligibility traces reduces exactly to the standard TD(0)\text{TD}(0)TD(0) update. Then show that when λ=1\lambda = 1λ=1 and γ=1\gamma = 1γ=1, the update is equivalent to Monte Carlo.

  7. (Extension) Double Q-learning maintains two independent Q-value estimates QAQ_AQA​ and QBQ_BQB​. When updating QAQ_AQA​, the greedy action is selected using QAQ_AQA​ but evaluated using QBQ_BQB​: the target is r+γQB(s′,arg⁡max⁡a′QA(s′,a′))r + \gamma Q_B(s', \arg\max_{a'} Q_A(s', a'))r+γQB​(s′,argmaxa′​QA​(s′,a′)). Explain why standard Q-learning suffers from maximization bias, and how the decoupling in Double Q-learning addresses it.


Coding exercises

Exercise 1: First-visit Monte Carlo prediction

Implement first-visit MC policy evaluation for a simple chain MDP. The environment has states {0,1,2,3,4}\{0, 1, 2, 3, 4\}{0,1,2,3,4} where state 4 is terminal with reward +1+1+1 and all other transitions give reward 000.

import numpy as np
from collections import defaultdict

def first_visit_mc(policy, n_episodes=5000, gamma=0.9, alpha=0.05):
    """
    First-visit Monte Carlo policy evaluation.

    policy: callable (state -> action), the policy to evaluate
    Returns: V, a dict mapping state -> estimated value
    """
    V = defaultdict(float)

    for _ in range(n_episodes):
        # --- generate an episode ---
        episode = []  # list of (state, reward) tuples
        state = 0
        while state != 4:
            action = policy(state)
            # TODO: implement the chain transition:
            #   action=1 moves right (state+1), reward=+1 if reaching state 4 else 0
            #   action=0 stays in place, reward=0
            next_state = ...
            reward = ...
            episode.append((state, reward))
            state = next_state
        # --- first-visit MC update ---
        visited = set()
        G = 0.0
        for t in reversed(range(len(episode))):
            state, reward = episode[t]
            G = reward + gamma * G
            if state not in visited:
                visited.add(state)
                # TODO: apply the incremental mean update
                V[state] += ...

    return V

# Test: uniform random policy
policy = lambda s: np.random.choice([0, 1])
V = first_visit_mc(policy)
print({s: round(v, 3) for s, v in sorted(V.items())})
# Expected: values should increase from state 0 to state 3

Exercise 2: TD(0) prediction

Implement TD(0) on the same chain MDP and compare convergence with MC.

def td_zero(policy, n_episodes=5000, gamma=0.9, alpha=0.05):
    """
    TD(0) policy evaluation.
    Returns: V, a dict mapping state -> estimated value
    """
    V = defaultdict(float)

    for _ in range(n_episodes):
        state = 0
        while state != 4:
            action = policy(state)
            next_state = state + 1 if action == 1 else state
            reward = 1.0 if next_state == 4 else 0.0

            # TODO: compute the TD error delta_t
            delta = ...

            # TODO: apply the TD(0) update
            V[state] += ...

            state = next_state

    return V

V_td = td_zero(lambda s: np.random.choice([0, 1]))
print({s: round(v, 3) for s, v in sorted(V_td.items())})

After implementing both, run both with the same number of episodes and compare their estimates. Which converges faster to the true values? Why?

Exercise 3: Q-learning vs SARSA

Implement both Q-learning and SARSA on a simple grid world and observe the policy difference in stochastic settings.

def q_learning(n_episodes=2000, gamma=0.9, alpha=0.1, epsilon=0.1):
    """Off-policy Q-learning."""
    # States: 0-7 (linear chain), state 7 is terminal (+10 reward)
    # State 3 is a "cliff" (-5 reward if visited with random action)
    n_states, n_actions = 8, 2  # 0=left, 1=right
    Q = np.zeros((n_states, n_actions))

    for _ in range(n_episodes):
        state = 0
        while state != 7:
            # epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(n_actions)
            else:
                action = np.argmax(Q[state])

            next_state = min(state + 1, 7) if action == 1 else max(state - 1, 0)
            reward = 10.0 if next_state == 7 else (-5.0 if next_state == 3 else 0.0)

            # TODO: Q-learning update (off-policy: use max over next actions)
            next_value = 0.0  # TODO: replace with np.max(Q[next_state])
            Q[state, action] += alpha * (reward + gamma * next_value - Q[state, action])
            state = next_state

    return Q

def sarsa(n_episodes=2000, gamma=0.9, alpha=0.1, epsilon=0.1):
    """On-policy SARSA."""
    n_states, n_actions = 8, 2
    Q = np.zeros((n_states, n_actions))

    for _ in range(n_episodes):
        state = 0
        action = np.argmax(Q[state]) if np.random.random() > epsilon else np.random.randint(n_actions)

        while state != 7:
            next_state = min(state + 1, 7) if action == 1 else max(state - 1, 0)
            reward = 10.0 if next_state == 7 else (-5.0 if next_state == 3 else 0.0)

            next_action = np.argmax(Q[next_state]) if np.random.random() > epsilon else np.random.randint(n_actions)

            # TODO: SARSA update (on-policy: use next_action, not max)
            next_value = 0.0  # TODO: replace with Q[next_state, next_action]
            Q[state, action] += alpha * (reward + gamma * next_value - Q[state, action])
            state, action = next_state, next_action

    return Q

Q_ql = q_learning()
Q_sarsa = sarsa()
print("Q-learning greedy policy:", [np.argmax(Q_ql[s]) for s in range(7)])
print("SARSA greedy policy:     ", [np.argmax(Q_sarsa[s]) for s in range(7)])
# Do the learned policies differ near state 3? Why?

Looking ahead

The next lecture introduces function approximation: replacing lookup tables with parametric models, and specifically neural networks. We will see that the deadly triad makes naive deep TDTemporal Difference learning unstable, and study how Deep Q-Networks (DQNDeep Q-Network) addresses this through target networks and experience replay — each of which can be understood as a direct engineering response to one component of the triad.


Further reading

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), MIT Press. Chapters 5–7 cover MC methods, TD learning, and n-step TD in depth. Freely available online.
  • Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning. (The paper that introduced Q-learning).
  • Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. (The technical report that introduced SARSAState-Action-Reward-State-Action).
  • Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning. (The foundation of TDTemporal Difference learning).

Knowledge Check

Test your understanding of Monte Carlo, TD, and eligibility traces.

Exercise · Fill in the blank

The TD error is defined as δ_t = r_{t+1} + γV(s_{t+1}) − ___

Exercise · Multiple choice

On each time step, an eligibility trace e(s) for a state not visited is multiplied by ___

γλ
γ only
λ only
1 − α
Question 1 of 3

Which method requires waiting until the end of an episode before updating value estimates?

Temporal-Difference (TD)
Monte Carlo (MC)
Dynamic Programming (DP)
TD(lambda) with lambda=0.5
← Previous
Week 3: Dynamic Programming for Finite MDPs
Next →
Week 5: Function Approximation in Reinforcement Learning
On this page
  • Purpose of this lecture
  • Learning from experience
  • A worked example
  • Monte Carlo methods
  • Return definition
  • First-visit vs every-visit Monte Carlo
  • Monte Carlo policy evaluation
  • Temporal-Difference learning
  • TD(0): the one-step update
  • TD(0) convergence
  • Bias-variance tradeoff: MC vs TD vs n-step TD
  • Monte Carlo: unbiased, high variance
  • TD(0): biased, low variance
  • n-step TD: interpolating the tradeoff
  • TD(λ) and eligibility traces
  • The λ-return
  • Eligibility traces: the backward view
  • On-policy vs off-policy learning: formal definitions
  • Control with TD methods
  • On-policy control: SARSA
  • Off-policy control: Q-learning
  • SARSA vs Q-learning: a summary
  • Exploration in control
  • ε-greedy policies
  • GLIE: the theoretical requirement for convergence
  • The exploration problem in practice
  • Importance sampling for off-policy learning
  • The importance sampling ratio
  • The variance explosion problem
  • PPO as a solution to variance explosion
  • The deadly triad
  • Key takeaways
  • Conceptual questions
  • Coding exercises
  • Exercise 1: First-visit Monte Carlo prediction
  • Exercise 2: TD(0) prediction
  • Exercise 3: Q-learning vs SARSA
  • Looking ahead
  • Further reading
  • Knowledge Check