Skip to main content
illumin8
Courses
Week 3: Dynamic Programming for Finite MDPs
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 3

Week 3: Dynamic Programming for Finite MDPs

✦Learning Outcomes
  • Implement policy evaluation, policy iteration, and value iteration algorithms
  • Derive and explain the contraction mapping property of Bellman operators
  • Analyze convergence properties of DP algorithms
  • Connect DP algorithms to model-free RLReinforcement Learning methods
◆Prerequisites
  • Week 1: MDPMarkov Decision Process formalization, Bellman equations, value functions, discount factor
  • Week 2: Regret decomposition, exploration-exploitation trade-off

Recommended: Review Week 1 sections on "Bellman equations" and "Value functions" before proceeding.

Purpose of this lecture

Here is the central question: If we knew everything about the environment, what is the fastest way to compute an optimal policy?

This is the domain of dynamic programming (DP). We will assume perfect knowledge—the transition dynamics PPP and reward function RRR are given. Under this idealized assumption, the problem has a clean solution: iterate the Bellman equations to convergence.

The payoff of studying DP is not that it solves real problems (it doesn't—real environments are unknown). Instead:

  1. DP defines the gold standard. DP algorithms are the exact solutions to the MDP problem. Every algorithm you study later is either a DP algorithm applied to an approximate model, or a sampled/approximate version of a DP algorithm.

  2. DP teaches algorithm design fundamentals. The progression from policy evaluation (full convergence) to policy iteration (two-step cycles) to value iteration (one-step updates) shows how to trade off computational cost against convergence guarantees.

  3. DP reveals what must change in model-free RL. The moment you remove the assumption of known PPP and RRR, everything else carries over. Understanding what changes and what doesn't is the key to understanding modern RL.

In Week 4, we will relax exactly one assumption: known dynamics. We will keep the Bellman equations, the value functions, the contraction property. We will only replace the model-based expectation ∑s′P(s′∣s,a)V(s′)\sum_{s'} P(s'|s,a)V(s')∑s′​P(s′∣s,a)V(s′) with samples from experience. This is the bridge from DP to model-free RL.


Setting and assumptions

Dynamic programming applies under the following assumptions:

  • Finite state space S\mathcal{S}S, finite action space A\mathcal{A}A
  • Known transition probabilities P(s′∣s,a)P(s' \mid s, a)P(s′∣s,a)
  • Known reward function R(s,a)R(s, a)R(s,a)
  • Discount factor γ∈[0,1)\gamma \in [0,1)γ∈[0,1)

Under these conditions, the Bellman equations from Week 1 become explicit computational algorithms. The key assumption distinguishing DP from everything that follows is the second and third: we know PPP and RRR. Model-free RLReinforcement Learning relaxes exactly this.

✦Intuition: Why knowing $P$ and $R$ is so powerful

Without PPP and RRR, computing the expected next-state value requires interacting with the environment — one sample at a time. With both in hand, you compute the exact expectation ∑s′P(s′∣s,a)V(s′)\sum_{s'} P(s'|s,a) V(s')∑s′​P(s′∣s,a)V(s′) analytically. This turns the Bellman equation from an estimation problem into an algebraic one: solve a system of ∣S∣|\mathcal{S}|∣S∣ linear equations and you have VπV^\piVπ exactly, with zero sampling error. The same computation that would require millions of environment interactions becomes a single matrix operation.


A worked example: the 4-state chain

Before developing the algorithms in full generality, we ground everything in a small concrete MDPMarkov Decision Process. Consider a linear chain with four states {s1,s2,s3,s4}\{s_1, s_2, s_3, s_4\}{s1​,s2​,s3​,s4​} and two actions {L,R}\{L, R\}{L,R} (move left, move right):

  • From any non-boundary state, action RRR moves right with probability 0.9 and stays with probability 0.1; action LLL moves left with probability 0.9 and stays with probability 0.1.
  • s1s_1s1​ and s4s_4s4​ are boundary states: all actions keep the agent in place.
  • Reward: R(s4,⋅)=+1R(s_4, \cdot) = +1R(s4​,⋅)=+1, R(s1,⋅)=−1R(s_1, \cdot) = -1R(s1​,⋅)=−1, all other transitions yield 000.
  • γ=0.9\gamma = 0.9γ=0.9.

We will use this MDPMarkov Decision Process to trace through policy evaluation, policy iteration, and value iteration concretely. The optimal policy is clearly "always go right" for s2s_2s2​ and s3s_3s3​, and the optimal values should reflect proximity to s4s_4s4​.

Two iterations of value iteration from V0=0V_0 = \mathbf{0}V0​=0:

Iteration 1 (V1V_1V1​): For s3s_3s3​ with action RRR:

V1(s3)=max⁡a[R(s3,a)+0.9∑s′P(s′∣s3,a)V0(s′)]=max⁡[0+0,  0+0]=0V_1(s_3) = \max_a \left[R(s_3, a) + 0.9\sum_{s'}P(s'|s_3,a)V_0(s')\right] = \max\left[0 + 0,\; 0 + 0\right] = 0V1​(s3​)=amax​[R(s3​,a)+0.9s′∑​P(s′∣s3​,a)V0​(s′)]=max[0+0,0+0]=0

All states get V1=0V_1 = 0V1​=0 since V0=0V_0 = 0V0​=0 and no immediate reward is earned except at boundaries.

Iteration 2 (V2V_2V2​): For s3s_3s3​ with action RRR:

V2(s3)=0+0.9[0.9⋅V1(s4)+0.1⋅V1(s3)]V_2(s_3) = 0 + 0.9\left[0.9 \cdot V_1(s_4) + 0.1 \cdot V_1(s_3)\right]V2​(s3​)=0+0.9[0.9⋅V1​(s4​)+0.1⋅V1​(s3​)]

Since V1(s4)V_1(s_4)V1​(s4​) requires revisiting: at s4s_4s4​, action RRR gives reward +1+1+1 and stays, so V1(s4)=1+0.9⋅0=1V_1(s_4) = 1 + 0.9 \cdot 0 = 1V1​(s4​)=1+0.9⋅0=1. Substituting:

V2(s3)=0.9[0.9⋅1+0.1⋅0]=0.81V_2(s_3) = 0.9\left[0.9 \cdot 1 + 0.1 \cdot 0\right] = 0.81V2​(s3​)=0.9[0.9⋅1+0.1⋅0]=0.81

For s2s_2s2​ with action RRR:

V2(s2)=0.9[0.9⋅V1(s3)+0.1⋅V1(s2)]=0.9[0+0]=0V_2(s_2) = 0.9\left[0.9 \cdot V_1(s_3) + 0.1 \cdot V_1(s_2)\right] = 0.9[0 + 0] = 0V2​(s2​)=0.9[0.9⋅V1​(s3​)+0.1⋅V1​(s2​)]=0.9[0+0]=0

To determine the greedy action at s2s_2s2​ under V2V_2V2​, compare both actions:

Q2(s2,R)=0+0.9[0.9⋅V1(s3)+0.1⋅V1(s2)]=0Q_2(s_2, R) = 0 + 0.9\left[0.9 \cdot V_1(s_3) + 0.1 \cdot V_1(s_2)\right] = 0Q2​(s2​,R)=0+0.9[0.9⋅V1​(s3​)+0.1⋅V1​(s2​)]=0 Q2(s2,L)=0+0.9[0.9⋅V1(s1)+0.1⋅V1(s2)]=0.9[0.9⋅(−1)+0]=−0.81Q_2(s_2, L) = 0 + 0.9\left[0.9 \cdot V_1(s_1) + 0.1 \cdot V_1(s_2)\right] = 0.9\left[0.9 \cdot (-1) + 0\right] = -0.81Q2​(s2​,L)=0+0.9[0.9⋅V1​(s1​)+0.1⋅V1​(s2​)]=0.9[0.9⋅(−1)+0]=−0.81

So π2greedy(s2)=R\pi_2^{\text{greedy}}(s_2) = Rπ2greedy​(s2​)=R (since 0>−0.810 > {-0.81}0>−0.81). And from above, Q2(s3,R)=0.81>Q2(s3,L)=0Q_2(s_3, R) = 0.81 > Q_2(s_3, L) = 0Q2​(s3​,R)=0.81>Q2​(s3​,L)=0, so π2greedy(s3)=R\pi_2^{\text{greedy}}(s_3) = Rπ2greedy​(s3​)=R. The optimal policy "always go right" is first recovered at iteration 2.

After two iterations, the value has propagated one step from the boundary. After kkk iterations, the value has propagated kkk steps. This is the backup structure: value information flows backward from high-reward states through the graph.

✦Intuition: The backup as a wave

Think of value information as a wave that starts at high-reward states and propagates backward. At k=0k=0k=0, the wave hasn't started — all values are zero. At k=1k=1k=1, the wave front has reached states one step from a reward. At k=2k=2k=2, it reaches states two steps away. After kkk iterations, the value has propagated exactly kkk steps through the graph. This is why convergence is geometric: each iteration carries the wave one step further, and the γk\gamma^kγk factor ensures that distant future rewards contribute exponentially less — the wave attenuates as it travels.


Policy evaluation

Given a fixed policy π\piπ, policy evaluation computes its state-value function VπV^\piVπ — the expected discounted return from each state under π\piπ.

Bellman expectation equation

For all s∈Ss \in \mathcal{S}s∈S:

Vπ(s)=∑aπ(a∣s)[R(s,a)+γ∑s′P(s′∣s,a) Vπ(s′)]V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^\pi(s') \right]Vπ(s)=a∑​π(a∣s)[R(s,a)+γs′∑​P(s′∣s,a)Vπ(s′)]

This is a system of ∣S∣|\mathcal{S}|∣S∣ linear equations in ∣S∣|\mathcal{S}|∣S∣ unknowns. For small state spaces, it can be solved directly by matrix inversion. For large state spaces, iterative methods are used.

The Bellman expectation operator

Define the Bellman expectation operator TπT^\piTπ:

(TπV)(s)=∑aπ(a∣s)[R(s,a)+γ∑s′P(s′∣s,a)V(s′)](T^\pi V)(s) = \sum_a \pi(a \mid s)\left[R(s,a) + \gamma\sum_{s'} P(s'|s,a)V(s')\right](TπV)(s)=a∑​π(a∣s)[R(s,a)+γs′∑​P(s′∣s,a)V(s′)]

TπT^\piTπ is a γ\gammaγ-contraction in ℓ∞\ell_\inftyℓ∞​: for any two value functions V,WV, WV,W:

∥TπV−TπW∥∞≤γ∥V−W∥∞\|T^\pi V - T^\pi W\|_\infty \leq \gamma \|V - W\|_\infty∥TπV−TπW∥∞​≤γ∥V−W∥∞​

The proof is identical to the optimality operator case from Week 1: the γ\gammaγ factor in front of the expectation over next states is what gives the contraction mapping property. By the Banach fixed-point theorem, VπV^\piVπ is the unique fixed point of TπT^\piTπ, and iterative application from any initialization converges to it geometrically.

✦Intuition: Why contraction guarantees convergence

A contraction mapping shrinks distances: if you start with two value functions VVV and WWW far apart, one application of TπT^\piTπ brings them within γ\gammaγ of each other in sup-norm. Apply it again, and they're within γ2\gamma^2γ2. The only point two values can agree on after infinite squeezing is the unique fixed point VπV^\piVπ. This is why the initial guess V0V_0V0​ doesn't matter — no matter how wrong your starting estimate, the contraction drags every iterate toward the same answer at rate γ\gammaγ per step. For γ=0.9\gamma = 0.9γ=0.9, you gain one decimal digit of accuracy roughly every 22 iterations (0.922≈0.10.9^{22} \approx 0.10.922≈0.1).

Iterative policy evaluation

Vk+1(s)←∑aπ(a∣s)[R(s,a)+γ∑s′P(s′∣s,a) Vk(s′)]V_{k+1}(s) \leftarrow \sum_a \pi(a \mid s) \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V_k(s') \right]Vk+1​(s)←a∑​π(a∣s)[R(s,a)+γs′∑​P(s′∣s,a)Vk​(s′)]

Starting from any V0V_0V0​ (e.g., V0=0V_0 = \mathbf{0}V0​=0), repeated application converges to VπV^\piVπ because TπT^\piTπ is a γ\gammaγ-contraction. The error after kkk iterations satisfies ∥Vk−Vπ∥∞≤γk∥V0−Vπ∥∞\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty∥Vk​−Vπ∥∞​≤γk∥V0​−Vπ∥∞​ — geometric convergence at rate γ\gammaγ.

Backup diagram: at each update, the value of state sss is computed by looking one step ahead — summing over all actions under π\piπ, then over all next states under PPP, and backing up the discounted next-state values. Information flows from future states into the current estimate.

◆Backup diagram: Policy evaluation

At state sss, the update fans out over all actions under π\piπ, then over all next states under PPP, and collapses back to a single discounted value. The root (top) is sss; the branches below are action nodes weighted by π(a∣s)\pi(a|s)π(a∣s); the leaves are next states s′s's′ weighted by P(s′∣s,a)P(s'|s,a)P(s′∣s,a). Information flows up (from future leaves to current root): this is the expectation backup, a weighted average over all reachable branches. Compare with the value iteration backup below, which replaces the weighted average with a max.


Q-functions in dynamic programming

The lecture so far has developed VπV^\piVπ and V∗V^*V∗. It is equally important to develop the action-value function (Q-function) in the DP setting, because Q-learning — the most widely used model-free algorithm — is value iteration applied to Q∗Q^*Q∗.

Bellman equations for QπQ^\piQπ and Q∗Q^*Q∗

The Bellman expectation equation for QπQ^\piQπ:

Qπ(s,a)=R(s,a)+γ∑s′P(s′∣s,a)∑a′π(a′∣s′) Qπ(s′,a′)Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s')\, Q^\pi(s',a')Qπ(s,a)=R(s,a)+γs′∑​P(s′∣s,a)a′∑​π(a′∣s′)Qπ(s′,a′)

The Bellman optimality equation for Q∗Q^*Q∗:

Q∗(s,a)=R(s,a)+γ∑s′P(s′∣s,a)max⁡a′Q∗(s′,a′)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')Q∗(s,a)=R(s,a)+γs′∑​P(s′∣s,a)a′max​Q∗(s′,a′)

Relationship between V and Q

Vπ(s)=∑aπ(a∣s) Qπ(s,a)V∗(s)=max⁡aQ∗(s,a)V^\pi(s) = \sum_a \pi(a|s)\, Q^\pi(s,a) \qquad V^*(s) = \max_a Q^*(s,a)Vπ(s)=a∑​π(a∣s)Qπ(s,a)V∗(s)=amax​Q∗(s,a) Qπ(s,a)=R(s,a)+γ∑s′P(s′∣s,a) Vπ(s′)Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a)\, V^\pi(s')Qπ(s,a)=R(s,a)+γs′∑​P(s′∣s,a)Vπ(s′)

Why Q∗Q^*Q∗ is the target for model-free RLReinforcement Learning

If you know V∗V^*V∗, extracting the optimal policy still requires knowing PPP:

π∗(s)=arg⁡max⁡a[R(s,a)+γ∑s′P(s′∣s,a)V∗(s′)]\pi^*(s) = \arg\max_a \left[R(s,a) + \gamma\sum_{s'} P(s'|s,a) V^*(s')\right]π∗(s)=argamax​[R(s,a)+γs′∑​P(s′∣s,a)V∗(s′)]

If you know Q∗Q^*Q∗, you do not need PPP:

π∗(s)=arg⁡max⁡aQ∗(s,a)\pi^*(s) = \arg\max_a Q^*(s,a)π∗(s)=argamax​Q∗(s,a)
◆Why Not target $V^*$ in model-free settings?

Knowing V∗V^*V∗ is not enough to act. Extracting the optimal action still requires arg⁡max⁡a[R(s,a)+γ∑s′P(s′∣s,a) V∗(s′)]\arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a)\, V^*(s')]argmaxa​[R(s,a)+γ∑s′​P(s′∣s,a)V∗(s′)] — a one-step lookahead that depends on PPP. Without PPP, you cannot perform this computation. Q∗Q^*Q∗ sidesteps this entirely: the optimal action is arg⁡max⁡aQ∗(s,a)\arg\max_a Q^*(s,a)argmaxa​Q∗(s,a), a simple table lookup that requires no model. This is the core reason Q-learning, DQN, and SAC all target QQQ rather than VVV: Q∗Q^*Q∗ encodes the model-free action choice directly in the function itself.

This is the key reason Q-learning targets Q∗Q^*Q∗ rather than V∗V^*V∗: in the model-free setting where PPP is unknown, Q∗Q^*Q∗ gives you the optimal policy directly from experience. The Bellman optimality equation for Q∗Q^*Q∗ becomes the Q-learning update rule once we replace the exact model-based expectation with a sampled transition.


Policy improvement

Once we know VπV^\piVπ, we can improve the policy.

Definition

Given VπV^\piVπ, define the improved policy:

π′(s)=arg⁡max⁡a[R(s,a)+γ∑s′P(s′∣s,a) Vπ(s′)]=arg⁡max⁡a Qπ(s,a)\pi'(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^\pi(s') \right] = \arg\max_a\, Q^\pi(s,a)π′(s)=argamax​[R(s,a)+γs′∑​P(s′∣s,a)Vπ(s′)]=argamax​Qπ(s,a)

π′\pi'π′ is the greedy policy with respect to VπV^\piVπ: at each state, take the action whose one-step lookahead value is highest.

✦Intuition: Greedy improvement as a one-step lookahead

The greedy policy doesn't re-plan from scratch — it asks a single question at each state: "Given that I already know the value of every next state under π\piπ, which action leads to the best expected outcome?" This is a one-step calculation that uses VπV^\piVπ as a read-only lookup table. The policy improvement theorem says that this single step of lookahead is always sufficient to improve (or maintain) performance. You don't need to simulate entire trajectories; the value function has already summarized all future consequences.

Policy improvement theorem: proof

Theorem: Vπ′(s)≥Vπ(s)V^{\pi'}(s) \geq V^\pi(s)Vπ′(s)≥Vπ(s) for all s∈Ss \in \mathcal{S}s∈S.

Proof: By the definition of π′\pi'π′ as greedy with respect to VπV^\piVπ:

Vπ(s)≤Qπ(s,π′(s))=R(s,π′(s))+γ∑s′P(s′∣s,π′(s)) Vπ(s′)V^\pi(s) \leq Q^\pi(s, \pi'(s)) = R(s,\pi'(s)) + \gamma\sum_{s'} P(s'|s,\pi'(s))\, V^\pi(s')Vπ(s)≤Qπ(s,π′(s))=R(s,π′(s))+γs′∑​P(s′∣s,π′(s))Vπ(s′)

The inequality holds because π′(s)\pi'(s)π′(s) maximizes the right-hand side over all actions, so it is at least as large as Vπ(s)V^\pi(s)Vπ(s) (which is the π\piπ-weighted average, not the maximum). Applying the same argument to Vπ(s′)V^\pi(s')Vπ(s′):

Vπ(s)≤R(s,π′(s))+γ∑s′P(s′∣s,π′(s))[R(s′,π′(s′))+γ∑s′′P(s′′∣s′,π′(s′)) Vπ(s′′)]V^\pi(s) \leq R(s,\pi'(s)) + \gamma\sum_{s'} P(s'|s,\pi'(s)) \left[R(s',\pi'(s')) + \gamma\sum_{s''} P(s''|s',\pi'(s'))\, V^\pi(s'')\right]Vπ(s)≤R(s,π′(s))+γs′∑​P(s′∣s,π′(s))[R(s′,π′(s′))+γs′′∑​P(s′′∣s′,π′(s′))Vπ(s′′)]

Unrolling this recursion over all future timesteps under π′\pi'π′ gives:

Vπ(s)≤Eπ′[∑t=0∞γtrt+1  |  s0=s]=Vπ′(s)V^\pi(s) \leq \mathbb{E}_{\pi'}\left[\sum_{t=0}^\infty \gamma^t r_{t+1} \;\middle|\; s_0 = s\right] = V^{\pi'}(s)Vπ(s)≤Eπ′​[t=0∑∞​γtrt+1​​s0​=s]=Vπ′(s)

□\square□

When does policy iteration terminate?

At termination, π′=π\pi' = \piπ′=π, meaning π\piπ is already greedy with respect to its own value function:

Vπ(s)=max⁡a[R(s,a)+γ∑s′P(s′∣s,a) Vπ(s′)]∀sV^\pi(s) = \max_a \left[R(s,a) + \gamma\sum_{s'} P(s'|s,a)\, V^\pi(s')\right] \quad \forall sVπ(s)=amax​[R(s,a)+γs′∑​P(s′∣s,a)Vπ(s′)]∀s

This is exactly the Bellman optimality equation. Therefore Vπ=V∗V^\pi = V^*Vπ=V∗ and π=π∗\pi = \pi^*π=π∗. Policy iteration terminates at and only at the optimal policy.


Policy iteration

Policy iteration alternates policy evaluation and policy improvement until convergence.

Algorithm

  1. Initialize policy π0\pi_0π0​ arbitrarily (e.g., uniform random)
  2. For k=0,1,2,…k = 0, 1, 2, \ldotsk=0,1,2,…:
    • Evaluate: compute VπkV^{\pi_k}Vπk​ by iterative policy evaluation (or direct solve)
    • Improve: set πk+1(s)=arg⁡max⁡aQπk(s,a)\pi_{k+1}(s) = \arg\max_a Q^{\pi_k}(s,a)πk+1​(s)=argmaxa​Qπk​(s,a) for all sss
  3. Stop when πk+1=πk\pi_{k+1} = \pi_kπk+1​=πk​

Convergence argument

The sequence {Vπk}\{V^{\pi_k}\}{Vπk​} is monotonically non-decreasing by the policy improvement theorem. The number of deterministic policies in a finite MDPMarkov Decision Process is ∣A∣∣S∣|\mathcal{A}|^{|\mathcal{S}|}∣A∣∣S∣ — finite. Therefore the sequence of policies must eventually cycle, but since it is non-decreasing in value, it cannot cycle through distinct policies. It must reach a fixed point, which we showed above is π∗\pi^*π∗.

In practice, policy iteration typically converges in far fewer iterations than ∣A∣∣S∣|\mathcal{A}|^{|\mathcal{S}|}∣A∣∣S∣ — often fewer than 10–20 iterations even for moderately large MDPs — because each improvement step makes substantial progress.

◆Diagram: Policy iteration cycle

Policy iteration alternates between two phases. Evaluate: iterate the Bellman expectation operator until VπkV^{\pi_k}Vπk​ converges (inner loop). Improve: set πk+1(s)=arg⁡max⁡aQπk(s,a)\pi_{k+1}(s) = \arg\max_a Q^{\pi_k}(s,a)πk+1​(s)=argmaxa​Qπk​(s,a) for all states (one pass). The improved policy enters the next evaluation phase, and so on. Convergence occurs when the improve step produces no change — meaning π\piπ is already greedy with respect to its own value function, which is the Bellman optimality equation. The cycle always terminates because values are non-decreasing and the policy space is finite.

Properties

  • Convergence to π∗\pi^*π∗ is guaranteed
  • Typically converges in very few outer iterations
  • Each iteration requires solving (or approximately solving) a linear system of size ∣S∣|\mathcal{S}|∣S∣
  • For large state spaces, the inner evaluation is the computational bottleneck

Implementation

import numpy as np

def policy_iteration(P, R, gamma=0.9, theta=1e-8, max_iter=1000):
    """
    Policy Iteration for finite MDPs.

    Args:
        P: Transition probabilities P[s][a] = [(prob, next_state, reward), ...]
           For deterministic transitions: P[s][a] = [(1.0, s', r)]
        R: Reward function R[s][a] = expected_reward
        gamma: Discount factor (0 < gamma < 1)
        theta: Convergence threshold for value function
        max_iter: Maximum number of outer iterations

    Returns:
        V: Optimal value function V[s]
        pi: Optimal policy pi[s] = best action
        history: List of (V, pi) tuples for visualization
    """
    n_states = len(P)
    n_actions = len(P[0])

    # Step 1: Initialize policy arbitrarily (uniform random)
    pi = np.ones((n_states, n_actions)) / n_actions
    history = [(np.zeros(n_states), pi.copy())]

    for iteration in range(max_iter):
        # === POLICY EVALUATION ===
        # Solve V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V^pi(s')]
        V = np.zeros(n_states)

        for _ in range(max_iter):  # Inner loop for iterative evaluation
            V_new = np.zeros(n_states)
            for s in range(n_states):
                q_values = np.zeros(n_actions)
                for a in range(n_actions):
                    # Q(s, a) = sum_{s'} P(s'|s,a) * [R + gamma * V(s')]
                    q_sum = 0
                    for prob, next_s, reward in P[s][a]:
                        q_sum += prob * (reward + gamma * V[next_s])
                    q_values[a] = q_sum

                # V(s) = sum_a pi(a|s) * Q(s, a)
                V_new[s] = np.sum(pi[s] * q_values)

            # Check for convergence (delta = max |V_new - V|)
            delta = np.max(np.abs(V_new - V))
            V = V_new
            if delta < theta:
                break

        # === POLICY IMPROVEMENT ===
        # For each state, find the action with highest Q-value
        q_table = np.zeros((n_states, n_actions))
        for s in range(n_states):
            for a in range(n_actions):
                q_sum = 0
                for prob, next_s, reward in P[s][a]:
                    q_sum += prob * (reward + gamma * V[next_s])
                q_table[s, a] = q_sum

        # Greedy policy: pi_new[s] = argmax_a Q(s, a)
        new_pi = np.zeros((n_states, n_actions))
        best_actions = np.argmax(q_table, axis=1)
        for s in range(n_states):
            new_pi[s, best_actions[s]] = 1.0

        # Check for convergence (policy unchanged)
        history.append((V.copy(), new_pi.copy()))
        if np.array_equal(new_pi, pi):
            print(f"Converged after {iteration + 1} iterations")
            break

        pi = new_pi

    return V, pi, history

# Example: 4-state chain MDP
# States: 0=start, 1=mid-left, 2=mid-right, 3=goal
# Actions: 0=left, 1=right
# Terminal states: 0 and 3 have self-loops with terminal rewards

P = {  # Deterministic transitions
    0: {0: [(1.0, 0, -1)], 1: [(1.0, 0, -1)]},  # Terminal: -1 reward
    1: {0: [(1.0, 0, 0)], 1: [(1.0, 2, 0)]},      # Left->0, Right->2
    2: {0: [(1.0, 1, 0)], 1: [(1.0, 3, 1)]},      # Left->1, Right->3 (goal!)
    3: {0: [(1.0, 3, 1)], 1: [(1.0, 3, 1)]},      # Terminal: +1 reward
}

V, pi, history = policy_iteration(P, None, gamma=0.9)
print(f"Optimal Value Function: {V}")
print(f"Optimal Policy: {pi.argmax(axis=1)}")  # 0=left, 1=right

Key implementation details:

  1. State representation: We enumerate states as integers 0, 1, 2, ..., n-1
  2. Transition format: P[s][a] is a list of (prob, next_state, reward) tuples
  3. Policy representation: pi[s] is a probability distribution over actions
  4. Convergence check: Policy iteration converges when πk+1=πk\pi_{k+1} = \pi_kπk+1​=πk​

Value iteration

Value iteration collapses the evaluation-improvement cycle into a single Bellman optimality update applied repeatedly.

Update rule

Vk+1(s)=max⁡a[R(s,a)+γ∑s′P(s′∣s,a) Vk(s′)]=(TVk)(s)V_{k+1}(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V_k(s') \right] = (TV_k)(s)Vk+1​(s)=amax​[R(s,a)+γs′∑​P(s′∣s,a)Vk​(s′)]=(TVk​)(s)

where TTT is the Bellman optimality operator. Unlike policy evaluation, which averages over the policy's action distribution, value iteration takes the maximum over all actions at every update.

✦Intuition: Why max makes VI faster than PI

Policy iteration evaluates π\piπ to full convergence — often hundreds of inner iterations — before taking a single improvement step. Value iteration skips this entirely: it applies the max-action Bellman operator immediately at every update, never committing to an explicit policy between iterations. Each VI sweep does in one pass what PI does in two nested loops. The tradeoff is that VI doesn't track a policy at intermediate steps, so you can't inspect which policy is being implicitly followed — but since the values converge to V∗V^*V∗, the greedy extraction at the end yields π∗\pi^*π∗.

Convergence

TTT is a γ\gammaγ-contraction:

∥TV−TW∥∞≤γ∥V−W∥∞\|TV - TW\|_\infty \leq \gamma\|V - W\|_\infty∥TV−TW∥∞​≤γ∥V−W∥∞​

By the Banach fixed-point theorem, Vk→V∗V_k \to V^*Vk​→V∗ from any initialization, with error ∥Vk−V∗∥∞≤γk∥V0−V∗∥∞\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty∥Vk​−V∗∥∞​≤γk∥V0​−V∗∥∞​. Once VkV_kVk​ has converged to within ϵ\epsilonϵ of V∗V^*V∗, the greedy policy πk(s)=arg⁡max⁡aQk(s,a)\pi_k(s) = \arg\max_a Q_k(s,a)πk​(s)=argmaxa​Qk​(s,a) is ϵ/(1−γ)\epsilon/(1-\gamma)ϵ/(1−γ)-optimal.

Implementation

import numpy as np

def value_iteration(P, R, gamma=0.9, theta=1e-8, max_iter=1000):
    """
    Value Iteration for finite MDPs.

    Args:
        P: Transition probabilities P[s][a] = [(prob, next_state, reward), ...]
        R: Reward function R[s][a] = expected_reward
        gamma: Discount factor (0 < gamma < 1)
        theta: Convergence threshold
        max_iter: Maximum iterations

    Returns:
        V: Optimal value function
        pi: Optimal policy (extracted greedily from V)
        history: List of V values for visualization
    """
    n_states = len(P)
    n_actions = len(P[0])

    # Initialize value function arbitrarily
    V = np.zeros(n_states)
    history = [V.copy()]

    for k in range(max_iter):
        delta = 0  # Track maximum change

        # For each state, apply the Bellman optimality operator
        V_new = np.zeros(n_states)
        for s in range(n_states):
            q_values = np.zeros(n_actions)
            for a in range(n_actions):
                # Q(s, a) = sum_{s'} P(s'|s,a) * [R + gamma * V(s')]
                q_sum = 0
                for prob, next_s, reward in P[s][a]:
                    q_sum += prob * (reward + gamma * V[next_s])
                q_values[a] = q_sum

            # V(s) = max_a Q(s, a)  <-- Key difference from policy evaluation!
            V_new[s] = np.max(q_values)
            delta = max(delta, abs(V_new[s] - V[s]))

        V = V_new
        history.append(V.copy())

        # Check convergence: ||V_k+1 - V_k||_inf < theta
        if delta < theta:
            print(f"Converged after {k + 1} iterations (delta={delta:.2e})")
            break

    # Extract greedy policy from converged value function
    pi = np.zeros((n_states, n_actions))
    for s in range(n_states):
        q_values = np.zeros(n_actions)
        for a in range(n_actions):
            q_sum = 0
            for prob, next_s, reward in P[s][a]:
                q_sum += prob * (reward + gamma * V[next_s])
            q_values[a] = q_sum
        best_action = np.argmax(q_values)
        pi[s, best_action] = 1.0

    return V, pi, history

# Example: FrozenLake-like 4x4 grid
# States: 0-15 (row-major order)
# Actions: 0=up, 1=down, 2=left, 3=right

def create_grid_mdp(size=4, goal_state=15, hole_states=None, slip_prob=0.0):
    """Create a gridworld MDP."""
    if hole_states is None:
        hole_states = [5, 7, 11]  # Classic holes

    n_states = size * size
    n_actions = 4

    # Initialize transition dictionary
    P = {s: {a: [] for a in range(n_actions)} for s in range(n_states)}

    for s in range(n_states):
        for a in range(n_actions):
            if s == goal_state or s in hole_states:
                # Terminal/hole states: stay in place
                reward = 1 if s == goal_state else 0
                P[s][a] = [(1.0, s, reward)]
            else:
                # Compute next state based on action
                row, col = s // size, s % size
                if a == 0: row = max(0, row - 1)      # up
                elif a == 1: row = min(size - 1, row + 1)  # down
                elif a == 2: col = max(0, col - 1)      # left
                else: col = min(size - 1, col + 1)     # right

                next_s = row * size + col
                reward = 1 if next_s == goal_state else 0
                P[s][a] = [(1.0, next_s, reward)]

    return P

# Run value iteration
P = create_grid_mdp(size=4)
V, pi, history = value_iteration(P, None, gamma=0.9)

print("Value Function (4x4 grid):")
for i in range(4):
    print([f"{V[i*4+j]:.2f}" for j in range(4)])

print("\nOptimal Policy (0=up, 1=down, 2=left, 3=right):")
for i in range(4):
    print([pi[i*4+j].argmax() for j in range(4)])

Key insight: Value iteration applies max_a at every update (line 22), while policy evaluation computes expected Q-values under the current policy. This single change is what makes VI converge in one sweep per iteration.

Computational complexity:

  • Policy iteration: O( iterations × |S|³ ) for direct solve, O( iterations × |S|² × |A| ) for iterative
  • Value iteration: O( iterations × |S|² × |A| )

For large state spaces, value iteration is often preferred because each iteration is cheaper, even though it may need more iterations.


Relationship to policy iteration

Value iteration is policy iteration with one-step policy evaluation before each improvement. Equivalently, it is applying TTT — which implicitly takes the greedy action — without first computing VπV^\piVπ for the current greedy policy. This means value iteration does not maintain an explicit policy between iterations; it simply drives VkV_kVk​ toward V∗V^*V∗ and extracts the policy at the end.

◆Backup diagram: Value iteration

The value iteration backup has the same tree structure as the policy evaluation backup (action nodes, then next-state nodes), but at the action level it takes a maximum instead of a π\piπ-weighted average. There is no policy weighting at the top: the root value is max⁡a\max_amaxa​ over all branches. This means no policy is needed to compute the update — the max implicitly selects the best action at each step. The backup still sums over next states under PPP (the stochastic transition), but the root is max⁡a\max_amaxa​ rather than Eπ\mathbb{E}_\piEπ​.


Generalized policy iteration

Both policy iteration and value iteration are special cases of a unifying framework: Generalized Policy Iteration (GPIGeneralised Policy Iteration).

GPI describes any algorithm that maintains both a value function VVV and a policy π\piπ, and alternates between:

  • Evaluation steps: make VVV more consistent with π\piπ (partial or full)
  • Improvement steps: make π\piπ more greedy with respect to VVV

The key theorem is that any interleaving of partial evaluation and greedy improvement converges to V∗V^*V∗ and π∗\pi^*π∗, as long as both processes run and neither is stopped prematurely. The special cases are:

| Algorithm | Evaluation steps per cycle | Improvement steps per cycle | |---|---|---| | Policy iteration | Full convergence of VπkV^{\pi_k}Vπk​ | One full greedy pass | | Value iteration | One Bellman update | One full greedy pass (implicit in max⁡\maxmax) | | Actor-critic (deep RLReinforcement Learning) | nnn TDTemporal Difference steps (partial) | One gradient step (partial) |

✦Intuition: GPI unifies DP and modern RL

GPI says evaluation and improvement don't need to be exact or alternating — they just both need to run. Policy iteration uses full evaluation and full improvement. Value iteration uses one-Bellman-update evaluation and implicit improvement via max. Actor-critic uses one-TD-step evaluation and one-gradient-step improvement at every transition. The remarkable result is that all these schedules converge to the same V∗V^*V∗ and π∗\pi^*π∗ — the particular ratio of evaluation to improvement steps changes the convergence speed but not the fixed point. Understanding this unification means you can reason about any RL algorithm by asking: "What is its evaluation step? What is its improvement step? How partial is each?"

GPI is the conceptual framework that connects DP to deep RLReinforcement Learning. An actor-critic algorithm is GPI with function approximation: the critic performs partial policy evaluation using TDTemporal Difference learning (Week 4), and the actor performs partial policy improvement using a policy gradient step. The evaluation and improvement are no longer exact — they are stochastic, approximate, and interleaved at every timestep — but the GPI structure is preserved.


Asynchronous and prioritized DP

Standard (synchronous) DP sweeps through all states in every iteration. This is impractical when ∣S∣|\mathcal{S}|∣S∣ is large. Two important extensions relax this:

Asynchronous DP

Asynchronous DP updates states in any order, not necessarily all at once. The only requirement is that every state is updated infinitely often (in the limit). Asynchronous updates converge to V∗V^*V∗ under the same contraction argument — each update moves the updated state's value closer to V∗V^*V∗, and the contraction ensures the rest of the value function follows.

In practice, asynchronous DP enables:

  • focusing computation on states that are frequently visited,
  • real-time DP (updating states as they are visited by the agent),
  • and efficient implementation on non-uniform state spaces.

Prioritized sweeping

prioritized sweeping selects states for update based on the magnitude of their Bellman error:

δ(s)=∣V(s)−max⁡a[R(s,a)+γ∑s′P(s′∣s,a)V(s′)]∣\delta(s) = \left| V(s) - \max_a\left[R(s,a) + \gamma\sum_{s'} P(s'|s,a) V(s')\right]\right|δ(s)=​V(s)−amax​[R(s,a)+γs′∑​P(s′∣s,a)V(s′)]​

States with large Bellman error have the most inaccurate value estimates and benefit most from being updated. Prioritized sweeping maintains a priority queue ordered by δ(s)\delta(s)δ(s) and updates the highest-priority state at each step, propagating updates backward through predecessor states.

Prioritized sweeping is the direct precursor to prioritized experience replay (PER) in deep RLReinforcement Learning: instead of sampling uniformly from a replay buffer, PER samples transitions with probability proportional to their TDTemporal Difference error — the model-free analog of δ(s)\delta(s)δ(s). The same intuition applies: transitions where the current value estimate is most wrong carry the most learning signal.


Why dynamic programming does not scale

Dynamic programming requires enumerating all states, all actions, and knowing exact transition dynamics. The resulting complexity is O(∣S∣2∣A∣)O(|\mathcal{S}|^2 |\mathcal{A}|)O(∣S∣2∣A∣) per sweep — quadratic in the state space due to the sum over next states s′s's′.

⚠What Breaks Here: The curse of dimensionality

The state space grows exponentially in the number of state variables. A robot with 12 joint angles discretized to 100 positions each has 10012=1024100^{12} = 10^{24}10012=1024 states — DP cannot enumerate them. The complexity per sweep is O(∣S∣2∣A∣)O(|\mathcal{S}|^2 |\mathcal{A}|)O(∣S∣2∣A∣); for 102410^{24}1024 states, a single full sweep is physically impossible.

This is not a limitation of the algorithm designer; it is a fundamental barrier. Exact DP requires enumerating all states and transitions, which is only tractable for small, discrete MDPs. For real-world problems — continuous control, vision-based tasks, language generation — an entirely different computational approach is required. This is why the rest of the course is not "DP for bigger problems."

GenAI context: the LLMLarge Language Model state space

Treating language generation as an MDPMarkov Decision Process, the state is the full context window. For an LLMLarge Language Model with vocabulary size ≈50,000\approx 50{,}000≈50,000 and context length 4,0004{,}0004,000 tokens:

∣S∣≈50,0004000|\mathcal{S}| \approx 50{,}000^{4000}∣S∣≈50,0004000

This number exceeds the number of atoms in the observable universe by an incomprehensible margin. Exact dynamic programming is fundamentally impossible for language models. This is not an engineering limitation — it is a mathematical one. The only viable approaches are:

  • Function approximation: represent VVV or QQQ as a neural network that generalizes across states, rather than a lookup table.
  • Sampling-based methods: estimate Bellman targets from experience rather than exact model-based sums.
  • Policy-based methods: represent and optimize π\piπ directly without computing VVV.

Every algorithm from Week 4 onward is a response to this impossibility. DP defines the ideal target; the rest of the course develops tractable approximations.


From DP to deep RLReinforcement Learning: the precise mappings

The "conceptual importance" claim — that DP underlies modern RLReinforcement Learning — can be made precise:

Q-learning is sampled value iteration on Q∗Q^*Q∗

Value iteration on Q∗Q^*Q∗:

Qk+1(s,a)←R(s,a)+γ∑s′P(s′∣s,a)max⁡a′Qk(s′,a′)Q_{k+1}(s,a) \leftarrow R(s,a) + \gamma\sum_{s'} P(s'|s,a) \max_{a'} Q_k(s',a')Qk+1​(s,a)←R(s,a)+γs′∑​P(s′∣s,a)a′max​Qk​(s′,a′)

Q-learning replaces the exact model-based expectation with a single sampled transition (s,a,r,s′)(s, a, r, s')(s,a,r,s′):

Q(s,a)←Q(s,a)+α[r+γmax⁡a′Q(s′,a′)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma\max_{a'} Q(s',a') - Q(s,a)\right]Q(s,a)←Q(s,a)+α[r+γa′max​Q(s′,a′)−Q(s,a)]

The bracketed term is the TDTemporal Difference error — the discrepancy between the current estimate and the one-step Bellman target. Q-learning is value iteration where the model-based expectation ∑s′P(s′∣s,a)(⋅)\sum_{s'} P(s'|s,a)(\cdot)∑s′​P(s′∣s,a)(⋅) is replaced by a Monte Carlo sample of a single next state s′s's′. Deep Q-Networks (DQNDeep Q-Network) further replace the tabular QQQ with a neural network.

Actor-critic is GPI with approximate evaluation and improvement

Policy iteration runs full policy evaluation before each improvement. Actor-critic methods run one TDTemporal Difference step of evaluation (the critic update) interleaved with one gradient step of improvement (the actor update) at every timestep. This is GPI with both steps truncated to a single stochastic gradient update. The critic approximates VπV^\piVπ or QπQ^\piQπ using TDTemporal Difference learning; the actor uses the critic's output to estimate the policy gradient and improve π\piπ.

Update rate balance matters: if the critic learns much faster than the actor, it converges to VπθV^{\pi_\theta}Vπθ​ for a nearly fixed policy — but must then re-track a new target every time the actor updates, causing value estimates to oscillate. If the actor learns much faster, policy gradients are computed from a stale critic, accumulating systematic bias that can push the policy in the wrong direction. Stable actor-critic training typically uses similar learning rates for both, or a frozen target network for the critic updated on a slower schedule.

RLHFReinforcement Learning from Human Feedback uses Bellman bootstrapping in the reward model training loop

RLHFReinforcement Learning from Human Feedback trains a reward model on preference data and then fine-tunes the language model using PPOProximal Policy Optimisation. PPOProximal Policy Optimisation is an actor-critic algorithm and uses TDTemporal Difference-style bootstrapped value estimates (the critic's VVV function) to compute advantage estimates. The Bellman structure — the critic is trained to be consistent with the Bellman expectation equation — is present even in the RLHFReinforcement Learning from Human Feedback setting, though the reward signal comes from a learned reward model rather than an analytic function.


Key takeaways

The lecture develops a progression from the ideal to the tractable. Policy evaluation solves for VπV^\piVπ exactly by iterating the Bellman expectation operator to its fixed point. Policy improvement replaces π\piπ with the greedy policy with respect to VπV^\piVπ, and the improvement theorem — proved by unrolling the greedy recursion — guarantees monotone improvement. Policy iteration alternates these two steps and converges to π∗\pi^*π∗ because the policy sequence is non-decreasing in value and the policy space is finite. Value iteration merges the two steps into a single Bellman optimality update, converging to V∗V^*V∗ by the contraction mapping property. Generalized policy iteration unifies both as special cases of a partial evaluation / partial improvement cycle — the same structure that underlies actor-critic methods in deep RLReinforcement Learning.

Q-functions are developed in parallel with V-functions and are the natural target for model-free RLReinforcement Learning because they encode the optimal policy without requiring PPP. Asynchronous DP and prioritized sweeping extend the framework to practical non-uniform computation and are the direct precursors of experience replay and PER. And the LLMLarge Language Model state-space calculation makes concrete why function approximation and sampling are not optional refinements but mathematical necessities.


Conceptual questions

  1. Apply two full iterations of value iteration to the 4-state chain example with γ=0.9\gamma = 0.9γ=0.9, starting from V0=0V_0 = \mathbf{0}V0​=0. Compute V2(s2)V_2(s_2)V2​(s2​) and V2(s3)V_2(s_3)V2​(s3​) explicitly. At which iteration does the greedy policy first become the optimal policy "always go right"?

  2. Prove that policy iteration terminates in finite time for a finite MDPMarkov Decision Process. Your proof should use the policy improvement theorem and a counting argument on the number of deterministic policies. Why does the same argument fail if the action space is continuous?

  3. Value iteration converges to V∗V^*V∗ but not necessarily to any policy's value function at intermediate iterates. Explain why the greedy policy πk(s)=arg⁡max⁡aQk(s,a)\pi_k(s) = \arg\max_a Q_k(s,a)πk​(s)=argmaxa​Qk​(s,a) may still be suboptimal even when ∥Vk−V∗∥∞\|V_k - V^*\|_\infty∥Vk​−V∗∥∞​ is small. What bound guarantees that πk\pi_kπk​ is near-optimal?

  4. Write out the Bellman optimality equation for Q∗(s,a)Q^*(s,a)Q∗(s,a) and identify exactly which term changes when moving from model-based value iteration to the model-free Q-learning update. What statistical assumption justifies replacing the expectation with a single sample?

  5. An actor-critic algorithm trains a critic Vϕ(s)V_\phi(s)Vϕ​(s) using TDTemporal Difference(0) and an actor πθ(a∣s)\pi_\theta(a|s)πθ​(a∣s) using policy gradient. Map each component onto the GPI framework: which step is the evaluation step, which is the improvement step, and in what sense are both steps "partial"? What goes wrong if the critic is updated much faster than the actor, or vice versa?

✦Solutions (conceptual questions)
  1. Both iterations start from V0=0V_0 = \mathbf{0}V0​=0, so V1=0V_1 = \mathbf{0}V1​=0 everywhere. Then V2(s3)=0.9(0.9⋅1+0.1⋅0)=0.81V_2(s_3) = 0.9(0.9\cdot 1 + 0.1\cdot 0) = 0.81V2​(s3​)=0.9(0.9⋅1+0.1⋅0)=0.81 and V2(s2)=0.9(0.9⋅0+0.1⋅0)=0V_2(s_2) = 0.9(0.9\cdot 0 + 0.1\cdot 0) = 0V2​(s2​)=0.9(0.9⋅0+0.1⋅0)=0. Comparing greedy actions, Q2(s2,R)=0>Q2(s2,L)=−0.81Q_2(s_2, R) = 0 > Q_2(s_2, L) = -0.81Q2​(s2​,R)=0>Q2​(s2​,L)=−0.81 and Q2(s3,R)=0.81>0Q_2(s_3, R) = 0.81 > 0Q2​(s3​,R)=0.81>0, so the optimal policy "always go right" is first recovered at iteration 2 (value propagates one step per iteration).
  2. A finite MDP has finitely many deterministic policies (∣A∣∣S∣|A|^{|S|}∣A∣∣S∣). The policy improvement theorem guarantees each step yields a policy that is no worse, and strictly better unless already optimal, so values increase monotonically and no policy repeats — a strictly improving sequence over a finite set must terminate. With continuous actions the counting argument fails: there are uncountably many policies, so no finite bound exists.
  3. Acting greedily w.r.t. an approximate VkV_kVk​ can select the wrong action wherever competing QQQ-values are close, so πk\pi_kπk​ may be suboptimal even when ∥Vk−V∗∥∞\|V_k - V^*\|_\infty∥Vk​−V∗∥∞​ is small. The guarantee is ∥Vπk−V∗∥∞≤2γ1−γ∥Vk−V∗∥∞\|V^{\pi_k} - V^*\|_\infty \le \frac{2\gamma}{1-\gamma}\|V_k - V^*\|_\infty∥Vπk​−V∗∥∞​≤1−γ2γ​∥Vk​−V∗∥∞​, so πk\pi_kπk​ is near-optimal once the value error is small relative to (1−γ)(1-\gamma)(1−γ).
  4. Q∗(s,a)=R(s,a)+γ∑s′P(s′∣s,a)max⁡a′Q∗(s′,a′)Q^*(s,a) = R(s,a) + \gamma\sum_{s'} P(s'\mid s,a)\max_{a'} Q^*(s',a')Q∗(s,a)=R(s,a)+γ∑s′​P(s′∣s,a)maxa′​Q∗(s′,a′). Going to Q-learning, the expectation ∑s′P(⋅)max⁡a′Q\sum_{s'} P(\cdot)\max_{a'} Q∑s′​P(⋅)maxa′​Q is replaced by a single sampled transition: target r+γmax⁡a′Q(s′,a′)r + \gamma\max_{a'} Q(s',a')r+γmaxa′​Q(s′,a′). This is justified by stochastic approximation — the sample is an unbiased estimate of the expectation, valid given sufficient exploration and a decaying step size.
  5. The TD(0) critic is the policy-evaluation step; the policy-gradient actor is the policy-improvement step. Both are "partial": the critic does one bootstrapped step rather than a full evaluation, and the actor takes a gradient nudge rather than a full greedy max. If the critic is far faster than the actor it tracks the policy well (stable but slow); if the actor is faster it improves against a stale, inaccurate value estimate, which destabilizes training — hence the two-timescale rule that the critic should adapt faster than the actor.

Coding exercises

Exercise 6: Asynchronous and prioritized variants. Implement two variants of value iteration for the 4-state chain MDP defined in the worked example:

  1. Asynchronous VI: on each sweep, update states in a randomly shuffled order instead of the fixed order s1,s2,s3,s4s_1, s_2, s_3, s_4s1​,s2​,s3​,s4​.
  2. Prioritized sweeping: maintain a max-heap ordered by Bellman error ∣δ(s)∣=∣V(s)−max⁡a[R(s,a)+γ∑s′P(s′∣s,a)V(s′)]∣|\delta(s)| = |V(s) - \max_a[R(s,a) + \gamma\sum_{s'} P(s'|s,a) V(s')]|∣δ(s)∣=∣V(s)−maxa​[R(s,a)+γ∑s′​P(s′∣s,a)V(s′)]∣. At each step, pop the highest-error state, update it, then recompute the Bellman error of its predecessor states and push them back into the heap.

For each variant, plot ∥Vk−V∗∥∞\|V_k - V^*\|_\infty∥Vk​−V∗∥∞​ versus total number of individual state updates (not full sweeps). Which variant converges fastest in terms of state updates? Does the advantage of prioritized sweeping grow or shrink as the chain length increases?

Extension prompt. Implement a discount annealing schedule: start with γ0=0.5\gamma_0 = 0.5γ0​=0.5 and linearly increase toward γ=0.9\gamma = 0.9γ=0.9 over the first 50 iterations of value iteration. Compare the convergence curve against standard value iteration with fixed γ=0.9\gamma = 0.9γ=0.9 on the 4-state chain. Does a low initial discount accelerate early convergence? Why might this trick be useful for problems with sparse or delayed rewards?


Looking ahead

The next lecture removes the assumption of known dynamics. We will study Monte Carlo and Temporal-Difference (TDTemporal Difference) learning, which estimate value functions directly from sampled experience. The key insight is that the Bellman equations can be used as regression targets rather than exact update rules — replacing the model-based expectation ∑s′P(s′∣s,a)V(s′)\sum_{s'} P(s'|s,a)V(s')∑s′​P(s′∣s,a)V(s′) with samples of V(s′)V(s')V(s′) from observed transitions. This replacement is what makes RLReinforcement Learning scale to problems where PPP is unknown or intractable.

Further reading

  • Sutton & Barto (2018). Reinforcement Learning: An Introduction, 2nd ed. Chapter 4: Dynamic Programming. (Standard reference — freely available at incompleteideas.net)
  • Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley. (The definitive mathematical treatment; Chapters 6–7 cover policy evaluation, policy iteration, and value iteration with full proofs.)
  • Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. I. Athena Scientific. (Advanced treatment covering asynchronous DP, convergence theory, and connections to approximate methods.)
  • Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518, 529–533. (The DQN paper — implements sampled value iteration on Q∗Q^*Q∗ with a neural network, directly realizing the DP-to-RL mapping developed in this lesson.)
  • Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press. (Pioneered Policy Iteration.)
← Previous
Week 2: Multi-Armed Bandits
Next →
Week 4: Monte Carlo and Temporal-Difference Learning
On this page
  • Purpose of this lecture
  • Setting and assumptions
  • A worked example: the 4-state chain
  • Policy evaluation
  • Bellman expectation equation
  • The Bellman expectation operator
  • Iterative policy evaluation
  • Q-functions in dynamic programming
  • Bellman equations for Q^\pi and Q^*
  • Relationship between V and Q
  • Why Q^* is the target for model-free RL
  • Policy improvement
  • Definition
  • Policy improvement theorem: proof
  • When does policy iteration terminate?
  • Policy iteration
  • Algorithm
  • Convergence argument
  • Properties
  • Implementation
  • Value iteration
  • Update rule
  • Convergence
  • Implementation
  • Relationship to policy iteration
  • Generalized policy iteration
  • Asynchronous and prioritized DP
  • Asynchronous DP
  • Prioritized sweeping
  • Why dynamic programming does not scale
  • GenAI context: the LLM state space
  • From DP to deep RL: the precise mappings
  • Q-learning is sampled value iteration on Q^*
  • Actor-critic is GPI with approximate evaluation and improvement
  • RLHF uses Bellman bootstrapping in the reward model training loop
  • Key takeaways
  • Conceptual questions
  • Coding exercises
  • Looking ahead
  • Further reading