Purpose of this lecture
Here is the central question: If we knew everything about the environment, what is the fastest way to compute an optimal policy?
This is the domain of dynamic programming (DP). We will assume perfect knowledge—the transition dynamics and reward function are given. Under this idealized assumption, the problem has a clean solution: iterate the Bellman equations to convergence.
The payoff of studying DP is not that it solves real problems (it doesn't—real environments are unknown). Instead:
-
DP defines the gold standard. DP algorithms are the exact solutions to the MDP problem. Every algorithm you study later is either a DP algorithm applied to an approximate model, or a sampled/approximate version of a DP algorithm.
-
DP teaches algorithm design fundamentals. The progression from policy evaluation (full convergence) to policy iteration (two-step cycles) to value iteration (one-step updates) shows how to trade off computational cost against convergence guarantees.
-
DP reveals what must change in model-free RL. The moment you remove the assumption of known and , everything else carries over. Understanding what changes and what doesn't is the key to understanding modern RL.
In Week 4, we will relax exactly one assumption: known dynamics. We will keep the Bellman equations, the value functions, the contraction property. We will only replace the model-based expectation with samples from experience. This is the bridge from DP to model-free RL.
Setting and assumptions
Dynamic programming applies under the following assumptions:
- Finite state space , finite action space
- Known transition probabilities
- Known reward function
- Discount factor
Under these conditions, the Bellman equations from Week 1 become explicit computational algorithms. The key assumption distinguishing DP from everything that follows is the second and third: we know and . Model-free RLReinforcement Learning relaxes exactly this.
A worked example: the 4-state chain
Before developing the algorithms in full generality, we ground everything in a small concrete MDPMarkov Decision Process. Consider a linear chain with four states and two actions (move left, move right):
- From any non-boundary state, action moves right with probability 0.9 and stays with probability 0.1; action moves left with probability 0.9 and stays with probability 0.1.
- and are boundary states: all actions keep the agent in place.
- Reward: , , all other transitions yield .
- .
We will use this MDPMarkov Decision Process to trace through policy evaluation, policy iteration, and value iteration concretely. The optimal policy is clearly "always go right" for and , and the optimal values should reflect proximity to .
Two iterations of value iteration from :
Iteration 1 (): For with action :
All states get since and no immediate reward is earned except at boundaries.
Iteration 2 (): For with action :
Since requires revisiting: at , action gives reward and stays, so . Substituting:
For with action :
To determine the greedy action at under , compare both actions:
So (since ). And from above, , so . The optimal policy "always go right" is first recovered at iteration 2.
After two iterations, the value has propagated one step from the boundary. After iterations, the value has propagated steps. This is the backup structure: value information flows backward from high-reward states through the graph.
Policy evaluation
Given a fixed policy , policy evaluation computes its state-value function — the expected discounted return from each state under .
Bellman expectation equation
For all :
This is a system of linear equations in unknowns. For small state spaces, it can be solved directly by matrix inversion. For large state spaces, iterative methods are used.
The Bellman expectation operator
Define the Bellman expectation operator :
is a -contraction in : for any two value functions :
The proof is identical to the optimality operator case from Week 1: the factor in front of the expectation over next states is what gives the contraction mapping property. By the Banach fixed-point theorem, is the unique fixed point of , and iterative application from any initialization converges to it geometrically.
Iterative policy evaluation
Starting from any (e.g., ), repeated application converges to because is a -contraction. The error after iterations satisfies — geometric convergence at rate .
Backup diagram: at each update, the value of state is computed by looking one step ahead — summing over all actions under , then over all next states under , and backing up the discounted next-state values. Information flows from future states into the current estimate.
Q-functions in dynamic programming
The lecture so far has developed and . It is equally important to develop the action-value function (Q-function) in the DP setting, because Q-learning — the most widely used model-free algorithm — is value iteration applied to .
Bellman equations for and
The Bellman expectation equation for :
The Bellman optimality equation for :
Relationship between V and Q
Why is the target for model-free RLReinforcement Learning
If you know , extracting the optimal policy still requires knowing :
If you know , you do not need :
This is the key reason Q-learning targets rather than : in the model-free setting where is unknown, gives you the optimal policy directly from experience. The Bellman optimality equation for becomes the Q-learning update rule once we replace the exact model-based expectation with a sampled transition.
Policy improvement
Once we know , we can improve the policy.
Definition
Given , define the improved policy:
is the greedy policy with respect to : at each state, take the action whose one-step lookahead value is highest.
Policy improvement theorem: proof
Theorem: for all .
Proof: By the definition of as greedy with respect to :
The inequality holds because maximizes the right-hand side over all actions, so it is at least as large as (which is the -weighted average, not the maximum). Applying the same argument to :
Unrolling this recursion over all future timesteps under gives:
When does policy iteration terminate?
At termination, , meaning is already greedy with respect to its own value function:
This is exactly the Bellman optimality equation. Therefore and . Policy iteration terminates at and only at the optimal policy.
Policy iteration
Policy iteration alternates policy evaluation and policy improvement until convergence.
Algorithm
- Initialize policy arbitrarily (e.g., uniform random)
- For :
- Evaluate: compute by iterative policy evaluation (or direct solve)
- Improve: set for all
- Stop when
Convergence argument
The sequence is monotonically non-decreasing by the policy improvement theorem. The number of deterministic policies in a finite MDPMarkov Decision Process is — finite. Therefore the sequence of policies must eventually cycle, but since it is non-decreasing in value, it cannot cycle through distinct policies. It must reach a fixed point, which we showed above is .
In practice, policy iteration typically converges in far fewer iterations than — often fewer than 10–20 iterations even for moderately large MDPs — because each improvement step makes substantial progress.
Properties
- Convergence to is guaranteed
- Typically converges in very few outer iterations
- Each iteration requires solving (or approximately solving) a linear system of size
- For large state spaces, the inner evaluation is the computational bottleneck
Implementation
import numpy as np
def policy_iteration(P, R, gamma=0.9, theta=1e-8, max_iter=1000):
"""
Policy Iteration for finite MDPs.
Args:
P: Transition probabilities P[s][a] = [(prob, next_state, reward), ...]
For deterministic transitions: P[s][a] = [(1.0, s', r)]
R: Reward function R[s][a] = expected_reward
gamma: Discount factor (0 < gamma < 1)
theta: Convergence threshold for value function
max_iter: Maximum number of outer iterations
Returns:
V: Optimal value function V[s]
pi: Optimal policy pi[s] = best action
history: List of (V, pi) tuples for visualization
"""
n_states = len(P)
n_actions = len(P[0])
# Step 1: Initialize policy arbitrarily (uniform random)
pi = np.ones((n_states, n_actions)) / n_actions
history = [(np.zeros(n_states), pi.copy())]
for iteration in range(max_iter):
# === POLICY EVALUATION ===
# Solve V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V^pi(s')]
V = np.zeros(n_states)
for _ in range(max_iter): # Inner loop for iterative evaluation
V_new = np.zeros(n_states)
for s in range(n_states):
q_values = np.zeros(n_actions)
for a in range(n_actions):
# Q(s, a) = sum_{s'} P(s'|s,a) * [R + gamma * V(s')]
q_sum = 0
for prob, next_s, reward in P[s][a]:
q_sum += prob * (reward + gamma * V[next_s])
q_values[a] = q_sum
# V(s) = sum_a pi(a|s) * Q(s, a)
V_new[s] = np.sum(pi[s] * q_values)
# Check for convergence (delta = max |V_new - V|)
delta = np.max(np.abs(V_new - V))
V = V_new
if delta < theta:
break
# === POLICY IMPROVEMENT ===
# For each state, find the action with highest Q-value
q_table = np.zeros((n_states, n_actions))
for s in range(n_states):
for a in range(n_actions):
q_sum = 0
for prob, next_s, reward in P[s][a]:
q_sum += prob * (reward + gamma * V[next_s])
q_table[s, a] = q_sum
# Greedy policy: pi_new[s] = argmax_a Q(s, a)
new_pi = np.zeros((n_states, n_actions))
best_actions = np.argmax(q_table, axis=1)
for s in range(n_states):
new_pi[s, best_actions[s]] = 1.0
# Check for convergence (policy unchanged)
history.append((V.copy(), new_pi.copy()))
if np.array_equal(new_pi, pi):
print(f"Converged after {iteration + 1} iterations")
break
pi = new_pi
return V, pi, history
# Example: 4-state chain MDP
# States: 0=start, 1=mid-left, 2=mid-right, 3=goal
# Actions: 0=left, 1=right
# Terminal states: 0 and 3 have self-loops with terminal rewards
P = { # Deterministic transitions
0: {0: [(1.0, 0, -1)], 1: [(1.0, 0, -1)]}, # Terminal: -1 reward
1: {0: [(1.0, 0, 0)], 1: [(1.0, 2, 0)]}, # Left->0, Right->2
2: {0: [(1.0, 1, 0)], 1: [(1.0, 3, 1)]}, # Left->1, Right->3 (goal!)
3: {0: [(1.0, 3, 1)], 1: [(1.0, 3, 1)]}, # Terminal: +1 reward
}
V, pi, history = policy_iteration(P, None, gamma=0.9)
print(f"Optimal Value Function: {V}")
print(f"Optimal Policy: {pi.argmax(axis=1)}") # 0=left, 1=right
Key implementation details:
- State representation: We enumerate states as integers 0, 1, 2, ..., n-1
- Transition format:
P[s][a]is a list of(prob, next_state, reward)tuples - Policy representation:
pi[s]is a probability distribution over actions - Convergence check: Policy iteration converges when
Value iteration
Value iteration collapses the evaluation-improvement cycle into a single Bellman optimality update applied repeatedly.
Update rule
where is the Bellman optimality operator. Unlike policy evaluation, which averages over the policy's action distribution, value iteration takes the maximum over all actions at every update.
Convergence
is a -contraction:
By the Banach fixed-point theorem, from any initialization, with error . Once has converged to within of , the greedy policy is -optimal.
Implementation
import numpy as np
def value_iteration(P, R, gamma=0.9, theta=1e-8, max_iter=1000):
"""
Value Iteration for finite MDPs.
Args:
P: Transition probabilities P[s][a] = [(prob, next_state, reward), ...]
R: Reward function R[s][a] = expected_reward
gamma: Discount factor (0 < gamma < 1)
theta: Convergence threshold
max_iter: Maximum iterations
Returns:
V: Optimal value function
pi: Optimal policy (extracted greedily from V)
history: List of V values for visualization
"""
n_states = len(P)
n_actions = len(P[0])
# Initialize value function arbitrarily
V = np.zeros(n_states)
history = [V.copy()]
for k in range(max_iter):
delta = 0 # Track maximum change
# For each state, apply the Bellman optimality operator
V_new = np.zeros(n_states)
for s in range(n_states):
q_values = np.zeros(n_actions)
for a in range(n_actions):
# Q(s, a) = sum_{s'} P(s'|s,a) * [R + gamma * V(s')]
q_sum = 0
for prob, next_s, reward in P[s][a]:
q_sum += prob * (reward + gamma * V[next_s])
q_values[a] = q_sum
# V(s) = max_a Q(s, a) <-- Key difference from policy evaluation!
V_new[s] = np.max(q_values)
delta = max(delta, abs(V_new[s] - V[s]))
V = V_new
history.append(V.copy())
# Check convergence: ||V_k+1 - V_k||_inf < theta
if delta < theta:
print(f"Converged after {k + 1} iterations (delta={delta:.2e})")
break
# Extract greedy policy from converged value function
pi = np.zeros((n_states, n_actions))
for s in range(n_states):
q_values = np.zeros(n_actions)
for a in range(n_actions):
q_sum = 0
for prob, next_s, reward in P[s][a]:
q_sum += prob * (reward + gamma * V[next_s])
q_values[a] = q_sum
best_action = np.argmax(q_values)
pi[s, best_action] = 1.0
return V, pi, history
# Example: FrozenLake-like 4x4 grid
# States: 0-15 (row-major order)
# Actions: 0=up, 1=down, 2=left, 3=right
def create_grid_mdp(size=4, goal_state=15, hole_states=None, slip_prob=0.0):
"""Create a gridworld MDP."""
if hole_states is None:
hole_states = [5, 7, 11] # Classic holes
n_states = size * size
n_actions = 4
# Initialize transition dictionary
P = {s: {a: [] for a in range(n_actions)} for s in range(n_states)}
for s in range(n_states):
for a in range(n_actions):
if s == goal_state or s in hole_states:
# Terminal/hole states: stay in place
reward = 1 if s == goal_state else 0
P[s][a] = [(1.0, s, reward)]
else:
# Compute next state based on action
row, col = s // size, s % size
if a == 0: row = max(0, row - 1) # up
elif a == 1: row = min(size - 1, row + 1) # down
elif a == 2: col = max(0, col - 1) # left
else: col = min(size - 1, col + 1) # right
next_s = row * size + col
reward = 1 if next_s == goal_state else 0
P[s][a] = [(1.0, next_s, reward)]
return P
# Run value iteration
P = create_grid_mdp(size=4)
V, pi, history = value_iteration(P, None, gamma=0.9)
print("Value Function (4x4 grid):")
for i in range(4):
print([f"{V[i*4+j]:.2f}" for j in range(4)])
print("\nOptimal Policy (0=up, 1=down, 2=left, 3=right):")
for i in range(4):
print([pi[i*4+j].argmax() for j in range(4)])
Key insight: Value iteration applies max_a at every update (line 22), while policy evaluation computes expected Q-values under the current policy. This single change is what makes VI converge in one sweep per iteration.
Computational complexity:
- Policy iteration: O( iterations × |S|³ ) for direct solve, O( iterations × |S|² × |A| ) for iterative
- Value iteration: O( iterations × |S|² × |A| )
For large state spaces, value iteration is often preferred because each iteration is cheaper, even though it may need more iterations.
Relationship to policy iteration
Value iteration is policy iteration with one-step policy evaluation before each improvement. Equivalently, it is applying — which implicitly takes the greedy action — without first computing for the current greedy policy. This means value iteration does not maintain an explicit policy between iterations; it simply drives toward and extracts the policy at the end.
Generalized policy iteration
Both policy iteration and value iteration are special cases of a unifying framework: Generalized Policy Iteration (GPIGeneralised Policy Iteration).
GPI describes any algorithm that maintains both a value function and a policy , and alternates between:
- Evaluation steps: make more consistent with (partial or full)
- Improvement steps: make more greedy with respect to
The key theorem is that any interleaving of partial evaluation and greedy improvement converges to and , as long as both processes run and neither is stopped prematurely. The special cases are:
| Algorithm | Evaluation steps per cycle | Improvement steps per cycle | |---|---|---| | Policy iteration | Full convergence of | One full greedy pass | | Value iteration | One Bellman update | One full greedy pass (implicit in ) | | Actor-critic (deep RLReinforcement Learning) | TDTemporal Difference steps (partial) | One gradient step (partial) |
GPI is the conceptual framework that connects DP to deep RLReinforcement Learning. An actor-critic algorithm is GPI with function approximation: the critic performs partial policy evaluation using TDTemporal Difference learning (Week 4), and the actor performs partial policy improvement using a policy gradient step. The evaluation and improvement are no longer exact — they are stochastic, approximate, and interleaved at every timestep — but the GPI structure is preserved.
Asynchronous and prioritized DP
Standard (synchronous) DP sweeps through all states in every iteration. This is impractical when is large. Two important extensions relax this:
Asynchronous DP
Asynchronous DP updates states in any order, not necessarily all at once. The only requirement is that every state is updated infinitely often (in the limit). Asynchronous updates converge to under the same contraction argument — each update moves the updated state's value closer to , and the contraction ensures the rest of the value function follows.
In practice, asynchronous DP enables:
- focusing computation on states that are frequently visited,
- real-time DP (updating states as they are visited by the agent),
- and efficient implementation on non-uniform state spaces.
Prioritized sweeping
prioritized sweeping selects states for update based on the magnitude of their Bellman error:
States with large Bellman error have the most inaccurate value estimates and benefit most from being updated. Prioritized sweeping maintains a priority queue ordered by and updates the highest-priority state at each step, propagating updates backward through predecessor states.
Prioritized sweeping is the direct precursor to prioritized experience replay (PER) in deep RLReinforcement Learning: instead of sampling uniformly from a replay buffer, PER samples transitions with probability proportional to their TDTemporal Difference error — the model-free analog of . The same intuition applies: transitions where the current value estimate is most wrong carry the most learning signal.
Why dynamic programming does not scale
Dynamic programming requires enumerating all states, all actions, and knowing exact transition dynamics. The resulting complexity is per sweep — quadratic in the state space due to the sum over next states .
GenAI context: the LLMLarge Language Model state space
Treating language generation as an MDPMarkov Decision Process, the state is the full context window. For an LLMLarge Language Model with vocabulary size and context length tokens:
This number exceeds the number of atoms in the observable universe by an incomprehensible margin. Exact dynamic programming is fundamentally impossible for language models. This is not an engineering limitation — it is a mathematical one. The only viable approaches are:
- Function approximation: represent or as a neural network that generalizes across states, rather than a lookup table.
- Sampling-based methods: estimate Bellman targets from experience rather than exact model-based sums.
- Policy-based methods: represent and optimize directly without computing .
Every algorithm from Week 4 onward is a response to this impossibility. DP defines the ideal target; the rest of the course develops tractable approximations.
From DP to deep RLReinforcement Learning: the precise mappings
The "conceptual importance" claim — that DP underlies modern RLReinforcement Learning — can be made precise:
Q-learning is sampled value iteration on
Value iteration on :
Q-learning replaces the exact model-based expectation with a single sampled transition :
The bracketed term is the TDTemporal Difference error — the discrepancy between the current estimate and the one-step Bellman target. Q-learning is value iteration where the model-based expectation is replaced by a Monte Carlo sample of a single next state . Deep Q-Networks (DQNDeep Q-Network) further replace the tabular with a neural network.
Actor-critic is GPI with approximate evaluation and improvement
Policy iteration runs full policy evaluation before each improvement. Actor-critic methods run one TDTemporal Difference step of evaluation (the critic update) interleaved with one gradient step of improvement (the actor update) at every timestep. This is GPI with both steps truncated to a single stochastic gradient update. The critic approximates or using TDTemporal Difference learning; the actor uses the critic's output to estimate the policy gradient and improve .
Update rate balance matters: if the critic learns much faster than the actor, it converges to for a nearly fixed policy — but must then re-track a new target every time the actor updates, causing value estimates to oscillate. If the actor learns much faster, policy gradients are computed from a stale critic, accumulating systematic bias that can push the policy in the wrong direction. Stable actor-critic training typically uses similar learning rates for both, or a frozen target network for the critic updated on a slower schedule.
RLHFReinforcement Learning from Human Feedback uses Bellman bootstrapping in the reward model training loop
RLHFReinforcement Learning from Human Feedback trains a reward model on preference data and then fine-tunes the language model using PPOProximal Policy Optimisation. PPOProximal Policy Optimisation is an actor-critic algorithm and uses TDTemporal Difference-style bootstrapped value estimates (the critic's function) to compute advantage estimates. The Bellman structure — the critic is trained to be consistent with the Bellman expectation equation — is present even in the RLHFReinforcement Learning from Human Feedback setting, though the reward signal comes from a learned reward model rather than an analytic function.
Key takeaways
The lecture develops a progression from the ideal to the tractable. Policy evaluation solves for exactly by iterating the Bellman expectation operator to its fixed point. Policy improvement replaces with the greedy policy with respect to , and the improvement theorem — proved by unrolling the greedy recursion — guarantees monotone improvement. Policy iteration alternates these two steps and converges to because the policy sequence is non-decreasing in value and the policy space is finite. Value iteration merges the two steps into a single Bellman optimality update, converging to by the contraction mapping property. Generalized policy iteration unifies both as special cases of a partial evaluation / partial improvement cycle — the same structure that underlies actor-critic methods in deep RLReinforcement Learning.
Q-functions are developed in parallel with V-functions and are the natural target for model-free RLReinforcement Learning because they encode the optimal policy without requiring . Asynchronous DP and prioritized sweeping extend the framework to practical non-uniform computation and are the direct precursors of experience replay and PER. And the LLMLarge Language Model state-space calculation makes concrete why function approximation and sampling are not optional refinements but mathematical necessities.
Conceptual questions
-
Apply two full iterations of value iteration to the 4-state chain example with , starting from . Compute and explicitly. At which iteration does the greedy policy first become the optimal policy "always go right"?
-
Prove that policy iteration terminates in finite time for a finite MDPMarkov Decision Process. Your proof should use the policy improvement theorem and a counting argument on the number of deterministic policies. Why does the same argument fail if the action space is continuous?
-
Value iteration converges to but not necessarily to any policy's value function at intermediate iterates. Explain why the greedy policy may still be suboptimal even when is small. What bound guarantees that is near-optimal?
-
Write out the Bellman optimality equation for and identify exactly which term changes when moving from model-based value iteration to the model-free Q-learning update. What statistical assumption justifies replacing the expectation with a single sample?
-
An actor-critic algorithm trains a critic using TDTemporal Difference(0) and an actor using policy gradient. Map each component onto the GPI framework: which step is the evaluation step, which is the improvement step, and in what sense are both steps "partial"? What goes wrong if the critic is updated much faster than the actor, or vice versa?
Coding exercises
Exercise 6: Asynchronous and prioritized variants. Implement two variants of value iteration for the 4-state chain MDP defined in the worked example:
- Asynchronous VI: on each sweep, update states in a randomly shuffled order instead of the fixed order .
- Prioritized sweeping: maintain a max-heap ordered by Bellman error . At each step, pop the highest-error state, update it, then recompute the Bellman error of its predecessor states and push them back into the heap.
For each variant, plot versus total number of individual state updates (not full sweeps). Which variant converges fastest in terms of state updates? Does the advantage of prioritized sweeping grow or shrink as the chain length increases?
Extension prompt. Implement a discount annealing schedule: start with and linearly increase toward over the first 50 iterations of value iteration. Compare the convergence curve against standard value iteration with fixed on the 4-state chain. Does a low initial discount accelerate early convergence? Why might this trick be useful for problems with sparse or delayed rewards?
Looking ahead
The next lecture removes the assumption of known dynamics. We will study Monte Carlo and Temporal-Difference (TDTemporal Difference) learning, which estimate value functions directly from sampled experience. The key insight is that the Bellman equations can be used as regression targets rather than exact update rules — replacing the model-based expectation with samples of from observed transitions. This replacement is what makes RLReinforcement Learning scale to problems where is unknown or intractable.
Further reading
- Sutton & Barto (2018). Reinforcement Learning: An Introduction, 2nd ed. Chapter 4: Dynamic Programming. (Standard reference — freely available at incompleteideas.net)
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley. (The definitive mathematical treatment; Chapters 6–7 cover policy evaluation, policy iteration, and value iteration with full proofs.)
- Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. I. Athena Scientific. (Advanced treatment covering asynchronous DP, convergence theory, and connections to approximate methods.)
- Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518, 529–533. (The DQN paper — implements sampled value iteration on with a neural network, directly realizing the DP-to-RL mapping developed in this lesson.)
- Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press. (Pioneered Policy Iteration.)