Week 3: Dynamic Programming for Finite MDPs

Purpose of this lecture#

Here is the central question: If we knew everything about the environment, what is the fastest way to compute an optimal policy?

This is the domain of dynamic programming (DP). We will assume perfect knowledge—the transition dynamics $P$ and reward function $R$ are given. Under this idealized assumption, the problem has a clean solution: iterate the Bellman equations to convergence.

The payoff of studying DP is not that it solves real problems (it doesn't—real environments are unknown). Instead:

DP defines the gold standard. DP algorithms are the exact solutions to the MDP problem. Every algorithm you study later is either a DP algorithm applied to an approximate model, or a sampled/approximate version of a DP algorithm.
DP teaches algorithm design fundamentals. The progression from policy evaluation (full convergence) to policy iteration (two-step cycles) to value iteration (one-step updates) shows how to trade off computational cost against convergence guarantees.
DP reveals what must change in model-free RL. The moment you remove the assumption of known $P$ and $R$ , everything else carries over. Understanding what changes and what doesn't is the key to understanding modern RL.

In Week 4, we will relax exactly one assumption: known dynamics. We will keep the Bellman equations, the value functions, the contraction property. We will only replace the model-based expectation $\sum_{s'} P(s'|s,a)V(s')$ with samples from experience. This is the bridge from DP to model-free RL.

Setting and assumptions#

Dynamic programming applies under the following assumptions:

Finite state space $\mathcal{S}$ , finite action space $\mathcal{A}$
Known transition probabilities $P(s' \mid s, a)$
Known reward function $R(s, a)$
Discount factor $\gamma \in [0,1)$

Under these conditions, the Bellman equations from Week 1 become explicit computational algorithms. The key assumption distinguishing DP from everything that follows is the second and third: we know $P$ and $R$ . Model-free RL relaxes exactly this.

A worked example: the 4-state chain#

Before developing the algorithms in full generality, we ground everything in a small concrete MDP. Consider a linear chain with four states $\{s_1, s_2, s_3, s_4\}$ and two actions $\{L, R\}$ (move left, move right):

From any non-boundary state, action $R$ moves right with probability 0.9 and stays with probability 0.1; action $L$ moves left with probability 0.9 and stays with probability 0.1.
$s_1$ and $s_4$ are boundary states: all actions keep the agent in place.
Reward: $R(s_4, \cdot) = +1$ , $R(s_1, \cdot) = -1$ , all other transitions yield $0$ .
$\gamma = 0.9$ .

We will use this MDP to trace through policy evaluation, policy iteration, and value iteration concretely. The optimal policy is clearly "always go right" for $s_2$ and $s_3$ , and the optimal values should reflect proximity to $s_4$ .

Two iterations of value iteration from $V_0 = \mathbf{0}$ :

Iteration 1 ( $V_1$ ): For $s_3$ with action $R$ :

V_1(s_3) = \max_a \left[R(s_3, a) + 0.9\sum_{s'}P(s'|s_3,a)V_0(s')\right] = \max\left[0 + 0,\; 0 + 0\right] = 0

All states get $V_1 = 0$ since $V_0 = 0$ and no immediate reward is earned except at boundaries.

Iteration 2 ( $V_2$ ): For $s_3$ with action $R$ :

V_2(s_3) = 0 + 0.9\left[0.9 \cdot V_1(s_4) + 0.1 \cdot V_1(s_3)\right]

Since $V_1(s_4)$ requires revisiting: at $s_4$ , action $R$ gives reward $+1$ and stays, so $V_1(s_4) = 1 + 0.9 \cdot 0 = 1$ . Substituting:

V_2(s_3) = 0.9\left[0.9 \cdot 1 + 0.1 \cdot 0\right] = 0.81

For $s_2$ with action $R$ :

V_2(s_2) = 0.9\left[0.9 \cdot V_1(s_3) + 0.1 \cdot V_1(s_2)\right] = 0.9[0 + 0] = 0

To determine the greedy action at $s_2$ under $V_2$ , compare both actions:

Q_2(s_2, R) = 0 + 0.9\left[0.9 \cdot V_1(s_3) + 0.1 \cdot V_1(s_2)\right] = 0

Q_2(s_2, L) = 0 + 0.9\left[0.9 \cdot V_1(s_1) + 0.1 \cdot V_1(s_2)\right] = 0.9\left[0.9 \cdot (-1) + 0\right] = -0.81

So $\pi_2^{\text{greedy}}(s_2) = R$ (since $0 > {-0.81}$ ). And from above, $Q_2(s_3, R) = 0.81 > Q_2(s_3, L) = 0$ , so $\pi_2^{\text{greedy}}(s_3) = R$ . The optimal policy "always go right" is first recovered at iteration 2.

After two iterations, the value has propagated one step from the boundary. After $k$ iterations, the value has propagated $k$ steps. This is the backup structure: value information flows backward from high-reward states through the graph.

Policy evaluation#

Given a fixed policy $\pi$ , policy evaluation computes its state-value function $V^\pi$ — the expected discounted return from each state under $\pi$ .

Bellman expectation equation#

For all $s \in \mathcal{S}$ :

V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^\pi(s') \right]

This is a system of $|\mathcal{S}|$ linear equations in $|\mathcal{S}|$ unknowns. For small state spaces, it can be solved directly by matrix inversion. For large state spaces, iterative methods are used.

The Bellman expectation operator#

Define the Bellman expectation operator $T^\pi$ :

(T^\pi V)(s) = \sum_a \pi(a \mid s)\left[R(s,a) + \gamma\sum_{s'} P(s'|s,a)V(s')\right]

$T^\pi$ is a $\gamma$ -contraction in $\ell_\infty$ : for any two value functions $V, W$ :

\|T^\pi V - T^\pi W\|_\infty \leq \gamma \|V - W\|_\infty

The proof is identical to the optimality operator case from Week 1: the $\gamma$ factor in front of the expectation over next states is what gives the contraction mapping property. By the Banach fixed-point theorem, $V^\pi$ is the unique fixed point of $T^\pi$ , and iterative application from any initialization converges to it geometrically.

Iterative policy evaluation#

V_{k+1}(s) \leftarrow \sum_a \pi(a \mid s) \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V_k(s') \right]

Starting from any $V_0$ (e.g., $V_0 = \mathbf{0}$ ), repeated application converges to $V^\pi$ because $T^\pi$ is a $\gamma$ -contraction. The error after $k$ iterations satisfies $\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty$ — geometric convergence at rate $\gamma$ .

Backup diagram: at each update, the value of state $s$ is computed by looking one step ahead — summing over all actions under $\pi$ , then over all next states under $P$ , and backing up the discounted next-state values. Information flows from future states into the current estimate.

Q-functions in dynamic programming#

The lecture so far has developed $V^\pi$ and $V^*$ . It is equally important to develop the action-value function (Q-function) in the DP setting, because Q-learning — the most widely used model-free algorithm — is value iteration applied to $Q^*$ .

Bellman equations for $Q^\pi$ and $Q^*$ #

The Bellman expectation equation for $Q^\pi$ :

Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s')\, Q^\pi(s',a')

The Bellman optimality equation for $Q^*$ :

Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')

Relationship between V and Q#

V^\pi(s) = \sum_a \pi(a|s)\, Q^\pi(s,a) \qquad V^*(s) = \max_a Q^*(s,a)

Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a)\, V^\pi(s')

Why $Q^*$ is the target for model-free RL#

If you know $V^*$ , extracting the optimal policy still requires knowing $P$ :

\pi^*(s) = \arg\max_a \left[R(s,a) + \gamma\sum_{s'} P(s'|s,a) V^*(s')\right]

If you know $Q^*$ , you do not need $P$ :

\pi^*(s) = \arg\max_a Q^*(s,a)

This is the key reason Q-learning targets $Q^*$ rather than $V^*$ : in the model-free setting where $P$ is unknown, $Q^*$ gives you the optimal policy directly from experience. The Bellman optimality equation for $Q^*$ becomes the Q-learning update rule once we replace the exact model-based expectation with a sampled transition.

Policy improvement#

Once we know $V^\pi$ , we can improve the policy.

Definition#

Given $V^\pi$ , define the improved policy:

\pi'(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^\pi(s') \right] = \arg\max_a\, Q^\pi(s,a)

$\pi'$ is the greedy policy with respect to $V^\pi$ : at each state, take the action whose one-step lookahead value is highest.

Policy improvement theorem: proof#

Theorem: $V^{\pi'}(s) \geq V^\pi(s)$ for all $s \in \mathcal{S}$ .

Proof: By the definition of $\pi'$ as greedy with respect to $V^\pi$ :

V^\pi(s) \leq Q^\pi(s, \pi'(s)) = R(s,\pi'(s)) + \gamma\sum_{s'} P(s'|s,\pi'(s))\, V^\pi(s')

The inequality holds because $\pi'(s)$ maximizes the right-hand side over all actions, so it is at least as large as $V^\pi(s)$ (which is the $\pi$ -weighted average, not the maximum). Applying the same argument to $V^\pi(s')$ :

V^\pi(s) \leq R(s,\pi'(s)) + \gamma\sum_{s'} P(s'|s,\pi'(s)) \left[R(s',\pi'(s')) + \gamma\sum_{s''} P(s''|s',\pi'(s'))\, V^\pi(s'')\right]

Unrolling this recursion over all future timesteps under $\pi'$ gives:

V^\pi(s) \leq \mathbb{E}_{\pi'}\left[\sum_{t=0}^\infty \gamma^t r_{t+1} \;\middle|\; s_0 = s\right] = V^{\pi'}(s)

$\square$

When does policy iteration terminate?#

At termination, $\pi' = \pi$ , meaning $\pi$ is already greedy with respect to its own value function:

V^\pi(s) = \max_a \left[R(s,a) + \gamma\sum_{s'} P(s'|s,a)\, V^\pi(s')\right] \quad \forall s

This is exactly the Bellman optimality equation. Therefore $V^\pi = V^*$ and $\pi = \pi^*$ . Policy iteration terminates at and only at the optimal policy.

Policy iteration#

Policy iteration alternates policy evaluation and policy improvement until convergence.

Algorithm#

Initialize policy $\pi_0$ arbitrarily (e.g., uniform random)
For $k = 0, 1, 2, \ldots$ $k = 0, 1, 2, \dots$ :
- Evaluate: compute $V^{\pi_k}$ by iterative policy evaluation (or direct solve)
- Improve: set $\pi_{k+1}(s) = \arg\max_a Q^{\pi_k}(s,a)$ for all $s$
Stop when $\pi_{k+1} = \pi_k$

Convergence argument#

The sequence $\{V^{\pi_k}\}$ is monotonically non-decreasing by the policy improvement theorem. The number of deterministic policies in a finite MDP is $|\mathcal{A}|^{|\mathcal{S}|}$ — finite. Therefore the sequence of policies must eventually cycle, but since it is non-decreasing in value, it cannot cycle through distinct policies. It must reach a fixed point, which we showed above is $\pi^*$ .

In practice, policy iteration typically converges in far fewer iterations than $|\mathcal{A}|^{|\mathcal{S}|}$ — often fewer than 10–20 iterations even for moderately large MDPs — because each improvement step makes substantial progress.

Properties#

Convergence to $\pi^*$ is guaranteed
Typically converges in very few outer iterations
Each iteration requires solving (or approximately solving) a linear system of size $|\mathcal{S}|$
For large state spaces, the inner evaluation is the computational bottleneck

Implementation#

python · runs in browser

import numpy as np

def policy_iteration(P, R, gamma=0.9, theta=1e-8, max_iter=1000):
    """
    Policy Iteration for finite MDPs.

    Args:
        P: Transition probabilities P[s][a] = [(prob, next_state, reward), ...]
           For deterministic transitions: P[s][a] = [(1.0, s', r)]
        R: Reward function R[s][a] = expected_reward
        gamma: Discount factor (0 < gamma < 1)
        theta: Convergence threshold for value function
        max_iter: Maximum number of outer iterations

    Returns:
        V: Optimal value function V[s]
        pi: Optimal policy pi[s] = best action
        history: List of (V, pi) tuples for visualization
    """
    n_states = len(P)
    n_actions = len(P[0])

    # Step 1: Initialize policy arbitrarily (uniform random)
    pi = np.ones((n_states, n_actions)) / n_actions
    history = [(np.zeros(n_states), pi.copy())]

    for iteration in range(max_iter):
        # === POLICY EVALUATION ===
        # Solve V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V^pi(s')]
        V = np.zeros(n_states)

        for _ in range(max_iter):  # Inner loop for iterative evaluation
            V_new = np.zeros(n_states)
            for s in range(n_states):
                q_values = np.zeros(n_actions)
                for a in range(n_actions):
                    # Q(s, a) = sum_{s'} P(s'|s,a) * [R + gamma * V(s')]
                    q_sum = 0
                    for prob, next_s, reward in P[s][a]:
                        q_sum += prob * (reward + gamma * V[next_s])
                    q_values[a] = q_sum

                # V(s) = sum_a pi(a|s) * Q(s, a)
                V_new[s] = np.sum(pi[s] * q_values)

            # Check for convergence (delta = max |V_new - V|)
            delta = np.max(np.abs(V_new - V))
            V = V_new
            if delta < theta:
                break

        # === POLICY IMPROVEMENT ===
        # For each state, find the action with highest Q-value
        q_table = np.zeros((n_states, n_actions))
        for s in range(n_states):
            for a in range(n_actions):
                q_sum = 0
                for prob, next_s, reward in P[s][a]:
                    q_sum += prob * (reward + gamma * V[next_s])
                q_table[s, a] = q_sum

        # Greedy policy: pi_new[s] = argmax_a Q(s, a)
        new_pi = np.zeros((n_states, n_actions))
        best_actions = np.argmax(q_table, axis=1)
        for s in range(n_states):
            new_pi[s, best_actions[s]] = 1.0

        # Check for convergence (policy unchanged)
        history.append((V.copy(), new_pi.copy()))
        if np.array_equal(new_pi, pi):
            print(f"Converged after {iteration + 1} iterations")
            break

        pi = new_pi

    return V, pi, history

# Example: 4-state chain MDP
# States: 0=start, 1=mid-left, 2=mid-right, 3=goal
# Actions: 0=left, 1=right
# Terminal states: 0 and 3 have self-loops with terminal rewards

P = {  # Deterministic transitions
    0: {0: [(1.0, 0, -1)], 1: [(1.0, 0, -1)]},  # Terminal: -1 reward
    1: {0: [(1.0, 0, 0)], 1: [(1.0, 2, 0)]},      # Left->0, Right->2
    2: {0: [(1.0, 1, 0)], 1: [(1.0, 3, 1)]},      # Left->1, Right->3 (goal!)
    3: {0: [(1.0, 3, 1)], 1: [(1.0, 3, 1)]},      # Terminal: +1 reward
}

V, pi, history = policy_iteration(P, None, gamma=0.9)
print(f"Optimal Value Function: {V}")
print(f"Optimal Policy: {pi.argmax(axis=1)}")  # 0=left, 1=right

Key implementation details:

State representation: We enumerate states as integers 0, 1, 2, ..., n-1
Transition format: P[s][a] is a list of (prob, next_state, reward) tuples
Policy representation: pi[s] is a probability distribution over actions
Convergence check: Policy iteration converges when $\pi_{k+1} = \pi_k$

Value iteration#

Value iteration collapses the evaluation-improvement cycle into a single Bellman optimality update applied repeatedly.

Update rule#

V_{k+1}(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V_k(s') \right] = (TV_k)(s)

where $T$ is the Bellman optimality operator. Unlike policy evaluation, which averages over the policy's action distribution, value iteration takes the maximum over all actions at every update.

Convergence#

$T$ is a $\gamma$ -contraction:

\|TV - TW\|_\infty \leq \gamma\|V - W\|_\infty

By the Banach fixed-point theorem, $V_k \to V^*$ from any initialization, with error $\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty$ . Once $V_k$ has converged to within $\epsilon$ of $V^*$ , the greedy policy $\pi_k(s) = \arg\max_a Q_k(s,a)$ is $\epsilon/(1-\gamma)$ -optimal.

Implementation#

python · runs in browser

import numpy as np

def value_iteration(P, R, gamma=0.9, theta=1e-8, max_iter=1000):
    """
    Value Iteration for finite MDPs.

    Args:
        P: Transition probabilities P[s][a] = [(prob, next_state, reward), ...]
        R: Reward function R[s][a] = expected_reward
        gamma: Discount factor (0 < gamma < 1)
        theta: Convergence threshold
        max_iter: Maximum iterations

    Returns:
        V: Optimal value function
        pi: Optimal policy (extracted greedily from V)
        history: List of V values for visualization
    """
    n_states = len(P)
    n_actions = len(P[0])

    # Initialize value function arbitrarily
    V = np.zeros(n_states)
    history = [V.copy()]

    for k in range(max_iter):
        delta = 0  # Track maximum change

        # For each state, apply the Bellman optimality operator
        V_new = np.zeros(n_states)
        for s in range(n_states):
            q_values = np.zeros(n_actions)
            for a in range(n_actions):
                # Q(s, a) = sum_{s'} P(s'|s,a) * [R + gamma * V(s')]
                q_sum = 0
                for prob, next_s, reward in P[s][a]:
                    q_sum += prob * (reward + gamma * V[next_s])
                q_values[a] = q_sum

            # V(s) = max_a Q(s, a)  <-- Key difference from policy evaluation!
            V_new[s] = np.max(q_values)
            delta = max(delta, abs(V_new[s] - V[s]))

        V = V_new
        history.append(V.copy())

        # Check convergence: ||V_k+1 - V_k||_inf < theta
        if delta < theta:
            print(f"Converged after {k + 1} iterations (delta={delta:.2e})")
            break

    # Extract greedy policy from converged value function
    pi = np.zeros((n_states, n_actions))
    for s in range(n_states):
        q_values = np.zeros(n_actions)
        for a in range(n_actions):
            q_sum = 0
            for prob, next_s, reward in P[s][a]:
                q_sum += prob * (reward + gamma * V[next_s])
            q_values[a] = q_sum
        best_action = np.argmax(q_values)
        pi[s, best_action] = 1.0

    return V, pi, history

# Example: FrozenLake-like 4x4 grid
# States: 0-15 (row-major order)
# Actions: 0=up, 1=down, 2=left, 3=right

def create_grid_mdp(size=4, goal_state=15, hole_states=None, slip_prob=0.0):
    """Create a gridworld MDP."""
    if hole_states is None:
        hole_states = [5, 7, 11]  # Classic holes

    n_states = size * size
    n_actions = 4

    # Initialize transition dictionary
    P = {s: {a: [] for a in range(n_actions)} for s in range(n_states)}

    for s in range(n_states):
        for a in range(n_actions):
            if s == goal_state or s in hole_states:
                # Terminal/hole states: stay in place
                reward = 1 if s == goal_state else 0
                P[s][a] = [(1.0, s, reward)]
            else:
                # Compute next state based on action
                row, col = s // size, s % size
                if a == 0: row = max(0, row - 1)      # up
                elif a == 1: row = min(size - 1, row + 1)  # down
                elif a == 2: col = max(0, col - 1)      # left
                else: col = min(size - 1, col + 1)     # right

                next_s = row * size + col
                reward = 1 if next_s == goal_state else 0
                P[s][a] = [(1.0, next_s, reward)]

    return P

# Run value iteration
P = create_grid_mdp(size=4)
V, pi, history = value_iteration(P, None, gamma=0.9)

print("Value Function (4x4 grid):")
for i in range(4):
    print([f"{V[i*4+j]:.2f}" for j in range(4)])

print("\nOptimal Policy (0=up, 1=down, 2=left, 3=right):")
for i in range(4):
    print([pi[i*4+j].argmax() for j in range(4)])

Key insight: Value iteration applies max_a at every update (line 22), while policy evaluation computes expected Q-values under the current policy. This single change is what makes VI converge in one sweep per iteration.

Computational complexity:

Policy iteration: O( iterations × |S|³ ) for direct solve, O( iterations × |S|² × |A| ) for iterative
Value iteration: O( iterations × |S|² × |A| )

For large state spaces, value iteration is often preferred because each iteration is cheaper, even though it may need more iterations.

Relationship to policy iteration#

Value iteration is policy iteration with one-step policy evaluation before each improvement. Equivalently, it is applying $T$ — which implicitly takes the greedy action — without first computing $V^\pi$ for the current greedy policy. This means value iteration does not maintain an explicit policy between iterations; it simply drives $V_k$ toward $V^*$ and extracts the policy at the end.

Generalized policy iteration#

Both policy iteration and value iteration are special cases of a unifying framework: Generalized Policy Iteration (GPI).

GPI describes any algorithm that maintains both a value function $V$ and a policy $\pi$ , and alternates between:

Evaluation steps: make $V$ more consistent with $\pi$ (partial or full)
Improvement steps: make $\pi$ more greedy with respect to $V$

The key theorem is that any interleaving of partial evaluation and greedy improvement converges to $V^*$ and $\pi^*$ , as long as both processes run and neither is stopped prematurely. The special cases are:

| Algorithm | Evaluation steps per cycle | Improvement steps per cycle | |---|---|---| | Policy iteration | Full convergence of $V^{\pi_k}$ | One full greedy pass | | Value iteration | One Bellman update | One full greedy pass (implicit in $\max$ ) | | Actor-critic (deep RL) | $n$ TD steps (partial) | One gradient step (partial) |

GPI is the conceptual framework that connects DP to deep RL. An actor-critic algorithm is GPI with function approximation: the critic performs partial policy evaluation using TD learning (Week 4), and the actor performs partial policy improvement using a policy gradient step. The evaluation and improvement are no longer exact — they are stochastic, approximate, and interleaved at every timestep — but the GPI structure is preserved.

Asynchronous and prioritized DP#

Standard (synchronous) DP sweeps through all states in every iteration. This is impractical when $|\mathcal{S}|$ is large. Two important extensions relax this:

Asynchronous DP#

Asynchronous DP updates states in any order, not necessarily all at once. The only requirement is that every state is updated infinitely often (in the limit). Asynchronous updates converge to $V^*$ under the same contraction argument — each update moves the updated state's value closer to $V^*$ , and the contraction ensures the rest of the value function follows.

In practice, asynchronous DP enables:

focusing computation on states that are frequently visited,
real-time DP (updating states as they are visited by the agent),
and efficient implementation on non-uniform state spaces.

Prioritized sweeping#

prioritized sweeping selects states for update based on the magnitude of their Bellman error:

\delta(s) = \left| V(s) - \max_a\left[R(s,a) + \gamma\sum_{s'} P(s'|s,a) V(s')\right]\right|

States with large Bellman error have the most inaccurate value estimates and benefit most from being updated. Prioritized sweeping maintains a priority queue ordered by $\delta(s)$ and updates the highest-priority state at each step, propagating updates backward through predecessor states.

Prioritized sweeping is the direct precursor to prioritized experience replay (PER) in deep RL: instead of sampling uniformly from a replay buffer, PER samples transitions with probability proportional to their TD error — the model-free analog of $\delta(s)$ . The same intuition applies: transitions where the current value estimate is most wrong carry the most learning signal.

Why dynamic programming does not scale#

Dynamic programming requires enumerating all states, all actions, and knowing exact transition dynamics. The resulting complexity is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ per sweep — quadratic in the state space due to the sum over next states $s'$ .

What Breaks Here: The curse of dimensionality

The state space grows exponentially in the number of state variables. A robot with 12 joint angles discretized to 100 positions each has $100^{12} = 10^{24}$ states — DP cannot enumerate them. The complexity per sweep is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ ; for $10^{24}$ states, a single full sweep is physically impossible.

This is not a limitation of the algorithm designer; it is a fundamental barrier. Exact DP requires enumerating all states and transitions, which is only tractable for small, discrete MDPs. For real-world problems — continuous control, vision-based tasks, language generation — an entirely different computational approach is required. This is why the rest of the course is not "DP for bigger problems."

GenAI context: the LLM state space#

Treating language generation as an MDP, the state is the full context window. For an LLM with vocabulary size $\approx 50{,}000$ and context length $4{,}000$ tokens:

|\mathcal{S}| \approx 50{,}000^{4000}

This number exceeds the number of atoms in the observable universe by an incomprehensible margin. Exact dynamic programming is fundamentally impossible for language models. This is not an engineering limitation — it is a mathematical one. The only viable approaches are:

Function approximation: represent $V$ or $Q$ as a neural network that generalizes across states, rather than a lookup table.
Sampling-based methods: estimate Bellman targets from experience rather than exact model-based sums.
Policy-based methods: represent and optimize $\pi$ directly without computing $V$ .

Every algorithm from Week 4 onward is a response to this impossibility. DP defines the ideal target; the rest of the course develops tractable approximations.

From DP to deep RL: the precise mappings#

The "conceptual importance" claim — that DP underlies modern RL — can be made precise:

Q-learning is sampled value iteration on $Q^*$ #

Value iteration on $Q^*$ :

Q_{k+1}(s,a) \leftarrow R(s,a) + \gamma\sum_{s'} P(s'|s,a) \max_{a'} Q_k(s',a')

Q-learning replaces the exact model-based expectation with a single sampled transition $(s, a, r, s')$ :

Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma\max_{a'} Q(s',a') - Q(s,a)\right]

The bracketed term is the TD error — the discrepancy between the current estimate and the one-step Bellman target. Q-learning is value iteration where the model-based expectation $\sum_{s'} P(s'|s,a)(\cdot)$ is replaced by a Monte Carlo sample of a single next state $s'$ . Deep Q-Networks (DQN) further replace the tabular $Q$ with a neural network.

Actor-critic is GPI with approximate evaluation and improvement#

Policy iteration runs full policy evaluation before each improvement. Actor-critic methods run one TD step of evaluation (the critic update) interleaved with one gradient step of improvement (the actor update) at every timestep. This is GPI with both steps truncated to a single stochastic gradient update. The critic approximates $V^\pi$ or $Q^\pi$ using TD learning; the actor uses the critic's output to estimate the policy gradient and improve $\pi$ .

Update rate balance matters: if the critic learns much faster than the actor, it converges to $V^{\pi_\theta}$ for a nearly fixed policy — but must then re-track a new target every time the actor updates, causing value estimates to oscillate. If the actor learns much faster, policy gradients are computed from a stale critic, accumulating systematic bias that can push the policy in the wrong direction. Stable actor-critic training typically uses similar learning rates for both, or a frozen target network for the critic updated on a slower schedule.

RLHF uses Bellman bootstrapping in the reward model training loop#

RLHF trains a reward model on preference data and then fine-tunes the language model using PPO. PPO is an actor-critic algorithm and uses TD-style bootstrapped value estimates (the critic's $V$ function) to compute advantage estimates. The Bellman structure — the critic is trained to be consistent with the Bellman expectation equation — is present even in the RLHF setting, though the reward signal comes from a learned reward model rather than an analytic function.

Key takeaways#

The lecture develops a progression from the ideal to the tractable. Policy evaluation solves for $V^\pi$ exactly by iterating the Bellman expectation operator to its fixed point. Policy improvement replaces $\pi$ with the greedy policy with respect to $V^\pi$ , and the improvement theorem — proved by unrolling the greedy recursion — guarantees monotone improvement. Policy iteration alternates these two steps and converges to $\pi^*$ because the policy sequence is non-decreasing in value and the policy space is finite. Value iteration merges the two steps into a single Bellman optimality update, converging to $V^*$ by the contraction mapping property. Generalized policy iteration unifies both as special cases of a partial evaluation / partial improvement cycle — the same structure that underlies actor-critic methods in deep RL.

Q-functions are developed in parallel with V-functions and are the natural target for model-free RL because they encode the optimal policy without requiring $P$ . Asynchronous DP and prioritized sweeping extend the framework to practical non-uniform computation and are the direct precursors of experience replay and PER. And the LLM state-space calculation makes concrete why function approximation and sampling are not optional refinements but mathematical necessities.

Conceptual questions#

Apply two full iterations of value iteration to the 4-state chain example with $\gamma = 0.9$ , starting from $V_0 = \mathbf{0}$ . Compute $V_2(s_2)$ and $V_2(s_3)$ explicitly. At which iteration does the greedy policy first become the optimal policy "always go right"?
Prove that policy iteration terminates in finite time for a finite MDP. Your proof should use the policy improvement theorem and a counting argument on the number of deterministic policies. Why does the same argument fail if the action space is continuous?
Value iteration converges to $V^*$ but not necessarily to any policy's value function at intermediate iterates. Explain why the greedy policy $\pi_k(s) = \arg\max_a Q_k(s,a)$ may still be suboptimal even when $\|V_k - V^*\|_\infty$ is small. What bound guarantees that $\pi_k$ is near-optimal?
Write out the Bellman optimality equation for $Q^*(s,a)$ and identify exactly which term changes when moving from model-based value iteration to the model-free Q-learning update. What statistical assumption justifies replacing the expectation with a single sample?
An actor-critic algorithm trains a critic $V_\phi(s)$ using TD(0) and an actor $\pi_\theta(a|s)$ using policy gradient. Map each component onto the GPI framework: which step is the evaluation step, which is the improvement step, and in what sense are both steps "partial"? What goes wrong if the critic is updated much faster than the actor, or vice versa?

Solutions (conceptual questions)

Both iterations start from $V_0 = \mathbf{0}$ , so $V_1 = \mathbf{0}$ everywhere. Then $V_2(s_3) = 0.9(0.9\cdot 1 + 0.1\cdot 0) = 0.81$ and $V_2(s_2) = 0.9(0.9\cdot 0 + 0.1\cdot 0) = 0$ . Comparing greedy actions, $Q_2(s_2, R) = 0 > Q_2(s_2, L) = -0.81$ and $Q_2(s_3, R) = 0.81 > 0$ , so the optimal policy "always go right" is first recovered at iteration 2 (value propagates one step per iteration).
A finite MDP has finitely many deterministic policies ( $|A|^{|S|}$ ). The policy improvement theorem guarantees each step yields a policy that is no worse, and strictly better unless already optimal, so values increase monotonically and no policy repeats — a strictly improving sequence over a finite set must terminate. With continuous actions the counting argument fails: there are uncountably many policies, so no finite bound exists.
Acting greedily w.r.t. an approximate $V_k$ can select the wrong action wherever competing $Q$ -values are close, so $\pi_k$ may be suboptimal even when $\|V_k - V^*\|_\infty$ is small. The guarantee is $\|V^{\pi_k} - V^*\|_\infty \le \frac{2\gamma}{1-\gamma}\|V_k - V^*\|_\infty$ , so $\pi_k$ is near-optimal once the value error is small relative to $(1-\gamma)$ .
$Q^*(s,a) = R(s,a) + \gamma\sum_{s'} P(s'\mid s,a)\max_{a'} Q^*(s',a')$ . Going to Q-learning, the expectation $\sum_{s'} P(\cdot)\max_{a'} Q$ is replaced by a single sampled transition: target $r + \gamma\max_{a'} Q(s',a')$ . This is justified by stochastic approximation — the sample is an unbiased estimate of the expectation, valid given sufficient exploration and a decaying step size.
The TD(0) critic is the policy-evaluation step; the policy-gradient actor is the policy-improvement step. Both are "partial": the critic does one bootstrapped step rather than a full evaluation, and the actor takes a gradient nudge rather than a full greedy max. If the critic is far faster than the actor it tracks the policy well (stable but slow); if the actor is faster it improves against a stale, inaccurate value estimate, which destabilizes training — hence the two-timescale rule that the critic should adapt faster than the actor.

Coding exercises#

Exercise 6: Asynchronous and prioritized variants. Implement two variants of value iteration for the 4-state chain MDP defined in the worked example:

Asynchronous VI: on each sweep, update states in a randomly shuffled order instead of the fixed order $s_1, s_2, s_3, s_4$ .
Prioritized sweeping: maintain a max-heap ordered by Bellman error $|\delta(s)| = |V(s) - \max_a[R(s,a) + \gamma\sum_{s'} P(s'|s,a) V(s')]|$ . At each step, pop the highest-error state, update it, then recompute the Bellman error of its predecessor states and push them back into the heap.

For each variant, plot $\|V_k - V^*\|_\infty$ versus total number of individual state updates (not full sweeps). Which variant converges fastest in terms of state updates? Does the advantage of prioritized sweeping grow or shrink as the chain length increases?

Extension prompt. Implement a discount annealing schedule: start with $\gamma_0 = 0.5$ and linearly increase toward $\gamma = 0.9$ over the first 50 iterations of value iteration. Compare the convergence curve against standard value iteration with fixed $\gamma = 0.9$ on the 4-state chain. Does a low initial discount accelerate early convergence? Why might this trick be useful for problems with sparse or delayed rewards?

Looking ahead#

The next lecture removes the assumption of known dynamics. We will study Monte Carlo and Temporal-Difference (TD) learning, which estimate value functions directly from sampled experience. The key insight is that the Bellman equations can be used as regression targets rather than exact update rules — replacing the model-based expectation $\sum_{s'} P(s'|s,a)V(s')$ with samples of $V(s')$ from observed transitions. This replacement is what makes RL scale to problems where $P$ is unknown or intractable.

Purpose of this lecture#

Setting and assumptions#

A worked example: the 4-state chain#

Policy evaluation#

Bellman expectation equation#

The Bellman expectation operator#

Iterative policy evaluation#

Q-functions in dynamic programming#

Bellman equations for QπQ^\piQπ and Q∗Q^*Q∗#

Relationship between V and Q#

Why Q∗Q^*Q∗ is the target for model-free RLReinforcement Learning#

Policy improvement#

Definition#

Policy improvement theorem: proof#

When does policy iteration terminate?#

Policy iteration#

Algorithm#

Convergence argument#

Properties#

Implementation#

Value iteration#

Update rule#

Convergence#

Implementation#

Relationship to policy iteration#

Generalized policy iteration#

Asynchronous and prioritized DP#

Asynchronous DP#

Prioritized sweeping#

Why dynamic programming does not scale#

GenAI context: the LLMLarge Language Model state space#

From DP to deep RLReinforcement Learning: the precise mappings#

Q-learning is sampled value iteration on Q∗Q^*Q∗#

Actor-critic is GPI with approximate evaluation and improvement#

RLHFReinforcement Learning from Human Feedback uses Bellman bootstrapping in the reward model training loop#

Key takeaways#

Conceptual questions#

Coding exercises#

Looking ahead#

Further reading#

Bellman equations for $Q^\pi$ and $Q^*$ #

Why $Q^*$ is the target for model-free RL#

GenAI context: the LLM state space#

From DP to deep RL: the precise mappings#

Q-learning is sampled value iteration on $Q^*$ #

RLHF uses Bellman bootstrapping in the reward model training loop#