Skip to main content
illumin8
Courses
Week 1: Reinforcement Learning Problem Formulation
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 1

Week 1: Reinforcement Learning Problem Formulation

✦Learning Outcomes
  • Formulate sequential decision problems as Markov Decision Processes (MDPs)
  • Define and compute returns with discount factors
  • Explain the reward hypothesis and its implications for goal-directed behavior
  • Derive Bellman expectation and optimality equations
  • Distinguish between episodic and continuing tasks
◆Prerequisites

Background knowledge assumed:

  • Basic probability theory (probability distributions, expectation)
  • Linear algebra (vectors, matrices, matrix operations)
  • Python programming fundamentals

Recommended: Review the Course 1 Index for an overview of the learning path.

Reinforcement learning studies sequential decision making under uncertainty. Unlike supervised learning, where data is static, an RLReinforcement Learning agent's actions actively influence the future—creating a defining feedback loop of perception and action.

How do we formalize goal-directed behavior? We build toward this by asking what kind of signal the agent should maximize, and how that signal relates to the underlying environment dynamics.

Purpose of this lecture

Here is the central puzzle: How can we reduce the dazzling complexity of goal-directed behavior—a robot learning to walk, a language model generating coherent text, a game-playing agent discovering a novel strategy—into a single clean principle?

Reinforcement learning answers this by reducing everything to one idea: maximize expected cumulative reward. This sounds simple. It is not. The moment you accept this reduction, you confront three immediate tensions:

  1. How do we specify what matters? Reward design is brutally hard. Write the wrong reward, and the agent optimizes the letter of your goal while violating its spirit. This is not a minor engineering problem—it is a foundational gap that the entire field grapples with.

  2. How do we handle uncertainty about the future? The agent doesn't know what will happen when it acts. How do we reason about long-term consequences when the world is stochastic and partially hidden?

  3. How do we make this tractable? Maximizing a sum over infinitely many future time steps is not computable as stated. What mathematical structure lets us solve it?

This lecture answers all three by introducing the MDPMarkov Decision Process formalism, value functions and the Bellman equation. These concepts are not approximations or conveniences—they are the exact mathematical encoding of sequential decision making. Every algorithm studied later—tabular methods, deep RLReinforcement Learning, RLHFReinforcement Learning from Human Feedback, GRPOGroup Relative Policy Optimisation, and agentic systems—is either an instantiation or an extension of the theory introduced here.

By the end of this lecture, you will understand not just what the Bellman equation is, but why it is the inevitable consequence of formalizing sequential decision making. You will also understand where it breaks and what engineering is required to make it work in the real world.


Sequential decision making

In reinforcement learning, an agent interacts with an environment over discrete time steps. This interaction forms a closed feedback loop:

  • At time step ttt, the agent observes a state sts_tst​
  • It selects an action ata_tat​
  • The environment transitions to a new state st+1s_{t+1}st+1​
  • The agent receives a scalar reward rt+1r_{t+1}rt+1​

The agent's objective is not to maximize immediate reward, but to act so as to maximize cumulative reward over time.

✦Intuition: Why not just optimize immediate reward?

Consider a chess player who values only the next move's position, ignoring everything downstream. They would always capture any available piece — even at the cost of a decisive positional disadvantage later. Greedily maximizing immediate reward produces the same failure: an agent rewarded for cleaning a room might learn to disable its own sensors so it can no longer detect dirt, achieving maximum reward by eliminating the ability to perceive problems. The temporal coupling of actions and consequences is not a complication to be engineered away — it is the defining structure of the problem.

This temporal coupling distinguishes reinforcement learning from:

  • Supervised learning, where data is fixed and independent of the learner's actions
  • Standard optimization, where decisions are made once rather than sequentially

The notation rt+1r_{t+1}rt+1​ for the reward received after taking action ata_tat​ follows the Sutton & Barto convention: the reward is indexed by the time at which it is received, not the time at which the action that caused it was taken. This is a purely notational choice but it recurs throughout the literature and in any codebase following S&B conventions.

To formalize "maximize cumulative reward over time" we need two things: a precise description of how the world works (an MDPMarkov Decision Process), and a precise notion of what cumulative reward means (the return). We build toward both by first asking what kind of signal the agent should be trying to maximize.


The reward hypothesis

A foundational idea in reinforcement learning is the Reward Hypothesis:

All goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (reward).

This hypothesis suggests that even the most sophisticated behaviors—mastering a game of chess, the fluid locomotion of a humanoid robot, or the nuanced generation of human-like text—can be reduced to maximizing expected cumulative reward. It is a claim, not a mathematical truth, and the course can be read as a sustained examination of how far it holds and where it breaks down.

A concrete starting point

Before examining where the hypothesis succeeds and fails, consider the simplest case where it clearly works.

Gridworld navigation. An agent moves through a 4×44 \times 44×4 grid to reach a goal cell in the lower-right corner. The reward function assigns +1+1+1 upon reaching the goal and 000 at every other step. This is unambiguous and easy to specify. An agent maximizing expected cumulative reward will learn to navigate efficiently to the goal. The reward hypothesis applies cleanly.

Atari Pong. Reward is +1+1+1 when the agent scores a point and −1-1−1 when the opponent scores. Again unambiguous. A policy maximizing this reward will beat the opponent. DeepMind's DQNDeep Q-Network (2013) demonstrated that a single algorithm, with no game-specific engineering, can learn superhuman performance from this signal alone — a striking validation of the hypothesis.

These cases share a key property: the reward function fully captures the intended goal. The difficulty arises when they diverge.

Why reward design matters

The hypothesis is silent on how to specify reward. In practice, reward design is one of the hardest problems in RLReinforcement Learning:

Productive uses — in modern AI systems, the reward hypothesis underpins:

  • RLHFReinforcement Learning from Human Feedback (Reinforcement Learning from Human Feedback)
  • Preference optimization (GRPOGroup Relative Policy Optimisation, DPODirect Preference Optimization, PPOProximal Policy Optimisation)
  • Agentic systems optimized via success signals

Design challenges — reward design is one of the hardest problems in RLReinforcement Learning:

  • Sparse rewards: signal arrives only at task completion, making learning slow
  • Dense rewards: accelerate learning but risk reward hacking
  • Reward misspecification: the gap between the written reward and the actual goal

Reward design is revisited throughout the course. The tension between what is easy to specify and what is safe to optimize is one of the running themes.

With the reward hypothesis accepted as a working assumption, the next step is to formalize the environment in which the agent operates — the mechanism by which actions produce transitions and rewards.


Markov Decision Processes (MDPs)

The canonical mathematical model for reinforcement learning is the Markov Decision Process (MDPMarkov Decision Process).

An MDPMarkov Decision Process is defined by the tuple:

(S, A, P, R, γ)(\mathcal{S},\, \mathcal{A},\, P,\, R,\, \gamma)(S,A,P,R,γ)

where:

  • S\mathcal{S}S is the state space,
  • A\mathcal{A}A is the action space,
  • P(s′∣s,a)P(s' \mid s, a)P(s′∣s,a) is the transition probability kernel,
  • R(s,a)R(s, a)R(s,a) is the expected reward function,
  • γ∈[0,1)\gamma \in [0,1)γ∈[0,1) is the discount factor.

The Markov property

The defining assumption of an MDPMarkov Decision Process is the Markov property:

P(st+1∣st,at,st−1,at−1,…)=P(st+1∣st,at)P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)P(st+1​∣st​,at​,st−1​,at−1​,…)=P(st+1​∣st​,at​)

The future is conditionally independent of the past given the current state and action. The current state sts_tst​ contains all information relevant to predicting what happens next.

✦Intuition: Why the Markov property matters

The Markov property is a compression claim: it says that no matter how long the history, the current state contains all predictive power about the future. This is what makes sequential decision making tractable. Without it, the value of a state would depend on an unbounded past, and no finite representation of "value" would exist. The price of this compression is that you must engineer your state to be sufficient for prediction.

Markovianity is engineered, not given

In practice, raw observations from the environment are almost never Markov. This is not a minor implementation detail — it is one of the most important points in the lecture.

Consider a robot that observes only joint positions but not joint velocities. The next state depends on the current velocity, which is not in the observation. Two robots at the same position but different velocities will transition to different next states, so the transition is non-Markov in the observed state. A language model that receives only the most recent token cannot distinguish "The bank charges fees" from "The bank slopes gently" — the same token ("bank") with entirely different meaning depending on history. In both cases, the true state is not captured in the current observation.

Systems where the agent receives observations oto_tot​ that are functions of an underlying hidden state xtx_txt​ are formally Partially Observable MDPs (POMDPs):

ot=O(xt)+noiseo_t = \mathcal{O}(x_t) + \text{noise}ot​=O(xt​)+noise

POMDPs are the general case; MDPs are the special case where ot=xto_t = x_tot​=xt​. Most real-world problems are POMDPs, and this gap is a primary source of failure in deployed RLReinforcement Learning systems.

⚠What Breaks Here

If you apply MDP algorithms (Q-learning, policy gradient, value iteration) to a POMDP without state engineering:

  • The value function Vπ(st)V^\pi(s_t)Vπ(st​) is ill-defined: the same observation can produce vastly different returns depending on hidden history.
  • Learning instability: the algorithm updates V(st)V(s_t)V(st​) based on experienced returns, but the same sts_tst​ in a later time step may come from a different history and produce a different return.
  • Deployment failure: a policy trained on stochastic transitions will not generalize to new instances of the same observation because the hidden state differs.

This is why every practical RLReinforcement Learning system begins with state engineering, not by hoping the observations are Markov.

The standard engineering responses to non-Markovianity are:

  • State augmentation: add velocity to a position-only observation; add contact forces to a proprioceptive observation. Extend sts_tst​ until the Markov property holds approximately.
  • Observation stacking: concatenate the last kkk observations into a single input (e.g., frame stacking in Atari, observation history windows in locomotion RLReinforcement Learning). This approximates a sufficient statistic for the hidden state.
  • Recurrent architectures: use an LSTM or transformer to maintain a compressed representation of history, effectively learning a belief state. This is the learned analog of the Kalman filter discussed in the dynamics course.
◆Why Not just ignore history?

One might ask: why not just learn a stochastic policy π(a∣ot)\pi(a|o_t)π(a∣ot​) that maps observations to actions without pretending the observations are Markov? The answer is that value functions—the core of RLReinforcement Learning algorithms—require deterministic transitions for convergence. A value function Vπ(s)V^\pi(s)Vπ(s) is only well-defined when the same state always produces the same distribution of returns. In a POMDP, this breaks. Modern approaches like transformer-based recurrent models effectively solve this by letting the model learn a belief state, but that belief state—not the raw observation—becomes the effective state.

GenAI context: MDPs for language models

To ground this abstraction, consider RLReinforcement Learning applied to large language models, as in RLHFReinforcement Learning from Human Feedback:

| MDPMarkov Decision Process component | LLMLarge Language Model interpretation | |---|---| | State sts_tst​ | Prompt + all tokens generated so far | | Action ata_tat​ | Next token selected from vocabulary | | Transition PPP | Appending selected token to context | | Reward rt+1r_{t+1}rt+1​ | Scalar score from a reward model | | Discount γ\gammaγ | Relative value of earlier vs later tokens |

From this perspective, language generation is a sequential decision process, and alignment techniques are specialized RLReinforcement Learning algorithms operating in this MDPMarkov Decision Process. The state space is the full context window — this is what makes the MDPMarkov Decision Process Markov for LLMs: the complete token history is available as the state.

With the environment formalized as an MDPMarkov Decision Process, the agent's behavior must be specified. A policy defines how the agent maps states to actions — and its properties determine whether the agent can learn effectively at all.


Policies

A policy specifies the agent's behavior as a mapping from states to actions:

π(a∣s)=P(at=a∣st=s)\pi(a \mid s) = P(a_t = a \mid s_t = s)π(a∣s)=P(at​=a∣st​=s)

Policies may be:

  • Deterministic: π(s)=a\pi(s) = aπ(s)=a, selecting a single action per state. Common in deployment and in control-theoretic settings.
  • Stochastic: sampling actions from a probability distribution. Essential for exploration during training, and fundamental to policy gradient methods.
✦Intuition: Why stochasticity matters for learning

In a deterministic policy π(s)=a\pi(s) = aπ(s)=a, the agent will visit the same states and actions repeatedly. It learns nothing about the consequences of alternative actions. A stochastic policy π(a∣s)\pi(a|s)π(a∣s) allows the agent to explore: with some probability it takes a suboptimal action, observes its consequences, and refines its estimate of value. The exploration-exploitation tradeoff is fundamentally enabled by policy stochasticity.

In LLMs, the policy is the language model itself: a stochastic policy over the vocabulary at each token position, with temperature controlling the degree of stochasticity.

Given a policy, we need to quantify how good it is — not just at the current step, but over the entire future trajectory it will generate. This leads to the return.


Returns and discounting

The agent's objective is formalized via the return, the discounted cumulative reward from time step ttt onward:

Gt=∑k=0∞γkrt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}Gt​=k=0∑∞​γkrt+k+1​

This single equation encodes three distinct ideas. Understanding all three is critical because they come apart in practice.

Three interpretations of the discount factor γ\gammaγ

1. Convergence guarantee. For continuing (non-terminating) tasks, the infinite sum ∑k=0∞rt+k+1\sum_{k=0}^\infty r_{t+k+1}∑k=0∞​rt+k+1​ may diverge if rewards are bounded away from zero. Multiplying by γk<1\gamma^k < 1γk<1 ensures Gt≤Rmax⁡/(1−γ)G_t \leq R_{\max} / (1 - \gamma)Gt​≤Rmax​/(1−γ), guaranteeing a finite return. This is a mathematical necessity, not a modeling choice. If γ=1\gamma = 1γ=1, returns are unbounded.

2. Time preference. γ\gammaγ encodes how much the agent discounts future rewards relative to immediate ones. γ→1\gamma \to 1γ→1 approaches equal weighting of all future rewards; γ→0\gamma \to 0γ→0 makes the agent myopic, caring only about immediate reward. The effective planning horizon is approximately 1/(1−γ)1 / (1 - \gamma)1/(1−γ) steps.

3. Survival probability. γ\gammaγ can be interpreted as the probability that the episode continues at each step: with probability 1−γ1 - \gamma1−γ, the episode ends and no further reward is received. Under this interpretation, discounting is not a modeling choice but a consequence of stochastic episode length. This connects episodic and continuing task formulations in a unified framework.

◆Why Not set $\gamma = 1$?

In a continuing task with bounded positive rewards, setting γ=1\gamma = 1γ=1 makes the return diverge: Gt=∑k=0∞rt+k+1→∞G_t = \sum_{k=0}^\infty r_{t+k+1} \to \inftyGt​=∑k=0∞​rt+k+1​→∞. This breaks all algorithms. You might think the solution is simple: cap γ<1\gamma < 1γ<1. But there is a deeper issue. When γ=1\gamma = 1γ=1, the Bellman operator T\mathcal{T}T loses its contraction property (discussed below). Without contraction, value iteration doesn't converge, and the entire dynamic programming machinery collapses. Discount factors are not optional for continuing tasks; they are structural.

For episodic tasks with a natural terminal state, you can set γ=1\gamma = 1γ=1 if you treat the terminal state as absorbing (zero reward, self-loops). The episode ends, so the sum is finite regardless.

A warning on γ\gammaγ and policy gradient bias

In policy gradient methods (studied in a later lecture), using γ<1\gamma < 1γ<1 introduces a subtle theoretical issue: the discounted policy gradient does not exactly optimize the true undiscounted return. This matters in continuing tasks where the intended objective is undiscounted cumulative reward. The bias is well-understood theoretically but frequently overlooked in practice. We return to this when we derive policy gradient algorithms.

Episodic vs continuing tasks

  • Episodic tasks terminate at a terminal state (games, robotic manipulation episodes, single LLMLarge Language Model responses). The return is a finite sum.
  • Continuing tasks run indefinitely (control systems, long-running agents). The return requires discounting for finiteness.

For episodic tasks, terminal states are often treated as absorbing states with zero reward and self-loops: once reached, the agent stays there and receives zero reward forever. This unifies the episodic and continuing formalisms and is the standard implementation in most RLReinforcement Learning libraries.

The return GtG_tGt​ is a random variable: it depends on the stochastic policy, the stochastic transitions, and the full future trajectory. Rather than work with this random variable directly, we take its expectation — yielding the value function, the central object of RLReinforcement Learning theory.


Value functions

Value functions quantify how good it is to be in a state or take an action, under a given policy π\piπ. They are not directly observable; they must be estimated from experience or derived from the environment model.

State-value function

Vπ(s)=Eπ[Gt∣st=s]=Eπ[∑k=0∞γkrt+k+1  |  st=s]V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid s_t = s \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \;\middle|\; s_t = s \right]Vπ(s)=Eπ​[Gt​∣st​=s]=Eπ​[k=0∑∞​γkrt+k+1​​st​=s]

This is the expected cumulative discounted reward, starting from state sss and following policy π\piπ thereafter.

Action-value function (Q-function)

Qπ(s,a)=Eπ[Gt∣st=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid s_t = s, a_t = a \right]Qπ(s,a)=Eπ​[Gt​∣st​=s,at​=a]

This conditions on both state and action: the expected return from taking action aaa in state sss, then following π\piπ.

✦Intuition: State-value vs action-value

The state-value Vπ(s)V^\pi(s)Vπ(s) is like asking "how good is this situation?" The action-value Qπ(s,a)Q^\pi(s,a)Qπ(s,a) is like asking "how good is this situation if I first take action aaa?" The Q-function is more informative: it directly tells you which action to prefer. The state-value is a summary—it's the average Q-value weighted by the policy's action probabilities. This is why control algorithms prefer Q-functions: they directly encode the action preference.

The relationship between them:

Vπ(s)=∑aπ(a∣s) Qπ(s,a)V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)Vπ(s)=a∑​π(a∣s)Qπ(s,a)

Value functions convert a sequential decision problem into an estimation problem over expected returns. The optimal value functions V∗V^*V∗ and Q∗Q^*Q∗ encode the solution to the RLReinforcement Learning problem: if you know Q∗(s,a)Q^*(s, a)Q∗(s,a), the optimal policy is simply π∗(s)=arg⁡max⁡aQ∗(s,a)\pi^*(s) = \arg\max_a Q^*(s,a)π∗(s)=argmaxa​Q∗(s,a).

Value functions defined as infinite sums are not directly computable. What makes them tractable is a recursive structure — the Bellman equations — which express the value of a state in terms of the values of its immediate successors.


Bellman equations

The Bellman equations express value functions recursively, relating the value of the current state to the values of successor states. They are the mathematical spine of almost every RLReinforcement Learning algorithm.

Bellman expectation equation

Vπ(s)=∑aπ(a∣s)[R(s,a)+γ∑s′P(s′∣s,a) Vπ(s′)]V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^\pi(s') \right]Vπ(s)=a∑​π(a∣s)[R(s,a)+γs′∑​P(s′∣s,a)Vπ(s′)]

This is a system of linear equations in VπV^\piVπ. Given a fixed policy π\piπ, the transition model PPP, and reward function RRR, solving for VπV^\piVπ is a linear algebra problem of size ∣S∣|\mathcal{S}|∣S∣. This is policy evaluation — and its tractability (for small state spaces) is the foundation of dynamic programming.

✦Intuition: Fixed-point iteration

This equation says: the value of a state equals the immediate reward, plus the discounted value of the next state. Notice the circularity: Vπ(s)V^\pi(s)Vπ(s) depends on Vπ(s′)V^\pi(s')Vπ(s′), and by induction, on values of all future states. Yet somehow, this system has a unique solution. The reason is the discount factor γ<1\gamma < 1γ<1: each step into the future is geometrically weighted less. The infinite recursion converges because we're summing a geometric series. Computationally, you can solve this by treating it as a linear system (direct inversion), or by iterating the equation until convergence (value iteration). Both give the same answer.

Bellman optimality equation

V∗(s)=max⁡a[R(s,a)+γ∑s′P(s′∣s,a) V∗(s′)]V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^*(s') \right]V∗(s)=amax​[R(s,a)+γs′∑​P(s′∣s,a)V∗(s′)]

This is a nonlinear fixed-point equation in V∗V^*V∗. Unlike the expectation equation, it cannot be solved by a single linear solve — the max⁡\maxmax makes it nonlinear. Yet it has a unique solution, and iterative methods converge to it geometrically.

✦Intuition: Optimality via max

The optimality equation says: the best value at a state is the maximum over actions of (immediate reward plus discounted future value). This is a principle of optimality: an optimal policy chooses the action that maximizes immediate reward plus best-achievable future value. This is intuitive but powerful: it lets you turn an exponential-time problem (searching all policies) into a polynomial problem (iterating value function updates).

The contraction mapping property

Define the Bellman optimality operator T\mathcal{T}T:

(TV)(s)=max⁡a[R(s,a)+γ∑s′P(s′∣s,a) V(s′)](\mathcal{T} V)(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V(s') \right](TV)(s)=amax​[R(s,a)+γs′∑​P(s′∣s,a)V(s′)]

T\mathcal{T}T is a γ\gammaγ-contraction in the ℓ∞\ell_\inftyℓ∞​ norm:

∥TV−TV′∥∞≤γ∥V−V′∥∞\|\mathcal{T} V - \mathcal{T} V'\|_\infty \leq \gamma \|V - V'\|_\infty∥TV−TV′∥∞​≤γ∥V−V′∥∞​

By the Banach fixed-point theorem, repeated application of T\mathcal{T}T converges geometrically to the unique fixed point V∗V^*V∗, at rate γ\gammaγ per iteration. This is value iteration, and the contraction property is why it works. It also explains why γ<1\gamma < 1γ<1 is not just a modeling choice: γ=1\gamma = 1γ=1 destroys the contraction and convergence is no longer guaranteed. The discount factor is both a modeling parameter and an algorithmic one.

⚠What Breaks Here

If γ=1\gamma = 1γ=1:

  • The ℓ∞\ell_\inftyℓ∞​ distance between TV\mathcal{T}VTV and TV′\mathcal{T}V'TV′ is not contracted; in fact, it can stay the same or grow.
  • Value iteration enters a cycle or diverges rather than converging to V∗V^*V∗.
  • More fundamentally, V∗V^*V∗ may not be unique or may not exist. The fixed-point theorem requires contraction; without it, no guarantee.

This is why discount factors are not optional in continuing tasks. The mathematics requires them for tractability, not as a convenience.

Connection to the LQR Riccati equation

Students of the robotics control lectures will recognize the Bellman optimality equation. The discrete-time Algebraic Riccati Equation (DARE):

P=Q+A⊤PA−A⊤PB(R+B⊤PB)−1B⊤PAP = Q + A^\top P A - A^\top P B (R + B^\top P B)^{-1} B^\top P AP=Q+A⊤PA−A⊤PB(R+B⊤PB)−1B⊤PA

is the Bellman optimality equation for the special case of linear dynamics and quadratic cost. PPP is the optimal value function; the Riccati equation is its fixed-point characterization. LQR solves the RLReinforcement Learning problem in closed form under linearity and quadratic cost. The Bellman framework and the LQR framework are the same framework at different levels of generality.

Worked example: Bellman backup on a two-state MDPMarkov Decision Process

Setup. Consider the following MDPMarkov Decision Process:

  • States: S={s1, s2}\mathcal{S} = \{s_1,\, s_2\}S={s1​,s2​}
  • Actions: A={stay, switch}\mathcal{A} = \{\text{stay},\, \text{switch}\}A={stay,switch}
  • Transitions: deterministic — stay keeps the agent in the current state, switch moves it to the other state.
  • Rewards: R(s1,⋅)=1R(s_1, \cdot) = 1R(s1​,⋅)=1, R(s2,⋅)=0R(s_2, \cdot) = 0R(s2​,⋅)=0 (reward depends only on the state).
  • Discount: γ=0.9\gamma = 0.9γ=0.9.

Policy π1\pi_1π1​: always stay. Write the Bellman expectation equations:

Vπ1(s1)=1+0.9 Vπ1(s1)Vπ1(s2)=0+0.9 Vπ1(s2)V^{\pi_1}(s_1) = 1 + 0.9\, V^{\pi_1}(s_1) \qquad V^{\pi_1}(s_2) = 0 + 0.9\, V^{\pi_1}(s_2)Vπ1​(s1​)=1+0.9Vπ1​(s1​)Vπ1​(s2​)=0+0.9Vπ1​(s2​)

Solving the linear system:

Vπ1(s1)=11−0.9=10Vπ1(s2)=01−0.9=0V^{\pi_1}(s_1) = \frac{1}{1 - 0.9} = 10 \qquad V^{\pi_1}(s_2) = \frac{0}{1 - 0.9} = 0Vπ1​(s1​)=1−0.91​=10Vπ1​(s2​)=1−0.90​=0

Interpretation. Starting in s1s_1s1​, the agent collects +1+1+1 at every step forever. The discounted sum ∑k=0∞0.9k=11−0.9=10\sum_{k=0}^\infty 0.9^k = \frac{1}{1-0.9} = 10∑k=0∞​0.9k=1−0.91​=10 matches. Starting in s2s_2s2​, the agent collects 000 forever.

Policy π2\pi_2π2​: always switch. The agent alternates s1→s2→s1→⋯s_1 \to s_2 \to s_1 \to \cdotss1​→s2​→s1​→⋯, collecting rewards 1,0,1,0,…1, 0, 1, 0, \ldots1,0,1,0,…. The Bellman equations couple the two states:

Vπ2(s1)=1+0.9 Vπ2(s2)Vπ2(s2)=0+0.9 Vπ2(s1)V^{\pi_2}(s_1) = 1 + 0.9\, V^{\pi_2}(s_2) \qquad V^{\pi_2}(s_2) = 0 + 0.9\, V^{\pi_2}(s_1)Vπ2​(s1​)=1+0.9Vπ2​(s2​)Vπ2​(s2​)=0+0.9Vπ2​(s1​)

Substituting the second into the first: Vπ2(s1)=1+0.81 Vπ2(s1)V^{\pi_2}(s_1) = 1 + 0.81\, V^{\pi_2}(s_1)Vπ2​(s1​)=1+0.81Vπ2​(s1​), giving Vπ2(s1)=10.19≈5.26V^{\pi_2}(s_1) = \frac{1}{0.19} \approx 5.26Vπ2​(s1​)=0.191​≈5.26.

Conclusion. π1\pi_1π1​ dominates: Vπ1(s1)=10>5.26=Vπ2(s1)V^{\pi_1}(s_1) = 10 > 5.26 = V^{\pi_2}(s_1)Vπ1​(s1​)=10>5.26=Vπ2​(s1​). The optimal policy is to stay in s1s_1s1​ and exploit the high-reward state — which the Bellman optimality equation confirms by selecting max⁡a\max_amaxa​ at s1s_1s1​.

Mini coding exercise: iterative policy evaluation

Implement the policy evaluation above using the iterative (not direct) method, and verify convergence to the analytical solution.

import numpy as np

# MDP definition
gamma = 0.9

# States: 0 = s1, 1 = s2. Actions: 0 = stay, 1 = switch.
# P[s, a, s'] = P(s' | s, a)
P = np.array([
    [[1.0, 0.0], [0.0, 1.0]],   # from s1: stay->s1, switch->s2
    [[0.0, 1.0], [1.0, 0.0]],   # from s2: stay->s2, switch->s1
])
R = np.array([1.0, 0.0])        # R(s1) = 1, R(s2) = 0

# Policy pi1: always stay (action 0 in both states)
pi = np.array([0, 0])

# Iterative policy evaluation
# Initial guess of zeros — the contraction property guarantees convergence
# regardless of the starting estimate.
V = np.zeros(2)
for i in range(1000):
    # Bellman expectation backup: R(s) + γ Σ_{s'} P(s'|s,π(s)) V(s')
    # np.dot computes the expected future value over successor states.
    V_new = np.array([
        R[s] + gamma * np.dot(P[s, pi[s]], V)
        for s in range(2)
    ])
    # Convergence check using the infinity norm (max absolute difference).
    # A small value means we've reached the fixed point of the Bellman operator.
    if np.max(np.abs(V_new - V)) < 1e-8:
        print(f"Converged after {i+1} iterations")
        break
    # Synchronous (Jacobi-style) update: all states updated using the old V.
    V = V_new

print(f"V(s1) = {V[0]:.4f}")   # Expected: 10.0
print(f"V(s2) = {V[1]:.4f}")   # Expected: 0.0

Extension: Change pi = np.array([1, 1]) (always switch) and verify V(s1)≈5.26V(s_1) \approx 5.26V(s1​)≈5.26. Then implement full policy iteration: alternate between policy evaluation and greedy policy improvement (pi[s] = argmax_a ...) until the policy stops changing. Verify it converges to pi = [0, 0].


Taxonomy of reinforcement learning problems

Understanding where RLReinforcement Learning problems sit in this taxonomy tells you which algorithms are applicable and what assumptions you can rely on.

Bandits vs full MDPs

  • Multi-armed bandits: no state transitions; actions yield immediate reward. The agent must balance exploration (trying uncertain actions) and exploitation (taking the best-known action).
  • Full MDPs: actions affect future states and long-term outcomes. Bandit reasoning applies locally at each state, but value functions must account for downstream consequences.

Bandits reappear throughout the course: in recommendation systems, A/B testing, and as the core mechanism in RLHFReinforcement Learning from Human Feedback (where human feedback on response pairs is a bandit signal).

Online vs offline reinforcement learning

  • Online RLReinforcement Learning: the agent interacts with the environment during learning. Experience is generated by the current (or recent) policy. Standard algorithms — Q-learning, PPOProximal Policy Optimisation, SACSoft Actor-Critic — are online.
  • Offline RLReinforcement Learning: learning occurs from a fixed dataset collected by some behavior policy, without further environment interaction. This is critical in safety-sensitive domains (robotics, healthcare, alignment) where online exploration is expensive or dangerous.

The key technical challenge in offline RLReinforcement Learning is distributional shift: the learned policy may assign high value to state-action pairs that are rare or absent in the dataset. Without the ability to collect new experience, these value estimates cannot be corrected, leading to overoptimistic Q-values and unstable policy improvement. This problem has no analog in supervised learning — a classifier trained on a fixed dataset does not synthesize new inputs from regions of low data density.

◆Why Not just use supervised learning for offline RL?

One might naively apply behavioral cloning (BC): train a supervised classifier to imitate the actions in the offline dataset. The classifier will match the behavior policy's performance, but cannot improve beyond it. RLReinforcement Learning offers the possibility of improvement — finding policies better than those in the dataset. But this improvement requires careful reasoning about distributional shift. If the dataset rarely shows action aaa in state sss, and the learned policy tries to execute aaa in sss, the value estimate will be unreliable and the policy improvement step will fail. Modern offline RLReinforcement Learning methods (CQL, IQL, AWR) explicitly constrain or penalize extrapolation to avoid this trap. The sophistication lies in knowing when you can and cannot trust the data.

Model-free vs model-based RLReinforcement Learning

  • Model-free RLReinforcement Learning: learns policies or value functions directly from experience, without explicitly representing or learning the transition dynamics PPP.
  • Model-based RLReinforcement Learning: learns or uses a model of the environment's dynamics for planning. Enables higher sample efficiency by generating synthetic experience from the model, at the cost of bias when the model is imperfect.

These categories exist on a spectrum rather than as a clean binary. Dyna-style algorithms interleave real experience with simulated rollouts from a learned model — they are simultaneously model-free and model-based. Dreamer and MuZero learn abstract latent-space world models that are not interpretable as physical dynamics models but are used for planning. The binary framing is a useful starting point but breaks down for most modern methods.


Relationship to supervised learning and control theory

Supervised learning

Supervised learning optimizes a fixed loss over a static dataset. The learner's outputs do not influence what data it sees next. RLReinforcement Learning breaks this assumption: the policy determines what states are visited, which determines what experience is collected, which determines what the policy learns — a feedback loop that makes the learning problem fundamentally non-stationary and requires explicit management of exploration.

◆Why Not treat RL as supervised learning on historical data?

One might imagine treating RLReinforcement Learning as a supervised problem: collect a dataset of (s,a)(s, a)(s,a) pairs, and learn a classifier. This is behavioral cloning, and it only reproduces the data distribution. To improve beyond the data requires reasoning about consequences—which is sequential, causal, and non-stationary. Supervised learning has no machinery for this. The Bellman equation is the missing piece: it lets you reason about what would happen under counterfactual policies, not just what did happen.

Control theory

Classical control theory solves related problems but under different assumptions:

  • Known dynamics: control assumes PPP is given analytically; RLReinforcement Learning treats it as unknown or learns it.
  • Stability guarantees: control provides formal guarantees (Lyapunov stability, ISS) that RLReinforcement Learning algorithms generally do not.
  • Function approximation: classical control works analytically in low-to-moderate dimensional spaces; RLReinforcement Learning scales to high-dimensional spaces by approximating VVV and π\piπ with neural networks, trading convergence guarantees for scalability.

The connection is exact at the boundary: LQR is RLReinforcement Learning with known linear dynamics and quadratic cost, solved analytically. Moving from LQR to RLReinforcement Learning is moving from known to unknown dynamics, from analytical to learned value functions, and from guaranteed convergence to empirical stability. Understanding this gradient — rather than treating control and RLReinforcement Learning as separate disciplines — is one of the goals of this course.


Connections to modern methods

The formalism introduced in this lecture is not only foundational — it is directly instantiated in state-of-the-art algorithms. The following examples show how each element of the MDPMarkov Decision Process tuple maps to a concrete design decision in recent systems.

DreamerV3: Bellman backups in latent space

DreamerV3 (Hafner et al., 2023) achieves strong performance across diverse domains — robotics, Atari, NetHack — from pixels alone. Its architecture maps directly onto the MDPMarkov Decision Process formalism:

| MDPMarkov Decision Process concept | DreamerV3 implementation | |---|---| | State sts_tst​ | Compact latent vector from a recurrent state-space model (RSSM) | | Transition PPP | Learned dynamics model in latent space | | Reward RRR | Learned reward predictor from latent state | | Value V∗V^*V∗ | Neural network trained via Bellman backup over imagined rollouts | | POMDPPartially Observable Markov Decision Process gap | RSSM maintains a belief state over raw observations — directly addressing the Markovianity problem |

The key insight: once a world model is learned, Bellman backups can be applied to imagined trajectories without interacting with the real environment. Every imagined step is a Bellman backup; every imagined rollout is value iteration in latent space. DreamerV3 is model-based RLReinforcement Learning at scale — the same equations, applied inside a learned model.

TDTemporal Difference-MPC: value-weighted planning

TDTemporal Difference-MPC (Hansen et al., 2022) combines model predictive control (MPC) with Bellman-based value learning. The key design choices map to the formalism as follows:

  • A learned latent world model provides short-horizon rollouts (the transition PPP).
  • A learned value function Q∗(s,a)Q^*(s, a)Q∗(s,a), trained via the Bellman optimality equation, estimates cumulative reward beyond the planning horizon.
  • At decision time, action sequences are sampled and evaluated by combining simulated returns (within the planning horizon) with Q-value bootstrapping (beyond it).

The core insight: the Bellman Q-function acts as a surrogate for long-horizon returns, allowing the planner to terminate rollouts early without losing information about the future. This is exactly why value functions are useful — they compress an infinite sum into a single scalar that can be queried at any state.

Understanding these methods requires only the formalism from this lecture. The algorithms add function approximation, learned dynamics, and planning; the underlying equations are unchanged.


Key takeaways

The concepts introduced here form the foundation for every algorithm in the course. The chain runs as follows: the reward hypothesis motivates reducing goal-directed behavior to scalar maximization, while immediately surfacing the challenge of reward specification. MDPs formalize the sequential interaction, with the Markov property as a structural assumption that must be engineered in practice. Policies define behavior; value functions quantify its long-term consequences. The Bellman equations express value functions recursively — linearly for policy evaluation, as a nonlinear fixed point for optimality — and the contraction mapping property is what makes iterative solution tractable. The discount factor simultaneously ensures convergence, encodes time preference, and introduces a policy gradient bias that resurfaces later. And the LQR/Riccati connection unifies this formalism with the control theory developed in preceding lectures: RLReinforcement Learning is generalized optimal control.


Common pitfalls in RLReinforcement Learning problem formulation

These mistakes appear repeatedly in both research and practice. Recognizing them early prevents compounding errors in implementation and design.

1. Non-Markovian state representation. Using raw sensor observations as states without verifying the Markov property. Symptom: the value function fails to converge or oscillates despite a correct implementation. Diagnosis: place the agent in the same observation from different histories and check whether subsequent transitions differ. Fix: state augmentation (add velocity, contact forces) or observation stacking.

2. Reward hacking. Optimizing a specified reward that does not fully capture the intended goal. Symptom: high reward but poor task performance; the agent found a policy that satisfies the letter of the reward, not the spirit. Prevention: before training, adversarially ask "what behavior maximizes this reward in an unintended way?" A robot rewarded purely for joint speed will fall forward. An LLMLarge Language Model rewarded for approval ratings will learn to flatter.

3. Setting γ=1\gamma = 1γ=1 in a continuing task. Returns diverge; the Bellman operator loses its contraction property; value iteration does not converge. Symptom: numerical overflow or NaN values during training. Fix: use γ<1\gamma < 1γ<1, or reformulate the task as episodic with a meaningful terminal condition.

4. Confusing episodic and continuing formulations. A robot manipulation task with a fixed episode length is episodic; a thermostat control task running indefinitely is continuing. Applying the wrong formulation produces an ill-defined objective. Fix: determine whether terminal states exist. If not, the task is continuing and requires discounting for a finite return.

5. Ignoring distributional shift in offline RLReinforcement Learning. In offline settings, the dataset was collected by a behavior policy; the policy being learned is the target policy. Ignoring this distinction leads to overestimated Q-values in regions not covered by the dataset, causing catastrophic failure outside the data distribution. This has no direct analog in supervised learning — a classifier does not synthesize inputs from data-sparse regions.


◆Matrix form of the Bellman expectation equation

For a fixed policy π\piπ over a finite state space, the Bellman expectation equation is a system of ∣S∣|\mathcal{S}|∣S∣ linear equations. It can be written in matrix form as:

Vπ=Rπ+γPπVπ⟹(I−γPπ) Vπ=RπV^\pi = R^\pi + \gamma P^\pi V^\pi \quad\Longrightarrow\quad (I - \gamma P^\pi)\, V^\pi = R^\piVπ=Rπ+γPπVπ⟹(I−γPπ)Vπ=Rπ

where Rsπ=∑aπ(a∣s) R(s,a)R^\pi_s = \sum_a \pi(a|s)\, R(s,a)Rsπ​=∑a​π(a∣s)R(s,a) and Pss′π=∑aπ(a∣s) P(s′∣s,a)P^\pi_{ss'} = \sum_a \pi(a|s)\, P(s'|s,a)Pss′π​=∑a​π(a∣s)P(s′∣s,a).

This is the standard Ax=bAx = bAx=b linear system with A=I−γPπA = I - \gamma P^\piA=I−γPπ, x=Vπx = V^\pix=Vπ, b=Rπb = R^\pib=Rπ. The matrix AAA is always invertible when γ<1\gamma < 1γ<1: since PπP^\piPπ is a stochastic matrix its eigenvalues lie in [−1,1][-1, 1][−1,1], so the eigenvalues of I−γPπI - \gamma P^\piI−γPπ are all in (1−γ,1+γ](1-\gamma, 1+\gamma](1−γ,1+γ], none zero. The unique solution is Vπ=(I−γPπ)−1RπV^\pi = (I - \gamma P^\pi)^{-1} R^\piVπ=(I−γPπ)−1Rπ.

Conceptual questions

  1. A robot observes only joint positions, not velocities. Describe precisely why this violates the Markov property. Propose two different state engineering strategies that restore approximate Markovianity, and explain what each one assumes about the system.

  2. You set γ=1.0\gamma = 1.0γ=1.0 in a value iteration algorithm on a continuing task with bounded positive rewards. What happens to GtG_tGt​, and why does the Bellman optimality operator lose its contraction property? What is the minimum change to the problem setup that restores convergence?

  3. An RLHFReinforcement Learning from Human Feedback system trains a language model to receive high scores from a reward model trained on human preference data. After deployment, users report that the model is more persuasive but less accurate. Explain this failure in terms of the reward hypothesis and reward misspecification. What would you change in the training setup?

  4. Compare the Bellman expectation equation for policy evaluation with the solution of a system of linear equations Ax=bAx = bAx=b. Identify what plays the role of AAA, xxx, and bbb, and explain why the system is always solvable (i.e., why AAA is always invertible for γ<1\gamma < 1γ<1).

  5. An offline RLReinforcement Learning algorithm is trained on a dataset of robot manipulation demonstrations. After training, the policy performs well on demonstrations similar to the training data but fails catastrophically on slightly different configurations. Explain this failure in terms of distributional shift and overestimated Q-values. How does this differ from the analogous failure mode in behavioral cloning?


✦Solutions
  1. Missing velocities. Without velocity, the future is not determined by the current observation — it depends on history — so the Markov property fails. (a) Stack a window of recent positions: assumes velocity is recoverable by finite differencing (the system is approximately finite-order). (b) Append a filter/observer estimate of velocity: assumes a known, identifiable dynamics model. Both restore approximate Markovianity by re-injecting the missing dynamical information.
  2. γ=1\gamma = 1γ=1 on a continuing task. Gt=∑trtG_t = \sum_t r_tGt​=∑t​rt​ diverges for bounded positive rewards over an infinite horizon, and the Bellman optimality operator is only a γ\gammaγ-contraction in sup-norm — with modulus γ=1\gamma = 1γ=1 it is no longer a contraction, so there is no unique fixed point. Minimum fix: set γ<1\gamma < 1γ<1, make the task episodic with an absorbing terminal state, or switch to an average-reward formulation.
  3. RLHF persuasive but inaccurate. The reward model is a proxy, and the reward hypothesis only delivers the desired behavior if reward is correctly specified. Here the RM rewarded human approval/persuasiveness rather than truth — reward misspecification — so the policy hacks the proxy, producing convincing but wrong outputs. Fix by adding factuality/groundedness signals to the reward, collecting preference data that penalizes confident falsehoods, and keeping a KL anchor to the reference model.
  4. Bellman as Ax=bAx=bAx=b. Vπ=Rπ+γPπVπV^\pi = R^\pi + \gamma P^\pi V^\piVπ=Rπ+γPπVπ rearranges to (I−γPπ)Vπ=Rπ(I - \gamma P^\pi)V^\pi = R^\pi(I−γPπ)Vπ=Rπ, so A=I−γPπA = I-\gamma P^\piA=I−γPπ, x=Vπx = V^\pix=Vπ, b=Rπb = R^\pib=Rπ. Since PπP^\piPπ is stochastic its spectral radius is 1, so γPπ\gamma P^\piγPπ has spectral radius ≤γ<1\le \gamma < 1≤γ<1 and I−γPπI-\gamma P^\piI−γPπ has all eigenvalues bounded away from zero — always invertible, giving a unique VπV^\piVπ.
  5. Offline distributional shift. The policy queries QQQ at out-of-distribution actions where it is overestimated (no data to correct it), so it confidently selects bad actions and fails off-distribution. This differs from behavioral cloning, whose failure is covariate shift / compounding error — BC drifts into unseen states by imitation but does not actively exploit overestimated values; offline RL's failure is value extrapolation specific to bootstrapped learning driving the policy toward unsupported actions.

Looking ahead

The next lecture studies the simplest nontrivial RLReinforcement Learning setting: multi-armed bandits. This isolates the exploration-exploitation tradeoff — the core challenge of RLReinforcement Learning — without the additional complexity of state transitions. Bandit algorithms and regret analysis provide the theoretical tools that extend, with modification, to full MDPs and to the RLHFReinforcement Learning from Human Feedback feedback mechanisms studied later in the course.

Further reading

  • Bellman, R. (1957). Dynamic Programming. Princeton University Press. The foundational text introducing the Bellman equation and the principle of optimality.
  • Sutton & Barto (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 3: Finite Markov Decision Processes. The standard textbook reference; freely available online.
  • Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley. The comprehensive mathematical treatment; essential for rigorous proofs of convergence and the contraction mapping results used here.
  • Silver, D. (2015). UCL Course on RL, Lecture 2: Markov Decision Processes. Lecture slides and video cover the same material with worked gridworld examples. Freely available online.
  • Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific. Bridges the gap between classical control theory (including LQR/Riccati) and modern RL; directly relevant to the control theory connections made in this lecture.
Next →
Week 2: Multi-Armed Bandits
On this page
  • Purpose of this lecture
  • Sequential decision making
  • The reward hypothesis
  • A concrete starting point
  • Why reward design matters
  • Markov Decision Processes (MDPs)
  • The Markov property
  • Markovianity is engineered, not given
  • GenAI context: MDPs for language models
  • Policies
  • Returns and discounting
  • Three interpretations of the discount factor \gamma
  • A warning on \gamma and policy gradient bias
  • Episodic vs continuing tasks
  • Value functions
  • State-value function
  • Action-value function (Q-function)
  • Bellman equations
  • Bellman expectation equation
  • Bellman optimality equation
  • The contraction mapping property
  • Connection to the LQR Riccati equation
  • Worked example: Bellman backup on a two-state MDP
  • Mini coding exercise: iterative policy evaluation
  • Taxonomy of reinforcement learning problems
  • Bandits vs full MDPs
  • Online vs offline reinforcement learning
  • Model-free vs model-based RL
  • Relationship to supervised learning and control theory
  • Supervised learning
  • Control theory
  • Connections to modern methods
  • DreamerV3: Bellman backups in latent space
  • TD-MPC: value-weighted planning
  • Key takeaways
  • Common pitfalls in RL problem formulation
  • Conceptual questions
  • Looking ahead
  • Further reading