Week 1: Reinforcement Learning Problem Formulation

Reinforcement learning studies sequential decision making under uncertainty. Unlike supervised learning, where data is static, an RL agent's actions actively influence the future—creating a defining feedback loop of perception and action.

Purpose of this lecture#

Here is the central puzzle: How can we reduce the dazzling complexity of goal-directed behavior—a robot learning to walk, a language model generating coherent text, a game-playing agent discovering a novel strategy—into a single clean principle?

Reinforcement learning answers this by reducing everything to one idea: maximize expected cumulative reward. This sounds simple. It is not. The moment you accept this reduction, you confront three immediate tensions:

How do we specify what matters? Reward design is brutally hard. Write the wrong reward, and the agent optimizes the letter of your goal while violating its spirit. This is not a minor engineering problem—it is a foundational gap that the entire field grapples with.
How do we handle uncertainty about the future? The agent doesn't know what will happen when it acts. How do we reason about long-term consequences when the world is stochastic and partially hidden?
How do we make this tractable? Maximizing a sum over infinitely many future time steps is not computable as stated. What mathematical structure lets us solve it?

This lecture answers all three by introducing the MDP formalism, value functions and the Bellman equation. These concepts are not approximations or conveniences—they are the exact mathematical encoding of sequential decision making. Every algorithm studied later—tabular methods, deep RL, RLHF, GRPO, and agentic systems—is either an instantiation or an extension of the theory introduced here.

By the end of this lecture, you will understand not just what the Bellman equation is, but why it is the inevitable consequence of formalizing sequential decision making. You will also understand where it breaks and what engineering is required to make it work in the real world.

Sequential decision making#

In reinforcement learning, an agent interacts with an environment over discrete time steps. This interaction forms a closed feedback loop:

At time step $t$ , the agent observes a state $s_t$
It selects an action $a_t$
The environment transitions to a new state $s_{t+1}$
The agent receives a scalar reward $r_{t+1}$

The agent's objective is not to maximize immediate reward, but to act so as to maximize cumulative reward over time.

This temporal coupling distinguishes reinforcement learning from:

Supervised learning, where data is fixed and independent of the learner's actions
Standard optimization, where decisions are made once rather than sequentially

The notation $r_{t+1}$ for the reward received after taking action $a_t$ follows the Sutton & Barto convention: the reward is indexed by the time at which it is received, not the time at which the action that caused it was taken. This is a purely notational choice but it recurs throughout the literature and in any codebase following S&B conventions.

To formalize "maximize cumulative reward over time" we need two things: a precise description of how the world works (an MDP), and a precise notion of what cumulative reward means (the return). We build toward both by first asking what kind of signal the agent should be trying to maximize.

The reward hypothesis#

A foundational idea in reinforcement learning is the Reward Hypothesis:

All goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (reward).

This hypothesis suggests that even the most sophisticated behaviors—mastering a game of chess, the fluid locomotion of a humanoid robot, or the nuanced generation of human-like text—can be reduced to maximizing expected cumulative reward. It is a claim, not a mathematical truth, and the course can be read as a sustained examination of how far it holds and where it breaks down.

A concrete starting point#

Before examining where the hypothesis succeeds and fails, consider the simplest case where it clearly works.

Gridworld navigation. An agent moves through a $4 \times 4$ grid to reach a goal cell in the lower-right corner. The reward function assigns $+1$ upon reaching the goal and $0$ at every other step. This is unambiguous and easy to specify. An agent maximizing expected cumulative reward will learn to navigate efficiently to the goal. The reward hypothesis applies cleanly.

Atari Pong. Reward is $+1$ when the agent scores a point and $-1$ when the opponent scores. Again unambiguous. A policy maximizing this reward will beat the opponent. DeepMind's DQN (2013) demonstrated that a single algorithm, with no game-specific engineering, can learn superhuman performance from this signal alone — a striking validation of the hypothesis.

These cases share a key property: the reward function fully captures the intended goal. The difficulty arises when they diverge.

Why reward design matters#

The hypothesis is silent on how to specify reward. In practice, reward design is one of the hardest problems in RL:

Productive uses — in modern AI systems, the reward hypothesis underpins:

RLHF (Reinforcement Learning from Human Feedback)
Preference optimization (GRPO, DPO, PPO)
Agentic systems optimized via success signals

Design challenges — reward design is one of the hardest problems in RL:

Sparse rewards: signal arrives only at task completion, making learning slow
Dense rewards: accelerate learning but risk reward hacking
Reward misspecification: the gap between the written reward and the actual goal

Reward design is revisited throughout the course. The tension between what is easy to specify and what is safe to optimize is one of the running themes.

With the reward hypothesis accepted as a working assumption, the next step is to formalize the environment in which the agent operates — the mechanism by which actions produce transitions and rewards.

Markov Decision Processes (MDPs)#

The canonical mathematical model for reinforcement learning is the Markov Decision Process (MDP).

An MDP is defined by the tuple:

(\mathcal{S},\, \mathcal{A},\, P,\, R,\, \gamma)

where:

$\mathcal{S}$ is the state space,
$\mathcal{A}$ is the action space,
$P(s' \mid s, a)$ is the transition probability kernel,
$R(s, a)$ is the expected reward function,
$\gamma \in [0,1)$ is the discount factor.

The Markov property#

The defining assumption of an MDP is the Markov property:

P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)

The future is conditionally independent of the past given the current state and action. The current state $s_t$ contains all information relevant to predicting what happens next.

Markovianity is engineered, not given#

In practice, raw observations from the environment are almost never Markov. This is not a minor implementation detail — it is one of the most important points in the lecture.

Consider a robot that observes only joint positions but not joint velocities. The next state depends on the current velocity, which is not in the observation. Two robots at the same position but different velocities will transition to different next states, so the transition is non-Markov in the observed state. A language model that receives only the most recent token cannot distinguish "The bank charges fees" from "The bank slopes gently" — the same token ("bank") with entirely different meaning depending on history. In both cases, the true state is not captured in the current observation.

Systems where the agent receives observations $o_t$ that are functions of an underlying hidden state $x_t$ are formally Partially Observable MDPs (POMDPs):

o_t = \mathcal{O}(x_t) + \text{noise}

POMDPs are the general case; MDPs are the special case where $o_t = x_t$ . Most real-world problems are POMDPs, and this gap is a primary source of failure in deployed RL systems.

The standard engineering responses to non-Markovianity are:

State augmentation: add velocity to a position-only observation; add contact forces to a proprioceptive observation. Extend $s_t$ until the Markov property holds approximately.
Observation stacking: concatenate the last $k$ observations into a single input (e.g., frame stacking in Atari, observation history windows in locomotion RL). This approximates a sufficient statistic for the hidden state.
Recurrent architectures: use an LSTM or transformer to maintain a compressed representation of history, effectively learning a belief state. This is the learned analog of the Kalman filter discussed in the dynamics course.

GenAI context: MDPs for language models#

To ground this abstraction, consider RL applied to large language models, as in RLHF:

| MDP component | LLM interpretation | |---|---| | State $s_t$ | Prompt + all tokens generated so far | | Action $a_t$ | Next token selected from vocabulary | | Transition $P$ | Appending selected token to context | | Reward $r_{t+1}$ | Scalar score from a reward model | | Discount $\gamma$ | Relative value of earlier vs later tokens |

From this perspective, language generation is a sequential decision process, and alignment techniques are specialized RL algorithms operating in this MDP. The state space is the full context window — this is what makes the MDP Markov for LLMs: the complete token history is available as the state.

With the environment formalized as an MDP, the agent's behavior must be specified. A policy defines how the agent maps states to actions — and its properties determine whether the agent can learn effectively at all.

Policies#

A policy specifies the agent's behavior as a mapping from states to actions:

\pi(a \mid s) = P(a_t = a \mid s_t = s)

Policies may be:

Deterministic: $\pi(s) = a$ , selecting a single action per state. Common in deployment and in control-theoretic settings.
Stochastic: sampling actions from a probability distribution. Essential for exploration during training, and fundamental to policy gradient methods.

In LLMs, the policy is the language model itself: a stochastic policy over the vocabulary at each token position, with temperature controlling the degree of stochasticity.

Given a policy, we need to quantify how good it is — not just at the current step, but over the entire future trajectory it will generate. This leads to the return.

Returns and discounting#

The agent's objective is formalized via the return, the discounted cumulative reward from time step $t$ onward:

G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}

This single equation encodes three distinct ideas. Understanding all three is critical because they come apart in practice.

Three interpretations of the discount factor $\gamma$ #

1. Convergence guarantee. For continuing (non-terminating) tasks, the infinite sum $\sum_{k=0}^\infty r_{t+k+1}$ may diverge if rewards are bounded away from zero. Multiplying by $\gamma^k < 1$ ensures $G_t \leq R_{\max} / (1 - \gamma)$ , guaranteeing a finite return. This is a mathematical necessity, not a modeling choice. If $\gamma = 1$ , returns are unbounded.

2. Time preference. $\gamma$ encodes how much the agent discounts future rewards relative to immediate ones. $\gamma \to 1$ approaches equal weighting of all future rewards; $\gamma \to 0$ makes the agent myopic, caring only about immediate reward. The effective planning horizon is approximately $1 / (1 - \gamma)$ steps.

3. Survival probability. $\gamma$ can be interpreted as the probability that the episode continues at each step: with probability $1 - \gamma$ , the episode ends and no further reward is received. Under this interpretation, discounting is not a modeling choice but a consequence of stochastic episode length. This connects episodic and continuing task formulations in a unified framework.

Why Not set $\gamma = 1$?

In a continuing task with bounded positive rewards, setting $\gamma = 1$ makes the return diverge: $G_t = \sum_{k=0}^\infty r_{t+k+1} \to \infty$ . This breaks all algorithms. You might think the solution is simple: cap $\gamma < 1$ . But there is a deeper issue. When $\gamma = 1$ , the Bellman operator $\mathcal{T}$ loses its contraction property (discussed below). Without contraction, value iteration doesn't converge, and the entire dynamic programming machinery collapses. Discount factors are not optional for continuing tasks; they are structural.

For episodic tasks with a natural terminal state, you can set $\gamma = 1$ if you treat the terminal state as absorbing (zero reward, self-loops). The episode ends, so the sum is finite regardless.

A warning on $\gamma$ and policy gradient bias#

In policy gradient methods (studied in a later lecture), using $\gamma < 1$ introduces a subtle theoretical issue: the discounted policy gradient does not exactly optimize the true undiscounted return. This matters in continuing tasks where the intended objective is undiscounted cumulative reward. The bias is well-understood theoretically but frequently overlooked in practice. We return to this when we derive policy gradient algorithms.

Episodic vs continuing tasks#

Episodic tasks terminate at a terminal state (games, robotic manipulation episodes, single LLM responses). The return is a finite sum.
Continuing tasks run indefinitely (control systems, long-running agents). The return requires discounting for finiteness.

For episodic tasks, terminal states are often treated as absorbing states with zero reward and self-loops: once reached, the agent stays there and receives zero reward forever. This unifies the episodic and continuing formalisms and is the standard implementation in most RL libraries.

The return $G_t$ is a random variable: it depends on the stochastic policy, the stochastic transitions, and the full future trajectory. Rather than work with this random variable directly, we take its expectation — yielding the value function, the central object of RL theory.

Value functions#

Value functions quantify how good it is to be in a state or take an action, under a given policy $\pi$ . They are not directly observable; they must be estimated from experience or derived from the environment model.

State-value function#

V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid s_t = s \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \;\middle|\; s_t = s \right]

This is the expected cumulative discounted reward, starting from state $s$ and following policy $\pi$ thereafter.

Action-value function (Q-function)#

Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid s_t = s, a_t = a \right]

This conditions on both state and action: the expected return from taking action $a$ in state $s$ , then following $\pi$ .

The relationship between them:

V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)

Value functions convert a sequential decision problem into an estimation problem over expected returns. The optimal value functions $V^*$ and $Q^*$ encode the solution to the RL problem: if you know $Q^*(s, a)$ , the optimal policy is simply $\pi^*(s) = \arg\max_a Q^*(s,a)$ .

Value functions defined as infinite sums are not directly computable. What makes them tractable is a recursive structure — the Bellman equations — which express the value of a state in terms of the values of its immediate successors.

Bellman equations#

The Bellman equations express value functions recursively, relating the value of the current state to the values of successor states. They are the mathematical spine of almost every RL algorithm.

Bellman expectation equation#

V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^\pi(s') \right]

This is a system of linear equations in $V^\pi$ . Given a fixed policy $\pi$ , the transition model $P$ , and reward function $R$ , solving for $V^\pi$ is a linear algebra problem of size $|\mathcal{S}|$ . This is policy evaluation — and its tractability (for small state spaces) is the foundation of dynamic programming.

Bellman optimality equation#

V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^*(s') \right]

This is a nonlinear fixed-point equation in $V^*$ . Unlike the expectation equation, it cannot be solved by a single linear solve — the $\max$ makes it nonlinear. Yet it has a unique solution, and iterative methods converge to it geometrically.

The contraction mapping property#

Define the Bellman optimality operator $\mathcal{T}$ :

(\mathcal{T} V)(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V(s') \right]

$\mathcal{T}$ is a $\gamma$ -contraction in the $\ell_\infty$ norm:

\|\mathcal{T} V - \mathcal{T} V'\|_\infty \leq \gamma \|V - V'\|_\infty

By the Banach fixed-point theorem, repeated application of $\mathcal{T}$ converges geometrically to the unique fixed point $V^*$ , at rate $\gamma$ per iteration. This is value iteration, and the contraction property is why it works. It also explains why $\gamma < 1$ is not just a modeling choice: $\gamma = 1$ destroys the contraction and convergence is no longer guaranteed. The discount factor is both a modeling parameter and an algorithmic one.

Connection to the LQR Riccati equation#

Students of the robotics control lectures will recognize the Bellman optimality equation. The discrete-time Algebraic Riccati Equation (DARE):

P = Q + A^\top P A - A^\top P B (R + B^\top P B)^{-1} B^\top P A

is the Bellman optimality equation for the special case of linear dynamics and quadratic cost. $P$ is the optimal value function; the Riccati equation is its fixed-point characterization. LQR solves the RL problem in closed form under linearity and quadratic cost. The Bellman framework and the LQR framework are the same framework at different levels of generality.

Worked example: Bellman backup on a two-state MDP#

Setup. Consider the following MDP:

States: $\mathcal{S} = \{s_1,\, s_2\}$
Actions: $\mathcal{A} = \{\text{stay},\, \text{switch}\}$
Transitions: deterministic — stay keeps the agent in the current state, switch moves it to the other state.
Rewards: $R(s_1, \cdot) = 1$ , $R(s_2, \cdot) = 0$ (reward depends only on the state).
Discount: $\gamma = 0.9$ .

Policy $\pi_1$ : always stay. Write the Bellman expectation equations:

V^{\pi_1}(s_1) = 1 + 0.9\, V^{\pi_1}(s_1) \qquad V^{\pi_1}(s_2) = 0 + 0.9\, V^{\pi_1}(s_2)

Solving the linear system:

V^{\pi_1}(s_1) = \frac{1}{1 - 0.9} = 10 \qquad V^{\pi_1}(s_2) = \frac{0}{1 - 0.9} = 0

Interpretation. Starting in $s_1$ , the agent collects $+1$ at every step forever. The discounted sum $\sum_{k=0}^\infty 0.9^k = \frac{1}{1-0.9} = 10$ matches. Starting in $s_2$ , the agent collects $0$ forever.

Policy $\pi_2$ : always switch. The agent alternates $s_1 \to s_2 \to s_1 \to \cdots$ , collecting rewards $1, 0, 1, 0, \ldots$ . The Bellman equations couple the two states:

V^{\pi_2}(s_1) = 1 + 0.9\, V^{\pi_2}(s_2) \qquad V^{\pi_2}(s_2) = 0 + 0.9\, V^{\pi_2}(s_1)

Substituting the second into the first: $V^{\pi_2}(s_1) = 1 + 0.81\, V^{\pi_2}(s_1)$ , giving $V^{\pi_2}(s_1) = \frac{1}{0.19} \approx 5.26$ .

Conclusion. $\pi_1$ dominates: $V^{\pi_1}(s_1) = 10 > 5.26 = V^{\pi_2}(s_1)$ . The optimal policy is to stay in $s_1$ and exploit the high-reward state — which the Bellman optimality equation confirms by selecting $\max_a$ at $s_1$ .

Mini coding exercise: iterative policy evaluation#

Implement the policy evaluation above using the iterative (not direct) method, and verify convergence to the analytical solution.

python · runs in browser

import numpy as np

# MDP definition
gamma = 0.9

# States: 0 = s1, 1 = s2. Actions: 0 = stay, 1 = switch.
# P[s, a, s'] = P(s' | s, a)
P = np.array([
    [[1.0, 0.0], [0.0, 1.0]],   # from s1: stay->s1, switch->s2
    [[0.0, 1.0], [1.0, 0.0]],   # from s2: stay->s2, switch->s1
])
R = np.array([1.0, 0.0])        # R(s1) = 1, R(s2) = 0

# Policy pi1: always stay (action 0 in both states)
pi = np.array([0, 0])

# Iterative policy evaluation
# Initial guess of zeros — the contraction property guarantees convergence
# regardless of the starting estimate.
V = np.zeros(2)
for i in range(1000):
    # Bellman expectation backup: R(s) + γ Σ_{s'} P(s'|s,π(s)) V(s')
    # np.dot computes the expected future value over successor states.
    V_new = np.array([
        R[s] + gamma * np.dot(P[s, pi[s]], V)
        for s in range(2)
    ])
    # Convergence check using the infinity norm (max absolute difference).
    # A small value means we've reached the fixed point of the Bellman operator.
    if np.max(np.abs(V_new - V)) < 1e-8:
        print(f"Converged after {i+1} iterations")
        break
    # Synchronous (Jacobi-style) update: all states updated using the old V.
    V = V_new

print(f"V(s1) = {V[0]:.4f}")   # Expected: 10.0
print(f"V(s2) = {V[1]:.4f}")   # Expected: 0.0

Extension: Change pi = np.array([1, 1]) (always switch) and verify $V(s_1) \approx 5.26$ . Then implement full policy iteration: alternate between policy evaluation and greedy policy improvement (pi[s] = argmax_a ...) until the policy stops changing. Verify it converges to pi = [0, 0].

Taxonomy of reinforcement learning problems#

Understanding where RL problems sit in this taxonomy tells you which algorithms are applicable and what assumptions you can rely on.

Bandits vs full MDPs#

Multi-armed bandits: no state transitions; actions yield immediate reward. The agent must balance exploration (trying uncertain actions) and exploitation (taking the best-known action).
Full MDPs: actions affect future states and long-term outcomes. Bandit reasoning applies locally at each state, but value functions must account for downstream consequences.

Bandits reappear throughout the course: in recommendation systems, A/B testing, and as the core mechanism in RLHF (where human feedback on response pairs is a bandit signal).

Online vs offline reinforcement learning#

Online RL: the agent interacts with the environment during learning. Experience is generated by the current (or recent) policy. Standard algorithms — Q-learning, PPO, SAC — are online.
Offline RL: learning occurs from a fixed dataset collected by some behavior policy, without further environment interaction. This is critical in safety-sensitive domains (robotics, healthcare, alignment) where online exploration is expensive or dangerous.

The key technical challenge in offline RL is distributional shift: the learned policy may assign high value to state-action pairs that are rare or absent in the dataset. Without the ability to collect new experience, these value estimates cannot be corrected, leading to overoptimistic Q-values and unstable policy improvement. This problem has no analog in supervised learning — a classifier trained on a fixed dataset does not synthesize new inputs from regions of low data density.

Why Not just use supervised learning for offline RL?

One might naively apply behavioral cloning (BC): train a supervised classifier to imitate the actions in the offline dataset. The classifier will match the behavior policy's performance, but cannot improve beyond it. RL offers the possibility of improvement — finding policies better than those in the dataset. But this improvement requires careful reasoning about distributional shift. If the dataset rarely shows action $a$ in state $s$ , and the learned policy tries to execute $a$ in $s$ , the value estimate will be unreliable and the policy improvement step will fail. Modern offline RL methods (CQL, IQL, AWR) explicitly constrain or penalize extrapolation to avoid this trap. The sophistication lies in knowing when you can and cannot trust the data.

Model-free vs model-based RL#

Model-free RL: learns policies or value functions directly from experience, without explicitly representing or learning the transition dynamics $P$ .
Model-based RL: learns or uses a model of the environment's dynamics for planning. Enables higher sample efficiency by generating synthetic experience from the model, at the cost of bias when the model is imperfect.

These categories exist on a spectrum rather than as a clean binary. Dyna-style algorithms interleave real experience with simulated rollouts from a learned model — they are simultaneously model-free and model-based. Dreamer and MuZero learn abstract latent-space world models that are not interpretable as physical dynamics models but are used for planning. The binary framing is a useful starting point but breaks down for most modern methods.

Relationship to supervised learning and control theory#

Supervised learning#

Supervised learning optimizes a fixed loss over a static dataset. The learner's outputs do not influence what data it sees next. RL breaks this assumption: the policy determines what states are visited, which determines what experience is collected, which determines what the policy learns — a feedback loop that makes the learning problem fundamentally non-stationary and requires explicit management of exploration.

Control theory#

Classical control theory solves related problems but under different assumptions:

Known dynamics: control assumes $P$ is given analytically; RL treats it as unknown or learns it.
Stability guarantees: control provides formal guarantees (Lyapunov stability, ISS) that RL algorithms generally do not.
Function approximation: classical control works analytically in low-to-moderate dimensional spaces; RL scales to high-dimensional spaces by approximating $V$ and $\pi$ with neural networks, trading convergence guarantees for scalability.

The connection is exact at the boundary: LQR is RL with known linear dynamics and quadratic cost, solved analytically. Moving from LQR to RL is moving from known to unknown dynamics, from analytical to learned value functions, and from guaranteed convergence to empirical stability. Understanding this gradient — rather than treating control and RL as separate disciplines — is one of the goals of this course.

Connections to modern methods#

The formalism introduced in this lecture is not only foundational — it is directly instantiated in state-of-the-art algorithms. The following examples show how each element of the MDP tuple maps to a concrete design decision in recent systems.

DreamerV3: Bellman backups in latent space#

DreamerV3 (Hafner et al., 2023) achieves strong performance across diverse domains — robotics, Atari, NetHack — from pixels alone. Its architecture maps directly onto the MDP formalism:

| MDP concept | DreamerV3 implementation | |---|---| | State $s_t$ | Compact latent vector from a recurrent state-space model (RSSM) | | Transition $P$ | Learned dynamics model in latent space | | Reward $R$ | Learned reward predictor from latent state | | Value $V^*$ | Neural network trained via Bellman backup over imagined rollouts | | POMDP gap | RSSM maintains a belief state over raw observations — directly addressing the Markovianity problem |

The key insight: once a world model is learned, Bellman backups can be applied to imagined trajectories without interacting with the real environment. Every imagined step is a Bellman backup; every imagined rollout is value iteration in latent space. DreamerV3 is model-based RL at scale — the same equations, applied inside a learned model.

TD-MPC: value-weighted planning#

TD-MPC (Hansen et al., 2022) combines model predictive control (MPC) with Bellman-based value learning. The key design choices map to the formalism as follows:

A learned latent world model provides short-horizon rollouts (the transition $P$ ).
A learned value function $Q^*(s, a)$ , trained via the Bellman optimality equation, estimates cumulative reward beyond the planning horizon.
At decision time, action sequences are sampled and evaluated by combining simulated returns (within the planning horizon) with Q-value bootstrapping (beyond it).

The core insight: the Bellman Q-function acts as a surrogate for long-horizon returns, allowing the planner to terminate rollouts early without losing information about the future. This is exactly why value functions are useful — they compress an infinite sum into a single scalar that can be queried at any state.

Understanding these methods requires only the formalism from this lecture. The algorithms add function approximation, learned dynamics, and planning; the underlying equations are unchanged.

Key takeaways#

The concepts introduced here form the foundation for every algorithm in the course. The chain runs as follows: the reward hypothesis motivates reducing goal-directed behavior to scalar maximization, while immediately surfacing the challenge of reward specification. MDPs formalize the sequential interaction, with the Markov property as a structural assumption that must be engineered in practice. Policies define behavior; value functions quantify its long-term consequences. The Bellman equations express value functions recursively — linearly for policy evaluation, as a nonlinear fixed point for optimality — and the contraction mapping property is what makes iterative solution tractable. The discount factor simultaneously ensures convergence, encodes time preference, and introduces a policy gradient bias that resurfaces later. And the LQR/Riccati connection unifies this formalism with the control theory developed in preceding lectures: RL is generalized optimal control.

Common pitfalls in RL problem formulation#

These mistakes appear repeatedly in both research and practice. Recognizing them early prevents compounding errors in implementation and design.

1. Non-Markovian state representation. Using raw sensor observations as states without verifying the Markov property. Symptom: the value function fails to converge or oscillates despite a correct implementation. Diagnosis: place the agent in the same observation from different histories and check whether subsequent transitions differ. Fix: state augmentation (add velocity, contact forces) or observation stacking.

2. Reward hacking. Optimizing a specified reward that does not fully capture the intended goal. Symptom: high reward but poor task performance; the agent found a policy that satisfies the letter of the reward, not the spirit. Prevention: before training, adversarially ask "what behavior maximizes this reward in an unintended way?" A robot rewarded purely for joint speed will fall forward. An LLM rewarded for approval ratings will learn to flatter.

3. Setting $\gamma = 1$ in a continuing task. Returns diverge; the Bellman operator loses its contraction property; value iteration does not converge. Symptom: numerical overflow or NaN values during training. Fix: use $\gamma < 1$ , or reformulate the task as episodic with a meaningful terminal condition.

4. Confusing episodic and continuing formulations. A robot manipulation task with a fixed episode length is episodic; a thermostat control task running indefinitely is continuing. Applying the wrong formulation produces an ill-defined objective. Fix: determine whether terminal states exist. If not, the task is continuing and requires discounting for a finite return.

5. Ignoring distributional shift in offline RL. In offline settings, the dataset was collected by a behavior policy; the policy being learned is the target policy. Ignoring this distinction leads to overestimated Q-values in regions not covered by the dataset, causing catastrophic failure outside the data distribution. This has no direct analog in supervised learning — a classifier does not synthesize inputs from data-sparse regions.

Matrix form of the Bellman expectation equation

For a fixed policy $\pi$ over a finite state space, the Bellman expectation equation is a system of $|\mathcal{S}|$ linear equations. It can be written in matrix form as:

V^\pi = R^\pi + \gamma P^\pi V^\pi \quad\Longrightarrow\quad (I - \gamma P^\pi)\, V^\pi = R^\pi

where $R^\pi_s = \sum_a \pi(a|s)\, R(s,a)$ and $P^\pi_{ss'} = \sum_a \pi(a|s)\, P(s'|s,a)$ .

This is the standard $Ax = b$ linear system with $A = I - \gamma P^\pi$ , $x = V^\pi$ , $b = R^\pi$ . The matrix $A$ is always invertible when $\gamma < 1$ : since $P^\pi$ is a stochastic matrix its eigenvalues lie in $[-1, 1]$ , so the eigenvalues of $I - \gamma P^\pi$ are all in $(1-\gamma, 1+\gamma]$ , none zero. The unique solution is $V^\pi = (I - \gamma P^\pi)^{-1} R^\pi$ .

Conceptual questions#

A robot observes only joint positions, not velocities. Describe precisely why this violates the Markov property. Propose two different state engineering strategies that restore approximate Markovianity, and explain what each one assumes about the system.
You set $\gamma = 1.0$ in a value iteration algorithm on a continuing task with bounded positive rewards. What happens to $G_t$ , and why does the Bellman optimality operator lose its contraction property? What is the minimum change to the problem setup that restores convergence?
An RLHF system trains a language model to receive high scores from a reward model trained on human preference data. After deployment, users report that the model is more persuasive but less accurate. Explain this failure in terms of the reward hypothesis and reward misspecification. What would you change in the training setup?
Compare the Bellman expectation equation for policy evaluation with the solution of a system of linear equations $Ax = b$ . Identify what plays the role of $A$ , $x$ , and $b$ , and explain why the system is always solvable (i.e., why $A$ is always invertible for $\gamma < 1$ ).
An offline RL algorithm is trained on a dataset of robot manipulation demonstrations. After training, the policy performs well on demonstrations similar to the training data but fails catastrophically on slightly different configurations. Explain this failure in terms of distributional shift and overestimated Q-values. How does this differ from the analogous failure mode in behavioral cloning?

Solutions

Missing velocities. Without velocity, the future is not determined by the current observation — it depends on history — so the Markov property fails. (a) Stack a window of recent positions: assumes velocity is recoverable by finite differencing (the system is approximately finite-order). (b) Append a filter/observer estimate of velocity: assumes a known, identifiable dynamics model. Both restore approximate Markovianity by re-injecting the missing dynamical information.
$\gamma = 1$ on a continuing task. $G_t = \sum_t r_t$ diverges for bounded positive rewards over an infinite horizon, and the Bellman optimality operator is only a $\gamma$ -contraction in sup-norm — with modulus $\gamma = 1$ it is no longer a contraction, so there is no unique fixed point. Minimum fix: set $\gamma < 1$ , make the task episodic with an absorbing terminal state, or switch to an average-reward formulation.
RLHF persuasive but inaccurate. The reward model is a proxy, and the reward hypothesis only delivers the desired behavior if reward is correctly specified. Here the RM rewarded human approval/persuasiveness rather than truth — reward misspecification — so the policy hacks the proxy, producing convincing but wrong outputs. Fix by adding factuality/groundedness signals to the reward, collecting preference data that penalizes confident falsehoods, and keeping a KL anchor to the reference model.
Bellman as $Ax=b$ . $V^\pi = R^\pi + \gamma P^\pi V^\pi$ rearranges to $(I - \gamma P^\pi)V^\pi = R^\pi$ , so $A = I-\gamma P^\pi$ , $x = V^\pi$ , $b = R^\pi$ . Since $P^\pi$ is stochastic its spectral radius is 1, so $\gamma P^\pi$ has spectral radius $\le \gamma < 1$ and $I-\gamma P^\pi$ has all eigenvalues bounded away from zero — always invertible, giving a unique $V^\pi$ .
Offline distributional shift. The policy queries $Q$ at out-of-distribution actions where it is overestimated (no data to correct it), so it confidently selects bad actions and fails off-distribution. This differs from behavioral cloning, whose failure is covariate shift / compounding error — BC drifts into unseen states by imitation but does not actively exploit overestimated values; offline RL's failure is value extrapolation specific to bootstrapped learning driving the policy toward unsupported actions.

Looking ahead#

The next lecture studies the simplest nontrivial RL setting: multi-armed bandits. This isolates the exploration-exploitation tradeoff — the core challenge of RL — without the additional complexity of state transitions. Bandit algorithms and regret analysis provide the theoretical tools that extend, with modification, to full MDPs and to the RLHF feedback mechanisms studied later in the course.

Purpose of this lecture#

How do we specify what matters? Reward design is brutally hard. Write the wrong reward, and the agent optimizes the letter of your goal while violating its spirit. This is not a minor engineering problem—it is a foundational gap that the entire field grapples with.
How do we handle uncertainty about the future? The agent doesn't know what will happen when it acts. How do we reason about long-term consequences when the world is stochastic and partially hidden?
How do we make this tractable? Maximizing a sum over infinitely many future time steps is not computable as stated. What mathematical structure lets us solve it?

Sequential decision making#

In reinforcement learning, an agent interacts with an environment over discrete time steps. This interaction forms a closed feedback loop:

At time step $t$ , the agent observes a state $s_t$
It selects an action $a_t$
The environment transitions to a new state $s_{t+1}$
The agent receives a scalar reward $r_{t+1}$

The agent's objective is not to maximize immediate reward, but to act so as to maximize cumulative reward over time.

This temporal coupling distinguishes reinforcement learning from:

Supervised learning, where data is fixed and independent of the learner's actions
Standard optimization, where decisions are made once rather than sequentially

The reward hypothesis#

A foundational idea in reinforcement learning is the Reward Hypothesis:

All goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (reward).

A concrete starting point#

Before examining where the hypothesis succeeds and fails, consider the simplest case where it clearly works.

These cases share a key property: the reward function fully captures the intended goal. The difficulty arises when they diverge.

Why reward design matters#

The hypothesis is silent on how to specify reward. In practice, reward design is one of the hardest problems in RL:

Productive uses — in modern AI systems, the reward hypothesis underpins:

RLHF (Reinforcement Learning from Human Feedback)
Preference optimization (GRPO, DPO, PPO)
Agentic systems optimized via success signals

Design challenges — reward design is one of the hardest problems in RL:

Sparse rewards: signal arrives only at task completion, making learning slow
Dense rewards: accelerate learning but risk reward hacking
Reward misspecification: the gap between the written reward and the actual goal

Reward design is revisited throughout the course. The tension between what is easy to specify and what is safe to optimize is one of the running themes.

Markov Decision Processes (MDPs)#

The canonical mathematical model for reinforcement learning is the Markov Decision Process (MDP).

An MDP is defined by the tuple:

(\mathcal{S},\, \mathcal{A},\, P,\, R,\, \gamma)

where:

$\mathcal{S}$ is the state space,
$\mathcal{A}$ is the action space,
$P(s' \mid s, a)$ is the transition probability kernel,
$R(s, a)$ is the expected reward function,
$\gamma \in [0,1)$ is the discount factor.

The Markov property#

The defining assumption of an MDP is the Markov property:

P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)

The future is conditionally independent of the past given the current state and action. The current state $s_t$ contains all information relevant to predicting what happens next.

Markovianity is engineered, not given#

In practice, raw observations from the environment are almost never Markov. This is not a minor implementation detail — it is one of the most important points in the lecture.

Systems where the agent receives observations $o_t$ that are functions of an underlying hidden state $x_t$ are formally Partially Observable MDPs (POMDPs):

o_t = \mathcal{O}(x_t) + \text{noise}

POMDPs are the general case; MDPs are the special case where $o_t = x_t$ . Most real-world problems are POMDPs, and this gap is a primary source of failure in deployed RL systems.

The standard engineering responses to non-Markovianity are:

State augmentation: add velocity to a position-only observation; add contact forces to a proprioceptive observation. Extend $s_t$ until the Markov property holds approximately.
Observation stacking: concatenate the last $k$ observations into a single input (e.g., frame stacking in Atari, observation history windows in locomotion RL). This approximates a sufficient statistic for the hidden state.
Recurrent architectures: use an LSTM or transformer to maintain a compressed representation of history, effectively learning a belief state. This is the learned analog of the Kalman filter discussed in the dynamics course.

GenAI context: MDPs for language models#

To ground this abstraction, consider RL applied to large language models, as in RLHF:

Policies#

A policy specifies the agent's behavior as a mapping from states to actions:

\pi(a \mid s) = P(a_t = a \mid s_t = s)

Policies may be:

Deterministic: $\pi(s) = a$ , selecting a single action per state. Common in deployment and in control-theoretic settings.
Stochastic: sampling actions from a probability distribution. Essential for exploration during training, and fundamental to policy gradient methods.

In LLMs, the policy is the language model itself: a stochastic policy over the vocabulary at each token position, with temperature controlling the degree of stochasticity.

Given a policy, we need to quantify how good it is — not just at the current step, but over the entire future trajectory it will generate. This leads to the return.

Returns and discounting#

The agent's objective is formalized via the return, the discounted cumulative reward from time step $t$ onward:

G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}

This single equation encodes three distinct ideas. Understanding all three is critical because they come apart in practice.

Three interpretations of the discount factor $\gamma$ #

Why Not set $\gamma = 1$?

For episodic tasks with a natural terminal state, you can set $\gamma = 1$ if you treat the terminal state as absorbing (zero reward, self-loops). The episode ends, so the sum is finite regardless.

A warning on $\gamma$ and policy gradient bias#

Episodic vs continuing tasks#

Episodic tasks terminate at a terminal state (games, robotic manipulation episodes, single LLM responses). The return is a finite sum.
Continuing tasks run indefinitely (control systems, long-running agents). The return requires discounting for finiteness.

Value functions#

State-value function#

V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid s_t = s \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \;\middle|\; s_t = s \right]

This is the expected cumulative discounted reward, starting from state $s$ and following policy $\pi$ thereafter.

Action-value function (Q-function)#

Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid s_t = s, a_t = a \right]

This conditions on both state and action: the expected return from taking action $a$ in state $s$ , then following $\pi$ .

The relationship between them:

V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)

Bellman equations#

The Bellman equations express value functions recursively, relating the value of the current state to the values of successor states. They are the mathematical spine of almost every RL algorithm.

Bellman expectation equation#

V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^\pi(s') \right]

Bellman optimality equation#

V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V^*(s') \right]

The contraction mapping property#

Define the Bellman optimality operator $\mathcal{T}$ :

(\mathcal{T} V)(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s' \mid s,a)\, V(s') \right]

$\mathcal{T}$ is a $\gamma$ -contraction in the $\ell_\infty$ norm:

\|\mathcal{T} V - \mathcal{T} V'\|_\infty \leq \gamma \|V - V'\|_\infty

Connection to the LQR Riccati equation#

Students of the robotics control lectures will recognize the Bellman optimality equation. The discrete-time Algebraic Riccati Equation (DARE):

P = Q + A^\top P A - A^\top P B (R + B^\top P B)^{-1} B^\top P A

Worked example: Bellman backup on a two-state MDP#

Setup. Consider the following MDP:

States: $\mathcal{S} = \{s_1,\, s_2\}$
Actions: $\mathcal{A} = \{\text{stay},\, \text{switch}\}$
Transitions: deterministic — stay keeps the agent in the current state, switch moves it to the other state.
Rewards: $R(s_1, \cdot) = 1$ , $R(s_2, \cdot) = 0$ (reward depends only on the state).
Discount: $\gamma = 0.9$ .

Policy $\pi_1$ : always stay. Write the Bellman expectation equations:

V^{\pi_1}(s_1) = 1 + 0.9\, V^{\pi_1}(s_1) \qquad V^{\pi_1}(s_2) = 0 + 0.9\, V^{\pi_1}(s_2)

Solving the linear system:

V^{\pi_1}(s_1) = \frac{1}{1 - 0.9} = 10 \qquad V^{\pi_1}(s_2) = \frac{0}{1 - 0.9} = 0

Policy $\pi_2$ : always switch. The agent alternates $s_1 \to s_2 \to s_1 \to \cdots$ , collecting rewards $1, 0, 1, 0, \ldots$ . The Bellman equations couple the two states:

V^{\pi_2}(s_1) = 1 + 0.9\, V^{\pi_2}(s_2) \qquad V^{\pi_2}(s_2) = 0 + 0.9\, V^{\pi_2}(s_1)

Substituting the second into the first: $V^{\pi_2}(s_1) = 1 + 0.81\, V^{\pi_2}(s_1)$ , giving $V^{\pi_2}(s_1) = \frac{1}{0.19} \approx 5.26$ .

Mini coding exercise: iterative policy evaluation#

Implement the policy evaluation above using the iterative (not direct) method, and verify convergence to the analytical solution.

python · runs in browser

import numpy as np

# MDP definition
gamma = 0.9

# States: 0 = s1, 1 = s2. Actions: 0 = stay, 1 = switch.
# P[s, a, s'] = P(s' | s, a)
P = np.array([
    [[1.0, 0.0], [0.0, 1.0]],   # from s1: stay->s1, switch->s2
    [[0.0, 1.0], [1.0, 0.0]],   # from s2: stay->s2, switch->s1
])
R = np.array([1.0, 0.0])        # R(s1) = 1, R(s2) = 0

# Policy pi1: always stay (action 0 in both states)
pi = np.array([0, 0])

# Iterative policy evaluation
# Initial guess of zeros — the contraction property guarantees convergence
# regardless of the starting estimate.
V = np.zeros(2)
for i in range(1000):
    # Bellman expectation backup: R(s) + γ Σ_{s'} P(s'|s,π(s)) V(s')
    # np.dot computes the expected future value over successor states.
    V_new = np.array([
        R[s] + gamma * np.dot(P[s, pi[s]], V)
        for s in range(2)
    ])
    # Convergence check using the infinity norm (max absolute difference).
    # A small value means we've reached the fixed point of the Bellman operator.
    if np.max(np.abs(V_new - V)) < 1e-8:
        print(f"Converged after {i+1} iterations")
        break
    # Synchronous (Jacobi-style) update: all states updated using the old V.
    V = V_new

print(f"V(s1) = {V[0]:.4f}")   # Expected: 10.0
print(f"V(s2) = {V[1]:.4f}")   # Expected: 0.0

Taxonomy of reinforcement learning problems#

Understanding where RL problems sit in this taxonomy tells you which algorithms are applicable and what assumptions you can rely on.

Bandits vs full MDPs#

Multi-armed bandits: no state transitions; actions yield immediate reward. The agent must balance exploration (trying uncertain actions) and exploitation (taking the best-known action).
Full MDPs: actions affect future states and long-term outcomes. Bandit reasoning applies locally at each state, but value functions must account for downstream consequences.

Bandits reappear throughout the course: in recommendation systems, A/B testing, and as the core mechanism in RLHF (where human feedback on response pairs is a bandit signal).

Online vs offline reinforcement learning#

Online RL: the agent interacts with the environment during learning. Experience is generated by the current (or recent) policy. Standard algorithms — Q-learning, PPO, SAC — are online.
Offline RL: learning occurs from a fixed dataset collected by some behavior policy, without further environment interaction. This is critical in safety-sensitive domains (robotics, healthcare, alignment) where online exploration is expensive or dangerous.

Why Not just use supervised learning for offline RL?

Model-free vs model-based RL#

Model-free RL: learns policies or value functions directly from experience, without explicitly representing or learning the transition dynamics $P$ .
Model-based RL: learns or uses a model of the environment's dynamics for planning. Enables higher sample efficiency by generating synthetic experience from the model, at the cost of bias when the model is imperfect.

Relationship to supervised learning and control theory#

Supervised learning#

Control theory#

Classical control theory solves related problems but under different assumptions:

Known dynamics: control assumes $P$ is given analytically; RL treats it as unknown or learns it.
Stability guarantees: control provides formal guarantees (Lyapunov stability, ISS) that RL algorithms generally do not.
Function approximation: classical control works analytically in low-to-moderate dimensional spaces; RL scales to high-dimensional spaces by approximating $V$ and $\pi$ with neural networks, trading convergence guarantees for scalability.

Connections to modern methods#

DreamerV3: Bellman backups in latent space#

DreamerV3 (Hafner et al., 2023) achieves strong performance across diverse domains — robotics, Atari, NetHack — from pixels alone. Its architecture maps directly onto the MDP formalism:

TD-MPC: value-weighted planning#

TD-MPC (Hansen et al., 2022) combines model predictive control (MPC) with Bellman-based value learning. The key design choices map to the formalism as follows:

A learned latent world model provides short-horizon rollouts (the transition $P$ ).
A learned value function $Q^*(s, a)$ , trained via the Bellman optimality equation, estimates cumulative reward beyond the planning horizon.
At decision time, action sequences are sampled and evaluated by combining simulated returns (within the planning horizon) with Q-value bootstrapping (beyond it).

Understanding these methods requires only the formalism from this lecture. The algorithms add function approximation, learned dynamics, and planning; the underlying equations are unchanged.

Key takeaways#

Common pitfalls in RL problem formulation#

These mistakes appear repeatedly in both research and practice. Recognizing them early prevents compounding errors in implementation and design.

Matrix form of the Bellman expectation equation

For a fixed policy $\pi$ over a finite state space, the Bellman expectation equation is a system of $|\mathcal{S}|$ linear equations. It can be written in matrix form as:

V^\pi = R^\pi + \gamma P^\pi V^\pi \quad\Longrightarrow\quad (I - \gamma P^\pi)\, V^\pi = R^\pi

where $R^\pi_s = \sum_a \pi(a|s)\, R(s,a)$ and $P^\pi_{ss'} = \sum_a \pi(a|s)\, P(s'|s,a)$ .

Conceptual questions#

A robot observes only joint positions, not velocities. Describe precisely why this violates the Markov property. Propose two different state engineering strategies that restore approximate Markovianity, and explain what each one assumes about the system.
You set $\gamma = 1.0$ in a value iteration algorithm on a continuing task with bounded positive rewards. What happens to $G_t$ , and why does the Bellman optimality operator lose its contraction property? What is the minimum change to the problem setup that restores convergence?
An RLHF system trains a language model to receive high scores from a reward model trained on human preference data. After deployment, users report that the model is more persuasive but less accurate. Explain this failure in terms of the reward hypothesis and reward misspecification. What would you change in the training setup?
Compare the Bellman expectation equation for policy evaluation with the solution of a system of linear equations $Ax = b$ . Identify what plays the role of $A$ , $x$ , and $b$ , and explain why the system is always solvable (i.e., why $A$ is always invertible for $\gamma < 1$ ).
An offline RL algorithm is trained on a dataset of robot manipulation demonstrations. After training, the policy performs well on demonstrations similar to the training data but fails catastrophically on slightly different configurations. Explain this failure in terms of distributional shift and overestimated Q-values. How does this differ from the analogous failure mode in behavioral cloning?

Solutions

Missing velocities. Without velocity, the future is not determined by the current observation — it depends on history — so the Markov property fails. (a) Stack a window of recent positions: assumes velocity is recoverable by finite differencing (the system is approximately finite-order). (b) Append a filter/observer estimate of velocity: assumes a known, identifiable dynamics model. Both restore approximate Markovianity by re-injecting the missing dynamical information.
$\gamma = 1$ on a continuing task. $G_t = \sum_t r_t$ diverges for bounded positive rewards over an infinite horizon, and the Bellman optimality operator is only a $\gamma$ -contraction in sup-norm — with modulus $\gamma = 1$ it is no longer a contraction, so there is no unique fixed point. Minimum fix: set $\gamma < 1$ , make the task episodic with an absorbing terminal state, or switch to an average-reward formulation.
RLHF persuasive but inaccurate. The reward model is a proxy, and the reward hypothesis only delivers the desired behavior if reward is correctly specified. Here the RM rewarded human approval/persuasiveness rather than truth — reward misspecification — so the policy hacks the proxy, producing convincing but wrong outputs. Fix by adding factuality/groundedness signals to the reward, collecting preference data that penalizes confident falsehoods, and keeping a KL anchor to the reference model.
Bellman as $Ax=b$ . $V^\pi = R^\pi + \gamma P^\pi V^\pi$ rearranges to $(I - \gamma P^\pi)V^\pi = R^\pi$ , so $A = I-\gamma P^\pi$ , $x = V^\pi$ , $b = R^\pi$ . Since $P^\pi$ is stochastic its spectral radius is 1, so $\gamma P^\pi$ has spectral radius $\le \gamma < 1$ and $I-\gamma P^\pi$ has all eigenvalues bounded away from zero — always invertible, giving a unique $V^\pi$ .
Offline distributional shift. The policy queries $Q$ at out-of-distribution actions where it is overestimated (no data to correct it), so it confidently selects bad actions and fails off-distribution. This differs from behavioral cloning, whose failure is covariate shift / compounding error — BC drifts into unseen states by imitation but does not actively exploit overestimated values; offline RL's failure is value extrapolation specific to bootstrapped learning driving the policy toward unsupported actions.

Purpose of this lecture#

Sequential decision making#

The reward hypothesis#

A concrete starting point#

Why reward design matters#

Markov Decision Processes (MDPs)#

The Markov property#

Markovianity is engineered, not given#

GenAI context: MDPs for language models#

Policies#

Returns and discounting#

Three interpretations of the discount factor γ\gammaγ#

A warning on γ\gammaγ and policy gradient bias#

Episodic vs continuing tasks#

Value functions#

State-value function#

Action-value function (Q-function)#

Bellman equations#

Bellman expectation equation#

Bellman optimality equation#

The contraction mapping property#

Connection to the LQR Riccati equation#

Worked example: Bellman backup on a two-state MDPMarkov Decision Process#

Mini coding exercise: iterative policy evaluation#

Taxonomy of reinforcement learning problems#

Bandits vs full MDPs#

Online vs offline reinforcement learning#

Model-free vs model-based RLReinforcement Learning#

Relationship to supervised learning and control theory#

Supervised learning#

Control theory#

Connections to modern methods#

DreamerV3: Bellman backups in latent space#

TDTemporal Difference-MPC: value-weighted planning#

Key takeaways#

Common pitfalls in RLReinforcement Learning problem formulation#

Conceptual questions#

Looking ahead#

Further reading#

Week 1: Reinforcement Learning Problem Formulation

Purpose of this lecture#

Sequential decision making#

The reward hypothesis#

A concrete starting point#

Why reward design matters#

Markov Decision Processes (MDPs)#

The Markov property#

Markovianity is engineered, not given#

GenAI context: MDPs for language models#

Policies#

Returns and discounting#

Three interpretations of the discount factor γ\gammaγ#

A warning on γ\gammaγ and policy gradient bias#

Episodic vs continuing tasks#

Value functions#

State-value function#

Action-value function (Q-function)#

Bellman equations#

Bellman expectation equation#

Bellman optimality equation#

The contraction mapping property#

Connection to the LQR Riccati equation#

Worked example: Bellman backup on a two-state MDPMarkov Decision Process#

Mini coding exercise: iterative policy evaluation#

Taxonomy of reinforcement learning problems#

Bandits vs full MDPs#

Online vs offline reinforcement learning#

Model-free vs model-based RLReinforcement Learning#

Relationship to supervised learning and control theory#

Supervised learning#

Control theory#

Connections to modern methods#

DreamerV3: Bellman backups in latent space#

TDTemporal Difference-MPC: value-weighted planning#

Key takeaways#

Common pitfalls in RLReinforcement Learning problem formulation#

Conceptual questions#

Looking ahead#

Further reading#

Three interpretations of the discount factor $\gamma$ #

A warning on $\gamma$ and policy gradient bias#

Worked example: Bellman backup on a two-state MDP#

Model-free vs model-based RL#

TD-MPC: value-weighted planning#

Common pitfalls in RL problem formulation#

Three interpretations of the discount factor $\gamma$ #

A warning on $\gamma$ and policy gradient bias#

Worked example: Bellman backup on a two-state MDP#

Model-free vs model-based RL#

TD-MPC: value-weighted planning#

Common pitfalls in RL problem formulation#