Reinforcement learning studies sequential decision making under uncertainty. Unlike supervised learning, where data is static, an RLReinforcement Learning agent's actions actively influence the future—creating a defining feedback loop of perception and action.
Purpose of this lecture
Here is the central puzzle: How can we reduce the dazzling complexity of goal-directed behavior—a robot learning to walk, a language model generating coherent text, a game-playing agent discovering a novel strategy—into a single clean principle?
Reinforcement learning answers this by reducing everything to one idea: maximize expected cumulative reward. This sounds simple. It is not. The moment you accept this reduction, you confront three immediate tensions:
-
How do we specify what matters? Reward design is brutally hard. Write the wrong reward, and the agent optimizes the letter of your goal while violating its spirit. This is not a minor engineering problem—it is a foundational gap that the entire field grapples with.
-
How do we handle uncertainty about the future? The agent doesn't know what will happen when it acts. How do we reason about long-term consequences when the world is stochastic and partially hidden?
-
How do we make this tractable? Maximizing a sum over infinitely many future time steps is not computable as stated. What mathematical structure lets us solve it?
This lecture answers all three by introducing the MDPMarkov Decision Process formalism, value functions and the Bellman equation. These concepts are not approximations or conveniences—they are the exact mathematical encoding of sequential decision making. Every algorithm studied later—tabular methods, deep RLReinforcement Learning, RLHFReinforcement Learning from Human Feedback, GRPOGroup Relative Policy Optimisation, and agentic systems—is either an instantiation or an extension of the theory introduced here.
By the end of this lecture, you will understand not just what the Bellman equation is, but why it is the inevitable consequence of formalizing sequential decision making. You will also understand where it breaks and what engineering is required to make it work in the real world.
Sequential decision making
In reinforcement learning, an agent interacts with an environment over discrete time steps. This interaction forms a closed feedback loop:
- At time step , the agent observes a state
- It selects an action
- The environment transitions to a new state
- The agent receives a scalar reward
The agent's objective is not to maximize immediate reward, but to act so as to maximize cumulative reward over time.
This temporal coupling distinguishes reinforcement learning from:
- Supervised learning, where data is fixed and independent of the learner's actions
- Standard optimization, where decisions are made once rather than sequentially
The notation for the reward received after taking action follows the Sutton & Barto convention: the reward is indexed by the time at which it is received, not the time at which the action that caused it was taken. This is a purely notational choice but it recurs throughout the literature and in any codebase following S&B conventions.
To formalize "maximize cumulative reward over time" we need two things: a precise description of how the world works (an MDPMarkov Decision Process), and a precise notion of what cumulative reward means (the return). We build toward both by first asking what kind of signal the agent should be trying to maximize.
The reward hypothesis
A foundational idea in reinforcement learning is the Reward Hypothesis:
All goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (reward).
This hypothesis suggests that even the most sophisticated behaviors—mastering a game of chess, the fluid locomotion of a humanoid robot, or the nuanced generation of human-like text—can be reduced to maximizing expected cumulative reward. It is a claim, not a mathematical truth, and the course can be read as a sustained examination of how far it holds and where it breaks down.
A concrete starting point
Before examining where the hypothesis succeeds and fails, consider the simplest case where it clearly works.
Gridworld navigation. An agent moves through a grid to reach a goal cell in the lower-right corner. The reward function assigns upon reaching the goal and at every other step. This is unambiguous and easy to specify. An agent maximizing expected cumulative reward will learn to navigate efficiently to the goal. The reward hypothesis applies cleanly.
Atari Pong. Reward is when the agent scores a point and when the opponent scores. Again unambiguous. A policy maximizing this reward will beat the opponent. DeepMind's DQNDeep Q-Network (2013) demonstrated that a single algorithm, with no game-specific engineering, can learn superhuman performance from this signal alone — a striking validation of the hypothesis.
These cases share a key property: the reward function fully captures the intended goal. The difficulty arises when they diverge.
Why reward design matters
The hypothesis is silent on how to specify reward. In practice, reward design is one of the hardest problems in RLReinforcement Learning:
Productive uses — in modern AI systems, the reward hypothesis underpins:
- RLHFReinforcement Learning from Human Feedback (Reinforcement Learning from Human Feedback)
- Preference optimization (GRPOGroup Relative Policy Optimisation, DPODirect Preference Optimization, PPOProximal Policy Optimisation)
- Agentic systems optimized via success signals
Design challenges — reward design is one of the hardest problems in RLReinforcement Learning:
- Sparse rewards: signal arrives only at task completion, making learning slow
- Dense rewards: accelerate learning but risk reward hacking
- Reward misspecification: the gap between the written reward and the actual goal
Reward design is revisited throughout the course. The tension between what is easy to specify and what is safe to optimize is one of the running themes.
With the reward hypothesis accepted as a working assumption, the next step is to formalize the environment in which the agent operates — the mechanism by which actions produce transitions and rewards.
Markov Decision Processes (MDPs)
The canonical mathematical model for reinforcement learning is the Markov Decision Process (MDPMarkov Decision Process).
An MDPMarkov Decision Process is defined by the tuple:
where:
- is the state space,
- is the action space,
- is the transition probability kernel,
- is the expected reward function,
- is the discount factor.
The Markov property
The defining assumption of an MDPMarkov Decision Process is the Markov property:
The future is conditionally independent of the past given the current state and action. The current state contains all information relevant to predicting what happens next.
Markovianity is engineered, not given
In practice, raw observations from the environment are almost never Markov. This is not a minor implementation detail — it is one of the most important points in the lecture.
Consider a robot that observes only joint positions but not joint velocities. The next state depends on the current velocity, which is not in the observation. Two robots at the same position but different velocities will transition to different next states, so the transition is non-Markov in the observed state. A language model that receives only the most recent token cannot distinguish "The bank charges fees" from "The bank slopes gently" — the same token ("bank") with entirely different meaning depending on history. In both cases, the true state is not captured in the current observation.
Systems where the agent receives observations that are functions of an underlying hidden state are formally Partially Observable MDPs (POMDPs):
POMDPs are the general case; MDPs are the special case where . Most real-world problems are POMDPs, and this gap is a primary source of failure in deployed RLReinforcement Learning systems.
The standard engineering responses to non-Markovianity are:
- State augmentation: add velocity to a position-only observation; add contact forces to a proprioceptive observation. Extend until the Markov property holds approximately.
- Observation stacking: concatenate the last observations into a single input (e.g., frame stacking in Atari, observation history windows in locomotion RLReinforcement Learning). This approximates a sufficient statistic for the hidden state.
- Recurrent architectures: use an LSTM or transformer to maintain a compressed representation of history, effectively learning a belief state. This is the learned analog of the Kalman filter discussed in the dynamics course.
GenAI context: MDPs for language models
To ground this abstraction, consider RLReinforcement Learning applied to large language models, as in RLHFReinforcement Learning from Human Feedback:
| MDPMarkov Decision Process component | LLMLarge Language Model interpretation | |---|---| | State | Prompt + all tokens generated so far | | Action | Next token selected from vocabulary | | Transition | Appending selected token to context | | Reward | Scalar score from a reward model | | Discount | Relative value of earlier vs later tokens |
From this perspective, language generation is a sequential decision process, and alignment techniques are specialized RLReinforcement Learning algorithms operating in this MDPMarkov Decision Process. The state space is the full context window — this is what makes the MDPMarkov Decision Process Markov for LLMs: the complete token history is available as the state.
With the environment formalized as an MDPMarkov Decision Process, the agent's behavior must be specified. A policy defines how the agent maps states to actions — and its properties determine whether the agent can learn effectively at all.
Policies
A policy specifies the agent's behavior as a mapping from states to actions:
Policies may be:
- Deterministic: , selecting a single action per state. Common in deployment and in control-theoretic settings.
- Stochastic: sampling actions from a probability distribution. Essential for exploration during training, and fundamental to policy gradient methods.
In LLMs, the policy is the language model itself: a stochastic policy over the vocabulary at each token position, with temperature controlling the degree of stochasticity.
Given a policy, we need to quantify how good it is — not just at the current step, but over the entire future trajectory it will generate. This leads to the return.
Returns and discounting
The agent's objective is formalized via the return, the discounted cumulative reward from time step onward:
This single equation encodes three distinct ideas. Understanding all three is critical because they come apart in practice.
Three interpretations of the discount factor
1. Convergence guarantee. For continuing (non-terminating) tasks, the infinite sum may diverge if rewards are bounded away from zero. Multiplying by ensures , guaranteeing a finite return. This is a mathematical necessity, not a modeling choice. If , returns are unbounded.
2. Time preference. encodes how much the agent discounts future rewards relative to immediate ones. approaches equal weighting of all future rewards; makes the agent myopic, caring only about immediate reward. The effective planning horizon is approximately steps.
3. Survival probability. can be interpreted as the probability that the episode continues at each step: with probability , the episode ends and no further reward is received. Under this interpretation, discounting is not a modeling choice but a consequence of stochastic episode length. This connects episodic and continuing task formulations in a unified framework.
A warning on and policy gradient bias
In policy gradient methods (studied in a later lecture), using introduces a subtle theoretical issue: the discounted policy gradient does not exactly optimize the true undiscounted return. This matters in continuing tasks where the intended objective is undiscounted cumulative reward. The bias is well-understood theoretically but frequently overlooked in practice. We return to this when we derive policy gradient algorithms.
Episodic vs continuing tasks
- Episodic tasks terminate at a terminal state (games, robotic manipulation episodes, single LLMLarge Language Model responses). The return is a finite sum.
- Continuing tasks run indefinitely (control systems, long-running agents). The return requires discounting for finiteness.
For episodic tasks, terminal states are often treated as absorbing states with zero reward and self-loops: once reached, the agent stays there and receives zero reward forever. This unifies the episodic and continuing formalisms and is the standard implementation in most RLReinforcement Learning libraries.
The return is a random variable: it depends on the stochastic policy, the stochastic transitions, and the full future trajectory. Rather than work with this random variable directly, we take its expectation — yielding the value function, the central object of RLReinforcement Learning theory.
Value functions
Value functions quantify how good it is to be in a state or take an action, under a given policy . They are not directly observable; they must be estimated from experience or derived from the environment model.
State-value function
This is the expected cumulative discounted reward, starting from state and following policy thereafter.
Action-value function (Q-function)
This conditions on both state and action: the expected return from taking action in state , then following .
The relationship between them:
Value functions convert a sequential decision problem into an estimation problem over expected returns. The optimal value functions and encode the solution to the RLReinforcement Learning problem: if you know , the optimal policy is simply .
Value functions defined as infinite sums are not directly computable. What makes them tractable is a recursive structure — the Bellman equations — which express the value of a state in terms of the values of its immediate successors.
Bellman equations
The Bellman equations express value functions recursively, relating the value of the current state to the values of successor states. They are the mathematical spine of almost every RLReinforcement Learning algorithm.
Bellman expectation equation
This is a system of linear equations in . Given a fixed policy , the transition model , and reward function , solving for is a linear algebra problem of size . This is policy evaluation — and its tractability (for small state spaces) is the foundation of dynamic programming.
Bellman optimality equation
This is a nonlinear fixed-point equation in . Unlike the expectation equation, it cannot be solved by a single linear solve — the makes it nonlinear. Yet it has a unique solution, and iterative methods converge to it geometrically.
The contraction mapping property
Define the Bellman optimality operator :
is a -contraction in the norm:
By the Banach fixed-point theorem, repeated application of converges geometrically to the unique fixed point , at rate per iteration. This is value iteration, and the contraction property is why it works. It also explains why is not just a modeling choice: destroys the contraction and convergence is no longer guaranteed. The discount factor is both a modeling parameter and an algorithmic one.
Connection to the LQR Riccati equation
Students of the robotics control lectures will recognize the Bellman optimality equation. The discrete-time Algebraic Riccati Equation (DARE):
is the Bellman optimality equation for the special case of linear dynamics and quadratic cost. is the optimal value function; the Riccati equation is its fixed-point characterization. LQR solves the RLReinforcement Learning problem in closed form under linearity and quadratic cost. The Bellman framework and the LQR framework are the same framework at different levels of generality.
Worked example: Bellman backup on a two-state MDPMarkov Decision Process
Setup. Consider the following MDPMarkov Decision Process:
- States:
- Actions:
- Transitions: deterministic — stay keeps the agent in the current state, switch moves it to the other state.
- Rewards: , (reward depends only on the state).
- Discount: .
Policy : always stay. Write the Bellman expectation equations:
Solving the linear system:
Interpretation. Starting in , the agent collects at every step forever. The discounted sum matches. Starting in , the agent collects forever.
Policy : always switch. The agent alternates , collecting rewards . The Bellman equations couple the two states:
Substituting the second into the first: , giving .
Conclusion. dominates: . The optimal policy is to stay in and exploit the high-reward state — which the Bellman optimality equation confirms by selecting at .
Mini coding exercise: iterative policy evaluation
Implement the policy evaluation above using the iterative (not direct) method, and verify convergence to the analytical solution.
import numpy as np
# MDP definition
gamma = 0.9
# States: 0 = s1, 1 = s2. Actions: 0 = stay, 1 = switch.
# P[s, a, s'] = P(s' | s, a)
P = np.array([
[[1.0, 0.0], [0.0, 1.0]], # from s1: stay->s1, switch->s2
[[0.0, 1.0], [1.0, 0.0]], # from s2: stay->s2, switch->s1
])
R = np.array([1.0, 0.0]) # R(s1) = 1, R(s2) = 0
# Policy pi1: always stay (action 0 in both states)
pi = np.array([0, 0])
# Iterative policy evaluation
# Initial guess of zeros — the contraction property guarantees convergence
# regardless of the starting estimate.
V = np.zeros(2)
for i in range(1000):
# Bellman expectation backup: R(s) + γ Σ_{s'} P(s'|s,π(s)) V(s')
# np.dot computes the expected future value over successor states.
V_new = np.array([
R[s] + gamma * np.dot(P[s, pi[s]], V)
for s in range(2)
])
# Convergence check using the infinity norm (max absolute difference).
# A small value means we've reached the fixed point of the Bellman operator.
if np.max(np.abs(V_new - V)) < 1e-8:
print(f"Converged after {i+1} iterations")
break
# Synchronous (Jacobi-style) update: all states updated using the old V.
V = V_new
print(f"V(s1) = {V[0]:.4f}") # Expected: 10.0
print(f"V(s2) = {V[1]:.4f}") # Expected: 0.0
Extension: Change pi = np.array([1, 1]) (always switch) and verify
. Then implement full policy iteration: alternate between
policy evaluation and greedy policy improvement (pi[s] = argmax_a ...) until
the policy stops changing. Verify it converges to pi = [0, 0].
Taxonomy of reinforcement learning problems
Understanding where RLReinforcement Learning problems sit in this taxonomy tells you which algorithms are applicable and what assumptions you can rely on.
Bandits vs full MDPs
- Multi-armed bandits: no state transitions; actions yield immediate reward. The agent must balance exploration (trying uncertain actions) and exploitation (taking the best-known action).
- Full MDPs: actions affect future states and long-term outcomes. Bandit reasoning applies locally at each state, but value functions must account for downstream consequences.
Bandits reappear throughout the course: in recommendation systems, A/B testing, and as the core mechanism in RLHFReinforcement Learning from Human Feedback (where human feedback on response pairs is a bandit signal).
Online vs offline reinforcement learning
- Online RLReinforcement Learning: the agent interacts with the environment during learning. Experience is generated by the current (or recent) policy. Standard algorithms — Q-learning, PPOProximal Policy Optimisation, SACSoft Actor-Critic — are online.
- Offline RLReinforcement Learning: learning occurs from a fixed dataset collected by some behavior policy, without further environment interaction. This is critical in safety-sensitive domains (robotics, healthcare, alignment) where online exploration is expensive or dangerous.
The key technical challenge in offline RLReinforcement Learning is distributional shift: the learned policy may assign high value to state-action pairs that are rare or absent in the dataset. Without the ability to collect new experience, these value estimates cannot be corrected, leading to overoptimistic Q-values and unstable policy improvement. This problem has no analog in supervised learning — a classifier trained on a fixed dataset does not synthesize new inputs from regions of low data density.
Model-free vs model-based RLReinforcement Learning
- Model-free RLReinforcement Learning: learns policies or value functions directly from experience, without explicitly representing or learning the transition dynamics .
- Model-based RLReinforcement Learning: learns or uses a model of the environment's dynamics for planning. Enables higher sample efficiency by generating synthetic experience from the model, at the cost of bias when the model is imperfect.
These categories exist on a spectrum rather than as a clean binary. Dyna-style algorithms interleave real experience with simulated rollouts from a learned model — they are simultaneously model-free and model-based. Dreamer and MuZero learn abstract latent-space world models that are not interpretable as physical dynamics models but are used for planning. The binary framing is a useful starting point but breaks down for most modern methods.
Relationship to supervised learning and control theory
Supervised learning
Supervised learning optimizes a fixed loss over a static dataset. The learner's outputs do not influence what data it sees next. RLReinforcement Learning breaks this assumption: the policy determines what states are visited, which determines what experience is collected, which determines what the policy learns — a feedback loop that makes the learning problem fundamentally non-stationary and requires explicit management of exploration.
Control theory
Classical control theory solves related problems but under different assumptions:
- Known dynamics: control assumes is given analytically; RLReinforcement Learning treats it as unknown or learns it.
- Stability guarantees: control provides formal guarantees (Lyapunov stability, ISS) that RLReinforcement Learning algorithms generally do not.
- Function approximation: classical control works analytically in low-to-moderate dimensional spaces; RLReinforcement Learning scales to high-dimensional spaces by approximating and with neural networks, trading convergence guarantees for scalability.
The connection is exact at the boundary: LQR is RLReinforcement Learning with known linear dynamics and quadratic cost, solved analytically. Moving from LQR to RLReinforcement Learning is moving from known to unknown dynamics, from analytical to learned value functions, and from guaranteed convergence to empirical stability. Understanding this gradient — rather than treating control and RLReinforcement Learning as separate disciplines — is one of the goals of this course.
Connections to modern methods
The formalism introduced in this lecture is not only foundational — it is directly instantiated in state-of-the-art algorithms. The following examples show how each element of the MDPMarkov Decision Process tuple maps to a concrete design decision in recent systems.
DreamerV3: Bellman backups in latent space
DreamerV3 (Hafner et al., 2023) achieves strong performance across diverse domains — robotics, Atari, NetHack — from pixels alone. Its architecture maps directly onto the MDPMarkov Decision Process formalism:
| MDPMarkov Decision Process concept | DreamerV3 implementation | |---|---| | State | Compact latent vector from a recurrent state-space model (RSSM) | | Transition | Learned dynamics model in latent space | | Reward | Learned reward predictor from latent state | | Value | Neural network trained via Bellman backup over imagined rollouts | | POMDPPartially Observable Markov Decision Process gap | RSSM maintains a belief state over raw observations — directly addressing the Markovianity problem |
The key insight: once a world model is learned, Bellman backups can be applied to imagined trajectories without interacting with the real environment. Every imagined step is a Bellman backup; every imagined rollout is value iteration in latent space. DreamerV3 is model-based RLReinforcement Learning at scale — the same equations, applied inside a learned model.
TDTemporal Difference-MPC: value-weighted planning
TDTemporal Difference-MPC (Hansen et al., 2022) combines model predictive control (MPC) with Bellman-based value learning. The key design choices map to the formalism as follows:
- A learned latent world model provides short-horizon rollouts (the transition ).
- A learned value function , trained via the Bellman optimality equation, estimates cumulative reward beyond the planning horizon.
- At decision time, action sequences are sampled and evaluated by combining simulated returns (within the planning horizon) with Q-value bootstrapping (beyond it).
The core insight: the Bellman Q-function acts as a surrogate for long-horizon returns, allowing the planner to terminate rollouts early without losing information about the future. This is exactly why value functions are useful — they compress an infinite sum into a single scalar that can be queried at any state.
Understanding these methods requires only the formalism from this lecture. The algorithms add function approximation, learned dynamics, and planning; the underlying equations are unchanged.
Key takeaways
The concepts introduced here form the foundation for every algorithm in the course. The chain runs as follows: the reward hypothesis motivates reducing goal-directed behavior to scalar maximization, while immediately surfacing the challenge of reward specification. MDPs formalize the sequential interaction, with the Markov property as a structural assumption that must be engineered in practice. Policies define behavior; value functions quantify its long-term consequences. The Bellman equations express value functions recursively — linearly for policy evaluation, as a nonlinear fixed point for optimality — and the contraction mapping property is what makes iterative solution tractable. The discount factor simultaneously ensures convergence, encodes time preference, and introduces a policy gradient bias that resurfaces later. And the LQR/Riccati connection unifies this formalism with the control theory developed in preceding lectures: RLReinforcement Learning is generalized optimal control.
Common pitfalls in RLReinforcement Learning problem formulation
These mistakes appear repeatedly in both research and practice. Recognizing them early prevents compounding errors in implementation and design.
1. Non-Markovian state representation. Using raw sensor observations as states without verifying the Markov property. Symptom: the value function fails to converge or oscillates despite a correct implementation. Diagnosis: place the agent in the same observation from different histories and check whether subsequent transitions differ. Fix: state augmentation (add velocity, contact forces) or observation stacking.
2. Reward hacking. Optimizing a specified reward that does not fully capture the intended goal. Symptom: high reward but poor task performance; the agent found a policy that satisfies the letter of the reward, not the spirit. Prevention: before training, adversarially ask "what behavior maximizes this reward in an unintended way?" A robot rewarded purely for joint speed will fall forward. An LLMLarge Language Model rewarded for approval ratings will learn to flatter.
3. Setting in a continuing task. Returns diverge; the Bellman operator loses its contraction property; value iteration does not converge. Symptom: numerical overflow or NaN values during training. Fix: use , or reformulate the task as episodic with a meaningful terminal condition.
4. Confusing episodic and continuing formulations. A robot manipulation task with a fixed episode length is episodic; a thermostat control task running indefinitely is continuing. Applying the wrong formulation produces an ill-defined objective. Fix: determine whether terminal states exist. If not, the task is continuing and requires discounting for a finite return.
5. Ignoring distributional shift in offline RLReinforcement Learning. In offline settings, the dataset was collected by a behavior policy; the policy being learned is the target policy. Ignoring this distinction leads to overestimated Q-values in regions not covered by the dataset, causing catastrophic failure outside the data distribution. This has no direct analog in supervised learning — a classifier does not synthesize inputs from data-sparse regions.
Conceptual questions
-
A robot observes only joint positions, not velocities. Describe precisely why this violates the Markov property. Propose two different state engineering strategies that restore approximate Markovianity, and explain what each one assumes about the system.
-
You set in a value iteration algorithm on a continuing task with bounded positive rewards. What happens to , and why does the Bellman optimality operator lose its contraction property? What is the minimum change to the problem setup that restores convergence?
-
An RLHFReinforcement Learning from Human Feedback system trains a language model to receive high scores from a reward model trained on human preference data. After deployment, users report that the model is more persuasive but less accurate. Explain this failure in terms of the reward hypothesis and reward misspecification. What would you change in the training setup?
-
Compare the Bellman expectation equation for policy evaluation with the solution of a system of linear equations . Identify what plays the role of , , and , and explain why the system is always solvable (i.e., why is always invertible for ).
-
An offline RLReinforcement Learning algorithm is trained on a dataset of robot manipulation demonstrations. After training, the policy performs well on demonstrations similar to the training data but fails catastrophically on slightly different configurations. Explain this failure in terms of distributional shift and overestimated Q-values. How does this differ from the analogous failure mode in behavioral cloning?
Looking ahead
The next lecture studies the simplest nontrivial RLReinforcement Learning setting: multi-armed bandits. This isolates the exploration-exploitation tradeoff — the core challenge of RLReinforcement Learning — without the additional complexity of state transitions. Bandit algorithms and regret analysis provide the theoretical tools that extend, with modification, to full MDPs and to the RLHFReinforcement Learning from Human Feedback feedback mechanisms studied later in the course.
Further reading
- Bellman, R. (1957). Dynamic Programming. Princeton University Press. The foundational text introducing the Bellman equation and the principle of optimality.
- Sutton & Barto (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 3: Finite Markov Decision Processes. The standard textbook reference; freely available online.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley. The comprehensive mathematical treatment; essential for rigorous proofs of convergence and the contraction mapping results used here.
- Silver, D. (2015). UCL Course on RL, Lecture 2: Markov Decision Processes. Lecture slides and video cover the same material with worked gridworld examples. Freely available online.
- Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific. Bridges the gap between classical control theory (including LQR/Riccati) and modern RL; directly relevant to the control theory connections made in this lecture.