Purpose of this lecture
The preceding thirteen weeks built from first principles — MDPs, Bellman equations, policy gradients — through the full modern stack of deep RLReinforcement Learning algorithms, and arrived at the alignment methods that deploy these foundations in large language models. This final lecture asks what happens when those aligned models are given the ability to act: to search, execute code, call APIs, write files, and coordinate with other agents.
Agentic AI systems are not a new abstraction. They are the MDPMarkov Decision Process formalism from Week 1, instantiated in a particular way: the state is a context window, the action space includes tools, the environment is the external world, and the policy is an aligned language model. Every framework introduced across this course — POMDPs, planning, credit assignment, offline RLReinforcement Learning, multi-agent coordination — reappears in the engineering and failure modes of these systems.
The goal of this lecture is to close that loop: show precisely where each piece of RLReinforcement Learning theory lands in the architecture of a deployed agent, identify the open problems that current systems do not solve, and place the course in the broader trajectory of AI research it is part of.
Agents as MDPs
The mapping between the MDPMarkov Decision Process tuple and a deployed tool-using agent is exact enough to be useful as an engineering checklist.
State: the context window as belief state
The agent's state at time is its context window: the system prompt, the user's instruction, and the full transcript of thoughts, tool calls, and observations accumulated so far. The context window is not the true state of the world — it is a belief state in the POMDPPartially Observable Markov Decision Process sense (Week 9). The true environment state includes all documents the agent has not yet retrieved, the current state of external systems, and any information that has scrolled out of the context window. The agent operates on a lossy, partial summary of reality, exactly as formalized by the POMDPPartially Observable Markov Decision Process emission model .
This framing has direct engineering consequences: context window management is belief state maintenance. Decisions about what to retain, summarize, or discard from the context are decisions about what sufficient statistics to keep for future optimal action. Retrieval mechanisms — fetching relevant documents at query time — are approximations to Bayesian filtering, extending the effective belief state beyond what fits in the context window.
Actions: tools as an augmented action space
In standard language model inference, the action space is the vocabulary : the agent selects the next token. In an agentic system, the action space is extended to include structured tool calls — API invocations, code execution requests, database queries, web searches — that interact with the external environment and produce observations appended to the context.
The tool action space has a different structure from the token vocabulary: tools are compositional (a web search tool requires a query argument; a code executor requires a code string), have variable-length outputs (a search may return hundreds of tokens; a calculator returns a single number), and have side effects (calling an API modifies the world, not just the agent's context). These properties make the agent's action space significantly richer than an LLMLarge Language Model's token-prediction space and introduce engineering challenges — tool selection, argument generation, output parsing — that do not arise in standard language generation.
Transitions: environments as non-Markov, stochastic systems
When the agent calls a tool, the environment transitions: the tool executes, returns an observation, and the context window updates. From the RLReinforcement Learning perspective, the transition model includes the behavior of all external systems the agent can interact with. These systems are often:
- Non-deterministic: the same web search query returns different results at different times; code execution may succeed or fail depending on external state.
- Partially observable: the agent cannot verify whether a tool call succeeded without additional queries; database state may be stale.
- Adversarial: web pages may contain prompt injection attempts — instructions embedded in retrieved content that try to redirect the agent's behavior (an instance of the multi-agent adversarial setting from Week 9).
Rewards: multi-objective and delayed
The reward signal in agentic systems is rarely a single scalar available at each step. Typical structures include:
- Terminal reward: the task either succeeded or failed at the end of the episode (binary, extremely sparse). Analogous to the sparse-reward settings from Week 9 that motivated curiosity-driven exploration.
- Process reward: intermediate steps are evaluated (did the agent retrieve the right document? did the code execute correctly?). Analogous to a dense or shaped reward that reduces credit assignment difficulty.
- Multi-objective: task success weighted against latency (number of steps), cost (token consumption, API call fees), and safety (did the agent take any irreversible or potentially harmful actions):
The weighting among these objectives is a reward specification problem — the exact issue raised by the reward hypothesis in Week 1. Overweighting efficiency causes the agent to skip necessary verification steps; overweighting safety causes the agent to refuse ambiguous actions that are actually harmless.
Planning architectures for agents
ReAct: interleaving reasoning and action
The ReAct framework (Yao et al., 2022) demonstrated that prompting a language model to generate an explicit thought before each action substantially improves task success on factual question-answering and interactive tasks. The pattern is straightforward:
Thought: I need to find the current CEO of NVIDIA.
I'll search for this directly.
Action: search("NVIDIA current CEO 2024")
Observation: Jensen Huang is the CEO of NVIDIA...
Thought: I have the answer. I should return it.
Action: finish("Jensen Huang")
ReAct's key insight was that language models already encode a latent world model — their next-token predictions implicitly encode beliefs about how the world responds to actions. By forcing the model to externalize this reasoning as natural language, the agent becomes more interpretable and more robust. From the RLReinforcement Learning perspective, the thought trace is the agent running a one-step world model rollout (Week 10) in natural language before committing to a tool call. The agent simulates the consequences of potential actions — "if I search for X, I expect to find Y" — and selects the action that its internal model predicts will be most informative. This is model-based planning in the token space, with the language model's next-token predictions serving as the dynamics model.
What ReAct does not provide is learning: the model's policy does not improve from failure. The thought trace helps at inference time, but there is no mechanism to correct the model when its predicted consequences are wrong. The limitation of ReAct is also that the thought trace is shallow: the agent reasons about the immediate next action but does not plan over multiple steps. Complex tasks requiring sequential tool use with dependencies (retrieve document A to determine what to query in database B, then synthesize the results) require longer planning horizons — longer than what in-context reasoning reliably produces.
Reflexion: using failure as a learning signal
Reflexion (Shinn et al., 2023) extends the ReAct loop with an explicit self-critique and retry mechanism. After an episode fails (the agent does not achieve the goal), the agent generates a natural language reflection on why it failed and what it should do differently. This reflection is prepended to the context in the next episode attempt. Shinn et al. demonstrated that Reflexion improves success rates on reasoning and decision-making tasks by 10–30% across multiple benchmark domains compared to vanilla ReAct. The mechanism is in-context learning: each failure trajectory becomes part of the model's context, and the model conditions its future behavior on past failures without any parameter updates.
From the RLReinforcement Learning perspective, Reflexion is episodic learning without gradient updates: the "policy improvement" step is carried out in the context window rather than in the parameter space. The agent's memory of past failures (stored as text in context) plays the role of the experience replay buffer from Week 6 — it contains evidence about which strategies failed, which the policy can condition on to avoid repeating mistakes. However, Reflexion has a critical limitation: it assumes the agent can accurately diagnose why it failed. When the agent's self-model is inaccurate or when failure is caused by factors outside its reasoning (a missing database, an incorrect tool response), the generated reflection may be misleading or directly counterproductive, teaching the agent to reinforce the wrong lessons.
Plan-and-execute: hierarchical decomposition
For complex multi-step tasks, a common architecture separates planning from execution: a planner LLMLarge Language Model generates a high-level task decomposition, and separate executor agents carry out each subtask. This is hierarchical RLReinforcement Learning (HRL) in the agentic context: the planner operates at the level of subtasks (high-level actions), while executors operate at the level of tool calls (low-level actions).
The HRL connection identifies the key failure mode: subgoal misspecification. If the planner decomposes the task incorrectly — produces subtasks that are inconsistent, in the wrong order, or do not cover the full task — the executors will complete their assigned subtasks successfully but the high-level task will fail. This is the multi-agent credit assignment problem from Week 9 in a hierarchical form: which level of the hierarchy was responsible for the failure?
Memory in agentic systems
Long-horizon tasks exceed the context window. Agents must maintain information across context resets, between sessions, and across multiple concurrent subagents. The RLReinforcement Learning literature distinguishes several types of memory that map directly onto agentic system components:
Working memory — the context window — holds the immediate task state. It is finite, volatile, and the primary bottleneck for long-horizon tasks. Compressing working memory (summarizing the context before it fills up) is a form of lossy belief state compression, with the same tradeoffs: too aggressive compression discards relevant information; too conservative compression causes context overflow.
Episodic memory — a vector database of past interaction trajectories — allows the agent to recall semantically similar past experiences. Retrieval-augmented generation against a memory store is episodic replay (Week 6) at inference time: the agent retrieves the most relevant past experiences and conditions its current policy on them. The retrieval mechanism (cosine similarity in embedding space) is a form of experience matching rather than exact replay.
Semantic memory — a structured knowledge base (facts, entity properties, relationships) — is the agent's equivalent of a learned world model: it encodes persistent beliefs about how the world works, updated asynchronously as new information is retrieved.
Training agentic policies
Deploying a well-aligned base model as an agent does not guarantee good agentic behavior. A model aligned for conversational helpfulness may not efficiently use tools, may fail to recognize when retrieval is necessary, or may generate malformed tool call syntax. Training specifically for agentic behavior requires RLReinforcement Learning with environment feedback.
Process reward models for agents
The sparse terminal reward structure of most agentic tasks — success or failure at the end of a long trajectory — makes policy gradient training difficult (long credit assignment horizon, high variance). Process Reward Models (PRMs) (Lightman et al., 2023) dense-ify the reward signal by evaluating intermediate steps: did this tool call make sense given the task? Was this reasoning step sound? A PRM trained on human (or AI) annotations of step quality provides a shaped reward that reduces the effective credit assignment horizon. Lightman et al.'s empirical finding was striking: PRMs trained to label step-level correctness in mathematical reasoning outperform simpler outcome-only reward models (ORMs), especially on longer problems where a single early error cascades. The mechanism is clear: a PRM can detect and suppress incorrect reasoning paths before the agent commits to building on them, whereas an ORM only provides a binary signal at the end.
However, the PRM vs. ORM debate remains open. DeepSeek-R1's result — that outcome-only GRPO can implicitly elicit multi-step reasoning without explicit step-level supervision — suggests that PRMs may be an unnecessary intermediate representation for some tasks. The field lacks consensus on when process supervision is necessary versus when outcome supervision is sufficient. PRMs for agents are the exact analog of the step-level reward shaping discussed in Week 1: they trade specification effort (labeling intermediate steps) for training efficiency, at the risk of reward hacking if the PRM is inaccurate. An agent might learn to generate reasoning steps that score highly on the PRM without actually advancing toward the goal, a manifestation of Goodhart's Law (Week 12).
The tool-use curriculum
Effective agentic RLReinforcement Learning training typically requires a curriculum: start with simple single-tool tasks, then gradually introduce multi-tool, multi-step tasks as the agent's competence develops. This is the same curriculum structure that self-play provides in multi-agent settings (Week 9) and that Dyna's interleaved real-imagined training approximates (Week 10): easier tasks early in training prevent the policy gradient from collapsing in the face of near-zero terminal rewards on hard tasks.
Offline-to-online for agent training
A common practical pipeline mirrors the offline RLReinforcement Learning → online fine-tuning approach from Week 11: pre-train the agent's tool-use behavior with behavior cloning on human demonstration trajectories (equivalent to SFT), then apply GRPOGroup Relative Policy Optimisation or PPOProximal Policy Optimisation with environment feedback to improve beyond the demonstrations. The offline phase provides a strong initialization that prevents catastrophically bad early-training behavior; the online RLReinforcement Learning phase discovers strategies unavailable in the demonstrations.
Safety and constrained optimization
Deployed agents operate in real environments with irreversible consequences. An agent that sends an incorrect email, modifies a production database, or initiates a financial transaction cannot easily undo the action. This motivates constrained RLReinforcement Learning: optimize the task reward subject to safety constraints that bound the probability or expected cost of unsafe actions.
The standard formulation is the Constrained MDPMarkov Decision Process (CMDP):
where is a constraint cost (e.g., if an irreversible action was taken at step ) and is the constraint budget. Lagrangian relaxation converts the CMDP to an unconstrained problem with an adaptive penalty:
The dual variable is increased when the constraint is violated and decreased when it is satisfied, playing the same role as the KL penalty in RLHFReinforcement Learning from Human Feedback — it is a Lagrange multiplier on a behavior constraint, adapted by gradient ascent on the dual.
In practice, most deployed agents implement safety constraints through a combination of conservative prompt engineering (the agent is instructed to confirm before irreversible actions), step-level safety classifiers (a separate model flags high-risk actions before execution), and hard-coded refusal rules for a predefined set of prohibited action types. These are engineering approximations to CMDP-style constraint satisfaction.
Open problems: The Research Frontier in Reinforcement Learning
This section names the seven most significant open problems in reinforcement learning as of early 2026, viewed from the perspective of what this course has built. These are not minor gaps or engineering challenges — they are fundamental, actively researched problems where significant theoretical or empirical progress would reshape the field.
Course conclusion
Reinforcement learning is the mathematics of sequential decision-making. The course began with a single agent observing a state, taking an action, receiving a reward — the minimal structure needed to define the problem. Every subsequent development was an extension of this core:
Bellman equations gave a recursive structure that makes value estimation tractable. Policy gradients connected the optimization of expected return to gradient descent on neural networks. Actor-critic methods reduced variance and enabled scaling to continuous control. PPOProximal Policy Optimisation and SACSoft Actor-Critic made deep RLReinforcement Learning reliable enough to deploy on real robots. Model-based methods connected RLReinforcement Learning to classical control theory and to modern reasoning systems. Offline RLReinforcement Learning removed the dependence on live interaction, enabling learning from historical data. RLHFReinforcement Learning from Human Feedback applied the full framework to align language models with human preferences. DPODirect Preference Optimization and GRPOGroup Relative Policy Optimisation simplified the alignment pipeline without abandoning its theoretical foundations. And agentic systems are the present frontier: the same MDPMarkov Decision Process formalism, instantiated in a world where the state is a context window, the actions include tools, and the environment is the open-ended physical and digital world.
The algorithms will continue to evolve. The principles — managing uncertainty, balancing exploration and exploitation, assigning credit over long horizons, specifying rewards that encode genuine intent — will remain the foundational challenges of building systems that act intelligently and safely in the world. The courses that follow build on this foundation in specific domains.
Course 2 (Robot Learning) applies the MDP, POMDP, and policy gradient framework to embodied systems. The state transitions are now governed by physics — robot kinematics, contact dynamics, actuator constraints — rather than by API semantics. The observation model is determined by camera intrinsics, lidar noise, and proprioceptive drift rather than by search engine output. But the core structure is identical: an agent with an imperfect belief state selects actions from a continuous action space to maximize a reward signal, with sim-to-real transfer (Week 9 domain adaptation) bridging the gap between policy training and deployment. The constrained RL and safety frameworks from this lecture — CMDPs, Lagrangian adaptation, step-level safety classifiers — apply directly to robot safety in human environments where constraint violations are physical rather than digital.
Course 4 (Vision--Language Models and Generative AI) extends the agent's perception stack. The agentic systems in this lecture retrieve text documents; VLMs retrieve and reason over images, video, and multimodal context. The MDP state representation generalizes from a text-only context window to a multimodal embedding space. The agent's action space correspondingly grows to include visual grounding (pointing at regions, segmenting objects) and image generation as tool outputs. The PRM debate from this lecture — process versus outcome supervision — reappears in VLM training where step-level annotation of visual reasoning is even more expensive than text annotation. The multi-agent coordination framework (Week 9) connects to multi-modal multi-agent systems where agents with different perceptual capabilities (vision, language, audio) must coordinate without sharing a common observation space, making emergent communication protocols a practical necessity rather than a theoretical curiosity.
Conceptual questions
-
Map a specific real-world agentic task — a research assistant that must retrieve papers, extract results, and synthesize a literature review — onto the full POMDPPartially Observable Markov Decision Process tuple . Identify: what constitutes the true hidden state, what information the context window fails to capture, and at least two failure modes that result directly from this partial observability. For each failure mode, describe the mitigation strategy from Week 9.
-
An agent trained with GRPOGroup Relative Policy Optimisation on single-tool tasks (web search, calculator) is deployed on a task requiring three tools in sequence. It achieves near-zero success rate despite performing well on individual tool benchmarks. Diagnose this failure using the hierarchical credit assignment framework: which level of the hierarchy is failing, and what training data or curriculum change would you introduce to address it?
-
Prompt injection is an adversarial attack where malicious instructions are embedded in content retrieved from the environment (web pages, documents) to redirect the agent's behavior. Frame this as a multi-agent RLReinforcement Learning problem (Week 9): identify the adversary's objective, the defender's objective, and the equilibrium toward which self-play between attacker and defender would converge. Why is this equilibrium more difficult to achieve than the Nash equilibrium in a zero-sum game like Go?
-
An agent uses a PRM to provide dense reward for intermediate reasoning steps. After training, evaluation shows that the agent's intermediate steps score highly on the PRM but the final task success rate is lower than a baseline trained with terminal reward only. Diagnose this failure using the Goodhart's Law framework from Week 12. Describe two modifications to the PRM training procedure that would reduce this gap without reverting to sparse terminal reward.
-
A CMDP formulation for a financial agent adds a constraint where if a transaction above $10,000 is executed. The Lagrange multiplier adapts the penalty online. Describe a failure mode where the adaptive correctly enforces the constraint during training but the deployed agent violates it in a novel market condition not seen during training. Connect this to the distributional shift problem from Week 11 and describe what additional mechanism would be needed to address it.
Extensions
- Design a token-level safety classifier. An agentic system uses a secondary classifier model that intercepts tool calls and flags high-risk actions before execution. Analyze the false-positive / false-negative tradeoff in terms of the constrained MDP objective: what happens to the Lagrangian multiplier when the classifier is too conservative (high false-positive rate) versus too permissive (high false-negative rate)? Propose a training procedure for the safety classifier that is consistent with the CMDP framework — i.e., the classifier's threshold should adapt based on the current value of rather than being a fixed constant. How would you design an evaluation benchmark that measures whether the combined agent + safety-classifier system satisfies the constraint budget on a held-out distribution of tasks?
Further reading
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.
- Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS.
- Lightman, H., et al. (2023). Let's Verify Step by Step. arXiv. (Process Reward Models / PRMs).