Purpose of this lecture
The algorithms studied in Weeks 3–8 share three background assumptions that are rarely satisfied in practice: the agent observes the true environment state, exploration is handled adequately by policy stochasticity, and the environment is stationary because there is only one learning agent. All three fail in real systems.
This lecture addresses each failure mode in turn. Exploration is elevated from a side effect of -greedy noise to an explicit objective with principled methods for directing agents toward genuinely novel experience. Partial observability is formalized through POMDPs, and the standard engineering responses — history stacking, recurrent architectures, and belief-state approximation — are developed in detail. Multi-agent RLReinforcement Learning examines what happens when the stationarity assumption collapses because other agents are simultaneously learning, and introduces the centralized training with decentralized execution (CTDE) paradigm that stabilizes cooperative systems.
These three topics converge in modern agentic AI: a tool-using LLMLarge Language Model operating in a partially observed environment alongside other AI agents is simultaneously a POMDPPartially Observable Markov Decision Process solver, an explorer under sparse rewards, and a participant in a multi-agent system.
Exploration as a first-class problem
In dense-reward tabular MDPs, -greedy exploration — random action with probability , greedy otherwise — is often adequate. This ceases to hold in three practically important regimes. In sparse-reward tasks, reward arrives only at task completion; random walks rarely reach the goal. In large or continuous state spaces, locally random actions are uninformative about distant novel regions. In deceptive reward landscapes, greedy exploitation of local optima prevents discovery of higher-value strategies that require temporarily accepting lower reward.
Modern RLReinforcement Learning treats exploration as an explicit design objective by constructing an exploration bonus that augments extrinsic reward:
where controls exploration strength. The challenge is designing to be meaningful in high dimensions without interfering with the ultimate objective.
Count-based exploration
In tabular MDPs, exploration bonuses follow from state visitation counts . The standard form, derived from PAC-MDPMarkov Decision Process analyses (RMAX, MBIE):
This embodies optimism under uncertainty: treat unvisited states as potentially optimal and visit them until uncertainty resolves. The bonus decays as states become familiar and drives the agent toward under-explored regions. Under this principle, provably efficient algorithms guarantee near-optimal policies once every state has been visited times.
The fundamental obstacle is the curse of dimensionality: in continuous or large discrete spaces, most states are visited at most once, and raw counts provide no signal. Scaling requires a density model that estimates and assigns pseudo-counts . Compression-based and neural density models have been explored, but scaling to high-dimensional observations without additional structure remains difficult.
Intrinsic motivation and curiosity
An influential alternative frames exploration as curiosity: reward the agent for encountering situations it cannot yet predict.
Prediction-error curiosity (ICM)
The Intrinsic Curiosity Module (Pathak et al., 2017) trains a forward model to predict the next observation embedding and uses its error as intrinsic reward:
where is the forward model's prediction and is an encoder trained via an auxiliary inverse model (predicting the action from consecutive embeddings). The inverse model discards observation features irrelevant to the agent's actions, making prediction error a more meaningful novelty proxy than raw pixel error.
The noisy TV problem
Prediction-error curiosity has a fundamental failure mode. A stochastic distractor — a screen displaying white noise — cannot be predicted regardless of how many times it is observed. The intrinsic reward on the noisy screen stays permanently high, creating an infinite reward attractor: the agent becomes "addicted" to watching the noise while ignoring task reward. This is the noisy TV problem: stochastic irreducible noise produces a curiosity signal that is unlearnable and therefore persistently high.
Random Network Distillation (RND)
RND (Burda et al., 2018) eliminates the noisy TV failure by replacing the stochastic prediction target with a fixed deterministic one:
- A target network with fixed random weights.
- A predictor network trained to match the target (a pattern repeated in robot learning, where one network encodes the learned environment and another learns to match it).
The predictor error is high for rarely visited states (cannot yet match the target) and decays as the state becomes familiar. Crucially, the target is a deterministic function of the state, so the predictor can perfectly learn it once the state has been visited sufficiently — there is no irreducible stochastic component. RND achieves strong results on hard-exploration Atari games including Montezuma's Revenge, where -greedy DQNDeep Q-Network fails to progress past the first room.
Temporally coherent exploration
Action-space noise: the incoherence problem
-greedy and Gaussian action noise resample independently at every step. The resulting trajectories are temporally incoherent: consecutive actions are independently noisy, so exploration behaves as a random walk in action space. For a robot that must execute a coordinated multi-step sequence to reach a novel region, this is particularly inefficient — each noisy step disrupts the trajectory before it can traverse the path to new experience.
Parameter-space noise
NoisyNets (Fortunato et al., 2017) add learnable noise directly to network weights, sampled once per episode:
The same perturbation is active for the full episode, so the policy behaves consistently throughout — exploring in a coherent direction in state space before resampling at episode reset. This temporal coherence enables directed exploration over long time horizons (a critical requirement for physical systems like robots that need multi-step coordinated actions to reach novel configurations).
Bootstrapped DQNDeep Q-Network
Bootstrapped DQNDeep Q-Network (Osband et al., 2016) maintains independent Q-function heads . At the start of each episode, one head is sampled and the agent follows its greedy policy for the full episode:
Each head encodes a different hypothesis about the value function; the diversity represents epistemic uncertainty. Committing to one head per episode implements Thompson sampling: the agent acts optimally under a sample from its posterior over value functions. This produces deep exploration that can traverse long corridors or multi-step puzzles where step-by-step random actions are ineffective.
Partial observability and POMDPs
In the MDPMarkov Decision Process formalism of Week 1, the agent observes the true state . In virtually every real deployment this is false.
Formal definition
A Partially Observable Markov Decision Process (POMDPPartially Observable Markov Decision Process) extends the MDPMarkov Decision Process to:
where is the observation space and is the emission distribution. The agent receives — a potentially lossy, noisy function of the true hidden state — rather than itself.
Partial observability is the generic case: a camera encodes only the visible surface, not occluded geometry. Joint sensors report position but not velocity or contact force. An LLMLarge Language Model sees only its context window, not the user's intent or external system state. In every case, the transition of the hidden state is Markov; the transition of the observed signal is not. Applying value functions to observations as if they were states causes the same instability as applying them to non-Markov states: identical observations corresponding to different hidden states with different optimal actions produce inconsistent value targets.
The belief state
The optimal solution to a POMDPPartially Observable Markov Decision Process operates on the belief state : the posterior probability distribution over hidden states given the full history:
The belief state is a sufficient statistic for the history — the optimal policy is at least as good as any policy based on the raw history. It is updated via Bayesian filtering:
This is the general-form Bayesian filter, of which the Kalman filter (linear Gaussian systems) and the HMM forward algorithm (finite state, discrete observations) are special cases. Exact belief updates are intractable for continuous state spaces; approximate methods include particle filters and point-based POMDPPartially Observable Markov Decision Process solvers.
Approximating the belief state in deep RLReinforcement Learning
History stacking concatenates the last observations as input: . If the Markov horizon is at most steps, this restores Markovianity. Frame stacking in Atari DQNDeep Q-Network () recovers velocity information unavailable in a single frame. The limitation is that must be manually specified; for tasks requiring memory over hundreds of steps, a fixed window is inadequate.
Recurrent policies replace the fixed window with a learned compressed summary:
The hidden state is a learned approximation to the belief state, trained end-to-end via BPTT. LSTM-based policies are standard in partially observed locomotion (robot state estimation from proprioception and vision), manipulation with occlusion, and multi-step reasoning. Transformers processing the full observation history via self-attention are recurrent in a generalized sense: the context window plays the role of , with attention weights encoding each past observation's relevance to the current decision.
GenAI context: LLMLarge Language Model agents as POMDPPartially Observable Markov Decision Process solvers
For an LLMLarge Language Model-based agent using tools, querying APIs, and maintaining conversation state, the true environment state includes the user's unstated intent, content not yet retrieved, and external system state. The agent observes only the current context. Its internal representation must compress this partial history into something sufficient for decision-making — exactly the belief state role. Retrieval mechanisms, memory modules, and scratchpads are engineering approximations of belief-state maintenance. The failure modes — being misled by stale tool outputs, forgetting earlier context, miscalibrating confidence in retrieved facts — are POMDPPartially Observable Markov Decision Process-theoretic: the agent's approximate belief diverges from the true hidden state.
Multi-agent reinforcement learning
When multiple agents share an environment and all are learning simultaneously, the environment from any single agent's perspective is non-stationary: the apparent transition distribution changes over time because other agents' policies are changing. The convergence guarantees of standard RLReinforcement Learning algorithms assume a stationary MDPMarkov Decision Process and do not directly apply.
Independent learners
Independent Q-learning (IQL) has each agent treat all other agents as part of the environment and applies standard RLReinforcement Learning independently. The non-stationarity violates the convergence conditions for Q-learning, producing oscillatory behavior and policy divergence. Concretely: Agent A learns that moving left yields high reward → Agent B, observing A's new behavior, adapts its own policy → A's previously learned Q-values are now wrong (the environment appears to have changed) → A relearns → B readapts, creating a cycle where neither agent's value estimates ever stabilize. In cooperative tasks with shared team reward, multi-agent credit assignment compounds the problem: naive RLReinforcement Learning attributes team reward equally to every agent's action, regardless of which agent actually drove the outcome. This produces noisy, slow policy gradient updates.
Self-play
Self-play creates an automatic curriculum by having each agent train against a copy (or recent snapshot) of itself. The opponent is always at the same competence level, so the training distribution shifts smoothly as capability improves. AlphaGo, AlphaZero, and OpenAI Five all rely on self-play as the primary learning mechanism.
In zero-sum two-player games, self-play corresponds to iteratively optimizing for the Nash equilibrium: neither player can improve by unilaterally changing strategy. Under conditions analogous to fictitious play, the iterative procedure converges to Nash. The zero-sum property () is structurally necessary here: it guarantees that improving one player's policy does not increase the other player's reward — the game is purely competitive, so self-play drives both agents toward the unique value of the game rather than chasing a moving cooperative target. In general-sum or cooperative settings, self-play without additional structure may cycle because agents can simultaneously improve each other's returns without converging. The emergent capabilities from self-play — discovering opening theory in Go, inventing tool use in hide-and-seek — arise purely from the self-play reward signal without human-designed intermediate objectives.
Centralized Training with Decentralized Execution (CTDE)
The dominant paradigm for cooperative MARL separates the training and execution regimes:
- During training: a centralized critic has access to the global state and all agents' actions . Full-information Q-function estimation stabilizes learning and enables joint credit assignment.
- During execution: each agent acts using only its local observation and local policy , as required for bandwidth-constrained and latency-sensitive deployment.
MADDPG (Lowe et al., 2017) implements CTDE by extending DDPGDeep Deterministic Policy Gradient: each agent has a separate actor and a centralized critic . At execution time, only the actors are used.
QMIX (Rashid et al., 2018) adds a monotonicity constraint to support decentralized greedy action selection:
This ensures that each agent independently maximizing its individual utility is consistent with maximizing the joint . The constraint is enforced by a mixing network with non-negative weights that combines individual values into conditioned on global state — active only at training time.
GenAI context: multi-agent AI systems
Multi-agent structures are pervasive in modern AI: orchestrator–subagent hierarchies, critic–generator pairs for self-critique, red-team–defender setups for adversarial robustness. Self-play underlies debate-based alignment (agents argue for and against positions; debate outcome is the reward signal) and adversarial red-teaming. CTDE is implicitly present in multi-agent reasoning frameworks where a coordinator plans globally and dispatches to specialized agents executing with local information. The credit assignment problem — which subagent's output caused task success or failure — is a direct instance of cooperative MARL credit assignment.
Key takeaways
Exploration must be designed as an explicit objective when rewards are sparse or environments are large. Count-based methods provide theoretical grounding in tabular MDPs; curiosity-driven methods (ICM, RND) scale to high dimensions, with RND circumventing the noisy TV failure of prediction-error curiosity by replacing stochastic prediction targets with a fixed random network. Bootstrapped DQNDeep Q-Network and NoisyNets produce temporally coherent exploration via per-episode policy diversity and parameter-space noise, respectively. Partial observability is the generic case; the POMDPPartially Observable Markov Decision Process belief state is the theoretically sufficient summary, approximated in practice by history stacking and recurrent architectures that learn implicit belief representations end-to-end. MARL introduces non-stationarity as the defining challenge; self-play produces emergent capability in competitive settings via automatic curriculum, while CTDE stabilizes cooperative training by using centralized critics during training and decentralized actors during execution.
Conceptual questions
-
An agent trained with prediction-error curiosity (ICM) is deployed in an environment containing a screen displaying Brownian motion. Explain precisely why the agent becomes addicted to watching the screen. Then explain why RND does not exhibit this failure, identifying the key structural difference between the two intrinsic reward formulations.
-
A robot has joint position sensors but no velocity sensors. Explain why this constitutes a POMDPPartially Observable Markov Decision Process and identify the true hidden state. Compare two solutions: (a) stack position observations, (b) add an LSTM to the policy. For each, state what assumption it makes about the memory horizon and describe a scenario where it would fail but the other would succeed.
-
Two independent Q-learning agents train in a cooperative game with shared team reward. Agent A receives high team reward after taking action , even though Agent B's action was actually responsible. Explain the credit assignment failure in Agent A's Q-update. Describe how CTDE with a centralized critic resolves this, and explain why the QMIX monotonicity constraint is necessary for decentralized execution.
-
Bootstrapped DQNDeep Q-Network samples one Q-head per episode rather than per step. Contrast exploration trajectories from per-step vs per-episode sampling. For a task requiring the agent to traverse a 50-step corridor to reach reward, explain why per-step sampling fails while per-episode sampling succeeds. Relate this to temporal coherence and connect it to parameter-space noise vs action-space noise.
-
AlphaZero trains via self-play in a zero-sum game. Explain why the zero-sum assumption is necessary for self-play to converge toward Nash equilibrium. Describe an alternative competitive signal that could provide a self-play-style curriculum for a cooperative task. Identify at least one new failure mode this would introduce that does not arise in zero-sum self-play.
Implementation exercises
Exercise 1: Count-based exploration in a grid world
Implement a tabular Q-learning agent on a sparse-reward grid world (e.g., MiniGrid-Empty-8x8-v0). Add a count-based exploration bonus and compare with standard -greedy. Measure:
- Number of episodes to first goal reach
- State visitation entropy across the grid
- Performance as varies across
For the bonus, store visit counts in a dictionary keyed by discretized position. Note that this is only feasible in tabular settings — Exercise 2 addresses the continuous case.
Exercise 2: RND on a continuous control task
Implement Random Network Distillation as an exploration bonus for a PPOProximal Policy Optimisation agent on MountainCarContinuous-v0 (sparse reward: agent only receives reward when reaching the flag). The setup:
- Target network: A random MLP with fixed, randomly initialized weights.
- Predictor network: trained via MSE to match on visited states.
- Intrinsic reward: , normalized by a running estimate of its standard deviation.
- Add to the PPOProximal Policy Optimisation reward.
Compare training curves (extrinsic return vs steps) for vanilla PPOProximal Policy Optimisation vs PPOProximal Policy Optimisation + RND. Report the fraction of seeds that solve the task within 200 episodes. For a bonus challenge, swap the MountainCar environment for a procedurally generated maze and observe whether RND's deterministic target prevents the noisy-TV attraction.
Exercise 3: IQL vs CTDE in a cooperative multi-agent task
Implement a simple cooperative multi-agent scenario: two agents navigating a grid to push a box into a target zone (requires both agents pushing simultaneously). Compare:
- IQL: Each agent trains independent DQNDeep Q-Networks with shared team reward.
- MADDPG-style CTDE: A centralized critic during training, with decentralized actors using only local observations at execution.
Track the standard deviation of per-episode returns across 5 seeds for each method. IQL should show high variance (credit assignment noise); CTDE should converge faster and more stably. Identify one failure mode where IQL agents converge to a suboptimal equilibrium (e.g., one agent learns to wait while the other does all the work).
Extension prompts
-
RND on Montezuma's Revenge: The original RND paper demonstrated superhuman performance on Montezuma's Revenge, a notoriously hard-exploration Atari game. Study the paper's preprocessing and normalization choices. What aspects of the RND formulation are specific to Atari (discrete actions, CNN encoder), and which generalize to continuous control? Propose a modification to RND for an environment with both visual and proprioceptive inputs.
-
Belief-state quality metric: Train an LSTM-based policy on a partially observable navigation task. Design an experiment to measure how well the LSTM's hidden state approximates the true belief state. One approach: train a separate network to predict the true hidden state from and measure its reconstruction error as a proxy for belief-state quality. How does this metric change as the LSTM hidden size varies?
-
Population-based self-play: Extend the self-play section's analysis by implementing a population-based training setup (maintain policy snapshots, sample opponents from the population rather than the latest policy). Compare with standard self-play on a simple competitive task. How does population diversity affect convergence stability? What new failure mode emerges when the population collapses to a single strategy?
Further reading
- Pathak, D., et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction. ICML. (ICM).
- Burda, Y., et al. (2018). Exploration by Random Network Distillation. ICLR. (RND).
- Fortunato, M., et al. (2017). Noisy Networks for Exploration. ICLR. (NoisyNets — parameter-space noise for temporally coherent exploration).
- Osband, I., et al. (2016). Deep Exploration via Bootstrapped DQN. NeurIPS. (Bootstrapped DQNDeep Q-Network — Thompson sampling with Q-ensembles).
- Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence. (Foundational POMDPPartially Observable Markov Decision Process paper).
- Lowe, R., et al. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS. (MADDPG & CTDE).
- Rashid, T., et al. (2018). QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. ICML.