Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning

Grounded In

Agentic AI: Every LLM-based tool-using agent is simultaneously a POMDP solver (partial context window), an explorer under sparse rewards (retrieval and reasoning under uncertainty), and a participant in multi-agent systems (orchestrator–subagent hierarchies, critic–generator pairs, adversarial red-teaming). The concepts in this lecture — belief states, curiosity-driven exploration, centralized critics — are the theoretical vocabulary for analyzing agent failure modes.
Robotics: Real robots always face partial observability (occlusion, missing sensors, proprioceptive drift). Exploration bonuses and temporally coherent noise (NoisyNets) enable robots to discover contact-rich manipulation strategies. CTDE directly maps to multi-robot coordination where centralized planning during training (with full state) produces decentralized policies for bandwidth-constrained deployment.

Purpose of this lecture#

The algorithms studied in Weeks 3–8 share three background assumptions that are rarely satisfied in practice: the agent observes the true environment state, exploration is handled adequately by policy stochasticity, and the environment is stationary because there is only one learning agent. All three fail in real systems.

This lecture addresses each failure mode in turn. Exploration is elevated from a side effect of $\epsilon$ -greedy noise to an explicit objective with principled methods for directing agents toward genuinely novel experience. Partial observability is formalized through POMDPs, and the standard engineering responses — history stacking, recurrent architectures, and belief-state approximation — are developed in detail. Multi-agent RL examines what happens when the stationarity assumption collapses because other agents are simultaneously learning, and introduces the centralized training with decentralized execution (CTDE) paradigm that stabilizes cooperative systems.

These three topics converge in modern agentic AI: a tool-using LLM operating in a partially observed environment alongside other AI agents is simultaneously a POMDP solver, an explorer under sparse rewards, and a participant in a multi-agent system.

Exploration as a first-class problem#

In dense-reward tabular MDPs, $\epsilon$ -greedy exploration — random action with probability $\epsilon$ , greedy otherwise — is often adequate. This ceases to hold in three practically important regimes. In sparse-reward tasks, reward arrives only at task completion; random walks rarely reach the goal. In large or continuous state spaces, locally random actions are uninformative about distant novel regions. In deceptive reward landscapes, greedy exploitation of local optima prevents discovery of higher-value strategies that require temporarily accepting lower reward.

Modern RL treats exploration as an explicit design objective by constructing an exploration bonus $r_t^{\text{int}}$ that augments extrinsic reward:

r_t^{\text{total}} = r_t^{\text{ext}} + \beta \cdot r_t^{\text{int}}

where $\beta > 0$ controls exploration strength. The challenge is designing $r_t^{\text{int}}$ to be meaningful in high dimensions without interfering with the ultimate objective.

Count-based exploration#

In tabular MDPs, exploration bonuses follow from state visitation counts $N_t(s)$ . The standard form, derived from PAC-MDP analyses (RMAX, MBIE):

r_t^{\text{int}}(s) = \frac{\beta}{\sqrt{N_t(s)}}

This embodies optimism under uncertainty: treat unvisited states as potentially optimal and visit them until uncertainty resolves. The bonus decays as states become familiar and drives the agent toward under-explored regions. Under this principle, provably efficient algorithms guarantee near-optimal policies once every state has been visited $\Omega\!\left(\frac{1}{\epsilon^2(1-\gamma)^2}\right)$ times.

The fundamental obstacle is the curse of dimensionality: in continuous or large discrete spaces, most states are visited at most once, and raw counts provide no signal. Scaling requires a density model that estimates $p(s)$ and assigns pseudo-counts $\hat{N}(s) \propto 1/p(s)$ . Compression-based and neural density models have been explored, but scaling to high-dimensional observations without additional structure remains difficult.

Intrinsic motivation and curiosity#

An influential alternative frames exploration as curiosity: reward the agent for encountering situations it cannot yet predict.

Prediction-error curiosity (ICM)#

The Intrinsic Curiosity Module (Pathak et al., 2017) trains a forward model to predict the next observation embedding and uses its error as intrinsic reward:

r_t^{\text{int}} = \frac{\eta}{2}\,\|\hat{\phi}(s_{t+1}) - \phi(s_{t+1})\|^2

where $\hat{\phi}(s_{t+1}) = f(\phi(s_t), a_t)$ is the forward model's prediction and $\phi$ is an encoder trained via an auxiliary inverse model (predicting the action from consecutive embeddings). The inverse model discards observation features irrelevant to the agent's actions, making prediction error a more meaningful novelty proxy than raw pixel error.

The noisy TV problem#

Prediction-error curiosity has a fundamental failure mode. A stochastic distractor — a screen displaying white noise — cannot be predicted regardless of how many times it is observed. The intrinsic reward on the noisy screen stays permanently high, creating an infinite reward attractor: the agent becomes "addicted" to watching the noise while ignoring task reward. This is the noisy TV problem: stochastic irreducible noise produces a curiosity signal that is unlearnable and therefore persistently high.

Random Network Distillation (RND)#

RND (Burda et al., 2018) eliminates the noisy TV failure by replacing the stochastic prediction target with a fixed deterministic one:

A target network $f: \mathcal{S} \to \mathbb{R}^d$ with fixed random weights.
A predictor network $\hat{f}_\theta: \mathcal{S} \to \mathbb{R}^d$ trained to match the target (a pattern repeated in robot learning, where one network encodes the learned environment and another learns to match it).

r_t^{\text{int}} = \|\hat{f}_\theta(s_t) - f(s_t)\|^2

The predictor error is high for rarely visited states (cannot yet match the target) and decays as the state becomes familiar. Crucially, the target is a deterministic function of the state, so the predictor can perfectly learn it once the state has been visited sufficiently — there is no irreducible stochastic component. RND achieves strong results on hard-exploration Atari games including Montezuma's Revenge, where $\epsilon$ -greedy DQN fails to progress past the first room.

Temporally coherent exploration#

Action-space noise: the incoherence problem#

$\epsilon$ -greedy and Gaussian action noise resample independently at every step. The resulting trajectories are temporally incoherent: consecutive actions are independently noisy, so exploration behaves as a random walk in action space. For a robot that must execute a coordinated multi-step sequence to reach a novel region, this is particularly inefficient — each noisy step disrupts the trajectory before it can traverse the path to new experience.

Parameter-space noise#

NoisyNets (Fortunato et al., 2017) add learnable noise directly to network weights, sampled once per episode:

\tilde{\theta} = \theta + \sigma \odot \xi, \quad \xi \sim \mathcal{N}(0, I)

The same perturbation is active for the full episode, so the policy behaves consistently throughout — exploring in a coherent direction in state space before resampling at episode reset. This temporal coherence enables directed exploration over long time horizons (a critical requirement for physical systems like robots that need multi-step coordinated actions to reach novel configurations).

Bootstrapped DQN#

Bootstrapped DQN (Osband et al., 2016) maintains $K$ independent Q-function heads $\{Q_k\}$ . At the start of each episode, one head $k$ is sampled and the agent follows its greedy policy for the full episode:

a_t = \arg\max_a Q_k(s_t, a)

Each head encodes a different hypothesis about the value function; the diversity represents epistemic uncertainty. Committing to one head per episode implements Thompson sampling: the agent acts optimally under a sample from its posterior over value functions. This produces deep exploration that can traverse long corridors or multi-step puzzles where step-by-step random actions are ineffective.

Partial observability and POMDPs#

In the MDP formalism of Week 1, the agent observes the true state $s_t$ . In virtually every real deployment this is false.

Formal definition#

A Partially Observable Markov Decision Process (POMDP) extends the MDP to:

(\mathcal{S},\, \mathcal{A},\, \mathcal{O},\, P,\, O,\, R,\, \gamma)

where $\mathcal{O}$ is the observation space and $O(o_t \mid s_t)$ is the emission distribution. The agent receives $o_t \sim O(\cdot \mid s_t)$ — a potentially lossy, noisy function of the true hidden state $s_t$ — rather than $s_t$ itself.

Partial observability is the generic case: a camera encodes only the visible surface, not occluded geometry. Joint sensors report position but not velocity or contact force. An LLM sees only its context window, not the user's intent or external system state. In every case, the transition of the hidden state is Markov; the transition of the observed signal is not. Applying value functions to observations as if they were states causes the same instability as applying them to non-Markov states: identical observations corresponding to different hidden states with different optimal actions produce inconsistent value targets.

The belief state#

The optimal solution to a POMDP operates on the belief state $b_t$ : the posterior probability distribution over hidden states given the full history:

b_t(s) = P(s_t = s \mid o_{1:t},\, a_{1:t-1})

The belief state is a sufficient statistic for the history — the optimal policy $\pi^*(b_t)$ is at least as good as any policy based on the raw history. It is updated via Bayesian filtering:

b_{t+1}(s') \propto O(o_{t+1} \mid s') \sum_s P(s' \mid s, a_t)\, b_t(s)

This is the general-form Bayesian filter, of which the Kalman filter (linear Gaussian systems) and the HMM forward algorithm (finite state, discrete observations) are special cases. Exact belief updates are intractable for continuous state spaces; approximate methods include particle filters and point-based POMDP solvers.

Approximating the belief state in deep RL#

History stacking concatenates the last $k$ observations as input: $\tilde{s}_t = (o_{t-k+1}, \ldots, o_t)$ . If the Markov horizon is at most $k$ steps, this restores Markovianity. Frame stacking in Atari DQN ( $k = 4$ ) recovers velocity information unavailable in a single frame. The limitation is that $k$ must be manually specified; for tasks requiring memory over hundreds of steps, a fixed window is inadequate.

Recurrent policies replace the fixed window with a learned compressed summary:

h_t = \text{RNN}(h_{t-1},\, o_t,\, a_{t-1}), \quad \pi_\theta(a \mid h_t)

The hidden state $h_t$ is a learned approximation to the belief state, trained end-to-end via BPTT. LSTM-based policies are standard in partially observed locomotion (robot state estimation from proprioception and vision), manipulation with occlusion, and multi-step reasoning. Transformers processing the full observation history via self-attention are recurrent in a generalized sense: the context window plays the role of $k$ , with attention weights encoding each past observation's relevance to the current decision.

GenAI context: LLM agents as POMDP solvers#

For an LLM-based agent using tools, querying APIs, and maintaining conversation state, the true environment state includes the user's unstated intent, content not yet retrieved, and external system state. The agent observes only the current context. Its internal representation must compress this partial history into something sufficient for decision-making — exactly the belief state role. Retrieval mechanisms, memory modules, and scratchpads are engineering approximations of belief-state maintenance. The failure modes — being misled by stale tool outputs, forgetting earlier context, miscalibrating confidence in retrieved facts — are POMDP-theoretic: the agent's approximate belief diverges from the true hidden state.

Multi-agent reinforcement learning#

When multiple agents share an environment and all are learning simultaneously, the environment from any single agent's perspective is non-stationary: the apparent transition distribution $\tilde{P}(s_{t+1} \mid s_t, a_t^i)$ changes over time because other agents' policies are changing. The convergence guarantees of standard RL algorithms assume a stationary MDP and do not directly apply.

Independent learners#

Independent Q-learning (IQL) has each agent treat all other agents as part of the environment and applies standard RL independently. The non-stationarity violates the convergence conditions for Q-learning, producing oscillatory behavior and policy divergence. Concretely: Agent A learns that moving left yields high reward → Agent B, observing A's new behavior, adapts its own policy → A's previously learned Q-values are now wrong (the environment appears to have changed) → A relearns → B readapts, creating a cycle where neither agent's value estimates ever stabilize. In cooperative tasks with shared team reward, multi-agent credit assignment compounds the problem: naive RL attributes team reward equally to every agent's action, regardless of which agent actually drove the outcome. This produces noisy, slow policy gradient updates.

Self-play#

Self-play creates an automatic curriculum by having each agent train against a copy (or recent snapshot) of itself. The opponent is always at the same competence level, so the training distribution shifts smoothly as capability improves. AlphaGo, AlphaZero, and OpenAI Five all rely on self-play as the primary learning mechanism.

In zero-sum two-player games, self-play corresponds to iteratively optimizing for the Nash equilibrium: neither player can improve by unilaterally changing strategy. Under conditions analogous to fictitious play, the iterative procedure converges to Nash. The zero-sum property ( $R_1 + R_2 = 0$ ) is structurally necessary here: it guarantees that improving one player's policy does not increase the other player's reward — the game is purely competitive, so self-play drives both agents toward the unique value of the game rather than chasing a moving cooperative target. In general-sum or cooperative settings, self-play without additional structure may cycle because agents can simultaneously improve each other's returns without converging. The emergent capabilities from self-play — discovering opening theory in Go, inventing tool use in hide-and-seek — arise purely from the self-play reward signal without human-designed intermediate objectives.

Centralized Training with Decentralized Execution (CTDE)#

The dominant paradigm for cooperative MARL separates the training and execution regimes:

During training: a centralized critic has access to the global state $s$ and all agents' actions $\mathbf{a} = (a^1, \ldots, a^n)$ . Full-information Q-function estimation stabilizes learning and enables joint credit assignment.
During execution: each agent $i$ acts using only its local observation $o^i$ and local policy $\pi^i(a^i \mid o^i)$ , as required for bandwidth-constrained and latency-sensitive deployment.

MADDPG (Lowe et al., 2017) implements CTDE by extending DDPG: each agent has a separate actor $\mu^i(o^i)$ and a centralized critic $Q^i(s, a^1, \ldots, a^n)$ . At execution time, only the actors are used.

QMIX (Rashid et al., 2018) adds a monotonicity constraint to support decentralized greedy action selection:

\frac{\partial Q_{\text{tot}}(s, \mathbf{a})}{\partial Q_i(o^i, a^i)} \geq 0 \quad \forall i

This ensures that each agent independently maximizing its individual utility $Q_i$ is consistent with maximizing the joint $Q_{\text{tot}}$ . The constraint is enforced by a mixing network with non-negative weights that combines individual $Q_i$ values into $Q_{\text{tot}}$ conditioned on global state — active only at training time.

GenAI context: multi-agent AI systems#

Multi-agent structures are pervasive in modern AI: orchestrator–subagent hierarchies, critic–generator pairs for self-critique, red-team–defender setups for adversarial robustness. Self-play underlies debate-based alignment (agents argue for and against positions; debate outcome is the reward signal) and adversarial red-teaming. CTDE is implicitly present in multi-agent reasoning frameworks where a coordinator plans globally and dispatches to specialized agents executing with local information. The credit assignment problem — which subagent's output caused task success or failure — is a direct instance of cooperative MARL credit assignment.

Key takeaways#

Exploration must be designed as an explicit objective when rewards are sparse or environments are large. Count-based methods provide theoretical grounding in tabular MDPs; curiosity-driven methods (ICM, RND) scale to high dimensions, with RND circumventing the noisy TV failure of prediction-error curiosity by replacing stochastic prediction targets with a fixed random network. Bootstrapped DQN and NoisyNets produce temporally coherent exploration via per-episode policy diversity and parameter-space noise, respectively. Partial observability is the generic case; the POMDP belief state is the theoretically sufficient summary, approximated in practice by history stacking and recurrent architectures that learn implicit belief representations end-to-end. MARL introduces non-stationarity as the defining challenge; self-play produces emergent capability in competitive settings via automatic curriculum, while CTDE stabilizes cooperative training by using centralized critics during training and decentralized actors during execution.

Conceptual questions#

An agent trained with prediction-error curiosity (ICM) is deployed in an environment containing a screen displaying Brownian motion. Explain precisely why the agent becomes addicted to watching the screen. Then explain why RND does not exhibit this failure, identifying the key structural difference between the two intrinsic reward formulations.
A robot has joint position sensors but no velocity sensors. Explain why this constitutes a POMDP and identify the true hidden state. Compare two solutions: (a) stack $k=5$ position observations, (b) add an LSTM to the policy. For each, state what assumption it makes about the memory horizon and describe a scenario where it would fail but the other would succeed.
Two independent Q-learning agents train in a cooperative game with shared team reward. Agent A receives high team reward after taking action $a_A$ , even though Agent B's action was actually responsible. Explain the credit assignment failure in Agent A's Q-update. Describe how CTDE with a centralized critic resolves this, and explain why the QMIX monotonicity constraint is necessary for decentralized execution.
Bootstrapped DQN samples one Q-head per episode rather than per step. Contrast exploration trajectories from per-step vs per-episode sampling. For a task requiring the agent to traverse a 50-step corridor to reach reward, explain why per-step sampling fails while per-episode sampling succeeds. Relate this to temporal coherence and connect it to parameter-space noise vs action-space noise.
AlphaZero trains via self-play in a zero-sum game. Explain why the zero-sum assumption is necessary for self-play to converge toward Nash equilibrium. Describe an alternative competitive signal that could provide a self-play-style curriculum for a cooperative task. Identify at least one new failure mode this would introduce that does not arise in zero-sum self-play.

Solutions (conceptual questions)

ICM's reward is the error of a learned forward model predicting the next state; a screen of Brownian motion is inherently unpredictable, so the prediction error never decays and the agent gets perpetual intrinsic reward for watching it (the "noisy-TV" trap). RND instead measures how well a predictor matches a fixed random network of the state, which is reducible by visiting states more often — so it rewards novelty (low visitation) rather than transition unpredictability, and does not get stuck on stochastic observations.
With only positions, velocity (part of the true state $(\text{position}, \text{velocity})$ ) is hidden, so the Markov property fails and it is a POMDP. (a) Stacking $k = 5$ positions assumes velocity is recoverable by finite-differencing a short fixed window — it fails when the relevant memory horizon exceeds 5 steps. (b) An LSTM learns an arbitrary-horizon memory — it handles long dependencies but is harder to train and can fail with limited data, where the hard-coded stacking window would have sufficed.
Agent A's update credits its own action $a_A$ for reward actually caused by B, corrupting A's $Q$ -value with spurious credit (the environment looks non-stationary to each independent learner). A CTDE centralized critic conditioned on the joint state and actions attributes reward correctly during training. QMIX's monotonicity constraint ( $\partial Q_{\text{tot}} / \partial Q_i \ge 0$ ) ensures the joint argmax factorizes into per-agent argmaxes, so agents can still act greedily from local values at decentralized execution.
Sampling one $Q$ -head per episode gives temporally coherent ("deep") exploration: the agent commits to a single hypothesis for the whole episode and explores in a directed, multi-step way. Per-step sampling re-randomizes every step, degenerating into dithering that cannot traverse a 50-step corridor (vanishingly small chance), whereas per-episode commitment can walk the corridor to the reward. This mirrors parameter-space / posterior-sampling noise (coherent) versus action-space noise (incoherent).
Zero-sum makes self-play a minimax game with a well-defined Nash equilibrium that improving against oneself provably moves toward — your gain is exactly the opponent's loss, so relative improvement equals absolute improvement. A cooperative substitute is a competitive curriculum such as population-based or asymmetric (goal-setting) self-play. A new failure mode absent in zero-sum is miscoordination / collapse to a degenerate equilibrium, or the competitive proxy drifting away from the true cooperative objective (gaming the proxy).

Implementation exercises#

Exercise 1: Count-based exploration in a grid world#

Implement a tabular Q-learning agent on a sparse-reward grid world (e.g., MiniGrid-Empty-8x8-v0). Add a count-based exploration bonus $r_t^{\text{int}}(s) = \beta / \sqrt{N_t(s)}$ and compare with standard $\epsilon$ -greedy. Measure:

Number of episodes to first goal reach
State visitation entropy across the grid
Performance as $\beta$ varies across $\{0.1, 0.5, 1.0, 5.0\}$

For the bonus, store visit counts in a dictionary keyed by discretized $(x, y)$ position. Note that this is only feasible in tabular settings — Exercise 2 addresses the continuous case.

Exercise 2: RND on a continuous control task#

Implement Random Network Distillation as an exploration bonus for a PPO agent on MountainCarContinuous-v0 (sparse reward: agent only receives reward when reaching the flag). The setup:

Target network: A random MLP $f: \mathcal{S} \to \mathbb{R}^{64}$ with fixed, randomly initialized weights.
Predictor network: $\hat{f}_\theta$ trained via MSE to match $f(s)$ on visited states.
Intrinsic reward: $r_t^{\text{int}} = \|\hat{f}_\theta(s_t) - f(s_t)\|^2$ , normalized by a running estimate of its standard deviation.
Add $r_t^{\text{total}} = r_t^{\text{ext}} + \beta \cdot r_t^{\text{int}}$ to the PPO reward.

Compare training curves (extrinsic return vs steps) for vanilla PPO vs PPO + RND. Report the fraction of seeds that solve the task within 200 episodes. For a bonus challenge, swap the MountainCar environment for a procedurally generated maze and observe whether RND's deterministic target prevents the noisy-TV attraction.

Exercise 3: IQL vs CTDE in a cooperative multi-agent task#

Implement a simple cooperative multi-agent scenario: two agents navigating a grid to push a box into a target zone (requires both agents pushing simultaneously). Compare:

IQL: Each agent trains independent DQNs with shared team reward.
MADDPG-style CTDE: A centralized critic $Q(s, a^1, a^2)$ during training, with decentralized actors using only local observations at execution.

Track the standard deviation of per-episode returns across 5 seeds for each method. IQL should show high variance (credit assignment noise); CTDE should converge faster and more stably. Identify one failure mode where IQL agents converge to a suboptimal equilibrium (e.g., one agent learns to wait while the other does all the work).

Extension prompts#

RND on Montezuma's Revenge: The original RND paper demonstrated superhuman performance on Montezuma's Revenge, a notoriously hard-exploration Atari game. Study the paper's preprocessing and normalization choices. What aspects of the RND formulation are specific to Atari (discrete actions, CNN encoder), and which generalize to continuous control? Propose a modification to RND for an environment with both visual and proprioceptive inputs.
Belief-state quality metric: Train an LSTM-based policy on a partially observable navigation task. Design an experiment to measure how well the LSTM's hidden state $h_t$ approximates the true belief state. One approach: train a separate network to predict the true hidden state from $h_t$ and measure its reconstruction error as a proxy for belief-state quality. How does this metric change as the LSTM hidden size varies?
Population-based self-play: Extend the self-play section's analysis by implementing a population-based training setup (maintain $N$ policy snapshots, sample opponents from the population rather than the latest policy). Compare with standard self-play on a simple competitive task. How does population diversity affect convergence stability? What new failure mode emerges when the population collapses to a single strategy?

Purpose of this lecture#

Exploration as a first-class problem#

Modern RL treats exploration as an explicit design objective by constructing an exploration bonus $r_t^{\text{int}}$ that augments extrinsic reward:

r_t^{\text{total}} = r_t^{\text{ext}} + \beta \cdot r_t^{\text{int}}

where $\beta > 0$ controls exploration strength. The challenge is designing $r_t^{\text{int}}$ to be meaningful in high dimensions without interfering with the ultimate objective.

Count-based exploration#

In tabular MDPs, exploration bonuses follow from state visitation counts $N_t(s)$ . The standard form, derived from PAC-MDP analyses (RMAX, MBIE):

r_t^{\text{int}}(s) = \frac{\beta}{\sqrt{N_t(s)}}

Intrinsic motivation and curiosity#

An influential alternative frames exploration as curiosity: reward the agent for encountering situations it cannot yet predict.

Prediction-error curiosity (ICM)#

The Intrinsic Curiosity Module (Pathak et al., 2017) trains a forward model to predict the next observation embedding and uses its error as intrinsic reward:

r_t^{\text{int}} = \frac{\eta}{2}\,\|\hat{\phi}(s_{t+1}) - \phi(s_{t+1})\|^2

The noisy TV problem#

Random Network Distillation (RND)#

RND (Burda et al., 2018) eliminates the noisy TV failure by replacing the stochastic prediction target with a fixed deterministic one:

A target network $f: \mathcal{S} \to \mathbb{R}^d$ with fixed random weights.
A predictor network $\hat{f}_\theta: \mathcal{S} \to \mathbb{R}^d$ trained to match the target (a pattern repeated in robot learning, where one network encodes the learned environment and another learns to match it).

r_t^{\text{int}} = \|\hat{f}_\theta(s_t) - f(s_t)\|^2

Temporally coherent exploration#

Action-space noise: the incoherence problem#

Parameter-space noise#

NoisyNets (Fortunato et al., 2017) add learnable noise directly to network weights, sampled once per episode:

\tilde{\theta} = \theta + \sigma \odot \xi, \quad \xi \sim \mathcal{N}(0, I)

Bootstrapped DQN#

a_t = \arg\max_a Q_k(s_t, a)

Partial observability and POMDPs#

In the MDP formalism of Week 1, the agent observes the true state $s_t$ . In virtually every real deployment this is false.

Formal definition#

A Partially Observable Markov Decision Process (POMDP) extends the MDP to:

(\mathcal{S},\, \mathcal{A},\, \mathcal{O},\, P,\, O,\, R,\, \gamma)

The belief state#

The optimal solution to a POMDP operates on the belief state $b_t$ : the posterior probability distribution over hidden states given the full history:

b_t(s) = P(s_t = s \mid o_{1:t},\, a_{1:t-1})

The belief state is a sufficient statistic for the history — the optimal policy $\pi^*(b_t)$ is at least as good as any policy based on the raw history. It is updated via Bayesian filtering:

b_{t+1}(s') \propto O(o_{t+1} \mid s') \sum_s P(s' \mid s, a_t)\, b_t(s)

Approximating the belief state in deep RL#

Recurrent policies replace the fixed window with a learned compressed summary:

h_t = \text{RNN}(h_{t-1},\, o_t,\, a_{t-1}), \quad \pi_\theta(a \mid h_t)

GenAI context: LLM agents as POMDP solvers#

Multi-agent reinforcement learning#

Independent learners#

Self-play#

Centralized Training with Decentralized Execution (CTDE)#

The dominant paradigm for cooperative MARL separates the training and execution regimes:

During training: a centralized critic has access to the global state $s$ and all agents' actions $\mathbf{a} = (a^1, \ldots, a^n)$ . Full-information Q-function estimation stabilizes learning and enables joint credit assignment.
During execution: each agent $i$ acts using only its local observation $o^i$ and local policy $\pi^i(a^i \mid o^i)$ , as required for bandwidth-constrained and latency-sensitive deployment.

QMIX (Rashid et al., 2018) adds a monotonicity constraint to support decentralized greedy action selection:

\frac{\partial Q_{\text{tot}}(s, \mathbf{a})}{\partial Q_i(o^i, a^i)} \geq 0 \quad \forall i

GenAI context: multi-agent AI systems#

Key takeaways#

Conceptual questions#

An agent trained with prediction-error curiosity (ICM) is deployed in an environment containing a screen displaying Brownian motion. Explain precisely why the agent becomes addicted to watching the screen. Then explain why RND does not exhibit this failure, identifying the key structural difference between the two intrinsic reward formulations.
A robot has joint position sensors but no velocity sensors. Explain why this constitutes a POMDP and identify the true hidden state. Compare two solutions: (a) stack $k=5$ position observations, (b) add an LSTM to the policy. For each, state what assumption it makes about the memory horizon and describe a scenario where it would fail but the other would succeed.
Two independent Q-learning agents train in a cooperative game with shared team reward. Agent A receives high team reward after taking action $a_A$ , even though Agent B's action was actually responsible. Explain the credit assignment failure in Agent A's Q-update. Describe how CTDE with a centralized critic resolves this, and explain why the QMIX monotonicity constraint is necessary for decentralized execution.
Bootstrapped DQN samples one Q-head per episode rather than per step. Contrast exploration trajectories from per-step vs per-episode sampling. For a task requiring the agent to traverse a 50-step corridor to reach reward, explain why per-step sampling fails while per-episode sampling succeeds. Relate this to temporal coherence and connect it to parameter-space noise vs action-space noise.
AlphaZero trains via self-play in a zero-sum game. Explain why the zero-sum assumption is necessary for self-play to converge toward Nash equilibrium. Describe an alternative competitive signal that could provide a self-play-style curriculum for a cooperative task. Identify at least one new failure mode this would introduce that does not arise in zero-sum self-play.

Solutions (conceptual questions)

ICM's reward is the error of a learned forward model predicting the next state; a screen of Brownian motion is inherently unpredictable, so the prediction error never decays and the agent gets perpetual intrinsic reward for watching it (the "noisy-TV" trap). RND instead measures how well a predictor matches a fixed random network of the state, which is reducible by visiting states more often — so it rewards novelty (low visitation) rather than transition unpredictability, and does not get stuck on stochastic observations.
With only positions, velocity (part of the true state $(\text{position}, \text{velocity})$ ) is hidden, so the Markov property fails and it is a POMDP. (a) Stacking $k = 5$ positions assumes velocity is recoverable by finite-differencing a short fixed window — it fails when the relevant memory horizon exceeds 5 steps. (b) An LSTM learns an arbitrary-horizon memory — it handles long dependencies but is harder to train and can fail with limited data, where the hard-coded stacking window would have sufficed.
Agent A's update credits its own action $a_A$ for reward actually caused by B, corrupting A's $Q$ -value with spurious credit (the environment looks non-stationary to each independent learner). A CTDE centralized critic conditioned on the joint state and actions attributes reward correctly during training. QMIX's monotonicity constraint ( $\partial Q_{\text{tot}} / \partial Q_i \ge 0$ ) ensures the joint argmax factorizes into per-agent argmaxes, so agents can still act greedily from local values at decentralized execution.
Sampling one $Q$ -head per episode gives temporally coherent ("deep") exploration: the agent commits to a single hypothesis for the whole episode and explores in a directed, multi-step way. Per-step sampling re-randomizes every step, degenerating into dithering that cannot traverse a 50-step corridor (vanishingly small chance), whereas per-episode commitment can walk the corridor to the reward. This mirrors parameter-space / posterior-sampling noise (coherent) versus action-space noise (incoherent).
Zero-sum makes self-play a minimax game with a well-defined Nash equilibrium that improving against oneself provably moves toward — your gain is exactly the opponent's loss, so relative improvement equals absolute improvement. A cooperative substitute is a competitive curriculum such as population-based or asymmetric (goal-setting) self-play. A new failure mode absent in zero-sum is miscoordination / collapse to a degenerate equilibrium, or the competitive proxy drifting away from the true cooperative objective (gaming the proxy).

Implementation exercises#

Exercise 1: Count-based exploration in a grid world#

Number of episodes to first goal reach
State visitation entropy across the grid
Performance as $\beta$ varies across $\{0.1, 0.5, 1.0, 5.0\}$

For the bonus, store visit counts in a dictionary keyed by discretized $(x, y)$ position. Note that this is only feasible in tabular settings — Exercise 2 addresses the continuous case.

Exercise 2: RND on a continuous control task#

Implement Random Network Distillation as an exploration bonus for a PPO agent on MountainCarContinuous-v0 (sparse reward: agent only receives reward when reaching the flag). The setup:

Target network: A random MLP $f: \mathcal{S} \to \mathbb{R}^{64}$ with fixed, randomly initialized weights.
Predictor network: $\hat{f}_\theta$ trained via MSE to match $f(s)$ on visited states.
Intrinsic reward: $r_t^{\text{int}} = \|\hat{f}_\theta(s_t) - f(s_t)\|^2$ , normalized by a running estimate of its standard deviation.
Add $r_t^{\text{total}} = r_t^{\text{ext}} + \beta \cdot r_t^{\text{int}}$ to the PPO reward.

Exercise 3: IQL vs CTDE in a cooperative multi-agent task#

Implement a simple cooperative multi-agent scenario: two agents navigating a grid to push a box into a target zone (requires both agents pushing simultaneously). Compare:

IQL: Each agent trains independent DQNs with shared team reward.
MADDPG-style CTDE: A centralized critic $Q(s, a^1, a^2)$ during training, with decentralized actors using only local observations at execution.

Extension prompts#

RND on Montezuma's Revenge: The original RND paper demonstrated superhuman performance on Montezuma's Revenge, a notoriously hard-exploration Atari game. Study the paper's preprocessing and normalization choices. What aspects of the RND formulation are specific to Atari (discrete actions, CNN encoder), and which generalize to continuous control? Propose a modification to RND for an environment with both visual and proprioceptive inputs.
Belief-state quality metric: Train an LSTM-based policy on a partially observable navigation task. Design an experiment to measure how well the LSTM's hidden state $h_t$ approximates the true belief state. One approach: train a separate network to predict the true hidden state from $h_t$ and measure its reconstruction error as a proxy for belief-state quality. How does this metric change as the LSTM hidden size varies?
Population-based self-play: Extend the self-play section's analysis by implementing a population-based training setup (maintain $N$ policy snapshots, sample opponents from the population rather than the latest policy). Compare with standard self-play on a simple competitive task. How does population diversity affect convergence stability? What new failure mode emerges when the population collapses to a single strategy?

Purpose of this lecture#

Exploration as a first-class problem#

Count-based exploration#

Intrinsic motivation and curiosity#

Prediction-error curiosity (ICM)#

The noisy TV problem#

Random Network Distillation (RND)#

Temporally coherent exploration#

Action-space noise: the incoherence problem#

Parameter-space noise#

Bootstrapped DQNDeep Q-Network#

Partial observability and POMDPs#

Formal definition#

The belief state#

Approximating the belief state in deep RLReinforcement Learning#

GenAI context: LLMLarge Language Model agents as POMDPPartially Observable Markov Decision Process solvers#

Multi-agent reinforcement learning#

Independent learners#

Self-play#

Centralized Training with Decentralized Execution (CTDE)#

GenAI context: multi-agent AI systems#

Key takeaways#

Conceptual questions#

Implementation exercises#

Exercise 1: Count-based exploration in a grid world#

Exercise 2: RND on a continuous control task#

Exercise 3: IQL vs CTDE in a cooperative multi-agent task#

Extension prompts#

Further reading#

Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning

Purpose of this lecture#

Exploration as a first-class problem#

Count-based exploration#

Intrinsic motivation and curiosity#

Prediction-error curiosity (ICM)#

The noisy TV problem#

Random Network Distillation (RND)#

Temporally coherent exploration#

Action-space noise: the incoherence problem#

Parameter-space noise#

Bootstrapped DQNDeep Q-Network#

Partial observability and POMDPs#

Formal definition#

The belief state#

Approximating the belief state in deep RLReinforcement Learning#

GenAI context: LLMLarge Language Model agents as POMDPPartially Observable Markov Decision Process solvers#

Multi-agent reinforcement learning#

Independent learners#

Self-play#

Centralized Training with Decentralized Execution (CTDE)#

GenAI context: multi-agent AI systems#

Key takeaways#

Conceptual questions#

Implementation exercises#

Exercise 1: Count-based exploration in a grid world#

Exercise 2: RND on a continuous control task#

Exercise 3: IQL vs CTDE in a cooperative multi-agent task#

Extension prompts#

Further reading#

Bootstrapped DQN#

Approximating the belief state in deep RL#

GenAI context: LLM agents as POMDP solvers#

Bootstrapped DQN#

Approximating the belief state in deep RL#

GenAI context: LLM agents as POMDP solvers#