Skip to main content
illumin8
Courses
Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 9

Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning

✦Learning Outcomes
  • Implement count-based and curiosity-driven exploration methods
  • Define POMDPs and explain engineering responses (recurrent policies, belief states)
  • Analyze multi-agent RLReinforcement Learning challenges (non-stationarity, CTDE)
  • Connect exploration and POMDPPartially Observable Markov Decision Process concepts to modern agentic AI
◆Prerequisites
  • Week 8: Modern deep RLReinforcement Learning algorithms, exploration in practice
  • Week 7: Policy gradient methods

Recommended: Review Week 8 before proceeding.

◆Grounded In
  • Agentic AI: Every LLMLarge Language Model-based tool-using agent is simultaneously a POMDPPartially Observable Markov Decision Process solver (partial context window), an explorer under sparse rewards (retrieval and reasoning under uncertainty), and a participant in multi-agent systems (orchestrator–subagent hierarchies, critic–generator pairs, adversarial red-teaming). The concepts in this lecture — belief states, curiosity-driven exploration, centralized critics — are the theoretical vocabulary for analyzing agent failure modes.
  • Robotics: Real robots always face partial observability (occlusion, missing sensors, proprioceptive drift). Exploration bonuses and temporally coherent noise (NoisyNets) enable robots to discover contact-rich manipulation strategies. CTDE directly maps to multi-robot coordination where centralized planning during training (with full state) produces decentralized policies for bandwidth-constrained deployment.

Purpose of this lecture

The algorithms studied in Weeks 3–8 share three background assumptions that are rarely satisfied in practice: the agent observes the true environment state, exploration is handled adequately by policy stochasticity, and the environment is stationary because there is only one learning agent. All three fail in real systems.

This lecture addresses each failure mode in turn. Exploration is elevated from a side effect of ϵ\epsilonϵ-greedy noise to an explicit objective with principled methods for directing agents toward genuinely novel experience. Partial observability is formalized through POMDPs, and the standard engineering responses — history stacking, recurrent architectures, and belief-state approximation — are developed in detail. Multi-agent RLReinforcement Learning examines what happens when the stationarity assumption collapses because other agents are simultaneously learning, and introduces the centralized training with decentralized execution (CTDE) paradigm that stabilizes cooperative systems.

These three topics converge in modern agentic AI: a tool-using LLMLarge Language Model operating in a partially observed environment alongside other AI agents is simultaneously a POMDPPartially Observable Markov Decision Process solver, an explorer under sparse rewards, and a participant in a multi-agent system.


Exploration as a first-class problem

In dense-reward tabular MDPs, ϵ\epsilonϵ-greedy exploration — random action with probability ϵ\epsilonϵ, greedy otherwise — is often adequate. This ceases to hold in three practically important regimes. In sparse-reward tasks, reward arrives only at task completion; random walks rarely reach the goal. In large or continuous state spaces, locally random actions are uninformative about distant novel regions. In deceptive reward landscapes, greedy exploitation of local optima prevents discovery of higher-value strategies that require temporarily accepting lower reward.

Modern RLReinforcement Learning treats exploration as an explicit design objective by constructing an exploration bonus rtintr_t^{\text{int}}rtint​ that augments extrinsic reward:

rttotal=rtext+β⋅rtintr_t^{\text{total}} = r_t^{\text{ext}} + \beta \cdot r_t^{\text{int}}rttotal​=rtext​+β⋅rtint​

where β>0\beta > 0β>0 controls exploration strength. The challenge is designing rtintr_t^{\text{int}}rtint​ to be meaningful in high dimensions without interfering with the ultimate objective.


Count-based exploration

In tabular MDPs, exploration bonuses follow from state visitation counts Nt(s)N_t(s)Nt​(s). The standard form, derived from PAC-MDPMarkov Decision Process analyses (RMAX, MBIE):

rtint(s)=βNt(s)r_t^{\text{int}}(s) = \frac{\beta}{\sqrt{N_t(s)}}rtint​(s)=Nt​(s)​β​

This embodies optimism under uncertainty: treat unvisited states as potentially optimal and visit them until uncertainty resolves. The bonus decays as states become familiar and drives the agent toward under-explored regions. Under this principle, provably efficient algorithms guarantee near-optimal policies once every state has been visited Ω ⁣(1ϵ2(1−γ)2)\Omega\!\left(\frac{1}{\epsilon^2(1-\gamma)^2}\right)Ω(ϵ2(1−γ)21​) times.

The fundamental obstacle is the curse of dimensionality: in continuous or large discrete spaces, most states are visited at most once, and raw counts provide no signal. Scaling requires a density model that estimates p(s)p(s)p(s) and assigns pseudo-counts N^(s)∝1/p(s)\hat{N}(s) \propto 1/p(s)N^(s)∝1/p(s). Compression-based and neural density models have been explored, but scaling to high-dimensional observations without additional structure remains difficult.


Intrinsic motivation and curiosity

An influential alternative frames exploration as curiosity: reward the agent for encountering situations it cannot yet predict.

Prediction-error curiosity (ICM)

The Intrinsic Curiosity Module (Pathak et al., 2017) trains a forward model to predict the next observation embedding and uses its error as intrinsic reward:

rtint=η2 ∥ϕ^(st+1)−ϕ(st+1)∥2r_t^{\text{int}} = \frac{\eta}{2}\,\|\hat{\phi}(s_{t+1}) - \phi(s_{t+1})\|^2rtint​=2η​∥ϕ^​(st+1​)−ϕ(st+1​)∥2

where ϕ^(st+1)=f(ϕ(st),at)\hat{\phi}(s_{t+1}) = f(\phi(s_t), a_t)ϕ^​(st+1​)=f(ϕ(st​),at​) is the forward model's prediction and ϕ\phiϕ is an encoder trained via an auxiliary inverse model (predicting the action from consecutive embeddings). The inverse model discards observation features irrelevant to the agent's actions, making prediction error a more meaningful novelty proxy than raw pixel error.

The noisy TV problem

Prediction-error curiosity has a fundamental failure mode. A stochastic distractor — a screen displaying white noise — cannot be predicted regardless of how many times it is observed. The intrinsic reward on the noisy screen stays permanently high, creating an infinite reward attractor: the agent becomes "addicted" to watching the noise while ignoring task reward. This is the noisy TV problem: stochastic irreducible noise produces a curiosity signal that is unlearnable and therefore persistently high.

Random Network Distillation (RND)

RND (Burda et al., 2018) eliminates the noisy TV failure by replacing the stochastic prediction target with a fixed deterministic one:

  • A target network f:S→Rdf: \mathcal{S} \to \mathbb{R}^df:S→Rd with fixed random weights.
  • A predictor network f^θ:S→Rd\hat{f}_\theta: \mathcal{S} \to \mathbb{R}^df^​θ​:S→Rd trained to match the target (a pattern repeated in robot learning, where one network encodes the learned environment and another learns to match it).
rtint=∥f^θ(st)−f(st)∥2r_t^{\text{int}} = \|\hat{f}_\theta(s_t) - f(s_t)\|^2rtint​=∥f^​θ​(st​)−f(st​)∥2

The predictor error is high for rarely visited states (cannot yet match the target) and decays as the state becomes familiar. Crucially, the target is a deterministic function of the state, so the predictor can perfectly learn it once the state has been visited sufficiently — there is no irreducible stochastic component. RND achieves strong results on hard-exploration Atari games including Montezuma's Revenge, where ϵ\epsilonϵ-greedy DQNDeep Q-Network fails to progress past the first room.


Temporally coherent exploration

Action-space noise: the incoherence problem

ϵ\epsilonϵ-greedy and Gaussian action noise resample independently at every step. The resulting trajectories are temporally incoherent: consecutive actions are independently noisy, so exploration behaves as a random walk in action space. For a robot that must execute a coordinated multi-step sequence to reach a novel region, this is particularly inefficient — each noisy step disrupts the trajectory before it can traverse the path to new experience.

Parameter-space noise

NoisyNets (Fortunato et al., 2017) add learnable noise directly to network weights, sampled once per episode:

θ~=θ+σ⊙ξ,ξ∼N(0,I)\tilde{\theta} = \theta + \sigma \odot \xi, \quad \xi \sim \mathcal{N}(0, I)θ~=θ+σ⊙ξ,ξ∼N(0,I)

The same perturbation is active for the full episode, so the policy behaves consistently throughout — exploring in a coherent direction in state space before resampling at episode reset. This temporal coherence enables directed exploration over long time horizons (a critical requirement for physical systems like robots that need multi-step coordinated actions to reach novel configurations).

Bootstrapped DQNDeep Q-Network

Bootstrapped DQNDeep Q-Network (Osband et al., 2016) maintains KKK independent Q-function heads {Qk}\{Q_k\}{Qk​}. At the start of each episode, one head kkk is sampled and the agent follows its greedy policy for the full episode:

at=arg⁡max⁡aQk(st,a)a_t = \arg\max_a Q_k(s_t, a)at​=argamax​Qk​(st​,a)

Each head encodes a different hypothesis about the value function; the diversity represents epistemic uncertainty. Committing to one head per episode implements Thompson sampling: the agent acts optimally under a sample from its posterior over value functions. This produces deep exploration that can traverse long corridors or multi-step puzzles where step-by-step random actions are ineffective.


Partial observability and POMDPs

In the MDPMarkov Decision Process formalism of Week 1, the agent observes the true state sts_tst​. In virtually every real deployment this is false.

Formal definition

A Partially Observable Markov Decision Process (POMDPPartially Observable Markov Decision Process) extends the MDPMarkov Decision Process to:

(S, A, O, P, O, R, γ)(\mathcal{S},\, \mathcal{A},\, \mathcal{O},\, P,\, O,\, R,\, \gamma)(S,A,O,P,O,R,γ)

where O\mathcal{O}O is the observation space and O(ot∣st)O(o_t \mid s_t)O(ot​∣st​) is the emission distribution. The agent receives ot∼O(⋅∣st)o_t \sim O(\cdot \mid s_t)ot​∼O(⋅∣st​) — a potentially lossy, noisy function of the true hidden state sts_tst​ — rather than sts_tst​ itself.

Partial observability is the generic case: a camera encodes only the visible surface, not occluded geometry. Joint sensors report position but not velocity or contact force. An LLMLarge Language Model sees only its context window, not the user's intent or external system state. In every case, the transition of the hidden state is Markov; the transition of the observed signal is not. Applying value functions to observations as if they were states causes the same instability as applying them to non-Markov states: identical observations corresponding to different hidden states with different optimal actions produce inconsistent value targets.

The belief state

The optimal solution to a POMDPPartially Observable Markov Decision Process operates on the belief state btb_tbt​: the posterior probability distribution over hidden states given the full history:

bt(s)=P(st=s∣o1:t, a1:t−1)b_t(s) = P(s_t = s \mid o_{1:t},\, a_{1:t-1})bt​(s)=P(st​=s∣o1:t​,a1:t−1​)

The belief state is a sufficient statistic for the history — the optimal policy π∗(bt)\pi^*(b_t)π∗(bt​) is at least as good as any policy based on the raw history. It is updated via Bayesian filtering:

bt+1(s′)∝O(ot+1∣s′)∑sP(s′∣s,at) bt(s)b_{t+1}(s') \propto O(o_{t+1} \mid s') \sum_s P(s' \mid s, a_t)\, b_t(s)bt+1​(s′)∝O(ot+1​∣s′)s∑​P(s′∣s,at​)bt​(s)

This is the general-form Bayesian filter, of which the Kalman filter (linear Gaussian systems) and the HMM forward algorithm (finite state, discrete observations) are special cases. Exact belief updates are intractable for continuous state spaces; approximate methods include particle filters and point-based POMDPPartially Observable Markov Decision Process solvers.

Approximating the belief state in deep RLReinforcement Learning

History stacking concatenates the last kkk observations as input: s~t=(ot−k+1,…,ot)\tilde{s}_t = (o_{t-k+1}, \ldots, o_t)s~t​=(ot−k+1​,…,ot​). If the Markov horizon is at most kkk steps, this restores Markovianity. Frame stacking in Atari DQNDeep Q-Network (k=4k = 4k=4) recovers velocity information unavailable in a single frame. The limitation is that kkk must be manually specified; for tasks requiring memory over hundreds of steps, a fixed window is inadequate.

Recurrent policies replace the fixed window with a learned compressed summary:

ht=RNN(ht−1, ot, at−1),πθ(a∣ht)h_t = \text{RNN}(h_{t-1},\, o_t,\, a_{t-1}), \quad \pi_\theta(a \mid h_t)ht​=RNN(ht−1​,ot​,at−1​),πθ​(a∣ht​)

The hidden state hth_tht​ is a learned approximation to the belief state, trained end-to-end via BPTT. LSTM-based policies are standard in partially observed locomotion (robot state estimation from proprioception and vision), manipulation with occlusion, and multi-step reasoning. Transformers processing the full observation history via self-attention are recurrent in a generalized sense: the context window plays the role of kkk, with attention weights encoding each past observation's relevance to the current decision.

GenAI context: LLMLarge Language Model agents as POMDPPartially Observable Markov Decision Process solvers

For an LLMLarge Language Model-based agent using tools, querying APIs, and maintaining conversation state, the true environment state includes the user's unstated intent, content not yet retrieved, and external system state. The agent observes only the current context. Its internal representation must compress this partial history into something sufficient for decision-making — exactly the belief state role. Retrieval mechanisms, memory modules, and scratchpads are engineering approximations of belief-state maintenance. The failure modes — being misled by stale tool outputs, forgetting earlier context, miscalibrating confidence in retrieved facts — are POMDPPartially Observable Markov Decision Process-theoretic: the agent's approximate belief diverges from the true hidden state.


Multi-agent reinforcement learning

When multiple agents share an environment and all are learning simultaneously, the environment from any single agent's perspective is non-stationary: the apparent transition distribution P~(st+1∣st,ati)\tilde{P}(s_{t+1} \mid s_t, a_t^i)P~(st+1​∣st​,ati​) changes over time because other agents' policies are changing. The convergence guarantees of standard RLReinforcement Learning algorithms assume a stationary MDPMarkov Decision Process and do not directly apply.

Independent learners

Independent Q-learning (IQL) has each agent treat all other agents as part of the environment and applies standard RLReinforcement Learning independently. The non-stationarity violates the convergence conditions for Q-learning, producing oscillatory behavior and policy divergence. Concretely: Agent A learns that moving left yields high reward → Agent B, observing A's new behavior, adapts its own policy → A's previously learned Q-values are now wrong (the environment appears to have changed) → A relearns → B readapts, creating a cycle where neither agent's value estimates ever stabilize. In cooperative tasks with shared team reward, multi-agent credit assignment compounds the problem: naive RLReinforcement Learning attributes team reward equally to every agent's action, regardless of which agent actually drove the outcome. This produces noisy, slow policy gradient updates.

Self-play

Self-play creates an automatic curriculum by having each agent train against a copy (or recent snapshot) of itself. The opponent is always at the same competence level, so the training distribution shifts smoothly as capability improves. AlphaGo, AlphaZero, and OpenAI Five all rely on self-play as the primary learning mechanism.

In zero-sum two-player games, self-play corresponds to iteratively optimizing for the Nash equilibrium: neither player can improve by unilaterally changing strategy. Under conditions analogous to fictitious play, the iterative procedure converges to Nash. The zero-sum property (R1+R2=0R_1 + R_2 = 0R1​+R2​=0) is structurally necessary here: it guarantees that improving one player's policy does not increase the other player's reward — the game is purely competitive, so self-play drives both agents toward the unique value of the game rather than chasing a moving cooperative target. In general-sum or cooperative settings, self-play without additional structure may cycle because agents can simultaneously improve each other's returns without converging. The emergent capabilities from self-play — discovering opening theory in Go, inventing tool use in hide-and-seek — arise purely from the self-play reward signal without human-designed intermediate objectives.

Centralized Training with Decentralized Execution (CTDE)

The dominant paradigm for cooperative MARL separates the training and execution regimes:

  • During training: a centralized critic has access to the global state sss and all agents' actions a=(a1,…,an)\mathbf{a} = (a^1, \ldots, a^n)a=(a1,…,an). Full-information Q-function estimation stabilizes learning and enables joint credit assignment.
  • During execution: each agent iii acts using only its local observation oio^ioi and local policy πi(ai∣oi)\pi^i(a^i \mid o^i)πi(ai∣oi), as required for bandwidth-constrained and latency-sensitive deployment.

MADDPG (Lowe et al., 2017) implements CTDE by extending DDPGDeep Deterministic Policy Gradient: each agent has a separate actor μi(oi)\mu^i(o^i)μi(oi) and a centralized critic Qi(s,a1,…,an)Q^i(s, a^1, \ldots, a^n)Qi(s,a1,…,an). At execution time, only the actors are used.

QMIX (Rashid et al., 2018) adds a monotonicity constraint to support decentralized greedy action selection:

∂Qtot(s,a)∂Qi(oi,ai)≥0∀i\frac{\partial Q_{\text{tot}}(s, \mathbf{a})}{\partial Q_i(o^i, a^i)} \geq 0 \quad \forall i∂Qi​(oi,ai)∂Qtot​(s,a)​≥0∀i

This ensures that each agent independently maximizing its individual utility QiQ_iQi​ is consistent with maximizing the joint QtotQ_{\text{tot}}Qtot​. The constraint is enforced by a mixing network with non-negative weights that combines individual QiQ_iQi​ values into QtotQ_{\text{tot}}Qtot​ conditioned on global state — active only at training time.

GenAI context: multi-agent AI systems

Multi-agent structures are pervasive in modern AI: orchestrator–subagent hierarchies, critic–generator pairs for self-critique, red-team–defender setups for adversarial robustness. Self-play underlies debate-based alignment (agents argue for and against positions; debate outcome is the reward signal) and adversarial red-teaming. CTDE is implicitly present in multi-agent reasoning frameworks where a coordinator plans globally and dispatches to specialized agents executing with local information. The credit assignment problem — which subagent's output caused task success or failure — is a direct instance of cooperative MARL credit assignment.


Key takeaways

Exploration must be designed as an explicit objective when rewards are sparse or environments are large. Count-based methods provide theoretical grounding in tabular MDPs; curiosity-driven methods (ICM, RND) scale to high dimensions, with RND circumventing the noisy TV failure of prediction-error curiosity by replacing stochastic prediction targets with a fixed random network. Bootstrapped DQNDeep Q-Network and NoisyNets produce temporally coherent exploration via per-episode policy diversity and parameter-space noise, respectively. Partial observability is the generic case; the POMDPPartially Observable Markov Decision Process belief state is the theoretically sufficient summary, approximated in practice by history stacking and recurrent architectures that learn implicit belief representations end-to-end. MARL introduces non-stationarity as the defining challenge; self-play produces emergent capability in competitive settings via automatic curriculum, while CTDE stabilizes cooperative training by using centralized critics during training and decentralized actors during execution.


Conceptual questions

  1. An agent trained with prediction-error curiosity (ICM) is deployed in an environment containing a screen displaying Brownian motion. Explain precisely why the agent becomes addicted to watching the screen. Then explain why RND does not exhibit this failure, identifying the key structural difference between the two intrinsic reward formulations.

  2. A robot has joint position sensors but no velocity sensors. Explain why this constitutes a POMDPPartially Observable Markov Decision Process and identify the true hidden state. Compare two solutions: (a) stack k=5k=5k=5 position observations, (b) add an LSTM to the policy. For each, state what assumption it makes about the memory horizon and describe a scenario where it would fail but the other would succeed.

  3. Two independent Q-learning agents train in a cooperative game with shared team reward. Agent A receives high team reward after taking action aAa_AaA​, even though Agent B's action was actually responsible. Explain the credit assignment failure in Agent A's Q-update. Describe how CTDE with a centralized critic resolves this, and explain why the QMIX monotonicity constraint is necessary for decentralized execution.

  4. Bootstrapped DQNDeep Q-Network samples one Q-head per episode rather than per step. Contrast exploration trajectories from per-step vs per-episode sampling. For a task requiring the agent to traverse a 50-step corridor to reach reward, explain why per-step sampling fails while per-episode sampling succeeds. Relate this to temporal coherence and connect it to parameter-space noise vs action-space noise.

  5. AlphaZero trains via self-play in a zero-sum game. Explain why the zero-sum assumption is necessary for self-play to converge toward Nash equilibrium. Describe an alternative competitive signal that could provide a self-play-style curriculum for a cooperative task. Identify at least one new failure mode this would introduce that does not arise in zero-sum self-play.

✦Solutions (conceptual questions)
  1. ICM's reward is the error of a learned forward model predicting the next state; a screen of Brownian motion is inherently unpredictable, so the prediction error never decays and the agent gets perpetual intrinsic reward for watching it (the "noisy-TV" trap). RND instead measures how well a predictor matches a fixed random network of the state, which is reducible by visiting states more often — so it rewards novelty (low visitation) rather than transition unpredictability, and does not get stuck on stochastic observations.
  2. With only positions, velocity (part of the true state (position,velocity)(\text{position}, \text{velocity})(position,velocity)) is hidden, so the Markov property fails and it is a POMDP. (a) Stacking k=5k = 5k=5 positions assumes velocity is recoverable by finite-differencing a short fixed window — it fails when the relevant memory horizon exceeds 5 steps. (b) An LSTM learns an arbitrary-horizon memory — it handles long dependencies but is harder to train and can fail with limited data, where the hard-coded stacking window would have sufficed.
  3. Agent A's update credits its own action aAa_AaA​ for reward actually caused by B, corrupting A's QQQ-value with spurious credit (the environment looks non-stationary to each independent learner). A CTDE centralized critic conditioned on the joint state and actions attributes reward correctly during training. QMIX's monotonicity constraint (∂Qtot/∂Qi≥0\partial Q_{\text{tot}} / \partial Q_i \ge 0∂Qtot​/∂Qi​≥0) ensures the joint argmax factorizes into per-agent argmaxes, so agents can still act greedily from local values at decentralized execution.
  4. Sampling one QQQ-head per episode gives temporally coherent ("deep") exploration: the agent commits to a single hypothesis for the whole episode and explores in a directed, multi-step way. Per-step sampling re-randomizes every step, degenerating into dithering that cannot traverse a 50-step corridor (vanishingly small chance), whereas per-episode commitment can walk the corridor to the reward. This mirrors parameter-space / posterior-sampling noise (coherent) versus action-space noise (incoherent).
  5. Zero-sum makes self-play a minimax game with a well-defined Nash equilibrium that improving against oneself provably moves toward — your gain is exactly the opponent's loss, so relative improvement equals absolute improvement. A cooperative substitute is a competitive curriculum such as population-based or asymmetric (goal-setting) self-play. A new failure mode absent in zero-sum is miscoordination / collapse to a degenerate equilibrium, or the competitive proxy drifting away from the true cooperative objective (gaming the proxy).

Implementation exercises

Exercise 1: Count-based exploration in a grid world

Implement a tabular Q-learning agent on a sparse-reward grid world (e.g., MiniGrid-Empty-8x8-v0). Add a count-based exploration bonus rtint(s)=β/Nt(s)r_t^{\text{int}}(s) = \beta / \sqrt{N_t(s)}rtint​(s)=β/Nt​(s)​ and compare with standard ϵ\epsilonϵ-greedy. Measure:

  • Number of episodes to first goal reach
  • State visitation entropy across the grid
  • Performance as β\betaβ varies across {0.1,0.5,1.0,5.0}\{0.1, 0.5, 1.0, 5.0\}{0.1,0.5,1.0,5.0}

For the bonus, store visit counts in a dictionary keyed by discretized (x,y)(x, y)(x,y) position. Note that this is only feasible in tabular settings — Exercise 2 addresses the continuous case.

Exercise 2: RND on a continuous control task

Implement Random Network Distillation as an exploration bonus for a PPOProximal Policy Optimisation agent on MountainCarContinuous-v0 (sparse reward: agent only receives reward when reaching the flag). The setup:

  • Target network: A random MLP f:S→R64f: \mathcal{S} \to \mathbb{R}^{64}f:S→R64 with fixed, randomly initialized weights.
  • Predictor network: f^θ\hat{f}_\thetaf^​θ​ trained via MSE to match f(s)f(s)f(s) on visited states.
  • Intrinsic reward: rtint=∥f^θ(st)−f(st)∥2r_t^{\text{int}} = \|\hat{f}_\theta(s_t) - f(s_t)\|^2rtint​=∥f^​θ​(st​)−f(st​)∥2, normalized by a running estimate of its standard deviation.
  • Add rttotal=rtext+β⋅rtintr_t^{\text{total}} = r_t^{\text{ext}} + \beta \cdot r_t^{\text{int}}rttotal​=rtext​+β⋅rtint​ to the PPOProximal Policy Optimisation reward.

Compare training curves (extrinsic return vs steps) for vanilla PPOProximal Policy Optimisation vs PPOProximal Policy Optimisation + RND. Report the fraction of seeds that solve the task within 200 episodes. For a bonus challenge, swap the MountainCar environment for a procedurally generated maze and observe whether RND's deterministic target prevents the noisy-TV attraction.

Exercise 3: IQL vs CTDE in a cooperative multi-agent task

Implement a simple cooperative multi-agent scenario: two agents navigating a grid to push a box into a target zone (requires both agents pushing simultaneously). Compare:

  • IQL: Each agent trains independent DQNDeep Q-Networks with shared team reward.
  • MADDPG-style CTDE: A centralized critic Q(s,a1,a2)Q(s, a^1, a^2)Q(s,a1,a2) during training, with decentralized actors using only local observations at execution.

Track the standard deviation of per-episode returns across 5 seeds for each method. IQL should show high variance (credit assignment noise); CTDE should converge faster and more stably. Identify one failure mode where IQL agents converge to a suboptimal equilibrium (e.g., one agent learns to wait while the other does all the work).


Extension prompts

  1. RND on Montezuma's Revenge: The original RND paper demonstrated superhuman performance on Montezuma's Revenge, a notoriously hard-exploration Atari game. Study the paper's preprocessing and normalization choices. What aspects of the RND formulation are specific to Atari (discrete actions, CNN encoder), and which generalize to continuous control? Propose a modification to RND for an environment with both visual and proprioceptive inputs.

  2. Belief-state quality metric: Train an LSTM-based policy on a partially observable navigation task. Design an experiment to measure how well the LSTM's hidden state hth_tht​ approximates the true belief state. One approach: train a separate network to predict the true hidden state from hth_tht​ and measure its reconstruction error as a proxy for belief-state quality. How does this metric change as the LSTM hidden size varies?

  3. Population-based self-play: Extend the self-play section's analysis by implementing a population-based training setup (maintain NNN policy snapshots, sample opponents from the population rather than the latest policy). Compare with standard self-play on a simple competitive task. How does population diversity affect convergence stability? What new failure mode emerges when the population collapses to a single strategy?

✦Looking Forward

Week 10: Model-Based Reinforcement Learning and Planning. The next lecture turns from how agents explore to how they plan. We study what it means for an agent to build a model of the world, how that model enables planning over imagined futures, and why the interplay of learned dynamics, MPC, and latent-space planning connects RLReinforcement Learning directly to classical control theory and to modern reasoning-capable AI systems.


Further reading

  • Pathak, D., et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction. ICML. (ICM).
  • Burda, Y., et al. (2018). Exploration by Random Network Distillation. ICLR. (RND).
  • Fortunato, M., et al. (2017). Noisy Networks for Exploration. ICLR. (NoisyNets — parameter-space noise for temporally coherent exploration).
  • Osband, I., et al. (2016). Deep Exploration via Bootstrapped DQN. NeurIPS. (Bootstrapped DQNDeep Q-Network — Thompson sampling with Q-ensembles).
  • Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence. (Foundational POMDPPartially Observable Markov Decision Process paper).
  • Lowe, R., et al. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS. (MADDPG & CTDE).
  • Rashid, T., et al. (2018). QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. ICML.
← Previous
Week 8: Modern Deep Reinforcement Learning Algorithms
Next →
Week 10: Model-Based Reinforcement Learning and Planning
On this page
  • Purpose of this lecture
  • Exploration as a first-class problem
  • Count-based exploration
  • Intrinsic motivation and curiosity
  • Prediction-error curiosity (ICM)
  • The noisy TV problem
  • Random Network Distillation (RND)
  • Temporally coherent exploration
  • Action-space noise: the incoherence problem
  • Parameter-space noise
  • Bootstrapped DQN
  • Partial observability and POMDPs
  • Formal definition
  • The belief state
  • Approximating the belief state in deep RL
  • GenAI context: LLM agents as POMDP solvers
  • Multi-agent reinforcement learning
  • Independent learners
  • Self-play
  • Centralized Training with Decentralized Execution (CTDE)
  • GenAI context: multi-agent AI systems
  • Key takeaways
  • Conceptual questions
  • Implementation exercises
  • Exercise 1: Count-based exploration in a grid world
  • Exercise 2: RND on a continuous control task
  • Exercise 3: IQL vs CTDE in a cooperative multi-agent task
  • Extension prompts
  • Further reading