Skip to main content
illumin8
Courses
Week 10: Model-Based Reinforcement Learning and Planning
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 10

Week 10: Model-Based Reinforcement Learning and Planning

✦Learning Outcomes
  • Implement Dyna-style algorithms and understand model learning challenges
  • Connect model-based RLReinforcement Learning to Model Predictive Control (MPC)
  • Analyze latent space planning (Dreamer, MuZero)
  • Understand tree search and planning methods
◆Prerequisites
  • Week 8: Modern deep RLReinforcement Learning algorithms
  • Week 3: Dynamic programming foundations

Recommended: Review Week 8 before proceeding.

◆Grounded In
  • GenAI: The model-based RL framework provides precise vocabulary for recent LLM reasoning advances. Chain-of-thought is model-free greedy rollout; Tree of Thoughts is MCTS in text space (the LLM as both dynamics model and value estimator); best-of-N sampling with process reward models is MPC with CEM. The test-time compute scaling observed in O1-class models is a direct prediction of the MBRL framework — more planning compute improves strategic reasoning.
  • Robotics: MPC with learned dynamics is the dominant paradigm for real-robot control under model uncertainty. Receding-horizon replanning is the connection to classical LQR/iLQR from the robotics course; Dreamer's latent-space planning enables visual control from pixels on real manipulation tasks; the compounding error problem is the central constraint on model-based deployment.

Purpose of this lecture

Every algorithm studied through Week 8 is model-free: it learns a policy or value function directly from environment interaction without representing how the environment works. Model-free methods are powerful but data-hungry — every transition contributes one gradient step, and the environment must be queried millions of times before useful behavior emerges.

Model-based RLReinforcement Learning breaks this dependence by having the agent learn, or be given, a model of the environment's dynamics. With a model, the agent can imagine consequences of actions without executing them, generating synthetic experience to supplement or replace real interaction. This is the computational analog of mentally simulating a trajectory before moving—and is essential for physical systems where each real interaction has cost or safety consequences.

Model-based RLReinforcement Learning is not one algorithm but a design space: the model can be learned or known; planning can use it for one step, a short horizon, or to build a tree of possible futures; the model can operate in observation space or a learned latent space. This lecture organizes that design space from simple to complex, anchoring each method in its connection to classical control theory (MPC, studied in the robotics course) and to modern AI systems that plan before acting.


What is a world model?

A world model approximates the environment's transition and reward functions:

P^ϕ(st+1∣st,at),R^ϕ(st,at)\hat{P}_\phi(s_{t+1} \mid s_t, a_t), \qquad \hat{R}_\phi(s_t, a_t)P^ϕ​(st+1​∣st​,at​),R^ϕ​(st​,at​)

Given a world model, the agent can unroll imagined trajectories:

s0→a0s^1→a1s^2→a2⋯s_0 \xrightarrow{a_0} \hat{s}_1 \xrightarrow{a_1} \hat{s}_2 \xrightarrow{a_2} \cdotss0​a0​​s^1​a1​​s^2​a2​​⋯

using the model in place of real environment steps. If the model is accurate, imagined trajectories provide the same learning signal as real ones at a fraction of the wall-clock cost. If the model is inaccurate, imagined trajectories mislead the learner — a policy optimized against a flawed model can fail arbitrarily when deployed. Managing the tension between model-leveraged efficiency and model-induced bias is the central design challenge of model-based RLReinforcement Learning.

Deterministic vs probabilistic models

A deterministic model s^t+1=fϕ(st,at)\hat{s}_{t+1} = f_\phi(s_t, a_t)s^t+1​=fϕ​(st​,at​) predicts a single next state, trained via MSE on observed transitions. It is fast and simple but cannot represent uncertainty. Errors are silent: the model produces a confident prediction even in regions where the true dynamics are stochastic or poorly constrained.

A probabilistic model outputs a distribution over next states, typically N(μϕ(st,at), σϕ2(st,at))\mathcal{N}(\mu_\phi(s_t, a_t),\, \sigma^2_\phi(s_t, a_t))N(μϕ​(st​,at​),σϕ2​(st​,at​)). The predicted variance σϕ2\sigma^2_\phiσϕ2​ signals epistemic uncertainty: high variance indicates regions of sparse data or inherently stochastic dynamics. This enables uncertainty-aware planning — action sequences through high-variance regions can be penalized.

Ensemble models train MMM independent deterministic models on different data subsets. Disagreement across ensemble predictions provides a practical epistemic uncertainty estimate. MBPO and PETS use probabilistic ensembles of M=5M=5M=5 Gaussian models as a sweet spot between expressivity and training cost.

The compounding error problem

Errors compound over time: a one-step prediction error ϵ1\epsilon_1ϵ1​ can produce accumulated error growing as O(eHϵ1)O(e^{H\epsilon_1})O(eHϵ1​) over HHH model steps for nonlinear dynamics, as model states drift off the true manifold of valid states. Even an accurate model with ϵ1=0.01\epsilon_1 = 0.01ϵ1​=0.01 may yield completely unreliable 50-step rollouts. This is the open-loop prediction problem studied in robust control. Standard mitigations: limit rollout length to the reliable horizon, replan frequently with closed-loop correction, or bootstrap long-horizon returns with a learned value function.


Dyna: interleaving real and imagined experience

The Dyna architecture (Sutton, 1990) is the prototypical integration of model learning and model-free value estimation.

At each real environment step:

  1. Execute ata_tat​, observe rt+1,st+1r_{t+1}, s_{t+1}rt+1​,st+1​; update model P^ϕ\hat{P}_\phiP^ϕ​.
  2. Update Q(st,at)Q(s_t, a_t)Q(st​,at​) from the real transition (standard TDTemporal Difference).
  3. Imagined rollouts (kkk iterations):
    • Sample a random previously visited state s^\hat{s}s^ and action a^\hat{a}a^.
    • Query model: s^′=P^ϕ(s^,a^)\hat{s}' = \hat{P}_\phi(\hat{s}, \hat{a})s^′=P^ϕ​(s^,a^), r^=R^ϕ(s^,a^)\hat{r} = \hat{R}_\phi(\hat{s}, \hat{a})r^=R^ϕ​(s^,a^).
    • Update Q(s^,a^)Q(\hat{s}, \hat{a})Q(s^,a^) from the imagined transition.

The kkk imagined Q-updates produce additional gradient steps without additional real-world interaction. For accurate models, each imagined transition is as informative as a real one. Dyna with k=100k = 100k=100 achieves the equivalent of 100× more real data, at the cost of kkk model evaluations per real step.

The conceptual contribution of Dyna is establishing that model-based and model-free learning are complementary, not competing: the model improves value estimates between real steps, and better value estimates guide smarter real-world data collection. This "imagination loop" — generate imagined experience, learn from it, act to collect better real data — recurs throughout model-based RLReinforcement Learning and in modern AI systems that deliberate before acting.

Model-Based Policy Optimization (MBPO)

MBPO (Janner et al., 2019) is the modern deep-learning incarnation of Dyna. Short model rollouts of length HHH are generated starting from states in the real replay buffer; imagined transitions are added to a separate model buffer; and a SACSoft Actor-Critic agent trains on the combined real + imagined data. Empirically, H=1H = 1H=1 to 555 works well across MuJoCo tasks, achieving near-state-of-the-art sample efficiency in under 500k real environment steps — compared to 2–3 million required by SACSoft Actor-Critic alone.


Model Predictive Control

Model Predictive Control (MPC) is a classical control algorithm that fits cleanly into the model-based RLReinforcement Learning framework. Students of the robotics course have seen MPC for known dynamics; here it is placed in the context of learned models.

The receding horizon principle

At each time step ttt, MPC solves:

a0:H−1∗=arg⁡max⁡a0:H−1∑k=0H−1γkR^(s^k,ak)\mathbf{a}^*_{0:H-1} = \arg\max_{\mathbf{a}_{0:H-1}} \sum_{k=0}^{H-1} \gamma^k \hat{R}(\hat{s}_k, a_k)a0:H−1∗​=arga0:H−1​max​k=0∑H−1​γkR^(s^k​,ak​)

subject to s^k+1=P^ϕ(s^k,ak)\hat{s}_{k+1} = \hat{P}_\phi(\hat{s}_k, a_k)s^k+1​=P^ϕ​(s^k​,ak​), s^0=st\hat{s}_0 = s_ts^0​=st​.

Only the first action a0∗a^*_0a0∗​ is executed. At the next step, the agent re-observes the true state st+1s_{t+1}st+1​ and replans over a new horizon. This is the receding horizon structure.

Why only the first action? A learned model is only an approximation; executing the full HHH-step plan propagates errors without correction. Replanning at every step with the new true observation implements closed-loop feedback on the model, continuously correcting for model error and unexpected perturbations (disturbances in robot control). This is why MPC is robust to model mismatch — it never commits to more than one step at a time.

Optimizers within MPC

In continuous action spaces, gradient-based optimization through differentiable models is possible. For non-differentiable models or non-smooth rewards, sampling-based optimizers are standard:

Cross-Entropy Method (CEM): sample NNN action sequences {a0:H−1(n)}n=1N\{\mathbf{a}^{(n)}_{0:H-1}\}_{n=1}^N{a0:H−1(n)​}n=1N​ from a Gaussian distribution N(μ,Σ)\mathcal{N}(\mu, \Sigma)N(μ,Σ) over action sequences, evaluate the total return R(a(n))R(\mathbf{a}^{(n)})R(a(n)) for each using model rollouts, select the top-KKK sequences (the "elite" set E\mathcal{E}E), and refit the distribution:

μ←1K∑n∈Ea(n),Σ←1K∑n∈E(a(n)−μ)(a(n)−μ)⊤\mu \leftarrow \frac{1}{K}\sum_{n \in \mathcal{E}} \mathbf{a}^{(n)}, \qquad \Sigma \leftarrow \frac{1}{K}\sum_{n \in \mathcal{E}} (\mathbf{a}^{(n)} - \mu)(\mathbf{a}^{(n)} - \mu)^\topμ←K1​n∈E∑​a(n),Σ←K1​n∈E∑​(a(n)−μ)(a(n)−μ)⊤

Iterating this refit-sample cycle concentrates the sampling distribution on high-return regions. CEM is simple, parallelizable, and effective for horizons up to H≈50H \approx 50H≈50 in continuous control, though it can collapse prematurely to a local optimum if the elite fraction K/NK/NK/N is too small.

MPPI (Model Predictive Path Integral): computes a principled gradient-free update by weighting action trajectories according to exponentiated returns. For a control problem with running cost c(st,at)c(s_t, a_t)c(st​,at​), MPPI samples NNN noise sequences ϵ(n)∼N(0,Σ)\boldsymbol{\epsilon}^{(n)} \sim \mathcal{N}(0, \Sigma)ϵ(n)∼N(0,Σ) and computes the update:

at←at+∑n=1Nexp⁡ ⁣(−1λ∑τ=tt+H−1c(s^τ,aτ(n)))⋅ϵt(n)∑n=1Nexp⁡ ⁣(−1λ∑τc(s^τ,aτ(n)))a_t \leftarrow a_t + \frac{\sum_{n=1}^N \exp\!\left(-\frac{1}{\lambda}\sum_{\tau=t}^{t+H-1} c(\hat{s}_\tau, a_\tau^{(n)})\right) \cdot \epsilon_t^{(n)}}{\sum_{n=1}^N \exp\!\left(-\frac{1}{\lambda}\sum_{\tau} c(\hat{s}_\tau, a_\tau^{(n)})\right)}at​←at​+∑n=1N​exp(−λ1​∑τ​c(s^τ​,aτ(n)​))∑n=1N​exp(−λ1​∑τ=tt+H−1​c(s^τ​,aτ(n)​))⋅ϵt(n)​​

where λ>0\lambda > 0λ>0 is the temperature controlling the soft-maximum. MPPI solves a KL-regularized version of the stochastic optimal control problem: it seeks the control distribution closest to the previous one that achieves lower expected cost. The exponential weighting ensures that trajectories with much lower cost contribute exponentially more to the update. MPPI is widely used in autonomous driving (steering + throttle control under nonlinear vehicle dynamics) and legged locomotion (real-time replanning for foot placement on uneven terrain).

Terminal value functions

Finite-horizon MPC considers only returns over horizon HHH. The standard solution adds a terminal value function Vψ(sH)V_\psi(s_H)Vψ​(sH​) estimated by a learned critic:

max⁡a0:H−1∑k=0H−1γkR^(s^k,ak)+γHVψ(s^H)\max_{\mathbf{a}_{0:H-1}} \sum_{k=0}^{H-1} \gamma^k \hat{R}(\hat{s}_k, a_k) + \gamma^H V_\psi(\hat{s}_H)a0:H−1​max​k=0∑H−1​γkR^(s^k​,ak​)+γHVψ​(s^H​)

This is the model-based analog of TDTemporal Difference(λ\lambdaλ): the model handles short-horizon consequences with high fidelity (accurate over short horizons), while the value function handles long-horizon return with correct asymptotic behavior. Small HHH leans on VψV_\psiVψ​ (high bias if inaccurate); large HHH leans on the model (high bias if it compounds error). The optimal HHH balances model accuracy against value function accuracy.

Connection to the robotics course

The model-based RLReinforcement Learning design space is the generalization of classical control methods:

| Setting | Algorithm | |---|---| | Known linear dynamics, quadratic cost | LQR / DARE (Bellman in closed form) | | Known nonlinear dynamics | iLQR, Differential Dynamic Programming | | Unknown dynamics, learned model | MBRL with MPC | | Unknown dynamics, model-free | SACSoft Actor-Critic, PPOProximal Policy Optimisation |

The progression from LQR to MBRL is from known to unknown, analytical to data-driven, and exact to approximate. All are solving the same problem — optimal sequential decision-making under dynamics constraints — with different information assumptions.


Latent-space world models: Dreamer and RSSM

Operating models in raw observation space forces prediction of task-irrelevant features (background textures, uninformative pixels) and amplifies prediction error. A more principled approach learns a compact latent state retaining only task-relevant structure.

Recurrent State Space Model (RSSM)

The RSSM (Hafner et al., 2019, Dreamer) combines a deterministic recurrent state with a stochastic latent state:

ht=fϕ(ht−1, zt−1, at−1)(GRU; deterministic history)h_t = f_\phi(h_{t-1},\, z_{t-1},\, a_{t-1}) \quad \text{(GRU; deterministic history)}ht​=fϕ​(ht−1​,zt−1​,at−1​)(GRU; deterministic history) zt∼qϕ(zt∣ht,ot)(posterior; uses real observation)z_t \sim q_\phi(z_t \mid h_t, o_t) \quad \text{(posterior; uses real observation)}zt​∼qϕ​(zt​∣ht​,ot​)(posterior; uses real observation) z^t∼pϕ(z^t∣ht)(prior; for model rollouts)\hat{z}_t \sim p_\phi(\hat{z}_t \mid h_t) \quad \text{(prior; for model rollouts)}z^t​∼pϕ​(z^t​∣ht​)(prior; for model rollouts)

The model is trained with a reconstruction loss, a reward prediction loss, and a KL penalty aligning prior and posterior:

L=E ⁣[−log⁡p(o^t∣ht,zt)−log⁡p(r^t∣ht,zt)+β DKL(qϕ∥pϕ)]\mathcal{L} = \mathbb{E}\!\left[-\log p(\hat{o}_t \mid h_t, z_t) - \log p(\hat{r}_t \mid h_t, z_t) + \beta\, D_{\text{KL}}(q_\phi \| p_\phi)\right]L=E[−logp(o^t​∣ht​,zt​)−logp(r^t​∣ht​,zt​)+βDKL​(qϕ​∥pϕ​)]

Planning happens entirely in latent space: RSSM unrolls imagined trajectories using (ht,z^t)(h_t, \hat{z}_t)(ht​,z^t​) transitions without decoding to observation space — the decoder is only needed for training supervision. Dreamer trains an actor and critic entirely within imagined rollouts using backpropagation through the latent model (analytic policy gradients), achieving state-of-the-art sample efficiency on visual continuous control from pixels (robot manipulation from camera images).


MuZero and planning in learned abstract spaces

MuZero (Schrittwieser et al., 2020) represents a conceptually distinct approach: learn a model whose sole purpose is to support Monte Carlo Tree Search (MCTS) planning, without any obligation to decode back to observations.

MuZero's three learned functions:

| Function | Input → Output | Role | |---|---|---| | Representation hθh_\thetahθ​ | Observation history → ztz_tzt​ | Encode real observation | | Dynamics gθg_\thetagθ​ | (zt,at)(z_t, a_t)(zt​,at​) → (zt+1,r^t)(z_{t+1}, \hat{r}_t)(zt+1​,r^t​) | Latent transitions + reward | | Prediction fθf_\thetafθ​ | ztz_tzt​ → (π^t,v^t)(\hat{\pi}_t, \hat{v}_t)(π^t​,v^t​) | Policy and value from latent |

At each real step, MCTS is run entirely within the latent space generated by repeated application of gθg_\thetagθ​. The tree is expanded using π^\hat{\pi}π^ to guide search and v^\hat{v}v^ to estimate leaf returns. The resulting action visit counts form the improved policy πMCTS\pi_\text{MCTS}πMCTS​, which is both executed and used as a training target for fθf_\thetafθ​.

The training signal for gθg_\thetagθ​ comes from temporal consistency of predictions along real trajectories — not observation reconstruction. MuZero's latent representation may bear no resemblance to the physical state; it only needs to support accurate value and policy prediction for MCTS. This demonstrates that observation-space accuracy is neither necessary nor sufficient for effective planning: a model with low reconstruction error but poor value prediction is useless for RLReinforcement Learning; a model with accurate value prediction on abstract latents supports excellent planning regardless of reconstruction quality.


Synthesis: the model-based RLReinforcement Learning design space

| Algorithm | Model space | Rollout | Planner | Strength | |---|---|---|---|---| | Dyna / MBPO | Observation | 1–5 steps | Q-learning / SACSoft Actor-Critic | Sample efficiency, simplicity | | MPC + CEM | Observation | 10–50 steps | CEM sampling | Constraints, no learned value fn | | MPC + terminal VVV | Observation | 5–20 steps | CEM + bootstrap | Balances model and value bias | | Dreamer (RSSM) | Learned latent | Full episode | Analytic actor gradient | Visual control, sample efficiency | | MuZero | Abstract latent | MCTS tree | MCTS | Discrete games, combinatorial tasks |

All model-based methods share the fundamental tradeoff: sample efficiency vs model accuracy demands. Model-free methods require many real transitions; model-based methods require fewer but demand accurate models — requiring a different investment (architecture, training stability, validation). The right choice depends on whether real-world interaction is the bottleneck (favor model-based) or whether model learning itself is difficult or unsafe (favor model-free or conservative model-based with uncertainty penalties).


GenAI context: planning in language models

The model-based RLReinforcement Learning perspective provides precise language for recent LLMLarge Language Model developments. An LLMLarge Language Model generating a chain-of-thought runs a one-step greedy rollout: predict the next token conditioned on all previous tokens, commit. This is the model-free degenerate case — prediction without planning.

Tree of Thoughts (ToT) introduces genuine planning: the LLMLarge Language Model is simultaneously the dynamics model (predicting consequences of reasoning steps) and the value/policy network (evaluating which branch is promising). MCTS is run in reasoning-step space, exactly as in MuZero with the LLMLarge Language Model in the model role. The latent state is the text context; actions are reasoning steps; value estimates come from self-critique.

Best-of-N sampling with process reward models (PRMs) is the equivalent of MPC with CEM: generate many candidate reasoning trajectories, score with a learned reward model, select the best. The MBRL framework thereby predicts the effectiveness of test-time compute scaling: providing more computation to plan (more MCTS iterations, more candidates, deeper trees) improves performance on tasks requiring strategic reasoning — exactly the empirical finding from recent scaling studies.


Key takeaways

World models approximate environment dynamics and enable imagined trajectories that substitute for real interaction. The fundamental tradeoff is model accuracy against planning efficiency; the compounding error problem limits reliable rollout length. Dyna established the principle of interleaving real and imagined experience, with MBPO scaling it to deep networks via short-horizon probabilistic rollouts. MPC implements receding-horizon planning with online re-optimization; its connection to LQR and iLQR makes it the bridge between classical control and RLReinforcement Learning. The terminal value function resolves the short-horizon limitation by bootstrapping long-horizon returns, creating a precise model-vs-value bias-variance tradeoff. Dreamer demonstrates that planning in learned latent space is more efficient than in observation space. MuZero makes the strongest claim: a model that supports MCTS on abstract latents can outperform observation-space models without any reconstruction objective. Planning is a first-class AI capability, and the MBRL framework provides the theoretical tools to reason about when and how to use it.


Conceptual questions

  1. The compounding error problem states that one-step model error ϵ1\epsilon_1ϵ1​ can produce rollout error growing as O(eHϵ1)O(e^{H\epsilon_1})O(eHϵ1​) over HHH steps. Explain mechanistically why errors compound exponentially rather than linearly for a nonlinear model, using the concept of model state diverging from the true manifold of valid states. How does the terminal value function strategy mitigate — but not eliminate — this problem? What property must VψV_\psiVψ​ have for the mitigation to be effective?

  2. MBPO uses short rollouts (H=1H = 1H=1–555) starting from real replay buffer states rather than full rollouts from the initial state distribution. Explain why this reduces compounding error. What does starting rollouts from the replay buffer assume about the on-policy vs replay state distributions, and when would this assumption fail?

  3. MuZero trains its dynamics model to support MCTS, not to reconstruct observations. Compare MuZero's latent zt=h(o1:t)z_t = h(o_{1:t})zt​=h(o1:t​) with Dreamer's zt∼q(zt∣ht,ot)z_t \sim q(z_t \mid h_t, o_t)zt​∼q(zt​∣ht​,ot​). For each, identify: (a) what training signal shapes the latent representation, (b) whether the latent must correspond to a physically interpretable quantity, and (c) what information would be lost if the decoder objective were removed. Under what task conditions would MuZero's representation be preferred?

  4. You are designing model-based RLReinforcement Learning for a robotic arm where rigid-body dynamics are known analytically but contact dynamics are highly uncertain. Describe a hybrid strategy using an analytical model for rigid-body components and a learned residual for contact dynamics. How would you modify MPC's receding-horizon optimization to incorporate both? What must the planning horizon HHH satisfy relative to the timescale of contact events?

  5. Map Tree of Thoughts onto the MuZero architecture by identifying what corresponds to: the representation function hhh, dynamics function ggg, prediction function fff, latent state ztz_tzt​, and MCTS search. Identify at least one structural difference that makes ToT less principled from a planning theory perspective. How would you modify ToT to address this?


✦Solutions
  1. Compounding error. A one-step error nudges the predicted state slightly off the manifold of valid states; for nonlinear dynamics the next prediction is then made from an already-wrong state where the model is even less accurate, so errors feed back multiplicatively and grow like eHϵ1e^{H\epsilon_1}eHϵ1​ rather than additively. A terminal value function VψV_\psiVψ​ caps the rollout at HHH short steps and bootstraps the remainder, so error accumulates only over HHH — but only if VψV_\psiVψ​ is accurate on the reached states; it does not remove the within-horizon compounding.
  2. MBPO short rollouts. Starting H=1H{=}1H=1–555 rollouts from real replay states keeps model queries near states it has actually seen, bounding how far error compounds. It assumes the replay (largely off-policy) state distribution matches the current on-policy distribution; this fails once the policy has shifted substantially, so rollouts begin from states the current policy rarely visits, reintroducing distribution mismatch.
  3. MuZero vs Dreamer latents. MuZero's z=h(o1:t)z=h(o_{1:t})z=h(o1:t​) is shaped purely by reward/value/policy-prediction losses, so (a) the training signal is value-equivalence, (b) the latent need not be physically interpretable, and (c) removing a decoder loses nothing because it never reconstructs observations — it keeps only task-relevant information. Dreamer's z∼q(z∣ht,ot)z\sim q(z|h_t,o_t)z∼q(z∣ht​,ot​) is shaped by a reconstruction (ELBO) objective, must encode enough to rebuild observations, and would lose its main representation signal without the decoder. MuZero's representation is preferred when observations carry lots of task-irrelevant detail.
  4. Hybrid model. Use the known rigid-body model plus a learned contact residual, s^′=frigid(s,a)+fres(s,a)\hat s' = f_\text{rigid}(s,a) + f_\text{res}(s,a)s^′=frigid​(s,a)+fres​(s,a), and run MPC's receding-horizon optimization over the combined model. The horizon HHH must be short enough that contact transients are resolved at the model's timescale — HΔtH\Delta tHΔt on the order of (or finer than) the contact-event timescale — or contact dynamics get aliased and missed.
  5. Tree of Thoughts as MuZero. Representation hhh ≈ encoding the prompt into a reasoning state; dynamics ggg ≈ the LLM generating the next thought; prediction fff ≈ the heuristic value/score of a thought; latent ztz_tzt​ ≈ the current partial reasoning state; MCTS ≈ ToT's tree search. The key gap: ToT's dynamics and value are uncalibrated LLM self-evaluations with no learned value-equivalence or grounding, so the search optimizes a noisy proxy. Fix by training a learned verifier/value model (and grounding steps in tool feedback) so the search targets a calibrated objective.

Implementation exercises

Exercise 1: Dyna on a grid world

Implement a Dyna-Q agent on a deterministic grid world (e.g., FrozenLake-v1 or a custom maze). Compare the number of real environment steps required to converge as a function of planning steps kkk:

  • k=0k = 0k=0 (standard Q-learning — no imagined updates)
  • k=5k = 5k=5 (moderate planning)
  • k=50k = 50k=50 (aggressive planning)

Track the Q-value convergence error ∥Qt−Q∗∥∞\|Q_t - Q^*\|_\infty∥Qt​−Q∗∥∞​ over real steps. At what value of kkk do diminishing returns set in? What happens when the learned model is initially inaccurate — does the error from imagined updates compound or self-correct? Deliberately introduce model errors by limiting the model table to only store transitions for certain state-action pairs; observe how partial model coverage affects Dyna performance.

Exercise 2: MPC with CEM on a continuous control task

Implement receding-horizon MPC using CEM as the optimizer for CartPole-v1 or Pendulum-v1. Your implementation should:

  • Train a deterministic dynamics model s^t+1=fϕ(st,at)\hat{s}_{t+1} = f_\phi(s_t, a_t)s^t+1​=fϕ​(st​,at​) from collected transitions.
  • At each step, run CEM for III iterations with NNN candidate sequences, horizon HHH.
  • Execute only the first action, then replan.
  • Compare with a model-free baseline (SACSoft Actor-Critic or PPOProximal Policy Optimisation) on sample efficiency: number of real environment steps to reach a target return.

Experiment with horizon lengths H∈{5,10,20,50}H \in \{5, 10, 20, 50\}H∈{5,10,20,50}. At what HHH does compounding error cause plan quality to degrade? Add a terminal value function VψV_\psiVψ​ and observe whether shorter horizons become competitive — does VψV_\psiVψ​ compensate for model inaccuracy?

Exercise 3: Latent imagination with a simplified Dreamer-style model

Implement a simplified Dreamer-style agent on a visual control task (CarRacing-v3 or a custom pixel environment). The key components:

  • Encoder: CNN encodes observation ot→zto_t \to z_tot​→zt​ (no recurrent state for simplicity).
  • Dynamics: A learned transition z^t+1=gϕ(zt,at)\hat{z}_{t+1} = g_\phi(z_t, a_t)z^t+1​=gϕ​(zt​,at​) in latent space.
  • Decoder: Reconstructs o^t\hat{o}_to^t​ from ztz_tzt​ for training supervision.
  • Actor/Critic: Trained entirely within imagined latent rollouts of length HHH.

Compare with a model-free baseline that operates directly on encoded observations without imagination. Measure the reduction in real environment steps needed to reach a target score. Track the reconstruction quality (MSE between oto_tot​ and o^t\hat{o}_to^t​) over training — does improved reconstruction correlate with better control performance, or does the latent representation plateau in reconstruction while planning quality continues to improve?


Extension prompts

  1. Model-ensemble uncertainty for safe planning: Extend Exercise 2's MPC by replacing the single dynamics model with an ensemble of M=5M = 5M=5 probabilistic models. Use the ensemble disagreement (variance across predicted next states) as an uncertainty penalty in the CEM cost function. Measure whether ensemble-based uncertainty penalties reduce the frequency of catastrophic planning failures (rollouts that predict high reward but produce low reward when executed).

  2. MuZero without a known simulator: Study the MuZero pseudocode and identify which components require a simulator for training (the MCTS tree expansion relies on the dynamics function gθg_\thetagθ​, which is learned — so what's the simulator's role?). Design a fully offline MuZero variant that learns from a fixed dataset with no environment interaction at all. What new failure mode emerges?

  3. MBRL for LLM reasoning: Implement a simplified Tree of Thoughts on a reasoning benchmark (e.g., GSM8K). Compare BFS-style tree expansion vs MCTS with the LLM providing both the dynamics (next-reasoning-step prediction) and the value (self-critique scoring). Measure accuracy as a function of the tree budget (number of nodes expanded). Does MCTS outperform BFS for the same compute budget? Connect your findings to the MBRL framework's prediction about test-time compute scaling.


✦Looking Forward

Week 11: Offline Reinforcement Learning. The next lecture addresses a fundamental practical constraint: what if the agent cannot interact with the environment at all during learning? We study learning from fixed datasets collected by a separate behavior policy — the dominant setting for safe deployment in robotics, healthcare, and AI alignment — and develop the conservative methods (CQL, IQL, AWAC) that address the distributional shift and Q-value overestimation problems that make standard RLReinforcement Learning fail on offline data.


Further reading

  • Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. ICML. (The Dyna Architecture).
  • Janner, M., et al. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS. (MBPO).
  • Hafner, D., et al. (2019). Dream to Control: Learning Behaviors by Latent Imagination. ICLR. (Dreamer / RSSM).
  • Hafner, D., et al. (2023). Mastering Diverse Domains through World Models. arXiv. (DreamerV3 — unifies Dreamer across domains with fixed hyperparameters).
  • Schrittwieser, J., et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature. (MuZero).
← Previous
Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
Next →
Week 11: Offline Reinforcement Learning
On this page
  • Purpose of this lecture
  • What is a world model?
  • Deterministic vs probabilistic models
  • The compounding error problem
  • Dyna: interleaving real and imagined experience
  • Model-Based Policy Optimization (MBPO)
  • Model Predictive Control
  • The receding horizon principle
  • Optimizers within MPC
  • Terminal value functions
  • Connection to the robotics course
  • Latent-space world models: Dreamer and RSSM
  • Recurrent State Space Model (RSSM)
  • MuZero and planning in learned abstract spaces
  • Synthesis: the model-based RL design space
  • GenAI context: planning in language models
  • Key takeaways
  • Conceptual questions
  • Implementation exercises
  • Exercise 1: Dyna on a grid world
  • Exercise 2: MPC with CEM on a continuous control task
  • Exercise 3: Latent imagination with a simplified Dreamer-style model
  • Extension prompts
  • Further reading