Purpose of this lecture
Every algorithm studied through Week 8 is model-free: it learns a policy or value function directly from environment interaction without representing how the environment works. Model-free methods are powerful but data-hungry — every transition contributes one gradient step, and the environment must be queried millions of times before useful behavior emerges.
Model-based RLReinforcement Learning breaks this dependence by having the agent learn, or be given, a model of the environment's dynamics. With a model, the agent can imagine consequences of actions without executing them, generating synthetic experience to supplement or replace real interaction. This is the computational analog of mentally simulating a trajectory before moving—and is essential for physical systems where each real interaction has cost or safety consequences.
Model-based RLReinforcement Learning is not one algorithm but a design space: the model can be learned or known; planning can use it for one step, a short horizon, or to build a tree of possible futures; the model can operate in observation space or a learned latent space. This lecture organizes that design space from simple to complex, anchoring each method in its connection to classical control theory (MPC, studied in the robotics course) and to modern AI systems that plan before acting.
What is a world model?
A world model approximates the environment's transition and reward functions:
Given a world model, the agent can unroll imagined trajectories:
using the model in place of real environment steps. If the model is accurate, imagined trajectories provide the same learning signal as real ones at a fraction of the wall-clock cost. If the model is inaccurate, imagined trajectories mislead the learner — a policy optimized against a flawed model can fail arbitrarily when deployed. Managing the tension between model-leveraged efficiency and model-induced bias is the central design challenge of model-based RLReinforcement Learning.
Deterministic vs probabilistic models
A deterministic model predicts a single next state, trained via MSE on observed transitions. It is fast and simple but cannot represent uncertainty. Errors are silent: the model produces a confident prediction even in regions where the true dynamics are stochastic or poorly constrained.
A probabilistic model outputs a distribution over next states, typically . The predicted variance signals epistemic uncertainty: high variance indicates regions of sparse data or inherently stochastic dynamics. This enables uncertainty-aware planning — action sequences through high-variance regions can be penalized.
Ensemble models train independent deterministic models on different data subsets. Disagreement across ensemble predictions provides a practical epistemic uncertainty estimate. MBPO and PETS use probabilistic ensembles of Gaussian models as a sweet spot between expressivity and training cost.
The compounding error problem
Errors compound over time: a one-step prediction error can produce accumulated error growing as over model steps for nonlinear dynamics, as model states drift off the true manifold of valid states. Even an accurate model with may yield completely unreliable 50-step rollouts. This is the open-loop prediction problem studied in robust control. Standard mitigations: limit rollout length to the reliable horizon, replan frequently with closed-loop correction, or bootstrap long-horizon returns with a learned value function.
Dyna: interleaving real and imagined experience
The Dyna architecture (Sutton, 1990) is the prototypical integration of model learning and model-free value estimation.
At each real environment step:
- Execute , observe ; update model .
- Update from the real transition (standard TDTemporal Difference).
- Imagined rollouts ( iterations):
- Sample a random previously visited state and action .
- Query model: , .
- Update from the imagined transition.
The imagined Q-updates produce additional gradient steps without additional real-world interaction. For accurate models, each imagined transition is as informative as a real one. Dyna with achieves the equivalent of 100× more real data, at the cost of model evaluations per real step.
The conceptual contribution of Dyna is establishing that model-based and model-free learning are complementary, not competing: the model improves value estimates between real steps, and better value estimates guide smarter real-world data collection. This "imagination loop" — generate imagined experience, learn from it, act to collect better real data — recurs throughout model-based RLReinforcement Learning and in modern AI systems that deliberate before acting.
Model-Based Policy Optimization (MBPO)
MBPO (Janner et al., 2019) is the modern deep-learning incarnation of Dyna. Short model rollouts of length are generated starting from states in the real replay buffer; imagined transitions are added to a separate model buffer; and a SACSoft Actor-Critic agent trains on the combined real + imagined data. Empirically, to works well across MuJoCo tasks, achieving near-state-of-the-art sample efficiency in under 500k real environment steps — compared to 2–3 million required by SACSoft Actor-Critic alone.
Model Predictive Control
Model Predictive Control (MPC) is a classical control algorithm that fits cleanly into the model-based RLReinforcement Learning framework. Students of the robotics course have seen MPC for known dynamics; here it is placed in the context of learned models.
The receding horizon principle
At each time step , MPC solves:
subject to , .
Only the first action is executed. At the next step, the agent re-observes the true state and replans over a new horizon. This is the receding horizon structure.
Why only the first action? A learned model is only an approximation; executing the full -step plan propagates errors without correction. Replanning at every step with the new true observation implements closed-loop feedback on the model, continuously correcting for model error and unexpected perturbations (disturbances in robot control). This is why MPC is robust to model mismatch — it never commits to more than one step at a time.
Optimizers within MPC
In continuous action spaces, gradient-based optimization through differentiable models is possible. For non-differentiable models or non-smooth rewards, sampling-based optimizers are standard:
Cross-Entropy Method (CEM): sample action sequences from a Gaussian distribution over action sequences, evaluate the total return for each using model rollouts, select the top- sequences (the "elite" set ), and refit the distribution:
Iterating this refit-sample cycle concentrates the sampling distribution on high-return regions. CEM is simple, parallelizable, and effective for horizons up to in continuous control, though it can collapse prematurely to a local optimum if the elite fraction is too small.
MPPI (Model Predictive Path Integral): computes a principled gradient-free update by weighting action trajectories according to exponentiated returns. For a control problem with running cost , MPPI samples noise sequences and computes the update:
where is the temperature controlling the soft-maximum. MPPI solves a KL-regularized version of the stochastic optimal control problem: it seeks the control distribution closest to the previous one that achieves lower expected cost. The exponential weighting ensures that trajectories with much lower cost contribute exponentially more to the update. MPPI is widely used in autonomous driving (steering + throttle control under nonlinear vehicle dynamics) and legged locomotion (real-time replanning for foot placement on uneven terrain).
Terminal value functions
Finite-horizon MPC considers only returns over horizon . The standard solution adds a terminal value function estimated by a learned critic:
This is the model-based analog of TDTemporal Difference(): the model handles short-horizon consequences with high fidelity (accurate over short horizons), while the value function handles long-horizon return with correct asymptotic behavior. Small leans on (high bias if inaccurate); large leans on the model (high bias if it compounds error). The optimal balances model accuracy against value function accuracy.
Connection to the robotics course
The model-based RLReinforcement Learning design space is the generalization of classical control methods:
| Setting | Algorithm | |---|---| | Known linear dynamics, quadratic cost | LQR / DARE (Bellman in closed form) | | Known nonlinear dynamics | iLQR, Differential Dynamic Programming | | Unknown dynamics, learned model | MBRL with MPC | | Unknown dynamics, model-free | SACSoft Actor-Critic, PPOProximal Policy Optimisation |
The progression from LQR to MBRL is from known to unknown, analytical to data-driven, and exact to approximate. All are solving the same problem — optimal sequential decision-making under dynamics constraints — with different information assumptions.
Latent-space world models: Dreamer and RSSM
Operating models in raw observation space forces prediction of task-irrelevant features (background textures, uninformative pixels) and amplifies prediction error. A more principled approach learns a compact latent state retaining only task-relevant structure.
Recurrent State Space Model (RSSM)
The RSSM (Hafner et al., 2019, Dreamer) combines a deterministic recurrent state with a stochastic latent state:
The model is trained with a reconstruction loss, a reward prediction loss, and a KL penalty aligning prior and posterior:
Planning happens entirely in latent space: RSSM unrolls imagined trajectories using transitions without decoding to observation space — the decoder is only needed for training supervision. Dreamer trains an actor and critic entirely within imagined rollouts using backpropagation through the latent model (analytic policy gradients), achieving state-of-the-art sample efficiency on visual continuous control from pixels (robot manipulation from camera images).
MuZero and planning in learned abstract spaces
MuZero (Schrittwieser et al., 2020) represents a conceptually distinct approach: learn a model whose sole purpose is to support Monte Carlo Tree Search (MCTS) planning, without any obligation to decode back to observations.
MuZero's three learned functions:
| Function | Input → Output | Role | |---|---|---| | Representation | Observation history → | Encode real observation | | Dynamics | → | Latent transitions + reward | | Prediction | → | Policy and value from latent |
At each real step, MCTS is run entirely within the latent space generated by repeated application of . The tree is expanded using to guide search and to estimate leaf returns. The resulting action visit counts form the improved policy , which is both executed and used as a training target for .
The training signal for comes from temporal consistency of predictions along real trajectories — not observation reconstruction. MuZero's latent representation may bear no resemblance to the physical state; it only needs to support accurate value and policy prediction for MCTS. This demonstrates that observation-space accuracy is neither necessary nor sufficient for effective planning: a model with low reconstruction error but poor value prediction is useless for RLReinforcement Learning; a model with accurate value prediction on abstract latents supports excellent planning regardless of reconstruction quality.
Synthesis: the model-based RLReinforcement Learning design space
| Algorithm | Model space | Rollout | Planner | Strength | |---|---|---|---|---| | Dyna / MBPO | Observation | 1–5 steps | Q-learning / SACSoft Actor-Critic | Sample efficiency, simplicity | | MPC + CEM | Observation | 10–50 steps | CEM sampling | Constraints, no learned value fn | | MPC + terminal | Observation | 5–20 steps | CEM + bootstrap | Balances model and value bias | | Dreamer (RSSM) | Learned latent | Full episode | Analytic actor gradient | Visual control, sample efficiency | | MuZero | Abstract latent | MCTS tree | MCTS | Discrete games, combinatorial tasks |
All model-based methods share the fundamental tradeoff: sample efficiency vs model accuracy demands. Model-free methods require many real transitions; model-based methods require fewer but demand accurate models — requiring a different investment (architecture, training stability, validation). The right choice depends on whether real-world interaction is the bottleneck (favor model-based) or whether model learning itself is difficult or unsafe (favor model-free or conservative model-based with uncertainty penalties).
GenAI context: planning in language models
The model-based RLReinforcement Learning perspective provides precise language for recent LLMLarge Language Model developments. An LLMLarge Language Model generating a chain-of-thought runs a one-step greedy rollout: predict the next token conditioned on all previous tokens, commit. This is the model-free degenerate case — prediction without planning.
Tree of Thoughts (ToT) introduces genuine planning: the LLMLarge Language Model is simultaneously the dynamics model (predicting consequences of reasoning steps) and the value/policy network (evaluating which branch is promising). MCTS is run in reasoning-step space, exactly as in MuZero with the LLMLarge Language Model in the model role. The latent state is the text context; actions are reasoning steps; value estimates come from self-critique.
Best-of-N sampling with process reward models (PRMs) is the equivalent of MPC with CEM: generate many candidate reasoning trajectories, score with a learned reward model, select the best. The MBRL framework thereby predicts the effectiveness of test-time compute scaling: providing more computation to plan (more MCTS iterations, more candidates, deeper trees) improves performance on tasks requiring strategic reasoning — exactly the empirical finding from recent scaling studies.
Key takeaways
World models approximate environment dynamics and enable imagined trajectories that substitute for real interaction. The fundamental tradeoff is model accuracy against planning efficiency; the compounding error problem limits reliable rollout length. Dyna established the principle of interleaving real and imagined experience, with MBPO scaling it to deep networks via short-horizon probabilistic rollouts. MPC implements receding-horizon planning with online re-optimization; its connection to LQR and iLQR makes it the bridge between classical control and RLReinforcement Learning. The terminal value function resolves the short-horizon limitation by bootstrapping long-horizon returns, creating a precise model-vs-value bias-variance tradeoff. Dreamer demonstrates that planning in learned latent space is more efficient than in observation space. MuZero makes the strongest claim: a model that supports MCTS on abstract latents can outperform observation-space models without any reconstruction objective. Planning is a first-class AI capability, and the MBRL framework provides the theoretical tools to reason about when and how to use it.
Conceptual questions
-
The compounding error problem states that one-step model error can produce rollout error growing as over steps. Explain mechanistically why errors compound exponentially rather than linearly for a nonlinear model, using the concept of model state diverging from the true manifold of valid states. How does the terminal value function strategy mitigate — but not eliminate — this problem? What property must have for the mitigation to be effective?
-
MBPO uses short rollouts (–) starting from real replay buffer states rather than full rollouts from the initial state distribution. Explain why this reduces compounding error. What does starting rollouts from the replay buffer assume about the on-policy vs replay state distributions, and when would this assumption fail?
-
MuZero trains its dynamics model to support MCTS, not to reconstruct observations. Compare MuZero's latent with Dreamer's . For each, identify: (a) what training signal shapes the latent representation, (b) whether the latent must correspond to a physically interpretable quantity, and (c) what information would be lost if the decoder objective were removed. Under what task conditions would MuZero's representation be preferred?
-
You are designing model-based RLReinforcement Learning for a robotic arm where rigid-body dynamics are known analytically but contact dynamics are highly uncertain. Describe a hybrid strategy using an analytical model for rigid-body components and a learned residual for contact dynamics. How would you modify MPC's receding-horizon optimization to incorporate both? What must the planning horizon satisfy relative to the timescale of contact events?
-
Map Tree of Thoughts onto the MuZero architecture by identifying what corresponds to: the representation function , dynamics function , prediction function , latent state , and MCTS search. Identify at least one structural difference that makes ToT less principled from a planning theory perspective. How would you modify ToT to address this?
Implementation exercises
Exercise 1: Dyna on a grid world
Implement a Dyna-Q agent on a deterministic grid world (e.g., FrozenLake-v1 or a custom maze). Compare the number of real environment steps required to converge as a function of planning steps :
- (standard Q-learning — no imagined updates)
- (moderate planning)
- (aggressive planning)
Track the Q-value convergence error over real steps. At what value of do diminishing returns set in? What happens when the learned model is initially inaccurate — does the error from imagined updates compound or self-correct? Deliberately introduce model errors by limiting the model table to only store transitions for certain state-action pairs; observe how partial model coverage affects Dyna performance.
Exercise 2: MPC with CEM on a continuous control task
Implement receding-horizon MPC using CEM as the optimizer for CartPole-v1 or Pendulum-v1. Your implementation should:
- Train a deterministic dynamics model from collected transitions.
- At each step, run CEM for iterations with candidate sequences, horizon .
- Execute only the first action, then replan.
- Compare with a model-free baseline (SACSoft Actor-Critic or PPOProximal Policy Optimisation) on sample efficiency: number of real environment steps to reach a target return.
Experiment with horizon lengths . At what does compounding error cause plan quality to degrade? Add a terminal value function and observe whether shorter horizons become competitive — does compensate for model inaccuracy?
Exercise 3: Latent imagination with a simplified Dreamer-style model
Implement a simplified Dreamer-style agent on a visual control task (CarRacing-v3 or a custom pixel environment). The key components:
- Encoder: CNN encodes observation (no recurrent state for simplicity).
- Dynamics: A learned transition in latent space.
- Decoder: Reconstructs from for training supervision.
- Actor/Critic: Trained entirely within imagined latent rollouts of length .
Compare with a model-free baseline that operates directly on encoded observations without imagination. Measure the reduction in real environment steps needed to reach a target score. Track the reconstruction quality (MSE between and ) over training — does improved reconstruction correlate with better control performance, or does the latent representation plateau in reconstruction while planning quality continues to improve?
Extension prompts
-
Model-ensemble uncertainty for safe planning: Extend Exercise 2's MPC by replacing the single dynamics model with an ensemble of probabilistic models. Use the ensemble disagreement (variance across predicted next states) as an uncertainty penalty in the CEM cost function. Measure whether ensemble-based uncertainty penalties reduce the frequency of catastrophic planning failures (rollouts that predict high reward but produce low reward when executed).
-
MuZero without a known simulator: Study the MuZero pseudocode and identify which components require a simulator for training (the MCTS tree expansion relies on the dynamics function , which is learned — so what's the simulator's role?). Design a fully offline MuZero variant that learns from a fixed dataset with no environment interaction at all. What new failure mode emerges?
-
MBRL for LLM reasoning: Implement a simplified Tree of Thoughts on a reasoning benchmark (e.g., GSM8K). Compare BFS-style tree expansion vs MCTS with the LLM providing both the dynamics (next-reasoning-step prediction) and the value (self-critique scoring). Measure accuracy as a function of the tree budget (number of nodes expanded). Does MCTS outperform BFS for the same compute budget? Connect your findings to the MBRL framework's prediction about test-time compute scaling.
Further reading
- Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. ICML. (The Dyna Architecture).
- Janner, M., et al. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS. (MBPO).
- Hafner, D., et al. (2019). Dream to Control: Learning Behaviors by Latent Imagination. ICLR. (Dreamer / RSSM).
- Hafner, D., et al. (2023). Mastering Diverse Domains through World Models. arXiv. (DreamerV3 — unifies Dreamer across domains with fixed hyperparameters).
- Schrittwieser, J., et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature. (MuZero).