Week 10: Model-Based Reinforcement Learning and Planning

Grounded In

GenAI: The model-based RL framework provides precise vocabulary for recent LLM reasoning advances. Chain-of-thought is model-free greedy rollout; Tree of Thoughts is MCTS in text space (the LLM as both dynamics model and value estimator); best-of-N sampling with process reward models is MPC with CEM. The test-time compute scaling observed in O1-class models is a direct prediction of the MBRL framework — more planning compute improves strategic reasoning.
Robotics: MPC with learned dynamics is the dominant paradigm for real-robot control under model uncertainty. Receding-horizon replanning is the connection to classical LQR/iLQR from the robotics course; Dreamer's latent-space planning enables visual control from pixels on real manipulation tasks; the compounding error problem is the central constraint on model-based deployment.

Purpose of this lecture#

Every algorithm studied through Week 8 is model-free: it learns a policy or value function directly from environment interaction without representing how the environment works. Model-free methods are powerful but data-hungry — every transition contributes one gradient step, and the environment must be queried millions of times before useful behavior emerges.

Model-based RL breaks this dependence by having the agent learn, or be given, a model of the environment's dynamics. With a model, the agent can imagine consequences of actions without executing them, generating synthetic experience to supplement or replace real interaction. This is the computational analog of mentally simulating a trajectory before moving—and is essential for physical systems where each real interaction has cost or safety consequences.

Model-based RL is not one algorithm but a design space: the model can be learned or known; planning can use it for one step, a short horizon, or to build a tree of possible futures; the model can operate in observation space or a learned latent space. This lecture organizes that design space from simple to complex, anchoring each method in its connection to classical control theory (MPC, studied in the robotics course) and to modern AI systems that plan before acting.

What is a world model?#

A world model approximates the environment's transition and reward functions:

\hat{P}_\phi(s_{t+1} \mid s_t, a_t), \qquad \hat{R}_\phi(s_t, a_t)

Given a world model, the agent can unroll imagined trajectories:

s_0 \xrightarrow{a_0} \hat{s}_1 \xrightarrow{a_1} \hat{s}_2 \xrightarrow{a_2} \cdots

using the model in place of real environment steps. If the model is accurate, imagined trajectories provide the same learning signal as real ones at a fraction of the wall-clock cost. If the model is inaccurate, imagined trajectories mislead the learner — a policy optimized against a flawed model can fail arbitrarily when deployed. Managing the tension between model-leveraged efficiency and model-induced bias is the central design challenge of model-based RL.

Deterministic vs probabilistic models#

A deterministic model $\hat{s}_{t+1} = f_\phi(s_t, a_t)$ predicts a single next state, trained via MSE on observed transitions. It is fast and simple but cannot represent uncertainty. Errors are silent: the model produces a confident prediction even in regions where the true dynamics are stochastic or poorly constrained.

A probabilistic model outputs a distribution over next states, typically $\mathcal{N}(\mu_\phi(s_t, a_t),\, \sigma^2_\phi(s_t, a_t))$ . The predicted variance $\sigma^2_\phi$ signals epistemic uncertainty: high variance indicates regions of sparse data or inherently stochastic dynamics. This enables uncertainty-aware planning — action sequences through high-variance regions can be penalized.

Ensemble models train $M$ independent deterministic models on different data subsets. Disagreement across ensemble predictions provides a practical epistemic uncertainty estimate. MBPO and PETS use probabilistic ensembles of $M=5$ Gaussian models as a sweet spot between expressivity and training cost.

The compounding error problem#

Errors compound over time: a one-step prediction error $\epsilon_1$ can produce accumulated error growing as $O(e^{H\epsilon_1})$ over $H$ model steps for nonlinear dynamics, as model states drift off the true manifold of valid states. Even an accurate model with $\epsilon_1 = 0.01$ may yield completely unreliable 50-step rollouts. This is the open-loop prediction problem studied in robust control. Standard mitigations: limit rollout length to the reliable horizon, replan frequently with closed-loop correction, or bootstrap long-horizon returns with a learned value function.

Dyna: interleaving real and imagined experience#

The Dyna architecture (Sutton, 1990) is the prototypical integration of model learning and model-free value estimation.

At each real environment step:

Execute $a_t$ , observe $r_{t+1}, s_{t+1}$ ; update model $\hat{P}_\phi$ .
Update $Q(s_t, a_t)$ from the real transition (standard TD).
Imagined rollouts ( $k$ $k$ iterations):
- Sample a random previously visited state $\hat{s}$ and action $\hat{a}$ .
- Query model: $\hat{s}' = \hat{P}_\phi(\hat{s}, \hat{a})$ , $\hat{r} = \hat{R}_\phi(\hat{s}, \hat{a})$ .
- Update $Q(\hat{s}, \hat{a})$ from the imagined transition.

The $k$ imagined Q-updates produce additional gradient steps without additional real-world interaction. For accurate models, each imagined transition is as informative as a real one. Dyna with $k = 100$ achieves the equivalent of 100× more real data, at the cost of $k$ model evaluations per real step.

The conceptual contribution of Dyna is establishing that model-based and model-free learning are complementary, not competing: the model improves value estimates between real steps, and better value estimates guide smarter real-world data collection. This "imagination loop" — generate imagined experience, learn from it, act to collect better real data — recurs throughout model-based RL and in modern AI systems that deliberate before acting.

Model-Based Policy Optimization (MBPO)#

MBPO (Janner et al., 2019) is the modern deep-learning incarnation of Dyna. Short model rollouts of length $H$ are generated starting from states in the real replay buffer; imagined transitions are added to a separate model buffer; and a SAC agent trains on the combined real + imagined data. Empirically, $H = 1$ to $5$ works well across MuJoCo tasks, achieving near-state-of-the-art sample efficiency in under 500k real environment steps — compared to 2–3 million required by SAC alone.

Model Predictive Control#

Model Predictive Control (MPC) is a classical control algorithm that fits cleanly into the model-based RL framework. Students of the robotics course have seen MPC for known dynamics; here it is placed in the context of learned models.

The receding horizon principle#

At each time step $t$ , MPC solves:

\mathbf{a}^*_{0:H-1} = \arg\max_{\mathbf{a}_{0:H-1}} \sum_{k=0}^{H-1} \gamma^k \hat{R}(\hat{s}_k, a_k)

subject to $\hat{s}_{k+1} = \hat{P}_\phi(\hat{s}_k, a_k)$ , $\hat{s}_0 = s_t$ .

Only the first action $a^*_0$ is executed. At the next step, the agent re-observes the true state $s_{t+1}$ and replans over a new horizon. This is the receding horizon structure.

Why only the first action? A learned model is only an approximation; executing the full $H$ -step plan propagates errors without correction. Replanning at every step with the new true observation implements closed-loop feedback on the model, continuously correcting for model error and unexpected perturbations (disturbances in robot control). This is why MPC is robust to model mismatch — it never commits to more than one step at a time.

Optimizers within MPC#

In continuous action spaces, gradient-based optimization through differentiable models is possible. For non-differentiable models or non-smooth rewards, sampling-based optimizers are standard:

Cross-Entropy Method (CEM): sample $N$ action sequences $\{\mathbf{a}^{(n)}_{0:H-1}\}_{n=1}^N$ from a Gaussian distribution $\mathcal{N}(\mu, \Sigma)$ over action sequences, evaluate the total return $R(\mathbf{a}^{(n)})$ for each using model rollouts, select the top- $K$ sequences (the "elite" set $\mathcal{E}$ ), and refit the distribution:

\mu \leftarrow \frac{1}{K}\sum_{n \in \mathcal{E}} \mathbf{a}^{(n)}, \qquad \Sigma \leftarrow \frac{1}{K}\sum_{n \in \mathcal{E}} (\mathbf{a}^{(n)} - \mu)(\mathbf{a}^{(n)} - \mu)^\top

Iterating this refit-sample cycle concentrates the sampling distribution on high-return regions. CEM is simple, parallelizable, and effective for horizons up to $H \approx 50$ in continuous control, though it can collapse prematurely to a local optimum if the elite fraction $K/N$ is too small.

MPPI (Model Predictive Path Integral): computes a principled gradient-free update by weighting action trajectories according to exponentiated returns. For a control problem with running cost $c(s_t, a_t)$ , MPPI samples $N$ noise sequences $\boldsymbol{\epsilon}^{(n)} \sim \mathcal{N}(0, \Sigma)$ and computes the update:

a_t \leftarrow a_t + \frac{\sum_{n=1}^N \exp\!\left(-\frac{1}{\lambda}\sum_{\tau=t}^{t+H-1} c(\hat{s}_\tau, a_\tau^{(n)})\right) \cdot \epsilon_t^{(n)}}{\sum_{n=1}^N \exp\!\left(-\frac{1}{\lambda}\sum_{\tau} c(\hat{s}_\tau, a_\tau^{(n)})\right)}

where $\lambda > 0$ is the temperature controlling the soft-maximum. MPPI solves a KL-regularized version of the stochastic optimal control problem: it seeks the control distribution closest to the previous one that achieves lower expected cost. The exponential weighting ensures that trajectories with much lower cost contribute exponentially more to the update. MPPI is widely used in autonomous driving (steering + throttle control under nonlinear vehicle dynamics) and legged locomotion (real-time replanning for foot placement on uneven terrain).

Terminal value functions#

Finite-horizon MPC considers only returns over horizon $H$ . The standard solution adds a terminal value function $V_\psi(s_H)$ estimated by a learned critic:

\max_{\mathbf{a}_{0:H-1}} \sum_{k=0}^{H-1} \gamma^k \hat{R}(\hat{s}_k, a_k) + \gamma^H V_\psi(\hat{s}_H)

This is the model-based analog of TD( $\lambda$ ): the model handles short-horizon consequences with high fidelity (accurate over short horizons), while the value function handles long-horizon return with correct asymptotic behavior. Small $H$ leans on $V_\psi$ (high bias if inaccurate); large $H$ leans on the model (high bias if it compounds error). The optimal $H$ balances model accuracy against value function accuracy.

Connection to the robotics course#

The model-based RL design space is the generalization of classical control methods:

| Setting | Algorithm | |---|---| | Known linear dynamics, quadratic cost | LQR / DARE (Bellman in closed form) | | Known nonlinear dynamics | iLQR, Differential Dynamic Programming | | Unknown dynamics, learned model | MBRL with MPC | | Unknown dynamics, model-free | SAC, PPO |

The progression from LQR to MBRL is from known to unknown, analytical to data-driven, and exact to approximate. All are solving the same problem — optimal sequential decision-making under dynamics constraints — with different information assumptions.

Latent-space world models: Dreamer and RSSM#

Operating models in raw observation space forces prediction of task-irrelevant features (background textures, uninformative pixels) and amplifies prediction error. A more principled approach learns a compact latent state retaining only task-relevant structure.

Recurrent State Space Model (RSSM)#

The RSSM (Hafner et al., 2019, Dreamer) combines a deterministic recurrent state with a stochastic latent state:

h_t = f_\phi(h_{t-1},\, z_{t-1},\, a_{t-1}) \quad \text{(GRU; deterministic history)}

z_t \sim q_\phi(z_t \mid h_t, o_t) \quad \text{(posterior; uses real observation)}

\hat{z}_t \sim p_\phi(\hat{z}_t \mid h_t) \quad \text{(prior; for model rollouts)}

The model is trained with a reconstruction loss, a reward prediction loss, and a KL penalty aligning prior and posterior:

\mathcal{L} = \mathbb{E}\!\left[-\log p(\hat{o}_t \mid h_t, z_t) - \log p(\hat{r}_t \mid h_t, z_t) + \beta\, D_{\text{KL}}(q_\phi \| p_\phi)\right]

Planning happens entirely in latent space: RSSM unrolls imagined trajectories using $(h_t, \hat{z}_t)$ transitions without decoding to observation space — the decoder is only needed for training supervision. Dreamer trains an actor and critic entirely within imagined rollouts using backpropagation through the latent model (analytic policy gradients), achieving state-of-the-art sample efficiency on visual continuous control from pixels (robot manipulation from camera images).

MuZero and planning in learned abstract spaces#

MuZero (Schrittwieser et al., 2020) represents a conceptually distinct approach: learn a model whose sole purpose is to support Monte Carlo Tree Search (MCTS) planning, without any obligation to decode back to observations.

MuZero's three learned functions:

| Function | Input → Output | Role | |---|---|---| | Representation $h_\theta$ | Observation history → $z_t$ | Encode real observation | | Dynamics $g_\theta$ | $(z_t, a_t)$ → $(z_{t+1}, \hat{r}_t)$ | Latent transitions + reward | | Prediction $f_\theta$ | $z_t$ → $(\hat{\pi}_t, \hat{v}_t)$ | Policy and value from latent |

At each real step, MCTS is run entirely within the latent space generated by repeated application of $g_\theta$ . The tree is expanded using $\hat{\pi}$ to guide search and $\hat{v}$ to estimate leaf returns. The resulting action visit counts form the improved policy $\pi_\text{MCTS}$ , which is both executed and used as a training target for $f_\theta$ .

The training signal for $g_\theta$ comes from temporal consistency of predictions along real trajectories — not observation reconstruction. MuZero's latent representation may bear no resemblance to the physical state; it only needs to support accurate value and policy prediction for MCTS. This demonstrates that observation-space accuracy is neither necessary nor sufficient for effective planning: a model with low reconstruction error but poor value prediction is useless for RL; a model with accurate value prediction on abstract latents supports excellent planning regardless of reconstruction quality.

Synthesis: the model-based RL design space#

| Algorithm | Model space | Rollout | Planner | Strength | |---|---|---|---|---| | Dyna / MBPO | Observation | 1–5 steps | Q-learning / SAC | Sample efficiency, simplicity | | MPC + CEM | Observation | 10–50 steps | CEM sampling | Constraints, no learned value fn | | MPC + terminal $V$ | Observation | 5–20 steps | CEM + bootstrap | Balances model and value bias | | Dreamer (RSSM) | Learned latent | Full episode | Analytic actor gradient | Visual control, sample efficiency | | MuZero | Abstract latent | MCTS tree | MCTS | Discrete games, combinatorial tasks |

All model-based methods share the fundamental tradeoff: sample efficiency vs model accuracy demands. Model-free methods require many real transitions; model-based methods require fewer but demand accurate models — requiring a different investment (architecture, training stability, validation). The right choice depends on whether real-world interaction is the bottleneck (favor model-based) or whether model learning itself is difficult or unsafe (favor model-free or conservative model-based with uncertainty penalties).

GenAI context: planning in language models#

The model-based RL perspective provides precise language for recent LLM developments. An LLM generating a chain-of-thought runs a one-step greedy rollout: predict the next token conditioned on all previous tokens, commit. This is the model-free degenerate case — prediction without planning.

Tree of Thoughts (ToT) introduces genuine planning: the LLM is simultaneously the dynamics model (predicting consequences of reasoning steps) and the value/policy network (evaluating which branch is promising). MCTS is run in reasoning-step space, exactly as in MuZero with the LLM in the model role. The latent state is the text context; actions are reasoning steps; value estimates come from self-critique.

Best-of-N sampling with process reward models (PRMs) is the equivalent of MPC with CEM: generate many candidate reasoning trajectories, score with a learned reward model, select the best. The MBRL framework thereby predicts the effectiveness of test-time compute scaling: providing more computation to plan (more MCTS iterations, more candidates, deeper trees) improves performance on tasks requiring strategic reasoning — exactly the empirical finding from recent scaling studies.

Key takeaways#

World models approximate environment dynamics and enable imagined trajectories that substitute for real interaction. The fundamental tradeoff is model accuracy against planning efficiency; the compounding error problem limits reliable rollout length. Dyna established the principle of interleaving real and imagined experience, with MBPO scaling it to deep networks via short-horizon probabilistic rollouts. MPC implements receding-horizon planning with online re-optimization; its connection to LQR and iLQR makes it the bridge between classical control and RL. The terminal value function resolves the short-horizon limitation by bootstrapping long-horizon returns, creating a precise model-vs-value bias-variance tradeoff. Dreamer demonstrates that planning in learned latent space is more efficient than in observation space. MuZero makes the strongest claim: a model that supports MCTS on abstract latents can outperform observation-space models without any reconstruction objective. Planning is a first-class AI capability, and the MBRL framework provides the theoretical tools to reason about when and how to use it.

Conceptual questions#

The compounding error problem states that one-step model error $\epsilon_1$ can produce rollout error growing as $O(e^{H\epsilon_1})$ over $H$ steps. Explain mechanistically why errors compound exponentially rather than linearly for a nonlinear model, using the concept of model state diverging from the true manifold of valid states. How does the terminal value function strategy mitigate — but not eliminate — this problem? What property must $V_\psi$ have for the mitigation to be effective?
MBPO uses short rollouts ( $H = 1$ – $5$ ) starting from real replay buffer states rather than full rollouts from the initial state distribution. Explain why this reduces compounding error. What does starting rollouts from the replay buffer assume about the on-policy vs replay state distributions, and when would this assumption fail?
MuZero trains its dynamics model to support MCTS, not to reconstruct observations. Compare MuZero's latent $z_t = h(o_{1:t})$ with Dreamer's $z_t \sim q(z_t \mid h_t, o_t)$ . For each, identify: (a) what training signal shapes the latent representation, (b) whether the latent must correspond to a physically interpretable quantity, and (c) what information would be lost if the decoder objective were removed. Under what task conditions would MuZero's representation be preferred?
You are designing model-based RL for a robotic arm where rigid-body dynamics are known analytically but contact dynamics are highly uncertain. Describe a hybrid strategy using an analytical model for rigid-body components and a learned residual for contact dynamics. How would you modify MPC's receding-horizon optimization to incorporate both? What must the planning horizon $H$ satisfy relative to the timescale of contact events?
Map Tree of Thoughts onto the MuZero architecture by identifying what corresponds to: the representation function $h$ , dynamics function $g$ , prediction function $f$ , latent state $z_t$ , and MCTS search. Identify at least one structural difference that makes ToT less principled from a planning theory perspective. How would you modify ToT to address this?

Solutions

Compounding error. A one-step error nudges the predicted state slightly off the manifold of valid states; for nonlinear dynamics the next prediction is then made from an already-wrong state where the model is even less accurate, so errors feed back multiplicatively and grow like $e^{H\epsilon_1}$ rather than additively. A terminal value function $V_\psi$ caps the rollout at $H$ short steps and bootstraps the remainder, so error accumulates only over $H$ — but only if $V_\psi$ is accurate on the reached states; it does not remove the within-horizon compounding.
MBPO short rollouts. Starting $H{=}1$ – $5$ rollouts from real replay states keeps model queries near states it has actually seen, bounding how far error compounds. It assumes the replay (largely off-policy) state distribution matches the current on-policy distribution; this fails once the policy has shifted substantially, so rollouts begin from states the current policy rarely visits, reintroducing distribution mismatch.
MuZero vs Dreamer latents. MuZero's $z=h(o_{1:t})$ is shaped purely by reward/value/policy-prediction losses, so (a) the training signal is value-equivalence, (b) the latent need not be physically interpretable, and (c) removing a decoder loses nothing because it never reconstructs observations — it keeps only task-relevant information. Dreamer's $z\sim q(z|h_t,o_t)$ is shaped by a reconstruction (ELBO) objective, must encode enough to rebuild observations, and would lose its main representation signal without the decoder. MuZero's representation is preferred when observations carry lots of task-irrelevant detail.
Hybrid model. Use the known rigid-body model plus a learned contact residual, $\hat s' = f_\text{rigid}(s,a) + f_\text{res}(s,a)$ , and run MPC's receding-horizon optimization over the combined model. The horizon $H$ must be short enough that contact transients are resolved at the model's timescale — $H\Delta t$ on the order of (or finer than) the contact-event timescale — or contact dynamics get aliased and missed.
Tree of Thoughts as MuZero. Representation $h$ ≈ encoding the prompt into a reasoning state; dynamics $g$ ≈ the LLM generating the next thought; prediction $f$ ≈ the heuristic value/score of a thought; latent $z_t$ ≈ the current partial reasoning state; MCTS ≈ ToT's tree search. The key gap: ToT's dynamics and value are uncalibrated LLM self-evaluations with no learned value-equivalence or grounding, so the search optimizes a noisy proxy. Fix by training a learned verifier/value model (and grounding steps in tool feedback) so the search targets a calibrated objective.

Implementation exercises#

Exercise 1: Dyna on a grid world#

Implement a Dyna-Q agent on a deterministic grid world (e.g., FrozenLake-v1 or a custom maze). Compare the number of real environment steps required to converge as a function of planning steps $k$ :

$k = 0$ (standard Q-learning — no imagined updates)
$k = 5$ (moderate planning)
$k = 50$ (aggressive planning)

Track the Q-value convergence error $\|Q_t - Q^*\|_\infty$ over real steps. At what value of $k$ do diminishing returns set in? What happens when the learned model is initially inaccurate — does the error from imagined updates compound or self-correct? Deliberately introduce model errors by limiting the model table to only store transitions for certain state-action pairs; observe how partial model coverage affects Dyna performance.

Exercise 2: MPC with CEM on a continuous control task#

Implement receding-horizon MPC using CEM as the optimizer for CartPole-v1 or Pendulum-v1. Your implementation should:

Train a deterministic dynamics model $\hat{s}_{t+1} = f_\phi(s_t, a_t)$ from collected transitions.
At each step, run CEM for $I$ iterations with $N$ candidate sequences, horizon $H$ .
Execute only the first action, then replan.
Compare with a model-free baseline (SAC or PPO) on sample efficiency: number of real environment steps to reach a target return.

Experiment with horizon lengths $H \in \{5, 10, 20, 50\}$ . At what $H$ does compounding error cause plan quality to degrade? Add a terminal value function $V_\psi$ and observe whether shorter horizons become competitive — does $V_\psi$ compensate for model inaccuracy?

Exercise 3: Latent imagination with a simplified Dreamer-style model#

Implement a simplified Dreamer-style agent on a visual control task (CarRacing-v3 or a custom pixel environment). The key components:

Encoder: CNN encodes observation $o_t \to z_t$ (no recurrent state for simplicity).
Dynamics: A learned transition $\hat{z}_{t+1} = g_\phi(z_t, a_t)$ in latent space.
Decoder: Reconstructs $\hat{o}_t$ from $z_t$ for training supervision.
Actor/Critic: Trained entirely within imagined latent rollouts of length $H$ .

Compare with a model-free baseline that operates directly on encoded observations without imagination. Measure the reduction in real environment steps needed to reach a target score. Track the reconstruction quality (MSE between $o_t$ and $\hat{o}_t$ ) over training — does improved reconstruction correlate with better control performance, or does the latent representation plateau in reconstruction while planning quality continues to improve?

Extension prompts#

Model-ensemble uncertainty for safe planning: Extend Exercise 2's MPC by replacing the single dynamics model with an ensemble of $M = 5$ probabilistic models. Use the ensemble disagreement (variance across predicted next states) as an uncertainty penalty in the CEM cost function. Measure whether ensemble-based uncertainty penalties reduce the frequency of catastrophic planning failures (rollouts that predict high reward but produce low reward when executed).
MuZero without a known simulator: Study the MuZero pseudocode and identify which components require a simulator for training (the MCTS tree expansion relies on the dynamics function $g_\theta$ , which is learned — so what's the simulator's role?). Design a fully offline MuZero variant that learns from a fixed dataset with no environment interaction at all. What new failure mode emerges?
MBRL for LLM reasoning: Implement a simplified Tree of Thoughts on a reasoning benchmark (e.g., GSM8K). Compare BFS-style tree expansion vs MCTS with the LLM providing both the dynamics (next-reasoning-step prediction) and the value (self-critique scoring). Measure accuracy as a function of the tree budget (number of nodes expanded). Does MCTS outperform BFS for the same compute budget? Connect your findings to the MBRL framework's prediction about test-time compute scaling.

Purpose of this lecture#

What is a world model?#

A world model approximates the environment's transition and reward functions:

\hat{P}_\phi(s_{t+1} \mid s_t, a_t), \qquad \hat{R}_\phi(s_t, a_t)

Given a world model, the agent can unroll imagined trajectories:

s_0 \xrightarrow{a_0} \hat{s}_1 \xrightarrow{a_1} \hat{s}_2 \xrightarrow{a_2} \cdots

Deterministic vs probabilistic models#

The compounding error problem#

Dyna: interleaving real and imagined experience#

The Dyna architecture (Sutton, 1990) is the prototypical integration of model learning and model-free value estimation.

At each real environment step:

Execute $a_t$ , observe $r_{t+1}, s_{t+1}$ ; update model $\hat{P}_\phi$ .
Update $Q(s_t, a_t)$ from the real transition (standard TD).
Imagined rollouts ( $k$ $k$ iterations):
- Sample a random previously visited state $\hat{s}$ and action $\hat{a}$ .
- Query model: $\hat{s}' = \hat{P}_\phi(\hat{s}, \hat{a})$ , $\hat{r} = \hat{R}_\phi(\hat{s}, \hat{a})$ .
- Update $Q(\hat{s}, \hat{a})$ from the imagined transition.

Model-Based Policy Optimization (MBPO)#

Model Predictive Control#

The receding horizon principle#

At each time step $t$ , MPC solves:

\mathbf{a}^*_{0:H-1} = \arg\max_{\mathbf{a}_{0:H-1}} \sum_{k=0}^{H-1} \gamma^k \hat{R}(\hat{s}_k, a_k)

subject to $\hat{s}_{k+1} = \hat{P}_\phi(\hat{s}_k, a_k)$ , $\hat{s}_0 = s_t$ .

Only the first action $a^*_0$ is executed. At the next step, the agent re-observes the true state $s_{t+1}$ and replans over a new horizon. This is the receding horizon structure.

Optimizers within MPC#

In continuous action spaces, gradient-based optimization through differentiable models is possible. For non-differentiable models or non-smooth rewards, sampling-based optimizers are standard:

\mu \leftarrow \frac{1}{K}\sum_{n \in \mathcal{E}} \mathbf{a}^{(n)}, \qquad \Sigma \leftarrow \frac{1}{K}\sum_{n \in \mathcal{E}} (\mathbf{a}^{(n)} - \mu)(\mathbf{a}^{(n)} - \mu)^\top

a_t \leftarrow a_t + \frac{\sum_{n=1}^N \exp\!\left(-\frac{1}{\lambda}\sum_{\tau=t}^{t+H-1} c(\hat{s}_\tau, a_\tau^{(n)})\right) \cdot \epsilon_t^{(n)}}{\sum_{n=1}^N \exp\!\left(-\frac{1}{\lambda}\sum_{\tau} c(\hat{s}_\tau, a_\tau^{(n)})\right)}

Terminal value functions#

Finite-horizon MPC considers only returns over horizon $H$ . The standard solution adds a terminal value function $V_\psi(s_H)$ estimated by a learned critic:

\max_{\mathbf{a}_{0:H-1}} \sum_{k=0}^{H-1} \gamma^k \hat{R}(\hat{s}_k, a_k) + \gamma^H V_\psi(\hat{s}_H)

Connection to the robotics course#

The model-based RL design space is the generalization of classical control methods:

Latent-space world models: Dreamer and RSSM#

Recurrent State Space Model (RSSM)#

The RSSM (Hafner et al., 2019, Dreamer) combines a deterministic recurrent state with a stochastic latent state:

h_t = f_\phi(h_{t-1},\, z_{t-1},\, a_{t-1}) \quad \text{(GRU; deterministic history)}

z_t \sim q_\phi(z_t \mid h_t, o_t) \quad \text{(posterior; uses real observation)}

\hat{z}_t \sim p_\phi(\hat{z}_t \mid h_t) \quad \text{(prior; for model rollouts)}

The model is trained with a reconstruction loss, a reward prediction loss, and a KL penalty aligning prior and posterior:

\mathcal{L} = \mathbb{E}\!\left[-\log p(\hat{o}_t \mid h_t, z_t) - \log p(\hat{r}_t \mid h_t, z_t) + \beta\, D_{\text{KL}}(q_\phi \| p_\phi)\right]

MuZero and planning in learned abstract spaces#

MuZero's three learned functions:

Synthesis: the model-based RL design space#

GenAI context: planning in language models#

Key takeaways#

Conceptual questions#

The compounding error problem states that one-step model error $\epsilon_1$ can produce rollout error growing as $O(e^{H\epsilon_1})$ over $H$ steps. Explain mechanistically why errors compound exponentially rather than linearly for a nonlinear model, using the concept of model state diverging from the true manifold of valid states. How does the terminal value function strategy mitigate — but not eliminate — this problem? What property must $V_\psi$ have for the mitigation to be effective?
MBPO uses short rollouts ( $H = 1$ – $5$ ) starting from real replay buffer states rather than full rollouts from the initial state distribution. Explain why this reduces compounding error. What does starting rollouts from the replay buffer assume about the on-policy vs replay state distributions, and when would this assumption fail?
MuZero trains its dynamics model to support MCTS, not to reconstruct observations. Compare MuZero's latent $z_t = h(o_{1:t})$ with Dreamer's $z_t \sim q(z_t \mid h_t, o_t)$ . For each, identify: (a) what training signal shapes the latent representation, (b) whether the latent must correspond to a physically interpretable quantity, and (c) what information would be lost if the decoder objective were removed. Under what task conditions would MuZero's representation be preferred?
You are designing model-based RL for a robotic arm where rigid-body dynamics are known analytically but contact dynamics are highly uncertain. Describe a hybrid strategy using an analytical model for rigid-body components and a learned residual for contact dynamics. How would you modify MPC's receding-horizon optimization to incorporate both? What must the planning horizon $H$ satisfy relative to the timescale of contact events?
Map Tree of Thoughts onto the MuZero architecture by identifying what corresponds to: the representation function $h$ , dynamics function $g$ , prediction function $f$ , latent state $z_t$ , and MCTS search. Identify at least one structural difference that makes ToT less principled from a planning theory perspective. How would you modify ToT to address this?

Solutions

Compounding error. A one-step error nudges the predicted state slightly off the manifold of valid states; for nonlinear dynamics the next prediction is then made from an already-wrong state where the model is even less accurate, so errors feed back multiplicatively and grow like $e^{H\epsilon_1}$ rather than additively. A terminal value function $V_\psi$ caps the rollout at $H$ short steps and bootstraps the remainder, so error accumulates only over $H$ — but only if $V_\psi$ is accurate on the reached states; it does not remove the within-horizon compounding.
MBPO short rollouts. Starting $H{=}1$ – $5$ rollouts from real replay states keeps model queries near states it has actually seen, bounding how far error compounds. It assumes the replay (largely off-policy) state distribution matches the current on-policy distribution; this fails once the policy has shifted substantially, so rollouts begin from states the current policy rarely visits, reintroducing distribution mismatch.
MuZero vs Dreamer latents. MuZero's $z=h(o_{1:t})$ is shaped purely by reward/value/policy-prediction losses, so (a) the training signal is value-equivalence, (b) the latent need not be physically interpretable, and (c) removing a decoder loses nothing because it never reconstructs observations — it keeps only task-relevant information. Dreamer's $z\sim q(z|h_t,o_t)$ is shaped by a reconstruction (ELBO) objective, must encode enough to rebuild observations, and would lose its main representation signal without the decoder. MuZero's representation is preferred when observations carry lots of task-irrelevant detail.
Hybrid model. Use the known rigid-body model plus a learned contact residual, $\hat s' = f_\text{rigid}(s,a) + f_\text{res}(s,a)$ , and run MPC's receding-horizon optimization over the combined model. The horizon $H$ must be short enough that contact transients are resolved at the model's timescale — $H\Delta t$ on the order of (or finer than) the contact-event timescale — or contact dynamics get aliased and missed.
Tree of Thoughts as MuZero. Representation $h$ ≈ encoding the prompt into a reasoning state; dynamics $g$ ≈ the LLM generating the next thought; prediction $f$ ≈ the heuristic value/score of a thought; latent $z_t$ ≈ the current partial reasoning state; MCTS ≈ ToT's tree search. The key gap: ToT's dynamics and value are uncalibrated LLM self-evaluations with no learned value-equivalence or grounding, so the search optimizes a noisy proxy. Fix by training a learned verifier/value model (and grounding steps in tool feedback) so the search targets a calibrated objective.

Implementation exercises#

Exercise 1: Dyna on a grid world#

$k = 0$ (standard Q-learning — no imagined updates)
$k = 5$ (moderate planning)
$k = 50$ (aggressive planning)

Exercise 2: MPC with CEM on a continuous control task#

Implement receding-horizon MPC using CEM as the optimizer for CartPole-v1 or Pendulum-v1. Your implementation should:

Train a deterministic dynamics model $\hat{s}_{t+1} = f_\phi(s_t, a_t)$ from collected transitions.
At each step, run CEM for $I$ iterations with $N$ candidate sequences, horizon $H$ .
Execute only the first action, then replan.
Compare with a model-free baseline (SAC or PPO) on sample efficiency: number of real environment steps to reach a target return.

Exercise 3: Latent imagination with a simplified Dreamer-style model#

Implement a simplified Dreamer-style agent on a visual control task (CarRacing-v3 or a custom pixel environment). The key components:

Encoder: CNN encodes observation $o_t \to z_t$ (no recurrent state for simplicity).
Dynamics: A learned transition $\hat{z}_{t+1} = g_\phi(z_t, a_t)$ in latent space.
Decoder: Reconstructs $\hat{o}_t$ from $z_t$ for training supervision.
Actor/Critic: Trained entirely within imagined latent rollouts of length $H$ .

Extension prompts#

Model-ensemble uncertainty for safe planning: Extend Exercise 2's MPC by replacing the single dynamics model with an ensemble of $M = 5$ probabilistic models. Use the ensemble disagreement (variance across predicted next states) as an uncertainty penalty in the CEM cost function. Measure whether ensemble-based uncertainty penalties reduce the frequency of catastrophic planning failures (rollouts that predict high reward but produce low reward when executed).
MuZero without a known simulator: Study the MuZero pseudocode and identify which components require a simulator for training (the MCTS tree expansion relies on the dynamics function $g_\theta$ , which is learned — so what's the simulator's role?). Design a fully offline MuZero variant that learns from a fixed dataset with no environment interaction at all. What new failure mode emerges?
MBRL for LLM reasoning: Implement a simplified Tree of Thoughts on a reasoning benchmark (e.g., GSM8K). Compare BFS-style tree expansion vs MCTS with the LLM providing both the dynamics (next-reasoning-step prediction) and the value (self-critique scoring). Measure accuracy as a function of the tree budget (number of nodes expanded). Does MCTS outperform BFS for the same compute budget? Connect your findings to the MBRL framework's prediction about test-time compute scaling.

Purpose of this lecture#

What is a world model?#

Deterministic vs probabilistic models#

The compounding error problem#

Dyna: interleaving real and imagined experience#

Model-Based Policy Optimization (MBPO)#

Model Predictive Control#

The receding horizon principle#

Optimizers within MPC#

Terminal value functions#

Connection to the robotics course#

Latent-space world models: Dreamer and RSSM#

Recurrent State Space Model (RSSM)#

MuZero and planning in learned abstract spaces#

Synthesis: the model-based RLReinforcement Learning design space#

GenAI context: planning in language models#

Key takeaways#

Conceptual questions#

Implementation exercises#

Exercise 1: Dyna on a grid world#

Exercise 2: MPC with CEM on a continuous control task#

Exercise 3: Latent imagination with a simplified Dreamer-style model#

Extension prompts#

Further reading#

Week 10: Model-Based Reinforcement Learning and Planning

Purpose of this lecture#

What is a world model?#

Deterministic vs probabilistic models#

The compounding error problem#

Dyna: interleaving real and imagined experience#

Model-Based Policy Optimization (MBPO)#

Model Predictive Control#

The receding horizon principle#

Optimizers within MPC#

Terminal value functions#

Connection to the robotics course#

Latent-space world models: Dreamer and RSSM#

Recurrent State Space Model (RSSM)#

MuZero and planning in learned abstract spaces#

Synthesis: the model-based RLReinforcement Learning design space#

GenAI context: planning in language models#

Key takeaways#

Conceptual questions#

Implementation exercises#

Exercise 1: Dyna on a grid world#

Exercise 2: MPC with CEM on a continuous control task#

Exercise 3: Latent imagination with a simplified Dreamer-style model#

Extension prompts#

Further reading#

Synthesis: the model-based RL design space#

Synthesis: the model-based RL design space#