Week 12: World Models and Reinforcement Learning

Purpose of this lecture#

The generative models studied so far produce outputs in response to noise and conditioning signals. This lecture applies the same neural machinery to a fundamentally different goal: building a world model — a learned simulator that predicts how the environment transitions in response to agent actions, enabling an agent to plan by imagining future states rather than by acting in the real world. World models represent the deepest integration of generative modeling with decision-making, and they provide the conceptual foundation for understanding how foundation models will be used in physical AI.

World models: architecture and role#

World models (Ha & Schmidhuber, 2018) are a class of generative models that learn to simulate the environment in which an agent operates. The key insight is that the same neural machinery used to generate images, audio, or text can be applied to predict the next state of the environment given the current state and action: $p(s_{t+1} \mid s_t, a_t)$ . This predictive capability enables model-based reinforcement learning, where the agent can plan by simulating many possible actions in its internal world model rather than interacting with the real environment.

The world model architecture typically consists of three components:

Encoder: maps observations $o_t$ to latent states $z_t$
Dynamics model: learns the transition function $p(z_{t+1} \mid z_t, a_t)$
Decoder: maps latent states back to observations $\hat{o}_t = p(o_t \mid z_t)$

This structure enables imagination: the agent can generate sequences of imagined states by sampling from the dynamics model, then decode them to produce simulated observations. This imagination capability is crucial for planning, as the agent can evaluate many possible action sequences without actually executing them in the real environment.

The RSSM architecture#

The recurrent state space model (RSSM; Hafner et al., 2019) is a world model architecture that extends the basic world model with a recurrent structure to handle sequential data more effectively. Unlike simple feedforward models, RSSMs can maintain a persistent memory of past observations, enabling better long-term planning.

The RSSM architecture consists of:

Observation encoder: maps each observation $o_t$ to a deterministic embedding $e_t$
Recurrent transition model: computes the posterior distribution over the latent state $z_t$ given the previous latent state $z_{t-1}$ and action $a_{t-1}$ , and the current observation $o_t$
Recurrent generative model: computes the prior distribution over the next latent state $z_{t+1}$ given the current latent state $z_t$ and action $a_t$
Decoder: maps the latent state $z_t$ to a reconstruction of the observation $\hat{o}_t$

The RSSM is particularly powerful because it can maintain a compact, informative representation of the environment's state that evolves over time, enabling efficient planning and control.

Dreamer: a model-based reinforcement learning agent#

Dreamer (Hafner et al., 2020) is a state-of-the-art model-based reinforcement learning agent that uses a world model to plan actions. The agent consists of three main components:

World model: learns to predict the next observation and reward given the current state and action
Actor: learns to select actions that maximize expected reward
Value function: estimates the expected return from each state

The key innovation of Dreamer is its use of latent imagination: the agent generates sequences of imagined states by sampling from the world model's dynamics, then evaluates these sequences using the learned value function. This allows the agent to plan without requiring interaction with the real environment, making it highly sample-efficient.

Imagination and planning: Dreamer uses the world model to imagine sequences of actions and their consequences. It samples action sequences from the actor policy, then uses the world model to simulate the resulting states and rewards. The agent then evaluates these imagined trajectories using the value function to select the best action.

Learning: Dreamer learns by minimizing the difference between predicted and actual observations, and by optimizing the actor and value function using the imagined trajectories. The world model is trained to predict the next observation and reward, while the actor and value function are trained to maximize the expected return.

Model predictive control and latent-space planning#

Model predictive control (MPC) is a control strategy that uses a model to predict future states and optimize control actions. In the context of world models, MPC involves:

Prediction: use the world model to predict the next few states given current state and action
Optimization: optimize a cost function over a planning horizon
Execution: execute only the first action in the optimized sequence

Latent-space planning: In world models, planning can be performed in the latent space rather than the observation space. This is more efficient because the latent space is typically lower-dimensional and contains more meaningful representations of the environment's state. The agent can plan by generating sequences of latent states, then decoding them to produce observations.

Advantages of latent-space planning:

More efficient: fewer dimensions to plan in
More robust: latent representations capture the essential features of the environment
Better generalization: latent models can generalize across different observation spaces

Sample efficiency and model-based vs. model-free RL#

Sample efficiency is a crucial metric in reinforcement learning, measuring how many environment interactions are needed to achieve a certain level of performance. Model-based methods can be significantly more sample-efficient than model-free methods because they can plan using imagined trajectories.

Tradeoffs between model-based and model-free RL:

Model-based: requires fewer environment interactions, but suffers from model error and planning inefficiencies
Model-free: more robust to model errors, but requires many environment interactions to learn effectively

Model accuracy: The performance of model-based methods depends heavily on the accuracy of the world model. If the model is inaccurate, the agent may plan based on false information, leading to poor performance.

Planning algorithms: Different planning algorithms have different tradeoffs between computational efficiency and accuracy. Some use approximate planning, while others use exact methods.

World models in physical AI#

World models are central to physical AI because they enable agents to understand and interact with the physical world. In robotics and autonomous systems, world models allow agents to:

Plan ahead: anticipate the consequences of actions before executing them
Handle uncertainty: model the uncertainty in the environment and agent behavior
Generalize: apply learned models to new situations and environments

Key challenges:

Model accuracy: ensuring the world model accurately represents the environment
Planning efficiency: making planning computationally feasible
Transfer learning: adapting models to new tasks and environments

Cross-course context: world models in generative modeling#

The concept of world models appears in multiple courses in this sequence:

Course 1 (RL): Reinforcement learning agents learn to optimize rewards through interaction with environments
Course 2 (Robotics): Robots learn to control their physical bodies and interact with the physical world
Course 3 (Generative Models): Generative models learn to simulate the distribution of data
Course 4 (VLMs): Vision-language models learn to align visual and textual representations

The world model framework provides a unifying perspective: all these domains involve learning to simulate or predict the behavior of systems. In RL, the system is the environment; in robotics, it's the physical body; in generative modeling, it's the data distribution; in VLMs, it's the alignment between visual and textual representations.

Key takeaways#

World models learn to simulate the environment by predicting how states evolve in response to actions. They enable model-based reinforcement learning, where agents can plan by imagining future states rather than by acting in the real environment. The RSSM architecture extends basic world models with recurrent structure for better long-term planning. Dreamer is a state-of-the-art model-based RL agent that uses latent imagination to plan efficiently. World models can be more sample-efficient than model-free methods, but require accurate models and efficient planning algorithms. They are central to physical AI because they enable agents to understand and interact with the physical world.

Conceptual questions#

A world model predicts the next observation given the current state and action. What are the advantages of this approach over direct policy learning? What are the potential disadvantages?
In the RSSM architecture, the encoder maps observations to latent states, and the decoder maps latent states back to observations. How does this structure enable more efficient planning compared to working directly in the observation space?
Dreamer uses latent imagination to plan by generating sequences of imagined states. What are the key components of this process, and how does it differ from model-free planning?
Model-based methods can be more sample-efficient than model-free methods, but they also suffer from model error. How does the Dreamer agent address this issue?
How do world models in generative modeling relate to world models in reinforcement learning? What are the key similarities and differences?

Solutions

Advantages of a predictive world model over direct policy learning: sample efficiency (plan in imagination instead of acting), reuse of one model across tasks/rewards, support for planning/MPC, and interpretability of predicted rollouts. Disadvantages: compounding model error (the planner trusts inaccurate predictions), extra compute to learn and roll out the model, model exploitation by the optimizer, and an objective mismatch — predictive accuracy is not the same as task reward.
The encoder→latent→decoder structure lets planning happen in a compact, low-dimensional latent rather than high-dimensional pixels: rollouts are far cheaper, the latent captures task-relevant dynamics while discarding pixel noise, the recurrent state provides memory for partial observability, and multi-step imagination stays in latent space without repeatedly decoding to observations.
Dreamer's latent imagination uses three components — the learned RSSM world model (rolls out latent trajectories), an actor (proposes actions), and a critic/value (evaluates imagined returns) — and backpropagates value gradients through the differentiable imagined rollout. Unlike model-free planning, learning happens on model-generated trajectories (few real interactions) and there is no act-time search: the policy is amortized into the actor.
Dreamer limits model error by training on short imagination horizons (capping compounding error), continually refreshing the model with new real data, representing uncertainty with stochastic latent states and KL balancing, and using a value function that bootstraps beyond the imagination horizon (DreamerV3 adds robustness tricks such as symlog and free bits).
Similarity: both learn a generative simulator $p(\text{next}\mid \text{current}, \cdot)$ with the same latent-variable/ELBO machinery. Differences: in pure generative modeling there is no agent, action, or reward — the "world" is the data distribution; in RL the model is action-conditioned and exists to support reward-maximizing planning, so it must be accurate in action-relevant directions and is judged by downstream control performance rather than sample fidelity.

Looking ahead#

With world models linking generation to decision-making, the course turns to the risks that accompany powerful generative systems.

Week 13: Safety, Misuse, and Alignment. We examine misuse vectors (deepfakes, memorization, adversarial inputs), detection and differential-privacy defenses, and the RLHF/DPO alignment techniques that steer model behavior toward human preferences.

Purpose of this lecture#

World models: architecture and role#

The world model architecture typically consists of three components:

Encoder: maps observations $o_t$ to latent states $z_t$
Dynamics model: learns the transition function $p(z_{t+1} \mid z_t, a_t)$
Decoder: maps latent states back to observations $\hat{o}_t = p(o_t \mid z_t)$

The RSSM architecture#

The RSSM architecture consists of:

Observation encoder: maps each observation $o_t$ to a deterministic embedding $e_t$
Recurrent transition model: computes the posterior distribution over the latent state $z_t$ given the previous latent state $z_{t-1}$ and action $a_{t-1}$ , and the current observation $o_t$
Recurrent generative model: computes the prior distribution over the next latent state $z_{t+1}$ given the current latent state $z_t$ and action $a_t$
Decoder: maps the latent state $z_t$ to a reconstruction of the observation $\hat{o}_t$

The RSSM is particularly powerful because it can maintain a compact, informative representation of the environment's state that evolves over time, enabling efficient planning and control.

Dreamer: a model-based reinforcement learning agent#

Dreamer (Hafner et al., 2020) is a state-of-the-art model-based reinforcement learning agent that uses a world model to plan actions. The agent consists of three main components:

World model: learns to predict the next observation and reward given the current state and action
Actor: learns to select actions that maximize expected reward
Value function: estimates the expected return from each state

Model predictive control and latent-space planning#

Model predictive control (MPC) is a control strategy that uses a model to predict future states and optimize control actions. In the context of world models, MPC involves:

Prediction: use the world model to predict the next few states given current state and action
Optimization: optimize a cost function over a planning horizon
Execution: execute only the first action in the optimized sequence

Advantages of latent-space planning:

More efficient: fewer dimensions to plan in
More robust: latent representations capture the essential features of the environment
Better generalization: latent models can generalize across different observation spaces

Sample efficiency and model-based vs. model-free RL#

Tradeoffs between model-based and model-free RL:

Model-based: requires fewer environment interactions, but suffers from model error and planning inefficiencies
Model-free: more robust to model errors, but requires many environment interactions to learn effectively

Planning algorithms: Different planning algorithms have different tradeoffs between computational efficiency and accuracy. Some use approximate planning, while others use exact methods.

World models in physical AI#

World models are central to physical AI because they enable agents to understand and interact with the physical world. In robotics and autonomous systems, world models allow agents to:

Plan ahead: anticipate the consequences of actions before executing them
Handle uncertainty: model the uncertainty in the environment and agent behavior
Generalize: apply learned models to new situations and environments

Key challenges:

Model accuracy: ensuring the world model accurately represents the environment
Planning efficiency: making planning computationally feasible
Transfer learning: adapting models to new tasks and environments

Cross-course context: world models in generative modeling#

The concept of world models appears in multiple courses in this sequence:

Course 1 (RL): Reinforcement learning agents learn to optimize rewards through interaction with environments
Course 2 (Robotics): Robots learn to control their physical bodies and interact with the physical world
Course 3 (Generative Models): Generative models learn to simulate the distribution of data
Course 4 (VLMs): Vision-language models learn to align visual and textual representations

Key takeaways#

Conceptual questions#

A world model predicts the next observation given the current state and action. What are the advantages of this approach over direct policy learning? What are the potential disadvantages?
In the RSSM architecture, the encoder maps observations to latent states, and the decoder maps latent states back to observations. How does this structure enable more efficient planning compared to working directly in the observation space?
Dreamer uses latent imagination to plan by generating sequences of imagined states. What are the key components of this process, and how does it differ from model-free planning?
Model-based methods can be more sample-efficient than model-free methods, but they also suffer from model error. How does the Dreamer agent address this issue?
How do world models in generative modeling relate to world models in reinforcement learning? What are the key similarities and differences?

Solutions

Advantages of a predictive world model over direct policy learning: sample efficiency (plan in imagination instead of acting), reuse of one model across tasks/rewards, support for planning/MPC, and interpretability of predicted rollouts. Disadvantages: compounding model error (the planner trusts inaccurate predictions), extra compute to learn and roll out the model, model exploitation by the optimizer, and an objective mismatch — predictive accuracy is not the same as task reward.
The encoder→latent→decoder structure lets planning happen in a compact, low-dimensional latent rather than high-dimensional pixels: rollouts are far cheaper, the latent captures task-relevant dynamics while discarding pixel noise, the recurrent state provides memory for partial observability, and multi-step imagination stays in latent space without repeatedly decoding to observations.
Dreamer's latent imagination uses three components — the learned RSSM world model (rolls out latent trajectories), an actor (proposes actions), and a critic/value (evaluates imagined returns) — and backpropagates value gradients through the differentiable imagined rollout. Unlike model-free planning, learning happens on model-generated trajectories (few real interactions) and there is no act-time search: the policy is amortized into the actor.
Dreamer limits model error by training on short imagination horizons (capping compounding error), continually refreshing the model with new real data, representing uncertainty with stochastic latent states and KL balancing, and using a value function that bootstraps beyond the imagination horizon (DreamerV3 adds robustness tricks such as symlog and free bits).
Similarity: both learn a generative simulator $p(\text{next}\mid \text{current}, \cdot)$ with the same latent-variable/ELBO machinery. Differences: in pure generative modeling there is no agent, action, or reward — the "world" is the data distribution; in RL the model is action-conditioned and exists to support reward-maximizing planning, so it must be accurate in action-relevant directions and is judged by downstream control performance rather than sample fidelity.

Looking ahead#

With world models linking generation to decision-making, the course turns to the risks that accompany powerful generative systems.

Purpose of this lecture#

World models: architecture and role#

The RSSM architecture#

Dreamer: a model-based reinforcement learning agent#

Model predictive control and latent-space planning#

Sample efficiency and model-based vs. model-free RL#

World models in physical AI#

Cross-course context: world models in generative modeling#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 12: World Models and Reinforcement Learning

Purpose of this lecture#

World models: architecture and role#

The RSSM architecture#

Dreamer: a model-based reinforcement learning agent#

Model predictive control and latent-space planning#

Sample efficiency and model-based vs. model-free RL#

World models in physical AI#

Cross-course context: world models in generative modeling#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#