Purpose of this lecture
The generative models studied so far produce outputs in response to noise and conditioning signals. This lecture applies the same neural machinery to a fundamentally different goal: building a world model — a learned simulator that predicts how the environment transitions in response to agent actions, enabling an agent to plan by imagining future states rather than by acting in the real world. World models represent the deepest integration of generative modeling with decision-making, and they provide the conceptual foundation for understanding how foundation models will be used in physical AI.
World models: architecture and role
World models (Ha & Schmidhuber, 2018) are a class of generative models that learn to simulate the environment in which an agent operates. The key insight is that the same neural machinery used to generate images, audio, or text can be applied to predict the next state of the environment given the current state and action: . This predictive capability enables model-based reinforcement learning, where the agent can plan by simulating many possible actions in its internal world model rather than interacting with the real environment.
The world model architecture typically consists of three components:
- Encoder: maps observations to latent states
- Dynamics model: learns the transition function
- Decoder: maps latent states back to observations
This structure enables imagination: the agent can generate sequences of imagined states by sampling from the dynamics model, then decode them to produce simulated observations. This imagination capability is crucial for planning, as the agent can evaluate many possible action sequences without actually executing them in the real environment.
The RSSM architecture
The recurrent state space model (RSSM; Hafner et al., 2019) is a world model architecture that extends the basic world model with a recurrent structure to handle sequential data more effectively. Unlike simple feedforward models, RSSMs can maintain a persistent memory of past observations, enabling better long-term planning.
The RSSM architecture consists of:
- Observation encoder: maps each observation to a deterministic embedding
- Recurrent transition model: computes the posterior distribution over the latent state given the previous latent state and action , and the current observation
- Recurrent generative model: computes the prior distribution over the next latent state given the current latent state and action
- Decoder: maps the latent state to a reconstruction of the observation
The RSSM is particularly powerful because it can maintain a compact, informative representation of the environment's state that evolves over time, enabling efficient planning and control.
Dreamer: a model-based reinforcement learning agent
Dreamer (Hafner et al., 2020) is a state-of-the-art model-based reinforcement learning agent that uses a world model to plan actions. The agent consists of three main components:
- World model: learns to predict the next observation and reward given the current state and action
- Actor: learns to select actions that maximize expected reward
- Value function: estimates the expected return from each state
The key innovation of Dreamer is its use of latent imagination: the agent generates sequences of imagined states by sampling from the world model's dynamics, then evaluates these sequences using the learned value function. This allows the agent to plan without requiring interaction with the real environment, making it highly sample-efficient.
Imagination and planning: Dreamer uses the world model to imagine sequences of actions and their consequences. It samples action sequences from the actor policy, then uses the world model to simulate the resulting states and rewards. The agent then evaluates these imagined trajectories using the value function to select the best action.
Learning: Dreamer learns by minimizing the difference between predicted and actual observations, and by optimizing the actor and value function using the imagined trajectories. The world model is trained to predict the next observation and reward, while the actor and value function are trained to maximize the expected return.
Model predictive control and latent-space planning
Model predictive control (MPC) is a control strategy that uses a model to predict future states and optimize control actions. In the context of world models, MPC involves:
- Prediction: use the world model to predict the next few states given current state and action
- Optimization: optimize a cost function over a planning horizon
- Execution: execute only the first action in the optimized sequence
Latent-space planning: In world models, planning can be performed in the latent space rather than the observation space. This is more efficient because the latent space is typically lower-dimensional and contains more meaningful representations of the environment's state. The agent can plan by generating sequences of latent states, then decoding them to produce observations.
Advantages of latent-space planning:
- More efficient: fewer dimensions to plan in
- More robust: latent representations capture the essential features of the environment
- Better generalization: latent models can generalize across different observation spaces
Sample efficiency and model-based vs. model-free RL
Sample efficiency is a crucial metric in reinforcement learning, measuring how many environment interactions are needed to achieve a certain level of performance. Model-based methods can be significantly more sample-efficient than model-free methods because they can plan using imagined trajectories.
Tradeoffs between model-based and model-free RL:
- Model-based: requires fewer environment interactions, but suffers from model error and planning inefficiencies
- Model-free: more robust to model errors, but requires many environment interactions to learn effectively
Model accuracy: The performance of model-based methods depends heavily on the accuracy of the world model. If the model is inaccurate, the agent may plan based on false information, leading to poor performance.
Planning algorithms: Different planning algorithms have different tradeoffs between computational efficiency and accuracy. Some use approximate planning, while others use exact methods.
World models in physical AI
World models are central to physical AI because they enable agents to understand and interact with the physical world. In robotics and autonomous systems, world models allow agents to:
- Plan ahead: anticipate the consequences of actions before executing them
- Handle uncertainty: model the uncertainty in the environment and agent behavior
- Generalize: apply learned models to new situations and environments
Key challenges:
- Model accuracy: ensuring the world model accurately represents the environment
- Planning efficiency: making planning computationally feasible
- Transfer learning: adapting models to new tasks and environments
Cross-course context: world models in generative modeling
The concept of world models appears in multiple courses in this sequence:
- Course 1 (RL): Reinforcement learning agents learn to optimize rewards through interaction with environments
- Course 2 (Robotics): Robots learn to control their physical bodies and interact with the physical world
- Course 3 (Generative Models): Generative models learn to simulate the distribution of data
- Course 4 (VLMs): Vision-language models learn to align visual and textual representations
The world model framework provides a unifying perspective: all these domains involve learning to simulate or predict the behavior of systems. In RL, the system is the environment; in robotics, it's the physical body; in generative modeling, it's the data distribution; in VLMs, it's the alignment between visual and textual representations.
Key takeaways
World models learn to simulate the environment by predicting how states evolve in response to actions. They enable model-based reinforcement learning, where agents can plan by imagining future states rather than by acting in the real environment. The RSSM architecture extends basic world models with recurrent structure for better long-term planning. Dreamer is a state-of-the-art model-based RL agent that uses latent imagination to plan efficiently. World models can be more sample-efficient than model-free methods, but require accurate models and efficient planning algorithms. They are central to physical AI because they enable agents to understand and interact with the physical world.
Conceptual questions
-
A world model predicts the next observation given the current state and action. What are the advantages of this approach over direct policy learning? What are the potential disadvantages?
-
In the RSSM architecture, the encoder maps observations to latent states, and the decoder maps latent states back to observations. How does this structure enable more efficient planning compared to working directly in the observation space?
-
Dreamer uses latent imagination to plan by generating sequences of imagined states. What are the key components of this process, and how does it differ from model-free planning?
-
Model-based methods can be more sample-efficient than model-free methods, but they also suffer from model error. How does the Dreamer agent address this issue?
-
How do world models in generative modeling relate to world models in reinforcement learning? What are the key similarities and differences?
Looking ahead
With world models linking generation to decision-making, the course turns to the risks that accompany powerful generative systems.
Week 13: Safety, Misuse, and Alignment. We examine misuse vectors (deepfakes, memorization, adversarial inputs), detection and differential-privacy defenses, and the RLHF/DPO alignment techniques that steer model behavior toward human preferences.
Further reading
- Ha, D., & Schmidhuber, J. (2018). World Models. NeurIPS.
- Hafner, D., et al. (2019). Learning Latent Dynamics for Planning from Pixels (PlaNet, RSSM). ICML.
- Hafner, D., et al. (2020). Dream to Control: Learning Behaviors by Latent Imagination (Dreamer). ICML.