Purpose of this lecture
Vision-Language Models become truly transformative when they are embodied—when their outputs dictate physical force in the real world. Robotics is the ultimate, most demanding application of VLMs: the system must perceive an unstructured physical environment, understand abstract natural language instructions, and execute continuous motor commands. All of this must occur under hard real-time latency constraints, partial observability, and strict safety requirements.
This lecture bridges the multimodal architectures developed in Course 4 with the physical control and safety theories established in Course 2 (Modern Robot Learning). We examine how VLMs are integrated into hierarchical control pipelines, the mathematics of discrete action tokenization (RT-2), the "System 1 vs. System 2" paradigm (GR00T), and the integration of 3D spatial representations (NeRFs/3DGS) into the VLMVision-Language Model control loop.
Hierarchical robot control architecture
Modern robot control operates simultaneously across vastly different timescales. Attempting to run a 70B parameter VLMVision-Language Model at the 1000Hz frequency required to calculate joint torques is both mathematically unnecessary and computationally impossible. Instead, embodied VLMs are integrated into a hierarchical control stack:
1. Semantic Task Level (System 2): Operating at roughly 0.5 to 2 Hz, a massive VLMVision-Language Model (e.g., LLaVA, PaLM-E) sits in the cloud or on a heavy local server. It receives a natural language instruction ("Make a cup of tea") and a streaming visual observation (RGB video or 3D point cloud). It performs deep reasoning and outputs a high-level sequence of semantic sub-goals ("locate mug", "grasp handle", "navigate to kettle").
2. Kinematic Skill Level (System 1): Operating at 10 to 50 Hz directly on the robot's edge-compute board (e.g., Nvidia Jetson). This layer receives the semantic sub-goal from the VLMVision-Language Model and executes a parameterized, continuous control policy. As explored in Course 2, this is frequently a Diffusion Policy or an Action Chunking Transformer (ACTAction Chunking with Transformers) that maps the VLMVision-Language Model's instruction to specific end-effector velocities .
3. Reactive Control Level: Operating at 1000+ Hz. Low-level operational-space controllers (like PD controllers) track the desired velocities from the skill layer, reject physical disturbances (e.g., the mug being heavier than expected), and enforce strict joint safety limits.
This hierarchy beautifully decouples deep language understanding (System 2) from real-time physical stabilization (System 1).
Language-Conditioned Spatial Grounding
To pass instructions down the hierarchy, the VLMVision-Language Model must translate abstract nouns into explicit physical coordinates.
2D Grounding to 3D Targets: If the VLMVision-Language Model commands "grasp the red screwdriver," it must first identify it. Using grounding architectures (Week 4), the VLMVision-Language Model predicts a 2D bounding box. If the robot is equipped with an RGB-D camera, the 3D position of the target is mathematically estimated by projecting the 2D center through the depth map using the camera's intrinsic matrix :
This coordinate is transformed into the robot's base frame using standard extrinsic calibration matrices, providing the exact spatial target for the System 1 reaching policy.
Integrating 3DGS and NeRFs: Relying purely on 2D cameras creates occlusion problems (the robot's arm blocks the view of the screwdriver). Advanced systems increasingly query 3D Gaussian Splatting (3DGS) or Neural Radiance Fields (NeRFs) instead of flat 2D images. If the scene is represented as a Language-Embedded Radiance Field (LERF), the VLMVision-Language Model can query the 3D coordinate space directly using the CLIP text embedding of "red screwdriver," bypassing the 2D-to-3D projection step entirely and inherently solving occlusion via multi-view geometry.
RT-2: The VLMVision-Language Model as an End-to-End Policy
RT-2 (Robotics Transformer 2; Brohan et al., 2023) boldly challenges the hierarchical paradigm by asking: What if we fine-tune the entire massive VLMVision-Language Model to directly output robot actions?
RT-2 takes a frozen, pre-trained VLMVision-Language Model (like PaLI-X) and unfreezes it for co-training on robotic trajectory data. The continuous physical actions (e.g., end-effector position and rotation commands) are mapped directly into the VLMVision-Language Model's discrete text vocabulary.
The Mathematics of Discrete Action Tokenization
RT-2 must predict continuous variables . To do this autoregressively alongside text, it discretizes the continuous space into uniform bins (typically ).
For an action dimension strictly bounded between , the discrete token index is computed as:
The action sequence for a 6-DOF arm plus gripper is represented as 7 discrete tokens appended to the text generation sequence. The VLMVision-Language Model is trained using standard cross-entropy loss to predict these action tokens identically to how it predicts language.
Quantization Error: If the arm spans a 2.0 meter workspace, quantizing it into 256 bins results in a spatial resolution error of . While acceptable for gross reaching movements, this mathematical quantization error makes delicate, high-precision tasks (like threading a needle) impossible for a pure RT-2 model without a continuous low-level correction controller.
GR00T: Formalizing System 1 and System 2
Nvidia's Project GR00T and models like (Physical Intelligence) represent the modern synthesis of these approaches, explicitly formalizing the handoff between the VLMVision-Language Model and the continuous controller.
In these architectures, the VLMVision-Language Model (System 2) processes the complex multimodal prompt and outputs a dense Goal Embedding Vector , rather than discrete text tokens or quantized actions.
This embedding is passed down to a high-frequency Diffusion Policy or Flow Matching Policy (System 1, from Course 2). The continuous control policy treats the VLMVision-Language Model's output as a strict geometric conditioning signal.
Because System 1 predicts continuous actions in rather than quantized tokens, it completely bypasses the quantization error of RT-2. Furthermore, because System 1 only needs to evaluate a lightweight diffusion UNet (and not a 70B parameter LLMLarge Language Model), it easily achieves the 50Hz latency requirement for dynamic physical tasks like catching moving objects.
VLMVision-Language Model Safety Layers and Control Barrier Functions
A VLMVision-Language Model outputting actions in a chatbot window is safe; a VLMVision-Language Model outputting actions that swing a 20kg titanium robot arm is dangerous. Because VLMs are known to hallucinate, they cannot be trusted with raw physical access. They must be wrapped in mathematical safety layers.
| Semantic Filter | CBF Filter | Fallback | | --- | --- | --- | | Auxiliary models check the VLM's high-level plans for semantic violations (e.g., "don't move sharp objects near humans"). | A mathematical safety layer that projects the VLM's commands into a safe control set using quadratic programming. | Automatic transition to a safe, low-gain damping mode if the VLM's confidence drops or the CBF is violated. |
Semantic Safety Filters: The VLMVision-Language Model's proposed plan is checked against a safety policy. If the VLMVision-Language Model suggests "move the knife over the human's hand," an auxiliary model flags the semantic violation.
Control Barrier Functions (CBFs): As covered extensively in Course 2 (Week 12), the VLMVision-Language Model's output must pass through a strict mathematical filter. Let the safe physical workspace of the robot be defined as a set .
If the VLMVision-Language Model commands a velocity that would drive the robot arm into a table (violating ), the CBF intercepts the command and solves a rapid Quadratic Program (QP) to project the VLMVision-Language Model's command to the nearest safe velocity that satisfies the barrier derivative:
This ensures that no matter how severely the VLMVision-Language Model hallucinates, the physical hardware mathematically cannot violate its collision constraints.
Key takeaways
Embodied AI requires integrating VLMs into hierarchical control stacks to balance deep semantic reasoning with high-frequency physical stabilization. VLMs bridge language to physics by predicting explicit 3D spatial coordinates, increasingly leveraging NeRFs and 3DGS to bypass 2D occlusion. Architectures like RT-2 attempt to run end-to-end by tokenizing continuous actions into the LLMLarge Language Model vocabulary, suffering from unavoidable quantization error. Modern architectures (like GR00T/) resolve this by using the VLMVision-Language Model as a "System 2" semantic planner that outputs continuous embeddings to condition a "System 1" high-frequency Flow Matching or Diffusion policy. Finally, because VLMs are prone to hallucination, their outputs must be mathematically filtered through hard constraints like Control Barrier Functions (CBFs) to guarantee physical safety.
Conceptual questions
Question 1: RT-2 Quantization
- RT-2 Quantization Mathematics: An RT-2 style model is controlling a robot arm. The Z-axis velocity is bounded between m/s and is quantized into uniform bins. The VLMVision-Language Model predicts bin , which corresponds to m/s. However, due to the softmax distribution over the tokens, the VLMVision-Language Model is slightly uncertain, and the actual physical target requires a velocity of m/s. Calculate the exact mathematical quantization error range. Explain why this specific error prevents an RT-2 model from successfully inserting a 1mm peg into a 1.2mm hole without relying on passive mechanical compliance.
Question 2: System 1 vs System 2
- System 1 vs System 2 Handoff: In the GR00T architecture, the VLMVision-Language Model (System 2) outputs a goal embedding which conditions a Diffusion Policy (System 1). Assume the VLMVision-Language Model operates at 1 Hz and the Diffusion Policy operates at 50 Hz. Describe the exact temporal flow of data during a 2-second maneuver where a human suddenly pushes the target object 10 cm to the right at seconds. How does the Diffusion Policy react instantly (before the VLMVision-Language Model updates at ), and what specific physical state variables must the Diffusion Policy have access to for this to work?
Question 3: Sim2Real and NeRF
- Sim2Real and NeRF Integration: A robotic manipulation policy trained purely in IsaacSim struggles to grasp objects in the real world because the real-world camera feed contains highly complex lighting, shadows, and occlusions not present in the simulator. You integrate a real-time 3D Gaussian Splatting (3DGS) pipeline. How does explicitly passing a 3D semantic point cloud (rather than a 2D RGB image) into the VLMVision-Language Model dramatically reduce the visual Covariate Shift (Week 9) between the simulated training data and the real-world deployment?
Question 4: CBF Safety
- CBF Safety Interception: A VLMVision-Language Model commands a robot arm to move linearly across a table to grab an apple. The VLMVision-Language Model hallucinates the depth of the apple and outputs a velocity vector that points directly through a glass vase. A Control Barrier Function (CBF) is monitoring the system. Mathematically describe how the CBF's Quadratic Program modifies . Does the robot stop completely, or does it dynamically curve around the vase? What is the tradeoff between setting a very high value (aggressive approach) versus a low value (conservative approach) in the CBF constraint ?
Question 5: Language Relabeling
- Language Relabeling Bias: To acquire more robotics data, a research team uses an off-the-shelf VLMVision-Language Model to watch 10,000 hours of teleoperated robot video and automatically generate text instructions (Language Relabeling). The VLMVision-Language Model has a known spurious correlation (from Week 9): it labels any video containing a sponge as "cleaning the table," even if the robot is just moving the sponge. Analyze the downstream effect on the RT-2 policy trained on this relabeled data. If a human operator later commands the robot to "Move the sponge to the bin," what unintended physical action is the robot mathematically highly likely to execute, and why?
Looking ahead
We have deployed the VLMVision-Language Model into the physical world. The final hurdle is ensuring that this deployment is ethical, unbiased, and rigorously aligned with human values. The failures of physical AI systems are rarely random hardware glitches; they are often the direct manifestation of societal biases encoded deep within the massive datasets.
Week 13: Bias, Fairness, and Safety in VLMs. We examine representation bias in web-scale datasets (like LAION), stereotype amplification in generative models, the mathematical mechanisms of RLHFReinforcement Learning from Human Feedback and RLAIF alignment for multimodal systems, and the content moderation pipelines necessary for responsible real-world deployment.
Further reading
- Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL. (End-to-end action tokenization).
- Driess, D., et al. (2023). PaLM-E: An Embodied Multimodal Language Model. ICML. (Continuous state injection).
- Kerr, J., et al. (2023). LERF: Language Embedded Radiance Fields. ICCV. (Grounding VLMs in 3D NeRF space).
- Physical Intelligence. (2024). : A Vision-Language-Action Flow Model for General Robot Control. (System 1 / System 2 formalization with Flow Matching).