Week 12: Vision-Language Models for Robotics

Purpose of this lecture#

Vision-Language Models become truly transformative when they are embodied—when their outputs dictate physical force in the real world. Robotics is the ultimate, most demanding application of VLMs: the system must perceive an unstructured physical environment, understand abstract natural language instructions, and execute continuous motor commands. All of this must occur under hard real-time latency constraints, partial observability, and strict safety requirements.

This lecture bridges the multimodal architectures developed in Course 4 with the physical control and safety theories established in Course 2 (Modern Robot Learning). We examine how VLMs are integrated into hierarchical control pipelines, the mathematics of discrete action tokenization (RT-2), the "System 1 vs. System 2" paradigm (GR00T), and the integration of 3D spatial representations (NeRFs/3DGS) into the VLM control loop.

Hierarchical robot control architecture#

Modern robot control operates simultaneously across vastly different timescales. Attempting to run a 70B parameter VLM at the 1000Hz frequency required to calculate joint torques is both mathematically unnecessary and computationally impossible. Instead, embodied VLMs are integrated into a hierarchical control stack:

1. Semantic Task Level (System 2): Operating at roughly 0.5 to 2 Hz, a massive VLM (e.g., LLaVA, PaLM-E) sits in the cloud or on a heavy local server. It receives a natural language instruction ("Make a cup of tea") and a streaming visual observation (RGB video or 3D point cloud). It performs deep reasoning and outputs a high-level sequence of semantic sub-goals ("locate mug", "grasp handle", "navigate to kettle").

2. Kinematic Skill Level (System 1): Operating at 10 to 50 Hz directly on the robot's edge-compute board (e.g., Nvidia Jetson). This layer receives the semantic sub-goal from the VLM and executes a parameterized, continuous control policy. As explored in Course 2, this is frequently a Diffusion Policy or an Action Chunking Transformer (ACT) that maps the VLM's instruction to specific end-effector velocities $(v_x, v_y, v_z, \omega_x, \omega_y, \omega_z)$ .

3. Reactive Control Level: Operating at 1000+ Hz. Low-level operational-space controllers (like PD controllers) track the desired velocities from the skill layer, reject physical disturbances (e.g., the mug being heavier than expected), and enforce strict joint safety limits.

This hierarchy beautifully decouples deep language understanding (System 2) from real-time physical stabilization (System 1).

Language-Conditioned Spatial Grounding#

To pass instructions down the hierarchy, the VLM must translate abstract nouns into explicit physical coordinates.

2D Grounding to 3D Targets: If the VLM commands "grasp the red screwdriver," it must first identify it. Using grounding architectures (Week 4), the VLM predicts a 2D bounding box. If the robot is equipped with an RGB-D camera, the 3D position of the target is mathematically estimated by projecting the 2D center $(u, v)$ through the depth map $d(u,v)$ using the camera's intrinsic matrix $K$ :

\begin{bmatrix} X \\ Y \\ Z \end{bmatrix} = K^{-1} \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \cdot d(u, v)

This $(X, Y, Z)$ coordinate is transformed into the robot's base frame using standard extrinsic calibration matrices, providing the exact spatial target for the System 1 reaching policy.

Integrating 3DGS and NeRFs: Relying purely on 2D cameras creates occlusion problems (the robot's arm blocks the view of the screwdriver). Advanced systems increasingly query 3D Gaussian Splatting (3DGS) or Neural Radiance Fields (NeRFs) instead of flat 2D images. If the scene is represented as a Language-Embedded Radiance Field (LERF), the VLM can query the 3D coordinate space directly using the CLIP text embedding of "red screwdriver," bypassing the 2D-to-3D projection step entirely and inherently solving occlusion via multi-view geometry.

RT-2: The VLM as an End-to-End Policy#

RT-2 (Robotics Transformer 2; Brohan et al., 2023) boldly challenges the hierarchical paradigm by asking: What if we fine-tune the entire massive VLM to directly output robot actions?

RT-2 takes a frozen, pre-trained VLM (like PaLI-X) and unfreezes it for co-training on robotic trajectory data. The continuous physical actions (e.g., end-effector position and rotation commands) are mapped directly into the VLM's discrete text vocabulary.

The Mathematics of Discrete Action Tokenization#

RT-2 must predict continuous variables $a \in \mathbb{R}^n$ . To do this autoregressively alongside text, it discretizes the continuous space into $B$ uniform bins (typically $B=256$ ).

For an action dimension $x$ strictly bounded between $[x_\text{min}, x_\text{max}]$ , the discrete token index $k \in \{0, 1, \dots, 255\}$ is computed as:

k = \text{round}\left( \frac{x - x_\text{min}}{x_\text{max} - x_\text{min}} \times 255 \right)

The action sequence for a 6-DOF arm plus gripper is represented as 7 discrete tokens appended to the text generation sequence. The VLM is trained using standard cross-entropy loss to predict these action tokens identically to how it predicts language.

Quantization Error: If the arm spans a 2.0 meter workspace, quantizing it into 256 bins results in a spatial resolution error of $\frac{200 \text{ cm}}{256} \approx 0.78 \text{ cm}$ . While acceptable for gross reaching movements, this mathematical quantization error makes delicate, high-precision tasks (like threading a needle) impossible for a pure RT-2 model without a continuous low-level correction controller.

GR00T: Formalizing System 1 and System 2#

Nvidia's Project GR00T and models like $\pi_0$ (Physical Intelligence) represent the modern synthesis of these approaches, explicitly formalizing the handoff between the VLM and the continuous controller.

In these architectures, the VLM (System 2) processes the complex multimodal prompt and outputs a dense Goal Embedding Vector $e_\text{goal}$ , rather than discrete text tokens or quantized actions.

This embedding is passed down to a high-frequency Diffusion Policy or Flow Matching Policy (System 1, from Course 2). The continuous control policy $\pi(a \mid s, e_\text{goal})$ treats the VLM's output as a strict geometric conditioning signal.

\text{System 2 (VLM): } e_\text{goal} = f_\text{VLM}(\text{Image}_{1:t}, \text{Language})

\text{System 1 (Diffusion): } \hat{a}_t = \text{Denoiser}(a_\text{noisy}, \text{RobotState}_t, e_\text{goal})

Because System 1 predicts continuous actions in $\mathbb{R}^n$ rather than quantized tokens, it completely bypasses the quantization error of RT-2. Furthermore, because System 1 only needs to evaluate a lightweight diffusion UNet (and not a 70B parameter LLM), it easily achieves the 50Hz latency requirement for dynamic physical tasks like catching moving objects.

VLM Safety Layers and Control Barrier Functions#

A VLM outputting actions in a chatbot window is safe; a VLM outputting actions that swing a 20kg titanium robot arm is dangerous. Because VLMs are known to hallucinate, they cannot be trusted with raw physical access. They must be wrapped in mathematical safety layers.

| Semantic Filter | CBF Filter | Fallback | | --- | --- | --- | | Auxiliary models check the VLM's high-level plans for semantic violations (e.g., "don't move sharp objects near humans"). | A mathematical safety layer that projects the VLM's commands into a safe control set using quadratic programming. | Automatic transition to a safe, low-gain damping mode if the VLM's confidence drops or the CBF is violated. |

Semantic Safety Filters: The VLM's proposed plan is checked against a safety policy. If the VLM suggests "move the knife over the human's hand," an auxiliary model flags the semantic violation.

Control Barrier Functions (CBFs): As covered extensively in Course 2 (Week 12), the VLM's output must pass through a strict mathematical filter. Let the safe physical workspace of the robot be defined as a set $\mathcal{C} = \{x \in \mathbb{R}^n \mid h(x) \geq 0\}$ .

If the VLM commands a velocity $u_\text{<Glossary term="VLM" />}$ that would drive the robot arm into a table (violating $h(x) \geq 0$ ), the CBF intercepts the command and solves a rapid Quadratic Program (QP) to project the VLM's command to the nearest safe velocity $u^*$ that satisfies the barrier derivative:

u^* = \arg\min_u \| u - u_\text{VLM} \|^2 \quad \text{s.t.} \quad \dot{h}(x, u) \geq -\gamma h(x)

This ensures that no matter how severely the VLM hallucinates, the physical hardware mathematically cannot violate its collision constraints.

Key takeaways#

Embodied AI requires integrating VLMs into hierarchical control stacks to balance deep semantic reasoning with high-frequency physical stabilization. VLMs bridge language to physics by predicting explicit 3D spatial coordinates, increasingly leveraging NeRFs and 3DGS to bypass 2D occlusion. Architectures like RT-2 attempt to run end-to-end by tokenizing continuous actions into the LLM vocabulary, suffering from unavoidable quantization error. Modern architectures (like GR00T/ $\pi_0$ ) resolve this by using the VLM as a "System 2" semantic planner that outputs continuous embeddings to condition a "System 1" high-frequency Flow Matching or Diffusion policy. Finally, because VLMs are prone to hallucination, their outputs must be mathematically filtered through hard constraints like Control Barrier Functions (CBFs) to guarantee physical safety.

Conceptual questions#

Question 1: RT-2 Quantization#

RT-2 Quantization Mathematics: An RT-2 style model is controlling a robot arm. The Z-axis velocity is bounded between $[-1.0, 1.0]$ m/s and is quantized into $B=256$ uniform bins. The VLM predicts bin $k=128$ , which corresponds to $0.0$ m/s. However, due to the softmax distribution over the tokens, the VLM is slightly uncertain, and the actual physical target requires a velocity of $0.005$ m/s. Calculate the exact mathematical quantization error range. Explain why this specific error prevents an RT-2 model from successfully inserting a 1mm peg into a 1.2mm hole without relying on passive mechanical compliance.

Question 2: System 1 vs System 2#

System 1 vs System 2 Handoff: In the GR00T architecture, the VLM (System 2) outputs a goal embedding $e_\text{goal}$ which conditions a Diffusion Policy (System 1). Assume the VLM operates at 1 Hz and the Diffusion Policy operates at 50 Hz. Describe the exact temporal flow of data during a 2-second maneuver where a human suddenly pushes the target object 10 cm to the right at $t=0.5$ seconds. How does the Diffusion Policy react instantly (before the VLM updates $e_\text{goal}$ at $t=1.0$ ), and what specific physical state variables $s_t$ must the Diffusion Policy have access to for this to work?

Question 3: Sim2Real and NeRF#

Sim2Real and NeRF Integration: A robotic manipulation policy trained purely in IsaacSim struggles to grasp objects in the real world because the real-world camera feed contains highly complex lighting, shadows, and occlusions not present in the simulator. You integrate a real-time 3D Gaussian Splatting (3DGS) pipeline. How does explicitly passing a 3D semantic point cloud (rather than a 2D RGB image) into the VLM dramatically reduce the visual Covariate Shift (Week 9) between the simulated training data and the real-world deployment?

Question 4: CBF Safety#

CBF Safety Interception: A VLM commands a robot arm to move linearly across a table to grab an apple. The VLM hallucinates the depth of the apple and outputs a velocity vector $u_\text{<Glossary term="VLM" />}$ that points directly through a glass vase. A Control Barrier Function (CBF) is monitoring the system. Mathematically describe how the CBF's Quadratic Program modifies $u_\text{<Glossary term="VLM" />}$ . Does the robot stop completely, or does it dynamically curve around the vase? What is the tradeoff between setting a very high $\gamma$ value (aggressive approach) versus a low $\gamma$ value (conservative approach) in the CBF constraint $\dot{h}(x, u) \geq -\gamma h(x)$ ?

Question 5: Language Relabeling#

Language Relabeling Bias: To acquire more robotics data, a research team uses an off-the-shelf VLM to watch 10,000 hours of teleoperated robot video and automatically generate text instructions (Language Relabeling). The VLM has a known spurious correlation (from Week 9): it labels any video containing a sponge as "cleaning the table," even if the robot is just moving the sponge. Analyze the downstream effect on the RT-2 policy trained on this relabeled data. If a human operator later commands the robot to "Move the sponge to the bin," what unintended physical action is the robot mathematically highly likely to execute, and why?

Solutions

RT-2 quantization. A 2.0 m/s range over 256 bins gives a bin width of $2.0/256 \approx 0.0078$ m/s and a quantization error up to $\pm 0.0039$ m/s. That ~7.8 mm/s resolution is far too coarse for seating a 1 mm peg in a 1.2 mm hole: the policy cannot command the fine sub-millimeter velocities required, so it must rely on passive mechanical compliance to absorb the residual error.
System 1/2 handoff. The VLM sets the goal embedding at $t=0$ and will not refresh it until $t=1.0$ . When the object is pushed at $t=0.5$ , the 50 Hz Diffusion Policy reacts within ~20 ms because it is conditioned on real-time state $s_t$ — it needs the current end-effector pose and the object's position to re-plan the local action toward the moved target, even while the high-level goal stays stale until the next VLM tick.
Sim2Real and 3DGS. Passing a 3D semantic point cloud instead of raw 2D RGB strips out appearance nuisances — lighting, shadows, textures — that differ most between sim and real. Geometry and semantics are far more invariant across the domain gap, so the policy's input distribution barely shifts, sharply reducing the visual covariate shift relative to 2D pixels.
CBF interception. The CBF solves a quadratic program for the action closest to the commanded one subject to $\dot h(x,u) \ge -\gamma h(x)$ , so it minimally modifies the velocity to keep the constraint satisfied — the arm curves around the vase rather than stopping (it only halts if no safe action exists). A high $\gamma$ permits an aggressive approach (lets $h$ shrink fast, smaller margin); a low $\gamma$ is conservative (keeps a large margin, wider detour). The tradeoff is speed/task completion versus safety margin.
Language-relabeling bias. The VLM mislabels every sponge clip as "cleaning the table," so the training set ties the sponge — and the word "clean" — to wiping motions. Trained on this, when told "Move the sponge to the bin," the policy is biased to execute a wiping/cleaning motion (its dominant learned sponge behavior) instead of pick-and-place, because the relabeling injected a spurious sponge→clean correlation.

Looking ahead#

We have deployed the VLM into the physical world. The final hurdle is ensuring that this deployment is ethical, unbiased, and rigorously aligned with human values. The failures of physical AI systems are rarely random hardware glitches; they are often the direct manifestation of societal biases encoded deep within the massive datasets.

Week 13: Bias, Fairness, and Safety in VLMs. We examine representation bias in web-scale datasets (like LAION), stereotype amplification in generative models, the mathematical mechanisms of RLHF and RLAIF alignment for multimodal systems, and the content moderation pipelines necessary for responsible real-world deployment.

Purpose of this lecture#

Hierarchical robot control architecture#

This hierarchy beautifully decouples deep language understanding (System 2) from real-time physical stabilization (System 1).

Language-Conditioned Spatial Grounding#

To pass instructions down the hierarchy, the VLM must translate abstract nouns into explicit physical coordinates.

\begin{bmatrix} X \\ Y \\ Z \end{bmatrix} = K^{-1} \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \cdot d(u, v)

This $(X, Y, Z)$ coordinate is transformed into the robot's base frame using standard extrinsic calibration matrices, providing the exact spatial target for the System 1 reaching policy.

RT-2: The VLM as an End-to-End Policy#

RT-2 (Robotics Transformer 2; Brohan et al., 2023) boldly challenges the hierarchical paradigm by asking: What if we fine-tune the entire massive VLM to directly output robot actions?

The Mathematics of Discrete Action Tokenization#

RT-2 must predict continuous variables $a \in \mathbb{R}^n$ . To do this autoregressively alongside text, it discretizes the continuous space into $B$ uniform bins (typically $B=256$ ).

For an action dimension $x$ strictly bounded between $[x_\text{min}, x_\text{max}]$ , the discrete token index $k \in \{0, 1, \dots, 255\}$ is computed as:

k = \text{round}\left( \frac{x - x_\text{min}}{x_\text{max} - x_\text{min}} \times 255 \right)

GR00T: Formalizing System 1 and System 2#

\text{System 2 (VLM): } e_\text{goal} = f_\text{VLM}(\text{Image}_{1:t}, \text{Language})

\text{System 1 (Diffusion): } \hat{a}_t = \text{Denoiser}(a_\text{noisy}, \text{RobotState}_t, e_\text{goal})

VLM Safety Layers and Control Barrier Functions#

Semantic Safety Filters: The VLM's proposed plan is checked against a safety policy. If the VLM suggests "move the knife over the human's hand," an auxiliary model flags the semantic violation.

u^* = \arg\min_u \| u - u_\text{VLM} \|^2 \quad \text{s.t.} \quad \dot{h}(x, u) \geq -\gamma h(x)

This ensures that no matter how severely the VLM hallucinates, the physical hardware mathematically cannot violate its collision constraints.

Key takeaways#

Conceptual questions#

Question 1: RT-2 Quantization#

RT-2 Quantization Mathematics: An RT-2 style model is controlling a robot arm. The Z-axis velocity is bounded between $[-1.0, 1.0]$ m/s and is quantized into $B=256$ uniform bins. The VLM predicts bin $k=128$ , which corresponds to $0.0$ m/s. However, due to the softmax distribution over the tokens, the VLM is slightly uncertain, and the actual physical target requires a velocity of $0.005$ m/s. Calculate the exact mathematical quantization error range. Explain why this specific error prevents an RT-2 model from successfully inserting a 1mm peg into a 1.2mm hole without relying on passive mechanical compliance.

Question 2: System 1 vs System 2#

System 1 vs System 2 Handoff: In the GR00T architecture, the VLM (System 2) outputs a goal embedding $e_\text{goal}$ which conditions a Diffusion Policy (System 1). Assume the VLM operates at 1 Hz and the Diffusion Policy operates at 50 Hz. Describe the exact temporal flow of data during a 2-second maneuver where a human suddenly pushes the target object 10 cm to the right at $t=0.5$ seconds. How does the Diffusion Policy react instantly (before the VLM updates $e_\text{goal}$ at $t=1.0$ ), and what specific physical state variables $s_t$ must the Diffusion Policy have access to for this to work?

Question 3: Sim2Real and NeRF#

Sim2Real and NeRF Integration: A robotic manipulation policy trained purely in IsaacSim struggles to grasp objects in the real world because the real-world camera feed contains highly complex lighting, shadows, and occlusions not present in the simulator. You integrate a real-time 3D Gaussian Splatting (3DGS) pipeline. How does explicitly passing a 3D semantic point cloud (rather than a 2D RGB image) into the VLM dramatically reduce the visual Covariate Shift (Week 9) between the simulated training data and the real-world deployment?

Question 4: CBF Safety#

CBF Safety Interception: A VLM commands a robot arm to move linearly across a table to grab an apple. The VLM hallucinates the depth of the apple and outputs a velocity vector $u_\text{<Glossary term="VLM" />}$ that points directly through a glass vase. A Control Barrier Function (CBF) is monitoring the system. Mathematically describe how the CBF's Quadratic Program modifies $u_\text{<Glossary term="VLM" />}$ . Does the robot stop completely, or does it dynamically curve around the vase? What is the tradeoff between setting a very high $\gamma$ value (aggressive approach) versus a low $\gamma$ value (conservative approach) in the CBF constraint $\dot{h}(x, u) \geq -\gamma h(x)$ ?

Question 5: Language Relabeling#

Language Relabeling Bias: To acquire more robotics data, a research team uses an off-the-shelf VLM to watch 10,000 hours of teleoperated robot video and automatically generate text instructions (Language Relabeling). The VLM has a known spurious correlation (from Week 9): it labels any video containing a sponge as "cleaning the table," even if the robot is just moving the sponge. Analyze the downstream effect on the RT-2 policy trained on this relabeled data. If a human operator later commands the robot to "Move the sponge to the bin," what unintended physical action is the robot mathematically highly likely to execute, and why?

Solutions

RT-2 quantization. A 2.0 m/s range over 256 bins gives a bin width of $2.0/256 \approx 0.0078$ m/s and a quantization error up to $\pm 0.0039$ m/s. That ~7.8 mm/s resolution is far too coarse for seating a 1 mm peg in a 1.2 mm hole: the policy cannot command the fine sub-millimeter velocities required, so it must rely on passive mechanical compliance to absorb the residual error.
System 1/2 handoff. The VLM sets the goal embedding at $t=0$ and will not refresh it until $t=1.0$ . When the object is pushed at $t=0.5$ , the 50 Hz Diffusion Policy reacts within ~20 ms because it is conditioned on real-time state $s_t$ — it needs the current end-effector pose and the object's position to re-plan the local action toward the moved target, even while the high-level goal stays stale until the next VLM tick.
Sim2Real and 3DGS. Passing a 3D semantic point cloud instead of raw 2D RGB strips out appearance nuisances — lighting, shadows, textures — that differ most between sim and real. Geometry and semantics are far more invariant across the domain gap, so the policy's input distribution barely shifts, sharply reducing the visual covariate shift relative to 2D pixels.
CBF interception. The CBF solves a quadratic program for the action closest to the commanded one subject to $\dot h(x,u) \ge -\gamma h(x)$ , so it minimally modifies the velocity to keep the constraint satisfied — the arm curves around the vase rather than stopping (it only halts if no safe action exists). A high $\gamma$ permits an aggressive approach (lets $h$ shrink fast, smaller margin); a low $\gamma$ is conservative (keeps a large margin, wider detour). The tradeoff is speed/task completion versus safety margin.
Language-relabeling bias. The VLM mislabels every sponge clip as "cleaning the table," so the training set ties the sponge — and the word "clean" — to wiping motions. Trained on this, when told "Move the sponge to the bin," the policy is biased to execute a wiping/cleaning motion (its dominant learned sponge behavior) instead of pick-and-place, because the relabeling injected a spurious sponge→clean correlation.

Purpose of this lecture#

Hierarchical robot control architecture#

Language-Conditioned Spatial Grounding#

RT-2: The VLM as an End-to-End Policy#

The Mathematics of Discrete Action Tokenization#

GR00T: Formalizing System 1 and System 2#

VLM Safety Layers and Control Barrier Functions#

Key takeaways#

Conceptual questions#

Question 1: RT-2 Quantization#

Question 2: System 1 vs System 2#

Question 3: Sim2Real and NeRF#

Question 4: CBF Safety#

Question 5: Language Relabeling#

Looking ahead#

Further reading#

Week 12: Vision-Language Models for Robotics

Purpose of this lecture#

Hierarchical robot control architecture#

Language-Conditioned Spatial Grounding#

RT-2: The VLM as an End-to-End Policy#

The Mathematics of Discrete Action Tokenization#

GR00T: Formalizing System 1 and System 2#

VLM Safety Layers and Control Barrier Functions#

Key takeaways#

Conceptual questions#

Question 1: RT-2 Quantization#

Question 2: System 1 vs System 2#

Question 3: Sim2Real and NeRF#

Question 4: CBF Safety#

Question 5: Language Relabeling#

Looking ahead#

Further reading#

Week 12: Vision-Language Models for Robotics

Purpose of this lecture#

Hierarchical robot control architecture#

Language-Conditioned Spatial Grounding#

RT-2: The VLMVision-Language Model as an End-to-End Policy#

The Mathematics of Discrete Action Tokenization#

GR00T: Formalizing System 1 and System 2#

VLMVision-Language Model Safety Layers and Control Barrier Functions#

Key takeaways#

Conceptual questions#

Question 1: RT-2 Quantization#

Question 2: System 1 vs System 2#

Question 3: Sim2Real and NeRF#

Question 4: CBF Safety#

Question 5: Language Relabeling#

Looking ahead#

Further reading#

Week 12: Vision-Language Models for Robotics

Purpose of this lecture#

Hierarchical robot control architecture#

Language-Conditioned Spatial Grounding#

RT-2: The VLMVision-Language Model as an End-to-End Policy#

The Mathematics of Discrete Action Tokenization#

GR00T: Formalizing System 1 and System 2#

VLMVision-Language Model Safety Layers and Control Barrier Functions#

Key takeaways#

Conceptual questions#

Question 1: RT-2 Quantization#

Question 2: System 1 vs System 2#

Question 3: Sim2Real and NeRF#

Question 4: CBF Safety#

Question 5: Language Relabeling#

Looking ahead#

Further reading#

RT-2: The VLM as an End-to-End Policy#

VLM Safety Layers and Control Barrier Functions#

RT-2: The VLM as an End-to-End Policy#

VLM Safety Layers and Control Barrier Functions#