Skip to main content
illumin8
Courses
Week 10: Vision–Language–Action Models
Robot Learning
01Week 1: Robot Modeling and Kinematics
02Week 2: Dynamics and State Estimation
03Week 3: Control Fundamentals
04Week 4: Teleoperation and Data Collection
05Week 5: Imitation Learning
06Week 6: Reinforcement Learning for Robotics
07Week 7: Sim2Real Pipelines and IsaacLab
08Week 8: Foundation Models for Manipulation — ACT and Action Chunking
09Week 9: Flow Matching and Diffusion for Robot Policies
10Week 10: Vision–Language–Action Models
11Week 11: Fine-Tuning and Adaptation
12Week 12: Safety, Constraints, and Reliability
13Week 13: Multi-Robot and Multi-Task Learning
14Week 14: Sim2Real Capstone
Week 10

Week 10: Vision–Language–Action Models

✦Learning Outcomes
  • Connect vision-language pretraining to robot control
  • Analyze tokenization and action representations in VLAs
  • Implement language-guided manipulation policies
  • Connect VLMs to Course 4 concepts
◆Prerequisites
  • Week 9: Diffusion and flow matching policies
  • Week 8: ACTAction Chunking with Transformers foundations
  • Basic vision-language models

Recommended: Review Week 9 before proceeding.

Purpose of this lecture

The previous two lectures developed the action generation machinery — sequence modeling with ACTAction Chunking with Transformers, and generative trajectory production with diffusion and flow matching. These methods produce high-quality manipulation behaviors but condition primarily on low-level sensory observations: RGB images and proprioceptive state. They do not reason about tasks, objects, or goals in a generalizable semantic sense — a diffusion policy trained on "pick up the red cup" demonstrations has no mechanism to generalize to "pick up the blue bottle" without separate training.

Vision–Language–Action (VLA) models close this gap by integrating pretrained vision-language representations into the action generation pipeline. By building on foundation models that already understand objects, spatial relationships, and natural language instructions, VLA models can follow novel task descriptions, generalize across objects and environments, and leverage the vast semantic knowledge encoded in large-scale vision-language pretraining. This lecture examines how VLAs are architected, trained, and scaled — with particular attention to the specific models that represent the current frontier: π0\pi_0π0​, GR00T, and related systems.


From modular pipelines to end-to-end VLAs

Traditional robotic systems separate perception, reasoning, and control into independent modules with clean interfaces: a vision system detects objects and estimates poses, a planner determines the action sequence to achieve the goal, and a controller executes the motion. Each module is designed and tested independently, which provides interpretability and modular debugging but introduces compounding errors at module boundaries and prevents the learning of joint representations that capture task-relevant visual features.

End-to-end VLAs replace this modular stack with a unified model that maps (image,language instruction,proprioception)→actions(\text{image}, \text{language instruction}, \text{proprioception}) \to \text{actions}(image,language instruction,proprioception)→actions. The advantage is not merely architectural elegance: when perception and action generation are trained jointly on task-execution data, the visual representation learned by the model is shaped by what is relevant for action generation rather than by what is relevant for generic object detection. A VLA learns to attend to the parts of the image that matter for grasping, not the parts that matter for segmentation benchmarks.

The prerequisite for this to work is a large, diverse dataset of aligned (observation, instruction, action) triplets. The Open X-embodiment dataset, RT-2 and its successors, DROID, and RLDS-format datasets provide this at scales ranging from thousands to hundreds of thousands of trajectories. The joint training of perception and action on these datasets is what enables VLA generalization.


Architecture: multimodal tokenization and fusion

From pixels and text to tokens

The transformer architecture that underlies modern VLAs processes sequences of discrete tokens. Converting multimodal inputs (images, text, proprioception) into a unified token sequence is the central architectural challenge.

Image tokenization extracts a sequence of patch embeddings from each camera frame. A ViTVision Transformer backbone divides a 224×224224 \times 224224×224 image into 16×1616 \times 1616×16 patches and maps each patch to a ddd-dimensional embedding vector through a linear projection. For a standard ViT-B/16, a single image produces 196 patch tokens; with 2 cameras, the observation at each timestep contributes 392 visual tokens. In models with longer visual contexts (multiple frames, more cameras), the visual token count dominates the total sequence length, motivating spatial and temporal compression through pooling or cross-attention summarization.

Language tokenization uses a standard BPE or sentencepiece tokenizer to convert the task instruction into a sequence of language tokens. In VLAs built on pretrained language models (LLaMA, Gemma, PaLM), these tokens are processed by the frozen or partially frozen language model to produce instruction embeddings that carry the semantic content of the task.

Visual token compression: the Perceiver Resampler

Two camera images at ViT-B/16 resolution produce 2×196=3922 \times 196 = 3922×196=392 visual tokens per timestep. When the VLA's transformer processes a sequence that includes language tokens (20–50 tokens), proprioceptive tokens (1–4 tokens), and multiple timesteps of visual history, the total sequence length can easily exceed 1000 tokens. Transformer self-attention scales as O(L2)O(L^2)O(L2) in sequence length LLL, making this computationally prohibitive for real-time inference.

The Perceiver Resampler (introduced in Flamingo; Alayrac et al., 2022, and used in RT-2 and related models) compresses the high-dimensional visual token sequence into a fixed-length set of K≪196K \ll 196K≪196 summary tokens. It does this through learned cross-attention: KKK learned query vectors Q∈RK×dQ \in \mathbb{R}^{K \times d}Q∈RK×d attend to the full set of patch embeddings as keys and values:

Vcompressed=softmax ⁣(QE⊤d)EV_{\text{compressed}} = \text{softmax}\!\left(\frac{QE^\top}{\sqrt{d}}\right) EVcompressed​=softmax(d​QE⊤​)E

where E∈R196×dE \in \mathbb{R}^{196 \times d}E∈R196×d is the matrix of ViT patch embeddings. The output Vcompressed∈RK×dV_{\text{compressed}} \in \mathbb{R}^{K \times d}Vcompressed​∈RK×d is a set of KKK summary tokens that aggregate information from all 196 patches through attention. Typical implementations use K=64K = 64K=64 or K=32K = 32K=32, reducing the visual token count by factors of 3–6 while retaining the spatially distributed information through the attention mechanism. Crucially, the Perceiver Resampler is the only module that sees the full high-resolution patch embeddings; the language model backbone downstream only processes the compressed KKK-token summary, keeping its attention computation tractable.

Proprioception tokenization maps the robot's joint state vector (positions, velocities, torques) to a fixed-length embedding through a learned linear projection. Proprioceptive tokens are typically injected into the sequence at each timestep, providing the policy with accurate knowledge of its current configuration.

Cross-attention as the fusion mechanism

Cross-attention is the dominant mechanism for fusing visual and language representations in VLAs. The language tokens serve as queries, and the visual patch tokens serve as keys and values: each language token attends to the image patches most relevant to its semantic content, producing a language embedding that is grounded in the specific visual context.

The inverse attention — visual tokens querying language tokens — is equally important for instruction following: each image patch attends to the parts of the instruction that are most relevant to it, sharpening the visual representation around task-relevant objects (the red cup, the hinge, the target bin). This bidirectional grounding is what distinguishes cross-attention fusion from simpler feature concatenation approaches.

In practice, most VLAs use a form of prefix attention: the language embedding is prepended to the visual token sequence as a prefix, and the transformer's self-attention over the entire sequence (language + visual) achieves cross-modal grounding through the unified attention mechanism rather than separate cross-attention layers.


Action generation in VLAs

The action generation head is architecturally distinct from the rest of the VLA because it must map from the rich, high-dimensional VLMVision-Language Model features to the precise, low-dimensional robot action space. Three design approaches are in active use:

Discrete action tokenization represents each action dimension as a categorical token by binning the continuous action space into BBB equal-width intervals. Formally, a continuous action value ai∈[aimin⁡,aimax⁡]a_i \in [a_i^{\min}, a_i^{\max}]ai​∈[aimin​,aimax​] is mapped to bin index bi∈{0,1,…,B−1}b_i \in \{0, 1, \ldots, B-1\}bi​∈{0,1,…,B−1} by:

bi=⌊ai−aimin⁡aimax⁡−aimin⁡⋅B⌋b_i = \left\lfloor \frac{a_i - a_i^{\min}}{a_i^{\max} - a_i^{\min}} \cdot B \right\rfloorbi​=⌊aimax​−aimin​ai​−aimin​​⋅B⌋

This allows the VLA to reuse the language model's autoregressive generation machinery directly: action tokens are simply appended to the output token sequence after the language tokens, and the model generates them using the same softmax vocabulary head. This approach was used in RT-2 (Brohan et al., 2023), where a PaLM-E model generates robot actions as discretized tokens repurposed from the least-used entries in the text vocabulary.

The quantization error from this binning is bounded by half the bin width: ∣a^i−ai∣≤aimax⁡−aimin⁡2B|\hat{a}_i - a_i| \leq \frac{a_i^{\max} - a_i^{\min}}{2B}∣a^i​−ai​∣≤2Baimax​−aimin​​. For a wrist joint with range [−π,π][-\pi, \pi][−π,π] radians and B=256B = 256B=256 bins, the worst-case quantization error is 2π512≈0.012\frac{2\pi}{512} \approx 0.0125122π​≈0.012 rad (≈0.7°\approx 0.7°≈0.7°) — sufficient for most pick-and-place tasks but potentially insufficient for fine insertion tasks requiring sub-degree precision. Increasing BBB reduces quantization error but expands the action vocabulary, making the generation task harder and increasing the probability of out-of-vocabulary tokens at inference. Handling out-of-vocabulary tokens requires either clamping to the nearest valid action bin or a fallback to mean action — both of which introduce biases on action boundary distributions. These quantization and out-of-vocabulary failure modes motivate the shift to continuous generative heads for precision-critical applications.

Flow matching over VLMVision-Language Model features replaces discrete action generation with a flow matching head that generates continuous action chunks conditioned on the VLMVision-Language Model features. The π0\pi_0π0​ model uses this approach: the VLMVision-Language Model produces a set of features from the image, language, and proprioceptive inputs, and a separate flow matching network generates the action chunk a0:Ka_{0:K}a0:K​ conditioned on these features. This achieves the multimodal distribution modeling advantages of flow matching while grounding the generation in the full semantic context of the VLMVision-Language Model.

Cross-attention action decoding trains a lightweight action decoder that cross-attends to the VLMVision-Language Model features to produce action predictions. The decoder attends to the VLMVision-Language Model's hidden states at each layer, conditioning the action generation on the entire depth of the VLMVision-Language Model's processing. SMOL-VLA and related architectures use this approach, enabling small, efficient action decoders to leverage large VLMVision-Language Model backbones without requiring the VLMVision-Language Model to itself generate action tokens.


Generalist robot models: π0\pi_0π0​ and GR00T

π0\pi_0π0​ (Black et al., 2024)

π0\pi_0π0​ is built on the PaliGemma vision-language model and uses flow matching for action generation. The architecture has three components: (1) the PaliGemma backbone, which processes images and language through a ViT image encoder and a Gemma 2B language model; (2) a proprioception encoder that maps joint state to embeddings concatenated with the VLMVision-Language Model features; and (3) a flow matching network that generates 50-Hz action chunks conditioned on the combined VLMVision-Language Model and proprioceptive features.

The key innovation in π0\pi_0π0​ is pre-training with action data at scale, followed by task-specific fine-tuning. The model is first pretrained on a large heterogeneous dataset spanning multiple robot embodiments and tasks, learning general motion primitives and task-conditioned behavior. It is then fine-tuned on task-specific data using LoRA-style parameter-efficient adaptation. This two-stage training mirrors the pretraining plus instruction-tuning paradigm of language model alignment.

π0\pi_0π0​ demonstrates that flow matching over VLMVision-Language Model features achieves substantially better performance than discrete tokenization on dexterous manipulation tasks, particularly those requiring force-sensitive contact (folding fabric, zipping bags) where the precision of the generated trajectory is critical.

GR00T N1 (NVIDIA, 2024)

GR00T N1 is a generalist humanoid robot policy built on a dual-system architecture: a large "System 2" model that processes observations and instructions to produce a goal embedding, and a smaller, fast "System 1" diffusion-based action decoder that generates actions conditioned on the goal embedding at high frequency. This hierarchical design mirrors the goal-conditioned options framework from RLReinforcement Learning (a high-level planner sets goals for a fast low-level controller) and provides the computational efficiency to run the full system at real-time control rates.

The System 2 model produces a goal embedding g∈Rdgg \in \mathbb{R}^{d_g}g∈Rdg​ that encodes the semantic intent of the current instruction in the context of the current visual scene. The System 1 diffusion decoder then generates actions conditioned on both the current proprioceptive state sts_tst​ and the goal embedding ggg:

at∼pθ ⁣(at  ∣  st,  g=fϕ(It,ℓ))a_t \sim p_\theta\!\left(a_t \;\Big|\; s_t,\; g = f_{\phi}(I_t, \ell)\right)at​∼pθ​(at​​st​,g=fϕ​(It​,ℓ))

where fϕf_{\phi}fϕ​ is the System 2 VLMVision-Language Model that maps the image ItI_tIt​ and language instruction ℓ\ellℓ to the goal embedding, and pθp_\thetapθ​ is the System 1 diffusion decoder. The System 2 model runs at a low frequency (e.g., 2 Hz) to produce goal embeddings, while System 1 runs at the full robot control rate (e.g., 50 Hz) using the most recently produced goal embedding as conditioning. This temporal decoupling is what enables real-time control despite the heavy compute of the System 2 VLMVision-Language Model: the goal embedding ggg is computed asynchronously and held fixed for a window of System 1 steps, updating only when System 2 completes its forward pass.

GR00T is trained on a combination of synthetic simulation data (from IsaacLab), motion capture human demonstrations retargeted to the robot's embodiment, and real-world teleoperation trajectories. The multimodal training data allows the model to leverage the scale of synthetic data while grounding in the physical realism of real demonstrations.

SMOL-VLA

SMOL-VLA (HuggingFace, 2025) demonstrates that a compact VLA (around 450M total parameters) built on a small VLMVision-Language Model backbone and a lightweight flow matching action head can match the performance of much larger models on standard manipulation benchmarks when trained with high-quality data. The architecture prioritizes inference efficiency — the entire model fits on a single consumer GPU — making it practical for real robot deployment without data center hardware.


Scaling and emergent capabilities

The scaling behavior of VLAs mirrors language model scaling in qualitative respects. Larger models trained on more diverse data demonstrate qualitatively different generalization capabilities: zero-shot execution of novel task descriptions, generalization to novel objects not seen during training (leveraging semantic VLMVision-Language Model knowledge), and recovery from perturbations through closed-loop replanning.

The data scaling story for VLAs is more constrained than for language models. Language model training data can be collected at near-zero marginal cost by scraping web text; robot training data requires teleoperation, which has real cost per trajectory. This creates a bottleneck where compute scaling is easier to achieve than data scaling. The response in the field has been: (a) aggressive use of simulation data augmented with domain randomization (leveraging the simulation stack from Week 7), (b) cross-embodiment training that pools demonstrations across robot platforms, and (c) data augmentation of visual inputs to increase effective diversity.


GenAI context: embodied foundation models

The VLA is the embodied analog of the large language model. Both are autoregressive (or flow-matching-based) transformers trained on large-scale behavioral data to produce useful outputs conditioned on rich context. The differences are consequential: where a language model's outputs are token sequences with no physical consequences, a VLA's outputs are motor commands that move masses in the world with irreversible effects. The safety and reliability requirements (Week 12) are therefore qualitatively more stringent.

The training paradigm is identical at the highest level: large-scale pretraining on diverse data to build a generalizable representation, followed by instruction tuning and task-specific fine-tuning to align the model with specific deployment requirements. The inference paradigm is increasingly similar: chain-of-thought reasoning in language models corresponds to the VLA's use of its language model backbone to "reason" about the task before generating actions.

| VLA concept | Language model analog | |---|---| | Observation tokens (image + state) | Context / prompt tokens | | Instruction embedding | System prompt | | Action chunk | Completion / response | | Flow matching denoising | Iterative text refinement | | Cross-embodiment pretraining | Cross-domain language pretraining | | LoRA fine-tuning for new robot | LoRA fine-tuning for new domain |


Key takeaways

VLA models unify perception, language understanding, and robot action generation into a single end-to-end learnable system. The dominant architectural choice is a pretrained VLMVision-Language Model backbone (ViT + language transformer) with cross-attention fusion of visual and language inputs, combined with either discrete action tokenization or a continuous generative head (flow matching, diffusion). π0\pi_0π0​ demonstrates flow matching over VLMVision-Language Model features as the highest-quality approach for dexterous manipulation; GR00T uses a hierarchical dual-system architecture for real-time humanoid control; SMOL-VLA shows that compact architectures with good data can match large-scale baselines. Scaling follows language model patterns but is bottlenecked by data collection cost rather than compute. The VLA is the embodied analog of the LLMLarge Language Model: same paradigm, higher physical stakes.


Conceptual questions

  1. A VLA built on a 7B parameter VLMVision-Language Model backbone processes two camera images (each producing 196 ViT patch tokens) and a 20-token language instruction at each timestep. The transformer's self-attention has quadratic complexity in sequence length. Calculate the attention complexity ratio between this VLA's per-step computation and a language model processing a 500-token prompt. Propose two specific architectural modifications to reduce the VLA's visual token count by a factor of 4 without losing the spatial resolution needed for grasping precision tasks, and analyze the tradeoffs of each.

  2. The π0\pi_0π0​ model pretrains on a heterogeneous dataset spanning 7 robot embodiments with different action spaces (different DoF, different actuator types). Explain how the flow matching head handles the fact that different embodiments have different action dimensionalities. What normalization and conditioning strategies are needed to prevent the model from learning embediments-specific biases in the VLMVision-Language Model backbone, and how would you diagnose whether such biases have been learned?

  3. Discrete action tokenization (as in RT-2) generates action tokens autoregressively, one per action dimension. A 6-DoF end-effector action with gripper command requires 7 tokens to generate. Compare the latency and correlation properties of this approach versus flow matching that generates all 7 dimensions simultaneously. For a task requiring precise simultaneous multi-axis movement (e.g., rotating and translating simultaneously during insertion), explain why autoregressive generation may fail despite high accuracy on each individual dimension.

  4. A VLA is fine-tuned on 50 demonstrations of a new task using LoRA with rank 16 on all attention projection matrices. After fine-tuning, the model achieves 80% success on the new task but drops from 70% to 45% success on a previously mastered task. Using the concept of catastrophic forgetting, explain the mechanism of this regression. Propose a regularization strategy or data mixing protocol that would maintain both the original and new task performance simultaneously.

  5. GR00T's hierarchical dual-system architecture separates a slow "System 2" goal-setting model from a fast "System 1" action generation model. Analyze this design using the options framework from hierarchical RLReinforcement Learning: what corresponds to the option initiation set, the intra-option policy, and the option termination condition? Identify the failure mode that arises when the System 2 goal embedding is inconsistent with the current world state (e.g., the planned grasp point is occluded), and describe what mechanism in the architecture would or would not detect and recover from this failure.


✦Solutions
  1. Attention complexity. The VLA sequence is 2×196+20=4122\times196 + 20 = 4122×196+20=412 tokens vs the LM's 500; since attention scales as L2L^2L2, the ratio is 4122/5002≈0.68412^2/500^2 \approx 0.684122/5002≈0.68 — the VLA's per-step attention is about 0.68× the LM's. To cut visual tokens 4×: (a) token merging/pooling (perceiver resampler or adaptive pooling) — fewer tokens but risk of blurring fine detail; (b) foveated tokenization keeping high resolution only on the workspace and coarse elsewhere — preserves grasp-region precision at the cost of added complexity.
  2. π0\pi_0π0​ heterogeneous actions. The flow-matching head outputs to a padded maximum action dimension with per-embodiment masking and normalization (each embodiment's actions scaled to a common range), conditioned on an embodiment ID so the shared head specializes. To keep embodiment bias out of the VLM backbone, normalize actions, condition explicitly on embodiment, and balance data; diagnose leakage by probing whether the backbone features predict embodiment, or by testing cross-embodiment transfer.
  3. Autoregressive vs simultaneous. Autoregressive tokenization needs 7 sequential passes (higher latency) and imposes an ordering, so coupled multi-axis motion (simultaneous rotate-and-translate) is captured only through conditioning on already-generated dimensions — small per-dimension errors compound and joint cross-axis correlation is hard to model, causing insertion failures despite high per-axis accuracy. Flow matching emits all dimensions jointly in one shot, preserving correlation and reducing latency.
  4. LoRA forgetting. Rank-16 LoRA on the shared attention projections nudges matrices that also serve the old task, overwriting directions important to it and dropping it 70→45%. Fix with data mixing (replay old-task demos), smaller rank or fewer target modules, separate per-task adapters, or Fisher-weighted (EWC) regularization protecting important weights.
  5. GR00T as options. The System-2 goal is the option: its initiation set is the states where that goal applies, the intra-option policy is System 1 generating actions toward it, and termination is goal achievement or a new System-2 goal. Failure: if the goal embedding is inconsistent with the world (planned grasp occluded), System 1 keeps driving toward an invalid goal. Recovery needs System 2 to re-plan on fresh observations or a feasibility check; without goal-feasibility feedback and with slow System-2 updates, the architecture cannot detect or recover.

Looking ahead

Generalist VLA models represent the current frontier of robot learning, but pretraining alone does not solve the adaptation problem. Deploying a large pretrained model on a specific robot, in a specific environment, for a specific task requires fine-tuning — and doing so efficiently, without catastrophic forgetting, and with limited data.

Week 11: Fine-Tuning and Adaptation. We examine parameter-efficient fine-tuning methods (LoRA, adapters), instruction tuning for robot policies, few-shot task adaptation, and strategies for maintaining general capabilities while specializing for deployment.


Further reading

  • Brohan, A., et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. RSS. (The first large-scale VLA).
  • Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL. (Integrating LLMLarge Language Model reasoning into robot control).
  • Driess, D., et al. (2023). PaLM-E: An Embodied Multimodal Language Model. ICML.
  • Padalkar, S., et al. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA.
← Previous
Week 9: Flow Matching and Diffusion for Robot Policies
Next →
Week 11: Fine-Tuning and Adaptation
On this page
  • Purpose of this lecture
  • From modular pipelines to end-to-end VLAs
  • Architecture: multimodal tokenization and fusion
  • From pixels and text to tokens
  • Visual token compression: the Perceiver Resampler
  • Cross-attention as the fusion mechanism
  • Action generation in VLAs
  • Generalist robot models: \pi_0 and GR00T
  • \pi_0 (Black et al., 2024)
  • GR00T N1 (NVIDIA, 2024)
  • SMOL-VLA
  • Scaling and emergent capabilities
  • GenAI context: embodied foundation models
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading