Purpose of this lecture
A pretrained VLA model encodes broad priors about manipulation behavior, object semantics, and task structure. But pretraining is not deployment. The real world — a specific robot platform, a specific kitchen, a specific set of objects and lighting conditions — differs from the pretraining distribution in ways that matter for policy performance. A model that achieves 75% success on evaluation benchmarks during pretraining may achieve 30% on the physical robot it will actually operate on, not because the model is poor but because the deployment conditions are outside the distribution it was trained on.
Fine-tuning and adaptation bridge this gap: they specialize the pretrained model to the deployment context using limited task-specific data, while preserving the general capabilities acquired during pretraining. This lecture examines the technical methods — instruction tuning, parameter-efficient fine-tuning (PEFT), LoRA, and few-shot adaptation — with attention to why each works, what it costs, and what it risks. The risks are as important as the methods: catastrophic forgetting, overfitting on small datasets, and distribution shift between fine-tuning and test conditions are the primary failure modes that disciplined adaptation engineering must avoid.
The fine-tuning problem
Let be a pretrained policy with parameters , and let be a small dataset of task-specific demonstrations. The naive fine-tuning objective is:
where is the imitation loss on the task data. This objective has three pathologies for large pretrained models.
Catastrophic forgetting occurs because gradient descent on moves away from toward a minimum of the task loss, and the task minimum generally has poor performance on the pretraining distribution. The model loses the general manipulation priors acquired during pretraining — the ability to handle novel objects, recover from perturbations, or follow instructions outside the task vocabulary.
Overfitting occurs because is small. With 20–50 demonstrations and a 7B parameter model, the loss can be driven to near zero by memorizing the demonstrations, but the resulting policy will not generalize to even small variations in object placement or lighting. The model's enormous capacity allows it to fit the training set perfectly while generalizing poorly.
Distribution shift at fine-tuning occurs when the fine-tuning demonstrations are collected in conditions (lighting, camera pose, object set) that differ slightly from the actual deployment conditions. A policy fine-tuned on demonstrations with a specific overhead camera angle will fail when the camera is moved by a few centimeters.
These pathologies motivate specific technical remedies: parameter-efficient methods that limit the number of degrees of freedom available to optimization (reducing catastrophic forgetting and overfitting), careful data collection protocols, and regularization techniques that anchor the fine-tuned model near the pretrained initialization.
Instruction tuning for robot policies
Instruction tuning is the process of training a policy to follow natural language task descriptions rather than (or in addition to) reproducing demonstrations of specific tasks. An instruction-tuned policy receives a language command ("pick up the red cup and place it in the blue bin") and produces actions that execute the described task, generalizing to novel combinations of objects and goals described in language.
The training procedure mirrors language model instruction tuning (SFT in the RLHFReinforcement Learning from Human Feedback pipeline): collect a dataset of (instruction, observation, action) triplets covering a diverse set of tasks and formulations, then train the policy to maximize the conditional log-likelihood of the actions given the instruction and observation. Concretely, let denote the language instruction, the observation sequence, and the demonstration action sequence. The SFT objective over the action tokens is:
This is the standard teacher-forcing cross-entropy: the language instruction and observation are provided as context (conditioned on but not predicted), and the loss is summed only over the action tokens . The same formulation is used in SFT for language models, with the system prompt treated as context — the only difference is that the predicted tokens are discretized robot actions rather than text. In practice, the action tokens are appended to the language prefix in the model's input sequence, and a loss mask zeroes out the cross-entropy contribution of all language prefix positions so that gradients only propagate through the action prediction objective.
Crucially, the same physical behavior (picking up a cup) is paired with multiple instruction formulations ("pick up the cup," "grasp the mug," "move the container to the bin") to teach the policy semantic invariance to instruction phrasing.
For robot policies, instruction tuning enables compositional generalization: a policy trained on "pick up the red cup" and "place the blue block in the bin" separately can execute "pick up the blue block and place it in the cup" without explicit training, because the object and action representations learned from the individual tasks compose in the policy's feature space. This compositional structure emerges from the joint encoding of language and visual observation in the VLMVision-Language Model backbone.
The limitation of instruction tuning for robots is grounding: language instructions specify goals semantically, but executing them requires grounding those semantics in the physical configuration of the environment. "Pick up the leftmost object" requires understanding spatial relationships in the image that the language model may not reliably resolve. Visual grounding — the alignment between language tokens and image patch features — is the technical challenge that VLA cross-attention aims to address and that fine-tuning on task-specific data reinforces.
Parameter-efficient fine-tuning: the landscape
PEFT methods reduce the number of trainable parameters during fine-tuning, which simultaneously reduces memory requirements, reduces overfitting risk, and limits catastrophic forgetting by preserving most pretrained weights unchanged.
Adapter modules
Adapter-based fine-tuning (Houlsby et al., 2019) inserts lightweight bottleneck modules between transformer layers. An adapter after each transformer layer applies the transformation:
where and with bottleneck dimension . Only the adapter parameters ( parameters per layer) are trained; the original transformer weights are frozen. For a 7B parameter model with and , adapters add approximately 0.3% of the original parameter count while maintaining the original features as a residual pathway.
The adapter residual structure is critical: initializing (or nearly so) means the adapter initially implements the identity function, leaving the pretrained model behavior unchanged at the start of fine-tuning. This warm initialization prevents the catastrophic early training dynamics that would occur if random adapter weights were applied to the pretrained features from step one.
LoRA: low-rank adaptation
LoRA (Hu et al., 2021) takes a different approach: rather than adding new modules, it reparameterizes the weight updates of existing linear layers as low-rank matrices. For a pretrained weight matrix , the fine-tuned weight is:
where is the LoRA rank. is frozen; only and are trained. The initialization convention (so at the start of training) ensures the same warm-start property as adapter residuals.
The theoretical motivation is the intrinsic rank hypothesis: the weight updates needed to adapt a pretrained model to a new task have low intrinsic rank, because the new task is a specialization of (not a departure from) the pretraining distribution. Empirically, LoRA with rank 4–32 recovers most of the performance of full fine-tuning while modifying 0.1–1% of the total parameters.
To make this concrete: for a VLMVision-Language Model with , a full attention projection matrix contains parameters. A LoRA update with rank requires and , totaling parameters — a reduction from the full matrix. Applying LoRA rank 16 to all four attention projections () in each of 32 transformer layers adds approximately trainable parameters, compared to the 7B backbone — roughly 0.24% of total parameters. The fine-tuned model stores only the compact and matrices per task; switching tasks requires replacing these small factor matrices without touching the frozen backbone, enabling efficient multi-task deployment from a single base model. LoRA is now the standard fine-tuning method for large robot models because it (a) fits in the memory of deployment-scale hardware, (b) does not modify the pretrained weights (which can be restored by removing the LoRA adapters), and (c) enables task-specific adapters to be swapped at deployment time without reloading the full backbone.
QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization of the frozen backbone, reducing memory requirements by approximately 4× while maintaining fine-tuning quality. For a 7B parameter VLA model, QLoRA enables fine-tuning in 20–30 GB of GPU memory rather than the 80–120 GB required for full fine-tuning in BF16.
Few-shot task adaptation
The extreme case of task-specific fine-tuning is few-shot adaptation: the policy must adapt to a new task from 5–50 demonstrations. This regime is common in practical robotics deployment — collecting 500 demonstrations for every new task variation is impractical, but collecting 10–20 high-quality demonstrations per session is feasible.
Few-shot adaptation works best when the pretrained representation is rich enough that the adaptation problem is low-rank: the new task requires only a small shift in the feature space covered by the pretrained model. The evidence for this in VLAs is the observation that LoRA with rank 8–16 and 20–50 demonstrations can match or exceed full fine-tuning with much larger datasets when starting from a high-quality VLA pretrained model. The pretrained model's representation provides most of the information needed; fine-tuning only needs to adjust which parts of that representation are relevant for the new task.
Practically, few-shot adaptation on physical robots uses the following pipeline: (1) collect 10–50 demonstrations via teleoperation on the target task in the deployment environment; (2) preprocess to ensure observation synchronization and action quality; (3) fine-tune with LoRA (rank 8–32) on the new demonstrations for 100–500 gradient steps; (4) evaluate on held-out demonstrations; (5) deploy. The total wall-clock time from data collection to deployment can be under an hour for simple task adaptations.
The risk in few-shot fine-tuning is overfitting to operator habits: with 10 demonstrations from a single operator, the policy learns to execute the task in the specific style of that operator — their approach angle, speed profile, and grasp orientation. Collecting demonstrations from multiple operators (even 2–3) substantially improves generalization by breaking operator-specific habits.
Avoiding catastrophic forgetting
Every fine-tuning step moves the model parameters away from the pretrained initialization and toward the task-specific optimum. For tasks that are qualitatively similar to the pretraining distribution, this movement is small; for tasks that are qualitatively different, it can be large enough to overwrite representations needed for pretraining-era generalization.
Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017) adds a regularization term that penalizes large changes to the parameters that were most important for the pretraining performance:
where is the Fisher information of parameter on the pretraining task (an estimate of how important parameter is for pretraining performance). Parameters with high Fisher information are heavily penalized for changing; parameters with low Fisher information can change freely. The coefficient controls the tradeoff between task adaptation and forgetting prevention.
Data mixing is a simpler and often more effective approach: include a random sample of the pretraining data in each fine-tuning mini-batch. Formally, the data mixing objective is:
where controls the mixing ratio. Setting recovers naive fine-tuning with full forgetting risk; setting anchors the model in the pretrained loss landscape at the cost of slow task adaptation. The mixing ratio acts as a regularizer against catastrophic forgetting: the gradient of penalizes any parameter direction that reduces at the expense of increasing , keeping the fine-tuned parameters in a region of weight space where pretraining-era capabilities are preserved. Typical values are – (10–30% pretraining examples per mini-batch), which in practice means each gradient step is a convex combination of task and retention signals — the same principle as the KL-from-reference penalty in RLHFReinforcement Learning from Human Feedback, expressed at the data level rather than the loss level.
Task-specific adapters avoid the forgetting problem by design: if only the LoRA adapter parameters (not the backbone) are fine-tuned, the backbone weights are unchanged and pretraining-era capabilities are preserved exactly. Multiple task-specific adapter sets can be maintained and swapped at deployment time, enabling a single backbone to serve multiple task contexts without forgetting any of them.
GenAI context: robotics PEFT mirrors language model PEFT
The adaptation story for robot models is nearly identical to the adaptation story for language models:
| Robots | Language models | |---|---| | Instruction tuning on (instruction, demo) pairs | SFT on (prompt, response) pairs | | LoRA for task-specific adaptation | LoRA for domain/style adaptation | | Few-shot fine-tuning from demonstrations | Few-shot fine-tuning from examples | | Catastrophic forgetting of general skills | Catastrophic forgetting of general knowledge | | EWC / data mixing regularization | KL-from-reference regularization (RLHFReinforcement Learning from Human Feedback) | | Task-specific LoRA adapters | Per-application LoRA adapters |
The most important shared lesson is that representation quality determines few-shot performance more than the adaptation algorithm. A pretrained model with rich, generalizable features can adapt to new tasks from very few examples; a poorly pretrained model requires much larger task-specific datasets regardless of the fine-tuning method. Investing in pretraining quality — diverse data, well-designed tasks, appropriate scale — pays compound dividends when fine-tuning across many deployment scenarios.
Key takeaways
Naive fine-tuning of large robot models fails due to catastrophic forgetting, overfitting, and distribution shift between fine-tuning and deployment. Instruction tuning aligns the policy with natural language commands and enables compositional generalization to novel task descriptions. LoRA is the standard PEFT approach: it parameterizes weight updates as low-rank matrices, initializes them to zero (preserving the pretrained behavior), and modifies 0.1–1% of parameters while recovering most of full fine-tuning performance. QLoRA combines LoRA with 4-bit quantization to reduce memory requirements for large backbones. Few-shot task adaptation (10–50 demonstrations) works reliably when starting from a high-quality pretrained VLA and using LoRA with appropriate rank. Catastrophic forgetting is mitigated by EWC regularization, data mixing during fine-tuning, or task-specific adapter designs that leave the backbone frozen. The fine-tuning paradigm for robot models is structurally identical to the SFT plus PEFT pipeline for language model adaptation.
Conceptual questions
-
A VLA with 7B parameters is fine-tuned on 30 demonstrations of a cup-stacking task using full fine-tuning (all parameters updated) versus LoRA rank-16 on all attention matrices (approximately 10M trainable parameters). After 500 gradient steps, the full fine-tuning model achieves 90% success on the cup-stacking task but 40% success on a previously mastered box-folding task; the LoRA model achieves 80% on cup-stacking and 65% on box-folding. Analyze the tradeoff using the Fisher information framework: which parameter subsets are likely responsible for the forgetting in the full fine-tuning case, and why does LoRA preserve them?
-
LoRA assumes that the intrinsic rank of the task-specific weight update is low. For a VLA being adapted from a pretraining distribution that included extensive manipulation of rigid objects to a new task involving deformable object manipulation (cloth folding), argue whether this assumption holds. What rank would you expect to be necessary, and how would you empirically determine the appropriate rank without running a full grid search?
-
An instruction-tuned VLA is trained on the commands "pick up the red object" and "place the blue object in the green container." At deployment, it is given the instruction "move the yellow object to the red container." Analyze the conditions under which this zero-shot generalization succeeds and fails. Specifically: what properties of the language representation (color binding to object tokens, spatial relationship grounding) are required, and what failure mode emerges when the visual grounding of color terms is insufficient?
-
A team uses data mixing during fine-tuning: 80% task-specific demonstrations and 20% randomly sampled pretraining trajectories per mini-batch. After 200 gradient steps, they observe that the loss on task-specific data is decreasing but the loss on the mixed pretraining data is decreasing faster, eventually becoming lower than the original pretrained model's loss on pretraining data. Explain why this would occur and what it implies about the model's behavior on out-of-distribution queries. Is this phenomenon beneficial or harmful for the overall deployment robustness?
-
Multiple fine-tuned LoRA adapters are maintained for different tasks: adapter for cup-stacking, for cloth folding, for bin-picking. At deployment, the robot receives an ambiguous instruction ("put the fabric in the container") that could invoke either or . Propose an adapter selection mechanism that uses the VLMVision-Language Model's language features to route to the correct adapter. Analyze the failure mode that occurs if the routing mechanism selects the wrong adapter, and describe an ensemble or fallback strategy that degrades gracefully.
Looking ahead
Fine-tuned VLA models can execute diverse tasks with high performance under nominal conditions. The remaining challenge is ensuring that they do so safely — avoiding dangerous actions, maintaining workspace limits, and detecting when the policy is about to fail before the failure occurs.
Week 12: Safety, Constraints, and Reliability. We examine formal safety methods (Control Barrier Functions in the VLA context), covariate shift detection for policy monitoring, runtime safety filters, and the architectural choices that make robot policies certifiably safe rather than empirically reliable.
Further reading
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR. (The foundation of PEFT).
- Walke, H., et al. (2023). Bridgedata V2: A Dataset for Robot Learning at Scale. CoRL. (Discusses strategies for fine-tuning on new domain data).