Skip to main content
illumin8
Courses
Week 11: Fine-Tuning and Adaptation
Robot Learning
01Week 1: Robot Modeling and Kinematics
02Week 2: Dynamics and State Estimation
03Week 3: Control Fundamentals
04Week 4: Teleoperation and Data Collection
05Week 5: Imitation Learning
06Week 6: Reinforcement Learning for Robotics
07Week 7: Sim2Real Pipelines and IsaacLab
08Week 8: Foundation Models for Manipulation — ACT and Action Chunking
09Week 9: Flow Matching and Diffusion for Robot Policies
10Week 10: Vision–Language–Action Models
11Week 11: Fine-Tuning and Adaptation
12Week 12: Safety, Constraints, and Reliability
13Week 13: Multi-Robot and Multi-Task Learning
14Week 14: Sim2Real Capstone
Week 11

Week 11: Fine-Tuning and Adaptation

✦Learning Outcomes
  • Implement instruction tuning for task-specific adaptation
  • Analyze catastrophic forgetting and mitigation strategies
  • Design few-shot adaptation pipelines for new tasks
  • Connect fine-tuning to Course 1 (RLHFReinforcement Learning from Human Feedback) concepts
◆Prerequisites
  • Week 10: VLA models
  • Course 1: RLHFReinforcement Learning from Human Feedback and preference optimization (concepts transfer)

Recommended: Review Week 10 before proceeding.

Purpose of this lecture

A pretrained VLA model encodes broad priors about manipulation behavior, object semantics, and task structure. But pretraining is not deployment. The real world — a specific robot platform, a specific kitchen, a specific set of objects and lighting conditions — differs from the pretraining distribution in ways that matter for policy performance. A model that achieves 75% success on evaluation benchmarks during pretraining may achieve 30% on the physical robot it will actually operate on, not because the model is poor but because the deployment conditions are outside the distribution it was trained on.

Fine-tuning and adaptation bridge this gap: they specialize the pretrained model to the deployment context using limited task-specific data, while preserving the general capabilities acquired during pretraining. This lecture examines the technical methods — instruction tuning, parameter-efficient fine-tuning (PEFT), LoRA, and few-shot adaptation — with attention to why each works, what it costs, and what it risks. The risks are as important as the methods: catastrophic forgetting, overfitting on small datasets, and distribution shift between fine-tuning and test conditions are the primary failure modes that disciplined adaptation engineering must avoid.


The fine-tuning problem

Let πθpre\pi_\theta^{\text{pre}}πθpre​ be a pretrained policy with parameters θpre\theta^{\text{pre}}θpre, and let Dtask\mathcal{D}_{\text{task}}Dtask​ be a small dataset of task-specific demonstrations. The naive fine-tuning objective is:

θ∗=arg⁡min⁡θLtask(θ,Dtask)\theta^* = \arg\min_\theta \mathcal{L}_{\text{task}}(\theta, \mathcal{D}_{\text{task}})θ∗=argθmin​Ltask​(θ,Dtask​)

where Ltask\mathcal{L}_{\text{task}}Ltask​ is the imitation loss on the task data. This objective has three pathologies for large pretrained models.

Catastrophic forgetting occurs because gradient descent on Ltask\mathcal{L}_{\text{task}}Ltask​ moves θ\thetaθ away from θpre\theta^{\text{pre}}θpre toward a minimum of the task loss, and the task minimum generally has poor performance on the pretraining distribution. The model loses the general manipulation priors acquired during pretraining — the ability to handle novel objects, recover from perturbations, or follow instructions outside the task vocabulary.

Overfitting occurs because Dtask\mathcal{D}_{\text{task}}Dtask​ is small. With 20–50 demonstrations and a 7B parameter model, the loss can be driven to near zero by memorizing the demonstrations, but the resulting policy will not generalize to even small variations in object placement or lighting. The model's enormous capacity allows it to fit the training set perfectly while generalizing poorly.

Distribution shift at fine-tuning occurs when the fine-tuning demonstrations are collected in conditions (lighting, camera pose, object set) that differ slightly from the actual deployment conditions. A policy fine-tuned on demonstrations with a specific overhead camera angle will fail when the camera is moved by a few centimeters.

These pathologies motivate specific technical remedies: parameter-efficient methods that limit the number of degrees of freedom available to optimization (reducing catastrophic forgetting and overfitting), careful data collection protocols, and regularization techniques that anchor the fine-tuned model near the pretrained initialization.


Instruction tuning for robot policies

Instruction tuning is the process of training a policy to follow natural language task descriptions rather than (or in addition to) reproducing demonstrations of specific tasks. An instruction-tuned policy receives a language command ("pick up the red cup and place it in the blue bin") and produces actions that execute the described task, generalizing to novel combinations of objects and goals described in language.

The training procedure mirrors language model instruction tuning (SFT in the RLHFReinforcement Learning from Human Feedback pipeline): collect a dataset of (instruction, observation, action) triplets covering a diverse set of tasks and formulations, then train the policy to maximize the conditional log-likelihood of the actions given the instruction and observation. Concretely, let ℓ\ellℓ denote the language instruction, ooo the observation sequence, and a1:Ta_{1:T}a1:T​ the demonstration action sequence. The SFT objective over the action tokens is:

LSFT(θ)=−E(ℓ, o, a1:T)∼Dtask[∑t=1Tlog⁡pθ(at∣a1:t−1, o, ℓ)]\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(\ell,\, o,\, a_{1:T}) \sim \mathcal{D}_{\text{task}}} \left[\sum_{t=1}^{T} \log p_\theta(a_t \mid a_{1:t-1},\, o,\, \ell)\right]LSFT​(θ)=−E(ℓ,o,a1:T​)∼Dtask​​[t=1∑T​logpθ​(at​∣a1:t−1​,o,ℓ)]

This is the standard teacher-forcing cross-entropy: the language instruction ℓ\ellℓ and observation ooo are provided as context (conditioned on but not predicted), and the loss is summed only over the action tokens a1:Ta_{1:T}a1:T​. The same formulation is used in SFT for language models, with the system prompt treated as context — the only difference is that the predicted tokens are discretized robot actions rather than text. In practice, the action tokens are appended to the language prefix in the model's input sequence, and a loss mask zeroes out the cross-entropy contribution of all language prefix positions so that gradients only propagate through the action prediction objective.

Crucially, the same physical behavior (picking up a cup) is paired with multiple instruction formulations ("pick up the cup," "grasp the mug," "move the container to the bin") to teach the policy semantic invariance to instruction phrasing.

For robot policies, instruction tuning enables compositional generalization: a policy trained on "pick up the red cup" and "place the blue block in the bin" separately can execute "pick up the blue block and place it in the cup" without explicit training, because the object and action representations learned from the individual tasks compose in the policy's feature space. This compositional structure emerges from the joint encoding of language and visual observation in the VLMVision-Language Model backbone.

The limitation of instruction tuning for robots is grounding: language instructions specify goals semantically, but executing them requires grounding those semantics in the physical configuration of the environment. "Pick up the leftmost object" requires understanding spatial relationships in the image that the language model may not reliably resolve. Visual grounding — the alignment between language tokens and image patch features — is the technical challenge that VLA cross-attention aims to address and that fine-tuning on task-specific data reinforces.


Parameter-efficient fine-tuning: the landscape

PEFT methods reduce the number of trainable parameters during fine-tuning, which simultaneously reduces memory requirements, reduces overfitting risk, and limits catastrophic forgetting by preserving most pretrained weights unchanged.

Adapter modules

Adapter-based fine-tuning (Houlsby et al., 2019) inserts lightweight bottleneck modules between transformer layers. An adapter after each transformer layer applies the transformation:

h←h+Adapter(h),Adapter(h)=Wup⋅σ(Wdown⋅h)h \leftarrow h + \text{Adapter}(h), \quad \text{Adapter}(h) = W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot h)h←h+Adapter(h),Adapter(h)=Wup​⋅σ(Wdown​⋅h)

where Wdown∈Rd×rW_{\text{down}} \in \mathbb{R}^{d \times r}Wdown​∈Rd×r and Wup∈Rr×dW_{\text{up}} \in \mathbb{R}^{r \times d}Wup​∈Rr×d with bottleneck dimension r≪dr \ll dr≪d. Only the adapter parameters (2rd2 r d2rd parameters per layer) are trained; the original transformer weights are frozen. For a 7B parameter model with d=4096d = 4096d=4096 and r=64r = 64r=64, adapters add approximately 0.3% of the original parameter count while maintaining the original features as a residual pathway.

The adapter residual structure is critical: initializing Wup=0W_{\text{up}} = 0Wup​=0 (or nearly so) means the adapter initially implements the identity function, leaving the pretrained model behavior unchanged at the start of fine-tuning. This warm initialization prevents the catastrophic early training dynamics that would occur if random adapter weights were applied to the pretrained features from step one.

LoRA: low-rank adaptation

LoRA (Hu et al., 2021) takes a different approach: rather than adding new modules, it reparameterizes the weight updates of existing linear layers as low-rank matrices. For a pretrained weight matrix W0∈Rdout×dinW_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}W0​∈Rdout​×din​, the fine-tuned weight is:

W=W0+ΔW,ΔW=BA,B∈Rdout×r,  A∈Rr×dinW = W_0 + \Delta W, \quad \Delta W = B A, \quad B \in \mathbb{R}^{d_{\text{out}} \times r},\; A \in \mathbb{R}^{r \times d_{\text{in}}}W=W0​+ΔW,ΔW=BA,B∈Rdout​×r,A∈Rr×din​

where r≪min⁡(dout,din)r \ll \min(d_{\text{out}}, d_{\text{in}})r≪min(dout​,din​) is the LoRA rank. W0W_0W0​ is frozen; only BBB and AAA are trained. The initialization convention B=0B = 0B=0 (so ΔW=0\Delta W = 0ΔW=0 at the start of training) ensures the same warm-start property as adapter residuals.

The theoretical motivation is the intrinsic rank hypothesis: the weight updates needed to adapt a pretrained model to a new task have low intrinsic rank, because the new task is a specialization of (not a departure from) the pretraining distribution. Empirically, LoRA with rank 4–32 recovers most of the performance of full fine-tuning while modifying 0.1–1% of the total parameters.

To make this concrete: for a VLMVision-Language Model with dmodel=4096d_{\text{model}} = 4096dmodel​=4096, a full attention projection matrix WQ∈R4096×4096W_Q \in \mathbb{R}^{4096 \times 4096}WQ​∈R4096×4096 contains 4096×4096=16,777,2164096 \times 4096 = 16{,}777{,}2164096×4096=16,777,216 parameters. A LoRA update with rank r=16r = 16r=16 requires B∈R4096×16B \in \mathbb{R}^{4096 \times 16}B∈R4096×16 and A∈R16×4096A \in \mathbb{R}^{16 \times 4096}A∈R16×4096, totaling 2×16×4096=131,0722 \times 16 \times 4096 = 131{,}0722×16×4096=131,072 parameters — a 128×128\times128× reduction from the full matrix. Applying LoRA rank 16 to all four attention projections (WQ,WK,WV,WOW_Q, W_K, W_V, W_OWQ​,WK​,WV​,WO​) in each of 32 transformer layers adds approximately 32×4×131,072≈16.8M32 \times 4 \times 131{,}072 \approx 16.8\text{M}32×4×131,072≈16.8M trainable parameters, compared to the 7B backbone — roughly 0.24% of total parameters. The fine-tuned model stores only the compact BBB and AAA matrices per task; switching tasks requires replacing these small factor matrices without touching the frozen backbone, enabling efficient multi-task deployment from a single base model. LoRA is now the standard fine-tuning method for large robot models because it (a) fits in the memory of deployment-scale hardware, (b) does not modify the pretrained weights (which can be restored by removing the LoRA adapters), and (c) enables task-specific adapters to be swapped at deployment time without reloading the full backbone.

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit quantization of the frozen backbone, reducing memory requirements by approximately 4× while maintaining fine-tuning quality. For a 7B parameter VLA model, QLoRA enables fine-tuning in 20–30 GB of GPU memory rather than the 80–120 GB required for full fine-tuning in BF16.


Few-shot task adaptation

The extreme case of task-specific fine-tuning is few-shot adaptation: the policy must adapt to a new task from 5–50 demonstrations. This regime is common in practical robotics deployment — collecting 500 demonstrations for every new task variation is impractical, but collecting 10–20 high-quality demonstrations per session is feasible.

Few-shot adaptation works best when the pretrained representation is rich enough that the adaptation problem is low-rank: the new task requires only a small shift in the feature space covered by the pretrained model. The evidence for this in VLAs is the observation that LoRA with rank 8–16 and 20–50 demonstrations can match or exceed full fine-tuning with much larger datasets when starting from a high-quality VLA pretrained model. The pretrained model's representation provides most of the information needed; fine-tuning only needs to adjust which parts of that representation are relevant for the new task.

Practically, few-shot adaptation on physical robots uses the following pipeline: (1) collect 10–50 demonstrations via teleoperation on the target task in the deployment environment; (2) preprocess to ensure observation synchronization and action quality; (3) fine-tune with LoRA (rank 8–32) on the new demonstrations for 100–500 gradient steps; (4) evaluate on held-out demonstrations; (5) deploy. The total wall-clock time from data collection to deployment can be under an hour for simple task adaptations.

The risk in few-shot fine-tuning is overfitting to operator habits: with 10 demonstrations from a single operator, the policy learns to execute the task in the specific style of that operator — their approach angle, speed profile, and grasp orientation. Collecting demonstrations from multiple operators (even 2–3) substantially improves generalization by breaking operator-specific habits.


Avoiding catastrophic forgetting

Every fine-tuning step moves the model parameters away from the pretrained initialization and toward the task-specific optimum. For tasks that are qualitatively similar to the pretraining distribution, this movement is small; for tasks that are qualitatively different, it can be large enough to overwrite representations needed for pretraining-era generalization.

Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017) adds a regularization term that penalizes large changes to the parameters that were most important for the pretraining performance:

LEWC(θ)=Ltask(θ)+λ2∑iFi(θi−θipre)2\mathcal{L}_{\text{EWC}}(\theta) = \mathcal{L}_{\text{task}}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_i^{\text{pre}})^2LEWC​(θ)=Ltask​(θ)+2λ​i∑​Fi​(θi​−θipre​)2

where FiF_iFi​ is the Fisher information of parameter iii on the pretraining task (an estimate of how important parameter iii is for pretraining performance). Parameters with high Fisher information are heavily penalized for changing; parameters with low Fisher information can change freely. The λ\lambdaλ coefficient controls the tradeoff between task adaptation and forgetting prevention.

Data mixing is a simpler and often more effective approach: include a random sample of the pretraining data in each fine-tuning mini-batch. Formally, the data mixing objective is:

Lmix(θ)=Ltask(θ, Dtask)+λ Lpretrain(θ, Dpre)\mathcal{L}_{\text{mix}}(\theta) = \mathcal{L}_{\text{task}}(\theta,\, \mathcal{D}_{\text{task}}) + \lambda\, \mathcal{L}_{\text{pretrain}}(\theta,\, \mathcal{D}_{\text{pre}})Lmix​(θ)=Ltask​(θ,Dtask​)+λLpretrain​(θ,Dpre​)

where λ∈[0,1]\lambda \in [0, 1]λ∈[0,1] controls the mixing ratio. Setting λ=0\lambda = 0λ=0 recovers naive fine-tuning with full forgetting risk; setting λ=1\lambda = 1λ=1 anchors the model in the pretrained loss landscape at the cost of slow task adaptation. The mixing ratio acts as a regularizer against catastrophic forgetting: the gradient of Lmix\mathcal{L}_{\text{mix}}Lmix​ penalizes any parameter direction that reduces Ltask\mathcal{L}_{\text{task}}Ltask​ at the expense of increasing Lpretrain\mathcal{L}_{\text{pretrain}}Lpretrain​, keeping the fine-tuned parameters in a region of weight space where pretraining-era capabilities are preserved. Typical values are λ≈0.1\lambda \approx 0.1λ≈0.1–0.30.30.3 (10–30% pretraining examples per mini-batch), which in practice means each gradient step is a convex combination of task and retention signals — the same principle as the KL-from-reference penalty in RLHFReinforcement Learning from Human Feedback, expressed at the data level rather than the loss level.

Task-specific adapters avoid the forgetting problem by design: if only the LoRA adapter parameters (not the backbone) are fine-tuned, the backbone weights are unchanged and pretraining-era capabilities are preserved exactly. Multiple task-specific adapter sets can be maintained and swapped at deployment time, enabling a single backbone to serve multiple task contexts without forgetting any of them.


GenAI context: robotics PEFT mirrors language model PEFT

The adaptation story for robot models is nearly identical to the adaptation story for language models:

| Robots | Language models | |---|---| | Instruction tuning on (instruction, demo) pairs | SFT on (prompt, response) pairs | | LoRA for task-specific adaptation | LoRA for domain/style adaptation | | Few-shot fine-tuning from demonstrations | Few-shot fine-tuning from examples | | Catastrophic forgetting of general skills | Catastrophic forgetting of general knowledge | | EWC / data mixing regularization | KL-from-reference regularization (RLHFReinforcement Learning from Human Feedback) | | Task-specific LoRA adapters | Per-application LoRA adapters |

The most important shared lesson is that representation quality determines few-shot performance more than the adaptation algorithm. A pretrained model with rich, generalizable features can adapt to new tasks from very few examples; a poorly pretrained model requires much larger task-specific datasets regardless of the fine-tuning method. Investing in pretraining quality — diverse data, well-designed tasks, appropriate scale — pays compound dividends when fine-tuning across many deployment scenarios.


Key takeaways

Naive fine-tuning of large robot models fails due to catastrophic forgetting, overfitting, and distribution shift between fine-tuning and deployment. Instruction tuning aligns the policy with natural language commands and enables compositional generalization to novel task descriptions. LoRA is the standard PEFT approach: it parameterizes weight updates as low-rank matrices, initializes them to zero (preserving the pretrained behavior), and modifies 0.1–1% of parameters while recovering most of full fine-tuning performance. QLoRA combines LoRA with 4-bit quantization to reduce memory requirements for large backbones. Few-shot task adaptation (10–50 demonstrations) works reliably when starting from a high-quality pretrained VLA and using LoRA with appropriate rank. Catastrophic forgetting is mitigated by EWC regularization, data mixing during fine-tuning, or task-specific adapter designs that leave the backbone frozen. The fine-tuning paradigm for robot models is structurally identical to the SFT plus PEFT pipeline for language model adaptation.


Conceptual questions

  1. A VLA with 7B parameters is fine-tuned on 30 demonstrations of a cup-stacking task using full fine-tuning (all parameters updated) versus LoRA rank-16 on all attention matrices (approximately 10M trainable parameters). After 500 gradient steps, the full fine-tuning model achieves 90% success on the cup-stacking task but 40% success on a previously mastered box-folding task; the LoRA model achieves 80% on cup-stacking and 65% on box-folding. Analyze the tradeoff using the Fisher information framework: which parameter subsets are likely responsible for the forgetting in the full fine-tuning case, and why does LoRA preserve them?

  2. LoRA assumes that the intrinsic rank of the task-specific weight update ΔW\Delta WΔW is low. For a VLA being adapted from a pretraining distribution that included extensive manipulation of rigid objects to a new task involving deformable object manipulation (cloth folding), argue whether this assumption holds. What rank would you expect to be necessary, and how would you empirically determine the appropriate rank without running a full grid search?

  3. An instruction-tuned VLA is trained on the commands "pick up the red object" and "place the blue object in the green container." At deployment, it is given the instruction "move the yellow object to the red container." Analyze the conditions under which this zero-shot generalization succeeds and fails. Specifically: what properties of the language representation (color binding to object tokens, spatial relationship grounding) are required, and what failure mode emerges when the visual grounding of color terms is insufficient?

  4. A team uses data mixing during fine-tuning: 80% task-specific demonstrations and 20% randomly sampled pretraining trajectories per mini-batch. After 200 gradient steps, they observe that the loss on task-specific data is decreasing but the loss on the mixed pretraining data is decreasing faster, eventually becoming lower than the original pretrained model's loss on pretraining data. Explain why this would occur and what it implies about the model's behavior on out-of-distribution queries. Is this phenomenon beneficial or harmful for the overall deployment robustness?

  5. Multiple fine-tuned LoRA adapters are maintained for different tasks: adapter A1A_1A1​ for cup-stacking, A2A_2A2​ for cloth folding, A3A_3A3​ for bin-picking. At deployment, the robot receives an ambiguous instruction ("put the fabric in the container") that could invoke either A2A_2A2​ or A3A_3A3​. Propose an adapter selection mechanism that uses the VLMVision-Language Model's language features to route to the correct adapter. Analyze the failure mode that occurs if the routing mechanism selects the wrong adapter, and describe an ensemble or fallback strategy that degrades gracefully.


✦Solutions
  1. Fisher information. Full fine-tuning updates all parameters, including the high-Fisher weights that are critical to the old box-folding task, so those shifts cause the 40% forgetting; LoRA confines updates to a low-rank subspace and largely leaves the base weights' high-Fisher directions intact, preserving the old task at 65%. The forgetting in full fine-tuning comes from large updates to parameters with high Fisher information for the previously mastered task.
  2. LoRA rank for deformables. Moving from rigid manipulation to cloth folding is a large distributional shift (new dynamics, high-DoF deformable state), so ΔW\Delta WΔW's intrinsic rank is likely higher than for a minor tweak and the low-rank assumption is strained — expect to need a larger rank (roughly 32–64+). Determine it empirically by sweeping rank until validation success saturates, or by taking the SVD of a full fine-tune ΔW\Delta WΔW to read off its effective rank.
  3. Zero-shot color/relation. "Move the yellow object to the red container" succeeds when color terms bind compositionally to object tokens and the spatial relation generalizes — i.e., the language representation factorizes attribute × object × relation. It fails when the visual grounding of "yellow" is insufficient (color unseen on that object or entangled in features), so the model cannot localize the yellow object even though the language composition is correct.
  4. Mixed-data loss dropping below baseline. The replay loss falling below the original pretrained model's loss indicates the model is fitting that specific replay subset (and the smaller mixed set), i.e., specializing to the seen trajectories rather than the full pretraining distribution. It is mildly helpful for retaining seen behaviors but a sign of overfitting that can hurt out-of-distribution robustness — judge it on held-out pretraining data, not on the mixed batch.
  5. Adapter routing. Embed the instruction with the VLM's language features, compare it to each adapter's learned task prototype, and route to the argmax (or a soft mixture). A wrong selection executes the wrong skill (e.g., cloth-folding motion during a bin-pick) — task failure or unsafe motion. Degrade gracefully by soft-weighting the top-k adapters, falling back to the base no-adapter policy under low routing confidence, or requesting clarification.

Looking ahead

Fine-tuned VLA models can execute diverse tasks with high performance under nominal conditions. The remaining challenge is ensuring that they do so safely — avoiding dangerous actions, maintaining workspace limits, and detecting when the policy is about to fail before the failure occurs.

Week 12: Safety, Constraints, and Reliability. We examine formal safety methods (Control Barrier Functions in the VLA context), covariate shift detection for policy monitoring, runtime safety filters, and the architectural choices that make robot policies certifiably safe rather than empirically reliable.


Further reading

  • Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR. (The foundation of PEFT).
  • Walke, H., et al. (2023). Bridgedata V2: A Dataset for Robot Learning at Scale. CoRL. (Discusses strategies for fine-tuning on new domain data).
← Previous
Week 10: Vision–Language–Action Models
Next →
Week 12: Safety, Constraints, and Reliability
On this page
  • Purpose of this lecture
  • The fine-tuning problem
  • Instruction tuning for robot policies
  • Parameter-efficient fine-tuning: the landscape
  • Adapter modules
  • LoRA: low-rank adaptation
  • Few-shot task adaptation
  • Avoiding catastrophic forgetting
  • GenAI context: robotics PEFT mirrors language model PEFT
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading