Skip to main content
illumin8
Courses
Week 8: Fine-Tuning and Parameter-Efficient Methods
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 8

Week 8: Fine-Tuning and Parameter-Efficient Methods

✦Learning Outcomes
  • Implement QLoRA with 4-bit quantization for single-GPU fine-tuning
  • Compare full fine-tuning vs. LoRA vs. adapters for VLMVision-Language Model adaptation
  • Apply PEFT methods for domain-specific VLMVision-Language Model deployment
◆Prerequisites
  • Week 6: LLaVA - VLMVision-Language Model architecture fundamentals
  • Basic understanding of deep learning optimization helpful

Purpose of this lecture

Large Vision-Language Models (VLMs) possess billions of parameters. While these massive models contain vast general knowledge, they frequently fail at proprietary, domain-specific tasks (e.g., parsing medical X-rays, reading obscure blueprints, or formatting output for a specific robotic API). Adapting them is necessary, but full fine-tuning is mathematically and financially prohibitive for most practitioners. A 7B-parameter LLMLarge Language Model at fp32 precision requires 28 GB just to store the weights, plus 56 GB for Adam optimizer states, and 28 GB for gradients—exceeding the memory of even an 80GB A100 GPU.

Parameter-Efficient Fine-Tuning (PEFT) methods solve this hardware bottleneck by freezing the vast majority of the network and training only a tiny, strategically placed fraction of additional parameters. This lecture derives the mathematics of Low-Rank Adaptation (LoRA) and QLoRA, analyzes the systems engineering of edge-compute deployment, and examines how these methods enable Continuous Learning on physical robotics hardware.


Why full fine-tuning is impractical at scale

Full fine-tuning of a VLMVision-Language Model with parameters θ\thetaθ updates every parameter using a task-specific dataset D\mathcal{D}D:

θ∗=arg⁡min⁡θL(θ;D)\theta^* = \arg\min_\theta \mathcal{L}(\theta; \mathcal{D})θ∗=argθmin​L(θ;D)

The practical obstacles to this approach are severe:

  1. The Optimizer State Memory Wall: During training, the Adam optimizer maintains two moving averages (first moment mtm_tmt​ and second moment vtv_tvt​) for every single parameter. If θ\thetaθ has 7 billion parameters, storing θ\thetaθ in fp32 takes 28GB. Storing mtm_tmt​ takes another 28GB, and vtv_tvt​ takes another 28GB. With gradients and activations, the VRAM requirement explodes past 100GB.
  2. Catastrophic Forgetting: Full fine-tuning on a narrow, task-specific dataset (like 5,000 images of chest X-rays) violently aggressively updates the model's weights. The VLMVision-Language Model quickly unlearns its pre-trained general intelligence—losing its conversational tone, its safety guardrails, and its ability to answer questions about anything other than X-rays.
  3. Deployment Storage Overhead: If an enterprise needs 10 distinct task-specific VLMs, full fine-tuning requires saving 10 distinct 28GB checkpoints. Switching between tasks at inference time requires unloading and reloading 28GB into VRAM, inducing massive latency.

PEFT methods address all three problems by introducing a small set of task-specific parameters Δθ\Delta\thetaΔθ where ∣Δθ∣≪∣θ∣|\Delta\theta| \ll |\theta|∣Δθ∣≪∣θ∣, while keeping θ\thetaθ perfectly frozen:

y=fθ(x)+gΔθ(x)y = f_\theta(x) + g_{\Delta\theta}(x)y=fθ​(x)+gΔθ​(x)

Because θ\thetaθ is frozen, the optimizer does not need to store mtm_tmt​ or vtv_tvt​ for 7 billion parameters, bypassing the memory wall.


LoRA: Low-Rank Adaptation Mathematics

LoRA (Hu et al., 2021) mathematically bypasses the parameter explosion by applying a low-rank matrix decomposition to the weight updates. The core hypothesis is that while the pre-trained matrices are full-rank, the task-specific updates ΔW\Delta WΔW reside in a remarkably low-dimensional subspace (they have a low "intrinsic rank").

For a frozen weight matrix W∈Rdin×doutW \in \mathbb{R}^{d_\text{in} \times d_\text{out}}W∈Rdin​×dout​ (e.g., the Query projection matrix in an attention head), LoRA constrains the update ΔW\Delta WΔW by decomposing it into two much smaller matrices:

W′=W+ΔW=W+ABW' = W + \Delta W = W + A BW′=W+ΔW=W+AB

where A∈Rdin×rA \in \mathbb{R}^{d_\text{in} \times r}A∈Rdin​×r and B∈Rr×doutB \in \mathbb{R}^{r \times d_\text{out}}B∈Rr×dout​, and the rank r≪min⁡(din,dout)r \ll \min(d_\text{in}, d_\text{out})r≪min(din​,dout​).

Initialization and the Forward Pass

Matrix AAA is initialized with a zero-mean Gaussian distribution N(0,σ2)\mathcal{N}(0, \sigma^2)N(0,σ2), but matrix BBB is explicitly initialized to all zeros. Therefore, at step t=0t=0t=0, ΔW=AB=0\Delta W = A B = 0ΔW=AB=0. This guarantees the model begins fine-tuning behaving exactly like the base pre-trained model.

During the forward pass, the input vector xxx is multiplied by both pathways:

h=Wx+αr(ABx)h = W x + \frac{\alpha}{r} (A B x)h=Wx+rα​(ABx)

Here, α\alphaα is a constant scaling factor. The ratio αr\frac{\alpha}{r}rα​ prevents the need to retrain the base learning rate if the rank rrr is changed during hyperparameter sweeps.

Concrete Dimensionality Example

Consider the Attention Projection matrix WQW_QWQ​ in a 7B LLaMA model, where din=4096d_\text{in} = 4096din​=4096 and dout=4096d_\text{out} = 4096dout​=4096.

  • Full Fine-Tuning: Updating WQW_QWQ​ requires training 4096×4096=16,777,2164096 \times 4096 = 16,777,2164096×4096=16,777,216 parameters.
  • LoRA (Rank r=16r=16r=16): AAA has 4096×16=65,5364096 \times 16 = 65,5364096×16=65,536 parameters. BBB has 16×4096=65,53616 \times 4096 = 65,53616×4096=65,536 parameters. Total trainable parameters: 131,072131,072131,072.

This is a 128×128\times128× mathematical reduction in trainable parameters for this single layer. Because the frozen WWW requires no optimizer states, training memory plummets. Furthermore, at inference time, ABA BAB can be pre-multiplied and added directly to WWW, meaning LoRA adds zero latency during generation.


QLoRA: 4-Bit Quantization for Single-GPU Training

Even with LoRA, storing a frozen 70B parameter LLMLarge Language Model in bf16 requires 140GB of VRAM just to hold the base model. QLoRA (Dettmers et al., 2023) shattered this barrier, allowing 65B-70B models to be fine-tuned on a single 48GB GPU.

QLoRA achieves this via three mathematical innovations:

  1. 4-bit NormalFloat (NF4): Instead of standard linear INT4 quantization, QLoRA uses an information-theoretically optimal data type for normally distributed weights. Because neural network weights naturally form a Gaussian distribution centered at zero, NF4 distributes its 16 available bins so that there are more bins clustered near zero (where most weights are) and fewer bins at the tails. This dramatically minimizes quantization error.
  2. Double Quantization: To map fp32 weights to NF4 bins, scaling constants are required. QLoRA saves an additional 0.5GB of memory per 65B model by aggressively quantizing these scaling constants themselves from 32-bit floats into 8-bit integers using nested blocks.
  3. Paged Optimizers: If the GPU runs out of VRAM during a massive gradient accumulation step, QLoRA leverages Nvidia unified memory to automatically page the Adam optimizer states to system CPU RAM, preventing out-of-memory (OOM) crashes.

During the forward and backward passes, the 4-bit frozen weights are dynamically dequantized back to bf16 inside the GPU caches to perform the matrix multiplication with xxx, and the gradients are only accumulated into the high-precision bf16 LoRA adapters AAA and BBB.


Adapter Strategies in VLMs (Where to put the parameters?)

When fine-tuning a VLMVision-Language Model (like LLaVA), engineers must choose which modules to inject with LoRA adapters:

  1. Projector-Only Fine-Tuning: Freeze the ViT and the LLMLarge Language Model. Train only the linear/MLP projector. This is extremely cheap and works if the domain shift is purely visual (e.g., standard natural images, but answering in a specific JSON format).
  2. LLMLarge Language Model LoRA: Freeze the ViT and the Projector. Inject LoRA into the LLMLarge Language Model's WQ,WK,WV,WOW_Q, W_K, W_V, W_OWQ​,WK​,WV​,WO​ matrices. This adapts the reasoning and text-generation style of the model.
  3. Vision (CLIP) LoRA: Freeze the LLMLarge Language Model, inject LoRA into the ViT. This is mathematically necessary for severe visual domain shifts (e.g., satellite imagery, medical histopathology, or deep-sea robotics). If the base CLIP model has never seen a radar scan, the LLMLarge Language Model LoRA cannot fix it, because the visual features entering the LLMLarge Language Model are already corrupted.

Data Mixing as a Regularizer

To prevent the LoRA adapters from aggressively overfitting to the new task (causing the VLMVision-Language Model to forget how to answer general questions), engineers use Data Mixing. The loss function is a weighted sum:

Ltotal=λLtask(θfrozen+Δθ;Dtask)+(1−λ)Lpretrain(θfrozen+Δθ;Dpretrain)\mathcal{L}_\text{total} = \lambda \mathcal{L}_\text{task}(\theta_{\text{frozen}} + \Delta\theta; \mathcal{D}_\text{task}) + (1-\lambda) \mathcal{L}_\text{pretrain}(\theta_{\text{frozen}} + \Delta\theta; \mathcal{D}_\text{pretrain})Ltotal​=λLtask​(θfrozen​+Δθ;Dtask​)+(1−λ)Lpretrain​(θfrozen​+Δθ;Dpretrain​)

By keeping 5−10%5-10\%5−10% of the original pretraining conversational data in the fine-tuning batch, the LoRA adapters are mathematically regularized against catastrophic forgetting.


Edge Compute and Continual Learning in Robotics

The ultimate promise of PEFT is deploying VLMs on physical hardware (e.g., an Nvidia Jetson Orin board on a mobile robot). Running a 7B VLMVision-Language Model at 50Hz for continuous motor control is currently impossible for a mobile GPU.

The System 1 / System 2 Architectural Tradeoff

Robotics systems solve the edge-compute bottleneck via asynchronous routing:

  • System 1 (Fast/Local): A small, highly optimized continuous control policy (e.g., the Flow Matching π0\pi_0π0​ from Course 2) runs at 50Hz directly on the robot's local Jetson board. It handles immediate reactions, balancing, and low-level motor torques.
  • System 2 (Slow/Cloud): The massive 7B or 70B VLMVision-Language Model runs in the cloud (or on a massive local server) at 1Hz. It receives images from the robot, processes the semantic human command ("Find the missing wrench"), and sends a high-level spatial waypoint back down to System 1.

Continual Learning via LoRA

Robot hardware degrades. Motor friction changes, and operating environments shift. A VLMVision-Language Model deployed in a factory in winter will see different lighting than in summer. Continual Learning attempts to update the deployed VLMVision-Language Model on the fly.

Because full fine-tuning causes catastrophic forgetting, LoRA is the perfect vehicle for continual learning. A robot can collect failure cases during the day. At night, while plugged into the charging dock, it computes a new LoRA adapter specifically optimized for the lighting conditions of that day. Because LoRA matrices are only a few megabytes, the robot can store a library of hundreds of distinct environmental LoRA weights, dynamically swapping them into VRAM based on current sensory context without ever permanently altering the foundational VLMVision-Language Model weights.


Key takeaways

Full fine-tuning of billions of parameters is prevented by the memory wall of optimizer states and the mathematical danger of catastrophic forgetting. LoRA solves this by freezing the network and injecting low-rank matrices AAA and BBB, cutting trainable parameters by over 100×100\times100× while allowing the updates to be merged for zero-latency inference. QLoRA pushes this further via 4-bit NormalFloat quantization, allowing 70B models to be trained on a single GPU. In multimodal physical AI, PEFT is not just a cost-saving measure; it enables the systems-engineering reality of Continual Learning, allowing edge-deployed robots to dynamically swap lightweight LoRA adapters to adapt to changing environmental physics without overwriting their foundational intelligence.


Conceptual questions

  1. LoRA Initialization Mathematics: LoRA initializes matrix A∼N(0,σ2)A \sim \mathcal{N}(0, \sigma^2)A∼N(0,σ2) and matrix B=0B = 0B=0. Consider an alternative initialization where AAA and BBB are both initialized from N(0,1/r)\mathcal{N}(0, 1/r)N(0,1/r) so their product has unit-scale entries. Derive the expected Frobenius norm of ΔW\Delta WΔW at initialization. Explain mathematically why this alternative initialization causes a massive spike in the cross-entropy loss on the very first training step, and why setting B=0B=0B=0 guarantees stability.
  2. Optimizer Memory Bounds: A team wants to fine-tune a 14B parameter VLMVision-Language Model using full fp32 precision. The Adam optimizer stores a first moment mtm_tmt​ and second moment vtv_tvt​ for every parameter. Calculate the exact VRAM in Gigabytes required strictly to hold the model weights and the Adam states (assume 1 parameter = 4 bytes). Why is the actual VRAM required during the training loop significantly higher than this number, forcing the team to switch to QLoRA?
  3. QLoRA Gradient Flow: In QLoRA, the frozen weights WWW are stored in 4-bit NF4, but they are dequantized to bf16 (16-bit) during the forward pass to multiply with the activations. During the backward pass, gradients must flow through these dequantized weights to reach the underlying layers. However, the derivative of a discrete quantization step function is mathematically zero almost everywhere. Explain how QLoRA utilizes the Straight-Through Estimator (STE) to bypass this, and why the gradients are ultimately only applied to the high-precision LoRA matrices.
  4. Catastrophic Forgetting and Modality Gaps: A medical team fine-tunes a LLaVA-7B model purely on X-ray images using LoRA rank 64 applied to both the ViT and the LLMLarge Language Model. They use no Data Mixing. After 10 epochs, the model reads X-rays perfectly, but when given a photo of a dog and asked "What is this?", it replies with "Severe pulmonary edema in the lower left lobe." Diagnose this failure geometrically in the latent space. How did the lack of regularizing data cause the LLMLarge Language Model's language generation distribution to irrevocably collapse into a narrow medical subspace?
  5. Edge Robotics Systems Design: You are deploying a bimanual robot to sort recycling on a fast-moving conveyor belt. You have a 7B VLMVision-Language Model that can perfectly identify recyclable materials, but its inference latency is 1.5 seconds. The conveyor belt requires action updates at 50Hz (0.02 seconds). Design a System 1 / System 2 asynchronous architecture that utilizes the VLMVision-Language Model for semantic reasoning while relying on a lightweight, continuous control policy (like a Diffusion Policy or ACTAction Chunking with Transformers) for high-frequency motor control. How do the two systems communicate their state?
✦Solutions
  1. LoRA init. With A,B∼N(0,1/r)A,B \sim \mathcal{N}(0,1/r)A,B∼N(0,1/r) the product ΔW=BA\Delta W = BAΔW=BA has expected Frobenius norm scaling like dk/rdk/rdk/r, so ΔW\Delta WΔW is non-trivial at initialization and perturbs the model's output on step 0 — the loss spikes. Setting B=0B=0B=0 makes ΔW=0\Delta W = 0ΔW=0 at init, so the output equals the base model (a stable start), while gradients still reach AAA through BBB's update.
  2. Optimizer memory. 14B×414\text{B} \times 414B×4 bytes =56= 56=56 GB of weights; Adam's two moments add 2×56=1122 \times 56 = 1122×56=112 GB, for ~168 GB just for weights and states. The real loop also stores gradients (56 GB) and activations for backprop, pushing well past 224 GB — impossible on a single GPU, which is why the team switches to QLoRA (4-bit frozen weights plus tiny trainable adapters).
  3. QLoRA STE. The forward pass dequantizes NF4 to bf16 to compute activations, but the quantization step's derivative is zero almost everywhere; QLoRA uses the Straight-Through Estimator, treating dequantization as the identity in the backward pass so gradients flow through unchanged. The frozen base weights receive no update — gradients are applied only to the high-precision LoRA matrices.
  4. Forgetting / modality collapse. LoRA on both the ViT and LLM, no data mixing, ten epochs on X-rays: the LLM's generation distribution collapses toward radiology text. Geometrically the language manifold contracts into the medical subspace, so any input — even a dog — decodes to medical phrasing. Mixing in general data, lowering the rank, or freezing the LLM regularizes this.
  5. Edge System 1/2. Run the 7B VLM asynchronously at a low rate (System 2), emitting a semantic goal/latent; a lightweight Diffusion Policy or ACT (System 1) runs the 50 Hz loop conditioned on that latest goal plus fast proprioceptive/visual state sts_tst​. They communicate through a shared goal/latent buffer: System 2 writes a new goal when ready (~1.5 s), System 1 keeps tracking the most recent goal at 50 Hz using real-time feedback.

Looking ahead

PEFT methods enable us to efficiently adapt massive VLMs to specific physical and semantic tasks. However, training is only half the battle. How do we scientifically prove that these fine-tuned models are actually working correctly, and how vulnerable are they to domain shifts?

Week 9: Evaluation and Robustness. We examine the full evaluation landscape for VLMs: from deeply flawed historical metrics like BLEU and CIDEr, to modern reasoning benchmarks, distribution shifts, spurious correlations, and why benchmark performance frequently overestimates real-world robotic robustness.


Further reading

  • Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR. (The definitive PEFT paper).
  • Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS. (4-bit NF4 quantization).
  • Houlsby, N., et al. (2019). Parameter-Efficient Transfer Learning for NLP. ICML. (Adapter layers).
← Previous
Week 7: Alternative VLM Architectures
Next →
Week 9: Evaluation and Robustness
On this page
  • Purpose of this lecture
  • Why full fine-tuning is impractical at scale
  • LoRA: Low-Rank Adaptation Mathematics
  • Initialization and the Forward Pass
  • Concrete Dimensionality Example
  • QLoRA: 4-Bit Quantization for Single-GPU Training
  • Adapter Strategies in VLMs (Where to put the parameters?)
  • Data Mixing as a Regularizer
  • Edge Compute and Continual Learning in Robotics
  • The System 1 / System 2 Architectural Tradeoff
  • Continual Learning via LoRA
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading