Purpose of this lecture
Large Vision-Language Models (VLMs) possess billions of parameters. While these massive models contain vast general knowledge, they frequently fail at proprietary, domain-specific tasks (e.g., parsing medical X-rays, reading obscure blueprints, or formatting output for a specific robotic API). Adapting them is necessary, but full fine-tuning is mathematically and financially prohibitive for most practitioners. A 7B-parameter LLMLarge Language Model at fp32 precision requires 28 GB just to store the weights, plus 56 GB for Adam optimizer states, and 28 GB for gradients—exceeding the memory of even an 80GB A100 GPU.
Parameter-Efficient Fine-Tuning (PEFT) methods solve this hardware bottleneck by freezing the vast majority of the network and training only a tiny, strategically placed fraction of additional parameters. This lecture derives the mathematics of Low-Rank Adaptation (LoRA) and QLoRA, analyzes the systems engineering of edge-compute deployment, and examines how these methods enable Continuous Learning on physical robotics hardware.
Why full fine-tuning is impractical at scale
Full fine-tuning of a VLMVision-Language Model with parameters updates every parameter using a task-specific dataset :
The practical obstacles to this approach are severe:
- The Optimizer State Memory Wall: During training, the Adam optimizer maintains two moving averages (first moment and second moment ) for every single parameter. If has 7 billion parameters, storing in fp32 takes 28GB. Storing takes another 28GB, and takes another 28GB. With gradients and activations, the VRAM requirement explodes past 100GB.
- Catastrophic Forgetting: Full fine-tuning on a narrow, task-specific dataset (like 5,000 images of chest X-rays) violently aggressively updates the model's weights. The VLMVision-Language Model quickly unlearns its pre-trained general intelligence—losing its conversational tone, its safety guardrails, and its ability to answer questions about anything other than X-rays.
- Deployment Storage Overhead: If an enterprise needs 10 distinct task-specific VLMs, full fine-tuning requires saving 10 distinct 28GB checkpoints. Switching between tasks at inference time requires unloading and reloading 28GB into VRAM, inducing massive latency.
PEFT methods address all three problems by introducing a small set of task-specific parameters where , while keeping perfectly frozen:
Because is frozen, the optimizer does not need to store or for 7 billion parameters, bypassing the memory wall.
LoRA: Low-Rank Adaptation Mathematics
LoRA (Hu et al., 2021) mathematically bypasses the parameter explosion by applying a low-rank matrix decomposition to the weight updates. The core hypothesis is that while the pre-trained matrices are full-rank, the task-specific updates reside in a remarkably low-dimensional subspace (they have a low "intrinsic rank").
For a frozen weight matrix (e.g., the Query projection matrix in an attention head), LoRA constrains the update by decomposing it into two much smaller matrices:
where and , and the rank .
Initialization and the Forward Pass
Matrix is initialized with a zero-mean Gaussian distribution , but matrix is explicitly initialized to all zeros. Therefore, at step , . This guarantees the model begins fine-tuning behaving exactly like the base pre-trained model.
During the forward pass, the input vector is multiplied by both pathways:
Here, is a constant scaling factor. The ratio prevents the need to retrain the base learning rate if the rank is changed during hyperparameter sweeps.
Concrete Dimensionality Example
Consider the Attention Projection matrix in a 7B LLaMA model, where and .
- Full Fine-Tuning: Updating requires training parameters.
- LoRA (Rank ): has parameters. has parameters. Total trainable parameters: .
This is a mathematical reduction in trainable parameters for this single layer. Because the frozen requires no optimizer states, training memory plummets. Furthermore, at inference time, can be pre-multiplied and added directly to , meaning LoRA adds zero latency during generation.
QLoRA: 4-Bit Quantization for Single-GPU Training
Even with LoRA, storing a frozen 70B parameter LLMLarge Language Model in bf16 requires 140GB of VRAM just to hold the base model. QLoRA (Dettmers et al., 2023) shattered this barrier, allowing 65B-70B models to be fine-tuned on a single 48GB GPU.
QLoRA achieves this via three mathematical innovations:
- 4-bit NormalFloat (NF4): Instead of standard linear INT4 quantization, QLoRA uses an information-theoretically optimal data type for normally distributed weights. Because neural network weights naturally form a Gaussian distribution centered at zero, NF4 distributes its 16 available bins so that there are more bins clustered near zero (where most weights are) and fewer bins at the tails. This dramatically minimizes quantization error.
- Double Quantization: To map fp32 weights to NF4 bins, scaling constants are required. QLoRA saves an additional 0.5GB of memory per 65B model by aggressively quantizing these scaling constants themselves from 32-bit floats into 8-bit integers using nested blocks.
- Paged Optimizers: If the GPU runs out of VRAM during a massive gradient accumulation step, QLoRA leverages Nvidia unified memory to automatically page the Adam optimizer states to system CPU RAM, preventing out-of-memory (OOM) crashes.
During the forward and backward passes, the 4-bit frozen weights are dynamically dequantized back to bf16 inside the GPU caches to perform the matrix multiplication with , and the gradients are only accumulated into the high-precision bf16 LoRA adapters and .
Adapter Strategies in VLMs (Where to put the parameters?)
When fine-tuning a VLMVision-Language Model (like LLaVA), engineers must choose which modules to inject with LoRA adapters:
- Projector-Only Fine-Tuning: Freeze the ViT and the LLMLarge Language Model. Train only the linear/MLP projector. This is extremely cheap and works if the domain shift is purely visual (e.g., standard natural images, but answering in a specific JSON format).
- LLMLarge Language Model LoRA: Freeze the ViT and the Projector. Inject LoRA into the LLMLarge Language Model's matrices. This adapts the reasoning and text-generation style of the model.
- Vision (CLIP) LoRA: Freeze the LLMLarge Language Model, inject LoRA into the ViT. This is mathematically necessary for severe visual domain shifts (e.g., satellite imagery, medical histopathology, or deep-sea robotics). If the base CLIP model has never seen a radar scan, the LLMLarge Language Model LoRA cannot fix it, because the visual features entering the LLMLarge Language Model are already corrupted.
Data Mixing as a Regularizer
To prevent the LoRA adapters from aggressively overfitting to the new task (causing the VLMVision-Language Model to forget how to answer general questions), engineers use Data Mixing. The loss function is a weighted sum:
By keeping of the original pretraining conversational data in the fine-tuning batch, the LoRA adapters are mathematically regularized against catastrophic forgetting.
Edge Compute and Continual Learning in Robotics
The ultimate promise of PEFT is deploying VLMs on physical hardware (e.g., an Nvidia Jetson Orin board on a mobile robot). Running a 7B VLMVision-Language Model at 50Hz for continuous motor control is currently impossible for a mobile GPU.
The System 1 / System 2 Architectural Tradeoff
Robotics systems solve the edge-compute bottleneck via asynchronous routing:
- System 1 (Fast/Local): A small, highly optimized continuous control policy (e.g., the Flow Matching from Course 2) runs at 50Hz directly on the robot's local Jetson board. It handles immediate reactions, balancing, and low-level motor torques.
- System 2 (Slow/Cloud): The massive 7B or 70B VLMVision-Language Model runs in the cloud (or on a massive local server) at 1Hz. It receives images from the robot, processes the semantic human command ("Find the missing wrench"), and sends a high-level spatial waypoint back down to System 1.
Continual Learning via LoRA
Robot hardware degrades. Motor friction changes, and operating environments shift. A VLMVision-Language Model deployed in a factory in winter will see different lighting than in summer. Continual Learning attempts to update the deployed VLMVision-Language Model on the fly.
Because full fine-tuning causes catastrophic forgetting, LoRA is the perfect vehicle for continual learning. A robot can collect failure cases during the day. At night, while plugged into the charging dock, it computes a new LoRA adapter specifically optimized for the lighting conditions of that day. Because LoRA matrices are only a few megabytes, the robot can store a library of hundreds of distinct environmental LoRA weights, dynamically swapping them into VRAM based on current sensory context without ever permanently altering the foundational VLMVision-Language Model weights.
Key takeaways
Full fine-tuning of billions of parameters is prevented by the memory wall of optimizer states and the mathematical danger of catastrophic forgetting. LoRA solves this by freezing the network and injecting low-rank matrices and , cutting trainable parameters by over while allowing the updates to be merged for zero-latency inference. QLoRA pushes this further via 4-bit NormalFloat quantization, allowing 70B models to be trained on a single GPU. In multimodal physical AI, PEFT is not just a cost-saving measure; it enables the systems-engineering reality of Continual Learning, allowing edge-deployed robots to dynamically swap lightweight LoRA adapters to adapt to changing environmental physics without overwriting their foundational intelligence.
Conceptual questions
- LoRA Initialization Mathematics: LoRA initializes matrix and matrix . Consider an alternative initialization where and are both initialized from so their product has unit-scale entries. Derive the expected Frobenius norm of at initialization. Explain mathematically why this alternative initialization causes a massive spike in the cross-entropy loss on the very first training step, and why setting guarantees stability.
- Optimizer Memory Bounds: A team wants to fine-tune a 14B parameter VLMVision-Language Model using full fp32 precision. The Adam optimizer stores a first moment and second moment for every parameter. Calculate the exact VRAM in Gigabytes required strictly to hold the model weights and the Adam states (assume 1 parameter = 4 bytes). Why is the actual VRAM required during the training loop significantly higher than this number, forcing the team to switch to QLoRA?
- QLoRA Gradient Flow: In QLoRA, the frozen weights are stored in 4-bit NF4, but they are dequantized to bf16 (16-bit) during the forward pass to multiply with the activations. During the backward pass, gradients must flow through these dequantized weights to reach the underlying layers. However, the derivative of a discrete quantization step function is mathematically zero almost everywhere. Explain how QLoRA utilizes the Straight-Through Estimator (STE) to bypass this, and why the gradients are ultimately only applied to the high-precision LoRA matrices.
- Catastrophic Forgetting and Modality Gaps: A medical team fine-tunes a LLaVA-7B model purely on X-ray images using LoRA rank 64 applied to both the ViT and the LLMLarge Language Model. They use no Data Mixing. After 10 epochs, the model reads X-rays perfectly, but when given a photo of a dog and asked "What is this?", it replies with "Severe pulmonary edema in the lower left lobe." Diagnose this failure geometrically in the latent space. How did the lack of regularizing data cause the LLMLarge Language Model's language generation distribution to irrevocably collapse into a narrow medical subspace?
- Edge Robotics Systems Design: You are deploying a bimanual robot to sort recycling on a fast-moving conveyor belt. You have a 7B VLMVision-Language Model that can perfectly identify recyclable materials, but its inference latency is 1.5 seconds. The conveyor belt requires action updates at 50Hz (0.02 seconds). Design a System 1 / System 2 asynchronous architecture that utilizes the VLMVision-Language Model for semantic reasoning while relying on a lightweight, continuous control policy (like a Diffusion Policy or ACTAction Chunking with Transformers) for high-frequency motor control. How do the two systems communicate their state?
Looking ahead
PEFT methods enable us to efficiently adapt massive VLMs to specific physical and semantic tasks. However, training is only half the battle. How do we scientifically prove that these fine-tuned models are actually working correctly, and how vulnerable are they to domain shifts?
Week 9: Evaluation and Robustness. We examine the full evaluation landscape for VLMs: from deeply flawed historical metrics like BLEU and CIDEr, to modern reasoning benchmarks, distribution shifts, spurious correlations, and why benchmark performance frequently overestimates real-world robotic robustness.
Further reading
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR. (The definitive PEFT paper).
- Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS. (4-bit NF4 quantization).
- Houlsby, N., et al. (2019). Parameter-Efficient Transfer Learning for NLP. ICML. (Adapter layers).