Week 8: Fine-Tuning and Parameter-Efficient Methods

Purpose of this lecture#

Large Vision-Language Models (VLMs) possess billions of parameters. While these massive models contain vast general knowledge, they frequently fail at proprietary, domain-specific tasks (e.g., parsing medical X-rays, reading obscure blueprints, or formatting output for a specific robotic API). Adapting them is necessary, but full fine-tuning is mathematically and financially prohibitive for most practitioners. A 7B-parameter LLM at fp32 precision requires 28 GB just to store the weights, plus 56 GB for Adam optimizer states, and 28 GB for gradients—exceeding the memory of even an 80GB A100 GPU.

Parameter-Efficient Fine-Tuning (PEFT) methods solve this hardware bottleneck by freezing the vast majority of the network and training only a tiny, strategically placed fraction of additional parameters. This lecture derives the mathematics of Low-Rank Adaptation (LoRA) and QLoRA, analyzes the systems engineering of edge-compute deployment, and examines how these methods enable Continuous Learning on physical robotics hardware.

Why full fine-tuning is impractical at scale#

Full fine-tuning of a VLM with parameters $\theta$ updates every parameter using a task-specific dataset $\mathcal{D}$ :

\theta^* = \arg\min_\theta \mathcal{L}(\theta; \mathcal{D})

The practical obstacles to this approach are severe:

The Optimizer State Memory Wall: During training, the Adam optimizer maintains two moving averages (first moment $m_t$ and second moment $v_t$ ) for every single parameter. If $\theta$ has 7 billion parameters, storing $\theta$ in fp32 takes 28GB. Storing $m_t$ takes another 28GB, and $v_t$ takes another 28GB. With gradients and activations, the VRAM requirement explodes past 100GB.
Catastrophic Forgetting: Full fine-tuning on a narrow, task-specific dataset (like 5,000 images of chest X-rays) violently aggressively updates the model's weights. The VLM quickly unlearns its pre-trained general intelligence—losing its conversational tone, its safety guardrails, and its ability to answer questions about anything other than X-rays.
Deployment Storage Overhead: If an enterprise needs 10 distinct task-specific VLMs, full fine-tuning requires saving 10 distinct 28GB checkpoints. Switching between tasks at inference time requires unloading and reloading 28GB into VRAM, inducing massive latency.

PEFT methods address all three problems by introducing a small set of task-specific parameters $\Delta\theta$ where $|\Delta\theta| \ll |\theta|$ , while keeping $\theta$ perfectly frozen:

y = f_\theta(x) + g_{\Delta\theta}(x)

Because $\theta$ is frozen, the optimizer does not need to store $m_t$ or $v_t$ for 7 billion parameters, bypassing the memory wall.

LoRA: Low-Rank Adaptation Mathematics#

LoRA (Hu et al., 2021) mathematically bypasses the parameter explosion by applying a low-rank matrix decomposition to the weight updates. The core hypothesis is that while the pre-trained matrices are full-rank, the task-specific updates $\Delta W$ reside in a remarkably low-dimensional subspace (they have a low "intrinsic rank").

For a frozen weight matrix $W \in \mathbb{R}^{d_\text{in} \times d_\text{out}}$ (e.g., the Query projection matrix in an attention head), LoRA constrains the update $\Delta W$ by decomposing it into two much smaller matrices:

W' = W + \Delta W = W + A B

where $A \in \mathbb{R}^{d_\text{in} \times r}$ and $B \in \mathbb{R}^{r \times d_\text{out}}$ , and the rank $r \ll \min(d_\text{in}, d_\text{out})$ .

Initialization and the Forward Pass#

Matrix $A$ is initialized with a zero-mean Gaussian distribution $\mathcal{N}(0, \sigma^2)$ , but matrix $B$ is explicitly initialized to all zeros. Therefore, at step $t=0$ , $\Delta W = A B = 0$ . This guarantees the model begins fine-tuning behaving exactly like the base pre-trained model.

During the forward pass, the input vector $x$ is multiplied by both pathways:

h = W x + \frac{\alpha}{r} (A B x)

Here, $\alpha$ is a constant scaling factor. The ratio $\frac{\alpha}{r}$ prevents the need to retrain the base learning rate if the rank $r$ is changed during hyperparameter sweeps.

Concrete Dimensionality Example#

Consider the Attention Projection matrix $W_Q$ in a 7B LLaMA model, where $d_\text{in} = 4096$ and $d_\text{out} = 4096$ .

Full Fine-Tuning: Updating $W_Q$ requires training $4096 \times 4096 = 16,777,216$ parameters.
LoRA (Rank $r=16$ ): $A$ has $4096 \times 16 = 65,536$ parameters. $B$ has $16 \times 4096 = 65,536$ parameters. Total trainable parameters: $131,072$ .

This is a $128\times$ mathematical reduction in trainable parameters for this single layer. Because the frozen $W$ requires no optimizer states, training memory plummets. Furthermore, at inference time, $A B$ can be pre-multiplied and added directly to $W$ , meaning LoRA adds zero latency during generation.

QLoRA: 4-Bit Quantization for Single-GPU Training#

Even with LoRA, storing a frozen 70B parameter LLM in bf16 requires 140GB of VRAM just to hold the base model. QLoRA (Dettmers et al., 2023) shattered this barrier, allowing 65B-70B models to be fine-tuned on a single 48GB GPU.

QLoRA achieves this via three mathematical innovations:

4-bit NormalFloat (NF4): Instead of standard linear INT4 quantization, QLoRA uses an information-theoretically optimal data type for normally distributed weights. Because neural network weights naturally form a Gaussian distribution centered at zero, NF4 distributes its 16 available bins so that there are more bins clustered near zero (where most weights are) and fewer bins at the tails. This dramatically minimizes quantization error.
Double Quantization: To map fp32 weights to NF4 bins, scaling constants are required. QLoRA saves an additional 0.5GB of memory per 65B model by aggressively quantizing these scaling constants themselves from 32-bit floats into 8-bit integers using nested blocks.
Paged Optimizers: If the GPU runs out of VRAM during a massive gradient accumulation step, QLoRA leverages Nvidia unified memory to automatically page the Adam optimizer states to system CPU RAM, preventing out-of-memory (OOM) crashes.

During the forward and backward passes, the 4-bit frozen weights are dynamically dequantized back to bf16 inside the GPU caches to perform the matrix multiplication with $x$ , and the gradients are only accumulated into the high-precision bf16 LoRA adapters $A$ and $B$ .

Adapter Strategies in VLMs (Where to put the parameters?)#

When fine-tuning a VLM (like LLaVA), engineers must choose which modules to inject with LoRA adapters:

Projector-Only Fine-Tuning: Freeze the ViT and the LLM. Train only the linear/MLP projector. This is extremely cheap and works if the domain shift is purely visual (e.g., standard natural images, but answering in a specific JSON format).
LLM LoRA: Freeze the ViT and the Projector. Inject LoRA into the LLM's $W_Q, W_K, W_V, W_O$ matrices. This adapts the reasoning and text-generation style of the model.
Vision (CLIP) LoRA: Freeze the LLM, inject LoRA into the ViT. This is mathematically necessary for severe visual domain shifts (e.g., satellite imagery, medical histopathology, or deep-sea robotics). If the base CLIP model has never seen a radar scan, the LLM LoRA cannot fix it, because the visual features entering the LLM are already corrupted.

Data Mixing as a Regularizer#

To prevent the LoRA adapters from aggressively overfitting to the new task (causing the VLM to forget how to answer general questions), engineers use Data Mixing. The loss function is a weighted sum:

\mathcal{L}_\text{total} = \lambda \mathcal{L}_\text{task}(\theta_{\text{frozen}} + \Delta\theta; \mathcal{D}_\text{task}) + (1-\lambda) \mathcal{L}_\text{pretrain}(\theta_{\text{frozen}} + \Delta\theta; \mathcal{D}_\text{pretrain})

By keeping $5-10\%$ of the original pretraining conversational data in the fine-tuning batch, the LoRA adapters are mathematically regularized against catastrophic forgetting.

Edge Compute and Continual Learning in Robotics#

The ultimate promise of PEFT is deploying VLMs on physical hardware (e.g., an Nvidia Jetson Orin board on a mobile robot). Running a 7B VLM at 50Hz for continuous motor control is currently impossible for a mobile GPU.

The System 1 / System 2 Architectural Tradeoff#

Robotics systems solve the edge-compute bottleneck via asynchronous routing:

System 1 (Fast/Local): A small, highly optimized continuous control policy (e.g., the Flow Matching $\pi_0$ from Course 2) runs at 50Hz directly on the robot's local Jetson board. It handles immediate reactions, balancing, and low-level motor torques.
System 2 (Slow/Cloud): The massive 7B or 70B VLM runs in the cloud (or on a massive local server) at 1Hz. It receives images from the robot, processes the semantic human command ("Find the missing wrench"), and sends a high-level spatial waypoint back down to System 1.

Continual Learning via LoRA#

Robot hardware degrades. Motor friction changes, and operating environments shift. A VLM deployed in a factory in winter will see different lighting than in summer. Continual Learning attempts to update the deployed VLM on the fly.

Because full fine-tuning causes catastrophic forgetting, LoRA is the perfect vehicle for continual learning. A robot can collect failure cases during the day. At night, while plugged into the charging dock, it computes a new LoRA adapter specifically optimized for the lighting conditions of that day. Because LoRA matrices are only a few megabytes, the robot can store a library of hundreds of distinct environmental LoRA weights, dynamically swapping them into VRAM based on current sensory context without ever permanently altering the foundational VLM weights.

Key takeaways#

Full fine-tuning of billions of parameters is prevented by the memory wall of optimizer states and the mathematical danger of catastrophic forgetting. LoRA solves this by freezing the network and injecting low-rank matrices $A$ and $B$ , cutting trainable parameters by over $100\times$ while allowing the updates to be merged for zero-latency inference. QLoRA pushes this further via 4-bit NormalFloat quantization, allowing 70B models to be trained on a single GPU. In multimodal physical AI, PEFT is not just a cost-saving measure; it enables the systems-engineering reality of Continual Learning, allowing edge-deployed robots to dynamically swap lightweight LoRA adapters to adapt to changing environmental physics without overwriting their foundational intelligence.

Conceptual questions#

LoRA Initialization Mathematics: LoRA initializes matrix $A \sim \mathcal{N}(0, \sigma^2)$ and matrix $B = 0$ . Consider an alternative initialization where $A$ and $B$ are both initialized from $\mathcal{N}(0, 1/r)$ so their product has unit-scale entries. Derive the expected Frobenius norm of $\Delta W$ at initialization. Explain mathematically why this alternative initialization causes a massive spike in the cross-entropy loss on the very first training step, and why setting $B=0$ guarantees stability.
Optimizer Memory Bounds: A team wants to fine-tune a 14B parameter VLM using full fp32 precision. The Adam optimizer stores a first moment $m_t$ and second moment $v_t$ for every parameter. Calculate the exact VRAM in Gigabytes required strictly to hold the model weights and the Adam states (assume 1 parameter = 4 bytes). Why is the actual VRAM required during the training loop significantly higher than this number, forcing the team to switch to QLoRA?
QLoRA Gradient Flow: In QLoRA, the frozen weights $W$ are stored in 4-bit NF4, but they are dequantized to bf16 (16-bit) during the forward pass to multiply with the activations. During the backward pass, gradients must flow through these dequantized weights to reach the underlying layers. However, the derivative of a discrete quantization step function is mathematically zero almost everywhere. Explain how QLoRA utilizes the Straight-Through Estimator (STE) to bypass this, and why the gradients are ultimately only applied to the high-precision LoRA matrices.
Catastrophic Forgetting and Modality Gaps: A medical team fine-tunes a LLaVA-7B model purely on X-ray images using LoRA rank 64 applied to both the ViT and the LLM. They use no Data Mixing. After 10 epochs, the model reads X-rays perfectly, but when given a photo of a dog and asked "What is this?", it replies with "Severe pulmonary edema in the lower left lobe." Diagnose this failure geometrically in the latent space. How did the lack of regularizing data cause the LLM's language generation distribution to irrevocably collapse into a narrow medical subspace?
Edge Robotics Systems Design: You are deploying a bimanual robot to sort recycling on a fast-moving conveyor belt. You have a 7B VLM that can perfectly identify recyclable materials, but its inference latency is 1.5 seconds. The conveyor belt requires action updates at 50Hz (0.02 seconds). Design a System 1 / System 2 asynchronous architecture that utilizes the VLM for semantic reasoning while relying on a lightweight, continuous control policy (like a Diffusion Policy or ACT) for high-frequency motor control. How do the two systems communicate their state?

Solutions

LoRA init. With $A,B \sim \mathcal{N}(0,1/r)$ the product $\Delta W = BA$ has expected Frobenius norm scaling like $dk/r$ , so $\Delta W$ is non-trivial at initialization and perturbs the model's output on step 0 — the loss spikes. Setting $B=0$ makes $\Delta W = 0$ at init, so the output equals the base model (a stable start), while gradients still reach $A$ through $B$ 's update.
Optimizer memory. $14\text{B} \times 4$ bytes $= 56$ GB of weights; Adam's two moments add $2 \times 56 = 112$ GB, for ~168 GB just for weights and states. The real loop also stores gradients (56 GB) and activations for backprop, pushing well past 224 GB — impossible on a single GPU, which is why the team switches to QLoRA (4-bit frozen weights plus tiny trainable adapters).
QLoRA STE. The forward pass dequantizes NF4 to bf16 to compute activations, but the quantization step's derivative is zero almost everywhere; QLoRA uses the Straight-Through Estimator, treating dequantization as the identity in the backward pass so gradients flow through unchanged. The frozen base weights receive no update — gradients are applied only to the high-precision LoRA matrices.
Forgetting / modality collapse. LoRA on both the ViT and LLM, no data mixing, ten epochs on X-rays: the LLM's generation distribution collapses toward radiology text. Geometrically the language manifold contracts into the medical subspace, so any input — even a dog — decodes to medical phrasing. Mixing in general data, lowering the rank, or freezing the LLM regularizes this.
Edge System 1/2. Run the 7B VLM asynchronously at a low rate (System 2), emitting a semantic goal/latent; a lightweight Diffusion Policy or ACT (System 1) runs the 50 Hz loop conditioned on that latest goal plus fast proprioceptive/visual state $s_t$ . They communicate through a shared goal/latent buffer: System 2 writes a new goal when ready (~1.5 s), System 1 keeps tracking the most recent goal at 50 Hz using real-time feedback.

Looking ahead#

PEFT methods enable us to efficiently adapt massive VLMs to specific physical and semantic tasks. However, training is only half the battle. How do we scientifically prove that these fine-tuned models are actually working correctly, and how vulnerable are they to domain shifts?

Week 9: Evaluation and Robustness. We examine the full evaluation landscape for VLMs: from deeply flawed historical metrics like BLEU and CIDEr, to modern reasoning benchmarks, distribution shifts, spurious correlations, and why benchmark performance frequently overestimates real-world robotic robustness.

Purpose of this lecture#

Why full fine-tuning is impractical at scale#

Full fine-tuning of a VLM with parameters $\theta$ updates every parameter using a task-specific dataset $\mathcal{D}$ :

\theta^* = \arg\min_\theta \mathcal{L}(\theta; \mathcal{D})

The practical obstacles to this approach are severe:

The Optimizer State Memory Wall: During training, the Adam optimizer maintains two moving averages (first moment $m_t$ and second moment $v_t$ ) for every single parameter. If $\theta$ has 7 billion parameters, storing $\theta$ in fp32 takes 28GB. Storing $m_t$ takes another 28GB, and $v_t$ takes another 28GB. With gradients and activations, the VRAM requirement explodes past 100GB.
Catastrophic Forgetting: Full fine-tuning on a narrow, task-specific dataset (like 5,000 images of chest X-rays) violently aggressively updates the model's weights. The VLM quickly unlearns its pre-trained general intelligence—losing its conversational tone, its safety guardrails, and its ability to answer questions about anything other than X-rays.
Deployment Storage Overhead: If an enterprise needs 10 distinct task-specific VLMs, full fine-tuning requires saving 10 distinct 28GB checkpoints. Switching between tasks at inference time requires unloading and reloading 28GB into VRAM, inducing massive latency.

PEFT methods address all three problems by introducing a small set of task-specific parameters $\Delta\theta$ where $|\Delta\theta| \ll |\theta|$ , while keeping $\theta$ perfectly frozen:

y = f_\theta(x) + g_{\Delta\theta}(x)

Because $\theta$ is frozen, the optimizer does not need to store $m_t$ or $v_t$ for 7 billion parameters, bypassing the memory wall.

LoRA: Low-Rank Adaptation Mathematics#

W' = W + \Delta W = W + A B

where $A \in \mathbb{R}^{d_\text{in} \times r}$ and $B \in \mathbb{R}^{r \times d_\text{out}}$ , and the rank $r \ll \min(d_\text{in}, d_\text{out})$ .

Initialization and the Forward Pass#

During the forward pass, the input vector $x$ is multiplied by both pathways:

h = W x + \frac{\alpha}{r} (A B x)

Here, $\alpha$ is a constant scaling factor. The ratio $\frac{\alpha}{r}$ prevents the need to retrain the base learning rate if the rank $r$ is changed during hyperparameter sweeps.

Concrete Dimensionality Example#

Consider the Attention Projection matrix $W_Q$ in a 7B LLaMA model, where $d_\text{in} = 4096$ and $d_\text{out} = 4096$ .

Full Fine-Tuning: Updating $W_Q$ requires training $4096 \times 4096 = 16,777,216$ parameters.
LoRA (Rank $r=16$ ): $A$ has $4096 \times 16 = 65,536$ parameters. $B$ has $16 \times 4096 = 65,536$ parameters. Total trainable parameters: $131,072$ .

QLoRA: 4-Bit Quantization for Single-GPU Training#

QLoRA achieves this via three mathematical innovations:

4-bit NormalFloat (NF4): Instead of standard linear INT4 quantization, QLoRA uses an information-theoretically optimal data type for normally distributed weights. Because neural network weights naturally form a Gaussian distribution centered at zero, NF4 distributes its 16 available bins so that there are more bins clustered near zero (where most weights are) and fewer bins at the tails. This dramatically minimizes quantization error.
Double Quantization: To map fp32 weights to NF4 bins, scaling constants are required. QLoRA saves an additional 0.5GB of memory per 65B model by aggressively quantizing these scaling constants themselves from 32-bit floats into 8-bit integers using nested blocks.
Paged Optimizers: If the GPU runs out of VRAM during a massive gradient accumulation step, QLoRA leverages Nvidia unified memory to automatically page the Adam optimizer states to system CPU RAM, preventing out-of-memory (OOM) crashes.

Adapter Strategies in VLMs (Where to put the parameters?)#

When fine-tuning a VLM (like LLaVA), engineers must choose which modules to inject with LoRA adapters:

Projector-Only Fine-Tuning: Freeze the ViT and the LLM. Train only the linear/MLP projector. This is extremely cheap and works if the domain shift is purely visual (e.g., standard natural images, but answering in a specific JSON format).
LLM LoRA: Freeze the ViT and the Projector. Inject LoRA into the LLM's $W_Q, W_K, W_V, W_O$ matrices. This adapts the reasoning and text-generation style of the model.
Vision (CLIP) LoRA: Freeze the LLM, inject LoRA into the ViT. This is mathematically necessary for severe visual domain shifts (e.g., satellite imagery, medical histopathology, or deep-sea robotics). If the base CLIP model has never seen a radar scan, the LLM LoRA cannot fix it, because the visual features entering the LLM are already corrupted.

Data Mixing as a Regularizer#

\mathcal{L}_\text{total} = \lambda \mathcal{L}_\text{task}(\theta_{\text{frozen}} + \Delta\theta; \mathcal{D}_\text{task}) + (1-\lambda) \mathcal{L}_\text{pretrain}(\theta_{\text{frozen}} + \Delta\theta; \mathcal{D}_\text{pretrain})

By keeping $5-10\%$ of the original pretraining conversational data in the fine-tuning batch, the LoRA adapters are mathematically regularized against catastrophic forgetting.

Edge Compute and Continual Learning in Robotics#

The System 1 / System 2 Architectural Tradeoff#

Robotics systems solve the edge-compute bottleneck via asynchronous routing:

System 1 (Fast/Local): A small, highly optimized continuous control policy (e.g., the Flow Matching $\pi_0$ from Course 2) runs at 50Hz directly on the robot's local Jetson board. It handles immediate reactions, balancing, and low-level motor torques.
System 2 (Slow/Cloud): The massive 7B or 70B VLM runs in the cloud (or on a massive local server) at 1Hz. It receives images from the robot, processes the semantic human command ("Find the missing wrench"), and sends a high-level spatial waypoint back down to System 1.

Continual Learning via LoRA#

Key takeaways#

Conceptual questions#

LoRA Initialization Mathematics: LoRA initializes matrix $A \sim \mathcal{N}(0, \sigma^2)$ and matrix $B = 0$ . Consider an alternative initialization where $A$ and $B$ are both initialized from $\mathcal{N}(0, 1/r)$ so their product has unit-scale entries. Derive the expected Frobenius norm of $\Delta W$ at initialization. Explain mathematically why this alternative initialization causes a massive spike in the cross-entropy loss on the very first training step, and why setting $B=0$ guarantees stability.
Optimizer Memory Bounds: A team wants to fine-tune a 14B parameter VLM using full fp32 precision. The Adam optimizer stores a first moment $m_t$ and second moment $v_t$ for every parameter. Calculate the exact VRAM in Gigabytes required strictly to hold the model weights and the Adam states (assume 1 parameter = 4 bytes). Why is the actual VRAM required during the training loop significantly higher than this number, forcing the team to switch to QLoRA?
QLoRA Gradient Flow: In QLoRA, the frozen weights $W$ are stored in 4-bit NF4, but they are dequantized to bf16 (16-bit) during the forward pass to multiply with the activations. During the backward pass, gradients must flow through these dequantized weights to reach the underlying layers. However, the derivative of a discrete quantization step function is mathematically zero almost everywhere. Explain how QLoRA utilizes the Straight-Through Estimator (STE) to bypass this, and why the gradients are ultimately only applied to the high-precision LoRA matrices.
Catastrophic Forgetting and Modality Gaps: A medical team fine-tunes a LLaVA-7B model purely on X-ray images using LoRA rank 64 applied to both the ViT and the LLM. They use no Data Mixing. After 10 epochs, the model reads X-rays perfectly, but when given a photo of a dog and asked "What is this?", it replies with "Severe pulmonary edema in the lower left lobe." Diagnose this failure geometrically in the latent space. How did the lack of regularizing data cause the LLM's language generation distribution to irrevocably collapse into a narrow medical subspace?
Edge Robotics Systems Design: You are deploying a bimanual robot to sort recycling on a fast-moving conveyor belt. You have a 7B VLM that can perfectly identify recyclable materials, but its inference latency is 1.5 seconds. The conveyor belt requires action updates at 50Hz (0.02 seconds). Design a System 1 / System 2 asynchronous architecture that utilizes the VLM for semantic reasoning while relying on a lightweight, continuous control policy (like a Diffusion Policy or ACT) for high-frequency motor control. How do the two systems communicate their state?

Solutions

LoRA init. With $A,B \sim \mathcal{N}(0,1/r)$ the product $\Delta W = BA$ has expected Frobenius norm scaling like $dk/r$ , so $\Delta W$ is non-trivial at initialization and perturbs the model's output on step 0 — the loss spikes. Setting $B=0$ makes $\Delta W = 0$ at init, so the output equals the base model (a stable start), while gradients still reach $A$ through $B$ 's update.
Optimizer memory. $14\text{B} \times 4$ bytes $= 56$ GB of weights; Adam's two moments add $2 \times 56 = 112$ GB, for ~168 GB just for weights and states. The real loop also stores gradients (56 GB) and activations for backprop, pushing well past 224 GB — impossible on a single GPU, which is why the team switches to QLoRA (4-bit frozen weights plus tiny trainable adapters).
QLoRA STE. The forward pass dequantizes NF4 to bf16 to compute activations, but the quantization step's derivative is zero almost everywhere; QLoRA uses the Straight-Through Estimator, treating dequantization as the identity in the backward pass so gradients flow through unchanged. The frozen base weights receive no update — gradients are applied only to the high-precision LoRA matrices.
Forgetting / modality collapse. LoRA on both the ViT and LLM, no data mixing, ten epochs on X-rays: the LLM's generation distribution collapses toward radiology text. Geometrically the language manifold contracts into the medical subspace, so any input — even a dog — decodes to medical phrasing. Mixing in general data, lowering the rank, or freezing the LLM regularizes this.
Edge System 1/2. Run the 7B VLM asynchronously at a low rate (System 2), emitting a semantic goal/latent; a lightweight Diffusion Policy or ACT (System 1) runs the 50 Hz loop conditioned on that latest goal plus fast proprioceptive/visual state $s_t$ . They communicate through a shared goal/latent buffer: System 2 writes a new goal when ready (~1.5 s), System 1 keeps tracking the most recent goal at 50 Hz using real-time feedback.

Purpose of this lecture#

Why full fine-tuning is impractical at scale#

LoRA: Low-Rank Adaptation Mathematics#

Initialization and the Forward Pass#

Concrete Dimensionality Example#

QLoRA: 4-Bit Quantization for Single-GPU Training#

Adapter Strategies in VLMs (Where to put the parameters?)#

Data Mixing as a Regularizer#

Edge Compute and Continual Learning in Robotics#

The System 1 / System 2 Architectural Tradeoff#

Continual Learning via LoRA#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 8: Fine-Tuning and Parameter-Efficient Methods

Purpose of this lecture#

Why full fine-tuning is impractical at scale#

LoRA: Low-Rank Adaptation Mathematics#

Initialization and the Forward Pass#

Concrete Dimensionality Example#

QLoRA: 4-Bit Quantization for Single-GPU Training#

Adapter Strategies in VLMs (Where to put the parameters?)#

Data Mixing as a Regularizer#

Edge Compute and Continual Learning in Robotics#

The System 1 / System 2 Architectural Tradeoff#

Continual Learning via LoRA#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#