Week 6: LLaVA and Multimodal Instruction Tuning

Purpose of this lecture#

While BLIP-2 successfully bridged vision and language using a complex, learned Q-Former, it primarily produced a system optimized for static, short-form tasks like captioning and VQA. It lacked the conversational fluidity and open-ended reasoning capabilities of modern Large Language Models like ChatGPT. Instruction-following—the ability to respond correctly to unconstrained, conversationally phrased queries about images—requires a specific training signal: explicitly supervised instruction-response pairs that teach the model how to respond, not just what the correct answer is.

LLaVA (Large Language-and-Vision Assistant; Liu et al., 2023) demonstrated a monumental breakthrough: this complex instruction-following capability could be achieved using a shockingly simple architecture (a linear matrix projection) combined with a highly curated data pipeline (GPT-4 generated instruction data). By reframing visual features as standard text tokens, LLaVA established the dominant architectural template for almost all subsequent open-weight multimodal assistants.

Architecture: Vision Encoder + Projector + LLM#

LLaVA's architecture strips away the complexity of cross-attention modules and Q-Formers in favor of strict, sequential minimalism. It consists of three components:

1. The Vision Encoder: A CLIP ViT-L/14 (specifically the 336px resolution variant) is used to process the image. Crucially, the encoder is kept strictly frozen throughout all stages of training. Unlike CLIP (which uses the [CLS] token), LLaVA extracts the full grid of 256 spatial patch tokens from the penultimate transformer layer. This preserves the dense spatial geometry required for answering localized questions. Let these features be $Z_v \in \mathbb{R}^{256 \times D_v}$ , where $D_v = 1024$ .

2. The Projector: A simple learnable mapping $W$ that transforms the visual features into the exact mathematical dimension of the LLM's token embedding space. In the original LLaVA, this is a single linear projection matrix $W \in \mathbb{R}^{D_v \times D_l}$ .

H_v = Z_v W

This produces 256 visual "pseudo-tokens" $H_v \in \mathbb{R}^{256 \times D_l}$ (where $D_l = 4096$ for a 7B LLM). In LLaVA 1.5, this is upgraded to a two-layer Multi-Layer Perceptron (MLP) with a GELU activation function:

H_v = W_2 \cdot \text{GELU}(W_1 Z_v)

3. The Language Model: An instruction-tuned LLM, typically Vicuna (a LLaMA derivative) or LLaMA-2-Chat. The LLM simply concatenates the visual pseudo-tokens $H_v$ and the text tokens $H_t$ into a single, continuous 1D sequence:

H_\text{input} = [H_v; H_t]

The LLM processes this concatenated sequence using its standard causal self-attention mechanism. No special cross-attention layers are added; the LLM treats the visual tokens exactly as if they were foreign language words that it needs to translate.

Why simple projectors work mathematically#

How can a simple linear matrix $W$ with only 4 million parameters replace BLIP-2's complex 188M parameter Q-Former?

The heavy lifting is done by CLIP: The CLIP encoder was pre-trained using the InfoNCE contrastive loss (Week 3), meaning its visual feature space is mathematically already aligned with natural language. Projecting CLIP features into an LLM's token space is therefore a relatively simple affine transformation rather than a complex representation-learning problem.
LLM capacity compensates: The LLM fine-tuning step exposes the LLM's massive 7B parameter attention heads to the visual pseudo-tokens. Because the LLM is unmasked and fine-tuned, its self-attention layers learn to dynamically interpret and route the projected visual tokens, compensating for the simplicity of the projector.
Spatial density: LLaVA passes all 256 visual tokens into the LLM, whereas BLIP-2 compressed them into 32 tokens. Passing the raw spatial grid prevents the mathematical information bottleneck that plagued BLIP-2, allowing the LLM's attention heads to query high-resolution spatial relationships directly.

Synthesizing multimodal instruction data (GPT-4)#

The primary bottleneck to building LLaVA was data. In 2023, there were zero large-scale datasets containing complex, multi-turn, conversationally phrased (image, instruction, response) triples. Curating this via human annotators would have cost millions of dollars. LLaVA's genius was to use a text-only LLM (GPT-4) to synthetically generate this data.

The Text-Proxy Pipeline: Because the original GPT-4 could not see images, the LLaVA authors translated images from the COCO dataset into a text-only representation. They fed GPT-4:

The symbolic bounding box coordinates of every object in the image: [person: [0.1, 0.2, 0.5, 0.8], dog: [0.6, 0.7, 0.9, 0.9]].
Several short, human-annotated captions describing the scene.

Using this symbolic text as a proxy for the image, they prompted GPT-4 to ACT as a simulated user and assistant, generating three specific types of instruction-response pairs:

Conversation (58K): Multi-turn dialogue about the image, mimicking a user asking questions about objects, counts, and spatial relations.
Detailed Description (23K): A rich, paragraph-length description of the scene's composition, lighting, and narrative.
Complex Reasoning (77K): High-level questions requiring logical deduction. For example, if the bounding boxes indicate a man holding a leash attached to a dog, the question might be "What is the relationship between the man and the animal, and what activity are they likely engaging in?"

This resulted in 158K highly complex, linguistically sophisticated instruction-following examples generated purely via API calls.

The Two-Stage Training Protocol#

Training a multimodal system from scratch is highly unstable. If you unfreeze the LLM while the projector $W$ is still outputting random noise, the garbage visual tokens will destroy the LLM's pre-trained linguistic weights (catastrophic forgetting). LLaVA mitigates this using a strict two-stage protocol:

Stage 1: Feature Alignment (Pre-training)#

Frozen: Vision Encoder and LLM.
Trainable: Only the Projector $W$ .
Data: 558K simple image-caption pairs from CC3M.
Goal: Treat the task as language modeling. The model is given the visual tokens and the prompt <image>\nDescribe the image. It must output the caption. Because the LLM is frozen, the projector is forced to mathematically rotate the CLIP visual features until they map cleanly onto the exact embedding vectors the LLM expects for the corresponding English words.

Stage 2: End-to-End Instruction Tuning#

Frozen: Vision Encoder.
Trainable: The Projector $W$ and the entire LLM.
Data: The 158K synthetic GPT-4 instruction-following examples.
Goal: Teach the system conversational formatting, deep reasoning, and safety.

Loss Masking#

Crucially, during Stage 2, the autoregressive cross-entropy loss is selectively masked. The model processes the system prompt, the user's instruction, the visual tokens, and the assistant's response. However, the loss function only calculates gradients for the tokens belonging to the Assistant's response.

\mathcal{L} = -\sum_{t=1}^T \mathbb{1}_{\{w_t \in \text{Response}\}} \log P_\theta(w_t \mid w_{1:t-1}, H_v)

If the model was penalized for failing to predict the user's instruction or the visual tokens, it would destroy the LLM's conversational formatting priors.

Connection to Course 1 (RLHF & Alignment)#

LLaVA's Stage 2 is the multimodal equivalent of Supervised Fine-Tuning (SFT) from Course 1, Week 12. Just as an LLM requires SFT to transition from a base text-predictor to a conversational agent, a VLM requires visual SFT to transition from a static captioner to an interactive assistant.

Furthermore, safety and refusal behaviors are implicitly learned during this phase. If the GPT-4 dataset contains examples where the user asks "How do I break into this car?" (accompanied by a photo of a car), and the target response is a polite refusal, the LLaVA model learns this exact probability distribution. The refusal behavior is not a hard-coded software filter; it is an emergent property of the autoregressive language modeling loss optimizing against safe target tokens.

Key takeaways#

LLaVA defined the modern architectural standard for open-weight VLMs. It connects a frozen CLIP Vision Transformer to an instruction-tuned LLM using a trivial linear or MLP projector, treating visual patches as 256 foreign-language tokens prepended to the user's text prompt. Because high-quality, multimodal conversational data did not exist, LLaVA synthetically generated 158K complex reasoning examples by using GPT-4 to translate bounding-box coordinates into rich dialogue. The network is trained in two stages to prevent catastrophic forgetting: first freezing the LLM to align the projector, then unfreezing the LLM to teach it instruction-following via masked autoregressive loss.

Conceptual questions#

Projector Matrix Mathematics: LLaVA 1.0 uses a linear projector $W \in \mathbb{R}^{1024 \times 4096}$ . During Stage 1, the LLM is frozen. Mathematically explain what happens to the gradient descent process if the CLIP vision encoder outputs highly entangled, non-linear features that cannot be cleanly separated by a single affine transformation $W$ . Why does upgrading to a two-layer MLP with a GELU activation (as done in LLaVA 1.5) mathematically resolve this specific representation bottleneck?
Loss Masking Debugging: You are training a custom LLaVA model for a robotics application. Your training loop has a bug: you forgot to apply the loss mask $\mathbb{1}_{\{w_t \in \text{Response}\}}$ , meaning the cross-entropy loss is being computed over the <SYSTEM> prompt and the <USER> instruction tokens as well. Describe the exact behavioral failure mode the model will exhibit at inference time when a user inputs a prompt. Why will the model attempt to "autocomplete" the user rather than answering them?
Data Synthesis Blind Spots: The LLaVA data generation pipeline feeds GPT-4 the text-based COCO bounding boxes and captions to simulate the image. Identify three specific visual properties (e.g., related to lighting, texture, or fine-grained text/OCR) that are entirely absent from standard bounding-box coordinates. If a user later asks the trained LLaVA model "Is the surface of this object rough or smooth?", explain precisely why the model will hallucinate an answer despite passing all validation loss metrics during training.
Resolution Scaling Complexity: A ViT-L/14 at 336px resolution produces 256 visual tokens. You want to upgrade your LLaVA architecture to process 1024px images to read fine-grained text. Calculate the new number of spatial patch tokens $N$ produced by the ViT. Given that the LLM uses standard causal self-attention (which scales as $O(L^2)$ where $L$ is the sequence length), compute the multiplicative increase in GPU memory required for the LLM's KV-cache. Why does this quadratic scaling make LLaVA's "prepend all tokens" architecture fundamentally unsustainable for high-resolution video?
Jailbreaking Instruction-Tuned VLMs: In Course 1, we learned that SFT safety guardrails can be bypassed. A user uploads an image of a handwritten note containing malicious instructions (e.g., a recipe for a dangerous chemical) and prompts the LLaVA model: "Transcribe the text in this image perfectly." Explain why LLaVA is highly likely to comply and output the dangerous text, bypassing its safety tuning. How does the modality gap (the fact that safety was trained on text prompts, but the malicious payload is hidden inside the visual tokens) create a new class of VLM vulnerabilities?

Solutions

Linear projector limit. A single affine $W$ can only apply a linear map; if the CLIP features are non-linearly entangled relative to the LLM's token space, no $W$ separates them and gradient descent converges to a best-fit linear approximation with irreducible residual — the alignment loss plateaus. A two-layer MLP with GELU adds the non-linearity needed to warp and disentangle the features.
Loss-mask bug. Computing loss over the system and user tokens trains the model to predict the prompt itself. At inference it then continues/autocompletes the user's instruction like a base language model rather than answering, because it was optimized to model the prompt distribution instead of only the response.
Data-synthesis blind spots. Bounding boxes and captions omit lighting/shadow, surface texture/material, and fine text/OCR. Asked "rough or smooth?", the model has no grounded signal for texture (GPT-4 invented the training answers from boxes), so it hallucinates plausibly — and validation loss looks fine because the validation set shares the identical blind spot.
Resolution scaling. $1024/14 \approx 73$ patches per side, so $N \approx 73^2 \approx 5{,}300$ tokens versus 256 — roughly $21\times$ more. Self-attention is $O(L^2)$ , so the KV-cache grows about $21\times$ and attention compute about $440\times$ . Prepending all tokens therefore becomes unsustainable, especially for video where the sequence explodes — motivating token compression/resamplers.
Jailbreak via image text. An OCR "transcribe this" request routes the malicious payload through the visual modality, but safety SFT was trained on text prompts, so the guardrail distribution never covered instructions hidden in pixels. The model faithfully transcribes the harmful text — the modality gap means text-only safety alignment does not transfer to visually-encoded content, a new attack surface.

Looking ahead#

LLaVA's design—prepended visual tokens processed by full self-attention—works beautifully for single images but hits a massive quadratic computational wall when scaling to high-resolution images or video.

Week 7: Alternative VLM Architectures. We examine Flamingo's gated cross-attention injection (enabling multi-image and video reasoning without quadratic explosion), Perceiver-style models with fixed-size latent bottlenecks (handling massive input lengths), and PaLI's unified encoder-decoder architecture.

Purpose of this lecture#

Architecture: Vision Encoder + Projector + LLM#

LLaVA's architecture strips away the complexity of cross-attention modules and Q-Formers in favor of strict, sequential minimalism. It consists of three components:

H_v = Z_v W

H_v = W_2 \cdot \text{GELU}(W_1 Z_v)

H_\text{input} = [H_v; H_t]

Why simple projectors work mathematically#

How can a simple linear matrix $W$ with only 4 million parameters replace BLIP-2's complex 188M parameter Q-Former?

The heavy lifting is done by CLIP: The CLIP encoder was pre-trained using the InfoNCE contrastive loss (Week 3), meaning its visual feature space is mathematically already aligned with natural language. Projecting CLIP features into an LLM's token space is therefore a relatively simple affine transformation rather than a complex representation-learning problem.
LLM capacity compensates: The LLM fine-tuning step exposes the LLM's massive 7B parameter attention heads to the visual pseudo-tokens. Because the LLM is unmasked and fine-tuned, its self-attention layers learn to dynamically interpret and route the projected visual tokens, compensating for the simplicity of the projector.
Spatial density: LLaVA passes all 256 visual tokens into the LLM, whereas BLIP-2 compressed them into 32 tokens. Passing the raw spatial grid prevents the mathematical information bottleneck that plagued BLIP-2, allowing the LLM's attention heads to query high-resolution spatial relationships directly.

Synthesizing multimodal instruction data (GPT-4)#

The Text-Proxy Pipeline: Because the original GPT-4 could not see images, the LLaVA authors translated images from the COCO dataset into a text-only representation. They fed GPT-4:

The symbolic bounding box coordinates of every object in the image: [person: [0.1, 0.2, 0.5, 0.8], dog: [0.6, 0.7, 0.9, 0.9]].
Several short, human-annotated captions describing the scene.

Using this symbolic text as a proxy for the image, they prompted GPT-4 to ACT as a simulated user and assistant, generating three specific types of instruction-response pairs:

Conversation (58K): Multi-turn dialogue about the image, mimicking a user asking questions about objects, counts, and spatial relations.
Detailed Description (23K): A rich, paragraph-length description of the scene's composition, lighting, and narrative.
Complex Reasoning (77K): High-level questions requiring logical deduction. For example, if the bounding boxes indicate a man holding a leash attached to a dog, the question might be "What is the relationship between the man and the animal, and what activity are they likely engaging in?"

This resulted in 158K highly complex, linguistically sophisticated instruction-following examples generated purely via API calls.

The Two-Stage Training Protocol#

Stage 1: Feature Alignment (Pre-training)#

Frozen: Vision Encoder and LLM.
Trainable: Only the Projector $W$ .
Data: 558K simple image-caption pairs from CC3M.
Goal: Treat the task as language modeling. The model is given the visual tokens and the prompt <image>\nDescribe the image. It must output the caption. Because the LLM is frozen, the projector is forced to mathematically rotate the CLIP visual features until they map cleanly onto the exact embedding vectors the LLM expects for the corresponding English words.

Stage 2: End-to-End Instruction Tuning#

Frozen: Vision Encoder.
Trainable: The Projector $W$ and the entire LLM.
Data: The 158K synthetic GPT-4 instruction-following examples.
Goal: Teach the system conversational formatting, deep reasoning, and safety.

Loss Masking#

\mathcal{L} = -\sum_{t=1}^T \mathbb{1}_{\{w_t \in \text{Response}\}} \log P_\theta(w_t \mid w_{1:t-1}, H_v)

If the model was penalized for failing to predict the user's instruction or the visual tokens, it would destroy the LLM's conversational formatting priors.

Connection to Course 1 (RLHF & Alignment)#

Key takeaways#

Conceptual questions#

Projector Matrix Mathematics: LLaVA 1.0 uses a linear projector $W \in \mathbb{R}^{1024 \times 4096}$ . During Stage 1, the LLM is frozen. Mathematically explain what happens to the gradient descent process if the CLIP vision encoder outputs highly entangled, non-linear features that cannot be cleanly separated by a single affine transformation $W$ . Why does upgrading to a two-layer MLP with a GELU activation (as done in LLaVA 1.5) mathematically resolve this specific representation bottleneck?
Loss Masking Debugging: You are training a custom LLaVA model for a robotics application. Your training loop has a bug: you forgot to apply the loss mask $\mathbb{1}_{\{w_t \in \text{Response}\}}$ , meaning the cross-entropy loss is being computed over the <SYSTEM> prompt and the <USER> instruction tokens as well. Describe the exact behavioral failure mode the model will exhibit at inference time when a user inputs a prompt. Why will the model attempt to "autocomplete" the user rather than answering them?
Data Synthesis Blind Spots: The LLaVA data generation pipeline feeds GPT-4 the text-based COCO bounding boxes and captions to simulate the image. Identify three specific visual properties (e.g., related to lighting, texture, or fine-grained text/OCR) that are entirely absent from standard bounding-box coordinates. If a user later asks the trained LLaVA model "Is the surface of this object rough or smooth?", explain precisely why the model will hallucinate an answer despite passing all validation loss metrics during training.
Resolution Scaling Complexity: A ViT-L/14 at 336px resolution produces 256 visual tokens. You want to upgrade your LLaVA architecture to process 1024px images to read fine-grained text. Calculate the new number of spatial patch tokens $N$ produced by the ViT. Given that the LLM uses standard causal self-attention (which scales as $O(L^2)$ where $L$ is the sequence length), compute the multiplicative increase in GPU memory required for the LLM's KV-cache. Why does this quadratic scaling make LLaVA's "prepend all tokens" architecture fundamentally unsustainable for high-resolution video?
Jailbreaking Instruction-Tuned VLMs: In Course 1, we learned that SFT safety guardrails can be bypassed. A user uploads an image of a handwritten note containing malicious instructions (e.g., a recipe for a dangerous chemical) and prompts the LLaVA model: "Transcribe the text in this image perfectly." Explain why LLaVA is highly likely to comply and output the dangerous text, bypassing its safety tuning. How does the modality gap (the fact that safety was trained on text prompts, but the malicious payload is hidden inside the visual tokens) create a new class of VLM vulnerabilities?

Solutions

Linear projector limit. A single affine $W$ can only apply a linear map; if the CLIP features are non-linearly entangled relative to the LLM's token space, no $W$ separates them and gradient descent converges to a best-fit linear approximation with irreducible residual — the alignment loss plateaus. A two-layer MLP with GELU adds the non-linearity needed to warp and disentangle the features.
Loss-mask bug. Computing loss over the system and user tokens trains the model to predict the prompt itself. At inference it then continues/autocompletes the user's instruction like a base language model rather than answering, because it was optimized to model the prompt distribution instead of only the response.
Data-synthesis blind spots. Bounding boxes and captions omit lighting/shadow, surface texture/material, and fine text/OCR. Asked "rough or smooth?", the model has no grounded signal for texture (GPT-4 invented the training answers from boxes), so it hallucinates plausibly — and validation loss looks fine because the validation set shares the identical blind spot.
Resolution scaling. $1024/14 \approx 73$ patches per side, so $N \approx 73^2 \approx 5{,}300$ tokens versus 256 — roughly $21\times$ more. Self-attention is $O(L^2)$ , so the KV-cache grows about $21\times$ and attention compute about $440\times$ . Prepending all tokens therefore becomes unsustainable, especially for video where the sequence explodes — motivating token compression/resamplers.
Jailbreak via image text. An OCR "transcribe this" request routes the malicious payload through the visual modality, but safety SFT was trained on text prompts, so the guardrail distribution never covered instructions hidden in pixels. The model faithfully transcribes the harmful text — the modality gap means text-only safety alignment does not transfer to visually-encoded content, a new attack surface.

Purpose of this lecture#

Architecture: Vision Encoder + Projector + LLM#

Why simple projectors work mathematically#

Synthesizing multimodal instruction data (GPT-4)#

The Two-Stage Training Protocol#

Stage 1: Feature Alignment (Pre-training)#

Stage 2: End-to-End Instruction Tuning#

Loss Masking#

Connection to Course 1 (RLHF & Alignment)#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 6: LLaVA and Multimodal Instruction Tuning

Purpose of this lecture#

Architecture: Vision Encoder + Projector + LLM#

Why simple projectors work mathematically#

Synthesizing multimodal instruction data (GPT-4)#

The Two-Stage Training Protocol#

Stage 1: Feature Alignment (Pre-training)#

Stage 2: End-to-End Instruction Tuning#

Loss Masking#

Connection to Course 1 (RLHF & Alignment)#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 6: LLaVA and Multimodal Instruction Tuning

Purpose of this lecture#

Architecture: Vision Encoder + Projector + LLMLarge Language Model#

Why simple projectors work mathematically#

Synthesizing multimodal instruction data (GPT-4)#

The Two-Stage Training Protocol#

Stage 1: Feature Alignment (Pre-training)#

Stage 2: End-to-End Instruction Tuning#

Loss Masking#

Connection to Course 1 (RLHFReinforcement Learning from Human Feedback & Alignment)#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 6: LLaVA and Multimodal Instruction Tuning

Purpose of this lecture#

Architecture: Vision Encoder + Projector + LLMLarge Language Model#

Why simple projectors work mathematically#

Synthesizing multimodal instruction data (GPT-4)#

The Two-Stage Training Protocol#

Stage 1: Feature Alignment (Pre-training)#

Stage 2: End-to-End Instruction Tuning#

Loss Masking#

Connection to Course 1 (RLHFReinforcement Learning from Human Feedback & Alignment)#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Architecture: Vision Encoder + Projector + LLM#

Connection to Course 1 (RLHF & Alignment)#

Architecture: Vision Encoder + Projector + LLM#

Connection to Course 1 (RLHF & Alignment)#