Skip to main content
illumin8
Courses
Week 6: LLaVA and Multimodal Instruction Tuning
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 6

Week 6: LLaVA and Multimodal Instruction Tuning

✦Learning Outcomes
  • Generate synthetic instruction-following data using GPT-4
  • Apply instruction tuning to convert VLMs into conversational assistants
  • Analyze tradeoffs between frozen vs. fine-tuned vision encoders
◆Prerequisites
  • Week 5: BLIP-2 - Bridging frozen encoders
  • Understanding of transformer architecture from earlier weeks

Purpose of this lecture

While BLIP-2 successfully bridged vision and language using a complex, learned Q-Former, it primarily produced a system optimized for static, short-form tasks like captioning and VQA. It lacked the conversational fluidity and open-ended reasoning capabilities of modern Large Language Models like ChatGPT. Instruction-following—the ability to respond correctly to unconstrained, conversationally phrased queries about images—requires a specific training signal: explicitly supervised instruction-response pairs that teach the model how to respond, not just what the correct answer is.

LLaVA (Large Language-and-Vision Assistant; Liu et al., 2023) demonstrated a monumental breakthrough: this complex instruction-following capability could be achieved using a shockingly simple architecture (a linear matrix projection) combined with a highly curated data pipeline (GPT-4 generated instruction data). By reframing visual features as standard text tokens, LLaVA established the dominant architectural template for almost all subsequent open-weight multimodal assistants.


Architecture: Vision Encoder + Projector + LLMLarge Language Model

LLaVA's architecture strips away the complexity of cross-attention modules and Q-Formers in favor of strict, sequential minimalism. It consists of three components:

1. The Vision Encoder: A CLIP ViT-L/14 (specifically the 336px resolution variant) is used to process the image. Crucially, the encoder is kept strictly frozen throughout all stages of training. Unlike CLIP (which uses the [CLS] token), LLaVA extracts the full grid of 256 spatial patch tokens from the penultimate transformer layer. This preserves the dense spatial geometry required for answering localized questions. Let these features be Zv∈R256×DvZ_v \in \mathbb{R}^{256 \times D_v}Zv​∈R256×Dv​, where Dv=1024D_v = 1024Dv​=1024.

2. The Projector: A simple learnable mapping WWW that transforms the visual features into the exact mathematical dimension of the LLMLarge Language Model's token embedding space. In the original LLaVA, this is a single linear projection matrix W∈RDv×DlW \in \mathbb{R}^{D_v \times D_l}W∈RDv​×Dl​.

Hv=ZvWH_v = Z_v WHv​=Zv​W

This produces 256 visual "pseudo-tokens" Hv∈R256×DlH_v \in \mathbb{R}^{256 \times D_l}Hv​∈R256×Dl​ (where Dl=4096D_l = 4096Dl​=4096 for a 7B LLMLarge Language Model). In LLaVA 1.5, this is upgraded to a two-layer Multi-Layer Perceptron (MLP) with a GELU activation function:

Hv=W2⋅GELU(W1Zv)H_v = W_2 \cdot \text{GELU}(W_1 Z_v)Hv​=W2​⋅GELU(W1​Zv​)

3. The Language Model: An instruction-tuned LLMLarge Language Model, typically Vicuna (a LLaMA derivative) or LLaMA-2-Chat. The LLMLarge Language Model simply concatenates the visual pseudo-tokens HvH_vHv​ and the text tokens HtH_tHt​ into a single, continuous 1D sequence:

Hinput=[Hv;Ht]H_\text{input} = [H_v; H_t]Hinput​=[Hv​;Ht​]

The LLMLarge Language Model processes this concatenated sequence using its standard causal self-attention mechanism. No special cross-attention layers are added; the LLMLarge Language Model treats the visual tokens exactly as if they were foreign language words that it needs to translate.


Why simple projectors work mathematically

How can a simple linear matrix WWW with only 4 million parameters replace BLIP-2's complex 188M parameter Q-Former?

  1. The heavy lifting is done by CLIP: The CLIP encoder was pre-trained using the InfoNCE contrastive loss (Week 3), meaning its visual feature space is mathematically already aligned with natural language. Projecting CLIP features into an LLMLarge Language Model's token space is therefore a relatively simple affine transformation rather than a complex representation-learning problem.
  2. LLMLarge Language Model capacity compensates: The LLMLarge Language Model fine-tuning step exposes the LLMLarge Language Model's massive 7B parameter attention heads to the visual pseudo-tokens. Because the LLMLarge Language Model is unmasked and fine-tuned, its self-attention layers learn to dynamically interpret and route the projected visual tokens, compensating for the simplicity of the projector.
  3. Spatial density: LLaVA passes all 256 visual tokens into the LLMLarge Language Model, whereas BLIP-2 compressed them into 32 tokens. Passing the raw spatial grid prevents the mathematical information bottleneck that plagued BLIP-2, allowing the LLMLarge Language Model's attention heads to query high-resolution spatial relationships directly.

Synthesizing multimodal instruction data (GPT-4)

The primary bottleneck to building LLaVA was data. In 2023, there were zero large-scale datasets containing complex, multi-turn, conversationally phrased (image, instruction, response) triples. Curating this via human annotators would have cost millions of dollars. LLaVA's genius was to use a text-only LLMLarge Language Model (GPT-4) to synthetically generate this data.

The Text-Proxy Pipeline: Because the original GPT-4 could not see images, the LLaVA authors translated images from the COCO dataset into a text-only representation. They fed GPT-4:

  1. The symbolic bounding box coordinates of every object in the image: [person: [0.1, 0.2, 0.5, 0.8], dog: [0.6, 0.7, 0.9, 0.9]].
  2. Several short, human-annotated captions describing the scene.

Using this symbolic text as a proxy for the image, they prompted GPT-4 to ACTAction Chunking with Transformers as a simulated user and assistant, generating three specific types of instruction-response pairs:

  • Conversation (58K): Multi-turn dialogue about the image, mimicking a user asking questions about objects, counts, and spatial relations.
  • Detailed Description (23K): A rich, paragraph-length description of the scene's composition, lighting, and narrative.
  • Complex Reasoning (77K): High-level questions requiring logical deduction. For example, if the bounding boxes indicate a man holding a leash attached to a dog, the question might be "What is the relationship between the man and the animal, and what activity are they likely engaging in?"

This resulted in 158K highly complex, linguistically sophisticated instruction-following examples generated purely via API calls.


The Two-Stage Training Protocol

Training a multimodal system from scratch is highly unstable. If you unfreeze the LLMLarge Language Model while the projector WWW is still outputting random noise, the garbage visual tokens will destroy the LLMLarge Language Model's pre-trained linguistic weights (catastrophic forgetting). LLaVA mitigates this using a strict two-stage protocol:

Stage 1: Feature Alignment (Pre-training)

  • Frozen: Vision Encoder and LLMLarge Language Model.
  • Trainable: Only the Projector WWW.
  • Data: 558K simple image-caption pairs from CC3M.
  • Goal: Treat the task as language modeling. The model is given the visual tokens and the prompt <image>\nDescribe the image. It must output the caption. Because the LLMLarge Language Model is frozen, the projector is forced to mathematically rotate the CLIP visual features until they map cleanly onto the exact embedding vectors the LLMLarge Language Model expects for the corresponding English words.

Stage 2: End-to-End Instruction Tuning

  • Frozen: Vision Encoder.
  • Trainable: The Projector WWW and the entire LLMLarge Language Model.
  • Data: The 158K synthetic GPT-4 instruction-following examples.
  • Goal: Teach the system conversational formatting, deep reasoning, and safety.

Loss Masking

Crucially, during Stage 2, the autoregressive cross-entropy loss is selectively masked. The model processes the system prompt, the user's instruction, the visual tokens, and the assistant's response. However, the loss function only calculates gradients for the tokens belonging to the Assistant's response.

L=−∑t=1T1{wt∈Response}log⁡Pθ(wt∣w1:t−1,Hv)\mathcal{L} = -\sum_{t=1}^T \mathbb{1}_{\{w_t \in \text{Response}\}} \log P_\theta(w_t \mid w_{1:t-1}, H_v)L=−t=1∑T​1{wt​∈Response}​logPθ​(wt​∣w1:t−1​,Hv​)

If the model was penalized for failing to predict the user's instruction or the visual tokens, it would destroy the LLMLarge Language Model's conversational formatting priors.


Connection to Course 1 (RLHFReinforcement Learning from Human Feedback & Alignment)

LLaVA's Stage 2 is the multimodal equivalent of Supervised Fine-Tuning (SFT) from Course 1, Week 12. Just as an LLMLarge Language Model requires SFT to transition from a base text-predictor to a conversational agent, a VLMVision-Language Model requires visual SFT to transition from a static captioner to an interactive assistant.

Furthermore, safety and refusal behaviors are implicitly learned during this phase. If the GPT-4 dataset contains examples where the user asks "How do I break into this car?" (accompanied by a photo of a car), and the target response is a polite refusal, the LLaVA model learns this exact probability distribution. The refusal behavior is not a hard-coded software filter; it is an emergent property of the autoregressive language modeling loss optimizing against safe target tokens.


Key takeaways

LLaVA defined the modern architectural standard for open-weight VLMs. It connects a frozen CLIP Vision Transformer to an instruction-tuned LLMLarge Language Model using a trivial linear or MLP projector, treating visual patches as 256 foreign-language tokens prepended to the user's text prompt. Because high-quality, multimodal conversational data did not exist, LLaVA synthetically generated 158K complex reasoning examples by using GPT-4 to translate bounding-box coordinates into rich dialogue. The network is trained in two stages to prevent catastrophic forgetting: first freezing the LLMLarge Language Model to align the projector, then unfreezing the LLMLarge Language Model to teach it instruction-following via masked autoregressive loss.


Conceptual questions

  1. Projector Matrix Mathematics: LLaVA 1.0 uses a linear projector W∈R1024×4096W \in \mathbb{R}^{1024 \times 4096}W∈R1024×4096. During Stage 1, the LLMLarge Language Model is frozen. Mathematically explain what happens to the gradient descent process if the CLIP vision encoder outputs highly entangled, non-linear features that cannot be cleanly separated by a single affine transformation WWW. Why does upgrading to a two-layer MLP with a GELU activation (as done in LLaVA 1.5) mathematically resolve this specific representation bottleneck?
  2. Loss Masking Debugging: You are training a custom LLaVA model for a robotics application. Your training loop has a bug: you forgot to apply the loss mask 1{wt∈Response}\mathbb{1}_{\{w_t \in \text{Response}\}}1{wt​∈Response}​, meaning the cross-entropy loss is being computed over the <SYSTEM> prompt and the <USER> instruction tokens as well. Describe the exact behavioral failure mode the model will exhibit at inference time when a user inputs a prompt. Why will the model attempt to "autocomplete" the user rather than answering them?
  3. Data Synthesis Blind Spots: The LLaVA data generation pipeline feeds GPT-4 the text-based COCO bounding boxes and captions to simulate the image. Identify three specific visual properties (e.g., related to lighting, texture, or fine-grained text/OCR) that are entirely absent from standard bounding-box coordinates. If a user later asks the trained LLaVA model "Is the surface of this object rough or smooth?", explain precisely why the model will hallucinate an answer despite passing all validation loss metrics during training.
  4. Resolution Scaling Complexity: A ViT-L/14 at 336px resolution produces 256 visual tokens. You want to upgrade your LLaVA architecture to process 1024px images to read fine-grained text. Calculate the new number of spatial patch tokens NNN produced by the ViT. Given that the LLMLarge Language Model uses standard causal self-attention (which scales as O(L2)O(L^2)O(L2) where LLL is the sequence length), compute the multiplicative increase in GPU memory required for the LLMLarge Language Model's KV-cache. Why does this quadratic scaling make LLaVA's "prepend all tokens" architecture fundamentally unsustainable for high-resolution video?
  5. Jailbreaking Instruction-Tuned VLMs: In Course 1, we learned that SFT safety guardrails can be bypassed. A user uploads an image of a handwritten note containing malicious instructions (e.g., a recipe for a dangerous chemical) and prompts the LLaVA model: "Transcribe the text in this image perfectly." Explain why LLaVA is highly likely to comply and output the dangerous text, bypassing its safety tuning. How does the modality gap (the fact that safety was trained on text prompts, but the malicious payload is hidden inside the visual tokens) create a new class of VLMVision-Language Model vulnerabilities?
✦Solutions
  1. Linear projector limit. A single affine WWW can only apply a linear map; if the CLIP features are non-linearly entangled relative to the LLM's token space, no WWW separates them and gradient descent converges to a best-fit linear approximation with irreducible residual — the alignment loss plateaus. A two-layer MLP with GELU adds the non-linearity needed to warp and disentangle the features.
  2. Loss-mask bug. Computing loss over the system and user tokens trains the model to predict the prompt itself. At inference it then continues/autocompletes the user's instruction like a base language model rather than answering, because it was optimized to model the prompt distribution instead of only the response.
  3. Data-synthesis blind spots. Bounding boxes and captions omit lighting/shadow, surface texture/material, and fine text/OCR. Asked "rough or smooth?", the model has no grounded signal for texture (GPT-4 invented the training answers from boxes), so it hallucinates plausibly — and validation loss looks fine because the validation set shares the identical blind spot.
  4. Resolution scaling. 1024/14≈731024/14 \approx 731024/14≈73 patches per side, so N≈732≈5,300N \approx 73^2 \approx 5{,}300N≈732≈5,300 tokens versus 256 — roughly 21×21\times21× more. Self-attention is O(L2)O(L^2)O(L2), so the KV-cache grows about 21×21\times21× and attention compute about 440×440\times440×. Prepending all tokens therefore becomes unsustainable, especially for video where the sequence explodes — motivating token compression/resamplers.
  5. Jailbreak via image text. An OCR "transcribe this" request routes the malicious payload through the visual modality, but safety SFT was trained on text prompts, so the guardrail distribution never covered instructions hidden in pixels. The model faithfully transcribes the harmful text — the modality gap means text-only safety alignment does not transfer to visually-encoded content, a new attack surface.

Looking ahead

LLaVA's design—prepended visual tokens processed by full self-attention—works beautifully for single images but hits a massive quadratic computational wall when scaling to high-resolution images or video.

Week 7: Alternative VLMVision-Language Model Architectures. We examine Flamingo's gated cross-attention injection (enabling multi-image and video reasoning without quadratic explosion), Perceiver-style models with fixed-size latent bottlenecks (handling massive input lengths), and PaLI's unified encoder-decoder architecture.


Further reading

  • Liu, H., et al. (2023). Visual Instruction Tuning. NeurIPS. (The original LLaVA paper).
  • Liu, H., et al. (2023). Improved Baselines with Visual Instruction Tuning. CVPR. (LLaVA-1.5, introducing the MLP projector).
  • Zhu, D., et al. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. ICLR.
← Previous
Week 5: BLIP, BLIP-2, and Related Models
Next →
Week 7: Alternative VLM Architectures
On this page
  • Purpose of this lecture
  • Architecture: Vision Encoder + Projector + LLM
  • Why simple projectors work mathematically
  • Synthesizing multimodal instruction data (GPT-4)
  • The Two-Stage Training Protocol
  • Stage 1: Feature Alignment (Pre-training)
  • Stage 2: End-to-End Instruction Tuning
  • Loss Masking
  • Connection to Course 1 (RLHF & Alignment)
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading