Week 5: BLIP, BLIP-2, and Related Models

Purpose of this lecture#

As explored in previous weeks, CLIP aligns images and text globally but lacks generative capability. Conversely, standard encoder-decoder models can generate captions but are computationally prohibitive to train end-to-end as language models scale to billions of parameters. This lecture introduces a monumental architectural compromise: BLIP (Bootstrapping Language-Image Pretraining), which unifies contrastive alignment with generative captioning into a single framework. We then explore its successor, BLIP-2, which introduced the paradigm-defining Q-Former—a learned, low-parameter bottleneck that safely bridges massive frozen vision encoders with massive frozen Large Language Models (LLMs) without incurring catastrophic forgetting. This "frozen-backbone" design pattern is the direct architectural ancestor of most modern production multimodal assistants.

BLIP: Unified Multimodal Pretraining#

BLIP (Li et al., 2022) trains a single, highly flexible transformer architecture using three distinct but complementary objectives. The network consists of a vision transformer (ViT) and a multimodal text transformer that can dynamically reconfigure its self-attention masking (bidirectional vs. causal) depending on the active loss function.

1. Image-Text Contrastive (ITC) Loss#

This objective is mathematically identical to CLIP's symmetric InfoNCE. It aligns the global image embedding (the vision [CLS] token) with the global text embedding (the text [CLS] token) in a shared geometric space. For the ITC forward pass, the text encoder uses bidirectional self-attention (allowing every word to see every other word) but crucially applies no cross-attention to the visual features. This enforces strict unimodal representations for rapid zero-shot retrieval.

2. Image-Text Matching (ITM) Loss#

ITC only checks if global vectors are similar. ITM is a much harder, binary classification task: given a specific image-text pair, a linear head must output $P(\text{match}) \in [0, 1]$ . For ITM, the text encoder uses bidirectional self-attention and active cross-attention into the visual features, allowing dense, fine-grained fusion of visual and linguistic tokens. To make ITM mathematically challenging, BLIP uses Hard Negative Mining: for each image in a batch, it finds the text in the batch that has the highest ITC similarity score but is technically a mismatch, and forces the ITM head to classify it as $0$ . This forces the model to reason about subtle semantic nuances (e.g., "a red cube" vs "a blue cube") that global contrastive dot-products often blur.

3. Language Modeling (LM) Loss#

Finally, to imbue the model with generative capacity, the text transformer's self-attention mask is switched to causal (autoregressive) masking. The model must predict the next text token $w_t$ conditioned on $w_{1:t-1}$ and the visual features accessed via cross-attention.

\mathcal{L}_\text{LM} = -\sum_{t=1}^T \log P_\theta(w_t \mid w_{1:t-1}, F_\text{image})

By sharing the same mathematical weights across these three objectives, BLIP produces a highly versatile representation space that excels simultaneously at retrieval, matching, and generation.

Bootstrapping noisy web data (CapFilt)#

Internet-scraped datasets (like LAION) contain massive amounts of noise, where HTML alt-text often bears no semantic relationship to the image. BLIP introduced a highly effective data curation loop called CapFilt (Captioning and Filtering):

Train an initial BLIP model on the raw, noisy web data.
Caption (Cap): Use the generative LM head of the trained BLIP model to synthesize entirely new captions for every image in the web dataset. Because the model has internalized visual priors, its synthetic captions are often far more descriptive than the original HTML alt-text.
Filter (Filt): Use the ITM (Matching) head of the trained BLIP model to score both the original web caption and the synthetic caption against the image. If $P(\text{match})$ falls below a threshold, the caption is discarded.
Retrain: Combine the surviving original captions and the surviving synthetic captions into a clean dataset, and retrain a new BLIP model from scratch.

This bootstrapping mathematically proves that a model can curate its own training distribution, yielding a significantly stronger VLM than one trained purely on the raw web data.

BLIP-2: Frozen Backbones and the Q-Former#

As LLMs scaled from millions to billions of parameters (e.g., OPT, Flan-T5, Llama), training an entire multimodal network end-to-end became mathematically and financially impossible for most labs.

BLIP-2 (Li et al., 2023) solved the scalability problem via extreme parameter efficiency. Both the vision encoder (a 1.8B parameter ViT-G) and the LLM (e.g., a 6.7B parameter OPT) are strictly frozen. No gradients flow into them. The only trainable component is a novel, lightweight architectural bridge called the Querying Transformer (Q-Former) (roughly 188M parameters).

The Q-Former Architecture#

The Q-Former initializes a fixed set of $K = 32$ learned query vectors $Q \in \mathbb{R}^{K \times D}$ . These queries are passed through a transformer architecture with two distinct attention mechanisms per layer:

Self-Attention: The $K$ queries attend to each other, allowing them to route information internally.
Cross-Attention: The $K$ queries attend to the $N$ frozen visual patch tokens extracted by the ViT (e.g., $N = 1024$ ).

Because there are only 32 query vectors, the output of the Q-Former is strictly bottlenecked to a sequence of $K=32$ dense visual tokens, regardless of how high the original image resolution was. This provides massive computational savings when these tokens are later injected into the LLM's context window.

Two-Stage Pretraining#

Optimizing a random initialization against a frozen 7B LLM is highly unstable. BLIP-2 breaks pretraining into two mathematical stages:

Stage 1: Vision-Language Representation Learning The LLM is completely ignored. The Q-Former is trained directly against the frozen ViT using the BLIP multi-task objectives (ITC, ITM, and LM). In this stage, the 32 queries learn to extract general-purpose, semantically rich visual features from the frozen ViT.

Stage 2: Generative LLM Alignment The Q-Former's multi-task heads are discarded. The 32 output queries are passed through a simple linear projection matrix $W \in \mathbb{R}^{D_\text{Q} \times D_\text{<Glossary term="LLM" />}}$ to match the hidden dimension of the frozen LLM. These 32 vectors ACT as "soft visual prompts" prepended to the user's text prompt. The frozen LLM generates text autoregressively. The cross-entropy loss is backpropagated through the frozen LLM (using its gradients, but not updating its weights) straight into the linear projector and the Q-Former.

The Information Bottleneck Tradeoff#

The frozen-backbone + bottleneck design principle revolutionized VLM development for three reasons:

Preserved LLM Capabilities: Because the LLM weights are frozen, it cannot undergo catastrophic forgetting. It retains all of its pre-trained linguistic reasoning, world knowledge, and safety guardrails.
Modular Upgrades: If a new, more powerful LLM is released tomorrow (e.g., Llama-3 replacing Llama-2), researchers only need to run Stage 2 pretraining to connect the existing ViT/Q-Former to the new LLM.
Compute Efficiency: Training 188M parameters instead of 10B parameters reduces training costs by orders of magnitude.

The Fatal Flaw (Spatial Collapse): Compressing an entire high-resolution grid of $1024$ spatial patches into exactly $32$ non-spatial query tokens acts as a massive mathematical information bottleneck. During Stage 1, the Q-Former learns to discard background details and specific $(x, y)$ coordinate geometry in favor of global semantic concepts (e.g., "there is a dog," "the scene is outdoors").

Consequently, while BLIP-2 excels at global VQA ("What is the main subject of this photo?"), it fundamentally fails at dense prediction tasks like grounding, OCR (reading small text), or robotic manipulation, where the loss of spatial geometry is catastrophic.

InstructBLIP and task-specific routing#

InstructBLIP (Dai et al., 2023) modifies the BLIP-2 architecture to address the fact that different tasks require different visual information.

Instead of letting the 32 query vectors blindly extract a generic summary of the image, InstructBLIP injects the user's text instruction (e.g., "Read the text on the red sign") directly into the Q-Former. The instruction tokens are concatenated with the 32 query vectors during the self-attention phase.

Mathematically, this forces the Q-Former to ACT as a dynamic routing mechanism. The queries use the semantic content of the instruction to guide their cross-attention over the ViT patches. If the instruction asks about the red sign, the queries selectively extract high-frequency pixel data from that specific region, ignoring the rest of the image. This instruction-aware visual extraction drastically improves performance on complex reasoning benchmarks compared to the generic, blind extraction of BLIP-2.

Key takeaways#

BLIP unifies contrastive alignment, binary matching, and autoregressive captioning into a single multi-task transformer framework, demonstrating that generative text can be used to mathematically filter and bootstrap noisy web datasets. BLIP-2 addresses the compute scaling laws by freezing massive ViT and LLM backbones, connecting them via a highly efficient, 32-token Querying Transformer (Q-Former). While this two-stage bottleneck architecture preserves LLM intelligence and slashes compute costs, it mathematically destroys fine-grained spatial geometry. InstructBLIP partially mitigates this by injecting the text instruction directly into the Q-Former, forcing the model to perform dynamic, task-aware visual feature extraction.

Conceptual questions#

ITM Hard Negative Mining: In BLIP, the Image-Text Matching (ITM) objective utilizes the Image-Text Contrastive (ITC) similarity matrix to select "hard negatives." Mathematically explain why selecting the text with the highest ITC score $S_{i,j}$ (where $i \neq j$ ) as a negative example provides a steeper, more informative gradient for the ITM classification head than selecting a text uniformly at random. What failure mode would occur in ITM if the ITC matrix was severely under-trained and produced near-random similarity scores?
Compression Mathematics in the Q-Former: A ViT-G/14 processes a $224 \times 224$ image into $N = 256$ spatial patches. The Q-Former compresses these into $K = 32$ query vectors. Calculate the mathematical compression ratio. If you were forced to deploy BLIP-2 for a medical imaging task where identifying a 2-pixel tumor is the difference between life and death, explain precisely why the $K=32$ cross-attention bottleneck will likely discard the tumor data, and propose an architectural modification to the Q-Former to bypass this.
Two-Stage Pretraining Optimization: BLIP-2 explicitly trains the Q-Former against the ViT (Stage 1) before training it against the LLM (Stage 2). If a junior engineer attempts to optimize the Q-Former against both the ViT (for ITM loss) and the frozen LLM (for generative cross-entropy loss) simultaneously in a single pass, describe the optimization challenge that arises. How do the differing magnitudes and variances of the gradients from an untrained classification head versus a massive frozen LLM cause training instability?
InstructBLIP Dynamic Routing: In InstructBLIP, the text instruction is concatenated to the $K$ queries during self-attention: $\text{SelfAttn}([Q_1 \dots Q_{32}, T_1 \dots T_L])$ . Given the prompt "What is the license plate number of the blue car?", trace the mathematical flow of information. How does the semantic representation of $T_\text{license}$ alter the Query vectors $Q$ , and how does that altered Query vector subsequently change the cross-attention weights $\alpha$ applied to the frozen ViT patch tokens?
Catastrophic Forgetting Tradeoffs: You are designing a VLM for a highly specific legal document parsing task. You have 5 million image-text pairs of legal documents. You must choose between (A) using BLIP-2 with a frozen 7B LLM and training only the 188M parameter Q-Former, or (B) using the same architecture but unfreezing all 7B parameters of the LLM for full fine-tuning. Analyze the risk of catastrophic forgetting in option B. Specifically, what generalized reasoning capabilities might the LLM mathematically lose by heavily optimizing its weights solely on legal text, and why does option A prevent this?

Solutions

ITM hard negatives. Selecting the highest-ITC non-matching text gives a near-miss that is semantically close, forcing the ITM head to learn fine discriminative features — a steep, informative gradient. Random negatives are trivially separable, so the loss is near zero and the gradient tiny. If the ITC matrix is under-trained and near-random, the "hard" negatives are effectively random and ITM loses its benefit, training slowly.
Q-Former compression. $256 \to 32$ is an $8\times$ compression. Thirty-two global queries pooling over 256 patches will average out a 2-pixel tumor — cross-attention spends capacity on dominant features and discards tiny rare signals. Fixes: raise $K$ , make the queries spatially local/hierarchical, or add a high-resolution residual path that bypasses the bottleneck for the detection head.
Two-stage optimization. Training the Q-Former simultaneously against an untrained ITM head and a massive frozen LLM mixes gradients of wildly different magnitude and variance, destabilizing the shared queries. Stage 1 first aligns the queries to vision, so Stage 2 begins from meaningful queries and the generative gradients are stable.
InstructBLIP routing. Concatenating instruction tokens with the queries in self-attention lets $T_\text{license}$ reshape the query vectors (queries attend to the instruction and absorb its semantics). The instruction-conditioned queries then drive cross-attention to the ViT, raising the attention weights on the license-plate region instead of generic salient regions — instruction-aware visual extraction.
Catastrophic forgetting. Option B unfreezes all 7B parameters and optimizes only on legal text, so the LLM's output distribution narrows and it loses general reasoning and world knowledge (mode collapse into a legal subspace). Option A freezes the LLM and trains only the 188M Q-Former, so the broad language capabilities are mathematically untouched and only the visual interface adapts.

Looking ahead#

BLIP-2 establishes a highly efficient, scalable pattern for connecting frozen vision and language models. The next step is moving beyond simple captioning and short-form VQA, transforming these perception-oriented models into conversational, interactive assistants.

Week 6: LLaVA and Multimodal Instruction Tuning. We examine how simple MLP projectors can outright replace complex Q-Formers, how synthetic instruction-following data is generated at scale from GPT-4, and how instruction tuning transforms a VLM into a conversational multimodal assistant capable of deep reasoning.

Purpose of this lecture#

BLIP: Unified Multimodal Pretraining#

1. Image-Text Contrastive (ITC) Loss#

2. Image-Text Matching (ITM) Loss#

3. Language Modeling (LM) Loss#

\mathcal{L}_\text{LM} = -\sum_{t=1}^T \log P_\theta(w_t \mid w_{1:t-1}, F_\text{image})

By sharing the same mathematical weights across these three objectives, BLIP produces a highly versatile representation space that excels simultaneously at retrieval, matching, and generation.

Bootstrapping noisy web data (CapFilt)#

Train an initial BLIP model on the raw, noisy web data.
Caption (Cap): Use the generative LM head of the trained BLIP model to synthesize entirely new captions for every image in the web dataset. Because the model has internalized visual priors, its synthetic captions are often far more descriptive than the original HTML alt-text.
Filter (Filt): Use the ITM (Matching) head of the trained BLIP model to score both the original web caption and the synthetic caption against the image. If $P(\text{match})$ falls below a threshold, the caption is discarded.
Retrain: Combine the surviving original captions and the surviving synthetic captions into a clean dataset, and retrain a new BLIP model from scratch.

This bootstrapping mathematically proves that a model can curate its own training distribution, yielding a significantly stronger VLM than one trained purely on the raw web data.

BLIP-2: Frozen Backbones and the Q-Former#

As LLMs scaled from millions to billions of parameters (e.g., OPT, Flan-T5, Llama), training an entire multimodal network end-to-end became mathematically and financially impossible for most labs.

The Q-Former Architecture#

Self-Attention: The $K$ queries attend to each other, allowing them to route information internally.
Cross-Attention: The $K$ queries attend to the $N$ frozen visual patch tokens extracted by the ViT (e.g., $N = 1024$ ).

Two-Stage Pretraining#

Optimizing a random initialization against a frozen 7B LLM is highly unstable. BLIP-2 breaks pretraining into two mathematical stages:

The Information Bottleneck Tradeoff#

The frozen-backbone + bottleneck design principle revolutionized VLM development for three reasons:

Preserved LLM Capabilities: Because the LLM weights are frozen, it cannot undergo catastrophic forgetting. It retains all of its pre-trained linguistic reasoning, world knowledge, and safety guardrails.
Modular Upgrades: If a new, more powerful LLM is released tomorrow (e.g., Llama-3 replacing Llama-2), researchers only need to run Stage 2 pretraining to connect the existing ViT/Q-Former to the new LLM.
Compute Efficiency: Training 188M parameters instead of 10B parameters reduces training costs by orders of magnitude.

InstructBLIP and task-specific routing#

InstructBLIP (Dai et al., 2023) modifies the BLIP-2 architecture to address the fact that different tasks require different visual information.

Key takeaways#

Conceptual questions#

ITM Hard Negative Mining: In BLIP, the Image-Text Matching (ITM) objective utilizes the Image-Text Contrastive (ITC) similarity matrix to select "hard negatives." Mathematically explain why selecting the text with the highest ITC score $S_{i,j}$ (where $i \neq j$ ) as a negative example provides a steeper, more informative gradient for the ITM classification head than selecting a text uniformly at random. What failure mode would occur in ITM if the ITC matrix was severely under-trained and produced near-random similarity scores?
Compression Mathematics in the Q-Former: A ViT-G/14 processes a $224 \times 224$ image into $N = 256$ spatial patches. The Q-Former compresses these into $K = 32$ query vectors. Calculate the mathematical compression ratio. If you were forced to deploy BLIP-2 for a medical imaging task where identifying a 2-pixel tumor is the difference between life and death, explain precisely why the $K=32$ cross-attention bottleneck will likely discard the tumor data, and propose an architectural modification to the Q-Former to bypass this.
Two-Stage Pretraining Optimization: BLIP-2 explicitly trains the Q-Former against the ViT (Stage 1) before training it against the LLM (Stage 2). If a junior engineer attempts to optimize the Q-Former against both the ViT (for ITM loss) and the frozen LLM (for generative cross-entropy loss) simultaneously in a single pass, describe the optimization challenge that arises. How do the differing magnitudes and variances of the gradients from an untrained classification head versus a massive frozen LLM cause training instability?
InstructBLIP Dynamic Routing: In InstructBLIP, the text instruction is concatenated to the $K$ queries during self-attention: $\text{SelfAttn}([Q_1 \dots Q_{32}, T_1 \dots T_L])$ . Given the prompt "What is the license plate number of the blue car?", trace the mathematical flow of information. How does the semantic representation of $T_\text{license}$ alter the Query vectors $Q$ , and how does that altered Query vector subsequently change the cross-attention weights $\alpha$ applied to the frozen ViT patch tokens?
Catastrophic Forgetting Tradeoffs: You are designing a VLM for a highly specific legal document parsing task. You have 5 million image-text pairs of legal documents. You must choose between (A) using BLIP-2 with a frozen 7B LLM and training only the 188M parameter Q-Former, or (B) using the same architecture but unfreezing all 7B parameters of the LLM for full fine-tuning. Analyze the risk of catastrophic forgetting in option B. Specifically, what generalized reasoning capabilities might the LLM mathematically lose by heavily optimizing its weights solely on legal text, and why does option A prevent this?

Solutions

ITM hard negatives. Selecting the highest-ITC non-matching text gives a near-miss that is semantically close, forcing the ITM head to learn fine discriminative features — a steep, informative gradient. Random negatives are trivially separable, so the loss is near zero and the gradient tiny. If the ITC matrix is under-trained and near-random, the "hard" negatives are effectively random and ITM loses its benefit, training slowly.
Q-Former compression. $256 \to 32$ is an $8\times$ compression. Thirty-two global queries pooling over 256 patches will average out a 2-pixel tumor — cross-attention spends capacity on dominant features and discards tiny rare signals. Fixes: raise $K$ , make the queries spatially local/hierarchical, or add a high-resolution residual path that bypasses the bottleneck for the detection head.
Two-stage optimization. Training the Q-Former simultaneously against an untrained ITM head and a massive frozen LLM mixes gradients of wildly different magnitude and variance, destabilizing the shared queries. Stage 1 first aligns the queries to vision, so Stage 2 begins from meaningful queries and the generative gradients are stable.
InstructBLIP routing. Concatenating instruction tokens with the queries in self-attention lets $T_\text{license}$ reshape the query vectors (queries attend to the instruction and absorb its semantics). The instruction-conditioned queries then drive cross-attention to the ViT, raising the attention weights on the license-plate region instead of generic salient regions — instruction-aware visual extraction.
Catastrophic forgetting. Option B unfreezes all 7B parameters and optimizes only on legal text, so the LLM's output distribution narrows and it loses general reasoning and world knowledge (mode collapse into a legal subspace). Option A freezes the LLM and trains only the 188M Q-Former, so the broad language capabilities are mathematically untouched and only the visual interface adapts.

Purpose of this lecture#

BLIP: Unified Multimodal Pretraining#

1. Image-Text Contrastive (ITC) Loss#

2. Image-Text Matching (ITM) Loss#

3. Language Modeling (LM) Loss#

Bootstrapping noisy web data (CapFilt)#

BLIP-2: Frozen Backbones and the Q-Former#

The Q-Former Architecture#

Two-Stage Pretraining#

The Information Bottleneck Tradeoff#

InstructBLIP and task-specific routing#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 5: BLIP, BLIP-2, and Related Models

Purpose of this lecture#

BLIP: Unified Multimodal Pretraining#

1. Image-Text Contrastive (ITC) Loss#

2. Image-Text Matching (ITM) Loss#

3. Language Modeling (LM) Loss#

Bootstrapping noisy web data (CapFilt)#

BLIP-2: Frozen Backbones and the Q-Former#

The Q-Former Architecture#

Two-Stage Pretraining#

The Information Bottleneck Tradeoff#

InstructBLIP and task-specific routing#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#