Skip to main content
illumin8
Courses
Week 5: BLIP, BLIP-2, and Related Models
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 5

Week 5: BLIP, BLIP-2, and Related Models

✦Learning Outcomes
  • Implement Q-Former as a learned bridge between frozen vision encoders and LLMs
  • Analyze data bootstrapping techniques for cleaning noisy web data
  • Compare frozen-backbone vs. end-to-end VLMVision-Language Model training approaches
◆Prerequisites
  • Week 3: CLIP - Contrastive learning
  • Week 4: Captioning - Encoder-decoder generation

Purpose of this lecture

As explored in previous weeks, CLIP aligns images and text globally but lacks generative capability. Conversely, standard encoder-decoder models can generate captions but are computationally prohibitive to train end-to-end as language models scale to billions of parameters. This lecture introduces a monumental architectural compromise: BLIP (Bootstrapping Language-Image Pretraining), which unifies contrastive alignment with generative captioning into a single framework. We then explore its successor, BLIP-2, which introduced the paradigm-defining Q-Former—a learned, low-parameter bottleneck that safely bridges massive frozen vision encoders with massive frozen Large Language Models (LLMs) without incurring catastrophic forgetting. This "frozen-backbone" design pattern is the direct architectural ancestor of most modern production multimodal assistants.


BLIP: Unified Multimodal Pretraining

BLIP (Li et al., 2022) trains a single, highly flexible transformer architecture using three distinct but complementary objectives. The network consists of a vision transformer (ViT) and a multimodal text transformer that can dynamically reconfigure its self-attention masking (bidirectional vs. causal) depending on the active loss function.

1. Image-Text Contrastive (ITC) Loss

This objective is mathematically identical to CLIP's symmetric InfoNCE. It aligns the global image embedding (the vision [CLS] token) with the global text embedding (the text [CLS] token) in a shared geometric space. For the ITC forward pass, the text encoder uses bidirectional self-attention (allowing every word to see every other word) but crucially applies no cross-attention to the visual features. This enforces strict unimodal representations for rapid zero-shot retrieval.

2. Image-Text Matching (ITM) Loss

ITC only checks if global vectors are similar. ITM is a much harder, binary classification task: given a specific image-text pair, a linear head must output P(match)∈[0,1]P(\text{match}) \in [0, 1]P(match)∈[0,1]. For ITM, the text encoder uses bidirectional self-attention and active cross-attention into the visual features, allowing dense, fine-grained fusion of visual and linguistic tokens. To make ITM mathematically challenging, BLIP uses Hard Negative Mining: for each image in a batch, it finds the text in the batch that has the highest ITC similarity score but is technically a mismatch, and forces the ITM head to classify it as 000. This forces the model to reason about subtle semantic nuances (e.g., "a red cube" vs "a blue cube") that global contrastive dot-products often blur.

3. Language Modeling (LM) Loss

Finally, to imbue the model with generative capacity, the text transformer's self-attention mask is switched to causal (autoregressive) masking. The model must predict the next text token wtw_twt​ conditioned on w1:t−1w_{1:t-1}w1:t−1​ and the visual features accessed via cross-attention.

LLM=−∑t=1Tlog⁡Pθ(wt∣w1:t−1,Fimage)\mathcal{L}_\text{LM} = -\sum_{t=1}^T \log P_\theta(w_t \mid w_{1:t-1}, F_\text{image})LLM​=−t=1∑T​logPθ​(wt​∣w1:t−1​,Fimage​)

By sharing the same mathematical weights across these three objectives, BLIP produces a highly versatile representation space that excels simultaneously at retrieval, matching, and generation.


Bootstrapping noisy web data (CapFilt)

Internet-scraped datasets (like LAION) contain massive amounts of noise, where HTML alt-text often bears no semantic relationship to the image. BLIP introduced a highly effective data curation loop called CapFilt (Captioning and Filtering):

  1. Train an initial BLIP model on the raw, noisy web data.
  2. Caption (Cap): Use the generative LM head of the trained BLIP model to synthesize entirely new captions for every image in the web dataset. Because the model has internalized visual priors, its synthetic captions are often far more descriptive than the original HTML alt-text.
  3. Filter (Filt): Use the ITM (Matching) head of the trained BLIP model to score both the original web caption and the synthetic caption against the image. If P(match)P(\text{match})P(match) falls below a threshold, the caption is discarded.
  4. Retrain: Combine the surviving original captions and the surviving synthetic captions into a clean dataset, and retrain a new BLIP model from scratch.

This bootstrapping mathematically proves that a model can curate its own training distribution, yielding a significantly stronger VLMVision-Language Model than one trained purely on the raw web data.


BLIP-2: Frozen Backbones and the Q-Former

As LLMs scaled from millions to billions of parameters (e.g., OPT, Flan-T5, Llama), training an entire multimodal network end-to-end became mathematically and financially impossible for most labs.

BLIP-2 (Li et al., 2023) solved the scalability problem via extreme parameter efficiency. Both the vision encoder (a 1.8B parameter ViT-G) and the LLMLarge Language Model (e.g., a 6.7B parameter OPT) are strictly frozen. No gradients flow into them. The only trainable component is a novel, lightweight architectural bridge called the Querying Transformer (Q-Former) (roughly 188M parameters).

The Q-Former Architecture

The Q-Former initializes a fixed set of K=32K = 32K=32 learned query vectors Q∈RK×DQ \in \mathbb{R}^{K \times D}Q∈RK×D. These queries are passed through a transformer architecture with two distinct attention mechanisms per layer:

  1. Self-Attention: The KKK queries attend to each other, allowing them to route information internally.
  2. Cross-Attention: The KKK queries attend to the NNN frozen visual patch tokens extracted by the ViT (e.g., N=1024N = 1024N=1024).

Because there are only 32 query vectors, the output of the Q-Former is strictly bottlenecked to a sequence of K=32K=32K=32 dense visual tokens, regardless of how high the original image resolution was. This provides massive computational savings when these tokens are later injected into the LLMLarge Language Model's context window.

Two-Stage Pretraining

Optimizing a random initialization against a frozen 7B LLMLarge Language Model is highly unstable. BLIP-2 breaks pretraining into two mathematical stages:

Stage 1: Vision-Language Representation Learning The LLMLarge Language Model is completely ignored. The Q-Former is trained directly against the frozen ViT using the BLIP multi-task objectives (ITC, ITM, and LM). In this stage, the 32 queries learn to extract general-purpose, semantically rich visual features from the frozen ViT.

Stage 2: Generative LLMLarge Language Model Alignment The Q-Former's multi-task heads are discarded. The 32 output queries are passed through a simple linear projection matrix W∈RDQ×D<Glossary term="LLM" />W \in \mathbb{R}^{D_\text{Q} \times D_\text{<Glossary term="LLM" />}}W∈RDQ​×D<Glossary term="LLM" />​ to match the hidden dimension of the frozen LLMLarge Language Model. These 32 vectors ACTAction Chunking with Transformers as "soft visual prompts" prepended to the user's text prompt. The frozen LLMLarge Language Model generates text autoregressively. The cross-entropy loss is backpropagated through the frozen LLMLarge Language Model (using its gradients, but not updating its weights) straight into the linear projector and the Q-Former.


The Information Bottleneck Tradeoff

The frozen-backbone + bottleneck design principle revolutionized VLMVision-Language Model development for three reasons:

  1. Preserved LLMLarge Language Model Capabilities: Because the LLMLarge Language Model weights are frozen, it cannot undergo catastrophic forgetting. It retains all of its pre-trained linguistic reasoning, world knowledge, and safety guardrails.
  2. Modular Upgrades: If a new, more powerful LLMLarge Language Model is released tomorrow (e.g., Llama-3 replacing Llama-2), researchers only need to run Stage 2 pretraining to connect the existing ViT/Q-Former to the new LLMLarge Language Model.
  3. Compute Efficiency: Training 188M parameters instead of 10B parameters reduces training costs by orders of magnitude.

The Fatal Flaw (Spatial Collapse): Compressing an entire high-resolution grid of 102410241024 spatial patches into exactly 323232 non-spatial query tokens acts as a massive mathematical information bottleneck. During Stage 1, the Q-Former learns to discard background details and specific (x,y)(x, y)(x,y) coordinate geometry in favor of global semantic concepts (e.g., "there is a dog," "the scene is outdoors").

Consequently, while BLIP-2 excels at global VQA ("What is the main subject of this photo?"), it fundamentally fails at dense prediction tasks like grounding, OCR (reading small text), or robotic manipulation, where the loss of spatial geometry is catastrophic.


InstructBLIP and task-specific routing

InstructBLIP (Dai et al., 2023) modifies the BLIP-2 architecture to address the fact that different tasks require different visual information.

Instead of letting the 32 query vectors blindly extract a generic summary of the image, InstructBLIP injects the user's text instruction (e.g., "Read the text on the red sign") directly into the Q-Former. The instruction tokens are concatenated with the 32 query vectors during the self-attention phase.

Mathematically, this forces the Q-Former to ACTAction Chunking with Transformers as a dynamic routing mechanism. The queries use the semantic content of the instruction to guide their cross-attention over the ViT patches. If the instruction asks about the red sign, the queries selectively extract high-frequency pixel data from that specific region, ignoring the rest of the image. This instruction-aware visual extraction drastically improves performance on complex reasoning benchmarks compared to the generic, blind extraction of BLIP-2.


Key takeaways

BLIP unifies contrastive alignment, binary matching, and autoregressive captioning into a single multi-task transformer framework, demonstrating that generative text can be used to mathematically filter and bootstrap noisy web datasets. BLIP-2 addresses the compute scaling laws by freezing massive ViT and LLMLarge Language Model backbones, connecting them via a highly efficient, 32-token Querying Transformer (Q-Former). While this two-stage bottleneck architecture preserves LLMLarge Language Model intelligence and slashes compute costs, it mathematically destroys fine-grained spatial geometry. InstructBLIP partially mitigates this by injecting the text instruction directly into the Q-Former, forcing the model to perform dynamic, task-aware visual feature extraction.


Conceptual questions

  1. ITM Hard Negative Mining: In BLIP, the Image-Text Matching (ITM) objective utilizes the Image-Text Contrastive (ITC) similarity matrix to select "hard negatives." Mathematically explain why selecting the text with the highest ITC score Si,jS_{i,j}Si,j​ (where i≠ji \neq ji=j) as a negative example provides a steeper, more informative gradient for the ITM classification head than selecting a text uniformly at random. What failure mode would occur in ITM if the ITC matrix was severely under-trained and produced near-random similarity scores?
  2. Compression Mathematics in the Q-Former: A ViT-G/14 processes a 224×224224 \times 224224×224 image into N=256N = 256N=256 spatial patches. The Q-Former compresses these into K=32K = 32K=32 query vectors. Calculate the mathematical compression ratio. If you were forced to deploy BLIP-2 for a medical imaging task where identifying a 2-pixel tumor is the difference between life and death, explain precisely why the K=32K=32K=32 cross-attention bottleneck will likely discard the tumor data, and propose an architectural modification to the Q-Former to bypass this.
  3. Two-Stage Pretraining Optimization: BLIP-2 explicitly trains the Q-Former against the ViT (Stage 1) before training it against the LLMLarge Language Model (Stage 2). If a junior engineer attempts to optimize the Q-Former against both the ViT (for ITM loss) and the frozen LLMLarge Language Model (for generative cross-entropy loss) simultaneously in a single pass, describe the optimization challenge that arises. How do the differing magnitudes and variances of the gradients from an untrained classification head versus a massive frozen LLMLarge Language Model cause training instability?
  4. InstructBLIP Dynamic Routing: In InstructBLIP, the text instruction is concatenated to the KKK queries during self-attention: SelfAttn([Q1…Q32,T1…TL])\text{SelfAttn}([Q_1 \dots Q_{32}, T_1 \dots T_L])SelfAttn([Q1​…Q32​,T1​…TL​]). Given the prompt "What is the license plate number of the blue car?", trace the mathematical flow of information. How does the semantic representation of TlicenseT_\text{license}Tlicense​ alter the Query vectors QQQ, and how does that altered Query vector subsequently change the cross-attention weights α\alphaα applied to the frozen ViT patch tokens?
  5. Catastrophic Forgetting Tradeoffs: You are designing a VLMVision-Language Model for a highly specific legal document parsing task. You have 5 million image-text pairs of legal documents. You must choose between (A) using BLIP-2 with a frozen 7B LLMLarge Language Model and training only the 188M parameter Q-Former, or (B) using the same architecture but unfreezing all 7B parameters of the LLMLarge Language Model for full fine-tuning. Analyze the risk of catastrophic forgetting in option B. Specifically, what generalized reasoning capabilities might the LLMLarge Language Model mathematically lose by heavily optimizing its weights solely on legal text, and why does option A prevent this?
✦Solutions
  1. ITM hard negatives. Selecting the highest-ITC non-matching text gives a near-miss that is semantically close, forcing the ITM head to learn fine discriminative features — a steep, informative gradient. Random negatives are trivially separable, so the loss is near zero and the gradient tiny. If the ITC matrix is under-trained and near-random, the "hard" negatives are effectively random and ITM loses its benefit, training slowly.
  2. Q-Former compression. 256→32256 \to 32256→32 is an 8×8\times8× compression. Thirty-two global queries pooling over 256 patches will average out a 2-pixel tumor — cross-attention spends capacity on dominant features and discards tiny rare signals. Fixes: raise KKK, make the queries spatially local/hierarchical, or add a high-resolution residual path that bypasses the bottleneck for the detection head.
  3. Two-stage optimization. Training the Q-Former simultaneously against an untrained ITM head and a massive frozen LLM mixes gradients of wildly different magnitude and variance, destabilizing the shared queries. Stage 1 first aligns the queries to vision, so Stage 2 begins from meaningful queries and the generative gradients are stable.
  4. InstructBLIP routing. Concatenating instruction tokens with the queries in self-attention lets TlicenseT_\text{license}Tlicense​ reshape the query vectors (queries attend to the instruction and absorb its semantics). The instruction-conditioned queries then drive cross-attention to the ViT, raising the attention weights on the license-plate region instead of generic salient regions — instruction-aware visual extraction.
  5. Catastrophic forgetting. Option B unfreezes all 7B parameters and optimizes only on legal text, so the LLM's output distribution narrows and it loses general reasoning and world knowledge (mode collapse into a legal subspace). Option A freezes the LLM and trains only the 188M Q-Former, so the broad language capabilities are mathematically untouched and only the visual interface adapts.

Looking ahead

BLIP-2 establishes a highly efficient, scalable pattern for connecting frozen vision and language models. The next step is moving beyond simple captioning and short-form VQA, transforming these perception-oriented models into conversational, interactive assistants.

Week 6: LLaVA and Multimodal Instruction Tuning. We examine how simple MLP projectors can outright replace complex Q-Formers, how synthetic instruction-following data is generated at scale from GPT-4, and how instruction tuning transforms a VLMVision-Language Model into a conversational multimodal assistant capable of deep reasoning.


Further reading

  • Li, J., et al. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. ICML.
  • Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML. (Introduced the Q-Former).
  • Dai, W., et al. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. NeurIPS.
← Previous
Week 4: Beyond CLIP — Captioning and Grounding
Next →
Week 6: LLaVA and Multimodal Instruction Tuning
On this page
  • Purpose of this lecture
  • BLIP: Unified Multimodal Pretraining
  • 1. Image-Text Contrastive (ITC) Loss
  • 2. Image-Text Matching (ITM) Loss
  • 3. Language Modeling (LM) Loss
  • Bootstrapping noisy web data (CapFilt)
  • BLIP-2: Frozen Backbones and the Q-Former
  • The Q-Former Architecture
  • Two-Stage Pretraining
  • The Information Bottleneck Tradeoff
  • InstructBLIP and task-specific routing
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading