Skip to main content
illumin8
Courses
Week 4: Beyond CLIP — Captioning and Grounding
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 4

Week 4: Beyond CLIP — Captioning and Grounding

✦Learning Outcomes
  • Derive cross-attention mechanisms for binding words to spatial features
  • Implement captioning and grounding tasks with VLMs
  • Compare dual-encoder (CLIP) vs. fusion encoder approaches
◆Prerequisites
  • Week 3: CLIP - Joint embedding spaces
  • Week 1: Vision Backbones - ViT architecture

Purpose of this lecture

While CLIP achieved unprecedented success in zero-shot classification and retrieval by aligning images and text in a joint geometric space, its purely contrastive, dual-encoder design left it fundamentally constrained. A standard CLIP model cannot describe what it sees in natural language (it lacks generative capacity), and it cannot locate where specific objects are within an image (it lacks spatial grounding).

A model that cannot generate text cannot answer open-ended questions, and a model that cannot ground language cannot support the spatial reasoning required for physical AI, robotics, and tool use. This lecture transitions from discriminative perception to generative perception. We examine the architectural patterns—specifically encoder-decoder captioning and cross-attention spatial grounding—that fill these gaps and establish the foundation for modern instruction-following VLMs.


Image captioning: the encoder-decoder framework

Image captioning is formulated as a conditional sequence-to-sequence generation problem. Given an input image xxx, the model must autoregressively generate a sequence of natural language tokens t=(w1,w2,…,wT)t = (w_1, w_2, \ldots, w_T)t=(w1​,w2​,…,wT​) that accurately describes the scene. The standard architecture for this task is the Encoder-Decoder:

1. The Vision Encoder: A pretrained vision backbone (such as a ViT-L from Week 1) processes the image. Unlike CLIP, which pools the entire image into a single global vector zxz_xzx​, captioning models preserve the spatial geometry by extracting the sequence of raw patch tokens F=(f1,f2,…,fN)∈RN×DF = (f_1, f_2, \ldots, f_N) \in \mathbb{R}^{N \times D}F=(f1​,f2​,…,fN​)∈RN×D from the final layer of the vision encoder.

2. The Text Decoder: An autoregressive transformer decodes the text token by token. The generation of token wtw_twt​ is conditioned on both the previously generated words w1:t−1w_{1:t-1}w1:t−1​ (via standard causal self-attention) and the visual patch tokens FFF (via cross-attention).

During training, the decoder is trained using teacher forcing: instead of feeding the model's own potentially incorrect predictions back into itself, the ground-truth prefix w1:t−1w_{1:t-1}w1:t−1​ is always provided. The training objective is to minimize the negative log-likelihood (cross-entropy) over the entire target sequence:

Lcap=−∑t=1Tlog⁡pθ(wt∣w1:t−1,F)\mathcal{L}_\text{cap} = -\sum_{t=1}^T \log p_\theta(w_t \mid w_{1:t-1}, F)Lcap​=−t=1∑T​logpθ​(wt​∣w1:t−1​,F)

The Mathematics of Cross-Attention

The mechanism that makes captioning possible is cross-attention. In each layer of the text decoder, after the text tokens have attended to themselves, a cross-attention layer allows the text sequence to dynamically "query" the visual patches.

For the text token at step ttt with hidden state ht∈RDh_t \in \mathbb{R}^Dht​∈RD, we compute a Query vector QtQ_tQt​. The visual patch tokens FFF are projected into Keys K∈RN×DK \in \mathbb{R}^{N \times D}K∈RN×D and Values V∈RN×DV \in \mathbb{R}^{N \times D}V∈RN×D:

Qt=htWQ,K=FWK,V=FWVQ_t = h_t W_Q, \quad K = F W_K, \quad V = F W_VQt​=ht​WQ​,K=FWK​,V=FWV​

The attention weights αt∈RN\alpha_t \in \mathbb{R}^Nαt​∈RN for this single word over all NNN image patches are computed as:

αt=softmax(QtK⊤dk)\alpha_t = \text{softmax}\left(\frac{Q_t K^\top}{\sqrt{d_k}}\right)αt​=softmax(dk​​Qt​K⊤​)

The resulting context vector ctc_tct​ is the weighted sum of the visual values: ct=αtVc_t = \alpha_t Vct​=αt​V.

This operation is profound because the attention weights αt\alpha_tαt​ form a spatial heatmap over the image grid. When the decoder is about to predict the word "dog," the query vector generated by the previous context (e.g., "a brown...") matches the keys of the image patches containing the dog. The network dynamically focuses its computational resources on specific spatial regions, token by token.


The failure of global embeddings for complex scenes

Why couldn't we just use CLIP's global embedding z^x\hat{z}_xz^x​ for captioning?

If the text decoder is conditioned only on a fixed ddd-dimensional global vector, it cannot selectively attend to parts of the image. Every word generated receives the exact same visual context vector. This forces the vision encoder to heavily compress all visual information—the background, the foreground, the spatial relationships, the colors—into a single point.

While a global embedding works for short, simple captions ("a dog on a lawn"), it catastrophically degrades when generating detailed descriptions of complex scenes ("a man in a red shirt holding a green umbrella next to a blue car"). The spatial tokens FFF alleviate this bottleneck by allowing the text decoder to pull high-resolution details on-demand exactly when they are needed in the sentence structure. The tradeoff is computational: computing cross-attention over NNN tokens incurs O(T⋅N)O(T \cdot N)O(T⋅N) operations per layer, which scales linearly with image resolution but can become a bottleneck for video generation.


Visual grounding: connecting language to coordinate space

If captioning maps an image to text, grounding solves the inverse problem: mapping a text phrase to a specific mathematical coordinate region in the image. Grounding allows language to ACTAction Chunking with Transformers as a spatial pointer.

Referring Expression Comprehension (REC): Given an image and an unambiguous referring expression (e.g., "the red mug on the left side of the table"), the model must output a coordinate bounding box (xmin,ymin,xmax,ymax)(x_{min}, y_{min}, x_{max}, y_{max})(xmin​,ymin​,xmax​,ymax​) enclosing the specific object. The expression must be unambiguous—"the larger of the two red mugs" explicitly disambiguates via comparison.

Phrase Grounding: Given a complete, complex sentence (e.g., "A man riding a bicycle next to a parked car"), the model must simultaneously detect and draw bounding boxes for every distinct noun phrase ("man," "bicycle," "parked car"). This requires compositional understanding, forcing the model to resolve spatial prepositions ("next to") while maintaining consistent object identities.


Architectural patterns for grounding

Historical: Two-Stage Pipelines

Early grounding systems were modular. A pre-trained Region Proposal Network (RPN), like Faster R-CNN, would first extract hundreds of candidate bounding boxes from the image, completely blind to the text prompt. A separate multimodal matching network would then compute similarity scores between the text embedding and the cropped visual features of each candidate box. This pipeline was fragile: if the RPN failed to propose a box for a small or unusual object, the matching network had zero chance of grounding it.

Modern: End-to-End Transformer Grounding (Grounding DINO)

Modern state-of-the-art systems like MDETR and Grounding DINO discard the two-stage pipeline. Instead, they concatenate the raw visual patch tokens FFF and the text tokens TTT into a single, massive sequence and feed them through a joint transformer architecture.

Because self-attention operates across the entire concatenated sequence, cross-modal fusion happens immediately. A text token representing the word "wheel" can directly attend to the visual patches containing circular, rubber textures before any bounding boxes are even predicted.

The transformer outputs enriched visual tokens. A lightweight detection head then predicts bounding box coordinates (cx,cy,w,h)(c_x, c_y, w, h)(cx​,cy​,w,h) directly from these enriched tokens, optimized using a combination of L1L_1L1​ regression loss and Generalized Intersection over Union (GIoU) loss against the ground truth boxes. This joint approach produces vastly superior grounding, especially for complex compositional queries.


Why grounding is mathematically difficult

Visual grounding exposes three fundamental cognitive failures in standard VLMs:

1. Attribute Binding: The prompt "a red cube and a blue sphere" requires the model to correctly bind the color "red" to the shape "cube." Standard contrastive VLMs frequently fail at this, confusing it with "a blue cube and a red sphere" because both images contain identical sets of raw features (red, blue, cube, sphere). Grounding models solve this by forcing the self-attention mechanism to explicitly route the "red" text token to the spatial coordinates of the cube.

2. Relational and Spatial Reasoning: "The cup to the left of the laptop" requires computing a geometric relationship between two disparate objects. A global embedding destroys relative coordinates. To solve this, grounding models rely heavily on the 2D positional embeddings added to the visual patches, allowing the transformer's attention heads to calculate spatial offsets directly.

3. Disambiguation with Negation and Comparison: "The smaller of the two cups" or "the cup that is not white" requires the network to identify all cups, compare their geometric sizes or color vectors, and perform a logical exclusion. These are compositional operations that simple dot-product similarity matching (like CLIP) mathematically cannot express.


Grounding as the bridge to physical robotics

Captioning and grounding are the absolute prerequisites for physical AI. In Course 2 (Robotics), we saw that policies require continuous coordinate actions. Grounding provides the mathematical bridge from abstract language to physical space.

If a human user tells a robot, "Hand me the Phillips-head screwdriver on the workbench," the VLMVision-Language Model cannot just classify the image as "workbench." It must run a grounding pass to output a specific bounding box or segmentation mask for the screwdriver. This 2D pixel mask is then projected into 3D space using depth sensors (or NeRFs/3DGS, as discussed in Week 1), giving the robotic control policy the exact (x,y,z)(x, y, z)(x,y,z) target needed to generate motor torques. Without precise visual grounding, language-conditioned robotics is impossible.


Key takeaways

Image captioning utilizes encoder-decoder architectures where causal cross-attention mathematically connects each generated word to the relevant visual spatial patches. Using high-resolution spatial patch tokens is essential for fine-grained descriptions; global embeddings collapse spatial geometry and fail on complex scenes. Visual grounding solves the inverse problem, mathematically mapping noun phrases to continuous (xmin,ymin,xmax,ymax)(x_{min}, y_{min}, x_{max}, y_{max})(xmin​,ymin​,xmax​,ymax​) bounding box coordinates. Modern end-to-end architectures like Grounding DINO achieve this by concatenating text and image tokens into a joint transformer, allowing cross-modal attention to solve attribute binding and relational reasoning before coordinate regression occurs. Together, generation and grounding provide the spatial and semantic primitives required for physical AI deployment.


Conceptual questions

  1. Cross-Attention Debugging: You are training an autoregressive image captioning model. At generation step ttt for the word "bicycle," you visualize the cross-attention weights αt=softmax(QtK⊤/dk)\alpha_t = \text{softmax}(Q_t K^\top / \sqrt{d_k})αt​=softmax(Qt​K⊤/dk​​). Instead of localizing on the bicycle, the attention heatmap is perfectly uniform, spreading equal weight 1/N1/N1/N across all image patches. Mathematically, what specific state must the dot product QtK⊤Q_t K^\topQt​K⊤ be in for the softmax to output a uniform distribution? Diagnose two potential architectural or initialization bugs that could cause this gradient collapse.
  2. GIoU vs L1 Loss in Grounding: Modern grounding models regress bounding boxes using a combination of L1L_1L1​ norm loss and Generalized Intersection over Union (GIoU) loss. If a model predicts a box of size 2×22 \times 22×2 pixels that is 101010 pixels away from a small target, versus a box of size 200×200200 \times 200200×200 pixels that is 101010 pixels away from a massive target, the L1L_1L1​ center-point error is identical. Mathematically explain why GIoU loss is scale-invariant and why relying solely on L1L_1L1​ loss would disproportionately penalize errors on large objects while ignoring catastrophic misses on small robotics components.
  3. The Two-Stage Pipeline Bottleneck: A robotics team uses a two-stage grounding pipeline: a frozen Faster R-CNN (trained on 80 COCO classes) extracts boxes, and a CLIP model matches the text prompt to the cropped box features. The prompt is "Pick up the clear plastic pipette." The pipeline fails 100% of the time. Explain why the RPN stage is the fundamental bottleneck here, and how an end-to-end joint transformer (like Grounding DINO) bypasses this failure mode using cross-modal self-attention.
  4. Attribute Binding Matrices: Consider the prompt "a red cube and a blue sphere." In a joint text-image transformer, the self-attention mechanism computes an attention matrix across the concatenated sequence [Tred,Tcube,Tblue,Tsphere,V1,…,VN][T_\text{red}, T_\text{cube}, T_\text{blue}, T_\text{sphere}, V_1, \dots, V_N][Tred​,Tcube​,Tblue​,Tsphere​,V1​,…,VN​]. Describe the specific, high-weight connections you would expect to see in the final layer's attention matrix to prove the model has successfully solved the attribute binding problem.
  5. Zero-Shot Grounding via Captioning: You have a trained captioning model with no explicit bounding-box training. Propose a heuristic algorithm to generate a zero-shot bounding box for the phrase "the green apple" by solely analyzing the cross-attention spatial heatmaps αt\alpha_tαt​ generated while the model is forced to decode the sequence "A photo of the green apple." What are the theoretical limitations of extracting precise bounding boxes using only the resolution of P×PP \times PP×P patch tokens?
✦Solutions
  1. Uniform cross-attention. Softmax is uniform only when all logits are equal, i.e. QtK⊤Q_t K^\topQt​K⊤ is constant across patches — typically because Qt≈0Q_t \approx 0Qt​≈0 or the keys have collapsed so every dot product is identical. Two likely bugs: query/key projection (or a LayerNorm) collapsing the query to near-zero so there is no signal, or missing or broken positional encodings that make all keys interchangeable. Either way no patch is preferred and gradients to the visual stream vanish.
  2. GIoU vs L1. GIoU is built from overlap area normalized by the union, a ratio that is invariant to absolute object scale; L1L_1L1​ on center coordinates measures absolute pixel error, so a 10 px error is trivial for a 200 px object but a total miss for a 2 px one. Relying on L1L_1L1​ alone over-penalizes large objects and under-penalizes catastrophic small-object misses. GIoU also provides gradient when boxes do not overlap, via the enclosing-box term.
  3. Two-stage bottleneck. The frozen Faster R-CNN was trained on 80 COCO classes; "pipette" is not among them, so the RPN never proposes a box around it and CLIP never receives a relevant crop to match — the failure is upstream of CLIP. Grounding DINO fuses the text into the detector through cross-modal attention, so proposals are conditioned on the phrase and open-vocabulary objects get boxes.
  4. Attribute-binding matrix. You expect strong text→vision links from TredT_\text{red}Tred​ to the cube patches and TblueT_\text{blue}Tblue​ to the sphere patches, strong within-text links Tred↔TcubeT_\text{red}\leftrightarrow T_\text{cube}Tred​↔Tcube​ and Tblue↔TsphereT_\text{blue}\leftrightarrow T_\text{sphere}Tblue​↔Tsphere​, and crucially low cross-binding (red to sphere patches). That pattern shows each attribute attended to its correct object region.
  5. Zero-shot grounding. Force-decode "A photo of the green apple," take the cross-attention heatmap at the "apple" token, normalize and threshold it, and return the bounding box of the high-attention patch region. The limitation is resolution: the box is quantized to the P×PP \times PP×P patch grid and cannot localize sub-patch detail, and diffuse/contextual attention inflates the extent — so precise boxes are not recoverable from attention alone.

Looking ahead

Captioning and grounding extend the VLMVision-Language Model from passive retrieval into active generation and spatial localization. However, training these models from scratch is immensely computationally expensive. The next evolution seeks to unify these capabilities while freezing the massive pre-trained backbones.

Week 5: BLIP, BLIP-2, and Related Models. We examine how BLIP unifies contrastive, matching, and captioning objectives, how data bootstrapping cleans noisy web data, and how BLIP-2 introduces the Q-Former as a highly efficient, learned interface to bridge frozen vision encoders with frozen Large Language Models.


Further reading

  • Anderson, P., et al. (2018). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. CVPR.
  • Kamath, A., et al. (2021). MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. ICCV.
  • Liu, S., et al. (2023). Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ECCV.
← Previous
Week 3: Contrastive Vision–Language Learning (CLIP)
Next →
Week 5: BLIP, BLIP-2, and Related Models
On this page
  • Purpose of this lecture
  • Image captioning: the encoder-decoder framework
  • The Mathematics of Cross-Attention
  • The failure of global embeddings for complex scenes
  • Visual grounding: connecting language to coordinate space
  • Architectural patterns for grounding
  • Historical: Two-Stage Pipelines
  • Modern: End-to-End Transformer Grounding (Grounding DINO)
  • Why grounding is mathematically difficult
  • Grounding as the bridge to physical robotics
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading