Week 4: Beyond CLIP — Captioning and Grounding — Physical AI

Purpose of this lecture#

While CLIP achieved unprecedented success in zero-shot classification and retrieval by aligning images and text in a joint geometric space, its purely contrastive, dual-encoder design left it fundamentally constrained. A standard CLIP model cannot describe what it sees in natural language (it lacks generative capacity), and it cannot locate where specific objects are within an image (it lacks spatial grounding).

A model that cannot generate text cannot answer open-ended questions, and a model that cannot ground language cannot support the spatial reasoning required for physical AI, robotics, and tool use. This lecture transitions from discriminative perception to generative perception. We examine the architectural patterns—specifically encoder-decoder captioning and cross-attention spatial grounding—that fill these gaps and establish the foundation for modern instruction-following VLMs.

Image captioning: the encoder-decoder framework#

Image captioning is formulated as a conditional sequence-to-sequence generation problem. Given an input image $x$ , the model must autoregressively generate a sequence of natural language tokens $t = (w_1, w_2, \ldots, w_T)$ that accurately describes the scene. The standard architecture for this task is the Encoder-Decoder:

1. The Vision Encoder: A pretrained vision backbone (such as a ViT-L from Week 1) processes the image. Unlike CLIP, which pools the entire image into a single global vector $z_x$ , captioning models preserve the spatial geometry by extracting the sequence of raw patch tokens $F = (f_1, f_2, \ldots, f_N) \in \mathbb{R}^{N \times D}$ from the final layer of the vision encoder.

2. The Text Decoder: An autoregressive transformer decodes the text token by token. The generation of token $w_t$ is conditioned on both the previously generated words $w_{1:t-1}$ (via standard causal self-attention) and the visual patch tokens $F$ (via cross-attention).

During training, the decoder is trained using teacher forcing: instead of feeding the model's own potentially incorrect predictions back into itself, the ground-truth prefix $w_{1:t-1}$ is always provided. The training objective is to minimize the negative log-likelihood (cross-entropy) over the entire target sequence:

\mathcal{L}_\text{cap} = -\sum_{t=1}^T \log p_\theta(w_t \mid w_{1:t-1}, F)

The Mathematics of Cross-Attention#

The mechanism that makes captioning possible is cross-attention. In each layer of the text decoder, after the text tokens have attended to themselves, a cross-attention layer allows the text sequence to dynamically "query" the visual patches.

For the text token at step $t$ with hidden state $h_t \in \mathbb{R}^D$ , we compute a Query vector $Q_t$ . The visual patch tokens $F$ are projected into Keys $K \in \mathbb{R}^{N \times D}$ and Values $V \in \mathbb{R}^{N \times D}$ :

Q_t = h_t W_Q, \quad K = F W_K, \quad V = F W_V

The attention weights $\alpha_t \in \mathbb{R}^N$ for this single word over all $N$ image patches are computed as:

\alpha_t = \text{softmax}\left(\frac{Q_t K^\top}{\sqrt{d_k}}\right)

The resulting context vector $c_t$ is the weighted sum of the visual values: $c_t = \alpha_t V$ .

This operation is profound because the attention weights $\alpha_t$ form a spatial heatmap over the image grid. When the decoder is about to predict the word "dog," the query vector generated by the previous context (e.g., "a brown...") matches the keys of the image patches containing the dog. The network dynamically focuses its computational resources on specific spatial regions, token by token.

The failure of global embeddings for complex scenes#

Why couldn't we just use CLIP's global embedding $\hat{z}_x$ for captioning?

If the text decoder is conditioned only on a fixed $d$ -dimensional global vector, it cannot selectively attend to parts of the image. Every word generated receives the exact same visual context vector. This forces the vision encoder to heavily compress all visual information—the background, the foreground, the spatial relationships, the colors—into a single point.

While a global embedding works for short, simple captions ("a dog on a lawn"), it catastrophically degrades when generating detailed descriptions of complex scenes ("a man in a red shirt holding a green umbrella next to a blue car"). The spatial tokens $F$ alleviate this bottleneck by allowing the text decoder to pull high-resolution details on-demand exactly when they are needed in the sentence structure. The tradeoff is computational: computing cross-attention over $N$ tokens incurs $O(T \cdot N)$ operations per layer, which scales linearly with image resolution but can become a bottleneck for video generation.

Visual grounding: connecting language to coordinate space#

If captioning maps an image to text, grounding solves the inverse problem: mapping a text phrase to a specific mathematical coordinate region in the image. Grounding allows language to ACT as a spatial pointer.

Referring Expression Comprehension (REC): Given an image and an unambiguous referring expression (e.g., "the red mug on the left side of the table"), the model must output a coordinate bounding box $(x_{min}, y_{min}, x_{max}, y_{max})$ enclosing the specific object. The expression must be unambiguous—"the larger of the two red mugs" explicitly disambiguates via comparison.

Phrase Grounding: Given a complete, complex sentence (e.g., "A man riding a bicycle next to a parked car"), the model must simultaneously detect and draw bounding boxes for every distinct noun phrase ("man," "bicycle," "parked car"). This requires compositional understanding, forcing the model to resolve spatial prepositions ("next to") while maintaining consistent object identities.

Architectural patterns for grounding#

Historical: Two-Stage Pipelines#

Early grounding systems were modular. A pre-trained Region Proposal Network (RPN), like Faster R-CNN, would first extract hundreds of candidate bounding boxes from the image, completely blind to the text prompt. A separate multimodal matching network would then compute similarity scores between the text embedding and the cropped visual features of each candidate box. This pipeline was fragile: if the RPN failed to propose a box for a small or unusual object, the matching network had zero chance of grounding it.

Modern: End-to-End Transformer Grounding (Grounding DINO)#

Modern state-of-the-art systems like MDETR and Grounding DINO discard the two-stage pipeline. Instead, they concatenate the raw visual patch tokens $F$ and the text tokens $T$ into a single, massive sequence and feed them through a joint transformer architecture.

Because self-attention operates across the entire concatenated sequence, cross-modal fusion happens immediately. A text token representing the word "wheel" can directly attend to the visual patches containing circular, rubber textures before any bounding boxes are even predicted.

The transformer outputs enriched visual tokens. A lightweight detection head then predicts bounding box coordinates $(c_x, c_y, w, h)$ directly from these enriched tokens, optimized using a combination of $L_1$ regression loss and Generalized Intersection over Union (GIoU) loss against the ground truth boxes. This joint approach produces vastly superior grounding, especially for complex compositional queries.

Why grounding is mathematically difficult#

Visual grounding exposes three fundamental cognitive failures in standard VLMs:

1. Attribute Binding: The prompt "a red cube and a blue sphere" requires the model to correctly bind the color "red" to the shape "cube." Standard contrastive VLMs frequently fail at this, confusing it with "a blue cube and a red sphere" because both images contain identical sets of raw features (red, blue, cube, sphere). Grounding models solve this by forcing the self-attention mechanism to explicitly route the "red" text token to the spatial coordinates of the cube.

2. Relational and Spatial Reasoning: "The cup to the left of the laptop" requires computing a geometric relationship between two disparate objects. A global embedding destroys relative coordinates. To solve this, grounding models rely heavily on the 2D positional embeddings added to the visual patches, allowing the transformer's attention heads to calculate spatial offsets directly.

3. Disambiguation with Negation and Comparison: "The smaller of the two cups" or "the cup that is not white" requires the network to identify all cups, compare their geometric sizes or color vectors, and perform a logical exclusion. These are compositional operations that simple dot-product similarity matching (like CLIP) mathematically cannot express.

Grounding as the bridge to physical robotics#

Captioning and grounding are the absolute prerequisites for physical AI. In Course 2 (Robotics), we saw that policies require continuous coordinate actions. Grounding provides the mathematical bridge from abstract language to physical space.

If a human user tells a robot, "Hand me the Phillips-head screwdriver on the workbench," the VLM cannot just classify the image as "workbench." It must run a grounding pass to output a specific bounding box or segmentation mask for the screwdriver. This 2D pixel mask is then projected into 3D space using depth sensors (or NeRFs/3DGS, as discussed in Week 1), giving the robotic control policy the exact $(x, y, z)$ target needed to generate motor torques. Without precise visual grounding, language-conditioned robotics is impossible.

Key takeaways#

Image captioning utilizes encoder-decoder architectures where causal cross-attention mathematically connects each generated word to the relevant visual spatial patches. Using high-resolution spatial patch tokens is essential for fine-grained descriptions; global embeddings collapse spatial geometry and fail on complex scenes. Visual grounding solves the inverse problem, mathematically mapping noun phrases to continuous $(x_{min}, y_{min}, x_{max}, y_{max})$ bounding box coordinates. Modern end-to-end architectures like Grounding DINO achieve this by concatenating text and image tokens into a joint transformer, allowing cross-modal attention to solve attribute binding and relational reasoning before coordinate regression occurs. Together, generation and grounding provide the spatial and semantic primitives required for physical AI deployment.

Conceptual questions#

Cross-Attention Debugging: You are training an autoregressive image captioning model. At generation step $t$ for the word "bicycle," you visualize the cross-attention weights $\alpha_t = \text{softmax}(Q_t K^\top / \sqrt{d_k})$ . Instead of localizing on the bicycle, the attention heatmap is perfectly uniform, spreading equal weight $1/N$ across all image patches. Mathematically, what specific state must the dot product $Q_t K^\top$ be in for the softmax to output a uniform distribution? Diagnose two potential architectural or initialization bugs that could cause this gradient collapse.
GIoU vs L1 Loss in Grounding: Modern grounding models regress bounding boxes using a combination of $L_1$ norm loss and Generalized Intersection over Union (GIoU) loss. If a model predicts a box of size $2 \times 2$ pixels that is $10$ pixels away from a small target, versus a box of size $200 \times 200$ pixels that is $10$ pixels away from a massive target, the $L_1$ center-point error is identical. Mathematically explain why GIoU loss is scale-invariant and why relying solely on $L_1$ loss would disproportionately penalize errors on large objects while ignoring catastrophic misses on small robotics components.
The Two-Stage Pipeline Bottleneck: A robotics team uses a two-stage grounding pipeline: a frozen Faster R-CNN (trained on 80 COCO classes) extracts boxes, and a CLIP model matches the text prompt to the cropped box features. The prompt is "Pick up the clear plastic pipette." The pipeline fails 100% of the time. Explain why the RPN stage is the fundamental bottleneck here, and how an end-to-end joint transformer (like Grounding DINO) bypasses this failure mode using cross-modal self-attention.
Attribute Binding Matrices: Consider the prompt "a red cube and a blue sphere." In a joint text-image transformer, the self-attention mechanism computes an attention matrix across the concatenated sequence $[T_\text{red}, T_\text{cube}, T_\text{blue}, T_\text{sphere}, V_1, \dots, V_N]$ . Describe the specific, high-weight connections you would expect to see in the final layer's attention matrix to prove the model has successfully solved the attribute binding problem.
Zero-Shot Grounding via Captioning: You have a trained captioning model with no explicit bounding-box training. Propose a heuristic algorithm to generate a zero-shot bounding box for the phrase "the green apple" by solely analyzing the cross-attention spatial heatmaps $\alpha_t$ generated while the model is forced to decode the sequence "A photo of the green apple." What are the theoretical limitations of extracting precise bounding boxes using only the resolution of $P \times P$ patch tokens?

Solutions

Uniform cross-attention. Softmax is uniform only when all logits are equal, i.e. $Q_t K^\top$ is constant across patches — typically because $Q_t \approx 0$ or the keys have collapsed so every dot product is identical. Two likely bugs: query/key projection (or a LayerNorm) collapsing the query to near-zero so there is no signal, or missing or broken positional encodings that make all keys interchangeable. Either way no patch is preferred and gradients to the visual stream vanish.
GIoU vs L1. GIoU is built from overlap area normalized by the union, a ratio that is invariant to absolute object scale; $L_1$ on center coordinates measures absolute pixel error, so a 10 px error is trivial for a 200 px object but a total miss for a 2 px one. Relying on $L_1$ alone over-penalizes large objects and under-penalizes catastrophic small-object misses. GIoU also provides gradient when boxes do not overlap, via the enclosing-box term.
Two-stage bottleneck. The frozen Faster R-CNN was trained on 80 COCO classes; "pipette" is not among them, so the RPN never proposes a box around it and CLIP never receives a relevant crop to match — the failure is upstream of CLIP. Grounding DINO fuses the text into the detector through cross-modal attention, so proposals are conditioned on the phrase and open-vocabulary objects get boxes.
Attribute-binding matrix. You expect strong text→vision links from $T_\text{red}$ to the cube patches and $T_\text{blue}$ to the sphere patches, strong within-text links $T_\text{red}\leftrightarrow T_\text{cube}$ and $T_\text{blue}\leftrightarrow T_\text{sphere}$ , and crucially low cross-binding (red to sphere patches). That pattern shows each attribute attended to its correct object region.
Zero-shot grounding. Force-decode "A photo of the green apple," take the cross-attention heatmap at the "apple" token, normalize and threshold it, and return the bounding box of the high-attention patch region. The limitation is resolution: the box is quantized to the $P \times P$ patch grid and cannot localize sub-patch detail, and diffuse/contextual attention inflates the extent — so precise boxes are not recoverable from attention alone.

Looking ahead#

Captioning and grounding extend the VLM from passive retrieval into active generation and spatial localization. However, training these models from scratch is immensely computationally expensive. The next evolution seeks to unify these capabilities while freezing the massive pre-trained backbones.

Week 5: BLIP, BLIP-2, and Related Models. We examine how BLIP unifies contrastive, matching, and captioning objectives, how data bootstrapping cleans noisy web data, and how BLIP-2 introduces the Q-Former as a highly efficient, learned interface to bridge frozen vision encoders with frozen Large Language Models.

Purpose of this lecture#

Image captioning: the encoder-decoder framework#

\mathcal{L}_\text{cap} = -\sum_{t=1}^T \log p_\theta(w_t \mid w_{1:t-1}, F)

The Mathematics of Cross-Attention#

Q_t = h_t W_Q, \quad K = F W_K, \quad V = F W_V

The attention weights $\alpha_t \in \mathbb{R}^N$ for this single word over all $N$ image patches are computed as:

\alpha_t = \text{softmax}\left(\frac{Q_t K^\top}{\sqrt{d_k}}\right)

The resulting context vector $c_t$ is the weighted sum of the visual values: $c_t = \alpha_t V$ .

The failure of global embeddings for complex scenes#

Why couldn't we just use CLIP's global embedding $\hat{z}_x$ for captioning?

Visual grounding: connecting language to coordinate space#

Architectural patterns for grounding#

Historical: Two-Stage Pipelines#

Modern: End-to-End Transformer Grounding (Grounding DINO)#

Why grounding is mathematically difficult#

Visual grounding exposes three fundamental cognitive failures in standard VLMs:

Grounding as the bridge to physical robotics#

Key takeaways#

Conceptual questions#

Cross-Attention Debugging: You are training an autoregressive image captioning model. At generation step $t$ for the word "bicycle," you visualize the cross-attention weights $\alpha_t = \text{softmax}(Q_t K^\top / \sqrt{d_k})$ . Instead of localizing on the bicycle, the attention heatmap is perfectly uniform, spreading equal weight $1/N$ across all image patches. Mathematically, what specific state must the dot product $Q_t K^\top$ be in for the softmax to output a uniform distribution? Diagnose two potential architectural or initialization bugs that could cause this gradient collapse.
GIoU vs L1 Loss in Grounding: Modern grounding models regress bounding boxes using a combination of $L_1$ norm loss and Generalized Intersection over Union (GIoU) loss. If a model predicts a box of size $2 \times 2$ pixels that is $10$ pixels away from a small target, versus a box of size $200 \times 200$ pixels that is $10$ pixels away from a massive target, the $L_1$ center-point error is identical. Mathematically explain why GIoU loss is scale-invariant and why relying solely on $L_1$ loss would disproportionately penalize errors on large objects while ignoring catastrophic misses on small robotics components.
The Two-Stage Pipeline Bottleneck: A robotics team uses a two-stage grounding pipeline: a frozen Faster R-CNN (trained on 80 COCO classes) extracts boxes, and a CLIP model matches the text prompt to the cropped box features. The prompt is "Pick up the clear plastic pipette." The pipeline fails 100% of the time. Explain why the RPN stage is the fundamental bottleneck here, and how an end-to-end joint transformer (like Grounding DINO) bypasses this failure mode using cross-modal self-attention.
Attribute Binding Matrices: Consider the prompt "a red cube and a blue sphere." In a joint text-image transformer, the self-attention mechanism computes an attention matrix across the concatenated sequence $[T_\text{red}, T_\text{cube}, T_\text{blue}, T_\text{sphere}, V_1, \dots, V_N]$ . Describe the specific, high-weight connections you would expect to see in the final layer's attention matrix to prove the model has successfully solved the attribute binding problem.
Zero-Shot Grounding via Captioning: You have a trained captioning model with no explicit bounding-box training. Propose a heuristic algorithm to generate a zero-shot bounding box for the phrase "the green apple" by solely analyzing the cross-attention spatial heatmaps $\alpha_t$ generated while the model is forced to decode the sequence "A photo of the green apple." What are the theoretical limitations of extracting precise bounding boxes using only the resolution of $P \times P$ patch tokens?

Solutions

Uniform cross-attention. Softmax is uniform only when all logits are equal, i.e. $Q_t K^\top$ is constant across patches — typically because $Q_t \approx 0$ or the keys have collapsed so every dot product is identical. Two likely bugs: query/key projection (or a LayerNorm) collapsing the query to near-zero so there is no signal, or missing or broken positional encodings that make all keys interchangeable. Either way no patch is preferred and gradients to the visual stream vanish.
GIoU vs L1. GIoU is built from overlap area normalized by the union, a ratio that is invariant to absolute object scale; $L_1$ on center coordinates measures absolute pixel error, so a 10 px error is trivial for a 200 px object but a total miss for a 2 px one. Relying on $L_1$ alone over-penalizes large objects and under-penalizes catastrophic small-object misses. GIoU also provides gradient when boxes do not overlap, via the enclosing-box term.
Two-stage bottleneck. The frozen Faster R-CNN was trained on 80 COCO classes; "pipette" is not among them, so the RPN never proposes a box around it and CLIP never receives a relevant crop to match — the failure is upstream of CLIP. Grounding DINO fuses the text into the detector through cross-modal attention, so proposals are conditioned on the phrase and open-vocabulary objects get boxes.
Attribute-binding matrix. You expect strong text→vision links from $T_\text{red}$ to the cube patches and $T_\text{blue}$ to the sphere patches, strong within-text links $T_\text{red}\leftrightarrow T_\text{cube}$ and $T_\text{blue}\leftrightarrow T_\text{sphere}$ , and crucially low cross-binding (red to sphere patches). That pattern shows each attribute attended to its correct object region.
Zero-shot grounding. Force-decode "A photo of the green apple," take the cross-attention heatmap at the "apple" token, normalize and threshold it, and return the bounding box of the high-attention patch region. The limitation is resolution: the box is quantized to the $P \times P$ patch grid and cannot localize sub-patch detail, and diffuse/contextual attention inflates the extent — so precise boxes are not recoverable from attention alone.

Week 4: Beyond CLIP — Captioning and Grounding

Purpose of this lecture#

Image captioning: the encoder-decoder framework#

The Mathematics of Cross-Attention#

The failure of global embeddings for complex scenes#

Visual grounding: connecting language to coordinate space#

Architectural patterns for grounding#

Historical: Two-Stage Pipelines#

Modern: End-to-End Transformer Grounding (Grounding DINO)#

Why grounding is mathematically difficult#

Grounding as the bridge to physical robotics#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 4: Beyond CLIP — Captioning and Grounding

Purpose of this lecture#

Image captioning: the encoder-decoder framework#

The Mathematics of Cross-Attention#

The failure of global embeddings for complex scenes#

Visual grounding: connecting language to coordinate space#

Architectural patterns for grounding#

Historical: Two-Stage Pipelines#

Modern: End-to-End Transformer Grounding (Grounding DINO)#

Why grounding is mathematically difficult#

Grounding as the bridge to physical robotics#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#