Purpose of this lecture
While CLIP achieved unprecedented success in zero-shot classification and retrieval by aligning images and text in a joint geometric space, its purely contrastive, dual-encoder design left it fundamentally constrained. A standard CLIP model cannot describe what it sees in natural language (it lacks generative capacity), and it cannot locate where specific objects are within an image (it lacks spatial grounding).
A model that cannot generate text cannot answer open-ended questions, and a model that cannot ground language cannot support the spatial reasoning required for physical AI, robotics, and tool use. This lecture transitions from discriminative perception to generative perception. We examine the architectural patterns—specifically encoder-decoder captioning and cross-attention spatial grounding—that fill these gaps and establish the foundation for modern instruction-following VLMs.
Image captioning: the encoder-decoder framework
Image captioning is formulated as a conditional sequence-to-sequence generation problem. Given an input image , the model must autoregressively generate a sequence of natural language tokens that accurately describes the scene. The standard architecture for this task is the Encoder-Decoder:
1. The Vision Encoder: A pretrained vision backbone (such as a ViT-L from Week 1) processes the image. Unlike CLIP, which pools the entire image into a single global vector , captioning models preserve the spatial geometry by extracting the sequence of raw patch tokens from the final layer of the vision encoder.
2. The Text Decoder: An autoregressive transformer decodes the text token by token. The generation of token is conditioned on both the previously generated words (via standard causal self-attention) and the visual patch tokens (via cross-attention).
During training, the decoder is trained using teacher forcing: instead of feeding the model's own potentially incorrect predictions back into itself, the ground-truth prefix is always provided. The training objective is to minimize the negative log-likelihood (cross-entropy) over the entire target sequence:
The Mathematics of Cross-Attention
The mechanism that makes captioning possible is cross-attention. In each layer of the text decoder, after the text tokens have attended to themselves, a cross-attention layer allows the text sequence to dynamically "query" the visual patches.
For the text token at step with hidden state , we compute a Query vector . The visual patch tokens are projected into Keys and Values :
The attention weights for this single word over all image patches are computed as:
The resulting context vector is the weighted sum of the visual values: .
This operation is profound because the attention weights form a spatial heatmap over the image grid. When the decoder is about to predict the word "dog," the query vector generated by the previous context (e.g., "a brown...") matches the keys of the image patches containing the dog. The network dynamically focuses its computational resources on specific spatial regions, token by token.
The failure of global embeddings for complex scenes
Why couldn't we just use CLIP's global embedding for captioning?
If the text decoder is conditioned only on a fixed -dimensional global vector, it cannot selectively attend to parts of the image. Every word generated receives the exact same visual context vector. This forces the vision encoder to heavily compress all visual information—the background, the foreground, the spatial relationships, the colors—into a single point.
While a global embedding works for short, simple captions ("a dog on a lawn"), it catastrophically degrades when generating detailed descriptions of complex scenes ("a man in a red shirt holding a green umbrella next to a blue car"). The spatial tokens alleviate this bottleneck by allowing the text decoder to pull high-resolution details on-demand exactly when they are needed in the sentence structure. The tradeoff is computational: computing cross-attention over tokens incurs operations per layer, which scales linearly with image resolution but can become a bottleneck for video generation.
Visual grounding: connecting language to coordinate space
If captioning maps an image to text, grounding solves the inverse problem: mapping a text phrase to a specific mathematical coordinate region in the image. Grounding allows language to ACTAction Chunking with Transformers as a spatial pointer.
Referring Expression Comprehension (REC): Given an image and an unambiguous referring expression (e.g., "the red mug on the left side of the table"), the model must output a coordinate bounding box enclosing the specific object. The expression must be unambiguous—"the larger of the two red mugs" explicitly disambiguates via comparison.
Phrase Grounding: Given a complete, complex sentence (e.g., "A man riding a bicycle next to a parked car"), the model must simultaneously detect and draw bounding boxes for every distinct noun phrase ("man," "bicycle," "parked car"). This requires compositional understanding, forcing the model to resolve spatial prepositions ("next to") while maintaining consistent object identities.
Architectural patterns for grounding
Historical: Two-Stage Pipelines
Early grounding systems were modular. A pre-trained Region Proposal Network (RPN), like Faster R-CNN, would first extract hundreds of candidate bounding boxes from the image, completely blind to the text prompt. A separate multimodal matching network would then compute similarity scores between the text embedding and the cropped visual features of each candidate box. This pipeline was fragile: if the RPN failed to propose a box for a small or unusual object, the matching network had zero chance of grounding it.
Modern: End-to-End Transformer Grounding (Grounding DINO)
Modern state-of-the-art systems like MDETR and Grounding DINO discard the two-stage pipeline. Instead, they concatenate the raw visual patch tokens and the text tokens into a single, massive sequence and feed them through a joint transformer architecture.
Because self-attention operates across the entire concatenated sequence, cross-modal fusion happens immediately. A text token representing the word "wheel" can directly attend to the visual patches containing circular, rubber textures before any bounding boxes are even predicted.
The transformer outputs enriched visual tokens. A lightweight detection head then predicts bounding box coordinates directly from these enriched tokens, optimized using a combination of regression loss and Generalized Intersection over Union (GIoU) loss against the ground truth boxes. This joint approach produces vastly superior grounding, especially for complex compositional queries.
Why grounding is mathematically difficult
Visual grounding exposes three fundamental cognitive failures in standard VLMs:
1. Attribute Binding: The prompt "a red cube and a blue sphere" requires the model to correctly bind the color "red" to the shape "cube." Standard contrastive VLMs frequently fail at this, confusing it with "a blue cube and a red sphere" because both images contain identical sets of raw features (red, blue, cube, sphere). Grounding models solve this by forcing the self-attention mechanism to explicitly route the "red" text token to the spatial coordinates of the cube.
2. Relational and Spatial Reasoning: "The cup to the left of the laptop" requires computing a geometric relationship between two disparate objects. A global embedding destroys relative coordinates. To solve this, grounding models rely heavily on the 2D positional embeddings added to the visual patches, allowing the transformer's attention heads to calculate spatial offsets directly.
3. Disambiguation with Negation and Comparison: "The smaller of the two cups" or "the cup that is not white" requires the network to identify all cups, compare their geometric sizes or color vectors, and perform a logical exclusion. These are compositional operations that simple dot-product similarity matching (like CLIP) mathematically cannot express.
Grounding as the bridge to physical robotics
Captioning and grounding are the absolute prerequisites for physical AI. In Course 2 (Robotics), we saw that policies require continuous coordinate actions. Grounding provides the mathematical bridge from abstract language to physical space.
If a human user tells a robot, "Hand me the Phillips-head screwdriver on the workbench," the VLMVision-Language Model cannot just classify the image as "workbench." It must run a grounding pass to output a specific bounding box or segmentation mask for the screwdriver. This 2D pixel mask is then projected into 3D space using depth sensors (or NeRFs/3DGS, as discussed in Week 1), giving the robotic control policy the exact target needed to generate motor torques. Without precise visual grounding, language-conditioned robotics is impossible.
Key takeaways
Image captioning utilizes encoder-decoder architectures where causal cross-attention mathematically connects each generated word to the relevant visual spatial patches. Using high-resolution spatial patch tokens is essential for fine-grained descriptions; global embeddings collapse spatial geometry and fail on complex scenes. Visual grounding solves the inverse problem, mathematically mapping noun phrases to continuous bounding box coordinates. Modern end-to-end architectures like Grounding DINO achieve this by concatenating text and image tokens into a joint transformer, allowing cross-modal attention to solve attribute binding and relational reasoning before coordinate regression occurs. Together, generation and grounding provide the spatial and semantic primitives required for physical AI deployment.
Conceptual questions
- Cross-Attention Debugging: You are training an autoregressive image captioning model. At generation step for the word "bicycle," you visualize the cross-attention weights . Instead of localizing on the bicycle, the attention heatmap is perfectly uniform, spreading equal weight across all image patches. Mathematically, what specific state must the dot product be in for the softmax to output a uniform distribution? Diagnose two potential architectural or initialization bugs that could cause this gradient collapse.
- GIoU vs L1 Loss in Grounding: Modern grounding models regress bounding boxes using a combination of norm loss and Generalized Intersection over Union (GIoU) loss. If a model predicts a box of size pixels that is pixels away from a small target, versus a box of size pixels that is pixels away from a massive target, the center-point error is identical. Mathematically explain why GIoU loss is scale-invariant and why relying solely on loss would disproportionately penalize errors on large objects while ignoring catastrophic misses on small robotics components.
- The Two-Stage Pipeline Bottleneck: A robotics team uses a two-stage grounding pipeline: a frozen Faster R-CNN (trained on 80 COCO classes) extracts boxes, and a CLIP model matches the text prompt to the cropped box features. The prompt is "Pick up the clear plastic pipette." The pipeline fails 100% of the time. Explain why the RPN stage is the fundamental bottleneck here, and how an end-to-end joint transformer (like Grounding DINO) bypasses this failure mode using cross-modal self-attention.
- Attribute Binding Matrices: Consider the prompt "a red cube and a blue sphere." In a joint text-image transformer, the self-attention mechanism computes an attention matrix across the concatenated sequence . Describe the specific, high-weight connections you would expect to see in the final layer's attention matrix to prove the model has successfully solved the attribute binding problem.
- Zero-Shot Grounding via Captioning: You have a trained captioning model with no explicit bounding-box training. Propose a heuristic algorithm to generate a zero-shot bounding box for the phrase "the green apple" by solely analyzing the cross-attention spatial heatmaps generated while the model is forced to decode the sequence "A photo of the green apple." What are the theoretical limitations of extracting precise bounding boxes using only the resolution of patch tokens?
Looking ahead
Captioning and grounding extend the VLMVision-Language Model from passive retrieval into active generation and spatial localization. However, training these models from scratch is immensely computationally expensive. The next evolution seeks to unify these capabilities while freezing the massive pre-trained backbones.
Week 5: BLIP, BLIP-2, and Related Models. We examine how BLIP unifies contrastive, matching, and captioning objectives, how data bootstrapping cleans noisy web data, and how BLIP-2 introduces the Q-Former as a highly efficient, learned interface to bridge frozen vision encoders with frozen Large Language Models.
Further reading
- Anderson, P., et al. (2018). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. CVPR.
- Kamath, A., et al. (2021). MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. ICCV.
- Liu, S., et al. (2023). Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ECCV.