Week 3: Contrastive Vision–Language Learning (CLIP)

Purpose of this lecture#

Before 2021, the dominant paradigm in computer vision was modular and rigidly supervised: researchers manually labeled images into a fixed ontology (e.g., 1,000 ImageNet classes) and trained a model to classify inputs exclusively into those buckets. This produced strong discriminative models that were entirely blind to the richness of natural language. If a model was trained to recognize "dog," it had no inherent understanding of the phrases "a brown puppy," "canine," or "playing fetch."

CLIP (Contrastive Language–Image Pretraining; Radford et al., 2021) shattered this paradigm. By replacing human-annotated class labels with noisy, web-scraped natural language captions, CLIP solved the image-language alignment problem at internet scale. It established a joint embedding space where images and text coexist mathematically, enabling zero-shot generalization to entirely new visual concepts without any task-specific fine-tuning. Just as ImageNet became the universal starting point for traditional vision, CLIP pretraining became the universal starting point for modern multimodal AI.

This lecture rigorously derives the CLIP architecture, its symmetric contrastive objective, how zero-shot classification mathematically emerges from vector similarity, and the structural limitations that motivated all subsequent Vision-Language Models (VLMs).

Architecture: Dual encoders and the joint space#

Unlike standard VLMs that fuse vision and text early via cross-attention, CLIP uses a strict dual-encoder architecture. It processes images and text entirely independently, only bringing their representations together at the very end via a dot product.

Image Encoder ( $f_\theta$ ): A Vision Transformer (ViT-B/32, ViT-L/14, etc.) or a ResNet processes the image $x$ . The global image representation is extracted from the final layer (typically the [CLS] token for ViT or an attention-pooled feature map for ResNet). This output is linearly projected to a joint dimension $d$ (e.g., $d=512$ ).
Text Encoder ( $g_\phi$ ): A standard transformer processes the text sequence $t$ . The original CLIP uses a GPT-style (causal) architecture with masked self-attention. The text is tokenized (using Byte Pair Encoding), capped at a maximum sequence length of 76 tokens, and appended with an [EOS] (End of Sequence) token. The final layer's hidden state corresponding to the [EOS] token is extracted, representing the aggregated context of the entire sentence. This is also linearly projected to dimension $d$ .

Both resulting vectors are strictly $L_2$ -normalized to project them onto the surface of a unit hypersphere:

\hat{z}_x = \frac{f_\theta(x)}{\|f_\theta(x)\|_2}, \quad \hat{z}_t = \frac{g_\phi(t)}{\|g_\phi(t)\|_2}

Because they are normalized, their dot product directly computes the cosine similarity between the image and the text:

\text{sim}(x, t) = \hat{z}_x \cdot \hat{z}_t \in [-1, 1]

This dual-encoder separation is highly computationally efficient at inference time. Text embeddings for a million concepts can be pre-computed and cached offline, allowing new images to be matched against them via a fast matrix multiplication.

The Symmetric InfoNCE Objective#

The training procedure requires aligning these modalities so that a picture of a dog and the phrase "a photo of a dog" point to the exact same geometric region on the hypersphere.

Given a batch of $N$ image-text pairs $\{(x_i, t_i)\}_{i=1}^N$ , we pass all images through the vision encoder and all texts through the text encoder to obtain $N$ image embeddings and $N$ text embeddings.

We then compute an $N \times N$ similarity matrix $S$ , where each entry $S_{i,j}$ represents the scaled cosine similarity between the $i$ -th image and the $j$ -th text:

S_{i,j} = \frac{\hat{z}_{x_i} \cdot \hat{z}_{t_j}}{\tau}

Here, $\tau$ is a highly critical, learnable temperature parameter. A small temperature (e.g., $\tau=0.01$ ) acts as a multiplier that sharply exaggerates differences in similarity scores, forcing the softmax function to become extremely confident.

The model must learn to maximize the similarities along the diagonal (where image $x_i$ matches text $t_i$ ) and minimize the off-diagonal similarities (where image $x_i$ is mismatched with text $t_j$ ). To prevent one modality from dominating the gradient updates, CLIP uses a Symmetric InfoNCE loss:

\mathcal{L}_\text{CLIP} = \frac{1}{2}\left(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I}\right)

The Image-to-Text loss applies softmax across the rows, forcing the image to identify its correct caption among $N-1$ distractors:

\mathcal{L}_{I \to T} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{i,i})}{\sum_{j=1}^N \exp(S_{i,j})}

The Text-to-Image loss applies softmax across the columns, forcing the text to identify its correct image among $N-1$ distractors:

\mathcal{L}_{T \to I} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{i,i})}{\sum_{j=1}^N \exp(S_{j,i})}

This loss is equivalent to a massive $N$ -way classification task. To provide enough "hard negatives" (distractor texts that might look similar to the true text but are actually wrong), $N$ must be enormous. OpenAI trained CLIP using an unprecedented batch size of $N = 32,768$ , requiring highly optimized, distributed matrix multiplications across thousands of GPUs.

Web-scale data and the open-source reproduction (LAION)#

The original CLIP was trained on a proprietary dataset of 400 million image-text pairs (WebImageText), sourced via extensive web scraping. These pairs included raw HTML alt-text attributes and adjacent forum captions. This data is profoundly noisy—captions are often misspelled, multilingual, or tangentially related to the image (e.g., a photo of a dog captioned "missing my best friend").

Prior supervised learning dogmas assumed this noise would poison the network. CLIP proved the scale-beats-cleanliness hypothesis: given a large enough batch size and 400M pairs, the random noise cancels out in expectation, leaving behind a robust semantic gradient.

Because OpenAI did not release the training data, the open-source community reproduced it via LAION-400M and later LAION-5B (Schuhmann et al., 2022). To build these datasets, researchers parsed the massive CommonCrawl dump (containing billions of raw HTML images). They used the frozen weights of the original OpenAI CLIP to score every raw image-text pair they found on the internet. If the cosine similarity $\hat{z}_x \cdot \hat{z}_t$ fell below a certain threshold (e.g., 0.28), the pair was discarded as garbage.

This created a feedback loop: an early CLIP model was used to curate a massive, high-quality open dataset, which was then used to train the OpenCLIP variants (Ilharco et al., 2021) that ultimately surpassed the original OpenAI model.

Zero-shot classification as nearest-neighbor search#

The most profound downstream capability of the joint embedding space is zero-shot classification: predicting the class of an image from a set of categories the model was never explicitly trained to recognize.

Instead of outputting probabilities over a fixed 1,000-class classifier head, the classification process is recast as a similarity search:

Construct Prompts: For a new dataset with $K$ classes (e.g., dog, cat, car), wrap each class label in a natural language template to create $K$ text strings $t_k$ : "a photo of a {class}."
Encode Text: Pass all $K$ strings through the text encoder to yield $K$ frozen reference vectors $\hat{z}_{t_k}$ .
Encode Image: Pass the target image $x$ through the image encoder to yield $\hat{z}_x$ .
Predict: Compute the cosine similarity between the image and all $K$ text vectors. The predicted class is simply the $\arg\max$ :

\hat{k} = \arg\max_k (\hat{z}_x \cdot \hat{z}_{t_k})

The Importance of Prompt Engineering#

A raw label like "boxer" is highly ambiguous in natural language—it could refer to a dog breed, a human athlete, or someone packing cardboard. By querying the model with "a photo of a boxer, a type of dog," the text encoder generates an embedding that disambiguates the context, heavily shifting the vector away from sports and towards canines.

Furthermore, researchers use prompt ensembling: querying the text encoder with 80 different phrasing variations ("a close-up photo of a {class}", "a blurry photo of a {class}", "a sketch of a {class}") and averaging the resulting text embeddings. This mathematically widens the "cone of acceptance" for that class in the embedding space, making the classifier vastly more robust to visual domain shifts (like sketches or night-time photos) than a standard supervised ResNet.

Structural Limitations of CLIP#

While CLIP excels at global semantic matching, its dual-encoder design and contrastive objective enforce mathematical constraints that dictate its failure modes:

The "Bag of Concepts" Failure: CLIP operates largely as a "bag of words" and "bag of patches." If you prompt it with "a red cube and a blue sphere," its text embedding is mathematically almost identical to "a blue cube and a red sphere." Because the InfoNCE loss never explicitly penalized the model for getting spatial relationships wrong, CLIP cannot bind specific attributes to specific objects.
Spatial Blindness (No Grounding): CLIP pools its entire visual feature map into a single $1 \times d$ vector to match the text. Therefore, it mathematically destroys all $(x, y)$ coordinate information. You cannot ask a standard CLIP model "where is the dog?" because it has no mechanism to map the word "dog" back to specific image patches.
Lack of Generative Capacity: CLIP can only measure the distance between an image and a provided text string. Because it lacks a causal language decoder head, it is impossible for CLIP to look at an image and generate a caption from scratch.

These exact limitations define the architecture of the next generation of VLMs (like BLIP and LLaVA), which discard the strict dual-encoder separation in favor of cross-attention and autoregressive text generation.

Key takeaways#

CLIP mathematically bridges vision and language by projecting both modalities into a shared unit hypersphere. The model is trained using a Symmetric InfoNCE loss over massive batches of noisy, web-scraped image-text pairs, treating alignment as an $N$ -way classification task. This joint space enables zero-shot classification (recasting classification as a nearest-neighbor search against text embeddings) and cross-modal retrieval. While the global contrastive objective creates incredibly robust semantic representations, it fundamentally destroys spatial localization and relational binding, necessitating the development of more complex, generative Vision-Language architectures.

Conceptual questions#

InfoNCE Gradient Analysis: The symmetric InfoNCE loss forces an image to match its text among $N-1$ distractors. Mathematically analyze the derivative of the cross-entropy loss with respect to the similarity scores $S_{i,j}$ . Explain how the gradient magnitude scales based on the "hardness" of the negative examples. Why would a batch consisting of 32,768 images of visually identical golden retrievers provide a stronger learning signal than a batch of 32,768 completely unrelated objects?
Temperature Dynamics: The temperature parameter $\tau$ scales the logits $S_{i,j} = (\hat{z}_x \cdot \hat{z}_t)/\tau$ . During the early epochs of CLIP training, the learned $\tau$ value typically drops dramatically (e.g., from 0.07 to 0.01). What mathematical effect does a smaller $\tau$ have on the softmax distribution? Why is this sharpening behavior critical for the network to continue learning when the dot products between normalized vectors are bounded strictly between $[-1, 1]$ ?
The LAION Curation Paradox: OpenCLIP models are trained on datasets like LAION-5B, which were curated by throwing away image-text pairs that scored poorly on the original OpenAI CLIP model. Describe the mathematical and semantic risks of this "recursive curation." If the original CLIP model was systematically blind to text printed on physical signs (OCR), what happens to the representation of text-in-images in the LAION dataset, and how does this impact the OpenCLIP model trained on it?
SigLIP vs. InfoNCE: Standard CLIP requires massive batches because the softmax denominator $\sum_{j=1}^N \exp(S_{i,j})$ requires all $N$ embeddings to reside in GPU memory simultaneously. SigLIP replaces this with a pairwise sigmoid loss, treating every pair independently: $\mathcal{L}_\text{SigLIP} = -\sum_{i,j} [y_{ij} \log \sigma(S_{ij} - b) + \dots]$ . Prove how this mathematical change breaks the dependency on batch-wide normalization, allowing engineers to train world-class vision encoders across fragmented, smaller GPU clusters.
Architectural Debugging (Attribute Binding): You are tasked with using a frozen CLIP ViT-L/14 model to sort physical robotics components on an assembly line. The objects are "a metallic gear," "a plastic gear," "a metallic bracket," and "a plastic bracket." During testing, the model consistently confuses the metallic gear with the metallic bracket. Using your knowledge of how CLIP aggregates visual tokens and text tokens into single global vectors, diagnose exactly why this "attribute-noun binding" failure occurs, and propose a downstream architectural fix that doesn't involve retraining CLIP.

Solutions

InfoNCE gradient hardness. The gradient w.r.t. each logit is proportional to (softmax probability − target), so hard negatives — those with high similarity — receive the largest probability mass and therefore the largest repulsive gradient. A batch of near-identical golden retrievers makes every negative a hard negative (high cosine similarity), forcing fine-grained discrimination; a batch of unrelated objects gives easy negatives (similarity near zero) that contribute almost no gradient.
Temperature dynamics. A smaller $\tau$ sharpens the softmax by amplifying logit differences. Because normalized dot products are bounded in $[-1,1]$ , the raw logit range is too compressed for cross-entropy to produce confident targets or large gradients; dividing by a small $\tau$ rescales those bounded similarities into a usable dynamic range, which is why the learned temperature shrinks early in training.
LAION curation paradox. Filtering pairs by the original CLIP keeps only what CLIP already scores as aligned, so any concept CLIP is blind to — e.g. text rendered in images (OCR) — is systematically discarded. The dataset under-represents text-in-images, and OpenCLIP trained on it inherits and reinforces the same OCR blindness: the bias is self-amplifying.
SigLIP. The sigmoid loss scores each pair independently as a binary cross-entropy on $\sigma(S_{ij}-b)$ , with no softmax denominator summing over all $N$ embeddings. Removing batch-wide normalization makes the loss decompose over pairs, so it can be sharded or accumulated across smaller/fragmented GPUs without holding all embeddings in memory at once.
Attribute binding. CLIP pools all tokens into a single global vector, collapsing the scene into a bag of concepts — "metallic" and "gear" lose their binding, so two metallic objects share the dominant "metallic" feature and get confused. A fix without retraining: isolate each object first (detector crops or region proposals) and classify each crop separately, rather than relying on one global vector for a compositional scene.

Looking ahead#

CLIP establishes a mathematically elegant joint embedding space but remains completely unable to generate language or localize it to specific image regions. The next lecture examines the architectural components required to move beyond perception into generation.

Week 4: Beyond CLIP — Captioning and Grounding. We introduce encoder-decoder architectures for image captioning, derive the role of cross-attention in binding specific words to spatial image features, and examine grounding tasks that reconnect language to precise $(x, y)$ coordinate bounding boxes.

Purpose of this lecture#

Architecture: Dual encoders and the joint space#

Image Encoder ( $f_\theta$ ): A Vision Transformer (ViT-B/32, ViT-L/14, etc.) or a ResNet processes the image $x$ . The global image representation is extracted from the final layer (typically the [CLS] token for ViT or an attention-pooled feature map for ResNet). This output is linearly projected to a joint dimension $d$ (e.g., $d=512$ ).
Text Encoder ( $g_\phi$ ): A standard transformer processes the text sequence $t$ . The original CLIP uses a GPT-style (causal) architecture with masked self-attention. The text is tokenized (using Byte Pair Encoding), capped at a maximum sequence length of 76 tokens, and appended with an [EOS] (End of Sequence) token. The final layer's hidden state corresponding to the [EOS] token is extracted, representing the aggregated context of the entire sentence. This is also linearly projected to dimension $d$ .

Both resulting vectors are strictly $L_2$ -normalized to project them onto the surface of a unit hypersphere:

\hat{z}_x = \frac{f_\theta(x)}{\|f_\theta(x)\|_2}, \quad \hat{z}_t = \frac{g_\phi(t)}{\|g_\phi(t)\|_2}

Because they are normalized, their dot product directly computes the cosine similarity between the image and the text:

\text{sim}(x, t) = \hat{z}_x \cdot \hat{z}_t \in [-1, 1]

The Symmetric InfoNCE Objective#

The training procedure requires aligning these modalities so that a picture of a dog and the phrase "a photo of a dog" point to the exact same geometric region on the hypersphere.

We then compute an $N \times N$ similarity matrix $S$ , where each entry $S_{i,j}$ represents the scaled cosine similarity between the $i$ -th image and the $j$ -th text:

S_{i,j} = \frac{\hat{z}_{x_i} \cdot \hat{z}_{t_j}}{\tau}

\mathcal{L}_\text{CLIP} = \frac{1}{2}\left(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I}\right)

The Image-to-Text loss applies softmax across the rows, forcing the image to identify its correct caption among $N-1$ distractors:

\mathcal{L}_{I \to T} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{i,i})}{\sum_{j=1}^N \exp(S_{i,j})}

The Text-to-Image loss applies softmax across the columns, forcing the text to identify its correct image among $N-1$ distractors:

\mathcal{L}_{T \to I} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{i,i})}{\sum_{j=1}^N \exp(S_{j,i})}

Web-scale data and the open-source reproduction (LAION)#

Zero-shot classification as nearest-neighbor search#

Instead of outputting probabilities over a fixed 1,000-class classifier head, the classification process is recast as a similarity search:

Construct Prompts: For a new dataset with $K$ classes (e.g., dog, cat, car), wrap each class label in a natural language template to create $K$ text strings $t_k$ : "a photo of a {class}."
Encode Text: Pass all $K$ strings through the text encoder to yield $K$ frozen reference vectors $\hat{z}_{t_k}$ .
Encode Image: Pass the target image $x$ through the image encoder to yield $\hat{z}_x$ .
Predict: Compute the cosine similarity between the image and all $K$ text vectors. The predicted class is simply the $\arg\max$ :

\hat{k} = \arg\max_k (\hat{z}_x \cdot \hat{z}_{t_k})

The Importance of Prompt Engineering#

Structural Limitations of CLIP#

While CLIP excels at global semantic matching, its dual-encoder design and contrastive objective enforce mathematical constraints that dictate its failure modes:

The "Bag of Concepts" Failure: CLIP operates largely as a "bag of words" and "bag of patches." If you prompt it with "a red cube and a blue sphere," its text embedding is mathematically almost identical to "a blue cube and a red sphere." Because the InfoNCE loss never explicitly penalized the model for getting spatial relationships wrong, CLIP cannot bind specific attributes to specific objects.
Spatial Blindness (No Grounding): CLIP pools its entire visual feature map into a single $1 \times d$ vector to match the text. Therefore, it mathematically destroys all $(x, y)$ coordinate information. You cannot ask a standard CLIP model "where is the dog?" because it has no mechanism to map the word "dog" back to specific image patches.
Lack of Generative Capacity: CLIP can only measure the distance between an image and a provided text string. Because it lacks a causal language decoder head, it is impossible for CLIP to look at an image and generate a caption from scratch.

Key takeaways#

Conceptual questions#

InfoNCE Gradient Analysis: The symmetric InfoNCE loss forces an image to match its text among $N-1$ distractors. Mathematically analyze the derivative of the cross-entropy loss with respect to the similarity scores $S_{i,j}$ . Explain how the gradient magnitude scales based on the "hardness" of the negative examples. Why would a batch consisting of 32,768 images of visually identical golden retrievers provide a stronger learning signal than a batch of 32,768 completely unrelated objects?
Temperature Dynamics: The temperature parameter $\tau$ scales the logits $S_{i,j} = (\hat{z}_x \cdot \hat{z}_t)/\tau$ . During the early epochs of CLIP training, the learned $\tau$ value typically drops dramatically (e.g., from 0.07 to 0.01). What mathematical effect does a smaller $\tau$ have on the softmax distribution? Why is this sharpening behavior critical for the network to continue learning when the dot products between normalized vectors are bounded strictly between $[-1, 1]$ ?
The LAION Curation Paradox: OpenCLIP models are trained on datasets like LAION-5B, which were curated by throwing away image-text pairs that scored poorly on the original OpenAI CLIP model. Describe the mathematical and semantic risks of this "recursive curation." If the original CLIP model was systematically blind to text printed on physical signs (OCR), what happens to the representation of text-in-images in the LAION dataset, and how does this impact the OpenCLIP model trained on it?
SigLIP vs. InfoNCE: Standard CLIP requires massive batches because the softmax denominator $\sum_{j=1}^N \exp(S_{i,j})$ requires all $N$ embeddings to reside in GPU memory simultaneously. SigLIP replaces this with a pairwise sigmoid loss, treating every pair independently: $\mathcal{L}_\text{SigLIP} = -\sum_{i,j} [y_{ij} \log \sigma(S_{ij} - b) + \dots]$ . Prove how this mathematical change breaks the dependency on batch-wide normalization, allowing engineers to train world-class vision encoders across fragmented, smaller GPU clusters.
Architectural Debugging (Attribute Binding): You are tasked with using a frozen CLIP ViT-L/14 model to sort physical robotics components on an assembly line. The objects are "a metallic gear," "a plastic gear," "a metallic bracket," and "a plastic bracket." During testing, the model consistently confuses the metallic gear with the metallic bracket. Using your knowledge of how CLIP aggregates visual tokens and text tokens into single global vectors, diagnose exactly why this "attribute-noun binding" failure occurs, and propose a downstream architectural fix that doesn't involve retraining CLIP.

Solutions

InfoNCE gradient hardness. The gradient w.r.t. each logit is proportional to (softmax probability − target), so hard negatives — those with high similarity — receive the largest probability mass and therefore the largest repulsive gradient. A batch of near-identical golden retrievers makes every negative a hard negative (high cosine similarity), forcing fine-grained discrimination; a batch of unrelated objects gives easy negatives (similarity near zero) that contribute almost no gradient.
Temperature dynamics. A smaller $\tau$ sharpens the softmax by amplifying logit differences. Because normalized dot products are bounded in $[-1,1]$ , the raw logit range is too compressed for cross-entropy to produce confident targets or large gradients; dividing by a small $\tau$ rescales those bounded similarities into a usable dynamic range, which is why the learned temperature shrinks early in training.
LAION curation paradox. Filtering pairs by the original CLIP keeps only what CLIP already scores as aligned, so any concept CLIP is blind to — e.g. text rendered in images (OCR) — is systematically discarded. The dataset under-represents text-in-images, and OpenCLIP trained on it inherits and reinforces the same OCR blindness: the bias is self-amplifying.
SigLIP. The sigmoid loss scores each pair independently as a binary cross-entropy on $\sigma(S_{ij}-b)$ , with no softmax denominator summing over all $N$ embeddings. Removing batch-wide normalization makes the loss decompose over pairs, so it can be sharded or accumulated across smaller/fragmented GPUs without holding all embeddings in memory at once.
Attribute binding. CLIP pools all tokens into a single global vector, collapsing the scene into a bag of concepts — "metallic" and "gear" lose their binding, so two metallic objects share the dominant "metallic" feature and get confused. A fix without retraining: isolate each object first (detector crops or region proposals) and classify each crop separately, rather than relying on one global vector for a compositional scene.

Purpose of this lecture#

Architecture: Dual encoders and the joint space#

The Symmetric InfoNCE Objective#

Web-scale data and the open-source reproduction (LAION)#

Zero-shot classification as nearest-neighbor search#

The Importance of Prompt Engineering#

Structural Limitations of CLIP#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 3: Contrastive Vision–Language Learning (CLIP)

Purpose of this lecture#

Architecture: Dual encoders and the joint space#

The Symmetric InfoNCE Objective#

Web-scale data and the open-source reproduction (LAION)#

Zero-shot classification as nearest-neighbor search#

The Importance of Prompt Engineering#

Structural Limitations of CLIP#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#