Purpose of this lecture
Modern Vision–Language Models (VLMs) almost never train their vision encoders from scratch using multimodal datasets. Because multimodal data (image-text pairs) is inherently noisy and expensive to curate, VLMs instead rely on vision encoders that have been pretrained on massive, pure-image corpora using self-supervised learning.
Self-supervised objectives construct learning signals directly from the data without requiring human annotations. The quality, geometry, and character of these pretrained representations dictate exactly what a downstream VLMVision-Language Model can perceive: how precisely it resolves object boundaries, how robustly it identifies semantic categories, and whether it collapses spatial geometry. This lecture formally examines the three dominant paradigms of self-supervised visual pretraining—Contrastive Learning, Masked Autoencoding (MAE), and Self-Distillation (DINO)—and analyzes how their different mathematical objectives produce representations with systematically different properties.
The imperative of self-supervised pretraining
Historically, vision backbones were pretrained using supervised learning on datasets like ImageNet-1K (1.28 million labeled images across 1,000 classes). While this produced models that generalized reasonably well, supervised pretraining suffers from fundamental limitations when scaling toward foundation models:
- Annotation Bottleneck: Curating high-quality, dense annotations scales linearly in human effort. Web-scale image data (e.g., billions of images) is abundant, but labeling it is economically prohibitive.
- Semantic Bottleneck: Supervised objectives force the network to compress its representation into a fixed, mutually exclusive set of classes (a one-hot vector). A model trained to classify "dog" vs. "cat" learns to aggressively discard background information, fine-grained object attributes, and relative spatial positioning, as these are "distractors" to the classification objective.
Self-supervised learning bypasses both bottlenecks by converting unlabeled abundance into a continuous, high-dimensional training signal. By constructing proxy tasks (pretext tasks) that require the model to understand image structure to solve them, the model learns general-purpose visual representations that transfer broadly to downstream tasks. The mathematical design of the proxy task determines what invariances the representation learns, what it suppresses, and how well it grounds multimodal reasoning.
Contrastive representation learning (SimCLR & MoCo)
Contrastive learning forces a visual encoder to produce similar high-dimensional vectors for different augmented views of the same image (positive pairs) and dissimilar vectors for views of different images (negative pairs).
The InfoNCE Objective
The canonical objective function is the InfoNCE (Information Noise-Contrastive Estimation) loss. Given a batch of images, two different data augmentations (e.g., random crop, color jitter) are applied to each, producing total views. For a specific image , let its two augmented views be and . Passing these through an encoder network and a projection head yields normalized representation vectors and .
The InfoNCE loss for a single positive pair is defined as:
where is the cosine similarity, is a temperature hyperparameter that scales the logits, and the denominator sums the similarity over all other samples in the batch (which ACTAction Chunking with Transformers as negative examples).
This is structurally a cross-entropy loss that treats the matching view as the single "correct class" among distractors. Minimizing this loss pushes the dot product of positive pairs toward 1 and negative pairs toward 0.
Augmentations define the invariances
The choice of data augmentations defines exactly what the network learns to ignore. SimCLR (Chen et al., 2020) relies heavily on random cropping, aggressive color jitter, grayscale conversion, and Gaussian blur. Because the model must produce the exact same vector regardless of whether the image is zoomed in on a dog's ear or zoomed out showing the whole dog, and regardless of lighting conditions, the network learns scale and color invariance.
This invariance is a double-edged sword: it forces the representation to encode robust, global semantic identity (highly useful for classifying an object), but it deliberately destroys spatial precision and fine-grained color data (highly detrimental for a robotics VLMVision-Language Model trying to locate a red block).
Preventing Collapse and the MoCo Queue
Contrastive learning requires explicit negative pairs to prevent a degenerate solution known as representation collapse (where the encoder simply maps every image to the same constant vector , making all similarities 1 and the loss trivially 0).
Because the quality of negative mining improves with batch size, SimCLR requires massive batches (e.g., ) to expose the model to enough hard negatives. MoCo (Momentum Contrast; He et al., 2020) solves this memory bottleneck by decoupling the batch size from the number of negatives. It maintains a separate "momentum encoder" whose weights are an exponential moving average (EMA) of the active encoder :
MoCo encodes a massive queue (e.g., 65,536) of negative examples using the slow-moving momentum encoder. This provides a rich, consistent dictionary of negatives without requiring them to be housed in the active GPU computational graph, drastically reducing VRAM requirements.
Masked autoencoders (MAE)
Masked Autoencoders (MAE; He et al., 2022) discard contrastive similarity entirely in favor of a generative approach: mask a massive fraction of an image and force the network to reconstruct the missing pixels from the remaining visible context.
Architecture and Objective
An image is divided into a grid of non-overlapping patches. A highly aggressive random masking ratio (typically 75%) is applied. Only the 25% visible patches are passed into a deep Vision Transformer (ViT) encoder, producing a set of dense latent features.
A separate, lightweight decoder network then takes these encoded visible features, inserts learned, generic [MASK] tokens into the empty spatial positions, adds 2D positional embeddings, and attempts to reconstruct the raw pixel values of the original image.
The training objective is the simple Mean Squared Error (MSE) computed only over the masked patches :
where represents the normalized ground-truth pixel values of patch , and is the decoder's pixel-space prediction. After pretraining, the decoder is completely discarded; only the encoder is retained for downstream VLMVision-Language Model use.
Why 75% Masking?
Images possess immense spatial redundancy; adjacent pixels share colors and gradients. If only 15% of an image is masked (as is common in NLP with BERT), the network can cheat by simply interpolating colors from directly adjacent, visible pixels without understanding what the object is.
By masking 75% of the patches, interpolation becomes mathematically impossible. To reconstruct a missing patch showing a dog's leg, the encoder must deduce that it is looking at a dog based on sparse, distant clues (like an ear and a tail), access internal representations of canine anatomy, and predict the leg's location. The extreme difficulty of the task forces the encoder to learn deep semantic structure.
Spatial Precision vs. Semantic Invariance
Unlike contrastive learning, which suppresses location information to achieve invariance, the MAE reconstruction task provides spatially precise supervision. The model must place specific textures at specific coordinates. Consequently, MAE representations preserve dense spatial geometry. This makes MAE-pretrained ViTs exceptionally powerful for downstream tasks requiring precise localization (bounding box detection, pixel-level segmentation, and robotic grasping), though they occasionally underperform contrastive models on pure linear-classification benchmarks.
DINO and self-distillation
DINO (Self-Distillation with No Labels; Caron et al., 2021) achieves the semantic strength of contrastive learning without requiring negative pairs, using a teacher-student distillation framework.
The DINO Objective
A student network and a teacher network (where is an EMA of ) process different augmented crops of the same image. The network generates a global crop (a large view of the image) and several local crops (small, zoomed-in patches).
The student is shown the local crops, and the teacher is shown the global crops. The objective forces the student's output probability distribution (over a large feature dimension) to match the teacher's distribution:
where . To prevent representation collapse without negative pairs, DINO relies on centering (subtracting a running mean from the teacher's outputs, preventing one dimension from dominating) and sharpening (using a very low temperature for the teacher, forcing confident predictions).
Emergent Object Segmentation
Because the student (seeing only a small local crop) must predict the exact same high-dimensional semantic cluster as the teacher (seeing the whole global crop), the network naturally learns to group pixels that belong to the same object.
This results in emergent object segmentation. If you visualize the self-attention maps of the [CLS] token in a DINO-pretrained ViT, the attention weights perfectly trace the physical contours of the foreground objects—despite the network never seeing a single human-annotated segmentation mask.
DINOv2 (Oquab et al., 2024) scales this approach further, combining self-distillation with masked image modeling (borrowed from MAE) and training on a massive, curated dataset (LVD-142M). DINOv2 features represent the current state-of-the-art for dense spatial tasks, combining the semantic robustness of contrastive learning with the spatial precision of generative masking.
Evaluating downstream transfer
The three paradigms produce fundamentally different mathematical representations. To evaluate them, researchers rely on established transfer protocols:
- Linear Probing: Freeze the pretrained encoder weights entirely. Attach a single, untrained linear layer on top of the representations and train it on a labeled dataset (e.g., ImageNet). If the linear probe achieves high accuracy (e.g., ), it mathematically proves that the self-supervised representations are already linearly separable according to human semantic concepts.
- K-Nearest Neighbors (k-NN): Freeze the encoder and compute the embedding for every image in a dataset. For a new test image, find the closest embeddings via cosine similarity. High k-NN accuracy proves the geometric clustering of the latent space aligns with real-world categories.
- Full Fine-Tuning: Unfreeze the encoder and update all weights with a low learning rate using task-specific labels. While this yields the highest absolute performance, it risks catastrophic forgetting of the general representations learned during pretraining.
| Property | Contrastive (SimCLR/MoCo) | MAE (Reconstruction) | Self-distillation (DINOv2) | |---|---|---|---| | Global semantics | Strong | Moderate | Strong | | Spatial precision | Weak (invariant to crops) | Strong | Strong (emergent boundaries) | | Linear probe accuracy | High | Moderate | High | | Dense prediction (detection) | Moderate | High | High | | Training efficiency | Requires massive memory queue | Highly efficient (only processes 25% of patches) | Moderate |
Role in Vision-Language Models
When designing a VLMVision-Language Model, the choice of the frozen vision encoder permanently shapes the model's capabilities:
- LLaVA and standard Chat-VLMs primarily use CLIP-ViT encoders (which use a variant of contrastive learning mapped against text embeddings, covered in Week 3). Because contrastive representations are semantically invariant, they map very cleanly to discrete language tokens.
- Robotics VLMs (VLAs) like RT-2 or systems requiring precise 3D spatial grounding increasingly prefer MAE or DINOv2 encoders. If a robot must grasp the handle of a mug, a contrastive encoder that blurred the handle's specific spatial coordinates to achieve scale-invariance will fail, whereas an MAE encoder preserves the exact geometric coordinate features required for fine-grained motor control.
Key takeaways
Self-supervised representation learning mathematically maps raw pixels into structured, semantically meaningful latent spaces without human labels. Contrastive learning (InfoNCE) pushes positive augmentations together and negative pairs apart, creating strong global semantics but destroying spatial precision due to crop invariance. Masked Autoencoders (MAE) treat images as generative puzzles, forcing the network to predict missing patches, which preserves dense spatial geometry and trains highly efficiently. Self-distillation (DINO) trains a student to match an EMA teacher's global predictions using only local crops, resulting in emergent, zero-shot object segmentation. The geometric properties of these latent spaces dictate which visual backbone is chosen when constructing modern, multimodal foundation models.
Conceptual questions
- InfoNCE Mathematics and Batch Limits: The InfoNCE loss relies on the denominator to push negative pairs apart. Mathematically, what happens to the gradient of this loss if the batch size is very small (e.g., )? If an engineer tells you they cannot increase their batch size due to strict 16GB VRAM limits on their GPU, explain exactly how implementing the MoCo momentum queue mathematically resolves this constraint without increasing the memory footprint of the computational graph.
- MAE Masking Ratios: You are applying the MAE architecture to a dataset of highly detailed satellite imagery for urban planning. A colleague suggests lowering the masking ratio from the standard 75% down to 15%, arguing that "satellite images have too many tiny details, so 75% masking deletes the objects entirely." Analyze this argument. What specific "shortcut" will the ViT encoder mathematically exploit if the masking ratio is dropped to 15%, and why will the resulting frozen representations perform poorly on a downstream linear probe?
- Representation Collapse in DINO: Unlike SimCLR, DINO does not use negative pairs in its loss function. If you randomly initialize a DINO network and train it without applying the "centering" operation to the teacher's outputs, the student and teacher will quickly undergo representation collapse. Describe the exact mathematical state of the feature vectors during total representation collapse. Why does centering (subtracting the moving average of the batch from the teacher's logits) prevent the softmax distribution from settling into this degenerate state?
- Inductive Biases in Transfer Learning: Your team evaluates three frozen ViT-Base encoders using a linear probe on ImageNet-1K: one trained via supervised learning, one via MoCo (Contrastive), and one via MAE. The supervised and MoCo probes achieve ~82% accuracy, while MAE achieves only ~68%. The team proposes discarding the MAE model. Frame a counter-argument explaining why linear probe accuracy on global classification is structurally biased against MAE. Propose an alternative downstream task (e.g., in medical image segmentation) where the MAE representations would fundamentally outperform the MoCo representations.
- VLMVision-Language Model Architecture Design: You are designing a visual-language action (VLA) model for a robotic arm that must fold laundry. The model receives an RGB image and a text prompt ("fold the left sleeve"), and must output continuous 3D coordinates for the gripper. You must choose between a CLIP-ViT-L (Contrastive) and a DINOv2-ViT-L as your frozen visual backbone. Select the superior backbone for this specific physical task, explicitly contrasting the spatial invariance of contrastive learning with the emergent segmentation properties of self-distillation.
Looking ahead
Having established how vision models learn structural representations from pure pixels, the next challenge is bridging those visual representations with natural language.
Week 3: Contrastive Vision–Language Learning (CLIP). We will derive the symmetric InfoNCE objective computed over cross-modal image-text pairs, analyze how web-scale noisy supervision enables zero-shot zero-shot classification, and examine why CLIP's joint embedding space became the standard interface for multimodal AI.
Further reading
- Chen, T., et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML. (SimCLR and the InfoNCE loss).
- He, K., et al. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR. (MoCo).
- He, K., et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR. (MAE).
- Caron, M., et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV. (DINO and emergent segmentation).
- Oquab, M., et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR.