Week 2: Self-Supervised Representation Learning for Vision

Purpose of this lecture#

Modern Vision–Language Models (VLMs) almost never train their vision encoders from scratch using multimodal datasets. Because multimodal data (image-text pairs) is inherently noisy and expensive to curate, VLMs instead rely on vision encoders that have been pretrained on massive, pure-image corpora using self-supervised learning.

Self-supervised objectives construct learning signals directly from the data without requiring human annotations. The quality, geometry, and character of these pretrained representations dictate exactly what a downstream VLM can perceive: how precisely it resolves object boundaries, how robustly it identifies semantic categories, and whether it collapses spatial geometry. This lecture formally examines the three dominant paradigms of self-supervised visual pretraining—Contrastive Learning, Masked Autoencoding (MAE), and Self-Distillation (DINO)—and analyzes how their different mathematical objectives produce representations with systematically different properties.

The imperative of self-supervised pretraining#

Historically, vision backbones were pretrained using supervised learning on datasets like ImageNet-1K (1.28 million labeled images across 1,000 classes). While this produced models that generalized reasonably well, supervised pretraining suffers from fundamental limitations when scaling toward foundation models:

Annotation Bottleneck: Curating high-quality, dense annotations scales linearly in human effort. Web-scale image data (e.g., billions of images) is abundant, but labeling it is economically prohibitive.
Semantic Bottleneck: Supervised objectives force the network to compress its representation into a fixed, mutually exclusive set of classes (a one-hot vector). A model trained to classify "dog" vs. "cat" learns to aggressively discard background information, fine-grained object attributes, and relative spatial positioning, as these are "distractors" to the classification objective.

Self-supervised learning bypasses both bottlenecks by converting unlabeled abundance into a continuous, high-dimensional training signal. By constructing proxy tasks (pretext tasks) that require the model to understand image structure to solve them, the model learns general-purpose visual representations that transfer broadly to downstream tasks. The mathematical design of the proxy task determines what invariances the representation learns, what it suppresses, and how well it grounds multimodal reasoning.

Contrastive representation learning (SimCLR & MoCo)#

Contrastive learning forces a visual encoder to produce similar high-dimensional vectors for different augmented views of the same image (positive pairs) and dissimilar vectors for views of different images (negative pairs).

The InfoNCE Objective#

The canonical objective function is the InfoNCE (Information Noise-Contrastive Estimation) loss. Given a batch of $B$ images, two different data augmentations (e.g., random crop, color jitter) are applied to each, producing $2B$ total views. For a specific image $x_i$ , let its two augmented views be $v_i$ and $v_i'$ . Passing these through an encoder network $f_\theta$ and a projection head $g_\phi$ yields normalized representation vectors $z_i$ and $z_i'$ .

The InfoNCE loss for a single positive pair $(z_i, z_i')$ is defined as:

\mathcal{L}_\text{NCE}(z_i, z_i') = -\log \frac{\exp(\text{sim}(z_i, z_i')/\tau)}{\sum_{k=1}^{2B} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}

where $\text{sim}(u, v) = \frac{u^T v}{\|u\| \|v\|}$ is the cosine similarity, $\tau$ is a temperature hyperparameter that scales the logits, and the denominator sums the similarity over all $2B - 1$ other samples in the batch (which ACT as negative examples).

This is structurally a cross-entropy loss that treats the matching view $z_i'$ as the single "correct class" among $2B - 1$ distractors. Minimizing this loss pushes the dot product of positive pairs toward 1 and negative pairs toward 0.

Augmentations define the invariances#

The choice of data augmentations $t \sim \mathcal{T}$ defines exactly what the network learns to ignore. SimCLR (Chen et al., 2020) relies heavily on random cropping, aggressive color jitter, grayscale conversion, and Gaussian blur. Because the model must produce the exact same vector $z_i$ regardless of whether the image is zoomed in on a dog's ear or zoomed out showing the whole dog, and regardless of lighting conditions, the network learns scale and color invariance.

This invariance is a double-edged sword: it forces the representation to encode robust, global semantic identity (highly useful for classifying an object), but it deliberately destroys spatial precision and fine-grained color data (highly detrimental for a robotics VLM trying to locate a red block).

Preventing Collapse and the MoCo Queue#

Contrastive learning requires explicit negative pairs to prevent a degenerate solution known as representation collapse (where the encoder simply maps every image to the same constant vector $\vec{c}$ , making all similarities 1 and the loss trivially 0).

Because the quality of negative mining improves with batch size, SimCLR requires massive batches (e.g., $B=4096$ ) to expose the model to enough hard negatives. MoCo (Momentum Contrast; He et al., 2020) solves this memory bottleneck by decoupling the batch size from the number of negatives. It maintains a separate "momentum encoder" $f_{\theta_k}$ whose weights are an exponential moving average (EMA) of the active encoder $f_{\theta_q}$ :

\theta_k \leftarrow m \theta_k + (1 - m) \theta_q

MoCo encodes a massive queue (e.g., 65,536) of negative examples using the slow-moving momentum encoder. This provides a rich, consistent dictionary of negatives without requiring them to be housed in the active GPU computational graph, drastically reducing VRAM requirements.

Masked autoencoders (MAE)#

Masked Autoencoders (MAE; He et al., 2022) discard contrastive similarity entirely in favor of a generative approach: mask a massive fraction of an image and force the network to reconstruct the missing pixels from the remaining visible context.

Architecture and Objective#

An image is divided into a grid of non-overlapping patches. A highly aggressive random masking ratio (typically 75%) is applied. Only the 25% visible patches are passed into a deep Vision Transformer (ViT) encoder, producing a set of dense latent features.

A separate, lightweight decoder network then takes these encoded visible features, inserts learned, generic [MASK] tokens into the empty spatial positions, adds 2D positional embeddings, and attempts to reconstruct the raw pixel values of the original image.

The training objective is the simple Mean Squared Error (MSE) computed only over the masked patches $M$ :

\mathcal{L}_\text{MAE} = \frac{1}{|M|} \sum_{i \in M} \|x_i - \hat{x}_i\|^2

where $x_i$ represents the normalized ground-truth pixel values of patch $i$ , and $\hat{x}_i$ is the decoder's pixel-space prediction. After pretraining, the decoder is completely discarded; only the encoder is retained for downstream VLM use.

Why 75% Masking?#

Images possess immense spatial redundancy; adjacent pixels share colors and gradients. If only 15% of an image is masked (as is common in NLP with BERT), the network can cheat by simply interpolating colors from directly adjacent, visible pixels without understanding what the object is.

By masking 75% of the patches, interpolation becomes mathematically impossible. To reconstruct a missing patch showing a dog's leg, the encoder must deduce that it is looking at a dog based on sparse, distant clues (like an ear and a tail), access internal representations of canine anatomy, and predict the leg's location. The extreme difficulty of the task forces the encoder to learn deep semantic structure.

Spatial Precision vs. Semantic Invariance#

Unlike contrastive learning, which suppresses location information to achieve invariance, the MAE reconstruction task provides spatially precise supervision. The model must place specific textures at specific $(x, y)$ coordinates. Consequently, MAE representations preserve dense spatial geometry. This makes MAE-pretrained ViTs exceptionally powerful for downstream tasks requiring precise localization (bounding box detection, pixel-level segmentation, and robotic grasping), though they occasionally underperform contrastive models on pure linear-classification benchmarks.

DINO and self-distillation#

DINO (Self-Distillation with No Labels; Caron et al., 2021) achieves the semantic strength of contrastive learning without requiring negative pairs, using a teacher-student distillation framework.

The DINO Objective#

A student network $f_{\theta_s}$ and a teacher network $f_{\theta_t}$ (where $\theta_t$ is an EMA of $\theta_s$ ) process different augmented crops of the same image. The network generates a global crop (a large view of the image) and several local crops (small, zoomed-in patches).

The student is shown the local crops, and the teacher is shown the global crops. The objective forces the student's output probability distribution (over a large feature dimension) to match the teacher's distribution:

\mathcal{L}_\text{DINO} = -\sum_{v \in V_\text{global}} \sum_{\substack{v' \in V \\ v' \neq v}} P_t(v) \log P_s(v')

where $P(v) = \text{softmax}(f(v) / \tau)$ . To prevent representation collapse without negative pairs, DINO relies on centering (subtracting a running mean from the teacher's outputs, preventing one dimension from dominating) and sharpening (using a very low temperature $\tau_t$ for the teacher, forcing confident predictions).

Emergent Object Segmentation#

Because the student (seeing only a small local crop) must predict the exact same high-dimensional semantic cluster as the teacher (seeing the whole global crop), the network naturally learns to group pixels that belong to the same object.

This results in emergent object segmentation. If you visualize the self-attention maps of the [CLS] token in a DINO-pretrained ViT, the attention weights perfectly trace the physical contours of the foreground objects—despite the network never seeing a single human-annotated segmentation mask.

DINOv2 (Oquab et al., 2024) scales this approach further, combining self-distillation with masked image modeling (borrowed from MAE) and training on a massive, curated dataset (LVD-142M). DINOv2 features represent the current state-of-the-art for dense spatial tasks, combining the semantic robustness of contrastive learning with the spatial precision of generative masking.

Evaluating downstream transfer#

The three paradigms produce fundamentally different mathematical representations. To evaluate them, researchers rely on established transfer protocols:

Linear Probing: Freeze the pretrained encoder weights entirely. Attach a single, untrained linear layer on top of the representations and train it on a labeled dataset (e.g., ImageNet). If the linear probe achieves high accuracy (e.g., $> 80\%$ ), it mathematically proves that the self-supervised representations are already linearly separable according to human semantic concepts.
K-Nearest Neighbors (k-NN): Freeze the encoder and compute the embedding for every image in a dataset. For a new test image, find the $k$ closest embeddings via cosine similarity. High k-NN accuracy proves the geometric clustering of the latent space aligns with real-world categories.
Full Fine-Tuning: Unfreeze the encoder and update all weights with a low learning rate using task-specific labels. While this yields the highest absolute performance, it risks catastrophic forgetting of the general representations learned during pretraining.

| Property | Contrastive (SimCLR/MoCo) | MAE (Reconstruction) | Self-distillation (DINOv2) | |---|---|---|---| | Global semantics | Strong | Moderate | Strong | | Spatial precision | Weak (invariant to crops) | Strong | Strong (emergent boundaries) | | Linear probe accuracy | High | Moderate | High | | Dense prediction (detection) | Moderate | High | High | | Training efficiency | Requires massive memory queue | Highly efficient (only processes 25% of patches) | Moderate |

Role in Vision-Language Models#

When designing a VLM, the choice of the frozen vision encoder permanently shapes the model's capabilities:

LLaVA and standard Chat-VLMs primarily use CLIP-ViT encoders (which use a variant of contrastive learning mapped against text embeddings, covered in Week 3). Because contrastive representations are semantically invariant, they map very cleanly to discrete language tokens.
Robotics VLMs (VLAs) like RT-2 or systems requiring precise 3D spatial grounding increasingly prefer MAE or DINOv2 encoders. If a robot must grasp the handle of a mug, a contrastive encoder that blurred the handle's specific spatial coordinates to achieve scale-invariance will fail, whereas an MAE encoder preserves the exact geometric coordinate features required for fine-grained motor control.

Key takeaways#

Self-supervised representation learning mathematically maps raw pixels into structured, semantically meaningful latent spaces without human labels. Contrastive learning (InfoNCE) pushes positive augmentations together and negative pairs apart, creating strong global semantics but destroying spatial precision due to crop invariance. Masked Autoencoders (MAE) treat images as generative puzzles, forcing the network to predict missing patches, which preserves dense spatial geometry and trains highly efficiently. Self-distillation (DINO) trains a student to match an EMA teacher's global predictions using only local crops, resulting in emergent, zero-shot object segmentation. The geometric properties of these latent spaces dictate which visual backbone is chosen when constructing modern, multimodal foundation models.

Conceptual questions#

InfoNCE Mathematics and Batch Limits: The InfoNCE loss $\mathcal{L}_\text{NCE}$ relies on the denominator $\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)$ to push negative pairs apart. Mathematically, what happens to the gradient of this loss if the batch size $B$ is very small (e.g., $B=8$ )? If an engineer tells you they cannot increase their batch size due to strict 16GB VRAM limits on their GPU, explain exactly how implementing the MoCo momentum queue mathematically resolves this constraint without increasing the memory footprint of the computational graph.
MAE Masking Ratios: You are applying the MAE architecture to a dataset of highly detailed satellite imagery for urban planning. A colleague suggests lowering the masking ratio from the standard 75% down to 15%, arguing that "satellite images have too many tiny details, so 75% masking deletes the objects entirely." Analyze this argument. What specific "shortcut" will the ViT encoder mathematically exploit if the masking ratio is dropped to 15%, and why will the resulting frozen representations perform poorly on a downstream linear probe?
Representation Collapse in DINO: Unlike SimCLR, DINO does not use negative pairs in its loss function. If you randomly initialize a DINO network and train it without applying the "centering" operation to the teacher's outputs, the student and teacher will quickly undergo representation collapse. Describe the exact mathematical state of the feature vectors $z$ during total representation collapse. Why does centering (subtracting the moving average of the batch from the teacher's logits) prevent the softmax distribution from settling into this degenerate state?
Inductive Biases in Transfer Learning: Your team evaluates three frozen ViT-Base encoders using a linear probe on ImageNet-1K: one trained via supervised learning, one via MoCo (Contrastive), and one via MAE. The supervised and MoCo probes achieve ~82% accuracy, while MAE achieves only ~68%. The team proposes discarding the MAE model. Frame a counter-argument explaining why linear probe accuracy on global classification is structurally biased against MAE. Propose an alternative downstream task (e.g., in medical image segmentation) where the MAE representations would fundamentally outperform the MoCo representations.
VLM Architecture Design: You are designing a visual-language action (VLA) model for a robotic arm that must fold laundry. The model receives an RGB image and a text prompt ("fold the left sleeve"), and must output continuous 3D $(x,y,z)$ coordinates for the gripper. You must choose between a CLIP-ViT-L (Contrastive) and a DINOv2-ViT-L as your frozen visual backbone. Select the superior backbone for this specific physical task, explicitly contrasting the spatial invariance of contrastive learning with the emergent segmentation properties of self-distillation.

Solutions

Small-batch InfoNCE / MoCo. With $B=8$ the denominator has only ~7 negatives, so the partition function is a high-variance, biased estimate of the true normalization — the repulsion gradient is weak and noisy and the InfoNCE lower bound on mutual information is loose, slowing convergence. MoCo decouples the negative count from the batch: it keeps a queue of keys encoded by a momentum (EMA) encoder. Those keys carry no stored activations and are not part of the current computational graph, so you get thousands of negatives at the memory cost of storing vectors, not the backward-pass graph.
15% masking shortcut. At a low masking ratio the encoder reconstructs each missing patch by interpolating/copying from adjacent visible patches — a local texture-smoothing trick that needs no semantic modeling. The frozen features then encode local statistics rather than object semantics, so a linear probe (which reads global semantics) underperforms. The 75% ratio destroys the interpolation shortcut and forces holistic reasoning.
DINO collapse. Total collapse means every $z$ maps to the same constant vector (zero variance across the batch), so the teacher softmax is input-independent and the student trivially matches it. Centering subtracts the running batch-mean from the teacher logits, preventing any single dimension from dominating; paired with sharpening (low teacher temperature) it rules out both the uniform and the one-dimension-dominant degenerate modes.
Linear-probe bias against MAE. The probe rewards encoders whose pooled global vector is already linearly separable by category — exactly what contrastive/supervised losses optimize. MAE optimizes pixel reconstruction and spreads information across patch tokens instead of concentrating linearly separable semantics globally, so one linear layer under-reads it. On a dense task like medical-image segmentation, MAE's per-patch spatial features outperform MoCo's globally-pooled invariant features (and full fine-tuning, not probing, closes the classification gap).
CLIP vs DINOv2 for the laundry VLA. DINOv2. Contrastive CLIP is trained for image-level invariance to captions and deliberately discards spatial detail into one global vector — poor for regressing precise $(x,y,z)$ gripper targets. DINOv2 self-distillation produces emergent, segmentation-quality dense patch features that localize parts like the sleeve, giving the spatial grounding manipulation requires.

Looking ahead#

Having established how vision models learn structural representations from pure pixels, the next challenge is bridging those visual representations with natural language.

Week 3: Contrastive Vision–Language Learning (CLIP). We will derive the symmetric InfoNCE objective computed over cross-modal image-text pairs, analyze how web-scale noisy supervision enables zero-shot zero-shot classification, and examine why CLIP's joint embedding space became the standard interface for multimodal AI.

Purpose of this lecture#

The imperative of self-supervised pretraining#

Annotation Bottleneck: Curating high-quality, dense annotations scales linearly in human effort. Web-scale image data (e.g., billions of images) is abundant, but labeling it is economically prohibitive.
Semantic Bottleneck: Supervised objectives force the network to compress its representation into a fixed, mutually exclusive set of classes (a one-hot vector). A model trained to classify "dog" vs. "cat" learns to aggressively discard background information, fine-grained object attributes, and relative spatial positioning, as these are "distractors" to the classification objective.

Contrastive representation learning (SimCLR & MoCo)#

The InfoNCE Objective#

The InfoNCE loss for a single positive pair $(z_i, z_i')$ is defined as:

\mathcal{L}_\text{NCE}(z_i, z_i') = -\log \frac{\exp(\text{sim}(z_i, z_i')/\tau)}{\sum_{k=1}^{2B} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}

Augmentations define the invariances#

Preventing Collapse and the MoCo Queue#

\theta_k \leftarrow m \theta_k + (1 - m) \theta_q

Masked autoencoders (MAE)#

Architecture and Objective#

The training objective is the simple Mean Squared Error (MSE) computed only over the masked patches $M$ :

\mathcal{L}_\text{MAE} = \frac{1}{|M|} \sum_{i \in M} \|x_i - \hat{x}_i\|^2

Why 75% Masking?#

Spatial Precision vs. Semantic Invariance#

DINO and self-distillation#

The DINO Objective#

\mathcal{L}_\text{DINO} = -\sum_{v \in V_\text{global}} \sum_{\substack{v' \in V \\ v' \neq v}} P_t(v) \log P_s(v')

Emergent Object Segmentation#

Evaluating downstream transfer#

The three paradigms produce fundamentally different mathematical representations. To evaluate them, researchers rely on established transfer protocols:

Linear Probing: Freeze the pretrained encoder weights entirely. Attach a single, untrained linear layer on top of the representations and train it on a labeled dataset (e.g., ImageNet). If the linear probe achieves high accuracy (e.g., $> 80\%$ ), it mathematically proves that the self-supervised representations are already linearly separable according to human semantic concepts.
K-Nearest Neighbors (k-NN): Freeze the encoder and compute the embedding for every image in a dataset. For a new test image, find the $k$ closest embeddings via cosine similarity. High k-NN accuracy proves the geometric clustering of the latent space aligns with real-world categories.
Full Fine-Tuning: Unfreeze the encoder and update all weights with a low learning rate using task-specific labels. While this yields the highest absolute performance, it risks catastrophic forgetting of the general representations learned during pretraining.

Role in Vision-Language Models#

When designing a VLM, the choice of the frozen vision encoder permanently shapes the model's capabilities:

LLaVA and standard Chat-VLMs primarily use CLIP-ViT encoders (which use a variant of contrastive learning mapped against text embeddings, covered in Week 3). Because contrastive representations are semantically invariant, they map very cleanly to discrete language tokens.
Robotics VLMs (VLAs) like RT-2 or systems requiring precise 3D spatial grounding increasingly prefer MAE or DINOv2 encoders. If a robot must grasp the handle of a mug, a contrastive encoder that blurred the handle's specific spatial coordinates to achieve scale-invariance will fail, whereas an MAE encoder preserves the exact geometric coordinate features required for fine-grained motor control.

Key takeaways#

Conceptual questions#

InfoNCE Mathematics and Batch Limits: The InfoNCE loss $\mathcal{L}_\text{NCE}$ relies on the denominator $\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)$ to push negative pairs apart. Mathematically, what happens to the gradient of this loss if the batch size $B$ is very small (e.g., $B=8$ )? If an engineer tells you they cannot increase their batch size due to strict 16GB VRAM limits on their GPU, explain exactly how implementing the MoCo momentum queue mathematically resolves this constraint without increasing the memory footprint of the computational graph.
MAE Masking Ratios: You are applying the MAE architecture to a dataset of highly detailed satellite imagery for urban planning. A colleague suggests lowering the masking ratio from the standard 75% down to 15%, arguing that "satellite images have too many tiny details, so 75% masking deletes the objects entirely." Analyze this argument. What specific "shortcut" will the ViT encoder mathematically exploit if the masking ratio is dropped to 15%, and why will the resulting frozen representations perform poorly on a downstream linear probe?
Representation Collapse in DINO: Unlike SimCLR, DINO does not use negative pairs in its loss function. If you randomly initialize a DINO network and train it without applying the "centering" operation to the teacher's outputs, the student and teacher will quickly undergo representation collapse. Describe the exact mathematical state of the feature vectors $z$ during total representation collapse. Why does centering (subtracting the moving average of the batch from the teacher's logits) prevent the softmax distribution from settling into this degenerate state?
Inductive Biases in Transfer Learning: Your team evaluates three frozen ViT-Base encoders using a linear probe on ImageNet-1K: one trained via supervised learning, one via MoCo (Contrastive), and one via MAE. The supervised and MoCo probes achieve ~82% accuracy, while MAE achieves only ~68%. The team proposes discarding the MAE model. Frame a counter-argument explaining why linear probe accuracy on global classification is structurally biased against MAE. Propose an alternative downstream task (e.g., in medical image segmentation) where the MAE representations would fundamentally outperform the MoCo representations.
VLM Architecture Design: You are designing a visual-language action (VLA) model for a robotic arm that must fold laundry. The model receives an RGB image and a text prompt ("fold the left sleeve"), and must output continuous 3D $(x,y,z)$ coordinates for the gripper. You must choose between a CLIP-ViT-L (Contrastive) and a DINOv2-ViT-L as your frozen visual backbone. Select the superior backbone for this specific physical task, explicitly contrasting the spatial invariance of contrastive learning with the emergent segmentation properties of self-distillation.

Solutions

Small-batch InfoNCE / MoCo. With $B=8$ the denominator has only ~7 negatives, so the partition function is a high-variance, biased estimate of the true normalization — the repulsion gradient is weak and noisy and the InfoNCE lower bound on mutual information is loose, slowing convergence. MoCo decouples the negative count from the batch: it keeps a queue of keys encoded by a momentum (EMA) encoder. Those keys carry no stored activations and are not part of the current computational graph, so you get thousands of negatives at the memory cost of storing vectors, not the backward-pass graph.
15% masking shortcut. At a low masking ratio the encoder reconstructs each missing patch by interpolating/copying from adjacent visible patches — a local texture-smoothing trick that needs no semantic modeling. The frozen features then encode local statistics rather than object semantics, so a linear probe (which reads global semantics) underperforms. The 75% ratio destroys the interpolation shortcut and forces holistic reasoning.
DINO collapse. Total collapse means every $z$ maps to the same constant vector (zero variance across the batch), so the teacher softmax is input-independent and the student trivially matches it. Centering subtracts the running batch-mean from the teacher logits, preventing any single dimension from dominating; paired with sharpening (low teacher temperature) it rules out both the uniform and the one-dimension-dominant degenerate modes.
Linear-probe bias against MAE. The probe rewards encoders whose pooled global vector is already linearly separable by category — exactly what contrastive/supervised losses optimize. MAE optimizes pixel reconstruction and spreads information across patch tokens instead of concentrating linearly separable semantics globally, so one linear layer under-reads it. On a dense task like medical-image segmentation, MAE's per-patch spatial features outperform MoCo's globally-pooled invariant features (and full fine-tuning, not probing, closes the classification gap).
CLIP vs DINOv2 for the laundry VLA. DINOv2. Contrastive CLIP is trained for image-level invariance to captions and deliberately discards spatial detail into one global vector — poor for regressing precise $(x,y,z)$ gripper targets. DINOv2 self-distillation produces emergent, segmentation-quality dense patch features that localize parts like the sleeve, giving the spatial grounding manipulation requires.

Looking ahead#

Having established how vision models learn structural representations from pure pixels, the next challenge is bridging those visual representations with natural language.

Purpose of this lecture#

The imperative of self-supervised pretraining#

Contrastive representation learning (SimCLR & MoCo)#

The InfoNCE Objective#

Augmentations define the invariances#

Preventing Collapse and the MoCo Queue#

Masked autoencoders (MAE)#

Architecture and Objective#

Why 75% Masking?#

Spatial Precision vs. Semantic Invariance#

DINO and self-distillation#

The DINO Objective#

Emergent Object Segmentation#

Evaluating downstream transfer#

Role in Vision-Language Models#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 2: Self-Supervised Representation Learning for Vision

Purpose of this lecture#

The imperative of self-supervised pretraining#

Contrastive representation learning (SimCLR & MoCo)#

The InfoNCE Objective#

Augmentations define the invariances#

Preventing Collapse and the MoCo Queue#

Masked autoencoders (MAE)#

Architecture and Objective#

Why 75% Masking?#

Spatial Precision vs. Semantic Invariance#

DINO and self-distillation#

The DINO Objective#

Emergent Object Segmentation#

Evaluating downstream transfer#

Role in Vision-Language Models#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#