Purpose of this lecture
For the first ten weeks, the goal of generative modeling was the output: a high-fidelity image, audio clip, or action sequence. This lecture asks a different question: what has the model learned internally in order to generate that output? Generative pretraining produces representations that capture geometric structure, object semantics, and scene composition — often exceeding the representations learned by supervised discriminative models at the same scale. Extracting and applying these representations to downstream tasks is one of the most practically impactful developments in modern AI, directly connecting generative model research to robotics, medical imaging, and autonomous planning.
Generative vs. contrastive self-supervised learning
Self-supervised learning learns representations without labels by solving a pretext task. Two paradigms dominate:
Contrastive learning (SimCLR, MoCo, CLIP): teaches a model to produce similar embeddings for semantically equivalent views of the same data and dissimilar embeddings for different data. The InfoNCE loss:
trains a representation that is semantically discriminative: "cat" and "dog" are far apart; two crops of the same cat are close. Contrastive features excel at classification and retrieval but tend to collapse fine-grained spatial information because invariance across views discards the local details that distinguish views.
Generative learning (VAEVariational Autoencoder, diffusion, MAE): teaches a model to reconstruct or predict missing data. To reconstruct a masked image patch from the surrounding context, the model must understand local geometry, texture boundaries, and global semantic composition. Generative representations tend to be richer in spatial and structural detail because no invariance is imposed — every pixel matters for reconstruction.
The key empirical finding is that for tasks requiring spatial precision (dense prediction, segmentation, depth estimation, grasping) generative representations typically outperform contrastive representations despite contrastive models having better linear classification accuracy. The representation quality metric depends entirely on the downstream task.
Masked autoencoders
Masked Autoencoder (MAE; He et al., 2022) is the preeminent generative pretraining method for images. Architecture: a ViT encoder processes only the unmasked patches (typically 25% of the image); a lightweight decoder attends to all patch positions (visible and masked) and reconstructs the masked patches:
where is the set of masked patch indices, is the pixel values in patch , and is the decoder's reconstruction. Masking 75% of patches creates a challenging reconstruction task that cannot be solved by low-level texture copying — the encoder must learn semantic context from widely separated patches.
Why MAE representations are good: the encoder sees only 25% of the image and must produce embeddings that contain sufficient information for the decoder to reconstruct the remaining 75%. This information compression forces the encoder to capture high-level semantic structure. The decoder, trained to reconstruct low-level pixels, is discarded at inference; only the encoder is used for downstream tasks.
Scaling properties: MAE scales well with model size. A ViT-H/14 (632M parameters) pretrained with MAE on ImageNet-1K achieves 86.9% top-1 accuracy with fine-tuning — competitive with supervised ViT-H while using only unlabeled data. The representation quality continues to improve with scale, unlike contrastive methods which saturate earlier.
MAE implementation: masking strategy and decoder design
Masking in MAE is applied at the patch level: a 224×224 image is divided into 196 patches of 16×16 pixels each; patches are uniformly randomly sampled without structure, and 75% (147 patches) are discarded before the encoder sees them. The encoder processes only the remaining 25% (49 patches) along with their positional embeddings, computing full self-attention over this sparse subset; no masked token embeddings are fed to the encoder.
The decoder is a separate, much smaller transformer architecture compared to the encoder: whereas the encoder may have 12–24 layers and hidden dimension 768–1024 (ViT-B, ViT-H, ViT-L), the decoder uses only 8 layers with hidden dimension 512. The decoder takes two inputs: (1) the encoder outputs for the 49 visible patches, and (2) learnable masked token embeddings at all 196 patch positions (49 visible + 147 masked). The decoder applies full self-attention across all 196 positions and predicts normalized pixel values for the masked patches. Critically, the decoder is used only during pretraining and is discarded entirely at inference; the representation quality depends entirely on what the encoder has learned.
The reconstruction target is normalized pixel values in each 16×16 patch — specifically, the mean and variance are computed per patch, and the network predicts the normalized (zero-mean, unit-variance) pixel values. This is not LPIPS, perceptual loss, or any other learned metric; the simple mean-squared error (MSE) pixel loss is sufficient to learn rich representations because the masking creates a sufficiently hard task that the model cannot solve it by memorizing low-level texture patterns.
The training efficiency of MAE derives directly from the masking strategy. Since only 25% of patches pass through the encoder, the attention computation is reduced dramatically: the self-attention cost scales with (not 196), which corresponds to only of the full attention cost. In practice, MAE trains approximately 3× faster than standard ViT at the same effective batch size, making it a highly efficient pretraining approach.
Diffusion model features for perception
DIFT (Tang et al., 2023) discovers that intermediate feature maps from a diffusion U-Net — specifically, the activations of the U-Net's decoder layers at specific timesteps — are powerful zero-shot correspondence estimators. For any two images, the spatial features at matching objects are geometrically close in activation space, even without explicit supervision.
The mechanism: at timestep , the diffusion model has denoised the image to reveal structure at a scale corresponding to the noise level. At low (low noise), features encode fine details; at high (high noise), features encode global structure. The U-Net decoder's attention layers perform implicit spatial grouping — pixels belonging to the same object are denoised as a coherent unit, clustering their features.
Emergent segmentation: the self-attention maps of diffusion U-Nets naturally segment objects into coherent groups without any segmentation training. The attention map at layer for a query patch shows which spatial positions are semantically related to — equivalent to a zero-shot segmentation mask. This has been exploited for training-free semantic segmentation (ODISE, Xu et al., 2023).
DINOv2 features from a transformer trained with self-distillation (a generative-style objective where the student predicts the teacher's features) also exhibit emergent semantic segmentation, indicating that the self-supervised objective design — forcing the representation to predict rich target features — is more important than whether the task is explicitly generative.
Diffusion features: extraction procedure and timestep selection
The extraction procedure for DIFT is straightforward but requires careful implementation. Given a query image and a target timestep :
-
Corrupt the query image to timestep using the diffusion forward process: , where and is the cumulative noise schedule coefficient.
-
Run the denoising network forward pass on the corrupted image without taking a denoising step — extract intermediate feature maps from decoder layer of the U-Net.
-
Use the features as dense visual descriptors for downstream tasks; the spatial resolution of is typically preserved from the U-Net's spatial feature maps.
Timestep selection is critical for downstream performance. Empirically, (out of diffusion steps) provides optimal correspondence features for semantic matching tasks:
-
At (minimal noise): features are nearly raw pixel values with minimal diffusion context; correspondence matching fails because only low-level texture similarity is captured, missing semantic structure.
-
At (maximum noise): features encode only global noise statistics and aggregate information from the entire image with equal weight; no spatial discrimination remains.
-
At (intermediate noise): the model is denoising moderate noise and must aggregate context across the image to recover object structure; the intermediate features capture both local texture details and global semantic context, enabling strong correspondence matching.
Layer selection also matters: decoder layers at moderate depth (e.g., layer 8 of 16 layers) provide better correspondence features than early layers (which encode low-level, task-specific features like texture boundaries) or late layers (which are close to pixel space and have been optimized for reconstruction rather than semantic correspondence).
Zero-shot semantic correspondence is achieved by extracting DIFT features from two images and : compute and at the same layer and timestep. For a query point in , find the nearest neighbor in by cosine similarity across the feature dimension; the result is a dense correspondence map without any training on correspondence pairs.
Comparison to DINO features: DIFT outperforms DINOv2 on semantic correspondence benchmarks (SPair-71K) despite DINOv2 being explicitly trained for dense feature matching. This empirical finding is noteworthy because it indicates that the denoising pretext task provides stronger spatial feature organization than contrastive methods, even when the contrastive model is specifically designed for dense features.
Feature extraction protocols
Three standard protocols for using pretrained generative features in downstream tasks:
Linear probing: freeze the pretrained encoder, extract features from all training examples, train a linear classifier on the frozen features. Linear probe accuracy measures the linear separability of the representation — a strong linear probe indicates well-organized semantic features without relying on fine-tuning to restructure them.
Feature distillation: train a smaller, faster student model to match the frozen teacher's features (rather than matching labels). The student learns a compressed version of the representation. Feature distillation is used to deploy large generative representations efficiently on edge devices.
Fine-tuning: unfreeze all (or a subset of) the pretrained parameters and fine-tune with task-specific labels. Fine-tuning typically outperforms linear probing and distillation but risks catastrophic forgetting of the pretrained representation (the same tradeoff as in VLA fine-tuning, Week 11 of Course 2). LoRA or adapter-based fine-tuning preserves the base representation while specializing for the task.
Generative representations for robotics
In manipulation and navigation, the robot must perceive its environment, infer object geometry, and predict the outcome of actions. Generative representations provide three capabilities that discriminative representations do not:
Spatial detail: diffusion model features resolve sub-object structure (handles, joints, deformable regions) at spatial precision unavailable from CLIP or contrastive models. Grasping and assembly tasks require this precision.
Prediction: a generative model can predict what a scene would look like after an action. Using the latent representation as the state space and a trained transition model enables imagination-based planning: generate candidate action sequences, simulate their outcomes in latent space, and select the sequence with the best predicted outcome according to a reward model.
Data efficiency: generative pretraining on large image datasets (internet-scale for diffusion, robotics video for VLAs) provides strong priors for downstream tasks. A robot perception system initialized from a generative model requires fewer labeled examples to reach a given performance level than one trained from scratch.
Foundation model features in robotic manipulation
The practical impact of generative and foundation model representations for robotics has been substantial in recent years. Several models demonstrate how different pretraining objectives and architectures affect downstream manipulation performance.
R3M (Nair et al., 2022) learns robot-relevant visual representations by combining three objectives: (1) temporal contrastive learning on egocentric human video (minimizing distance between temporally close frames, which indicates similar actions), (2) language alignment (aligning video frames with task descriptions via language-conditioned distances), and (3) reward prediction (predicting task completion from visual features). R3M features transfer to downstream manipulation tasks with frozen features and a small policy head, achieving sample-efficient learning with minimal task-specific training.
CLIP features for robot policy have shown that open-vocabulary task specification is feasible: using CLIP's vision encoder to extract visual features for robot policies enables the same robot policy to perform "pick up the cup" and "pick up the scissors" by conditioning on CLIP text embeddings of the task description, without requiring separate policy training for each object category. This reuse reduces the annotation burden and allows policies to generalize to novel object categories seen during CLIP pretraining.
Voltron (Karamcheti et al., 2023) extends MAE with language conditioning — the masked image reconstruction decoder is additionally conditioned on a language description of the scene. The encoder must produce features that jointly support both pixel reconstruction and language grounding. The joint objective forces the encoder to produce features that are both spatially detailed (for reconstruction) and semantically aligned with language (for understanding task descriptions). Voltron features outperform R3M and CLIP features on downstream robot manipulation tasks when fine-tuned with few demonstrations.
Evaluation protocol for robot representations follows a standard protocol: freeze the pretrained encoder, attach a small behavior-cloning policy head (typically a 2-layer MLP or Gaussian mixture model), train on demonstrations, and measure task success rate on held-out test episodes. A strong representation should enable high success rate at low — data efficiency is the primary evaluation criterion. The implicit assumption is that pretrained features capture task-relevant structure, so downstream learning requires only a shallow policy adaptation layer.
The general finding across R3M, Voltron, CLIP, and MAE evaluations is that representations preserving spatial structure (MAE, DIFT, Voltron, DINO) outperform representations that collapse spatial information (CLIP ViT class token, DINO class token) for contact-rich manipulation tasks. The class token averages spatial information over the entire image, losing the local detail needed for grasp point selection, force control, and deformable object handling. In contrast, spatial feature maps maintain resolution in the space where the gripper contacts the object, directly supporting fine-grained action control.
GenAI context
The representation learning perspective reveals deep connections across the four courses in the GenAI sequence. The table below summarizes how different representation learning approaches are deployed across courses:
| Representation type | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Pretraining objective | MAE: masked patch reconstruction; Diffusion: denoising | RLHFReinforcement Learning from Human Feedback: reward model pretraining; representation for Q-function | Robot video pretraining: R3M, Voltron | CLIP: contrastive image-text; MAE on images (C4W2) | | Representation type | Spatial feature maps (encoder activations) | State value embedding | RGB observation embedding for policy | Visual token embeddings (patch features) | | Downstream adaptation | Linear probe, distillation, LoRA fine-tuning | Q-function fine-tuning from pretrained features | Frozen encoder + BC policy head | Q-Former (BLIP-2), projector (LLaVA) bridging to LLMLarge Language Model | | Spatial precision | High (MAE/DIFT capture sub-object structure) | Low (state abstractions) | Required for grasping (contact-rich tasks) | Variable (class token low, patch tokens high) |
The representation learning perspective reveals a fundamental connection between Course 3 and Course 4. MAE and ViT (Course 4, Week 1) are the same core architecture with different pretraining objectives — MAE uses masked patch reconstruction while the ViT in CLIP uses contrastive image-text pretraining. The representation quality difference between the two directly determines their utility for spatial tasks: robotics policies trained on MAE features consistently outperform those trained on CLIP class tokens, which is why modern VLA models (visual-language action models, Course 2 Week 8; Course 4 Week 12) typically use vision encoders based on ViT-DINO or ViT-MAE rather than CLIP's class token aggregation.
Key takeaways
Generative pretraining produces representations that capture geometric and semantic structure not available from contrastive learning, particularly for spatial tasks. MAE trains a ViT encoder to reconstruct masked patches from context, producing features that scale well with model size. Diffusion U-Net feature maps exhibit emergent zero-shot segmentation through spatial clustering in the attention layers. Three downstream protocols — linear probing, distillation, and fine-tuning — provide different tradeoffs between representation fidelity and downstream adaptation cost. For robotics, generative representations enable spatial precision, predictive planning, and data-efficient perception.
Conceptual questions
-
MAE uses 75% masking during pretraining. If the masking ratio is reduced to 25% (only 25% masked), the reconstruction task becomes easier. Analyze how this change affects: (a) the difficulty of the pretext task; (b) the quality of the learned representation; (c) the computational efficiency of training. What masking ratio would you choose for a domain with high spatial redundancy (e.g., satellite imagery of farmland) vs. a domain with high spatial information density (e.g., circuit board inspection)?
-
DIFT uses diffusion model features at a specific timestep for correspondence estimation. At (no noise), the features are close to raw image features; at (full noise), features encode no spatial information. Argue why an intermediate is optimal for semantic correspondence: what types of correspondences are captured at small vs. large , and how would you empirically determine the optimal for a given downstream task?
-
A linear probe is trained on top of frozen features from (a) a supervised ResNet-50, (b) a CLIP ViT-B/16, and (c) a MAE ViT-B/16, all on an ImageNet subset. The probe accuracies are 85%, 83%, and 74% respectively. Based on these results, can you conclude that the supervised ResNet features are better for all downstream tasks? Describe a downstream task where the ordering would reverse.
-
Imagination-based planning uses a learned transition model in latent space. Identify two sources of error that accumulate when planning steps ahead: (a) error from the initial state encoding , and (b) compounding error from the transition model. Derive how error grows with planning horizon under the assumption that the transition model has bounded error at each step.
-
Feature distillation trains a student model to match a teacher's feature representations. If the teacher is a large diffusion model (e.g., Stable Diffusion U-Net features), the student receives supervision at every spatial position from a model with very different architecture. Describe the loss function for feature distillation and explain what properties of the teacher's features are preserved vs. lost during distillation. Under what conditions does distilled-feature quality approach teacher-feature quality?
Looking ahead
Generative representations capture world structure. The next lecture examines how this world structure can be used not just for perception but for decision-making — building models that predict the future and allow planning without real-world interaction.
Week 12: World Models and Reinforcement Learning. We study the RSSM architecture, DreamerV3's imagination-based policy optimization, latent-space model predictive control, and the use of generative models as trajectory priors for offline RLReinforcement Learning.
Further reading
- Chen, T., et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML. (SimCLR).
- He, K., et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR. (MAE).