Skip to main content
illumin8
Courses
Week 11: Representation Learning with Generative Models
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 11

Week 11: Representation Learning with Generative Models

✦Learning Outcomes
  • Extract and apply representations from VAEVariational Autoencoder encoders and diffusion models to downstream tasks
  • Analyze when generative pretraining outperforms discriminative approaches for downstream tasks
  • Apply representation learning concepts to robotics perception and planning
◆Prerequisites
  • Week 2: VAEs - VAEVariational Autoencoder encoder representations
  • Week 6: DDPM - Diffusion model features

Background in machine learning fundamentals is assumed. Course 1 concepts helpful but not required.

Purpose of this lecture

For the first ten weeks, the goal of generative modeling was the output: a high-fidelity image, audio clip, or action sequence. This lecture asks a different question: what has the model learned internally in order to generate that output? Generative pretraining produces representations that capture geometric structure, object semantics, and scene composition — often exceeding the representations learned by supervised discriminative models at the same scale. Extracting and applying these representations to downstream tasks is one of the most practically impactful developments in modern AI, directly connecting generative model research to robotics, medical imaging, and autonomous planning.


Generative vs. contrastive self-supervised learning

Self-supervised learning learns representations without labels by solving a pretext task. Two paradigms dominate:

Contrastive learning (SimCLR, MoCo, CLIP): teaches a model to produce similar embeddings for semantically equivalent views of the same data and dissimilar embeddings for different data. The InfoNCE loss:

LNCE=−log⁡exp⁡(sim(zi,zj)/τ)∑k≠iexp⁡(sim(zi,zk)/τ)\mathcal{L}_\text{NCE} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}LNCE​=−log∑k=i​exp(sim(zi​,zk​)/τ)exp(sim(zi​,zj​)/τ)​

trains a representation that is semantically discriminative: "cat" and "dog" are far apart; two crops of the same cat are close. Contrastive features excel at classification and retrieval but tend to collapse fine-grained spatial information because invariance across views discards the local details that distinguish views.

Generative learning (VAEVariational Autoencoder, diffusion, MAE): teaches a model to reconstruct or predict missing data. To reconstruct a masked image patch from the surrounding context, the model must understand local geometry, texture boundaries, and global semantic composition. Generative representations tend to be richer in spatial and structural detail because no invariance is imposed — every pixel matters for reconstruction.

The key empirical finding is that for tasks requiring spatial precision (dense prediction, segmentation, depth estimation, grasping) generative representations typically outperform contrastive representations despite contrastive models having better linear classification accuracy. The representation quality metric depends entirely on the downstream task.


Masked autoencoders

Masked Autoencoder (MAE; He et al., 2022) is the preeminent generative pretraining method for images. Architecture: a ViT encoder processes only the unmasked patches (typically 25% of the image); a lightweight decoder attends to all patch positions (visible and masked) and reconstructs the masked patches:

LMAE=Ex,M ⁣[1∣M∣∑i∈M∥xi−x^i∥2]\mathcal{L}_\text{MAE} = \mathbb{E}_{x, M}\!\left[\frac{1}{|M|} \sum_{i \in M} \|x_i - \hat{x}_i\|^2\right]LMAE​=Ex,M​[∣M∣1​i∈M∑​∥xi​−x^i​∥2]

where MMM is the set of masked patch indices, xix_ixi​ is the pixel values in patch iii, and x^i\hat{x}_ix^i​ is the decoder's reconstruction. Masking 75% of patches creates a challenging reconstruction task that cannot be solved by low-level texture copying — the encoder must learn semantic context from widely separated patches.

Why MAE representations are good: the encoder sees only 25% of the image and must produce embeddings that contain sufficient information for the decoder to reconstruct the remaining 75%. This information compression forces the encoder to capture high-level semantic structure. The decoder, trained to reconstruct low-level pixels, is discarded at inference; only the encoder is used for downstream tasks.

Scaling properties: MAE scales well with model size. A ViT-H/14 (632M parameters) pretrained with MAE on ImageNet-1K achieves 86.9% top-1 accuracy with fine-tuning — competitive with supervised ViT-H while using only unlabeled data. The representation quality continues to improve with scale, unlike contrastive methods which saturate earlier.


MAE implementation: masking strategy and decoder design

Masking in MAE is applied at the patch level: a 224×224 image is divided into 196 patches of 16×16 pixels each; patches are uniformly randomly sampled without structure, and 75% (147 patches) are discarded before the encoder sees them. The encoder processes only the remaining 25% (49 patches) along with their positional embeddings, computing full self-attention over this sparse subset; no masked token embeddings are fed to the encoder.

The decoder is a separate, much smaller transformer architecture compared to the encoder: whereas the encoder may have 12–24 layers and hidden dimension 768–1024 (ViT-B, ViT-H, ViT-L), the decoder uses only 8 layers with hidden dimension 512. The decoder takes two inputs: (1) the encoder outputs for the 49 visible patches, and (2) learnable masked token embeddings at all 196 patch positions (49 visible + 147 masked). The decoder applies full self-attention across all 196 positions and predicts normalized pixel values for the masked patches. Critically, the decoder is used only during pretraining and is discarded entirely at inference; the representation quality depends entirely on what the encoder has learned.

The reconstruction target is normalized pixel values in each 16×16 patch — specifically, the mean and variance are computed per patch, and the network predicts the normalized (zero-mean, unit-variance) pixel values. This is not LPIPS, perceptual loss, or any other learned metric; the simple mean-squared error (MSE) pixel loss is sufficient to learn rich representations because the masking creates a sufficiently hard task that the model cannot solve it by memorizing low-level texture patterns.

The training efficiency of MAE derives directly from the masking strategy. Since only 25% of patches pass through the encoder, the attention computation is reduced dramatically: the O(n2)O(n^2)O(n2) self-attention cost scales with n=49n=49n=49 (not 196), which corresponds to only (49/196)2=6.25%(49/196)^2 = 6.25\%(49/196)2=6.25% of the full attention cost. In practice, MAE trains approximately 3× faster than standard ViT at the same effective batch size, making it a highly efficient pretraining approach.


Diffusion model features for perception

DIFT (Tang et al., 2023) discovers that intermediate feature maps from a diffusion U-Net — specifically, the activations of the U-Net's decoder layers at specific timesteps — are powerful zero-shot correspondence estimators. For any two images, the spatial features at matching objects are geometrically close in activation space, even without explicit supervision.

The mechanism: at timestep ttt, the diffusion model has denoised the image to reveal structure at a scale corresponding to the noise level. At low ttt (low noise), features encode fine details; at high ttt (high noise), features encode global structure. The U-Net decoder's attention layers perform implicit spatial grouping — pixels belonging to the same object are denoised as a coherent unit, clustering their features.

Emergent segmentation: the self-attention maps of diffusion U-Nets naturally segment objects into coherent groups without any segmentation training. The attention map at layer lll for a query patch qqq shows which spatial positions are semantically related to qqq — equivalent to a zero-shot segmentation mask. This has been exploited for training-free semantic segmentation (ODISE, Xu et al., 2023).

DINOv2 features from a transformer trained with self-distillation (a generative-style objective where the student predicts the teacher's features) also exhibit emergent semantic segmentation, indicating that the self-supervised objective design — forcing the representation to predict rich target features — is more important than whether the task is explicitly generative.


Diffusion features: extraction procedure and timestep selection

The extraction procedure for DIFT is straightforward but requires careful implementation. Given a query image xxx and a target timestep t∗t^*t∗:

  1. Corrupt the query image to timestep t∗t^*t∗ using the diffusion forward process: xt∗=αˉt∗x+1−αˉt∗ϵx_{t^*} = \sqrt{\bar{\alpha}_{t^*}} x + \sqrt{1 - \bar{\alpha}_{t^*}} \epsilonxt∗​=αˉt∗​​x+1−αˉt∗​​ϵ, where ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I) and αˉt∗\bar{\alpha}_{t^*}αˉt∗​ is the cumulative noise schedule coefficient.

  2. Run the denoising network forward pass on the corrupted image without taking a denoising step — extract intermediate feature maps ϕl(xt∗)\phi_l(x_{t^*})ϕl​(xt∗​) from decoder layer lll of the U-Net.

  3. Use the features as dense visual descriptors for downstream tasks; the spatial resolution of ϕl\phi_lϕl​ is typically preserved from the U-Net's spatial feature maps.

Timestep selection is critical for downstream performance. Empirically, t∗≈100t^* \approx 100t∗≈100 (out of T=1000T=1000T=1000 diffusion steps) provides optimal correspondence features for semantic matching tasks:

  • At t∗=0t^* = 0t∗=0 (minimal noise): features are nearly raw pixel values with minimal diffusion context; correspondence matching fails because only low-level texture similarity is captured, missing semantic structure.

  • At t∗=1000t^* = 1000t∗=1000 (maximum noise): features encode only global noise statistics and aggregate information from the entire image with equal weight; no spatial discrimination remains.

  • At t∗=100t^* = 100t∗=100 (intermediate noise): the model is denoising moderate noise and must aggregate context across the image to recover object structure; the intermediate features capture both local texture details and global semantic context, enabling strong correspondence matching.

Layer selection also matters: decoder layers at moderate depth (e.g., layer 8 of 16 layers) provide better correspondence features than early layers (which encode low-level, task-specific features like texture boundaries) or late layers (which are close to pixel space and have been optimized for reconstruction rather than semantic correspondence).

Zero-shot semantic correspondence is achieved by extracting DIFT features from two images I1I_1I1​ and I2I_2I2​: compute ϕ(I1t∗)\phi(I_1^{t^*})ϕ(I1t∗​) and ϕ(I2t∗)\phi(I_2^{t^*})ϕ(I2t∗​) at the same layer and timestep. For a query point (u,v)(u, v)(u,v) in I1I_1I1​, find the nearest neighbor in ϕ(I2t∗)\phi(I_2^{t^*})ϕ(I2t∗​) by cosine similarity across the feature dimension; the result is a dense correspondence map without any training on correspondence pairs.

Comparison to DINO features: DIFT outperforms DINOv2 on semantic correspondence benchmarks (SPair-71K) despite DINOv2 being explicitly trained for dense feature matching. This empirical finding is noteworthy because it indicates that the denoising pretext task provides stronger spatial feature organization than contrastive methods, even when the contrastive model is specifically designed for dense features.


Feature extraction protocols

Three standard protocols for using pretrained generative features in downstream tasks:

Linear probing: freeze the pretrained encoder, extract features from all training examples, train a linear classifier on the frozen features. Linear probe accuracy measures the linear separability of the representation — a strong linear probe indicates well-organized semantic features without relying on fine-tuning to restructure them.

Feature distillation: train a smaller, faster student model to match the frozen teacher's features (rather than matching labels). The student learns a compressed version of the representation. Feature distillation is used to deploy large generative representations efficiently on edge devices.

Fine-tuning: unfreeze all (or a subset of) the pretrained parameters and fine-tune with task-specific labels. Fine-tuning typically outperforms linear probing and distillation but risks catastrophic forgetting of the pretrained representation (the same tradeoff as in VLA fine-tuning, Week 11 of Course 2). LoRA or adapter-based fine-tuning preserves the base representation while specializing for the task.


Generative representations for robotics

In manipulation and navigation, the robot must perceive its environment, infer object geometry, and predict the outcome of actions. Generative representations provide three capabilities that discriminative representations do not:

Spatial detail: diffusion model features resolve sub-object structure (handles, joints, deformable regions) at spatial precision unavailable from CLIP or contrastive models. Grasping and assembly tasks require this precision.

Prediction: a generative model can predict what a scene would look like after an action. Using the latent representation as the state space and a trained transition model zt+1=Tθ(zt,at)z_{t+1} = T_\theta(z_t, a_t)zt+1​=Tθ​(zt​,at​) enables imagination-based planning: generate KKK candidate action sequences, simulate their outcomes in latent space, and select the sequence with the best predicted outcome according to a reward model.

Data efficiency: generative pretraining on large image datasets (internet-scale for diffusion, robotics video for VLAs) provides strong priors for downstream tasks. A robot perception system initialized from a generative model requires fewer labeled examples to reach a given performance level than one trained from scratch.


Foundation model features in robotic manipulation

The practical impact of generative and foundation model representations for robotics has been substantial in recent years. Several models demonstrate how different pretraining objectives and architectures affect downstream manipulation performance.

R3M (Nair et al., 2022) learns robot-relevant visual representations by combining three objectives: (1) temporal contrastive learning on egocentric human video (minimizing distance between temporally close frames, which indicates similar actions), (2) language alignment (aligning video frames with task descriptions via language-conditioned distances), and (3) reward prediction (predicting task completion from visual features). R3M features transfer to downstream manipulation tasks with frozen features and a small policy head, achieving sample-efficient learning with minimal task-specific training.

CLIP features for robot policy have shown that open-vocabulary task specification is feasible: using CLIP's vision encoder to extract visual features for robot policies enables the same robot policy to perform "pick up the cup" and "pick up the scissors" by conditioning on CLIP text embeddings of the task description, without requiring separate policy training for each object category. This reuse reduces the annotation burden and allows policies to generalize to novel object categories seen during CLIP pretraining.

Voltron (Karamcheti et al., 2023) extends MAE with language conditioning — the masked image reconstruction decoder is additionally conditioned on a language description of the scene. The encoder must produce features that jointly support both pixel reconstruction and language grounding. The joint objective forces the encoder to produce features that are both spatially detailed (for reconstruction) and semantically aligned with language (for understanding task descriptions). Voltron features outperform R3M and CLIP features on downstream robot manipulation tasks when fine-tuned with few demonstrations.

Evaluation protocol for robot representations follows a standard protocol: freeze the pretrained encoder, attach a small behavior-cloning policy head (typically a 2-layer MLP or Gaussian mixture model), train on N∈{10,50,100,500}N \in \{10, 50, 100, 500\}N∈{10,50,100,500} demonstrations, and measure task success rate on held-out test episodes. A strong representation should enable high success rate at low NNN — data efficiency is the primary evaluation criterion. The implicit assumption is that pretrained features capture task-relevant structure, so downstream learning requires only a shallow policy adaptation layer.

The general finding across R3M, Voltron, CLIP, and MAE evaluations is that representations preserving spatial structure (MAE, DIFT, Voltron, DINO) outperform representations that collapse spatial information (CLIP ViT class token, DINO class token) for contact-rich manipulation tasks. The class token averages spatial information over the entire image, losing the local detail needed for grasp point selection, force control, and deformable object handling. In contrast, spatial feature maps maintain resolution in the space where the gripper contacts the object, directly supporting fine-grained action control.


GenAI context

The representation learning perspective reveals deep connections across the four courses in the GenAI sequence. The table below summarizes how different representation learning approaches are deployed across courses:

| Representation type | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Pretraining objective | MAE: masked patch reconstruction; Diffusion: denoising | RLHFReinforcement Learning from Human Feedback: reward model pretraining; representation for Q-function | Robot video pretraining: R3M, Voltron | CLIP: contrastive image-text; MAE on images (C4W2) | | Representation type | Spatial feature maps (encoder activations) | State value embedding | RGB observation embedding for policy | Visual token embeddings (patch features) | | Downstream adaptation | Linear probe, distillation, LoRA fine-tuning | Q-function fine-tuning from pretrained features | Frozen encoder + BC policy head | Q-Former (BLIP-2), projector (LLaVA) bridging to LLMLarge Language Model | | Spatial precision | High (MAE/DIFT capture sub-object structure) | Low (state abstractions) | Required for grasping (contact-rich tasks) | Variable (class token low, patch tokens high) |

The representation learning perspective reveals a fundamental connection between Course 3 and Course 4. MAE and ViT (Course 4, Week 1) are the same core architecture with different pretraining objectives — MAE uses masked patch reconstruction while the ViT in CLIP uses contrastive image-text pretraining. The representation quality difference between the two directly determines their utility for spatial tasks: robotics policies trained on MAE features consistently outperform those trained on CLIP class tokens, which is why modern VLA models (visual-language action models, Course 2 Week 8; Course 4 Week 12) typically use vision encoders based on ViT-DINO or ViT-MAE rather than CLIP's class token aggregation.


Key takeaways

Generative pretraining produces representations that capture geometric and semantic structure not available from contrastive learning, particularly for spatial tasks. MAE trains a ViT encoder to reconstruct masked patches from context, producing features that scale well with model size. Diffusion U-Net feature maps exhibit emergent zero-shot segmentation through spatial clustering in the attention layers. Three downstream protocols — linear probing, distillation, and fine-tuning — provide different tradeoffs between representation fidelity and downstream adaptation cost. For robotics, generative representations enable spatial precision, predictive planning, and data-efficient perception.


Conceptual questions

  1. MAE uses 75% masking during pretraining. If the masking ratio is reduced to 25% (only 25% masked), the reconstruction task becomes easier. Analyze how this change affects: (a) the difficulty of the pretext task; (b) the quality of the learned representation; (c) the computational efficiency of training. What masking ratio would you choose for a domain with high spatial redundancy (e.g., satellite imagery of farmland) vs. a domain with high spatial information density (e.g., circuit board inspection)?

  2. DIFT uses diffusion model features at a specific timestep t∗t^*t∗ for correspondence estimation. At t∗=0t^* = 0t∗=0 (no noise), the features are close to raw image features; at t∗=Tt^* = Tt∗=T (full noise), features encode no spatial information. Argue why an intermediate t∗t^*t∗ is optimal for semantic correspondence: what types of correspondences are captured at small t∗t^*t∗ vs. large t∗t^*t∗, and how would you empirically determine the optimal t∗t^*t∗ for a given downstream task?

  3. A linear probe is trained on top of frozen features from (a) a supervised ResNet-50, (b) a CLIP ViT-B/16, and (c) a MAE ViT-B/16, all on an ImageNet subset. The probe accuracies are 85%, 83%, and 74% respectively. Based on these results, can you conclude that the supervised ResNet features are better for all downstream tasks? Describe a downstream task where the ordering would reverse.

  4. Imagination-based planning uses a learned transition model zt+1=Tθ(zt,at)z_{t+1} = T_\theta(z_t, a_t)zt+1​=Tθ​(zt​,at​) in latent space. Identify two sources of error that accumulate when planning HHH steps ahead: (a) error from the initial state encoding z0=E(x0)z_0 = \mathcal{E}(x_0)z0​=E(x0​), and (b) compounding error from the transition model. Derive how error grows with planning horizon HHH under the assumption that the transition model has bounded error ∥zt+1−Tθ(zt,at)∥2≤δ\|z_{t+1} - T_\theta(z_t, a_t)\|_2 \leq \delta∥zt+1​−Tθ​(zt​,at​)∥2​≤δ at each step.

  5. Feature distillation trains a student model to match a teacher's feature representations. If the teacher is a large diffusion model (e.g., Stable Diffusion U-Net features), the student receives supervision at every spatial position from a model with very different architecture. Describe the loss function for feature distillation and explain what properties of the teacher's features are preserved vs. lost during distillation. Under what conditions does distilled-feature quality approach teacher-feature quality?

✦Solutions
  1. (a) 25% masking makes the pretext task much easier — the model can interpolate masked patches from nearby texture rather than reasoning about global structure. (b) Representation quality drops: the encoder need not capture semantics, so it learns lower-level features. (c) Efficiency worsens — the encoder now processes 75% of patches instead of 25%, and O(n2)O(n^2)O(n2) attention grows roughly 9×9\times9× as nnn triples. Choose masking by redundancy: high-redundancy domains (farmland satellite) tolerate even higher masking (80–90%); high-information-density domains (circuit-board inspection) need lower masking (~50–65%) to keep the task solvable and detail-preserving.
  2. Small t∗t^*t∗: low noise → features near raw pixels, capturing low-level/textural correspondence but missing semantic matches across appearance changes. Large t∗t^*t∗: high noise → only coarse global structure survives, so precise spatial discrimination is lost. Intermediate t∗t^*t∗ (~100) balances local detail with semantic context, giving the best correspondence. Determine it empirically by sweeping t∗t^*t∗ (and layer) on a labeled correspondence validation set (e.g. SPair-71K) and maximizing PCK/accuracy for the target task.
  3. No — linear-probe accuracy measures linear separability for ImageNet classification only; it does not generalize to all tasks. The ordering reverses on spatially-precise dense tasks — semantic correspondence, segmentation, depth, or robotic grasp-point prediction — where MAE's spatial feature maps outperform the supervised ResNet / CLIP class-token features despite lower linear-probe classification accuracy.
  4. (a) Initial-encoding error ε0=∥z0−E(x0)∥\varepsilon_0=\|z_0-\mathcal{E}(x_0)\|ε0​=∥z0​−E(x0​)∥ — a fixed offset at the start. (b) With per-step bound δ\deltaδ and an LLL-Lipschitz transition, error propagates as ∥eH∥≤LH−1ε0+δLH−1L−1\|e_H\|\le L^{H-1}\varepsilon_0 + \delta\frac{L^H-1}{L-1}∥eH​∥≤LH−1ε0​+δL−1LH−1​. For a non-expansive model (L≈1L\approx1L≈1) this is ≈ε0+Hδ\approx \varepsilon_0 + H\delta≈ε0​+Hδ — linear growth in the horizon; for L>1L>1L>1 it grows exponentially, which bounds how far ahead latent planning stays reliable.
  5. Loss: L=∑spatial∥fstudent(x)−g(fteacher(x))∥2\mathcal{L}=\sum_\text{spatial}\|f_\text{student}(x)-g(f_\text{teacher}(x))\|^2L=∑spatial​∥fstudent​(x)−g(fteacher​(x))∥2 (often cosine or MSE on normalized features, with a projection head ggg aligning dimensions), summed over positions. Preserved: the teacher's relative feature geometry / semantic structure that the student has capacity to represent. Lost: distinctions exceeding student capacity, teacher-architecture-specific artifacts, and information in directions the projection discards. Distilled quality approaches the teacher when the student has sufficient capacity, the distillation data covers the input distribution, and the projection adequately aligns the two feature spaces.

Looking ahead

Generative representations capture world structure. The next lecture examines how this world structure can be used not just for perception but for decision-making — building models that predict the future and allow planning without real-world interaction.

Week 12: World Models and Reinforcement Learning. We study the RSSM architecture, DreamerV3's imagination-based policy optimization, latent-space model predictive control, and the use of generative models as trajectory priors for offline RLReinforcement Learning.


Further reading

  • Chen, T., et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML. (SimCLR).
  • He, K., et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR. (MAE).
← Previous
Week 10: Evaluating Generative Models
Next →
Week 12: World Models and Reinforcement Learning
On this page
  • Purpose of this lecture
  • Generative vs. contrastive self-supervised learning
  • Masked autoencoders
  • MAE implementation: masking strategy and decoder design
  • Diffusion model features for perception
  • Diffusion features: extraction procedure and timestep selection
  • Feature extraction protocols
  • Generative representations for robotics
  • Foundation model features in robotic manipulation
  • GenAI context
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading