Purpose of this lecture
Pixel-space diffusion at high resolution (512×512 and above) is computationally prohibitive: a 512×512 RGB image has 786,432 dimensions, and each diffusion step applies the full denoising network to this entire tensor. Latent diffusion models (LDMs; Rombach et al., 2022) solve this by first compressing images into a lower-dimensional latent representation using a pretrained VAEVariational Autoencoder, then running the diffusion process in latent space. This decoupling dramatically reduces computation while preserving perceptual quality. The same framework extends to audio, video, and multimodal generation by swapping the VAEVariational Autoencoder encoder for domain-appropriate compression.
Latent diffusion models
The LDM architecture has two stages. First, a perceptual compression stage: a KL-regularized VAEVariational Autoencoder (similar to Week 2 but designed for perceptual quality) is trained to compress images to latents with the spatial downsampling factor (typically or ) and the number of latent channels. The decoder reconstructs the image. This stage is trained to minimize a combination of reconstruction loss, perceptual loss (LPIPS), adversarial loss, and KL regularization toward .
Second, a latent generation stage: DDPM (or flow matching) is applied in the latent space rather than pixel space . The denoising network is a U-Net operating on the latent tensors. The spatial dimensions are times smaller than pixel space, reducing the computational cost of attention layers by . For , a 512×512 image with latent is 4096 times faster per attention operation.
The two-stage separation is critical: the perceptual compression stage handles local, high-frequency details (textures, sharpness) that diffusion models handle poorly because they are computationally expensive at high resolution. The diffusion stage handles semantic structure (layout, objects, lighting) at low resolution. The VAEVariational Autoencoder decoder upsamples back to pixel quality.
Text conditioning via cross-attention
LDMs (specifically Stable Diffusion) condition the denoising U-Net on text prompts through cross-attention (Week 8). The text encoder is a CLIP ViT-L/14 or OpenCLIP for SD 1.x/2.x, replaced by a joint CLIP+OpenCLIP ensemble for SDXL and T5-XXL for SD3. Text embedding sequence is projected to match the U-Net's attention dimension and attended to at multiple resolutions.
SDXL (Podell et al., 2023) scales LDM in three ways: (1) larger U-Net (2.6B parameters vs. 860M) with a second text encoder (OpenCLIP ViT-G); (2) conditioning on image size and crop coordinates as additional signals (enabling the model to learn resolution-appropriate generation); (3) a separate refiner model that applies additional denoising steps in latent space to improve high-frequency detail. Together these achieve substantially improved sample quality and prompt adherence over SD 1.x.
Stable Diffusion 3 replaces the U-Net with a diffusion transformer (DiT) and uses flow matching rather than DDPM. The DiT processes flattened image patches and text tokens jointly in a multimodal transformer, with the image and text representations attending to each other at every layer rather than only through cross-attention.
The Diffusion Transformer architecture
The diffusion transformer (DiT; Peebles & Xie, 2023) departs from the convolutional U-Net inductive bias by using a pure transformer backbone for latent diffusion. This shift is motivated by a key observation: U-Nets have strong local inductive biases (spatial convolution windows, hierarchical skip connections) that are optimized for pixel-space processing. However, in the latent space of a VAEVariational Autoencoder with downsampling, the spatial structure is already heavily compressed and semantic — U-Net biases become less critical and may even constrain model capacity.
| Patching | Conditioning | Scaling | | --- | --- | --- | | Latents are divided into non-overlapping patches and linearly projected to a fixed model dimension . | AdaLN-Zero layers inject timestep and class information by modulating the scale and shift of the layer normalization. | Replacing U-Net convolutions with transformers enables predictable performance gains as GFLOPS increase. |
Patch embedding and tokenization: The input latent is divided into non-overlapping patches of size (e.g., ), producing tokens, each of dimension . These patch embeddings are linearly projected to a fixed model dimension , creating a sequence of spatial tokens analogous to text token embeddings in language models. For a latent with channels and , this produces tokens of dimension , projected to dimension .
DiT block architecture: Each transformer block consists of:
- Time and conditioning modulation via AdaLN-Zero: timestep is embedded to a vector and concatenated with text embeddings (text class or CLIP embedding); this combined conditioning vector is used to compute scale and shift parameters that modulate the layer norm, plus a gating parameter that controls block output strength.
- Multi-head self-attention: standard scaled dot-product attention over all patch tokens, with projections from the normalized input.
- Pointwise MLP: a two-layer feedforward network applied independently to each token position.
The AdaLN-Zero gating mechanism is particularly elegant: the final block output is computed as where the gating parameter is initialized to zero. This ensures that at initialization, each block contributes zero to the residual, so the entire network is initially an identity function. Learning then proceeds by gradually increasing from zero, making this a form of learned skip-connection strength.
Scale and model families: DiT-XL/2 (the transformer used in Stable Diffusion 3) has 28 transformer blocks, patch size 2, and model dimension , totaling 675M parameters. The "/2" denotes that the hidden dimension of the MLP is 2× the model dimension (typical transformer scaling). This is comparable in capacity to the 860M-parameter SDXL U-Net but with fundamentally different architectural constraints.
U-Net vs. DiT comparison: The architectural difference has training and scaling implications:
| Inductive Bias | Skip Connections | Scaling Law | | --- | --- | --- | | U-Nets use local convolutions; DiTs use global self-attention over learned patch positions. | U-Nets rely on hardwired encoder-decoder pairs; DiTs rely on residual blocks and global context. | DiTs follow predictable transformer scaling laws, making them more compute-optimal at scale. |
| Property | U-Net | DiT | |---|---|---| | Spatial inductive bias | Strong (local convolution kernels) | Weak (learned absolute/relative patch positions) | | Skip connections | Yes (encoder–decoder pairs at multiple scales) | No (only residual connections within blocks) | | Attention scope | Local at high resolution, global at low resolution (via downsampling) | Global at all scales (all patches attend to all patches) | | Scaling behavior with model size | Less predictable; hierarchy must be rebalanced | Follows transformer scaling laws; more compute-optimal | | Computational scaling | Attention cost dominated by low-resolution bottleneck | Attention cost linear in patch count; predictable | | Used in production | SD 1.x (1.1B), SD 2.x (865M), SDXL (2.6B) | SD3 (2B), FLUX (12B), Sora |
The lack of skip connections in DiT is counterintuitive — they seem essential for preserving fine details — but empirical results show that DiT achieves comparable or better FID at similar model sizes, and scales more predictably to very large models (FLUX at 12B parameters). The hypothesis is that the global attention at every layer allows the model to directly learn which information to propagate, obviating the need for hardwired skip paths.
Audio diffusion
Audio waveforms have different compression tradeoffs than images: raw audio at 44kHz is extremely high-dimensional but contains significant redundancy. Two approaches dominate:
Spectrogram diffusion: convert audio to a mel-spectrogram (time-frequency representation), apply 2D diffusion in spectrogram space, then invert the spectrogram to audio using a pretrained vocoder (HiFi-GAN, WaveNet). This treats audio generation as image generation of spectrograms and inherits all the text-conditioning tools from image diffusion.
Latent audio diffusion: apply a 1D audio VAEVariational Autoencoder to compress waveforms directly, then run diffusion in the latent waveform space. Models like Stable Audio (Evans et al., 2024) and AudioLDM use this approach. The compressed latent captures the audio's temporal structure at lower sample rates, enabling generation of minutes-long audio with manageable compute.
Text-to-audio (AudioLDM, MusicGen via language model backbone) conditions on CLAP embeddings — the audio analog of CLIP that aligns audio and text descriptions in a shared embedding space. Sound effects, music, and speech can all be generated by conditioning the diffusion model on CLAP text embeddings.
Video diffusion
Video is spatially and temporally high-dimensional, requiring compression in both dimensions. Video LDMs extend the image VAEVariational Autoencoder with temporal compression: a 3D VAEVariational Autoencoder encodes a video clip of frames at resolution to a spatiotemporal latent of shape where is the temporal downsampling factor (typically 4). The denoising network must model spatiotemporal dependencies.
Architectural choices for temporal modeling: (1) 3D convolutions in the U-Net extend spatial convolutions to include temporal neighbors; (2) temporal attention adds attention over the time dimension at each spatial location; (3) video transformers (e.g., Sora's space-time transformer) process all space-time patches jointly through full 3D self-attention, enabling long-range temporal consistency at the cost of higher compute.
Consistency and temporal coherence: diffusion models generate each frame independently unless temporal structure is explicitly enforced. Temporal attention at each layer and temporal convolutions bias the model toward smooth temporal transitions. Training on video data with optical flow loss further enforces coherence.
Temporal attention: the reshape trick and 3D convolutions
Efficiently modeling temporal dependencies in video LDMs requires careful architectural choices. Full spatiotemporal self-attention over all positions is prohibitively expensive ( complexity), so practical video models use decoupled or hierarchical temporal processing.
| Temporal Attention | 3D Convolutions | Temporal Embeds | | --- | --- | --- | | Reshaping the 5D video tensor to process time separately from space, reducing complexity to per spatial location. | Replacing 2D kernels with kernels (or decoupled and ) to learn local spatiotemporal correlations. | Injecting relative time information into the transformer blocks so the model can distinguish sequence order. |
Reshape trick for temporal attention: The most practical approach reshapes the feature tensor to process time separately from space. Given a 5D feature tensor from a video diffusion model, reshape to to create a 3D tensor where each of the spatial positions is treated as an independent sequence of length .
| Reshape Trick | 3D Convolution | Optical Flow | | --- | --- | --- | | Processing time separately from space reduces complexity from to , enabling efficient temporal modeling. | Using kernels to capture local spatiotemporal correlations and enforce temporal smoothness across frames. | Auxiliary supervision using flow estimators (e.g., RAFT) to penalize flickering and ensure realistic motion patterns. |
Apply standard 1D self-attention along the dimension, then reshape back to . This processes each spatial location independently across the temporal axis with complexity per location, for a total cost of . For a spatial resolution with frames, this is operations vs. the naive billion for full 3D attention.
Alternating temporal and spatial attention: To capture spatiotemporal correlations (not just temporal ones at each location), models alternate temporal attention (reshape as above) with spatial attention (reshape to , apply 2D spatial self-attention). A sequence of blocks alternating these operations gives an approximation to full 3D attention at cost. For typical video dimensions this is 1000–10,000× cheaper than naive 3D attention and enables training on longer sequences (16–32 frames).
3D convolution architecture: An alternative to attention is to use 3D convolutional layers that directly operate on the spatiotemporal volume. A 3D convolution kernel of size convolves over temporal and spatial neighbors simultaneously. Typical configurations include:
- kernels for motion-sensitive features (captures local temporal flow)
- kernels for spatial-only features (no temporal mixing, used in early video VAEVariational Autoencoder stages)
- Dilated kernels with dilation in the temporal dimension to increase receptive field without increasing parameter count
The 3D convolution naturally enforces temporal smoothness (adjacent frames are mixed) and is parameter-efficient compared to 2D spatial convolutions by a factor proportional to . However, the temporal receptive field grows slowly: with layers of 3D conv using kernel size , the temporal receptive field spans frames. For and layers, this is only 25 frames. Longer temporal dependencies require more layers or dilated temporal convolutions (e.g., dilation pattern 1, 2, 4, 8 to reach exponential receptive field).
Optical flow supervision: A powerful training technique for video diffusion is to add an auxiliary loss that encourages temporal coherence. Given a pretrained optical flow estimator (e.g., RAFT), compute the ground-truth optical flow between consecutive frames: . Then during training, also estimate optical flow from the reconstructed frames and minimize This loss penalizes flickering and ensures that the model learns to generate frames that obey realistic motion patterns, improving temporal coherence without explicit motion supervision.
Joint embedding spaces and multimodal generation
Unified multimodal LDMs generate across modalities by learning a shared latent space that multiple domain-specific encoders and decoders map to and from. Text, image, audio, and depth map encoders all map to a common embedding; a single diffusion model operates in this shared space; modality-specific decoders reconstruct the outputs.
Composable diffusion enables combining multiple conditioning signals: generate an image conditioned on both a text prompt AND a reference image AND an edge map by combining the scores from each conditioner. The composed score:
This approximation assumes conditional independence between conditioning signals, which holds approximately when conditioners target different aspects of the image (text → semantics, edge → structure, reference → style).
CLAP: contrastive audio-language pretraining
CLAP (Elizalde et al., 2022) is the audio-language equivalent of CLIP, training audio and text encoders with a symmetric contrastive loss to align audio clips with their textual descriptions. Just as CLIP learns a joint vision-language embedding space, CLAP learns a joint audio-language space where audio clips and text descriptions with matching semantics are embedded near each other.
| Contrastive Loss | Audio Encoder | Mel-Spectrogram | | --- | --- | --- | | Symmetric audio-to-text and text-to-audio loss using in-batch negatives to align paired embeddings in a shared 512D space. | CNN or hybrid transformer backbones (PANN/HTS-AT) that process mel-spectrograms into normalized feature vectors. | A 2D time-frequency representation computed via STFT and log-mel filterbanks to match human auditory perception. |
Training objective: The CLAP loss is formalized as: where and are L2-normalized embeddings for audio clip and description , and is a temperature parameter. The loss is symmetric (both audio→text and text→audio) and uses in-batch negatives, so a batch of paired examples produces positive pairs and negative pairs for each direction.
Audio encoder architecture: The audio encoder typically operates on mel-spectrograms and is implemented as a CNN-based architecture such as PANN (PANNs: Large-Scale Pretrained Audio Neural Networks) or HTS-AT (hybrid transformer-CNN for sound tagging). These models are pretrained on large-scale audio tagging datasets and transfer well to CLAP training. The encoder outputs an embedding vector of dimension 512 or 1024, L2-normalized for use in the contrastive loss.
Mel-spectrogram computation: The audio signal is converted to a mel-spectrogram in a standardized pipeline:
- STFT (Short-Time Fourier Transform): apply a window function (Hann window of size , typically 2048 samples) with hop size (typically 512 samples) to the raw waveform at sample rate (44 kHz or 16 kHz), producing a complex spectrogram where indexes frequency bins and indexes time frames. Compute the magnitude spectrogram .
- Mel filterbank: apply a mel-scale filterbank where is the number of mel-scale bands (typically 64 or 128) and is the number of frequency bins. The mel scale compresses high frequencies perceptually (humans are less sensitive to differences above ~1 kHz). The filterbank is a matrix of overlapping triangular filters spaced on the mel scale.
- Log magnitude: apply log compression to approximate the perceptual loudness scale. The result is an image (mel-spectrogram) that serves as input to the audio encoder.
Conditioning in AudioLDM: AudioLDM uses CLAP embeddings for text-to-audio generation by conditioning the diffusion model on text embeddings via cross-attention, identical to how Stable Diffusion uses CLIP embeddings. At inference, a user provides a text description such as "heavy rain on a metal roof" or "jazz saxophone solo in a crowded bar," the text is encoded with the CLAP text encoder to produce a 512-dimensional embedding , and this embedding is fed into the cross-attention layers of the audio diffusion U-Net. The diffusion process then generates an audio spectrogram that matches the description, which is converted back to waveform via a vocoder.
Cross-course context: latent compression and tokenization in generative models
A unifying theme emerges across all the courses in this sequence. Both language models and diffusion models solve the same fundamental problem: raw signals (text bytes, image pixels, audio samples) are too high-dimensional for standard sequence models, so they must be compressed into a lower-dimensional token-like representation.
| Architecture/concept | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Latent compression | VAEVariational Autoencoder latent space ( spatial downsampling reduces 512×512 to 64×64) | State abstraction / representation learning | Robot proprioceptive + exteroceptive compression (C2W8 CVAE latent space) | Visual tokenization: ViT patch embeddings as visual "tokens" | | Patch-based processing | DiT: image latent divided into patches (typically ) | Trajectory patches in decision transformer | ACTAction Chunking with Transformers: chunk of actions as single prediction unit (temporal patch) | ViT: image divided into patches, each embedded as a token (C4W1) | | Temporal modeling | Video LDM: 3D convolutions + temporal attention (reshape trick) | MDPMarkov Decision Process: temporal credit assignment via discount factor | Robot trajectory temporal modeling: LSTM/Transformer over history of proprioceptive states | Video VLMs: temporal transformer for frame-by-frame video-language alignment | | Multi-encoder conditioning | SDXL: CLIP-L text encoder + OpenCLIP-G image encoder + size/crop conditioning signals | Multi-objective RLReinforcement Learning: combining reward signal and constraint satisfaction signals | ACTAction Chunking with Transformers: DINO visual encoder + proprioceptive encoder + language instruction encoder | BLIP-2 Q-Former: bridging frozen vision encoder (ViT) and frozen language model (T5) |
The analogy between latent diffusion and language model tokenization is profound: both operations project a raw, high-dimensional signal into a discrete (or quasi-discrete) lower-dimensional space where sequence models can operate efficiently. This is why the same transformer architecture (DiT, GPT) works effectively in both domains — they are both operating on compressed, semantic-level tokens rather than raw bits.
In Course 4 (Vision–Language Models), the ViT backbone (Week 1) learns a similar patch-based visual compression: an image is divided into patches, each embedded to a 768-dimensional token, and these patch tokens are processed by a transformer. The full text-to-image generation pipeline in Stable Diffusion can be understood as a vision encoder (VAEVariational Autoencoder compresses pixels to latent tokens), a language-vision bridge (CLIP text encoder + cross-attention), and a generative decoder (U-Net or DiT generates latent tokens, VAEVariational Autoencoder decoder upsamples to pixels) — precisely the same functional structure as BLIP-2 (frozen ViT → visual tokens → Q-Former bridge → frozen LLMLarge Language Model) without the language generation head. This structural homology suggests that as generative models scale, they will continue to converge on a common bottleneck: learning efficient, semantically meaningful discretizations of their respective domains.
Key takeaways
LDMs decouple perceptual compression (VAEVariational Autoencoder) from semantic generation (diffusion), reducing computation by relative to pixel-space diffusion. Text conditioning via cross-attention over CLIP/T5 embeddings enables prompt-guided generation; SDXL improves this with a larger model and additional conditioning signals. The diffusion transformer (DiT) replaces U-Net convolutions with pure transformer attention, scaling more predictably to very large models. Audio diffusion operates on spectrograms or compressed audio latents, conditioned via CLAP embeddings. Video diffusion requires temporal modeling through 3D convolutions or temporal attention (via reshape trick); optical flow losses enforce temporal coherence. Joint embedding spaces enable multimodal generation; composable diffusion combines multiple independent conditioning signals by linearly summing their score contributions.
Conceptual questions
-
An LDM uses a VAEVariational Autoencoder with spatial downsampling factor to compress images to latents with channels. (a) Compute the compression ratio in pixels. (b) A U-Net with full spatial attention at the resolution has an attention matrix of size . Compare the memory and FLOPs for this attention to the same operation applied in pixel space at . (c) Explain qualitatively why fine texture details survive the VAEVariational Autoencoder encoding-decoding despite the high compression ratio.
-
Stable Diffusion's VAEVariational Autoencoder uses a KL regularization term toward on the latent space, while some video LDMs use a VQ-VAEVariational Autoencoder (vector quantized) encoder. Compare the two approaches: (a) What does KL regularization ensure about the latent space geometry? (b) What does VQ discretization enable that continuous KL regularization does not? (c) For temporal consistency in video generation, which approach is preferable and why?
-
Composable diffusion combines conditioning signals by summing their individual score contributions. Show that this approximation corresponds to assuming conditional independence between conditioners given the noised image. Construct a scenario where this independence assumption fails badly (where the combined effect of two conditioners is not the sum of their individual effects), and describe the artifact that would appear in the generated image.
-
Temporal attention in video diffusion models attends over the time dimension at each spatial location independently. Compare this to a full 3D space-time attention mechanism. (a) What temporal artifacts would temporal-only attention produce that space-time attention avoids? (b) For a video with frames and spatial resolution , compare the memory cost of temporal-only attention vs. space-time attention.
-
A video diffusion model generates 4-second clips at 24 fps (96 frames) by running diffusion in a spatiotemporal latent space with temporal downsampling (24 latent time steps). Describe the key inference-time challenge when generating longer videos (e.g., 30 seconds) and propose two strategies for extending the generation window beyond the training clip length.
-
The DiT architecture removes skip connections entirely, relying only on residual connections within blocks. Explain why this architectural choice might be less problematic for DiT than for convolutional U-Nets, considering the receptive field and attention mechanisms in play.
-
A CLAP-conditioned audio diffusion model generates 10 seconds of audio at 44 kHz (440,000 samples). Assuming a mel-spectrogram with mel bands and hop size , compute the spectrogram shape. How does this compare to the spatial dimensions of a image in terms of total tokens after patch embedding with patch size ?
Looking ahead
With the major model families and conditioning mechanisms established, a practical question arises: how do we know if a generative model is actually good?
Week 10: Evaluating Generative Models. We examine bits-per-dimension and FID computation, analyze the known failure modes of automatic metrics, survey human preference evaluation methods, and discuss the precision-recall tradeoff that quantifies sample quality vs. diversity.
Further reading
- Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR. (The Stable Diffusion paper).
- Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV. (The DiT architecture).