Skip to main content
illumin8
Courses
Week 9: Latent Diffusion and Multimodal Generation
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 9

Week 9: Latent Diffusion and Multimodal Generation

✦Learning Outcomes
  • Compare the computational tradeoffs between pixel-space and latent-space diffusion
  • Implement text conditioning via cross-attention in latent diffusion
  • Describe how the same framework extends to audio and video generation
◆Prerequisites
  • Week 2: Variational Autoencoders - VAEVariational Autoencoder architecture
  • Week 6: DDPM - Diffusion in latent space context
  • Week 8: Conditioning - Cross-attention conditioning

Understanding of perceptual compression concepts is helpful.

Purpose of this lecture

Pixel-space diffusion at high resolution (512×512 and above) is computationally prohibitive: a 512×512 RGB image has 786,432 dimensions, and each diffusion step applies the full denoising network to this entire tensor. Latent diffusion models (LDMs; Rombach et al., 2022) solve this by first compressing images into a lower-dimensional latent representation using a pretrained VAEVariational Autoencoder, then running the diffusion process in latent space. This decoupling dramatically reduces computation while preserving perceptual quality. The same framework extends to audio, video, and multimodal generation by swapping the VAEVariational Autoencoder encoder for domain-appropriate compression.


Latent diffusion models

The LDM architecture has two stages. First, a perceptual compression stage: a KL-regularized VAEVariational Autoencoder (similar to Week 2 but designed for perceptual quality) is trained to compress images x∈RH×W×3x \in \mathbb{R}^{H \times W \times 3}x∈RH×W×3 to latents z=E(x)∈Rh×w×cz = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}z=E(x)∈Rh×w×c with f=H/h=W/wf = H/h = W/wf=H/h=W/w the spatial downsampling factor (typically f=4f = 4f=4 or f=8f = 8f=8) and ccc the number of latent channels. The decoder D(z)\mathcal{D}(z)D(z) reconstructs the image. This stage is trained to minimize a combination of reconstruction loss, perceptual loss (LPIPS), adversarial loss, and KL regularization toward N(0,I)\mathcal{N}(0, I)N(0,I).

Second, a latent generation stage: DDPM (or flow matching) is applied in the latent space zzz rather than pixel space xxx. The denoising network ϵθ(zt,t,c)\epsilon_\theta(z_t, t, c)ϵθ​(zt​,t,c) is a U-Net operating on the latent tensors. The spatial dimensions are f2f^2f2 times smaller than pixel space, reducing the computational cost of attention layers by f4f^4f4. For f=8f = 8f=8, a 512×512 image with latent 64×6464 \times 6464×64 is 4096 times faster per attention operation.

The two-stage separation is critical: the perceptual compression stage handles local, high-frequency details (textures, sharpness) that diffusion models handle poorly because they are computationally expensive at high resolution. The diffusion stage handles semantic structure (layout, objects, lighting) at low resolution. The VAEVariational Autoencoder decoder upsamples back to pixel quality.


Text conditioning via cross-attention

LDMs (specifically Stable Diffusion) condition the denoising U-Net on text prompts through cross-attention (Week 8). The text encoder is a CLIP ViT-L/14 or OpenCLIP for SD 1.x/2.x, replaced by a joint CLIP+OpenCLIP ensemble for SDXL and T5-XXL for SD3. Text embedding sequence c=τθ(text)c = \tau_\theta(\text{text})c=τθ​(text) is projected to match the U-Net's attention dimension and attended to at multiple resolutions.

SDXL (Podell et al., 2023) scales LDM in three ways: (1) larger U-Net (2.6B parameters vs. 860M) with a second text encoder (OpenCLIP ViT-G); (2) conditioning on image size and crop coordinates as additional signals (enabling the model to learn resolution-appropriate generation); (3) a separate refiner model that applies additional denoising steps in latent space to improve high-frequency detail. Together these achieve substantially improved sample quality and prompt adherence over SD 1.x.

Stable Diffusion 3 replaces the U-Net with a diffusion transformer (DiT) and uses flow matching rather than DDPM. The DiT processes flattened image patches and text tokens jointly in a multimodal transformer, with the image and text representations attending to each other at every layer rather than only through cross-attention.


The Diffusion Transformer architecture

The diffusion transformer (DiT; Peebles & Xie, 2023) departs from the convolutional U-Net inductive bias by using a pure transformer backbone for latent diffusion. This shift is motivated by a key observation: U-Nets have strong local inductive biases (spatial convolution windows, hierarchical skip connections) that are optimized for pixel-space processing. However, in the latent space of a VAEVariational Autoencoder with f=8f = 8f=8 downsampling, the spatial structure is already heavily compressed and semantic — U-Net biases become less critical and may even constrain model capacity.

| Patching | Conditioning | Scaling | | --- | --- | --- | | Latents zzz are divided into p×pp \times pp×p non-overlapping patches and linearly projected to a fixed model dimension ddd. | AdaLN-Zero layers inject timestep and class information by modulating the scale and shift of the layer normalization. | Replacing U-Net convolutions with transformers enables predictable performance gains as GFLOPS increase. |

Patch embedding and tokenization: The input latent z∈Rh×w×cz \in \mathbb{R}^{h \times w \times c}z∈Rh×w×c is divided into non-overlapping patches of size p×pp \times pp×p (e.g., p=2p=2p=2), producing (h/p)(w/p)(h/p)(w/p)(h/p)(w/p) tokens, each of dimension p2cp^2 cp2c. These patch embeddings are linearly projected to a fixed model dimension ddd, creating a sequence of spatial tokens analogous to text token embeddings in language models. For a 64×6464 \times 6464×64 latent with c=4c=4c=4 channels and p=2p=2p=2, this produces 32×32=102432 \times 32 = 102432×32=1024 tokens of dimension 4⋅4=164 \cdot 4 = 164⋅4=16, projected to dimension ddd.

DiT block architecture: Each transformer block consists of:

  1. Time and conditioning modulation via AdaLN-Zero: timestep ttt is embedded to a vector and concatenated with text embeddings ccc (text class or CLIP embedding); this combined conditioning vector is used to compute scale γ(t,c)\gamma(t, c)γ(t,c) and shift δ(t,c)\delta(t, c)δ(t,c) parameters that modulate the layer norm, plus a gating parameter α(t,c)\alpha(t, c)α(t,c) that controls block output strength.
  2. Multi-head self-attention: standard scaled dot-product attention over all patch tokens, with Q,K,VQ, K, VQ,K,V projections from the normalized input.
  3. Pointwise MLP: a two-layer feedforward network applied independently to each token position.

The AdaLN-Zero gating mechanism is particularly elegant: the final block output is computed as x←x+α⋅MLP(Attention(AdaLN(x,t,c)))x \leftarrow x + \alpha \cdot \text{MLP}(\text{Attention}(\text{AdaLN}(x, t, c)))x←x+α⋅MLP(Attention(AdaLN(x,t,c))) where the gating parameter α\alphaα is initialized to zero. This ensures that at initialization, each block contributes zero to the residual, so the entire network is initially an identity function. Learning then proceeds by gradually increasing α\alphaα from zero, making this a form of learned skip-connection strength.

Scale and model families: DiT-XL/2 (the transformer used in Stable Diffusion 3) has 28 transformer blocks, patch size 2, and model dimension d=1152d = 1152d=1152, totaling 675M parameters. The "/2" denotes that the hidden dimension of the MLP is 2× the model dimension (typical transformer scaling). This is comparable in capacity to the 860M-parameter SDXL U-Net but with fundamentally different architectural constraints.

U-Net vs. DiT comparison: The architectural difference has training and scaling implications:

| Inductive Bias | Skip Connections | Scaling Law | | --- | --- | --- | | U-Nets use local convolutions; DiTs use global self-attention over learned patch positions. | U-Nets rely on hardwired encoder-decoder pairs; DiTs rely on residual blocks and global context. | DiTs follow predictable transformer scaling laws, making them more compute-optimal at scale. |

| Property | U-Net | DiT | |---|---|---| | Spatial inductive bias | Strong (local convolution kernels) | Weak (learned absolute/relative patch positions) | | Skip connections | Yes (encoder–decoder pairs at multiple scales) | No (only residual connections within blocks) | | Attention scope | Local at high resolution, global at low resolution (via downsampling) | Global at all scales (all patches attend to all patches) | | Scaling behavior with model size | Less predictable; hierarchy must be rebalanced | Follows transformer scaling laws; more compute-optimal | | Computational scaling | Attention cost dominated by low-resolution bottleneck | Attention cost linear in patch count; predictable | | Used in production | SD 1.x (1.1B), SD 2.x (865M), SDXL (2.6B) | SD3 (2B), FLUX (12B), Sora |

The lack of skip connections in DiT is counterintuitive — they seem essential for preserving fine details — but empirical results show that DiT achieves comparable or better FID at similar model sizes, and scales more predictably to very large models (FLUX at 12B parameters). The hypothesis is that the global attention at every layer allows the model to directly learn which information to propagate, obviating the need for hardwired skip paths.


Audio diffusion

Audio waveforms have different compression tradeoffs than images: raw audio at 44kHz is extremely high-dimensional but contains significant redundancy. Two approaches dominate:

Spectrogram diffusion: convert audio to a mel-spectrogram (time-frequency representation), apply 2D diffusion in spectrogram space, then invert the spectrogram to audio using a pretrained vocoder (HiFi-GAN, WaveNet). This treats audio generation as image generation of spectrograms and inherits all the text-conditioning tools from image diffusion.

Latent audio diffusion: apply a 1D audio VAEVariational Autoencoder to compress waveforms directly, then run diffusion in the latent waveform space. Models like Stable Audio (Evans et al., 2024) and AudioLDM use this approach. The compressed latent captures the audio's temporal structure at lower sample rates, enabling generation of minutes-long audio with manageable compute.

Text-to-audio (AudioLDM, MusicGen via language model backbone) conditions on CLAP embeddings — the audio analog of CLIP that aligns audio and text descriptions in a shared embedding space. Sound effects, music, and speech can all be generated by conditioning the diffusion model on CLAP text embeddings.


Video diffusion

Video is spatially and temporally high-dimensional, requiring compression in both dimensions. Video LDMs extend the image VAEVariational Autoencoder with temporal compression: a 3D VAEVariational Autoencoder encodes a video clip of TTT frames at H×WH \times WH×W resolution to a spatiotemporal latent of shape T/ft×h×w×cT/f_t \times h \times w \times cT/ft​×h×w×c where ftf_tft​ is the temporal downsampling factor (typically 4). The denoising network must model spatiotemporal dependencies.

Architectural choices for temporal modeling: (1) 3D convolutions in the U-Net extend spatial convolutions to include temporal neighbors; (2) temporal attention adds attention over the time dimension at each spatial location; (3) video transformers (e.g., Sora's space-time transformer) process all space-time patches jointly through full 3D self-attention, enabling long-range temporal consistency at the cost of higher compute.

Consistency and temporal coherence: diffusion models generate each frame independently unless temporal structure is explicitly enforced. Temporal attention at each layer and temporal convolutions bias the model toward smooth temporal transitions. Training on video data with optical flow loss further enforces coherence.


Temporal attention: the reshape trick and 3D convolutions

Efficiently modeling temporal dependencies in video LDMs requires careful architectural choices. Full spatiotemporal self-attention over all T×H×WT \times H \times WT×H×W positions is prohibitively expensive (O(T2H2W2)O(T^2 H^2 W^2)O(T2H2W2) complexity), so practical video models use decoupled or hierarchical temporal processing.

| Temporal Attention | 3D Convolutions | Temporal Embeds | | --- | --- | --- | | Reshaping the 5D video tensor to process time separately from space, reducing complexity to O(T2)O(T^2)O(T2) per spatial location. | Replacing 2D kernels with 3×3×33 \times 3 \times 33×3×3 kernels (or decoupled 1×3×31 \times 3 \times 31×3×3 and 3×1×13 \times 1 \times 13×1×1) to learn local spatiotemporal correlations. | Injecting relative time information into the transformer blocks so the model can distinguish sequence order. |

Reshape trick for temporal attention: The most practical approach reshapes the feature tensor to process time separately from space. Given a 5D feature tensor x∈RB×T×H×W×Cx \in \mathbb{R}^{B \times T \times H \times W \times C}x∈RB×T×H×W×C from a video diffusion model, reshape to [B⋅H⋅W,T,C][B \cdot H \cdot W, T, C][B⋅H⋅W,T,C] to create a 3D tensor where each of the H⋅WH \cdot WH⋅W spatial positions is treated as an independent sequence of length TTT.

| Reshape Trick | 3D Convolution | Optical Flow | | --- | --- | --- | | Processing time separately from space reduces complexity from O((HTW)2)O((HTW)^2)O((HTW)2) to O(HWT2)O(HW T^2)O(HWT2), enabling efficient temporal modeling. | Using kt×kh×kwk_t \times k_h \times k_wkt​×kh​×kw​ kernels to capture local spatiotemporal correlations and enforce temporal smoothness across frames. | Auxiliary supervision using flow estimators (e.g., RAFT) to penalize flickering and ensure realistic motion patterns. |

Apply standard 1D self-attention along the TTT dimension, then reshape back to [B,T,H,W,C][B, T, H, W, C][B,T,H,W,C]. This processes each spatial location independently across the temporal axis with O(T2)O(T^2)O(T2) complexity per location, for a total cost of O(H⋅W⋅T2)O(H \cdot W \cdot T^2)O(H⋅W⋅T2). For a 32×3232 \times 3232×32 spatial resolution with T=16T=16T=16 frames, this is 1024⋅256=262,1441024 \cdot 256 = 262,1441024⋅256=262,144 operations vs. the naive O((32⋅32⋅16)2)=26O((32 \cdot 32 \cdot 16)^2) = 26O((32⋅32⋅16)2)=26 billion for full 3D attention.

Alternating temporal and spatial attention: To capture spatiotemporal correlations (not just temporal ones at each location), models alternate temporal attention (reshape as above) with spatial attention (reshape to [B⋅T,H⋅W,C][B \cdot T, H \cdot W, C][B⋅T,H⋅W,C], apply 2D spatial self-attention). A sequence of blocks alternating these operations gives an approximation to full 3D attention at O(T2⋅H⋅W+H2⋅W2⋅T)O(T^2 \cdot H \cdot W + H^2 \cdot W^2 \cdot T)O(T2⋅H⋅W+H2⋅W2⋅T) cost. For typical video dimensions this is 1000–10,000× cheaper than naive 3D attention and enables training on longer sequences (16–32 frames).

3D convolution architecture: An alternative to attention is to use 3D convolutional layers that directly operate on the spatiotemporal volume. A 3D convolution kernel of size (kt,kh,kw)(k_t, k_h, k_w)(kt​,kh​,kw​) convolves over temporal and spatial neighbors simultaneously. Typical configurations include:

  • (3,3,3)(3, 3, 3)(3,3,3) kernels for motion-sensitive features (captures local temporal flow)
  • (1,3,3)(1, 3, 3)(1,3,3) kernels for spatial-only features (no temporal mixing, used in early video VAEVariational Autoencoder stages)
  • Dilated kernels with dilation dt>1d_t > 1dt​>1 in the temporal dimension to increase receptive field without increasing parameter count

The 3D convolution naturally enforces temporal smoothness (adjacent frames are mixed) and is parameter-efficient compared to 2D spatial convolutions by a factor proportional to ktk_tkt​. However, the temporal receptive field grows slowly: with LLL layers of 3D conv using kernel size ktk_tkt​, the temporal receptive field spans (kt−1)⋅L+1(k_t - 1) \cdot L + 1(kt​−1)⋅L+1 frames. For kt=3k_t = 3kt​=3 and L=12L = 12L=12 layers, this is only 25 frames. Longer temporal dependencies require more layers or dilated temporal convolutions (e.g., dilation pattern 1, 2, 4, 8 to reach exponential receptive field).

Optical flow supervision: A powerful training technique for video diffusion is to add an auxiliary loss that encourages temporal coherence. Given a pretrained optical flow estimator (e.g., RAFT), compute the ground-truth optical flow between consecutive frames: ft→t+1∗=FlowEstimator(xt,xt+1)f_{t \to t+1}^* = \text{FlowEstimator}(x_t, x_{t+1})ft→t+1∗​=FlowEstimator(xt​,xt+1​). Then during training, also estimate optical flow from the reconstructed frames and minimize Lflow=∑t=1T−1∥FlowEstimator(x^t,x^t+1)−ft→t+1∗∥22\mathcal{L}_\text{flow} = \sum_{t=1}^{T-1} \|\text{FlowEstimator}(\hat{x}_t, \hat{x}_{t+1}) - f_{t \to t+1}^*\|_2^2Lflow​=∑t=1T−1​∥FlowEstimator(x^t​,x^t+1​)−ft→t+1∗​∥22​ This loss penalizes flickering and ensures that the model learns to generate frames that obey realistic motion patterns, improving temporal coherence without explicit motion supervision.


Joint embedding spaces and multimodal generation

Unified multimodal LDMs generate across modalities by learning a shared latent space that multiple domain-specific encoders and decoders map to and from. Text, image, audio, and depth map encoders all map to a common embedding; a single diffusion model operates in this shared space; modality-specific decoders reconstruct the outputs.

Composable diffusion enables combining multiple conditioning signals: generate an image conditioned on both a text prompt AND a reference image AND an edge map by combining the scores from each conditioner. The composed score:

∇xlog⁡p(x∣c1,c2,c3)≈∇xlog⁡p(x)+s1∇xlog⁡p(c1∣x)+s2∇xlog⁡p(c2∣x)+s3∇xlog⁡p(c3∣x)\nabla_x \log p(x \mid c_1, c_2, c_3) \approx \nabla_x \log p(x) + s_1 \nabla_x \log p(c_1 \mid x) + s_2 \nabla_x \log p(c_2 \mid x) + s_3 \nabla_x \log p(c_3 \mid x)∇x​logp(x∣c1​,c2​,c3​)≈∇x​logp(x)+s1​∇x​logp(c1​∣x)+s2​∇x​logp(c2​∣x)+s3​∇x​logp(c3​∣x)

This approximation assumes conditional independence between conditioning signals, which holds approximately when conditioners target different aspects of the image (text → semantics, edge → structure, reference → style).


CLAP: contrastive audio-language pretraining

CLAP (Elizalde et al., 2022) is the audio-language equivalent of CLIP, training audio and text encoders with a symmetric contrastive loss to align audio clips with their textual descriptions. Just as CLIP learns a joint vision-language embedding space, CLAP learns a joint audio-language space where audio clips and text descriptions with matching semantics are embedded near each other.

| Contrastive Loss | Audio Encoder | Mel-Spectrogram | | --- | --- | --- | | Symmetric audio-to-text and text-to-audio loss using in-batch negatives to align paired embeddings in a shared 512D space. | CNN or hybrid transformer backbones (PANN/HTS-AT) that process mel-spectrograms into normalized feature vectors. | A 2D time-frequency representation computed via STFT and log-mel filterbanks to match human auditory perception. |

Training objective: The CLAP loss is formalized as: LCLAP=−1N∑i=1N[log⁡eai⊤ti/τ∑jeai⊤tj/τ+log⁡eti⊤ai/τ∑jetj⊤ai/τ]\mathcal{L}_\text{CLAP} = -\frac{1}{N}\sum_{i=1}^N \left[\log \frac{e^{a_i^\top t_i / \tau}}{\sum_j e^{a_i^\top t_j/\tau}} + \log \frac{e^{t_i^\top a_i/\tau}}{\sum_j e^{t_j^\top a_i/\tau}}\right]LCLAP​=−N1​∑i=1N​[log∑j​eai⊤​tj​/τeai⊤​ti​/τ​+log∑j​etj⊤​ai​/τeti⊤​ai​/τ​] where ai=AudioEncoder(xi)a_i = \text{AudioEncoder}(x_i)ai​=AudioEncoder(xi​) and ti=TextEncoder(si)t_i = \text{TextEncoder}(s_i)ti​=TextEncoder(si​) are L2-normalized embeddings for audio clip xix_ixi​ and description sis_isi​, and τ\tauτ is a temperature parameter. The loss is symmetric (both audio→text and text→audio) and uses in-batch negatives, so a batch of NNN paired examples produces NNN positive pairs and N(N−1)N(N-1)N(N−1) negative pairs for each direction.

Audio encoder architecture: The audio encoder typically operates on mel-spectrograms and is implemented as a CNN-based architecture such as PANN (PANNs: Large-Scale Pretrained Audio Neural Networks) or HTS-AT (hybrid transformer-CNN for sound tagging). These models are pretrained on large-scale audio tagging datasets and transfer well to CLAP training. The encoder outputs an embedding vector of dimension 512 or 1024, L2-normalized for use in the contrastive loss.

Mel-spectrogram computation: The audio signal is converted to a mel-spectrogram in a standardized pipeline:

  1. STFT (Short-Time Fourier Transform): apply a window function (Hann window of size www, typically 2048 samples) with hop size hhh (typically 512 samples) to the raw waveform at sample rate fsf_sfs​ (44 kHz or 16 kHz), producing a complex spectrogram X[k,t]X[k, t]X[k,t] where kkk indexes frequency bins and ttt indexes time frames. Compute the magnitude spectrogram ∣X[k,t]∣2|X[k, t]|^2∣X[k,t]∣2.
  2. Mel filterbank: apply a mel-scale filterbank M∈RF×KM \in \mathbb{R}^{F \times K}M∈RF×K where FFF is the number of mel-scale bands (typically 64 or 128) and KKK is the number of frequency bins. The mel scale m=2595log⁡10(1+f/700)m = 2595 \log_{10}(1 + f/700)m=2595log10​(1+f/700) compresses high frequencies perceptually (humans are less sensitive to differences above ~1 kHz). The filterbank is a matrix of overlapping triangular filters spaced on the mel scale.
  3. Log magnitude: apply log compression log⁡(1+MelSpectrogram)\log(1 + \text{MelSpectrogram})log(1+MelSpectrogram) to approximate the perceptual loudness scale. The result is an F×TF \times TF×T image (mel-spectrogram) that serves as input to the audio encoder.

Conditioning in AudioLDM: AudioLDM uses CLAP embeddings for text-to-audio generation by conditioning the diffusion model on text embeddings via cross-attention, identical to how Stable Diffusion uses CLIP embeddings. At inference, a user provides a text description such as "heavy rain on a metal roof" or "jazz saxophone solo in a crowded bar," the text is encoded with the CLAP text encoder to produce a 512-dimensional embedding t=TextEncoder(description)t = \text{TextEncoder}(\text{description})t=TextEncoder(description), and this embedding is fed into the cross-attention layers of the audio diffusion U-Net. The diffusion process then generates an audio spectrogram that matches the description, which is converted back to waveform via a vocoder.


Cross-course context: latent compression and tokenization in generative models

A unifying theme emerges across all the courses in this sequence. Both language models and diffusion models solve the same fundamental problem: raw signals (text bytes, image pixels, audio samples) are too high-dimensional for standard sequence models, so they must be compressed into a lower-dimensional token-like representation.

| Architecture/concept | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Latent compression | VAEVariational Autoencoder latent space (f=8f=8f=8 spatial downsampling reduces 512×512 to 64×64) | State abstraction / representation learning | Robot proprioceptive + exteroceptive compression (C2W8 CVAE latent space) | Visual tokenization: ViT patch embeddings as visual "tokens" | | Patch-based processing | DiT: image latent divided into p×pp \times pp×p patches (typically p=2p=2p=2) | Trajectory patches in decision transformer | ACTAction Chunking with Transformers: chunk of kkk actions as single prediction unit (temporal patch) | ViT: image divided into 16×1616\times1616×16 patches, each embedded as a token (C4W1) | | Temporal modeling | Video LDM: 3D convolutions + temporal attention (reshape trick) | MDPMarkov Decision Process: temporal credit assignment via discount factor γ\gammaγ | Robot trajectory temporal modeling: LSTM/Transformer over history of proprioceptive states | Video VLMs: temporal transformer for frame-by-frame video-language alignment | | Multi-encoder conditioning | SDXL: CLIP-L text encoder + OpenCLIP-G image encoder + size/crop conditioning signals | Multi-objective RLReinforcement Learning: combining reward signal and constraint satisfaction signals | ACTAction Chunking with Transformers: DINO visual encoder + proprioceptive encoder + language instruction encoder | BLIP-2 Q-Former: bridging frozen vision encoder (ViT) and frozen language model (T5) |

The analogy between latent diffusion and language model tokenization is profound: both operations project a raw, high-dimensional signal into a discrete (or quasi-discrete) lower-dimensional space where sequence models can operate efficiently. This is why the same transformer architecture (DiT, GPT) works effectively in both domains — they are both operating on compressed, semantic-level tokens rather than raw bits.

In Course 4 (Vision–Language Models), the ViT backbone (Week 1) learns a similar patch-based visual compression: an image is divided into 16×1616 \times 1616×16 patches, each embedded to a 768-dimensional token, and these patch tokens are processed by a transformer. The full text-to-image generation pipeline in Stable Diffusion can be understood as a vision encoder (VAEVariational Autoencoder compresses pixels to latent tokens), a language-vision bridge (CLIP text encoder + cross-attention), and a generative decoder (U-Net or DiT generates latent tokens, VAEVariational Autoencoder decoder upsamples to pixels) — precisely the same functional structure as BLIP-2 (frozen ViT → visual tokens → Q-Former bridge → frozen LLMLarge Language Model) without the language generation head. This structural homology suggests that as generative models scale, they will continue to converge on a common bottleneck: learning efficient, semantically meaningful discretizations of their respective domains.


Key takeaways

LDMs decouple perceptual compression (VAEVariational Autoencoder) from semantic generation (diffusion), reducing computation by f4f^4f4 relative to pixel-space diffusion. Text conditioning via cross-attention over CLIP/T5 embeddings enables prompt-guided generation; SDXL improves this with a larger model and additional conditioning signals. The diffusion transformer (DiT) replaces U-Net convolutions with pure transformer attention, scaling more predictably to very large models. Audio diffusion operates on spectrograms or compressed audio latents, conditioned via CLAP embeddings. Video diffusion requires temporal modeling through 3D convolutions or temporal attention (via reshape trick); optical flow losses enforce temporal coherence. Joint embedding spaces enable multimodal generation; composable diffusion combines multiple independent conditioning signals by linearly summing their score contributions.


Conceptual questions

  1. An LDM uses a VAEVariational Autoencoder with spatial downsampling factor f=8f = 8f=8 to compress 512×512512 \times 512512×512 images to 64×6464 \times 6464×64 latents with c=4c = 4c=4 channels. (a) Compute the compression ratio in pixels. (b) A U-Net with full spatial attention at the 64×6464 \times 6464×64 resolution has an attention matrix of size 4096×40964096 \times 40964096×4096. Compare the memory and FLOPs for this attention to the same operation applied in pixel space at 512×512512 \times 512512×512. (c) Explain qualitatively why fine texture details survive the VAEVariational Autoencoder encoding-decoding despite the high compression ratio.

  2. Stable Diffusion's VAEVariational Autoencoder uses a KL regularization term toward N(0,I)\mathcal{N}(0, I)N(0,I) on the latent space, while some video LDMs use a VQ-VAEVariational Autoencoder (vector quantized) encoder. Compare the two approaches: (a) What does KL regularization ensure about the latent space geometry? (b) What does VQ discretization enable that continuous KL regularization does not? (c) For temporal consistency in video generation, which approach is preferable and why?

  3. Composable diffusion combines conditioning signals by summing their individual score contributions. Show that this approximation corresponds to assuming conditional independence between conditioners given the noised image. Construct a scenario where this independence assumption fails badly (where the combined effect of two conditioners is not the sum of their individual effects), and describe the artifact that would appear in the generated image.

  4. Temporal attention in video diffusion models attends over the time dimension at each spatial location independently. Compare this to a full 3D space-time attention mechanism. (a) What temporal artifacts would temporal-only attention produce that space-time attention avoids? (b) For a video with T=16T = 16T=16 frames and spatial resolution h×w=32×32h \times w = 32 \times 32h×w=32×32, compare the memory cost of temporal-only attention vs. space-time attention.

  5. A video diffusion model generates 4-second clips at 24 fps (96 frames) by running diffusion in a spatiotemporal latent space with temporal downsampling ft=4f_t = 4ft​=4 (24 latent time steps). Describe the key inference-time challenge when generating longer videos (e.g., 30 seconds) and propose two strategies for extending the generation window beyond the training clip length.

  6. The DiT architecture removes skip connections entirely, relying only on residual connections within blocks. Explain why this architectural choice might be less problematic for DiT than for convolutional U-Nets, considering the receptive field and attention mechanisms in play.

  7. A CLAP-conditioned audio diffusion model generates 10 seconds of audio at 44 kHz (440,000 samples). Assuming a mel-spectrogram with F=128F=128F=128 mel bands and hop size h=512h=512h=512, compute the spectrogram shape. How does this compare to the spatial dimensions of a 512×512512 \times 512512×512 image in terms of total tokens after patch embedding with patch size p=2p=2p=2?

✦Solutions
  1. (a) Pixels: 512⋅512⋅3=786,432512\cdot512\cdot3 = 786{,}432512⋅512⋅3=786,432 vs latent 64⋅64⋅4=16,38464\cdot64\cdot4 = 16{,}38464⋅64⋅4=16,384, a 48×48\times48× compression. (b) Spatial positions grow 64×64\times64× (4096 → 262,144), and attention is O(N2)O(N^2)O(N2), so pixel-space attention costs ≈642=4096×\approx 64^2 = 4096\times≈642=4096× more memory and FLOPs than the 4096×40964096\times40964096×4096 latent attention. (c) The VAE is trained with perceptual (LPIPS) + adversarial losses, so its decoder synthesizes plausible high-frequency texture from the compact latent — the latent need only carry enough information to regenerate texture, not store every pixel.
  2. (a) KL regularization keeps the latent near N(0,I)\mathcal{N}(0,I)N(0,I): a smooth, continuous, near-isotropic geometry the diffusion prior can sample. (b) VQ gives a finite codebook, enabling discrete autoregressive/transformer priors over latents and crisp reconstruction of repeated structure. (c) For video, the continuous KL latent is usually preferable — it interpolates smoothly between frames, whereas discrete codes can flicker as adjacent frames snap to different codebook entries unless temporal smoothing is added.
  3. Summing scores assumes ∇xlog⁡p(c1,c2∣x)=∇xlog⁡p(c1∣x)+∇xlog⁡p(c2∣x)\nabla_x\log p(c_1,c_2\mid x)=\nabla_x\log p(c_1\mid x)+\nabla_x\log p(c_2\mid x)∇x​logp(c1​,c2​∣x)=∇x​logp(c1​∣x)+∇x​logp(c2​∣x) — i.e. c1⊥c2∣xc_1\perp c_2 \mid xc1​⊥c2​∣x. It fails when conditioners interact or conflict: e.g. text "a red cube" + an edge map of a sphere. The two scores pull toward incompatible geometries, producing ghosted/blended shapes or one signal overriding the other with distortions. (Another: "small dog" text + a reference image of a large object → scale conflict.)
  4. (a) Temporal-only attention lets each spatial location attend only across time at its own position, so it cannot track an object translating across positions — moving objects flicker/ghost; full space-time attention captures motion across locations. (b) Temporal-only: HWH WHW sequences of length TTT → HW⋅T2=1024⋅256=262,144HW\cdot T^2 = 1024\cdot256 = 262{,}144HW⋅T2=1024⋅256=262,144 attention entries. Full space-time: (THW)2=(16⋅1024)2≈2.68×108(THW)^2 = (16\cdot1024)^2 \approx 2.68\times10^8(THW)2=(16⋅1024)2≈2.68×108 — about 1024×1024\times1024× more.
  5. Generating 30 s exceeds the fixed training horizon (24 latent steps), so the model extrapolates poorly — drift, loss of consistency — and full attention blows up in memory/compute. Strategies: (1) autoregressive sliding window — generate in overlapping chunks, conditioning each on the last few frames of the previous chunk for continuity; (2) hierarchical generation — produce sparse keyframes first, then interpolate intermediate frames (temporal super-resolution).
  6. DiT can drop skip connections because global self-attention at every layer lets any patch directly read fine detail from any other patch, so there is no need for hardwired encoder→decoder skip paths to re-inject lost spatial information. U-Nets need skips precisely because local convolutions + downsampling discard spatial detail that must be restored; DiT routes information through learned attention + residual connections instead.
  7. T=⌈440,000/512⌉≈860T=\lceil 440{,}000/512\rceil \approx 860T=⌈440,000/512⌉≈860 frames, F=128F=128F=128 → spectrogram 128×860128\times860128×860. With p=2p=2p=2: (128/2)(860/2)=64⋅430=27,520(128/2)(860/2)=64\cdot430 = 27{,}520(128/2)(860/2)=64⋅430=27,520 tokens. A 512×512512\times512512×512 image at p=2p=2p=2 gives 256⋅256=65,536256\cdot256 = 65{,}536256⋅256=65,536 tokens — so the 10 s audio spectrogram is roughly 2.4×2.4\times2.4× fewer tokens than the full-resolution image.

Looking ahead

With the major model families and conditioning mechanisms established, a practical question arises: how do we know if a generative model is actually good?

Week 10: Evaluating Generative Models. We examine bits-per-dimension and FID computation, analyze the known failure modes of automatic metrics, survey human preference evaluation methods, and discuss the precision-recall tradeoff that quantifies sample quality vs. diversity.


Further reading

  • Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR. (The Stable Diffusion paper).
  • Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV. (The DiT architecture).
← Previous
Week 8: Conditioning and Control
Next →
Week 10: Evaluating Generative Models
On this page
  • Purpose of this lecture
  • Latent diffusion models
  • Text conditioning via cross-attention
  • The Diffusion Transformer architecture
  • Audio diffusion
  • Video diffusion
  • Temporal attention: the reshape trick and 3D convolutions
  • Joint embedding spaces and multimodal generation
  • CLAP: contrastive audio-language pretraining
  • Cross-course context: latent compression and tokenization in generative models
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading