Week 9: Latent Diffusion and Multimodal Generation

Purpose of this lecture#

Pixel-space diffusion at high resolution (512×512 and above) is computationally prohibitive: a 512×512 RGB image has 786,432 dimensions, and each diffusion step applies the full denoising network to this entire tensor. Latent diffusion models (LDMs; Rombach et al., 2022) solve this by first compressing images into a lower-dimensional latent representation using a pretrained VAE, then running the diffusion process in latent space. This decoupling dramatically reduces computation while preserving perceptual quality. The same framework extends to audio, video, and multimodal generation by swapping the VAE encoder for domain-appropriate compression.

Latent diffusion models#

The LDM architecture has two stages. First, a perceptual compression stage: a KL-regularized VAE (similar to Week 2 but designed for perceptual quality) is trained to compress images $x \in \mathbb{R}^{H \times W \times 3}$ to latents $z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}$ with $f = H/h = W/w$ the spatial downsampling factor (typically $f = 4$ or $f = 8$ ) and $c$ the number of latent channels. The decoder $\mathcal{D}(z)$ reconstructs the image. This stage is trained to minimize a combination of reconstruction loss, perceptual loss (LPIPS), adversarial loss, and KL regularization toward $\mathcal{N}(0, I)$ .

Second, a latent generation stage: DDPM (or flow matching) is applied in the latent space $z$ rather than pixel space $x$ . The denoising network $\epsilon_\theta(z_t, t, c)$ is a U-Net operating on the latent tensors. The spatial dimensions are $f^2$ times smaller than pixel space, reducing the computational cost of attention layers by $f^4$ . For $f = 8$ , a 512×512 image with latent $64 \times 64$ is 4096 times faster per attention operation.

The two-stage separation is critical: the perceptual compression stage handles local, high-frequency details (textures, sharpness) that diffusion models handle poorly because they are computationally expensive at high resolution. The diffusion stage handles semantic structure (layout, objects, lighting) at low resolution. The VAE decoder upsamples back to pixel quality.

Text conditioning via cross-attention#

LDMs (specifically Stable Diffusion) condition the denoising U-Net on text prompts through cross-attention (Week 8). The text encoder is a CLIP ViT-L/14 or OpenCLIP for SD 1.x/2.x, replaced by a joint CLIP+OpenCLIP ensemble for SDXL and T5-XXL for SD3. Text embedding sequence $c = \tau_\theta(\text{text})$ is projected to match the U-Net's attention dimension and attended to at multiple resolutions.

SDXL (Podell et al., 2023) scales LDM in three ways: (1) larger U-Net (2.6B parameters vs. 860M) with a second text encoder (OpenCLIP ViT-G); (2) conditioning on image size and crop coordinates as additional signals (enabling the model to learn resolution-appropriate generation); (3) a separate refiner model that applies additional denoising steps in latent space to improve high-frequency detail. Together these achieve substantially improved sample quality and prompt adherence over SD 1.x.

Stable Diffusion 3 replaces the U-Net with a diffusion transformer (DiT) and uses flow matching rather than DDPM. The DiT processes flattened image patches and text tokens jointly in a multimodal transformer, with the image and text representations attending to each other at every layer rather than only through cross-attention.

The Diffusion Transformer architecture#

The diffusion transformer (DiT; Peebles & Xie, 2023) departs from the convolutional U-Net inductive bias by using a pure transformer backbone for latent diffusion. This shift is motivated by a key observation: U-Nets have strong local inductive biases (spatial convolution windows, hierarchical skip connections) that are optimized for pixel-space processing. However, in the latent space of a VAE with $f = 8$ downsampling, the spatial structure is already heavily compressed and semantic — U-Net biases become less critical and may even constrain model capacity.

| Patching | Conditioning | Scaling | | --- | --- | --- | | Latents $z$ are divided into $p \times p$ non-overlapping patches and linearly projected to a fixed model dimension $d$ . | AdaLN-Zero layers inject timestep and class information by modulating the scale and shift of the layer normalization. | Replacing U-Net convolutions with transformers enables predictable performance gains as GFLOPS increase. |

Patch embedding and tokenization: The input latent $z \in \mathbb{R}^{h \times w \times c}$ is divided into non-overlapping patches of size $p \times p$ (e.g., $p=2$ ), producing $(h/p)(w/p)$ tokens, each of dimension $p^2 c$ . These patch embeddings are linearly projected to a fixed model dimension $d$ , creating a sequence of spatial tokens analogous to text token embeddings in language models. For a $64 \times 64$ latent with $c=4$ channels and $p=2$ , this produces $32 \times 32 = 1024$ tokens of dimension $4 \cdot 4 = 16$ , projected to dimension $d$ .

DiT block architecture: Each transformer block consists of:

Time and conditioning modulation via AdaLN-Zero: timestep $t$ is embedded to a vector and concatenated with text embeddings $c$ (text class or CLIP embedding); this combined conditioning vector is used to compute scale $\gamma(t, c)$ and shift $\delta(t, c)$ parameters that modulate the layer norm, plus a gating parameter $\alpha(t, c)$ that controls block output strength.
Multi-head self-attention: standard scaled dot-product attention over all patch tokens, with $Q, K, V$ projections from the normalized input.
Pointwise MLP: a two-layer feedforward network applied independently to each token position.

The AdaLN-Zero gating mechanism is particularly elegant: the final block output is computed as $x \leftarrow x + \alpha \cdot \text{MLP}(\text{Attention}(\text{AdaLN}(x, t, c)))$ where the gating parameter $\alpha$ is initialized to zero. This ensures that at initialization, each block contributes zero to the residual, so the entire network is initially an identity function. Learning then proceeds by gradually increasing $\alpha$ from zero, making this a form of learned skip-connection strength.

Scale and model families: DiT-XL/2 (the transformer used in Stable Diffusion 3) has 28 transformer blocks, patch size 2, and model dimension $d = 1152$ , totaling 675M parameters. The "/2" denotes that the hidden dimension of the MLP is 2× the model dimension (typical transformer scaling). This is comparable in capacity to the 860M-parameter SDXL U-Net but with fundamentally different architectural constraints.

U-Net vs. DiT comparison: The architectural difference has training and scaling implications:

| Inductive Bias | Skip Connections | Scaling Law | | --- | --- | --- | | U-Nets use local convolutions; DiTs use global self-attention over learned patch positions. | U-Nets rely on hardwired encoder-decoder pairs; DiTs rely on residual blocks and global context. | DiTs follow predictable transformer scaling laws, making them more compute-optimal at scale. |

| Property | U-Net | DiT | |---|---|---| | Spatial inductive bias | Strong (local convolution kernels) | Weak (learned absolute/relative patch positions) | | Skip connections | Yes (encoder–decoder pairs at multiple scales) | No (only residual connections within blocks) | | Attention scope | Local at high resolution, global at low resolution (via downsampling) | Global at all scales (all patches attend to all patches) | | Scaling behavior with model size | Less predictable; hierarchy must be rebalanced | Follows transformer scaling laws; more compute-optimal | | Computational scaling | Attention cost dominated by low-resolution bottleneck | Attention cost linear in patch count; predictable | | Used in production | SD 1.x (1.1B), SD 2.x (865M), SDXL (2.6B) | SD3 (2B), FLUX (12B), Sora |

The lack of skip connections in DiT is counterintuitive — they seem essential for preserving fine details — but empirical results show that DiT achieves comparable or better FID at similar model sizes, and scales more predictably to very large models (FLUX at 12B parameters). The hypothesis is that the global attention at every layer allows the model to directly learn which information to propagate, obviating the need for hardwired skip paths.

Audio diffusion#

Audio waveforms have different compression tradeoffs than images: raw audio at 44kHz is extremely high-dimensional but contains significant redundancy. Two approaches dominate:

Spectrogram diffusion: convert audio to a mel-spectrogram (time-frequency representation), apply 2D diffusion in spectrogram space, then invert the spectrogram to audio using a pretrained vocoder (HiFi-GAN, WaveNet). This treats audio generation as image generation of spectrograms and inherits all the text-conditioning tools from image diffusion.

Latent audio diffusion: apply a 1D audio VAE to compress waveforms directly, then run diffusion in the latent waveform space. Models like Stable Audio (Evans et al., 2024) and AudioLDM use this approach. The compressed latent captures the audio's temporal structure at lower sample rates, enabling generation of minutes-long audio with manageable compute.

Text-to-audio (AudioLDM, MusicGen via language model backbone) conditions on CLAP embeddings — the audio analog of CLIP that aligns audio and text descriptions in a shared embedding space. Sound effects, music, and speech can all be generated by conditioning the diffusion model on CLAP text embeddings.

Video diffusion#

Video is spatially and temporally high-dimensional, requiring compression in both dimensions. Video LDMs extend the image VAE with temporal compression: a 3D VAE encodes a video clip of $T$ frames at $H \times W$ resolution to a spatiotemporal latent of shape $T/f_t \times h \times w \times c$ where $f_t$ is the temporal downsampling factor (typically 4). The denoising network must model spatiotemporal dependencies.

Architectural choices for temporal modeling: (1) 3D convolutions in the U-Net extend spatial convolutions to include temporal neighbors; (2) temporal attention adds attention over the time dimension at each spatial location; (3) video transformers (e.g., Sora's space-time transformer) process all space-time patches jointly through full 3D self-attention, enabling long-range temporal consistency at the cost of higher compute.

Consistency and temporal coherence: diffusion models generate each frame independently unless temporal structure is explicitly enforced. Temporal attention at each layer and temporal convolutions bias the model toward smooth temporal transitions. Training on video data with optical flow loss further enforces coherence.

Temporal attention: the reshape trick and 3D convolutions#

Efficiently modeling temporal dependencies in video LDMs requires careful architectural choices. Full spatiotemporal self-attention over all $T \times H \times W$ positions is prohibitively expensive ( $O(T^2 H^2 W^2)$ complexity), so practical video models use decoupled or hierarchical temporal processing.

| Temporal Attention | 3D Convolutions | Temporal Embeds | | --- | --- | --- | | Reshaping the 5D video tensor to process time separately from space, reducing complexity to $O(T^2)$ per spatial location. | Replacing 2D kernels with $3 \times 3 \times 3$ kernels (or decoupled $1 \times 3 \times 3$ and $3 \times 1 \times 1$ ) to learn local spatiotemporal correlations. | Injecting relative time information into the transformer blocks so the model can distinguish sequence order. |

Reshape trick for temporal attention: The most practical approach reshapes the feature tensor to process time separately from space. Given a 5D feature tensor $x \in \mathbb{R}^{B \times T \times H \times W \times C}$ from a video diffusion model, reshape to $[B \cdot H \cdot W, T, C]$ to create a 3D tensor where each of the $H \cdot W$ spatial positions is treated as an independent sequence of length $T$ .

| Reshape Trick | 3D Convolution | Optical Flow | | --- | --- | --- | | Processing time separately from space reduces complexity from $O((HTW)^2)$ to $O(HW T^2)$ , enabling efficient temporal modeling. | Using $k_t \times k_h \times k_w$ kernels to capture local spatiotemporal correlations and enforce temporal smoothness across frames. | Auxiliary supervision using flow estimators (e.g., RAFT) to penalize flickering and ensure realistic motion patterns. |

Apply standard 1D self-attention along the $T$ dimension, then reshape back to $[B, T, H, W, C]$ . This processes each spatial location independently across the temporal axis with $O(T^2)$ complexity per location, for a total cost of $O(H \cdot W \cdot T^2)$ . For a $32 \times 32$ spatial resolution with $T=16$ frames, this is $1024 \cdot 256 = 262,144$ operations vs. the naive $O((32 \cdot 32 \cdot 16)^2) = 26$ billion for full 3D attention.

Alternating temporal and spatial attention: To capture spatiotemporal correlations (not just temporal ones at each location), models alternate temporal attention (reshape as above) with spatial attention (reshape to $[B \cdot T, H \cdot W, C]$ , apply 2D spatial self-attention). A sequence of blocks alternating these operations gives an approximation to full 3D attention at $O(T^2 \cdot H \cdot W + H^2 \cdot W^2 \cdot T)$ cost. For typical video dimensions this is 1000–10,000× cheaper than naive 3D attention and enables training on longer sequences (16–32 frames).

3D convolution architecture: An alternative to attention is to use 3D convolutional layers that directly operate on the spatiotemporal volume. A 3D convolution kernel of size $(k_t, k_h, k_w)$ convolves over temporal and spatial neighbors simultaneously. Typical configurations include:

$(3, 3, 3)$ kernels for motion-sensitive features (captures local temporal flow)
$(1, 3, 3)$ kernels for spatial-only features (no temporal mixing, used in early video VAE stages)
Dilated kernels with dilation $d_t > 1$ in the temporal dimension to increase receptive field without increasing parameter count

The 3D convolution naturally enforces temporal smoothness (adjacent frames are mixed) and is parameter-efficient compared to 2D spatial convolutions by a factor proportional to $k_t$ . However, the temporal receptive field grows slowly: with $L$ layers of 3D conv using kernel size $k_t$ , the temporal receptive field spans $(k_t - 1) \cdot L + 1$ frames. For $k_t = 3$ and $L = 12$ layers, this is only 25 frames. Longer temporal dependencies require more layers or dilated temporal convolutions (e.g., dilation pattern 1, 2, 4, 8 to reach exponential receptive field).

Optical flow supervision: A powerful training technique for video diffusion is to add an auxiliary loss that encourages temporal coherence. Given a pretrained optical flow estimator (e.g., RAFT), compute the ground-truth optical flow between consecutive frames: $f_{t \to t+1}^* = \text{FlowEstimator}(x_t, x_{t+1})$ . Then during training, also estimate optical flow from the reconstructed frames and minimize $\mathcal{L}_\text{flow} = \sum_{t=1}^{T-1} \|\text{FlowEstimator}(\hat{x}_t, \hat{x}_{t+1}) - f_{t \to t+1}^*\|_2^2$ This loss penalizes flickering and ensures that the model learns to generate frames that obey realistic motion patterns, improving temporal coherence without explicit motion supervision.

Joint embedding spaces and multimodal generation#

Unified multimodal LDMs generate across modalities by learning a shared latent space that multiple domain-specific encoders and decoders map to and from. Text, image, audio, and depth map encoders all map to a common embedding; a single diffusion model operates in this shared space; modality-specific decoders reconstruct the outputs.

Composable diffusion enables combining multiple conditioning signals: generate an image conditioned on both a text prompt AND a reference image AND an edge map by combining the scores from each conditioner. The composed score:

\nabla_x \log p(x \mid c_1, c_2, c_3) \approx \nabla_x \log p(x) + s_1 \nabla_x \log p(c_1 \mid x) + s_2 \nabla_x \log p(c_2 \mid x) + s_3 \nabla_x \log p(c_3 \mid x)

This approximation assumes conditional independence between conditioning signals, which holds approximately when conditioners target different aspects of the image (text → semantics, edge → structure, reference → style).

CLAP: contrastive audio-language pretraining#

CLAP (Elizalde et al., 2022) is the audio-language equivalent of CLIP, training audio and text encoders with a symmetric contrastive loss to align audio clips with their textual descriptions. Just as CLIP learns a joint vision-language embedding space, CLAP learns a joint audio-language space where audio clips and text descriptions with matching semantics are embedded near each other.

| Contrastive Loss | Audio Encoder | Mel-Spectrogram | | --- | --- | --- | | Symmetric audio-to-text and text-to-audio loss using in-batch negatives to align paired embeddings in a shared 512D space. | CNN or hybrid transformer backbones (PANN/HTS-AT) that process mel-spectrograms into normalized feature vectors. | A 2D time-frequency representation computed via STFT and log-mel filterbanks to match human auditory perception. |

Training objective: The CLAP loss is formalized as: $\mathcal{L}_\text{CLAP} = -\frac{1}{N}\sum_{i=1}^N \left[\log \frac{e^{a_i^\top t_i / \tau}}{\sum_j e^{a_i^\top t_j/\tau}} + \log \frac{e^{t_i^\top a_i/\tau}}{\sum_j e^{t_j^\top a_i/\tau}}\right]$ where $a_i = \text{AudioEncoder}(x_i)$ and $t_i = \text{TextEncoder}(s_i)$ are L2-normalized embeddings for audio clip $x_i$ and description $s_i$ , and $\tau$ is a temperature parameter. The loss is symmetric (both audio→text and text→audio) and uses in-batch negatives, so a batch of $N$ paired examples produces $N$ positive pairs and $N(N-1)$ negative pairs for each direction.

Audio encoder architecture: The audio encoder typically operates on mel-spectrograms and is implemented as a CNN-based architecture such as PANN (PANNs: Large-Scale Pretrained Audio Neural Networks) or HTS-AT (hybrid transformer-CNN for sound tagging). These models are pretrained on large-scale audio tagging datasets and transfer well to CLAP training. The encoder outputs an embedding vector of dimension 512 or 1024, L2-normalized for use in the contrastive loss.

Mel-spectrogram computation: The audio signal is converted to a mel-spectrogram in a standardized pipeline:

STFT (Short-Time Fourier Transform): apply a window function (Hann window of size $w$ , typically 2048 samples) with hop size $h$ (typically 512 samples) to the raw waveform at sample rate $f_s$ (44 kHz or 16 kHz), producing a complex spectrogram $X[k, t]$ where $k$ indexes frequency bins and $t$ indexes time frames. Compute the magnitude spectrogram $|X[k, t]|^2$ .
Mel filterbank: apply a mel-scale filterbank $M \in \mathbb{R}^{F \times K}$ where $F$ is the number of mel-scale bands (typically 64 or 128) and $K$ is the number of frequency bins. The mel scale $m = 2595 \log_{10}(1 + f/700)$ compresses high frequencies perceptually (humans are less sensitive to differences above ~1 kHz). The filterbank is a matrix of overlapping triangular filters spaced on the mel scale.
Log magnitude: apply log compression $\log(1 + \text{MelSpectrogram})$ to approximate the perceptual loudness scale. The result is an $F \times T$ image (mel-spectrogram) that serves as input to the audio encoder.

Conditioning in AudioLDM: AudioLDM uses CLAP embeddings for text-to-audio generation by conditioning the diffusion model on text embeddings via cross-attention, identical to how Stable Diffusion uses CLIP embeddings. At inference, a user provides a text description such as "heavy rain on a metal roof" or "jazz saxophone solo in a crowded bar," the text is encoded with the CLAP text encoder to produce a 512-dimensional embedding $t = \text{TextEncoder}(\text{description})$ , and this embedding is fed into the cross-attention layers of the audio diffusion U-Net. The diffusion process then generates an audio spectrogram that matches the description, which is converted back to waveform via a vocoder.

Cross-course context: latent compression and tokenization in generative models#

A unifying theme emerges across all the courses in this sequence. Both language models and diffusion models solve the same fundamental problem: raw signals (text bytes, image pixels, audio samples) are too high-dimensional for standard sequence models, so they must be compressed into a lower-dimensional token-like representation.

| Architecture/concept | Course 3 (Generative Models) | Course 1 (RL) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Latent compression | VAE latent space ( $f=8$ spatial downsampling reduces 512×512 to 64×64) | State abstraction / representation learning | Robot proprioceptive + exteroceptive compression (C2W8 CVAE latent space) | Visual tokenization: ViT patch embeddings as visual "tokens" | | Patch-based processing | DiT: image latent divided into $p \times p$ patches (typically $p=2$ ) | Trajectory patches in decision transformer | ACT: chunk of $k$ actions as single prediction unit (temporal patch) | ViT: image divided into $16\times16$ patches, each embedded as a token (C4W1) | | Temporal modeling | Video LDM: 3D convolutions + temporal attention (reshape trick) | MDP: temporal credit assignment via discount factor $\gamma$ | Robot trajectory temporal modeling: LSTM/Transformer over history of proprioceptive states | Video VLMs: temporal transformer for frame-by-frame video-language alignment | | Multi-encoder conditioning | SDXL: CLIP-L text encoder + OpenCLIP-G image encoder + size/crop conditioning signals | Multi-objective RL: combining reward signal and constraint satisfaction signals | ACT: DINO visual encoder + proprioceptive encoder + language instruction encoder | BLIP-2 Q-Former: bridging frozen vision encoder (ViT) and frozen language model (T5) |

The analogy between latent diffusion and language model tokenization is profound: both operations project a raw, high-dimensional signal into a discrete (or quasi-discrete) lower-dimensional space where sequence models can operate efficiently. This is why the same transformer architecture (DiT, GPT) works effectively in both domains — they are both operating on compressed, semantic-level tokens rather than raw bits.

In Course 4 (Vision–Language Models), the ViT backbone (Week 1) learns a similar patch-based visual compression: an image is divided into $16 \times 16$ patches, each embedded to a 768-dimensional token, and these patch tokens are processed by a transformer. The full text-to-image generation pipeline in Stable Diffusion can be understood as a vision encoder (VAE compresses pixels to latent tokens), a language-vision bridge (CLIP text encoder + cross-attention), and a generative decoder (U-Net or DiT generates latent tokens, VAE decoder upsamples to pixels) — precisely the same functional structure as BLIP-2 (frozen ViT → visual tokens → Q-Former bridge → frozen LLM) without the language generation head. This structural homology suggests that as generative models scale, they will continue to converge on a common bottleneck: learning efficient, semantically meaningful discretizations of their respective domains.

Key takeaways#

LDMs decouple perceptual compression (VAE) from semantic generation (diffusion), reducing computation by $f^4$ relative to pixel-space diffusion. Text conditioning via cross-attention over CLIP/T5 embeddings enables prompt-guided generation; SDXL improves this with a larger model and additional conditioning signals. The diffusion transformer (DiT) replaces U-Net convolutions with pure transformer attention, scaling more predictably to very large models. Audio diffusion operates on spectrograms or compressed audio latents, conditioned via CLAP embeddings. Video diffusion requires temporal modeling through 3D convolutions or temporal attention (via reshape trick); optical flow losses enforce temporal coherence. Joint embedding spaces enable multimodal generation; composable diffusion combines multiple independent conditioning signals by linearly summing their score contributions.

Conceptual questions#

An LDM uses a VAE with spatial downsampling factor $f = 8$ to compress $512 \times 512$ images to $64 \times 64$ latents with $c = 4$ channels. (a) Compute the compression ratio in pixels. (b) A U-Net with full spatial attention at the $64 \times 64$ resolution has an attention matrix of size $4096 \times 4096$ . Compare the memory and FLOPs for this attention to the same operation applied in pixel space at $512 \times 512$ . (c) Explain qualitatively why fine texture details survive the VAE encoding-decoding despite the high compression ratio.
Stable Diffusion's VAE uses a KL regularization term toward $\mathcal{N}(0, I)$ on the latent space, while some video LDMs use a VQ-VAE (vector quantized) encoder. Compare the two approaches: (a) What does KL regularization ensure about the latent space geometry? (b) What does VQ discretization enable that continuous KL regularization does not? (c) For temporal consistency in video generation, which approach is preferable and why?
Composable diffusion combines conditioning signals by summing their individual score contributions. Show that this approximation corresponds to assuming conditional independence between conditioners given the noised image. Construct a scenario where this independence assumption fails badly (where the combined effect of two conditioners is not the sum of their individual effects), and describe the artifact that would appear in the generated image.
Temporal attention in video diffusion models attends over the time dimension at each spatial location independently. Compare this to a full 3D space-time attention mechanism. (a) What temporal artifacts would temporal-only attention produce that space-time attention avoids? (b) For a video with $T = 16$ frames and spatial resolution $h \times w = 32 \times 32$ , compare the memory cost of temporal-only attention vs. space-time attention.
A video diffusion model generates 4-second clips at 24 fps (96 frames) by running diffusion in a spatiotemporal latent space with temporal downsampling $f_t = 4$ (24 latent time steps). Describe the key inference-time challenge when generating longer videos (e.g., 30 seconds) and propose two strategies for extending the generation window beyond the training clip length.
The DiT architecture removes skip connections entirely, relying only on residual connections within blocks. Explain why this architectural choice might be less problematic for DiT than for convolutional U-Nets, considering the receptive field and attention mechanisms in play.
A CLAP-conditioned audio diffusion model generates 10 seconds of audio at 44 kHz (440,000 samples). Assuming a mel-spectrogram with $F=128$ mel bands and hop size $h=512$ , compute the spectrogram shape. How does this compare to the spatial dimensions of a $512 \times 512$ image in terms of total tokens after patch embedding with patch size $p=2$ ?

Solutions

(a) Pixels: $512\cdot512\cdot3 = 786{,}432$ vs latent $64\cdot64\cdot4 = 16{,}384$ , a $48\times$ compression. (b) Spatial positions grow $64\times$ (4096 → 262,144), and attention is $O(N^2)$ , so pixel-space attention costs $\approx 64^2 = 4096\times$ more memory and FLOPs than the $4096\times4096$ latent attention. (c) The VAE is trained with perceptual (LPIPS) + adversarial losses, so its decoder synthesizes plausible high-frequency texture from the compact latent — the latent need only carry enough information to regenerate texture, not store every pixel.
(a) KL regularization keeps the latent near $\mathcal{N}(0,I)$ : a smooth, continuous, near-isotropic geometry the diffusion prior can sample. (b) VQ gives a finite codebook, enabling discrete autoregressive/transformer priors over latents and crisp reconstruction of repeated structure. (c) For video, the continuous KL latent is usually preferable — it interpolates smoothly between frames, whereas discrete codes can flicker as adjacent frames snap to different codebook entries unless temporal smoothing is added.
Summing scores assumes $\nabla_x\log p(c_1,c_2\mid x)=\nabla_x\log p(c_1\mid x)+\nabla_x\log p(c_2\mid x)$ — i.e. $c_1\perp c_2 \mid x$ . It fails when conditioners interact or conflict: e.g. text "a red cube" + an edge map of a sphere. The two scores pull toward incompatible geometries, producing ghosted/blended shapes or one signal overriding the other with distortions. (Another: "small dog" text + a reference image of a large object → scale conflict.)
(a) Temporal-only attention lets each spatial location attend only across time at its own position, so it cannot track an object translating across positions — moving objects flicker/ghost; full space-time attention captures motion across locations. (b) Temporal-only: $H W$ sequences of length $T$ → $HW\cdot T^2 = 1024\cdot256 = 262{,}144$ attention entries. Full space-time: $(THW)^2 = (16\cdot1024)^2 \approx 2.68\times10^8$ — about $1024\times$ more.
Generating 30 s exceeds the fixed training horizon (24 latent steps), so the model extrapolates poorly — drift, loss of consistency — and full attention blows up in memory/compute. Strategies: (1) autoregressive sliding window — generate in overlapping chunks, conditioning each on the last few frames of the previous chunk for continuity; (2) hierarchical generation — produce sparse keyframes first, then interpolate intermediate frames (temporal super-resolution).
DiT can drop skip connections because global self-attention at every layer lets any patch directly read fine detail from any other patch, so there is no need for hardwired encoder→decoder skip paths to re-inject lost spatial information. U-Nets need skips precisely because local convolutions + downsampling discard spatial detail that must be restored; DiT routes information through learned attention + residual connections instead.
$T=\lceil 440{,}000/512\rceil \approx 860$ frames, $F=128$ → spectrogram $128\times860$ . With $p=2$ : $(128/2)(860/2)=64\cdot430 = 27{,}520$ tokens. A $512\times512$ image at $p=2$ gives $256\cdot256 = 65{,}536$ tokens — so the 10 s audio spectrogram is roughly $2.4\times$ fewer tokens than the full-resolution image.

Looking ahead#

With the major model families and conditioning mechanisms established, a practical question arises: how do we know if a generative model is actually good?

Week 10: Evaluating Generative Models. We examine bits-per-dimension and FID computation, analyze the known failure modes of automatic metrics, survey human preference evaluation methods, and discuss the precision-recall tradeoff that quantifies sample quality vs. diversity.

Purpose of this lecture#

Latent diffusion models#

Text conditioning via cross-attention#

The Diffusion Transformer architecture#

DiT block architecture: Each transformer block consists of:

Time and conditioning modulation via AdaLN-Zero: timestep $t$ is embedded to a vector and concatenated with text embeddings $c$ (text class or CLIP embedding); this combined conditioning vector is used to compute scale $\gamma(t, c)$ and shift $\delta(t, c)$ parameters that modulate the layer norm, plus a gating parameter $\alpha(t, c)$ that controls block output strength.
Multi-head self-attention: standard scaled dot-product attention over all patch tokens, with $Q, K, V$ projections from the normalized input.
Pointwise MLP: a two-layer feedforward network applied independently to each token position.

U-Net vs. DiT comparison: The architectural difference has training and scaling implications:

Audio diffusion#

Audio waveforms have different compression tradeoffs than images: raw audio at 44kHz is extremely high-dimensional but contains significant redundancy. Two approaches dominate:

Video diffusion#

Temporal attention: the reshape trick and 3D convolutions#

$(3, 3, 3)$ kernels for motion-sensitive features (captures local temporal flow)
$(1, 3, 3)$ kernels for spatial-only features (no temporal mixing, used in early video VAE stages)
Dilated kernels with dilation $d_t > 1$ in the temporal dimension to increase receptive field without increasing parameter count

Joint embedding spaces and multimodal generation#

\nabla_x \log p(x \mid c_1, c_2, c_3) \approx \nabla_x \log p(x) + s_1 \nabla_x \log p(c_1 \mid x) + s_2 \nabla_x \log p(c_2 \mid x) + s_3 \nabla_x \log p(c_3 \mid x)

CLAP: contrastive audio-language pretraining#

Mel-spectrogram computation: The audio signal is converted to a mel-spectrogram in a standardized pipeline:

STFT (Short-Time Fourier Transform): apply a window function (Hann window of size $w$ , typically 2048 samples) with hop size $h$ (typically 512 samples) to the raw waveform at sample rate $f_s$ (44 kHz or 16 kHz), producing a complex spectrogram $X[k, t]$ where $k$ indexes frequency bins and $t$ indexes time frames. Compute the magnitude spectrogram $|X[k, t]|^2$ .
Mel filterbank: apply a mel-scale filterbank $M \in \mathbb{R}^{F \times K}$ where $F$ is the number of mel-scale bands (typically 64 or 128) and $K$ is the number of frequency bins. The mel scale $m = 2595 \log_{10}(1 + f/700)$ compresses high frequencies perceptually (humans are less sensitive to differences above ~1 kHz). The filterbank is a matrix of overlapping triangular filters spaced on the mel scale.
Log magnitude: apply log compression $\log(1 + \text{MelSpectrogram})$ to approximate the perceptual loudness scale. The result is an $F \times T$ image (mel-spectrogram) that serves as input to the audio encoder.

Cross-course context: latent compression and tokenization in generative models#

Key takeaways#

Conceptual questions#

An LDM uses a VAE with spatial downsampling factor $f = 8$ to compress $512 \times 512$ images to $64 \times 64$ latents with $c = 4$ channels. (a) Compute the compression ratio in pixels. (b) A U-Net with full spatial attention at the $64 \times 64$ resolution has an attention matrix of size $4096 \times 4096$ . Compare the memory and FLOPs for this attention to the same operation applied in pixel space at $512 \times 512$ . (c) Explain qualitatively why fine texture details survive the VAE encoding-decoding despite the high compression ratio.
Stable Diffusion's VAE uses a KL regularization term toward $\mathcal{N}(0, I)$ on the latent space, while some video LDMs use a VQ-VAE (vector quantized) encoder. Compare the two approaches: (a) What does KL regularization ensure about the latent space geometry? (b) What does VQ discretization enable that continuous KL regularization does not? (c) For temporal consistency in video generation, which approach is preferable and why?
Composable diffusion combines conditioning signals by summing their individual score contributions. Show that this approximation corresponds to assuming conditional independence between conditioners given the noised image. Construct a scenario where this independence assumption fails badly (where the combined effect of two conditioners is not the sum of their individual effects), and describe the artifact that would appear in the generated image.
Temporal attention in video diffusion models attends over the time dimension at each spatial location independently. Compare this to a full 3D space-time attention mechanism. (a) What temporal artifacts would temporal-only attention produce that space-time attention avoids? (b) For a video with $T = 16$ frames and spatial resolution $h \times w = 32 \times 32$ , compare the memory cost of temporal-only attention vs. space-time attention.
A video diffusion model generates 4-second clips at 24 fps (96 frames) by running diffusion in a spatiotemporal latent space with temporal downsampling $f_t = 4$ (24 latent time steps). Describe the key inference-time challenge when generating longer videos (e.g., 30 seconds) and propose two strategies for extending the generation window beyond the training clip length.
The DiT architecture removes skip connections entirely, relying only on residual connections within blocks. Explain why this architectural choice might be less problematic for DiT than for convolutional U-Nets, considering the receptive field and attention mechanisms in play.
A CLAP-conditioned audio diffusion model generates 10 seconds of audio at 44 kHz (440,000 samples). Assuming a mel-spectrogram with $F=128$ mel bands and hop size $h=512$ , compute the spectrogram shape. How does this compare to the spatial dimensions of a $512 \times 512$ image in terms of total tokens after patch embedding with patch size $p=2$ ?

Solutions

(a) Pixels: $512\cdot512\cdot3 = 786{,}432$ vs latent $64\cdot64\cdot4 = 16{,}384$ , a $48\times$ compression. (b) Spatial positions grow $64\times$ (4096 → 262,144), and attention is $O(N^2)$ , so pixel-space attention costs $\approx 64^2 = 4096\times$ more memory and FLOPs than the $4096\times4096$ latent attention. (c) The VAE is trained with perceptual (LPIPS) + adversarial losses, so its decoder synthesizes plausible high-frequency texture from the compact latent — the latent need only carry enough information to regenerate texture, not store every pixel.
(a) KL regularization keeps the latent near $\mathcal{N}(0,I)$ : a smooth, continuous, near-isotropic geometry the diffusion prior can sample. (b) VQ gives a finite codebook, enabling discrete autoregressive/transformer priors over latents and crisp reconstruction of repeated structure. (c) For video, the continuous KL latent is usually preferable — it interpolates smoothly between frames, whereas discrete codes can flicker as adjacent frames snap to different codebook entries unless temporal smoothing is added.
Summing scores assumes $\nabla_x\log p(c_1,c_2\mid x)=\nabla_x\log p(c_1\mid x)+\nabla_x\log p(c_2\mid x)$ — i.e. $c_1\perp c_2 \mid x$ . It fails when conditioners interact or conflict: e.g. text "a red cube" + an edge map of a sphere. The two scores pull toward incompatible geometries, producing ghosted/blended shapes or one signal overriding the other with distortions. (Another: "small dog" text + a reference image of a large object → scale conflict.)
(a) Temporal-only attention lets each spatial location attend only across time at its own position, so it cannot track an object translating across positions — moving objects flicker/ghost; full space-time attention captures motion across locations. (b) Temporal-only: $H W$ sequences of length $T$ → $HW\cdot T^2 = 1024\cdot256 = 262{,}144$ attention entries. Full space-time: $(THW)^2 = (16\cdot1024)^2 \approx 2.68\times10^8$ — about $1024\times$ more.
Generating 30 s exceeds the fixed training horizon (24 latent steps), so the model extrapolates poorly — drift, loss of consistency — and full attention blows up in memory/compute. Strategies: (1) autoregressive sliding window — generate in overlapping chunks, conditioning each on the last few frames of the previous chunk for continuity; (2) hierarchical generation — produce sparse keyframes first, then interpolate intermediate frames (temporal super-resolution).
DiT can drop skip connections because global self-attention at every layer lets any patch directly read fine detail from any other patch, so there is no need for hardwired encoder→decoder skip paths to re-inject lost spatial information. U-Nets need skips precisely because local convolutions + downsampling discard spatial detail that must be restored; DiT routes information through learned attention + residual connections instead.
$T=\lceil 440{,}000/512\rceil \approx 860$ frames, $F=128$ → spectrogram $128\times860$ . With $p=2$ : $(128/2)(860/2)=64\cdot430 = 27{,}520$ tokens. A $512\times512$ image at $p=2$ gives $256\cdot256 = 65{,}536$ tokens — so the 10 s audio spectrogram is roughly $2.4\times$ fewer tokens than the full-resolution image.

Looking ahead#

With the major model families and conditioning mechanisms established, a practical question arises: how do we know if a generative model is actually good?

Purpose of this lecture#

Latent diffusion models#

Text conditioning via cross-attention#

The Diffusion Transformer architecture#

Audio diffusion#

Video diffusion#

Temporal attention: the reshape trick and 3D convolutions#

Joint embedding spaces and multimodal generation#

CLAP: contrastive audio-language pretraining#

Cross-course context: latent compression and tokenization in generative models#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 9: Latent Diffusion and Multimodal Generation

Purpose of this lecture#

Latent diffusion models#

Text conditioning via cross-attention#

The Diffusion Transformer architecture#

Audio diffusion#

Video diffusion#

Temporal attention: the reshape trick and 3D convolutions#

Joint embedding spaces and multimodal generation#

CLAP: contrastive audio-language pretraining#

Cross-course context: latent compression and tokenization in generative models#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#