Skip to main content
illumin8
Courses
Week 8: Conditioning and Control
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 8

Week 8: Conditioning and Control

✦Learning Outcomes
  • Implement cross-attention conditioning for text-to-image generation
  • Compare ControlNet-style architectural conditioning with prompt-based conditioning
  • Apply guidance scale to balance sample diversity and prompt adherence
◆Prerequisites
  • Week 6: DDPM - Core diffusion model training and sampling
  • Week 7: Flow Matching - Alternative training objectives

Understanding of attention mechanisms from Course 1 is helpful but not required.

Purpose of this lecture

Unconditional generative models produce samples from the full learned distribution, which is useful for sampling diverse outputs but insufficient for applications requiring specific content. Conditioning mechanisms enable targeted generation: producing samples that satisfy a constraint (text prompt, edge map, reference image, class label). This lecture derives the two primary conditioning paradigms — classifier guidance and classifier-free guidance — and examines the architectural mechanisms that implement conditioning in practice: cross-attention, adaptive layer normalization, and ControlNet structural conditioning.


Classifier guidance

Classifier guidance (Dhariwal and Nichol, 2021) derives a sampling procedure for a class-conditional distribution p(x∣y)p(x \mid y)p(x∣y) using an unconditional diffusion model pθ(x)p_\theta(x)pθ​(x) and a separately trained classifier pϕ(y∣xt)p_\phi(y \mid x_t)pϕ​(y∣xt​) that classifies noisy images. The key identity:

∇xtlog⁡p(xt∣y)=∇xtlog⁡p(xt)+∇xtlog⁡pϕ(y∣xt)\nabla_{x_t} \log p(x_t \mid y) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p_\phi(y \mid x_t)∇xt​​logp(xt​∣y)=∇xt​​logp(xt​)+∇xt​​logpϕ​(y∣xt​)

Since the denoising score is ∇xtlog⁡pt(xt)≈−ϵθ(xt,t)/1−αˉt\nabla_{x_t} \log p_t(x_t) \approx -\epsilon_\theta(x_t, t)/\sqrt{1-\bar\alpha_t}∇xt​​logpt​(xt​)≈−ϵθ​(xt​,t)/1−αˉt​​, the guided score becomes:

ϵ~θ(xt,t,y)=ϵθ(xt,t)−1−αˉt∇xtlog⁡pϕ(y∣xt)\tilde\epsilon_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - \sqrt{1-\bar\alpha_t} \nabla_{x_t} \log p_\phi(y \mid x_t)ϵ~θ​(xt​,t,y)=ϵθ​(xt​,t)−1−αˉt​​∇xt​​logpϕ​(y∣xt​)

Replacing ϵθ\epsilon_\thetaϵθ​ with ϵ~θ\tilde\epsilon_\thetaϵ~θ​ in ancestral sampling steers the denoising toward regions of high pϕ(y∣xt)p_\phi(y \mid x_t)pϕ​(y∣xt​) — samples that the classifier attributes to class yyy. A guidance scale s>1s > 1s>1 amplifies this effect:

ϵ~θ=ϵθ−s1−αˉt∇xtlog⁡pϕ(y∣xt)\tilde\epsilon_\theta = \epsilon_\theta - s \sqrt{1-\bar\alpha_t} \nabla_{x_t} \log p_\phi(y \mid x_t)ϵ~θ​=ϵθ​−s1−αˉt​​∇xt​​logpϕ​(y∣xt​)

Classifier guidance requires training a separate noise-robust classifier on noisy inputs and performing backward passes through the classifier at each sampling step, adding computational overhead. The classifier must be specifically designed for noisy images at multiple noise levels — standard classifiers trained on clean images perform poorly.


Classifier-free guidance

Classifier-free guidance (CFG; Ho and Salimans, 2021) avoids the separate classifier by jointly training a conditional model ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c)ϵθ​(xt​,t,c) and an unconditional model ϵθ(xt,t,∅)\epsilon_\theta(x_t, t, \emptyset)ϵθ​(xt​,t,∅), where ∅\emptyset∅ is a null conditioning token. The training procedure: with probability puncondp_\text{uncond}puncond​ (typically 10–20%), set c=∅c = \emptysetc=∅ (drop the conditioning); otherwise use the real conditioning ccc. A single network learns both modes.

At sampling time, the guided score is a linear extrapolation beyond the conditional distribution:

ϵ~θ(xt,t,c)=ϵθ(xt,t,∅)+s⋅[ϵθ(xt,t,c)−ϵθ(xt,t,∅)]\tilde\epsilon_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + s \cdot [\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)]ϵ~θ​(xt​,t,c)=ϵθ​(xt​,t,∅)+s⋅[ϵθ​(xt​,t,c)−ϵθ​(xt​,t,∅)]

This can be rearranged as (1−s)ϵθ(xt,∅)+s⋅ϵθ(xt,c)(1-s)\epsilon_\theta(x_t, \emptyset) + s\cdot\epsilon_\theta(x_t, c)(1−s)ϵθ​(xt​,∅)+s⋅ϵθ​(xt​,c). For s=1s = 1s=1, CFG recovers the standard conditional model. For s>1s > 1s>1, it extrapolates beyond the conditional, sampling from an implicit distribution that more strongly satisfies the condition but with reduced diversity:

log⁡ps(x∣c)∝log⁡p(x)+slog⁡p(c∣x)=log⁡p(x∣c)+(s−1)log⁡p(c∣x)\log p_s(x \mid c) \propto \log p(x) + s \log p(c \mid x) = \log p(x \mid c) + (s-1) \log p(c \mid x)logps​(x∣c)∝logp(x)+slogp(c∣x)=logp(x∣c)+(s−1)logp(c∣x)

This is the product of the data prior with the conditional likelihood raised to the power (s−1)(s-1)(s−1) — a sharpened version of the conditional that trades diversity for prompt adherence. Typical guidance scales range from s=5s = 5s=5 to s=15s = 15s=15 for image generation; higher values produce sharper adherence to the prompt but more saturated, artifact-prone images.

CFG advantages over classifier guidance: no separate classifier needed; the same network handles both conditional and unconditional generation; no backward pass through a classifier at sampling time; applicable to any conditioning signal including continuous embeddings. CFG is the standard conditioning mechanism in all production text-to-image systems.


CFG: the implicit energy model interpretation

CFG is best understood as energy-based modeling. Starting from Bayes rule: log⁡p(x∣c)=log⁡p(x)+log⁡p(c∣x)−log⁡p(c)\log p(x \mid c) = \log p(x) + \log p(c \mid x) - \log p(c)logp(x∣c)=logp(x)+logp(c∣x)−logp(c). The CFG sampling procedure applies a scaled version of the conditional likelihood gradient:

∇xlog⁡ps(x∣c)=∇xlog⁡p(x)+s⋅∇xlog⁡p(c∣x)\nabla_x \log p_s(x\mid c) = \nabla_x \log p(x) + s \cdot \nabla_x \log p(c\mid x)∇x​logps​(x∣c)=∇x​logp(x)+s⋅∇x​logp(c∣x)

Note that log⁡p(c)\log p(c)logp(c) does not depend on xxx, so its gradient is zero and it drops from the score equation. This scaling by sss implies an implicit likelihood function where the strength of the conditioning signal is amplified:

ps(x∣c)∝p(x)⋅p(c∣x)sp_s(x \mid c) \propto p(x) \cdot p(c \mid x)^sps​(x∣c)∝p(x)⋅p(c∣x)s

For s=1s = 1s=1, this recovers the true conditional p(x∣c)p(x \mid c)p(x∣c). For s>1s > 1s>1, we sample from a sharpened version that concentrates near the modes of p(c∣x)p(c \mid x)p(c∣x).

The manifold departure problem. For large sss, samples concentrate in regions where p(c∣x)p(c \mid x)p(c∣x) is high but p(x)p(x)p(x) is low, producing manifold departure artifacts: oversaturated colors, texture repetition, distorted anatomy. The default scale for Stable Diffusion is s=7.5s = 7.5s=7.5; scales above 15–20 produce visible artifacts from amplifying CLIP-favored features to extremes.

Perp-CFG. Sanchez et al. (2023) propose Perp-CFG to mitigate manifold departure by scaling only the guidance component perpendicular to the unconditional score: ϵ~=ϵu+s⋅proj⊥(ϵc−ϵu)\tilde\epsilon = \epsilon_u + s \cdot \text{proj}_{\perp}(\epsilon_c - \epsilon_u)ϵ~=ϵu​+s⋅proj⊥​(ϵc​−ϵu​). This keeps the denoising trajectory within the high-density manifold, yielding comparable prompt adherence at lower scales (s=5s = 5s=5) with reduced artifacts.

This view connects to tempering: scaling likelihood is equivalent to raising the implicit energy to power sss. The tension between guidance strength and quality is fundamental—stronger conditioning means relying more on CLIP's semantics rather than the learned denoising trajectory.


Cross-attention for text conditioning

Text-conditioned diffusion models inject the text conditioning ccc into the denoising network through cross-attention layers. The conditioning signal is a sequence of text embeddings c=(c1,…,cL)c = (c_1, \ldots, c_L)c=(c1​,…,cL​) from a pretrained language encoder (CLIP, T5, etc.). In each cross-attention layer, the spatial features of the noised image ϕ(xt)\phi(x_t)ϕ(xt​) serve as queries, and the text embeddings serve as keys and values:

Attention(Q,K,V)=softmax ⁣(QK⊤dk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) VAttention(Q,K,V)=softmax(dk​​QK⊤​)V Q=WQϕ(xt),K=WKc,V=WVcQ = W_Q \phi(x_t), \quad K = W_K c, \quad V = W_V cQ=WQ​ϕ(xt​),K=WK​c,V=WV​c

The attention weights softmax(QK⊤/dk)\text{softmax}(QK^\top/\sqrt{d_k})softmax(QK⊤/dk​​) indicate which text tokens are relevant to each spatial location at each timestep. Cross-attention enables soft, position-specific binding: the word "red" can attend to the region of the image being denoised that corresponds to the colored object, while "background" attends to other regions.

CLIP embeddings: trained by contrastive learning on (image, text) pairs to produce a shared embedding space where semantically matching image and text embeddings are close. Using CLIP text encoders as the text conditioning provides strong semantic grounding — the embedding captures that "dog" and "canine" are synonymous, that "red cube on a green surface" combines spatial, color, and object concepts, etc.


Cross-attention: memory and compute analysis

The computational cost of cross-attention in diffusion U-Nets is critical for understanding why certain architectural choices are made. Let us compute this rigorously.

Tensor dimensions. The query tensor has shape Q:[B,Hf⋅Wf,dk]Q: [B, H_f \cdot W_f, d_k]Q:[B,Hf​⋅Wf​,dk​] where BBB is batch size, Hf×WfH_f \times W_fHf​×Wf​ is spatial dimension, and dkd_kdk​ is head dimension. Key and value tensors have shape K,V:[B,L,dk]K, V: [B, L, d_k]K,V:[B,L,dk​] where L=77L = 77L=77 (CLIP token length).

Attention matrix. The attention matrix A=softmax(QK⊤/dk)A = \text{softmax}(QK^\top / \sqrt{d_k})A=softmax(QK⊤/dk​​) has shape [B,Hf⋅Wf,L][B, H_f \cdot W_f, L][B,Hf​⋅Wf​,L] — asymmetric, with image dimension much larger than text dimension.

Memory analysis. At 64×64 resolution: attention matrix is [B,4096,77][B, 4096, 77][B,4096,77], requiring 4096×77×44096 \times 77 \times 44096×77×4 bytes = 1.26 MB per batch element.

Self-attention at 64×64 requires 4096×4096×44096 \times 4096 \times 44096×4096×4 bytes = 64 MB per batch element. Cross-attention is approximately 50× more efficient. This explains why modern U-Nets use cross-attention throughout but self-attention only at lowest resolutions (8×8 or 16×16).

High-resolution scaling. At 256×256 resolution: self-attention requires 16 GB per batch element (prohibitive), while cross-attention requires only 20 MB (tractable). Self-attention scales O(N2)O(N^2)O(N2) with resolution; cross-attention scales O(N)O(N)O(N) because text length is fixed. This is why high-resolution generation relies on cross-attention.

Computational cost (FLOPs). At 64×64 with dk=64d_k = 64dk​=64, L=77L = 77L=77, cross-attention requires approximately 40 million FLOPs per batch element. Self-attention at the same resolution requires approximately 2 billion FLOPs, making cross-attention about 50× faster. Efficient algorithms like Flash Attention (Dao et al., 2022) further reduce costs by tiling computation and avoiding full attention matrix materialization.

Architectural implication. Cross-attention is a computational necessity for high-resolution diffusion. Self-attention is used only at low resolutions (8×8 or 16×16); cross-attention scales to full resolution, enabling high-resolution generation. This asymmetric pattern appears across vision-language models and long-context transformers.


ControlNet

ControlNet (Zhang et al., 2023) provides spatial structural conditioning (edge maps, depth maps, segmentation masks, pose keypoints) on top of a pretrained text-to-image diffusion model without retraining the base model.

The architecture: take the encoder blocks of a pretrained diffusion U-Net and create a trainable copy initialized with the pretrained weights. The structural conditioning cctrlc_\text{ctrl}cctrl​ (e.g., Canny edges) is injected into this copy through a "zero convolution" layer (a 1×1 convolution initialized with zero weights and zero bias). The outputs of the trainable copy are added to the corresponding layers of the locked base model:

outputl=fθ(locked)(x,t,ctext)+Zl(fθ+(trainable)(x,t,ctext,cctrl))\text{output}_l = f_\theta^{(\text{locked})}(x, t, c_\text{text}) + \mathcal{Z}_l(f_{\theta^+}^{(\text{trainable})}(x, t, c_\text{text}, c_\text{ctrl}))outputl​=fθ(locked)​(x,t,ctext​)+Zl​(fθ+(trainable)​(x,t,ctext​,cctrl​))

where Zl\mathcal{Z}_lZl​ is the zero convolution at layer lll and θ+\theta^+θ+ are the trainable copy's parameters. At initialization, Zl\mathcal{Z}_lZl​ outputs zero, so the model behaves identically to the base model. As training progresses, the zero convolutions learn to pass structural information that modifies the base model's output.

The locked base model preserves the pretrained text-image knowledge; the trainable copy learns the structural condition. The zero convolution initialization prevents damaging the pretrained features early in training when the structural conditioning signal is still uninformative. ControlNet can be applied to any conditioning type by replacing cctrlc_\text{ctrl}cctrl​ with the appropriate spatial input.


Zero convolution gradient analysis

The zero convolution mechanism deserves closer analysis, as its design encodes important principles about adapter-based fine-tuning. Let us derive the gradient flow explicitly.

Setup. The zero convolution Z(h)=Wh+b\mathcal{Z}(h) = Wh + bZ(h)=Wh+b is initialized with W=0,b=0W = 0, b = 0W=0,b=0, where hhh is the hidden representation from the trainable U-Net encoder.

Output at initialization. For any input hhh, the output is: Z(h)=0⋅h+0=0\mathcal{Z}(h) = 0 \cdot h + 0 = 0Z(h)=0⋅h+0=0

regardless of the value of hhh. Thus the ControlNet adds nothing to the base model output at initialization.

Gradient w.r.t. parameters. The gradient of loss L\mathcal{L}L w.r.t. weight WWW is: ∂L∂W=∂L∂Zh⊤\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top∂W∂L​=∂Z∂L​h⊤ where ∂L∂Z\frac{\partial \mathcal{L}}{\partial \mathcal{Z}}∂Z∂L​ is the upstream gradient from the loss.

When W=0W = 0W=0: The output is Z(h)=0\mathcal{Z}(h) = 0Z(h)=0, but the upstream gradient ∂L∂Z\frac{\partial \mathcal{L}}{\partial \mathcal{Z}}∂Z∂L​ is nonzero because the loss depends on both locked and ControlNet outputs—their dependence creates a nonzero gradient w.r.t. the ControlNet.

Therefore: ∂L∂W∣W=0=∂L∂Zh⊤\frac{\partial \mathcal{L}}{\partial W}\bigg|_{W=0} = \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top∂W∂L​​W=0​=∂Z∂L​h⊤

which is nonzero provided hhh and the upstream gradient are both nonzero. Since hhh is the hidden state from a forward pass through a neural network (with ReLU activations, batch normalization, etc.), it is virtually certain to be nonzero.

First gradient step. After the first update: W(1)←W(0)−η∂L∂W∣W=0=0−η∂L∂Zh⊤=−η∂L∂Zh⊤W^{(1)} \leftarrow W^{(0)} - \eta \frac{\partial \mathcal{L}}{\partial W}\bigg|_{W=0} = 0 - \eta \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top = -\eta \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\topW(1)←W(0)−η∂W∂L​​W=0​=0−η∂Z∂L​h⊤=−η∂Z∂L​h⊤

which is nonzero. The zero convolution immediately begins learning to transmit the structural conditioning signal.

Design implication. Zero initialization ensures the ControlNet starts with zero contribution (no disruption to pretrained weights) but learns immediately on the first gradient step (no dead zone). This cold-start adapter design balances stability with responsiveness, differing from random initialization which would immediately modify the base model.


Adaptive layer normalization

An alternative to cross-attention for scalar or low-dimensional conditioning (class labels, timestep, noise level) is adaptive layer normalization (AdaLN). Standard layer normalization normalizes features and applies a fixed learned scale and shift; AdaLN makes these parameters conditioning-dependent:

AdaLN(h,c)=γ(c)⊙h−μ(h)σ(h)+δ(c)\text{AdaLN}(h, c) = \gamma(c) \odot \frac{h - \mu(h)}{\sigma(h)} + \delta(c)AdaLN(h,c)=γ(c)⊙σ(h)h−μ(h)​+δ(c)

where γ(c)\gamma(c)γ(c) and δ(c)\delta(c)δ(c) are learned linear projections of the conditioning signal ccc. In DiT (diffusion transformers; Peebles and Xie, 2023), AdaLN-zero initializes γ=1\gamma = 1γ=1, δ=0\delta = 0δ=0 (identity normalization) at the start, enabling training stability similar to ControlNet's zero convolutions.


GenAI context: conditioning across the curriculum

Classifier-free guidance is the standard conditioning mechanism in DALL-E 2/3, Stable Diffusion, Midjourney, and all major text-to-image systems. The cross-attention mechanism for text conditioning is directly analogous to the cross-attention in encoder-decoder transformers (T5, BART) — the denoising network is the decoder, and the text encoder output is the encoder representation being attended to. ControlNet's locked-plus-trainable-copy architecture resembles LoRA (Course 2, Week 11): both freeze the pretrained model and add a small trainable adapter, with zero initialization ensuring the adapter starts from the pretrained behavior.

More broadly, the conditioning concepts in this lecture appear throughout the four-course sequence with different instantiations:

| Conditioning mechanism | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Classifier guidance | Noisy-image classifier ∇log⁡p(y∣xt)\nabla \log p(y\mid x_t)∇logp(y∣xt​) | Reward shaping: r(s,a)+λV(s′)r(s,a) + \lambda V(s')r(s,a)+λV(s′) as a value gradient | Constrained policy with CBF gradient (C2W12): a←a+λ∇ah(s,a)a \leftarrow a + \lambda \nabla_a h(s,a)a←a+λ∇a​h(s,a) | VQA reward model gradient | | CFG | Linear extrapolation ϵu+s(ϵc−ϵu)\epsilon_u + s(\epsilon_c - \epsilon_u)ϵu​+s(ϵc​−ϵu​) | Conservative Q-function (CQL): extrapolation beyond observed data | Goal-conditioned imitation: conditioning on goal image drops out with prob. ppp | Image-conditional generation: CLIP score as guidance | | Cross-attention | Image queries attend to text keys/values | Q-function attending to action representations | ACTAction Chunking with Transformers (C2W8): action chunk queries attend to observation keys | VLMVision-Language Model cross-attention (C4W4): image tokens attend to text tokens | | Structural conditioning | ControlNet: edge/pose maps lock spatial layout | Model-based control: constraint maps in MPC | Force/torque conditioning: proprioceptive state as spatial conditioning | Layout grounding: bounding box conditioning for spatial generation |

The CFG extrapolation formula has a direct analog in model-based RLReinforcement Learning: guidance scale sss weights the conditional signal like a value function guides rollouts toward high-reward regions. Both face the fundamental tension that stronger guidance toward a target metric (text adherence, reward) trades off against diversity and potential out-of-distribution artifacts.


Key takeaways

Classifier guidance steers sampling by adding the gradient of a noise-robust classifier to the denoising score; the guidance scale sss controls adherence vs. diversity. Classifier-free guidance achieves the same effect with a single jointly-trained conditional/unconditional network and linear extrapolation at sampling time. CFG implicitly samples from an energy-based model where the likelihood is scaled; for large sss, this causes manifold departure artifacts (oversaturation, texture repetition) that trade diversity for prompt adherence; Perp-CFG mitigates this by scaling only the orthogonal component of guidance. Cross-attention in the denoising U-Net binds text tokens to spatial image regions using CLIP embeddings as keys and values; it is much more memory- and compute-efficient than self-attention on large spatial dimensions, scaling linearly rather than quadratically with resolution. ControlNet provides structural conditioning via a trainable copy of the encoder initialized from pretrained weights, with zero convolution outputs added to the locked base model; zero convolution initialization ensures stability (no corruption of pretrained weights) while allowing learning from the first gradient step. Adaptive layer normalization provides conditioning for scalar signals through learned scale-and-shift projections. Conditioning mechanisms recur throughout all four courses, unifying concepts from diffusion, reinforcement learning, robotics, and vision-language models.


Conceptual questions

  1. Derive that CFG sampling from ϵ~θ(xt,c)=(1−s)ϵθ(xt,∅)+s⋅ϵθ(xt,c)\tilde\epsilon_\theta(x_t, c) = (1-s)\epsilon_\theta(x_t, \emptyset) + s\cdot\epsilon_\theta(x_t, c)ϵ~θ​(xt​,c)=(1−s)ϵθ​(xt​,∅)+s⋅ϵθ​(xt​,c) approximates sampling from ps(x∣c)∝p(x∣c)s/p(x)s−1p_s(x \mid c) \propto p(x \mid c)^s / p(x)^{s-1}ps​(x∣c)∝p(x∣c)s/p(x)s−1. For s=1s = 1s=1, this recovers p(x∣c)p(x \mid c)p(x∣c); for s→∞s \to \inftys→∞, what distribution does ps(x∣c)p_s(x \mid c)ps​(x∣c) approach? What does this imply for the diversity of CFG samples as sss increases?

  2. A cross-attention layer in a 64×64 diffusion U-Net attends over a text sequence of L=77L = 77L=77 tokens. The queries have dimension dk=64d_k = 64dk​=64 and the spatial feature map has 64×64=409664 \times 64 = 409664×64=4096 positions. Compute the memory required for the attention matrix (in bytes, float32) and the computational cost (FLOPs) of computing attention for one image at one timestep. How do these scale with image resolution, and what modification to attention enables higher-resolution generation?

  3. ControlNet uses zero convolutions initialized with weight W=0W = 0W=0 and bias b=0b = 0b=0. During training, the zero convolution learns a nonzero output after the first gradient step. Derive the gradient of the zero convolution output with respect to the convolution weight WWW when W=0W = 0W=0, and show that the gradient is nonzero only if the input to the convolution is nonzero. What does this imply about the training dynamics in the earliest gradient steps?

  4. Unconditional dropout (setting c=∅c = \emptysetc=∅ with probability puncondp_\text{uncond}puncond​ during training) is essential for CFG to work. Explain what failure mode occurs when puncond=0p_\text{uncond} = 0puncond​=0 (no unconditional training) and when puncond=1p_\text{uncond} = 1puncond​=1 (no conditional training). What value of puncondp_\text{uncond}puncond​ would you choose for a task where conditional fidelity is paramount, and how would you validate this choice?

  5. CLIP text embeddings are trained to match image-text pairs; CLIP image embeddings are trained on the same pairs. A text-to-image diffusion model conditioned on CLIP text embeddings can theoretically also be conditioned on CLIP image embeddings (image-to-image generation). Describe the semantic property of CLIP that enables this cross-modal conditioning without architectural changes, and analyze what information is preserved (and lost) when using a CLIP image embedding vs. directly conditioning on pixel values.

✦Solutions
  1. The blend corresponds to the score ∇xlog⁡p(x)+s ∇xlog⁡p(c∣x)\nabla_x\log p(x)+s\,\nabla_x\log p(c\mid x)∇x​logp(x)+s∇x​logp(c∣x), whose stationary distribution is ps(x∣c)∝p(x) p(c∣x)s=p(x∣c)s/p(x)s−1p_s(x\mid c)\propto p(x)\,p(c\mid x)^s = p(x\mid c)^s/p(x)^{s-1}ps​(x∣c)∝p(x)p(c∣x)s=p(x∣c)s/p(x)s−1. At s=1s=1s=1 this is p(x∣c)p(x\mid c)p(x∣c). As s→∞s\to\inftys→∞ the mass concentrates on arg⁡max⁡xp(c∣x)\arg\max_x p(c\mid x)argmaxx​p(c∣x) within the support of ppp — a near point-mass — so diversity collapses monotonically as sss grows: stronger prompt adherence is bought with mode-seeking and lost variety.
  2. Memory: 4096×77×4=1,261,5684096\times 77\times 4 = 1{,}261{,}5684096×77×4=1,261,568 bytes ≈1.26\approx 1.26≈1.26 MB. FLOPs: QK⊤QK^\topQK⊤ costs ≈4096⋅77⋅64≈2×107\approx 4096\cdot77\cdot64\approx 2\times10^7≈4096⋅77⋅64≈2×107, and roughly the same again for the AVA VAV product, ≈4×107\approx 4\times10^7≈4×107 multiply-adds. The attention matrix is N×LN\times LN×L with N=HfWfN=H_fW_fN=Hf​Wf​ spatial positions and L=77L=77L=77 fixed, so cost is O(N)O(N)O(N) — linear in pixels — whereas self-attention is O(N2)O(N^2)O(N2). Higher resolution therefore relies on cross-attention (self-attention only at low res), with Flash-Attention tiling avoiding materializing the matrix.
  3. With Z(h)=Wh+b\mathcal{Z}(h)=Wh+bZ(h)=Wh+b, ∂L/∂W=(∂L/∂Z) h⊤\partial\mathcal{L}/\partial W=(\partial\mathcal{L}/\partial\mathcal{Z})\,h^\top∂L/∂W=(∂L/∂Z)h⊤. At W=0W=0W=0 the output is 000 but the upstream gradient ∂L/∂Z\partial\mathcal{L}/\partial\mathcal{Z}∂L/∂Z is nonzero (the loss depends on the locked + ControlNet sum), so ∂L/∂W=(∂L/∂Z)h⊤\partial\mathcal{L}/\partial W=(\partial\mathcal{L}/\partial\mathcal{Z})h^\top∂L/∂W=(∂L/∂Z)h⊤ is nonzero exactly when h≠0h\neq 0h=0. Implication: at init the adapter contributes nothing (no disruption to pretrained weights) yet learns on the very first step (no dead zone), since encoder activations hhh are essentially always nonzero — a clean cold-start.
  4. puncond=0p_\text{uncond}=0puncond​=0: the model never sees ∅\emptyset∅, so ϵθ(x,∅)\epsilon_\theta(x,\emptyset)ϵθ​(x,∅) is undefined/garbage and the CFG extrapolation has no valid unconditional anchor. puncond=1p_\text{uncond}=1puncond​=1: the model never sees real ccc, so conditional generation fails entirely. For conditional fidelity paramount, choose a moderate value ≈0.1\approx 0.1≈0.1–0.20.20.2 (enough unconditional signal to enable guidance, mostly conditional training); validate by sweeping sss on a held-out set and tracing the prompt-adherence (CLIP score) vs FID/diversity frontier, picking the puncondp_\text{uncond}puncond​ that gives the best frontier.
  5. CLIP's shared image–text embedding space (contrastive training pulls matching image and text to nearby vectors) means a CLIP image embedding lives in the same space the model was conditioned on via text, so it can be substituted with no architecture change. Preserved: high-level semantics — objects, scene, style. Lost: precise spatial layout, fine appearance, and identity, because the CLIP embedding is a compact global summary. Conditioning on pixels directly keeps that spatial detail but lacks the semantic abstraction and the cross-modal transfer.

Looking ahead

Conditioning mechanisms enable text-guided generation in pixel space. But scaling to high-resolution images makes pixel-space training prohibitively expensive.

Week 9: Latent Diffusion and Multimodal Generation. We examine latent diffusion models that compress images into compact latent codes and apply diffusion in latent space, then survey how the same diffusion backbone extends to audio (spectrograms), video (temporal tokens), and joint embedding spaces for multimodal generation.


Further reading

  • Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop. (The CFG algorithm used in all modern models).
  • Dhariwal, P., & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS. (Classifier Guidance).
← Previous
Week 7: Flow Matching and Consistency Models
Next →
Week 9: Latent Diffusion and Multimodal Generation
On this page
  • Purpose of this lecture
  • Classifier guidance
  • Classifier-free guidance
  • CFG: the implicit energy model interpretation
  • Cross-attention for text conditioning
  • Cross-attention: memory and compute analysis
  • ControlNet
  • Zero convolution gradient analysis
  • Adaptive layer normalization
  • GenAI context: conditioning across the curriculum
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading