Week 8: Conditioning and Control

Purpose of this lecture#

Unconditional generative models produce samples from the full learned distribution, which is useful for sampling diverse outputs but insufficient for applications requiring specific content. Conditioning mechanisms enable targeted generation: producing samples that satisfy a constraint (text prompt, edge map, reference image, class label). This lecture derives the two primary conditioning paradigms — classifier guidance and classifier-free guidance — and examines the architectural mechanisms that implement conditioning in practice: cross-attention, adaptive layer normalization, and ControlNet structural conditioning.

Classifier guidance#

Classifier guidance (Dhariwal and Nichol, 2021) derives a sampling procedure for a class-conditional distribution $p(x \mid y)$ using an unconditional diffusion model $p_\theta(x)$ and a separately trained classifier $p_\phi(y \mid x_t)$ that classifies noisy images. The key identity:

\nabla_{x_t} \log p(x_t \mid y) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p_\phi(y \mid x_t)

Since the denoising score is $\nabla_{x_t} \log p_t(x_t) \approx -\epsilon_\theta(x_t, t)/\sqrt{1-\bar\alpha_t}$ , the guided score becomes:

\tilde\epsilon_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - \sqrt{1-\bar\alpha_t} \nabla_{x_t} \log p_\phi(y \mid x_t)

Replacing $\epsilon_\theta$ with $\tilde\epsilon_\theta$ in ancestral sampling steers the denoising toward regions of high $p_\phi(y \mid x_t)$ — samples that the classifier attributes to class $y$ . A guidance scale $s > 1$ amplifies this effect:

\tilde\epsilon_\theta = \epsilon_\theta - s \sqrt{1-\bar\alpha_t} \nabla_{x_t} \log p_\phi(y \mid x_t)

Classifier guidance requires training a separate noise-robust classifier on noisy inputs and performing backward passes through the classifier at each sampling step, adding computational overhead. The classifier must be specifically designed for noisy images at multiple noise levels — standard classifiers trained on clean images perform poorly.

Classifier-free guidance#

Classifier-free guidance (CFG; Ho and Salimans, 2021) avoids the separate classifier by jointly training a conditional model $\epsilon_\theta(x_t, t, c)$ and an unconditional model $\epsilon_\theta(x_t, t, \emptyset)$ , where $\emptyset$ is a null conditioning token. The training procedure: with probability $p_\text{uncond}$ (typically 10–20%), set $c = \emptyset$ (drop the conditioning); otherwise use the real conditioning $c$ . A single network learns both modes.

At sampling time, the guided score is a linear extrapolation beyond the conditional distribution:

\tilde\epsilon_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + s \cdot [\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)]

This can be rearranged as $(1-s)\epsilon_\theta(x_t, \emptyset) + s\cdot\epsilon_\theta(x_t, c)$ . For $s = 1$ , CFG recovers the standard conditional model. For $s > 1$ , it extrapolates beyond the conditional, sampling from an implicit distribution that more strongly satisfies the condition but with reduced diversity:

\log p_s(x \mid c) \propto \log p(x) + s \log p(c \mid x) = \log p(x \mid c) + (s-1) \log p(c \mid x)

This is the product of the data prior with the conditional likelihood raised to the power $(s-1)$ — a sharpened version of the conditional that trades diversity for prompt adherence. Typical guidance scales range from $s = 5$ to $s = 15$ for image generation; higher values produce sharper adherence to the prompt but more saturated, artifact-prone images.

CFG advantages over classifier guidance: no separate classifier needed; the same network handles both conditional and unconditional generation; no backward pass through a classifier at sampling time; applicable to any conditioning signal including continuous embeddings. CFG is the standard conditioning mechanism in all production text-to-image systems.

CFG: the implicit energy model interpretation#

CFG is best understood as energy-based modeling. Starting from Bayes rule: $\log p(x \mid c) = \log p(x) + \log p(c \mid x) - \log p(c)$ . The CFG sampling procedure applies a scaled version of the conditional likelihood gradient:

\nabla_x \log p_s(x\mid c) = \nabla_x \log p(x) + s \cdot \nabla_x \log p(c\mid x)

Note that $\log p(c)$ does not depend on $x$ , so its gradient is zero and it drops from the score equation. This scaling by $s$ implies an implicit likelihood function where the strength of the conditioning signal is amplified:

p_s(x \mid c) \propto p(x) \cdot p(c \mid x)^s

For $s = 1$ , this recovers the true conditional $p(x \mid c)$ . For $s > 1$ , we sample from a sharpened version that concentrates near the modes of $p(c \mid x)$ .

The manifold departure problem. For large $s$ , samples concentrate in regions where $p(c \mid x)$ is high but $p(x)$ is low, producing manifold departure artifacts: oversaturated colors, texture repetition, distorted anatomy. The default scale for Stable Diffusion is $s = 7.5$ ; scales above 15–20 produce visible artifacts from amplifying CLIP-favored features to extremes.

Perp-CFG. Sanchez et al. (2023) propose Perp-CFG to mitigate manifold departure by scaling only the guidance component perpendicular to the unconditional score: $\tilde\epsilon = \epsilon_u + s \cdot \text{proj}_{\perp}(\epsilon_c - \epsilon_u)$ . This keeps the denoising trajectory within the high-density manifold, yielding comparable prompt adherence at lower scales ( $s = 5$ ) with reduced artifacts.

This view connects to tempering: scaling likelihood is equivalent to raising the implicit energy to power $s$ . The tension between guidance strength and quality is fundamental—stronger conditioning means relying more on CLIP's semantics rather than the learned denoising trajectory.

Cross-attention for text conditioning#

Text-conditioned diffusion models inject the text conditioning $c$ into the denoising network through cross-attention layers. The conditioning signal is a sequence of text embeddings $c = (c_1, \ldots, c_L)$ from a pretrained language encoder (CLIP, T5, etc.). In each cross-attention layer, the spatial features of the noised image $\phi(x_t)$ serve as queries, and the text embeddings serve as keys and values:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Q = W_Q \phi(x_t), \quad K = W_K c, \quad V = W_V c

The attention weights $\text{softmax}(QK^\top/\sqrt{d_k})$ indicate which text tokens are relevant to each spatial location at each timestep. Cross-attention enables soft, position-specific binding: the word "red" can attend to the region of the image being denoised that corresponds to the colored object, while "background" attends to other regions.

CLIP embeddings: trained by contrastive learning on (image, text) pairs to produce a shared embedding space where semantically matching image and text embeddings are close. Using CLIP text encoders as the text conditioning provides strong semantic grounding — the embedding captures that "dog" and "canine" are synonymous, that "red cube on a green surface" combines spatial, color, and object concepts, etc.

Cross-attention: memory and compute analysis#

The computational cost of cross-attention in diffusion U-Nets is critical for understanding why certain architectural choices are made. Let us compute this rigorously.

Tensor dimensions. The query tensor has shape $Q: [B, H_f \cdot W_f, d_k]$ where $B$ is batch size, $H_f \times W_f$ is spatial dimension, and $d_k$ is head dimension. Key and value tensors have shape $K, V: [B, L, d_k]$ where $L = 77$ (CLIP token length).

Attention matrix. The attention matrix $A = \text{softmax}(QK^\top / \sqrt{d_k})$ has shape $[B, H_f \cdot W_f, L]$ — asymmetric, with image dimension much larger than text dimension.

Memory analysis. At 64×64 resolution: attention matrix is $[B, 4096, 77]$ , requiring $4096 \times 77 \times 4$ bytes = 1.26 MB per batch element.

Self-attention at 64×64 requires $4096 \times 4096 \times 4$ bytes = 64 MB per batch element. Cross-attention is approximately 50× more efficient. This explains why modern U-Nets use cross-attention throughout but self-attention only at lowest resolutions (8×8 or 16×16).

High-resolution scaling. At 256×256 resolution: self-attention requires 16 GB per batch element (prohibitive), while cross-attention requires only 20 MB (tractable). Self-attention scales $O(N^2)$ with resolution; cross-attention scales $O(N)$ because text length is fixed. This is why high-resolution generation relies on cross-attention.

Computational cost (FLOPs). At 64×64 with $d_k = 64$ , $L = 77$ , cross-attention requires approximately 40 million FLOPs per batch element. Self-attention at the same resolution requires approximately 2 billion FLOPs, making cross-attention about 50× faster. Efficient algorithms like Flash Attention (Dao et al., 2022) further reduce costs by tiling computation and avoiding full attention matrix materialization.

Architectural implication. Cross-attention is a computational necessity for high-resolution diffusion. Self-attention is used only at low resolutions (8×8 or 16×16); cross-attention scales to full resolution, enabling high-resolution generation. This asymmetric pattern appears across vision-language models and long-context transformers.

ControlNet#

ControlNet (Zhang et al., 2023) provides spatial structural conditioning (edge maps, depth maps, segmentation masks, pose keypoints) on top of a pretrained text-to-image diffusion model without retraining the base model.

The architecture: take the encoder blocks of a pretrained diffusion U-Net and create a trainable copy initialized with the pretrained weights. The structural conditioning $c_\text{ctrl}$ (e.g., Canny edges) is injected into this copy through a "zero convolution" layer (a 1×1 convolution initialized with zero weights and zero bias). The outputs of the trainable copy are added to the corresponding layers of the locked base model:

\text{output}_l = f_\theta^{(\text{locked})}(x, t, c_\text{text}) + \mathcal{Z}_l(f_{\theta^+}^{(\text{trainable})}(x, t, c_\text{text}, c_\text{ctrl}))

where $\mathcal{Z}_l$ is the zero convolution at layer $l$ and $\theta^+$ are the trainable copy's parameters. At initialization, $\mathcal{Z}_l$ outputs zero, so the model behaves identically to the base model. As training progresses, the zero convolutions learn to pass structural information that modifies the base model's output.

The locked base model preserves the pretrained text-image knowledge; the trainable copy learns the structural condition. The zero convolution initialization prevents damaging the pretrained features early in training when the structural conditioning signal is still uninformative. ControlNet can be applied to any conditioning type by replacing $c_\text{ctrl}$ with the appropriate spatial input.

Zero convolution gradient analysis#

The zero convolution mechanism deserves closer analysis, as its design encodes important principles about adapter-based fine-tuning. Let us derive the gradient flow explicitly.

Setup. The zero convolution $\mathcal{Z}(h) = Wh + b$ is initialized with $W = 0, b = 0$ , where $h$ is the hidden representation from the trainable U-Net encoder.

Output at initialization. For any input $h$ , the output is: $\mathcal{Z}(h) = 0 \cdot h + 0 = 0$

regardless of the value of $h$ . Thus the ControlNet adds nothing to the base model output at initialization.

Gradient w.r.t. parameters. The gradient of loss $\mathcal{L}$ w.r.t. weight $W$ is: $\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top$ where $\frac{\partial \mathcal{L}}{\partial \mathcal{Z}}$ is the upstream gradient from the loss.

When $W = 0$ : The output is $\mathcal{Z}(h) = 0$ , but the upstream gradient $\frac{\partial \mathcal{L}}{\partial \mathcal{Z}}$ is nonzero because the loss depends on both locked and ControlNet outputs—their dependence creates a nonzero gradient w.r.t. the ControlNet.

Therefore: $\frac{\partial \mathcal{L}}{\partial W}\bigg|_{W=0} = \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top$

which is nonzero provided $h$ and the upstream gradient are both nonzero. Since $h$ is the hidden state from a forward pass through a neural network (with ReLU activations, batch normalization, etc.), it is virtually certain to be nonzero.

First gradient step. After the first update: $W^{(1)} \leftarrow W^{(0)} - \eta \frac{\partial \mathcal{L}}{\partial W}\bigg|_{W=0} = 0 - \eta \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top = -\eta \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top$

which is nonzero. The zero convolution immediately begins learning to transmit the structural conditioning signal.

Design implication. Zero initialization ensures the ControlNet starts with zero contribution (no disruption to pretrained weights) but learns immediately on the first gradient step (no dead zone). This cold-start adapter design balances stability with responsiveness, differing from random initialization which would immediately modify the base model.

Adaptive layer normalization#

An alternative to cross-attention for scalar or low-dimensional conditioning (class labels, timestep, noise level) is adaptive layer normalization (AdaLN). Standard layer normalization normalizes features and applies a fixed learned scale and shift; AdaLN makes these parameters conditioning-dependent:

\text{AdaLN}(h, c) = \gamma(c) \odot \frac{h - \mu(h)}{\sigma(h)} + \delta(c)

where $\gamma(c)$ and $\delta(c)$ are learned linear projections of the conditioning signal $c$ . In DiT (diffusion transformers; Peebles and Xie, 2023), AdaLN-zero initializes $\gamma = 1$ , $\delta = 0$ (identity normalization) at the start, enabling training stability similar to ControlNet's zero convolutions.

GenAI context: conditioning across the curriculum#

Classifier-free guidance is the standard conditioning mechanism in DALL-E 2/3, Stable Diffusion, Midjourney, and all major text-to-image systems. The cross-attention mechanism for text conditioning is directly analogous to the cross-attention in encoder-decoder transformers (T5, BART) — the denoising network is the decoder, and the text encoder output is the encoder representation being attended to. ControlNet's locked-plus-trainable-copy architecture resembles LoRA (Course 2, Week 11): both freeze the pretrained model and add a small trainable adapter, with zero initialization ensuring the adapter starts from the pretrained behavior.

More broadly, the conditioning concepts in this lecture appear throughout the four-course sequence with different instantiations:

| Conditioning mechanism | Course 3 (Generative Models) | Course 1 (RL) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Classifier guidance | Noisy-image classifier $\nabla \log p(y\mid x_t)$ | Reward shaping: $r(s,a) + \lambda V(s')$ as a value gradient | Constrained policy with CBF gradient (C2W12): $a \leftarrow a + \lambda \nabla_a h(s,a)$ | VQA reward model gradient | | CFG | Linear extrapolation $\epsilon_u + s(\epsilon_c - \epsilon_u)$ | Conservative Q-function (CQL): extrapolation beyond observed data | Goal-conditioned imitation: conditioning on goal image drops out with prob. $p$ | Image-conditional generation: CLIP score as guidance | | Cross-attention | Image queries attend to text keys/values | Q-function attending to action representations | ACT (C2W8): action chunk queries attend to observation keys | VLM cross-attention (C4W4): image tokens attend to text tokens | | Structural conditioning | ControlNet: edge/pose maps lock spatial layout | Model-based control: constraint maps in MPC | Force/torque conditioning: proprioceptive state as spatial conditioning | Layout grounding: bounding box conditioning for spatial generation |

The CFG extrapolation formula has a direct analog in model-based RL: guidance scale $s$ weights the conditional signal like a value function guides rollouts toward high-reward regions. Both face the fundamental tension that stronger guidance toward a target metric (text adherence, reward) trades off against diversity and potential out-of-distribution artifacts.

Key takeaways#

Classifier guidance steers sampling by adding the gradient of a noise-robust classifier to the denoising score; the guidance scale $s$ controls adherence vs. diversity. Classifier-free guidance achieves the same effect with a single jointly-trained conditional/unconditional network and linear extrapolation at sampling time. CFG implicitly samples from an energy-based model where the likelihood is scaled; for large $s$ , this causes manifold departure artifacts (oversaturation, texture repetition) that trade diversity for prompt adherence; Perp-CFG mitigates this by scaling only the orthogonal component of guidance. Cross-attention in the denoising U-Net binds text tokens to spatial image regions using CLIP embeddings as keys and values; it is much more memory- and compute-efficient than self-attention on large spatial dimensions, scaling linearly rather than quadratically with resolution. ControlNet provides structural conditioning via a trainable copy of the encoder initialized from pretrained weights, with zero convolution outputs added to the locked base model; zero convolution initialization ensures stability (no corruption of pretrained weights) while allowing learning from the first gradient step. Adaptive layer normalization provides conditioning for scalar signals through learned scale-and-shift projections. Conditioning mechanisms recur throughout all four courses, unifying concepts from diffusion, reinforcement learning, robotics, and vision-language models.

Conceptual questions#

Derive that CFG sampling from $\tilde\epsilon_\theta(x_t, c) = (1-s)\epsilon_\theta(x_t, \emptyset) + s\cdot\epsilon_\theta(x_t, c)$ approximates sampling from $p_s(x \mid c) \propto p(x \mid c)^s / p(x)^{s-1}$ . For $s = 1$ , this recovers $p(x \mid c)$ ; for $s \to \infty$ , what distribution does $p_s(x \mid c)$ approach? What does this imply for the diversity of CFG samples as $s$ increases?
A cross-attention layer in a 64×64 diffusion U-Net attends over a text sequence of $L = 77$ tokens. The queries have dimension $d_k = 64$ and the spatial feature map has $64 \times 64 = 4096$ positions. Compute the memory required for the attention matrix (in bytes, float32) and the computational cost (FLOPs) of computing attention for one image at one timestep. How do these scale with image resolution, and what modification to attention enables higher-resolution generation?
ControlNet uses zero convolutions initialized with weight $W = 0$ and bias $b = 0$ . During training, the zero convolution learns a nonzero output after the first gradient step. Derive the gradient of the zero convolution output with respect to the convolution weight $W$ when $W = 0$ , and show that the gradient is nonzero only if the input to the convolution is nonzero. What does this imply about the training dynamics in the earliest gradient steps?
Unconditional dropout (setting $c = \emptyset$ with probability $p_\text{uncond}$ during training) is essential for CFG to work. Explain what failure mode occurs when $p_\text{uncond} = 0$ (no unconditional training) and when $p_\text{uncond} = 1$ (no conditional training). What value of $p_\text{uncond}$ would you choose for a task where conditional fidelity is paramount, and how would you validate this choice?
CLIP text embeddings are trained to match image-text pairs; CLIP image embeddings are trained on the same pairs. A text-to-image diffusion model conditioned on CLIP text embeddings can theoretically also be conditioned on CLIP image embeddings (image-to-image generation). Describe the semantic property of CLIP that enables this cross-modal conditioning without architectural changes, and analyze what information is preserved (and lost) when using a CLIP image embedding vs. directly conditioning on pixel values.

Solutions

The blend corresponds to the score $\nabla_x\log p(x)+s\,\nabla_x\log p(c\mid x)$ , whose stationary distribution is $p_s(x\mid c)\propto p(x)\,p(c\mid x)^s = p(x\mid c)^s/p(x)^{s-1}$ . At $s=1$ this is $p(x\mid c)$ . As $s\to\infty$ the mass concentrates on $\arg\max_x p(c\mid x)$ within the support of $p$ — a near point-mass — so diversity collapses monotonically as $s$ grows: stronger prompt adherence is bought with mode-seeking and lost variety.
Memory: $4096\times 77\times 4 = 1{,}261{,}568$ bytes $\approx 1.26$ MB. FLOPs: $QK^\top$ costs $\approx 4096\cdot77\cdot64\approx 2\times10^7$ , and roughly the same again for the $A V$ product, $\approx 4\times10^7$ multiply-adds. The attention matrix is $N\times L$ with $N=H_fW_f$ spatial positions and $L=77$ fixed, so cost is $O(N)$ — linear in pixels — whereas self-attention is $O(N^2)$ . Higher resolution therefore relies on cross-attention (self-attention only at low res), with Flash-Attention tiling avoiding materializing the matrix.
With $\mathcal{Z}(h)=Wh+b$ , $\partial\mathcal{L}/\partial W=(\partial\mathcal{L}/\partial\mathcal{Z})\,h^\top$ . At $W=0$ the output is $0$ but the upstream gradient $\partial\mathcal{L}/\partial\mathcal{Z}$ is nonzero (the loss depends on the locked + ControlNet sum), so $\partial\mathcal{L}/\partial W=(\partial\mathcal{L}/\partial\mathcal{Z})h^\top$ is nonzero exactly when $h\neq 0$ . Implication: at init the adapter contributes nothing (no disruption to pretrained weights) yet learns on the very first step (no dead zone), since encoder activations $h$ are essentially always nonzero — a clean cold-start.
$p_\text{uncond}=0$ : the model never sees $\emptyset$ , so $\epsilon_\theta(x,\emptyset)$ is undefined/garbage and the CFG extrapolation has no valid unconditional anchor. $p_\text{uncond}=1$ : the model never sees real $c$ , so conditional generation fails entirely. For conditional fidelity paramount, choose a moderate value $\approx 0.1$ – $0.2$ (enough unconditional signal to enable guidance, mostly conditional training); validate by sweeping $s$ on a held-out set and tracing the prompt-adherence (CLIP score) vs FID/diversity frontier, picking the $p_\text{uncond}$ that gives the best frontier.
CLIP's shared image–text embedding space (contrastive training pulls matching image and text to nearby vectors) means a CLIP image embedding lives in the same space the model was conditioned on via text, so it can be substituted with no architecture change. Preserved: high-level semantics — objects, scene, style. Lost: precise spatial layout, fine appearance, and identity, because the CLIP embedding is a compact global summary. Conditioning on pixels directly keeps that spatial detail but lacks the semantic abstraction and the cross-modal transfer.

Looking ahead#

Conditioning mechanisms enable text-guided generation in pixel space. But scaling to high-resolution images makes pixel-space training prohibitively expensive.

Week 9: Latent Diffusion and Multimodal Generation. We examine latent diffusion models that compress images into compact latent codes and apply diffusion in latent space, then survey how the same diffusion backbone extends to audio (spectrograms), video (temporal tokens), and joint embedding spaces for multimodal generation.

Purpose of this lecture#

Classifier guidance#

\nabla_{x_t} \log p(x_t \mid y) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p_\phi(y \mid x_t)

Since the denoising score is $\nabla_{x_t} \log p_t(x_t) \approx -\epsilon_\theta(x_t, t)/\sqrt{1-\bar\alpha_t}$ , the guided score becomes:

\tilde\epsilon_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - \sqrt{1-\bar\alpha_t} \nabla_{x_t} \log p_\phi(y \mid x_t)

\tilde\epsilon_\theta = \epsilon_\theta - s \sqrt{1-\bar\alpha_t} \nabla_{x_t} \log p_\phi(y \mid x_t)

Classifier-free guidance#

At sampling time, the guided score is a linear extrapolation beyond the conditional distribution:

\tilde\epsilon_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + s \cdot [\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)]

\log p_s(x \mid c) \propto \log p(x) + s \log p(c \mid x) = \log p(x \mid c) + (s-1) \log p(c \mid x)

CFG: the implicit energy model interpretation#

\nabla_x \log p_s(x\mid c) = \nabla_x \log p(x) + s \cdot \nabla_x \log p(c\mid x)

p_s(x \mid c) \propto p(x) \cdot p(c \mid x)^s

For $s = 1$ , this recovers the true conditional $p(x \mid c)$ . For $s > 1$ , we sample from a sharpened version that concentrates near the modes of $p(c \mid x)$ .

Cross-attention for text conditioning#

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Q = W_Q \phi(x_t), \quad K = W_K c, \quad V = W_V c

Cross-attention: memory and compute analysis#

The computational cost of cross-attention in diffusion U-Nets is critical for understanding why certain architectural choices are made. Let us compute this rigorously.

Attention matrix. The attention matrix $A = \text{softmax}(QK^\top / \sqrt{d_k})$ has shape $[B, H_f \cdot W_f, L]$ — asymmetric, with image dimension much larger than text dimension.

Memory analysis. At 64×64 resolution: attention matrix is $[B, 4096, 77]$ , requiring $4096 \times 77 \times 4$ bytes = 1.26 MB per batch element.

ControlNet#

\text{output}_l = f_\theta^{(\text{locked})}(x, t, c_\text{text}) + \mathcal{Z}_l(f_{\theta^+}^{(\text{trainable})}(x, t, c_\text{text}, c_\text{ctrl}))

Zero convolution gradient analysis#

The zero convolution mechanism deserves closer analysis, as its design encodes important principles about adapter-based fine-tuning. Let us derive the gradient flow explicitly.

Setup. The zero convolution $\mathcal{Z}(h) = Wh + b$ is initialized with $W = 0, b = 0$ , where $h$ is the hidden representation from the trainable U-Net encoder.

Output at initialization. For any input $h$ , the output is: $\mathcal{Z}(h) = 0 \cdot h + 0 = 0$

regardless of the value of $h$ . Thus the ControlNet adds nothing to the base model output at initialization.

Therefore: $\frac{\partial \mathcal{L}}{\partial W}\bigg|_{W=0} = \frac{\partial \mathcal{L}}{\partial \mathcal{Z}} h^\top$

which is nonzero. The zero convolution immediately begins learning to transmit the structural conditioning signal.

Adaptive layer normalization#

\text{AdaLN}(h, c) = \gamma(c) \odot \frac{h - \mu(h)}{\sigma(h)} + \delta(c)

GenAI context: conditioning across the curriculum#

More broadly, the conditioning concepts in this lecture appear throughout the four-course sequence with different instantiations:

Key takeaways#

Conceptual questions#

Derive that CFG sampling from $\tilde\epsilon_\theta(x_t, c) = (1-s)\epsilon_\theta(x_t, \emptyset) + s\cdot\epsilon_\theta(x_t, c)$ approximates sampling from $p_s(x \mid c) \propto p(x \mid c)^s / p(x)^{s-1}$ . For $s = 1$ , this recovers $p(x \mid c)$ ; for $s \to \infty$ , what distribution does $p_s(x \mid c)$ approach? What does this imply for the diversity of CFG samples as $s$ increases?
A cross-attention layer in a 64×64 diffusion U-Net attends over a text sequence of $L = 77$ tokens. The queries have dimension $d_k = 64$ and the spatial feature map has $64 \times 64 = 4096$ positions. Compute the memory required for the attention matrix (in bytes, float32) and the computational cost (FLOPs) of computing attention for one image at one timestep. How do these scale with image resolution, and what modification to attention enables higher-resolution generation?
ControlNet uses zero convolutions initialized with weight $W = 0$ and bias $b = 0$ . During training, the zero convolution learns a nonzero output after the first gradient step. Derive the gradient of the zero convolution output with respect to the convolution weight $W$ when $W = 0$ , and show that the gradient is nonzero only if the input to the convolution is nonzero. What does this imply about the training dynamics in the earliest gradient steps?
Unconditional dropout (setting $c = \emptyset$ with probability $p_\text{uncond}$ during training) is essential for CFG to work. Explain what failure mode occurs when $p_\text{uncond} = 0$ (no unconditional training) and when $p_\text{uncond} = 1$ (no conditional training). What value of $p_\text{uncond}$ would you choose for a task where conditional fidelity is paramount, and how would you validate this choice?
CLIP text embeddings are trained to match image-text pairs; CLIP image embeddings are trained on the same pairs. A text-to-image diffusion model conditioned on CLIP text embeddings can theoretically also be conditioned on CLIP image embeddings (image-to-image generation). Describe the semantic property of CLIP that enables this cross-modal conditioning without architectural changes, and analyze what information is preserved (and lost) when using a CLIP image embedding vs. directly conditioning on pixel values.

Solutions

The blend corresponds to the score $\nabla_x\log p(x)+s\,\nabla_x\log p(c\mid x)$ , whose stationary distribution is $p_s(x\mid c)\propto p(x)\,p(c\mid x)^s = p(x\mid c)^s/p(x)^{s-1}$ . At $s=1$ this is $p(x\mid c)$ . As $s\to\infty$ the mass concentrates on $\arg\max_x p(c\mid x)$ within the support of $p$ — a near point-mass — so diversity collapses monotonically as $s$ grows: stronger prompt adherence is bought with mode-seeking and lost variety.
Memory: $4096\times 77\times 4 = 1{,}261{,}568$ bytes $\approx 1.26$ MB. FLOPs: $QK^\top$ costs $\approx 4096\cdot77\cdot64\approx 2\times10^7$ , and roughly the same again for the $A V$ product, $\approx 4\times10^7$ multiply-adds. The attention matrix is $N\times L$ with $N=H_fW_f$ spatial positions and $L=77$ fixed, so cost is $O(N)$ — linear in pixels — whereas self-attention is $O(N^2)$ . Higher resolution therefore relies on cross-attention (self-attention only at low res), with Flash-Attention tiling avoiding materializing the matrix.
With $\mathcal{Z}(h)=Wh+b$ , $\partial\mathcal{L}/\partial W=(\partial\mathcal{L}/\partial\mathcal{Z})\,h^\top$ . At $W=0$ the output is $0$ but the upstream gradient $\partial\mathcal{L}/\partial\mathcal{Z}$ is nonzero (the loss depends on the locked + ControlNet sum), so $\partial\mathcal{L}/\partial W=(\partial\mathcal{L}/\partial\mathcal{Z})h^\top$ is nonzero exactly when $h\neq 0$ . Implication: at init the adapter contributes nothing (no disruption to pretrained weights) yet learns on the very first step (no dead zone), since encoder activations $h$ are essentially always nonzero — a clean cold-start.
$p_\text{uncond}=0$ : the model never sees $\emptyset$ , so $\epsilon_\theta(x,\emptyset)$ is undefined/garbage and the CFG extrapolation has no valid unconditional anchor. $p_\text{uncond}=1$ : the model never sees real $c$ , so conditional generation fails entirely. For conditional fidelity paramount, choose a moderate value $\approx 0.1$ – $0.2$ (enough unconditional signal to enable guidance, mostly conditional training); validate by sweeping $s$ on a held-out set and tracing the prompt-adherence (CLIP score) vs FID/diversity frontier, picking the $p_\text{uncond}$ that gives the best frontier.
CLIP's shared image–text embedding space (contrastive training pulls matching image and text to nearby vectors) means a CLIP image embedding lives in the same space the model was conditioned on via text, so it can be substituted with no architecture change. Preserved: high-level semantics — objects, scene, style. Lost: precise spatial layout, fine appearance, and identity, because the CLIP embedding is a compact global summary. Conditioning on pixels directly keeps that spatial detail but lacks the semantic abstraction and the cross-modal transfer.

Looking ahead#

Conditioning mechanisms enable text-guided generation in pixel space. But scaling to high-resolution images makes pixel-space training prohibitively expensive.

Purpose of this lecture#

Classifier guidance#

Classifier-free guidance#

CFG: the implicit energy model interpretation#

Cross-attention for text conditioning#

Cross-attention: memory and compute analysis#

ControlNet#

Zero convolution gradient analysis#

Adaptive layer normalization#

GenAI context: conditioning across the curriculum#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 8: Conditioning and Control

Purpose of this lecture#

Classifier guidance#

Classifier-free guidance#

CFG: the implicit energy model interpretation#

Cross-attention for text conditioning#

Cross-attention: memory and compute analysis#

ControlNet#

Zero convolution gradient analysis#

Adaptive layer normalization#

GenAI context: conditioning across the curriculum#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#