Purpose of this lecture
Unconditional generative models produce samples from the full learned distribution, which is useful for sampling diverse outputs but insufficient for applications requiring specific content. Conditioning mechanisms enable targeted generation: producing samples that satisfy a constraint (text prompt, edge map, reference image, class label). This lecture derives the two primary conditioning paradigms — classifier guidance and classifier-free guidance — and examines the architectural mechanisms that implement conditioning in practice: cross-attention, adaptive layer normalization, and ControlNet structural conditioning.
Classifier guidance
Classifier guidance (Dhariwal and Nichol, 2021) derives a sampling procedure for a class-conditional distribution using an unconditional diffusion model and a separately trained classifier that classifies noisy images. The key identity:
Since the denoising score is , the guided score becomes:
Replacing with in ancestral sampling steers the denoising toward regions of high — samples that the classifier attributes to class . A guidance scale amplifies this effect:
Classifier guidance requires training a separate noise-robust classifier on noisy inputs and performing backward passes through the classifier at each sampling step, adding computational overhead. The classifier must be specifically designed for noisy images at multiple noise levels — standard classifiers trained on clean images perform poorly.
Classifier-free guidance
Classifier-free guidance (CFG; Ho and Salimans, 2021) avoids the separate classifier by jointly training a conditional model and an unconditional model , where is a null conditioning token. The training procedure: with probability (typically 10–20%), set (drop the conditioning); otherwise use the real conditioning . A single network learns both modes.
At sampling time, the guided score is a linear extrapolation beyond the conditional distribution:
This can be rearranged as . For , CFG recovers the standard conditional model. For , it extrapolates beyond the conditional, sampling from an implicit distribution that more strongly satisfies the condition but with reduced diversity:
This is the product of the data prior with the conditional likelihood raised to the power — a sharpened version of the conditional that trades diversity for prompt adherence. Typical guidance scales range from to for image generation; higher values produce sharper adherence to the prompt but more saturated, artifact-prone images.
CFG advantages over classifier guidance: no separate classifier needed; the same network handles both conditional and unconditional generation; no backward pass through a classifier at sampling time; applicable to any conditioning signal including continuous embeddings. CFG is the standard conditioning mechanism in all production text-to-image systems.
CFG: the implicit energy model interpretation
CFG is best understood as energy-based modeling. Starting from Bayes rule: . The CFG sampling procedure applies a scaled version of the conditional likelihood gradient:
Note that does not depend on , so its gradient is zero and it drops from the score equation. This scaling by implies an implicit likelihood function where the strength of the conditioning signal is amplified:
For , this recovers the true conditional . For , we sample from a sharpened version that concentrates near the modes of .
The manifold departure problem. For large , samples concentrate in regions where is high but is low, producing manifold departure artifacts: oversaturated colors, texture repetition, distorted anatomy. The default scale for Stable Diffusion is ; scales above 15–20 produce visible artifacts from amplifying CLIP-favored features to extremes.
Perp-CFG. Sanchez et al. (2023) propose Perp-CFG to mitigate manifold departure by scaling only the guidance component perpendicular to the unconditional score: . This keeps the denoising trajectory within the high-density manifold, yielding comparable prompt adherence at lower scales () with reduced artifacts.
This view connects to tempering: scaling likelihood is equivalent to raising the implicit energy to power . The tension between guidance strength and quality is fundamental—stronger conditioning means relying more on CLIP's semantics rather than the learned denoising trajectory.
Cross-attention for text conditioning
Text-conditioned diffusion models inject the text conditioning into the denoising network through cross-attention layers. The conditioning signal is a sequence of text embeddings from a pretrained language encoder (CLIP, T5, etc.). In each cross-attention layer, the spatial features of the noised image serve as queries, and the text embeddings serve as keys and values:
The attention weights indicate which text tokens are relevant to each spatial location at each timestep. Cross-attention enables soft, position-specific binding: the word "red" can attend to the region of the image being denoised that corresponds to the colored object, while "background" attends to other regions.
CLIP embeddings: trained by contrastive learning on (image, text) pairs to produce a shared embedding space where semantically matching image and text embeddings are close. Using CLIP text encoders as the text conditioning provides strong semantic grounding — the embedding captures that "dog" and "canine" are synonymous, that "red cube on a green surface" combines spatial, color, and object concepts, etc.
Cross-attention: memory and compute analysis
The computational cost of cross-attention in diffusion U-Nets is critical for understanding why certain architectural choices are made. Let us compute this rigorously.
Tensor dimensions. The query tensor has shape where is batch size, is spatial dimension, and is head dimension. Key and value tensors have shape where (CLIP token length).
Attention matrix. The attention matrix has shape — asymmetric, with image dimension much larger than text dimension.
Memory analysis. At 64×64 resolution: attention matrix is , requiring bytes = 1.26 MB per batch element.
Self-attention at 64×64 requires bytes = 64 MB per batch element. Cross-attention is approximately 50× more efficient. This explains why modern U-Nets use cross-attention throughout but self-attention only at lowest resolutions (8×8 or 16×16).
High-resolution scaling. At 256×256 resolution: self-attention requires 16 GB per batch element (prohibitive), while cross-attention requires only 20 MB (tractable). Self-attention scales with resolution; cross-attention scales because text length is fixed. This is why high-resolution generation relies on cross-attention.
Computational cost (FLOPs). At 64×64 with , , cross-attention requires approximately 40 million FLOPs per batch element. Self-attention at the same resolution requires approximately 2 billion FLOPs, making cross-attention about 50× faster. Efficient algorithms like Flash Attention (Dao et al., 2022) further reduce costs by tiling computation and avoiding full attention matrix materialization.
Architectural implication. Cross-attention is a computational necessity for high-resolution diffusion. Self-attention is used only at low resolutions (8×8 or 16×16); cross-attention scales to full resolution, enabling high-resolution generation. This asymmetric pattern appears across vision-language models and long-context transformers.
ControlNet
ControlNet (Zhang et al., 2023) provides spatial structural conditioning (edge maps, depth maps, segmentation masks, pose keypoints) on top of a pretrained text-to-image diffusion model without retraining the base model.
The architecture: take the encoder blocks of a pretrained diffusion U-Net and create a trainable copy initialized with the pretrained weights. The structural conditioning (e.g., Canny edges) is injected into this copy through a "zero convolution" layer (a 1×1 convolution initialized with zero weights and zero bias). The outputs of the trainable copy are added to the corresponding layers of the locked base model:
where is the zero convolution at layer and are the trainable copy's parameters. At initialization, outputs zero, so the model behaves identically to the base model. As training progresses, the zero convolutions learn to pass structural information that modifies the base model's output.
The locked base model preserves the pretrained text-image knowledge; the trainable copy learns the structural condition. The zero convolution initialization prevents damaging the pretrained features early in training when the structural conditioning signal is still uninformative. ControlNet can be applied to any conditioning type by replacing with the appropriate spatial input.
Zero convolution gradient analysis
The zero convolution mechanism deserves closer analysis, as its design encodes important principles about adapter-based fine-tuning. Let us derive the gradient flow explicitly.
Setup. The zero convolution is initialized with , where is the hidden representation from the trainable U-Net encoder.
Output at initialization. For any input , the output is:
regardless of the value of . Thus the ControlNet adds nothing to the base model output at initialization.
Gradient w.r.t. parameters. The gradient of loss w.r.t. weight is: where is the upstream gradient from the loss.
When : The output is , but the upstream gradient is nonzero because the loss depends on both locked and ControlNet outputs—their dependence creates a nonzero gradient w.r.t. the ControlNet.
Therefore:
which is nonzero provided and the upstream gradient are both nonzero. Since is the hidden state from a forward pass through a neural network (with ReLU activations, batch normalization, etc.), it is virtually certain to be nonzero.
First gradient step. After the first update:
which is nonzero. The zero convolution immediately begins learning to transmit the structural conditioning signal.
Design implication. Zero initialization ensures the ControlNet starts with zero contribution (no disruption to pretrained weights) but learns immediately on the first gradient step (no dead zone). This cold-start adapter design balances stability with responsiveness, differing from random initialization which would immediately modify the base model.
Adaptive layer normalization
An alternative to cross-attention for scalar or low-dimensional conditioning (class labels, timestep, noise level) is adaptive layer normalization (AdaLN). Standard layer normalization normalizes features and applies a fixed learned scale and shift; AdaLN makes these parameters conditioning-dependent:
where and are learned linear projections of the conditioning signal . In DiT (diffusion transformers; Peebles and Xie, 2023), AdaLN-zero initializes , (identity normalization) at the start, enabling training stability similar to ControlNet's zero convolutions.
GenAI context: conditioning across the curriculum
Classifier-free guidance is the standard conditioning mechanism in DALL-E 2/3, Stable Diffusion, Midjourney, and all major text-to-image systems. The cross-attention mechanism for text conditioning is directly analogous to the cross-attention in encoder-decoder transformers (T5, BART) — the denoising network is the decoder, and the text encoder output is the encoder representation being attended to. ControlNet's locked-plus-trainable-copy architecture resembles LoRA (Course 2, Week 11): both freeze the pretrained model and add a small trainable adapter, with zero initialization ensuring the adapter starts from the pretrained behavior.
More broadly, the conditioning concepts in this lecture appear throughout the four-course sequence with different instantiations:
| Conditioning mechanism | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Classifier guidance | Noisy-image classifier | Reward shaping: as a value gradient | Constrained policy with CBF gradient (C2W12): | VQA reward model gradient | | CFG | Linear extrapolation | Conservative Q-function (CQL): extrapolation beyond observed data | Goal-conditioned imitation: conditioning on goal image drops out with prob. | Image-conditional generation: CLIP score as guidance | | Cross-attention | Image queries attend to text keys/values | Q-function attending to action representations | ACTAction Chunking with Transformers (C2W8): action chunk queries attend to observation keys | VLMVision-Language Model cross-attention (C4W4): image tokens attend to text tokens | | Structural conditioning | ControlNet: edge/pose maps lock spatial layout | Model-based control: constraint maps in MPC | Force/torque conditioning: proprioceptive state as spatial conditioning | Layout grounding: bounding box conditioning for spatial generation |
The CFG extrapolation formula has a direct analog in model-based RLReinforcement Learning: guidance scale weights the conditional signal like a value function guides rollouts toward high-reward regions. Both face the fundamental tension that stronger guidance toward a target metric (text adherence, reward) trades off against diversity and potential out-of-distribution artifacts.
Key takeaways
Classifier guidance steers sampling by adding the gradient of a noise-robust classifier to the denoising score; the guidance scale controls adherence vs. diversity. Classifier-free guidance achieves the same effect with a single jointly-trained conditional/unconditional network and linear extrapolation at sampling time. CFG implicitly samples from an energy-based model where the likelihood is scaled; for large , this causes manifold departure artifacts (oversaturation, texture repetition) that trade diversity for prompt adherence; Perp-CFG mitigates this by scaling only the orthogonal component of guidance. Cross-attention in the denoising U-Net binds text tokens to spatial image regions using CLIP embeddings as keys and values; it is much more memory- and compute-efficient than self-attention on large spatial dimensions, scaling linearly rather than quadratically with resolution. ControlNet provides structural conditioning via a trainable copy of the encoder initialized from pretrained weights, with zero convolution outputs added to the locked base model; zero convolution initialization ensures stability (no corruption of pretrained weights) while allowing learning from the first gradient step. Adaptive layer normalization provides conditioning for scalar signals through learned scale-and-shift projections. Conditioning mechanisms recur throughout all four courses, unifying concepts from diffusion, reinforcement learning, robotics, and vision-language models.
Conceptual questions
-
Derive that CFG sampling from approximates sampling from . For , this recovers ; for , what distribution does approach? What does this imply for the diversity of CFG samples as increases?
-
A cross-attention layer in a 64×64 diffusion U-Net attends over a text sequence of tokens. The queries have dimension and the spatial feature map has positions. Compute the memory required for the attention matrix (in bytes, float32) and the computational cost (FLOPs) of computing attention for one image at one timestep. How do these scale with image resolution, and what modification to attention enables higher-resolution generation?
-
ControlNet uses zero convolutions initialized with weight and bias . During training, the zero convolution learns a nonzero output after the first gradient step. Derive the gradient of the zero convolution output with respect to the convolution weight when , and show that the gradient is nonzero only if the input to the convolution is nonzero. What does this imply about the training dynamics in the earliest gradient steps?
-
Unconditional dropout (setting with probability during training) is essential for CFG to work. Explain what failure mode occurs when (no unconditional training) and when (no conditional training). What value of would you choose for a task where conditional fidelity is paramount, and how would you validate this choice?
-
CLIP text embeddings are trained to match image-text pairs; CLIP image embeddings are trained on the same pairs. A text-to-image diffusion model conditioned on CLIP text embeddings can theoretically also be conditioned on CLIP image embeddings (image-to-image generation). Describe the semantic property of CLIP that enables this cross-modal conditioning without architectural changes, and analyze what information is preserved (and lost) when using a CLIP image embedding vs. directly conditioning on pixel values.
Looking ahead
Conditioning mechanisms enable text-guided generation in pixel space. But scaling to high-resolution images makes pixel-space training prohibitively expensive.
Week 9: Latent Diffusion and Multimodal Generation. We examine latent diffusion models that compress images into compact latent codes and apply diffusion in latent space, then survey how the same diffusion backbone extends to audio (spectrograms), video (temporal tokens), and joint embedding spaces for multimodal generation.
Further reading
- Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop. (The CFG algorithm used in all modern models).
- Dhariwal, P., & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS. (Classifier Guidance).