Week 10: ControlNet and Controlled Generation

Purpose of this lecture#

As explored in Course 3, text-to-image diffusion models (like Stable Diffusion) can generate remarkably diverse, high-fidelity images from language prompts. However, natural language is a fundamentally weak control signal for structure-critical applications. Specifying "a person standing with arms raised" in text produces a different physical pose every time the model is sampled. Specifying "a building with three windows on the left" rarely produces the exact spatial layout intended.

To bridge the gap between pure generative art and physical AI, we need precise geometric control. ControlNet (Zhang et al., 2023) solves this by injecting trainable structural conditioning networks into massive, frozen diffusion models. This enables precise pixel-level control via edge maps, depth maps, pose skeletons, and semantic segmentation masks. This lecture derives the ControlNet architecture, the mathematics of zero-initialized convolutions, and examines how VLMs serve as high-level "System 2" controllers that translate abstract language into the structured signals required by these generative pipelines.

The structural conditioning problem#

Standard text-conditioned diffusion (from Course 3) operates on the latent noise vector $z_t$ and a text embedding $c_\text{text}$ (e.g., from CLIP). The UNet denoiser learns to predict the noise:

\epsilon_\theta(z_t, t, c_\text{text})

Text conditioning controls the semantic content (style, texture, identity) but completely fails to constrain the structural layout. Two different diffusion trajectories (different initial random seeds $z_T$ ) conditioned on the exact same text prompt will produce radically different object arrangements and spatial structures. For applications requiring structural precision—such as a robot generating a synthetic training image from a depth map, or an architect rendering a photorealistic house from a CAD floor plan—text conditioning is insufficient.

Structured conditioning mathematically extends the denoiser to accept an additional spatial condition $c_\text{struct}$ :

\epsilon_\theta(z_t, t, c_\text{text}, c_\text{struct})

where $c_\text{struct} \in \mathbb{R}^{H \times W \times C}$ encodes spatial structure explicitly. The goal is to constrain the generative diffusion trajectory to strictly obey the geometric boundaries of $c_\text{struct}$ while allowing $c_\text{text}$ to freely hallucinate the semantic textures within those boundaries.

ControlNet Architecture#

Training a massive diffusion model (like Stable Diffusion, with ~1 billion parameters) from scratch to accept $c_\text{struct}$ would require thousands of GPU-days and risk forgetting its pre-trained aesthetic priors.

ControlNet bypasses this by creating a parallel, trainable copy of the diffusion model's encoder, while keeping the original generative backbone strictly frozen. The architecture has three key components:

1. Frozen Backbone: The original UNet denoiser remains completely frozen ( $\theta_\text{frozen}$ ). This preserves all the billions of visual concepts and aesthetic priors learned during large-scale pretraining.

2. Trainable Copy: An exact structural copy of the downsampling encoder blocks of the UNet is created ( $\theta_\text{trainable}$ ). Its parameters are explicitly initialized to be identical to the frozen model's weights. This trainable copy processes the combined latent $z_t$ and the structural condition $c_\text{struct}$ .

3. Zero Convolution Layers: The structural condition $c_\text{struct}$ is injected into the trainable copy, and the trainable copy's features are injected back into the frozen UNet's decoders, using a novel operation called a Zero Convolution. A zero convolution $\mathcal{Z}(\cdot; \phi)$ is a $1 \times 1$ convolutional layer where both the weight matrix $W$ and the bias vector $b$ are initialized exactly to zero.

At a given depth $d$ in the UNet, the modified hidden state $h_\text{out}$ passed to the decoder is:

h_\text{out} = h_\text{frozen} + \mathcal{Z}_d(h_\text{ctrl})

where $h_\text{ctrl}$ is the output of the trainable encoder block at depth $d$ .

The Mathematics of Zero Initialization#

If a new, randomly initialized neural network block is injected into the middle of a delicate, pre-trained diffusion UNet, its initial output will be high-variance noise. This noise will completely destroy the signal in the frozen UNet, causing the model to output garbage images at step 0. The gradients flowing backwards would be massive and chaotic, leading to catastrophic failure.

Zero initialization provides a mathematical guarantee of safety. At step 0 of training, because $\mathcal{Z}_d(\cdot) = 0$ , the equation reduces to $h_\text{out} = h_\text{frozen} + 0$ . The complete model with the massive ControlNet attached behaves mathematically identically to the original frozen model.

As training proceeds on a dataset of $(x, c_\text{struct}, c_\text{text})$ triplets, the gradients gently push the weights of the zero convolutions away from zero. The ControlNet learns incrementally: first, the zero convolutions learn to let small structural signals leak into the main network; then, the trainable encoder learns to aggressively extract geometric features from $c_\text{struct}$ .

Control signal types and extraction#

ControlNet is fundamentally agnostic to what $c_\text{struct}$ actually represents. Different preprocessing pipelines extract different physical constraints from source images, resulting in different ControlNet variants:

1. Depth Maps: Extracted using monocular depth estimation networks (e.g., MiDaS, ZoeDepth) or physical RGB-D cameras. A depth-conditioned ControlNet forces the diffusion model to generate objects strictly at the specified $Z$ -axis distances. For physical AI, a robot can use a depth map of its current environment to hallucinate hundreds of photorealistic variations of that exact same room (changing lighting, colors, and textures) to train a robust Sim2Real policy, knowing the collision geometry remains strictly identical.

2. Canny Edge Maps: Extracted using classical computer vision algorithms. Training a ControlNet on (image, Canny edge map) pairs produces a model that treats the edges as strict boundaries. This is highly useful for industrial design or converting rough spatial sketches into photorealistic textures.

3. Human Pose Skeletons (OpenPose): 2D keypoint estimates representing human joints. A pose-conditioned ControlNet generates humans in the exact specified pose while allowing the text prompt to freely alter identity, clothing, and background.

Combinatorial Inference: Because ControlNet additions $\mathcal{Z}_d(h_\text{ctrl})$ are just tensor additions into the frozen UNet's residual stream, multiple ControlNets can be combined at inference time. You can simultaneously apply a Depth ControlNet (to enforce room layout) and a Pose ControlNet (to place a human in a specific location).

VLMs as high-level semantic controllers#

ControlNet proves that diffusion models can perfectly execute structural commands. However, an automated system needs a brain to generate those structural commands in the first place. Here, standard Vision-Language Models (like LLaVA or BLIP) ACT as the "System 2" semantic controller for the generative pipeline.

Language to Structure: Consider an interior design agent given the prompt: "Make the room feel warmer by adding a wooden chair to the left of the table."

The VLM processes the image of the room and the text instruction.
Relying on its spatial grounding capabilities (Week 4), the VLM outputs a geometric bounding box or a segmentation mask in the empty space to the left of the table.
This VLM-generated mask becomes the $c_\text{struct}$ for an inpainting ControlNet.
The diffusion model renders the wooden chair perfectly blended into the lighting and depth of that specific bounding box.

The VLM handles the abstract semantic reasoning and logical placement; the ControlNet handles the continuous, pixel-perfect geometric rendering.

Iterative Refinement (VLM as Critic): Once the image is generated, the VLM can inspect the output. If the diffusion model generated a metal chair instead of a wooden one, the VLM can output a structured text correction: "The object generated in box $[x_1, y_1, x_2, y_2]$ violates the material constraint 'wooden'." This feedback loop is conceptually identical to the Actor-Critic loops in RL (Course 1), where the VLM acts as the reward model guiding the generative actor toward alignment.

Key takeaways#

Standard text-to-image diffusion models cannot guarantee spatial geometry. ControlNet solves this by locking the pre-trained diffusion UNet and training a parallel encoder to inject structural conditions (depth, edges, poses) into the residual stream. This is made mathematically stable via Zero Convolutions, which initialize at exactly zero to prevent catastrophic disruption of the frozen network's priors. By utilizing explicit geometric constraints, diffusion is upgraded from an art generation tool into a rigorous physical rendering engine. In autonomous pipelines, VLMs serve as the semantic brain, translating abstract language instructions into the explicit bounding boxes, depth maps, and segmentation masks required to steer these ControlNet pipelines.

Conceptual questions#

Zero Convolution Gradient Flow: A zero convolution is initialized with weight matrix $W=0$ . During the first backward pass, the gradient of the loss with respect to the input features $x$ is $\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} W^T$ . Because $W=0$ , the gradient passed backward through the zero convolution to the trainable encoder is exactly zero. Mathematically explain how the zero convolution's weights $W$ actually update on step 1 (hint: look at $\frac{\partial \mathcal{L}}{\partial W}$ ), and why it takes until step 2 for gradients to finally flow backwards into the trainable encoder block.
ControlNet vs. PEFT (LoRA): In Week 8, we discussed LoRA as a method to adapt a model without catastrophic forgetting. ControlNet also adapts a model by freezing the backbone and training a smaller set of side-parameters. Contrast the mathematical objectives of these two approaches. Why is LoRA sufficient for changing the style of a diffusion model (e.g., fine-tuning it to output anime), whereas the full ControlNet parallel encoder block is required for injecting spatial condition maps $c_\text{struct}$ ?
Depth Map Scale Ambiguity: A robotic agent captures a depth map $c_\text{struct}$ using an RGB-D camera and passes it to a depth-conditioned ControlNet. The text prompt is "A red coffee mug on a table." The depth map clearly shows a cylindrical object on a flat plane. However, monocular depth estimation is scale-ambiguous (a mug 1 foot away looks geometrically identical in depth to an oil drum 10 feet away). Describe the failure mode the diffusion model might exhibit when rendering this object. How does providing the text prompt "coffee mug" mathematically anchor the scale ambiguity in the UNet's cross-attention layers?
VLM-ControlNet Pipeline Debugging: You build a pipeline where a LLaVA model generates a bounding box for "a new window," and a ControlNet inpaints the window. During testing, the window is drawn perfectly, but the lighting on the floor of the room doesn't change to reflect the new light source, making the image look fake. Diagnose which component is failing. Is the VLM passing an incomplete $c_\text{struct}$ , or is the frozen diffusion backbone failing to model global illumination? Propose a fix using a depth-map ControlNet.
Combinatorial Conflict Resolution: You apply two ControlNets simultaneously to a frozen UNet: a Pose ControlNet $c_\text{pose}$ (forcing a person to stand with arms wide open) and a Depth ControlNet $c_\text{depth}$ (showing a narrow, crowded hallway where wide arms physically cannot fit). At inference, you simply sum their outputs: $h_\text{out} = h_\text{frozen} + \mathcal{Z}_1(c_\text{pose}) + \mathcal{Z}_2(c_\text{depth})$ . Describe the likely visual artifact the diffusion model will generate to resolve this mathematical contradiction. How could a VLM "critic" be used to detect this geometric impossibility before the diffusion process even begins?

Solutions

Zero-conv update. Even though $\partial \mathcal{L}/\partial x = (\partial \mathcal{L}/\partial y)\,W^\top = 0$ when $W=0$ (no gradient to the upstream encoder on step 1), the weight gradient $\partial \mathcal{L}/\partial W = x^\top (\partial \mathcal{L}/\partial y)$ is generally nonzero since $x \neq 0$ , so $W$ updates on step 1. Once $W \neq 0$ , on step 2 the input gradient becomes nonzero and finally flows into the trainable encoder — hence the one-step delay.
ControlNet vs LoRA. LoRA adds a low-rank $\Delta W$ inside existing layers, enough to shift global style. Injecting a spatial map requires a parallel branch that processes the high-dimensional condition $c_\text{struct}$ through its own copied encoder and returns spatial features at matching resolutions; a low-rank weight delta cannot carry per-pixel conditioning, so the full ControlNet side-encoder is required.
Depth scale ambiguity. Monocular depth is scale-ambiguous, so the model might render the cylinder at the wrong absolute size (mug vs oil drum). The text token "coffee mug" enters the UNet cross-attention and biases appearance and scale toward a mug, anchoring the ambiguous geometry to the intended object.
Lighting bug. The window geometry is correct, so the VLM passed a good structural condition; the unchanged floor lighting is the frozen diffusion backbone failing to model the new light source, not a missing $c_\text{struct}$ . Add a depth/normal ControlNet so the model has 3D geometry to compute consistent shading, or condition on an illumination map.
Combinatorial conflict. Summing a wide-arm pose with a narrow-hallway depth forces the UNet to partially satisfy both, producing artifacts — distorted or clipped arms, melted geometry, or arms passing through walls. A VLM critic can check geometric feasibility (is this pose compatible with this depth map?) and reject or repair the conditions before sampling begins.

Looking ahead#

ControlNet demonstrates how to constrain generation into strict geometric bounds. But what happens when the VLM itself is not just generating images or answering questions, but actively operating a computer, browsing the internet, or moving a robot?

Week 11: Multimodal Agents and Tool Use. We examine how VLMs are extended into autonomous agents capable of UI grounding (identifying and clicking icons on a screen), web navigation, and tool selection. We model VLM agent decision-making explicitly as a Partially Observable Markov Decision Process (POMDP), bridging Course 4 directly back to the reinforcement learning theory of Course 1.

Purpose of this lecture#

The structural conditioning problem#

Standard text-conditioned diffusion (from Course 3) operates on the latent noise vector $z_t$ and a text embedding $c_\text{text}$ (e.g., from CLIP). The UNet denoiser learns to predict the noise:

\epsilon_\theta(z_t, t, c_\text{text})

Structured conditioning mathematically extends the denoiser to accept an additional spatial condition $c_\text{struct}$ :

\epsilon_\theta(z_t, t, c_\text{text}, c_\text{struct})

ControlNet Architecture#

At a given depth $d$ in the UNet, the modified hidden state $h_\text{out}$ passed to the decoder is:

h_\text{out} = h_\text{frozen} + \mathcal{Z}_d(h_\text{ctrl})

where $h_\text{ctrl}$ is the output of the trainable encoder block at depth $d$ .

The Mathematics of Zero Initialization#

Control signal types and extraction#

VLMs as high-level semantic controllers#

Language to Structure: Consider an interior design agent given the prompt: "Make the room feel warmer by adding a wooden chair to the left of the table."

The VLM processes the image of the room and the text instruction.
Relying on its spatial grounding capabilities (Week 4), the VLM outputs a geometric bounding box or a segmentation mask in the empty space to the left of the table.
This VLM-generated mask becomes the $c_\text{struct}$ for an inpainting ControlNet.
The diffusion model renders the wooden chair perfectly blended into the lighting and depth of that specific bounding box.

The VLM handles the abstract semantic reasoning and logical placement; the ControlNet handles the continuous, pixel-perfect geometric rendering.

Key takeaways#

Conceptual questions#

Zero Convolution Gradient Flow: A zero convolution is initialized with weight matrix $W=0$ . During the first backward pass, the gradient of the loss with respect to the input features $x$ is $\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} W^T$ . Because $W=0$ , the gradient passed backward through the zero convolution to the trainable encoder is exactly zero. Mathematically explain how the zero convolution's weights $W$ actually update on step 1 (hint: look at $\frac{\partial \mathcal{L}}{\partial W}$ ), and why it takes until step 2 for gradients to finally flow backwards into the trainable encoder block.
ControlNet vs. PEFT (LoRA): In Week 8, we discussed LoRA as a method to adapt a model without catastrophic forgetting. ControlNet also adapts a model by freezing the backbone and training a smaller set of side-parameters. Contrast the mathematical objectives of these two approaches. Why is LoRA sufficient for changing the style of a diffusion model (e.g., fine-tuning it to output anime), whereas the full ControlNet parallel encoder block is required for injecting spatial condition maps $c_\text{struct}$ ?
Depth Map Scale Ambiguity: A robotic agent captures a depth map $c_\text{struct}$ using an RGB-D camera and passes it to a depth-conditioned ControlNet. The text prompt is "A red coffee mug on a table." The depth map clearly shows a cylindrical object on a flat plane. However, monocular depth estimation is scale-ambiguous (a mug 1 foot away looks geometrically identical in depth to an oil drum 10 feet away). Describe the failure mode the diffusion model might exhibit when rendering this object. How does providing the text prompt "coffee mug" mathematically anchor the scale ambiguity in the UNet's cross-attention layers?
VLM-ControlNet Pipeline Debugging: You build a pipeline where a LLaVA model generates a bounding box for "a new window," and a ControlNet inpaints the window. During testing, the window is drawn perfectly, but the lighting on the floor of the room doesn't change to reflect the new light source, making the image look fake. Diagnose which component is failing. Is the VLM passing an incomplete $c_\text{struct}$ , or is the frozen diffusion backbone failing to model global illumination? Propose a fix using a depth-map ControlNet.
Combinatorial Conflict Resolution: You apply two ControlNets simultaneously to a frozen UNet: a Pose ControlNet $c_\text{pose}$ (forcing a person to stand with arms wide open) and a Depth ControlNet $c_\text{depth}$ (showing a narrow, crowded hallway where wide arms physically cannot fit). At inference, you simply sum their outputs: $h_\text{out} = h_\text{frozen} + \mathcal{Z}_1(c_\text{pose}) + \mathcal{Z}_2(c_\text{depth})$ . Describe the likely visual artifact the diffusion model will generate to resolve this mathematical contradiction. How could a VLM "critic" be used to detect this geometric impossibility before the diffusion process even begins?

Solutions

Zero-conv update. Even though $\partial \mathcal{L}/\partial x = (\partial \mathcal{L}/\partial y)\,W^\top = 0$ when $W=0$ (no gradient to the upstream encoder on step 1), the weight gradient $\partial \mathcal{L}/\partial W = x^\top (\partial \mathcal{L}/\partial y)$ is generally nonzero since $x \neq 0$ , so $W$ updates on step 1. Once $W \neq 0$ , on step 2 the input gradient becomes nonzero and finally flows into the trainable encoder — hence the one-step delay.
ControlNet vs LoRA. LoRA adds a low-rank $\Delta W$ inside existing layers, enough to shift global style. Injecting a spatial map requires a parallel branch that processes the high-dimensional condition $c_\text{struct}$ through its own copied encoder and returns spatial features at matching resolutions; a low-rank weight delta cannot carry per-pixel conditioning, so the full ControlNet side-encoder is required.
Depth scale ambiguity. Monocular depth is scale-ambiguous, so the model might render the cylinder at the wrong absolute size (mug vs oil drum). The text token "coffee mug" enters the UNet cross-attention and biases appearance and scale toward a mug, anchoring the ambiguous geometry to the intended object.
Lighting bug. The window geometry is correct, so the VLM passed a good structural condition; the unchanged floor lighting is the frozen diffusion backbone failing to model the new light source, not a missing $c_\text{struct}$ . Add a depth/normal ControlNet so the model has 3D geometry to compute consistent shading, or condition on an illumination map.
Combinatorial conflict. Summing a wide-arm pose with a narrow-hallway depth forces the UNet to partially satisfy both, producing artifacts — distorted or clipped arms, melted geometry, or arms passing through walls. A VLM critic can check geometric feasibility (is this pose compatible with this depth map?) and reject or repair the conditions before sampling begins.

Purpose of this lecture#

The structural conditioning problem#

ControlNet Architecture#

The Mathematics of Zero Initialization#

Control signal types and extraction#

VLMs as high-level semantic controllers#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 10: ControlNet and Controlled Generation

Purpose of this lecture#

The structural conditioning problem#

ControlNet Architecture#

The Mathematics of Zero Initialization#

Control signal types and extraction#

VLMs as high-level semantic controllers#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#