Skip to main content
illumin8
Courses
Week 6: Denoising Diffusion Probabilistic Models
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 6

Week 6: Denoising Diffusion Probabilistic Models

✦Learning Outcomes
  • Explain the reverse denoising process and derive the simplified noise prediction training objective
  • Implement DDPM ancestral sampling and compare with DDIM for accelerated generation
  • Connect diffusion models to the SDE/ODE formalism and explain classifier-free guidance
◆Prerequisites
  • Week 1: Probabilistic Foundations - Score functions, ELBO
  • Week 4: Energy-Based Models - Denoising score matching
  • Week 5: Normalizing Flows - Change of variables (conceptual bridge)

Understanding of Gaussian distributions and reparameterization is essential.

Purpose of this lecture

DDPM (Ho et al., 2020) is the generative model architecture underlying essentially all modern image, audio, and video generation systems. It achieves GAN-level sample quality without adversarial training, provides tractable likelihoods, and trains stably across model scales. This lecture derives DDPM from first principles: the forward noising process, the variational lower bound on the reverse process, the simplification from the full ELBO to the noise prediction objective, the ancestral sampling algorithm, and the SDE/ODE perspective that connects DDPM to score matching and enables accelerated DDIM sampling.


The forward process

DDPM defines a fixed (non-learned) forward process that gradually corrupts a data sample x0∼pdatax_0 \sim p_\text{data}x0​∼pdata​ by adding Gaussian noise over TTT steps:

q(xt∣xt−1)=N ⁣(xt; 1−βt xt−1, βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I\right)q(xt​∣xt−1​)=N(xt​;1−βt​​xt−1​,βt​I)

where β1<β2<⋯<βT\beta_1 < \beta_2 < \cdots < \beta_Tβ1​<β2​<⋯<βT​ is the noise schedule (small positive constants). Each step scales the current sample toward zero while adding noise. A key property: marginals can be computed in closed form for any ttt directly from x0x_0x0​. Define αt=1−βt\alpha_t = 1 - \beta_tαt​=1−βt​ and αˉt=∏s=1tαs\bar{\alpha}_t = \prod_{s=1}^t \alpha_sαˉt​=∏s=1t​αs​. Then:

q(xt∣x0)=N ⁣(xt; αˉt x0, (1−αˉt)I)q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I\right)q(xt​∣x0​)=N(xt​;αˉt​​x0​,(1−αˉt​)I)

This allows reparameterizing any noised sample as xt=αˉt x0+1−αˉt ϵx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilonxt​=αˉt​​x0​+1−αˉt​​ϵ with ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I). As t→Tt \to Tt→T with appropriate schedule, αˉT≈0\bar{\alpha}_T \approx 0αˉT​≈0 and xT≈N(0,I)x_T \approx \mathcal{N}(0, I)xT​≈N(0,I) — the data has been completely destroyed.

Noise schedules: the original DDPM used a linear schedule βt=t−1T−1βT+T−tT−1β1\beta_t = \frac{t-1}{T-1}\beta_T + \frac{T-t}{T-1}\beta_1βt​=T−1t−1​βT​+T−1T−t​β1​. The cosine schedule (Nichol and Dhariwal, 2021) defines αˉt=cos⁡2(πt2T+s/(1+s))\bar{\alpha}_t = \cos^2(\frac{\pi t}{2T + s} / (1 + s))αˉt​=cos2(2T+sπt​/(1+s)), which decreases more slowly near t=0t = 0t=0 (preserving more data information early in the process) and is better matched to the informational needs of image generation.


The reverse process and ELBO

The reverse process is a Markov chain that progressively denoises xT∼N(0,I)x_T \sim \mathcal{N}(0,I)xT​∼N(0,I) back to x0x_0x0​:

pθ(x0:T)=p(xT)∏t=1Tpθ(xt−1∣xt),pθ(xt−1∣xt)=N(xt−1; μθ(xt,t), Σθ(xt,t))p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t), \quad p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t))pθ​(x0:T​)=p(xT​)t=1∏T​pθ​(xt−1​∣xt​),pθ​(xt−1​∣xt​)=N(xt−1​;μθ​(xt​,t),Σθ​(xt​,t))

The reverse process is learned; the denoising network μθ(xt,t)\mu_\theta(x_t, t)μθ​(xt​,t) must predict where the clean signal was. The ELBO lower-bounds the log-likelihood log⁡pθ(x0)\log p_\theta(x_0)logpθ​(x0​):

L=Eq ⁣[log⁡q(x1:T∣x0)pθ(x0:T)]=DKL(q(xT∣x0)∥p(xT))⏟LT+∑t=2TDKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))⏟Lt−1−log⁡pθ(x0∣x1)⏟L0\mathcal{L} = \mathbb{E}_q\!\left[\log\frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{0:T})}\right] = \underbrace{D_\text{KL}(q(x_T \mid x_0) \| p(x_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(x_{t-1} \mid x_t, x_0) \| p_\theta(x_{t-1} \mid x_t))}_{L_{t-1}} - \underbrace{\log p_\theta(x_0 \mid x_1)}_{L_0}L=Eq​[logpθ​(x0:T​)q(x1:T​∣x0​)​]=LT​DKL​(q(xT​∣x0​)∥p(xT​))​​+t=2∑T​Lt−1​DKL​(q(xt−1​∣xt​,x0​)∥pθ​(xt−1​∣xt​))​​−L0​logpθ​(x0​∣x1​)​​

The term LTL_TLT​ is a constant (the forward process is fixed, and q(xT∣x0)≈p(xT)=N(0,I)q(x_T \mid x_0) \approx p(x_T) = \mathcal{N}(0,I)q(xT​∣x0​)≈p(xT​)=N(0,I) for a good schedule). The reconstruction term L0L_0L0​ is handled with a discrete decoder. The key learning terms are Lt−1L_{t-1}Lt−1​: KL divergences between the forward posterior and the learned reverse step.

The forward posterior q(xt−1∣xt,x0)q(x_{t-1} \mid x_t, x_0)q(xt−1​∣xt​,x0​) is tractable given x0x_0x0​ (because the forward process is Gaussian). By Bayes' rule:

q(xt−1∣xt,x0)=N ⁣(xt−1; μ~t(xt,x0), β~tI)q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\, \tilde{\mu}_t(x_t, x_0),\, \tilde{\beta}_t I\right)q(xt−1​∣xt​,x0​)=N(xt−1​;μ~​t​(xt​,x0​),β~​t​I)

where μ~t=αˉt−1βt1−αˉtx0+αt(1−αˉt−1)1−αˉtxt\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_tμ~​t​=1−αˉt​αˉt−1​​βt​​x0​+1−αˉt​αt​​(1−αˉt−1​)​xt​ and β~t=(1−αˉt−1)βt1−αˉt\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}β~​t​=1−αˉt​(1−αˉt−1​)βt​​. This Gaussian form makes Lt−1L_{t-1}Lt−1​ a simple squared distance between the model mean and the target mean.


The noise prediction objective

The learned reverse mean μθ(xt,t)\mu_\theta(x_t, t)μθ​(xt​,t) should match μ~t(xt,x0)\tilde{\mu}_t(x_t, x_0)μ~​t​(xt​,x0​). Substituting the reparameterization x0=(xt−1−αˉtϵ)/αˉtx_0 = (x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)/\sqrt{\bar{\alpha}_t}x0​=(xt​−1−αˉt​​ϵ)/αˉt​​ into the expression for μ~t\tilde{\mu}_tμ~​t​:

μ~t(xt,x0)=1αt ⁣(xt−βt1−αˉtϵ)\tilde{\mu}_t(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right)μ~​t​(xt​,x0​)=αt​​1​(xt​−1−αˉt​​βt​​ϵ)

This suggests parameterizing μθ(xt,t)\mu_\theta(x_t, t)μθ​(xt​,t) as:

μθ(xt,t)=1αt ⁣(xt−βt1−αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right)μθ​(xt​,t)=αt​​1​(xt​−1−αˉt​​βt​​ϵθ​(xt​,t))

where ϵθ(xt,t)\epsilon_\theta(x_t, t)ϵθ​(xt​,t) predicts the noise ϵ\epsilonϵ that was added to x0x_0x0​ to produce xtx_txt​. The Lt−1L_{t-1}Lt−1​ terms then reduce to:

Lsimple(θ)=Et,x0,ϵ ⁣[∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]\mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\!\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]Lsimple​(θ)=Et,x0​,ϵ​[∥ϵ−ϵθ​(αˉt​​x0​+1−αˉt​​ϵ,t)∥2]

This is the simple objective of Ho et al.: sample a timestep ttt, sample noise ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0,I)ϵ∼N(0,I), compute the noised input xtx_txt​, and train the network to predict ϵ\epsilonϵ from xtx_txt​ and ttt. The objective is a sum of DSM losses at every noise level simultaneously — the connection to score matching (Week 4) is exact: ϵθ(xt,t)=−1−αˉtsθ(xt,t)\epsilon_\theta(x_t, t) = -\sqrt{1-\bar{\alpha}_t} s_\theta(x_t, t)ϵθ​(xt​,t)=−1−αˉt​​sθ​(xt​,t).

Step-by-step: why noise prediction works

The algebraic substitution connecting the reverse mean to noise prediction deserves careful walkthrough. Starting from the forward posterior mean: μ~t(xt,x0)=αˉt−1βt1−αˉtx0+αt(1−αˉt−1)1−αˉtxt\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_tμ~​t​(xt​,x0​)=1−αˉt​αˉt−1​​βt​​x0​+1−αˉt​αt​​(1−αˉt−1​)​xt​

The key insight is that x0x_0x0​ is not observed during generation — only xtx_txt​ is. But we can express x0x_0x0​ in terms of xtx_txt​ and the noise ϵ\epsilonϵ using the reparameterization xt=αˉtx0+1−αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilonxt​=αˉt​​x0​+1−αˉt​​ϵ, which gives: x^0(xt,ϵ)=xt−1−αˉtϵαˉt\hat{x}_0(x_t, \epsilon) = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon}{\sqrt{\bar{\alpha}_t}}x^0​(xt​,ϵ)=αˉt​​xt​−1−αˉt​​ϵ​

Substituting into μ~t\tilde{\mu}_tμ~​t​: μ~t=αˉt−1βt1−αˉt⋅xt−1−αˉtϵαˉt+αt(1−αˉt−1)1−αˉtxt\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t} \cdot \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon}{\sqrt{\bar{\alpha}_t}} + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_tμ~​t​=1−αˉt​αˉt−1​​βt​​⋅αˉt​​xt​−1−αˉt​​ϵ​+1−αˉt​αt​​(1−αˉt−1​)​xt​

Collecting terms in xtx_txt​ (using βt=1−αt\beta_t = 1 - \alpha_tβt​=1−αt​ and αˉt−1=αˉt/αt\bar{\alpha}_{t-1} = \bar{\alpha}_t / \alpha_tαˉt−1​=αˉt​/αt​): =1αt ⁣(xt−βt1−αˉtϵ)= \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right)=αt​​1​(xt​−1−αˉt​​βt​​ϵ)

This is the clean noise-prediction parameterization. The model ϵθ(xt,t)\epsilon_\theta(x_t, t)ϵθ​(xt​,t) that predicts ϵ\epsilonϵ from xtx_txt​ gives the reverse mean μθ(xt,t)=(xt−βtϵθ(xt,t)/1−αˉt)/αt\mu_\theta(x_t, t) = (x_t - \beta_t\epsilon_\theta(x_t,t)/\sqrt{1-\bar{\alpha}_t})/\sqrt{\alpha_t}μθ​(xt​,t)=(xt​−βt​ϵθ​(xt​,t)/1−αˉt​​)/αt​​. The L2 loss between μ~t\tilde{\mu}_tμ~​t​ and μθ\mu_\thetaμθ​ then reduces to ∥ϵ−ϵθ(xt,t)∥2\|\epsilon - \epsilon_\theta(x_t, t)\|^2∥ϵ−ϵθ​(xt​,t)∥2 (with a constant prefactor that Ho et al. drop to get Lsimple\mathcal{L}_\text{simple}Lsimple​, empirically finding equal-weight timestep averaging performs better than the exact ELBO weighting).


Ancestral sampling

Reverse process sampling (ancestral sampling) generates a sample by iteratively denoising from xT∼N(0,I)x_T \sim \mathcal{N}(0,I)xT​∼N(0,I):

xt−1=1αt ⁣(xt−βt1−αˉtϵθ(xt,t))+β~t z,z∼N(0,I)x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sqrt{\tilde{\beta}_t}\, z, \quad z \sim \mathcal{N}(0,I)xt−1​=αt​​1​(xt​−1−αˉt​​βt​​ϵθ​(xt​,t))+β~​t​​z,z∼N(0,I)

The stochastic noise term β~t z\sqrt{\tilde{\beta}_t}\, zβ~​t​​z is added at each step except the last (t=1t = 1t=1). With T=1000T = 1000T=1000 steps, this produces high-quality samples but requires 1000 neural network evaluations per sample — computationally expensive.

The U-Net architecture

The denoising network ϵθ(xt,t)\epsilon_\theta(x_t, t)ϵθ​(xt​,t) is a time-conditioned U-Net in all standard implementations. The U-Net has three components:

Encoder (downsampling path): a series of residual blocks that progressively halve the spatial resolution while doubling the channel count (e.g., 2562×64→1282×128→642×256→322×512256^2 \times 64 \to 128^2 \times 128 \to 64^2 \times 256 \to 32^2 \times 5122562×64→1282×128→642×256→322×512). Each residual block consists of two Conv2D + GroupNorm + SiLU layers with a skip connection.

Time embedding: the timestep t∈{1,…,T}t \in \{1, \ldots, T\}t∈{1,…,T} is encoded into a sinusoidal position embedding γ(t)\gamma(t)γ(t) (same formulation as positional encodings in Transformers), then passed through two linear layers with SiLU activation to produce a time embedding vector τ∈Rdmodel\tau \in \mathbb{R}^{d_\text{model}}τ∈Rdmodel​. This time vector is added (or scale-shifted via adaptive group norm) to the features after each residual block — conditioning all feature computations on the noise level.

Decoder (upsampling path): a mirror of the encoder with skip connections from the encoder at each resolution (the "U" shape). Transposed convolutions or bilinear upsampling followed by convolution restore the original spatial resolution.

Attention at the bottleneck: self-attention blocks (multi-head attention at the lowest-resolution feature maps, e.g., 32232^2322) allow the model to capture global structure. At higher resolutions, attention is too expensive and only local convolutions are used. The number of attention heads and the resolution at which attention is applied are key hyperparameters.

For a standard DDPM on 256×256256 \times 256256×256 images: the U-Net has ∼\sim∼100M parameters, 4 resolution levels, attention at the 32×3232 \times 3232×32 and 16×1616 \times 1616×16 levels, and channel counts [128,256,512,512][128, 256, 512, 512][128,256,512,512].


DDIM and accelerated sampling

DDIM (Song et al., 2020) derives a non-Markovian forward process that has the same marginals q(xt∣x0)q(x_t \mid x_0)q(xt​∣x0​) as DDPM but allows deterministic reverse trajectories. The DDIM update:

xt−1=αˉt−1xt−1−αˉtϵθ(xt,t)αˉt⏟predicted x0+1−αˉt−1−σt2 ϵθ(xt,t)+σtzx_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}}_{\text{predicted }x_0} + \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2}\, \epsilon_\theta(x_t, t) + \sigma_t zxt−1​=αˉt−1​​predicted x0​αˉt​​xt​−1−αˉt​​ϵθ​(xt​,t)​​​+1−αˉt−1​−σt2​​ϵθ​(xt​,t)+σt​z

With σt=0\sigma_t = 0σt​=0, this is fully deterministic: xt−1x_{t-1}xt−1​ depends only on xtx_txt​ and the predicted noise, with no added stochasticity. DDIM samples are deterministic functions of the initial noise xTx_TxT​, enabling: (1) accelerated sampling — skip from ttt to t−Δtt - \Delta tt−Δt for large Δt\Delta tΔt, reducing from 1000 to 10-50 steps with modest quality loss; (2) interpolation — interpolating between two xTx_TxT​ values produces interpolated images; (3) inversion — running DDIM backward from a real image produces the noise xTx_TxT​ that would have generated it.


SDE and ODE unification

Song et al. (2021) unify DDPM, NCSN, and normalizing flows in a single continuous-time SDE framework. The forward process is a stochastic differential equation:

dx=f(x,t) dt+g(t) dWdx = f(x, t)\, dt + g(t)\, dWdx=f(x,t)dt+g(t)dW

with fff a drift coefficient and ggg a diffusion coefficient. Every choice of (f,g)(f, g)(f,g) defines a different noising process; DDPM and NCSN are discrete approximations of specific SDEs (VP-SDE and VE-SDE respectively). The reverse SDE:

dx=[f(x,t)−g(t)2∇xlog⁡pt(x)] dt+g(t) dWˉdx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]\, dt + g(t)\, d\bar{W}dx=[f(x,t)−g(t)2∇x​logpt​(x)]dt+g(t)dWˉ

requires the score ∇xlog⁡pt(x)\nabla_x \log p_t(x)∇x​logpt​(x) at each time step — learned by the denoising network. Crucially, the reverse SDE has a corresponding probability flow ODE:

dx=[f(x,t)−12g(t)2∇xlog⁡pt(x)]dtdx = \left[f(x, t) - \tfrac{1}{2} g(t)^2 \nabla_x \log p_t(x)\right] dtdx=[f(x,t)−21​g(t)2∇x​logpt​(x)]dt

whose marginals match those of the SDE at every ttt but which is deterministic — a continuous normalizing flow. DDIM sampling is the Euler discretization of this ODE.


Practical considerations for DDPM training and inference

Effective DDPM implementation requires attention to several engineering details:

Noise schedule design: the original linear schedule βt=βmin⁡+t/T(βmax⁡−βmin⁡)\beta_t = \beta_{\min} + t/T (\beta_{\max} - \beta_{\min})βt​=βmin​+t/T(βmax​−βmin​) with βmin⁡=0.0001\beta_{\min} = 0.0001βmin​=0.0001, βmax⁡=0.02\beta_{\max} = 0.02βmax​=0.02 works reasonably well for CIFAR-10. The cosine schedule αˉt=cos⁡2(π2⋅tT+s/(1+s))\bar{\alpha}_t = \cos^2(\frac{\pi}{2} \cdot \frac{t}{T+s} / (1+s))αˉt​=cos2(2π​⋅T+st​/(1+s)) for s=0.008s = 0.008s=0.008 performs better by preserving more signal early in the noising process. The schedule choice affects the SNR curve SNR(t)=log⁡(αˉt/(1−αˉt))\text{SNR}(t) = \log(\bar{\alpha}_t / (1-\bar{\alpha}_t))SNR(t)=log(αˉt​/(1−αˉt​)); a smooth SNR schedule with no sharp kinks performs better than one with discontinuities. Modern research shows that learning the noise schedule jointly with the model (or using sophisticated scheduling techniques like min-SNR weighting) can improve sample quality.

Timestep embedding: the positional encoding γ(t)=[sin⁡(20πt/T),cos⁡(20πt/T),…,sin⁡(2L−1πt/T),cos⁡(2L−1πt/T)]\gamma(t) = [\sin(2^0 \pi t/T), \cos(2^0 \pi t/T), \ldots, \sin(2^{L-1} \pi t/T), \cos(2^{L-1} \pi t/T)]γ(t)=[sin(20πt/T),cos(20πt/T),…,sin(2L−1πt/T),cos(2L−1πt/T)] for L=∼128L = \sim 128L=∼128 provides rich information about the noise level to all layers of the U-Net. Some models use learned embeddings; sinusoidal is more sample-efficient.

GroupNorm vs. BatchNorm: diffusion models use GroupNorm (normalizing across groups of channels within each sample) rather than BatchNorm because batch statistics are unreliable with small batch sizes and the noise level varies per sample. BatchNorm couples the denoising network to batch statistics, degrading generalization.

EMA and exponential moving average for evaluation: during training, it is common to maintain an exponential moving average (EMA) of model weights (e.g., θEMA←0.9999⋅θEMA+0.0001⋅θcurrent\theta_\text{EMA} \leftarrow 0.9999 \cdot \theta_\text{EMA} + 0.0001 \cdot \theta_\text{current}θEMA​←0.9999⋅θEMA​+0.0001⋅θcurrent​). The EMA model is used for evaluation and typically provides better sample quality than the final checkpoint.

Guidance and conditional generation: classifier-free guidance trains the model to output predictions for both unconditional and conditional (class-conditioned or text-conditioned) denoising. During inference, the final prediction is a weighted blend: ϵθ(xt,c,t)=ϵθ(xt,∅,t)+w⋅(ϵθ(xt,c,t)−ϵθ(xt,∅,t))\epsilon_\theta(x_t, c, t) = \epsilon_\theta(x_t, \emptyset, t) + w \cdot (\epsilon_\theta(x_t, c, t) - \epsilon_\theta(x_t, \emptyset, t))ϵθ​(xt​,c,t)=ϵθ​(xt​,∅,t)+w⋅(ϵθ​(xt​,c,t)−ϵθ​(xt​,∅,t)) for guidance weight w>0w > 0w>0. Larger www increases adherence to the conditioning signal but risks degrading sample diversity. Typical values: w∈[1.5,7.5]w \in [1.5, 7.5]w∈[1.5,7.5] depending on the guidance strength desired.

Sampling efficiency trade-offs: ancestral sampling with T=1000T = 1000T=1000 steps produces the highest sample quality but is slow. DDIM with T′=50T' = 50T′=50 steps runs 20×20\times20× faster with modest quality loss (∼0.5\sim0.5∼0.5 FID increase on CIFAR-10). Rectified flows (Week 7) further accelerate this to T′=1T' = 1T′=1 step. The choice of acceleration method reflects a quality-speed tradeoff inherent to diffusion-based generation.


GenAI context: DDPM across the course sequence

The DDPM framework appears across all four courses in different guises:

| DDPM concept | Robotics (Course 2) | RLReinforcement Learning (Course 1) | VLMs (Course 4) | |---|---|---|---| | Forward process (data → noise) | Action trajectory corruption for training | State transition noise model | Image corruption for masked pretraining | | Reverse process (denoising) | Diffusion policy action generation | Planning via backward pass | Visual token prediction | | Noise prediction objective Lsimple\mathcal{L}_\text{simple}Lsimple​ | ACTAction Chunking with Transformers / diffusion policy training loss | TDTemporal Difference-error as noise signal | MAE pixel prediction | | U-Net score network | Observation + time → action score | Value network over state-time | Vision encoder + time embedding | | Classifier-free guidance | Goal-conditioned diffusion policy | Reward-weighted policy gradient | Text-conditioned image generation | | DDIM deterministic sampling | 10-step diffusion policy inference | Model predictive control with score | Fast text-to-image generation |

The diffusion policy in Course 2 (Week 9) is DDPM applied to action distributions: x0x_0x0​ becomes the clean action sequence, xtx_txt​ is the noised action, and the denoising network takes the current observation as conditioning. The same DDPM math applies exactly — the only difference is that the data being modeled is a robot action trajectory rather than an image.

At 50 Hz robot control, 1000-step DDPM sampling is infeasible. DDIM with 10–20 steps runs in ~20ms per action, making real-time diffusion policy practical. The fact that DDIM's deterministic reverse process enables step-skipping without retraining is therefore not just a generative modeling curiosity — it is a hard engineering requirement for physical robot deployment.


Key takeaways

DDPM defines a fixed Gaussian forward process that corrupts data to noise in TTT steps; marginals q(xt∣x0)q(x_t \mid x_0)q(xt​∣x0​) are Gaussian and computable in closed form using αˉt\bar{\alpha}_tαˉt​ schedules (linear or cosine). The ELBO decomposes into KL terms between the forward posterior and the learned reverse step, with the posterior having a tractable Gaussian form. Reparameterizing the reverse mean in terms of the predicted noise ϵθ(xt,t)\epsilon_\theta(x_t, t)ϵθ​(xt​,t) reduces all Lt−1L_{t-1}Lt−1​ terms to Lsimple\mathcal{L}_\text{simple}Lsimple​: equal-weight MSE between true noise and predicted noise across all timesteps — empirically superior to exact ELBO weighting. The denoising network is a U-Net with time embeddings, residual blocks, and multi-scale attention; it learns to estimate the score −∇xlog⁡pt(x)-\nabla_x \log p_t(x)−∇x​logpt​(x) at each noise level. Ancestral sampling generates samples by iteratively applying the learned reverse step with stochastic noise injection; DDIM enables deterministic accelerated sampling by skipping timesteps and setting the noise scale to zero. The SDE/ODE unification reveals that DDPM's score network defines a probability flow ODE with the same marginals as the full stochastic reverse process — a continuous normalizing flow. This connection links EBMs (Week 4), score matching, flows (Week 5), and diffusion models into a single theoretical framework. The practical success of diffusion models stems from: (1) exact likelihood lower bounds via ELBO, (2) stable training with standard MSE objectives unlike GANs, (3) high-quality samples competitive with or exceeding GANs without adversarial training, (4) accelerated sampling via DDIM for inference efficiency, and (5) natural conditioning for class-conditional and text-conditional generation via classifier-free guidance.


Conceptual questions

  1. Derive the forward posterior q(xt−1∣xt,x0)q(x_{t-1} \mid x_t, x_0)q(xt−1​∣xt​,x0​) from Bayes' rule using q(xt∣xt−1)q(x_t \mid x_{t-1})q(xt​∣xt−1​) and q(xt∣x0)q(x_t \mid x_0)q(xt​∣x0​). Show that it is Gaussian and derive the mean μ~t\tilde{\mu}_tμ~​t​ and variance β~t\tilde{\beta}_tβ~​t​ in terms of αˉt\bar{\alpha}_tαˉt​, αt\alpha_tαt​, and βt\beta_tβt​. Verify that in the limit T→∞T \to \inftyT→∞ (infinitesimal steps), β~t→0\tilde{\beta}_t \to 0β~​t​→0 — the forward posterior becomes deterministic.

  2. The simple objective Lsimple\mathcal{L}_\text{simple}Lsimple​ weights all timesteps equally. An alternative is to use the exact ELBO weighting βt22σt2αt(1−αˉt)\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)}2σt2​αt​(1−αˉt​)βt2​​ for each Lt−1L_{t-1}Lt−1​ term. Analyze how this weighting differs from equal weighting: at early timesteps (t≈1t \approx 1t≈1, low noise), which weighting emphasizes the objective more? Explain the practical implication for sample quality if the simple objective underweights low-noise terms.

  3. DDIM sampling with σt=0\sigma_t = 0σt​=0 produces deterministic samples from a given xT∼N(0,I)x_T \sim \mathcal{N}(0,I)xT​∼N(0,I). Using DDIM inversion (running the deterministic reverse process backward from a real image x0x_0x0​), one can obtain a noise vector xTx_TxT​ such that sampling from xTx_TxT​ approximately recovers x0x_0x0​. Describe a generative editing application enabled by this inversion capability, and explain what approximation error accumulates when the inversion is not exact.

  4. The cosine noise schedule is designed so that αˉt\bar{\alpha}_tαˉt​ decreases slowly near t=0t = 0t=0. For a linear schedule vs. cosine schedule, compare the signal-to-noise ratio SNR(t)=αˉt/(1−αˉt)\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t)SNR(t)=αˉt​/(1−αˉt​) at t=0.05Tt = 0.05Tt=0.05T. Explain why a higher SNR at small ttt is beneficial for image generation quality, particularly for fine-detail structure in images.

  5. The probability flow ODE dx=[f(x,t)−12g(t)2sθ(x,t)]dtdx = [f(x,t) - \frac{1}{2}g(t)^2 s_\theta(x,t)] dtdx=[f(x,t)−21​g(t)2sθ​(x,t)]dt defines a continuous normalizing flow. Compare the architectural requirements of this flow (using a score network as the vector field) to a coupling-layer normalizing flow (Week 5). Which model class is more flexible in terms of the distributions it can represent? Which is more computationally efficient for exact likelihood computation?

✦Solutions
  1. By Bayes, q(xt−1∣xt,x0)∝q(xt∣xt−1) q(xt−1∣x0)q(x_{t-1}\mid x_t,x_0)\propto q(x_t\mid x_{t-1})\,q(x_{t-1}\mid x_0)q(xt−1​∣xt​,x0​)∝q(xt​∣xt−1​)q(xt−1​∣x0​); both factors are Gaussian in xt−1x_{t-1}xt−1​, and a product of Gaussians is Gaussian. Completing the square gives precision β~t−1=βt−1+(1−αˉt−1)−1\tilde\beta_t^{-1}=\beta_t^{-1}+(1-\bar\alpha_{t-1})^{-1}β~​t−1​=βt−1​+(1−αˉt−1​)−1, i.e. β~t=(1−αˉt−1)βt1−αˉt\tilde\beta_t=\frac{(1-\bar\alpha_{t-1})\beta_t}{1-\bar\alpha_t}β~​t​=1−αˉt​(1−αˉt−1​)βt​​, and the stated μ~t\tilde\mu_tμ~​t​. As T→∞T\to\inftyT→∞ each βt→0\beta_t\to 0βt​→0, so β~t→0\tilde\beta_t\to 0β~​t​→0 — the posterior collapses to a point and the reverse step becomes deterministic.
  2. The exact ELBO weight ∝βt2/[ 2σt2αt(1−αˉt)]\propto \beta_t^2/[\,2\sigma_t^2\alpha_t(1-\bar\alpha_t)]∝βt2​/[2σt2​αt​(1−αˉt​)] is small at low ttt (tiny βt2\beta_t^2βt2​), so it emphasizes high-noise (large-ttt) terms. Equal weighting therefore emphasizes the low-noise terms more than the ELBO does. Since low-noise steps carry fine, high-frequency detail, underweighting them yields blurry samples — which is why Lsimple\mathcal{L}_\text{simple}Lsimple​'s relative upweighting of low-noise terms improves perceptual quality.
  3. DDIM inversion enables real-image editing (e.g. prompt-to-prompt): invert x0→xTx_0\to x_Tx0​→xT​, then resample with modified conditioning to edit while preserving structure. The error: the deterministic ODE is only approximately reversible with finite steps (and under CFG), so discretization + linearization error accumulate each step and the reconstruction drifts from the original — worse at high guidance scale and few steps.
  4. SNR(t)=αˉt/(1−αˉt)\text{SNR}(t)=\bar\alpha_t/(1-\bar\alpha_t)SNR(t)=αˉt​/(1−αˉt​). At t=0.05Tt=0.05Tt=0.05T the cosine schedule keeps αˉt\bar\alpha_tαˉt​ near 1 (high SNR), while the linear schedule has already dropped αˉt\bar\alpha_tαˉt​ (lower SNR). Higher SNR early preserves signal/detail at low noise, where high-frequency image structure lives, so the model devotes effective capacity there; the linear schedule destroys fine detail too quickly.
  5. The score-network PF-ODE uses an unconstrained vector field (a U-Net) — no invertibility or Jacobian constraint — so it represents a strictly broader class of distributions than a coupling flow, whose layers must be invertible with tractable Jacobian. So the diffusion/ODE model is more flexible. For exact likelihood, the coupling flow is more efficient: one forward pass plus a closed-form log-Jacobian, versus integrating the Jacobian trace along the whole ODE trajectory (an ODE solve with a Hutchinson estimator) for the PF-ODE.

Looking ahead

DDPM establishes the denoising framework. The next development simplifies the training objective and accelerates sampling by learning vector fields directly rather than noise.

Week 7: Flow Matching and Consistency Models. We derive the flow matching objective as regression against a conditional vector field, show that rectified flows produce straight trajectories enabling few-step sampling, and examine consistency models that distill a diffusion trajectory into a single-step generator.


Further reading

  • Sohl-Dickstein, J., et al. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML. (The conceptual origin of diffusion).
  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS. (DDPM).
  • Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR. (DDIM for faster sampling).
← Previous
Week 5: Normalizing Flows
Next →
Week 7: Flow Matching and Consistency Models
On this page
  • Purpose of this lecture
  • The forward process
  • The reverse process and ELBO
  • The noise prediction objective
  • Step-by-step: why noise prediction works
  • Ancestral sampling
  • The U-Net architecture
  • DDIM and accelerated sampling
  • SDE and ODE unification
  • Practical considerations for DDPM training and inference
  • GenAI context: DDPM across the course sequence
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading