Week 6: Denoising Diffusion Probabilistic Models

Purpose of this lecture#

DDPM (Ho et al., 2020) is the generative model architecture underlying essentially all modern image, audio, and video generation systems. It achieves GAN-level sample quality without adversarial training, provides tractable likelihoods, and trains stably across model scales. This lecture derives DDPM from first principles: the forward noising process, the variational lower bound on the reverse process, the simplification from the full ELBO to the noise prediction objective, the ancestral sampling algorithm, and the SDE/ODE perspective that connects DDPM to score matching and enables accelerated DDIM sampling.

The forward process#

DDPM defines a fixed (non-learned) forward process that gradually corrupts a data sample $x_0 \sim p_\text{data}$ by adding Gaussian noise over $T$ steps:

q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I\right)

where $\beta_1 < \beta_2 < \cdots < \beta_T$ is the noise schedule (small positive constants). Each step scales the current sample toward zero while adding noise. A key property: marginals can be computed in closed form for any $t$ directly from $x_0$ . Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ . Then:

q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I\right)

This allows reparameterizing any noised sample as $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ . As $t \to T$ with appropriate schedule, $\bar{\alpha}_T \approx 0$ and $x_T \approx \mathcal{N}(0, I)$ — the data has been completely destroyed.

Noise schedules: the original DDPM used a linear schedule $\beta_t = \frac{t-1}{T-1}\beta_T + \frac{T-t}{T-1}\beta_1$ . The cosine schedule (Nichol and Dhariwal, 2021) defines $\bar{\alpha}_t = \cos^2(\frac{\pi t}{2T + s} / (1 + s))$ , which decreases more slowly near $t = 0$ (preserving more data information early in the process) and is better matched to the informational needs of image generation.

The reverse process and ELBO#

The reverse process is a Markov chain that progressively denoises $x_T \sim \mathcal{N}(0,I)$ back to $x_0$ :

p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t), \quad p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t))

The reverse process is learned; the denoising network $\mu_\theta(x_t, t)$ must predict where the clean signal was. The ELBO lower-bounds the log-likelihood $\log p_\theta(x_0)$ :

\mathcal{L} = \mathbb{E}_q\!\left[\log\frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{0:T})}\right] = \underbrace{D_\text{KL}(q(x_T \mid x_0) \| p(x_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(x_{t-1} \mid x_t, x_0) \| p_\theta(x_{t-1} \mid x_t))}_{L_{t-1}} - \underbrace{\log p_\theta(x_0 \mid x_1)}_{L_0}

The term $L_T$ is a constant (the forward process is fixed, and $q(x_T \mid x_0) \approx p(x_T) = \mathcal{N}(0,I)$ for a good schedule). The reconstruction term $L_0$ is handled with a discrete decoder. The key learning terms are $L_{t-1}$ : KL divergences between the forward posterior and the learned reverse step.

The forward posterior $q(x_{t-1} \mid x_t, x_0)$ is tractable given $x_0$ (because the forward process is Gaussian). By Bayes' rule:

q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\, \tilde{\mu}_t(x_t, x_0),\, \tilde{\beta}_t I\right)

where $\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_t$ and $\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}$ . This Gaussian form makes $L_{t-1}$ a simple squared distance between the model mean and the target mean.

The noise prediction objective#

The learned reverse mean $\mu_\theta(x_t, t)$ should match $\tilde{\mu}_t(x_t, x_0)$ . Substituting the reparameterization $x_0 = (x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)/\sqrt{\bar{\alpha}_t}$ into the expression for $\tilde{\mu}_t$ :

\tilde{\mu}_t(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right)

This suggests parameterizing $\mu_\theta(x_t, t)$ as:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right)

where $\epsilon_\theta(x_t, t)$ predicts the noise $\epsilon$ that was added to $x_0$ to produce $x_t$ . The $L_{t-1}$ terms then reduce to:

\mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\!\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]

This is the simple objective of Ho et al.: sample a timestep $t$ , sample noise $\epsilon \sim \mathcal{N}(0,I)$ , compute the noised input $x_t$ , and train the network to predict $\epsilon$ from $x_t$ and $t$ . The objective is a sum of DSM losses at every noise level simultaneously — the connection to score matching (Week 4) is exact: $\epsilon_\theta(x_t, t) = -\sqrt{1-\bar{\alpha}_t} s_\theta(x_t, t)$ .

Step-by-step: why noise prediction works#

The algebraic substitution connecting the reverse mean to noise prediction deserves careful walkthrough. Starting from the forward posterior mean: $\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_t$

The key insight is that $x_0$ is not observed during generation — only $x_t$ is. But we can express $x_0$ in terms of $x_t$ and the noise $\epsilon$ using the reparameterization $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ , which gives: $\hat{x}_0(x_t, \epsilon) = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon}{\sqrt{\bar{\alpha}_t}}$

Substituting into $\tilde{\mu}_t$ : $\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t} \cdot \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon}{\sqrt{\bar{\alpha}_t}} + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_t$

Collecting terms in $x_t$ (using $\beta_t = 1 - \alpha_t$ and $\bar{\alpha}_{t-1} = \bar{\alpha}_t / \alpha_t$ ): $= \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right)$

This is the clean noise-prediction parameterization. The model $\epsilon_\theta(x_t, t)$ that predicts $\epsilon$ from $x_t$ gives the reverse mean $\mu_\theta(x_t, t) = (x_t - \beta_t\epsilon_\theta(x_t,t)/\sqrt{1-\bar{\alpha}_t})/\sqrt{\alpha_t}$ . The L2 loss between $\tilde{\mu}_t$ and $\mu_\theta$ then reduces to $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$ (with a constant prefactor that Ho et al. drop to get $\mathcal{L}_\text{simple}$ , empirically finding equal-weight timestep averaging performs better than the exact ELBO weighting).

Ancestral sampling#

Reverse process sampling (ancestral sampling) generates a sample by iteratively denoising from $x_T \sim \mathcal{N}(0,I)$ :

x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sqrt{\tilde{\beta}_t}\, z, \quad z \sim \mathcal{N}(0,I)

The stochastic noise term $\sqrt{\tilde{\beta}_t}\, z$ is added at each step except the last ( $t = 1$ ). With $T = 1000$ steps, this produces high-quality samples but requires 1000 neural network evaluations per sample — computationally expensive.

The U-Net architecture#

The denoising network $\epsilon_\theta(x_t, t)$ is a time-conditioned U-Net in all standard implementations. The U-Net has three components:

Encoder (downsampling path): a series of residual blocks that progressively halve the spatial resolution while doubling the channel count (e.g., $256^2 \times 64 \to 128^2 \times 128 \to 64^2 \times 256 \to 32^2 \times 512$ ). Each residual block consists of two Conv2D + GroupNorm + SiLU layers with a skip connection.

Time embedding: the timestep $t \in \{1, \ldots, T\}$ is encoded into a sinusoidal position embedding $\gamma(t)$ (same formulation as positional encodings in Transformers), then passed through two linear layers with SiLU activation to produce a time embedding vector $\tau \in \mathbb{R}^{d_\text{model}}$ . This time vector is added (or scale-shifted via adaptive group norm) to the features after each residual block — conditioning all feature computations on the noise level.

Decoder (upsampling path): a mirror of the encoder with skip connections from the encoder at each resolution (the "U" shape). Transposed convolutions or bilinear upsampling followed by convolution restore the original spatial resolution.

Attention at the bottleneck: self-attention blocks (multi-head attention at the lowest-resolution feature maps, e.g., $32^2$ ) allow the model to capture global structure. At higher resolutions, attention is too expensive and only local convolutions are used. The number of attention heads and the resolution at which attention is applied are key hyperparameters.

For a standard DDPM on $256 \times 256$ images: the U-Net has $\sim$ 100M parameters, 4 resolution levels, attention at the $32 \times 32$ and $16 \times 16$ levels, and channel counts $[128, 256, 512, 512]$ .

DDIM and accelerated sampling#

DDIM (Song et al., 2020) derives a non-Markovian forward process that has the same marginals $q(x_t \mid x_0)$ as DDPM but allows deterministic reverse trajectories. The DDIM update:

x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}}_{\text{predicted }x_0} + \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2}\, \epsilon_\theta(x_t, t) + \sigma_t z

With $\sigma_t = 0$ , this is fully deterministic: $x_{t-1}$ depends only on $x_t$ and the predicted noise, with no added stochasticity. DDIM samples are deterministic functions of the initial noise $x_T$ , enabling: (1) accelerated sampling — skip from $t$ to $t - \Delta t$ for large $\Delta t$ , reducing from 1000 to 10-50 steps with modest quality loss; (2) interpolation — interpolating between two $x_T$ values produces interpolated images; (3) inversion — running DDIM backward from a real image produces the noise $x_T$ that would have generated it.

SDE and ODE unification#

Song et al. (2021) unify DDPM, NCSN, and normalizing flows in a single continuous-time SDE framework. The forward process is a stochastic differential equation:

dx = f(x, t)\, dt + g(t)\, dW

with $f$ a drift coefficient and $g$ a diffusion coefficient. Every choice of $(f, g)$ defines a different noising process; DDPM and NCSN are discrete approximations of specific SDEs (VP-SDE and VE-SDE respectively). The reverse SDE:

dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]\, dt + g(t)\, d\bar{W}

requires the score $\nabla_x \log p_t(x)$ at each time step — learned by the denoising network. Crucially, the reverse SDE has a corresponding probability flow ODE:

dx = \left[f(x, t) - \tfrac{1}{2} g(t)^2 \nabla_x \log p_t(x)\right] dt

whose marginals match those of the SDE at every $t$ but which is deterministic — a continuous normalizing flow. DDIM sampling is the Euler discretization of this ODE.

Practical considerations for DDPM training and inference#

Effective DDPM implementation requires attention to several engineering details:

Noise schedule design: the original linear schedule $\beta_t = \beta_{\min} + t/T (\beta_{\max} - \beta_{\min})$ with $\beta_{\min} = 0.0001$ , $\beta_{\max} = 0.02$ works reasonably well for CIFAR-10. The cosine schedule $\bar{\alpha}_t = \cos^2(\frac{\pi}{2} \cdot \frac{t}{T+s} / (1+s))$ for $s = 0.008$ performs better by preserving more signal early in the noising process. The schedule choice affects the SNR curve $\text{SNR}(t) = \log(\bar{\alpha}_t / (1-\bar{\alpha}_t))$ ; a smooth SNR schedule with no sharp kinks performs better than one with discontinuities. Modern research shows that learning the noise schedule jointly with the model (or using sophisticated scheduling techniques like min-SNR weighting) can improve sample quality.

Timestep embedding: the positional encoding $\gamma(t) = [\sin(2^0 \pi t/T), \cos(2^0 \pi t/T), \ldots, \sin(2^{L-1} \pi t/T), \cos(2^{L-1} \pi t/T)]$ for $L = \sim 128$ provides rich information about the noise level to all layers of the U-Net. Some models use learned embeddings; sinusoidal is more sample-efficient.

GroupNorm vs. BatchNorm: diffusion models use GroupNorm (normalizing across groups of channels within each sample) rather than BatchNorm because batch statistics are unreliable with small batch sizes and the noise level varies per sample. BatchNorm couples the denoising network to batch statistics, degrading generalization.

EMA and exponential moving average for evaluation: during training, it is common to maintain an exponential moving average (EMA) of model weights (e.g., $\theta_\text{EMA} \leftarrow 0.9999 \cdot \theta_\text{EMA} + 0.0001 \cdot \theta_\text{current}$ ). The EMA model is used for evaluation and typically provides better sample quality than the final checkpoint.

Guidance and conditional generation: classifier-free guidance trains the model to output predictions for both unconditional and conditional (class-conditioned or text-conditioned) denoising. During inference, the final prediction is a weighted blend: $\epsilon_\theta(x_t, c, t) = \epsilon_\theta(x_t, \emptyset, t) + w \cdot (\epsilon_\theta(x_t, c, t) - \epsilon_\theta(x_t, \emptyset, t))$ for guidance weight $w > 0$ . Larger $w$ increases adherence to the conditioning signal but risks degrading sample diversity. Typical values: $w \in [1.5, 7.5]$ depending on the guidance strength desired.

Sampling efficiency trade-offs: ancestral sampling with $T = 1000$ steps produces the highest sample quality but is slow. DDIM with $T' = 50$ steps runs $20\times$ faster with modest quality loss ( $\sim0.5$ FID increase on CIFAR-10). Rectified flows (Week 7) further accelerate this to $T' = 1$ step. The choice of acceleration method reflects a quality-speed tradeoff inherent to diffusion-based generation.

GenAI context: DDPM across the course sequence#

The DDPM framework appears across all four courses in different guises:

| DDPM concept | Robotics (Course 2) | RL (Course 1) | VLMs (Course 4) | |---|---|---|---| | Forward process (data → noise) | Action trajectory corruption for training | State transition noise model | Image corruption for masked pretraining | | Reverse process (denoising) | Diffusion policy action generation | Planning via backward pass | Visual token prediction | | Noise prediction objective $\mathcal{L}_\text{simple}$ | ACT / diffusion policy training loss | TD-error as noise signal | MAE pixel prediction | | U-Net score network | Observation + time → action score | Value network over state-time | Vision encoder + time embedding | | Classifier-free guidance | Goal-conditioned diffusion policy | Reward-weighted policy gradient | Text-conditioned image generation | | DDIM deterministic sampling | 10-step diffusion policy inference | Model predictive control with score | Fast text-to-image generation |

The diffusion policy in Course 2 (Week 9) is DDPM applied to action distributions: $x_0$ becomes the clean action sequence, $x_t$ is the noised action, and the denoising network takes the current observation as conditioning. The same DDPM math applies exactly — the only difference is that the data being modeled is a robot action trajectory rather than an image.

At 50 Hz robot control, 1000-step DDPM sampling is infeasible. DDIM with 10–20 steps runs in ~20ms per action, making real-time diffusion policy practical. The fact that DDIM's deterministic reverse process enables step-skipping without retraining is therefore not just a generative modeling curiosity — it is a hard engineering requirement for physical robot deployment.

Key takeaways#

DDPM defines a fixed Gaussian forward process that corrupts data to noise in $T$ steps; marginals $q(x_t \mid x_0)$ are Gaussian and computable in closed form using $\bar{\alpha}_t$ schedules (linear or cosine). The ELBO decomposes into KL terms between the forward posterior and the learned reverse step, with the posterior having a tractable Gaussian form. Reparameterizing the reverse mean in terms of the predicted noise $\epsilon_\theta(x_t, t)$ reduces all $L_{t-1}$ terms to $\mathcal{L}_\text{simple}$ : equal-weight MSE between true noise and predicted noise across all timesteps — empirically superior to exact ELBO weighting. The denoising network is a U-Net with time embeddings, residual blocks, and multi-scale attention; it learns to estimate the score $-\nabla_x \log p_t(x)$ at each noise level. Ancestral sampling generates samples by iteratively applying the learned reverse step with stochastic noise injection; DDIM enables deterministic accelerated sampling by skipping timesteps and setting the noise scale to zero. The SDE/ODE unification reveals that DDPM's score network defines a probability flow ODE with the same marginals as the full stochastic reverse process — a continuous normalizing flow. This connection links EBMs (Week 4), score matching, flows (Week 5), and diffusion models into a single theoretical framework. The practical success of diffusion models stems from: (1) exact likelihood lower bounds via ELBO, (2) stable training with standard MSE objectives unlike GANs, (3) high-quality samples competitive with or exceeding GANs without adversarial training, (4) accelerated sampling via DDIM for inference efficiency, and (5) natural conditioning for class-conditional and text-conditional generation via classifier-free guidance.

Conceptual questions#

Derive the forward posterior $q(x_{t-1} \mid x_t, x_0)$ from Bayes' rule using $q(x_t \mid x_{t-1})$ and $q(x_t \mid x_0)$ . Show that it is Gaussian and derive the mean $\tilde{\mu}_t$ and variance $\tilde{\beta}_t$ in terms of $\bar{\alpha}_t$ , $\alpha_t$ , and $\beta_t$ . Verify that in the limit $T \to \infty$ (infinitesimal steps), $\tilde{\beta}_t \to 0$ — the forward posterior becomes deterministic.
The simple objective $\mathcal{L}_\text{simple}$ weights all timesteps equally. An alternative is to use the exact ELBO weighting $\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)}$ for each $L_{t-1}$ term. Analyze how this weighting differs from equal weighting: at early timesteps ( $t \approx 1$ , low noise), which weighting emphasizes the objective more? Explain the practical implication for sample quality if the simple objective underweights low-noise terms.
DDIM sampling with $\sigma_t = 0$ produces deterministic samples from a given $x_T \sim \mathcal{N}(0,I)$ . Using DDIM inversion (running the deterministic reverse process backward from a real image $x_0$ ), one can obtain a noise vector $x_T$ such that sampling from $x_T$ approximately recovers $x_0$ . Describe a generative editing application enabled by this inversion capability, and explain what approximation error accumulates when the inversion is not exact.
The cosine noise schedule is designed so that $\bar{\alpha}_t$ decreases slowly near $t = 0$ . For a linear schedule vs. cosine schedule, compare the signal-to-noise ratio $\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t)$ at $t = 0.05T$ . Explain why a higher SNR at small $t$ is beneficial for image generation quality, particularly for fine-detail structure in images.
The probability flow ODE $dx = [f(x,t) - \frac{1}{2}g(t)^2 s_\theta(x,t)] dt$ defines a continuous normalizing flow. Compare the architectural requirements of this flow (using a score network as the vector field) to a coupling-layer normalizing flow (Week 5). Which model class is more flexible in terms of the distributions it can represent? Which is more computationally efficient for exact likelihood computation?

Solutions

By Bayes, $q(x_{t-1}\mid x_t,x_0)\propto q(x_t\mid x_{t-1})\,q(x_{t-1}\mid x_0)$ ; both factors are Gaussian in $x_{t-1}$ , and a product of Gaussians is Gaussian. Completing the square gives precision $\tilde\beta_t^{-1}=\beta_t^{-1}+(1-\bar\alpha_{t-1})^{-1}$ , i.e. $\tilde\beta_t=\frac{(1-\bar\alpha_{t-1})\beta_t}{1-\bar\alpha_t}$ , and the stated $\tilde\mu_t$ . As $T\to\infty$ each $\beta_t\to 0$ , so $\tilde\beta_t\to 0$ — the posterior collapses to a point and the reverse step becomes deterministic.
The exact ELBO weight $\propto \beta_t^2/[\,2\sigma_t^2\alpha_t(1-\bar\alpha_t)]$ is small at low $t$ (tiny $\beta_t^2$ ), so it emphasizes high-noise (large- $t$ ) terms. Equal weighting therefore emphasizes the low-noise terms more than the ELBO does. Since low-noise steps carry fine, high-frequency detail, underweighting them yields blurry samples — which is why $\mathcal{L}_\text{simple}$ 's relative upweighting of low-noise terms improves perceptual quality.
DDIM inversion enables real-image editing (e.g. prompt-to-prompt): invert $x_0\to x_T$ , then resample with modified conditioning to edit while preserving structure. The error: the deterministic ODE is only approximately reversible with finite steps (and under CFG), so discretization + linearization error accumulate each step and the reconstruction drifts from the original — worse at high guidance scale and few steps.
$\text{SNR}(t)=\bar\alpha_t/(1-\bar\alpha_t)$ . At $t=0.05T$ the cosine schedule keeps $\bar\alpha_t$ near 1 (high SNR), while the linear schedule has already dropped $\bar\alpha_t$ (lower SNR). Higher SNR early preserves signal/detail at low noise, where high-frequency image structure lives, so the model devotes effective capacity there; the linear schedule destroys fine detail too quickly.
The score-network PF-ODE uses an unconstrained vector field (a U-Net) — no invertibility or Jacobian constraint — so it represents a strictly broader class of distributions than a coupling flow, whose layers must be invertible with tractable Jacobian. So the diffusion/ODE model is more flexible. For exact likelihood, the coupling flow is more efficient: one forward pass plus a closed-form log-Jacobian, versus integrating the Jacobian trace along the whole ODE trajectory (an ODE solve with a Hutchinson estimator) for the PF-ODE.

Looking ahead#

DDPM establishes the denoising framework. The next development simplifies the training objective and accelerates sampling by learning vector fields directly rather than noise.

Week 7: Flow Matching and Consistency Models. We derive the flow matching objective as regression against a conditional vector field, show that rectified flows produce straight trajectories enabling few-step sampling, and examine consistency models that distill a diffusion trajectory into a single-step generator.