Purpose of this lecture
DDPM (Ho et al., 2020) is the generative model architecture underlying essentially all modern image, audio, and video generation systems. It achieves GAN-level sample quality without adversarial training, provides tractable likelihoods, and trains stably across model scales. This lecture derives DDPM from first principles: the forward noising process, the variational lower bound on the reverse process, the simplification from the full ELBO to the noise prediction objective, the ancestral sampling algorithm, and the SDE/ODE perspective that connects DDPM to score matching and enables accelerated DDIM sampling.
The forward process
DDPM defines a fixed (non-learned) forward process that gradually corrupts a data sample by adding Gaussian noise over steps:
where is the noise schedule (small positive constants). Each step scales the current sample toward zero while adding noise. A key property: marginals can be computed in closed form for any directly from . Define and . Then:
This allows reparameterizing any noised sample as with . As with appropriate schedule, and — the data has been completely destroyed.
Noise schedules: the original DDPM used a linear schedule . The cosine schedule (Nichol and Dhariwal, 2021) defines , which decreases more slowly near (preserving more data information early in the process) and is better matched to the informational needs of image generation.
The reverse process and ELBO
The reverse process is a Markov chain that progressively denoises back to :
The reverse process is learned; the denoising network must predict where the clean signal was. The ELBO lower-bounds the log-likelihood :
The term is a constant (the forward process is fixed, and for a good schedule). The reconstruction term is handled with a discrete decoder. The key learning terms are : KL divergences between the forward posterior and the learned reverse step.
The forward posterior is tractable given (because the forward process is Gaussian). By Bayes' rule:
where and . This Gaussian form makes a simple squared distance between the model mean and the target mean.
The noise prediction objective
The learned reverse mean should match . Substituting the reparameterization into the expression for :
This suggests parameterizing as:
where predicts the noise that was added to to produce . The terms then reduce to:
This is the simple objective of Ho et al.: sample a timestep , sample noise , compute the noised input , and train the network to predict from and . The objective is a sum of DSM losses at every noise level simultaneously — the connection to score matching (Week 4) is exact: .
Step-by-step: why noise prediction works
The algebraic substitution connecting the reverse mean to noise prediction deserves careful walkthrough. Starting from the forward posterior mean:
The key insight is that is not observed during generation — only is. But we can express in terms of and the noise using the reparameterization , which gives:
Substituting into :
Collecting terms in (using and ):
This is the clean noise-prediction parameterization. The model that predicts from gives the reverse mean . The L2 loss between and then reduces to (with a constant prefactor that Ho et al. drop to get , empirically finding equal-weight timestep averaging performs better than the exact ELBO weighting).
Ancestral sampling
Reverse process sampling (ancestral sampling) generates a sample by iteratively denoising from :
The stochastic noise term is added at each step except the last (). With steps, this produces high-quality samples but requires 1000 neural network evaluations per sample — computationally expensive.
The U-Net architecture
The denoising network is a time-conditioned U-Net in all standard implementations. The U-Net has three components:
Encoder (downsampling path): a series of residual blocks that progressively halve the spatial resolution while doubling the channel count (e.g., ). Each residual block consists of two Conv2D + GroupNorm + SiLU layers with a skip connection.
Time embedding: the timestep is encoded into a sinusoidal position embedding (same formulation as positional encodings in Transformers), then passed through two linear layers with SiLU activation to produce a time embedding vector . This time vector is added (or scale-shifted via adaptive group norm) to the features after each residual block — conditioning all feature computations on the noise level.
Decoder (upsampling path): a mirror of the encoder with skip connections from the encoder at each resolution (the "U" shape). Transposed convolutions or bilinear upsampling followed by convolution restore the original spatial resolution.
Attention at the bottleneck: self-attention blocks (multi-head attention at the lowest-resolution feature maps, e.g., ) allow the model to capture global structure. At higher resolutions, attention is too expensive and only local convolutions are used. The number of attention heads and the resolution at which attention is applied are key hyperparameters.
For a standard DDPM on images: the U-Net has 100M parameters, 4 resolution levels, attention at the and levels, and channel counts .
DDIM and accelerated sampling
DDIM (Song et al., 2020) derives a non-Markovian forward process that has the same marginals as DDPM but allows deterministic reverse trajectories. The DDIM update:
With , this is fully deterministic: depends only on and the predicted noise, with no added stochasticity. DDIM samples are deterministic functions of the initial noise , enabling: (1) accelerated sampling — skip from to for large , reducing from 1000 to 10-50 steps with modest quality loss; (2) interpolation — interpolating between two values produces interpolated images; (3) inversion — running DDIM backward from a real image produces the noise that would have generated it.
SDE and ODE unification
Song et al. (2021) unify DDPM, NCSN, and normalizing flows in a single continuous-time SDE framework. The forward process is a stochastic differential equation:
with a drift coefficient and a diffusion coefficient. Every choice of defines a different noising process; DDPM and NCSN are discrete approximations of specific SDEs (VP-SDE and VE-SDE respectively). The reverse SDE:
requires the score at each time step — learned by the denoising network. Crucially, the reverse SDE has a corresponding probability flow ODE:
whose marginals match those of the SDE at every but which is deterministic — a continuous normalizing flow. DDIM sampling is the Euler discretization of this ODE.
Practical considerations for DDPM training and inference
Effective DDPM implementation requires attention to several engineering details:
Noise schedule design: the original linear schedule with , works reasonably well for CIFAR-10. The cosine schedule for performs better by preserving more signal early in the noising process. The schedule choice affects the SNR curve ; a smooth SNR schedule with no sharp kinks performs better than one with discontinuities. Modern research shows that learning the noise schedule jointly with the model (or using sophisticated scheduling techniques like min-SNR weighting) can improve sample quality.
Timestep embedding: the positional encoding for provides rich information about the noise level to all layers of the U-Net. Some models use learned embeddings; sinusoidal is more sample-efficient.
GroupNorm vs. BatchNorm: diffusion models use GroupNorm (normalizing across groups of channels within each sample) rather than BatchNorm because batch statistics are unreliable with small batch sizes and the noise level varies per sample. BatchNorm couples the denoising network to batch statistics, degrading generalization.
EMA and exponential moving average for evaluation: during training, it is common to maintain an exponential moving average (EMA) of model weights (e.g., ). The EMA model is used for evaluation and typically provides better sample quality than the final checkpoint.
Guidance and conditional generation: classifier-free guidance trains the model to output predictions for both unconditional and conditional (class-conditioned or text-conditioned) denoising. During inference, the final prediction is a weighted blend: for guidance weight . Larger increases adherence to the conditioning signal but risks degrading sample diversity. Typical values: depending on the guidance strength desired.
Sampling efficiency trade-offs: ancestral sampling with steps produces the highest sample quality but is slow. DDIM with steps runs faster with modest quality loss ( FID increase on CIFAR-10). Rectified flows (Week 7) further accelerate this to step. The choice of acceleration method reflects a quality-speed tradeoff inherent to diffusion-based generation.
GenAI context: DDPM across the course sequence
The DDPM framework appears across all four courses in different guises:
| DDPM concept | Robotics (Course 2) | RLReinforcement Learning (Course 1) | VLMs (Course 4) | |---|---|---|---| | Forward process (data → noise) | Action trajectory corruption for training | State transition noise model | Image corruption for masked pretraining | | Reverse process (denoising) | Diffusion policy action generation | Planning via backward pass | Visual token prediction | | Noise prediction objective | ACTAction Chunking with Transformers / diffusion policy training loss | TDTemporal Difference-error as noise signal | MAE pixel prediction | | U-Net score network | Observation + time → action score | Value network over state-time | Vision encoder + time embedding | | Classifier-free guidance | Goal-conditioned diffusion policy | Reward-weighted policy gradient | Text-conditioned image generation | | DDIM deterministic sampling | 10-step diffusion policy inference | Model predictive control with score | Fast text-to-image generation |
The diffusion policy in Course 2 (Week 9) is DDPM applied to action distributions: becomes the clean action sequence, is the noised action, and the denoising network takes the current observation as conditioning. The same DDPM math applies exactly — the only difference is that the data being modeled is a robot action trajectory rather than an image.
At 50 Hz robot control, 1000-step DDPM sampling is infeasible. DDIM with 10–20 steps runs in ~20ms per action, making real-time diffusion policy practical. The fact that DDIM's deterministic reverse process enables step-skipping without retraining is therefore not just a generative modeling curiosity — it is a hard engineering requirement for physical robot deployment.
Key takeaways
DDPM defines a fixed Gaussian forward process that corrupts data to noise in steps; marginals are Gaussian and computable in closed form using schedules (linear or cosine). The ELBO decomposes into KL terms between the forward posterior and the learned reverse step, with the posterior having a tractable Gaussian form. Reparameterizing the reverse mean in terms of the predicted noise reduces all terms to : equal-weight MSE between true noise and predicted noise across all timesteps — empirically superior to exact ELBO weighting. The denoising network is a U-Net with time embeddings, residual blocks, and multi-scale attention; it learns to estimate the score at each noise level. Ancestral sampling generates samples by iteratively applying the learned reverse step with stochastic noise injection; DDIM enables deterministic accelerated sampling by skipping timesteps and setting the noise scale to zero. The SDE/ODE unification reveals that DDPM's score network defines a probability flow ODE with the same marginals as the full stochastic reverse process — a continuous normalizing flow. This connection links EBMs (Week 4), score matching, flows (Week 5), and diffusion models into a single theoretical framework. The practical success of diffusion models stems from: (1) exact likelihood lower bounds via ELBO, (2) stable training with standard MSE objectives unlike GANs, (3) high-quality samples competitive with or exceeding GANs without adversarial training, (4) accelerated sampling via DDIM for inference efficiency, and (5) natural conditioning for class-conditional and text-conditional generation via classifier-free guidance.
Conceptual questions
-
Derive the forward posterior from Bayes' rule using and . Show that it is Gaussian and derive the mean and variance in terms of , , and . Verify that in the limit (infinitesimal steps), — the forward posterior becomes deterministic.
-
The simple objective weights all timesteps equally. An alternative is to use the exact ELBO weighting for each term. Analyze how this weighting differs from equal weighting: at early timesteps (, low noise), which weighting emphasizes the objective more? Explain the practical implication for sample quality if the simple objective underweights low-noise terms.
-
DDIM sampling with produces deterministic samples from a given . Using DDIM inversion (running the deterministic reverse process backward from a real image ), one can obtain a noise vector such that sampling from approximately recovers . Describe a generative editing application enabled by this inversion capability, and explain what approximation error accumulates when the inversion is not exact.
-
The cosine noise schedule is designed so that decreases slowly near . For a linear schedule vs. cosine schedule, compare the signal-to-noise ratio at . Explain why a higher SNR at small is beneficial for image generation quality, particularly for fine-detail structure in images.
-
The probability flow ODE defines a continuous normalizing flow. Compare the architectural requirements of this flow (using a score network as the vector field) to a coupling-layer normalizing flow (Week 5). Which model class is more flexible in terms of the distributions it can represent? Which is more computationally efficient for exact likelihood computation?
Looking ahead
DDPM establishes the denoising framework. The next development simplifies the training objective and accelerates sampling by learning vector fields directly rather than noise.
Week 7: Flow Matching and Consistency Models. We derive the flow matching objective as regression against a conditional vector field, show that rectified flows produce straight trajectories enabling few-step sampling, and examine consistency models that distill a diffusion trajectory into a single-step generator.
Further reading
- Sohl-Dickstein, J., et al. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML. (The conceptual origin of diffusion).
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS. (DDPM).
- Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR. (DDIM for faster sampling).