Purpose of this lecture
Energy-based models (EBMs) and score-based models are the theoretical ancestors of diffusion models. EBMs define distributions through an unnormalized density function whose normalization is intractable; learning them requires either approximating the partition function or finding a training objective that does not require it. Score matching provides exactly such an objective, and denoising score matching is the direct mathematical precursor of DDPM's training target. Understanding this lineage clarifies why diffusion models work and what their connection to physics-inspired sampling methods is.
Energy-based models
An energy-based model assigns an energy to each configuration , with lower energy corresponding to more probable configurations. The probability distribution is:
where is the partition function (normalizing constant). The energy function is a neural network; the Boltzmann/Gibbs form ensures non-negativity. EBMs are flexible: the energy function can be any architecture, and the distribution can have complex multi-modal structure captured by the shape of the energy landscape.
The partition function problem: computing requires integrating over all of , which is intractable for high-dimensional . This makes direct likelihood evaluation impossible, and the MLE gradient:
requires computing — an expectation under the model distribution. This expectation is the key obstacle: it requires sampling from , which itself requires MCMC.
Contrastive divergence
Contrastive divergence (CD) (Hinton, 2002) approximates the MLE gradient by running short MCMC chains (typically steps) initialized at training data rather than running chains to convergence. The gradient estimate is:
where is a training example (positive sample) and is the result of running steps of Markov chain Monte Carlo from (negative sample). CD lowers the energy of data and raises the energy of near-data points generated by short chains — a reasonable approximation to the true gradient when the model is close to the data distribution.
Persistent contrastive divergence (PCD) maintains a persistent buffer of MCMC chains across training iterations. Chains are not reinitalized from data but continue from their previous state, sampling more broadly from as the chains have more time to mix. PCD reduces bias in the gradient estimate at the cost of requiring many parallel chain states.
Langevin dynamics and MCMC sampling
Given a trained EBM, generating samples requires MCMC. Stochastic gradient Langevin dynamics (SGLD) combines gradient-based proposals with injected noise:
With step size and sufficient steps, SGLD converges to exact samples from . The gradient points from high-energy to low-energy regions (downhill), while the noise prevents the chain from getting stuck in local minima. SGLD is the physical analog of Brownian motion in a potential energy landscape.
The gradient is the score function under the EBM parameterization (since and the partition function is constant in ). This connection between energy functions and scores is central to score matching.
Multi-modal landscapes and the sampling challenge
EBMs can represent arbitrarily complex multi-modal distributions — a key advantage. However, Langevin dynamics struggles precisely when the energy landscape has multiple well-separated modes. If the chain is initialized near one mode, the noise must be large enough to escape the energy barrier to reach other modes, but large noise degrades sample quality near any single mode. This is the mixing problem: for a bimodal distribution with modes separated by and energy barrier , the mixing time scales as — exponentially in the barrier height.
Replica exchange (parallel tempering) runs multiple MCMC chains at different temperatures (implemented by scaling the energy as ). Hot chains explore freely; cold chains refine detail near modes. Chains periodically swap states according to a Metropolis acceptance criterion: the swap from chain to is accepted with probability . This gives cold chains access to the global distribution without sacrificing local quality.
Energy barriers in high dimensions: in -dimensional space, the effective energy barrier grows with because the volume of the saddle region between modes shrinks exponentially. Practical EBM sampling for high-dimensional images () requires either: short chains (biased samples but computationally tractable), noise-annealed schedules (starting from high-temperature exploration and cooling — the NCSN approach), or transition kernels specifically designed for the energy landscape (Hamiltonian Monte Carlo, which uses momentum to cross barriers efficiently).
Score matching
The score function of a distribution is:
Score matching (Hyvärinen, 2005) provides a training objective for learning the score function without computing the normalizing constant. For a model with score , the score matching objective is:
The second term involves the trace of the Jacobian of the score with respect to , which is to compute naively. Integration by parts shows that minimizing is equivalent to minimizing — the Fisher divergence between the model score and the true score. Crucially, cancels out and the objective is tractable.
Sliced score matching reduces the Jacobian computation by projecting the score onto random directions: . This is unbiased and per sample.
Denoising score matching
Denoising score matching (Vincent, 2011) avoids the Jacobian entirely by perturbing the data and matching the score of the noisy distribution. For noise , , the optimal denoiser can be learned by:
The target is the score of the noisy distribution evaluated at the noised sample. This objective requires only sampling and forward passes — no Jacobians, no partition function. Minimizing DSM recovers the score function of , which approximates the score of as .
This is the direct mathematical ancestor of DDPM's training target. DDPM (Week 6) can be understood as DSM applied simultaneously across a sequence of noise levels, with the score network learning to denoise from any level.
Multi-scale score estimation and NCSN
NCSN (Noise Conditional Score Network; Song and Ermon, 2019) trains a single score network conditioned on the noise level , estimating the score of for many noise levels simultaneously:
where is a geometric sequence of noise levels and is a weighting factor. Generation uses annealed Langevin dynamics: run SGLD at noise level (high noise, easy exploration), then progressively reduce to (low noise, fine detail). The high-noise steps handle global structure; low-noise steps refine fine details.
NCSN demonstrated that score matching across noise scales could generate competitive image samples without adversarial training — a key empirical validation before DDPM showed that the same principle, formulated as a noising/denoising chain, could match GAN quality.
Energy-based models in imitation learning and RLReinforcement Learning
The EBM framework has direct applications in robot learning and reinforcement learning that connect this course to Courses 1 and 2.
Energy-based imitation learning: given a set of expert demonstrations , an EBM can represent the expert policy implicitly as , where low-energy pairs are those favored by the expert. Training uses contrastive divergence: lower the energy of expert (state, action) pairs, raise the energy of policy-generated pairs. This avoids the mode-averaging problem of regression-based behavior cloning — the EBM places probability mass only where the expert does, rather than averaging over multiple expert modes.
GAIL (Ho and Ermon, 2016) is the adversarial version: the discriminator distinguishes expert from policy trajectories, and the policy gradient uses as a reward. GAIL is formally equivalent to minimizing the Jensen-Shannon divergence between the occupancy measures of the expert and the learned policy — the GAN objective applied in trajectory space rather than data space. This is the direct connection between Week 3's GAN theory and Course 2's imitation learning.
Reward models as EBMs: in RLHFReinforcement Learning from Human Feedback (Course 1, Week 12), the reward model assigns scalar rewards to language model outputs. This is precisely an energy function with — lower energy (higher reward) for preferred outputs. Policy optimization via PPOProximal Policy Optimisation performs a form of Langevin dynamics in policy space, with the reward model gradient guiding the policy toward high-reward (low-energy) completions and the KL penalty providing regularizing noise.
Practical implementation of EBM training
Training EBMs in practice requires careful choices:
Buffer management in persistent contrastive divergence: maintaining a persistent buffer of MCMC chains requires allocating GPU memory for many parallel chains (typically 64–512). The buffer is initialized with random noise or data samples at the start of training; chains continue evolving across batches. Periodically refreshing a fraction of the buffer with new random initializations prevents mode collapse where chains get stuck in high-energy regions. The effective gradient is more stable than CD but with higher memory overhead.
Step size and schedule for Langevin dynamics: the step size in SGLD must decrease over time to ensure convergence: typical schedules use or . Too large causes divergence; too small causes slow mixing. A heuristic: initialize such that is to in magnitude. For high-dimensional images, SGLD with 50–100 steps produces usable samples; fewer steps bias toward the data distribution (a failure mode when training with CD).
Energy function architectures: the energy network is typically a convolutional ResNet outputting a scalar. For images, a common design: down-sample to latent resolution (), apply fully connected layers, output . A critical detail: the energy function should not be too powerful (very deep networks with many parameters) because this allows the model to achieve zero training loss on the data distribution while learning little about it — a form of overfitting. Regularizing the energy norm (via weight decay or spectral normalization) mitigates this.
Relationship to other divergences: contrastive divergence minimizes a biased approximation to KL divergence; persistent CD reduces bias but increases variance. Neither achieves the unbiased gradient of perfect MCMC. Other objectives like score matching (Week 4) sidestep these issues entirely, avoiding MCMC sampling during training — one reason score-based and diffusion models became dominant.
GenAI context: EBMs across the AI stack
| EBM / score-based concept | Robotics (Course 2) | RLReinforcement Learning (Course 1) | VLMs (Course 4) | |---|---|---|---| | Energy function | Policy energy for IRL | Reward model | CLIP similarity score | | Score | Gradient for action refinement | Policy gradient direction | Contrastive gradient for retrieval | | Langevin sampling | Trajectory optimization via gradient descent | MCMC policy search | Test-time compute via iterative refinement | | SGLD noise injection | Domain randomization (noise stabilizes training) | Exploration noise in policy gradient | Visual augmentation for robustness | | NCSN multi-scale noise | Domain randomization schedules | Entropy-regularized RLReinforcement Learning at multiple scales | Multi-resolution image understanding |
The EBM perspective reveals a unified principle running through all four courses: probability is assigned through an energy function, and both training (MLE / contrastive divergence) and inference (MCMC / Langevin) are gradient computations on that energy. Diffusion models are EBMs that have learned to estimate scores at multiple noise levels; RLHFReinforcement Learning from Human Feedback reward models are EBMs trained on human preference data; CLIP is an EBM whose energy is the negative dot product of image and text embeddings. Recognizing these as the same mathematical object enables practitioners to transfer techniques across domains — for instance, using test-time Langevin refinement (originally from EBMs) to improve diffusion model samples, or using reward-conditioned EBM sampling (from RLHFReinforcement Learning from Human Feedback) to steer robot trajectory generation.
Key takeaways
EBMs define probability distributions through unnormalized energy functions; the partition function makes direct likelihood computation intractable. Contrastive divergence approximates the MLE gradient using short MCMC chains, trading bias for computational tractability, and persistent contrastive divergence maintains long-running chains for better gradient estimates. Langevin dynamics generates samples from an EBM by following the negative energy gradient with injected noise, but suffers from exponential slowdown in high-dimensional multimodal landscapes — replica exchange and noise annealing are practical solutions. Score matching learns the score function without the partition function; denoising score matching replaces the intractable Jacobian with a simple denoising target. NCSN generalizes DSM to multiple noise levels, enabling annealed Langevin sampling — the direct precursor to DDPM's reverse diffusion process. The conceptual insight that bridges all of these: probability densities can be represented implicitly through energy functions or scores, and both learning and inference reduce to gradient-based operations on these energy landscapes. This perspective unifies generative modeling, reinforcement learning, and inverse reinforcement learning as variants of the same underlying principle.
Conceptual questions
-
The MLE gradient for an EBM is . Show that this is equivalent to minimizing the KL divergence . Explain why the expectation under makes this gradient intractable in practice and what approximation contrastive divergence makes.
-
Stochastic gradient Langevin dynamics converges to the target distribution as and . For finite and , the chain is biased. (a) Show that with fixed , the stationary distribution of SGLD is not but a biased approximation. (b) If the energy function is strongly convex with parameter , bound the mixing time required for the chain to approximate within total variation .
-
Denoising score matching trains to estimate for a single noise level . As , the noisy distribution approaches the data distribution , so the learned score should approximate well — but in practice, small produces poor score estimates. Explain the geometric reason why score matching is unreliable at very small noise levels, particularly in regions of low data density.
-
NCSN uses a geometric sequence of noise levels and runs annealed Langevin dynamics from down to . Explain why this annealing is necessary and what failure mode would occur if sampling were performed entirely at the lowest noise level without the annealing schedule. How does this relate to the multi-modal structure of typical image distributions?
-
An EBM trained on natural images learns an energy function with low energy on realistic images and high energy on random noise. A red-teaming researcher discovers that a small adversarial perturbation can move a random-noise image to a low-energy region without making look realistic to humans. What does this finding imply about the geometric structure of the EBM's energy landscape? How does this relate to adversarial examples in discriminative classifiers?
Looking ahead
Score matching and denoising score matching provide a way to learn distributions without partition functions. The next model family takes a different approach: defining distributions through exact bijections whose Jacobian determinants are computable.
Week 5: Normalizing Flows. We derive the change-of-variables formula, examine coupling layer architectures (RealNVP, Glow), analyze autoregressive flows and their tradeoff between training and inference parallelism, and introduce continuous normalizing flows via neural ODEs.
Further reading
- LeCun, Y., et al. (2006). A Tutorial on Energy-Based Learning. Predicting Structured Data.
- Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. JMLR.
- Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS. (NCSN / Score-Based Generative Modeling).