Skip to main content
illumin8
Courses
Week 4: Energy-Based Models and Score Matching
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 4

Week 4: Energy-Based Models and Score Matching

✦Learning Outcomes
  • Implement contrastive divergence and explain its approximation to the true MLE gradient
  • Derive score matching and denoising score matching objectives and explain why they avoid the partition function
  • Connect denoising score matching to DDPM training as a mathematical ancestor
◆Prerequisites
  • Week 1: Probabilistic Foundations - Score functions ∇xlog⁡p(x)\nabla_x \log p(x)∇x​logp(x)
  • Week 3: Generative Adversarial Networks - Divergence geometry concepts

Familiarity with basic Monte Carlo estimation and gradient-based optimization is assumed.

Purpose of this lecture

Energy-based models (EBMs) and score-based models are the theoretical ancestors of diffusion models. EBMs define distributions through an unnormalized density function whose normalization is intractable; learning them requires either approximating the partition function or finding a training objective that does not require it. Score matching provides exactly such an objective, and denoising score matching is the direct mathematical precursor of DDPM's training target. Understanding this lineage clarifies why diffusion models work and what their connection to physics-inspired sampling methods is.


Energy-based models

An energy-based model assigns an energy Eθ(x)∈RE_\theta(x) \in \mathbb{R}Eθ​(x)∈R to each configuration xxx, with lower energy corresponding to more probable configurations. The probability distribution is:

pθ(x)=e−Eθ(x)Z(θ),Z(θ)=∫e−Eθ(x) dxp_\theta(x) = \frac{e^{-E_\theta(x)}}{Z(\theta)}, \quad Z(\theta) = \int e^{-E_\theta(x)} \, dxpθ​(x)=Z(θ)e−Eθ​(x)​,Z(θ)=∫e−Eθ​(x)dx

where Z(θ)Z(\theta)Z(θ) is the partition function (normalizing constant). The energy function EθE_\thetaEθ​ is a neural network; the Boltzmann/Gibbs form e−Ee^{-E}e−E ensures non-negativity. EBMs are flexible: the energy function can be any architecture, and the distribution can have complex multi-modal structure captured by the shape of the energy landscape.

The partition function problem: computing Z(θ)Z(\theta)Z(θ) requires integrating e−Eθ(x)e^{-E_\theta(x)}e−Eθ​(x) over all of X\mathcal{X}X, which is intractable for high-dimensional xxx. This makes direct likelihood evaluation impossible, and the MLE gradient:

∇θlog⁡pθ(x)=−∇θEθ(x)−∇θlog⁡Z(θ)=−∇θEθ(x)+Epθ[∇θEθ(x′)]\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) - \nabla_\theta \log Z(\theta) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]∇θ​logpθ​(x)=−∇θ​Eθ​(x)−∇θ​logZ(θ)=−∇θ​Eθ​(x)+Epθ​​[∇θ​Eθ​(x′)]

requires computing Epθ[∇θEθ(x′)]\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]Epθ​​[∇θ​Eθ​(x′)] — an expectation under the model distribution. This expectation is the key obstacle: it requires sampling from pθp_\thetapθ​, which itself requires MCMC.


Contrastive divergence

Contrastive divergence (CD) (Hinton, 2002) approximates the MLE gradient by running short MCMC chains (typically kkk steps) initialized at training data rather than running chains to convergence. The gradient estimate is:

∇θLCD≈−∇θEθ(x+)+∇θEθ(x−)\nabla_\theta \mathcal{L}_\text{CD} \approx -\nabla_\theta E_\theta(x^+) + \nabla_\theta E_\theta(x^-)∇θ​LCD​≈−∇θ​Eθ​(x+)+∇θ​Eθ​(x−)

where x+x^+x+ is a training example (positive sample) and x−x^-x− is the result of running kkk steps of Markov chain Monte Carlo from x+x^+x+ (negative sample). CD lowers the energy of data and raises the energy of near-data points generated by short chains — a reasonable approximation to the true gradient when the model is close to the data distribution.

Persistent contrastive divergence (PCD) maintains a persistent buffer of MCMC chains across training iterations. Chains are not reinitalized from data but continue from their previous state, sampling more broadly from pθp_\thetapθ​ as the chains have more time to mix. PCD reduces bias in the gradient estimate at the cost of requiring many parallel chain states.


Langevin dynamics and MCMC sampling

Given a trained EBM, generating samples requires MCMC. Stochastic gradient Langevin dynamics (SGLD) combines gradient-based proposals with injected noise:

xt+1=xt−η2∇xEθ(xt)+η ϵt,ϵt∼N(0,I)x_{t+1} = x_t - \frac{\eta}{2} \nabla_x E_\theta(x_t) + \sqrt{\eta} \, \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I)xt+1​=xt​−2η​∇x​Eθ​(xt​)+η​ϵt​,ϵt​∼N(0,I)

With step size η→0\eta \to 0η→0 and sufficient steps, SGLD converges to exact samples from pθ(x)∝e−Eθ(x)p_\theta(x) \propto e^{-E_\theta(x)}pθ​(x)∝e−Eθ​(x). The gradient −∇xEθ(x)-\nabla_x E_\theta(x)−∇x​Eθ​(x) points from high-energy to low-energy regions (downhill), while the noise η ϵt\sqrt{\eta}\,\epsilon_tη​ϵt​ prevents the chain from getting stuck in local minima. SGLD is the physical analog of Brownian motion in a potential energy landscape.

The gradient −∇xEθ(x)-\nabla_x E_\theta(x)−∇x​Eθ​(x) is the score function ∇xlog⁡pθ(x)\nabla_x \log p_\theta(x)∇x​logpθ​(x) under the EBM parameterization (since log⁡pθ(x)=−Eθ(x)−log⁡Z(θ)\log p_\theta(x) = -E_\theta(x) - \log Z(\theta)logpθ​(x)=−Eθ​(x)−logZ(θ) and the partition function is constant in xxx). This connection between energy functions and scores is central to score matching.


Multi-modal landscapes and the sampling challenge

EBMs can represent arbitrarily complex multi-modal distributions — a key advantage. However, Langevin dynamics struggles precisely when the energy landscape has multiple well-separated modes. If the chain is initialized near one mode, the noise ηϵt\sqrt{\eta}\epsilon_tη​ϵt​ must be large enough to escape the energy barrier to reach other modes, but large noise degrades sample quality near any single mode. This is the mixing problem: for a bimodal distribution with modes separated by Δx\Delta xΔx and energy barrier ΔE\Delta EΔE, the mixing time scales as eΔEe^{\Delta E}eΔE — exponentially in the barrier height.

Replica exchange (parallel tempering) runs multiple MCMC chains at different temperatures T1<T2<⋯<TKT_1 < T_2 < \cdots < T_KT1​<T2​<⋯<TK​ (implemented by scaling the energy as Eθ(x)/TiE_\theta(x)/T_iEθ​(x)/Ti​). Hot chains explore freely; cold chains refine detail near modes. Chains periodically swap states according to a Metropolis acceptance criterion: the swap from chain iii to jjj is accepted with probability min⁡(1,exp⁡((E(xi)−E(xj))(1/Ti−1/Tj)))\min(1, \exp((E(x_i) - E(x_j))(1/T_i - 1/T_j)))min(1,exp((E(xi​)−E(xj​))(1/Ti​−1/Tj​))). This gives cold chains access to the global distribution without sacrificing local quality.

Energy barriers in high dimensions: in ddd-dimensional space, the effective energy barrier grows with ddd because the volume of the saddle region between modes shrinks exponentially. Practical EBM sampling for high-dimensional images (d>104d > 10^4d>104) requires either: short chains (biased samples but computationally tractable), noise-annealed schedules (starting from high-temperature exploration and cooling — the NCSN approach), or transition kernels specifically designed for the energy landscape (Hamiltonian Monte Carlo, which uses momentum to cross barriers efficiently).


Score matching

The score function of a distribution p(x)p(x)p(x) is:

s(x)=∇xlog⁡p(x)s(x) = \nabla_x \log p(x)s(x)=∇x​logp(x)

Score matching (Hyvärinen, 2005) provides a training objective for learning the score function without computing the normalizing constant. For a model with score sθ(x)=∇xlog⁡pθ(x)=−∇xEθ(x)s_\theta(x) = \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)sθ​(x)=∇x​logpθ​(x)=−∇x​Eθ​(x), the score matching objective is:

LSM(θ)=Epdata(x) ⁣[12∥sθ(x)∥2+tr(∇xsθ(x))]\mathcal{L}_\text{SM}(\theta) = \mathbb{E}_{p_\text{data}(x)}\!\left[\tfrac{1}{2}\|s_\theta(x)\|^2 + \text{tr}(\nabla_x s_\theta(x))\right]LSM​(θ)=Epdata​(x)​[21​∥sθ​(x)∥2+tr(∇x​sθ​(x))]

The second term involves the trace of the Jacobian of the score with respect to xxx, which is O(d2)O(d^2)O(d2) to compute naively. Integration by parts shows that minimizing LSM\mathcal{L}_\text{SM}LSM​ is equivalent to minimizing E[∥sθ(x)−∇xlog⁡pdata(x)∥2]\mathbb{E}[\|s_\theta(x) - \nabla_x \log p_\text{data}(x)\|^2]E[∥sθ​(x)−∇x​logpdata​(x)∥2] — the Fisher divergence between the model score and the true score. Crucially, Z(θ)Z(\theta)Z(θ) cancels out and the objective is tractable.

Sliced score matching reduces the O(d2)O(d^2)O(d2) Jacobian computation by projecting the score onto random directions: LSSM=Ev∼pvEpdata[v⊤∇x(v⊤sθ(x))+12(v⊤sθ(x))2]\mathcal{L}_\text{SSM} = \mathbb{E}_{v \sim p_v}\mathbb{E}_{p_\text{data}}[v^\top \nabla_x(v^\top s_\theta(x)) + \frac{1}{2}(v^\top s_\theta(x))^2]LSSM​=Ev∼pv​​Epdata​​[v⊤∇x​(v⊤sθ​(x))+21​(v⊤sθ​(x))2]. This is unbiased and O(d)O(d)O(d) per sample.


Denoising score matching

Denoising score matching (Vincent, 2011) avoids the Jacobian entirely by perturbing the data and matching the score of the noisy distribution. For noise x~=x+ϵ\tilde{x} = x + \epsilonx~=x+ϵ, ϵ∼N(0,σ2I)\epsilon \sim \mathcal{N}(0, \sigma^2 I)ϵ∼N(0,σ2I), the optimal denoiser sθ(x~)=∇x~log⁡pσ(x~)s_\theta(\tilde{x}) = \nabla_{\tilde{x}} \log p_\sigma(\tilde{x})sθ​(x~)=∇x~​logpσ​(x~) can be learned by:

LDSM(θ)=Ex∼pdata, ϵ∼N(0,σ2I) ⁣[∥sθ(x+ϵ)−−ϵσ2∥2]\mathcal{L}_\text{DSM}(\theta) = \mathbb{E}_{x \sim p_\text{data},\, \epsilon \sim \mathcal{N}(0, \sigma^2 I)}\!\left[\left\|s_\theta(x + \epsilon) - \frac{-\epsilon}{\sigma^2}\right\|^2\right]LDSM​(θ)=Ex∼pdata​,ϵ∼N(0,σ2I)​[​sθ​(x+ϵ)−σ2−ϵ​​2]

The target −ϵ/σ2-\epsilon/\sigma^2−ϵ/σ2 is the score of the noisy distribution pσ(x~∣x)=N(x~;x,σ2I)p_\sigma(\tilde{x} \mid x) = \mathcal{N}(\tilde{x}; x, \sigma^2 I)pσ​(x~∣x)=N(x~;x,σ2I) evaluated at the noised sample. This objective requires only sampling and forward passes — no Jacobians, no partition function. Minimizing DSM recovers the score function of pσp_\sigmapσ​, which approximates the score of pdatap_\text{data}pdata​ as σ→0\sigma \to 0σ→0.

This is the direct mathematical ancestor of DDPM's training target. DDPM (Week 6) can be understood as DSM applied simultaneously across a sequence of noise levels, with the score network learning to denoise from any level.


Multi-scale score estimation and NCSN

NCSN (Noise Conditional Score Network; Song and Ermon, 2019) trains a single score network sθ(x,σ)s_\theta(x, \sigma)sθ​(x,σ) conditioned on the noise level σ\sigmaσ, estimating the score of pσ(x)p_\sigma(x)pσ​(x) for many noise levels simultaneously:

LNCSN(θ)=∑i=1Lλ(σi)Ex,x~ ⁣[∥sθ(x~,σi)+x~−xσi2∥2]\mathcal{L}_\text{NCSN}(\theta) = \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}_{x, \tilde{x}}\!\left[\left\|s_\theta(\tilde{x}, \sigma_i) + \frac{\tilde{x} - x}{\sigma_i^2}\right\|^2\right]LNCSN​(θ)=i=1∑L​λ(σi​)Ex,x~​[​sθ​(x~,σi​)+σi2​x~−x​​2]

where σ1<σ2<⋯<σL\sigma_1 < \sigma_2 < \cdots < \sigma_Lσ1​<σ2​<⋯<σL​ is a geometric sequence of noise levels and λ(σi)=σi2\lambda(\sigma_i) = \sigma_i^2λ(σi​)=σi2​ is a weighting factor. Generation uses annealed Langevin dynamics: run SGLD at noise level σL\sigma_LσL​ (high noise, easy exploration), then progressively reduce to σ1\sigma_1σ1​ (low noise, fine detail). The high-noise steps handle global structure; low-noise steps refine fine details.

NCSN demonstrated that score matching across noise scales could generate competitive image samples without adversarial training — a key empirical validation before DDPM showed that the same principle, formulated as a noising/denoising chain, could match GAN quality.


Energy-based models in imitation learning and RLReinforcement Learning

The EBM framework has direct applications in robot learning and reinforcement learning that connect this course to Courses 1 and 2.

Energy-based imitation learning: given a set of expert demonstrations {x(n)}\{x^{(n)}\}{x(n)}, an EBM can represent the expert policy implicitly as πexpert(a∣s)∝e−Eθ(s,a)\pi_\text{expert}(a \mid s) \propto e^{-E_\theta(s, a)}πexpert​(a∣s)∝e−Eθ​(s,a), where low-energy (s,a)(s, a)(s,a) pairs are those favored by the expert. Training uses contrastive divergence: lower the energy of expert (state, action) pairs, raise the energy of policy-generated pairs. This avoids the mode-averaging problem of regression-based behavior cloning — the EBM places probability mass only where the expert does, rather than averaging over multiple expert modes.

GAIL (Ho and Ermon, 2016) is the adversarial version: the discriminator D(s,a)D(s, a)D(s,a) distinguishes expert from policy trajectories, and the policy gradient uses log⁡D(s,a)\log D(s, a)logD(s,a) as a reward. GAIL is formally equivalent to minimizing the Jensen-Shannon divergence between the occupancy measures of the expert and the learned policy — the GAN objective applied in trajectory space rather than data space. This is the direct connection between Week 3's GAN theory and Course 2's imitation learning.

Reward models as EBMs: in RLHFReinforcement Learning from Human Feedback (Course 1, Week 12), the reward model rϕ(x)r_\phi(x)rϕ​(x) assigns scalar rewards to language model outputs. This is precisely an energy function with Eϕ(x)=−rϕ(x)E_\phi(x) = -r_\phi(x)Eϕ​(x)=−rϕ​(x) — lower energy (higher reward) for preferred outputs. Policy optimization via PPOProximal Policy Optimisation performs a form of Langevin dynamics in policy space, with the reward model gradient guiding the policy toward high-reward (low-energy) completions and the KL penalty providing regularizing noise.


Practical implementation of EBM training

Training EBMs in practice requires careful choices:

Buffer management in persistent contrastive divergence: maintaining a persistent buffer of MCMC chains requires allocating GPU memory for many parallel chains (typically 64–512). The buffer is initialized with random noise or data samples at the start of training; chains continue evolving across batches. Periodically refreshing a fraction of the buffer with new random initializations prevents mode collapse where chains get stuck in high-energy regions. The effective gradient is more stable than CD but with higher memory overhead.

Step size and schedule for Langevin dynamics: the step size η\etaη in SGLD must decrease over time to ensure convergence: typical schedules use ηt=η0/t\eta_t = \eta_0 / \sqrt{t}ηt​=η0​/t​ or ηt=η0/t1/3\eta_t = \eta_0 / t^{1/3}ηt​=η0​/t1/3. Too large η\etaη causes divergence; too small η\etaη causes slow mixing. A heuristic: initialize η\etaη such that ∣∇xEθ∣⋅η|\nabla_x E_\theta| \cdot \eta∣∇x​Eθ​∣⋅η is O(0.01)O(0.01)O(0.01) to O(0.1)O(0.1)O(0.1) in magnitude. For high-dimensional images, SGLD with 50–100 steps produces usable samples; fewer steps bias toward the data distribution (a failure mode when training with CD).

Energy function architectures: the energy network Eθ(x)E_\theta(x)Eθ​(x) is typically a convolutional ResNet outputting a scalar. For images, a common design: down-sample to latent resolution (8×88 \times 88×8), apply fully connected layers, output Eθ(x)∈RE_\theta(x) \in \mathbb{R}Eθ​(x)∈R. A critical detail: the energy function should not be too powerful (very deep networks with many parameters) because this allows the model to achieve zero training loss on the data distribution while learning little about it — a form of overfitting. Regularizing the energy norm (via weight decay or spectral normalization) mitigates this.

Relationship to other divergences: contrastive divergence minimizes a biased approximation to KL divergence; persistent CD reduces bias but increases variance. Neither achieves the unbiased gradient of perfect MCMC. Other objectives like score matching (Week 4) sidestep these issues entirely, avoiding MCMC sampling during training — one reason score-based and diffusion models became dominant.


GenAI context: EBMs across the AI stack

| EBM / score-based concept | Robotics (Course 2) | RLReinforcement Learning (Course 1) | VLMs (Course 4) | |---|---|---|---| | Energy function Eθ(x)E_\theta(x)Eθ​(x) | Policy energy for IRL | Reward model rϕ(x)r_\phi(x)rϕ​(x) | CLIP similarity score | | Score ∇xlog⁡p\nabla_x \log p∇x​logp | Gradient for action refinement | Policy gradient direction | Contrastive gradient for retrieval | | Langevin sampling | Trajectory optimization via gradient descent | MCMC policy search | Test-time compute via iterative refinement | | SGLD noise injection | Domain randomization (noise stabilizes training) | Exploration noise in policy gradient | Visual augmentation for robustness | | NCSN multi-scale noise | Domain randomization schedules | Entropy-regularized RLReinforcement Learning at multiple scales | Multi-resolution image understanding |

The EBM perspective reveals a unified principle running through all four courses: probability is assigned through an energy function, and both training (MLE / contrastive divergence) and inference (MCMC / Langevin) are gradient computations on that energy. Diffusion models are EBMs that have learned to estimate scores at multiple noise levels; RLHFReinforcement Learning from Human Feedback reward models are EBMs trained on human preference data; CLIP is an EBM whose energy is the negative dot product of image and text embeddings. Recognizing these as the same mathematical object enables practitioners to transfer techniques across domains — for instance, using test-time Langevin refinement (originally from EBMs) to improve diffusion model samples, or using reward-conditioned EBM sampling (from RLHFReinforcement Learning from Human Feedback) to steer robot trajectory generation.


Key takeaways

EBMs define probability distributions through unnormalized energy functions; the partition function makes direct likelihood computation intractable. Contrastive divergence approximates the MLE gradient using short MCMC chains, trading bias for computational tractability, and persistent contrastive divergence maintains long-running chains for better gradient estimates. Langevin dynamics generates samples from an EBM by following the negative energy gradient with injected noise, but suffers from exponential slowdown in high-dimensional multimodal landscapes — replica exchange and noise annealing are practical solutions. Score matching learns the score function ∇xlog⁡p(x)\nabla_x \log p(x)∇x​logp(x) without the partition function; denoising score matching replaces the intractable Jacobian with a simple denoising target. NCSN generalizes DSM to multiple noise levels, enabling annealed Langevin sampling — the direct precursor to DDPM's reverse diffusion process. The conceptual insight that bridges all of these: probability densities can be represented implicitly through energy functions or scores, and both learning and inference reduce to gradient-based operations on these energy landscapes. This perspective unifies generative modeling, reinforcement learning, and inverse reinforcement learning as variants of the same underlying principle.


Conceptual questions

  1. The MLE gradient for an EBM is ∇θlog⁡pθ(x)=−∇θEθ(x)+Epθ[∇θEθ(x′)]\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]∇θ​logpθ​(x)=−∇θ​Eθ​(x)+Epθ​​[∇θ​Eθ​(x′)]. Show that this is equivalent to minimizing the KL divergence DKL(pdata∥pθ)D_\text{KL}(p_\text{data} \| p_\theta)DKL​(pdata​∥pθ​). Explain why the expectation under pθp_\thetapθ​ makes this gradient intractable in practice and what approximation contrastive divergence makes.

  2. Stochastic gradient Langevin dynamics converges to the target distribution pθp_\thetapθ​ as η→0\eta \to 0η→0 and T→∞T \to \inftyT→∞. For finite η\etaη and TTT, the chain is biased. (a) Show that with fixed η>0\eta > 0η>0, the stationary distribution of SGLD is not pθp_\thetapθ​ but a biased approximation. (b) If the energy function Eθ(x)E_\theta(x)Eθ​(x) is strongly convex with parameter mmm, bound the mixing time required for the chain to approximate pθp_\thetapθ​ within total variation ϵ\epsilonϵ.

  3. Denoising score matching trains sθ(x~)s_\theta(\tilde{x})sθ​(x~) to estimate ∇x~log⁡pσ(x~)\nabla_{\tilde{x}} \log p_\sigma(\tilde{x})∇x~​logpσ​(x~) for a single noise level σ\sigmaσ. As σ→0\sigma \to 0σ→0, the noisy distribution pσp_\sigmapσ​ approaches the data distribution pdatap_\text{data}pdata​, so the learned score should approximate ∇xlog⁡pdata(x)\nabla_x \log p_\text{data}(x)∇x​logpdata​(x) well — but in practice, small σ\sigmaσ produces poor score estimates. Explain the geometric reason why score matching is unreliable at very small noise levels, particularly in regions of low data density.

  4. NCSN uses a geometric sequence of noise levels σ1<⋯<σL\sigma_1 < \cdots < \sigma_Lσ1​<⋯<σL​ and runs annealed Langevin dynamics from σL\sigma_LσL​ down to σ1\sigma_1σ1​. Explain why this annealing is necessary and what failure mode would occur if sampling were performed entirely at the lowest noise level σ1\sigma_1σ1​ without the annealing schedule. How does this relate to the multi-modal structure of typical image distributions?

  5. An EBM trained on natural images learns an energy function Eθ(x)E_\theta(x)Eθ​(x) with low energy on realistic images and high energy on random noise. A red-teaming researcher discovers that a small adversarial perturbation δ\deltaδ can move a random-noise image xnoisex_\text{noise}xnoise​ to a low-energy region without making xnoise+δx_\text{noise} + \deltaxnoise​+δ look realistic to humans. What does this finding imply about the geometric structure of the EBM's energy landscape? How does this relate to adversarial examples in discriminative classifiers?

✦Solutions
  1. The empirical loss is −Epdata[log⁡pθ]-\mathbb{E}_{p_\text{data}}[\log p_\theta]−Epdata​​[logpθ​], whose gradient is −Epdata[∇θE]+Epθ[∇θE]-\mathbb{E}_{p_\text{data}}[\nabla_\theta E] + \mathbb{E}_{p_\theta}[\nabla_\theta E]−Epdata​​[∇θ​E]+Epθ​​[∇θ​E] — exactly ∇θDKL(pdata∥pθ)\nabla_\theta D_\text{KL}(p_\text{data}\|p_\theta)∇θ​DKL​(pdata​∥pθ​) (the data entropy is θ\thetaθ-independent). The second term is an expectation under the model, requiring samples from pθp_\thetapθ​ via MCMC. CD approximates those samples with kkk short MCMC steps started at the data, biasing the negative phase toward near-data points.
  2. (a) SGLD with fixed η\etaη is the Euler–Maruyama discretization of the Langevin SDE; the discretization error means its stationary distribution is pθp_\thetapθ​ only up to O(η)O(\eta)O(η) bias — it is the invariant distribution of the discrete chain, not the SDE. (b) If EθE_\thetaEθ​ is mmm-strongly convex, pθp_\thetapθ​ is mmm-strongly log-concave and the chain contracts geometrically: mixing to TV ϵ\epsilonϵ takes O(1mlog⁡1ϵ)O(\tfrac{1}{m}\log\tfrac{1}{\epsilon})O(m1​logϵ1​) steps (up to dimension/condition-number factors).
  3. At small σ\sigmaσ the noisy distribution concentrates on a thin shell around the data manifold, so low-density / off-manifold regions are essentially never sampled — the DSM target ∇log⁡pσ\nabla\log p_\sigma∇logpσ​ is never observed there and the score is unconstrained and inaccurate. Since Langevin starts off-manifold, these bad estimates dominate. Geometrically: the score must point toward a low-dimensional manifold, but receives near-zero training signal away from it when σ\sigmaσ is tiny.
  4. Annealing is needed because at low σ\sigmaσ the score is accurate only near data and Langevin cannot cross the empty, high-barrier regions between well-separated modes. Starting at high σ\sigmaσ blurs the modes into one connected distribution the chain can traverse, then cooling refines detail. Sampling only at σ1\sigma_1σ1​ traps the chain near its initialization, producing wrong mode proportions (mode imbalance) — critical because natural-image distributions are highly multimodal (many disconnected class manifolds).
  5. The energy is trained to be low on data and high only on the specific negatives seen during CD; everywhere else it is unconstrained, so spurious low-energy basins exist far from the data manifold. The adversarial δ\deltaδ simply walks into one. This is the same phenomenon as adversarial examples in classifiers: the model is well-behaved only near training data and extrapolates arbitrarily off-manifold, where small input changes cross into unintended low-energy / high-confidence regions.

Looking ahead

Score matching and denoising score matching provide a way to learn distributions without partition functions. The next model family takes a different approach: defining distributions through exact bijections whose Jacobian determinants are computable.

Week 5: Normalizing Flows. We derive the change-of-variables formula, examine coupling layer architectures (RealNVP, Glow), analyze autoregressive flows and their tradeoff between training and inference parallelism, and introduce continuous normalizing flows via neural ODEs.


Further reading

  • LeCun, Y., et al. (2006). A Tutorial on Energy-Based Learning. Predicting Structured Data.
  • Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. JMLR.
  • Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS. (NCSN / Score-Based Generative Modeling).
← Previous
Week 3: Generative Adversarial Networks
Next →
Week 5: Normalizing Flows
On this page
  • Purpose of this lecture
  • Energy-based models
  • Contrastive divergence
  • Langevin dynamics and MCMC sampling
  • Multi-modal landscapes and the sampling challenge
  • Score matching
  • Denoising score matching
  • Multi-scale score estimation and NCSN
  • Energy-based models in imitation learning and RL
  • Practical implementation of EBM training
  • GenAI context: EBMs across the AI stack
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading