Week 4: Energy-Based Models and Score Matching

Purpose of this lecture#

Energy-based models (EBMs) and score-based models are the theoretical ancestors of diffusion models. EBMs define distributions through an unnormalized density function whose normalization is intractable; learning them requires either approximating the partition function or finding a training objective that does not require it. Score matching provides exactly such an objective, and denoising score matching is the direct mathematical precursor of DDPM's training target. Understanding this lineage clarifies why diffusion models work and what their connection to physics-inspired sampling methods is.

Energy-based models#

An energy-based model assigns an energy $E_\theta(x) \in \mathbb{R}$ to each configuration $x$ , with lower energy corresponding to more probable configurations. The probability distribution is:

p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z(\theta)}, \quad Z(\theta) = \int e^{-E_\theta(x)} \, dx

where $Z(\theta)$ is the partition function (normalizing constant). The energy function $E_\theta$ is a neural network; the Boltzmann/Gibbs form $e^{-E}$ ensures non-negativity. EBMs are flexible: the energy function can be any architecture, and the distribution can have complex multi-modal structure captured by the shape of the energy landscape.

The partition function problem: computing $Z(\theta)$ requires integrating $e^{-E_\theta(x)}$ over all of $\mathcal{X}$ , which is intractable for high-dimensional $x$ . This makes direct likelihood evaluation impossible, and the MLE gradient:

\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) - \nabla_\theta \log Z(\theta) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]

requires computing $\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]$ — an expectation under the model distribution. This expectation is the key obstacle: it requires sampling from $p_\theta$ , which itself requires MCMC.

Contrastive divergence#

Contrastive divergence (CD) (Hinton, 2002) approximates the MLE gradient by running short MCMC chains (typically $k$ steps) initialized at training data rather than running chains to convergence. The gradient estimate is:

\nabla_\theta \mathcal{L}_\text{CD} \approx -\nabla_\theta E_\theta(x^+) + \nabla_\theta E_\theta(x^-)

where $x^+$ is a training example (positive sample) and $x^-$ is the result of running $k$ steps of Markov chain Monte Carlo from $x^+$ (negative sample). CD lowers the energy of data and raises the energy of near-data points generated by short chains — a reasonable approximation to the true gradient when the model is close to the data distribution.

Persistent contrastive divergence (PCD) maintains a persistent buffer of MCMC chains across training iterations. Chains are not reinitalized from data but continue from their previous state, sampling more broadly from $p_\theta$ as the chains have more time to mix. PCD reduces bias in the gradient estimate at the cost of requiring many parallel chain states.

Langevin dynamics and MCMC sampling#

Given a trained EBM, generating samples requires MCMC. Stochastic gradient Langevin dynamics (SGLD) combines gradient-based proposals with injected noise:

x_{t+1} = x_t - \frac{\eta}{2} \nabla_x E_\theta(x_t) + \sqrt{\eta} \, \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I)

With step size $\eta \to 0$ and sufficient steps, SGLD converges to exact samples from $p_\theta(x) \propto e^{-E_\theta(x)}$ . The gradient $-\nabla_x E_\theta(x)$ points from high-energy to low-energy regions (downhill), while the noise $\sqrt{\eta}\,\epsilon_t$ prevents the chain from getting stuck in local minima. SGLD is the physical analog of Brownian motion in a potential energy landscape.

The gradient $-\nabla_x E_\theta(x)$ is the score function $\nabla_x \log p_\theta(x)$ under the EBM parameterization (since $\log p_\theta(x) = -E_\theta(x) - \log Z(\theta)$ and the partition function is constant in $x$ ). This connection between energy functions and scores is central to score matching.

EBMs can represent arbitrarily complex multi-modal distributions — a key advantage. However, Langevin dynamics struggles precisely when the energy landscape has multiple well-separated modes. If the chain is initialized near one mode, the noise $\sqrt{\eta}\epsilon_t$ must be large enough to escape the energy barrier to reach other modes, but large noise degrades sample quality near any single mode. This is the mixing problem: for a bimodal distribution with modes separated by $\Delta x$ and energy barrier $\Delta E$ , the mixing time scales as $e^{\Delta E}$ — exponentially in the barrier height.

Replica exchange (parallel tempering) runs multiple MCMC chains at different temperatures $T_1 < T_2 < \cdots < T_K$ (implemented by scaling the energy as $E_\theta(x)/T_i$ ). Hot chains explore freely; cold chains refine detail near modes. Chains periodically swap states according to a Metropolis acceptance criterion: the swap from chain $i$ to $j$ is accepted with probability $\min(1, \exp((E(x_i) - E(x_j))(1/T_i - 1/T_j)))$ . This gives cold chains access to the global distribution without sacrificing local quality.

Energy barriers in high dimensions: in $d$ -dimensional space, the effective energy barrier grows with $d$ because the volume of the saddle region between modes shrinks exponentially. Practical EBM sampling for high-dimensional images ( $d > 10^4$ ) requires either: short chains (biased samples but computationally tractable), noise-annealed schedules (starting from high-temperature exploration and cooling — the NCSN approach), or transition kernels specifically designed for the energy landscape (Hamiltonian Monte Carlo, which uses momentum to cross barriers efficiently).

Score matching#

The score function of a distribution $p(x)$ is:

s(x) = \nabla_x \log p(x)

Score matching (Hyvärinen, 2005) provides a training objective for learning the score function without computing the normalizing constant. For a model with score $s_\theta(x) = \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ , the score matching objective is:

\mathcal{L}_\text{SM}(\theta) = \mathbb{E}_{p_\text{data}(x)}\!\left[\tfrac{1}{2}\|s_\theta(x)\|^2 + \text{tr}(\nabla_x s_\theta(x))\right]

The second term involves the trace of the Jacobian of the score with respect to $x$ , which is $O(d^2)$ to compute naively. Integration by parts shows that minimizing $\mathcal{L}_\text{SM}$ is equivalent to minimizing $\mathbb{E}[\|s_\theta(x) - \nabla_x \log p_\text{data}(x)\|^2]$ — the Fisher divergence between the model score and the true score. Crucially, $Z(\theta)$ cancels out and the objective is tractable.

Sliced score matching reduces the $O(d^2)$ Jacobian computation by projecting the score onto random directions: $\mathcal{L}_\text{SSM} = \mathbb{E}_{v \sim p_v}\mathbb{E}_{p_\text{data}}[v^\top \nabla_x(v^\top s_\theta(x)) + \frac{1}{2}(v^\top s_\theta(x))^2]$ . This is unbiased and $O(d)$ per sample.

Denoising score matching#

Denoising score matching (Vincent, 2011) avoids the Jacobian entirely by perturbing the data and matching the score of the noisy distribution. For noise $\tilde{x} = x + \epsilon$ , $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$ , the optimal denoiser $s_\theta(\tilde{x}) = \nabla_{\tilde{x}} \log p_\sigma(\tilde{x})$ can be learned by:

\mathcal{L}_\text{DSM}(\theta) = \mathbb{E}_{x \sim p_\text{data},\, \epsilon \sim \mathcal{N}(0, \sigma^2 I)}\!\left[\left\|s_\theta(x + \epsilon) - \frac{-\epsilon}{\sigma^2}\right\|^2\right]

The target $-\epsilon/\sigma^2$ is the score of the noisy distribution $p_\sigma(\tilde{x} \mid x) = \mathcal{N}(\tilde{x}; x, \sigma^2 I)$ evaluated at the noised sample. This objective requires only sampling and forward passes — no Jacobians, no partition function. Minimizing DSM recovers the score function of $p_\sigma$ , which approximates the score of $p_\text{data}$ as $\sigma \to 0$ .

This is the direct mathematical ancestor of DDPM's training target. DDPM (Week 6) can be understood as DSM applied simultaneously across a sequence of noise levels, with the score network learning to denoise from any level.

Multi-scale score estimation and NCSN#

NCSN (Noise Conditional Score Network; Song and Ermon, 2019) trains a single score network $s_\theta(x, \sigma)$ conditioned on the noise level $\sigma$ , estimating the score of $p_\sigma(x)$ for many noise levels simultaneously:

\mathcal{L}_\text{NCSN}(\theta) = \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}_{x, \tilde{x}}\!\left[\left\|s_\theta(\tilde{x}, \sigma_i) + \frac{\tilde{x} - x}{\sigma_i^2}\right\|^2\right]

where $\sigma_1 < \sigma_2 < \cdots < \sigma_L$ is a geometric sequence of noise levels and $\lambda(\sigma_i) = \sigma_i^2$ is a weighting factor. Generation uses annealed Langevin dynamics: run SGLD at noise level $\sigma_L$ (high noise, easy exploration), then progressively reduce to $\sigma_1$ (low noise, fine detail). The high-noise steps handle global structure; low-noise steps refine fine details.

NCSN demonstrated that score matching across noise scales could generate competitive image samples without adversarial training — a key empirical validation before DDPM showed that the same principle, formulated as a noising/denoising chain, could match GAN quality.

Energy-based models in imitation learning and RL#

The EBM framework has direct applications in robot learning and reinforcement learning that connect this course to Courses 1 and 2.

Energy-based imitation learning: given a set of expert demonstrations $\{x^{(n)}\}$ , an EBM can represent the expert policy implicitly as $\pi_\text{expert}(a \mid s) \propto e^{-E_\theta(s, a)}$ , where low-energy $(s, a)$ pairs are those favored by the expert. Training uses contrastive divergence: lower the energy of expert (state, action) pairs, raise the energy of policy-generated pairs. This avoids the mode-averaging problem of regression-based behavior cloning — the EBM places probability mass only where the expert does, rather than averaging over multiple expert modes.

GAIL (Ho and Ermon, 2016) is the adversarial version: the discriminator $D(s, a)$ distinguishes expert from policy trajectories, and the policy gradient uses $\log D(s, a)$ as a reward. GAIL is formally equivalent to minimizing the Jensen-Shannon divergence between the occupancy measures of the expert and the learned policy — the GAN objective applied in trajectory space rather than data space. This is the direct connection between Week 3's GAN theory and Course 2's imitation learning.

Reward models as EBMs: in RLHF (Course 1, Week 12), the reward model $r_\phi(x)$ assigns scalar rewards to language model outputs. This is precisely an energy function with $E_\phi(x) = -r_\phi(x)$ — lower energy (higher reward) for preferred outputs. Policy optimization via PPO performs a form of Langevin dynamics in policy space, with the reward model gradient guiding the policy toward high-reward (low-energy) completions and the KL penalty providing regularizing noise.

Practical implementation of EBM training#

Training EBMs in practice requires careful choices:

Buffer management in persistent contrastive divergence: maintaining a persistent buffer of MCMC chains requires allocating GPU memory for many parallel chains (typically 64–512). The buffer is initialized with random noise or data samples at the start of training; chains continue evolving across batches. Periodically refreshing a fraction of the buffer with new random initializations prevents mode collapse where chains get stuck in high-energy regions. The effective gradient is more stable than CD but with higher memory overhead.

Step size and schedule for Langevin dynamics: the step size $\eta$ in SGLD must decrease over time to ensure convergence: typical schedules use $\eta_t = \eta_0 / \sqrt{t}$ or $\eta_t = \eta_0 / t^{1/3}$ . Too large $\eta$ causes divergence; too small $\eta$ causes slow mixing. A heuristic: initialize $\eta$ such that $|\nabla_x E_\theta| \cdot \eta$ is $O(0.01)$ to $O(0.1)$ in magnitude. For high-dimensional images, SGLD with 50–100 steps produces usable samples; fewer steps bias toward the data distribution (a failure mode when training with CD).

Energy function architectures: the energy network $E_\theta(x)$ is typically a convolutional ResNet outputting a scalar. For images, a common design: down-sample to latent resolution ( $8 \times 8$ ), apply fully connected layers, output $E_\theta(x) \in \mathbb{R}$ . A critical detail: the energy function should not be too powerful (very deep networks with many parameters) because this allows the model to achieve zero training loss on the data distribution while learning little about it — a form of overfitting. Regularizing the energy norm (via weight decay or spectral normalization) mitigates this.

Relationship to other divergences: contrastive divergence minimizes a biased approximation to KL divergence; persistent CD reduces bias but increases variance. Neither achieves the unbiased gradient of perfect MCMC. Other objectives like score matching (Week 4) sidestep these issues entirely, avoiding MCMC sampling during training — one reason score-based and diffusion models became dominant.

GenAI context: EBMs across the AI stack#

| EBM / score-based concept | Robotics (Course 2) | RL (Course 1) | VLMs (Course 4) | |---|---|---|---| | Energy function $E_\theta(x)$ | Policy energy for IRL | Reward model $r_\phi(x)$ | CLIP similarity score | | Score $\nabla_x \log p$ | Gradient for action refinement | Policy gradient direction | Contrastive gradient for retrieval | | Langevin sampling | Trajectory optimization via gradient descent | MCMC policy search | Test-time compute via iterative refinement | | SGLD noise injection | Domain randomization (noise stabilizes training) | Exploration noise in policy gradient | Visual augmentation for robustness | | NCSN multi-scale noise | Domain randomization schedules | Entropy-regularized RL at multiple scales | Multi-resolution image understanding |

The EBM perspective reveals a unified principle running through all four courses: probability is assigned through an energy function, and both training (MLE / contrastive divergence) and inference (MCMC / Langevin) are gradient computations on that energy. Diffusion models are EBMs that have learned to estimate scores at multiple noise levels; RLHF reward models are EBMs trained on human preference data; CLIP is an EBM whose energy is the negative dot product of image and text embeddings. Recognizing these as the same mathematical object enables practitioners to transfer techniques across domains — for instance, using test-time Langevin refinement (originally from EBMs) to improve diffusion model samples, or using reward-conditioned EBM sampling (from RLHF) to steer robot trajectory generation.

Key takeaways#

EBMs define probability distributions through unnormalized energy functions; the partition function makes direct likelihood computation intractable. Contrastive divergence approximates the MLE gradient using short MCMC chains, trading bias for computational tractability, and persistent contrastive divergence maintains long-running chains for better gradient estimates. Langevin dynamics generates samples from an EBM by following the negative energy gradient with injected noise, but suffers from exponential slowdown in high-dimensional multimodal landscapes — replica exchange and noise annealing are practical solutions. Score matching learns the score function $\nabla_x \log p(x)$ without the partition function; denoising score matching replaces the intractable Jacobian with a simple denoising target. NCSN generalizes DSM to multiple noise levels, enabling annealed Langevin sampling — the direct precursor to DDPM's reverse diffusion process. The conceptual insight that bridges all of these: probability densities can be represented implicitly through energy functions or scores, and both learning and inference reduce to gradient-based operations on these energy landscapes. This perspective unifies generative modeling, reinforcement learning, and inverse reinforcement learning as variants of the same underlying principle.

Conceptual questions#

The MLE gradient for an EBM is $\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]$ . Show that this is equivalent to minimizing the KL divergence $D_\text{KL}(p_\text{data} \| p_\theta)$ . Explain why the expectation under $p_\theta$ makes this gradient intractable in practice and what approximation contrastive divergence makes.
Stochastic gradient Langevin dynamics converges to the target distribution $p_\theta$ as $\eta \to 0$ and $T \to \infty$ . For finite $\eta$ and $T$ , the chain is biased. (a) Show that with fixed $\eta > 0$ , the stationary distribution of SGLD is not $p_\theta$ but a biased approximation. (b) If the energy function $E_\theta(x)$ is strongly convex with parameter $m$ , bound the mixing time required for the chain to approximate $p_\theta$ within total variation $\epsilon$ .
Denoising score matching trains $s_\theta(\tilde{x})$ to estimate $\nabla_{\tilde{x}} \log p_\sigma(\tilde{x})$ for a single noise level $\sigma$ . As $\sigma \to 0$ , the noisy distribution $p_\sigma$ approaches the data distribution $p_\text{data}$ , so the learned score should approximate $\nabla_x \log p_\text{data}(x)$ well — but in practice, small $\sigma$ produces poor score estimates. Explain the geometric reason why score matching is unreliable at very small noise levels, particularly in regions of low data density.
NCSN uses a geometric sequence of noise levels $\sigma_1 < \cdots < \sigma_L$ and runs annealed Langevin dynamics from $\sigma_L$ down to $\sigma_1$ . Explain why this annealing is necessary and what failure mode would occur if sampling were performed entirely at the lowest noise level $\sigma_1$ without the annealing schedule. How does this relate to the multi-modal structure of typical image distributions?
An EBM trained on natural images learns an energy function $E_\theta(x)$ with low energy on realistic images and high energy on random noise. A red-teaming researcher discovers that a small adversarial perturbation $\delta$ can move a random-noise image $x_\text{noise}$ to a low-energy region without making $x_\text{noise} + \delta$ look realistic to humans. What does this finding imply about the geometric structure of the EBM's energy landscape? How does this relate to adversarial examples in discriminative classifiers?

Solutions

The empirical loss is $-\mathbb{E}_{p_\text{data}}[\log p_\theta]$ , whose gradient is $-\mathbb{E}_{p_\text{data}}[\nabla_\theta E] + \mathbb{E}_{p_\theta}[\nabla_\theta E]$ — exactly $\nabla_\theta D_\text{KL}(p_\text{data}\|p_\theta)$ (the data entropy is $\theta$ -independent). The second term is an expectation under the model, requiring samples from $p_\theta$ via MCMC. CD approximates those samples with $k$ short MCMC steps started at the data, biasing the negative phase toward near-data points.
(a) SGLD with fixed $\eta$ is the Euler–Maruyama discretization of the Langevin SDE; the discretization error means its stationary distribution is $p_\theta$ only up to $O(\eta)$ bias — it is the invariant distribution of the discrete chain, not the SDE. (b) If $E_\theta$ is $m$ -strongly convex, $p_\theta$ is $m$ -strongly log-concave and the chain contracts geometrically: mixing to TV $\epsilon$ takes $O(\tfrac{1}{m}\log\tfrac{1}{\epsilon})$ steps (up to dimension/condition-number factors).
At small $\sigma$ the noisy distribution concentrates on a thin shell around the data manifold, so low-density / off-manifold regions are essentially never sampled — the DSM target $\nabla\log p_\sigma$ is never observed there and the score is unconstrained and inaccurate. Since Langevin starts off-manifold, these bad estimates dominate. Geometrically: the score must point toward a low-dimensional manifold, but receives near-zero training signal away from it when $\sigma$ is tiny.
Annealing is needed because at low $\sigma$ the score is accurate only near data and Langevin cannot cross the empty, high-barrier regions between well-separated modes. Starting at high $\sigma$ blurs the modes into one connected distribution the chain can traverse, then cooling refines detail. Sampling only at $\sigma_1$ traps the chain near its initialization, producing wrong mode proportions (mode imbalance) — critical because natural-image distributions are highly multimodal (many disconnected class manifolds).
The energy is trained to be low on data and high only on the specific negatives seen during CD; everywhere else it is unconstrained, so spurious low-energy basins exist far from the data manifold. The adversarial $\delta$ simply walks into one. This is the same phenomenon as adversarial examples in classifiers: the model is well-behaved only near training data and extrapolates arbitrarily off-manifold, where small input changes cross into unintended low-energy / high-confidence regions.

Looking ahead#

Score matching and denoising score matching provide a way to learn distributions without partition functions. The next model family takes a different approach: defining distributions through exact bijections whose Jacobian determinants are computable.

Week 5: Normalizing Flows. We derive the change-of-variables formula, examine coupling layer architectures (RealNVP, Glow), analyze autoregressive flows and their tradeoff between training and inference parallelism, and introduce continuous normalizing flows via neural ODEs.

Purpose of this lecture#

Energy-based models#

An energy-based model assigns an energy $E_\theta(x) \in \mathbb{R}$ to each configuration $x$ , with lower energy corresponding to more probable configurations. The probability distribution is:

p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z(\theta)}, \quad Z(\theta) = \int e^{-E_\theta(x)} \, dx

\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) - \nabla_\theta \log Z(\theta) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]

Contrastive divergence#

\nabla_\theta \mathcal{L}_\text{CD} \approx -\nabla_\theta E_\theta(x^+) + \nabla_\theta E_\theta(x^-)

Langevin dynamics and MCMC sampling#

Given a trained EBM, generating samples requires MCMC. Stochastic gradient Langevin dynamics (SGLD) combines gradient-based proposals with injected noise:

x_{t+1} = x_t - \frac{\eta}{2} \nabla_x E_\theta(x_t) + \sqrt{\eta} \, \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I)

Score matching#

The score function of a distribution $p(x)$ is:

s(x) = \nabla_x \log p(x)

\mathcal{L}_\text{SM}(\theta) = \mathbb{E}_{p_\text{data}(x)}\!\left[\tfrac{1}{2}\|s_\theta(x)\|^2 + \text{tr}(\nabla_x s_\theta(x))\right]

Denoising score matching#

\mathcal{L}_\text{DSM}(\theta) = \mathbb{E}_{x \sim p_\text{data},\, \epsilon \sim \mathcal{N}(0, \sigma^2 I)}\!\left[\left\|s_\theta(x + \epsilon) - \frac{-\epsilon}{\sigma^2}\right\|^2\right]

Multi-scale score estimation and NCSN#

\mathcal{L}_\text{NCSN}(\theta) = \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}_{x, \tilde{x}}\!\left[\left\|s_\theta(\tilde{x}, \sigma_i) + \frac{\tilde{x} - x}{\sigma_i^2}\right\|^2\right]

Energy-based models in imitation learning and RL#

The EBM framework has direct applications in robot learning and reinforcement learning that connect this course to Courses 1 and 2.

Practical implementation of EBM training#

Training EBMs in practice requires careful choices:

GenAI context: EBMs across the AI stack#

Key takeaways#

Conceptual questions#

The MLE gradient for an EBM is $\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) + \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x')]$ . Show that this is equivalent to minimizing the KL divergence $D_\text{KL}(p_\text{data} \| p_\theta)$ . Explain why the expectation under $p_\theta$ makes this gradient intractable in practice and what approximation contrastive divergence makes.
Stochastic gradient Langevin dynamics converges to the target distribution $p_\theta$ as $\eta \to 0$ and $T \to \infty$ . For finite $\eta$ and $T$ , the chain is biased. (a) Show that with fixed $\eta > 0$ , the stationary distribution of SGLD is not $p_\theta$ but a biased approximation. (b) If the energy function $E_\theta(x)$ is strongly convex with parameter $m$ , bound the mixing time required for the chain to approximate $p_\theta$ within total variation $\epsilon$ .
Denoising score matching trains $s_\theta(\tilde{x})$ to estimate $\nabla_{\tilde{x}} \log p_\sigma(\tilde{x})$ for a single noise level $\sigma$ . As $\sigma \to 0$ , the noisy distribution $p_\sigma$ approaches the data distribution $p_\text{data}$ , so the learned score should approximate $\nabla_x \log p_\text{data}(x)$ well — but in practice, small $\sigma$ produces poor score estimates. Explain the geometric reason why score matching is unreliable at very small noise levels, particularly in regions of low data density.
NCSN uses a geometric sequence of noise levels $\sigma_1 < \cdots < \sigma_L$ and runs annealed Langevin dynamics from $\sigma_L$ down to $\sigma_1$ . Explain why this annealing is necessary and what failure mode would occur if sampling were performed entirely at the lowest noise level $\sigma_1$ without the annealing schedule. How does this relate to the multi-modal structure of typical image distributions?
An EBM trained on natural images learns an energy function $E_\theta(x)$ with low energy on realistic images and high energy on random noise. A red-teaming researcher discovers that a small adversarial perturbation $\delta$ can move a random-noise image $x_\text{noise}$ to a low-energy region without making $x_\text{noise} + \delta$ look realistic to humans. What does this finding imply about the geometric structure of the EBM's energy landscape? How does this relate to adversarial examples in discriminative classifiers?

Solutions

The empirical loss is $-\mathbb{E}_{p_\text{data}}[\log p_\theta]$ , whose gradient is $-\mathbb{E}_{p_\text{data}}[\nabla_\theta E] + \mathbb{E}_{p_\theta}[\nabla_\theta E]$ — exactly $\nabla_\theta D_\text{KL}(p_\text{data}\|p_\theta)$ (the data entropy is $\theta$ -independent). The second term is an expectation under the model, requiring samples from $p_\theta$ via MCMC. CD approximates those samples with $k$ short MCMC steps started at the data, biasing the negative phase toward near-data points.
(a) SGLD with fixed $\eta$ is the Euler–Maruyama discretization of the Langevin SDE; the discretization error means its stationary distribution is $p_\theta$ only up to $O(\eta)$ bias — it is the invariant distribution of the discrete chain, not the SDE. (b) If $E_\theta$ is $m$ -strongly convex, $p_\theta$ is $m$ -strongly log-concave and the chain contracts geometrically: mixing to TV $\epsilon$ takes $O(\tfrac{1}{m}\log\tfrac{1}{\epsilon})$ steps (up to dimension/condition-number factors).
At small $\sigma$ the noisy distribution concentrates on a thin shell around the data manifold, so low-density / off-manifold regions are essentially never sampled — the DSM target $\nabla\log p_\sigma$ is never observed there and the score is unconstrained and inaccurate. Since Langevin starts off-manifold, these bad estimates dominate. Geometrically: the score must point toward a low-dimensional manifold, but receives near-zero training signal away from it when $\sigma$ is tiny.
Annealing is needed because at low $\sigma$ the score is accurate only near data and Langevin cannot cross the empty, high-barrier regions between well-separated modes. Starting at high $\sigma$ blurs the modes into one connected distribution the chain can traverse, then cooling refines detail. Sampling only at $\sigma_1$ traps the chain near its initialization, producing wrong mode proportions (mode imbalance) — critical because natural-image distributions are highly multimodal (many disconnected class manifolds).
The energy is trained to be low on data and high only on the specific negatives seen during CD; everywhere else it is unconstrained, so spurious low-energy basins exist far from the data manifold. The adversarial $\delta$ simply walks into one. This is the same phenomenon as adversarial examples in classifiers: the model is well-behaved only near training data and extrapolates arbitrarily off-manifold, where small input changes cross into unintended low-energy / high-confidence regions.

Purpose of this lecture#

Energy-based models#

Contrastive divergence#

Langevin dynamics and MCMC sampling#

Score matching#

Denoising score matching#

Multi-scale score estimation and NCSN#

Energy-based models in imitation learning and RL#

Practical implementation of EBM training#

GenAI context: EBMs across the AI stack#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 4: Energy-Based Models and Score Matching

Purpose of this lecture#

Energy-based models#

Contrastive divergence#

Langevin dynamics and MCMC sampling#

Score matching#

Denoising score matching#

Multi-scale score estimation and NCSN#

Energy-based models in imitation learning and RL#

Practical implementation of EBM training#

GenAI context: EBMs across the AI stack#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 4: Energy-Based Models and Score Matching

Purpose of this lecture#

Energy-based models#

Contrastive divergence#

Langevin dynamics and MCMC sampling#

Multi-modal landscapes and the sampling challenge#

Score matching#

Denoising score matching#

Multi-scale score estimation and NCSN#

Energy-based models in imitation learning and RLReinforcement Learning#

Practical implementation of EBM training#

GenAI context: EBMs across the AI stack#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 4: Energy-Based Models and Score Matching

Purpose of this lecture#

Energy-based models#

Contrastive divergence#

Langevin dynamics and MCMC sampling#

Multi-modal landscapes and the sampling challenge#

Score matching#

Denoising score matching#

Multi-scale score estimation and NCSN#

Energy-based models in imitation learning and RLReinforcement Learning#

Practical implementation of EBM training#

GenAI context: EBMs across the AI stack#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Energy-based models in imitation learning and RL#

Energy-based models in imitation learning and RL#