Week 1: Probabilistic Foundations

Purpose of this lecture#

Generative modeling is the problem of learning a probability distribution $p(x)$ over high-dimensional data — images, audio, text, robot trajectories — and using that distribution to draw new samples, evaluate likelihoods, or infer latent structure. Every algorithm in this course (VAE, GAN, diffusion, flow matching) is a different answer to the same underlying question: how do we represent and compute with $p(x)$ when $x$ lives in thousands or millions of dimensions?

This lecture establishes the probabilistic vocabulary shared by all of them. The central objects are the likelihood function, the latent variable model, the evidence lower bound (ELBO), and amortized variational inference. These are not preliminary material — they are the load-bearing structure of everything that follows.

Maximum likelihood estimation#

Given a dataset $\mathcal{D} = \{x^{(1)}, \ldots, x^{(N)}\}$ drawn i.i.d. from some true distribution $p^*(x)$ , the maximum likelihood estimator (MLE) of a parametric model $p_\theta(x)$ is:

\theta^* = \arg\max_\theta \sum_{n=1}^N \log p_\theta(x^{(n)})

The log-likelihood sum is the empirical expectation $N \cdot \mathbb{E}_{x \sim \hat{p}}[\log p_\theta(x)]$ , where $\hat{p}$ is the empirical distribution. Minimizing the negative log-likelihood is equivalent to minimizing the KL divergence $D_\text{KL}(\hat{p} \| p_\theta)$ :

D_\text{KL}(\hat{p} \| p_\theta) = \mathbb{E}_{\hat{p}}[\log \hat{p}(x)] - \mathbb{E}_{\hat{p}}[\log p_\theta(x)]

The first term is a constant with respect to $\theta$ , so maximizing $\mathbb{E}_{\hat{p}}[\log p_\theta(x)]$ exactly minimizes the KL from the data distribution to the model. MLE is thus the canonical method for fitting generative models: it directly minimizes the divergence between model and data.

Maximum a posteriori (MAP) estimation adds a prior $p(\theta)$ and maximizes $\log p_\theta(\mathcal{D}) + \log p(\theta)$ . This is equivalent to MLE with regularization: a Gaussian prior on $\theta$ with precision $\lambda$ corresponds to L2 regularization with coefficient $\lambda$ .

Latent variable models#

Many natural distributions are most concisely described as a mixture: each observation $x$ is generated from a low-dimensional latent variable $z$ that captures the semantics (identity, pose, style), with variation in $x$ given $z$ representing rendering noise. The joint model is:

p_\theta(x, z) = p_\theta(x \mid z) \, p(z)

where $p(z)$ is the prior (typically $\mathcal{N}(0, I)$ ) and $p_\theta(x \mid z)$ is the likelihood (a neural network decoder). The marginal likelihood is obtained by integrating out the latent:

p_\theta(x) = \int p_\theta(x \mid z) \, p(z) \, dz

For continuous $z$ in even moderate dimension, this integral is intractable. We cannot evaluate $\log p_\theta(x)$ exactly, so we cannot directly optimize MLE. This is the fundamental difficulty that variational inference addresses.

The evidence lower bound#

The ELBO (evidence lower bound) is a tractable lower bound on $\log p_\theta(x)$ obtained by introducing an approximate posterior $q_\phi(z \mid x)$ and applying Jensen's inequality:

\log p_\theta(x) = \log \int p_\theta(x \mid z) p(z) \, dz \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - D_\text{KL}(q_\phi(z \mid x) \| p(z))

This bound is tight when $q_\phi(z \mid x) = p_\theta(z \mid x)$ — the true posterior. The derivation proceeds via:

\log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\!\left[\log \frac{p_\theta(x, z)}{q_\phi(z \mid x)}\right] + D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x))

The KL term on the right is non-negative, so the first term (the ELBO) is a lower bound. Jointly maximizing the ELBO over $(\theta, \phi)$ simultaneously trains the decoder to increase $\log p_\theta(x \mid z)$ and trains the encoder to produce posteriors that are close to the prior and to the true posterior.

The ELBO decomposes into two interpretable terms:

\mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x \mid z)]}_{\text{reconstruction}} - \underbrace{D_\text{KL}(q_\phi(z \mid x) \| p(z))}_{\text{regularization}}

The reconstruction term rewards the decoder for correctly generating $x$ given the latent sample. The KL term penalizes the encoder for producing posteriors that differ from the prior — it acts as a bottleneck that prevents the latent space from collapsing to a lookup table.

The derivation in full: Jensen's inequality step by step#

The ELBO derivation is worth following carefully, as the same algebraic maneuver appears in nearly every subsequent model. Starting from the marginal log-likelihood:

\log p_\theta(x) = \log \int p_\theta(x \mid z)\, p(z)\, dz

Introduce $q_\phi(z \mid x)$ via an importance weighting identity — multiply and divide inside the integral:

= \log \int q_\phi(z \mid x)\, \frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\, dz \;=\; \log\, \mathbb{E}_{q_\phi(z \mid x)}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]

Jensen's inequality states that for any concave function $f$ and any random variable $W$ , $f(\mathbb{E}[W]) \geq \mathbb{E}[f(W)]$ . Because $\log$ is concave:

\log\,\mathbb{E}_{q_\phi}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right] \;\geq\; \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]

Expanding the log of the ratio recovers the two-term ELBO:

= \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] \;+\; \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log \frac{p(z)}{q_\phi(z \mid x)}\right] \;=\; \mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))

The tightness of the bound follows immediately: subtracting the ELBO from $\log p_\theta(x)$ yields $D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) \geq 0$ . Maximizing the ELBO is equivalent to simultaneously maximizing $\log p_\theta(x)$ and minimizing the gap $D_\text{KL}(q_\phi \| p_\theta(z \mid x))$ , driving the encoder toward the true posterior.

Amortized variational inference#

Classical variational inference optimizes a separate set of variational parameters for each data point, which does not generalize to new observations and does not scale to large datasets. Amortized inference solves both problems by training a single neural network — the inference network or encoder — that takes $x$ as input and outputs the parameters of $q_\phi(z \mid x)$ :

q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x),\, \sigma^2_\phi(x) \cdot I)

where $\mu_\phi$ and $\sigma_\phi$ are the outputs of a neural encoder. This amortization introduces an approximation gap (the encoder may not represent the true posterior exactly), but it provides: generalization to new $x$ without re-optimization, $O(1)$ inference time at test time, and mini-batch stochastic gradient optimization of the ELBO.

The gradient of the ELBO requires differentiating through a sampling operation $z \sim q_\phi(z \mid x)$ . The reparameterization trick (covered in Week 2) enables this by writing $z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ , moving the stochasticity into the fixed noise variable $\epsilon$ so that gradients can flow through $\mu_\phi$ and $\sigma_\phi$ .

Score functions and denoising score matching#

A concept that runs through the second half of the course is the score function of a distribution $p(x)$ : the gradient of the log-density with respect to the data point $x$ itself (not the parameters):

s(x) = \nabla_x \log p(x)

The score points in the direction of steepest increase in log-probability — toward higher-density regions. For an energy-based model $p_\theta(x) \propto e^{-E_\theta(x)}$ (covered in Week 4), the score is $s_\theta(x) = -\nabla_x E_\theta(x)$ : the negative gradient of the energy, free of the intractable partition function $Z(\theta)$ . This makes scores far more useful than densities for many computational tasks: you can evaluate and differentiate a score without ever computing $p_\theta(x)$ itself.

Score matching (Hyvärinen, 2005) trains a score network $s_\theta(x) \approx \nabla_x \log p(x)$ by minimizing:

J(\theta) = \mathbb{E}_{p(x)}\!\left[\tfrac{1}{2}\|s_\theta(x)\|^2 + \nabla_x \cdot s_\theta(x)\right]

where $\nabla_x \cdot s_\theta = \sum_i \partial [s_\theta]_i / \partial x_i$ is the divergence. Crucially, this objective equals $\mathbb{E}_{p(x)}[\frac{1}{2}\|s_\theta(x) - \nabla_x \log p(x)\|^2]$ up to a constant, without requiring the true score — only samples from $p(x)$ .

Denoising score matching (DSM; Vincent, 2011) circumvents the expensive divergence computation. Corrupt each data point with Gaussian noise, $\tilde{x} = x + \sigma\epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ , and train the network to recover the clean direction:

J_\text{DSM}(\theta) = \mathbb{E}_{x,\, \epsilon}\!\left[\left\| s_\theta(\tilde{x},\, \sigma) + \frac{\epsilon}{\sigma} \right\|^2\right]

The optimal solution is $s_\theta(\tilde{x}, \sigma) = -\epsilon/\sigma = \nabla_{\tilde{x}} \log p_\sigma(\tilde{x})$ : the score of the noisy distribution, which points back toward the clean data. This is precisely the noise-prediction objective that DDPM (Week 6) optimizes — the diffusion model's training loss is DSM applied simultaneously across a schedule of noise levels $\sigma_t$ . Establishing this connection now means that when the DDPM objective appears, it will be recognizable as a likelihood-free probabilistic training method rather than an ad hoc engineering choice.

Autoregressive models#

An alternative to latent variable models is to factorize $p(x)$ using the chain rule of probability without any latent variables:

p_\theta(x) = \prod_{i=1}^D p_\theta(x_i \mid x_1, \ldots, x_{i-1})

This factorization is exact for any joint distribution. A model that parameterizes each conditional $p_\theta(x_i \mid x_{<i})$ with a neural network — a PixelCNN for images, a transformer for tokens — is an autoregressive model. Autoregressive models have exact likelihood: $\log p_\theta(x)$ is computable as a sum of conditional log-likelihoods, enabling direct MLE training.

The tradeoff is sampling efficiency: generating a sample requires $D$ sequential neural network evaluations, one per dimension, which is slow for high-dimensional $x$ . Latent variable models and diffusion models generate $x$ in $O(1)$ forward passes (or $O(T)$ diffusion steps) independently of the data dimension.

Connections to language models: GPT-style transformers are autoregressive models over discrete token sequences, training with the exact same MLE objective. The probability of a sequence $p_\theta(x_1, \ldots, x_T) = \prod_t p_\theta(x_t \mid x_{<t})$ is the cross-entropy loss computed at each position. The difference from image autoregressive models is only the data type (discrete tokens vs. continuous pixels) and the conditional architecture (self-attention vs. masked convolutions).

The expressiveness–tractability frontier#

Every generative model design choice is a point on an expressiveness–tractability frontier: more expressive representations of $p_\theta(x)$ generally make evaluation, sampling, or training harder.

Fully tractable models (normalizing flows, autoregressive models) can evaluate $\log p_\theta(x)$ exactly and sample in a single forward pass (flows) or in $D$ sequential steps (autoregressions). Their training is simple MLE. Their cost is architectural constraint: flows must be invertible with tractable Jacobian determinants; autoregressive models must factorize along dimensions.

Latent-variable models (VAEs) relax the architectural constraint — the decoder can be any neural network — at the cost of an intractable marginal likelihood. The ELBO approximates MLE at the price of an amortization gap and a bound gap. Sampling is cheap ( $z \sim p(z)$ , then one decoder forward pass), but reconstruction quality is limited by the quality of the encoder.

Implicit models (GANs) abandon likelihood evaluation entirely. The generator is completely free — there is no constraint on its architecture. The cost is training instability: optimizing a minimax game rather than a simple loss. Mode collapse (the generator covering only a subset of the data distribution) is the signature failure mode.

Energy-based models express the most general probability distributions but make both evaluation and sampling hard: the partition function is intractable (making MLE impossible directly) and sampling requires MCMC chains that may not mix in practice.

Diffusion and flow matching models occupy a distinctive point: they evaluate $\log p_\theta(x)$ only approximately (via a variational bound or through ODE likelihood), but they achieve state-of-the-art sample quality by leveraging deep networks in a many-step generation process, trading single-pass efficiency for iterative quality.

Understanding where each model falls on this frontier — and why the engineering constraints of the deployment scenario (need for exact likelihoods? need for fast sampling? need for conditional generation?) determine which model class is appropriate — is the meta-lesson that every lecture builds toward.

The generative model taxonomy#

The models in this course differ in how they represent $p_\theta(x)$ :

Likelihood-based models — VAEs, normalizing flows, diffusion — can evaluate (or bound) $\log p_\theta(x)$ and train by maximizing it. They have well-behaved training dynamics because the loss is a proxy for the data likelihood.

Implicit models — GANs — represent $p_\theta(x)$ only through a sampler (the generator). Likelihood evaluation requires intractable density estimation. Training uses an adversarial game rather than maximum likelihood.

Energy-based models — define $p_\theta(x) \propto e^{-E_\theta(x)}$ where $E_\theta$ is a learned energy function. Likelihood evaluation requires computing the normalizing constant (the partition function), which is typically intractable, requiring approximate MCMC methods.

GenAI context: probabilistic foundations across the course sequence#

The ELBO, latent variable model, and score function are not just Course 3 abstractions — they are the same objects used throughout the course sequence under different names.

| Concept | Generative models (this course) | Robotics (Course 2) | RL (Course 1) | |---|---|---|---| | Latent variable model | VAE encoder–decoder | CVAE in ACT; action codes | World model latent state $h_t$ | | Prior $p(z)$ | $\mathcal{N}(0, I)$ over image latents | Gaussian prior over action modes | Dynamics prior in RSSM | | Encoder $q_\phi(z \mid x)$ | Image→latent inference | Demo sequence→action code | Posterior state estimation | | Score function | $\nabla_x \log p(x)$ in diffusion | Energy gradient for imitation | Reward gradient for policy search | | Autoregressive factorization | PixelCNN, Transformer LM | Token-by-token VLA action generation | POMDP belief update | | ELBO bound gap | $D_\text{KL}(q_\phi \| p_\theta(z \mid x))$ | Encoder approximation error | Variational inference in POMDP |

The ELBO and its gradient are the same mathematical objects whether they are computed over robot demonstration sequences, image pixels, or language tokens. A practitioner who understands the unified probabilistic framework can recognize that a diffusion policy for manipulation, a VAE for image compression, and an autoregressive language model are all distinct engineering answers to the same underlying question: how do we represent and compute with a high-dimensional probability distribution?

Key takeaways#

MLE minimizes the KL divergence from the data distribution to the model and is the training objective underlying all likelihood-based generative models. Latent variable models $p_\theta(x, z) = p_\theta(x \mid z) p(z)$ capture data structure through a low-dimensional semantic bottleneck but require variational approximation because the marginal $p_\theta(x)$ is intractable. The ELBO lower-bounds $\log p_\theta(x)$ as a reconstruction term minus a KL regularization term, and is tight when the encoder approximates the true posterior. Amortized inference parameterizes $q_\phi(z \mid x)$ with a single shared encoder, enabling scalable stochastic training. Autoregressive models factor $p(x)$ along dimensions using the chain rule, enabling exact likelihood but requiring sequential sampling.

Conceptual questions#

The ELBO can be written as $\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))$ . Show algebraically that maximizing the ELBO is equivalent to minimizing $D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x))$ while maximizing $\log p_\theta(x)$ . What does this imply about the gap between the ELBO and the true log-likelihood?
A latent variable model uses a standard Gaussian prior $p(z) = \mathcal{N}(0, I)$ and a Gaussian encoder $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$ . Derive the closed-form KL term $D_\text{KL}(q_\phi(z \mid x) \| p(z))$ as a function of $\mu_\phi$ and $\sigma_\phi$ . Under what conditions does this KL term equal zero?
Consider an autoregressive model factored as $p_\theta(x) = \prod_i p_\theta(x_i \mid x_{<i})$ . If each conditional is modeled as a Gaussian with mean $\mu_\theta^{(i)}(x_{<i})$ and fixed variance $\sigma^2$ , show that maximizing the log-likelihood is equivalent to minimizing the sum of squared prediction errors $\sum_i \|x_i - \mu_\theta^{(i)}(x_{<i})\|^2$ . What generative model does this correspond to?
MLE minimizes $D_\text{KL}(\hat{p} \| p_\theta)$ (forward KL) rather than $D_\text{KL}(p_\theta \| \hat{p})$ (reverse KL). Explain geometrically why forward KL leads to mode covering behavior (the model spreads to cover all modes of the data) while reverse KL leads to mode seeking (the model collapses to one mode). Which behavior is preferable for generative sampling, and why does the choice of KL direction matter for the resulting model?
A normalizing flow model achieves $\log p_\theta(x) = -2.3$ nats on a test image, while a VAE achieves an ELBO of $-3.1$ nats on the same image. Can you conclude the flow is a better model? What additional information would you need to make a valid comparison between the two models?

Solutions

Subtract the ELBO from $\log p_\theta(x)$ : the gap is exactly $D_\text{KL}(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)) \geq 0$ . So maximizing the ELBO over $\phi$ tightens the bound (shrinks the gap) while maximizing it over $\theta$ raises $\log p_\theta(x)$ . The bound is never above the true log-likelihood, and equals it only when the encoder matches the true posterior.
$D_\text{KL}(q_\phi\|p) = \tfrac{1}{2}\sum_{j=1}^{d_z}\big(\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\big)$ . Each term is zero iff $\mu_j = 0$ and $\sigma_j = 1$ , so the KL is zero exactly when the posterior equals the standard-Gaussian prior.
With fixed-variance Gaussian conditionals, $\log p_\theta(x_i\mid x_{<i}) = -\tfrac{1}{2\sigma^2}\|x_i - \mu_\theta^{(i)}\|^2 + \text{const}$ . Summing over $i$ shows MLE is equivalent to minimizing $\sum_i\|x_i - \mu_\theta^{(i)}(x_{<i})\|^2$ — i.e. an autoregressive least-squares predictor (a Gaussian autoregressive model).
Forward KL $D_\text{KL}(\hat p\|p_\theta)$ is infinite wherever the data has mass but the model assigns ~zero density, so the model is forced to cover every data mode (mode-covering). Reverse KL penalizes the model placing mass where the data has none, so it collapses onto a single high-density mode (mode-seeking). For generative sampling, mode-covering is generally preferable — dropping modes means never generating whole classes of plausible samples.
No. The flow reports an exact log-likelihood while the VAE reports a lower bound — the true VAE log-likelihood is $\geq -3.1$ and could exceed $-2.3$ . A valid comparison needs either exact likelihoods for both (e.g. an importance-weighted / annealed estimate of the VAE marginal) and matched preprocessing (same bits-per-dim convention, dequantization, and test set).

Looking ahead#

With the probabilistic foundations established, the next lecture introduces the first complete generative architecture that directly optimizes the ELBO.

Week 2: Variational Autoencoders. We derive the VAE objective in full, implement the reparameterization trick that enables stochastic backpropagation, analyze the posterior collapse failure mode, and examine $\beta$ -VAE and hierarchical variants that improve representation quality.

Purpose of this lecture#

Maximum likelihood estimation#

Given a dataset $\mathcal{D} = \{x^{(1)}, \ldots, x^{(N)}\}$ drawn i.i.d. from some true distribution $p^*(x)$ , the maximum likelihood estimator (MLE) of a parametric model $p_\theta(x)$ is:

\theta^* = \arg\max_\theta \sum_{n=1}^N \log p_\theta(x^{(n)})

D_\text{KL}(\hat{p} \| p_\theta) = \mathbb{E}_{\hat{p}}[\log \hat{p}(x)] - \mathbb{E}_{\hat{p}}[\log p_\theta(x)]

Latent variable models#

p_\theta(x, z) = p_\theta(x \mid z) \, p(z)

where $p(z)$ is the prior (typically $\mathcal{N}(0, I)$ ) and $p_\theta(x \mid z)$ is the likelihood (a neural network decoder). The marginal likelihood is obtained by integrating out the latent:

p_\theta(x) = \int p_\theta(x \mid z) \, p(z) \, dz

The evidence lower bound#

The ELBO (evidence lower bound) is a tractable lower bound on $\log p_\theta(x)$ obtained by introducing an approximate posterior $q_\phi(z \mid x)$ and applying Jensen's inequality:

\log p_\theta(x) = \log \int p_\theta(x \mid z) p(z) \, dz \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - D_\text{KL}(q_\phi(z \mid x) \| p(z))

This bound is tight when $q_\phi(z \mid x) = p_\theta(z \mid x)$ — the true posterior. The derivation proceeds via:

\log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\!\left[\log \frac{p_\theta(x, z)}{q_\phi(z \mid x)}\right] + D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x))

The ELBO decomposes into two interpretable terms:

\mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x \mid z)]}_{\text{reconstruction}} - \underbrace{D_\text{KL}(q_\phi(z \mid x) \| p(z))}_{\text{regularization}}

The derivation in full: Jensen's inequality step by step#

The ELBO derivation is worth following carefully, as the same algebraic maneuver appears in nearly every subsequent model. Starting from the marginal log-likelihood:

\log p_\theta(x) = \log \int p_\theta(x \mid z)\, p(z)\, dz

Introduce $q_\phi(z \mid x)$ via an importance weighting identity — multiply and divide inside the integral:

= \log \int q_\phi(z \mid x)\, \frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\, dz \;=\; \log\, \mathbb{E}_{q_\phi(z \mid x)}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]

Jensen's inequality states that for any concave function $f$ and any random variable $W$ , $f(\mathbb{E}[W]) \geq \mathbb{E}[f(W)]$ . Because $\log$ is concave:

\log\,\mathbb{E}_{q_\phi}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right] \;\geq\; \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]

Expanding the log of the ratio recovers the two-term ELBO:

= \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] \;+\; \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log \frac{p(z)}{q_\phi(z \mid x)}\right] \;=\; \mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))

Amortized variational inference#

q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x),\, \sigma^2_\phi(x) \cdot I)

Score functions and denoising score matching#

s(x) = \nabla_x \log p(x)

Score matching (Hyvärinen, 2005) trains a score network $s_\theta(x) \approx \nabla_x \log p(x)$ by minimizing:

J(\theta) = \mathbb{E}_{p(x)}\!\left[\tfrac{1}{2}\|s_\theta(x)\|^2 + \nabla_x \cdot s_\theta(x)\right]

J_\text{DSM}(\theta) = \mathbb{E}_{x,\, \epsilon}\!\left[\left\| s_\theta(\tilde{x},\, \sigma) + \frac{\epsilon}{\sigma} \right\|^2\right]

Autoregressive models#

An alternative to latent variable models is to factorize $p(x)$ using the chain rule of probability without any latent variables:

p_\theta(x) = \prod_{i=1}^D p_\theta(x_i \mid x_1, \ldots, x_{i-1})

The expressiveness–tractability frontier#

The generative model taxonomy#

The models in this course differ in how they represent $p_\theta(x)$ :

GenAI context: probabilistic foundations across the course sequence#

The ELBO, latent variable model, and score function are not just Course 3 abstractions — they are the same objects used throughout the course sequence under different names.

Key takeaways#

Conceptual questions#

The ELBO can be written as $\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))$ . Show algebraically that maximizing the ELBO is equivalent to minimizing $D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x))$ while maximizing $\log p_\theta(x)$ . What does this imply about the gap between the ELBO and the true log-likelihood?
A latent variable model uses a standard Gaussian prior $p(z) = \mathcal{N}(0, I)$ and a Gaussian encoder $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$ . Derive the closed-form KL term $D_\text{KL}(q_\phi(z \mid x) \| p(z))$ as a function of $\mu_\phi$ and $\sigma_\phi$ . Under what conditions does this KL term equal zero?
Consider an autoregressive model factored as $p_\theta(x) = \prod_i p_\theta(x_i \mid x_{<i})$ . If each conditional is modeled as a Gaussian with mean $\mu_\theta^{(i)}(x_{<i})$ and fixed variance $\sigma^2$ , show that maximizing the log-likelihood is equivalent to minimizing the sum of squared prediction errors $\sum_i \|x_i - \mu_\theta^{(i)}(x_{<i})\|^2$ . What generative model does this correspond to?
MLE minimizes $D_\text{KL}(\hat{p} \| p_\theta)$ (forward KL) rather than $D_\text{KL}(p_\theta \| \hat{p})$ (reverse KL). Explain geometrically why forward KL leads to mode covering behavior (the model spreads to cover all modes of the data) while reverse KL leads to mode seeking (the model collapses to one mode). Which behavior is preferable for generative sampling, and why does the choice of KL direction matter for the resulting model?
A normalizing flow model achieves $\log p_\theta(x) = -2.3$ nats on a test image, while a VAE achieves an ELBO of $-3.1$ nats on the same image. Can you conclude the flow is a better model? What additional information would you need to make a valid comparison between the two models?

Solutions

Subtract the ELBO from $\log p_\theta(x)$ : the gap is exactly $D_\text{KL}(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)) \geq 0$ . So maximizing the ELBO over $\phi$ tightens the bound (shrinks the gap) while maximizing it over $\theta$ raises $\log p_\theta(x)$ . The bound is never above the true log-likelihood, and equals it only when the encoder matches the true posterior.
$D_\text{KL}(q_\phi\|p) = \tfrac{1}{2}\sum_{j=1}^{d_z}\big(\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\big)$ . Each term is zero iff $\mu_j = 0$ and $\sigma_j = 1$ , so the KL is zero exactly when the posterior equals the standard-Gaussian prior.
With fixed-variance Gaussian conditionals, $\log p_\theta(x_i\mid x_{<i}) = -\tfrac{1}{2\sigma^2}\|x_i - \mu_\theta^{(i)}\|^2 + \text{const}$ . Summing over $i$ shows MLE is equivalent to minimizing $\sum_i\|x_i - \mu_\theta^{(i)}(x_{<i})\|^2$ — i.e. an autoregressive least-squares predictor (a Gaussian autoregressive model).
Forward KL $D_\text{KL}(\hat p\|p_\theta)$ is infinite wherever the data has mass but the model assigns ~zero density, so the model is forced to cover every data mode (mode-covering). Reverse KL penalizes the model placing mass where the data has none, so it collapses onto a single high-density mode (mode-seeking). For generative sampling, mode-covering is generally preferable — dropping modes means never generating whole classes of plausible samples.
No. The flow reports an exact log-likelihood while the VAE reports a lower bound — the true VAE log-likelihood is $\geq -3.1$ and could exceed $-2.3$ . A valid comparison needs either exact likelihoods for both (e.g. an importance-weighted / annealed estimate of the VAE marginal) and matched preprocessing (same bits-per-dim convention, dequantization, and test set).

Looking ahead#

With the probabilistic foundations established, the next lecture introduces the first complete generative architecture that directly optimizes the ELBO.

Purpose of this lecture#

Maximum likelihood estimation#

Latent variable models#

The evidence lower bound#

The derivation in full: Jensen's inequality step by step#

Amortized variational inference#

Score functions and denoising score matching#

Autoregressive models#

The expressiveness–tractability frontier#

The generative model taxonomy#

GenAI context: probabilistic foundations across the course sequence#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 1: Probabilistic Foundations

Purpose of this lecture#

Maximum likelihood estimation#

Latent variable models#

The evidence lower bound#

The derivation in full: Jensen's inequality step by step#

Amortized variational inference#

Score functions and denoising score matching#

Autoregressive models#

The expressiveness–tractability frontier#

The generative model taxonomy#

GenAI context: probabilistic foundations across the course sequence#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#