Skip to main content
illumin8
Courses
Week 1: Probabilistic Foundations
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 1

Week 1: Probabilistic Foundations

✦Learning Outcomes
  • Compute the ELBO for a latent variable model and identify its reconstruction and regularization components
  • Recognize how the reparameterization trick enables stochastic backpropagation through latent samples (developed in full in Week 2)
  • Compare likelihood-based, implicit, and energy-based generative models on the expressiveness-tractability frontier
◆Prerequisites
  • Probability: random variables, expectation, conditional distributions, and KL divergence
  • Linear algebra and multivariable calculus: vectors, matrices, gradients
  • Maximum likelihood estimation and the basics of Bayesian inference
  • Neural network training and backpropagation

This is the first lecture of the course and assumes no prior exposure to generative modeling.

Purpose of this lecture

Generative modeling is the problem of learning a probability distribution p(x)p(x)p(x) over high-dimensional data — images, audio, text, robot trajectories — and using that distribution to draw new samples, evaluate likelihoods, or infer latent structure. Every algorithm in this course (VAEVariational Autoencoder, GANGenerative Adversarial Network, diffusion, flow matching) is a different answer to the same underlying question: how do we represent and compute with p(x)p(x)p(x) when xxx lives in thousands or millions of dimensions?

This lecture establishes the probabilistic vocabulary shared by all of them. The central objects are the likelihood function, the latent variable model, the evidence lower bound (ELBO), and amortized variational inference. These are not preliminary material — they are the load-bearing structure of everything that follows.


Maximum likelihood estimation

Given a dataset D={x(1),…,x(N)}\mathcal{D} = \{x^{(1)}, \ldots, x^{(N)}\}D={x(1),…,x(N)} drawn i.i.d. from some true distribution p∗(x)p^*(x)p∗(x), the maximum likelihood estimator (MLE) of a parametric model pθ(x)p_\theta(x)pθ​(x) is:

θ∗=arg⁡max⁡θ∑n=1Nlog⁡pθ(x(n))\theta^* = \arg\max_\theta \sum_{n=1}^N \log p_\theta(x^{(n)})θ∗=argθmax​n=1∑N​logpθ​(x(n))

The log-likelihood sum is the empirical expectation N⋅Ex∼p^[log⁡pθ(x)]N \cdot \mathbb{E}_{x \sim \hat{p}}[\log p_\theta(x)]N⋅Ex∼p^​​[logpθ​(x)], where p^\hat{p}p^​ is the empirical distribution. Minimizing the negative log-likelihood is equivalent to minimizing the KL divergence DKL(p^∥pθ)D_\text{KL}(\hat{p} \| p_\theta)DKL​(p^​∥pθ​):

DKL(p^∥pθ)=Ep^[log⁡p^(x)]−Ep^[log⁡pθ(x)]D_\text{KL}(\hat{p} \| p_\theta) = \mathbb{E}_{\hat{p}}[\log \hat{p}(x)] - \mathbb{E}_{\hat{p}}[\log p_\theta(x)]DKL​(p^​∥pθ​)=Ep^​​[logp^​(x)]−Ep^​​[logpθ​(x)]

The first term is a constant with respect to θ\thetaθ, so maximizing Ep^[log⁡pθ(x)]\mathbb{E}_{\hat{p}}[\log p_\theta(x)]Ep^​​[logpθ​(x)] exactly minimizes the KL from the data distribution to the model. MLE is thus the canonical method for fitting generative models: it directly minimizes the divergence between model and data.

Maximum a posteriori (MAP) estimation adds a prior p(θ)p(\theta)p(θ) and maximizes log⁡pθ(D)+log⁡p(θ)\log p_\theta(\mathcal{D}) + \log p(\theta)logpθ​(D)+logp(θ). This is equivalent to MLE with regularization: a Gaussian prior on θ\thetaθ with precision λ\lambdaλ corresponds to L2 regularization with coefficient λ\lambdaλ.


Latent variable models

Many natural distributions are most concisely described as a mixture: each observation xxx is generated from a low-dimensional latent variable zzz that captures the semantics (identity, pose, style), with variation in xxx given zzz representing rendering noise. The joint model is:

pθ(x,z)=pθ(x∣z) p(z)p_\theta(x, z) = p_\theta(x \mid z) \, p(z)pθ​(x,z)=pθ​(x∣z)p(z)

where p(z)p(z)p(z) is the prior (typically N(0,I)\mathcal{N}(0, I)N(0,I)) and pθ(x∣z)p_\theta(x \mid z)pθ​(x∣z) is the likelihood (a neural network decoder). The marginal likelihood is obtained by integrating out the latent:

pθ(x)=∫pθ(x∣z) p(z) dzp_\theta(x) = \int p_\theta(x \mid z) \, p(z) \, dzpθ​(x)=∫pθ​(x∣z)p(z)dz

For continuous zzz in even moderate dimension, this integral is intractable. We cannot evaluate log⁡pθ(x)\log p_\theta(x)logpθ​(x) exactly, so we cannot directly optimize MLE. This is the fundamental difficulty that variational inference addresses.


The evidence lower bound

The ELBO (evidence lower bound) is a tractable lower bound on log⁡pθ(x)\log p_\theta(x)logpθ​(x) obtained by introducing an approximate posterior qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) and applying Jensen's inequality:

log⁡pθ(x)=log⁡∫pθ(x∣z)p(z) dz≥Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))\log p_\theta(x) = \log \int p_\theta(x \mid z) p(z) \, dz \geq \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - D_\text{KL}(q_\phi(z \mid x) \| p(z))logpθ​(x)=log∫pθ​(x∣z)p(z)dz≥Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z))

This bound is tight when qϕ(z∣x)=pθ(z∣x)q_\phi(z \mid x) = p_\theta(z \mid x)qϕ​(z∣x)=pθ​(z∣x) — the true posterior. The derivation proceeds via:

log⁡pθ(x)=Eqϕ(z∣x) ⁣[log⁡pθ(x,z)qϕ(z∣x)]+DKL(qϕ(z∣x)∥pθ(z∣x))\log p_\theta(x) = \mathbb{E}_{q_\phi(z|x)}\!\left[\log \frac{p_\theta(x, z)}{q_\phi(z \mid x)}\right] + D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x))logpθ​(x)=Eqϕ​(z∣x)​[logqϕ​(z∣x)pθ​(x,z)​]+DKL​(qϕ​(z∣x)∥pθ​(z∣x))

The KL term on the right is non-negative, so the first term (the ELBO) is a lower bound. Jointly maximizing the ELBO over (θ,ϕ)(\theta, \phi)(θ,ϕ) simultaneously trains the decoder to increase log⁡pθ(x∣z)\log p_\theta(x \mid z)logpθ​(x∣z) and trains the encoder to produce posteriors that are close to the prior and to the true posterior.

The ELBO decomposes into two interpretable terms:

L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]⏟reconstruction−DKL(qϕ(z∣x)∥p(z))⏟regularization\mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x \mid z)]}_{\text{reconstruction}} - \underbrace{D_\text{KL}(q_\phi(z \mid x) \| p(z))}_{\text{regularization}}L(θ,ϕ;x)=reconstructionEqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularizationDKL​(qϕ​(z∣x)∥p(z))​​

The reconstruction term rewards the decoder for correctly generating xxx given the latent sample. The KL term penalizes the encoder for producing posteriors that differ from the prior — it acts as a bottleneck that prevents the latent space from collapsing to a lookup table.

The derivation in full: Jensen's inequality step by step

The ELBO derivation is worth following carefully, as the same algebraic maneuver appears in nearly every subsequent model. Starting from the marginal log-likelihood:

log⁡pθ(x)=log⁡∫pθ(x∣z) p(z) dz\log p_\theta(x) = \log \int p_\theta(x \mid z)\, p(z)\, dzlogpθ​(x)=log∫pθ​(x∣z)p(z)dz

Introduce qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) via an importance weighting identity — multiply and divide inside the integral:

=log⁡∫qϕ(z∣x) pθ(x∣z) p(z)qϕ(z∣x) dz  =  log⁡ Eqϕ(z∣x) ⁣[pθ(x∣z) p(z)qϕ(z∣x)]= \log \int q_\phi(z \mid x)\, \frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\, dz \;=\; \log\, \mathbb{E}_{q_\phi(z \mid x)}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]=log∫qϕ​(z∣x)qϕ​(z∣x)pθ​(x∣z)p(z)​dz=logEqϕ​(z∣x)​[qϕ​(z∣x)pθ​(x∣z)p(z)​]

Jensen's inequality states that for any concave function fff and any random variable WWW, f(E[W])≥E[f(W)]f(\mathbb{E}[W]) \geq \mathbb{E}[f(W)]f(E[W])≥E[f(W)]. Because log⁡\loglog is concave:

log⁡ Eqϕ ⁣[pθ(x∣z) p(z)qϕ(z∣x)]  ≥  Eqϕ ⁣[log⁡pθ(x∣z) p(z)qϕ(z∣x)]\log\,\mathbb{E}_{q_\phi}\!\left[\frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right] \;\geq\; \mathbb{E}_{q_\phi}\!\left[\log \frac{p_\theta(x \mid z)\, p(z)}{q_\phi(z \mid x)}\right]logEqϕ​​[qϕ​(z∣x)pθ​(x∣z)p(z)​]≥Eqϕ​​[logqϕ​(z∣x)pθ​(x∣z)p(z)​]

Expanding the log of the ratio recovers the two-term ELBO:

=Eqϕ(z∣x)[log⁡pθ(x∣z)]  +  Eqϕ(z∣x) ⁣[log⁡p(z)qϕ(z∣x)]  =  Eqϕ[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))= \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] \;+\; \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log \frac{p(z)}{q_\phi(z \mid x)}\right] \;=\; \mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))=Eqϕ​(z∣x)​[logpθ​(x∣z)]+Eqϕ​(z∣x)​[logqϕ​(z∣x)p(z)​]=Eqϕ​​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z))

The tightness of the bound follows immediately: subtracting the ELBO from log⁡pθ(x)\log p_\theta(x)logpθ​(x) yields DKL(qϕ(z∣x)∥pθ(z∣x))≥0D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) \geq 0DKL​(qϕ​(z∣x)∥pθ​(z∣x))≥0. Maximizing the ELBO is equivalent to simultaneously maximizing log⁡pθ(x)\log p_\theta(x)logpθ​(x) and minimizing the gap DKL(qϕ∥pθ(z∣x))D_\text{KL}(q_\phi \| p_\theta(z \mid x))DKL​(qϕ​∥pθ​(z∣x)), driving the encoder toward the true posterior.


Amortized variational inference

Classical variational inference optimizes a separate set of variational parameters for each data point, which does not generalize to new observations and does not scale to large datasets. Amortized inference solves both problems by training a single neural network — the inference network or encoder — that takes xxx as input and outputs the parameters of qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x):

qϕ(z∣x)=N(μϕ(x), σϕ2(x)⋅I)q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x),\, \sigma^2_\phi(x) \cdot I)qϕ​(z∣x)=N(μϕ​(x),σϕ2​(x)⋅I)

where μϕ\mu_\phiμϕ​ and σϕ\sigma_\phiσϕ​ are the outputs of a neural encoder. This amortization introduces an approximation gap (the encoder may not represent the true posterior exactly), but it provides: generalization to new xxx without re-optimization, O(1)O(1)O(1) inference time at test time, and mini-batch stochastic gradient optimization of the ELBO.

The gradient of the ELBO requires differentiating through a sampling operation z∼qϕ(z∣x)z \sim q_\phi(z \mid x)z∼qϕ​(z∣x). The reparameterization trick (covered in Week 2) enables this by writing z=μϕ(x)+σϕ(x)⊙ϵz = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilonz=μϕ​(x)+σϕ​(x)⊙ϵ with ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I), moving the stochasticity into the fixed noise variable ϵ\epsilonϵ so that gradients can flow through μϕ\mu_\phiμϕ​ and σϕ\sigma_\phiσϕ​.


Score functions and denoising score matching

A concept that runs through the second half of the course is the score function of a distribution p(x)p(x)p(x): the gradient of the log-density with respect to the data point xxx itself (not the parameters):

s(x)=∇xlog⁡p(x)s(x) = \nabla_x \log p(x)s(x)=∇x​logp(x)

The score points in the direction of steepest increase in log-probability — toward higher-density regions. For an energy-based model pθ(x)∝e−Eθ(x)p_\theta(x) \propto e^{-E_\theta(x)}pθ​(x)∝e−Eθ​(x) (covered in Week 4), the score is sθ(x)=−∇xEθ(x)s_\theta(x) = -\nabla_x E_\theta(x)sθ​(x)=−∇x​Eθ​(x): the negative gradient of the energy, free of the intractable partition function Z(θ)Z(\theta)Z(θ). This makes scores far more useful than densities for many computational tasks: you can evaluate and differentiate a score without ever computing pθ(x)p_\theta(x)pθ​(x) itself.

Score matching (Hyvärinen, 2005) trains a score network sθ(x)≈∇xlog⁡p(x)s_\theta(x) \approx \nabla_x \log p(x)sθ​(x)≈∇x​logp(x) by minimizing:

J(θ)=Ep(x) ⁣[12∥sθ(x)∥2+∇x⋅sθ(x)]J(\theta) = \mathbb{E}_{p(x)}\!\left[\tfrac{1}{2}\|s_\theta(x)\|^2 + \nabla_x \cdot s_\theta(x)\right]J(θ)=Ep(x)​[21​∥sθ​(x)∥2+∇x​⋅sθ​(x)]

where ∇x⋅sθ=∑i∂[sθ]i/∂xi\nabla_x \cdot s_\theta = \sum_i \partial [s_\theta]_i / \partial x_i∇x​⋅sθ​=∑i​∂[sθ​]i​/∂xi​ is the divergence. Crucially, this objective equals Ep(x)[12∥sθ(x)−∇xlog⁡p(x)∥2]\mathbb{E}_{p(x)}[\frac{1}{2}\|s_\theta(x) - \nabla_x \log p(x)\|^2]Ep(x)​[21​∥sθ​(x)−∇x​logp(x)∥2] up to a constant, without requiring the true score — only samples from p(x)p(x)p(x).

Denoising score matching (DSM; Vincent, 2011) circumvents the expensive divergence computation. Corrupt each data point with Gaussian noise, x~=x+σϵ\tilde{x} = x + \sigma\epsilonx~=x+σϵ with ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I), and train the network to recover the clean direction:

JDSM(θ)=Ex, ϵ ⁣[∥sθ(x~, σ)+ϵσ∥2]J_\text{DSM}(\theta) = \mathbb{E}_{x,\, \epsilon}\!\left[\left\| s_\theta(\tilde{x},\, \sigma) + \frac{\epsilon}{\sigma} \right\|^2\right]JDSM​(θ)=Ex,ϵ​[​sθ​(x~,σ)+σϵ​​2]

The optimal solution is sθ(x~,σ)=−ϵ/σ=∇x~log⁡pσ(x~)s_\theta(\tilde{x}, \sigma) = -\epsilon/\sigma = \nabla_{\tilde{x}} \log p_\sigma(\tilde{x})sθ​(x~,σ)=−ϵ/σ=∇x~​logpσ​(x~): the score of the noisy distribution, which points back toward the clean data. This is precisely the noise-prediction objective that DDPM (Week 6) optimizes — the diffusion model's training loss is DSM applied simultaneously across a schedule of noise levels σt\sigma_tσt​. Establishing this connection now means that when the DDPM objective appears, it will be recognizable as a likelihood-free probabilistic training method rather than an ad hoc engineering choice.


Autoregressive models

An alternative to latent variable models is to factorize p(x)p(x)p(x) using the chain rule of probability without any latent variables:

pθ(x)=∏i=1Dpθ(xi∣x1,…,xi−1)p_\theta(x) = \prod_{i=1}^D p_\theta(x_i \mid x_1, \ldots, x_{i-1})pθ​(x)=i=1∏D​pθ​(xi​∣x1​,…,xi−1​)

This factorization is exact for any joint distribution. A model that parameterizes each conditional pθ(xi∣x<i)p_\theta(x_i \mid x_{<i})pθ​(xi​∣x<i​) with a neural network — a PixelCNN for images, a transformer for tokens — is an autoregressive model. Autoregressive models have exact likelihood: log⁡pθ(x)\log p_\theta(x)logpθ​(x) is computable as a sum of conditional log-likelihoods, enabling direct MLE training.

The tradeoff is sampling efficiency: generating a sample requires DDD sequential neural network evaluations, one per dimension, which is slow for high-dimensional xxx. Latent variable models and diffusion models generate xxx in O(1)O(1)O(1) forward passes (or O(T)O(T)O(T) diffusion steps) independently of the data dimension.

Connections to language models: GPT-style transformers are autoregressive models over discrete token sequences, training with the exact same MLE objective. The probability of a sequence pθ(x1,…,xT)=∏tpθ(xt∣x<t)p_\theta(x_1, \ldots, x_T) = \prod_t p_\theta(x_t \mid x_{<t})pθ​(x1​,…,xT​)=∏t​pθ​(xt​∣x<t​) is the cross-entropy loss computed at each position. The difference from image autoregressive models is only the data type (discrete tokens vs. continuous pixels) and the conditional architecture (self-attention vs. masked convolutions).


The expressiveness–tractability frontier

Every generative model design choice is a point on an expressiveness–tractability frontier: more expressive representations of pθ(x)p_\theta(x)pθ​(x) generally make evaluation, sampling, or training harder.

Fully tractable models (normalizing flows, autoregressive models) can evaluate log⁡pθ(x)\log p_\theta(x)logpθ​(x) exactly and sample in a single forward pass (flows) or in DDD sequential steps (autoregressions). Their training is simple MLE. Their cost is architectural constraint: flows must be invertible with tractable Jacobian determinants; autoregressive models must factorize along dimensions.

Latent-variable models (VAEs) relax the architectural constraint — the decoder can be any neural network — at the cost of an intractable marginal likelihood. The ELBO approximates MLE at the price of an amortization gap and a bound gap. Sampling is cheap (z∼p(z)z \sim p(z)z∼p(z), then one decoder forward pass), but reconstruction quality is limited by the quality of the encoder.

Implicit models (GANs) abandon likelihood evaluation entirely. The generator is completely free — there is no constraint on its architecture. The cost is training instability: optimizing a minimax game rather than a simple loss. Mode collapse (the generator covering only a subset of the data distribution) is the signature failure mode.

Energy-based models express the most general probability distributions but make both evaluation and sampling hard: the partition function is intractable (making MLE impossible directly) and sampling requires MCMC chains that may not mix in practice.

Diffusion and flow matching models occupy a distinctive point: they evaluate log⁡pθ(x)\log p_\theta(x)logpθ​(x) only approximately (via a variational bound or through ODE likelihood), but they achieve state-of-the-art sample quality by leveraging deep networks in a many-step generation process, trading single-pass efficiency for iterative quality.

Understanding where each model falls on this frontier — and why the engineering constraints of the deployment scenario (need for exact likelihoods? need for fast sampling? need for conditional generation?) determine which model class is appropriate — is the meta-lesson that every lecture builds toward.


The generative model taxonomy

The models in this course differ in how they represent pθ(x)p_\theta(x)pθ​(x):

Likelihood-based models — VAEs, normalizing flows, diffusion — can evaluate (or bound) log⁡pθ(x)\log p_\theta(x)logpθ​(x) and train by maximizing it. They have well-behaved training dynamics because the loss is a proxy for the data likelihood.

Implicit models — GANs — represent pθ(x)p_\theta(x)pθ​(x) only through a sampler (the generator). Likelihood evaluation requires intractable density estimation. Training uses an adversarial game rather than maximum likelihood.

Energy-based models — define pθ(x)∝e−Eθ(x)p_\theta(x) \propto e^{-E_\theta(x)}pθ​(x)∝e−Eθ​(x) where EθE_\thetaEθ​ is a learned energy function. Likelihood evaluation requires computing the normalizing constant (the partition function), which is typically intractable, requiring approximate MCMC methods.


GenAI context: probabilistic foundations across the course sequence

The ELBO, latent variable model, and score function are not just Course 3 abstractions — they are the same objects used throughout the course sequence under different names.

| Concept | Generative models (this course) | Robotics (Course 2) | RLReinforcement Learning (Course 1) | |---|---|---|---| | Latent variable model | VAEVariational Autoencoder encoder–decoder | CVAE in ACTAction Chunking with Transformers; action codes | World model latent state hth_tht​ | | Prior p(z)p(z)p(z) | N(0,I)\mathcal{N}(0, I)N(0,I) over image latents | Gaussian prior over action modes | Dynamics prior in RSSM | | Encoder qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) | Image→latent inference | Demo sequence→action code | Posterior state estimation | | Score function | ∇xlog⁡p(x)\nabla_x \log p(x)∇x​logp(x) in diffusion | Energy gradient for imitation | Reward gradient for policy search | | Autoregressive factorization | PixelCNN, Transformer LM | Token-by-token VLA action generation | POMDPPartially Observable Markov Decision Process belief update | | ELBO bound gap | DKL(qϕ∥pθ(z∣x))D_\text{KL}(q_\phi \| p_\theta(z \mid x))DKL​(qϕ​∥pθ​(z∣x)) | Encoder approximation error | Variational inference in POMDPPartially Observable Markov Decision Process |

The ELBO and its gradient are the same mathematical objects whether they are computed over robot demonstration sequences, image pixels, or language tokens. A practitioner who understands the unified probabilistic framework can recognize that a diffusion policy for manipulation, a VAEVariational Autoencoder for image compression, and an autoregressive language model are all distinct engineering answers to the same underlying question: how do we represent and compute with a high-dimensional probability distribution?


Key takeaways

MLE minimizes the KL divergence from the data distribution to the model and is the training objective underlying all likelihood-based generative models. Latent variable models pθ(x,z)=pθ(x∣z)p(z)p_\theta(x, z) = p_\theta(x \mid z) p(z)pθ​(x,z)=pθ​(x∣z)p(z) capture data structure through a low-dimensional semantic bottleneck but require variational approximation because the marginal pθ(x)p_\theta(x)pθ​(x) is intractable. The ELBO lower-bounds log⁡pθ(x)\log p_\theta(x)logpθ​(x) as a reconstruction term minus a KL regularization term, and is tight when the encoder approximates the true posterior. Amortized inference parameterizes qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) with a single shared encoder, enabling scalable stochastic training. Autoregressive models factor p(x)p(x)p(x) along dimensions using the chain rule, enabling exact likelihood but requiring sequential sampling.


Conceptual questions

  1. The ELBO can be written as Eqϕ[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))Eqϕ​​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)). Show algebraically that maximizing the ELBO is equivalent to minimizing DKL(qϕ(z∣x)∥pθ(z∣x))D_\text{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x))DKL​(qϕ​(z∣x)∥pθ​(z∣x)) while maximizing log⁡pθ(x)\log p_\theta(x)logpθ​(x). What does this imply about the gap between the ELBO and the true log-likelihood?

  2. A latent variable model uses a standard Gaussian prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I)p(z)=N(0,I) and a Gaussian encoder qϕ(z∣x)=N(μϕ(x),diag(σϕ2(x)))q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))qϕ​(z∣x)=N(μϕ​(x),diag(σϕ2​(x))). Derive the closed-form KL term DKL(qϕ(z∣x)∥p(z))D_\text{KL}(q_\phi(z \mid x) \| p(z))DKL​(qϕ​(z∣x)∥p(z)) as a function of μϕ\mu_\phiμϕ​ and σϕ\sigma_\phiσϕ​. Under what conditions does this KL term equal zero?

  3. Consider an autoregressive model factored as pθ(x)=∏ipθ(xi∣x<i)p_\theta(x) = \prod_i p_\theta(x_i \mid x_{<i})pθ​(x)=∏i​pθ​(xi​∣x<i​). If each conditional is modeled as a Gaussian with mean μθ(i)(x<i)\mu_\theta^{(i)}(x_{<i})μθ(i)​(x<i​) and fixed variance σ2\sigma^2σ2, show that maximizing the log-likelihood is equivalent to minimizing the sum of squared prediction errors ∑i∥xi−μθ(i)(x<i)∥2\sum_i \|x_i - \mu_\theta^{(i)}(x_{<i})\|^2∑i​∥xi​−μθ(i)​(x<i​)∥2. What generative model does this correspond to?

  4. MLE minimizes DKL(p^∥pθ)D_\text{KL}(\hat{p} \| p_\theta)DKL​(p^​∥pθ​) (forward KL) rather than DKL(pθ∥p^)D_\text{KL}(p_\theta \| \hat{p})DKL​(pθ​∥p^​) (reverse KL). Explain geometrically why forward KL leads to mode covering behavior (the model spreads to cover all modes of the data) while reverse KL leads to mode seeking (the model collapses to one mode). Which behavior is preferable for generative sampling, and why does the choice of KL direction matter for the resulting model?

  5. A normalizing flow model achieves log⁡pθ(x)=−2.3\log p_\theta(x) = -2.3logpθ​(x)=−2.3 nats on a test image, while a VAEVariational Autoencoder achieves an ELBO of −3.1-3.1−3.1 nats on the same image. Can you conclude the flow is a better model? What additional information would you need to make a valid comparison between the two models?

✦Solutions
  1. Subtract the ELBO from log⁡pθ(x)\log p_\theta(x)logpθ​(x): the gap is exactly DKL(qϕ(z∣x) ∥ pθ(z∣x))≥0D_\text{KL}(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)) \geq 0DKL​(qϕ​(z∣x)∥pθ​(z∣x))≥0. So maximizing the ELBO over ϕ\phiϕ tightens the bound (shrinks the gap) while maximizing it over θ\thetaθ raises log⁡pθ(x)\log p_\theta(x)logpθ​(x). The bound is never above the true log-likelihood, and equals it only when the encoder matches the true posterior.
  2. DKL(qϕ∥p)=12∑j=1dz(σj2+μj2−1−log⁡σj2)D_\text{KL}(q_\phi\|p) = \tfrac{1}{2}\sum_{j=1}^{d_z}\big(\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\big)DKL​(qϕ​∥p)=21​∑j=1dz​​(σj2​+μj2​−1−logσj2​). Each term is zero iff μj=0\mu_j = 0μj​=0 and σj=1\sigma_j = 1σj​=1, so the KL is zero exactly when the posterior equals the standard-Gaussian prior.
  3. With fixed-variance Gaussian conditionals, log⁡pθ(xi∣x<i)=−12σ2∥xi−μθ(i)∥2+const\log p_\theta(x_i\mid x_{<i}) = -\tfrac{1}{2\sigma^2}\|x_i - \mu_\theta^{(i)}\|^2 + \text{const}logpθ​(xi​∣x<i​)=−2σ21​∥xi​−μθ(i)​∥2+const. Summing over iii shows MLE is equivalent to minimizing ∑i∥xi−μθ(i)(x<i)∥2\sum_i\|x_i - \mu_\theta^{(i)}(x_{<i})\|^2∑i​∥xi​−μθ(i)​(x<i​)∥2 — i.e. an autoregressive least-squares predictor (a Gaussian autoregressive model).
  4. Forward KL DKL(p^∥pθ)D_\text{KL}(\hat p\|p_\theta)DKL​(p^​∥pθ​) is infinite wherever the data has mass but the model assigns ~zero density, so the model is forced to cover every data mode (mode-covering). Reverse KL penalizes the model placing mass where the data has none, so it collapses onto a single high-density mode (mode-seeking). For generative sampling, mode-covering is generally preferable — dropping modes means never generating whole classes of plausible samples.
  5. No. The flow reports an exact log-likelihood while the VAE reports a lower bound — the true VAE log-likelihood is ≥−3.1\geq -3.1≥−3.1 and could exceed −2.3-2.3−2.3. A valid comparison needs either exact likelihoods for both (e.g. an importance-weighted / annealed estimate of the VAE marginal) and matched preprocessing (same bits-per-dim convention, dequantization, and test set).

Looking ahead

With the probabilistic foundations established, the next lecture introduces the first complete generative architecture that directly optimizes the ELBO.

Week 2: Variational Autoencoders. We derive the VAEVariational Autoencoder objective in full, implement the reparameterization trick that enables stochastic backpropagation, analyze the posterior collapse failure mode, and examine β\betaβ-VAEVariational Autoencoder and hierarchical variants that improve representation quality.


Further reading

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (Chapters on probability distributions and latent variables).
  • Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. (Variational inference and deep generative models).
  • Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. ICLR. (Introduces amortized variational inference and the reparameterization trick).
  • Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. JMLR. (The original score matching objective).
  • Vincent, P. (2011). A Connection Between Score Matching and Denoising Autoencoders. Neural Computation. (Denoising score matching, the basis of the DDPM objective).
Next →
Week 2: Variational Autoencoders
On this page
  • Purpose of this lecture
  • Maximum likelihood estimation
  • Latent variable models
  • The evidence lower bound
  • The derivation in full: Jensen's inequality step by step
  • Amortized variational inference
  • Score functions and denoising score matching
  • Autoregressive models
  • The expressiveness–tractability frontier
  • The generative model taxonomy
  • GenAI context: probabilistic foundations across the course sequence
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading