Purpose of this lecture
Generative modeling is the problem of learning a probability distribution over high-dimensional data — images, audio, text, robot trajectories — and using that distribution to draw new samples, evaluate likelihoods, or infer latent structure. Every algorithm in this course (VAEVariational Autoencoder, GANGenerative Adversarial Network, diffusion, flow matching) is a different answer to the same underlying question: how do we represent and compute with when lives in thousands or millions of dimensions?
This lecture establishes the probabilistic vocabulary shared by all of them. The central objects are the likelihood function, the latent variable model, the evidence lower bound (ELBO), and amortized variational inference. These are not preliminary material — they are the load-bearing structure of everything that follows.
Maximum likelihood estimation
Given a dataset drawn i.i.d. from some true distribution , the maximum likelihood estimator (MLE) of a parametric model is:
The log-likelihood sum is the empirical expectation , where is the empirical distribution. Minimizing the negative log-likelihood is equivalent to minimizing the KL divergence :
The first term is a constant with respect to , so maximizing exactly minimizes the KL from the data distribution to the model. MLE is thus the canonical method for fitting generative models: it directly minimizes the divergence between model and data.
Maximum a posteriori (MAP) estimation adds a prior and maximizes . This is equivalent to MLE with regularization: a Gaussian prior on with precision corresponds to L2 regularization with coefficient .
Latent variable models
Many natural distributions are most concisely described as a mixture: each observation is generated from a low-dimensional latent variable that captures the semantics (identity, pose, style), with variation in given representing rendering noise. The joint model is:
where is the prior (typically ) and is the likelihood (a neural network decoder). The marginal likelihood is obtained by integrating out the latent:
For continuous in even moderate dimension, this integral is intractable. We cannot evaluate exactly, so we cannot directly optimize MLE. This is the fundamental difficulty that variational inference addresses.
The evidence lower bound
The ELBO (evidence lower bound) is a tractable lower bound on obtained by introducing an approximate posterior and applying Jensen's inequality:
This bound is tight when — the true posterior. The derivation proceeds via:
The KL term on the right is non-negative, so the first term (the ELBO) is a lower bound. Jointly maximizing the ELBO over simultaneously trains the decoder to increase and trains the encoder to produce posteriors that are close to the prior and to the true posterior.
The ELBO decomposes into two interpretable terms:
The reconstruction term rewards the decoder for correctly generating given the latent sample. The KL term penalizes the encoder for producing posteriors that differ from the prior — it acts as a bottleneck that prevents the latent space from collapsing to a lookup table.
The derivation in full: Jensen's inequality step by step
The ELBO derivation is worth following carefully, as the same algebraic maneuver appears in nearly every subsequent model. Starting from the marginal log-likelihood:
Introduce via an importance weighting identity — multiply and divide inside the integral:
Jensen's inequality states that for any concave function and any random variable , . Because is concave:
Expanding the log of the ratio recovers the two-term ELBO:
The tightness of the bound follows immediately: subtracting the ELBO from yields . Maximizing the ELBO is equivalent to simultaneously maximizing and minimizing the gap , driving the encoder toward the true posterior.
Amortized variational inference
Classical variational inference optimizes a separate set of variational parameters for each data point, which does not generalize to new observations and does not scale to large datasets. Amortized inference solves both problems by training a single neural network — the inference network or encoder — that takes as input and outputs the parameters of :
where and are the outputs of a neural encoder. This amortization introduces an approximation gap (the encoder may not represent the true posterior exactly), but it provides: generalization to new without re-optimization, inference time at test time, and mini-batch stochastic gradient optimization of the ELBO.
The gradient of the ELBO requires differentiating through a sampling operation . The reparameterization trick (covered in Week 2) enables this by writing with , moving the stochasticity into the fixed noise variable so that gradients can flow through and .
Score functions and denoising score matching
A concept that runs through the second half of the course is the score function of a distribution : the gradient of the log-density with respect to the data point itself (not the parameters):
The score points in the direction of steepest increase in log-probability — toward higher-density regions. For an energy-based model (covered in Week 4), the score is : the negative gradient of the energy, free of the intractable partition function . This makes scores far more useful than densities for many computational tasks: you can evaluate and differentiate a score without ever computing itself.
Score matching (Hyvärinen, 2005) trains a score network by minimizing:
where is the divergence. Crucially, this objective equals up to a constant, without requiring the true score — only samples from .
Denoising score matching (DSM; Vincent, 2011) circumvents the expensive divergence computation. Corrupt each data point with Gaussian noise, with , and train the network to recover the clean direction:
The optimal solution is : the score of the noisy distribution, which points back toward the clean data. This is precisely the noise-prediction objective that DDPM (Week 6) optimizes — the diffusion model's training loss is DSM applied simultaneously across a schedule of noise levels . Establishing this connection now means that when the DDPM objective appears, it will be recognizable as a likelihood-free probabilistic training method rather than an ad hoc engineering choice.
Autoregressive models
An alternative to latent variable models is to factorize using the chain rule of probability without any latent variables:
This factorization is exact for any joint distribution. A model that parameterizes each conditional with a neural network — a PixelCNN for images, a transformer for tokens — is an autoregressive model. Autoregressive models have exact likelihood: is computable as a sum of conditional log-likelihoods, enabling direct MLE training.
The tradeoff is sampling efficiency: generating a sample requires sequential neural network evaluations, one per dimension, which is slow for high-dimensional . Latent variable models and diffusion models generate in forward passes (or diffusion steps) independently of the data dimension.
Connections to language models: GPT-style transformers are autoregressive models over discrete token sequences, training with the exact same MLE objective. The probability of a sequence is the cross-entropy loss computed at each position. The difference from image autoregressive models is only the data type (discrete tokens vs. continuous pixels) and the conditional architecture (self-attention vs. masked convolutions).
The expressiveness–tractability frontier
Every generative model design choice is a point on an expressiveness–tractability frontier: more expressive representations of generally make evaluation, sampling, or training harder.
Fully tractable models (normalizing flows, autoregressive models) can evaluate exactly and sample in a single forward pass (flows) or in sequential steps (autoregressions). Their training is simple MLE. Their cost is architectural constraint: flows must be invertible with tractable Jacobian determinants; autoregressive models must factorize along dimensions.
Latent-variable models (VAEs) relax the architectural constraint — the decoder can be any neural network — at the cost of an intractable marginal likelihood. The ELBO approximates MLE at the price of an amortization gap and a bound gap. Sampling is cheap (, then one decoder forward pass), but reconstruction quality is limited by the quality of the encoder.
Implicit models (GANs) abandon likelihood evaluation entirely. The generator is completely free — there is no constraint on its architecture. The cost is training instability: optimizing a minimax game rather than a simple loss. Mode collapse (the generator covering only a subset of the data distribution) is the signature failure mode.
Energy-based models express the most general probability distributions but make both evaluation and sampling hard: the partition function is intractable (making MLE impossible directly) and sampling requires MCMC chains that may not mix in practice.
Diffusion and flow matching models occupy a distinctive point: they evaluate only approximately (via a variational bound or through ODE likelihood), but they achieve state-of-the-art sample quality by leveraging deep networks in a many-step generation process, trading single-pass efficiency for iterative quality.
Understanding where each model falls on this frontier — and why the engineering constraints of the deployment scenario (need for exact likelihoods? need for fast sampling? need for conditional generation?) determine which model class is appropriate — is the meta-lesson that every lecture builds toward.
The generative model taxonomy
The models in this course differ in how they represent :
Likelihood-based models — VAEs, normalizing flows, diffusion — can evaluate (or bound) and train by maximizing it. They have well-behaved training dynamics because the loss is a proxy for the data likelihood.
Implicit models — GANs — represent only through a sampler (the generator). Likelihood evaluation requires intractable density estimation. Training uses an adversarial game rather than maximum likelihood.
Energy-based models — define where is a learned energy function. Likelihood evaluation requires computing the normalizing constant (the partition function), which is typically intractable, requiring approximate MCMC methods.
GenAI context: probabilistic foundations across the course sequence
The ELBO, latent variable model, and score function are not just Course 3 abstractions — they are the same objects used throughout the course sequence under different names.
| Concept | Generative models (this course) | Robotics (Course 2) | RLReinforcement Learning (Course 1) | |---|---|---|---| | Latent variable model | VAEVariational Autoencoder encoder–decoder | CVAE in ACTAction Chunking with Transformers; action codes | World model latent state | | Prior | over image latents | Gaussian prior over action modes | Dynamics prior in RSSM | | Encoder | Image→latent inference | Demo sequence→action code | Posterior state estimation | | Score function | in diffusion | Energy gradient for imitation | Reward gradient for policy search | | Autoregressive factorization | PixelCNN, Transformer LM | Token-by-token VLA action generation | POMDPPartially Observable Markov Decision Process belief update | | ELBO bound gap | | Encoder approximation error | Variational inference in POMDPPartially Observable Markov Decision Process |
The ELBO and its gradient are the same mathematical objects whether they are computed over robot demonstration sequences, image pixels, or language tokens. A practitioner who understands the unified probabilistic framework can recognize that a diffusion policy for manipulation, a VAEVariational Autoencoder for image compression, and an autoregressive language model are all distinct engineering answers to the same underlying question: how do we represent and compute with a high-dimensional probability distribution?
Key takeaways
MLE minimizes the KL divergence from the data distribution to the model and is the training objective underlying all likelihood-based generative models. Latent variable models capture data structure through a low-dimensional semantic bottleneck but require variational approximation because the marginal is intractable. The ELBO lower-bounds as a reconstruction term minus a KL regularization term, and is tight when the encoder approximates the true posterior. Amortized inference parameterizes with a single shared encoder, enabling scalable stochastic training. Autoregressive models factor along dimensions using the chain rule, enabling exact likelihood but requiring sequential sampling.
Conceptual questions
-
The ELBO can be written as . Show algebraically that maximizing the ELBO is equivalent to minimizing while maximizing . What does this imply about the gap between the ELBO and the true log-likelihood?
-
A latent variable model uses a standard Gaussian prior and a Gaussian encoder . Derive the closed-form KL term as a function of and . Under what conditions does this KL term equal zero?
-
Consider an autoregressive model factored as . If each conditional is modeled as a Gaussian with mean and fixed variance , show that maximizing the log-likelihood is equivalent to minimizing the sum of squared prediction errors . What generative model does this correspond to?
-
MLE minimizes (forward KL) rather than (reverse KL). Explain geometrically why forward KL leads to mode covering behavior (the model spreads to cover all modes of the data) while reverse KL leads to mode seeking (the model collapses to one mode). Which behavior is preferable for generative sampling, and why does the choice of KL direction matter for the resulting model?
-
A normalizing flow model achieves nats on a test image, while a VAEVariational Autoencoder achieves an ELBO of nats on the same image. Can you conclude the flow is a better model? What additional information would you need to make a valid comparison between the two models?
Looking ahead
With the probabilistic foundations established, the next lecture introduces the first complete generative architecture that directly optimizes the ELBO.
Week 2: Variational Autoencoders. We derive the VAEVariational Autoencoder objective in full, implement the reparameterization trick that enables stochastic backpropagation, analyze the posterior collapse failure mode, and examine -VAEVariational Autoencoder and hierarchical variants that improve representation quality.
Further reading
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (Chapters on probability distributions and latent variables).
- Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. (Variational inference and deep generative models).
- Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. ICLR. (Introduces amortized variational inference and the reparameterization trick).
- Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. JMLR. (The original score matching objective).
- Vincent, P. (2011). A Connection Between Score Matching and Denoising Autoencoders. Neural Computation. (Denoising score matching, the basis of the DDPM objective).