Purpose of this lecture
The variational autoencoder (VAEVariational Autoencoder; Kingma and Welling, 2013) is the first complete generative architecture to directly operationalize the ELBO framework from Week 1. It simultaneously trains an encoder that compresses observations into a structured latent space and a decoder that reconstructs observations from latents, using backpropagation throughout via the reparameterization trick. The VAEVariational Autoencoder is foundational both in its own right — latent diffusion models (Week 9) use VAEVariational Autoencoder encoders as their compression stage — and as a template for understanding the tradeoffs that all latent variable generative models face.
The autoencoder baseline
A deterministic autoencoder compresses to a code and reconstructs , trained to minimize reconstruction error . This achieves compression but fails as a generative model: the latent space has no prescribed structure, so sampling a random from, say, a Gaussian and passing it through produces out-of-distribution garbage. The decoder was never trained on latent codes other than those produced by the encoder on training data.
The VAEVariational Autoencoder solves this by replacing the deterministic encoder with a stochastic encoder (inference network) and adding a prior that regularizes the latent space. The reconstruction objective is supplemented by a KL penalty that forces toward the prior, making the entire latent space — not just the codes of training examples — meaningful to the decoder.
The VAEVariational Autoencoder objective
For a Gaussian encoder and standard Gaussian prior, the ELBO from Week 1 becomes the tractable VAEVariational Autoencoder objective:
With and , the KL term has a closed form:
This sum over latent dimensions is computable without Monte Carlo. The reconstruction term requires sampling and evaluating . For a Gaussian decoder , this becomes — mean squared error weighted by the decoder variance.
The reparameterization trick
The reconstruction gradient cannot be computed by naive backpropagation because the sampling operation has no gradient with respect to . The reparameterization trick sidesteps this by rewriting the sample as a deterministic function of a fixed-distribution noise:
Now is a differentiable function of , and gradients flow through to the encoder parameters:
The trick works for any distribution that can be expressed as for some deterministic function and noise drawn from a fixed (parameter-free) distribution. Laplace, Gamma, and Beta distributions all have reparameterizable forms. The trick fails for discrete distributions, which require alternative gradient estimators (REINFORCE / Gumbel-Softmax) at higher variance.
Why the trick fails for discrete variables
The failure mode for discrete distributions deserves careful attention because it recurs in vector-quantized models (VQ-VAEs, VQ-BeT from Course 2). For a discrete latent , there is no way to write with a continuous differentiable — the sampling operation is a discontinuous argmax that blocks gradients. The REINFORCE estimator replaces with , which is unbiased but high-variance. The Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017) provides a continuous relaxation: replace the discrete sample with a soft-max over Gumbel-perturbed logits:
where is a temperature. As , approaches the one-hot argmax; as , it approaches a uniform distribution. At any , is differentiable in the logits , enabling gradient-based training. In practice, the straight-through Gumbel estimator uses the hard discrete sample in the forward pass (for correct categorical behavior) but the soft Gumbel-Softmax gradient in the backward pass — a pragmatic approximation that works well in the VQ-VAEVariational Autoencoder context.
Training and architecture
The standard VAEVariational Autoencoder training loop: for each mini-batch, run the encoder to obtain ; sample ; compute ; run the decoder to obtain ; compute the ELBO loss; backpropagate. The encoder and decoder are trained jointly via the same gradient step.
Architecture in practice
For image VAEs, the encoder is a convolutional network that takes and outputs two vectors through a series of strided convolutions followed by a linear projection. A typical small VAEVariational Autoencoder for images uses 4 convolutional stages (doubling channels: 64 → 128 → 256 → 512) and a latent dimension –. The decoder is a mirror image: a linear projection followed by transposed convolutions (or nearest-neighbor upsampling + convolution, which avoids checkerboard artifacts) back to .
The latent space design for latent diffusion models (Week 9) differs fundamentally from the vector-latent design above. The Stable Diffusion VAEVariational Autoencoder uses a spatial latent map rather than a vector: input is encoded to — a spatial downsampling into 4 channels. The diffusion model then operates on these latents rather than pixels, reducing the dimensionality from 786,432 to 16,384 — a reduction. The KL regularization in this spatial VAEVariational Autoencoder uses a weight (extremely small), because the primary goal is high-fidelity compression rather than a structured latent distribution. The slight KL regularization prevents the latent space from being too irregular for the subsequent diffusion model to sample from.
Decoder variance: treating as a fixed hyperparameter (rather than a learned output) simplifies training. gives equal weight to reconstruction and KL; upweights reconstruction (sharper images, less regularization); upweights KL (more structured latent space, blurrier images). The decoder variance is the primary lever for balancing generation quality against representation quality.
Posterior collapse
A notorious failure mode of VAEs is posterior collapse: during training, the encoder learns to output and for all — the encoder ignores the input and produces the prior. When this happens, the KL term is zero, the decoder learns to generate from the prior alone (ignoring ), and the latent space carries no information about .
Posterior collapse occurs because: (1) powerful decoders can achieve near-zero reconstruction loss without using at all, so gradients through vanish early in training; (2) the KL term provides a constant gradient pushing toward the prior regardless of the reconstruction quality. The model finds a degenerate but locally stable optimum.
Detecting collapse in practice: monitor the per-dimension KL during training, for each latent dimension . A healthy VAEVariational Autoencoder will have most dimensions with KL – nats, meaning the encoder is using those dimensions to encode information. Collapsed dimensions have KL — the encoder outputs regardless of . As a rule of thumb, if more than 10–20% of latent dimensions are below 0.1 nats, the model is collapsing and training should be restarted with lower decoder capacity or stronger KL annealing.
Mitigations include KL annealing (start training with and slowly ramp to 1, preventing the KL term from forcing collapse before the decoder learns to use ), free bits (set a floor of bits per latent dimension, zero-ing out the KL gradient when the KL is below ), and architectural constraints that prevent the decoder from being too powerful (e.g., restricting its receptive field or capacity).
-VAEVariational Autoencoder and disentanglement
-VAEVariational Autoencoder (Higgins et al., 2017) modifies the VAEVariational Autoencoder objective by upweighting the KL term:
Increasing compresses the information in more aggressively, which tends to disentangle the latent dimensions: each dimension of captures a single semantically interpretable factor of variation (e.g., rotation, scale, color) rather than multiple entangled factors. The tradeoff is reduced reconstruction quality.
The disentanglement intuition: with , the information bottleneck is so tight that only the most compressible (statistically independent) structure survives in . Independent generative factors — which by definition have no statistical interaction — are the most efficient representation under this bottleneck.
Hierarchical VAEs (NVAE, VDVAE) stack multiple levels of latent variables with the joint prior factored as . Each level captures structure at a different scale: encodes global semantics, encodes local details. Hierarchical VAEs substantially improve sample quality over single-level VAEs, approaching GAN-level quality on face generation benchmarks.
Vector-Quantized VAEVariational Autoencoder
The VQ-VAEVariational Autoencoder (van den Oord et al., 2017) replaces the continuous Gaussian latent space with a discrete codebook. The encoder maps to a continuous embedding , which is then quantized to the nearest codebook vector:
where are the learnable codebook vectors. The decoder receives and reconstructs . The discrete bottleneck prevents the posterior collapse problem of continuous VAEs: the codebook has a fixed, discrete structure that cannot be bypassed.
Training requires handling the non-differentiable argmin. The straight-through estimator copies the gradient from back to directly, ignoring the quantization step in the backward pass. The full VQ-VAEVariational Autoencoder loss has three terms: reconstruction, codebook alignment (pulling codebook vectors toward encoder outputs), and commitment (pulling encoder outputs toward codebook vectors):
where denotes stop-gradient. The commitment weight (default) balances encoder stability against codebook update speed.
VQ-VAEs enable two-stage generation: first train the VQ-VAEVariational Autoencoder to produce discrete codes; then train a separate autoregressive or diffusion model over the discrete codebook sequence. This two-stage approach produces sharp samples (VQ-VAEVariational Autoencoder preserves all local details) while enabling tractable likelihood over the discrete code sequence. DALL-E 1 used a grid of 8192-code VQ-VAEVariational Autoencoder tokens; subsequent work replaced the VQ-VAEVariational Autoencoder stage with continuous spatial VAEs for better high-frequency fidelity. In robotics (Course 2 Week 8), VQ-BeT applies the same codebook structure to discrete action bins, enabling a behavior transformer to model multimodal action distributions by predicting codebook indices rather than continuous action values.
VAEs as generative models
At inference time, the VAEVariational Autoencoder generates samples by: (1) drawing ; (2) passing through the decoder to obtain the output distribution ; (3) sampling or taking the mean . Generated samples are often blurry because the Gaussian decoder with MSE reconstruction minimizes the expected pixel error, which averages over aleatoric uncertainty — producing the mean of the conditional rather than a sharp mode.
This blurriness is not a bug in the VAEVariational Autoencoder algorithm but a consequence of the Gaussian likelihood assumption: the optimal decoder output under MSE loss is the conditional mean , which averages over multiple plausible renderings. The latent diffusion models of Week 9 address this directly by replacing the Gaussian decoder with a diffusion model conditioned on the latent code.
GenAI context: VAEs across the course sequence
| VAEVariational Autoencoder concept | Language model analog | Robotics (Course 2) analog | |---|---|---| | Encoder | Input token embedding | Demo sequence → action code (ACTAction Chunking with Transformers) | | Reparameterization trick | Straight-through for discrete tokens | CVAE continuous action latents | | Posterior collapse | LM ignoring context | CVAE encoder ignoring demonstration | | -VAEVariational Autoencoder disentanglement | Bottleneck transformers | Disentangled task/motion latents | | Hierarchical VAEVariational Autoencoder | Multi-level prompt latents | Hierarchical skill latent hierarchy | | Spatial latent VAEVariational Autoencoder | VQGAN token grid for image LMs | Compressed sensor observations |
The CVAE architecture in ACTAction Chunking with Transformers (Course 2, Week 8) is a direct application of the conditional VAEVariational Autoencoder framework developed here: the encoder maps a demonstration sequence to an action latent code , the prior is , and the decoder generates action chunk predictions conditioned on and the current observation. The ELBO trained there is the same mathematical object — reconstruction term plus KL regularization — with "demonstration" replacing "image" in the reconstruction objective.
Key takeaways
The VAEVariational Autoencoder jointly trains encoder and decoder by maximizing the ELBO, decomposed into a reconstruction term and a KL divergence toward the prior. The reparameterization trick enables gradient-based optimization through stochastic latent samples by expressing as a deterministic function of the encoder output and a fixed-distribution noise. Posterior collapse is the primary failure mode, caused by powerful decoders learning to ignore , and is mitigated by KL annealing or free bits. -VAEVariational Autoencoder upweights the KL term to improve disentanglement at the cost of reconstruction quality. Hierarchical VAEs stack multiple latent levels, recovering sample quality competitive with GANs while retaining exact likelihood bounds.
Conceptual questions
-
A VAEVariational Autoencoder decoder is parameterized as a Gaussian with . Show that in this limit, the ELBO reconstruction term approaches , which dominates the KL term. What is the risk of using very small ? Conversely, what happens to sample quality as ?
-
Derive the closed-form KL divergence for -dimensional Gaussian distributions. Show that it equals zero if and only if and . How does this expression change if the prior is rather than the standard Gaussian?
-
During training of a VAEVariational Autoencoder, the KL term is observed to be zero after 100 gradient steps while the reconstruction loss is still high. Diagnose this failure mode and propose two architectural or training modifications that would prevent it. For each modification, explain the mechanism by which it discourages the degenerate solution.
-
A -VAEVariational Autoencoder is trained on a dataset of 3D objects with independent generative factors (shape, size, orientation, color). For (standard VAEVariational Autoencoder) and , predict: (a) the reconstruction quality on held-out objects, (b) whether the latent dimensions will be aligned with the generative factors, and (c) the quality of samples generated by interpolating between two latent codes. Justify each prediction using the information-bottleneck interpretation.
-
Hierarchical VAEs use levels of latent variables. During inference, the posterior must approximate the true posterior . Write the ELBO for a two-level hierarchical VAEVariational Autoencoder and identify where the posterior approximation gap arises for each level. Which level (top or bottom) do you expect to suffer from larger posterior approximation error, and why?
Looking ahead
VAEs provide a probabilistic, likelihood-based approach to generative modeling. The next model family takes a different approach: instead of approximating the intractable posterior, it sidesteps likelihood entirely.
Week 3: Generative Adversarial Networks. We derive the GAN min-max objective, show its connection to Jensen-Shannon divergence, analyze the Wasserstein distance alternative, and examine training instability and mode collapse — the pathologies that motivated nearly all subsequent GAN research.
Further reading
- Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. ICLR. (The original VAEVariational Autoencoder paper).
- Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML.
- Higgins, I., et al. (2017). beta-VAEVariational Autoencoder: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR.
- van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS.
- Jang, E., Gu, S., & Poole, B. (2017). Categorical Reparameterization with Gumbel-Softmax. ICLR.
- Vahdat, A., & Kautz, J. (2020). NVAE: A Deep Hierarchical Variational Autoencoder. NeurIPS.