Skip to main content
illumin8
Courses
Week 2: Variational Autoencoders
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 2

Week 2: Variational Autoencoders

✦Learning Outcomes
  • Implement the reparameterization trick to enable gradient-based optimization through stochastic latent samples
  • Diagnose and mitigate posterior collapse in VAEVariational Autoencoder training
  • Compare Gaussian, categorical, and hierarchical VAEVariational Autoencoder variants for different representation learning goals
◆Prerequisites
  • Latent variable models and the marginal likelihood
  • Evidence Lower Bound (ELBO) derivation
  • Amortized variational inference

Recommended: Basic understanding of neural network training and backpropagation.

Purpose of this lecture

The variational autoencoder (VAEVariational Autoencoder; Kingma and Welling, 2013) is the first complete generative architecture to directly operationalize the ELBO framework from Week 1. It simultaneously trains an encoder that compresses observations into a structured latent space and a decoder that reconstructs observations from latents, using backpropagation throughout via the reparameterization trick. The VAEVariational Autoencoder is foundational both in its own right — latent diffusion models (Week 9) use VAEVariational Autoencoder encoders as their compression stage — and as a template for understanding the tradeoffs that all latent variable generative models face.


The autoencoder baseline

A deterministic autoencoder compresses xxx to a code z=fϕ(x)z = f_\phi(x)z=fϕ​(x) and reconstructs x^=gθ(z)\hat{x} = g_\theta(z)x^=gθ​(z), trained to minimize reconstruction error ∥x−x^∥2\|x - \hat{x}\|^2∥x−x^∥2. This achieves compression but fails as a generative model: the latent space has no prescribed structure, so sampling a random zzz from, say, a Gaussian and passing it through gθg_\thetagθ​ produces out-of-distribution garbage. The decoder was never trained on latent codes other than those produced by the encoder on training data.

The VAEVariational Autoencoder solves this by replacing the deterministic encoder with a stochastic encoder (inference network) qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) and adding a prior p(z)p(z)p(z) that regularizes the latent space. The reconstruction objective is supplemented by a KL penalty that forces qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) toward the prior, making the entire latent space — not just the codes of training examples — meaningful to the decoder.


The VAEVariational Autoencoder objective

For a Gaussian encoder and standard Gaussian prior, the ELBO from Week 1 becomes the tractable VAEVariational Autoencoder objective:

LVAE(θ,ϕ;x)=Ez∼qϕ(z∣x) ⁣[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))\mathcal{L}_\text{VAE}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}\!\left[\log p_\theta(x \mid z)\right] - D_\text{KL}(q_\phi(z \mid x) \| p(z))LVAE​(θ,ϕ;x)=Ez∼qϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z))

With qϕ(z∣x)=N(μϕ(x),diag(σϕ2(x)))q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))qϕ​(z∣x)=N(μϕ​(x),diag(σϕ2​(x))) and p(z)=N(0,I)p(z) = \mathcal{N}(0, I)p(z)=N(0,I), the KL term has a closed form:

DKL(qϕ∥p)=12∑j=1dz ⁣(σϕ,j2(x)+μϕ,j2(x)−1−log⁡σϕ,j2(x))D_\text{KL}(q_\phi \| p) = \frac{1}{2}\sum_{j=1}^{d_z}\!\left(\sigma_{\phi,j}^2(x) + \mu_{\phi,j}^2(x) - 1 - \log \sigma_{\phi,j}^2(x)\right)DKL​(qϕ​∥p)=21​j=1∑dz​​(σϕ,j2​(x)+μϕ,j2​(x)−1−logσϕ,j2​(x))

This sum over latent dimensions is computable without Monte Carlo. The reconstruction term requires sampling z∼qϕ(z∣x)z \sim q_\phi(z \mid x)z∼qϕ​(z∣x) and evaluating log⁡pθ(x∣z)\log p_\theta(x \mid z)logpθ​(x∣z). For a Gaussian decoder pθ(x∣z)=N(μθ(z),σ2I)p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)pθ​(x∣z)=N(μθ​(z),σ2I), this becomes −12σ2∥x−μθ(z)∥2-\frac{1}{2\sigma^2}\|x - \mu_\theta(z)\|^2−2σ21​∥x−μθ​(z)∥2 — mean squared error weighted by the decoder variance.


The reparameterization trick

The reconstruction gradient ∇ϕEz∼qϕ(z∣x)[log⁡pθ(x∣z)]\nabla_\phi \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x \mid z)]∇ϕ​Ez∼qϕ​(z∣x)​[logpθ​(x∣z)] cannot be computed by naive backpropagation because the sampling operation z∼qϕ(z∣x)z \sim q_\phi(z \mid x)z∼qϕ​(z∣x) has no gradient with respect to ϕ\phiϕ. The reparameterization trick sidesteps this by rewriting the sample as a deterministic function of a fixed-distribution noise:

z=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)z=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I)

Now zzz is a differentiable function of (μϕ,σϕ)(\mu_\phi, \sigma_\phi)(μϕ​,σϕ​), and gradients flow through zzz to the encoder parameters:

∇ϕEz∼qϕ[f(z)]=Eϵ∼N(0,I) ⁣[∇ϕf(μϕ(x)+σϕ(x)⊙ϵ)]\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}\!\left[\nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)\right]∇ϕ​Ez∼qϕ​​[f(z)]=Eϵ∼N(0,I)​[∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ)]

The trick works for any distribution that can be expressed as z=gϕ(ϵ,x)z = g_\phi(\epsilon, x)z=gϕ​(ϵ,x) for some deterministic function gϕg_\phigϕ​ and noise ϵ\epsilonϵ drawn from a fixed (parameter-free) distribution. Laplace, Gamma, and Beta distributions all have reparameterizable forms. The trick fails for discrete distributions, which require alternative gradient estimators (REINFORCE / Gumbel-Softmax) at higher variance.

Why the trick fails for discrete variables

The failure mode for discrete distributions deserves careful attention because it recurs in vector-quantized models (VQ-VAEs, VQ-BeT from Course 2). For a discrete latent z∼Categorical(pϕ(x))z \sim \text{Categorical}(p_\phi(x))z∼Categorical(pϕ​(x)), there is no way to write z=g(pϕ(x),ϵ)z = g(p_\phi(x), \epsilon)z=g(pϕ​(x),ϵ) with a continuous differentiable ggg — the sampling operation is a discontinuous argmax that blocks gradients. The REINFORCE estimator replaces ∇ϕEz∼qϕ[f(z)]\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)]∇ϕ​Ez∼qϕ​​[f(z)] with Ez∼qϕ[f(z)∇ϕlog⁡qϕ(z)]\mathbb{E}_{z \sim q_\phi}[f(z) \nabla_\phi \log q_\phi(z)]Ez∼qϕ​​[f(z)∇ϕ​logqϕ​(z)], which is unbiased but high-variance. The Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017) provides a continuous relaxation: replace the discrete sample with a soft-max over Gumbel-perturbed logits:

z~k=exp⁡((log⁡pk+gk)/τ)∑jexp⁡((log⁡pj+gj)/τ),gk∼Gumbel(0,1)\tilde{z}_k = \frac{\exp((\log p_k + g_k) / \tau)}{\sum_j \exp((\log p_j + g_j) / \tau)}, \quad g_k \sim \text{Gumbel}(0, 1)z~k​=∑j​exp((logpj​+gj​)/τ)exp((logpk​+gk​)/τ)​,gk​∼Gumbel(0,1)

where τ>0\tau > 0τ>0 is a temperature. As τ→0\tau \to 0τ→0, z~\tilde{z}z~ approaches the one-hot argmax; as τ→∞\tau \to \inftyτ→∞, it approaches a uniform distribution. At any τ>0\tau > 0τ>0, z~\tilde{z}z~ is differentiable in the logits log⁡pk\log p_klogpk​, enabling gradient-based training. In practice, the straight-through Gumbel estimator uses the hard discrete sample in the forward pass (for correct categorical behavior) but the soft Gumbel-Softmax gradient in the backward pass — a pragmatic approximation that works well in the VQ-VAEVariational Autoencoder context.


Training and architecture

The standard VAEVariational Autoencoder training loop: for each mini-batch, run the encoder to obtain (μϕ(x),σϕ(x))(\mu_\phi(x), \sigma_\phi(x))(μϕ​(x),σϕ​(x)); sample ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I); compute z=μϕ+σϕ⊙ϵz = \mu_\phi + \sigma_\phi \odot \epsilonz=μϕ​+σϕ​⊙ϵ; run the decoder to obtain μθ(z)\mu_\theta(z)μθ​(z); compute the ELBO loss; backpropagate. The encoder and decoder are trained jointly via the same gradient step.

Architecture in practice

For image VAEs, the encoder is a convolutional network that takes x∈RH×W×3x \in \mathbb{R}^{H \times W \times 3}x∈RH×W×3 and outputs two vectors μϕ(x),log⁡σϕ2(x)∈Rdz\mu_\phi(x), \log \sigma^2_\phi(x) \in \mathbb{R}^{d_z}μϕ​(x),logσϕ2​(x)∈Rdz​ through a series of strided convolutions followed by a linear projection. A typical small VAEVariational Autoencoder for 64×6464 \times 6464×64 images uses 4 convolutional stages (doubling channels: 64 → 128 → 256 → 512) and a latent dimension dz=128d_z = 128dz​=128–256256256. The decoder is a mirror image: a linear projection followed by transposed convolutions (or nearest-neighbor upsampling + convolution, which avoids checkerboard artifacts) back to H×W×3H \times W \times 3H×W×3.

The latent space design for latent diffusion models (Week 9) differs fundamentally from the vector-latent design above. The Stable Diffusion VAEVariational Autoencoder uses a spatial latent map rather than a vector: input x∈R512×512×3x \in \mathbb{R}^{512 \times 512 \times 3}x∈R512×512×3 is encoded to z∈R64×64×4z \in \mathbb{R}^{64 \times 64 \times 4}z∈R64×64×4 — a 8×8\times8× spatial downsampling into 4 channels. The diffusion model then operates on these 64×64×464 \times 64 \times 464×64×4 latents rather than 512×512×3512 \times 512 \times 3512×512×3 pixels, reducing the dimensionality from 786,432 to 16,384 — a 48×48\times48× reduction. The KL regularization in this spatial VAEVariational Autoencoder uses a weight β=10−6\beta = 10^{-6}β=10−6 (extremely small), because the primary goal is high-fidelity compression rather than a structured latent distribution. The slight KL regularization prevents the latent space from being too irregular for the subsequent diffusion model to sample from.

Decoder variance: treating σ2\sigma^2σ2 as a fixed hyperparameter (rather than a learned output) simplifies training. σ2=1\sigma^2 = 1σ2=1 gives equal weight to reconstruction and KL; σ2<1\sigma^2 < 1σ2<1 upweights reconstruction (sharper images, less regularization); σ2>1\sigma^2 > 1σ2>1 upweights KL (more structured latent space, blurrier images). The decoder variance is the primary lever for balancing generation quality against representation quality.


Posterior collapse

A notorious failure mode of VAEs is posterior collapse: during training, the encoder learns to output μϕ(x)≈0\mu_\phi(x) \approx 0μϕ​(x)≈0 and σϕ(x)≈1\sigma_\phi(x) \approx 1σϕ​(x)≈1 for all xxx — the encoder ignores the input and produces the prior. When this happens, the KL term is zero, the decoder learns to generate from the prior alone (ignoring zzz), and the latent space carries no information about xxx.

Posterior collapse occurs because: (1) powerful decoders can achieve near-zero reconstruction loss without using zzz at all, so gradients through zzz vanish early in training; (2) the KL term provides a constant gradient pushing qϕq_\phiqϕ​ toward the prior regardless of the reconstruction quality. The model finds a degenerate but locally stable optimum.

Detecting collapse in practice: monitor the per-dimension KL during training, DKL(qϕ(zj∣x)∥p(zj))D_\text{KL}(q_\phi(z_j \mid x) \| p(z_j))DKL​(qϕ​(zj​∣x)∥p(zj​)) for each latent dimension jjj. A healthy VAEVariational Autoencoder will have most dimensions with KL ≈1\approx 1≈1–555 nats, meaning the encoder is using those dimensions to encode information. Collapsed dimensions have KL ≈0\approx 0≈0 — the encoder outputs μj≈0,σj≈1\mu_j \approx 0, \sigma_j \approx 1μj​≈0,σj​≈1 regardless of xxx. As a rule of thumb, if more than 10–20% of latent dimensions are below 0.1 nats, the model is collapsing and training should be restarted with lower decoder capacity or stronger KL annealing.

Mitigations include KL annealing (start training with β=0\beta = 0β=0 and slowly ramp β\betaβ to 1, preventing the KL term from forcing collapse before the decoder learns to use zzz), free bits (set a floor of λ\lambdaλ bits per latent dimension, zero-ing out the KL gradient when the KL is below λ\lambdaλ), and architectural constraints that prevent the decoder from being too powerful (e.g., restricting its receptive field or capacity).


β\betaβ-VAEVariational Autoencoder and disentanglement

β\betaβ-VAEVariational Autoencoder (Higgins et al., 2017) modifies the VAEVariational Autoencoder objective by upweighting the KL term:

Lβ=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]−β⋅DKL(qϕ(z∣x)∥p(z)),β>1\mathcal{L}_\beta = \mathbb{E}_{q_\phi(z|x)}\!\left[\log p_\theta(x \mid z)\right] - \beta \cdot D_\text{KL}(q_\phi(z \mid x) \| p(z)), \quad \beta > 1Lβ​=Eqϕ​(z∣x)​[logpθ​(x∣z)]−β⋅DKL​(qϕ​(z∣x)∥p(z)),β>1

Increasing β\betaβ compresses the information in zzz more aggressively, which tends to disentangle the latent dimensions: each dimension of zzz captures a single semantically interpretable factor of variation (e.g., rotation, scale, color) rather than multiple entangled factors. The tradeoff is reduced reconstruction quality.

The disentanglement intuition: with β≫1\beta \gg 1β≫1, the information bottleneck is so tight that only the most compressible (statistically independent) structure survives in zzz. Independent generative factors — which by definition have no statistical interaction — are the most efficient representation under this bottleneck.

Hierarchical VAEs (NVAE, VDVAE) stack multiple levels of latent variables z1,z2,…,zLz_1, z_2, \ldots, z_Lz1​,z2​,…,zL​ with the joint prior factored as p(z1,…,zL)=p(zL)∏ℓp(zℓ∣zℓ+1,…,zL)p(z_1, \ldots, z_L) = p(z_L) \prod_\ell p(z_\ell \mid z_{\ell+1}, \ldots, z_L)p(z1​,…,zL​)=p(zL​)∏ℓ​p(zℓ​∣zℓ+1​,…,zL​). Each level captures structure at a different scale: zLz_LzL​ encodes global semantics, z1z_1z1​ encodes local details. Hierarchical VAEs substantially improve sample quality over single-level VAEs, approaching GAN-level quality on face generation benchmarks.


Vector-Quantized VAEVariational Autoencoder

The VQ-VAEVariational Autoencoder (van den Oord et al., 2017) replaces the continuous Gaussian latent space with a discrete codebook. The encoder maps xxx to a continuous embedding ze=Eϕ(x)z_e = E_\phi(x)ze​=Eϕ​(x), which is then quantized to the nearest codebook vector:

zq=ek∗,k∗=arg⁡min⁡k∥ze−ek∥2z_q = e_{k^*}, \quad k^* = \arg\min_k \|z_e - e_k\|_2zq​=ek∗​,k∗=argkmin​∥ze​−ek​∥2​

where {ek}k=1K\{e_k\}_{k=1}^K{ek​}k=1K​ are the KKK learnable codebook vectors. The decoder receives zqz_qzq​ and reconstructs xxx. The discrete bottleneck prevents the posterior collapse problem of continuous VAEs: the codebook has a fixed, discrete structure that cannot be bypassed.

Training requires handling the non-differentiable argmin. The straight-through estimator copies the gradient from zqz_qzq​ back to zez_eze​ directly, ignoring the quantization step in the backward pass. The full VQ-VAEVariational Autoencoder loss has three terms: reconstruction, codebook alignment (pulling codebook vectors toward encoder outputs), and commitment (pulling encoder outputs toward codebook vectors):

LVQ=∥x−Dθ(zq)∥2⏟reconstruction+∥sg(ze)−zq∥2⏟codebook update+βcommit∥ze−sg(zq)∥2⏟commitment\mathcal{L}_\text{VQ} = \underbrace{\|x - D_\theta(z_q)\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}(z_e) - z_q\|^2}_{\text{codebook update}} + \underbrace{\beta_\text{commit}\|z_e - \text{sg}(z_q)\|^2}_{\text{commitment}}LVQ​=reconstruction∥x−Dθ​(zq​)∥2​​+codebook update∥sg(ze​)−zq​∥2​​+commitmentβcommit​∥ze​−sg(zq​)∥2​​

where sg(⋅)\text{sg}(\cdot)sg(⋅) denotes stop-gradient. The commitment weight βcommit=0.25\beta_\text{commit} = 0.25βcommit​=0.25 (default) balances encoder stability against codebook update speed.

VQ-VAEs enable two-stage generation: first train the VQ-VAEVariational Autoencoder to produce discrete codes; then train a separate autoregressive or diffusion model over the discrete codebook sequence. This two-stage approach produces sharp samples (VQ-VAEVariational Autoencoder preserves all local details) while enabling tractable likelihood over the discrete code sequence. DALL-E 1 used a 32×3232 \times 3232×32 grid of 8192-code VQ-VAEVariational Autoencoder tokens; subsequent work replaced the VQ-VAEVariational Autoencoder stage with continuous spatial VAEs for better high-frequency fidelity. In robotics (Course 2 Week 8), VQ-BeT applies the same codebook structure to discrete action bins, enabling a behavior transformer to model multimodal action distributions by predicting codebook indices rather than continuous action values.


VAEs as generative models

At inference time, the VAEVariational Autoencoder generates samples by: (1) drawing z∼p(z)=N(0,I)z \sim p(z) = \mathcal{N}(0, I)z∼p(z)=N(0,I); (2) passing zzz through the decoder to obtain the output distribution pθ(x∣z)p_\theta(x \mid z)pθ​(x∣z); (3) sampling x∼pθ(x∣z)x \sim p_\theta(x \mid z)x∼pθ​(x∣z) or taking the mean μθ(z)\mu_\theta(z)μθ​(z). Generated samples are often blurry because the Gaussian decoder with MSE reconstruction minimizes the expected pixel error, which averages over aleatoric uncertainty — producing the mean of the conditional rather than a sharp mode.

This blurriness is not a bug in the VAEVariational Autoencoder algorithm but a consequence of the Gaussian likelihood assumption: the optimal decoder output under MSE loss is the conditional mean E[x∣z]\mathbb{E}[x \mid z]E[x∣z], which averages over multiple plausible renderings. The latent diffusion models of Week 9 address this directly by replacing the Gaussian decoder with a diffusion model conditioned on the latent code.


GenAI context: VAEs across the course sequence

| VAEVariational Autoencoder concept | Language model analog | Robotics (Course 2) analog | |---|---|---| | Encoder qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) | Input token embedding | Demo sequence → action code (ACTAction Chunking with Transformers) | | Reparameterization trick | Straight-through for discrete tokens | CVAE continuous action latents | | Posterior collapse | LM ignoring context | CVAE encoder ignoring demonstration | | β\betaβ-VAEVariational Autoencoder disentanglement | Bottleneck transformers | Disentangled task/motion latents | | Hierarchical VAEVariational Autoencoder | Multi-level prompt latents | Hierarchical skill latent hierarchy | | Spatial latent VAEVariational Autoencoder | VQGAN token grid for image LMs | Compressed sensor observations |

The CVAE architecture in ACTAction Chunking with Transformers (Course 2, Week 8) is a direct application of the conditional VAEVariational Autoencoder framework developed here: the encoder maps a demonstration sequence to an action latent code zzz, the prior is N(0,I)\mathcal{N}(0, I)N(0,I), and the decoder generates action chunk predictions conditioned on zzz and the current observation. The ELBO trained there is the same mathematical object — reconstruction term plus KL regularization — with "demonstration" replacing "image" in the reconstruction objective.


Key takeaways

The VAEVariational Autoencoder jointly trains encoder and decoder by maximizing the ELBO, decomposed into a reconstruction term and a KL divergence toward the prior. The reparameterization trick enables gradient-based optimization through stochastic latent samples by expressing zzz as a deterministic function of the encoder output and a fixed-distribution noise. Posterior collapse is the primary failure mode, caused by powerful decoders learning to ignore zzz, and is mitigated by KL annealing or free bits. β\betaβ-VAEVariational Autoencoder upweights the KL term to improve disentanglement at the cost of reconstruction quality. Hierarchical VAEs stack multiple latent levels, recovering sample quality competitive with GANs while retaining exact likelihood bounds.


Conceptual questions

  1. A VAEVariational Autoencoder decoder is parameterized as a Gaussian pθ(x∣z)=N(μθ(z),σ2I)p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)pθ​(x∣z)=N(μθ​(z),σ2I) with σ2=0.01\sigma^2 = 0.01σ2=0.01. Show that in this limit, the ELBO reconstruction term approaches −12σ2∥x−μθ(z)∥2-\frac{1}{2\sigma^2}\|x - \mu_\theta(z)\|^2−2σ21​∥x−μθ​(z)∥2, which dominates the KL term. What is the risk of using very small σ2\sigma^2σ2? Conversely, what happens to sample quality as σ2→∞\sigma^2 \to \inftyσ2→∞?

  2. Derive the closed-form KL divergence DKL(N(μ,σ2I)∥N(0,I))D_\text{KL}(\mathcal{N}(\mu, \sigma^2 I) \| \mathcal{N}(0, I))DKL​(N(μ,σ2I)∥N(0,I)) for dzd_zdz​-dimensional Gaussian distributions. Show that it equals zero if and only if μ=0\mu = 0μ=0 and σ=1\sigma = 1σ=1. How does this expression change if the prior is N(μ0,σ02I)\mathcal{N}(\mu_0, \sigma_0^2 I)N(μ0​,σ02​I) rather than the standard Gaussian?

  3. During training of a VAEVariational Autoencoder, the KL term is observed to be zero after 100 gradient steps while the reconstruction loss is still high. Diagnose this failure mode and propose two architectural or training modifications that would prevent it. For each modification, explain the mechanism by which it discourages the degenerate solution.

  4. A β\betaβ-VAEVariational Autoencoder is trained on a dataset of 3D objects with independent generative factors (shape, size, orientation, color). For β=1\beta = 1β=1 (standard VAEVariational Autoencoder) and β=10\beta = 10β=10, predict: (a) the reconstruction quality on held-out objects, (b) whether the latent dimensions will be aligned with the generative factors, and (c) the quality of samples generated by interpolating between two latent codes. Justify each prediction using the information-bottleneck interpretation.

  5. Hierarchical VAEs use LLL levels of latent variables. During inference, the posterior qϕ(z1,…,zL∣x)q_\phi(z_1, \ldots, z_L \mid x)qϕ​(z1​,…,zL​∣x) must approximate the true posterior pθ(z1,…,zL∣x)p_\theta(z_1, \ldots, z_L \mid x)pθ​(z1​,…,zL​∣x). Write the ELBO for a two-level hierarchical VAEVariational Autoencoder and identify where the posterior approximation gap arises for each level. Which level (top or bottom) do you expect to suffer from larger posterior approximation error, and why?

✦Solutions
  1. As σ2→0\sigma^2 \to 0σ2→0, the reconstruction term −12σ2∥x−μθ(z)∥2-\tfrac{1}{2\sigma^2}\|x-\mu_\theta(z)\|^2−2σ21​∥x−μθ​(z)∥2 is scaled by 1/σ2→∞1/\sigma^2 \to \infty1/σ2→∞, dwarfing the (unscaled) KL term. The risk is that the KL is effectively ignored, the latent space loses its prior structure, and the model overfits / behaves like a deterministic autoencoder (poor samples from p(z)p(z)p(z)). As σ2→∞\sigma^2 \to \inftyσ2→∞ the reconstruction term vanishes, the KL dominates and forces posterior collapse — blurry, content-free samples.
  2. DKL(N(μ,σ2I) ∥ N(0,I))=12∑j(σj2+μj2−1−log⁡σj2)D_\text{KL}(\mathcal{N}(\mu,\sigma^2 I)\,\|\,\mathcal{N}(0,I)) = \tfrac12\sum_j(\sigma_j^2+\mu_j^2-1-\log\sigma_j^2)DKL​(N(μ,σ2I)∥N(0,I))=21​∑j​(σj2​+μj2​−1−logσj2​), which is ≥0\geq 0≥0 with equality iff every μj=0\mu_j=0μj​=0 and σj=1\sigma_j=1σj​=1. For a general prior N(μ0,σ02I)\mathcal{N}(\mu_0,\sigma_0^2 I)N(μ0​,σ02​I) each term becomes 12(σj2+(μj−μ0,j)2σ0,j2−1−log⁡σj2σ0,j2)\tfrac12\big(\tfrac{\sigma_j^2+(\mu_j-\mu_{0,j})^2}{\sigma_{0,j}^2} - 1 - \log\tfrac{\sigma_j^2}{\sigma_{0,j}^2}\big)21​(σ0,j2​σj2​+(μj​−μ0,j​)2​−1−logσ0,j2​σj2​​).
  3. KL →0\to 0→0 with reconstruction still high is posterior collapse: the decoder is powerful enough to reconstruct without using zzz. Two fixes: (a) KL annealing — ramp β\betaβ from 0 to 1 so the decoder learns to use zzz before the KL pressure applies; (b) free bits — zero the KL gradient below a floor λ\lambdaλ per dimension, so cheap dimensions aren't pushed to the prior; (alternatively) weaken the decoder's capacity/receptive field so it cannot ignore zzz.
  4. (a) Reconstruction is worse at β=10\beta=10β=10 (tighter information bottleneck). (b) Latent dimensions are more likely aligned with the independent factors at β=10\beta=10β=10, since only statistically independent structure survives the bottleneck. (c) Interpolations are smoother and more semantically meaningful at β=10\beta=10β=10 because the latent space is more structured and disentangled — at the cost of sharpness.
  5. Two-level ELBO: Eqϕ(z1,z2∣x)[log⁡pθ(x∣z1)]−Eq(z2∣x)DKL(qϕ(z1∣z2,x)∥pθ(z1∣z2))−DKL(qϕ(z2∣x)∥p(z2))\mathbb{E}_{q_\phi(z_1,z_2\mid x)}[\log p_\theta(x\mid z_1)] - \mathbb{E}_{q(z_2\mid x)}D_\text{KL}(q_\phi(z_1\mid z_2,x)\|p_\theta(z_1\mid z_2)) - D_\text{KL}(q_\phi(z_2\mid x)\|p(z_2))Eqϕ​(z1​,z2​∣x)​[logpθ​(x∣z1​)]−Eq(z2​∣x)​DKL​(qϕ​(z1​∣z2​,x)∥pθ​(z1​∣z2​))−DKL​(qϕ​(z2​∣x)∥p(z2​)). A gap appears at each level wherever qϕq_\phiqϕ​ at that level departs from the true conditional posterior. The top level (z2z_2z2​) typically suffers more: its posterior depends on marginalizing the lower level and is farther from the data, so the amortized Gaussian approximation is cruder there.

Looking ahead

VAEs provide a probabilistic, likelihood-based approach to generative modeling. The next model family takes a different approach: instead of approximating the intractable posterior, it sidesteps likelihood entirely.

Week 3: Generative Adversarial Networks. We derive the GAN min-max objective, show its connection to Jensen-Shannon divergence, analyze the Wasserstein distance alternative, and examine training instability and mode collapse — the pathologies that motivated nearly all subsequent GAN research.


Further reading

  • Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. ICLR. (The original VAEVariational Autoencoder paper).
  • Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML.
  • Higgins, I., et al. (2017). beta-VAEVariational Autoencoder: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR.
  • van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS.
  • Jang, E., Gu, S., & Poole, B. (2017). Categorical Reparameterization with Gumbel-Softmax. ICLR.
  • Vahdat, A., & Kautz, J. (2020). NVAE: A Deep Hierarchical Variational Autoencoder. NeurIPS.
← Previous
Week 1: Probabilistic Foundations
Next →
Week 3: Generative Adversarial Networks
On this page
  • Purpose of this lecture
  • The autoencoder baseline
  • The VAE objective
  • The reparameterization trick
  • Why the trick fails for discrete variables
  • Training and architecture
  • Architecture in practice
  • Posterior collapse
  • \beta-VAE and disentanglement
  • Vector-Quantized VAE
  • VAEs as generative models
  • GenAI context: VAEs across the course sequence
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading