Week 2: Variational Autoencoders

Purpose of this lecture#

The variational autoencoder (VAE; Kingma and Welling, 2013) is the first complete generative architecture to directly operationalize the ELBO framework from Week 1. It simultaneously trains an encoder that compresses observations into a structured latent space and a decoder that reconstructs observations from latents, using backpropagation throughout via the reparameterization trick. The VAE is foundational both in its own right — latent diffusion models (Week 9) use VAE encoders as their compression stage — and as a template for understanding the tradeoffs that all latent variable generative models face.

The autoencoder baseline#

A deterministic autoencoder compresses $x$ to a code $z = f_\phi(x)$ and reconstructs $\hat{x} = g_\theta(z)$ , trained to minimize reconstruction error $\|x - \hat{x}\|^2$ . This achieves compression but fails as a generative model: the latent space has no prescribed structure, so sampling a random $z$ from, say, a Gaussian and passing it through $g_\theta$ produces out-of-distribution garbage. The decoder was never trained on latent codes other than those produced by the encoder on training data.

The VAE solves this by replacing the deterministic encoder with a stochastic encoder (inference network) $q_\phi(z \mid x)$ and adding a prior $p(z)$ that regularizes the latent space. The reconstruction objective is supplemented by a KL penalty that forces $q_\phi(z \mid x)$ toward the prior, making the entire latent space — not just the codes of training examples — meaningful to the decoder.

The VAE objective#

For a Gaussian encoder and standard Gaussian prior, the ELBO from Week 1 becomes the tractable VAE objective:

\mathcal{L}_\text{VAE}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}\!\left[\log p_\theta(x \mid z)\right] - D_\text{KL}(q_\phi(z \mid x) \| p(z))

With $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$ and $p(z) = \mathcal{N}(0, I)$ , the KL term has a closed form:

D_\text{KL}(q_\phi \| p) = \frac{1}{2}\sum_{j=1}^{d_z}\!\left(\sigma_{\phi,j}^2(x) + \mu_{\phi,j}^2(x) - 1 - \log \sigma_{\phi,j}^2(x)\right)

This sum over latent dimensions is computable without Monte Carlo. The reconstruction term requires sampling $z \sim q_\phi(z \mid x)$ and evaluating $\log p_\theta(x \mid z)$ . For a Gaussian decoder $p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)$ , this becomes $-\frac{1}{2\sigma^2}\|x - \mu_\theta(z)\|^2$ — mean squared error weighted by the decoder variance.

The reparameterization trick#

The reconstruction gradient $\nabla_\phi \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x \mid z)]$ cannot be computed by naive backpropagation because the sampling operation $z \sim q_\phi(z \mid x)$ has no gradient with respect to $\phi$ . The reparameterization trick sidesteps this by rewriting the sample as a deterministic function of a fixed-distribution noise:

z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Now $z$ is a differentiable function of $(\mu_\phi, \sigma_\phi)$ , and gradients flow through $z$ to the encoder parameters:

\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}\!\left[\nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)\right]

The trick works for any distribution that can be expressed as $z = g_\phi(\epsilon, x)$ for some deterministic function $g_\phi$ and noise $\epsilon$ drawn from a fixed (parameter-free) distribution. Laplace, Gamma, and Beta distributions all have reparameterizable forms. The trick fails for discrete distributions, which require alternative gradient estimators (REINFORCE / Gumbel-Softmax) at higher variance.

Why the trick fails for discrete variables#

The failure mode for discrete distributions deserves careful attention because it recurs in vector-quantized models (VQ-VAEs, VQ-BeT from Course 2). For a discrete latent $z \sim \text{Categorical}(p_\phi(x))$ , there is no way to write $z = g(p_\phi(x), \epsilon)$ with a continuous differentiable $g$ — the sampling operation is a discontinuous argmax that blocks gradients. The REINFORCE estimator replaces $\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)]$ with $\mathbb{E}_{z \sim q_\phi}[f(z) \nabla_\phi \log q_\phi(z)]$ , which is unbiased but high-variance. The Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017) provides a continuous relaxation: replace the discrete sample with a soft-max over Gumbel-perturbed logits:

\tilde{z}_k = \frac{\exp((\log p_k + g_k) / \tau)}{\sum_j \exp((\log p_j + g_j) / \tau)}, \quad g_k \sim \text{Gumbel}(0, 1)

where $\tau > 0$ is a temperature. As $\tau \to 0$ , $\tilde{z}$ approaches the one-hot argmax; as $\tau \to \infty$ , it approaches a uniform distribution. At any $\tau > 0$ , $\tilde{z}$ is differentiable in the logits $\log p_k$ , enabling gradient-based training. In practice, the straight-through Gumbel estimator uses the hard discrete sample in the forward pass (for correct categorical behavior) but the soft Gumbel-Softmax gradient in the backward pass — a pragmatic approximation that works well in the VQ-VAE context.

Training and architecture#

The standard VAE training loop: for each mini-batch, run the encoder to obtain $(\mu_\phi(x), \sigma_\phi(x))$ ; sample $\epsilon \sim \mathcal{N}(0, I)$ ; compute $z = \mu_\phi + \sigma_\phi \odot \epsilon$ ; run the decoder to obtain $\mu_\theta(z)$ ; compute the ELBO loss; backpropagate. The encoder and decoder are trained jointly via the same gradient step.

Architecture in practice#

For image VAEs, the encoder is a convolutional network that takes $x \in \mathbb{R}^{H \times W \times 3}$ and outputs two vectors $\mu_\phi(x), \log \sigma^2_\phi(x) \in \mathbb{R}^{d_z}$ through a series of strided convolutions followed by a linear projection. A typical small VAE for $64 \times 64$ images uses 4 convolutional stages (doubling channels: 64 → 128 → 256 → 512) and a latent dimension $d_z = 128$ – $256$ . The decoder is a mirror image: a linear projection followed by transposed convolutions (or nearest-neighbor upsampling + convolution, which avoids checkerboard artifacts) back to $H \times W \times 3$ .

The latent space design for latent diffusion models (Week 9) differs fundamentally from the vector-latent design above. The Stable Diffusion VAE uses a spatial latent map rather than a vector: input $x \in \mathbb{R}^{512 \times 512 \times 3}$ is encoded to $z \in \mathbb{R}^{64 \times 64 \times 4}$ — a $8\times$ spatial downsampling into 4 channels. The diffusion model then operates on these $64 \times 64 \times 4$ latents rather than $512 \times 512 \times 3$ pixels, reducing the dimensionality from 786,432 to 16,384 — a $48\times$ reduction. The KL regularization in this spatial VAE uses a weight $\beta = 10^{-6}$ (extremely small), because the primary goal is high-fidelity compression rather than a structured latent distribution. The slight KL regularization prevents the latent space from being too irregular for the subsequent diffusion model to sample from.

Decoder variance: treating $\sigma^2$ as a fixed hyperparameter (rather than a learned output) simplifies training. $\sigma^2 = 1$ gives equal weight to reconstruction and KL; $\sigma^2 < 1$ upweights reconstruction (sharper images, less regularization); $\sigma^2 > 1$ upweights KL (more structured latent space, blurrier images). The decoder variance is the primary lever for balancing generation quality against representation quality.

Posterior collapse#

A notorious failure mode of VAEs is posterior collapse: during training, the encoder learns to output $\mu_\phi(x) \approx 0$ and $\sigma_\phi(x) \approx 1$ for all $x$ — the encoder ignores the input and produces the prior. When this happens, the KL term is zero, the decoder learns to generate from the prior alone (ignoring $z$ ), and the latent space carries no information about $x$ .

Posterior collapse occurs because: (1) powerful decoders can achieve near-zero reconstruction loss without using $z$ at all, so gradients through $z$ vanish early in training; (2) the KL term provides a constant gradient pushing $q_\phi$ toward the prior regardless of the reconstruction quality. The model finds a degenerate but locally stable optimum.

Detecting collapse in practice: monitor the per-dimension KL during training, $D_\text{KL}(q_\phi(z_j \mid x) \| p(z_j))$ for each latent dimension $j$ . A healthy VAE will have most dimensions with KL $\approx 1$ – $5$ nats, meaning the encoder is using those dimensions to encode information. Collapsed dimensions have KL $\approx 0$ — the encoder outputs $\mu_j \approx 0, \sigma_j \approx 1$ regardless of $x$ . As a rule of thumb, if more than 10–20% of latent dimensions are below 0.1 nats, the model is collapsing and training should be restarted with lower decoder capacity or stronger KL annealing.

Mitigations include KL annealing (start training with $\beta = 0$ and slowly ramp $\beta$ to 1, preventing the KL term from forcing collapse before the decoder learns to use $z$ ), free bits (set a floor of $\lambda$ bits per latent dimension, zero-ing out the KL gradient when the KL is below $\lambda$ ), and architectural constraints that prevent the decoder from being too powerful (e.g., restricting its receptive field or capacity).

$\beta$ -VAE and disentanglement#

$\beta$ -VAE (Higgins et al., 2017) modifies the VAE objective by upweighting the KL term:

\mathcal{L}_\beta = \mathbb{E}_{q_\phi(z|x)}\!\left[\log p_\theta(x \mid z)\right] - \beta \cdot D_\text{KL}(q_\phi(z \mid x) \| p(z)), \quad \beta > 1

Increasing $\beta$ compresses the information in $z$ more aggressively, which tends to disentangle the latent dimensions: each dimension of $z$ captures a single semantically interpretable factor of variation (e.g., rotation, scale, color) rather than multiple entangled factors. The tradeoff is reduced reconstruction quality.

The disentanglement intuition: with $\beta \gg 1$ , the information bottleneck is so tight that only the most compressible (statistically independent) structure survives in $z$ . Independent generative factors — which by definition have no statistical interaction — are the most efficient representation under this bottleneck.

Hierarchical VAEs (NVAE, VDVAE) stack multiple levels of latent variables $z_1, z_2, \ldots, z_L$ with the joint prior factored as $p(z_1, \ldots, z_L) = p(z_L) \prod_\ell p(z_\ell \mid z_{\ell+1}, \ldots, z_L)$ . Each level captures structure at a different scale: $z_L$ encodes global semantics, $z_1$ encodes local details. Hierarchical VAEs substantially improve sample quality over single-level VAEs, approaching GAN-level quality on face generation benchmarks.

Vector-Quantized VAE#

The VQ-VAE (van den Oord et al., 2017) replaces the continuous Gaussian latent space with a discrete codebook. The encoder maps $x$ to a continuous embedding $z_e = E_\phi(x)$ , which is then quantized to the nearest codebook vector:

z_q = e_{k^*}, \quad k^* = \arg\min_k \|z_e - e_k\|_2

where $\{e_k\}_{k=1}^K$ are the $K$ learnable codebook vectors. The decoder receives $z_q$ and reconstructs $x$ . The discrete bottleneck prevents the posterior collapse problem of continuous VAEs: the codebook has a fixed, discrete structure that cannot be bypassed.

Training requires handling the non-differentiable argmin. The straight-through estimator copies the gradient from $z_q$ back to $z_e$ directly, ignoring the quantization step in the backward pass. The full VQ-VAE loss has three terms: reconstruction, codebook alignment (pulling codebook vectors toward encoder outputs), and commitment (pulling encoder outputs toward codebook vectors):

\mathcal{L}_\text{VQ} = \underbrace{\|x - D_\theta(z_q)\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}(z_e) - z_q\|^2}_{\text{codebook update}} + \underbrace{\beta_\text{commit}\|z_e - \text{sg}(z_q)\|^2}_{\text{commitment}}

where $\text{sg}(\cdot)$ denotes stop-gradient. The commitment weight $\beta_\text{commit} = 0.25$ (default) balances encoder stability against codebook update speed.

VQ-VAEs enable two-stage generation: first train the VQ-VAE to produce discrete codes; then train a separate autoregressive or diffusion model over the discrete codebook sequence. This two-stage approach produces sharp samples (VQ-VAE preserves all local details) while enabling tractable likelihood over the discrete code sequence. DALL-E 1 used a $32 \times 32$ grid of 8192-code VQ-VAE tokens; subsequent work replaced the VQ-VAE stage with continuous spatial VAEs for better high-frequency fidelity. In robotics (Course 2 Week 8), VQ-BeT applies the same codebook structure to discrete action bins, enabling a behavior transformer to model multimodal action distributions by predicting codebook indices rather than continuous action values.

VAEs as generative models#

At inference time, the VAE generates samples by: (1) drawing $z \sim p(z) = \mathcal{N}(0, I)$ ; (2) passing $z$ through the decoder to obtain the output distribution $p_\theta(x \mid z)$ ; (3) sampling $x \sim p_\theta(x \mid z)$ or taking the mean $\mu_\theta(z)$ . Generated samples are often blurry because the Gaussian decoder with MSE reconstruction minimizes the expected pixel error, which averages over aleatoric uncertainty — producing the mean of the conditional rather than a sharp mode.

This blurriness is not a bug in the VAE algorithm but a consequence of the Gaussian likelihood assumption: the optimal decoder output under MSE loss is the conditional mean $\mathbb{E}[x \mid z]$ , which averages over multiple plausible renderings. The latent diffusion models of Week 9 address this directly by replacing the Gaussian decoder with a diffusion model conditioned on the latent code.

GenAI context: VAEs across the course sequence#

| VAE concept | Language model analog | Robotics (Course 2) analog | |---|---|---| | Encoder $q_\phi(z \mid x)$ | Input token embedding | Demo sequence → action code (ACT) | | Reparameterization trick | Straight-through for discrete tokens | CVAE continuous action latents | | Posterior collapse | LM ignoring context | CVAE encoder ignoring demonstration | | $\beta$ -VAE disentanglement | Bottleneck transformers | Disentangled task/motion latents | | Hierarchical VAE | Multi-level prompt latents | Hierarchical skill latent hierarchy | | Spatial latent VAE | VQGAN token grid for image LMs | Compressed sensor observations |

The CVAE architecture in ACT (Course 2, Week 8) is a direct application of the conditional VAE framework developed here: the encoder maps a demonstration sequence to an action latent code $z$ , the prior is $\mathcal{N}(0, I)$ , and the decoder generates action chunk predictions conditioned on $z$ and the current observation. The ELBO trained there is the same mathematical object — reconstruction term plus KL regularization — with "demonstration" replacing "image" in the reconstruction objective.

Key takeaways#

The VAE jointly trains encoder and decoder by maximizing the ELBO, decomposed into a reconstruction term and a KL divergence toward the prior. The reparameterization trick enables gradient-based optimization through stochastic latent samples by expressing $z$ as a deterministic function of the encoder output and a fixed-distribution noise. Posterior collapse is the primary failure mode, caused by powerful decoders learning to ignore $z$ , and is mitigated by KL annealing or free bits. $\beta$ -VAE upweights the KL term to improve disentanglement at the cost of reconstruction quality. Hierarchical VAEs stack multiple latent levels, recovering sample quality competitive with GANs while retaining exact likelihood bounds.

Conceptual questions#

A VAE decoder is parameterized as a Gaussian $p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)$ with $\sigma^2 = 0.01$ . Show that in this limit, the ELBO reconstruction term approaches $-\frac{1}{2\sigma^2}\|x - \mu_\theta(z)\|^2$ , which dominates the KL term. What is the risk of using very small $\sigma^2$ ? Conversely, what happens to sample quality as $\sigma^2 \to \infty$ ?
Derive the closed-form KL divergence $D_\text{KL}(\mathcal{N}(\mu, \sigma^2 I) \| \mathcal{N}(0, I))$ for $d_z$ -dimensional Gaussian distributions. Show that it equals zero if and only if $\mu = 0$ and $\sigma = 1$ . How does this expression change if the prior is $\mathcal{N}(\mu_0, \sigma_0^2 I)$ rather than the standard Gaussian?
During training of a VAE, the KL term is observed to be zero after 100 gradient steps while the reconstruction loss is still high. Diagnose this failure mode and propose two architectural or training modifications that would prevent it. For each modification, explain the mechanism by which it discourages the degenerate solution.
A $\beta$ -VAE is trained on a dataset of 3D objects with independent generative factors (shape, size, orientation, color). For $\beta = 1$ (standard VAE) and $\beta = 10$ , predict: (a) the reconstruction quality on held-out objects, (b) whether the latent dimensions will be aligned with the generative factors, and (c) the quality of samples generated by interpolating between two latent codes. Justify each prediction using the information-bottleneck interpretation.
Hierarchical VAEs use $L$ levels of latent variables. During inference, the posterior $q_\phi(z_1, \ldots, z_L \mid x)$ must approximate the true posterior $p_\theta(z_1, \ldots, z_L \mid x)$ . Write the ELBO for a two-level hierarchical VAE and identify where the posterior approximation gap arises for each level. Which level (top or bottom) do you expect to suffer from larger posterior approximation error, and why?

Solutions

As $\sigma^2 \to 0$ , the reconstruction term $-\tfrac{1}{2\sigma^2}\|x-\mu_\theta(z)\|^2$ is scaled by $1/\sigma^2 \to \infty$ , dwarfing the (unscaled) KL term. The risk is that the KL is effectively ignored, the latent space loses its prior structure, and the model overfits / behaves like a deterministic autoencoder (poor samples from $p(z)$ ). As $\sigma^2 \to \infty$ the reconstruction term vanishes, the KL dominates and forces posterior collapse — blurry, content-free samples.
$D_\text{KL}(\mathcal{N}(\mu,\sigma^2 I)\,\|\,\mathcal{N}(0,I)) = \tfrac12\sum_j(\sigma_j^2+\mu_j^2-1-\log\sigma_j^2)$ , which is $\geq 0$ with equality iff every $\mu_j=0$ and $\sigma_j=1$ . For a general prior $\mathcal{N}(\mu_0,\sigma_0^2 I)$ each term becomes $\tfrac12\big(\tfrac{\sigma_j^2+(\mu_j-\mu_{0,j})^2}{\sigma_{0,j}^2} - 1 - \log\tfrac{\sigma_j^2}{\sigma_{0,j}^2}\big)$ .
KL $\to 0$ with reconstruction still high is posterior collapse: the decoder is powerful enough to reconstruct without using $z$ . Two fixes: (a) KL annealing — ramp $\beta$ from 0 to 1 so the decoder learns to use $z$ before the KL pressure applies; (b) free bits — zero the KL gradient below a floor $\lambda$ per dimension, so cheap dimensions aren't pushed to the prior; (alternatively) weaken the decoder's capacity/receptive field so it cannot ignore $z$ .
(a) Reconstruction is worse at $\beta=10$ (tighter information bottleneck). (b) Latent dimensions are more likely aligned with the independent factors at $\beta=10$ , since only statistically independent structure survives the bottleneck. (c) Interpolations are smoother and more semantically meaningful at $\beta=10$ because the latent space is more structured and disentangled — at the cost of sharpness.
Two-level ELBO: $\mathbb{E}_{q_\phi(z_1,z_2\mid x)}[\log p_\theta(x\mid z_1)] - \mathbb{E}_{q(z_2\mid x)}D_\text{KL}(q_\phi(z_1\mid z_2,x)\|p_\theta(z_1\mid z_2)) - D_\text{KL}(q_\phi(z_2\mid x)\|p(z_2))$ . A gap appears at each level wherever $q_\phi$ at that level departs from the true conditional posterior. The top level ( $z_2$ ) typically suffers more: its posterior depends on marginalizing the lower level and is farther from the data, so the amortized Gaussian approximation is cruder there.

Looking ahead#

VAEs provide a probabilistic, likelihood-based approach to generative modeling. The next model family takes a different approach: instead of approximating the intractable posterior, it sidesteps likelihood entirely.

Week 3: Generative Adversarial Networks. We derive the GAN min-max objective, show its connection to Jensen-Shannon divergence, analyze the Wasserstein distance alternative, and examine training instability and mode collapse — the pathologies that motivated nearly all subsequent GAN research.

Purpose of this lecture#

The autoencoder baseline#

The VAE objective#

For a Gaussian encoder and standard Gaussian prior, the ELBO from Week 1 becomes the tractable VAE objective:

\mathcal{L}_\text{VAE}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}\!\left[\log p_\theta(x \mid z)\right] - D_\text{KL}(q_\phi(z \mid x) \| p(z))

With $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$ and $p(z) = \mathcal{N}(0, I)$ , the KL term has a closed form:

D_\text{KL}(q_\phi \| p) = \frac{1}{2}\sum_{j=1}^{d_z}\!\left(\sigma_{\phi,j}^2(x) + \mu_{\phi,j}^2(x) - 1 - \log \sigma_{\phi,j}^2(x)\right)

The reparameterization trick#

z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Now $z$ is a differentiable function of $(\mu_\phi, \sigma_\phi)$ , and gradients flow through $z$ to the encoder parameters:

\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}\!\left[\nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)\right]

Why the trick fails for discrete variables#

\tilde{z}_k = \frac{\exp((\log p_k + g_k) / \tau)}{\sum_j \exp((\log p_j + g_j) / \tau)}, \quad g_k \sim \text{Gumbel}(0, 1)

Training and architecture#

Architecture in practice#

Posterior collapse#

$\beta$ -VAE and disentanglement#

$\beta$ -VAE (Higgins et al., 2017) modifies the VAE objective by upweighting the KL term:

\mathcal{L}_\beta = \mathbb{E}_{q_\phi(z|x)}\!\left[\log p_\theta(x \mid z)\right] - \beta \cdot D_\text{KL}(q_\phi(z \mid x) \| p(z)), \quad \beta > 1

Vector-Quantized VAE#

z_q = e_{k^*}, \quad k^* = \arg\min_k \|z_e - e_k\|_2

\mathcal{L}_\text{VQ} = \underbrace{\|x - D_\theta(z_q)\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}(z_e) - z_q\|^2}_{\text{codebook update}} + \underbrace{\beta_\text{commit}\|z_e - \text{sg}(z_q)\|^2}_{\text{commitment}}

where $\text{sg}(\cdot)$ denotes stop-gradient. The commitment weight $\beta_\text{commit} = 0.25$ (default) balances encoder stability against codebook update speed.

VAEs as generative models#

GenAI context: VAEs across the course sequence#

Key takeaways#

Conceptual questions#

A VAE decoder is parameterized as a Gaussian $p_\theta(x \mid z) = \mathcal{N}(\mu_\theta(z), \sigma^2 I)$ with $\sigma^2 = 0.01$ . Show that in this limit, the ELBO reconstruction term approaches $-\frac{1}{2\sigma^2}\|x - \mu_\theta(z)\|^2$ , which dominates the KL term. What is the risk of using very small $\sigma^2$ ? Conversely, what happens to sample quality as $\sigma^2 \to \infty$ ?
Derive the closed-form KL divergence $D_\text{KL}(\mathcal{N}(\mu, \sigma^2 I) \| \mathcal{N}(0, I))$ for $d_z$ -dimensional Gaussian distributions. Show that it equals zero if and only if $\mu = 0$ and $\sigma = 1$ . How does this expression change if the prior is $\mathcal{N}(\mu_0, \sigma_0^2 I)$ rather than the standard Gaussian?
During training of a VAE, the KL term is observed to be zero after 100 gradient steps while the reconstruction loss is still high. Diagnose this failure mode and propose two architectural or training modifications that would prevent it. For each modification, explain the mechanism by which it discourages the degenerate solution.
A $\beta$ -VAE is trained on a dataset of 3D objects with independent generative factors (shape, size, orientation, color). For $\beta = 1$ (standard VAE) and $\beta = 10$ , predict: (a) the reconstruction quality on held-out objects, (b) whether the latent dimensions will be aligned with the generative factors, and (c) the quality of samples generated by interpolating between two latent codes. Justify each prediction using the information-bottleneck interpretation.
Hierarchical VAEs use $L$ levels of latent variables. During inference, the posterior $q_\phi(z_1, \ldots, z_L \mid x)$ must approximate the true posterior $p_\theta(z_1, \ldots, z_L \mid x)$ . Write the ELBO for a two-level hierarchical VAE and identify where the posterior approximation gap arises for each level. Which level (top or bottom) do you expect to suffer from larger posterior approximation error, and why?

Solutions

As $\sigma^2 \to 0$ , the reconstruction term $-\tfrac{1}{2\sigma^2}\|x-\mu_\theta(z)\|^2$ is scaled by $1/\sigma^2 \to \infty$ , dwarfing the (unscaled) KL term. The risk is that the KL is effectively ignored, the latent space loses its prior structure, and the model overfits / behaves like a deterministic autoencoder (poor samples from $p(z)$ ). As $\sigma^2 \to \infty$ the reconstruction term vanishes, the KL dominates and forces posterior collapse — blurry, content-free samples.
$D_\text{KL}(\mathcal{N}(\mu,\sigma^2 I)\,\|\,\mathcal{N}(0,I)) = \tfrac12\sum_j(\sigma_j^2+\mu_j^2-1-\log\sigma_j^2)$ , which is $\geq 0$ with equality iff every $\mu_j=0$ and $\sigma_j=1$ . For a general prior $\mathcal{N}(\mu_0,\sigma_0^2 I)$ each term becomes $\tfrac12\big(\tfrac{\sigma_j^2+(\mu_j-\mu_{0,j})^2}{\sigma_{0,j}^2} - 1 - \log\tfrac{\sigma_j^2}{\sigma_{0,j}^2}\big)$ .
KL $\to 0$ with reconstruction still high is posterior collapse: the decoder is powerful enough to reconstruct without using $z$ . Two fixes: (a) KL annealing — ramp $\beta$ from 0 to 1 so the decoder learns to use $z$ before the KL pressure applies; (b) free bits — zero the KL gradient below a floor $\lambda$ per dimension, so cheap dimensions aren't pushed to the prior; (alternatively) weaken the decoder's capacity/receptive field so it cannot ignore $z$ .
(a) Reconstruction is worse at $\beta=10$ (tighter information bottleneck). (b) Latent dimensions are more likely aligned with the independent factors at $\beta=10$ , since only statistically independent structure survives the bottleneck. (c) Interpolations are smoother and more semantically meaningful at $\beta=10$ because the latent space is more structured and disentangled — at the cost of sharpness.
Two-level ELBO: $\mathbb{E}_{q_\phi(z_1,z_2\mid x)}[\log p_\theta(x\mid z_1)] - \mathbb{E}_{q(z_2\mid x)}D_\text{KL}(q_\phi(z_1\mid z_2,x)\|p_\theta(z_1\mid z_2)) - D_\text{KL}(q_\phi(z_2\mid x)\|p(z_2))$ . A gap appears at each level wherever $q_\phi$ at that level departs from the true conditional posterior. The top level ( $z_2$ ) typically suffers more: its posterior depends on marginalizing the lower level and is farther from the data, so the amortized Gaussian approximation is cruder there.

Purpose of this lecture#

The autoencoder baseline#

The VAE objective#

The reparameterization trick#

Why the trick fails for discrete variables#

Training and architecture#

Architecture in practice#

Posterior collapse#

$\beta$ -VAE and disentanglement#

Vector-Quantized VAE#

VAEs as generative models#

GenAI context: VAEs across the course sequence#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 2: Variational Autoencoders

Purpose of this lecture#

The autoencoder baseline#

The VAE objective#

The reparameterization trick#

Why the trick fails for discrete variables#

Training and architecture#

Architecture in practice#

Posterior collapse#

$\beta$ -VAE and disentanglement#

Vector-Quantized VAE#

VAEs as generative models#

GenAI context: VAEs across the course sequence#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 2: Variational Autoencoders

Purpose of this lecture#

The autoencoder baseline#

The VAEVariational Autoencoder objective#

The reparameterization trick#

Why the trick fails for discrete variables#

Training and architecture#

Architecture in practice#

Posterior collapse#

β\betaβ-VAEVariational Autoencoder and disentanglement#

Vector-Quantized VAEVariational Autoencoder#

VAEs as generative models#

GenAI context: VAEs across the course sequence#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 2: Variational Autoencoders

Purpose of this lecture#

The autoencoder baseline#

The VAEVariational Autoencoder objective#

The reparameterization trick#

Why the trick fails for discrete variables#

Training and architecture#

Architecture in practice#

Posterior collapse#

β\betaβ-VAEVariational Autoencoder and disentanglement#

Vector-Quantized VAEVariational Autoencoder#

VAEs as generative models#

GenAI context: VAEs across the course sequence#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

The VAE objective#

$\beta$ -VAE and disentanglement#

Vector-Quantized VAE#

The VAE objective#

$\beta$ -VAE and disentanglement#

Vector-Quantized VAE#