Skip to main content
illumin8
Courses
Week 5: Normalizing Flows
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 5

Week 5: Normalizing Flows

✦Learning Outcomes
  • Implement affine coupling layers and explain why they have tractable Jacobian determinants
  • Compare autoregressive flows with coupling flows on the training-inference parallelism tradeoff
  • Explain continuous normalizing flows and their connection to neural ODEs
◆Prerequisites
  • Week 1: Probabilistic Foundations - Likelihood and ELBO concepts
  • Week 4: Energy-Based Models - Understanding tractable vs. intractable objectives

Familiarity with matrix operations and Jacobian determinants is helpful.

Purpose of this lecture

Normalizing flows define a probability distribution by constructing an invertible, differentiable mapping fθ:X→Zf_\theta: \mathcal{X} \to \mathcal{Z}fθ​:X→Z from data space to a simple base space (usually a Gaussian), then transforming the base density through the inverse mapping. They offer exact likelihood evaluation, exact sampling, and exact latent inference — a uniquely complete set of tractable operations among generative model families. Understanding flows clarifies both their strengths and the architectural constraints required to maintain invertibility efficiently, and provides the conceptual bridge to neural ODEs and Schrödinger bridges.


The change-of-variables formula

For an invertible differentiable mapping z=fθ(x)z = f_\theta(x)z=fθ​(x) with x=fθ−1(z)=gθ(z)x = f_\theta^{-1}(z) = g_\theta(z)x=fθ−1​(z)=gθ​(z), the change-of-variables formula relates the densities of xxx and zzz:

pX(x)=pZ(fθ(x))⋅∣det⁡Jfθ(x)∣p_X(x) = p_Z(f_\theta(x)) \cdot \left|\det J_{f_\theta}(x)\right|pX​(x)=pZ​(fθ​(x))⋅∣detJfθ​​(x)∣

where Jfθ(x)=∂fθ(x)∂xJ_{f_\theta}(x) = \frac{\partial f_\theta(x)}{\partial x}Jfθ​​(x)=∂x∂fθ​(x)​ is the Jacobian matrix of fθf_\thetafθ​ at xxx. Taking logarithms gives the exact log-likelihood:

log⁡pX(x)=log⁡pZ(fθ(x))+log⁡∣det⁡Jfθ(x)∣\log p_X(x) = \log p_Z(f_\theta(x)) + \log \left|\det J_{f_\theta}(x)\right|logpX​(x)=logpZ​(fθ​(x))+log∣detJfθ​​(x)∣

The first term is the likelihood of the transformed point under the base distribution (computable if pZp_ZpZ​ is Gaussian). The second term is the log-absolute-Jacobian-determinant, which accounts for the volume change induced by the mapping. Computing a general d×dd \times dd×d Jacobian determinant costs O(d3)O(d^3)O(d3) — prohibitive for high-dimensional xxx. All flow architectures are designed to make this computation tractable.

Sampling from a flow requires only the inverse: draw z∼pZz \sim p_Zz∼pZ​, compute x=gθ(z)x = g_\theta(z)x=gθ​(z). Density evaluation requires the forward pass: compute z=fθ(x)z = f_\theta(x)z=fθ​(x), then evaluate log⁡pZ(z)+log⁡∣det⁡Jfθ(x)∣\log p_Z(z) + \log|\det J_{f_\theta}(x)|logpZ​(z)+log∣detJfθ​​(x)∣. Both directions are exact without approximation.


Coupling layers: RealNVP and Glow

Affine coupling layers (Dinh et al., 2017, RealNVP) achieve O(d)O(d)O(d) Jacobian determinant computation by dividing the input dimensions into two halves and applying a learned affine transformation to one half conditioned on the other:

xA′=xA,xB′=xB⊙exp⁡(sθ(xA))+tθ(xA)x_A' = x_A, \quad x_B' = x_B \odot \exp(s_\theta(x_A)) + t_\theta(x_A)xA′​=xA​,xB′​=xB​⊙exp(sθ​(xA​))+tθ​(xA​)

where sθs_\thetasθ​ and tθt_\thetatθ​ are arbitrary neural networks mapping from xAx_AxA​ to scale and translation for xBx_BxB​. Because xA′=xAx_A' = x_AxA′​=xA​, the Jacobian of the full layer is lower-triangular:

J=(I0∂xB′∂xAdiag(exp⁡(sθ(xA))))J = \begin{pmatrix} I & 0 \\ \frac{\partial x_B'}{\partial x_A} & \text{diag}(\exp(s_\theta(x_A))) \end{pmatrix}J=(I∂xA​∂xB′​​​0diag(exp(sθ​(xA​)))​)

The determinant is ∏iexp⁡(sθ,i(xA))=exp⁡(∑isθ,i(xA))\prod_i \exp(s_{\theta,i}(x_A)) = \exp(\sum_i s_{\theta,i}(x_A))∏i​exp(sθ,i​(xA​))=exp(∑i​sθ,i​(xA​)) — an O(d)O(d)O(d) computation. The inverse is also O(d)O(d)O(d): given xB′x_B'xB′​, recover xB=(xB′−tθ(xA))⊙exp⁡(−sθ(xA))x_B = (x_B' - t_\theta(x_A)) \odot \exp(-s_\theta(x_A))xB​=(xB′​−tθ​(xA​))⊙exp(−sθ​(xA​)). Critically, sθs_\thetasθ​ and tθt_\thetatθ​ do not need to be invertible — any architecture works.

Stacking coupling layers with alternating splits and shuffling between layers allows information to flow across all dimensions. Glow (Kingma and Dhariwal, 2018) extends RealNVP with invertible 1×11\times 11×1 convolutions (mixing channels), activation normalization (ActNorm, replacing batch norm), and a multi-scale architecture that generates samples at multiple resolutions. Glow achieved state-of-the-art likelihoods on image benchmarks at the time and demonstrated latent-space interpolation for face attributes.

What the sθs_\thetasθ​ and tθt_\thetatθ​ subnetworks look like in practice

The key freedom in coupling layers — that sθs_\thetasθ​ and tθt_\thetatθ​ need not be invertible — allows using powerful architectures without constraint. In RealNVP for images, sθs_\thetasθ​ and tθt_\thetatθ​ are ResNets with masked convolutions: the split xAx_AxA​ and xBx_BxB​ corresponds to a checkerboard or channel-wise mask, and the conditioner networks are standard convolutional ResNets taking xAx_AxA​ as input and producing per-pixel scale and translation maps for xBx_BxB​.

For the checkerboard split, alternating pixels form xAx_AxA​: the black squares on a checkerboard pattern. The conditioner ResNet sees every other pixel and predicts scale and translation for the remaining pixels. Alternating between checkerboard and channel splits at each coupling layer ensures that information flows across all spatial locations after two layers — the same principle as alternating row/column attention in Transformers.

In Glow, the conditioner uses 1×11 \times 11×1 invertible convolutions (learned channel permutations) between coupling layers, enabling the model to learn which channels to pass through unchanged versus transform. The sθs_\thetasθ​ and tθt_\thetatθ​ networks are 3-layer ConvNets with ReLU activations and skip connections; the scale output sθ(xA)s_\theta(x_A)sθ​(xA​) is often clipped to [−2,2][-2, 2][−2,2] (using tanh⁡\tanhtanh) to prevent the Jacobian from becoming numerically unstable.


Autoregressive flows

Autoregressive flows (MAF and IAF; Papamakarios et al., 2017; Kingma et al., 2016) generalize coupling layers to the full autoregressive factorization:

zi=(xi−μθ(i)(x<i))/exp⁡(αθ(i)(x<i))z_i = (x_i - \mu_\theta^{(i)}(x_{<i})) / \exp(\alpha_\theta^{(i)}(x_{<i}))zi​=(xi​−μθ(i)​(x<i​))/exp(αθ(i)​(x<i​))

where μθ(i)\mu_\theta^{(i)}μθ(i)​ and αθ(i)\alpha_\theta^{(i)}αθ(i)​ are learned conditioners depending on all preceding dimensions. This achieves a triangular Jacobian with O(d)O(d)O(d) determinant, like coupling layers, but allows each dimension to depend on all preceding dimensions rather than just the half-partition.

Masked Autoregressive Flow (MAF): forward pass (density evaluation) is O(d)O(d)O(d) passes of the full network; inverse pass (sampling) requires ddd sequential evaluations. MAF is fast at density evaluation and slow at sampling.

Inverse Autoregressive Flow (IAF): the transformation is parameterized in the other direction, making sampling O(1)O(1)O(1) parallel passes but density evaluation O(d)O(d)O(d) sequential. IAF is fast at sampling and slow at density evaluation.

This speed asymmetry reflects a fundamental duality: fast density evaluation requires computing the forward direction of the transformation; fast sampling requires computing the inverse. Coupling layers occupy the middle ground, with O(d)O(d)O(d) complexity in both directions at the cost of only allowing half the dimensions to be transformed per layer.


Practical training: flows vs other generative models

Normalizing flows have distinctive training characteristics compared to VAEs and GANs:

Memory requirements: flows require storing the full Jacobian computation graph for backpropagation. For a coupling layer with O(d)O(d)O(d) Jacobian determinant, the memory cost is O(d)O(d)O(d) per layer and O(Ld)O(Ld)O(Ld) total for LLL layers — proportional to the model depth. In contrast, VAEs require storing the encoder, decoder, and latent sample; GANs require storing generator and discriminator. Flows are generally more memory-intensive than VAEs for the same model capacity because every intermediate activation is needed for the backward pass through the Jacobian.

Training stability: flows train with standard MLE and do not suffer from the adversarial instability of GANs or the posterior collapse failure of VAEs. However, they are sensitive to numerical stability: the log-Jacobian must be finite, which can fail if exp⁡(sθ(xA))\exp(s_\theta(x_A))exp(sθ​(xA​)) becomes very large or very small. Clipping sθs_\thetasθ​ outputs to a bounded range and using careful weight initialization (to keep sθ≈0s_\theta \approx 0sθ​≈0 at initialization, making the initial flow near-identity) are standard practices.

Data preprocessing: real-valued data (images with values in [0,255][0, 255][0,255]) must be transformed to Rd\mathbb{R}^dRd for flows. The standard pipeline: (1) add uniform noise x′=x+ux' = x + ux′=x+u, u∼Uniform(0,1)du \sim \text{Uniform}(0, 1)^du∼Uniform(0,1)d (dequantization, converting discrete pixels to continuous values); (2) apply a logit transform y=logit(α+(1−2α)x′/256)y = \text{logit}(\alpha + (1-2\alpha) x'/256)y=logit(α+(1−2α)x′/256) for α≈0.05\alpha \approx 0.05α≈0.05 (preventing mass accumulation at the boundaries); (3) model yyy with the flow. Forgetting this preprocessing produces flows that assign high likelihood to test images but also assign high likelihood to near-integer images that are unrealistic.

Bits per dimension: flow likelihoods are typically reported in bits per dimension (bpd): −log⁡2pθ(x)/d-\log_2 p_\theta(x) / d−log2​pθ​(x)/d. Lower is better. State-of-the-art flows achieve 3.3–3.5 bpd on CIFAR-10 (3072 dimensions); autoregressive transformers achieve 2.8–3.0 bpd but are much slower to sample. Understanding bpd allows direct comparison between model families using the same metric.


Neural ODEs and continuous normalizing flows

Continuous normalizing flows (CNFs; Chen et al., 2018) replace the discrete sequence of transformations with a continuous flow defined by an ODE:

dz(t)dt=fθ(z(t),t),z(0)=z0,z(1)=x\frac{dz(t)}{dt} = f_\theta(z(t), t), \quad z(0) = z_0, \quad z(1) = xdtdz(t)​=fθ​(z(t),t),z(0)=z0​,z(1)=x

The change-of-variables formula for the continuous flow is the instantaneous change of variables:

dlog⁡p(z(t))dt=−tr ⁣(∂fθ∂z(t))\frac{d \log p(z(t))}{dt} = -\text{tr}\!\left(\frac{\partial f_\theta}{\partial z(t)}\right)dtdlogp(z(t))​=−tr(∂z(t)∂fθ​​)

This requires only the trace of the Jacobian (not the full determinant), which can be estimated in O(d)O(d)O(d) using Hutchinson's trace estimator: tr(J)≈ϵ⊤Jϵ\text{tr}(J) \approx \epsilon^\top J \epsilontr(J)≈ϵ⊤Jϵ for ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I).

The CNF can be trained by running the ODE forward (from z0z_0z0​ to xxx) and backward (from xxx to z0z_0z0​), with the adjoint method providing O(1)O(1)O(1) memory gradients. The neural ODE framework generalizes naturally to data on manifolds, graph-structured data, and irregular time series — domains where discrete flow layers would require bespoke architecture.

FFJORD (Grathwohl et al., 2019) implements CNFs with free-form Jacobians (no masking or coupling required) using the stochastic trace estimator, achieving competitive likelihoods with a more flexible architecture than coupling-based flows.


Schrödinger bridges

Schrödinger bridges generalize normalizing flows to stochastic processes: instead of a deterministic ODE connecting p0p_0p0​ to p1p_1p1​, a Schrödinger bridge finds the minimum-entropy-transport stochastic process connecting two marginals p0p_0p0​ and p1p_1p1​. The bridge minimizes:

min⁡P∈C(p0,p1)DKL(P∥Q)\min_{P \in \mathcal{C}(p_0, p_1)} D_\text{KL}(P \| Q)P∈C(p0​,p1​)min​DKL​(P∥Q)

where QQQ is a reference Brownian motion process and C(p0,p1)\mathcal{C}(p_0, p_1)C(p0​,p1​) is the set of processes with the correct marginals. The solution generalizes optimal transport while accounting for diffusion. Schrödinger bridges have applications in single-cell biology (interpolating between gene expression distributions) and are the theoretical foundation of diffusion Schrödinger bridges — an alternative approach to generative modeling that bridges two arbitrary distributions rather than a data distribution and a Gaussian.

The Schrödinger bridge is closely related to optimal transport and provides the theoretical foundation for flow matching (Week 7). The key difference from a normalizing flow is that the source distribution p0p_0p0​ can be any distribution — not just a Gaussian. For generative modeling, the Schrödinger bridge connects the data distribution p1=pdatap_1 = p_\text{data}p1​=pdata​ to p0=N(0,I)p_0 = \mathcal{N}(0, I)p0​=N(0,I) via a minimum-entropy stochastic process. The resulting bridge defines a vector field that interpolates between the two distributions in a way that minimizes the total kinetic energy E[∫01∥vt∥2dt]\mathbb{E}[\int_0^1 \|v_t\|^2 dt]E[∫01​∥vt​∥2dt] — the same objective that Optimal Transport Flow Matching optimizes (Course 2, Week 9 made use of this for real-time robot control via OT-CFM). This connection reveals that flow matching, OT-CFM, and Schrödinger bridges are all instances of the same minimum-kinetic-energy interpolation, differing only in whether the process is deterministic (flow matching), stochastic (Schrödinger bridge), or constrained to straight-line paths (rectified flow).


Practical considerations for flow architecture design

Designing an effective normalizing flow requires balancing multiple constraints:

Depth vs. expressiveness: deeper flows can represent more complex distributions (since each layer adds a nonlinear transformation), but each additional layer increases memory consumption for backpropagation and increases the number of Jacobian determinants that must be computed. Empirically, 16–32 coupling layers are sufficient for high-quality image modeling; beyond that, gains diminish. The depth of the conditioner networks sθs_\thetasθ​ and tθt_\thetatθ​ matters less than the depth of the flow — using 3-layer ConvNets for conditioners is typical.

Masked convolutions and efficiency: in RealNVP, the checkerboard mask determines which pixels are transformed; channel masks apply the split at the channel level. Checkerboard splitting preserves spatial structure early in sampling but requires alternating with channel splits to allow information flow. Channel-first masks are faster (no spatial inefficiency) but require careful reshuffle between layers. Modern implementations use 1×11 \times 11×1 invertible convolutions (Glow) or learned permutations to allow the model to discover optimal split patterns rather than using fixed masking.

Initialization strategies: initializing a normalizing flow to be near-identity is critical. If sθ(xA)≈0s_\theta(x_A) \approx 0sθ​(xA​)≈0 and tθ(xA)≈0t_\theta(x_A) \approx 0tθ​(xA​)≈0 at initialization, the coupling layer is close to the identity transformation. This ensures the flow starts with reasonable log-Jacobian values (near zero) and avoids exploding or vanishing Jacobians early in training. Techniques: zero-initialize the final layer of sθs_\thetasθ​ and tθt_\thetatθ​, use small weight initialization for conditioner networks, or apply a scaling factor to the output of conditioners.

Flow inversion for synthesis: sampling from a flow requires computing the inverse transformation at each coupling layer. Some architectures support exact inverse computation (affine coupling layers, autoregressive flows); others require solving an equation or iterative refinement (neural ODEs via ODE solvers). For real-time generation, exact invertibility is essential — neural ODEs, while theoretically elegant, are slower than discrete coupling layers.

Likelihood-free estimation: while flows provide exact likelihoods, computing the likelihood for a large batch is memory-intensive. For big datasets, sampling-based evaluation (estimating likelihoods via importance sampling or variational bounds) can be more efficient than exact computation.


GenAI context: flows across the generative modeling landscape

| Concept | Flow analog | Application | |---|---|---| | Exact likelihood log⁡pθ(x)\log p_\theta(x)logpθ​(x) | Log-Jacobian formula | Model selection, density estimation, anomaly detection | | Coupling layer | Masked self-attention in parallel | Parallel sampling in autoregressive models | | IAF fast sampling | Used in VAEVariational Autoencoder decoders for expressive posteriors | Posterior approximation quality | | Neural ODE | Continuous residual networks | Continuous-time sequence models, NODE-based RNNs | | Schrödinger bridge | Minimum-kinetic-energy interpolation | Image-to-image translation, cell trajectory modeling |

The practical lesson from normalizing flows is that invertibility is expensive: enforcing it either constrains the architecture (coupling layers), requires O(d)O(d)O(d) sequential computation (autoregressive flows), or requires ODE integration (CNFs). Flow matching (Week 7) abandons invertibility entirely and instead trains a vector field directly via regression, achieving the best of both worlds — unconstrained architecture and parallel sampling — at the cost of losing exact likelihood. The dominance of flow matching and diffusion over normalizing flows in practice reflects this tradeoff.


Key takeaways

Normalizing flows provide exact likelihood, exact sampling, and exact inference through invertible mappings; the log-likelihood equals the base log-likelihood plus the log-absolute-Jacobian-determinant. Coupling layers achieve O(d)O(d)O(d) Jacobian determinant computation with triangular structure, using unrestricted networks for the scale and translation conditioners; RealNVP uses masked ConvNets with alternating checkerboard/channel splits; Glow adds invertible convolutions and multi-scale generation. Autoregressive flows (MAF/IAF) extend coupling layers to full autoregressive structure, with a fundamental speed-accuracy tradeoff: MAF is fast for density evaluation but slow at sampling; IAF is fast at sampling but slow for likelihood. Continuous normalizing flows replace discrete transformations with ODEs, requiring only the trace (not determinant) of the Jacobian and enabling free-form architecture via the adjoint method. In practice, flows are memory-intensive and sensitive to numerical stability issues with unbounded Jacobians; they require careful data preprocessing (dequantization, logit transforms) to work well. The dominance of flow matching (Week 7) over normalizing flows reflects the fundamental tradeoff: invertibility enables exact likelihoods but constrains architecture and loses memory efficiency compared to unconstrained regression-based models.


Conceptual questions

  1. A 2D normalizing flow applies the affine coupling layer x1′=x1x_1' = x_1x1′​=x1​, x2′=x2⋅es(x1)+t(x1)x_2' = x_2 \cdot e^{s(x_1)} + t(x_1)x2′​=x2​⋅es(x1​)+t(x1​). Compute the log-likelihood contribution of this layer to log⁡pX(x)\log p_X(x)logpX​(x) for a given (x1,x2)(x_1, x_2)(x1​,x2​) with s(x1)=0.5s(x_1) = 0.5s(x1​)=0.5 and t(x1)=1.0t(x_1) = 1.0t(x1​)=1.0. Then compute the inverse transformation to recover (x1,x2)(x_1, x_2)(x1​,x2​) from (x1′,x2′)(x_1', x_2')(x1′​,x2′​). Show that the inverse exists without requiring sss or ttt to be invertible.

  2. MAF has O(1)O(1)O(1) training (density evaluation) but O(d)O(d)O(d) sampling, while IAF has O(1)O(1)O(1) sampling but O(d)O(d)O(d) density evaluation. Describe a generative modeling scenario where you would prefer MAF over IAF and vice versa. How does this tradeoff relate to the design of variational inference algorithms (where the inference network is an IAF)?

  3. A continuous normalizing flow uses the instantaneous change-of-variables formula dlog⁡pdt=−tr(∂f/∂z)\frac{d \log p}{dt} = -\text{tr}(\partial f/\partial z)dtdlogp​=−tr(∂f/∂z). Show that for an affine vector field f(z,t)=Az+bf(z, t) = Az + bf(z,t)=Az+b where A∈Rd×dA \in \mathbb{R}^{d \times d}A∈Rd×d, the change in log-probability along any trajectory is constant at −tr(A)-\text{tr}(A)−tr(A). What does this imply for the expressiveness of affine CNFs?

  4. The Schrödinger bridge between distributions p0p_0p0​ and p1p_1p1​ minimizes DKL(P∥Q)D_\text{KL}(P \| Q)DKL​(P∥Q) over processes with correct marginals. Compared to the deterministic optimal transport map, what additional flexibility does the stochastic Schrödinger bridge provide? In which application domains does this stochastic interpolation provide meaningful advantages over deterministic flow?

  5. A normalizing flow trained on a distribution with disconnected support (two well-separated modes) must map the data to a unimodal Gaussian base distribution. Explain why this requires the Jacobian determinant to vary dramatically across the data space. What practical training difficulty does this create, and how does it compare to the training difficulty faced by GANs and EBMs on the same multimodal distribution?

✦Solutions
  1. Only x2x_2x2​ is scaled, so det⁡J=es(x1)\det J = e^{s(x_1)}detJ=es(x1​) and the layer's contribution is log⁡∣det⁡J∣=s(x1)=0.5\log|\det J| = s(x_1) = 0.5log∣detJ∣=s(x1​)=0.5. Inverse: x1=x1′x_1 = x_1'x1​=x1′​, and x2=(x2′−t(x1′)) e−s(x1′)=(x2′−1.0) e−0.5x_2 = (x_2' - t(x_1'))\,e^{-s(x_1')} = (x_2'-1.0)\,e^{-0.5}x2​=(x2′​−t(x1′​))e−s(x1′​)=(x2′​−1.0)e−0.5. Note sss and ttt are only ever evaluated (at x1=x1′x_1=x_1'x1​=x1′​, which passes through unchanged), never inverted — so they can be arbitrary non-invertible networks.
  2. Prefer MAF when density evaluation/training throughput matters (MLE on a dataset, density estimation, anomaly detection). Prefer IAF when sampling speed matters (real-time generation, or as a VAE posterior qϕ(z∣x)q_\phi(z\mid x)qϕ​(z∣x) where you draw zzz once and only need the density of that sample, which IAF gives cheaply). In VI the inference network is an IAF precisely because sampling and evaluating its own samples' density are both fast.
  3. For f=Az+bf=Az+bf=Az+b, ∂f/∂z=A\partial f/\partial z = A∂f/∂z=A and tr(A)\text{tr}(A)tr(A) is constant in z,tz,tz,t; integrating dlog⁡pdt=−tr(A)\tfrac{d\log p}{dt}=-\text{tr}(A)dtdlogp​=−tr(A) over t∈[0,1]t\in[0,1]t∈[0,1] gives Δlog⁡p=−tr(A)\Delta\log p=-\text{tr}(A)Δlogp=−tr(A) on every trajectory. So an affine CNF only rescales density by a single global constant — it cannot reshape density locally and is no more expressive than one linear map; nonlinear fff is required.
  4. The stochastic bridge optimizes over a distribution of trajectories (Brownian-like paths) rather than the single deterministic OT map, so it models genuinely noisy transitions and spreads mass with an entropy term. This helps where the underlying process is stochastic — single-cell trajectories, image-to-image translation with inherent uncertainty — where a deterministic map would be artificially rigid.
  5. To send two separated modes to a unimodal Gaussian, the bijection must stretch the empty inter-mode region (low data density ⇒ assigned low base density) and compress within modes, forcing ∣det⁡J∣|\det J|∣detJ∣ to span a huge dynamic range. This causes numerical instability (very large/small log-Jacobians) and hard optimization; moreover, because the map is a continuous bijection it cannot "tear" space, so a thin bridge of spurious probability always connects the modes. GANs (discontinuous generator) and EBMs (separate energy basins) represent disconnected support more naturally — flows are uniquely penalized by the invertibility/continuity constraint.

Looking ahead

Normalizing flows provide exact likelihood through invertible architectures constrained to have tractable Jacobians. The next model family achieves even higher sample quality by abandoning invertibility and instead defining distributions through a learned denoising process.

Week 6: Denoising Diffusion Probabilistic Models. We derive the DDPM forward process (data → noise) and reverse process (noise → data), show that the optimal denoising target is the simple noise prediction objective Lsimple\mathcal{L}_\text{simple}Lsimple​, and connect DDPM to SDE/ODE formulations that enable accelerated sampling.


Further reading

  • Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using Real NVP. ICLR.
  • Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative Flow with Invertible 1x1 Convolutions. NeurIPS.
  • Chen, R. T. Q., et al. (2018). Neural Ordinary Differential Equations. NeurIPS. (Continuous Normalizing Flows).
← Previous
Week 4: Energy-Based Models and Score Matching
Next →
Week 6: Denoising Diffusion Probabilistic Models
On this page
  • Purpose of this lecture
  • The change-of-variables formula
  • Coupling layers: RealNVP and Glow
  • What the s\theta and t\theta subnetworks look like in practice
  • Autoregressive flows
  • Practical training: flows vs other generative models
  • Neural ODEs and continuous normalizing flows
  • Schrödinger bridges
  • Practical considerations for flow architecture design
  • GenAI context: flows across the generative modeling landscape
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading