Purpose of this lecture
Normalizing flows define a probability distribution by constructing an invertible, differentiable mapping from data space to a simple base space (usually a Gaussian), then transforming the base density through the inverse mapping. They offer exact likelihood evaluation, exact sampling, and exact latent inference — a uniquely complete set of tractable operations among generative model families. Understanding flows clarifies both their strengths and the architectural constraints required to maintain invertibility efficiently, and provides the conceptual bridge to neural ODEs and Schrödinger bridges.
The change-of-variables formula
For an invertible differentiable mapping with , the change-of-variables formula relates the densities of and :
where is the Jacobian matrix of at . Taking logarithms gives the exact log-likelihood:
The first term is the likelihood of the transformed point under the base distribution (computable if is Gaussian). The second term is the log-absolute-Jacobian-determinant, which accounts for the volume change induced by the mapping. Computing a general Jacobian determinant costs — prohibitive for high-dimensional . All flow architectures are designed to make this computation tractable.
Sampling from a flow requires only the inverse: draw , compute . Density evaluation requires the forward pass: compute , then evaluate . Both directions are exact without approximation.
Coupling layers: RealNVP and Glow
Affine coupling layers (Dinh et al., 2017, RealNVP) achieve Jacobian determinant computation by dividing the input dimensions into two halves and applying a learned affine transformation to one half conditioned on the other:
where and are arbitrary neural networks mapping from to scale and translation for . Because , the Jacobian of the full layer is lower-triangular:
The determinant is — an computation. The inverse is also : given , recover . Critically, and do not need to be invertible — any architecture works.
Stacking coupling layers with alternating splits and shuffling between layers allows information to flow across all dimensions. Glow (Kingma and Dhariwal, 2018) extends RealNVP with invertible convolutions (mixing channels), activation normalization (ActNorm, replacing batch norm), and a multi-scale architecture that generates samples at multiple resolutions. Glow achieved state-of-the-art likelihoods on image benchmarks at the time and demonstrated latent-space interpolation for face attributes.
What the and subnetworks look like in practice
The key freedom in coupling layers — that and need not be invertible — allows using powerful architectures without constraint. In RealNVP for images, and are ResNets with masked convolutions: the split and corresponds to a checkerboard or channel-wise mask, and the conditioner networks are standard convolutional ResNets taking as input and producing per-pixel scale and translation maps for .
For the checkerboard split, alternating pixels form : the black squares on a checkerboard pattern. The conditioner ResNet sees every other pixel and predicts scale and translation for the remaining pixels. Alternating between checkerboard and channel splits at each coupling layer ensures that information flows across all spatial locations after two layers — the same principle as alternating row/column attention in Transformers.
In Glow, the conditioner uses invertible convolutions (learned channel permutations) between coupling layers, enabling the model to learn which channels to pass through unchanged versus transform. The and networks are 3-layer ConvNets with ReLU activations and skip connections; the scale output is often clipped to (using ) to prevent the Jacobian from becoming numerically unstable.
Autoregressive flows
Autoregressive flows (MAF and IAF; Papamakarios et al., 2017; Kingma et al., 2016) generalize coupling layers to the full autoregressive factorization:
where and are learned conditioners depending on all preceding dimensions. This achieves a triangular Jacobian with determinant, like coupling layers, but allows each dimension to depend on all preceding dimensions rather than just the half-partition.
Masked Autoregressive Flow (MAF): forward pass (density evaluation) is passes of the full network; inverse pass (sampling) requires sequential evaluations. MAF is fast at density evaluation and slow at sampling.
Inverse Autoregressive Flow (IAF): the transformation is parameterized in the other direction, making sampling parallel passes but density evaluation sequential. IAF is fast at sampling and slow at density evaluation.
This speed asymmetry reflects a fundamental duality: fast density evaluation requires computing the forward direction of the transformation; fast sampling requires computing the inverse. Coupling layers occupy the middle ground, with complexity in both directions at the cost of only allowing half the dimensions to be transformed per layer.
Practical training: flows vs other generative models
Normalizing flows have distinctive training characteristics compared to VAEs and GANs:
Memory requirements: flows require storing the full Jacobian computation graph for backpropagation. For a coupling layer with Jacobian determinant, the memory cost is per layer and total for layers — proportional to the model depth. In contrast, VAEs require storing the encoder, decoder, and latent sample; GANs require storing generator and discriminator. Flows are generally more memory-intensive than VAEs for the same model capacity because every intermediate activation is needed for the backward pass through the Jacobian.
Training stability: flows train with standard MLE and do not suffer from the adversarial instability of GANs or the posterior collapse failure of VAEs. However, they are sensitive to numerical stability: the log-Jacobian must be finite, which can fail if becomes very large or very small. Clipping outputs to a bounded range and using careful weight initialization (to keep at initialization, making the initial flow near-identity) are standard practices.
Data preprocessing: real-valued data (images with values in ) must be transformed to for flows. The standard pipeline: (1) add uniform noise , (dequantization, converting discrete pixels to continuous values); (2) apply a logit transform for (preventing mass accumulation at the boundaries); (3) model with the flow. Forgetting this preprocessing produces flows that assign high likelihood to test images but also assign high likelihood to near-integer images that are unrealistic.
Bits per dimension: flow likelihoods are typically reported in bits per dimension (bpd): . Lower is better. State-of-the-art flows achieve 3.3–3.5 bpd on CIFAR-10 (3072 dimensions); autoregressive transformers achieve 2.8–3.0 bpd but are much slower to sample. Understanding bpd allows direct comparison between model families using the same metric.
Neural ODEs and continuous normalizing flows
Continuous normalizing flows (CNFs; Chen et al., 2018) replace the discrete sequence of transformations with a continuous flow defined by an ODE:
The change-of-variables formula for the continuous flow is the instantaneous change of variables:
This requires only the trace of the Jacobian (not the full determinant), which can be estimated in using Hutchinson's trace estimator: for .
The CNF can be trained by running the ODE forward (from to ) and backward (from to ), with the adjoint method providing memory gradients. The neural ODE framework generalizes naturally to data on manifolds, graph-structured data, and irregular time series — domains where discrete flow layers would require bespoke architecture.
FFJORD (Grathwohl et al., 2019) implements CNFs with free-form Jacobians (no masking or coupling required) using the stochastic trace estimator, achieving competitive likelihoods with a more flexible architecture than coupling-based flows.
Schrödinger bridges
Schrödinger bridges generalize normalizing flows to stochastic processes: instead of a deterministic ODE connecting to , a Schrödinger bridge finds the minimum-entropy-transport stochastic process connecting two marginals and . The bridge minimizes:
where is a reference Brownian motion process and is the set of processes with the correct marginals. The solution generalizes optimal transport while accounting for diffusion. Schrödinger bridges have applications in single-cell biology (interpolating between gene expression distributions) and are the theoretical foundation of diffusion Schrödinger bridges — an alternative approach to generative modeling that bridges two arbitrary distributions rather than a data distribution and a Gaussian.
The Schrödinger bridge is closely related to optimal transport and provides the theoretical foundation for flow matching (Week 7). The key difference from a normalizing flow is that the source distribution can be any distribution — not just a Gaussian. For generative modeling, the Schrödinger bridge connects the data distribution to via a minimum-entropy stochastic process. The resulting bridge defines a vector field that interpolates between the two distributions in a way that minimizes the total kinetic energy — the same objective that Optimal Transport Flow Matching optimizes (Course 2, Week 9 made use of this for real-time robot control via OT-CFM). This connection reveals that flow matching, OT-CFM, and Schrödinger bridges are all instances of the same minimum-kinetic-energy interpolation, differing only in whether the process is deterministic (flow matching), stochastic (Schrödinger bridge), or constrained to straight-line paths (rectified flow).
Practical considerations for flow architecture design
Designing an effective normalizing flow requires balancing multiple constraints:
Depth vs. expressiveness: deeper flows can represent more complex distributions (since each layer adds a nonlinear transformation), but each additional layer increases memory consumption for backpropagation and increases the number of Jacobian determinants that must be computed. Empirically, 16–32 coupling layers are sufficient for high-quality image modeling; beyond that, gains diminish. The depth of the conditioner networks and matters less than the depth of the flow — using 3-layer ConvNets for conditioners is typical.
Masked convolutions and efficiency: in RealNVP, the checkerboard mask determines which pixels are transformed; channel masks apply the split at the channel level. Checkerboard splitting preserves spatial structure early in sampling but requires alternating with channel splits to allow information flow. Channel-first masks are faster (no spatial inefficiency) but require careful reshuffle between layers. Modern implementations use invertible convolutions (Glow) or learned permutations to allow the model to discover optimal split patterns rather than using fixed masking.
Initialization strategies: initializing a normalizing flow to be near-identity is critical. If and at initialization, the coupling layer is close to the identity transformation. This ensures the flow starts with reasonable log-Jacobian values (near zero) and avoids exploding or vanishing Jacobians early in training. Techniques: zero-initialize the final layer of and , use small weight initialization for conditioner networks, or apply a scaling factor to the output of conditioners.
Flow inversion for synthesis: sampling from a flow requires computing the inverse transformation at each coupling layer. Some architectures support exact inverse computation (affine coupling layers, autoregressive flows); others require solving an equation or iterative refinement (neural ODEs via ODE solvers). For real-time generation, exact invertibility is essential — neural ODEs, while theoretically elegant, are slower than discrete coupling layers.
Likelihood-free estimation: while flows provide exact likelihoods, computing the likelihood for a large batch is memory-intensive. For big datasets, sampling-based evaluation (estimating likelihoods via importance sampling or variational bounds) can be more efficient than exact computation.
GenAI context: flows across the generative modeling landscape
| Concept | Flow analog | Application | |---|---|---| | Exact likelihood | Log-Jacobian formula | Model selection, density estimation, anomaly detection | | Coupling layer | Masked self-attention in parallel | Parallel sampling in autoregressive models | | IAF fast sampling | Used in VAEVariational Autoencoder decoders for expressive posteriors | Posterior approximation quality | | Neural ODE | Continuous residual networks | Continuous-time sequence models, NODE-based RNNs | | Schrödinger bridge | Minimum-kinetic-energy interpolation | Image-to-image translation, cell trajectory modeling |
The practical lesson from normalizing flows is that invertibility is expensive: enforcing it either constrains the architecture (coupling layers), requires sequential computation (autoregressive flows), or requires ODE integration (CNFs). Flow matching (Week 7) abandons invertibility entirely and instead trains a vector field directly via regression, achieving the best of both worlds — unconstrained architecture and parallel sampling — at the cost of losing exact likelihood. The dominance of flow matching and diffusion over normalizing flows in practice reflects this tradeoff.
Key takeaways
Normalizing flows provide exact likelihood, exact sampling, and exact inference through invertible mappings; the log-likelihood equals the base log-likelihood plus the log-absolute-Jacobian-determinant. Coupling layers achieve Jacobian determinant computation with triangular structure, using unrestricted networks for the scale and translation conditioners; RealNVP uses masked ConvNets with alternating checkerboard/channel splits; Glow adds invertible convolutions and multi-scale generation. Autoregressive flows (MAF/IAF) extend coupling layers to full autoregressive structure, with a fundamental speed-accuracy tradeoff: MAF is fast for density evaluation but slow at sampling; IAF is fast at sampling but slow for likelihood. Continuous normalizing flows replace discrete transformations with ODEs, requiring only the trace (not determinant) of the Jacobian and enabling free-form architecture via the adjoint method. In practice, flows are memory-intensive and sensitive to numerical stability issues with unbounded Jacobians; they require careful data preprocessing (dequantization, logit transforms) to work well. The dominance of flow matching (Week 7) over normalizing flows reflects the fundamental tradeoff: invertibility enables exact likelihoods but constrains architecture and loses memory efficiency compared to unconstrained regression-based models.
Conceptual questions
-
A 2D normalizing flow applies the affine coupling layer , . Compute the log-likelihood contribution of this layer to for a given with and . Then compute the inverse transformation to recover from . Show that the inverse exists without requiring or to be invertible.
-
MAF has training (density evaluation) but sampling, while IAF has sampling but density evaluation. Describe a generative modeling scenario where you would prefer MAF over IAF and vice versa. How does this tradeoff relate to the design of variational inference algorithms (where the inference network is an IAF)?
-
A continuous normalizing flow uses the instantaneous change-of-variables formula . Show that for an affine vector field where , the change in log-probability along any trajectory is constant at . What does this imply for the expressiveness of affine CNFs?
-
The Schrödinger bridge between distributions and minimizes over processes with correct marginals. Compared to the deterministic optimal transport map, what additional flexibility does the stochastic Schrödinger bridge provide? In which application domains does this stochastic interpolation provide meaningful advantages over deterministic flow?
-
A normalizing flow trained on a distribution with disconnected support (two well-separated modes) must map the data to a unimodal Gaussian base distribution. Explain why this requires the Jacobian determinant to vary dramatically across the data space. What practical training difficulty does this create, and how does it compare to the training difficulty faced by GANs and EBMs on the same multimodal distribution?
Looking ahead
Normalizing flows provide exact likelihood through invertible architectures constrained to have tractable Jacobians. The next model family achieves even higher sample quality by abandoning invertibility and instead defining distributions through a learned denoising process.
Week 6: Denoising Diffusion Probabilistic Models. We derive the DDPM forward process (data → noise) and reverse process (noise → data), show that the optimal denoising target is the simple noise prediction objective , and connect DDPM to SDE/ODE formulations that enable accelerated sampling.
Further reading
- Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using Real NVP. ICLR.
- Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative Flow with Invertible 1x1 Convolutions. NeurIPS.
- Chen, R. T. Q., et al. (2018). Neural Ordinary Differential Equations. NeurIPS. (Continuous Normalizing Flows).