Week 5: Normalizing Flows

Purpose of this lecture#

Normalizing flows define a probability distribution by constructing an invertible, differentiable mapping $f_\theta: \mathcal{X} \to \mathcal{Z}$ from data space to a simple base space (usually a Gaussian), then transforming the base density through the inverse mapping. They offer exact likelihood evaluation, exact sampling, and exact latent inference — a uniquely complete set of tractable operations among generative model families. Understanding flows clarifies both their strengths and the architectural constraints required to maintain invertibility efficiently, and provides the conceptual bridge to neural ODEs and Schrödinger bridges.

The change-of-variables formula#

For an invertible differentiable mapping $z = f_\theta(x)$ with $x = f_\theta^{-1}(z) = g_\theta(z)$ , the change-of-variables formula relates the densities of $x$ and $z$ :

p_X(x) = p_Z(f_\theta(x)) \cdot \left|\det J_{f_\theta}(x)\right|

where $J_{f_\theta}(x) = \frac{\partial f_\theta(x)}{\partial x}$ is the Jacobian matrix of $f_\theta$ at $x$ . Taking logarithms gives the exact log-likelihood:

\log p_X(x) = \log p_Z(f_\theta(x)) + \log \left|\det J_{f_\theta}(x)\right|

The first term is the likelihood of the transformed point under the base distribution (computable if $p_Z$ is Gaussian). The second term is the log-absolute-Jacobian-determinant, which accounts for the volume change induced by the mapping. Computing a general $d \times d$ Jacobian determinant costs $O(d^3)$ — prohibitive for high-dimensional $x$ . All flow architectures are designed to make this computation tractable.

Sampling from a flow requires only the inverse: draw $z \sim p_Z$ , compute $x = g_\theta(z)$ . Density evaluation requires the forward pass: compute $z = f_\theta(x)$ , then evaluate $\log p_Z(z) + \log|\det J_{f_\theta}(x)|$ . Both directions are exact without approximation.

Coupling layers: RealNVP and Glow#

Affine coupling layers (Dinh et al., 2017, RealNVP) achieve $O(d)$ Jacobian determinant computation by dividing the input dimensions into two halves and applying a learned affine transformation to one half conditioned on the other:

x_A' = x_A, \quad x_B' = x_B \odot \exp(s_\theta(x_A)) + t_\theta(x_A)

where $s_\theta$ and $t_\theta$ are arbitrary neural networks mapping from $x_A$ to scale and translation for $x_B$ . Because $x_A' = x_A$ , the Jacobian of the full layer is lower-triangular:

J = \begin{pmatrix} I & 0 \\ \frac{\partial x_B'}{\partial x_A} & \text{diag}(\exp(s_\theta(x_A))) \end{pmatrix}

The determinant is $\prod_i \exp(s_{\theta,i}(x_A)) = \exp(\sum_i s_{\theta,i}(x_A))$ — an $O(d)$ computation. The inverse is also $O(d)$ : given $x_B'$ , recover $x_B = (x_B' - t_\theta(x_A)) \odot \exp(-s_\theta(x_A))$ . Critically, $s_\theta$ and $t_\theta$ do not need to be invertible — any architecture works.

Stacking coupling layers with alternating splits and shuffling between layers allows information to flow across all dimensions. Glow (Kingma and Dhariwal, 2018) extends RealNVP with invertible $1\times 1$ convolutions (mixing channels), activation normalization (ActNorm, replacing batch norm), and a multi-scale architecture that generates samples at multiple resolutions. Glow achieved state-of-the-art likelihoods on image benchmarks at the time and demonstrated latent-space interpolation for face attributes.

What the $s_\theta$ and $t_\theta$ subnetworks look like in practice#

The key freedom in coupling layers — that $s_\theta$ and $t_\theta$ need not be invertible — allows using powerful architectures without constraint. In RealNVP for images, $s_\theta$ and $t_\theta$ are ResNets with masked convolutions: the split $x_A$ and $x_B$ corresponds to a checkerboard or channel-wise mask, and the conditioner networks are standard convolutional ResNets taking $x_A$ as input and producing per-pixel scale and translation maps for $x_B$ .

For the checkerboard split, alternating pixels form $x_A$ : the black squares on a checkerboard pattern. The conditioner ResNet sees every other pixel and predicts scale and translation for the remaining pixels. Alternating between checkerboard and channel splits at each coupling layer ensures that information flows across all spatial locations after two layers — the same principle as alternating row/column attention in Transformers.

In Glow, the conditioner uses $1 \times 1$ invertible convolutions (learned channel permutations) between coupling layers, enabling the model to learn which channels to pass through unchanged versus transform. The $s_\theta$ and $t_\theta$ networks are 3-layer ConvNets with ReLU activations and skip connections; the scale output $s_\theta(x_A)$ is often clipped to $[-2, 2]$ (using $\tanh$ ) to prevent the Jacobian from becoming numerically unstable.

Autoregressive flows#

Autoregressive flows (MAF and IAF; Papamakarios et al., 2017; Kingma et al., 2016) generalize coupling layers to the full autoregressive factorization:

z_i = (x_i - \mu_\theta^{(i)}(x_{<i})) / \exp(\alpha_\theta^{(i)}(x_{<i}))

where $\mu_\theta^{(i)}$ and $\alpha_\theta^{(i)}$ are learned conditioners depending on all preceding dimensions. This achieves a triangular Jacobian with $O(d)$ determinant, like coupling layers, but allows each dimension to depend on all preceding dimensions rather than just the half-partition.

Masked Autoregressive Flow (MAF): forward pass (density evaluation) is $O(d)$ passes of the full network; inverse pass (sampling) requires $d$ sequential evaluations. MAF is fast at density evaluation and slow at sampling.

Inverse Autoregressive Flow (IAF): the transformation is parameterized in the other direction, making sampling $O(1)$ parallel passes but density evaluation $O(d)$ sequential. IAF is fast at sampling and slow at density evaluation.

This speed asymmetry reflects a fundamental duality: fast density evaluation requires computing the forward direction of the transformation; fast sampling requires computing the inverse. Coupling layers occupy the middle ground, with $O(d)$ complexity in both directions at the cost of only allowing half the dimensions to be transformed per layer.

Practical training: flows vs other generative models#

Normalizing flows have distinctive training characteristics compared to VAEs and GANs:

Memory requirements: flows require storing the full Jacobian computation graph for backpropagation. For a coupling layer with $O(d)$ Jacobian determinant, the memory cost is $O(d)$ per layer and $O(Ld)$ total for $L$ layers — proportional to the model depth. In contrast, VAEs require storing the encoder, decoder, and latent sample; GANs require storing generator and discriminator. Flows are generally more memory-intensive than VAEs for the same model capacity because every intermediate activation is needed for the backward pass through the Jacobian.

Training stability: flows train with standard MLE and do not suffer from the adversarial instability of GANs or the posterior collapse failure of VAEs. However, they are sensitive to numerical stability: the log-Jacobian must be finite, which can fail if $\exp(s_\theta(x_A))$ becomes very large or very small. Clipping $s_\theta$ outputs to a bounded range and using careful weight initialization (to keep $s_\theta \approx 0$ at initialization, making the initial flow near-identity) are standard practices.

Data preprocessing: real-valued data (images with values in $[0, 255]$ ) must be transformed to $\mathbb{R}^d$ for flows. The standard pipeline: (1) add uniform noise $x' = x + u$ , $u \sim \text{Uniform}(0, 1)^d$ (dequantization, converting discrete pixels to continuous values); (2) apply a logit transform $y = \text{logit}(\alpha + (1-2\alpha) x'/256)$ for $\alpha \approx 0.05$ (preventing mass accumulation at the boundaries); (3) model $y$ with the flow. Forgetting this preprocessing produces flows that assign high likelihood to test images but also assign high likelihood to near-integer images that are unrealistic.

Bits per dimension: flow likelihoods are typically reported in bits per dimension (bpd): $-\log_2 p_\theta(x) / d$ . Lower is better. State-of-the-art flows achieve 3.3–3.5 bpd on CIFAR-10 (3072 dimensions); autoregressive transformers achieve 2.8–3.0 bpd but are much slower to sample. Understanding bpd allows direct comparison between model families using the same metric.

Neural ODEs and continuous normalizing flows#

Continuous normalizing flows (CNFs; Chen et al., 2018) replace the discrete sequence of transformations with a continuous flow defined by an ODE:

\frac{dz(t)}{dt} = f_\theta(z(t), t), \quad z(0) = z_0, \quad z(1) = x

The change-of-variables formula for the continuous flow is the instantaneous change of variables:

\frac{d \log p(z(t))}{dt} = -\text{tr}\!\left(\frac{\partial f_\theta}{\partial z(t)}\right)

This requires only the trace of the Jacobian (not the full determinant), which can be estimated in $O(d)$ using Hutchinson's trace estimator: $\text{tr}(J) \approx \epsilon^\top J \epsilon$ for $\epsilon \sim \mathcal{N}(0, I)$ .

The CNF can be trained by running the ODE forward (from $z_0$ to $x$ ) and backward (from $x$ to $z_0$ ), with the adjoint method providing $O(1)$ memory gradients. The neural ODE framework generalizes naturally to data on manifolds, graph-structured data, and irregular time series — domains where discrete flow layers would require bespoke architecture.

FFJORD (Grathwohl et al., 2019) implements CNFs with free-form Jacobians (no masking or coupling required) using the stochastic trace estimator, achieving competitive likelihoods with a more flexible architecture than coupling-based flows.

Schrödinger bridges#

Schrödinger bridges generalize normalizing flows to stochastic processes: instead of a deterministic ODE connecting $p_0$ to $p_1$ , a Schrödinger bridge finds the minimum-entropy-transport stochastic process connecting two marginals $p_0$ and $p_1$ . The bridge minimizes:

\min_{P \in \mathcal{C}(p_0, p_1)} D_\text{KL}(P \| Q)

where $Q$ is a reference Brownian motion process and $\mathcal{C}(p_0, p_1)$ is the set of processes with the correct marginals. The solution generalizes optimal transport while accounting for diffusion. Schrödinger bridges have applications in single-cell biology (interpolating between gene expression distributions) and are the theoretical foundation of diffusion Schrödinger bridges — an alternative approach to generative modeling that bridges two arbitrary distributions rather than a data distribution and a Gaussian.

The Schrödinger bridge is closely related to optimal transport and provides the theoretical foundation for flow matching (Week 7). The key difference from a normalizing flow is that the source distribution $p_0$ can be any distribution — not just a Gaussian. For generative modeling, the Schrödinger bridge connects the data distribution $p_1 = p_\text{data}$ to $p_0 = \mathcal{N}(0, I)$ via a minimum-entropy stochastic process. The resulting bridge defines a vector field that interpolates between the two distributions in a way that minimizes the total kinetic energy $\mathbb{E}[\int_0^1 \|v_t\|^2 dt]$ — the same objective that Optimal Transport Flow Matching optimizes (Course 2, Week 9 made use of this for real-time robot control via OT-CFM). This connection reveals that flow matching, OT-CFM, and Schrödinger bridges are all instances of the same minimum-kinetic-energy interpolation, differing only in whether the process is deterministic (flow matching), stochastic (Schrödinger bridge), or constrained to straight-line paths (rectified flow).

Practical considerations for flow architecture design#

Designing an effective normalizing flow requires balancing multiple constraints:

Depth vs. expressiveness: deeper flows can represent more complex distributions (since each layer adds a nonlinear transformation), but each additional layer increases memory consumption for backpropagation and increases the number of Jacobian determinants that must be computed. Empirically, 16–32 coupling layers are sufficient for high-quality image modeling; beyond that, gains diminish. The depth of the conditioner networks $s_\theta$ and $t_\theta$ matters less than the depth of the flow — using 3-layer ConvNets for conditioners is typical.

Masked convolutions and efficiency: in RealNVP, the checkerboard mask determines which pixels are transformed; channel masks apply the split at the channel level. Checkerboard splitting preserves spatial structure early in sampling but requires alternating with channel splits to allow information flow. Channel-first masks are faster (no spatial inefficiency) but require careful reshuffle between layers. Modern implementations use $1 \times 1$ invertible convolutions (Glow) or learned permutations to allow the model to discover optimal split patterns rather than using fixed masking.

Initialization strategies: initializing a normalizing flow to be near-identity is critical. If $s_\theta(x_A) \approx 0$ and $t_\theta(x_A) \approx 0$ at initialization, the coupling layer is close to the identity transformation. This ensures the flow starts with reasonable log-Jacobian values (near zero) and avoids exploding or vanishing Jacobians early in training. Techniques: zero-initialize the final layer of $s_\theta$ and $t_\theta$ , use small weight initialization for conditioner networks, or apply a scaling factor to the output of conditioners.

Flow inversion for synthesis: sampling from a flow requires computing the inverse transformation at each coupling layer. Some architectures support exact inverse computation (affine coupling layers, autoregressive flows); others require solving an equation or iterative refinement (neural ODEs via ODE solvers). For real-time generation, exact invertibility is essential — neural ODEs, while theoretically elegant, are slower than discrete coupling layers.

Likelihood-free estimation: while flows provide exact likelihoods, computing the likelihood for a large batch is memory-intensive. For big datasets, sampling-based evaluation (estimating likelihoods via importance sampling or variational bounds) can be more efficient than exact computation.

GenAI context: flows across the generative modeling landscape#

| Concept | Flow analog | Application | |---|---|---| | Exact likelihood $\log p_\theta(x)$ | Log-Jacobian formula | Model selection, density estimation, anomaly detection | | Coupling layer | Masked self-attention in parallel | Parallel sampling in autoregressive models | | IAF fast sampling | Used in VAE decoders for expressive posteriors | Posterior approximation quality | | Neural ODE | Continuous residual networks | Continuous-time sequence models, NODE-based RNNs | | Schrödinger bridge | Minimum-kinetic-energy interpolation | Image-to-image translation, cell trajectory modeling |

The practical lesson from normalizing flows is that invertibility is expensive: enforcing it either constrains the architecture (coupling layers), requires $O(d)$ sequential computation (autoregressive flows), or requires ODE integration (CNFs). Flow matching (Week 7) abandons invertibility entirely and instead trains a vector field directly via regression, achieving the best of both worlds — unconstrained architecture and parallel sampling — at the cost of losing exact likelihood. The dominance of flow matching and diffusion over normalizing flows in practice reflects this tradeoff.

Key takeaways#

Normalizing flows provide exact likelihood, exact sampling, and exact inference through invertible mappings; the log-likelihood equals the base log-likelihood plus the log-absolute-Jacobian-determinant. Coupling layers achieve $O(d)$ Jacobian determinant computation with triangular structure, using unrestricted networks for the scale and translation conditioners; RealNVP uses masked ConvNets with alternating checkerboard/channel splits; Glow adds invertible convolutions and multi-scale generation. Autoregressive flows (MAF/IAF) extend coupling layers to full autoregressive structure, with a fundamental speed-accuracy tradeoff: MAF is fast for density evaluation but slow at sampling; IAF is fast at sampling but slow for likelihood. Continuous normalizing flows replace discrete transformations with ODEs, requiring only the trace (not determinant) of the Jacobian and enabling free-form architecture via the adjoint method. In practice, flows are memory-intensive and sensitive to numerical stability issues with unbounded Jacobians; they require careful data preprocessing (dequantization, logit transforms) to work well. The dominance of flow matching (Week 7) over normalizing flows reflects the fundamental tradeoff: invertibility enables exact likelihoods but constrains architecture and loses memory efficiency compared to unconstrained regression-based models.

Conceptual questions#

A 2D normalizing flow applies the affine coupling layer $x_1' = x_1$ , $x_2' = x_2 \cdot e^{s(x_1)} + t(x_1)$ . Compute the log-likelihood contribution of this layer to $\log p_X(x)$ for a given $(x_1, x_2)$ with $s(x_1) = 0.5$ and $t(x_1) = 1.0$ . Then compute the inverse transformation to recover $(x_1, x_2)$ from $(x_1', x_2')$ . Show that the inverse exists without requiring $s$ or $t$ to be invertible.
MAF has $O(1)$ training (density evaluation) but $O(d)$ sampling, while IAF has $O(1)$ sampling but $O(d)$ density evaluation. Describe a generative modeling scenario where you would prefer MAF over IAF and vice versa. How does this tradeoff relate to the design of variational inference algorithms (where the inference network is an IAF)?
A continuous normalizing flow uses the instantaneous change-of-variables formula $\frac{d \log p}{dt} = -\text{tr}(\partial f/\partial z)$ . Show that for an affine vector field $f(z, t) = Az + b$ where $A \in \mathbb{R}^{d \times d}$ , the change in log-probability along any trajectory is constant at $-\text{tr}(A)$ . What does this imply for the expressiveness of affine CNFs?
The Schrödinger bridge between distributions $p_0$ and $p_1$ minimizes $D_\text{KL}(P \| Q)$ over processes with correct marginals. Compared to the deterministic optimal transport map, what additional flexibility does the stochastic Schrödinger bridge provide? In which application domains does this stochastic interpolation provide meaningful advantages over deterministic flow?
A normalizing flow trained on a distribution with disconnected support (two well-separated modes) must map the data to a unimodal Gaussian base distribution. Explain why this requires the Jacobian determinant to vary dramatically across the data space. What practical training difficulty does this create, and how does it compare to the training difficulty faced by GANs and EBMs on the same multimodal distribution?

Solutions

Only $x_2$ is scaled, so $\det J = e^{s(x_1)}$ and the layer's contribution is $\log|\det J| = s(x_1) = 0.5$ . Inverse: $x_1 = x_1'$ , and $x_2 = (x_2' - t(x_1'))\,e^{-s(x_1')} = (x_2'-1.0)\,e^{-0.5}$ . Note $s$ and $t$ are only ever evaluated (at $x_1=x_1'$ , which passes through unchanged), never inverted — so they can be arbitrary non-invertible networks.
Prefer MAF when density evaluation/training throughput matters (MLE on a dataset, density estimation, anomaly detection). Prefer IAF when sampling speed matters (real-time generation, or as a VAE posterior $q_\phi(z\mid x)$ where you draw $z$ once and only need the density of that sample, which IAF gives cheaply). In VI the inference network is an IAF precisely because sampling and evaluating its own samples' density are both fast.
For $f=Az+b$ , $\partial f/\partial z = A$ and $\text{tr}(A)$ is constant in $z,t$ ; integrating $\tfrac{d\log p}{dt}=-\text{tr}(A)$ over $t\in[0,1]$ gives $\Delta\log p=-\text{tr}(A)$ on every trajectory. So an affine CNF only rescales density by a single global constant — it cannot reshape density locally and is no more expressive than one linear map; nonlinear $f$ is required.
The stochastic bridge optimizes over a distribution of trajectories (Brownian-like paths) rather than the single deterministic OT map, so it models genuinely noisy transitions and spreads mass with an entropy term. This helps where the underlying process is stochastic — single-cell trajectories, image-to-image translation with inherent uncertainty — where a deterministic map would be artificially rigid.
To send two separated modes to a unimodal Gaussian, the bijection must stretch the empty inter-mode region (low data density ⇒ assigned low base density) and compress within modes, forcing $|\det J|$ to span a huge dynamic range. This causes numerical instability (very large/small log-Jacobians) and hard optimization; moreover, because the map is a continuous bijection it cannot "tear" space, so a thin bridge of spurious probability always connects the modes. GANs (discontinuous generator) and EBMs (separate energy basins) represent disconnected support more naturally — flows are uniquely penalized by the invertibility/continuity constraint.

Looking ahead#

Normalizing flows provide exact likelihood through invertible architectures constrained to have tractable Jacobians. The next model family achieves even higher sample quality by abandoning invertibility and instead defining distributions through a learned denoising process.

Week 6: Denoising Diffusion Probabilistic Models. We derive the DDPM forward process (data → noise) and reverse process (noise → data), show that the optimal denoising target is the simple noise prediction objective $\mathcal{L}_\text{simple}$ , and connect DDPM to SDE/ODE formulations that enable accelerated sampling.

Purpose of this lecture#

The change-of-variables formula#

For an invertible differentiable mapping $z = f_\theta(x)$ with $x = f_\theta^{-1}(z) = g_\theta(z)$ , the change-of-variables formula relates the densities of $x$ and $z$ :

p_X(x) = p_Z(f_\theta(x)) \cdot \left|\det J_{f_\theta}(x)\right|

where $J_{f_\theta}(x) = \frac{\partial f_\theta(x)}{\partial x}$ is the Jacobian matrix of $f_\theta$ at $x$ . Taking logarithms gives the exact log-likelihood:

\log p_X(x) = \log p_Z(f_\theta(x)) + \log \left|\det J_{f_\theta}(x)\right|

Coupling layers: RealNVP and Glow#

x_A' = x_A, \quad x_B' = x_B \odot \exp(s_\theta(x_A)) + t_\theta(x_A)

where $s_\theta$ and $t_\theta$ are arbitrary neural networks mapping from $x_A$ to scale and translation for $x_B$ . Because $x_A' = x_A$ , the Jacobian of the full layer is lower-triangular:

J = \begin{pmatrix} I & 0 \\ \frac{\partial x_B'}{\partial x_A} & \text{diag}(\exp(s_\theta(x_A))) \end{pmatrix}

What the $s_\theta$ and $t_\theta$ subnetworks look like in practice#

Autoregressive flows#

Autoregressive flows (MAF and IAF; Papamakarios et al., 2017; Kingma et al., 2016) generalize coupling layers to the full autoregressive factorization:

z_i = (x_i - \mu_\theta^{(i)}(x_{<i})) / \exp(\alpha_\theta^{(i)}(x_{<i}))

Practical training: flows vs other generative models#

Normalizing flows have distinctive training characteristics compared to VAEs and GANs:

Neural ODEs and continuous normalizing flows#

Continuous normalizing flows (CNFs; Chen et al., 2018) replace the discrete sequence of transformations with a continuous flow defined by an ODE:

\frac{dz(t)}{dt} = f_\theta(z(t), t), \quad z(0) = z_0, \quad z(1) = x

The change-of-variables formula for the continuous flow is the instantaneous change of variables:

\frac{d \log p(z(t))}{dt} = -\text{tr}\!\left(\frac{\partial f_\theta}{\partial z(t)}\right)

Schrödinger bridges#

\min_{P \in \mathcal{C}(p_0, p_1)} D_\text{KL}(P \| Q)

Practical considerations for flow architecture design#

Designing an effective normalizing flow requires balancing multiple constraints:

GenAI context: flows across the generative modeling landscape#

Key takeaways#

Conceptual questions#

A 2D normalizing flow applies the affine coupling layer $x_1' = x_1$ , $x_2' = x_2 \cdot e^{s(x_1)} + t(x_1)$ . Compute the log-likelihood contribution of this layer to $\log p_X(x)$ for a given $(x_1, x_2)$ with $s(x_1) = 0.5$ and $t(x_1) = 1.0$ . Then compute the inverse transformation to recover $(x_1, x_2)$ from $(x_1', x_2')$ . Show that the inverse exists without requiring $s$ or $t$ to be invertible.
MAF has $O(1)$ training (density evaluation) but $O(d)$ sampling, while IAF has $O(1)$ sampling but $O(d)$ density evaluation. Describe a generative modeling scenario where you would prefer MAF over IAF and vice versa. How does this tradeoff relate to the design of variational inference algorithms (where the inference network is an IAF)?
A continuous normalizing flow uses the instantaneous change-of-variables formula $\frac{d \log p}{dt} = -\text{tr}(\partial f/\partial z)$ . Show that for an affine vector field $f(z, t) = Az + b$ where $A \in \mathbb{R}^{d \times d}$ , the change in log-probability along any trajectory is constant at $-\text{tr}(A)$ . What does this imply for the expressiveness of affine CNFs?
The Schrödinger bridge between distributions $p_0$ and $p_1$ minimizes $D_\text{KL}(P \| Q)$ over processes with correct marginals. Compared to the deterministic optimal transport map, what additional flexibility does the stochastic Schrödinger bridge provide? In which application domains does this stochastic interpolation provide meaningful advantages over deterministic flow?
A normalizing flow trained on a distribution with disconnected support (two well-separated modes) must map the data to a unimodal Gaussian base distribution. Explain why this requires the Jacobian determinant to vary dramatically across the data space. What practical training difficulty does this create, and how does it compare to the training difficulty faced by GANs and EBMs on the same multimodal distribution?

Solutions

Only $x_2$ is scaled, so $\det J = e^{s(x_1)}$ and the layer's contribution is $\log|\det J| = s(x_1) = 0.5$ . Inverse: $x_1 = x_1'$ , and $x_2 = (x_2' - t(x_1'))\,e^{-s(x_1')} = (x_2'-1.0)\,e^{-0.5}$ . Note $s$ and $t$ are only ever evaluated (at $x_1=x_1'$ , which passes through unchanged), never inverted — so they can be arbitrary non-invertible networks.
Prefer MAF when density evaluation/training throughput matters (MLE on a dataset, density estimation, anomaly detection). Prefer IAF when sampling speed matters (real-time generation, or as a VAE posterior $q_\phi(z\mid x)$ where you draw $z$ once and only need the density of that sample, which IAF gives cheaply). In VI the inference network is an IAF precisely because sampling and evaluating its own samples' density are both fast.
For $f=Az+b$ , $\partial f/\partial z = A$ and $\text{tr}(A)$ is constant in $z,t$ ; integrating $\tfrac{d\log p}{dt}=-\text{tr}(A)$ over $t\in[0,1]$ gives $\Delta\log p=-\text{tr}(A)$ on every trajectory. So an affine CNF only rescales density by a single global constant — it cannot reshape density locally and is no more expressive than one linear map; nonlinear $f$ is required.
The stochastic bridge optimizes over a distribution of trajectories (Brownian-like paths) rather than the single deterministic OT map, so it models genuinely noisy transitions and spreads mass with an entropy term. This helps where the underlying process is stochastic — single-cell trajectories, image-to-image translation with inherent uncertainty — where a deterministic map would be artificially rigid.
To send two separated modes to a unimodal Gaussian, the bijection must stretch the empty inter-mode region (low data density ⇒ assigned low base density) and compress within modes, forcing $|\det J|$ to span a huge dynamic range. This causes numerical instability (very large/small log-Jacobians) and hard optimization; moreover, because the map is a continuous bijection it cannot "tear" space, so a thin bridge of spurious probability always connects the modes. GANs (discontinuous generator) and EBMs (separate energy basins) represent disconnected support more naturally — flows are uniquely penalized by the invertibility/continuity constraint.

Purpose of this lecture#

The change-of-variables formula#

Coupling layers: RealNVP and Glow#

What the $s_\theta$ and $t_\theta$ subnetworks look like in practice#

Autoregressive flows#

Practical training: flows vs other generative models#

Neural ODEs and continuous normalizing flows#

Schrödinger bridges#

Practical considerations for flow architecture design#

GenAI context: flows across the generative modeling landscape#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 5: Normalizing Flows

Purpose of this lecture#

The change-of-variables formula#

Coupling layers: RealNVP and Glow#

What the $s_\theta$ and $t_\theta$ subnetworks look like in practice#

Autoregressive flows#

Practical training: flows vs other generative models#

Neural ODEs and continuous normalizing flows#

Schrödinger bridges#

Practical considerations for flow architecture design#

GenAI context: flows across the generative modeling landscape#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 5: Normalizing Flows

Purpose of this lecture#

The change-of-variables formula#

Coupling layers: RealNVP and Glow#

What the sθs_\thetasθ​ and tθt_\thetatθ​ subnetworks look like in practice#

Autoregressive flows#

Practical training: flows vs other generative models#

Neural ODEs and continuous normalizing flows#

Schrödinger bridges#

Practical considerations for flow architecture design#

GenAI context: flows across the generative modeling landscape#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 5: Normalizing Flows

Purpose of this lecture#

The change-of-variables formula#

Coupling layers: RealNVP and Glow#

What the sθs_\thetasθ​ and tθt_\thetatθ​ subnetworks look like in practice#

Autoregressive flows#

Practical training: flows vs other generative models#

Neural ODEs and continuous normalizing flows#

Schrödinger bridges#

Practical considerations for flow architecture design#

GenAI context: flows across the generative modeling landscape#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

What the $s_\theta$ and $t_\theta$ subnetworks look like in practice#

What the $s_\theta$ and $t_\theta$ subnetworks look like in practice#