Skip to main content
illumin8
Courses
Week 7: Flow Matching and Consistency Models
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 7

Week 7: Flow Matching and Consistency Models

✦Learning Outcomes
  • Compare flow matching to DDPM and explain why rectified flows produce straighter trajectories
  • Implement consistency models and explain the distillation process from multi-step to single-step generation
  • Analyze tradeoffs between sample quality, inference speed, and training complexity
◆Prerequisites
  • Week 6: DDPM - Forward/reverse processes, noise prediction
  • Week 5: Normalizing Flows - Change of variables, continuous flows

Understanding of ODEs and numerical integration is helpful.

Purpose of this lecture

Flow matching (Lipman et al., 2022; Liu et al., 2022; Albergo and Vanden-Eijnden, 2022) reframes generative modeling as regression on a vector field rather than denoising at each noise level. This reformulation is simpler to derive, produces straighter trajectories (fewer function evaluations at inference), and generalizes naturally to any source-target pair rather than requiring a Gaussian source. Consistency models (Song et al., 2023) take a further step: instead of learning a multi-step process, they learn a function that maps any point on a diffusion trajectory directly to the clean sample in a single evaluation. Together, these methods represent the frontier of fast generative inference.


Vector fields and the continuity equation

A continuous normalizing flow defines a time-varying vector field vθ(x,t):X×[0,1]→Xv_\theta(x, t): \mathcal{X} \times [0,1] \to \mathcal{X}vθ​(x,t):X×[0,1]→X that generates a flow ϕt\phi_tϕt​ satisfying:

dϕt(x)dt=vθ(ϕt(x),t),ϕ0(x)=x\frac{d\phi_t(x)}{dt} = v_\theta(\phi_t(x), t), \quad \phi_0(x) = xdtdϕt​(x)​=vθ​(ϕt​(x),t),ϕ0​(x)=x

If ϕ0\phi_0ϕ0​ pushes forward the source distribution p0=N(0,I)p_0 = \mathcal{N}(0,I)p0​=N(0,I), then ϕ1\phi_1ϕ1​ pushes forward p0p_0p0​ to the data distribution p1=pdatap_1 = p_\text{data}p1​=pdata​. The probability density at time ttt evolves according to the continuity equation:

∂pt∂t+∇⋅(ptvt)=0\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v_t) = 0∂t∂pt​​+∇⋅(pt​vt​)=0

The goal of flow matching is to learn vθv_\thetavθ​ such that ϕ1\phi_1ϕ1​ transports p0p_0p0​ to p1p_1p1​. The key insight is that this can be done by regressing vθv_\thetavθ​ against a conditional vector field u(x,t∣x0)u(x, t \mid x_0)u(x,t∣x0​) that generates the correct marginal flow — without requiring the intractable marginal vector field vt(x)v_t(x)vt​(x).


The flow matching objective

The marginal flow matching objective directly targets the marginal vector field:

LMFM(θ)=Et,x∼pt ⁣[∥vθ(x,t)−vt(x)∥2]\mathcal{L}_\text{MFM}(\theta) = \mathbb{E}_{t, x \sim p_t}\!\left[\|v_\theta(x, t) - v_t(x)\|^2\right]LMFM​(θ)=Et,x∼pt​​[∥vθ​(x,t)−vt​(x)∥2]

where vt(x)=E[u(x,t∣x0)∣xt=x]v_t(x) = \mathbb{E}[u(x, t \mid x_0) \mid x_t = x]vt​(x)=E[u(x,t∣x0​)∣xt​=x] is the marginal vector field averaged over all data points that could have produced xtx_txt​. This expectation is intractable because it requires knowing which x0x_0x0​ generated xtx_txt​.

The conditional flow matching (CFM) objective bypasses this by regressing against conditional vector fields conditioned on individual data points:

LCFM(θ)=Et,x0,xt∼pt(⋅∣x0) ⁣[∥vθ(xt,t)−u(xt,t∣x0)∥2]\mathcal{L}_\text{CFM}(\theta) = \mathbb{E}_{t, x_0, x_t \sim p_t(\cdot | x_0)}\!\left[\|v_\theta(x_t, t) - u(x_t, t \mid x_0)\|^2\right]LCFM​(θ)=Et,x0​,xt​∼pt​(⋅∣x0​)​[∥vθ​(xt​,t)−u(xt​,t∣x0​)∥2]

This is tractable because u(xt,t∣x0)u(x_t, t \mid x_0)u(xt​,t∣x0​) can be computed analytically for any choice of interpolation path. The CFM objective has the same gradient as MFM with respect to θ\thetaθ — using conditional vector fields produces an unbiased estimator of the marginal vector field gradient. Training samples: (1) draw x0∼pdatax_0 \sim p_\text{data}x0​∼pdata​; (2) draw x1∼N(0,I)x_1 \sim \mathcal{N}(0,I)x1​∼N(0,I); (3) interpolate to get xt=ψt(x1,x0)x_t = \psi_t(x_1, x_0)xt​=ψt​(x1​,x0​) and compute target ut=dψt/dtu_t = d\psi_t/dtut​=dψt​/dt; (4) minimize the squared error.


Linear interpolation: rectified flow

Rectified flow (Liu et al., 2022) chooses the simplest possible interpolation: a straight line between x0∼pdatax_0 \sim p_\text{data}x0​∼pdata​ and x1∼N(0,I)x_1 \sim \mathcal{N}(0,I)x1​∼N(0,I):

xt=(1−t)x0+tx1,t∈[0,1]x_t = (1 - t) x_0 + t x_1, \quad t \in [0, 1]xt​=(1−t)x0​+tx1​,t∈[0,1]

The conditional vector field for this interpolation is simply the constant velocity:

u(xt,t∣x0,x1)=x1−x0u(x_t, t \mid x_0, x_1) = x_1 - x_0u(xt​,t∣x0​,x1​)=x1​−x0​

The flow matching objective becomes:

LRF(θ)=Et,x0,x1 ⁣[∥vθ(xt,t)−(x1−x0)∥2]\mathcal{L}_\text{RF}(\theta) = \mathbb{E}_{t, x_0, x_1}\!\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right]LRF​(θ)=Et,x0​,x1​​[∥vθ​(xt​,t)−(x1​−x0​)∥2]

This is even simpler than DDPM's noise prediction objective: no noise schedule, no signal-to-noise weighting, no Markov chain — just regression on straight-line velocities. The marginal probability path is:

pt(x)=∫pdata(x0)N(x;(1−t)x0,t2I) dx0p_t(x) = \int p_\text{data}(x_0) \mathcal{N}(x; (1-t)x_0, t^2 I)\, dx_0pt​(x)=∫pdata​(x0​)N(x;(1−t)x0​,t2I)dx0​

Straight-line trajectories have the property that the optimal transport plan (the coupling π(x0,x1)\pi(x_0, x_1)π(x0​,x1​) that minimizes total trajectory length) is an independent coupling — matching the Gaussian source with the data independently. Real trajectories will not be perfectly straight in general, but they tend toward straight when the coupling is near-optimal.

Reflow takes a trained rectified flow, generates (noise, sample) pairs by running the forward process, and trains a new rectified flow on these pairs. This makes trajectories straighter because the new training distribution has near-optimal transport structure. After one reflow iteration, sampling requires very few steps (often as few as 1).


Optimal transport coupling

OT-CFM (Tong et al., 2023) improves flow matching by using an optimal transport coupling between the source and data rather than an independent coupling. The OT plan π∗(x0,x1)\pi^*(x_0, x_1)π∗(x0​,x1​) minimizes the expected squared transport distance Eπ[∥x0−x1∥2]\mathbb{E}_\pi[\|x_0 - x_1\|^2]Eπ​[∥x0​−x1​∥2]. When the coupling is near-OT, the conditional trajectories xt=(1−t)x1+tx0x_t = (1-t)x_1 + t x_0xt​=(1−t)x1​+tx0​ are approximately straight in expectation, meaning fewer integration steps are needed at inference.

The mini-batch OT approximation computes the OT plan within each batch using the Sinkhorn algorithm, yielding straighter trajectories with minimal additional overhead. OT-CFM achieves comparable sample quality to DDPM with 10×\times× fewer function evaluations.


Mini-batch Sinkhorn: the computational implementation

The Sinkhorn algorithm (Cuturi, 2013) solves the regularized optimal transport problem efficiently through iterative scaling:

πε∗=arg⁡min⁡π∈Π(μ,ν)⟨C,π⟩+εH(π)\pi^*_\varepsilon = \arg\min_{\pi \in \Pi(\mu,\nu)} \langle C, \pi \rangle + \varepsilon H(\pi)πε∗​=argπ∈Π(μ,ν)min​⟨C,π⟩+εH(π)

where Cij=∥xi−yj∥2C_{ij} = \|x_i - y_j\|^2Cij​=∥xi​−yj​∥2 is the pairwise cost matrix, H(π)=−∑ijπijlog⁡πijH(\pi) = -\sum_{ij}\pi_{ij}\log\pi_{ij}H(π)=−∑ij​πij​logπij​ is the entropy regularization (negative entropy; ε>0\varepsilon > 0ε>0 controls smoothness), and Π(μ,ν)\Pi(\mu,\nu)Π(μ,ν) is the set of couplings with marginals μ\muμ and ν\nuν.

The Sinkhorn iterations alternate between scaling row and column vectors using the Gibbs kernel Kij=exp⁡(−Cij/ε)K_{ij} = \exp(-C_{ij}/\varepsilon)Kij​=exp(−Cij​/ε):

  1. Initialize u=1m/mu = \mathbf{1}_m / mu=1m​/m (uniform row scaling)
  2. Repeat: v←1n/(K⊤u)v \leftarrow \mathbf{1}_n / (K^\top u)v←1n​/(K⊤u) and u←1m/(Kv)u \leftarrow \mathbf{1}_m / (Kv)u←1m​/(Kv)
  3. Recover the transport plan: π∗=diag(u)Kdiag(v)\pi^* = \text{diag}(u) K \text{diag}(v)π∗=diag(u)Kdiag(v)

Theoretical behavior: As ε→0\varepsilon \to 0ε→0, the Sinkhorn plan converges to the unregularized (true) OT plan. Larger ε\varepsilonε yields smoother, more averaged couplings that trade off transportation cost for reduced trajectory variance — the key insight for improving flow matching. The entropy term εH(π)\varepsilon H(\pi)εH(π) prevents the algorithm from assigning all probability mass to a single matching pair, spreading the coupling smoothly across the cost landscape.

Mini-batch implementation: Within each batch of size BBB, compute the mini-batch Sinkhorn plan π∗∈RB×B\pi^* \in \mathbb{R}^{B \times B}π∗∈RB×B between the BBB data samples and BBB noise samples. Use this plan to assign a noise sample yσ(i)y_{\sigma(i)}yσ(i)​ to each data sample xix_ixi​, where σ\sigmaσ is the permutation induced by the OT coupling. This is a O(B2)\mathcal{O}(B^2)O(B2) operation per batch, which is tractable for typical batch sizes (B=128B = 128B=128–512512512).

Wall-clock overhead: Sinkhorn converges in 100–200 iterations (with early stopping on the dual variable residual). For B=256B=256B=256, this typically adds 5–10% to wall-clock training time. The benefit is substantial: straighter trajectories reduce variance in the gradient estimates, lowering the number of ODE function evaluations at inference from ∼50\sim 50∼50 (DDPM) to ∼10\sim 10∼10 (OT-CFM), and enabling one-step generation after reflow.


Stochastic interpolants and general probability paths

Flow matching and DDPM are both special cases of the broader stochastic interpolant framework (Albergo et al., 2023), which parameterizes the path from source to data as:

xt=α(t)x0+β(t)x1+γ(t)ξ,t∈[0,1]x_t = \alpha(t) x_0 + \beta(t) x_1 + \gamma(t) \xi, \quad t \in [0, 1]xt​=α(t)x0​+β(t)x1​+γ(t)ξ,t∈[0,1]

where x0∼pdatax_0 \sim p_\text{data}x0​∼pdata​, x1∼psourcex_1 \sim p_\text{source}x1​∼psource​, ξ∼N(0,I)\xi \sim \mathcal{N}(0,I)ξ∼N(0,I) is independent Gaussian noise, and (α,β,γ)(\alpha, \beta, \gamma)(α,β,γ) are time-dependent coefficients satisfying boundary conditions:

  • At t=0t=0t=0: α(0)=1,β(0)=0,γ(0)=0\alpha(0)=1, \beta(0)=0, \gamma(0)=0α(0)=1,β(0)=0,γ(0)=0 (start at clean data)
  • At t=1t=1t=1: α(1)=0,β(1)=1,γ(1)=0\alpha(1)=0, \beta(1)=1, \gamma(1)=0α(1)=0,β(1)=1,γ(1)=0 (end at source)

Different interpolants correspond to different modeling choices:

  1. Rectified flow: α(t)=1−t\alpha(t) = 1-tα(t)=1−t, β(t)=t\beta(t) = tβ(t)=t, γ(t)=0\gamma(t) = 0γ(t)=0. Linear interpolation with no independent noise; produces constant velocity targets ut=x1−x0u_t = x_1 - x_0ut​=x1​−x0​.

  2. DDPM: α(t)=αˉt\alpha(t) = \sqrt{\bar\alpha_t}α(t)=αˉt​​, β(t)=0\beta(t) = 0β(t)=0, γ(t)=1−αˉt\gamma(t) = \sqrt{1-\bar\alpha_t}γ(t)=1−αˉt​​. No interpolation term (β=0\beta=0β=0); the source distribution is implicit in the noise schedule. This is the variance-preserving scaling from the DDPM paper.

  3. Trigonometric interpolant: α(t)=cos⁡(πt/2)\alpha(t) = \cos(\pi t/2)α(t)=cos(πt/2), β(t)=sin⁡(πt/2)\beta(t) = \sin(\pi t/2)β(t)=sin(πt/2), γ(t)=0\gamma(t) = 0γ(t)=0. Smooth trigonometric interpolation; concentrates density changes near t=0t=0t=0 and t=1t=1t=1, reducing variance at intermediate times. Used in some recent models for improved sample quality.

The conditional vector field for a general interpolant is:

u(xt,t∣x0,x1)=α′(t)x0+β′(t)x1+γ′(t)ξu(x_t, t \mid x_0, x_1) = \alpha'(t) x_0 + \beta'(t) x_1 + \gamma'(t) \xiu(xt​,t∣x0​,x1​)=α′(t)x0​+β′(t)x1​+γ′(t)ξ

where primes denote time derivatives. The flow matching objective remains:

LCFM(θ)=Et,x0,x1,ξ ⁣[∥vθ(xt,t)−u(xt,t∣x0,x1)∥2]\mathcal{L}_\text{CFM}(\theta) = \mathbb{E}_{t, x_0, x_1, \xi}\!\left[\|v_\theta(x_t, t) - u(x_t, t \mid x_0, x_1)\|^2\right]LCFM​(θ)=Et,x0​,x1​,ξ​[∥vθ​(xt​,t)−u(xt​,t∣x0​,x1​)∥2]

This unifying view shows that flow matching, denoising diffusion, and other variants are all instances of the same core principle: regression on a conditional velocity field. The choice of interpolant affects the geometry of trajectories (straight vs. curved), variance at different noise levels, and the form of the target vector field — but the training procedure and theoretical guarantees remain identical.


Consistency models

Consistency models (Song et al., 2023) learn a function fθ(xt,t)f_\theta(x_t, t)fθ​(xt​,t) with the consistency property: for any two points (xt,xt′)(x_t, x_{t'})(xt​,xt′​) on the same PF ODE trajectory, fθ(xt,t)=fθ(xt′,t′)=x0f_\theta(x_t, t) = f_\theta(x_{t'}, t') = x_0fθ​(xt​,t)=fθ​(xt′​,t′)=x0​ — the function maps any trajectory point to the same clean sample. A consistency model can generate high-quality samples in a single step: draw xT∼N(0,I)x_T \sim \mathcal{N}(0,I)xT​∼N(0,I), compute x0=fθ(xT,T)x_0 = f_\theta(x_T, T)x0​=fθ​(xT​,T).

Consistency distillation (CD) trains the consistency model to satisfy the consistency property by:

LCD(θ,θ−)=E ⁣[d ⁣(fθ(xt+Δt,t+Δt), fθ−(xtϕ,t))]\mathcal{L}_\text{CD}(\theta, \theta^-) = \mathbb{E}\!\left[d\!\left(f_\theta(x_{t+\Delta t}, t + \Delta t),\, f_{\theta^-}(x_t^{\phi}, t)\right)\right]LCD​(θ,θ−)=E[d(fθ​(xt+Δt​,t+Δt),fθ−​(xtϕ​,t))]

where xtϕx_t^{\phi}xtϕ​ is obtained by running one ODE step from xt+Δtx_{t+\Delta t}xt+Δt​, d(⋅,⋅)d(\cdot, \cdot)d(⋅,⋅) is a distance metric (LPIPS perceptual distance works well), and θ−\theta^-θ− is an exponential moving average of θ\thetaθ (teacher parameters). CD requires a pretrained diffusion model (the teacher ODE solver) and produces a one-step model.

Consistency training (CT) trains the consistency model without a pretrained teacher by replacing the ODE step with the expected data point x0x_0x0​, using the forward process to provide the noised pairs. CT is less stable than CD but allows training from scratch.

Progressive time discretization in consistency training: CT uses a discretization of the time interval [0,T][0, T][0,T] into NNN timesteps. Rather than using a fixed NNN throughout training, the schedule N(k)=min⁡(s0+⌊k/k0⌋,s1)N(k) = \min(s_0 + \lfloor k / k_0 \rfloor, s_1)N(k)=min(s0​+⌊k/k0​⌋,s1​) starts with a coarse discretization (small NNN, large Δt\Delta tΔt) and progressively refines it as training progresses (increasing NNN, reducing Δt\Delta tΔt). Here kkk is the training iteration and (s0,s1,k0)(s_0, s_1, k_0)(s0​,s1​,k0​) are hyperparameters (e.g., s0=2,s1=150,k0=400s_0=2, s_1=150, k_0=400s0​=2,s1​=150,k0​=400).

The intuition is that coarse discretization provides a strong global consistency signal: the learned function must output the same x0x_0x0​ for very different noise levels, preventing mode collapse and ensuring large-scale structure. As training refines the discretization, the consistency signal shifts to local consistency (adjacent points on the trajectory agree), fine-tuning the mapping near the data manifold. This curriculum prevents the model from getting stuck in poor local minima and allows CT to match CD's sample quality without a pretrained teacher.

Multi-step generation with consistency models: generate xT(0)∼N(0,I)x_T^{(0)} \sim \mathcal{N}(0,I)xT(0)​∼N(0,I), compute x0(0)=fθ(xT(0),T)x_0^{(0)} = f_\theta(x_T^{(0)}, T)x0(0)​=fθ​(xT(0)​,T), add noise to get xt1(1)=αˉt1x0(0)+1−αˉt1ϵx_{t_1}^{(1)} = \sqrt{\bar\alpha_{t_1}}x_0^{(0)} + \sqrt{1-\bar\alpha_{t_1}}\epsilonxt1​(1)​=αˉt1​​​x0(0)​+1−αˉt1​​​ϵ, apply fθf_\thetafθ​ again, repeat. This stochastic refinement scheme enables a quality-speed tradeoff from 1 to ∼\sim∼4 steps.


Cross-course connections: Flow matching across generative, RLReinforcement Learning, robotics, and vision domains

Flow matching concepts extend far beyond image generation. The table below maps core ideas from this week across all four GenAI courses:

| Concept | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | OT coupling | Pairs noise and data for straight trajectories in diffusion space | Distributional RLReinforcement Learning (e.g., IQN, QR-DQNDeep Q-Network): Wasserstein distance on return distributions; optimal coupling of Q-functions | OT-CFM diffusion policy (Week 9): pairs noise with robot action sequences to minimize trajectory cost; enables 50Hz control | Multimodal alignment: optimal coupling between image regions and text token embeddings for cross-modal matching | | Probability path ptp_tpt​ | Marginal density at time ttt; continuity equation governs density evolution during generation | State visitation distribution ρπ(s)\rho^\pi(s)ρπ(s) under a policy; steady-state distribution in value iteration | Distribution over robot state-action trajectories; path integral structure relates to energy-based planning | Feature distribution shift during CLIP contrastive learning; alignment space between vision and language | | One-step inference | Consistency model: x0=fθ(xT,T)x_0 = f_\theta(x_T, T)x0​=fθ​(xT​,T) (single network call) | One-step model-based planning: a∗=π(s)a^* = \pi(s)a∗=π(s) using a single forward pass (vs. tree search) | 50Hz real-time control requires ≤1 network call per action; consistency models achieve this directly | Fast CLIP retrieval: one forward pass per image-text pair for similarity matching (no iterative refinement) | | Source distribution | Standard Gaussian N(0,I)\mathcal{N}(0,I)N(0,I) for unconditional generation | Initial state distribution d0(s)d_0(s)d0​(s); uniform over task starts | Distribution of robot initial configurations and object poses; can be learned from demonstrations | Uniform distribution over image patches in self-supervised vision pre-training |

Bridging deployment across disciplines: Flow matching's enabling of near-real-time inference is critical precisely where it is deployed. In robotics (Course 2 Week 9), diffusion policies based on OT-CFM were a breakthrough because robot control demands 50 Hz action generation — far faster than DDPM's 20–1000 step inference. The OT coupling ensures trajectories are nearly straight, reducing integration steps to 5–10, making real-time viability possible. Consistency distillation parallels fast policy distillation in robotics (Course 2 Week 11): training a large imitation learning model and distilling it into a reactive, low-latency policy. In RLReinforcement Learning (Course 1), Wasserstein losses on distributional value functions use OT couplings to compare return distributions, leveraging the same geometric insight that makes straight trajectories reduce sampling cost. In vision-language models (Course 4), the alignment of image and text features during CLIP training uses optimal transport to match visual and linguistic structure efficiently, reducing the total number of comparison operations in contrastive learning.


Key takeaways

Flow matching trains a vector field vθ(x,t)v_\theta(x, t)vθ​(x,t) to transport the source distribution to the data distribution. The conditional flow matching objective is an unbiased estimator of the marginal flow matching objective, tractable because it conditions on individual data points. Rectified flow uses linear interpolation, producing constant velocity targets ut=x1−x0u_t = x_1 - x_0ut​=x1​−x0​ and approximately straight trajectories. OT-CFM uses a mini-batch optimal transport coupling to make trajectories straighter, reducing the number of inference steps. The Sinkhorn algorithm solves regularized OT within each batch in O(B2)\mathcal{O}(B^2)O(B2) time with minimal wall-clock overhead. Stochastic interpolants unify flow matching and DDPM as special cases of a general probability path framework. Consistency models learn the endpoint mapping fθ(xt,t)=x0f_\theta(x_t, t) = x_0fθ​(xt​,t)=x0​ along a diffusion trajectory, enabling single-step generation; consistency distillation trains the model against a teacher ODE solver, while consistency training uses progressive time discretization for training from scratch.


Conceptual questions

  1. In rectified flow, the target vector field at each timestep is ut=x1−x0u_t = x_1 - x_0ut​=x1​−x0​ (a constant). Show that for this vector field, the probability flow ODE dx/dt=vt(x)dx/dt = v_t(x)dx/dt=vt​(x) integrates to a straight line if vtv_tvt​ exactly equals utu_tut​ everywhere. Then explain why the learned vθv_\thetavθ​ will not produce perfectly straight trajectories in practice, even after training — specifically, identify the source of trajectory curvature that arises from the averaging in the marginal vector field.

  2. OT-CFM minimizes the expected transport cost E[∥x0−x1∥2]\mathbb{E}[\|x_0 - x_1\|^2]E[∥x0​−x1​∥2] when coupling data and noise. Standard CFM uses an independent coupling. Construct a 1D example with pdatap_\text{data}pdata​ as a bimodal distribution (two Gaussians) and p0=N(0,1)p_0 = \mathcal{N}(0,1)p0​=N(0,1) where the OT coupling produces qualitatively straighter trajectories than the independent coupling. Explain why straighter trajectories require fewer ODE integration steps.

  3. A consistency model trained with distillation from a DDPM teacher must satisfy fθ(xt,t)≈fθ(xt+Δt,t+Δt)f_\theta(x_t, t) \approx f_\theta(x_{t+\Delta t}, t + \Delta t)fθ​(xt​,t)≈fθ​(xt+Δt​,t+Δt) along ODE trajectories. If Δt\Delta tΔt is chosen very small, the training signal becomes noisy; if Δt\Delta tΔt is large, the bootstrap target fθ−(xtϕ,t)f_{\theta^-}(x_t^{\phi}, t)fθ−​(xtϕ​,t) may be inaccurate. Derive the optimal Δt\Delta tΔt schedule that balances these competing errors, and explain how the progressive time discretization used in consistency training (increasing the number of discretization steps during training) manages this tradeoff.

  4. Multi-step consistency model generation adds noise to the predicted x0x_0x0​ before applying the consistency function again. Show that this is equivalent to a short diffusion process starting from the predicted x0x_0x0​ rather than from noise. What error accumulates across multiple refinement steps if the initial one-step prediction fθ(xT,T)f_\theta(x_T, T)fθ​(xT​,T) is slightly incorrect?

  5. Flow matching can use any source distribution, not just N(0,I)\mathcal{N}(0, I)N(0,I). Describe a robotics application where the source distribution should be a learned distribution over previous robot states rather than a Gaussian. What computational modification to the flow matching training loop is required, and how does this compare to conditioning the flow on state information?

✦Solutions
  1. If vt≡ut=x1−x0v_t \equiv u_t = x_1-x_0vt​≡ut​=x1​−x0​ (constant in x,tx,tx,t), then dx/dtdx/dtdx/dt is constant and integrates to x(t)=x0+t(x1−x0)x(t)=x_0+t(x_1-x_0)x(t)=x0​+t(x1​−x0​) — a straight line. In practice vθv_\thetavθ​ learns the marginal field vt(x)=E[x1−x0∣xt=x]v_t(x)=\mathbb{E}[x_1-x_0\mid x_t=x]vt​(x)=E[x1​−x0​∣xt​=x], an average over every endpoint pair passing through xxx. That conditional expectation varies with x,tx,tx,t, so where trajectories from different endpoints cross, the averaged field bends — the curvature comes from the averaging in the marginal field (severe under an independent coupling).
  2. Take pdata=12N(−a,σ2)+12N(+a,σ2)p_\text{data}=\tfrac12\mathcal{N}(-a,\sigma^2)+\tfrac12\mathcal{N}(+a,\sigma^2)pdata​=21​N(−a,σ2)+21​N(+a,σ2) and p0=N(0,1)p_0=\mathcal{N}(0,1)p0​=N(0,1). The independent coupling pairs a noise sample with a random mode, so many trajectories cross the origin (e.g. positive noise → −a-a−a data), producing long, crossing, curved marginal paths. The OT coupling matches each noise sample to the nearer mode (positive↔+a+a+a, negative↔−a-a−a), giving non-crossing, near-straight paths. Straighter paths have nearly constant velocity, so Euler integration with few steps is accurate (a straight line is integrated exactly in one step); curved paths need many small steps.
  3. Small Δt\Delta tΔt: the bootstrap signal d(f(xt+Δt),f(xt))→0d(f(x_{t+\Delta t}),f(x_t))\to 0d(f(xt+Δt​),f(xt​))→0 and is swamped by estimation/numerical noise (high variance). Large Δt\Delta tΔt: the single ODE-step target xtϕx_t^\phixtϕ​ is inaccurate (discretization bias ∼O(Δtp)\sim O(\Delta t^p)∼O(Δtp)). The optimal Δt\Delta tΔt balances variance (↓\downarrow↓ with Δt\Delta tΔt) against bias (↑\uparrow↑ with Δt\Delta tΔt). Progressive discretization starts coarse (large Δt\Delta tΔt: stable, low-variance global signal) and refines (small Δt\Delta tΔt: accurate local targets) as the model improves — annealing the bias–variance tradeoff over training.
  4. The update xt1=αˉt1x^0+1−αˉt1ϵx_{t_1}=\sqrt{\bar\alpha_{t_1}}\hat x_0+\sqrt{1-\bar\alpha_{t_1}}\epsilonxt1​​=αˉt1​​​x^0​+1−αˉt1​​​ϵ is exactly the DDPM forward process applied to the predicted x^0\hat x_0x^0​ — i.e. a short diffusion starting from x^0\hat x_0x^0​ rather than from pure noise. If the one-step prediction fθ(xT,T)f_\theta(x_T,T)fθ​(xT​,T) is slightly wrong, re-noising sits near that biased x^0\hat x_0x^0​ and each subsequent fθf_\thetafθ​ maps back toward it, so the bias persists; added noise gives only partial correction toward the manifold, leaving bounded but non-vanishing error growth across refinement steps.
  5. Receding-horizon control: the source should be the (learned/empirical) distribution over the previous action chunk or robot state, so successive plans are temporally consistent (warm-started) rather than starting from unstructured Gaussian noise. Modification: replace x1∼N(0,I)x_1\sim\mathcal{N}(0,I)x1​∼N(0,I) with samples from that source and pair them with target actions; the interpolation and target ut=x1−x0u_t=x_1-x_0ut​=x1​−x0​ are unchanged. Versus conditioning on state (keep a Gaussian source, feed state as an extra input to vθv_\thetavθ​): conditioning is more general and easily amortized across states, while changing the source shortens transport when source already resembles target (faster sampling) but is harder to amortize.

Looking ahead

Unconditional generative models produce samples from a learned distribution without control over the output. Deploying these models requires mechanisms to steer generation toward specific targets.

Week 8: Conditioning and Control. We derive classifier guidance and classifier-free guidance, examine cross-attention as the mechanism for text conditioning, analyze ControlNet's architectural approach to structural conditioning, and assess CLIP embeddings as the shared semantic space connecting text and image generation.


Further reading

  • Lipman, Y., et al. (2022). Flow Matching for Generative Modeling. ICLR. (OT-CFM framework).
  • Albergo, M. S., & Vanden-Eijnden, E. (2022). Building Normalizing Flows with Stochastic Interpolants. ICLR.
  • Song, Y., et al. (2023). Consistency Models. ICML. (Single-step diffusion generation).
← Previous
Week 6: Denoising Diffusion Probabilistic Models
Next →
Week 8: Conditioning and Control
On this page
  • Purpose of this lecture
  • Vector fields and the continuity equation
  • The flow matching objective
  • Linear interpolation: rectified flow
  • Optimal transport coupling
  • Mini-batch Sinkhorn: the computational implementation
  • Stochastic interpolants and general probability paths
  • Consistency models
  • Cross-course connections: Flow matching across generative, RL, robotics, and vision domains
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading