Purpose of this lecture
Flow matching (Lipman et al., 2022; Liu et al., 2022; Albergo and Vanden-Eijnden, 2022) reframes generative modeling as regression on a vector field rather than denoising at each noise level. This reformulation is simpler to derive, produces straighter trajectories (fewer function evaluations at inference), and generalizes naturally to any source-target pair rather than requiring a Gaussian source. Consistency models (Song et al., 2023) take a further step: instead of learning a multi-step process, they learn a function that maps any point on a diffusion trajectory directly to the clean sample in a single evaluation. Together, these methods represent the frontier of fast generative inference.
Vector fields and the continuity equation
A continuous normalizing flow defines a time-varying vector field that generates a flow satisfying:
If pushes forward the source distribution , then pushes forward to the data distribution . The probability density at time evolves according to the continuity equation:
The goal of flow matching is to learn such that transports to . The key insight is that this can be done by regressing against a conditional vector field that generates the correct marginal flow — without requiring the intractable marginal vector field .
The flow matching objective
The marginal flow matching objective directly targets the marginal vector field:
where is the marginal vector field averaged over all data points that could have produced . This expectation is intractable because it requires knowing which generated .
The conditional flow matching (CFM) objective bypasses this by regressing against conditional vector fields conditioned on individual data points:
This is tractable because can be computed analytically for any choice of interpolation path. The CFM objective has the same gradient as MFM with respect to — using conditional vector fields produces an unbiased estimator of the marginal vector field gradient. Training samples: (1) draw ; (2) draw ; (3) interpolate to get and compute target ; (4) minimize the squared error.
Linear interpolation: rectified flow
Rectified flow (Liu et al., 2022) chooses the simplest possible interpolation: a straight line between and :
The conditional vector field for this interpolation is simply the constant velocity:
The flow matching objective becomes:
This is even simpler than DDPM's noise prediction objective: no noise schedule, no signal-to-noise weighting, no Markov chain — just regression on straight-line velocities. The marginal probability path is:
Straight-line trajectories have the property that the optimal transport plan (the coupling that minimizes total trajectory length) is an independent coupling — matching the Gaussian source with the data independently. Real trajectories will not be perfectly straight in general, but they tend toward straight when the coupling is near-optimal.
Reflow takes a trained rectified flow, generates (noise, sample) pairs by running the forward process, and trains a new rectified flow on these pairs. This makes trajectories straighter because the new training distribution has near-optimal transport structure. After one reflow iteration, sampling requires very few steps (often as few as 1).
Optimal transport coupling
OT-CFM (Tong et al., 2023) improves flow matching by using an optimal transport coupling between the source and data rather than an independent coupling. The OT plan minimizes the expected squared transport distance . When the coupling is near-OT, the conditional trajectories are approximately straight in expectation, meaning fewer integration steps are needed at inference.
The mini-batch OT approximation computes the OT plan within each batch using the Sinkhorn algorithm, yielding straighter trajectories with minimal additional overhead. OT-CFM achieves comparable sample quality to DDPM with 10 fewer function evaluations.
Mini-batch Sinkhorn: the computational implementation
The Sinkhorn algorithm (Cuturi, 2013) solves the regularized optimal transport problem efficiently through iterative scaling:
where is the pairwise cost matrix, is the entropy regularization (negative entropy; controls smoothness), and is the set of couplings with marginals and .
The Sinkhorn iterations alternate between scaling row and column vectors using the Gibbs kernel :
- Initialize (uniform row scaling)
- Repeat: and
- Recover the transport plan:
Theoretical behavior: As , the Sinkhorn plan converges to the unregularized (true) OT plan. Larger yields smoother, more averaged couplings that trade off transportation cost for reduced trajectory variance — the key insight for improving flow matching. The entropy term prevents the algorithm from assigning all probability mass to a single matching pair, spreading the coupling smoothly across the cost landscape.
Mini-batch implementation: Within each batch of size , compute the mini-batch Sinkhorn plan between the data samples and noise samples. Use this plan to assign a noise sample to each data sample , where is the permutation induced by the OT coupling. This is a operation per batch, which is tractable for typical batch sizes (–).
Wall-clock overhead: Sinkhorn converges in 100–200 iterations (with early stopping on the dual variable residual). For , this typically adds 5–10% to wall-clock training time. The benefit is substantial: straighter trajectories reduce variance in the gradient estimates, lowering the number of ODE function evaluations at inference from (DDPM) to (OT-CFM), and enabling one-step generation after reflow.
Stochastic interpolants and general probability paths
Flow matching and DDPM are both special cases of the broader stochastic interpolant framework (Albergo et al., 2023), which parameterizes the path from source to data as:
where , , is independent Gaussian noise, and are time-dependent coefficients satisfying boundary conditions:
- At : (start at clean data)
- At : (end at source)
Different interpolants correspond to different modeling choices:
-
Rectified flow: , , . Linear interpolation with no independent noise; produces constant velocity targets .
-
DDPM: , , . No interpolation term (); the source distribution is implicit in the noise schedule. This is the variance-preserving scaling from the DDPM paper.
-
Trigonometric interpolant: , , . Smooth trigonometric interpolation; concentrates density changes near and , reducing variance at intermediate times. Used in some recent models for improved sample quality.
The conditional vector field for a general interpolant is:
where primes denote time derivatives. The flow matching objective remains:
This unifying view shows that flow matching, denoising diffusion, and other variants are all instances of the same core principle: regression on a conditional velocity field. The choice of interpolant affects the geometry of trajectories (straight vs. curved), variance at different noise levels, and the form of the target vector field — but the training procedure and theoretical guarantees remain identical.
Consistency models
Consistency models (Song et al., 2023) learn a function with the consistency property: for any two points on the same PF ODE trajectory, — the function maps any trajectory point to the same clean sample. A consistency model can generate high-quality samples in a single step: draw , compute .
Consistency distillation (CD) trains the consistency model to satisfy the consistency property by:
where is obtained by running one ODE step from , is a distance metric (LPIPS perceptual distance works well), and is an exponential moving average of (teacher parameters). CD requires a pretrained diffusion model (the teacher ODE solver) and produces a one-step model.
Consistency training (CT) trains the consistency model without a pretrained teacher by replacing the ODE step with the expected data point , using the forward process to provide the noised pairs. CT is less stable than CD but allows training from scratch.
Progressive time discretization in consistency training: CT uses a discretization of the time interval into timesteps. Rather than using a fixed throughout training, the schedule starts with a coarse discretization (small , large ) and progressively refines it as training progresses (increasing , reducing ). Here is the training iteration and are hyperparameters (e.g., ).
The intuition is that coarse discretization provides a strong global consistency signal: the learned function must output the same for very different noise levels, preventing mode collapse and ensuring large-scale structure. As training refines the discretization, the consistency signal shifts to local consistency (adjacent points on the trajectory agree), fine-tuning the mapping near the data manifold. This curriculum prevents the model from getting stuck in poor local minima and allows CT to match CD's sample quality without a pretrained teacher.
Multi-step generation with consistency models: generate , compute , add noise to get , apply again, repeat. This stochastic refinement scheme enables a quality-speed tradeoff from 1 to 4 steps.
Cross-course connections: Flow matching across generative, RLReinforcement Learning, robotics, and vision domains
Flow matching concepts extend far beyond image generation. The table below maps core ideas from this week across all four GenAI courses:
| Concept | Course 3 (Generative Models) | Course 1 (RLReinforcement Learning) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | OT coupling | Pairs noise and data for straight trajectories in diffusion space | Distributional RLReinforcement Learning (e.g., IQN, QR-DQNDeep Q-Network): Wasserstein distance on return distributions; optimal coupling of Q-functions | OT-CFM diffusion policy (Week 9): pairs noise with robot action sequences to minimize trajectory cost; enables 50Hz control | Multimodal alignment: optimal coupling between image regions and text token embeddings for cross-modal matching | | Probability path | Marginal density at time ; continuity equation governs density evolution during generation | State visitation distribution under a policy; steady-state distribution in value iteration | Distribution over robot state-action trajectories; path integral structure relates to energy-based planning | Feature distribution shift during CLIP contrastive learning; alignment space between vision and language | | One-step inference | Consistency model: (single network call) | One-step model-based planning: using a single forward pass (vs. tree search) | 50Hz real-time control requires ≤1 network call per action; consistency models achieve this directly | Fast CLIP retrieval: one forward pass per image-text pair for similarity matching (no iterative refinement) | | Source distribution | Standard Gaussian for unconditional generation | Initial state distribution ; uniform over task starts | Distribution of robot initial configurations and object poses; can be learned from demonstrations | Uniform distribution over image patches in self-supervised vision pre-training |
Bridging deployment across disciplines: Flow matching's enabling of near-real-time inference is critical precisely where it is deployed. In robotics (Course 2 Week 9), diffusion policies based on OT-CFM were a breakthrough because robot control demands 50 Hz action generation — far faster than DDPM's 20–1000 step inference. The OT coupling ensures trajectories are nearly straight, reducing integration steps to 5–10, making real-time viability possible. Consistency distillation parallels fast policy distillation in robotics (Course 2 Week 11): training a large imitation learning model and distilling it into a reactive, low-latency policy. In RLReinforcement Learning (Course 1), Wasserstein losses on distributional value functions use OT couplings to compare return distributions, leveraging the same geometric insight that makes straight trajectories reduce sampling cost. In vision-language models (Course 4), the alignment of image and text features during CLIP training uses optimal transport to match visual and linguistic structure efficiently, reducing the total number of comparison operations in contrastive learning.
Key takeaways
Flow matching trains a vector field to transport the source distribution to the data distribution. The conditional flow matching objective is an unbiased estimator of the marginal flow matching objective, tractable because it conditions on individual data points. Rectified flow uses linear interpolation, producing constant velocity targets and approximately straight trajectories. OT-CFM uses a mini-batch optimal transport coupling to make trajectories straighter, reducing the number of inference steps. The Sinkhorn algorithm solves regularized OT within each batch in time with minimal wall-clock overhead. Stochastic interpolants unify flow matching and DDPM as special cases of a general probability path framework. Consistency models learn the endpoint mapping along a diffusion trajectory, enabling single-step generation; consistency distillation trains the model against a teacher ODE solver, while consistency training uses progressive time discretization for training from scratch.
Conceptual questions
-
In rectified flow, the target vector field at each timestep is (a constant). Show that for this vector field, the probability flow ODE integrates to a straight line if exactly equals everywhere. Then explain why the learned will not produce perfectly straight trajectories in practice, even after training — specifically, identify the source of trajectory curvature that arises from the averaging in the marginal vector field.
-
OT-CFM minimizes the expected transport cost when coupling data and noise. Standard CFM uses an independent coupling. Construct a 1D example with as a bimodal distribution (two Gaussians) and where the OT coupling produces qualitatively straighter trajectories than the independent coupling. Explain why straighter trajectories require fewer ODE integration steps.
-
A consistency model trained with distillation from a DDPM teacher must satisfy along ODE trajectories. If is chosen very small, the training signal becomes noisy; if is large, the bootstrap target may be inaccurate. Derive the optimal schedule that balances these competing errors, and explain how the progressive time discretization used in consistency training (increasing the number of discretization steps during training) manages this tradeoff.
-
Multi-step consistency model generation adds noise to the predicted before applying the consistency function again. Show that this is equivalent to a short diffusion process starting from the predicted rather than from noise. What error accumulates across multiple refinement steps if the initial one-step prediction is slightly incorrect?
-
Flow matching can use any source distribution, not just . Describe a robotics application where the source distribution should be a learned distribution over previous robot states rather than a Gaussian. What computational modification to the flow matching training loop is required, and how does this compare to conditioning the flow on state information?
Looking ahead
Unconditional generative models produce samples from a learned distribution without control over the output. Deploying these models requires mechanisms to steer generation toward specific targets.
Week 8: Conditioning and Control. We derive classifier guidance and classifier-free guidance, examine cross-attention as the mechanism for text conditioning, analyze ControlNet's architectural approach to structural conditioning, and assess CLIP embeddings as the shared semantic space connecting text and image generation.
Further reading
- Lipman, Y., et al. (2022). Flow Matching for Generative Modeling. ICLR. (OT-CFM framework).
- Albergo, M. S., & Vanden-Eijnden, E. (2022). Building Normalizing Flows with Stochastic Interpolants. ICLR.
- Song, Y., et al. (2023). Consistency Models. ICML. (Single-step diffusion generation).