Week 7: Flow Matching and Consistency Models

Purpose of this lecture#

Flow matching (Lipman et al., 2022; Liu et al., 2022; Albergo and Vanden-Eijnden, 2022) reframes generative modeling as regression on a vector field rather than denoising at each noise level. This reformulation is simpler to derive, produces straighter trajectories (fewer function evaluations at inference), and generalizes naturally to any source-target pair rather than requiring a Gaussian source. Consistency models (Song et al., 2023) take a further step: instead of learning a multi-step process, they learn a function that maps any point on a diffusion trajectory directly to the clean sample in a single evaluation. Together, these methods represent the frontier of fast generative inference.

Vector fields and the continuity equation#

A continuous normalizing flow defines a time-varying vector field $v_\theta(x, t): \mathcal{X} \times [0,1] \to \mathcal{X}$ that generates a flow $\phi_t$ satisfying:

\frac{d\phi_t(x)}{dt} = v_\theta(\phi_t(x), t), \quad \phi_0(x) = x

If $\phi_0$ pushes forward the source distribution $p_0 = \mathcal{N}(0,I)$ , then $\phi_1$ pushes forward $p_0$ to the data distribution $p_1 = p_\text{data}$ . The probability density at time $t$ evolves according to the continuity equation:

\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v_t) = 0

The goal of flow matching is to learn $v_\theta$ such that $\phi_1$ transports $p_0$ to $p_1$ . The key insight is that this can be done by regressing $v_\theta$ against a conditional vector field $u(x, t \mid x_0)$ that generates the correct marginal flow — without requiring the intractable marginal vector field $v_t(x)$ .

The flow matching objective#

The marginal flow matching objective directly targets the marginal vector field:

\mathcal{L}_\text{MFM}(\theta) = \mathbb{E}_{t, x \sim p_t}\!\left[\|v_\theta(x, t) - v_t(x)\|^2\right]

where $v_t(x) = \mathbb{E}[u(x, t \mid x_0) \mid x_t = x]$ is the marginal vector field averaged over all data points that could have produced $x_t$ . This expectation is intractable because it requires knowing which $x_0$ generated $x_t$ .

The conditional flow matching (CFM) objective bypasses this by regressing against conditional vector fields conditioned on individual data points:

\mathcal{L}_\text{CFM}(\theta) = \mathbb{E}_{t, x_0, x_t \sim p_t(\cdot | x_0)}\!\left[\|v_\theta(x_t, t) - u(x_t, t \mid x_0)\|^2\right]

This is tractable because $u(x_t, t \mid x_0)$ can be computed analytically for any choice of interpolation path. The CFM objective has the same gradient as MFM with respect to $\theta$ — using conditional vector fields produces an unbiased estimator of the marginal vector field gradient. Training samples: (1) draw $x_0 \sim p_\text{data}$ ; (2) draw $x_1 \sim \mathcal{N}(0,I)$ ; (3) interpolate to get $x_t = \psi_t(x_1, x_0)$ and compute target $u_t = d\psi_t/dt$ ; (4) minimize the squared error.

Linear interpolation: rectified flow#

Rectified flow (Liu et al., 2022) chooses the simplest possible interpolation: a straight line between $x_0 \sim p_\text{data}$ and $x_1 \sim \mathcal{N}(0,I)$ :

x_t = (1 - t) x_0 + t x_1, \quad t \in [0, 1]

The conditional vector field for this interpolation is simply the constant velocity:

u(x_t, t \mid x_0, x_1) = x_1 - x_0

The flow matching objective becomes:

\mathcal{L}_\text{RF}(\theta) = \mathbb{E}_{t, x_0, x_1}\!\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right]

This is even simpler than DDPM's noise prediction objective: no noise schedule, no signal-to-noise weighting, no Markov chain — just regression on straight-line velocities. The marginal probability path is:

p_t(x) = \int p_\text{data}(x_0) \mathcal{N}(x; (1-t)x_0, t^2 I)\, dx_0

Straight-line trajectories have the property that the optimal transport plan (the coupling $\pi(x_0, x_1)$ that minimizes total trajectory length) is an independent coupling — matching the Gaussian source with the data independently. Real trajectories will not be perfectly straight in general, but they tend toward straight when the coupling is near-optimal.

Reflow takes a trained rectified flow, generates (noise, sample) pairs by running the forward process, and trains a new rectified flow on these pairs. This makes trajectories straighter because the new training distribution has near-optimal transport structure. After one reflow iteration, sampling requires very few steps (often as few as 1).

Optimal transport coupling#

OT-CFM (Tong et al., 2023) improves flow matching by using an optimal transport coupling between the source and data rather than an independent coupling. The OT plan $\pi^*(x_0, x_1)$ minimizes the expected squared transport distance $\mathbb{E}_\pi[\|x_0 - x_1\|^2]$ . When the coupling is near-OT, the conditional trajectories $x_t = (1-t)x_1 + t x_0$ are approximately straight in expectation, meaning fewer integration steps are needed at inference.

The mini-batch OT approximation computes the OT plan within each batch using the Sinkhorn algorithm, yielding straighter trajectories with minimal additional overhead. OT-CFM achieves comparable sample quality to DDPM with 10 $\times$ fewer function evaluations.

Mini-batch Sinkhorn: the computational implementation#

The Sinkhorn algorithm (Cuturi, 2013) solves the regularized optimal transport problem efficiently through iterative scaling:

\pi^*_\varepsilon = \arg\min_{\pi \in \Pi(\mu,\nu)} \langle C, \pi \rangle + \varepsilon H(\pi)

where $C_{ij} = \|x_i - y_j\|^2$ is the pairwise cost matrix, $H(\pi) = -\sum_{ij}\pi_{ij}\log\pi_{ij}$ is the entropy regularization (negative entropy; $\varepsilon > 0$ controls smoothness), and $\Pi(\mu,\nu)$ is the set of couplings with marginals $\mu$ and $\nu$ .

The Sinkhorn iterations alternate between scaling row and column vectors using the Gibbs kernel $K_{ij} = \exp(-C_{ij}/\varepsilon)$ :

Initialize $u = \mathbf{1}_m / m$ (uniform row scaling)
Repeat: $v \leftarrow \mathbf{1}_n / (K^\top u)$ and $u \leftarrow \mathbf{1}_m / (Kv)$
Recover the transport plan: $\pi^* = \text{diag}(u) K \text{diag}(v)$

Theoretical behavior: As $\varepsilon \to 0$ , the Sinkhorn plan converges to the unregularized (true) OT plan. Larger $\varepsilon$ yields smoother, more averaged couplings that trade off transportation cost for reduced trajectory variance — the key insight for improving flow matching. The entropy term $\varepsilon H(\pi)$ prevents the algorithm from assigning all probability mass to a single matching pair, spreading the coupling smoothly across the cost landscape.

Mini-batch implementation: Within each batch of size $B$ , compute the mini-batch Sinkhorn plan $\pi^* \in \mathbb{R}^{B \times B}$ between the $B$ data samples and $B$ noise samples. Use this plan to assign a noise sample $y_{\sigma(i)}$ to each data sample $x_i$ , where $\sigma$ is the permutation induced by the OT coupling. This is a $\mathcal{O}(B^2)$ operation per batch, which is tractable for typical batch sizes ( $B = 128$ – $512$ ).

Wall-clock overhead: Sinkhorn converges in 100–200 iterations (with early stopping on the dual variable residual). For $B=256$ , this typically adds 5–10% to wall-clock training time. The benefit is substantial: straighter trajectories reduce variance in the gradient estimates, lowering the number of ODE function evaluations at inference from $\sim 50$ (DDPM) to $\sim 10$ (OT-CFM), and enabling one-step generation after reflow.

Stochastic interpolants and general probability paths#

Flow matching and DDPM are both special cases of the broader stochastic interpolant framework (Albergo et al., 2023), which parameterizes the path from source to data as:

x_t = \alpha(t) x_0 + \beta(t) x_1 + \gamma(t) \xi, \quad t \in [0, 1]

where $x_0 \sim p_\text{data}$ , $x_1 \sim p_\text{source}$ , $\xi \sim \mathcal{N}(0,I)$ is independent Gaussian noise, and $(\alpha, \beta, \gamma)$ are time-dependent coefficients satisfying boundary conditions:

At $t=0$ : $\alpha(0)=1, \beta(0)=0, \gamma(0)=0$ (start at clean data)
At $t=1$ : $\alpha(1)=0, \beta(1)=1, \gamma(1)=0$ (end at source)

Different interpolants correspond to different modeling choices:

Rectified flow: $\alpha(t) = 1-t$ , $\beta(t) = t$ , $\gamma(t) = 0$ . Linear interpolation with no independent noise; produces constant velocity targets $u_t = x_1 - x_0$ .
DDPM: $\alpha(t) = \sqrt{\bar\alpha_t}$ , $\beta(t) = 0$ , $\gamma(t) = \sqrt{1-\bar\alpha_t}$ . No interpolation term ( $\beta=0$ ); the source distribution is implicit in the noise schedule. This is the variance-preserving scaling from the DDPM paper.
Trigonometric interpolant: $\alpha(t) = \cos(\pi t/2)$ , $\beta(t) = \sin(\pi t/2)$ , $\gamma(t) = 0$ . Smooth trigonometric interpolation; concentrates density changes near $t=0$ and $t=1$ , reducing variance at intermediate times. Used in some recent models for improved sample quality.

The conditional vector field for a general interpolant is:

u(x_t, t \mid x_0, x_1) = \alpha'(t) x_0 + \beta'(t) x_1 + \gamma'(t) \xi

where primes denote time derivatives. The flow matching objective remains:

\mathcal{L}_\text{CFM}(\theta) = \mathbb{E}_{t, x_0, x_1, \xi}\!\left[\|v_\theta(x_t, t) - u(x_t, t \mid x_0, x_1)\|^2\right]

This unifying view shows that flow matching, denoising diffusion, and other variants are all instances of the same core principle: regression on a conditional velocity field. The choice of interpolant affects the geometry of trajectories (straight vs. curved), variance at different noise levels, and the form of the target vector field — but the training procedure and theoretical guarantees remain identical.

Consistency models#

Consistency models (Song et al., 2023) learn a function $f_\theta(x_t, t)$ with the consistency property: for any two points $(x_t, x_{t'})$ on the same PF ODE trajectory, $f_\theta(x_t, t) = f_\theta(x_{t'}, t') = x_0$ — the function maps any trajectory point to the same clean sample. A consistency model can generate high-quality samples in a single step: draw $x_T \sim \mathcal{N}(0,I)$ , compute $x_0 = f_\theta(x_T, T)$ .

Consistency distillation (CD) trains the consistency model to satisfy the consistency property by:

\mathcal{L}_\text{CD}(\theta, \theta^-) = \mathbb{E}\!\left[d\!\left(f_\theta(x_{t+\Delta t}, t + \Delta t),\, f_{\theta^-}(x_t^{\phi}, t)\right)\right]

where $x_t^{\phi}$ is obtained by running one ODE step from $x_{t+\Delta t}$ , $d(\cdot, \cdot)$ is a distance metric (LPIPS perceptual distance works well), and $\theta^-$ is an exponential moving average of $\theta$ (teacher parameters). CD requires a pretrained diffusion model (the teacher ODE solver) and produces a one-step model.

Consistency training (CT) trains the consistency model without a pretrained teacher by replacing the ODE step with the expected data point $x_0$ , using the forward process to provide the noised pairs. CT is less stable than CD but allows training from scratch.

Progressive time discretization in consistency training: CT uses a discretization of the time interval $[0, T]$ into $N$ timesteps. Rather than using a fixed $N$ throughout training, the schedule $N(k) = \min(s_0 + \lfloor k / k_0 \rfloor, s_1)$ starts with a coarse discretization (small $N$ , large $\Delta t$ ) and progressively refines it as training progresses (increasing $N$ , reducing $\Delta t$ ). Here $k$ is the training iteration and $(s_0, s_1, k_0)$ are hyperparameters (e.g., $s_0=2, s_1=150, k_0=400$ ).

The intuition is that coarse discretization provides a strong global consistency signal: the learned function must output the same $x_0$ for very different noise levels, preventing mode collapse and ensuring large-scale structure. As training refines the discretization, the consistency signal shifts to local consistency (adjacent points on the trajectory agree), fine-tuning the mapping near the data manifold. This curriculum prevents the model from getting stuck in poor local minima and allows CT to match CD's sample quality without a pretrained teacher.

Multi-step generation with consistency models: generate $x_T^{(0)} \sim \mathcal{N}(0,I)$ , compute $x_0^{(0)} = f_\theta(x_T^{(0)}, T)$ , add noise to get $x_{t_1}^{(1)} = \sqrt{\bar\alpha_{t_1}}x_0^{(0)} + \sqrt{1-\bar\alpha_{t_1}}\epsilon$ , apply $f_\theta$ again, repeat. This stochastic refinement scheme enables a quality-speed tradeoff from 1 to $\sim$ 4 steps.

Cross-course connections: Flow matching across generative, RL, robotics, and vision domains#

Flow matching concepts extend far beyond image generation. The table below maps core ideas from this week across all four GenAI courses:

| Concept | Course 3 (Generative Models) | Course 1 (RL) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | OT coupling | Pairs noise and data for straight trajectories in diffusion space | Distributional RL (e.g., IQN, QR-DQN): Wasserstein distance on return distributions; optimal coupling of Q-functions | OT-CFM diffusion policy (Week 9): pairs noise with robot action sequences to minimize trajectory cost; enables 50Hz control | Multimodal alignment: optimal coupling between image regions and text token embeddings for cross-modal matching | | Probability path $p_t$ | Marginal density at time $t$ ; continuity equation governs density evolution during generation | State visitation distribution $\rho^\pi(s)$ under a policy; steady-state distribution in value iteration | Distribution over robot state-action trajectories; path integral structure relates to energy-based planning | Feature distribution shift during CLIP contrastive learning; alignment space between vision and language | | One-step inference | Consistency model: $x_0 = f_\theta(x_T, T)$ (single network call) | One-step model-based planning: $a^* = \pi(s)$ using a single forward pass (vs. tree search) | 50Hz real-time control requires ≤1 network call per action; consistency models achieve this directly | Fast CLIP retrieval: one forward pass per image-text pair for similarity matching (no iterative refinement) | | Source distribution | Standard Gaussian $\mathcal{N}(0,I)$ for unconditional generation | Initial state distribution $d_0(s)$ ; uniform over task starts | Distribution of robot initial configurations and object poses; can be learned from demonstrations | Uniform distribution over image patches in self-supervised vision pre-training |

Bridging deployment across disciplines: Flow matching's enabling of near-real-time inference is critical precisely where it is deployed. In robotics (Course 2 Week 9), diffusion policies based on OT-CFM were a breakthrough because robot control demands 50 Hz action generation — far faster than DDPM's 20–1000 step inference. The OT coupling ensures trajectories are nearly straight, reducing integration steps to 5–10, making real-time viability possible. Consistency distillation parallels fast policy distillation in robotics (Course 2 Week 11): training a large imitation learning model and distilling it into a reactive, low-latency policy. In RL (Course 1), Wasserstein losses on distributional value functions use OT couplings to compare return distributions, leveraging the same geometric insight that makes straight trajectories reduce sampling cost. In vision-language models (Course 4), the alignment of image and text features during CLIP training uses optimal transport to match visual and linguistic structure efficiently, reducing the total number of comparison operations in contrastive learning.

Key takeaways#

Flow matching trains a vector field $v_\theta(x, t)$ to transport the source distribution to the data distribution. The conditional flow matching objective is an unbiased estimator of the marginal flow matching objective, tractable because it conditions on individual data points. Rectified flow uses linear interpolation, producing constant velocity targets $u_t = x_1 - x_0$ and approximately straight trajectories. OT-CFM uses a mini-batch optimal transport coupling to make trajectories straighter, reducing the number of inference steps. The Sinkhorn algorithm solves regularized OT within each batch in $\mathcal{O}(B^2)$ time with minimal wall-clock overhead. Stochastic interpolants unify flow matching and DDPM as special cases of a general probability path framework. Consistency models learn the endpoint mapping $f_\theta(x_t, t) = x_0$ along a diffusion trajectory, enabling single-step generation; consistency distillation trains the model against a teacher ODE solver, while consistency training uses progressive time discretization for training from scratch.

Conceptual questions#

In rectified flow, the target vector field at each timestep is $u_t = x_1 - x_0$ (a constant). Show that for this vector field, the probability flow ODE $dx/dt = v_t(x)$ integrates to a straight line if $v_t$ exactly equals $u_t$ everywhere. Then explain why the learned $v_\theta$ will not produce perfectly straight trajectories in practice, even after training — specifically, identify the source of trajectory curvature that arises from the averaging in the marginal vector field.
OT-CFM minimizes the expected transport cost $\mathbb{E}[\|x_0 - x_1\|^2]$ when coupling data and noise. Standard CFM uses an independent coupling. Construct a 1D example with $p_\text{data}$ as a bimodal distribution (two Gaussians) and $p_0 = \mathcal{N}(0,1)$ where the OT coupling produces qualitatively straighter trajectories than the independent coupling. Explain why straighter trajectories require fewer ODE integration steps.
A consistency model trained with distillation from a DDPM teacher must satisfy $f_\theta(x_t, t) \approx f_\theta(x_{t+\Delta t}, t + \Delta t)$ along ODE trajectories. If $\Delta t$ is chosen very small, the training signal becomes noisy; if $\Delta t$ is large, the bootstrap target $f_{\theta^-}(x_t^{\phi}, t)$ may be inaccurate. Derive the optimal $\Delta t$ schedule that balances these competing errors, and explain how the progressive time discretization used in consistency training (increasing the number of discretization steps during training) manages this tradeoff.
Multi-step consistency model generation adds noise to the predicted $x_0$ before applying the consistency function again. Show that this is equivalent to a short diffusion process starting from the predicted $x_0$ rather than from noise. What error accumulates across multiple refinement steps if the initial one-step prediction $f_\theta(x_T, T)$ is slightly incorrect?
Flow matching can use any source distribution, not just $\mathcal{N}(0, I)$ . Describe a robotics application where the source distribution should be a learned distribution over previous robot states rather than a Gaussian. What computational modification to the flow matching training loop is required, and how does this compare to conditioning the flow on state information?

Solutions

If $v_t \equiv u_t = x_1-x_0$ (constant in $x,t$ ), then $dx/dt$ is constant and integrates to $x(t)=x_0+t(x_1-x_0)$ — a straight line. In practice $v_\theta$ learns the marginal field $v_t(x)=\mathbb{E}[x_1-x_0\mid x_t=x]$ , an average over every endpoint pair passing through $x$ . That conditional expectation varies with $x,t$ , so where trajectories from different endpoints cross, the averaged field bends — the curvature comes from the averaging in the marginal field (severe under an independent coupling).
Take $p_\text{data}=\tfrac12\mathcal{N}(-a,\sigma^2)+\tfrac12\mathcal{N}(+a,\sigma^2)$ and $p_0=\mathcal{N}(0,1)$ . The independent coupling pairs a noise sample with a random mode, so many trajectories cross the origin (e.g. positive noise → $-a$ data), producing long, crossing, curved marginal paths. The OT coupling matches each noise sample to the nearer mode (positive↔ $+a$ , negative↔ $-a$ ), giving non-crossing, near-straight paths. Straighter paths have nearly constant velocity, so Euler integration with few steps is accurate (a straight line is integrated exactly in one step); curved paths need many small steps.
Small $\Delta t$ : the bootstrap signal $d(f(x_{t+\Delta t}),f(x_t))\to 0$ and is swamped by estimation/numerical noise (high variance). Large $\Delta t$ : the single ODE-step target $x_t^\phi$ is inaccurate (discretization bias $\sim O(\Delta t^p)$ ). The optimal $\Delta t$ balances variance ( $\downarrow$ with $\Delta t$ ) against bias ( $\uparrow$ with $\Delta t$ ). Progressive discretization starts coarse (large $\Delta t$ : stable, low-variance global signal) and refines (small $\Delta t$ : accurate local targets) as the model improves — annealing the bias–variance tradeoff over training.
The update $x_{t_1}=\sqrt{\bar\alpha_{t_1}}\hat x_0+\sqrt{1-\bar\alpha_{t_1}}\epsilon$ is exactly the DDPM forward process applied to the predicted $\hat x_0$ — i.e. a short diffusion starting from $\hat x_0$ rather than from pure noise. If the one-step prediction $f_\theta(x_T,T)$ is slightly wrong, re-noising sits near that biased $\hat x_0$ and each subsequent $f_\theta$ maps back toward it, so the bias persists; added noise gives only partial correction toward the manifold, leaving bounded but non-vanishing error growth across refinement steps.
Receding-horizon control: the source should be the (learned/empirical) distribution over the previous action chunk or robot state, so successive plans are temporally consistent (warm-started) rather than starting from unstructured Gaussian noise. Modification: replace $x_1\sim\mathcal{N}(0,I)$ with samples from that source and pair them with target actions; the interpolation and target $u_t=x_1-x_0$ are unchanged. Versus conditioning on state (keep a Gaussian source, feed state as an extra input to $v_\theta$ ): conditioning is more general and easily amortized across states, while changing the source shortens transport when source already resembles target (faster sampling) but is harder to amortize.

Looking ahead#

Unconditional generative models produce samples from a learned distribution without control over the output. Deploying these models requires mechanisms to steer generation toward specific targets.

Week 8: Conditioning and Control. We derive classifier guidance and classifier-free guidance, examine cross-attention as the mechanism for text conditioning, analyze ControlNet's architectural approach to structural conditioning, and assess CLIP embeddings as the shared semantic space connecting text and image generation.

Purpose of this lecture#

Vector fields and the continuity equation#

A continuous normalizing flow defines a time-varying vector field $v_\theta(x, t): \mathcal{X} \times [0,1] \to \mathcal{X}$ that generates a flow $\phi_t$ satisfying:

\frac{d\phi_t(x)}{dt} = v_\theta(\phi_t(x), t), \quad \phi_0(x) = x

\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v_t) = 0

The flow matching objective#

The marginal flow matching objective directly targets the marginal vector field:

\mathcal{L}_\text{MFM}(\theta) = \mathbb{E}_{t, x \sim p_t}\!\left[\|v_\theta(x, t) - v_t(x)\|^2\right]

The conditional flow matching (CFM) objective bypasses this by regressing against conditional vector fields conditioned on individual data points:

\mathcal{L}_\text{CFM}(\theta) = \mathbb{E}_{t, x_0, x_t \sim p_t(\cdot | x_0)}\!\left[\|v_\theta(x_t, t) - u(x_t, t \mid x_0)\|^2\right]

Linear interpolation: rectified flow#

Rectified flow (Liu et al., 2022) chooses the simplest possible interpolation: a straight line between $x_0 \sim p_\text{data}$ and $x_1 \sim \mathcal{N}(0,I)$ :

x_t = (1 - t) x_0 + t x_1, \quad t \in [0, 1]

The conditional vector field for this interpolation is simply the constant velocity:

u(x_t, t \mid x_0, x_1) = x_1 - x_0

The flow matching objective becomes:

\mathcal{L}_\text{RF}(\theta) = \mathbb{E}_{t, x_0, x_1}\!\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right]

p_t(x) = \int p_\text{data}(x_0) \mathcal{N}(x; (1-t)x_0, t^2 I)\, dx_0

Optimal transport coupling#

Mini-batch Sinkhorn: the computational implementation#

The Sinkhorn algorithm (Cuturi, 2013) solves the regularized optimal transport problem efficiently through iterative scaling:

\pi^*_\varepsilon = \arg\min_{\pi \in \Pi(\mu,\nu)} \langle C, \pi \rangle + \varepsilon H(\pi)

The Sinkhorn iterations alternate between scaling row and column vectors using the Gibbs kernel $K_{ij} = \exp(-C_{ij}/\varepsilon)$ :

Initialize $u = \mathbf{1}_m / m$ (uniform row scaling)
Repeat: $v \leftarrow \mathbf{1}_n / (K^\top u)$ and $u \leftarrow \mathbf{1}_m / (Kv)$
Recover the transport plan: $\pi^* = \text{diag}(u) K \text{diag}(v)$

Stochastic interpolants and general probability paths#

Flow matching and DDPM are both special cases of the broader stochastic interpolant framework (Albergo et al., 2023), which parameterizes the path from source to data as:

x_t = \alpha(t) x_0 + \beta(t) x_1 + \gamma(t) \xi, \quad t \in [0, 1]

At $t=0$ : $\alpha(0)=1, \beta(0)=0, \gamma(0)=0$ (start at clean data)
At $t=1$ : $\alpha(1)=0, \beta(1)=1, \gamma(1)=0$ (end at source)

Different interpolants correspond to different modeling choices:

Rectified flow: $\alpha(t) = 1-t$ , $\beta(t) = t$ , $\gamma(t) = 0$ . Linear interpolation with no independent noise; produces constant velocity targets $u_t = x_1 - x_0$ .
DDPM: $\alpha(t) = \sqrt{\bar\alpha_t}$ , $\beta(t) = 0$ , $\gamma(t) = \sqrt{1-\bar\alpha_t}$ . No interpolation term ( $\beta=0$ ); the source distribution is implicit in the noise schedule. This is the variance-preserving scaling from the DDPM paper.
Trigonometric interpolant: $\alpha(t) = \cos(\pi t/2)$ , $\beta(t) = \sin(\pi t/2)$ , $\gamma(t) = 0$ . Smooth trigonometric interpolation; concentrates density changes near $t=0$ and $t=1$ , reducing variance at intermediate times. Used in some recent models for improved sample quality.

The conditional vector field for a general interpolant is:

u(x_t, t \mid x_0, x_1) = \alpha'(t) x_0 + \beta'(t) x_1 + \gamma'(t) \xi

where primes denote time derivatives. The flow matching objective remains:

\mathcal{L}_\text{CFM}(\theta) = \mathbb{E}_{t, x_0, x_1, \xi}\!\left[\|v_\theta(x_t, t) - u(x_t, t \mid x_0, x_1)\|^2\right]

Consistency models#

Consistency distillation (CD) trains the consistency model to satisfy the consistency property by:

\mathcal{L}_\text{CD}(\theta, \theta^-) = \mathbb{E}\!\left[d\!\left(f_\theta(x_{t+\Delta t}, t + \Delta t),\, f_{\theta^-}(x_t^{\phi}, t)\right)\right]

Cross-course connections: Flow matching across generative, RL, robotics, and vision domains#

Flow matching concepts extend far beyond image generation. The table below maps core ideas from this week across all four GenAI courses:

Key takeaways#

Conceptual questions#

In rectified flow, the target vector field at each timestep is $u_t = x_1 - x_0$ (a constant). Show that for this vector field, the probability flow ODE $dx/dt = v_t(x)$ integrates to a straight line if $v_t$ exactly equals $u_t$ everywhere. Then explain why the learned $v_\theta$ will not produce perfectly straight trajectories in practice, even after training — specifically, identify the source of trajectory curvature that arises from the averaging in the marginal vector field.
OT-CFM minimizes the expected transport cost $\mathbb{E}[\|x_0 - x_1\|^2]$ when coupling data and noise. Standard CFM uses an independent coupling. Construct a 1D example with $p_\text{data}$ as a bimodal distribution (two Gaussians) and $p_0 = \mathcal{N}(0,1)$ where the OT coupling produces qualitatively straighter trajectories than the independent coupling. Explain why straighter trajectories require fewer ODE integration steps.
A consistency model trained with distillation from a DDPM teacher must satisfy $f_\theta(x_t, t) \approx f_\theta(x_{t+\Delta t}, t + \Delta t)$ along ODE trajectories. If $\Delta t$ is chosen very small, the training signal becomes noisy; if $\Delta t$ is large, the bootstrap target $f_{\theta^-}(x_t^{\phi}, t)$ may be inaccurate. Derive the optimal $\Delta t$ schedule that balances these competing errors, and explain how the progressive time discretization used in consistency training (increasing the number of discretization steps during training) manages this tradeoff.
Multi-step consistency model generation adds noise to the predicted $x_0$ before applying the consistency function again. Show that this is equivalent to a short diffusion process starting from the predicted $x_0$ rather than from noise. What error accumulates across multiple refinement steps if the initial one-step prediction $f_\theta(x_T, T)$ is slightly incorrect?
Flow matching can use any source distribution, not just $\mathcal{N}(0, I)$ . Describe a robotics application where the source distribution should be a learned distribution over previous robot states rather than a Gaussian. What computational modification to the flow matching training loop is required, and how does this compare to conditioning the flow on state information?

Solutions

If $v_t \equiv u_t = x_1-x_0$ (constant in $x,t$ ), then $dx/dt$ is constant and integrates to $x(t)=x_0+t(x_1-x_0)$ — a straight line. In practice $v_\theta$ learns the marginal field $v_t(x)=\mathbb{E}[x_1-x_0\mid x_t=x]$ , an average over every endpoint pair passing through $x$ . That conditional expectation varies with $x,t$ , so where trajectories from different endpoints cross, the averaged field bends — the curvature comes from the averaging in the marginal field (severe under an independent coupling).
Take $p_\text{data}=\tfrac12\mathcal{N}(-a,\sigma^2)+\tfrac12\mathcal{N}(+a,\sigma^2)$ and $p_0=\mathcal{N}(0,1)$ . The independent coupling pairs a noise sample with a random mode, so many trajectories cross the origin (e.g. positive noise → $-a$ data), producing long, crossing, curved marginal paths. The OT coupling matches each noise sample to the nearer mode (positive↔ $+a$ , negative↔ $-a$ ), giving non-crossing, near-straight paths. Straighter paths have nearly constant velocity, so Euler integration with few steps is accurate (a straight line is integrated exactly in one step); curved paths need many small steps.
Small $\Delta t$ : the bootstrap signal $d(f(x_{t+\Delta t}),f(x_t))\to 0$ and is swamped by estimation/numerical noise (high variance). Large $\Delta t$ : the single ODE-step target $x_t^\phi$ is inaccurate (discretization bias $\sim O(\Delta t^p)$ ). The optimal $\Delta t$ balances variance ( $\downarrow$ with $\Delta t$ ) against bias ( $\uparrow$ with $\Delta t$ ). Progressive discretization starts coarse (large $\Delta t$ : stable, low-variance global signal) and refines (small $\Delta t$ : accurate local targets) as the model improves — annealing the bias–variance tradeoff over training.
The update $x_{t_1}=\sqrt{\bar\alpha_{t_1}}\hat x_0+\sqrt{1-\bar\alpha_{t_1}}\epsilon$ is exactly the DDPM forward process applied to the predicted $\hat x_0$ — i.e. a short diffusion starting from $\hat x_0$ rather than from pure noise. If the one-step prediction $f_\theta(x_T,T)$ is slightly wrong, re-noising sits near that biased $\hat x_0$ and each subsequent $f_\theta$ maps back toward it, so the bias persists; added noise gives only partial correction toward the manifold, leaving bounded but non-vanishing error growth across refinement steps.
Receding-horizon control: the source should be the (learned/empirical) distribution over the previous action chunk or robot state, so successive plans are temporally consistent (warm-started) rather than starting from unstructured Gaussian noise. Modification: replace $x_1\sim\mathcal{N}(0,I)$ with samples from that source and pair them with target actions; the interpolation and target $u_t=x_1-x_0$ are unchanged. Versus conditioning on state (keep a Gaussian source, feed state as an extra input to $v_\theta$ ): conditioning is more general and easily amortized across states, while changing the source shortens transport when source already resembles target (faster sampling) but is harder to amortize.

Looking ahead#

Unconditional generative models produce samples from a learned distribution without control over the output. Deploying these models requires mechanisms to steer generation toward specific targets.

Purpose of this lecture#

Vector fields and the continuity equation#

The flow matching objective#

Linear interpolation: rectified flow#

Optimal transport coupling#

Mini-batch Sinkhorn: the computational implementation#

Stochastic interpolants and general probability paths#

Consistency models#

Cross-course connections: Flow matching across generative, RL, robotics, and vision domains#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 7: Flow Matching and Consistency Models

Purpose of this lecture#

Vector fields and the continuity equation#

The flow matching objective#

Linear interpolation: rectified flow#

Optimal transport coupling#

Mini-batch Sinkhorn: the computational implementation#

Stochastic interpolants and general probability paths#

Consistency models#

Cross-course connections: Flow matching across generative, RL, robotics, and vision domains#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 7: Flow Matching and Consistency Models

Purpose of this lecture#

Vector fields and the continuity equation#

The flow matching objective#

Linear interpolation: rectified flow#

Optimal transport coupling#

Mini-batch Sinkhorn: the computational implementation#

Stochastic interpolants and general probability paths#

Consistency models#

Cross-course connections: Flow matching across generative, RLReinforcement Learning, robotics, and vision domains#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 7: Flow Matching and Consistency Models

Purpose of this lecture#

Vector fields and the continuity equation#

The flow matching objective#

Linear interpolation: rectified flow#

Optimal transport coupling#

Mini-batch Sinkhorn: the computational implementation#

Stochastic interpolants and general probability paths#

Consistency models#

Cross-course connections: Flow matching across generative, RLReinforcement Learning, robotics, and vision domains#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Cross-course connections: Flow matching across generative, RL, robotics, and vision domains#

Cross-course connections: Flow matching across generative, RL, robotics, and vision domains#