Week 9: Flow Matching and Diffusion for Robot Policies

Purpose of this lecture#

The previous lecture introduced the diffusion policy framework as a way to generate action sequences through iterative denoising. This lecture goes deeper: we derive both diffusion and flow matching from first principles, understand why generative modeling is the right paradigm for multimodal manipulation behavior, and analyze the architectural and training choices that determine policy performance in practice.

Flow matching is a continuous-time generalization that subsumes diffusion models and provides a cleaner mathematical structure, faster inference, and easier conditioning on complex multi-modal observations. Both frameworks have been demonstrated in state-of-the-art robot policies — diffusion policies in the Chi et al. (2023) work, flow matching in $\pi_0$ (Black et al., 2024) — and understanding their relationship is essential for reasoning about why and when to use each. The common thread is that robot trajectories are not unique: for any manipulation task, many valid action sequences exist, and the policy must represent this distribution over behaviors, not merely a single best action.

Why generative policies for manipulation#

Consider a bimanual task: pick up a bottle with one hand while holding a cup steady with the other. At each point in the grasp approach, there is a distribution over valid arm trajectories — different approach angles and speeds are all valid — but the actions of the two arms must be jointly consistent. A reactive policy $\pi(a_t \mid o_t)$ that models each arm independently cannot capture this joint distribution, and one that averages over all valid approach angles produces an arm trajectory that goes straight through the bottle. The problem is not that no good action exists; it is that the policy must commit to a mode and execute it consistently.

More formally, the distribution over expert action sequences given an observation is multimodal: $p^*(a_{0:T} \mid o)$ has multiple well-separated modes corresponding to distinct manipulation strategies. The mean of this distribution $\mathbb{E}[a_{0:T} \mid o]$ is generally not in any mode — it is between them. Regression-based policies, by minimizing mean squared error, converge to this mean and produce averaging artifacts.

Generative policies model the full conditional distribution $p_\theta(a_{0:T} \mid o)$ , allowing samples that are consistent with a single mode. The challenge is learning and sampling from this high-dimensional distribution efficiently.

Score matching and denoising diffusion#

The score function of a distribution $p(x)$ is its log-density gradient: $\nabla_x \log p(x)$ . Score matching trains a neural network to estimate this gradient, and gradient ascent on the estimated score moves samples from low-density to high-density regions of the distribution — from noise toward data.

Denoising Score Matching (DSM; Vincent, 2011) avoids directly estimating the score of $p$ (which requires the intractable normalizing constant) by instead estimating the score of a noisy version of $p$ : for a Gaussian kernel $q_\sigma(x \mid x_0) = \mathcal{N}(x; x_0, \sigma^2 I)$ , the marginal noisy distribution is $q_\sigma(x) = \int q_\sigma(x \mid x_0) p(x_0)\, dx_0$ , and its score is:

\nabla_x \log q_\sigma(x) = -\mathbb{E}\!\left[\frac{x - x_0}{\sigma^2} \;\Big|\; x\right]

Training a network $s_\theta(x, \sigma)$ to predict $(x - x_0)/\sigma$ given $x$ and $\sigma$ is equivalent to denoising: the network learns to predict the direction toward the clean sample from a noisy one.

DDPMs (Ho et al., 2020) make this into a practical algorithm by: (1) defining a forward process that gradually corrupts the data over $T$ timesteps, $q(x_t \mid x_0) = \mathcal{N}(x_t;\, \sqrt{\bar\alpha_t}\, x_0,\, (1-\bar\alpha_t) I)$ with $\bar\alpha_t = \prod_{s=1}^t (1 - \beta_s)$ , and (2) training a network $\varepsilon_\theta(x_t, t)$ to predict the added noise $\varepsilon$ from the noisy sample $x_t$ and diffusion step $t$ . The simple MSE objective

\mathcal{L}_{\text<Glossary term="DDPM" />}(\theta) = \mathbb{E}_{t, x_0, \varepsilon}\!\left[\| \varepsilon - \varepsilon_\theta(x_t, t) \|^2\right], \quad x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon

is equivalent to a weighted sum of denoising score matching objectives at all noise levels. The trained denoiser can then be used to run the reverse Markov chain from $x_T \sim \mathcal{N}(0, I)$ to a sample $x_0 \sim p(x_0)$ .

For robot policies, $x_0 = a_{0:K}$ is an action chunk, the diffusion network is conditioned on the observation $o$ , and inference runs the reverse chain from noise to a coherent action sequence: $a_{0:K} \sim p_\theta(\cdot \mid o)$ .

DDIM: accelerated inference through non-Markovian processes#

The standard DDPM reverse process is a Markov chain that must run all $T$ steps sequentially — each denoising step depends on the result of the previous step. With $T = 1000$ , this requires 1000 network forward passes to generate one sample, which is far too slow for real-time robot control.

Denoising Diffusion Implicit Models (DDIM; Song et al., 2020) achieve dramatic acceleration by replacing the Markov forward process with a non-Markovian forward process that admits a deterministic reverse chain. The key insight is that the marginal distributions $q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I)$ that define the noise corruption at each step do not require the chain to be Markovian — they can be achieved by many different forward processes, not just the sequential Markov chain. DDIM defines a non-Markovian forward process with the same marginals but admits a deterministic reverse update:

x_{t-1} = \sqrt{\bar\alpha_{t-1}} \underbrace{\left(\frac{x_t - \sqrt{1-\bar\alpha_t}\,\varepsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}}\right)}_{\text{predicted }x_0} + \sqrt{1-\bar\alpha_{t-1} - \sigma_t^2}\cdot \varepsilon_\theta(x_t, t) + \sigma_t \epsilon

where $\sigma_t = 0$ gives a fully deterministic reverse process. The crucial property is that this reverse update can skip steps: instead of stepping from $t = T$ to $t = T-1$ to … to $t=0$ , DDIM can jump from $t = T$ directly to $t = T - \Delta$ for any $\Delta > 1$ , with the predicted $x_0$ serving as the bridge. This allows the reverse chain to use only a subsequence $\{t_1, t_2, \ldots, t_S\} \subset \{1, \ldots, T\}$ of size $S \ll T$ , reducing the number of network evaluations from $T$ to $S$ .

For robot policies, DDIM with $S = 10$ steps achieves generation quality comparable to full DDPM with $T = 100$ , enabling roughly 10 $\times$ inference acceleration. At $S = 5$ steps, some quality degradation is visible on high-precision tasks but the control frequency becomes compatible with real-time deployment. DDIM thus occupies the middle ground between DDPM (highest quality, too slow) and flow matching (fastest, equal quality at low step count).

Conditioning and guidance#

A diffusion policy is useful only if it can be precisely conditioned on the robot's current state. The conditioning signals relevant for manipulation are:

Visual observations — RGB or RGB-D frames from one or more cameras provide scene geometry, object identity, and spatial relationships. Typical architectures encode each camera frame through a pretrained or jointly trained visual backbone (ResNet, ViT, or DINO) to produce a sequence of spatial features. These features are injected into the denoiser through cross-attention, where the denoiser's query is the noisy action $a_t$ and the key-value pairs are the visual feature tokens.

Proprioception — joint positions, velocities, and wrist force-torque measurements ground the policy in the robot's actual configuration. Proprioceptive state is typically concatenated with the diffusion timestep embedding and injected via the denoiser's conditioning MLP or added to the action representation before denoising.

Language — task descriptions or instructions condition the policy on what the robot should do, enabling a single policy to handle multiple tasks. Language conditioning is often implemented through a pretrained language encoder (BERT, T5, or a language model backbone) whose output is used as additional cross-attention keys-values alongside the visual features.

Classifier-free guidance (CFG) provides a mechanism to strengthen conditioning at inference time without training a separate classifier. The denoiser is trained with the condition dropped with probability $p_{\text{drop}}$ during training (the unconditional denoiser $\varepsilon_\theta(a_t, t)$ ) and the conditional denoiser $\varepsilon_\theta(a_t, t, c)$ jointly. At inference, the guided denoiser is:

\tilde\varepsilon_\theta(a_t, t, c) = (1 + w)\,\varepsilon_\theta(a_t, t, c) - w\,\varepsilon_\theta(a_t, t)

where $w > 0$ is the guidance weight. Higher $w$ produces samples more strongly conditioned on $c$ at the cost of reduced diversity — useful when the conditioning is precise and the policy should follow it tightly.

Noise schedules and their practical effects#

The noise schedule $\{\beta_t\}_{t=1}^T$ (equivalently $\{\bar\alpha_t\}$ ) controls how much information about the original sample is retained at each noise level. The schedule has important practical consequences for robot policies:

Linear schedules add noise in equal increments: $\beta_t = \beta_{\min} + \frac{t-1}{T-1}(\beta_{\max} - \beta_{\min})$ . These were the first proposed and work adequately, but they allocate denoising capacity roughly equally across all noise levels, whereas most structure recovery happens at low noise levels.

Cosine schedules (Nichol and Dhariwal, 2021) define $\bar\alpha_t = \cos^2\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)$ , which reduces noise levels more gently near $t = 0$ (low noise) and more aggressively near $t = T$ (high noise). This schedule prevents fully destroying all signal at high noise levels and concentrates denoising steps where they matter most. For action sequences, the cosine schedule produces smoother trajectory generation and better performance on precise manipulation tasks.

Learned schedules parameterize the noise process and optimize the schedule jointly with the denoiser, in principle maximizing task performance subject to the constraints of the diffusion formalism. These are not yet standard in robot learning but represent an active research direction.

Flow matching: continuous-time transport#

Flow matching (Lipman et al., 2022; Albergo and Vanden-Eijnden, 2022) provides a mathematically cleaner alternative to diffusion that achieves faster inference. Instead of a discrete-time Markov chain, flow matching defines a continuous-time ODE that transports samples from a source distribution $p_0 = \mathcal{N}(0, I)$ to the target distribution $p_1 = p_{\text{data}}$ :

\frac{dx_t}{dt} = v_\theta(x_t, t), \qquad t \in [0, 1]

where $v_\theta$ is the learned velocity field. A sample from the data distribution is obtained by integrating the ODE from $t = 0$ (noise) to $t = 1$ (data): $x_1 = x_0 + \int_0^1 v_\theta(x_t, t)\, dt$ .

Training matches the learned velocity field to the conditional velocity field defined by a transport path between paired noise-data samples. For the simplest (linear) transport path, $x_t = (1-t) x_0 + t x_1$ (interpolation between noise $x_0$ and data $x_1$ ), the conditional velocity is simply $v(x_t, t \mid x_0, x_1) = x_1 - x_0$ , and the training objective is:

\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t, x_0, x_1}\!\left[\| v_\theta(x_t, t) - (x_1 - x_0) \|^2\right]

where $x_0 \sim \mathcal{N}(0, I)$ , $x_1 \sim p_{\text{data}}$ , and $t \sim U[0, 1]$ . This objective is strikingly simple: the model learns to predict the straight-line velocity from the noise sample to the data sample, evaluated along the interpolated path.

The key advantages of flow matching for robot policies are: (1) fewer inference steps — the ODE integrator can be run for as few as 5–10 steps using high-order solvers (Euler, Runge-Kutta) while maintaining high-quality generation, compared to 50–1000 steps for standard DDPM inference; (2) cleaner conditioning — the simple conditional velocity formulation makes it straightforward to condition the velocity field on arbitrary observation features via the same cross-attention mechanisms; and (3) more stable training — the loss landscape is smooth and the gradient magnitude is well-conditioned across the full trajectory, unlike diffusion losses that have instability near $t = 0$ .

Optimal Transport Conditional Flow Matching#

A key limitation of the basic flow matching formulation is that the linear interpolation paths $x_t = (1-t)x_0 + tx_1$ can cross: two data samples $x_1^{(A)}$ and $x_1^{(B)}$ may be paired with the same noise sample $x_0$ , producing interpolation paths that intersect in the interior of $[0,1]$ . At the intersection point, the velocity field is asked to simultaneously point toward $x_1^{(A)}$ (from one path) and toward $x_1^{(B)}$ (from another), introducing a contradiction. The resulting velocity field is not a valid ODE flow — it requires high-curvature trajectories to accommodate the inconsistency, demanding more ODE solver steps and introducing approximation error.

Optimal Transport Conditional Flow Matching (OT-CFM; Tong et al., 2023) eliminates crossings by choosing the coupling between noise samples and data samples to minimize the total transport cost. The Monge optimal transport plan pairs each noise sample $x_0$ with the data sample $x_1$ that minimizes the expected squared displacement $\mathbb{E}[\|x_0 - x_1\|^2]$ , subject to the constraint that the marginals match the noise and data distributions. Under the OT coupling, interpolation paths are non-crossing by construction: each data sample receives a unique noise partner, and the resulting straight-line paths span the transport uniformly without intersections. This directly translates to straighter ODE trajectories, allowing accurate integration in fewer steps. The $\pi_0$ model specifically uses OT-CFM to couple the zero-mean Gaussian noise prior with the action chunk distribution, achieving its 5-step inference target. The OT cost minimization can be computed efficiently using the Sinkhorn algorithm, adding modest overhead to the training data pipeline but producing significantly cleaner velocity fields that generalize better to new tasks.

The $\pi_0$ model uses flow matching with a pre-trained vision-language model backbone and achieves 5-step inference at 50 Hz control frequency — a combination of speed and expressive power that standard DDPM-based diffusion cannot match.

Inference latency analysis#

The choice of generative architecture has direct, quantifiable consequences for robot control frequency. The following comparison uses a representative 100M-parameter denoising/velocity network, evaluated on a single NVIDIA RTX 4090 with a batch size of 1 (single environment, no parallelism), with a 60-dimensional action chunk ( $K=10$ actions, 6 DoF each):

| Architecture | Forward passes | Per-pass latency | Total latency | Max control frequency | |---|---|---|---|---| | ACT (single-pass autoregressive) | 10 (one per token) | ~1 ms | ~10 ms | ~100 Hz | | DDPM ( $T=100$ steps) | 100 | ~2 ms | ~200 ms | ~5 Hz | | DDIM ( $S=10$ steps) | 10 | ~2 ms | ~20 ms | ~50 Hz | | OT-CFM Flow Matching ( $S=5$ steps) | 5 | ~2 ms | ~10 ms | ~100 Hz |

These numbers reveal why the field has moved from DDPM to DDIM and then to flow matching: DDPM at 5 Hz is too slow for most manipulation tasks requiring reactive closed-loop control (the robot arm moves on 100 ms timescales), while DDIM and flow matching recover real-time performance. ACT remains the fastest overall because it generates the full chunk in a single decoder forward pass, trading some multimodal expressiveness for speed.

For legged locomotion, which operates at 500–1000 Hz low-level control frequencies, neither diffusion nor flow matching is currently viable as the direct policy output — proprioceptive joint-space policies (Week 6) remain necessary at the lowest level. Diffusion and flow matching policies operate at the task-command frequency (5–50 Hz), generating high-level goals or action chunks that are tracked by the fast joint-space controller beneath them.

Diffusion versus flow matching for robot policies#

The two frameworks have complementary practical profiles:

| Dimension | Diffusion (DDPM) | DDIM | Flow Matching (OT-CFM) | |---|---|---|---| | Inference steps | 50–1000 | 5–20 | 5–10 | | Sample quality | Highest | Good | Comparable | | Training objective | Noise prediction | Same (reuses DDPM weights) | Velocity regression | | Gradient stability | Unstable near $t=0$ | Same | Stable throughout | | Max control frequency | ~5 Hz | ~50 Hz | ~100 Hz |

For real-time control at 50+ Hz, flow matching is now preferred. DDIM provides a practical middle ground that reuses trained DDPM weights without retraining. For offline trajectory generation (motion planning, dataset augmentation), DDPM with a high step count can produce the smoothest results. The $\pi_0$ model's adoption of OT-CFM flow matching over diffusion reflects this tradeoff in a practical system where latency directly affects manipulation quality.

GenAI context: trajectory generation as latent reasoning#

The connection between generative robot policies and generative reasoning in language models is structural. Both involve:

learning a distribution over valid completions (action sequences / token sequences) conditioned on a context (observations / prompts),
the need to represent multimodal distributions (multiple valid grasps / multiple valid answers), and
a generative process that progressively refines an initial sample (denoising / chain-of-thought reasoning).

Flow matching over action sequences is analogous to continuous-time reasoning processes in diffusion language models, where the model iteratively refines a draft sequence rather than generating token-by-token. Both involve training a model to predict the direction of improvement given a partially resolved state, whether that state is a noisy action chunk or a partially correct reasoning trace.

The conditioning story also aligns: classifier-free guidance in diffusion is structurally similar to the KL penalty in RLHF — both are mechanisms for trading diversity against faithfulness to a conditioning signal, with a scalar weight controlling the tradeoff.

Key takeaways#

Generative policies address the multimodal action distribution problem by modeling the full conditional distribution $p_\theta(a_{0:K} \mid o)$ rather than its mean. Denoising diffusion models learn to reverse a Gaussian noising process via score matching; the DDPM training objective predicts the added noise, which is equivalent to a weighted sum of denoising score matching losses. Conditional diffusion policies inject observation context (visual, proprioceptive, language) via cross-attention in the denoiser. Noise schedules control the allocation of denoising capacity and affect trajectory smoothness and precision. Flow matching provides a continuous-time ODE formulation with a simpler training objective (velocity regression on linear interpolation paths) and requires 5–10 inference steps versus 50–1000 for DDPM, enabling 50+ Hz control. The $\pi_0$ architecture demonstrates that flow matching over pre-trained VLM features enables both high-quality manipulation and real-time inference.

Conceptual questions#

The denoising score matching objective $\mathcal{L}_{\text{DDPM}}$ trains the network to predict added noise $\varepsilon$ at a random diffusion step $t$ . The loss at small $t$ (low noise, nearly clean data) has small $\varepsilon$ magnitude and therefore small gradient, while the loss at large $t$ (high noise, nearly pure Gaussian) has large $\varepsilon$ magnitude and larger gradient. Analyze how this asymmetry affects training: which noise levels does the denoiser learn to denoise accurately, and which does it under-fit? Explain why cosine noise schedules mitigate this asymmetry, and derive what the ideal loss weighting would be to balance gradient magnitudes across all noise levels.
Classifier-free guidance at inference combines conditional and unconditional denoisers: $\tilde\varepsilon = (1+w)\varepsilon(a_t, t, c) - w\varepsilon(a_t, t)$ . For a diffusion policy conditioned on a visual observation of a specific object pose, analyze what happens to the generated action distribution as $w \to \infty$ . Using the interpretation of the guidance formula as score interpolation, show that high guidance weight increases the policy's sensitivity to the conditioning signal at the cost of diversity. Identify a failure mode that would emerge specifically for robot policies with very high guidance weight during contact-rich tasks.
The flow matching objective $\mathcal{L}_{\text{FM}}$ trains the velocity field to predict $x_1 - x_0$ along linear interpolation paths. For two data modes $x_1^{(A)}$ and $x_1^{(B)}$ that are far apart in action space (two different grasping orientations), analyze the geometry of the learned velocity field at the midpoint $t = 0.5$ . Does the learned velocity correctly transport samples toward both modes? If not, explain the failure mode and describe how optimal transport flow matching (OT-CFM) would address it.
A manipulation task requires generating a 2-second action chunk at 50 Hz (100 actions, 6 DoF each = 600-dimensional output) using flow matching with 5 ODE steps. Calculate the wall-clock inference time required if the flow matching network has 100M parameters and a single forward pass takes 2 ms. Is 50 Hz real-time control achievable? If not, what architectural optimizations (network compression, output dimension reduction, parallelism) would be needed, and what is the minimum achievable latency?
Both diffusion policies and ACT with CVAE conditioning model multimodal distributions over action chunks. Compare the two approaches on a task where the robot must choose between two qualitatively different strategies at the start of each episode (strategy A: overhead grasp; strategy B: side grasp). For each approach, describe: (a) how the strategy is committed to at the beginning of the episode, (b) whether the policy can switch strategies mid-episode if the first attempt fails, and (c) what training data distribution is needed for the policy to correctly represent both strategies. Which approach is more appropriate for this specific task structure?

Solutions

DDPM loss asymmetry. With $\varepsilon$ -prediction, small- $t$ levels have small $\varepsilon$ and small gradient, so the denoiser under-fits low-noise (fine-detail) levels while large- $t$ levels dominate the gradient and are fit well. Cosine schedules spend more steps at low noise, rebalancing where capacity is spent. The ideal weighting scales each level inversely to its gradient magnitude — an SNR-based weighting (e.g., min-SNR) so every noise level contributes equally.
CFG $w \to \infty$ . The guided score is dominated by the (conditional − unconditional) difference, collapsing the action distribution onto the single highest-density mode given the observation and destroying diversity. For contact-rich tasks this makes the policy overconfident and effectively deterministic, losing the small corrective multimodality needed under contact uncertainty — brittle behavior that cannot represent alternative recovery actions.
Flow matching between far modes. Averaging linear paths from $x_0$ to two distant modes makes the learned velocity at $t=0.5$ point toward the mean of the two modes, transporting samples into the empty gap rather than to either mode — a mode-averaging artifact. Optimal-transport flow matching (OT-CFM) pairs samples to avoid crossing/averaging paths, yielding straighter, mode-consistent velocity fields.
Flow-matching latency. Five ODE steps × 2 ms = 10 ms to generate the whole 2 s chunk, well under the 20 ms budget — because generation is once per chunk, not per control step, 50 Hz is achievable via chunked execution. If per-step generation were required it would be borderline; optimizations include step distillation (1–2 steps), a smaller network, or output-dimension reduction, with a floor near a single forward pass (~2 ms).
Diffusion vs ACT-CVAE strategy choice. Both can represent the two strategies if the demos contain both. (a) Commitment: diffusion commits via the sampled denoising trajectory at chunk start; the CVAE commits via the sampled $z$ . (b) Mid-episode switching: both can switch when re-sampled at the next chunk if the new observation favors the other mode — neither is locked unless executed fully open-loop. (c) Data: balanced demonstrations of both strategies. Diffusion typically renders sharp, well-separated multimodality most faithfully, so it is the better fit here.

Looking ahead#

Diffusion and flow matching policies generate high-quality action sequences conditioned on visual and proprioceptive observations. The next frontier is integrating semantic understanding and language grounding into this generation process — moving from policies that ACT on raw sensor observations to policies that understand task descriptions, reason about object properties, and generalize across tasks through language.

Week 10: Vision-Language-Action Models. We examine how pretrained vision-language foundations ( $\pi_0$ , GR00T, SMOL-VLA) are adapted for robot control, focusing on the architectural choices that allow a single model to perceive, reason about, and ACT in diverse robotic manipulation scenarios.

Purpose of this lecture#

Why generative policies for manipulation#

Score matching and denoising diffusion#

\nabla_x \log q_\sigma(x) = -\mathbb{E}\!\left[\frac{x - x_0}{\sigma^2} \;\Big|\; x\right]

\mathcal{L}_{\text<Glossary term="DDPM" />}(\theta) = \mathbb{E}_{t, x_0, \varepsilon}\!\left[\| \varepsilon - \varepsilon_\theta(x_t, t) \|^2\right], \quad x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon

DDIM: accelerated inference through non-Markovian processes#

x_{t-1} = \sqrt{\bar\alpha_{t-1}} \underbrace{\left(\frac{x_t - \sqrt{1-\bar\alpha_t}\,\varepsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}}\right)}_{\text{predicted }x_0} + \sqrt{1-\bar\alpha_{t-1} - \sigma_t^2}\cdot \varepsilon_\theta(x_t, t) + \sigma_t \epsilon

Conditioning and guidance#

A diffusion policy is useful only if it can be precisely conditioned on the robot's current state. The conditioning signals relevant for manipulation are:

\tilde\varepsilon_\theta(a_t, t, c) = (1 + w)\,\varepsilon_\theta(a_t, t, c) - w\,\varepsilon_\theta(a_t, t)

Noise schedules and their practical effects#

Flow matching: continuous-time transport#

\frac{dx_t}{dt} = v_\theta(x_t, t), \qquad t \in [0, 1]

\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t, x_0, x_1}\!\left[\| v_\theta(x_t, t) - (x_1 - x_0) \|^2\right]

Optimal Transport Conditional Flow Matching#

Inference latency analysis#

Diffusion versus flow matching for robot policies#

The two frameworks have complementary practical profiles:

GenAI context: trajectory generation as latent reasoning#

The connection between generative robot policies and generative reasoning in language models is structural. Both involve:

learning a distribution over valid completions (action sequences / token sequences) conditioned on a context (observations / prompts),
the need to represent multimodal distributions (multiple valid grasps / multiple valid answers), and
a generative process that progressively refines an initial sample (denoising / chain-of-thought reasoning).

Key takeaways#

Conceptual questions#

The denoising score matching objective $\mathcal{L}_{\text{DDPM}}$ trains the network to predict added noise $\varepsilon$ at a random diffusion step $t$ . The loss at small $t$ (low noise, nearly clean data) has small $\varepsilon$ magnitude and therefore small gradient, while the loss at large $t$ (high noise, nearly pure Gaussian) has large $\varepsilon$ magnitude and larger gradient. Analyze how this asymmetry affects training: which noise levels does the denoiser learn to denoise accurately, and which does it under-fit? Explain why cosine noise schedules mitigate this asymmetry, and derive what the ideal loss weighting would be to balance gradient magnitudes across all noise levels.
Classifier-free guidance at inference combines conditional and unconditional denoisers: $\tilde\varepsilon = (1+w)\varepsilon(a_t, t, c) - w\varepsilon(a_t, t)$ . For a diffusion policy conditioned on a visual observation of a specific object pose, analyze what happens to the generated action distribution as $w \to \infty$ . Using the interpretation of the guidance formula as score interpolation, show that high guidance weight increases the policy's sensitivity to the conditioning signal at the cost of diversity. Identify a failure mode that would emerge specifically for robot policies with very high guidance weight during contact-rich tasks.
The flow matching objective $\mathcal{L}_{\text{FM}}$ trains the velocity field to predict $x_1 - x_0$ along linear interpolation paths. For two data modes $x_1^{(A)}$ and $x_1^{(B)}$ that are far apart in action space (two different grasping orientations), analyze the geometry of the learned velocity field at the midpoint $t = 0.5$ . Does the learned velocity correctly transport samples toward both modes? If not, explain the failure mode and describe how optimal transport flow matching (OT-CFM) would address it.
A manipulation task requires generating a 2-second action chunk at 50 Hz (100 actions, 6 DoF each = 600-dimensional output) using flow matching with 5 ODE steps. Calculate the wall-clock inference time required if the flow matching network has 100M parameters and a single forward pass takes 2 ms. Is 50 Hz real-time control achievable? If not, what architectural optimizations (network compression, output dimension reduction, parallelism) would be needed, and what is the minimum achievable latency?
Both diffusion policies and ACT with CVAE conditioning model multimodal distributions over action chunks. Compare the two approaches on a task where the robot must choose between two qualitatively different strategies at the start of each episode (strategy A: overhead grasp; strategy B: side grasp). For each approach, describe: (a) how the strategy is committed to at the beginning of the episode, (b) whether the policy can switch strategies mid-episode if the first attempt fails, and (c) what training data distribution is needed for the policy to correctly represent both strategies. Which approach is more appropriate for this specific task structure?

Solutions

DDPM loss asymmetry. With $\varepsilon$ -prediction, small- $t$ levels have small $\varepsilon$ and small gradient, so the denoiser under-fits low-noise (fine-detail) levels while large- $t$ levels dominate the gradient and are fit well. Cosine schedules spend more steps at low noise, rebalancing where capacity is spent. The ideal weighting scales each level inversely to its gradient magnitude — an SNR-based weighting (e.g., min-SNR) so every noise level contributes equally.
CFG $w \to \infty$ . The guided score is dominated by the (conditional − unconditional) difference, collapsing the action distribution onto the single highest-density mode given the observation and destroying diversity. For contact-rich tasks this makes the policy overconfident and effectively deterministic, losing the small corrective multimodality needed under contact uncertainty — brittle behavior that cannot represent alternative recovery actions.
Flow matching between far modes. Averaging linear paths from $x_0$ to two distant modes makes the learned velocity at $t=0.5$ point toward the mean of the two modes, transporting samples into the empty gap rather than to either mode — a mode-averaging artifact. Optimal-transport flow matching (OT-CFM) pairs samples to avoid crossing/averaging paths, yielding straighter, mode-consistent velocity fields.
Flow-matching latency. Five ODE steps × 2 ms = 10 ms to generate the whole 2 s chunk, well under the 20 ms budget — because generation is once per chunk, not per control step, 50 Hz is achievable via chunked execution. If per-step generation were required it would be borderline; optimizations include step distillation (1–2 steps), a smaller network, or output-dimension reduction, with a floor near a single forward pass (~2 ms).
Diffusion vs ACT-CVAE strategy choice. Both can represent the two strategies if the demos contain both. (a) Commitment: diffusion commits via the sampled denoising trajectory at chunk start; the CVAE commits via the sampled $z$ . (b) Mid-episode switching: both can switch when re-sampled at the next chunk if the new observation favors the other mode — neither is locked unless executed fully open-loop. (c) Data: balanced demonstrations of both strategies. Diffusion typically renders sharp, well-separated multimodality most faithfully, so it is the better fit here.

Purpose of this lecture#

Why generative policies for manipulation#

Score matching and denoising diffusion#

DDIM: accelerated inference through non-Markovian processes#

Conditioning and guidance#

Noise schedules and their practical effects#

Flow matching: continuous-time transport#

Optimal Transport Conditional Flow Matching#

Inference latency analysis#

Diffusion versus flow matching for robot policies#

GenAI context: trajectory generation as latent reasoning#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 9: Flow Matching and Diffusion for Robot Policies

Purpose of this lecture#

Why generative policies for manipulation#

Score matching and denoising diffusion#

DDIM: accelerated inference through non-Markovian processes#

Conditioning and guidance#

Noise schedules and their practical effects#

Flow matching: continuous-time transport#

Optimal Transport Conditional Flow Matching#

Inference latency analysis#

Diffusion versus flow matching for robot policies#

GenAI context: trajectory generation as latent reasoning#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#