Purpose of this lecture
The previous lecture introduced the diffusion policy framework as a way to generate action sequences through iterative denoising. This lecture goes deeper: we derive both diffusion and flow matching from first principles, understand why generative modeling is the right paradigm for multimodal manipulation behavior, and analyze the architectural and training choices that determine policy performance in practice.
Flow matching is a continuous-time generalization that subsumes diffusion models and provides a cleaner mathematical structure, faster inference, and easier conditioning on complex multi-modal observations. Both frameworks have been demonstrated in state-of-the-art robot policies — diffusion policies in the Chi et al. (2023) work, flow matching in (Black et al., 2024) — and understanding their relationship is essential for reasoning about why and when to use each. The common thread is that robot trajectories are not unique: for any manipulation task, many valid action sequences exist, and the policy must represent this distribution over behaviors, not merely a single best action.
Why generative policies for manipulation
Consider a bimanual task: pick up a bottle with one hand while holding a cup steady with the other. At each point in the grasp approach, there is a distribution over valid arm trajectories — different approach angles and speeds are all valid — but the actions of the two arms must be jointly consistent. A reactive policy that models each arm independently cannot capture this joint distribution, and one that averages over all valid approach angles produces an arm trajectory that goes straight through the bottle. The problem is not that no good action exists; it is that the policy must commit to a mode and execute it consistently.
More formally, the distribution over expert action sequences given an observation is multimodal: has multiple well-separated modes corresponding to distinct manipulation strategies. The mean of this distribution is generally not in any mode — it is between them. Regression-based policies, by minimizing mean squared error, converge to this mean and produce averaging artifacts.
Generative policies model the full conditional distribution , allowing samples that are consistent with a single mode. The challenge is learning and sampling from this high-dimensional distribution efficiently.
Score matching and denoising diffusion
The score function of a distribution is its log-density gradient: . Score matching trains a neural network to estimate this gradient, and gradient ascent on the estimated score moves samples from low-density to high-density regions of the distribution — from noise toward data.
Denoising Score Matching (DSM; Vincent, 2011) avoids directly estimating the score of (which requires the intractable normalizing constant) by instead estimating the score of a noisy version of : for a Gaussian kernel , the marginal noisy distribution is , and its score is:
Training a network to predict given and is equivalent to denoising: the network learns to predict the direction toward the clean sample from a noisy one.
DDPMs (Ho et al., 2020) make this into a practical algorithm by: (1) defining a forward process that gradually corrupts the data over timesteps, with , and (2) training a network to predict the added noise from the noisy sample and diffusion step . The simple MSE objective
is equivalent to a weighted sum of denoising score matching objectives at all noise levels. The trained denoiser can then be used to run the reverse Markov chain from to a sample .
For robot policies, is an action chunk, the diffusion network is conditioned on the observation , and inference runs the reverse chain from noise to a coherent action sequence: .
DDIM: accelerated inference through non-Markovian processes
The standard DDPM reverse process is a Markov chain that must run all steps sequentially — each denoising step depends on the result of the previous step. With , this requires 1000 network forward passes to generate one sample, which is far too slow for real-time robot control.
Denoising Diffusion Implicit Models (DDIM; Song et al., 2020) achieve dramatic acceleration by replacing the Markov forward process with a non-Markovian forward process that admits a deterministic reverse chain. The key insight is that the marginal distributions that define the noise corruption at each step do not require the chain to be Markovian — they can be achieved by many different forward processes, not just the sequential Markov chain. DDIM defines a non-Markovian forward process with the same marginals but admits a deterministic reverse update:
where gives a fully deterministic reverse process. The crucial property is that this reverse update can skip steps: instead of stepping from to to … to , DDIM can jump from directly to for any , with the predicted serving as the bridge. This allows the reverse chain to use only a subsequence of size , reducing the number of network evaluations from to .
For robot policies, DDIM with steps achieves generation quality comparable to full DDPM with , enabling roughly 10 inference acceleration. At steps, some quality degradation is visible on high-precision tasks but the control frequency becomes compatible with real-time deployment. DDIM thus occupies the middle ground between DDPM (highest quality, too slow) and flow matching (fastest, equal quality at low step count).
Conditioning and guidance
A diffusion policy is useful only if it can be precisely conditioned on the robot's current state. The conditioning signals relevant for manipulation are:
Visual observations — RGB or RGB-D frames from one or more cameras provide scene geometry, object identity, and spatial relationships. Typical architectures encode each camera frame through a pretrained or jointly trained visual backbone (ResNet, ViT, or DINO) to produce a sequence of spatial features. These features are injected into the denoiser through cross-attention, where the denoiser's query is the noisy action and the key-value pairs are the visual feature tokens.
Proprioception — joint positions, velocities, and wrist force-torque measurements ground the policy in the robot's actual configuration. Proprioceptive state is typically concatenated with the diffusion timestep embedding and injected via the denoiser's conditioning MLP or added to the action representation before denoising.
Language — task descriptions or instructions condition the policy on what the robot should do, enabling a single policy to handle multiple tasks. Language conditioning is often implemented through a pretrained language encoder (BERT, T5, or a language model backbone) whose output is used as additional cross-attention keys-values alongside the visual features.
Classifier-free guidance (CFG) provides a mechanism to strengthen conditioning at inference time without training a separate classifier. The denoiser is trained with the condition dropped with probability during training (the unconditional denoiser ) and the conditional denoiser jointly. At inference, the guided denoiser is:
where is the guidance weight. Higher produces samples more strongly conditioned on at the cost of reduced diversity — useful when the conditioning is precise and the policy should follow it tightly.
Noise schedules and their practical effects
The noise schedule (equivalently ) controls how much information about the original sample is retained at each noise level. The schedule has important practical consequences for robot policies:
Linear schedules add noise in equal increments: . These were the first proposed and work adequately, but they allocate denoising capacity roughly equally across all noise levels, whereas most structure recovery happens at low noise levels.
Cosine schedules (Nichol and Dhariwal, 2021) define , which reduces noise levels more gently near (low noise) and more aggressively near (high noise). This schedule prevents fully destroying all signal at high noise levels and concentrates denoising steps where they matter most. For action sequences, the cosine schedule produces smoother trajectory generation and better performance on precise manipulation tasks.
Learned schedules parameterize the noise process and optimize the schedule jointly with the denoiser, in principle maximizing task performance subject to the constraints of the diffusion formalism. These are not yet standard in robot learning but represent an active research direction.
Flow matching: continuous-time transport
Flow matching (Lipman et al., 2022; Albergo and Vanden-Eijnden, 2022) provides a mathematically cleaner alternative to diffusion that achieves faster inference. Instead of a discrete-time Markov chain, flow matching defines a continuous-time ODE that transports samples from a source distribution to the target distribution :
where is the learned velocity field. A sample from the data distribution is obtained by integrating the ODE from (noise) to (data): .
Training matches the learned velocity field to the conditional velocity field defined by a transport path between paired noise-data samples. For the simplest (linear) transport path, (interpolation between noise and data ), the conditional velocity is simply , and the training objective is:
where , , and . This objective is strikingly simple: the model learns to predict the straight-line velocity from the noise sample to the data sample, evaluated along the interpolated path.
The key advantages of flow matching for robot policies are: (1) fewer inference steps — the ODE integrator can be run for as few as 5–10 steps using high-order solvers (Euler, Runge-Kutta) while maintaining high-quality generation, compared to 50–1000 steps for standard DDPM inference; (2) cleaner conditioning — the simple conditional velocity formulation makes it straightforward to condition the velocity field on arbitrary observation features via the same cross-attention mechanisms; and (3) more stable training — the loss landscape is smooth and the gradient magnitude is well-conditioned across the full trajectory, unlike diffusion losses that have instability near .
Optimal Transport Conditional Flow Matching
A key limitation of the basic flow matching formulation is that the linear interpolation paths can cross: two data samples and may be paired with the same noise sample , producing interpolation paths that intersect in the interior of . At the intersection point, the velocity field is asked to simultaneously point toward (from one path) and toward (from another), introducing a contradiction. The resulting velocity field is not a valid ODE flow — it requires high-curvature trajectories to accommodate the inconsistency, demanding more ODE solver steps and introducing approximation error.
Optimal Transport Conditional Flow Matching (OT-CFM; Tong et al., 2023) eliminates crossings by choosing the coupling between noise samples and data samples to minimize the total transport cost. The Monge optimal transport plan pairs each noise sample with the data sample that minimizes the expected squared displacement , subject to the constraint that the marginals match the noise and data distributions. Under the OT coupling, interpolation paths are non-crossing by construction: each data sample receives a unique noise partner, and the resulting straight-line paths span the transport uniformly without intersections. This directly translates to straighter ODE trajectories, allowing accurate integration in fewer steps. The model specifically uses OT-CFM to couple the zero-mean Gaussian noise prior with the action chunk distribution, achieving its 5-step inference target. The OT cost minimization can be computed efficiently using the Sinkhorn algorithm, adding modest overhead to the training data pipeline but producing significantly cleaner velocity fields that generalize better to new tasks.
The model uses flow matching with a pre-trained vision-language model backbone and achieves 5-step inference at 50 Hz control frequency — a combination of speed and expressive power that standard DDPM-based diffusion cannot match.
Inference latency analysis
The choice of generative architecture has direct, quantifiable consequences for robot control frequency. The following comparison uses a representative 100M-parameter denoising/velocity network, evaluated on a single NVIDIA RTX 4090 with a batch size of 1 (single environment, no parallelism), with a 60-dimensional action chunk ( actions, 6 DoF each):
| Architecture | Forward passes | Per-pass latency | Total latency | Max control frequency | |---|---|---|---|---| | ACTAction Chunking with Transformers (single-pass autoregressive) | 10 (one per token) | ~1 ms | ~10 ms | ~100 Hz | | DDPM ( steps) | 100 | ~2 ms | ~200 ms | ~5 Hz | | DDIM ( steps) | 10 | ~2 ms | ~20 ms | ~50 Hz | | OT-CFM Flow Matching ( steps) | 5 | ~2 ms | ~10 ms | ~100 Hz |
These numbers reveal why the field has moved from DDPM to DDIM and then to flow matching: DDPM at 5 Hz is too slow for most manipulation tasks requiring reactive closed-loop control (the robot arm moves on 100 ms timescales), while DDIM and flow matching recover real-time performance. ACTAction Chunking with Transformers remains the fastest overall because it generates the full chunk in a single decoder forward pass, trading some multimodal expressiveness for speed.
For legged locomotion, which operates at 500–1000 Hz low-level control frequencies, neither diffusion nor flow matching is currently viable as the direct policy output — proprioceptive joint-space policies (Week 6) remain necessary at the lowest level. Diffusion and flow matching policies operate at the task-command frequency (5–50 Hz), generating high-level goals or action chunks that are tracked by the fast joint-space controller beneath them.
Diffusion versus flow matching for robot policies
The two frameworks have complementary practical profiles:
| Dimension | Diffusion (DDPM) | DDIM | Flow Matching (OT-CFM) | |---|---|---|---| | Inference steps | 50–1000 | 5–20 | 5–10 | | Sample quality | Highest | Good | Comparable | | Training objective | Noise prediction | Same (reuses DDPM weights) | Velocity regression | | Gradient stability | Unstable near | Same | Stable throughout | | Max control frequency | ~5 Hz | ~50 Hz | ~100 Hz |
For real-time control at 50+ Hz, flow matching is now preferred. DDIM provides a practical middle ground that reuses trained DDPM weights without retraining. For offline trajectory generation (motion planning, dataset augmentation), DDPM with a high step count can produce the smoothest results. The model's adoption of OT-CFM flow matching over diffusion reflects this tradeoff in a practical system where latency directly affects manipulation quality.
GenAI context: trajectory generation as latent reasoning
The connection between generative robot policies and generative reasoning in language models is structural. Both involve:
- learning a distribution over valid completions (action sequences / token sequences) conditioned on a context (observations / prompts),
- the need to represent multimodal distributions (multiple valid grasps / multiple valid answers), and
- a generative process that progressively refines an initial sample (denoising / chain-of-thought reasoning).
Flow matching over action sequences is analogous to continuous-time reasoning processes in diffusion language models, where the model iteratively refines a draft sequence rather than generating token-by-token. Both involve training a model to predict the direction of improvement given a partially resolved state, whether that state is a noisy action chunk or a partially correct reasoning trace.
The conditioning story also aligns: classifier-free guidance in diffusion is structurally similar to the KL penalty in RLHFReinforcement Learning from Human Feedback — both are mechanisms for trading diversity against faithfulness to a conditioning signal, with a scalar weight controlling the tradeoff.
Key takeaways
Generative policies address the multimodal action distribution problem by modeling the full conditional distribution rather than its mean. Denoising diffusion models learn to reverse a Gaussian noising process via score matching; the DDPM training objective predicts the added noise, which is equivalent to a weighted sum of denoising score matching losses. Conditional diffusion policies inject observation context (visual, proprioceptive, language) via cross-attention in the denoiser. Noise schedules control the allocation of denoising capacity and affect trajectory smoothness and precision. Flow matching provides a continuous-time ODE formulation with a simpler training objective (velocity regression on linear interpolation paths) and requires 5–10 inference steps versus 50–1000 for DDPM, enabling 50+ Hz control. The architecture demonstrates that flow matching over pre-trained VLMVision-Language Model features enables both high-quality manipulation and real-time inference.
Conceptual questions
-
The denoising score matching objective trains the network to predict added noise at a random diffusion step . The loss at small (low noise, nearly clean data) has small magnitude and therefore small gradient, while the loss at large (high noise, nearly pure Gaussian) has large magnitude and larger gradient. Analyze how this asymmetry affects training: which noise levels does the denoiser learn to denoise accurately, and which does it under-fit? Explain why cosine noise schedules mitigate this asymmetry, and derive what the ideal loss weighting would be to balance gradient magnitudes across all noise levels.
-
Classifier-free guidance at inference combines conditional and unconditional denoisers: . For a diffusion policy conditioned on a visual observation of a specific object pose, analyze what happens to the generated action distribution as . Using the interpretation of the guidance formula as score interpolation, show that high guidance weight increases the policy's sensitivity to the conditioning signal at the cost of diversity. Identify a failure mode that would emerge specifically for robot policies with very high guidance weight during contact-rich tasks.
-
The flow matching objective trains the velocity field to predict along linear interpolation paths. For two data modes and that are far apart in action space (two different grasping orientations), analyze the geometry of the learned velocity field at the midpoint . Does the learned velocity correctly transport samples toward both modes? If not, explain the failure mode and describe how optimal transport flow matching (OT-CFM) would address it.
-
A manipulation task requires generating a 2-second action chunk at 50 Hz (100 actions, 6 DoF each = 600-dimensional output) using flow matching with 5 ODE steps. Calculate the wall-clock inference time required if the flow matching network has 100M parameters and a single forward pass takes 2 ms. Is 50 Hz real-time control achievable? If not, what architectural optimizations (network compression, output dimension reduction, parallelism) would be needed, and what is the minimum achievable latency?
-
Both diffusion policies and ACTAction Chunking with Transformers with CVAE conditioning model multimodal distributions over action chunks. Compare the two approaches on a task where the robot must choose between two qualitatively different strategies at the start of each episode (strategy A: overhead grasp; strategy B: side grasp). For each approach, describe: (a) how the strategy is committed to at the beginning of the episode, (b) whether the policy can switch strategies mid-episode if the first attempt fails, and (c) what training data distribution is needed for the policy to correctly represent both strategies. Which approach is more appropriate for this specific task structure?
Looking ahead
Diffusion and flow matching policies generate high-quality action sequences conditioned on visual and proprioceptive observations. The next frontier is integrating semantic understanding and language grounding into this generation process — moving from policies that ACTAction Chunking with Transformers on raw sensor observations to policies that understand task descriptions, reason about object properties, and generalize across tasks through language.
Week 10: Vision-Language-Action Models. We examine how pretrained vision-language foundations (, GR00T, SMOL-VLA) are adapted for robot control, focusing on the architectural choices that allow a single model to perceive, reason about, and ACTAction Chunking with Transformers in diverse robotic manipulation scenarios.
Further reading
- Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS. (The foundational Diffusion Policy paper).
- Black, K., et al. (2023). Training Diffusion Models with Reinforcement Learning. (Connections between diffusion and control).
- Physical Intelligence. (2024). : A Vision-Language-Action Flow Model for General Robot Control. (The state-of-the-art use of Flow Matching in robotics).