Skip to main content
illumin8
Courses
Week 8: Foundation Models for Manipulation — ACT and Action Chunking
Robot Learning
01Week 1: Robot Modeling and Kinematics
02Week 2: Dynamics and State Estimation
03Week 3: Control Fundamentals
04Week 4: Teleoperation and Data Collection
05Week 5: Imitation Learning
06Week 6: Reinforcement Learning for Robotics
07Week 7: Sim2Real Pipelines and IsaacLab
08Week 8: Foundation Models for Manipulation — ACT and Action Chunking
09Week 9: Flow Matching and Diffusion for Robot Policies
10Week 10: Vision–Language–Action Models
11Week 11: Fine-Tuning and Adaptation
12Week 12: Safety, Constraints, and Reliability
13Week 13: Multi-Robot and Multi-Task Learning
14Week 14: Sim2Real Capstone
Week 8

Week 8: Foundation Models for Manipulation — ACT and Action Chunking

✦Learning Outcomes
  • Implement Action Chunking Transformer (ACTAction Chunking with Transformers) for manipulation
  • Connect transformer architectures to robot learning
  • Analyze temporal modeling in multi-step manipulation
  • Compare ACTAction Chunking with Transformers to other foundation model approaches
◆Prerequisites
  • Week 5: Imitation learning
  • Week 7: Sim2real pipelines
  • Basic deep learning (transformers, attention)

Recommended: Review Week 5 and Week 7 before proceeding.

Purpose of this lecture

Classical robot policies are reactive: at each timestep they observe the current state and output a single action. This formulation, while simple, discards information about the temporal structure of manipulation tasks. Grasping a cup, stacking blocks, and folding fabric are not independent sequences of single-step decisions — they are coherent behaviors with temporal dependencies, where the action at step t+5t+5t+5 is informed by the commitment made at step ttt and the trajectory of states visited between them.

Foundation models for manipulation reformulate robot control as sequence prediction: the policy consumes a history of observations and produces a sequence of future actions, using architectures and training paradigms borrowed from large language models and image generation. This lecture focuses on the Action Chunking Transformer (ACTAction Chunking with Transformers) — the most influential such architecture — and introduces the broader family of sequence-modeling approaches that it exemplifies. The goal is to understand not just the mechanics but the conceptual shift: what problems action chunking solves, why transformer architectures are particularly well-suited to it, and how this paradigm connects to the GenAI stack.


From reactive control to sequence modeling

The contrast between reactive and sequence-modeling policies is not merely architectural — it reflects different statistical assumptions about the structure of manipulation behavior.

A reactive policy π(at∣ot)\pi(a_t \mid o_t)π(at​∣ot​) models actions as conditionally independent given the current observation. This is the implicit assumption of standard imitation learning: the dataset is treated as a collection of (o,a)(o, a)(o,a) pairs, and the policy learns the conditional distribution over actions at each observation independently. The assumption holds when the task has a unique optimal action at every state — when manipulation is unambiguous. In practice, it fails for two reasons.

First, temporal consistency: a reactive policy can output actions at ttt and t+1t+1t+1 that are each locally plausible but globally inconsistent — for instance, grasping from the left at ttt and from the right at t+1t+1t+1. The policy has no mechanism to commit to a grasping strategy and execute it coherently because each timestep is treated as independent. This produces jittery, oscillatory behavior that fails despite reasonable single-step accuracy.

Second, multimodal ambiguity: when multiple grasping strategies are equally valid, a reactive policy trained by maximum likelihood learns to average over the modes of the distribution, producing an action that is between all valid options and therefore sub-optimal for any of them. A policy that should either grasp from the left or from the right will instead try to grasp from the middle — the mean of the multimodal distribution.

Sequence modeling addresses both problems by modeling the joint distribution over action sequences:

p(at,at+1,…,at+K∣ot,ot−1,…,ot−H)p(a_t, a_{t+1}, \ldots, a_{t+K} \mid o_t, o_{t-1}, \ldots, o_{t-H})p(at​,at+1​,…,at+K​∣ot​,ot−1​,…,ot−H​)

This joint distribution can represent temporal consistency (actions that form a coherent trajectory) and multimodal behavior (the distribution selects a mode and generates actions consistent with that mode throughout the sequence).


Action Chunking with Transformers (ACTAction Chunking with Transformers)

Architecture and training

ACTAction Chunking with Transformers (Zhao et al., 2023) trains an encoder-decoder transformer to predict a chunk of KKK future actions given a history of HHH observations. During training, the model maximizes the log-likelihood of the expert action chunk given the observation history:

LACT(θ)=−E(ot−H:t,at:t+K)∼D ⁣[∑k=0K−1log⁡pθ(at+k∣at:t+k,ot−H:t)]\mathcal{L}_{\text{ACT}}(\theta) = -\mathbb{E}_{(o_{t-H:t}, a_{t:t+K}) \sim \mathcal{D}}\!\left[\sum_{k=0}^{K-1} \log p_\theta(a_{t+k} \mid a_{t:t+k}, o_{t-H:t})\right]LACT​(θ)=−E(ot−H:t​,at:t+K​)∼D​[k=0∑K−1​logpθ​(at+k​∣at:t+k​,ot−H:t​)]

The observation encoder processes proprioceptive state (joint positions and velocities) and visual observations (RGB images from one or more cameras) into a sequence of embedding vectors. Visual frames are processed by a CNN or ViT backbone producing patch embeddings; proprioceptive state is linearly projected. All observation embeddings are concatenated into the encoder's input sequence. The decoder autoregressively generates the KKK-step action chunk, attending to the full encoded observation context via cross-attention.

In the original ACTAction Chunking with Transformers formulation, the model also incorporates a Conditional VAEVariational Autoencoder (CVAE) component to handle multimodal action distributions. During training, a style encoder qϕ(z∣at:t+K,ot−H:t)q_\phi(z \mid a_{t:t+K}, o_{t-H:t})qϕ​(z∣at:t+K​,ot−H:t​) processes the full demonstration action chunk together with the observations to produce a posterior distribution over a latent style variable z∈Rdz \in \mathbb{R}^dz∈Rd — a compact representation of the behavioral mode the expert chose (approach angle, speed profile, grasping strategy). The policy decoder is conditioned on both the observations and zzz, generating actions consistent with the selected mode.

The CVAE is trained by maximizing the Evidence Lower BOund (ELBO) on the log-likelihood of the demonstration actions:

LCVAE(θ,ϕ)=Ez∼qϕ ⁣[∑k=0K−1log⁡pθ(at+k∣z,ot−H:t)]⏟reconstruction−β DKL ⁣(qϕ(z∣at:t+K,o)  ∥  p(z))⏟regularization\mathcal{L}_{\text{CVAE}}(\theta, \phi) = \underbrace{\mathbb{E}_{z \sim q_\phi}\!\left[\sum_{k=0}^{K-1} \log p_\theta(a_{t+k} \mid z, o_{t-H:t})\right]}_{\text{reconstruction}} - \underbrace{\beta\, D_{\text{KL}}\!\left(q_\phi(z \mid a_{t:t+K}, o) \;\|\; p(z)\right)}_{\text{regularization}}LCVAE​(θ,ϕ)=reconstructionEz∼qϕ​​[k=0∑K−1​logpθ​(at+k​∣z,ot−H:t​)]​​−regularizationβDKL​(qϕ​(z∣at:t+K​,o)∥p(z))​​

where p(z)=N(0,I)p(z) = \mathcal{N}(0, I)p(z)=N(0,I) is the Gaussian prior over the latent style. The reconstruction term trains the decoder to produce accurate action chunks given the encoded style; the KL divergence term forces the posterior qϕq_\phiqϕ​ to remain close to the prior, ensuring that the latent space is well-organized and that sampling z∼N(0,I)z \sim \mathcal{N}(0, I)z∼N(0,I) at test time draws from a smooth distribution of behavioral modes. The coefficient β\betaβ (from β\betaβ-VAEVariational Autoencoder) controls the disentanglement-fidelity tradeoff: large β\betaβ produces a more disentangled latent space where different dimensions correspond to interpretable behavioral attributes, at the cost of some reconstruction fidelity.

At test time, zzz is sampled from the prior N(0,I)\mathcal{N}(0, I)N(0,I), and the policy samples a mode before generating the chunk — resolving the multimodal averaging problem by committing to one mode of the distribution rather than averaging across all modes.

Action chunking and temporal abstraction

The chunk length KKK is the most important hyperparameter in ACTAction Chunking with Transformers. A chunk of K=1K=1K=1 reduces to a reactive policy; a chunk of K=TK=TK=T (the full episode) produces a one-shot trajectory generator. Intermediate values provide a tunable tradeoff between replanning frequency and action consistency.

The critical insight is that action chunking reduces the effective decision frequency of the policy. A policy operating at 30 Hz with K=30K=30K=30 replans once per second, treating each 1-second window as a single decision. This coarser decision frequency is beneficial for transformer training because the effective sequence length in the training dataset is shorter (the policy needs to learn dependencies at the granularity of chunks rather than individual actions), and it allows the policy to commit to a behavioral mode for a full chunk duration rather than being free to switch on every step.

Temporal ensembling is a deployment technique for ACTAction Chunking with Transformers that smooths execution between chunk replanning events. At each timestep, the policy generates a new chunk; rather than executing the entire chunk open-loop, the policy takes the mean of all overlapping chunk predictions for the current timestep. If chunks are predicted at every step with length KKK, then at time ttt there are KKK overlapping predictions for action ata_tat​ from the chunks initiated at t,t−1,…,t−K+1t, t-1, \ldots, t-K+1t,t−1,…,t−K+1. Averaging these with exponentially decaying weights produces smoother motion than executing any single chunk.


Tokenization strategies for continuous control

Transformer architectures were designed for discrete token vocabularies. Adapting them to continuous robot action spaces requires an explicit tokenization strategy.

Direct regression is the simplest approach: the transformer's output is a sequence of real-valued vectors, one per action dimension, trained with L2 or L1 loss. This requires no discretization and has no quantization error, but it loses the categorical modeling strengths of transformers — the ability to represent sharp multimodal distributions and to benefit from techniques like temperature sampling and top-kkk truncation.

Uniform discretization bins each action dimension into BBB equal-width bins and represents each action as a tuple of bin indices. The transformer predicts action dimensions as classification problems, outputting a probability vector over BBB categories per dimension. This representation can capture multimodal distributions (high probability mass at two non-adjacent bins) and benefits from the same training stability and sampling flexibility as language models. The cost is quantization error: fine-grained precision requires large BBB, which increases the effective action vocabulary and slows inference.

Vector quantization (VQ) learns a codebook of MMM action prototypes {em}m=1M⊂Rd\{e_m\}_{m=1}^M \subset \mathbb{R}^d{em​}m=1M​⊂Rd and represents each action embedding as the index of its nearest prototype. This produces a compressed trajectory representation as a sequence of codebook indices, enabling a transformer to operate over a discrete token vocabulary of motor primitives. VQ-BeT (Vector Quantized Behavior Transformer; Lee et al., 2024) combines vector quantization with a GPT-style transformer backbone that autoregressively predicts both the codebook index (which motor primitive) and a continuous residual offset (fine-grained correction within the primitive). The codebook is trained with a commitment loss that encourages action embeddings to cluster near codebook entries:

LVQ=∥sg(ze)−em∗∥2⏟codebook update+βcommit∥ze−sg(em∗)∥2⏟commitment\mathcal{L}_{\text{VQ}} = \underbrace{\|\text{sg}(z_e) - e_{m^*}\|^2}_{\text{codebook update}} + \beta_{\text{commit}}\underbrace{\|z_e - \text{sg}(e_{m^*})\|^2}_{\text{commitment}}LVQ​=codebook update∥sg(ze​)−em∗​∥2​​+βcommit​commitment∥ze​−sg(em∗​)∥2​​

where zez_eze​ is the encoder's continuous action embedding, em∗e_{m^*}em∗​ is the nearest codebook entry, and sg(⋅)\text{sg}(\cdot)sg(⋅) denotes the stop-gradient operator. The first term updates the codebook entries toward the encoder's outputs (using an exponential moving average); the second term encourages the encoder to commit to the nearest codebook entry rather than continuously drifting between entries. The codebook can be pre-trained on the entire demonstration dataset to capture natural action clusters, then frozen during policy training, or trained jointly end-to-end. VQ-BeT achieves strong performance on multi-modal manipulation tasks precisely because the discrete codebook vocabulary allows the transformer to select distinct action modes cleanly, while the residual offset provides the fine-grained precision that pure codebook prediction would lack.


Diffusion-style manipulation policies

Parallel to the ACTAction Chunking with Transformers development, diffusion policies (Chi et al., 2023) proposed using denoising diffusion probabilistic models (DDPMs) for action generation. Where ACTAction Chunking with Transformers generates chunks autoregressively, diffusion policies generate full action sequences through an iterative denoising process.

A diffusion policy defines a forward noising process that progressively corrupts a clean action chunk a01:Ka_0^{1:K}a01:K​ through NNN steps:

q(an1:K∣an−11:K)=N(an1:K;  1−βn an−11:K,  βnI)q(a_n^{1:K} \mid a_{n-1}^{1:K}) = \mathcal{N}(a_n^{1:K};\; \sqrt{1-\beta_n}\, a_{n-1}^{1:K},\; \beta_n I)q(an1:K​∣an−11:K​)=N(an1:K​;1−βn​​an−11:K​,βn​I)

and trains a denoising network εθ(an1:K,n,o)\varepsilon_\theta(a_n^{1:K}, n, o)εθ​(an1:K​,n,o) to predict the injected noise at each diffusion step nnn, conditioned on the observation ooo. The training objective is:

Ldiff(θ)=En,a0,ε ⁣[∥ε−εθ(an1:K,n,o)∥2]\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{n, a_0, \varepsilon}\!\left[\| \varepsilon - \varepsilon_\theta(a_n^{1:K}, n, o) \|^2\right]Ldiff​(θ)=En,a0​,ε​[∥ε−εθ​(an1:K​,n,o)∥2]

At inference, the policy starts from Gaussian noise and runs the reverse denoising chain NNN steps to produce a clean action chunk. The condition ooo (current and past observations) guides the denoising toward action chunks consistent with the current task context.

Diffusion policies are particularly well-suited to contact-rich manipulation: when the task requires precise contact forces (inserting a plug, unfolding a cloth), the distribution over valid actions is narrow and multimodal, and the iterative refinement of diffusion allows the model to progressively narrow the action distribution from broad initial noise to a precise, coherent action sequence. ACTAction Chunking with Transformers, by contrast, generates each token in a single forward pass, which limits its ability to represent extremely tight distributions.

The cost of diffusion policies relative to ACTAction Chunking with Transformers is inference latency: running N=100N = 100N=100 denoising steps at 30 Hz requires executing the denoising network 3000 times per second. Accelerated samplers (DDIM, 5–20 step inference) reduce this cost, but diffusion inference remains heavier than ACTAction Chunking with Transformers's single-pass decoding. The π0\pi_0π0​ model (Black et al., 2024) uses flow matching, a continuous-time generalization of diffusion, to reduce inference steps to as few as 5–10 while maintaining diffusion's multimodal expressiveness.


Temporal resolution and horizon design

A fundamental design axis for sequence-model policies is the tradeoff between temporal resolution (how finely the policy discretizes time) and planning horizon (how many steps into the future the policy models explicitly).

High temporal resolution (short control period, e.g., 1 ms) allows precise trajectory shaping but requires the policy to model very long sequences even for short behaviors, straining the transformer's context capacity. Low temporal resolution (long control period, e.g., 100 ms) allows longer nominal horizons but coarsens the trajectory and can miss the contact precision required for assembly tasks. ACTAction Chunking with Transformers with typical K=20K = 20K=20–100100100 at 10–30 Hz operates at an effective planning horizon of 0.7–10 seconds — appropriate for most manipulation tasks.

The replanning rate (how often the policy generates a new chunk) introduces a secondary design choice. Replanning every step allows rapid correction but wastes computation on chunks that are immediately superseded; replanning every K/2K/2K/2 steps with temporal ensembling provides a smooth tradeoff. The optimal replanning rate depends on the task's sensitivity to disturbances: tasks with unpredictable contacts (deformable objects, multi-fingered grasping) benefit from frequent replanning; tasks with predictable open-loop dynamics (sweeping, wiping) benefit from infrequent replanning.


GenAI context: control as sequence generation

The parallel between ACTAction Chunking with Transformers and autoregressive language generation is precise:

| Robot control (ACTAction Chunking with Transformers) | Language model | |---|---| | Observation history | Prompt / context | | Action chunk at:t+Ka_{t:t+K}at:t+K​ | Completion sequence | | Proprioception tokens | System instruction | | Visual patch embeddings | Image tokens | | Chunk length KKK | Generation length | | CVAE latent zzz | Temperature / sampling strategy | | Temporal ensembling | Token reranking / aggregation |

This alignment is not accidental — it reflects that both manipulation and language generation are sequential tasks where generating consistent, coherent outputs requires modeling temporal dependencies and committing to a mode among multiple valid continuations. The alignment also opens direct technical connections: ACTAction Chunking with Transformers-style policies can be initialized from language model weights (sharing the transformer backbone), conditioned on language task descriptions (natural language as the "prompt"), and scaled using the same training recipes as LLMs. These connections motivate the Vision-Language-Action models studied in Week 10.


Key takeaways

ACTAction Chunking with Transformers reformulates manipulation as sequence prediction, using an encoder-decoder transformer to produce KKK-step action chunks from observation histories. The chunk length KKK controls the tradeoff between replanning frequency and action consistency; temporal ensembling smooths execution between replanning events. The CVAE-based conditioning resolves the multimodal averaging problem of direct regression by sampling a latent style variable before generating the chunk. Tokenization of continuous actions can be achieved through direct regression, uniform discretization, or learned vector quantization, each with different expressiveness and computational tradeoffs. Diffusion policies generate action chunks through iterative denoising, achieving strong performance on contact-rich tasks at higher inference cost. The architectural alignment between ACTAction Chunking with Transformers and autoregressive language generation opens direct technical connections: shared backbones, language conditioning, and cross-domain pretraining are all natural extensions.


Conceptual questions

  1. The temporal ensembling strategy for ACTAction Chunking with Transformers averages the action predictions from KKK overlapping chunks at each timestep, weighted by recency. Analyze the effect of this averaging on the policy's closed-loop bandwidth: if a sudden perturbation occurs at time ttt (an unexpected contact pushes the object), how many timesteps does it take for the ensembled policy to respond fully? Derive the transfer function of the temporal ensembling filter and express the bandwidth reduction as a function of KKK. Under what task conditions is this bandwidth reduction acceptable, and when is it not?

  2. ACTAction Chunking with Transformers with CVAE conditioning samples a latent style variable z∼N(0,I)z \sim \mathcal{N}(0, I)z∼N(0,I) at test time, where zzz was learned to represent behavioral modes during training. For a task with 3 distinct valid grasping modes, analyze what happens to the CVAE latent space during training if the demonstration dataset contains 80% mode A, 15% mode B, and 5% mode C. How does this imbalance affect the conditional action distribution pθ(a∣o,z)p_\theta(a \mid o, z)pθ​(a∣o,z) for zzz values corresponding to mode C? What modification to the training procedure would give more uniform mode coverage?

  3. Diffusion policies run NNN denoising steps to generate each action chunk, where NNN is typically 50–100 for high-quality generation and 5–10 for fast inference. Using the DDIM sampler, the number of steps can be reduced at the cost of sample quality. Analyze the tradeoff between step count and sample quality specifically for a peg insertion task requiring 0.5 mm positional precision. What denoising step count would you use, and how would you diagnose whether the step count is insufficient without running physical experiments?

  4. A behavior-cloned ACTAction Chunking with Transformers policy achieves 85% success rate on a cup-placement task in evaluation. The team observes that on the 15% failures, the robot produces a jittery approach trajectory that ends with the cup dropped approximately 3 cm from the target. Hypothesize whether this failure is caused by (a) incorrect mode selection from the CVAE, (b) insufficient chunk length for this task, or (c) temporal ensembling overly smoothing a rapid contact adjustment. Describe the diagnostic procedure to distinguish these hypotheses and the corresponding fixes.

  5. The alignment between ACTAction Chunking with Transformers and LLMLarge Language Model architectures suggests pretraining the ACTAction Chunking with Transformers transformer on a large language corpus before fine-tuning on robot demonstrations. Analyze the expected benefits and failure modes of this approach: which components of the transformer architecture transfer well (attention patterns, positional encoding, MLP layers), and which would require re-learning from scratch (action tokenization, visual conditioning)? Propose an experimental design to measure the contribution of language pretraining to the final policy's performance separately from the contribution of architecture choice.


✦Solutions
  1. Temporal-ensembling bandwidth. Averaging KKK overlapping, recency-weighted chunks is a low-pass filter, so a perturbation at ttt is fully reflected only after the effective averaging window (on the order of the chunk overlap). The filter is a weighted moving average — exponential weights give roughly a first-order low-pass whose cutoff falls as the window grows — so bandwidth drops with KKK. This is acceptable for smooth, quasi-static tasks but not for fast contact reactions that need high closed-loop bandwidth.
  2. CVAE mode imbalance. The latent space allocates volume in proportion to mode frequency, so the 5% mode C region is under-trained: pθ(a∣o,z)p_\theta(a\mid o,z)pθ​(a∣o,z) for those zzz is high-variance and may collapse toward mode A. Mode C becomes unreliable. Fix by reweighting/oversampling mode C, using a discrete or mixture latent with a balanced prior, or clustering-then-balancing the demos so each mode gets adequate latent coverage.
  3. Diffusion steps for 0.5 mm. Too few DDIM steps under-integrate the denoising ODE, biasing or blurring the action and missing sub-millimeter precision; use more steps (~50) or a high-quality fast sampler. Diagnose without hardware by measuring generated-action error against a many-step reference on held-out data, checking positional variance of chunks near contact, and confirming precision plateaus as steps increase.
  4. Cup-drop diagnosis. (a) Wrong CVAE mode: cluster failures by the sampled zzz and see if they concentrate on one mode. (b) Too-short chunk: check whether failures sit at chunk boundaries and retry with longer chunks. (c) Over-smoothing: reduce or disable temporal ensembling and see if the jittery approach and drop disappear. Run each as an ablation; a jittery approach ending in a near-contact drop points most to (c) or (b).
  5. Language pretraining for ACT. Attention patterns, positional encodings, and MLP layers (general sequence machinery) transfer reasonably; action tokenization, visual conditioning, and the action head have no language analog and must be learned from scratch. Benefit: stronger long-range temporal modeling; failure mode: irrelevant or harmful language priors and tokenization mismatch. Measure the contribution with a 2×2 ablation — language-pretrained vs random init, crossed with the transformer vs a matched alternative architecture.

Looking ahead

ACTAction Chunking with Transformers and diffusion policies model the action distribution conditioned on visual and proprioceptive observations. The next step is to understand the generative process more deeply — specifically, how flow matching provides a mathematically cleaner and computationally faster alternative to diffusion, and how these generative paradigms can be unified under the framework of transport maps between probability distributions.

Week 9: Flow Matching and Diffusion for Robot Policies. We derive the flow matching objective from first principles, examine how it relates to the diffusion score-matching objective, and analyze the practical tradeoffs between diffusion and flow-based policies for real-time robotic control.


Further reading

  • Zhao, T. Z., et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS. (The Action Chunking with Transformers / ACTAction Chunking with Transformers paper).
  • Bahl, O., et al. (2023). Affordances from Human Videos as a Versatile Representation for Robotics. CVPR.
← Previous
Week 7: Sim2Real Pipelines and IsaacLab
Next →
Week 9: Flow Matching and Diffusion for Robot Policies
On this page
  • Purpose of this lecture
  • From reactive control to sequence modeling
  • Action Chunking with Transformers (ACT)
  • Architecture and training
  • Action chunking and temporal abstraction
  • Tokenization strategies for continuous control
  • Diffusion-style manipulation policies
  • Temporal resolution and horizon design
  • GenAI context: control as sequence generation
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading