Week 8: Foundation Models for Manipulation — ACT and Action Chunking — Robot Learning

Purpose of this lecture#

Classical robot policies are reactive: at each timestep they observe the current state and output a single action. This formulation, while simple, discards information about the temporal structure of manipulation tasks. Grasping a cup, stacking blocks, and folding fabric are not independent sequences of single-step decisions — they are coherent behaviors with temporal dependencies, where the action at step $t+5$ is informed by the commitment made at step $t$ and the trajectory of states visited between them.

Foundation models for manipulation reformulate robot control as sequence prediction: the policy consumes a history of observations and produces a sequence of future actions, using architectures and training paradigms borrowed from large language models and image generation. This lecture focuses on the Action Chunking Transformer (ACT) — the most influential such architecture — and introduces the broader family of sequence-modeling approaches that it exemplifies. The goal is to understand not just the mechanics but the conceptual shift: what problems action chunking solves, why transformer architectures are particularly well-suited to it, and how this paradigm connects to the GenAI stack.

From reactive control to sequence modeling#

The contrast between reactive and sequence-modeling policies is not merely architectural — it reflects different statistical assumptions about the structure of manipulation behavior.

A reactive policy $\pi(a_t \mid o_t)$ models actions as conditionally independent given the current observation. This is the implicit assumption of standard imitation learning: the dataset is treated as a collection of $(o, a)$ pairs, and the policy learns the conditional distribution over actions at each observation independently. The assumption holds when the task has a unique optimal action at every state — when manipulation is unambiguous. In practice, it fails for two reasons.

First, temporal consistency: a reactive policy can output actions at $t$ and $t+1$ that are each locally plausible but globally inconsistent — for instance, grasping from the left at $t$ and from the right at $t+1$ . The policy has no mechanism to commit to a grasping strategy and execute it coherently because each timestep is treated as independent. This produces jittery, oscillatory behavior that fails despite reasonable single-step accuracy.

Second, multimodal ambiguity: when multiple grasping strategies are equally valid, a reactive policy trained by maximum likelihood learns to average over the modes of the distribution, producing an action that is between all valid options and therefore sub-optimal for any of them. A policy that should either grasp from the left or from the right will instead try to grasp from the middle — the mean of the multimodal distribution.

Sequence modeling addresses both problems by modeling the joint distribution over action sequences:

p(a_t, a_{t+1}, \ldots, a_{t+K} \mid o_t, o_{t-1}, \ldots, o_{t-H})

This joint distribution can represent temporal consistency (actions that form a coherent trajectory) and multimodal behavior (the distribution selects a mode and generates actions consistent with that mode throughout the sequence).

Action Chunking with Transformers (ACT)#

Architecture and training#

ACT (Zhao et al., 2023) trains an encoder-decoder transformer to predict a chunk of $K$ future actions given a history of $H$ observations. During training, the model maximizes the log-likelihood of the expert action chunk given the observation history:

\mathcal{L}_{\text{ACT}}(\theta) = -\mathbb{E}_{(o_{t-H:t}, a_{t:t+K}) \sim \mathcal{D}}\!\left[\sum_{k=0}^{K-1} \log p_\theta(a_{t+k} \mid a_{t:t+k}, o_{t-H:t})\right]

The observation encoder processes proprioceptive state (joint positions and velocities) and visual observations (RGB images from one or more cameras) into a sequence of embedding vectors. Visual frames are processed by a CNN or ViT backbone producing patch embeddings; proprioceptive state is linearly projected. All observation embeddings are concatenated into the encoder's input sequence. The decoder autoregressively generates the $K$ -step action chunk, attending to the full encoded observation context via cross-attention.

In the original ACT formulation, the model also incorporates a Conditional VAE (CVAE) component to handle multimodal action distributions. During training, a style encoder $q_\phi(z \mid a_{t:t+K}, o_{t-H:t})$ processes the full demonstration action chunk together with the observations to produce a posterior distribution over a latent style variable $z \in \mathbb{R}^d$ — a compact representation of the behavioral mode the expert chose (approach angle, speed profile, grasping strategy). The policy decoder is conditioned on both the observations and $z$ , generating actions consistent with the selected mode.

The CVAE is trained by maximizing the Evidence Lower BOund (ELBO) on the log-likelihood of the demonstration actions:

\mathcal{L}_{\text{CVAE}}(\theta, \phi) = \underbrace{\mathbb{E}_{z \sim q_\phi}\!\left[\sum_{k=0}^{K-1} \log p_\theta(a_{t+k} \mid z, o_{t-H:t})\right]}_{\text{reconstruction}} - \underbrace{\beta\, D_{\text{KL}}\!\left(q_\phi(z \mid a_{t:t+K}, o) \;\|\; p(z)\right)}_{\text{regularization}}

where $p(z) = \mathcal{N}(0, I)$ is the Gaussian prior over the latent style. The reconstruction term trains the decoder to produce accurate action chunks given the encoded style; the KL divergence term forces the posterior $q_\phi$ to remain close to the prior, ensuring that the latent space is well-organized and that sampling $z \sim \mathcal{N}(0, I)$ at test time draws from a smooth distribution of behavioral modes. The coefficient $\beta$ (from $\beta$ -VAE) controls the disentanglement-fidelity tradeoff: large $\beta$ produces a more disentangled latent space where different dimensions correspond to interpretable behavioral attributes, at the cost of some reconstruction fidelity.

At test time, $z$ is sampled from the prior $\mathcal{N}(0, I)$ , and the policy samples a mode before generating the chunk — resolving the multimodal averaging problem by committing to one mode of the distribution rather than averaging across all modes.

Action chunking and temporal abstraction#

The chunk length $K$ is the most important hyperparameter in ACT. A chunk of $K=1$ reduces to a reactive policy; a chunk of $K=T$ (the full episode) produces a one-shot trajectory generator. Intermediate values provide a tunable tradeoff between replanning frequency and action consistency.

The critical insight is that action chunking reduces the effective decision frequency of the policy. A policy operating at 30 Hz with $K=30$ replans once per second, treating each 1-second window as a single decision. This coarser decision frequency is beneficial for transformer training because the effective sequence length in the training dataset is shorter (the policy needs to learn dependencies at the granularity of chunks rather than individual actions), and it allows the policy to commit to a behavioral mode for a full chunk duration rather than being free to switch on every step.

Temporal ensembling is a deployment technique for ACT that smooths execution between chunk replanning events. At each timestep, the policy generates a new chunk; rather than executing the entire chunk open-loop, the policy takes the mean of all overlapping chunk predictions for the current timestep. If chunks are predicted at every step with length $K$ , then at time $t$ there are $K$ overlapping predictions for action $a_t$ from the chunks initiated at $t, t-1, \ldots, t-K+1$ . Averaging these with exponentially decaying weights produces smoother motion than executing any single chunk.

Tokenization strategies for continuous control#

Transformer architectures were designed for discrete token vocabularies. Adapting them to continuous robot action spaces requires an explicit tokenization strategy.

Direct regression is the simplest approach: the transformer's output is a sequence of real-valued vectors, one per action dimension, trained with L2 or L1 loss. This requires no discretization and has no quantization error, but it loses the categorical modeling strengths of transformers — the ability to represent sharp multimodal distributions and to benefit from techniques like temperature sampling and top- $k$ truncation.

Uniform discretization bins each action dimension into $B$ equal-width bins and represents each action as a tuple of bin indices. The transformer predicts action dimensions as classification problems, outputting a probability vector over $B$ categories per dimension. This representation can capture multimodal distributions (high probability mass at two non-adjacent bins) and benefits from the same training stability and sampling flexibility as language models. The cost is quantization error: fine-grained precision requires large $B$ , which increases the effective action vocabulary and slows inference.

Vector quantization (VQ) learns a codebook of $M$ action prototypes $\{e_m\}_{m=1}^M \subset \mathbb{R}^d$ and represents each action embedding as the index of its nearest prototype. This produces a compressed trajectory representation as a sequence of codebook indices, enabling a transformer to operate over a discrete token vocabulary of motor primitives. VQ-BeT (Vector Quantized Behavior Transformer; Lee et al., 2024) combines vector quantization with a GPT-style transformer backbone that autoregressively predicts both the codebook index (which motor primitive) and a continuous residual offset (fine-grained correction within the primitive). The codebook is trained with a commitment loss that encourages action embeddings to cluster near codebook entries:

\mathcal{L}_{\text{VQ}} = \underbrace{\|\text{sg}(z_e) - e_{m^*}\|^2}_{\text{codebook update}} + \beta_{\text{commit}}\underbrace{\|z_e - \text{sg}(e_{m^*})\|^2}_{\text{commitment}}

where $z_e$ is the encoder's continuous action embedding, $e_{m^*}$ is the nearest codebook entry, and $\text{sg}(\cdot)$ denotes the stop-gradient operator. The first term updates the codebook entries toward the encoder's outputs (using an exponential moving average); the second term encourages the encoder to commit to the nearest codebook entry rather than continuously drifting between entries. The codebook can be pre-trained on the entire demonstration dataset to capture natural action clusters, then frozen during policy training, or trained jointly end-to-end. VQ-BeT achieves strong performance on multi-modal manipulation tasks precisely because the discrete codebook vocabulary allows the transformer to select distinct action modes cleanly, while the residual offset provides the fine-grained precision that pure codebook prediction would lack.

Diffusion-style manipulation policies#

Parallel to the ACT development, diffusion policies (Chi et al., 2023) proposed using denoising diffusion probabilistic models (DDPMs) for action generation. Where ACT generates chunks autoregressively, diffusion policies generate full action sequences through an iterative denoising process.

A diffusion policy defines a forward noising process that progressively corrupts a clean action chunk $a_0^{1:K}$ through $N$ steps:

q(a_n^{1:K} \mid a_{n-1}^{1:K}) = \mathcal{N}(a_n^{1:K};\; \sqrt{1-\beta_n}\, a_{n-1}^{1:K},\; \beta_n I)

and trains a denoising network $\varepsilon_\theta(a_n^{1:K}, n, o)$ to predict the injected noise at each diffusion step $n$ , conditioned on the observation $o$ . The training objective is:

\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{n, a_0, \varepsilon}\!\left[\| \varepsilon - \varepsilon_\theta(a_n^{1:K}, n, o) \|^2\right]

At inference, the policy starts from Gaussian noise and runs the reverse denoising chain $N$ steps to produce a clean action chunk. The condition $o$ (current and past observations) guides the denoising toward action chunks consistent with the current task context.

Diffusion policies are particularly well-suited to contact-rich manipulation: when the task requires precise contact forces (inserting a plug, unfolding a cloth), the distribution over valid actions is narrow and multimodal, and the iterative refinement of diffusion allows the model to progressively narrow the action distribution from broad initial noise to a precise, coherent action sequence. ACT, by contrast, generates each token in a single forward pass, which limits its ability to represent extremely tight distributions.

The cost of diffusion policies relative to ACT is inference latency: running $N = 100$ denoising steps at 30 Hz requires executing the denoising network 3000 times per second. Accelerated samplers (DDIM, 5–20 step inference) reduce this cost, but diffusion inference remains heavier than ACT's single-pass decoding. The $\pi_0$ model (Black et al., 2024) uses flow matching, a continuous-time generalization of diffusion, to reduce inference steps to as few as 5–10 while maintaining diffusion's multimodal expressiveness.

Temporal resolution and horizon design#

A fundamental design axis for sequence-model policies is the tradeoff between temporal resolution (how finely the policy discretizes time) and planning horizon (how many steps into the future the policy models explicitly).

High temporal resolution (short control period, e.g., 1 ms) allows precise trajectory shaping but requires the policy to model very long sequences even for short behaviors, straining the transformer's context capacity. Low temporal resolution (long control period, e.g., 100 ms) allows longer nominal horizons but coarsens the trajectory and can miss the contact precision required for assembly tasks. ACT with typical $K = 20$ – $100$ at 10–30 Hz operates at an effective planning horizon of 0.7–10 seconds — appropriate for most manipulation tasks.

The replanning rate (how often the policy generates a new chunk) introduces a secondary design choice. Replanning every step allows rapid correction but wastes computation on chunks that are immediately superseded; replanning every $K/2$ steps with temporal ensembling provides a smooth tradeoff. The optimal replanning rate depends on the task's sensitivity to disturbances: tasks with unpredictable contacts (deformable objects, multi-fingered grasping) benefit from frequent replanning; tasks with predictable open-loop dynamics (sweeping, wiping) benefit from infrequent replanning.

GenAI context: control as sequence generation#

The parallel between ACT and autoregressive language generation is precise:

| Robot control (ACT) | Language model | |---|---| | Observation history | Prompt / context | | Action chunk $a_{t:t+K}$ | Completion sequence | | Proprioception tokens | System instruction | | Visual patch embeddings | Image tokens | | Chunk length $K$ | Generation length | | CVAE latent $z$ | Temperature / sampling strategy | | Temporal ensembling | Token reranking / aggregation |

This alignment is not accidental — it reflects that both manipulation and language generation are sequential tasks where generating consistent, coherent outputs requires modeling temporal dependencies and committing to a mode among multiple valid continuations. The alignment also opens direct technical connections: ACT-style policies can be initialized from language model weights (sharing the transformer backbone), conditioned on language task descriptions (natural language as the "prompt"), and scaled using the same training recipes as LLMs. These connections motivate the Vision-Language-Action models studied in Week 10.

Key takeaways#

ACT reformulates manipulation as sequence prediction, using an encoder-decoder transformer to produce $K$ -step action chunks from observation histories. The chunk length $K$ controls the tradeoff between replanning frequency and action consistency; temporal ensembling smooths execution between replanning events. The CVAE-based conditioning resolves the multimodal averaging problem of direct regression by sampling a latent style variable before generating the chunk. Tokenization of continuous actions can be achieved through direct regression, uniform discretization, or learned vector quantization, each with different expressiveness and computational tradeoffs. Diffusion policies generate action chunks through iterative denoising, achieving strong performance on contact-rich tasks at higher inference cost. The architectural alignment between ACT and autoregressive language generation opens direct technical connections: shared backbones, language conditioning, and cross-domain pretraining are all natural extensions.

Conceptual questions#

The temporal ensembling strategy for ACT averages the action predictions from $K$ overlapping chunks at each timestep, weighted by recency. Analyze the effect of this averaging on the policy's closed-loop bandwidth: if a sudden perturbation occurs at time $t$ (an unexpected contact pushes the object), how many timesteps does it take for the ensembled policy to respond fully? Derive the transfer function of the temporal ensembling filter and express the bandwidth reduction as a function of $K$ . Under what task conditions is this bandwidth reduction acceptable, and when is it not?
ACT with CVAE conditioning samples a latent style variable $z \sim \mathcal{N}(0, I)$ at test time, where $z$ was learned to represent behavioral modes during training. For a task with 3 distinct valid grasping modes, analyze what happens to the CVAE latent space during training if the demonstration dataset contains 80% mode A, 15% mode B, and 5% mode C. How does this imbalance affect the conditional action distribution $p_\theta(a \mid o, z)$ for $z$ values corresponding to mode C? What modification to the training procedure would give more uniform mode coverage?
Diffusion policies run $N$ denoising steps to generate each action chunk, where $N$ is typically 50–100 for high-quality generation and 5–10 for fast inference. Using the DDIM sampler, the number of steps can be reduced at the cost of sample quality. Analyze the tradeoff between step count and sample quality specifically for a peg insertion task requiring 0.5 mm positional precision. What denoising step count would you use, and how would you diagnose whether the step count is insufficient without running physical experiments?
A behavior-cloned ACT policy achieves 85% success rate on a cup-placement task in evaluation. The team observes that on the 15% failures, the robot produces a jittery approach trajectory that ends with the cup dropped approximately 3 cm from the target. Hypothesize whether this failure is caused by (a) incorrect mode selection from the CVAE, (b) insufficient chunk length for this task, or (c) temporal ensembling overly smoothing a rapid contact adjustment. Describe the diagnostic procedure to distinguish these hypotheses and the corresponding fixes.
The alignment between ACT and LLM architectures suggests pretraining the ACT transformer on a large language corpus before fine-tuning on robot demonstrations. Analyze the expected benefits and failure modes of this approach: which components of the transformer architecture transfer well (attention patterns, positional encoding, MLP layers), and which would require re-learning from scratch (action tokenization, visual conditioning)? Propose an experimental design to measure the contribution of language pretraining to the final policy's performance separately from the contribution of architecture choice.

Solutions

Temporal-ensembling bandwidth. Averaging $K$ overlapping, recency-weighted chunks is a low-pass filter, so a perturbation at $t$ is fully reflected only after the effective averaging window (on the order of the chunk overlap). The filter is a weighted moving average — exponential weights give roughly a first-order low-pass whose cutoff falls as the window grows — so bandwidth drops with $K$ . This is acceptable for smooth, quasi-static tasks but not for fast contact reactions that need high closed-loop bandwidth.
CVAE mode imbalance. The latent space allocates volume in proportion to mode frequency, so the 5% mode C region is under-trained: $p_\theta(a\mid o,z)$ for those $z$ is high-variance and may collapse toward mode A. Mode C becomes unreliable. Fix by reweighting/oversampling mode C, using a discrete or mixture latent with a balanced prior, or clustering-then-balancing the demos so each mode gets adequate latent coverage.
Diffusion steps for 0.5 mm. Too few DDIM steps under-integrate the denoising ODE, biasing or blurring the action and missing sub-millimeter precision; use more steps (~50) or a high-quality fast sampler. Diagnose without hardware by measuring generated-action error against a many-step reference on held-out data, checking positional variance of chunks near contact, and confirming precision plateaus as steps increase.
Cup-drop diagnosis. (a) Wrong CVAE mode: cluster failures by the sampled $z$ and see if they concentrate on one mode. (b) Too-short chunk: check whether failures sit at chunk boundaries and retry with longer chunks. (c) Over-smoothing: reduce or disable temporal ensembling and see if the jittery approach and drop disappear. Run each as an ablation; a jittery approach ending in a near-contact drop points most to (c) or (b).
Language pretraining for ACT. Attention patterns, positional encodings, and MLP layers (general sequence machinery) transfer reasonably; action tokenization, visual conditioning, and the action head have no language analog and must be learned from scratch. Benefit: stronger long-range temporal modeling; failure mode: irrelevant or harmful language priors and tokenization mismatch. Measure the contribution with a 2×2 ablation — language-pretrained vs random init, crossed with the transformer vs a matched alternative architecture.

Looking ahead#

ACT and diffusion policies model the action distribution conditioned on visual and proprioceptive observations. The next step is to understand the generative process more deeply — specifically, how flow matching provides a mathematically cleaner and computationally faster alternative to diffusion, and how these generative paradigms can be unified under the framework of transport maps between probability distributions.

Week 9: Flow Matching and Diffusion for Robot Policies. We derive the flow matching objective from first principles, examine how it relates to the diffusion score-matching objective, and analyze the practical tradeoffs between diffusion and flow-based policies for real-time robotic control.

Purpose of this lecture#

From reactive control to sequence modeling#

The contrast between reactive and sequence-modeling policies is not merely architectural — it reflects different statistical assumptions about the structure of manipulation behavior.

Sequence modeling addresses both problems by modeling the joint distribution over action sequences:

p(a_t, a_{t+1}, \ldots, a_{t+K} \mid o_t, o_{t-1}, \ldots, o_{t-H})

Action Chunking with Transformers (ACT)#

Architecture and training#

\mathcal{L}_{\text{ACT}}(\theta) = -\mathbb{E}_{(o_{t-H:t}, a_{t:t+K}) \sim \mathcal{D}}\!\left[\sum_{k=0}^{K-1} \log p_\theta(a_{t+k} \mid a_{t:t+k}, o_{t-H:t})\right]

The CVAE is trained by maximizing the Evidence Lower BOund (ELBO) on the log-likelihood of the demonstration actions:

\mathcal{L}_{\text{CVAE}}(\theta, \phi) = \underbrace{\mathbb{E}_{z \sim q_\phi}\!\left[\sum_{k=0}^{K-1} \log p_\theta(a_{t+k} \mid z, o_{t-H:t})\right]}_{\text{reconstruction}} - \underbrace{\beta\, D_{\text{KL}}\!\left(q_\phi(z \mid a_{t:t+K}, o) \;\|\; p(z)\right)}_{\text{regularization}}

Action chunking and temporal abstraction#

Tokenization strategies for continuous control#

Transformer architectures were designed for discrete token vocabularies. Adapting them to continuous robot action spaces requires an explicit tokenization strategy.

\mathcal{L}_{\text{VQ}} = \underbrace{\|\text{sg}(z_e) - e_{m^*}\|^2}_{\text{codebook update}} + \beta_{\text{commit}}\underbrace{\|z_e - \text{sg}(e_{m^*})\|^2}_{\text{commitment}}

Diffusion-style manipulation policies#

A diffusion policy defines a forward noising process that progressively corrupts a clean action chunk $a_0^{1:K}$ through $N$ steps:

q(a_n^{1:K} \mid a_{n-1}^{1:K}) = \mathcal{N}(a_n^{1:K};\; \sqrt{1-\beta_n}\, a_{n-1}^{1:K},\; \beta_n I)

and trains a denoising network $\varepsilon_\theta(a_n^{1:K}, n, o)$ to predict the injected noise at each diffusion step $n$ , conditioned on the observation $o$ . The training objective is:

\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{n, a_0, \varepsilon}\!\left[\| \varepsilon - \varepsilon_\theta(a_n^{1:K}, n, o) \|^2\right]

Temporal resolution and horizon design#

GenAI context: control as sequence generation#

The parallel between ACT and autoregressive language generation is precise:

Key takeaways#

Conceptual questions#

The temporal ensembling strategy for ACT averages the action predictions from $K$ overlapping chunks at each timestep, weighted by recency. Analyze the effect of this averaging on the policy's closed-loop bandwidth: if a sudden perturbation occurs at time $t$ (an unexpected contact pushes the object), how many timesteps does it take for the ensembled policy to respond fully? Derive the transfer function of the temporal ensembling filter and express the bandwidth reduction as a function of $K$ . Under what task conditions is this bandwidth reduction acceptable, and when is it not?
ACT with CVAE conditioning samples a latent style variable $z \sim \mathcal{N}(0, I)$ at test time, where $z$ was learned to represent behavioral modes during training. For a task with 3 distinct valid grasping modes, analyze what happens to the CVAE latent space during training if the demonstration dataset contains 80% mode A, 15% mode B, and 5% mode C. How does this imbalance affect the conditional action distribution $p_\theta(a \mid o, z)$ for $z$ values corresponding to mode C? What modification to the training procedure would give more uniform mode coverage?
Diffusion policies run $N$ denoising steps to generate each action chunk, where $N$ is typically 50–100 for high-quality generation and 5–10 for fast inference. Using the DDIM sampler, the number of steps can be reduced at the cost of sample quality. Analyze the tradeoff between step count and sample quality specifically for a peg insertion task requiring 0.5 mm positional precision. What denoising step count would you use, and how would you diagnose whether the step count is insufficient without running physical experiments?
A behavior-cloned ACT policy achieves 85% success rate on a cup-placement task in evaluation. The team observes that on the 15% failures, the robot produces a jittery approach trajectory that ends with the cup dropped approximately 3 cm from the target. Hypothesize whether this failure is caused by (a) incorrect mode selection from the CVAE, (b) insufficient chunk length for this task, or (c) temporal ensembling overly smoothing a rapid contact adjustment. Describe the diagnostic procedure to distinguish these hypotheses and the corresponding fixes.
The alignment between ACT and LLM architectures suggests pretraining the ACT transformer on a large language corpus before fine-tuning on robot demonstrations. Analyze the expected benefits and failure modes of this approach: which components of the transformer architecture transfer well (attention patterns, positional encoding, MLP layers), and which would require re-learning from scratch (action tokenization, visual conditioning)? Propose an experimental design to measure the contribution of language pretraining to the final policy's performance separately from the contribution of architecture choice.

Solutions

Temporal-ensembling bandwidth. Averaging $K$ overlapping, recency-weighted chunks is a low-pass filter, so a perturbation at $t$ is fully reflected only after the effective averaging window (on the order of the chunk overlap). The filter is a weighted moving average — exponential weights give roughly a first-order low-pass whose cutoff falls as the window grows — so bandwidth drops with $K$ . This is acceptable for smooth, quasi-static tasks but not for fast contact reactions that need high closed-loop bandwidth.
CVAE mode imbalance. The latent space allocates volume in proportion to mode frequency, so the 5% mode C region is under-trained: $p_\theta(a\mid o,z)$ for those $z$ is high-variance and may collapse toward mode A. Mode C becomes unreliable. Fix by reweighting/oversampling mode C, using a discrete or mixture latent with a balanced prior, or clustering-then-balancing the demos so each mode gets adequate latent coverage.
Diffusion steps for 0.5 mm. Too few DDIM steps under-integrate the denoising ODE, biasing or blurring the action and missing sub-millimeter precision; use more steps (~50) or a high-quality fast sampler. Diagnose without hardware by measuring generated-action error against a many-step reference on held-out data, checking positional variance of chunks near contact, and confirming precision plateaus as steps increase.
Cup-drop diagnosis. (a) Wrong CVAE mode: cluster failures by the sampled $z$ and see if they concentrate on one mode. (b) Too-short chunk: check whether failures sit at chunk boundaries and retry with longer chunks. (c) Over-smoothing: reduce or disable temporal ensembling and see if the jittery approach and drop disappear. Run each as an ablation; a jittery approach ending in a near-contact drop points most to (c) or (b).
Language pretraining for ACT. Attention patterns, positional encodings, and MLP layers (general sequence machinery) transfer reasonably; action tokenization, visual conditioning, and the action head have no language analog and must be learned from scratch. Benefit: stronger long-range temporal modeling; failure mode: irrelevant or harmful language priors and tokenization mismatch. Measure the contribution with a 2×2 ablation — language-pretrained vs random init, crossed with the transformer vs a matched alternative architecture.

Week 8: Foundation Models for Manipulation — ACT and Action Chunking

Purpose of this lecture#

From reactive control to sequence modeling#

Action Chunking with Transformers (ACT)#

Architecture and training#

Action chunking and temporal abstraction#

Tokenization strategies for continuous control#

Diffusion-style manipulation policies#

Temporal resolution and horizon design#

GenAI context: control as sequence generation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 8: Foundation Models for Manipulation — ACT and Action Chunking

Purpose of this lecture#

From reactive control to sequence modeling#

Action Chunking with Transformers (ACT)#

Architecture and training#

Action chunking and temporal abstraction#

Tokenization strategies for continuous control#

Diffusion-style manipulation policies#

Temporal resolution and horizon design#

GenAI context: control as sequence generation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 8: Foundation Models for Manipulation — ACT and Action Chunking

Purpose of this lecture#

From reactive control to sequence modeling#

Action Chunking with Transformers (ACTAction Chunking with Transformers)#

Architecture and training#

Action chunking and temporal abstraction#

Tokenization strategies for continuous control#

Diffusion-style manipulation policies#

Temporal resolution and horizon design#

GenAI context: control as sequence generation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 8: Foundation Models for Manipulation — ACT and Action Chunking

Purpose of this lecture#

From reactive control to sequence modeling#

Action Chunking with Transformers (ACTAction Chunking with Transformers)#

Architecture and training#

Action chunking and temporal abstraction#

Tokenization strategies for continuous control#

Diffusion-style manipulation policies#

Temporal resolution and horizon design#

GenAI context: control as sequence generation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Action Chunking with Transformers (ACT)#

Action Chunking with Transformers (ACT)#