Week 4: Teleoperation and Data Collection

Purpose of this lecture#

Modern robot learning begins not with algorithms, but with data. Before imitation learning, reinforcement learning, or any foundation model for robotics can succeed, a more fundamental question must be answered: how do we collect high-quality robot experience that reflects meaningful behavior, covers the relevant state space, and can be reproduced at scale?

This lecture addresses teleoperation and human-in-the-loop data collection — the dominant paradigm for generating the robot datasets that underpin everything from classical behavior cloning to the most capable vision-language-action models. The engineering decisions made at the data collection stage propagate through every downstream component. A policy cannot generalize to states absent from its training set; a world model cannot predict dynamics it has never observed. Understanding what makes a dataset good — and what makes it pathological — is therefore a prerequisite for understanding why algorithms succeed or fail in practice.

Why robot data is structurally different#

Robot data is expensive, slow to generate, safety-critical, and tightly coupled to physical hardware. Each of these properties has consequences that distinguish robot learning from most of machine learning.

Expense arises because demonstrations require skilled human operators, physical robot hardware, and dedicated lab time. At current scales, a robotics team might collect thousands of demonstrations per task per week; a web-scale language model trains on trillions of tokens assembled passively from existing content. The orders-of-magnitude gap in data generation rate shapes every design decision: unlike NLP, robotics cannot afford to discard mediocre data or iterate freely on data collection protocols.

Coupling to hardware means that robot data is not reusable across systems without significant effort. Demonstrations collected on a 7-DoF arm with a parallel-jaw gripper do not directly transfer to a 6-DoF arm with a dexterous hand, because the action space, joint limits, and contact geometry differ. Even within a single hardware platform, mechanical wear, sensor drift, and recalibration events make data collected at different times subtly inconsistent.

Safety constraints during collection are qualitatively different from safety constraints during deployment. When a human operator teleoperates a robot, they impose implicit safety constraints — slowing down near contacts, avoiding joint limits, aborting executions that look dangerous — that the learned policy must eventually internalize without the human present. These implicit constraints are invisible in the recorded $(s, a)$ pairs; the policy sees the safe trajectory but not the reasoning that avoided the unsafe one.

These structural properties motivate thinking of data collection as a first-class engineering problem, not a preprocessing step.

Teleoperation: the dominant paradigm#

Teleoperation is the process by which a human operator controls a robot in real time, generating expert demonstrations that are logged as state-action trajectories. The human provides intent, task structure, and implicit safety knowledge; the robot's sensors record states, actions, forces, and visual observations. This interaction forms the backbone of nearly every large-scale robot learning dataset, from the Open X-Embodiment collection to the datasets used to train ACT, diffusion policies, and GR00T.

The quality of teleoperation data is jointly determined by the operator (skill, consistency, deliberateness), the interface (expressiveness, ergonomics, latency), and the robot (backdrivability, sensor suite, end-effector design). Each of these axes introduces variance into the demonstration distribution, and that variance has downstream consequences for policy generalization.

VR-based teleoperation#

Virtual reality interfaces track the operator's full 6-DoF head and hand pose using optical or electromagnetic tracking systems, mapping human arm movements directly onto the robot's end-effector in Cartesian space. Because the mapping is spatial and intuitive — moving your hand forward moves the robot end-effector forward by a corresponding amount — VR interfaces allow operators to provide demonstrations that mirror natural manipulation behavior without formal robotics training.

The fidelity of VR teleoperation is constrained by two factors: latency and workspace mismatch. Round-trip latency from human motion to robot motion to visual feedback must be kept below roughly 50–100 ms for stable closed-loop operation; at higher latencies, operators begin to over-correct, introducing oscillations into the demonstration that the policy will reproduce. Workspace mismatch occurs when the human's range of motion does not correspond to the robot's reachable workspace — operators must mentally remap scales and angles, which degrades demonstration quality on tasks requiring precision.

Kinesthetic teaching#

Kinesthetic teaching bypasses the operator-to-robot mapping problem entirely: the operator physically grasps the robot and moves it through the desired trajectory by direct physical contact. This produces demonstrations that are intrinsically consistent with the robot's kinematic constraints, because the robot's own structure filters out infeasible configurations.

The fundamental prerequisite for kinesthetic teaching is backdrivability — the robot must offer low mechanical impedance when being moved, so that the operator can guide it without substantial force. Series elastic actuators and torque-controlled joints enable this, whereas high-gear-ratio position-controlled actuators (common in industrial robots) do not. When backdrivability is available, kinesthetic teaching produces some of the most precise and physically consistent demonstrations achievable; when it is absent, the operator is fighting the robot's own mechanism, and the resulting demonstrations are compromised.

Haptic and force-feedback interfaces#

Haptic teleoperation devices — ranging from simple force-feedback joysticks to full exoskeleton-style bilateral control systems — transmit force and torque information back to the operator as they control the robot. When the robot contacts an object, the operator feels a corresponding resistive force; when the robot lifts a heavy object, the operator experiences a corresponding load.

This bidirectional force channel is particularly valuable for manipulation tasks where contact forces carry critical task information: assembly operations requiring precise insertion forces, deformable object manipulation where compliance determines success, and tasks involving fragile objects where exceeding a force threshold causes failure. The dataset produced by haptic teleoperation therefore includes richer information than the robot's motor commands alone — it captures the contact dynamics that the operator used to govern the trajectory.

From the state-action perspective, haptic systems expand what can be included in the state observation $s$ . A dataset logged with a wrist force-torque sensor records the contact wrench $F \in \mathbb{R}^6$ at each timestep. When $F$ is included in the observation $s_t = (q_t, \dot{q}_t, x_{ee,t}, F_t, I_t)$ where $I_t$ is the RGB image, the policy learns to condition its actions on contact force — modulating insertion depth based on felt resistance, or adjusting grip force based on object compliance. Policies that receive only proprioceptive and visual observations cannot infer contact force from the image alone and therefore cannot exhibit this force-adaptive behavior. The choice to include or exclude force from the state observation $s$ is a design decision with direct consequences for the policy's capability on contact-rich tasks, and it must be consistent between data collection and policy deployment.

The engineering cost of haptic systems is significant: bilateral control requires low-latency, high-bandwidth force measurement at the robot, precise force rendering at the haptic device, and a communications architecture that maintains stability under feedback delay. A haptic control loop that becomes unstable due to communication latency produces erratic, jerky demonstrations that contaminate the dataset.

Joystick and gamepad interfaces#

At the other end of the complexity spectrum, joystick and gamepad interfaces map simple analog inputs (axis deflections, button presses) to robot commands. These interfaces are inexpensive, familiar to non-expert operators, and require no hardware beyond a standard controller. Their cost is expressiveness: a 6-axis joystick cannot simultaneously command the orientation and position of a 6-DoF end-effector, so operators must switch between control modes, introducing discontinuities into the demonstration trajectories.

For tasks amenable to low-dimensional control — mobile manipulation, object pushing, simple pick-and-place — joystick interfaces can produce adequate demonstrations at lower cost than VR or haptic systems. For tasks requiring simultaneous control of position and orientation (bimanual manipulation, contact-rich assembly), they are generally insufficient.

Dataset design: coverage, diversity, and bias#

Collecting more data is necessary but not sufficient. What matters, decisively, is what behavioral distribution the dataset represents and how that distribution relates to the deployment distribution the policy will face.

The behavioral cloning objective and distributional coverage#

Before discussing what makes datasets good or bad, it is useful to state precisely what the downstream algorithm is trying to optimize. Behavior cloning (examined in depth in Week 5) treats the demonstration dataset as a supervised learning problem and minimizes the expected negative log-likelihood of the expert action under the policy:

\mathcal{L}_{\text{BC}}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[-\log \pi_\theta(a \mid s)\right]

This loss is minimized over the demonstration distribution $d_{\text{demo}}(s)$ — the marginal over states that the expert visited. The critical implication is that the policy has no training signal for states outside $\text{supp}(d_{\text{demo}})$ . At deployment, the policy induces its own state visitation distribution $d_{\pi_\theta}(s)$ . If $\pi_\theta$ is not a perfect copy of the expert, the two distributions diverge, and the policy must ACT on states for which the behavioral cloning loss provides no supervision.

The severity of this distributional shift has been formally characterized. Let $\epsilon$ be the per-step loss under the demonstration distribution. The cumulative loss of the behavior-cloned policy over a trajectory of $T$ steps is bounded by $O(T^2 \epsilon)$ — the quadratic term reflects compounding errors: each small deviation moves the policy slightly off the demonstrated path, increasing the probability of a larger error at the next step, which compounds further. This $O(T^2)$ growth is the theoretical motivation for all the data coverage practices described below: if the dataset does not cover states near the demonstrated trajectory, the policy will compound errors catastrophically on long-horizon tasks.

| Coverage | Diversity | Operator Bias | | --- | --- | --- | | Does the dataset include recovery behaviors for states the policy will encounter when it inevitably drifts from the nominal demonstration? | Variation in initial configurations, lighting, object placements, and operator styles to prevent overfitting to specific collection conditions. | Idiosyncrasies like preferred grasp orientations or approach speeds that a policy might overfit to as task constraints. |

Coverage refers to whether the dataset includes the states and transitions that the policy will encounter during autonomous operation. A dataset collected entirely from nominal-start, nominal-conditions demonstrations systematically excludes the recovery behaviors needed when the policy drifts to non-nominal states. Because learned policies inevitably deviate from the demonstrated trajectory — no learned mapping is perfect — the policy will encounter out-of-distribution states regardless of how accurate its nominal-state behavior is. A dataset without recovery behaviors will produce policies that fail gracefully from demonstration trajectories and catastrophically from any perturbation.

Diversity refers to variation along dimensions that affect policy generalization: initial configurations, object placements, lighting conditions, operator execution styles, and task parameterizations. Policies trained on low-diversity datasets overfit to the specific conditions present during collection and fail when deployment conditions differ. Diversity is usually obtained by varying initial conditions systematically, collecting demonstrations from multiple operators, and introducing controlled environmental perturbations.

Operator bias is an underappreciated source of dataset pathology. Human demonstrators exhibit consistent behavioral tendencies — preferred grasp orientations, characteristic approach trajectories, habitual speed profiles — that the policy will reproduce regardless of whether those tendencies are task-optimal. A dataset collected by a single operator encodes that operator's idiosyncrasies as apparent task constraints. A policy trained on this data will attempt to reproduce them even in configurations where they are suboptimal or infeasible. Using multiple operators, providing standardized demonstration protocols, and filtering demonstrations by task completion quality can partially mitigate this, but bias introduced at the collection stage cannot be fully removed by downstream processing.

Logging pipelines and time synchronization#

A robot learning dataset is a synchronized multimodal recording: at each timestep, the logging system must capture joint positions and velocities, commanded actions (torques, targets, or position setpoints depending on the control interface), RGB and depth images from one or more cameras, force-torque sensor readings at the wrist, and any task-specific metadata (object identities, goal configurations, episode flags).

| Multimodal Logging | Time Sync | Jitter & Drift | | --- | --- | --- | | Synchronizing joint states, end-effector poses, multiple camera feeds, and high-frequency force-torque data into a unified stream. | Handling disparate sensor rates (30Hz cameras vs. 1000Hz encoders) using shared time references or hardware triggers. | Mitigating non-deterministic network delays and clock drift that cause misaligned observations in the dataset. |

The central engineering challenge is time synchronization. Sensors run at different rates: cameras at 30–60 Hz, force-torque sensors at 100–1000 Hz, joint encoders at the servo controller update rate (typically 500–1000 Hz), and the teleoperation interface at a rate determined by its own hardware. Associating these streams into a coherent dataset requires a shared time reference and a principled strategy for handling misaligned timestamps.

The cleanest solution is hardware synchronization: a hardware trigger signal (SYNC pulse) is sent to all sensors simultaneously, forcing them to sample at a common reference time. This eliminates timestamp uncertainty but requires hardware support on all sensors. When hardware synchronization is unavailable, software timestamps (assigned by the host computer at the moment of receipt) are used instead, with interpolation employed to align streams to a common grid. Software timestamps introduce jitter proportional to operating system scheduling latency; under load, this jitter can exceed 10 ms, which at 30 Hz introduces alignment errors of up to one-third of a frame.

The consequences of timing misalignment can be quantified. Let $\Delta t$ be the timestamp mismatch between the visual observation and the joint command at a given timestep, and let $\dot{x}$ be the end-effector velocity at that moment. The spatial misalignment introduced into the state-action mapping is:

\Delta x = \dot{x}\,\Delta t

For a robot moving at a typical manipulation speed of $\dot{x} = 0.1$ m/s with a timestamp mismatch of $\Delta t = 15$ ms (a typical software-timestamp jitter under moderate system load), the spatial error is $\Delta x = 1.5$ mm per observation. While small in isolation, this error is systematic — it consistently biases the mapping from the visual observation (showing the world 15 ms ago) to the current joint command — and it propagates into a systematic policy error. In practice, this manifests as a policy that reaches slightly past targets or initiates grasp closures fractionally early, producing failure modes that look like calibration errors but stem from data collection methodology.

Time synchronization errors are silent: they produce a dataset that appears consistent but contains subtle misalignments between the visual observation and the motor command at each timestep. A policy trained on misaligned data will learn a slightly incorrect mapping from observations to actions, and this bias will manifest as systematic errors in deployed behavior that are extremely difficult to diagnose.

Well-designed logging pipelines record raw, unprocessed sensor streams with high-resolution timestamps and apply synchronization in post-processing, where alignment strategies can be verified and adjusted. This separation of raw logging from post-processed datasets is essential for reproducibility: if the raw streams are preserved, the dataset can be re-synchronized with different alignment strategies as post-processing techniques improve.

Python Implementation: Multimodal Data Logger#

The following code implements a simple logger that simulates capturing data from a camera (30Hz) and joint encoders (100Hz), then synchronizes them into a single dataset.

python · runs in browser

import numpy as np
import time
import json

class RobotDataLogger:
    def __init__(self):
        self.data = []
        self.start_time = time.time()

    def log_step(self, camera_frame, joint_angles, action):
        """
        Logs a single synchronized step.
        """
        timestamp = time.time() - self.start_time
        
        entry = {
            "timestamp": timestamp,
            "observation": {
                "image": camera_frame.tolist(), # Simplified: usually saved as paths
                "joint_angles": joint_angles.tolist()
            },
            "action": action.tolist()
        }
        self.data.append(entry)

    def save(self, filename="demonstration.json"):
        with open(filename, 'w') as f:
            json.dump(self.data, f)
        print(f"Logged {len(self.data)} steps to {filename}")

# Simulation of a data collection session
logger = RobotDataLogger()
duration = 1.0  # seconds
camera_rate = 30  # Hz
encoder_rate = 100 # Hz

print("Starting data collection...")

# In a real system, these would run in separate threads
for t in np.arange(0, duration, 1.0/encoder_rate):
    # Simulating data
    joint_angles = np.random.randn(7)
    action = np.random.randn(7)
    
    # Only capture camera at 30Hz
    if int(t * camera_rate) > int((t - 1.0/encoder_rate) * camera_rate):
        camera_frame = np.random.rand(64, 64, 3) # Dummy image
    else:
        camera_frame = None # Frame not ready
        
    if camera_frame is not None:
        logger.log_step(camera_frame, joint_angles, action)
    
    time.sleep(0.001) # Simulate some processing time

logger.save()

Python Implementation: Dataset Normalization and Alignment#

After logging raw data, we must align timestamps and normalize actions for training.

python · runs in browser

def post_process_demonstration(data):
    """
    Interpolates data to a common 50Hz grid and normalizes actions.
    """
    timestamps = [step['timestamp'] for step in data]
    # actions: (N, 7)
    actions = np.array([step['action'] for step in data])
    
    # 1. Normalization (min-max to [-1, 1])
    action_min = actions.min(axis=0)
    action_max = actions.max(axis=0)
    norm_actions = 2 * (actions - action_min) / (action_max - action_min + 1e-8) - 1
    
    # 2. Resampling to 50Hz (0.02s steps)
    target_dt = 0.02
    t_target = np.arange(0, timestamps[-1], target_dt)
    
    # Linear interpolation for each dimension
    aligned_actions = np.zeros((len(t_target), norm_actions.shape[1]))
    for d in range(norm_actions.shape[1]):
        aligned_actions[:, d] = np.interp(t_target, timestamps, norm_actions[:, d])
        
    return t_target, aligned_actions

# Mock post-processing
with open("demonstration.json", "r") as f:
    raw_data = json.load(f)
    
t_new, a_new = post_process_demonstration(raw_data)
print(f"Resampled to {len(t_new)} steps at 50Hz")

GenAI context: robot datasets as foundation data#

The relationship between teleoperation data and robot learning models is structurally analogous to the relationship between web text corpora and large language models. Both enable a generalist pretrained representation that can be fine-tuned for specific tasks; both impose hard limits on generalization to behaviors outside the training distribution; and both exhibit qualitative improvements as dataset scale, diversity, and curation quality improve.

The Open X-Embodiment dataset — comprising over 1 million robot trajectories across 22 robot embodiments — represents an attempt to achieve the breadth of coverage that allows cross-embodiment generalization, analogous to training a language model on text from diverse domains to achieve broad linguistic competence. The VLA models that build on this data (GR00T, OpenVLA, $\pi_0$ ) require exactly the kind of high-quality, synchronized, diverse demonstrations described in this lecture, at scale, to achieve the cross-task and cross-embodiment generalization they demonstrate.

A key difference from language data is the role of action labeling: language models are trained on text that exists in the world as a naturally occurring signal; robot action datasets must be deliberately created through teleoperation. This means that the bottleneck for scaling robot foundation models is not compute or model capacity — it is data, and specifically the rate and quality at which teleoperation pipelines can produce diverse, well-labeled demonstrations. Investment in teleoperation infrastructure is therefore investment in the foundation of the entire robot learning stack.

Key takeaways#

Teleoperation is the dominant method for generating robot demonstrations because it allows human task knowledge, safety judgment, and contact expertise to be transferred directly into the data distribution. The choice of teleoperation interface — VR, kinesthetic, haptic, or joystick — shapes the expressive richness and physical consistency of the resulting dataset. Dataset quality is determined not by size alone but by coverage of the deployment state distribution, diversity across operators and conditions, and absence of systematic operator bias. Logging pipelines must address the time synchronization problem rigorously, as silent misalignments introduce systematic biases that are difficult to diagnose in downstream training. The analogy between robot teleoperation datasets and web-scale language corpora clarifies why data collection infrastructure is a prerequisite for everything that follows.

Conceptual questions#

A manipulation dataset is collected by a single highly skilled operator performing 500 demonstrations of a bottle-cap assembly task. The resulting policy achieves 95% accuracy in evaluation but fails consistently when tested by a different operator who uses a slightly different initial grasp. Explain this failure using the concepts of coverage and operator bias. Describe two modifications to the data collection protocol — one targeting coverage and one targeting bias — that would address the failure without discarding the existing 500 demonstrations.
A robot learning team must decide between two teleoperation setups for a contact-rich assembly task: (a) VR-based teleoperation with 60 ms round-trip latency but 6-DoF Cartesian control, and (b) kinesthetic teaching with no latency but requiring operators to physically move a backdrivable arm. Analyze the tradeoffs for this specific task type. Which approach would produce a higher-quality dataset, and why? What property of the assembly task is the deciding factor?
In a logged dataset, joint encoder data is recorded at 500 Hz with hardware timestamps, while RGB camera data is recorded at 30 Hz with software timestamps subject to ±15 ms jitter. When training a visuomotor policy that maps RGB observations to joint torques, describe the effect of this timing mismatch on the learned policy. Propose a post-processing strategy to mitigate the misalignment, and identify what information is lost in the process.
The Open X-Embodiment dataset aggregates demonstrations from 22 different robot embodiments, combining data that was collected with different action spaces, sensor suites, and control frequencies. Explain why naively concatenating all demonstrations and training a single policy would likely fail. What normalization, encoding, or conditioning strategies would be needed to enable cross-embodiment learning, and what limits the amount of cross-embodiment transfer that is achievable regardless of these strategies?
A teleoperation dataset is collected with a haptic bilateral control system. The force feedback loop uses a software communication stack with 20 ms latency, which causes occasional instability in the force feedback channel — resulting in brief oscillatory episodes (0.3–0.5 seconds) where the operator overreacts to force feedback. These episodes are present in roughly 8% of demonstrations. Describe the effect of these oscillatory episodes on a behavior-cloned policy and on a diffusion policy trained on the same data. Which is more robust to this contamination, and why?

Solutions

Coverage vs operator bias. 95% on the operator's own start distribution but failure on a new grasp means the data lacks coverage of alternative initial conditions and encodes one operator's idiosyncratic style (bias). Coverage fix: collect demonstrations from varied initial grasps or add DAgger-style corrections from new starts. Bias fix: add multiple operators (or randomize the initial grasp). Both augment the existing 500 rather than discarding them.
VR vs kinesthetic. Contact-rich assembly hinges on force fidelity, and VR's 60 ms latency degrades the operator's reaction to contact (over/undershoot), whereas backdrivable kinesthetic teaching has no latency and conveys real contact forces — yielding higher-quality data. The deciding factor is that the task is contact-rich, so latency and force feedback dominate over the convenience of 6-DoF Cartesian control.
Timing mismatch. Aligning 30 Hz jittery-timestamp images to 500 Hz encoders means observation-action pairs are temporally misaligned, so the policy maps stale images to torques and learns a lagged, blurred mapping that hurts precision and reactivity. Mitigate by resampling/interpolating to a common clock using hardware timestamps and nearest-timestamp association (or enforcing a fixed buffered latency); what's lost is the sub-frame detail and exact timing between camera frames.
Open X-Embodiment. Different action spaces, control rates, and sensors mean an identical action vector means different things across robots, so one head cannot fit the contradictory semantics. You need per-embodiment action normalization, a canonical action representation (e.g., end-effector deltas), embodiment conditioning/embeddings, and aligned observation encoders. Morphology differences (kinematics, DoF, dynamics) ultimately cap how much transfers, regardless of these strategies.
Oscillatory contamination. A unimodal behavior-cloned policy regresses toward the mean and can blend the clean and oscillatory actions into jittery averages, or memorize the bad mode. A diffusion policy models the full multimodal distribution, keeping the dominant clean mode separate from the rare oscillatory mode (sampled only ~8% of the time), so it is more robust to the contamination.

Looking ahead#

With a mechanism for collecting high-quality demonstrations established, the next question is how to turn those demonstrations into an autonomous policy. The naive answer — supervised regression — works surprisingly well within the demonstrated distribution but breaks down immediately once the robot encounters a state that no expert demonstrated.

Week 5: Imitation Learning. We analyze why behavior cloning fails under distributional shift, derive the DAgger algorithm as a principled fix, and examine the adversarial imitation framework (GAIL) that connects demonstration-based learning to reinforcement learning through a learned reward signal.

Purpose of this lecture#

Why robot data is structurally different#

These structural properties motivate thinking of data collection as a first-class engineering problem, not a preprocessing step.

Teleoperation: the dominant paradigm#

VR-based teleoperation#

Kinesthetic teaching#

Haptic and force-feedback interfaces#

Joystick and gamepad interfaces#

Dataset design: coverage, diversity, and bias#

The behavioral cloning objective and distributional coverage#

\mathcal{L}_{\text{BC}}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[-\log \pi_\theta(a \mid s)\right]

Logging pipelines and time synchronization#

\Delta x = \dot{x}\,\Delta t

Python Implementation: Multimodal Data Logger#

The following code implements a simple logger that simulates capturing data from a camera (30Hz) and joint encoders (100Hz), then synchronizes them into a single dataset.

python · runs in browser

import numpy as np
import time
import json

class RobotDataLogger:
    def __init__(self):
        self.data = []
        self.start_time = time.time()

    def log_step(self, camera_frame, joint_angles, action):
        """
        Logs a single synchronized step.
        """
        timestamp = time.time() - self.start_time
        
        entry = {
            "timestamp": timestamp,
            "observation": {
                "image": camera_frame.tolist(), # Simplified: usually saved as paths
                "joint_angles": joint_angles.tolist()
            },
            "action": action.tolist()
        }
        self.data.append(entry)

    def save(self, filename="demonstration.json"):
        with open(filename, 'w') as f:
            json.dump(self.data, f)
        print(f"Logged {len(self.data)} steps to {filename}")

# Simulation of a data collection session
logger = RobotDataLogger()
duration = 1.0  # seconds
camera_rate = 30  # Hz
encoder_rate = 100 # Hz

print("Starting data collection...")

# In a real system, these would run in separate threads
for t in np.arange(0, duration, 1.0/encoder_rate):
    # Simulating data
    joint_angles = np.random.randn(7)
    action = np.random.randn(7)
    
    # Only capture camera at 30Hz
    if int(t * camera_rate) > int((t - 1.0/encoder_rate) * camera_rate):
        camera_frame = np.random.rand(64, 64, 3) # Dummy image
    else:
        camera_frame = None # Frame not ready
        
    if camera_frame is not None:
        logger.log_step(camera_frame, joint_angles, action)
    
    time.sleep(0.001) # Simulate some processing time

logger.save()

Python Implementation: Dataset Normalization and Alignment#

After logging raw data, we must align timestamps and normalize actions for training.

python · runs in browser

def post_process_demonstration(data):
    """
    Interpolates data to a common 50Hz grid and normalizes actions.
    """
    timestamps = [step['timestamp'] for step in data]
    # actions: (N, 7)
    actions = np.array([step['action'] for step in data])
    
    # 1. Normalization (min-max to [-1, 1])
    action_min = actions.min(axis=0)
    action_max = actions.max(axis=0)
    norm_actions = 2 * (actions - action_min) / (action_max - action_min + 1e-8) - 1
    
    # 2. Resampling to 50Hz (0.02s steps)
    target_dt = 0.02
    t_target = np.arange(0, timestamps[-1], target_dt)
    
    # Linear interpolation for each dimension
    aligned_actions = np.zeros((len(t_target), norm_actions.shape[1]))
    for d in range(norm_actions.shape[1]):
        aligned_actions[:, d] = np.interp(t_target, timestamps, norm_actions[:, d])
        
    return t_target, aligned_actions

# Mock post-processing
with open("demonstration.json", "r") as f:
    raw_data = json.load(f)
    
t_new, a_new = post_process_demonstration(raw_data)
print(f"Resampled to {len(t_new)} steps at 50Hz")

GenAI context: robot datasets as foundation data#

Key takeaways#

Conceptual questions#

A manipulation dataset is collected by a single highly skilled operator performing 500 demonstrations of a bottle-cap assembly task. The resulting policy achieves 95% accuracy in evaluation but fails consistently when tested by a different operator who uses a slightly different initial grasp. Explain this failure using the concepts of coverage and operator bias. Describe two modifications to the data collection protocol — one targeting coverage and one targeting bias — that would address the failure without discarding the existing 500 demonstrations.
A robot learning team must decide between two teleoperation setups for a contact-rich assembly task: (a) VR-based teleoperation with 60 ms round-trip latency but 6-DoF Cartesian control, and (b) kinesthetic teaching with no latency but requiring operators to physically move a backdrivable arm. Analyze the tradeoffs for this specific task type. Which approach would produce a higher-quality dataset, and why? What property of the assembly task is the deciding factor?
In a logged dataset, joint encoder data is recorded at 500 Hz with hardware timestamps, while RGB camera data is recorded at 30 Hz with software timestamps subject to ±15 ms jitter. When training a visuomotor policy that maps RGB observations to joint torques, describe the effect of this timing mismatch on the learned policy. Propose a post-processing strategy to mitigate the misalignment, and identify what information is lost in the process.
The Open X-Embodiment dataset aggregates demonstrations from 22 different robot embodiments, combining data that was collected with different action spaces, sensor suites, and control frequencies. Explain why naively concatenating all demonstrations and training a single policy would likely fail. What normalization, encoding, or conditioning strategies would be needed to enable cross-embodiment learning, and what limits the amount of cross-embodiment transfer that is achievable regardless of these strategies?
A teleoperation dataset is collected with a haptic bilateral control system. The force feedback loop uses a software communication stack with 20 ms latency, which causes occasional instability in the force feedback channel — resulting in brief oscillatory episodes (0.3–0.5 seconds) where the operator overreacts to force feedback. These episodes are present in roughly 8% of demonstrations. Describe the effect of these oscillatory episodes on a behavior-cloned policy and on a diffusion policy trained on the same data. Which is more robust to this contamination, and why?

Solutions

Coverage vs operator bias. 95% on the operator's own start distribution but failure on a new grasp means the data lacks coverage of alternative initial conditions and encodes one operator's idiosyncratic style (bias). Coverage fix: collect demonstrations from varied initial grasps or add DAgger-style corrections from new starts. Bias fix: add multiple operators (or randomize the initial grasp). Both augment the existing 500 rather than discarding them.
VR vs kinesthetic. Contact-rich assembly hinges on force fidelity, and VR's 60 ms latency degrades the operator's reaction to contact (over/undershoot), whereas backdrivable kinesthetic teaching has no latency and conveys real contact forces — yielding higher-quality data. The deciding factor is that the task is contact-rich, so latency and force feedback dominate over the convenience of 6-DoF Cartesian control.
Timing mismatch. Aligning 30 Hz jittery-timestamp images to 500 Hz encoders means observation-action pairs are temporally misaligned, so the policy maps stale images to torques and learns a lagged, blurred mapping that hurts precision and reactivity. Mitigate by resampling/interpolating to a common clock using hardware timestamps and nearest-timestamp association (or enforcing a fixed buffered latency); what's lost is the sub-frame detail and exact timing between camera frames.
Open X-Embodiment. Different action spaces, control rates, and sensors mean an identical action vector means different things across robots, so one head cannot fit the contradictory semantics. You need per-embodiment action normalization, a canonical action representation (e.g., end-effector deltas), embodiment conditioning/embeddings, and aligned observation encoders. Morphology differences (kinematics, DoF, dynamics) ultimately cap how much transfers, regardless of these strategies.
Oscillatory contamination. A unimodal behavior-cloned policy regresses toward the mean and can blend the clean and oscillatory actions into jittery averages, or memorize the bad mode. A diffusion policy models the full multimodal distribution, keeping the dominant clean mode separate from the rare oscillatory mode (sampled only ~8% of the time), so it is more robust to the contamination.

Purpose of this lecture#

Why robot data is structurally different#

Teleoperation: the dominant paradigm#

VR-based teleoperation#

Kinesthetic teaching#

Haptic and force-feedback interfaces#

Joystick and gamepad interfaces#

Dataset design: coverage, diversity, and bias#

The behavioral cloning objective and distributional coverage#

Logging pipelines and time synchronization#

Python Implementation: Multimodal Data Logger#

Python Implementation: Dataset Normalization and Alignment#

GenAI context: robot datasets as foundation data#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 4: Teleoperation and Data Collection

Purpose of this lecture#

Why robot data is structurally different#

Teleoperation: the dominant paradigm#

VR-based teleoperation#

Kinesthetic teaching#

Haptic and force-feedback interfaces#

Joystick and gamepad interfaces#

Dataset design: coverage, diversity, and bias#

The behavioral cloning objective and distributional coverage#

Logging pipelines and time synchronization#

Python Implementation: Multimodal Data Logger#

Python Implementation: Dataset Normalization and Alignment#

GenAI context: robot datasets as foundation data#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#