Purpose of this lecture
Reinforcement learning for robotics almost never happens directly on physical hardware at scale. The sample efficiency constraints examined in Week 6 — thousands of hours of robot time for millions of transitions — make direct hardware training impractical for all but the simplest tasks. The practical answer is simulation: train in a virtual environment that approximates the physics of the real world, then transfer the learned policy to hardware.
This is the sim2real problem — not a single technique but a full engineering pipeline spanning physics modeling, simulation infrastructure, domain randomization, and system identification. Getting it right determines whether the policy works on the real robot. Getting it wrong produces policies that are expert simulators and useless manipulators. This lecture examines the problem from both ends: what properties of simulation make transfer possible, and what failure modes make it fail.
Why simulation is indispensable
The asymmetry between simulation and hardware is stark. A modern GPU-accelerated simulator running on a cluster can execute tens of thousands of parallel environment instances, collecting millions of transitions per minute. The same experiment on a single physical robot would require years. Simulation enables the data volumes that modern RLReinforcement Learning algorithms require while providing safe, free-reset, repeatable environments where exploration cannot damage hardware or injure operators.
But simulation is an approximation. Every simulated environment encodes assumptions — about rigid-body dynamics, friction models, actuator behavior, sensor noise, and contact resolution — that diverge from physical reality in ways that the simulator designer may not anticipate or control. A policy learned in a simulator that models friction as a simple Coulomb model will encounter stiction, viscoelasticity, and asymmetric friction at the physical robot and behave erroneously. The gap between simulated and real dynamics is the fundamental obstacle that all sim2real methods attempt to address.
The Isaac Sim and IsaacLab stack
NVIDIA's Isaac Sim is a high-fidelity robotics simulator built on the Omniverse USD (Universal Scene Description) platform, using PhysX as its physics engine for rigid-body and articulated dynamics and Hydra for photorealistic rendering. The key features for robot learning are GPU acceleration of both physics simulation and rendering, articulated robot support with configurable joint models, and tight integration with ROS 2 and standard robot learning frameworks.
The physics engine in Isaac Sim uses a position-based dynamics (PBD) solver for contact and a Featherstone recursive dynamics algorithm for articulated bodies. GPU parallelism is achieved by batching many simulation instances as a single tensor computation: joint positions, velocities, and applied torques for parallel environments are represented as tensors and updated simultaneously in a single GPU kernel call. At high parallelization factors (), the per-environment marginal cost approaches zero, enabling asymptotic data collection rates that make billion-step training feasible.
IsaacLab (formerly Isaac Gym, then Isaac Orbit) is the robot learning framework built on top of Isaac Sim. It provides task templates and environment interfaces that implement the OpenAI Gym API, built-in domain randomization utilities for physics and appearance parameters, a manager-based architecture that separates task logic, observation construction, and reward computation for modularity, and integration with standard RLReinforcement Learning libraries (RSL-RLReinforcement Learning, RLReinforcement Learning-games, Stable Baselines3).
The central technical mechanism enabling IsaacLab's parallelism is tensor-based vectorized computation. All parallel environment instances share a single physics simulation, with their states represented as batched tensors. Joint positions, velocities, and forces for all environments are stored in tensors that are updated simultaneously by a single PhysX GPU kernel call. Observation construction is performed by slicing and transforming these state tensors entirely on the GPU — no data is copied to CPU between physics steps and policy inference. Reward computation is similarly vectorized: the reward function is evaluated as an elementwise operation on the -dimensional state and action tensors, returning a vector of scalar rewards in a single GPU kernel. Episode termination conditions (joint limit violations, task completion, maximum episode length) are evaluated in parallel as boolean masks on the state tensor, and terminated environments are reset by writing new initial conditions into the corresponding tensor rows without interrupting the other simulations. This design means the marginal cost of adding one more parallel environment is approximately zero as long as the GPU has remaining capacity — enabling training at or higher parallel instances on a single high-end GPU.
Robot modeling: assets and fidelity
Simulation fidelity begins with the digital twin — the robot model that the simulator uses to compute dynamics. For robots with available URDF (Unified Robot Description Format) or USD files, importing is straightforward; the critical parameters are joint mass and inertia tensors, actuator models (motor gain, back-EMF, friction), joint limits, and collision geometry.
Mass and inertia errors compound through the dynamics equations. If the simulated link inertia is incorrect by 10%, the simulated response to a torque command will be incorrect by a corresponding amount, and the policy's learned dynamics model will not match the physical robot. Obtaining accurate inertia parameters requires either CAD models of the robot's mechanical components (available from manufacturers for most platforms) or experimental identification through system identification experiments.
Actuator models are particularly important for sim2real transfer. Physical servo actuators exhibit position-dependent friction, current-dependent torque limits, thermal derating, and communication delays (from motor controller to computer and back). A simulation that models actuators as ideal torque sources with instantaneous response will produce policies that assume perfect torque control authority — authority the physical actuator cannot provide in contact-rich situations where the motor is near its torque limit.
The sim2real gap: sources and characterization
The sim2real gap is the distribution shift between the dynamics experienced in simulation and those experienced on physical hardware. Its sources are diverse:
Unmodeled actuator dynamics include transmission flexibility (backlash, cable stretch in cable-driven robots), motor friction (position-dependent, velocity-dependent, and static), and thermal effects that change motor constants as the robot warms up during use. These effects are small individually but accumulate over a trajectory to produce systematic deviations from the simulated path.
Contact and friction modeling is the single hardest problem in physics simulation for manipulation. Physical contacts involve micro-scale surface deformation, viscoelastic material response, adhesion, and velocity-dependent friction behavior (Stribeck curve) that standard rigid-body simulators approximate with a single friction coefficient. A grasping policy trained with simulated friction coefficient will fail to transfer when the physical friction is unless the policy's learned grasp is robust to this variation.
Sensor noise and delays differ between simulation and hardware. Real sensors exhibit correlated noise, bias drift, and saturation behavior that ideal Gaussian noise models in simulation do not capture. The round-trip delay from commanding a joint position to reading the resulting encoder state is typically 1–5 ms on physical hardware, introducing a phase lag that destabilizes policies designed for zero-delay simulation.
Rendering gaps affect visual policies: real RGB images contain specular highlights, shadow variation, motion blur, and lens distortion not present in simulated renders. A policy trained on simulation renders will encounter a visually distinct domain at deployment, which can cause substantial degradation in visuomotor policies.
Domain randomization
The principal strategy for bridging the sim2real gap is domain randomization: instead of trying to make the simulator accurate, make it variable enough that the real world is one of the cases the policy was trained on. If the policy performs well across a distribution of simulated environments with randomized physics parameters, visual appearances, and noise characteristics, and if the real environment lies within that distribution, the policy will transfer.
| Dynamics | Visual | Sensors & Delay | | --- | --- | --- | | Randomizing masses, friction, and gains so the policy learns to compensate for physical uncertainty across episodes. | Varying lighting, textures, and camera poses to prevent the policy from overfitting to specific simulation graphics. | Injecting noise and latency that mimic real-world hardware jitter and communications overhead. |
Dynamics randomization perturbs the physical parameters of the simulation on each episode reset: link masses are drawn from intervals around their nominal values, friction coefficients are sampled uniformly from a range, actuator gains are multiplied by random scalars, and communication delays are drawn from a distribution. The policy must learn behaviors that are robust to this variation — behaviors that succeed not by exploiting specific physics values but by closed-loop correction using sensor feedback.
Appearance randomization (visual domain randomization) perturbs the visual rendering on each episode: object textures are replaced with random RGB images or procedural patterns, lighting direction and intensity are randomized, camera pose is perturbed, and background geometry is varied. Visual policies trained under aggressive appearance randomization develop feature representations that are robust to appearance changes because none of the specific appearance details are reliable — the policy must rely on shape and geometry invariants that transfer to the real domain.
Structured randomization is more targeted: instead of randomizing all parameters uniformly, it focuses randomization on the parameters that are most uncertain (poor system identification) and narrows randomization ranges for parameters that are well-characterized. This concentrates the robustness budget where it is most needed.
The theoretical justification for domain randomization is rooted in a formal expected return objective. Let denote the vector of physical parameters (masses, friction coefficients, actuator gains, delays) sampled from the randomization distribution . For a given set of parameters , the environment transitions follow — the distribution over trajectories induced by executing policy in the environment with parameters . The domain randomization training objective is the expected return integrated over the parameter distribution:
Maximizing produces a policy that performs well on average across the randomization distribution. The guarantee is: let be the distribution of physical parameters in the real world (unknown) and be the simulation's randomization distribution. If (the real distribution is absolutely continuous with respect to the randomization distribution), then a policy that maximizes in simulation is at least as good as any non-adaptive policy in the real world. The requirement that motivates using wide randomization ranges — but wide ranges also make harder to maximize because the policy must succeed under the hardest physics configurations. This tension between coverage width and training tractability is the central design tradeoff in domain randomization.
System identification
Domain randomization solves the transfer problem at the cost of training efficiency: a policy that must succeed for all friction coefficients in is harder to train than one that only needs to succeed for . System identification reduces this cost by estimating the real robot's physical parameters from hardware experiments, allowing the simulation to be centered at the identified values before applying randomization.
Classical system identification for articulated robots collects torque-position-velocity data from controlled joint movements and fits parametric models for mass, inertia, friction, and actuator constants. The fitting procedure minimizes the mismatch between simulated and real joint trajectories under the same input sequence:
where parameterizes the robot model. The identified parameters are then used as the nominal simulation values, with randomization applied as perturbations around these identified values. The randomization range reflects residual uncertainty in the identification.
The relationship between system identification and domain randomization is best understood as a sequential variance management strategy. Before any hardware experiments, has high variance — physical parameters are set to broad default ranges to ensure coverage but at the cost of training difficulty. System identification narrows this distribution by concentrating around the identified values : the post-identification randomization range for each parameter shrinks from the prior interval to a tighter interval where reflects residual identification uncertainty. This narrowing directly reduces the variance of the training objective under , making it easier to optimize.
Adaptive domain randomization (ADR) then applies the reverse operation: it starts with the identified, narrow and progressively expands the ranges as the policy demonstrates robustness. When the policy achieves success rate above a threshold on the current distribution, the boundaries of are widened by a small increment. When success rate drops below a lower threshold, the boundaries are contracted. This provides a curriculum over the variance of that keeps the training difficulty in the productive range throughout training — and ultimately reaches the broad coverage needed for the guarantee. The interplay between System ID (narrowing initial variance to center the distribution) and ADR (progressively expanding variance to achieve coverage) is the principled sim2real strategy employed in systems like OpenAI's Dactyl and NVIDIA's Isaac-trained locomotion policies.
Randomization schedules and training stability
Starting training with maximum randomization from the beginning often stalls learning: if friction can be anywhere in and delays can be anywhere in simultaneously, the policy encounters effectively independent dynamics on every episode and cannot build stable behavioral patterns. A randomization curriculum starts with narrow randomization around identified nominal values and progressively widens the ranges as the policy demonstrates robustness at each level.
The interaction between randomization strength and policy architecture is non-trivial. Policies that represent the domain parameters as part of the observation — either by directly observing them (unrealistic) or by inferring them from a history of state-action-observation transitions (domain randomization with history, DR+H) — can adapt their behavior to the specific parameters in the current episode, achieving better performance than policies that must be robust to all parameters simultaneously. This adaptive approach requires a policy with memory (recurrent or attention-based) that can maintain an implicit estimate of the current domain parameters.
GenAI context: sim2real as structured data augmentation
The analogy between domain randomization and data augmentation in vision-language pretraining is precise and instructive.
| Robotics sim2real | Vision/language pretraining | |---|---| | Domain randomization | Data augmentation (crop, flip, color jitter) | | System identification | Dataset curation and filtering | | Sim2real gap | Covariate shift between training and test distribution | | Appearance randomization | Rendering/style variation in synthetic data | | Adaptive randomization | Curriculum or difficulty-aware sampling |
Both fields have converged on the same insight: training for distributional robustness is more effective than training for point-estimate accuracy. A policy trained on a narrow, perfectly simulated distribution fails when the real world deviates from that point; a policy trained on a broad, randomized distribution is robust to the inevitable deviation. The synthesis in both fields is the same: use structured domain knowledge (system identification, dataset curation) to center the distribution, then use randomization (domain randomization, augmentation) to broaden it.
Key takeaways
Simulation is the primary training environment for robot RLReinforcement Learning because it provides unlimited, fast, safe data collection. Isaac Sim provides GPU-accelerated physics and rendering with parallel environment support; IsaacLab wraps this with robot learning infrastructure. The sim2real gap arises from unmodeled actuator dynamics, imprecise contact models, sensor noise and delay mismatches, and visual domain differences. Domain randomization addresses the gap by training policies over a distribution of simulated environments broad enough to contain the real-world configuration. System identification narrows the randomization domain to efficient ranges by estimating physical parameters from hardware experiments. Randomization curricula that widen the randomization range progressively maintain training stability while eventually reaching broad coverage. The design principle — robustness over accuracy — is the same principle underlying data augmentation in vision and language model training.
Conceptual questions
-
A legged locomotion policy trained in Isaac Sim achieves near-perfect performance across 1000 parallel simulation instances but fails immediately on the physical robot, exhibiting oscillatory joint motions at the hip. Post-hoc analysis reveals that the simulation modeled actuators as ideal torque sources with zero delay, while the physical actuators have 3 ms communication delay and 15% torque saturation at high velocities. Explain exactly how these two modeling errors would interact to produce oscillatory motion. What changes to the simulation model and domain randomization schedule would prevent this failure?
-
The theoretical guarantee for domain randomization requires that the real-world parameter distribution is absolutely continuous with respect to the randomization distribution . For a contact-rich grasping task, identify at least three physical parameters where this condition might fail in practice (i.e., the real parameter lies outside the randomization range), explain why it is difficult to discover these failures before deployment, and propose diagnostic procedures to detect range violations before physical testing.
-
System identification estimates physical parameters by minimizing the mismatch between simulated and real trajectories under controlled inputs. Explain why the identified parameters from quasi-static joint movements (slow, low-acceleration motions) may not produce accurate simulation of the high-acceleration, contact-rich motions used during manipulation RLReinforcement Learning training. How would you design an identification experiment specifically targeting the dynamic regime used during training? What parameter correlation structure would make the identification problem ill-conditioned?
-
A team uses adaptive domain randomization (ADR) to train a bin-picking policy. During training, the friction coefficient range expands progressively from to as the policy improves. After deployment, the physical friction is measured at — within the final randomization range — but the policy's success rate is only 40%, despite 85% success in simulation at . Diagnose this failure. Consider the interaction between friction and the other parameters being simultaneously randomized, and identify which parameter combination in the randomization distribution is most likely underrepresented near .
-
Visual domain randomization replaces simulation textures with random RGB patterns. A visuomotor policy trained with aggressive visual randomization achieves robust transfer but requires significantly more training steps than a policy trained with photorealistic simulation renders. Propose a training curriculum that achieves the final robustness of visual randomization while matching the early learning speed of photorealistic training. What theoretical property of the policy's learned visual representation changes between the two approaches, and how would you measure this?
Looking ahead
With simulation providing training scale and domain randomization providing robustness, the next question is what architectural and algorithmic designs best leverage large-scale simulated and real-world demonstration data. The answer has shifted from standard actor-critic architectures to transformer-based foundation models trained on massive multi-task datasets.
Week 8: Foundation Models for Manipulation — ACTAction Chunking with Transformers and Action Chunking. We examine the Action Chunking Transformer architecture, the role of temporal action chunking in reducing the effective decision frequency, and how sequence modeling fundamentally changes the way policies represent and generate multi-step manipulation behaviors.
Further reading
- Tobin, J., et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS. (The foundational Domain Randomization paper).
- Peng, X. B., et al. (2018). Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. ICRA.
- Makoviychuk, V., et al. (2021). Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning. NeurIPS Datasets and Benchmarks. (The architecture underlying IsaacLab).