Purpose of this lecture
Imitation learning gives a robot competent initial behavior. What it cannot provide is the ability to optimize long-term outcomes, recover from states outside the demonstration distribution, or discover strategies that improve upon what any human demonstrator would exhibit. Reinforcement learning is the natural complement to imitation: starting from an imitation-learned initialization, RLReinforcement Learning explores the state-action space and updates the policy toward trajectories of higher cumulative reward, potentially exceeding the performance of the expert demonstrations.
In practice, applying RLReinforcement Learning to physical robots is substantially more constrained than applying RLReinforcement Learning to software environments. Sample efficiency, safety, continuous action spaces, and the difficulty of resetting physical environments impose design requirements that dominate algorithm selection. This lecture analyzes how RLReinforcement Learning is actually deployed in robotics: which algorithms are used, why, and how reward design, action space structure, and curriculum scheduling are chosen to make learning tractable under real-world constraints.
The physical robot RLReinforcement Learning setting
Robotic RLReinforcement Learning differs from the canonical Atari or MuJoCo benchmark setting in several ways that are not merely cosmetic.
Data is expensive in wall-clock time. A simulation-based RLReinforcement Learning experiment can collect tens of millions of environment steps in hours by running many environments in parallel. A physical robot collects roughly 3–10 Hz of useful data (accounting for resets, safety monitoring, and hardware delays), amounting to perhaps 10,000–50,000 transitions per hour under favorable conditions. An experiment that consumes transitions on a simulated benchmark would require years of continuous operation on physical hardware. This makes sample efficiency not a preference but an existential requirement.
Unsafe exploration damages hardware and people. In an Atari game, the worst outcome of a random action is losing points. On a physical robot, a random high-torque action can snap a joint, throw an object at a bystander, or drive the arm into a singularity. The standard RLReinforcement Learning exploration strategies — -greedy, additive Gaussian noise on actions, random network distillation — are all potentially unsafe on physical hardware and must be constrained or replaced.
Environments are difficult to reset. Episodic RLReinforcement Learning assumes that the environment can be reset to an initial state distribution between episodes. On a physical robot, resetting means manually repositioning objects, returning the arm to a home configuration, and potentially recalibrating sensors — a process that can take minutes per episode and dominates the total experiment time for tasks with short episodes.
Action spaces are continuous and high-dimensional. Physical robots operate in for typically ranging from 6 to 30 (joint positions, velocities, or torques). The structure of this space — joint limits, kinematic constraints, dynamics coupling — must be reflected in the action space representation, which has no direct counterpart in discrete action environments.
Off-policy methods as the dominant paradigm
Why off-policy
The sample efficiency imperative drives robot RLReinforcement Learning toward off-policy algorithms that can reuse data from previous interactions regardless of which policy collected them. On-policy algorithms — PPOProximal Policy Optimisation, TRPOTrust Region Policy Optimisation, A2CAdvantage Actor-Critic — require that all training data be collected from the current policy, making earlier data unusable once the policy updates. For a robot that collects a few thousand transitions per hour, discarding any experience is unacceptable.
Off-policy methods maintain a replay buffer that stores all past transitions and samples mini-batches uniformly (or with prioritization) for training. The policy can be updated continuously using data collected by arbitrarily old policies, provided that the off-policy correction is handled correctly.
SACSoft Actor-Critic: the standard off-policy algorithm for continuous robotics
Soft Actor-Critic (SACSoft Actor-Critic; Haarnoja et al., 2018) is the most widely deployed RLReinforcement Learning algorithm in robot learning. SACSoft Actor-Critic maximizes a maximum entropy objective:
where is the policy entropy and is a temperature parameter. The entropy term rewards the policy for maintaining diverse action distributions, which provides automatic exploration without requiring separate noise injection, and prevents premature convergence to a narrow local optimum.
SACSoft Actor-Critic maintains twin soft Q-functions and to reduce overestimation bias — the same double-Q trick introduced in TD3. The soft Bellman target for policy evaluation is:
where denotes the target network parameters updated by exponential moving average. The policy is updated to maximize the soft Q-function:
In practice, SACSoft Actor-Critic's automatic temperature tuning — adjusting to maintain a target entropy — removes the need for manual exploration tuning, which is a significant practical advantage in robot learning where re-running experiments to tune hyperparameters is expensive.
TD3: mathematical formulation
Twin Delayed Deep Deterministic Policy Gradient (TD3; Fujimoto et al., 2018) addresses the overestimation and instability problems of DDPGDeep Deterministic Policy Gradient through three mathematically precise modifications.
Double Q-networks: TD3 maintains two critic networks and and uses the minimum of their predictions as the Bellman target, reducing overestimation bias that causes DDPGDeep Deterministic Policy Gradient to diverge:
Target policy smoothing: the target action used to compute the Bellman target is obtained by adding clipped Gaussian noise to the target policy's deterministic output and clipping the result to the valid action range:
This smoothing regularizes the Q-function by making the target less sensitive to narrow peaks in the critic's value surface — peaks that DDPGDeep Deterministic Policy Gradient exploits catastrophically, computing gradients that push the policy toward action values that the Q-function has overestimated near a single point. The clipping of to (typically ) prevents the noise from pushing the target action so far off-policy that the Bellman target becomes meaningless.
Delayed policy updates: TD3 updates the actor network only every critic gradient steps (typically ), allowing the critic to partially converge before the actor uses it to compute policy gradients. Because the actor's gradient is computed by differentiating through the critic, an inaccurate critic produces misleading actor gradients. Delayed updates break the feedback loop between unstable critics and erratic actor updates.
TD3 is often competitive with SACSoft Actor-Critic on continuous control benchmarks and is preferred in robotics settings where stochastic policies are undesirable — for instance, when the policy must be deterministic for safety certification, or when action noise is explicitly managed by a higher-level safety filter rather than the policy itself.
Safety constraints in robotic RLReinforcement Learning
Physical safety during RLReinforcement Learning training and deployment is enforced at multiple levels, with different mechanisms addressing different timescales and failure modes.
Workspace limits define the set of joint positions, velocities, and torques that are physically safe and hardware-preserving. These are typically implemented as hard constraints in the robot's low-level controller, which clips or rejects commands that violate the limits before they reach the actuators. From the RLReinforcement Learning algorithm's perspective, the action space is therefore always a clipped version of what the policy proposes.
Control Barrier Functions (CBFs) provide a formal safety filter at the policy output level. A CBF defines a safe set and the safety filter solves a quadratic program (CBF-QP) at each timestep to find the nearest safe action to what the policy proposes:
The CBF constraint ensures forward invariance of the safe set: if the robot is currently safe (), executing guarantees it remains safe at the next step. This filter can be applied to any learned policy without modifying the learning algorithm, which is particularly valuable in early training when the policy is far from safe.
Shielded RLReinforcement Learning combines the learned policy with a classical controller that takes over when safety-critical conditions are predicted. Unlike CBF-QP, which modifies individual actions, a shield performs predictive forward trajectory simulation: at each timestep, it rolls the proposed policy action forward for a short horizon steps using a dynamics model, and checks whether any predicted state along the rollout enters the unsafe set:
where is the safety function, is the predicted state under the proposed policy, and is a safety margin. If any predicted state violates safety, the shield engages before that violation occurs, executing a pre-programmed safe recovery behavior — stop, reverse, or return to a home configuration — and resuming learned policy control once the predicted rollout is again safe. This predictive structure is the key difference from reactive safety filters: the shield catches near-misses before they happen rather than correcting them as they occur. The horizon is the primary design parameter; too short a horizon misses slow-developing violations, while too long a horizon causes excessive conservative shielding on trajectories that would self-correct.
Action space design
The choice of action representation has a larger effect on learning efficiency than the choice of RLReinforcement Learning algorithm in most robotic applications. The action space determines what the policy must learn, how directly the action affects the task objective, and how much prior knowledge about robot kinematics and dynamics can be injected into the learning system.
Joint space actions (torques or positions) provide maximum expressive control but require the policy to learn the full inverse kinematics and dynamics of the robot. A policy operating in joint space must learn that moving a cup on a table requires coordinating 6–7 joints in a specific coupled manner — a mapping that takes many samples to learn from scratch. Joint-space torque control is the most expressive and physically faithful representation but is rarely used directly in end-to-end robot learning.
End-effector Cartesian commands specify the desired position, orientation, or velocity of the end-effector, with an inverse kinematics solver converting these to joint commands. This abstraction reduces the effective dimensionality of the policy's task from joint coordination to Cartesian manipulation geometry, which is much more natural for contact and grasping tasks. The policy must still learn the geometry of manipulation, but it is freed from learning the robot-specific coupling between joints. Most modern robot learning policies operate in this space.
Delta actions represent the desired change from the current end-effector pose rather than the absolute target: . This parameterization has better training properties than absolute targets because the gradient of the policy's loss with respect to small delta actions is well-conditioned near the identity mapping — the "do nothing" action corresponds to zero, which is easier to initialize around than arbitrary absolute poses.
Skill-based or options-space actions operate at a higher level of abstraction, where each action corresponds to a parameterized skill primitive (approach, grasp, push, release) or a learned sub-policy from a library. In the options framework, a high-level policy selects which option to execute and for how long, while low-level controllers carry out the option. This hierarchical structure dramatically reduces the effective horizon for the high-level policy and allows compositional manipulation behaviors to be learned with many fewer samples than a flat policy.
Reward design and shaping
The choice of action space is not independent of reward design — it directly restricts which reward terms are computable and how those terms interact with the policy gradient. A joint-space torque control policy allows the shaped reward to directly penalize joint torques as an energy or effort proxy: . This term is only tractable if the policy produces torques directly; if the policy outputs Cartesian delta commands that are tracked by a low-level controller, the torques are an internal signal not exposed to the RLReinforcement Learning layer, and penalizing them requires computing inverse dynamics through the full robot model — an additional computational step that may introduce gradient computation complexity. Conversely, a task-space Cartesian policy makes it straightforward to penalize end-effector Cartesian jerk for smooth motion, while a joint-space policy would need to propagate this penalty through FK to evaluate it. These interaction effects between action space and reward structure are a frequently underappreciated source of difficulty in reward engineering.
Reward specification in robotics is simultaneously critical and difficult. The reward signal must accurately capture the task objective, provide sufficient learning signal to guide exploration, and avoid Goodhart's Law failure modes where the policy achieves high reward through unintended behaviors.
Sparse rewards (binary success at episode end) are the most faithful to the true task objective but provide almost no learning signal in early training: a policy that has never successfully grasped an object receives a constant reward of zero regardless of how much it improved over the course of the episode, making gradient-based updates uninformative.
Shaped rewards add intermediate feedback by defining reward components that correlate with task progress:
where rewards decreasing distance to the goal, penalizes jerky joint motions, and penalizes unnecessary torque. The Ng et al. (1999) potential-based reward shaping theorem guarantees that a shaped reward — where is any real-valued potential function — does not change the optimal policy of the original MDPMarkov Decision Process. Non-potential-based shaping (arbitrary additions to the reward) does change the optimal policy and can produce unintended behavior even when the shaping terms seem reasonable.
Reward hacking in robotics often manifests as the policy exploiting simulator imprecision: sliding objects rather than lifting them to minimize effort while maximizing proximity rewards, vibrating the end-effector to generate false contact signals, or holding the arm stationary in a pose that maximizes proximity reward without completing the grasp. Catching these behaviors requires rigorous success-criterion evaluation that is independent of the training reward.
Curriculum learning
Learning a complex manipulation task from a random policy initialization is typically infeasible: the probability of reaching the goal from random actions is near zero, the resulting zero-reward trajectories provide no useful gradient signal, and the policy cannot improve. Curriculum learning addresses this by sequencing a progression of tasks from easy to hard, ensuring that each stage of training provides useful signal for the next.
A curriculum can vary difficulty along multiple axes: initial state distributions (start with the object already in the robot's hand before training pick-and-place), tolerances (require coarser position accuracy early in training, then tighten), horizon lengths (train on short-horizon tasks that require fewer sequential decisions), or physics complexity (simple frictionless grasping before contact-rich assembly).
Automatic curriculum generation methods — including POET (Paired Open-Ended Trailblazers) and the universal value function approach — adaptively adjust task difficulty based on the agent's current performance, keeping the task in the "zone of proximal development" where it is challenging but solvable. This avoids both the trivial-task plateau (too easy to provide useful signal) and the impossible-task plateau (too hard for any gradient signal).
GenAI context: comparing RLReinforcement Learning regimes
The contrast between RLReinforcement Learning in robotics and RLReinforcement Learning in language model training illuminates which constraints are intrinsic to sequential decision-making and which are domain-specific.
| Dimension | Robot RLReinforcement Learning | Language model RLReinforcement Learning (RLHFReinforcement Learning from Human Feedback/GRPOGroup Relative Policy Optimisation) | |---|---|---| | Dominant algorithm | Off-policy SACSoft Actor-Critic/TD3 | On-policy PPOProximal Policy Optimisation/GRPOGroup Relative Policy Optimisation | | Sample cost | Wall-clock expensive (hardware) | Token-compute expensive (but parallelizable) | | Action space | Continuous | Discrete (vocabulary) | | Safety | Hard physical constraints (CBF-QP) | KL constraint from reference model | | Reward | Shaped dense or sparse task reward | Human preference or rule-based verifier | | Curriculum | Task complexity and initial conditions | Difficulty of reasoning problems |
The preference for on-policy RLReinforcement Learning in LLMLarge Language Model training reverses the robot RLReinforcement Learning preference for off-policy RLReinforcement Learning because token generation is cheap (LLMs can produce thousands of rollouts per minute), so the sample efficiency advantage of off-policy methods is less decisive than its instability cost. The on-policy PPOProximal Policy Optimisation training that is impractical for physical robots becomes the default choice when data generation is fast and cheap.
Key takeaways
Robotic RLReinforcement Learning operates under hard constraints — sample efficiency, physical safety, and difficult resets — that dominate algorithm selection. Off-policy methods, especially SACSoft Actor-Critic and TD3, are standard because they reuse all collected experience through replay buffers. SACSoft Actor-Critic's maximum entropy objective provides automatic exploration through policy entropy regularization, removing the need for separate noise injection. Safety is enforced at multiple levels: hardware workspace limits, CBF-QP filters at the policy output, and shielded controllers for emergency recovery. Action space design has large effects on learning efficiency; end-effector Cartesian delta commands are the most common choice for manipulation. Potential-based reward shaping is theoretically principled; non-potential shaping introduces unintended optimal policies. Curriculum learning is essential for complex tasks where random initialization cannot reach the goal. The RLReinforcement Learning choices made for robotics and for language models differ systematically because the cost of data generation differs by orders of magnitude between domains.
Conceptual questions
-
A team attempts to train a peg-insertion policy using SACSoft Actor-Critic directly on physical hardware with a sparse reward (success = 1 when the peg is inserted, 0 otherwise). After 10,000 hardware interactions over two days, the policy shows no improvement. Diagnose this failure using the concepts of sample efficiency and sparse reward. Propose a complete modified experimental design — including reward shaping, curriculum, and replay buffer strategy — that would make learning tractable, and justify each component's role.
-
The CBF-QP safety filter modifies the policy's proposed action to satisfy . During RLReinforcement Learning training, the policy consistently proposes actions that are slightly outside the safe set, and the CBF-QP correction is small (< 5% change in action norm). Analyze the effect of this persistent correction on the RLReinforcement Learning training dynamics. Does the policy converge to the CBF-constrained optimal policy? Under what conditions would the CBF correction interfere with convergence, and how would you detect this?
-
An engineer must choose between joint-space torque control and end-effector Cartesian delta control for training a bimanual assembly task where both arms must coordinate to simultaneously insert two pegs. Analyze the tradeoffs specifically for bimanual coordination: which action space makes the inter-arm coupling easier or harder to learn? Does the answer change if a high-quality analytical inverse kinematics solution is available? If the task requires contact forces to be controlled explicitly?
-
A robot RLReinforcement Learning policy is trained with a shaped reward . After training, the policy achieves 90% success rate in simulation but only 30% in physical deployment. Post-hoc analysis shows the policy learned to exploit a simulation artifact where the proximity reward can be maximized by a specific jerky motion that is physically impossible on the real robot. Identify which reward shaping principle was violated, explain how the Ng et al. theorem fails to protect against this failure mode, and redesign the reward.
-
SACSoft Actor-Critic optimizes the maximum entropy objective with temperature . Compare the behavior of SACSoft Actor-Critic with very large (high entropy target) versus very small (low entropy target) during the early training phase of a contact-rich grasping task. Specifically: how does each setting affect the exploration-exploitation tradeoff, the rate of unsafe actions near joint limits, and the eventual convergence behavior? Use this analysis to argue for a specific schedule over the course of training.
Looking ahead
Most practical robot RLReinforcement Learning training does not occur on physical hardware but in simulation, with the trained policy subsequently transferred to the real world. The validity of this approach depends entirely on how faithfully the simulator captures the physical dynamics the robot will encounter.
Week 7: Sim2Real Pipelines and IsaacLab. We examine how modern GPU-accelerated simulation stacks (Isaac Sim, IsaacLab) are structured, why the sim2real gap is fundamentally difficult to close, and how domain randomization combined with system identification produces policies that transfer robustly to physical hardware.
Further reading
- Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement Learning in Robotics: A Survey. IJRR. (Classic overview of RLReinforcement Learning challenges on hardware).
- Hwangbo, J., et al. (2019). Learning Agile and Dynamic Motor Skills for Legged Robots. Science Robotics. (Pioneering work on using RLReinforcement Learning for continuous legged locomotion).