Skip to main content
illumin8
Courses
Week 6: Reinforcement Learning for Robotics
Robot Learning
01Week 1: Robot Modeling and Kinematics
02Week 2: Dynamics and State Estimation
03Week 3: Control Fundamentals
04Week 4: Teleoperation and Data Collection
05Week 5: Imitation Learning
06Week 6: Reinforcement Learning for Robotics
07Week 7: Sim2Real Pipelines and IsaacLab
08Week 8: Foundation Models for Manipulation — ACT and Action Chunking
09Week 9: Flow Matching and Diffusion for Robot Policies
10Week 10: Vision–Language–Action Models
11Week 11: Fine-Tuning and Adaptation
12Week 12: Safety, Constraints, and Reliability
13Week 13: Multi-Robot and Multi-Task Learning
14Week 14: Sim2Real Capstone
Week 6

Week 6: Reinforcement Learning for Robotics

✦Learning Outcomes
  • Compare on-policy vs off-policy RLReinforcement Learning in robotic settings
  • Design reward functions for robotic manipulation and locomotion
  • Apply curriculum learning to complex robot tasks
  • Connect robot RLReinforcement Learning to Course 1 concepts
◆Prerequisites
  • Course 1 (RLReinforcement Learning): Weeks 7-8 (Policy gradient, PPOProximal Policy Optimisation, SACSoft Actor-Critic)
  • Week 5: Imitation learning
  • Week 3: Control fundamentals

This week connects RLReinforcement Learning theory to robotics. Review Course 1 Week 7 for policy gradient concepts.

Purpose of this lecture

Imitation learning gives a robot competent initial behavior. What it cannot provide is the ability to optimize long-term outcomes, recover from states outside the demonstration distribution, or discover strategies that improve upon what any human demonstrator would exhibit. Reinforcement learning is the natural complement to imitation: starting from an imitation-learned initialization, RLReinforcement Learning explores the state-action space and updates the policy toward trajectories of higher cumulative reward, potentially exceeding the performance of the expert demonstrations.

In practice, applying RLReinforcement Learning to physical robots is substantially more constrained than applying RLReinforcement Learning to software environments. Sample efficiency, safety, continuous action spaces, and the difficulty of resetting physical environments impose design requirements that dominate algorithm selection. This lecture analyzes how RLReinforcement Learning is actually deployed in robotics: which algorithms are used, why, and how reward design, action space structure, and curriculum scheduling are chosen to make learning tractable under real-world constraints.


The physical robot RLReinforcement Learning setting

Robotic RLReinforcement Learning differs from the canonical Atari or MuJoCo benchmark setting in several ways that are not merely cosmetic.

Data is expensive in wall-clock time. A simulation-based RLReinforcement Learning experiment can collect tens of millions of environment steps in hours by running many environments in parallel. A physical robot collects roughly 3–10 Hz of useful data (accounting for resets, safety monitoring, and hardware delays), amounting to perhaps 10,000–50,000 transitions per hour under favorable conditions. An experiment that consumes 10710^7107 transitions on a simulated benchmark would require years of continuous operation on physical hardware. This makes sample efficiency not a preference but an existential requirement.

Unsafe exploration damages hardware and people. In an Atari game, the worst outcome of a random action is losing points. On a physical robot, a random high-torque action can snap a joint, throw an object at a bystander, or drive the arm into a singularity. The standard RLReinforcement Learning exploration strategies — ε\varepsilonε-greedy, additive Gaussian noise on actions, random network distillation — are all potentially unsafe on physical hardware and must be constrained or replaced.

Environments are difficult to reset. Episodic RLReinforcement Learning assumes that the environment can be reset to an initial state distribution between episodes. On a physical robot, resetting means manually repositioning objects, returning the arm to a home configuration, and potentially recalibrating sensors — a process that can take minutes per episode and dominates the total experiment time for tasks with short episodes.

Action spaces are continuous and high-dimensional. Physical robots operate in Rn\mathbb{R}^nRn for nnn typically ranging from 6 to 30 (joint positions, velocities, or torques). The structure of this space — joint limits, kinematic constraints, dynamics coupling — must be reflected in the action space representation, which has no direct counterpart in discrete action environments.


Off-policy methods as the dominant paradigm

Why off-policy

The sample efficiency imperative drives robot RLReinforcement Learning toward off-policy algorithms that can reuse data from previous interactions regardless of which policy collected them. On-policy algorithms — PPOProximal Policy Optimisation, TRPOTrust Region Policy Optimisation, A2CAdvantage Actor-Critic — require that all training data be collected from the current policy, making earlier data unusable once the policy updates. For a robot that collects a few thousand transitions per hour, discarding any experience is unacceptable.

Off-policy methods maintain a replay buffer B\mathcal{B}B that stores all past transitions (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1})(st​,at​,rt​,st+1​) and samples mini-batches uniformly (or with prioritization) for training. The policy can be updated continuously using data collected by arbitrarily old policies, provided that the off-policy correction is handled correctly.

SACSoft Actor-Critic: the standard off-policy algorithm for continuous robotics

Soft Actor-Critic (SACSoft Actor-Critic; Haarnoja et al., 2018) is the most widely deployed RLReinforcement Learning algorithm in robot learning. SACSoft Actor-Critic maximizes a maximum entropy objective:

J(π)=∑t=0TE(st,at)∼ρπ ⁣[r(st,at)+αH(π(⋅∣st))]J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi}\!\left[r(s_t, a_t) + \alpha H(\pi(\cdot \mid s_t))\right]J(π)=t=0∑T​E(st​,at​)∼ρπ​​[r(st​,at​)+αH(π(⋅∣st​))]

where H(π(⋅∣st))=−Ea∼π[log⁡π(a∣st)]H(\pi(\cdot \mid s_t)) = -\mathbb{E}_{a \sim \pi}[\log \pi(a \mid s_t)]H(π(⋅∣st​))=−Ea∼π​[logπ(a∣st​)] is the policy entropy and α\alphaα is a temperature parameter. The entropy term rewards the policy for maintaining diverse action distributions, which provides automatic exploration without requiring separate noise injection, and prevents premature convergence to a narrow local optimum.

SACSoft Actor-Critic maintains twin soft Q-functions Qϕ1(s,a)Q_{\phi_1}(s, a)Qϕ1​​(s,a) and Qϕ2(s,a)Q_{\phi_2}(s, a)Qϕ2​​(s,a) to reduce overestimation bias — the same double-Q trick introduced in TD3. The soft Bellman target for policy evaluation is:

y=r+γ(min⁡j=1,2Qϕˉj(s′,a~′)−αlog⁡πθ(a~′∣s′)),a~′∼πθ(⋅∣s′)y = r + \gamma \left(\min_{j=1,2} Q_{\bar{\phi}_j}(s', \tilde{a}') - \alpha \log \pi_\theta(\tilde{a}' \mid s')\right), \quad \tilde{a}' \sim \pi_\theta(\cdot \mid s')y=r+γ(j=1,2min​Qϕˉ​j​​(s′,a~′)−αlogπθ​(a~′∣s′)),a~′∼πθ​(⋅∣s′)

where ϕˉ\bar{\phi}ϕˉ​ denotes the target network parameters updated by exponential moving average. The policy is updated to maximize the soft Q-function:

max⁡θEs∼B, a∼πθ ⁣[min⁡j=1,2Qϕj(s,a)−αlog⁡πθ(a∣s)]\max_\theta \mathbb{E}_{s \sim \mathcal{B},\, a \sim \pi_\theta}\!\left[\min_{j=1,2} Q_{\phi_j}(s, a) - \alpha \log \pi_\theta(a \mid s)\right]θmax​Es∼B,a∼πθ​​[j=1,2min​Qϕj​​(s,a)−αlogπθ​(a∣s)]

In practice, SACSoft Actor-Critic's automatic temperature tuning — adjusting α\alphaα to maintain a target entropy Htarget\mathcal{H}_{\text{target}}Htarget​ — removes the need for manual exploration tuning, which is a significant practical advantage in robot learning where re-running experiments to tune hyperparameters is expensive.

TD3: mathematical formulation

Twin Delayed Deep Deterministic Policy Gradient (TD3; Fujimoto et al., 2018) addresses the overestimation and instability problems of DDPGDeep Deterministic Policy Gradient through three mathematically precise modifications.

Double Q-networks: TD3 maintains two critic networks Qϕ1Q_{\phi_1}Qϕ1​​ and Qϕ2Q_{\phi_2}Qϕ2​​ and uses the minimum of their predictions as the Bellman target, reducing overestimation bias that causes DDPGDeep Deterministic Policy Gradient to diverge:

y=r+γmin⁡j=1,2Qϕˉj(s′,a~′)y = r + \gamma \min_{j=1,2} Q_{\bar{\phi}_j}(s', \tilde{a}')y=r+γj=1,2min​Qϕˉ​j​​(s′,a~′)

Target policy smoothing: the target action a~′\tilde{a}'a~′ used to compute the Bellman target is obtained by adding clipped Gaussian noise ϵ∼clip(N(0,σ2),−c,c)\epsilon \sim \text{clip}(\mathcal{N}(0, \sigma^2), -c, c)ϵ∼clip(N(0,σ2),−c,c) to the target policy's deterministic output and clipping the result to the valid action range:

a~′=clip ⁣(πϕˉ(s′)+clip(ϵ,−c,c),  alow,  ahigh)\tilde{a}' = \text{clip}\!\left(\pi_{\bar{\phi}}(s') + \text{clip}(\epsilon, -c, c),\; a_{\text{low}},\; a_{\text{high}}\right)a~′=clip(πϕˉ​​(s′)+clip(ϵ,−c,c),alow​,ahigh​)

This smoothing regularizes the Q-function by making the target less sensitive to narrow peaks in the critic's value surface — peaks that DDPGDeep Deterministic Policy Gradient exploits catastrophically, computing gradients that push the policy toward action values that the Q-function has overestimated near a single point. The clipping of ϵ\epsilonϵ to [−c,c][-c, c][−c,c] (typically c=0.5c = 0.5c=0.5) prevents the noise from pushing the target action so far off-policy that the Bellman target becomes meaningless.

Delayed policy updates: TD3 updates the actor network only every ddd critic gradient steps (typically d=2d = 2d=2), allowing the critic to partially converge before the actor uses it to compute policy gradients. Because the actor's gradient ∇ϕJ=E[∇aQϕ1(s,a)∣a=πϕ(s)⋅∇ϕπϕ(s)]\nabla_\phi J = \mathbb{E}[{\nabla_a Q_{\phi_1}(s,a)|_{a=\pi_\phi(s)}} \cdot \nabla_\phi \pi_\phi(s)]∇ϕ​J=E[∇a​Qϕ1​​(s,a)∣a=πϕ​(s)​⋅∇ϕ​πϕ​(s)] is computed by differentiating through the critic, an inaccurate critic produces misleading actor gradients. Delayed updates break the feedback loop between unstable critics and erratic actor updates.

TD3 is often competitive with SACSoft Actor-Critic on continuous control benchmarks and is preferred in robotics settings where stochastic policies are undesirable — for instance, when the policy must be deterministic for safety certification, or when action noise is explicitly managed by a higher-level safety filter rather than the policy itself.


Safety constraints in robotic RLReinforcement Learning

Physical safety during RLReinforcement Learning training and deployment is enforced at multiple levels, with different mechanisms addressing different timescales and failure modes.

Workspace limits define the set of joint positions, velocities, and torques that are physically safe and hardware-preserving. These are typically implemented as hard constraints in the robot's low-level controller, which clips or rejects commands that violate the limits before they reach the actuators. From the RLReinforcement Learning algorithm's perspective, the action space is therefore always a clipped version of what the policy proposes.

Control Barrier Functions (CBFs) provide a formal safety filter at the policy output level. A CBF h(s):S→Rh(s): \mathcal{S} \to \mathbb{R}h(s):S→R defines a safe set C={s:h(s)≥0}\mathcal{C} = \{s : h(s) \geq 0\}C={s:h(s)≥0} and the safety filter solves a quadratic program (CBF-QP) at each timestep to find the nearest safe action to what the policy proposes:

asafe=arg⁡min⁡a∥a−πθ(s)∥2s.t.h˙(s,a)+κh(s)≥0a_{\text{safe}} = \arg\min_{a} \| a - \pi_\theta(s) \|^2 \quad \text{s.t.} \quad \dot{h}(s, a) + \kappa h(s) \geq 0asafe​=argamin​∥a−πθ​(s)∥2s.t.h˙(s,a)+κh(s)≥0

The CBF constraint h˙+κh≥0\dot{h} + \kappa h \geq 0h˙+κh≥0 ensures forward invariance of the safe set: if the robot is currently safe (h(s)≥0h(s) \geq 0h(s)≥0), executing asafea_{\text{safe}}asafe​ guarantees it remains safe at the next step. This filter can be applied to any learned policy without modifying the learning algorithm, which is particularly valuable in early training when the policy is far from safe.

Shielded RLReinforcement Learning combines the learned policy with a classical controller that takes over when safety-critical conditions are predicted. Unlike CBF-QP, which modifies individual actions, a shield performs predictive forward trajectory simulation: at each timestep, it rolls the proposed policy action forward for a short horizon HHH steps using a dynamics model, and checks whether any predicted state along the rollout enters the unsafe set:

engage shield  ⇔  ∃ k∈{1,…,H}:h(st+k)<−δ\text{engage shield} \;\Leftrightarrow\; \exists\, k \in \{1, \ldots, H\} : h(s_{t+k}) < -\deltaengage shield⇔∃k∈{1,…,H}:h(st+k​)<−δ

where h(s)h(s)h(s) is the safety function, st+ks_{t+k}st+k​ is the predicted state under the proposed policy, and δ>0\delta > 0δ>0 is a safety margin. If any predicted state violates safety, the shield engages before that violation occurs, executing a pre-programmed safe recovery behavior — stop, reverse, or return to a home configuration — and resuming learned policy control once the predicted rollout is again safe. This predictive structure is the key difference from reactive safety filters: the shield catches near-misses before they happen rather than correcting them as they occur. The horizon HHH is the primary design parameter; too short a horizon misses slow-developing violations, while too long a horizon causes excessive conservative shielding on trajectories that would self-correct.


Action space design

The choice of action representation has a larger effect on learning efficiency than the choice of RLReinforcement Learning algorithm in most robotic applications. The action space determines what the policy must learn, how directly the action affects the task objective, and how much prior knowledge about robot kinematics and dynamics can be injected into the learning system.

Joint space actions (torques or positions) provide maximum expressive control but require the policy to learn the full inverse kinematics and dynamics of the robot. A policy operating in joint space must learn that moving a cup on a table requires coordinating 6–7 joints in a specific coupled manner — a mapping that takes many samples to learn from scratch. Joint-space torque control is the most expressive and physically faithful representation but is rarely used directly in end-to-end robot learning.

End-effector Cartesian commands specify the desired position, orientation, or velocity of the end-effector, with an inverse kinematics solver converting these to joint commands. This abstraction reduces the effective dimensionality of the policy's task from joint coordination to Cartesian manipulation geometry, which is much more natural for contact and grasping tasks. The policy must still learn the geometry of manipulation, but it is freed from learning the robot-specific coupling between joints. Most modern robot learning policies operate in this space.

Delta actions represent the desired change from the current end-effector pose rather than the absolute target: at=Δposet=poset+1−poseta_t = \Delta \text{pose}_t = \text{pose}_{t+1} - \text{pose}_tat​=Δposet​=poset+1​−poset​. This parameterization has better training properties than absolute targets because the gradient of the policy's loss with respect to small delta actions is well-conditioned near the identity mapping — the "do nothing" action corresponds to zero, which is easier to initialize around than arbitrary absolute poses.

Skill-based or options-space actions operate at a higher level of abstraction, where each action corresponds to a parameterized skill primitive (approach, grasp, push, release) or a learned sub-policy from a library. In the options framework, a high-level policy selects which option to execute and for how long, while low-level controllers carry out the option. This hierarchical structure dramatically reduces the effective horizon for the high-level policy and allows compositional manipulation behaviors to be learned with many fewer samples than a flat policy.


Reward design and shaping

The choice of action space is not independent of reward design — it directly restricts which reward terms are computable and how those terms interact with the policy gradient. A joint-space torque control policy allows the shaped reward to directly penalize joint torques as an energy or effort proxy: reffort=−∥τ∥2r_{\text{effort}} = -\|\tau\|^2reffort​=−∥τ∥2. This term is only tractable if the policy produces torques directly; if the policy outputs Cartesian delta commands that are tracked by a low-level controller, the torques are an internal signal not exposed to the RLReinforcement Learning layer, and penalizing them requires computing inverse dynamics through the full robot model — an additional computational step that may introduce gradient computation complexity. Conversely, a task-space Cartesian policy makes it straightforward to penalize end-effector Cartesian jerk ∥x¨ee∥2\|\ddot{x}_{ee}\|^2∥x¨ee​∥2 for smooth motion, while a joint-space policy would need to propagate this penalty through FK to evaluate it. These interaction effects between action space and reward structure are a frequently underappreciated source of difficulty in reward engineering.

Reward specification in robotics is simultaneously critical and difficult. The reward signal must accurately capture the task objective, provide sufficient learning signal to guide exploration, and avoid Goodhart's Law failure modes where the policy achieves high reward through unintended behaviors.

Sparse rewards (binary success at episode end) are the most faithful to the true task objective but provide almost no learning signal in early training: a policy that has never successfully grasped an object receives a constant reward of zero regardless of how much it improved over the course of the episode, making gradient-based updates uninformative.

Shaped rewards add intermediate feedback by defining reward components that correlate with task progress:

rshaped(s,a)=rtask(s)+λ1rproximity(s)+λ2rsmoothness(a)+λ3reffort(a)r_{\text{shaped}}(s, a) = r_{\text{task}}(s) + \lambda_1 r_{\text{proximity}}(s) + \lambda_2 r_{\text{smoothness}}(a) + \lambda_3 r_{\text{effort}}(a)rshaped​(s,a)=rtask​(s)+λ1​rproximity​(s)+λ2​rsmoothness​(a)+λ3​reffort​(a)

where rproximityr_{\text{proximity}}rproximity​ rewards decreasing distance to the goal, rsmoothnessr_{\text{smoothness}}rsmoothness​ penalizes jerky joint motions, and reffortr_{\text{effort}}reffort​ penalizes unnecessary torque. The Ng et al. (1999) potential-based reward shaping theorem guarantees that a shaped reward rshaped=r+γΦ(s′)−Φ(s)r_{\text{shaped}} = r + \gamma \Phi(s') - \Phi(s)rshaped​=r+γΦ(s′)−Φ(s) — where Φ\PhiΦ is any real-valued potential function — does not change the optimal policy of the original MDPMarkov Decision Process. Non-potential-based shaping (arbitrary additions to the reward) does change the optimal policy and can produce unintended behavior even when the shaping terms seem reasonable.

Reward hacking in robotics often manifests as the policy exploiting simulator imprecision: sliding objects rather than lifting them to minimize effort while maximizing proximity rewards, vibrating the end-effector to generate false contact signals, or holding the arm stationary in a pose that maximizes proximity reward without completing the grasp. Catching these behaviors requires rigorous success-criterion evaluation that is independent of the training reward.


Curriculum learning

Learning a complex manipulation task from a random policy initialization is typically infeasible: the probability of reaching the goal from random actions is near zero, the resulting zero-reward trajectories provide no useful gradient signal, and the policy cannot improve. Curriculum learning addresses this by sequencing a progression of tasks from easy to hard, ensuring that each stage of training provides useful signal for the next.

A curriculum can vary difficulty along multiple axes: initial state distributions (start with the object already in the robot's hand before training pick-and-place), tolerances (require coarser position accuracy early in training, then tighten), horizon lengths (train on short-horizon tasks that require fewer sequential decisions), or physics complexity (simple frictionless grasping before contact-rich assembly).

Automatic curriculum generation methods — including POET (Paired Open-Ended Trailblazers) and the universal value function approach — adaptively adjust task difficulty based on the agent's current performance, keeping the task in the "zone of proximal development" where it is challenging but solvable. This avoids both the trivial-task plateau (too easy to provide useful signal) and the impossible-task plateau (too hard for any gradient signal).


GenAI context: comparing RLReinforcement Learning regimes

The contrast between RLReinforcement Learning in robotics and RLReinforcement Learning in language model training illuminates which constraints are intrinsic to sequential decision-making and which are domain-specific.

| Dimension | Robot RLReinforcement Learning | Language model RLReinforcement Learning (RLHFReinforcement Learning from Human Feedback/GRPOGroup Relative Policy Optimisation) | |---|---|---| | Dominant algorithm | Off-policy SACSoft Actor-Critic/TD3 | On-policy PPOProximal Policy Optimisation/GRPOGroup Relative Policy Optimisation | | Sample cost | Wall-clock expensive (hardware) | Token-compute expensive (but parallelizable) | | Action space | Continuous Rn\mathbb{R}^nRn | Discrete (vocabulary) | | Safety | Hard physical constraints (CBF-QP) | KL constraint from reference model | | Reward | Shaped dense or sparse task reward | Human preference or rule-based verifier | | Curriculum | Task complexity and initial conditions | Difficulty of reasoning problems |

The preference for on-policy RLReinforcement Learning in LLMLarge Language Model training reverses the robot RLReinforcement Learning preference for off-policy RLReinforcement Learning because token generation is cheap (LLMs can produce thousands of rollouts per minute), so the sample efficiency advantage of off-policy methods is less decisive than its instability cost. The on-policy PPOProximal Policy Optimisation training that is impractical for physical robots becomes the default choice when data generation is fast and cheap.


Key takeaways

Robotic RLReinforcement Learning operates under hard constraints — sample efficiency, physical safety, and difficult resets — that dominate algorithm selection. Off-policy methods, especially SACSoft Actor-Critic and TD3, are standard because they reuse all collected experience through replay buffers. SACSoft Actor-Critic's maximum entropy objective provides automatic exploration through policy entropy regularization, removing the need for separate noise injection. Safety is enforced at multiple levels: hardware workspace limits, CBF-QP filters at the policy output, and shielded controllers for emergency recovery. Action space design has large effects on learning efficiency; end-effector Cartesian delta commands are the most common choice for manipulation. Potential-based reward shaping is theoretically principled; non-potential shaping introduces unintended optimal policies. Curriculum learning is essential for complex tasks where random initialization cannot reach the goal. The RLReinforcement Learning choices made for robotics and for language models differ systematically because the cost of data generation differs by orders of magnitude between domains.


Conceptual questions

  1. A team attempts to train a peg-insertion policy using SACSoft Actor-Critic directly on physical hardware with a sparse reward (success = 1 when the peg is inserted, 0 otherwise). After 10,000 hardware interactions over two days, the policy shows no improvement. Diagnose this failure using the concepts of sample efficiency and sparse reward. Propose a complete modified experimental design — including reward shaping, curriculum, and replay buffer strategy — that would make learning tractable, and justify each component's role.

  2. The CBF-QP safety filter modifies the policy's proposed action to satisfy h˙+κh≥0\dot{h} + \kappa h \geq 0h˙+κh≥0. During RLReinforcement Learning training, the policy consistently proposes actions that are slightly outside the safe set, and the CBF-QP correction is small (< 5% change in action norm). Analyze the effect of this persistent correction on the RLReinforcement Learning training dynamics. Does the policy converge to the CBF-constrained optimal policy? Under what conditions would the CBF correction interfere with convergence, and how would you detect this?

  3. An engineer must choose between joint-space torque control and end-effector Cartesian delta control for training a bimanual assembly task where both arms must coordinate to simultaneously insert two pegs. Analyze the tradeoffs specifically for bimanual coordination: which action space makes the inter-arm coupling easier or harder to learn? Does the answer change if a high-quality analytical inverse kinematics solution is available? If the task requires contact forces to be controlled explicitly?

  4. A robot RLReinforcement Learning policy is trained with a shaped reward r=rsuccess+0.1⋅rproximity+0.05⋅rsmoothnessr = r_{\text{success}} + 0.1 \cdot r_{\text{proximity}} + 0.05 \cdot r_{\text{smoothness}}r=rsuccess​+0.1⋅rproximity​+0.05⋅rsmoothness​. After training, the policy achieves 90% success rate in simulation but only 30% in physical deployment. Post-hoc analysis shows the policy learned to exploit a simulation artifact where the proximity reward can be maximized by a specific jerky motion that is physically impossible on the real robot. Identify which reward shaping principle was violated, explain how the Ng et al. theorem fails to protect against this failure mode, and redesign the reward.

  5. SACSoft Actor-Critic optimizes the maximum entropy objective with temperature α\alphaα. Compare the behavior of SACSoft Actor-Critic with very large α\alphaα (high entropy target) versus very small α\alphaα (low entropy target) during the early training phase of a contact-rich grasping task. Specifically: how does each setting affect the exploration-exploitation tradeoff, the rate of unsafe actions near joint limits, and the eventual convergence behavior? Use this analysis to argue for a specific α\alphaα schedule over the course of training.


✦Solutions
  1. SAC sparse reward on hardware. 10,000 interactions is far too few for sparse-reward RL — random exploration essentially never triggers success, so there is no gradient, and hardware is sample-limited. Redesign: a dense shaped reward (distance-to-hole plus alignment), a curriculum (start with the peg near/above the hole and widen), demonstrations seeding the replay buffer (or offline pretraining / residual RL on a scripted controller), and goal relabeling (HER). Each supplies learning signal or shrinks the exploration burden.
  2. Persistent small CBF correction. Because the executed (clipped) action differs from the proposed one, you must train on the executed actions to converge to the CBF-constrained optimum — training on proposed actions never lets the policy experience the true consequences, so it won't converge. Interference appears when corrections are large or frequent and credit is assigned to the wrong action; detect it by monitoring proposed-vs-executed action divergence and reward plateaus while corrections persist.
  3. Action space for bimanual. Cartesian end-effector deltas make inter-arm spatial coordination easier (the task lives in relative-pose space) but hide dynamics; joint-torque control gives full authority including forces but forces the policy to learn coordination and IK implicitly. A good analytical IK makes Cartesian even more attractive. If contact forces must be controlled explicitly, the answer flips toward torque/impedance control, since position deltas cannot regulate force.
  4. Reward hacking. The violated principle is that shaping should be potential-based to preserve the optimal policy — but Ng et al. only guarantees policy invariance under correct dynamics; it cannot prevent exploiting a simulator artifact, because the jerky motion is genuinely optimal in the wrong sim dynamics. Redesign: make the proximity term potential-based and physically grounded, penalize jerk/torque, and add domain randomization so the exploit is not robust across dynamics.
  5. SAC temperature α\alphaα. Large α\alphaα drives high entropy: lots of exploration but more unsafe actions near joint limits early and slower, less precise convergence; small α\alphaα is greedy early, under-explores contact, but makes fewer unsafe excursions. This argues for a schedule (or automatic entropy tuning with a decreasing target): start high to explore — ideally behind a CBF — then anneal to exploit and refine precision.

Looking ahead

Most practical robot RLReinforcement Learning training does not occur on physical hardware but in simulation, with the trained policy subsequently transferred to the real world. The validity of this approach depends entirely on how faithfully the simulator captures the physical dynamics the robot will encounter.

Week 7: Sim2Real Pipelines and IsaacLab. We examine how modern GPU-accelerated simulation stacks (Isaac Sim, IsaacLab) are structured, why the sim2real gap is fundamentally difficult to close, and how domain randomization combined with system identification produces policies that transfer robustly to physical hardware.


Further reading

  • Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement Learning in Robotics: A Survey. IJRR. (Classic overview of RLReinforcement Learning challenges on hardware).
  • Hwangbo, J., et al. (2019). Learning Agile and Dynamic Motor Skills for Legged Robots. Science Robotics. (Pioneering work on using RLReinforcement Learning for continuous legged locomotion).
← Previous
Week 5: Imitation Learning
Next →
Week 7: Sim2Real Pipelines and IsaacLab
On this page
  • Purpose of this lecture
  • The physical robot RL setting
  • Off-policy methods as the dominant paradigm
  • Why off-policy
  • SAC: the standard off-policy algorithm for continuous robotics
  • TD3: mathematical formulation
  • Safety constraints in robotic RL
  • Action space design
  • Reward design and shaping
  • Curriculum learning
  • GenAI context: comparing RL regimes
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading