Week 6: Reinforcement Learning for Robotics

Purpose of this lecture#

Imitation learning gives a robot competent initial behavior. What it cannot provide is the ability to optimize long-term outcomes, recover from states outside the demonstration distribution, or discover strategies that improve upon what any human demonstrator would exhibit. Reinforcement learning is the natural complement to imitation: starting from an imitation-learned initialization, RL explores the state-action space and updates the policy toward trajectories of higher cumulative reward, potentially exceeding the performance of the expert demonstrations.

In practice, applying RL to physical robots is substantially more constrained than applying RL to software environments. Sample efficiency, safety, continuous action spaces, and the difficulty of resetting physical environments impose design requirements that dominate algorithm selection. This lecture analyzes how RL is actually deployed in robotics: which algorithms are used, why, and how reward design, action space structure, and curriculum scheduling are chosen to make learning tractable under real-world constraints.

The physical robot RL setting#

Robotic RL differs from the canonical Atari or MuJoCo benchmark setting in several ways that are not merely cosmetic.

Data is expensive in wall-clock time. A simulation-based RL experiment can collect tens of millions of environment steps in hours by running many environments in parallel. A physical robot collects roughly 3–10 Hz of useful data (accounting for resets, safety monitoring, and hardware delays), amounting to perhaps 10,000–50,000 transitions per hour under favorable conditions. An experiment that consumes $10^7$ transitions on a simulated benchmark would require years of continuous operation on physical hardware. This makes sample efficiency not a preference but an existential requirement.

Unsafe exploration damages hardware and people. In an Atari game, the worst outcome of a random action is losing points. On a physical robot, a random high-torque action can snap a joint, throw an object at a bystander, or drive the arm into a singularity. The standard RL exploration strategies — $\varepsilon$ -greedy, additive Gaussian noise on actions, random network distillation — are all potentially unsafe on physical hardware and must be constrained or replaced.

Environments are difficult to reset. Episodic RL assumes that the environment can be reset to an initial state distribution between episodes. On a physical robot, resetting means manually repositioning objects, returning the arm to a home configuration, and potentially recalibrating sensors — a process that can take minutes per episode and dominates the total experiment time for tasks with short episodes.

Action spaces are continuous and high-dimensional. Physical robots operate in $\mathbb{R}^n$ for $n$ typically ranging from 6 to 30 (joint positions, velocities, or torques). The structure of this space — joint limits, kinematic constraints, dynamics coupling — must be reflected in the action space representation, which has no direct counterpart in discrete action environments.

Off-policy methods as the dominant paradigm#

Why off-policy#

The sample efficiency imperative drives robot RL toward off-policy algorithms that can reuse data from previous interactions regardless of which policy collected them. On-policy algorithms — PPO, TRPO, A2C — require that all training data be collected from the current policy, making earlier data unusable once the policy updates. For a robot that collects a few thousand transitions per hour, discarding any experience is unacceptable.

Off-policy methods maintain a replay buffer $\mathcal{B}$ that stores all past transitions $(s_t, a_t, r_t, s_{t+1})$ and samples mini-batches uniformly (or with prioritization) for training. The policy can be updated continuously using data collected by arbitrarily old policies, provided that the off-policy correction is handled correctly.

SAC: the standard off-policy algorithm for continuous robotics#

Soft Actor-Critic (SAC; Haarnoja et al., 2018) is the most widely deployed RL algorithm in robot learning. SAC maximizes a maximum entropy objective:

J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi}\!\left[r(s_t, a_t) + \alpha H(\pi(\cdot \mid s_t))\right]

where $H(\pi(\cdot \mid s_t)) = -\mathbb{E}_{a \sim \pi}[\log \pi(a \mid s_t)]$ is the policy entropy and $\alpha$ is a temperature parameter. The entropy term rewards the policy for maintaining diverse action distributions, which provides automatic exploration without requiring separate noise injection, and prevents premature convergence to a narrow local optimum.

SAC maintains twin soft Q-functions $Q_{\phi_1}(s, a)$ and $Q_{\phi_2}(s, a)$ to reduce overestimation bias — the same double-Q trick introduced in TD3. The soft Bellman target for policy evaluation is:

y = r + \gamma \left(\min_{j=1,2} Q_{\bar{\phi}_j}(s', \tilde{a}') - \alpha \log \pi_\theta(\tilde{a}' \mid s')\right), \quad \tilde{a}' \sim \pi_\theta(\cdot \mid s')

where $\bar{\phi}$ denotes the target network parameters updated by exponential moving average. The policy is updated to maximize the soft Q-function:

\max_\theta \mathbb{E}_{s \sim \mathcal{B},\, a \sim \pi_\theta}\!\left[\min_{j=1,2} Q_{\phi_j}(s, a) - \alpha \log \pi_\theta(a \mid s)\right]

In practice, SAC's automatic temperature tuning — adjusting $\alpha$ to maintain a target entropy $\mathcal{H}_{\text{target}}$ — removes the need for manual exploration tuning, which is a significant practical advantage in robot learning where re-running experiments to tune hyperparameters is expensive.

TD3: mathematical formulation#

Twin Delayed Deep Deterministic Policy Gradient (TD3; Fujimoto et al., 2018) addresses the overestimation and instability problems of DDPG through three mathematically precise modifications.

Double Q-networks: TD3 maintains two critic networks $Q_{\phi_1}$ and $Q_{\phi_2}$ and uses the minimum of their predictions as the Bellman target, reducing overestimation bias that causes DDPG to diverge:

y = r + \gamma \min_{j=1,2} Q_{\bar{\phi}_j}(s', \tilde{a}')

Target policy smoothing: the target action $\tilde{a}'$ used to compute the Bellman target is obtained by adding clipped Gaussian noise $\epsilon \sim \text{clip}(\mathcal{N}(0, \sigma^2), -c, c)$ to the target policy's deterministic output and clipping the result to the valid action range:

\tilde{a}' = \text{clip}\!\left(\pi_{\bar{\phi}}(s') + \text{clip}(\epsilon, -c, c),\; a_{\text{low}},\; a_{\text{high}}\right)

This smoothing regularizes the Q-function by making the target less sensitive to narrow peaks in the critic's value surface — peaks that DDPG exploits catastrophically, computing gradients that push the policy toward action values that the Q-function has overestimated near a single point. The clipping of $\epsilon$ to $[-c, c]$ (typically $c = 0.5$ ) prevents the noise from pushing the target action so far off-policy that the Bellman target becomes meaningless.

Delayed policy updates: TD3 updates the actor network only every $d$ critic gradient steps (typically $d = 2$ ), allowing the critic to partially converge before the actor uses it to compute policy gradients. Because the actor's gradient $\nabla_\phi J = \mathbb{E}[{\nabla_a Q_{\phi_1}(s,a)|_{a=\pi_\phi(s)}} \cdot \nabla_\phi \pi_\phi(s)]$ is computed by differentiating through the critic, an inaccurate critic produces misleading actor gradients. Delayed updates break the feedback loop between unstable critics and erratic actor updates.

TD3 is often competitive with SAC on continuous control benchmarks and is preferred in robotics settings where stochastic policies are undesirable — for instance, when the policy must be deterministic for safety certification, or when action noise is explicitly managed by a higher-level safety filter rather than the policy itself.

Safety constraints in robotic RL#

Physical safety during RL training and deployment is enforced at multiple levels, with different mechanisms addressing different timescales and failure modes.

Workspace limits define the set of joint positions, velocities, and torques that are physically safe and hardware-preserving. These are typically implemented as hard constraints in the robot's low-level controller, which clips or rejects commands that violate the limits before they reach the actuators. From the RL algorithm's perspective, the action space is therefore always a clipped version of what the policy proposes.

Control Barrier Functions (CBFs) provide a formal safety filter at the policy output level. A CBF $h(s): \mathcal{S} \to \mathbb{R}$ defines a safe set $\mathcal{C} = \{s : h(s) \geq 0\}$ and the safety filter solves a quadratic program (CBF-QP) at each timestep to find the nearest safe action to what the policy proposes:

a_{\text{safe}} = \arg\min_{a} \| a - \pi_\theta(s) \|^2 \quad \text{s.t.} \quad \dot{h}(s, a) + \kappa h(s) \geq 0

The CBF constraint $\dot{h} + \kappa h \geq 0$ ensures forward invariance of the safe set: if the robot is currently safe ( $h(s) \geq 0$ ), executing $a_{\text{safe}}$ guarantees it remains safe at the next step. This filter can be applied to any learned policy without modifying the learning algorithm, which is particularly valuable in early training when the policy is far from safe.

Shielded RL combines the learned policy with a classical controller that takes over when safety-critical conditions are predicted. Unlike CBF-QP, which modifies individual actions, a shield performs predictive forward trajectory simulation: at each timestep, it rolls the proposed policy action forward for a short horizon $H$ steps using a dynamics model, and checks whether any predicted state along the rollout enters the unsafe set:

\text{engage shield} \;\Leftrightarrow\; \exists\, k \in \{1, \ldots, H\} : h(s_{t+k}) < -\delta

where $h(s)$ is the safety function, $s_{t+k}$ is the predicted state under the proposed policy, and $\delta > 0$ is a safety margin. If any predicted state violates safety, the shield engages before that violation occurs, executing a pre-programmed safe recovery behavior — stop, reverse, or return to a home configuration — and resuming learned policy control once the predicted rollout is again safe. This predictive structure is the key difference from reactive safety filters: the shield catches near-misses before they happen rather than correcting them as they occur. The horizon $H$ is the primary design parameter; too short a horizon misses slow-developing violations, while too long a horizon causes excessive conservative shielding on trajectories that would self-correct.

Action space design#

The choice of action representation has a larger effect on learning efficiency than the choice of RL algorithm in most robotic applications. The action space determines what the policy must learn, how directly the action affects the task objective, and how much prior knowledge about robot kinematics and dynamics can be injected into the learning system.

Joint space actions (torques or positions) provide maximum expressive control but require the policy to learn the full inverse kinematics and dynamics of the robot. A policy operating in joint space must learn that moving a cup on a table requires coordinating 6–7 joints in a specific coupled manner — a mapping that takes many samples to learn from scratch. Joint-space torque control is the most expressive and physically faithful representation but is rarely used directly in end-to-end robot learning.

End-effector Cartesian commands specify the desired position, orientation, or velocity of the end-effector, with an inverse kinematics solver converting these to joint commands. This abstraction reduces the effective dimensionality of the policy's task from joint coordination to Cartesian manipulation geometry, which is much more natural for contact and grasping tasks. The policy must still learn the geometry of manipulation, but it is freed from learning the robot-specific coupling between joints. Most modern robot learning policies operate in this space.

Delta actions represent the desired change from the current end-effector pose rather than the absolute target: $a_t = \Delta \text{pose}_t = \text{pose}_{t+1} - \text{pose}_t$ . This parameterization has better training properties than absolute targets because the gradient of the policy's loss with respect to small delta actions is well-conditioned near the identity mapping — the "do nothing" action corresponds to zero, which is easier to initialize around than arbitrary absolute poses.

Skill-based or options-space actions operate at a higher level of abstraction, where each action corresponds to a parameterized skill primitive (approach, grasp, push, release) or a learned sub-policy from a library. In the options framework, a high-level policy selects which option to execute and for how long, while low-level controllers carry out the option. This hierarchical structure dramatically reduces the effective horizon for the high-level policy and allows compositional manipulation behaviors to be learned with many fewer samples than a flat policy.

Reward design and shaping#

The choice of action space is not independent of reward design — it directly restricts which reward terms are computable and how those terms interact with the policy gradient. A joint-space torque control policy allows the shaped reward to directly penalize joint torques as an energy or effort proxy: $r_{\text{effort}} = -\|\tau\|^2$ . This term is only tractable if the policy produces torques directly; if the policy outputs Cartesian delta commands that are tracked by a low-level controller, the torques are an internal signal not exposed to the RL layer, and penalizing them requires computing inverse dynamics through the full robot model — an additional computational step that may introduce gradient computation complexity. Conversely, a task-space Cartesian policy makes it straightforward to penalize end-effector Cartesian jerk $\|\ddot{x}_{ee}\|^2$ for smooth motion, while a joint-space policy would need to propagate this penalty through FK to evaluate it. These interaction effects between action space and reward structure are a frequently underappreciated source of difficulty in reward engineering.

Reward specification in robotics is simultaneously critical and difficult. The reward signal must accurately capture the task objective, provide sufficient learning signal to guide exploration, and avoid Goodhart's Law failure modes where the policy achieves high reward through unintended behaviors.

Sparse rewards (binary success at episode end) are the most faithful to the true task objective but provide almost no learning signal in early training: a policy that has never successfully grasped an object receives a constant reward of zero regardless of how much it improved over the course of the episode, making gradient-based updates uninformative.

Shaped rewards add intermediate feedback by defining reward components that correlate with task progress:

r_{\text{shaped}}(s, a) = r_{\text{task}}(s) + \lambda_1 r_{\text{proximity}}(s) + \lambda_2 r_{\text{smoothness}}(a) + \lambda_3 r_{\text{effort}}(a)

where $r_{\text{proximity}}$ rewards decreasing distance to the goal, $r_{\text{smoothness}}$ penalizes jerky joint motions, and $r_{\text{effort}}$ penalizes unnecessary torque. The Ng et al. (1999) potential-based reward shaping theorem guarantees that a shaped reward $r_{\text{shaped}} = r + \gamma \Phi(s') - \Phi(s)$ — where $\Phi$ is any real-valued potential function — does not change the optimal policy of the original MDP. Non-potential-based shaping (arbitrary additions to the reward) does change the optimal policy and can produce unintended behavior even when the shaping terms seem reasonable.

Reward hacking in robotics often manifests as the policy exploiting simulator imprecision: sliding objects rather than lifting them to minimize effort while maximizing proximity rewards, vibrating the end-effector to generate false contact signals, or holding the arm stationary in a pose that maximizes proximity reward without completing the grasp. Catching these behaviors requires rigorous success-criterion evaluation that is independent of the training reward.

Curriculum learning#

Learning a complex manipulation task from a random policy initialization is typically infeasible: the probability of reaching the goal from random actions is near zero, the resulting zero-reward trajectories provide no useful gradient signal, and the policy cannot improve. Curriculum learning addresses this by sequencing a progression of tasks from easy to hard, ensuring that each stage of training provides useful signal for the next.

A curriculum can vary difficulty along multiple axes: initial state distributions (start with the object already in the robot's hand before training pick-and-place), tolerances (require coarser position accuracy early in training, then tighten), horizon lengths (train on short-horizon tasks that require fewer sequential decisions), or physics complexity (simple frictionless grasping before contact-rich assembly).

Automatic curriculum generation methods — including POET (Paired Open-Ended Trailblazers) and the universal value function approach — adaptively adjust task difficulty based on the agent's current performance, keeping the task in the "zone of proximal development" where it is challenging but solvable. This avoids both the trivial-task plateau (too easy to provide useful signal) and the impossible-task plateau (too hard for any gradient signal).

GenAI context: comparing RL regimes#

The contrast between RL in robotics and RL in language model training illuminates which constraints are intrinsic to sequential decision-making and which are domain-specific.

| Dimension | Robot RL | Language model RL (RLHF/GRPO) | |---|---|---| | Dominant algorithm | Off-policy SAC/TD3 | On-policy PPO/GRPO | | Sample cost | Wall-clock expensive (hardware) | Token-compute expensive (but parallelizable) | | Action space | Continuous $\mathbb{R}^n$ | Discrete (vocabulary) | | Safety | Hard physical constraints (CBF-QP) | KL constraint from reference model | | Reward | Shaped dense or sparse task reward | Human preference or rule-based verifier | | Curriculum | Task complexity and initial conditions | Difficulty of reasoning problems |

The preference for on-policy RL in LLM training reverses the robot RL preference for off-policy RL because token generation is cheap (LLMs can produce thousands of rollouts per minute), so the sample efficiency advantage of off-policy methods is less decisive than its instability cost. The on-policy PPO training that is impractical for physical robots becomes the default choice when data generation is fast and cheap.

Key takeaways#

Robotic RL operates under hard constraints — sample efficiency, physical safety, and difficult resets — that dominate algorithm selection. Off-policy methods, especially SAC and TD3, are standard because they reuse all collected experience through replay buffers. SAC's maximum entropy objective provides automatic exploration through policy entropy regularization, removing the need for separate noise injection. Safety is enforced at multiple levels: hardware workspace limits, CBF-QP filters at the policy output, and shielded controllers for emergency recovery. Action space design has large effects on learning efficiency; end-effector Cartesian delta commands are the most common choice for manipulation. Potential-based reward shaping is theoretically principled; non-potential shaping introduces unintended optimal policies. Curriculum learning is essential for complex tasks where random initialization cannot reach the goal. The RL choices made for robotics and for language models differ systematically because the cost of data generation differs by orders of magnitude between domains.

Conceptual questions#

A team attempts to train a peg-insertion policy using SAC directly on physical hardware with a sparse reward (success = 1 when the peg is inserted, 0 otherwise). After 10,000 hardware interactions over two days, the policy shows no improvement. Diagnose this failure using the concepts of sample efficiency and sparse reward. Propose a complete modified experimental design — including reward shaping, curriculum, and replay buffer strategy — that would make learning tractable, and justify each component's role.
The CBF-QP safety filter modifies the policy's proposed action to satisfy $\dot{h} + \kappa h \geq 0$ . During RL training, the policy consistently proposes actions that are slightly outside the safe set, and the CBF-QP correction is small (< 5% change in action norm). Analyze the effect of this persistent correction on the RL training dynamics. Does the policy converge to the CBF-constrained optimal policy? Under what conditions would the CBF correction interfere with convergence, and how would you detect this?
An engineer must choose between joint-space torque control and end-effector Cartesian delta control for training a bimanual assembly task where both arms must coordinate to simultaneously insert two pegs. Analyze the tradeoffs specifically for bimanual coordination: which action space makes the inter-arm coupling easier or harder to learn? Does the answer change if a high-quality analytical inverse kinematics solution is available? If the task requires contact forces to be controlled explicitly?
A robot RL policy is trained with a shaped reward $r = r_{\text{success}} + 0.1 \cdot r_{\text{proximity}} + 0.05 \cdot r_{\text{smoothness}}$ . After training, the policy achieves 90% success rate in simulation but only 30% in physical deployment. Post-hoc analysis shows the policy learned to exploit a simulation artifact where the proximity reward can be maximized by a specific jerky motion that is physically impossible on the real robot. Identify which reward shaping principle was violated, explain how the Ng et al. theorem fails to protect against this failure mode, and redesign the reward.
SAC optimizes the maximum entropy objective with temperature $\alpha$ . Compare the behavior of SAC with very large $\alpha$ (high entropy target) versus very small $\alpha$ (low entropy target) during the early training phase of a contact-rich grasping task. Specifically: how does each setting affect the exploration-exploitation tradeoff, the rate of unsafe actions near joint limits, and the eventual convergence behavior? Use this analysis to argue for a specific $\alpha$ schedule over the course of training.

Solutions

SAC sparse reward on hardware. 10,000 interactions is far too few for sparse-reward RL — random exploration essentially never triggers success, so there is no gradient, and hardware is sample-limited. Redesign: a dense shaped reward (distance-to-hole plus alignment), a curriculum (start with the peg near/above the hole and widen), demonstrations seeding the replay buffer (or offline pretraining / residual RL on a scripted controller), and goal relabeling (HER). Each supplies learning signal or shrinks the exploration burden.
Persistent small CBF correction. Because the executed (clipped) action differs from the proposed one, you must train on the executed actions to converge to the CBF-constrained optimum — training on proposed actions never lets the policy experience the true consequences, so it won't converge. Interference appears when corrections are large or frequent and credit is assigned to the wrong action; detect it by monitoring proposed-vs-executed action divergence and reward plateaus while corrections persist.
Action space for bimanual. Cartesian end-effector deltas make inter-arm spatial coordination easier (the task lives in relative-pose space) but hide dynamics; joint-torque control gives full authority including forces but forces the policy to learn coordination and IK implicitly. A good analytical IK makes Cartesian even more attractive. If contact forces must be controlled explicitly, the answer flips toward torque/impedance control, since position deltas cannot regulate force.
Reward hacking. The violated principle is that shaping should be potential-based to preserve the optimal policy — but Ng et al. only guarantees policy invariance under correct dynamics; it cannot prevent exploiting a simulator artifact, because the jerky motion is genuinely optimal in the wrong sim dynamics. Redesign: make the proximity term potential-based and physically grounded, penalize jerk/torque, and add domain randomization so the exploit is not robust across dynamics.
SAC temperature $\alpha$ . Large $\alpha$ drives high entropy: lots of exploration but more unsafe actions near joint limits early and slower, less precise convergence; small $\alpha$ is greedy early, under-explores contact, but makes fewer unsafe excursions. This argues for a schedule (or automatic entropy tuning with a decreasing target): start high to explore — ideally behind a CBF — then anneal to exploit and refine precision.

Looking ahead#

Most practical robot RL training does not occur on physical hardware but in simulation, with the trained policy subsequently transferred to the real world. The validity of this approach depends entirely on how faithfully the simulator captures the physical dynamics the robot will encounter.

Week 7: Sim2Real Pipelines and IsaacLab. We examine how modern GPU-accelerated simulation stacks (Isaac Sim, IsaacLab) are structured, why the sim2real gap is fundamentally difficult to close, and how domain randomization combined with system identification produces policies that transfer robustly to physical hardware.

Purpose of this lecture#

The physical robot RL setting#

Robotic RL differs from the canonical Atari or MuJoCo benchmark setting in several ways that are not merely cosmetic.

Off-policy methods as the dominant paradigm#

Why off-policy#

SAC: the standard off-policy algorithm for continuous robotics#

Soft Actor-Critic (SAC; Haarnoja et al., 2018) is the most widely deployed RL algorithm in robot learning. SAC maximizes a maximum entropy objective:

J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi}\!\left[r(s_t, a_t) + \alpha H(\pi(\cdot \mid s_t))\right]

y = r + \gamma \left(\min_{j=1,2} Q_{\bar{\phi}_j}(s', \tilde{a}') - \alpha \log \pi_\theta(\tilde{a}' \mid s')\right), \quad \tilde{a}' \sim \pi_\theta(\cdot \mid s')

where $\bar{\phi}$ denotes the target network parameters updated by exponential moving average. The policy is updated to maximize the soft Q-function:

\max_\theta \mathbb{E}_{s \sim \mathcal{B},\, a \sim \pi_\theta}\!\left[\min_{j=1,2} Q_{\phi_j}(s, a) - \alpha \log \pi_\theta(a \mid s)\right]

TD3: mathematical formulation#

Twin Delayed Deep Deterministic Policy Gradient (TD3; Fujimoto et al., 2018) addresses the overestimation and instability problems of DDPG through three mathematically precise modifications.

y = r + \gamma \min_{j=1,2} Q_{\bar{\phi}_j}(s', \tilde{a}')

\tilde{a}' = \text{clip}\!\left(\pi_{\bar{\phi}}(s') + \text{clip}(\epsilon, -c, c),\; a_{\text{low}},\; a_{\text{high}}\right)

Safety constraints in robotic RL#

Physical safety during RL training and deployment is enforced at multiple levels, with different mechanisms addressing different timescales and failure modes.

a_{\text{safe}} = \arg\min_{a} \| a - \pi_\theta(s) \|^2 \quad \text{s.t.} \quad \dot{h}(s, a) + \kappa h(s) \geq 0

\text{engage shield} \;\Leftrightarrow\; \exists\, k \in \{1, \ldots, H\} : h(s_{t+k}) < -\delta

Action space design#

Reward design and shaping#

Shaped rewards add intermediate feedback by defining reward components that correlate with task progress:

r_{\text{shaped}}(s, a) = r_{\text{task}}(s) + \lambda_1 r_{\text{proximity}}(s) + \lambda_2 r_{\text{smoothness}}(a) + \lambda_3 r_{\text{effort}}(a)

Curriculum learning#

GenAI context: comparing RL regimes#

The contrast between RL in robotics and RL in language model training illuminates which constraints are intrinsic to sequential decision-making and which are domain-specific.

Key takeaways#

Conceptual questions#

A team attempts to train a peg-insertion policy using SAC directly on physical hardware with a sparse reward (success = 1 when the peg is inserted, 0 otherwise). After 10,000 hardware interactions over two days, the policy shows no improvement. Diagnose this failure using the concepts of sample efficiency and sparse reward. Propose a complete modified experimental design — including reward shaping, curriculum, and replay buffer strategy — that would make learning tractable, and justify each component's role.
The CBF-QP safety filter modifies the policy's proposed action to satisfy $\dot{h} + \kappa h \geq 0$ . During RL training, the policy consistently proposes actions that are slightly outside the safe set, and the CBF-QP correction is small (< 5% change in action norm). Analyze the effect of this persistent correction on the RL training dynamics. Does the policy converge to the CBF-constrained optimal policy? Under what conditions would the CBF correction interfere with convergence, and how would you detect this?
An engineer must choose between joint-space torque control and end-effector Cartesian delta control for training a bimanual assembly task where both arms must coordinate to simultaneously insert two pegs. Analyze the tradeoffs specifically for bimanual coordination: which action space makes the inter-arm coupling easier or harder to learn? Does the answer change if a high-quality analytical inverse kinematics solution is available? If the task requires contact forces to be controlled explicitly?
A robot RL policy is trained with a shaped reward $r = r_{\text{success}} + 0.1 \cdot r_{\text{proximity}} + 0.05 \cdot r_{\text{smoothness}}$ . After training, the policy achieves 90% success rate in simulation but only 30% in physical deployment. Post-hoc analysis shows the policy learned to exploit a simulation artifact where the proximity reward can be maximized by a specific jerky motion that is physically impossible on the real robot. Identify which reward shaping principle was violated, explain how the Ng et al. theorem fails to protect against this failure mode, and redesign the reward.
SAC optimizes the maximum entropy objective with temperature $\alpha$ . Compare the behavior of SAC with very large $\alpha$ (high entropy target) versus very small $\alpha$ (low entropy target) during the early training phase of a contact-rich grasping task. Specifically: how does each setting affect the exploration-exploitation tradeoff, the rate of unsafe actions near joint limits, and the eventual convergence behavior? Use this analysis to argue for a specific $\alpha$ schedule over the course of training.

Solutions

SAC sparse reward on hardware. 10,000 interactions is far too few for sparse-reward RL — random exploration essentially never triggers success, so there is no gradient, and hardware is sample-limited. Redesign: a dense shaped reward (distance-to-hole plus alignment), a curriculum (start with the peg near/above the hole and widen), demonstrations seeding the replay buffer (or offline pretraining / residual RL on a scripted controller), and goal relabeling (HER). Each supplies learning signal or shrinks the exploration burden.
Persistent small CBF correction. Because the executed (clipped) action differs from the proposed one, you must train on the executed actions to converge to the CBF-constrained optimum — training on proposed actions never lets the policy experience the true consequences, so it won't converge. Interference appears when corrections are large or frequent and credit is assigned to the wrong action; detect it by monitoring proposed-vs-executed action divergence and reward plateaus while corrections persist.
Action space for bimanual. Cartesian end-effector deltas make inter-arm spatial coordination easier (the task lives in relative-pose space) but hide dynamics; joint-torque control gives full authority including forces but forces the policy to learn coordination and IK implicitly. A good analytical IK makes Cartesian even more attractive. If contact forces must be controlled explicitly, the answer flips toward torque/impedance control, since position deltas cannot regulate force.
Reward hacking. The violated principle is that shaping should be potential-based to preserve the optimal policy — but Ng et al. only guarantees policy invariance under correct dynamics; it cannot prevent exploiting a simulator artifact, because the jerky motion is genuinely optimal in the wrong sim dynamics. Redesign: make the proximity term potential-based and physically grounded, penalize jerk/torque, and add domain randomization so the exploit is not robust across dynamics.
SAC temperature $\alpha$ . Large $\alpha$ drives high entropy: lots of exploration but more unsafe actions near joint limits early and slower, less precise convergence; small $\alpha$ is greedy early, under-explores contact, but makes fewer unsafe excursions. This argues for a schedule (or automatic entropy tuning with a decreasing target): start high to explore — ideally behind a CBF — then anneal to exploit and refine precision.

Purpose of this lecture#

The physical robot RL setting#

Off-policy methods as the dominant paradigm#

Why off-policy#

SAC: the standard off-policy algorithm for continuous robotics#

TD3: mathematical formulation#

Safety constraints in robotic RL#

Action space design#

Reward design and shaping#

Curriculum learning#

GenAI context: comparing RL regimes#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 6: Reinforcement Learning for Robotics

Purpose of this lecture#

The physical robot RL setting#

Off-policy methods as the dominant paradigm#

Why off-policy#

SAC: the standard off-policy algorithm for continuous robotics#

TD3: mathematical formulation#

Safety constraints in robotic RL#

Action space design#

Reward design and shaping#

Curriculum learning#

GenAI context: comparing RL regimes#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 6: Reinforcement Learning for Robotics

Purpose of this lecture#

The physical robot RLReinforcement Learning setting#

Off-policy methods as the dominant paradigm#

Why off-policy#

SACSoft Actor-Critic: the standard off-policy algorithm for continuous robotics#

TD3: mathematical formulation#

Safety constraints in robotic RLReinforcement Learning#

Action space design#

Reward design and shaping#

Curriculum learning#

GenAI context: comparing RLReinforcement Learning regimes#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 6: Reinforcement Learning for Robotics

Purpose of this lecture#

The physical robot RLReinforcement Learning setting#

Off-policy methods as the dominant paradigm#

Why off-policy#

SACSoft Actor-Critic: the standard off-policy algorithm for continuous robotics#

TD3: mathematical formulation#

Safety constraints in robotic RLReinforcement Learning#

Action space design#

Reward design and shaping#

Curriculum learning#

GenAI context: comparing RLReinforcement Learning regimes#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

The physical robot RL setting#

SAC: the standard off-policy algorithm for continuous robotics#

Safety constraints in robotic RL#

GenAI context: comparing RL regimes#

The physical robot RL setting#

SAC: the standard off-policy algorithm for continuous robotics#

Safety constraints in robotic RL#

GenAI context: comparing RL regimes#