Purpose of this lecture
So far, we have established:
- how robots are shaped and move (kinematics),
- how forces produce motion under uncertainty (dynamics and state estimation).
This lecture addresses the next essential question:
Given a state estimate and a model, how do we choose actions that make the robot behave as desired?
Control theory provides principled answers to this question. Many modern robot learning methods—especially MPC-based RLReinforcement Learning and sim-to-real pipelines—are best understood as learning-augmented control systems, not replacements for control. This lecture builds the vocabulary and intuitions needed to understand that relationship precisely.
Feedback control: the core idea
At the heart of robotics is feedback.
A feedback controller continuously:
- measures the system state,
- compares it to a desired reference,
- applies corrective action.
This closed-loop structure is what makes robots robust to disturbances, modeling errors, and sensor noise. Open-loop control — executing a pre-planned action sequence without measuring the result — fails in practice because no model is perfect and no actuator is noiseless. Feedback is what bridges the gap between the model and the world.
PID control
The most widely used controller in robotics is the PID controller.
Given an error , the control input is:
The three terms each address a different aspect of the error signal:
- Proportional (P): corrects present error proportionally. Large gives fast response but can cause oscillation.
- Integral (I): accumulates past error and eliminates steady-state bias. Essential when constant disturbances (e.g. gravity, friction) must be rejected.
- Derivative (D): acts on the rate of change of error, damping oscillations and providing anticipatory correction.
The PD action interface in torque-controlled robots
For torque-controlled robots — which includes virtually all modern legged systems and many manipulation platforms — PID typically takes the form of a joint-space PD impedance law:
where is the desired joint position and is the commanded torque. The integral term is usually omitted in torque control to avoid integrator windup under contact transitions.
This is the action interface between a learned policy and the robot hardware. In almost every locomotion RLReinforcement Learning deployment (Unitree Go1/Go2, ANYmal, MIT Mini Cheetah), the policy outputs desired joint positions at 50–100 Hz; the PD controller runs at 500–1000 Hz and converts those targets to motor torques. Understanding this stack is essential: the policy is not directly commanding torques — it is commanding set-points into a lower-level impedance controller. The stiffness and damping of that controller shape the effective dynamics the policy experiences, and choosing them poorly can destabilize an otherwise well-trained policy at deployment.
Python Implementation: PID Controller
The following code implements a standard discrete-time PID controller.
import numpy as np
class PIDController:
def __init__(self, kp, ki, kd, dt=0.01):
self.kp = kp
self.ki = ki
self.kd = kd
self.dt = dt
self.integral = 0
self.prev_error = 0
def compute(self, target, current):
"""
Computes the PID control output.
"""
error = target - current
# Proportional term
p_term = self.kp * error
# Integral term (with basic anti-windup clamping)
self.integral += error * self.dt
self.integral = np.clip(self.integral, -10, 10)
i_term = self.ki * self.integral
# Derivative term
derivative = (error - self.prev_error) / self.dt
d_term = self.kd * derivative
# Total output
output = p_term + i_term + d_term
# Update state
self.prev_error = error
return output
# Simulation: 1D mass following a target
# Mass dynamics: x_next = x + v*dt; v_next = v + (u/mass)*dt
dt = 0.01
controller = PIDController(kp=10.0, ki=1.0, kd=2.0, dt=dt)
current_pos = 0.0
current_vel = 0.0
target_pos = 1.0
mass = 1.0
print(f"Step | Position | Output")
print("-" * 30)
for step in range(10):
u = controller.compute(target_pos, current_pos)
# Simple Euler integration of mass dynamics
acceleration = u / mass
current_vel += acceleration * dt
current_pos += current_vel * dt
print(f"{step:4d} | {current_pos:8.3f} | {u:7.2f}")
Stability and Lyapunov theory
A controller is only useful if it is stable. Stability means that the system returns to — or stays near — a desired operating point when perturbed. Designing and verifying stability is where control theory diverges most sharply from purely empirical learning.
Python Implementation: Linear Quadratic Regulator (LQR)
For linear systems, LQR provides the optimal control gain by solving the Discrete Algebraic Riccati Equation (DARE).
import scipy.linalg
def compute_lqr_gain(A, B, Q, R):
"""
Computes LQR gain K for discrete-time system x_next = Ax + Bu.
"""
# Solve DARE: P = Q + A'PA - A'PB(R + B'PB)^-1 B'PA
P = scipy.linalg.solve_discrete_are(A, B, Q, R)
# Compute K = (R + B'PB)^-1 B'PA
K = np.linalg.inv(R + B.T @ P @ B) @ (B.T @ P @ A)
return K
# Double integrator system (discrete time)
dt = 0.1
A = np.array([[1, dt], [0, 1]])
B = np.array([[0], [dt]])
# Penalize position and velocity error equally
Q = np.eye(2)
# Low penalty on control effort
R = np.array([[0.01]])
K = compute_lqr_gain(A, B, Q, R)
print(f"LQR Gain K: {K}")
Informal definition
A system is stable (in the Lyapunov sense) if:
- small disturbances produce small, bounded deviations,
- the system returns to equilibrium over time (asymptotic stability).
Lyapunov functions
A Lyapunov function is an energy-like scalar function that certifies stability without requiring an explicit solution of the system's differential equations.
For a system with equilibrium at , certifies asymptotic stability if:
The third condition — — says the "energy" is strictly decreasing along every trajectory. If energy always decreases, the system must converge to the zero-energy state.
Worked example
Consider the scalar system with candidate Lyapunov function :
All three conditions are satisfied. The system is asymptotically stable, and is the certificate. This example is trivial, but the same structure — find a function whose derivative is negative under the closed-loop dynamics — applies to arbitrarily complex nonlinear systems.
Control Lyapunov Functions (CLFs)
A Control Lyapunov Function extends this idea to systems with control inputs. A CLF is a Lyapunov function for which, at every state, there exists a control input that makes . This reframes stability from a post-hoc analysis into a constraint on the controller: choose such that decreases.
CLFs are the foundation of:
- CLF-QP controllers: at each timestep, solve a small quadratic program to find the minimum-norm control input that satisfies (plus other constraints). This gives a provably stabilizing controller with low computational cost.
- Neural Lyapunov functions: learn as a neural network jointly with the policy, enforcing the CLF conditions as training constraints. This allows stability certification for complex nonlinear systems where analytical Lyapunov functions are unavailable (Berkenkamp et al., Chang et al.).
Control Barrier Functions (CBFs)
Safety is the spatial analog of stability. A Control Barrier Function defines a safe set and enforces that the system never leaves it. The CBF condition requires:
for some class- function . Like the CLF, this can be enforced as a constraint in a CBF-QP that wraps around any nominal policy and minimally modifies its output to maintain safety. CBF-QP safety filters are increasingly used to wrap learned locomotion and manipulation policies, providing hard safety guarantees without retraining the policy.
Input-to-state stability (ISS)
Classical Lyapunov stability applies to unperturbed systems. In practice — and especially in sim-to-real — the robot always faces disturbances: model mismatch, sensor noise, terrain variation. Input-to-state stability (ISS) extends Lyapunov theory to perturbed systems: a system is ISS if bounded disturbances produce bounded state deviations, with the bound scaling continuously from zero. ISS is the theoretical foundation for understanding why domain randomization helps sim-to-real: a policy trained to be ISS-robust to a distribution of disturbances will remain stable under the bounded mismatch between simulation and hardware.
Why this matters for learning
Classical controllers come with stability guarantees derived from Lyapunov theory. Learned controllers generally do not — a policy that achieves high reward in training may be unstable in the Lyapunov sense and fail catastrophically under small perturbations at deployment. The modern response to this gap takes two forms: (1) enforcing Lyapunov or barrier constraints during policy training, and (2) wrapping a learned policy in a CLF-QP or CBF-QP safety filter that provides guarantees without modifying the policy itself.
Joint-space vs. task-space control
Robots can be controlled in different coordinate systems, and the choice has significant consequences for both controller design and learned policy architecture.
Joint-space control
Control objectives are defined directly in joint space. The controller computes torques as a function of joint position and velocity errors:
Simple and directly matched to the actuator space. This is the dominant paradigm in locomotion RLReinforcement Learning, where policies output desired joint positions into the PD layer described above.
Task-space (operational-space) control
Control objectives are defined in Cartesian space. The end-effector pose error is mapped to joint torques via the Jacobian and the robot dynamics.
The operational-space controller (Khatib, 1987) computes:
where is the desired Cartesian force/wrench, is the pseudoinverse Jacobian, and is a secondary joint-space objective (joint limit avoidance, posture control) projected into the null space of the task. This structure separates the primary task (end-effector control) from secondary objectives, which is the conceptual foundation of hierarchical manipulation policy design.
Connection to learning
The action space of a learned policy determines which control regime it operates in, and this choice propagates directly into how the RLReinforcement Learning reward function must be designed.
Locomotion RLReinforcement Learning policies (PPOProximal Policy Optimisation on quadrupeds) typically output desired joint positions → executed by joint-space PD control. In this paradigm, rewards must be formulated in joint space or in terms of quantities that are computable from joint-space targets: joint tracking error , joint velocity penalties , and energy consumption proxies . The policy never directly reasons about end-effector Cartesian position, so task-space objectives (such as foot clearance or swing-foot trajectory) must be projected through the Jacobian or computed from FK within the reward evaluation.
Manipulation policies (diffusion policy, ACTAction Chunking with Transformers, ) typically output end-effector waypoints → executed by IK plus joint PD, or by operational-space control directly. Here, rewards are naturally expressed in task space: Cartesian distance to the target object , orientation alignment , or contact force at the end-effector. Joint-space penalties (for smooth motion or avoiding singularities, as discussed in Week 1) must be brought back through FK or IK to be evaluated, adding complexity to the reward computation.
The key asymmetry is this: joint-space policies require rewards on joint targets, while task-space policies require rewards based on Cartesian end-effector distance — bringing IK or inverse dynamics into the reward evaluation loop whenever joint-space quantities need to be penalized. This interaction between action space and reward structure is a frequently overlooked source of reward engineering difficulty in manipulation RLReinforcement Learning.
Hybrid architectures output task-space goals that are tracked by a classical task-space controller, allowing the policy to reason at the level of object interaction rather than motor commands. Understanding which layer the policy operates at determines both what geometric structure it must implicitly learn and how the reward function must be structured to provide useful gradient signal.
Impedance and admittance control
For contact-rich tasks — manipulation, assembly, wiping, legged locomotion — impedance control is the standard control paradigm.
The core idea
Pure position control fails under contact: if the robot commands a position that is physically blocked, it generates unbounded forces. Impedance control instead defines a desired dynamic relationship between end-effector motion error and contact force:
where is the Cartesian error, , , are the desired virtual mass, damping, and stiffness, and is the external contact force. Rather than commanding the robot to reach a position regardless of contact, impedance control makes the robot behave like a mass-spring-damper: it complies with contact forces while still being attracted to the desired trajectory.
Admittance control
Impedance control is implemented in torque-controlled robots (command , measure ). Admittance control is the dual formulation for position-controlled robots (command , measure ): contact force measurements are used to modify the position command, achieving the same compliant behavior through position-loop modulation rather than direct torque control.
Connection to passivity and Week 2
The passivity property introduced in Week 2 — that is skew-symmetric — is precisely what makes energy-based impedance control stable. A passivity-based impedance controller can guarantee stability under arbitrary contact geometries because the interaction with any passive environment (a wall, a box, a human) cannot add energy to the system. This is a much stronger guarantee than Lyapunov stability at an isolated equilibrium.
Connection to learning
Impedance control is the low-level substrate for much of contact-rich imitation learning:
- Teleoperated demonstrations for assembly or insertion tasks are typically collected under impedance control, because it allows the human operator to make contact without generating damaging forces.
- Learned manipulation policies often output impedance set-points (, , ) rather than raw torques — allowing the policy to modulate compliance as a function of the task phase (high stiffness for precise positioning, low stiffness for compliant insertion).
- Variable impedance learning (learning as part of the policy) is an active research area in manipulation RLReinforcement Learning.
Optimal control: formalizing "best behavior"
Control design can be framed as an optimization problem: find the sequence of actions that minimizes a cost function subject to the robot's dynamics.
General form
This formulation is the foundation of both classical optimal control and reinforcement learning. The cost in optimal control corresponds to the negative reward in RLReinforcement Learning; the dynamics constraint corresponds to the environment transition function. The structural connection is exact — RLReinforcement Learning is optimal control with an unknown dynamics model and a learned value function.
Linear Quadratic Regulator (LQR)
For linear dynamics and quadratic cost, the optimal control problem has a closed-form solution.
Setup
where penalizes state deviation and penalizes control effort. The optimal controller is linear: .
Solving for K: the Discrete Algebraic Riccati Equation
is obtained by solving the Discrete Algebraic Riccati Equation (DARE):
is the solution to a fixed-point equation — the optimal value function for the infinite-horizon LQR problem. This is directly analogous to the Bellman equation in reinforcement learning:
LQR solves the Bellman equation in closed form for the linear-quadratic case. The Riccati equation is the algebraic fixed point of the Bellman operator under linearity and quadratic cost. This connection is not coincidental — the value function is also a Lyapunov function for the closed-loop system, linking optimal control, stability theory, and RLReinforcement Learning in a single object.
Properties
LQR is globally optimal for linear systems, guaranteed stable under mild conditions on , and computationally cheap once is solved offline. It is widely used as a local controller around an operating point for nonlinear systems — linearize, solve DARE, apply — and as a baseline against which more complex controllers are benchmarked.
Iterative LQR (iLQR) and Differential Dynamic Programming
Real robots are nonlinear. iLQR extends LQR to nonlinear systems through iterative local linearization.
Algorithm structure
iLQR alternates between two passes over a nominal trajectory :
Backward pass: starting from the terminal cost, compute a quadratic approximation of the value function at each timestep by linearizing the dynamics around the current trajectory. This produces local feedback gains and feedforward corrections .
Forward pass: apply the updated control law (with step size ) to generate a new trajectory. The new trajectory has lower cost than the previous one.
Iterate until convergence. The result is a locally optimal trajectory with a stabilizing feedback policy defined along it.
Differential Dynamic Programming (DDP)
DDP is the second-order extension of iLQR: it includes second-order derivatives of the dynamics in the backward pass, giving faster convergence at the cost of more computation. In practice, iLQR (which drops the second-order dynamics terms) is used more widely because the first-order approximation is sufficient for most robot problems and the Hessian of the dynamics is expensive to compute.
Key limitation
Both iLQR and DDP assume smooth dynamics. Contact events — foot strikes, object grasps, surface transitions — introduce discontinuities that violate this assumption and cause the local linearization to fail across contact boundaries. This motivates:
- Contact-implicit trajectory optimization (e.g. TRAJOPT, DIRTREL), which handles contact mode sequences explicitly.
- Learned dynamics models that smooth over contact discontinuities.
- RLReinforcement Learning, which sidesteps the need for differentiable dynamics entirely by using zeroth-order gradient estimates (policy gradient).
iLQR is a direct ancestor of trajectory optimization in MuJoCo (via MJPC), shooting methods in model-based RLReinforcement Learning, and the local policy improvement step in Guided Policy Search.
Model Predictive Control (MPC)
MPC applies optimal control online by solving a finite-horizon optimization problem at every timestep, using the current state as the initial condition.
Core loop
At each time step:
- Observe current state
- Solve: subject to
- Execute only the first action
- Discard the rest of the plan; replan at
Why replan every step?
Models are imperfect. By resolving the optimization at each step, MPC continuously corrects for:
- accumulated prediction error,
- unexpected disturbances,
- state estimation updates.
This receding-horizon structure gives MPC much of the robustness of feedback control while retaining the optimality guarantees of trajectory optimization.
Sampling-based MPC: MPPI
Classical MPC solves the optimization with gradient-based methods (iLQR, SQP). Model Predictive Path Integral (MPPI) control takes a different approach: sample thousands of random control sequences, simulate their outcomes in parallel, and compute a importance-weighted average of the samples biased toward low-cost trajectories.
where is the cumulative cost of sample and is a temperature parameter. MPPI:
- requires no gradient through the dynamics (works with non-differentiable simulators),
- handles non-smooth costs and contact naturally,
- maps directly onto GPU-parallel simulation (used in aggressive driving and locomotion).
MPPI is the classical antecedent of many RLReinforcement Learning planning algorithms and the direct inspiration for CEM-based model-based RLReinforcement Learning methods.
Connections to model-based reinforcement learning
The structural parallel between MPC and model-based RLReinforcement Learning is exact:
| Classical MPC | Learning Analogue | |---|---| | Analytical dynamics model | Learned world model (Dreamer, RSSM) | | Hand-designed cost function | Learned or specified reward function | | Receding-horizon planning | Rollout-based policy search | | iLQR / DDP backward pass | Local policy improvement (GPS, ILQR-RLReinforcement Learning) | | MPPI sampling | CEM / MPPI with learned dynamics | | Terminal value function | Learned value function (critic) |
MuZero performs planning in a learned latent space using a learned value function as the terminal cost — it is MPC with a neural dynamics model and a neural terminal value. Dreamer performs MPC over a learned RSSM world model. In locomotion, MPC is often used as a safety wrapper around an RLReinforcement Learning policy: the RLReinforcement Learning policy proposes actions and MPC projects them onto a feasible, collision-free trajectory.
Hybrid architectures: control + learning in deployed systems
Rather than choosing between classical control and learned policies, modern deployed systems combine them in structured ways. Two architectures dominate:
Residual policy
A classical controller handles nominal behavior (stance control, gravity compensation, trajectory following). A learned policy outputs a residual correction:
The classical controller provides stability and baseline performance; the learned residual captures dynamics that the model gets wrong or behavior that is hard to hand-design (adaptive gait recovery, contact-rich dexterity). Residual policies typically train faster than end-to-end RLReinforcement Learning because the classical controller handles the bulk of the task.
Policy + low-level controller stack
The learned policy operates at a higher abstraction level and outputs targets for a classical low-level controller:
This is the dominant architecture in locomotion RLReinforcement Learning. The policy reasons about gait, balance, and terrain at 50 Hz; the PD controller handles torque execution at 500–1000 Hz. The separation of timescales provides robustness: the fast PD loop rejects high-frequency disturbances that the policy never sees.
Understanding these architectures — not just the individual components — is what allows a robotics engineer to design systems that are both learnable and deployable.
Key takeaways
The concepts in this lecture form a progression from local to global, and from classical to learned:
PID and joint-space PD control are the lowest layer of almost every real robot, including the action interface through which learned policies execute. Stability and Lyapunov theory provide the mathematical framework for certifying that controllers work — and motivate CLFs, CBFs, and neural Lyapunov functions as tools for extending those guarantees to learned systems. Impedance control is the standard paradigm for contact-rich manipulation and the substrate on which most imitation learning for assembly is built. Task-space and operational-space control generalize joint-space control to Cartesian objectives, and are the execution layer for manipulation policies that reason about end-effector pose. LQR is optimal control in closed form — and the Riccati equation is the Bellman equation under linearity and quadratic cost, directly linking control theory and RLReinforcement Learning. iLQR and DDP extend optimality to nonlinear systems through iterative linearization, at the cost of assuming smooth dynamics. MPC closes the loop by resolving the optimization online, giving robustness through feedback. MPPI removes the smoothness assumption through sampling, bridging MPC and gradient-free RLReinforcement Learning. And hybrid architectures — residual policies and policy + controller stacks — are how these classical and learned components are combined in systems that actually deploy.
Conceptual questions
-
A locomotion policy outputs desired joint positions at 50 Hz, which are tracked by a PD controller at 500 Hz. What happens to stability if the PD gains and are set too high? Too low? How does the choice of gains interact with the policy's training dynamics in simulation?
-
You have a Lyapunov function for a linear system. Show that is also a valid Lyapunov function for the LQR-controlled closed-loop system. What does this imply about the relationship between the optimal cost-to-go and stability certification?
-
A manipulation policy trained with behavioral cloning outputs end-effector positions. At deployment, these are tracked with a stiff position controller. The robot is asked to insert a peg into a hole. Describe the failure mode, and explain how switching to impedance control would change the behavior.
-
iLQR converges to a locally optimal trajectory for a robot arm reaching task. The same algorithm is applied to a legged robot performing a jumping motion. Describe specifically why iLQR is likely to fail for the jumping task and what alternative approach you would use.
-
A learned locomotion policy achieves excellent sim performance but fails at deployment when the terrain is slightly different from training. You decide to wrap it in a CBF-QP safety filter. What does the CBF need to represent, what constraint does it enforce, and what does it not protect against?
Looking ahead
With control foundations in place, we now ask:
What if we learn behaviors directly from data instead of hand-designing controllers?
Week 4: Teleoperation and Data Collection
We study how robot datasets are generated, logged, and prepared for imitation learning
and RLReinforcement Learning — and how the control architecture at collection time shapes what policies can learn.
Further reading
- Ames, A. D., et al. (2019). Control Barrier Functions: Theory and Applications. European Control Conference. (The foundational modern paper on CBFs for safety).
- Borrelli, F., Bemporad, A., & Morari, M. (2017). Predictive Control for Linear and Hybrid Systems. Cambridge University Press. (Comprehensive MPC reference).