Skip to main content
illumin8
Courses
Week 3: Control Fundamentals
Robot Learning
01Week 1: Robot Modeling and Kinematics
02Week 2: Dynamics and State Estimation
03Week 3: Control Fundamentals
04Week 4: Teleoperation and Data Collection
05Week 5: Imitation Learning
06Week 6: Reinforcement Learning for Robotics
07Week 7: Sim2Real Pipelines and IsaacLab
08Week 8: Foundation Models for Manipulation — ACT and Action Chunking
09Week 9: Flow Matching and Diffusion for Robot Policies
10Week 10: Vision–Language–Action Models
11Week 11: Fine-Tuning and Adaptation
12Week 12: Safety, Constraints, and Reliability
13Week 13: Multi-Robot and Multi-Task Learning
14Week 14: Sim2Real Capstone
Week 3

Week 3: Control Fundamentals

✦Learning Outcomes
  • Derive and apply Linear Quadratic Regulator (LQR)
  • Understand Model Predictive Control (MPC) formulation
  • Connect classical control to model-based reinforcement learning
  • Apply Control Barrier Functions (CBFs) for safety-critical control
◆Prerequisites
  • Week 2: Dynamics, state estimation
  • Week 1: Kinematics, Jacobians

Recommended: Review Week 2 before proceeding.

Purpose of this lecture

So far, we have established:

  • how robots are shaped and move (kinematics),
  • how forces produce motion under uncertainty (dynamics and state estimation).

This lecture addresses the next essential question:

Given a state estimate and a model, how do we choose actions that make the robot behave as desired?

Control theory provides principled answers to this question. Many modern robot learning methods—especially MPC-based RLReinforcement Learning and sim-to-real pipelines—are best understood as learning-augmented control systems, not replacements for control. This lecture builds the vocabulary and intuitions needed to understand that relationship precisely.


Feedback control: the core idea

At the heart of robotics is feedback.

A feedback controller continuously:

  1. measures the system state,
  2. compares it to a desired reference,
  3. applies corrective action.

This closed-loop structure is what makes robots robust to disturbances, modeling errors, and sensor noise. Open-loop control — executing a pre-planned action sequence without measuring the result — fails in practice because no model is perfect and no actuator is noiseless. Feedback is what bridges the gap between the model and the world.


PID control

The most widely used controller in robotics is the PID controller.

Given an error e(t)=xdesired−x(t)e(t) = x_{\text{desired}} - x(t)e(t)=xdesired​−x(t), the control input is:

u(t)=KPe(t)+KI∫0te(τ) dτ+KDde(t)dtu(t) = K_P e(t) + K_I \int_0^t e(\tau)\,d\tau + K_D \frac{d e(t)}{dt}u(t)=KP​e(t)+KI​∫0t​e(τ)dτ+KD​dtde(t)​

The three terms each address a different aspect of the error signal:

  • Proportional (P): corrects present error proportionally. Large KPK_PKP​ gives fast response but can cause oscillation.
  • Integral (I): accumulates past error and eliminates steady-state bias. Essential when constant disturbances (e.g. gravity, friction) must be rejected.
  • Derivative (D): acts on the rate of change of error, damping oscillations and providing anticipatory correction.

The PD action interface in torque-controlled robots

For torque-controlled robots — which includes virtually all modern legged systems and many manipulation platforms — PID typically takes the form of a joint-space PD impedance law:

τ=Kp(qd−q)+Kd(q˙d−q˙)\tau = K_p(q_d - q) + K_d(\dot{q}_d - \dot{q})τ=Kp​(qd​−q)+Kd​(q˙​d​−q˙​)

where qdq_dqd​ is the desired joint position and τ\tauτ is the commanded torque. The integral term is usually omitted in torque control to avoid integrator windup under contact transitions.

This is the action interface between a learned policy and the robot hardware. In almost every locomotion RLReinforcement Learning deployment (Unitree Go1/Go2, ANYmal, MIT Mini Cheetah), the policy outputs desired joint positions qdq_dqd​ at 50–100 Hz; the PD controller runs at 500–1000 Hz and converts those targets to motor torques. Understanding this stack is essential: the policy is not directly commanding torques — it is commanding set-points into a lower-level impedance controller. The stiffness KpK_pKp​ and damping KdK_dKd​ of that controller shape the effective dynamics the policy experiences, and choosing them poorly can destabilize an otherwise well-trained policy at deployment.

Python Implementation: PID Controller

The following code implements a standard discrete-time PID controller.

import numpy as np

class PIDController:
    def __init__(self, kp, ki, kd, dt=0.01):
        self.kp = kp
        self.ki = ki
        self.kd = kd
        self.dt = dt
        
        self.integral = 0
        self.prev_error = 0

    def compute(self, target, current):
        """
        Computes the PID control output.
        """
        error = target - current
        
        # Proportional term
        p_term = self.kp * error
        
        # Integral term (with basic anti-windup clamping)
        self.integral += error * self.dt
        self.integral = np.clip(self.integral, -10, 10) 
        i_term = self.ki * self.integral
        
        # Derivative term
        derivative = (error - self.prev_error) / self.dt
        d_term = self.kd * derivative
        
        # Total output
        output = p_term + i_term + d_term
        
        # Update state
        self.prev_error = error
        
        return output

# Simulation: 1D mass following a target
# Mass dynamics: x_next = x + v*dt; v_next = v + (u/mass)*dt
dt = 0.01
controller = PIDController(kp=10.0, ki=1.0, kd=2.0, dt=dt)

current_pos = 0.0
current_vel = 0.0
target_pos = 1.0
mass = 1.0

print(f"Step | Position | Output")
print("-" * 30)

for step in range(10):
    u = controller.compute(target_pos, current_pos)
    
    # Simple Euler integration of mass dynamics
    acceleration = u / mass
    current_vel += acceleration * dt
    current_pos += current_vel * dt
    
    print(f"{step:4d} | {current_pos:8.3f} | {u:7.2f}")

Stability and Lyapunov theory

A controller is only useful if it is stable. Stability means that the system returns to — or stays near — a desired operating point when perturbed. Designing and verifying stability is where control theory diverges most sharply from purely empirical learning.

Python Implementation: Linear Quadratic Regulator (LQR)

For linear systems, LQR provides the optimal control gain KKK by solving the Discrete Algebraic Riccati Equation (DARE).

import scipy.linalg

def compute_lqr_gain(A, B, Q, R):
    """
    Computes LQR gain K for discrete-time system x_next = Ax + Bu.
    """
    # Solve DARE: P = Q + A'PA - A'PB(R + B'PB)^-1 B'PA
    P = scipy.linalg.solve_discrete_are(A, B, Q, R)
    
    # Compute K = (R + B'PB)^-1 B'PA
    K = np.linalg.inv(R + B.T @ P @ B) @ (B.T @ P @ A)
    
    return K

# Double integrator system (discrete time)
dt = 0.1
A = np.array([[1, dt], [0, 1]])
B = np.array([[0], [dt]])

# Penalize position and velocity error equally
Q = np.eye(2)
# Low penalty on control effort
R = np.array([[0.01]])

K = compute_lqr_gain(A, B, Q, R)
print(f"LQR Gain K: {K}")

Informal definition

A system is stable (in the Lyapunov sense) if:

  • small disturbances produce small, bounded deviations,
  • the system returns to equilibrium over time (asymptotic stability).

Lyapunov functions

A Lyapunov function V(x)V(x)V(x) is an energy-like scalar function that certifies stability without requiring an explicit solution of the system's differential equations.

For a system x˙=f(x)\dot{x} = f(x)x˙=f(x) with equilibrium at x=0x = 0x=0, V(x)V(x)V(x) certifies asymptotic stability if:

V(x)>0∀x≠0,V(0)=0,V˙(x)=∇V⋅f(x)<0∀x≠0V(x) > 0 \quad \forall x \neq 0, \qquad V(0) = 0, \qquad \dot{V}(x) = \nabla V \cdot f(x) < 0 \quad \forall x \neq 0V(x)>0∀x=0,V(0)=0,V˙(x)=∇V⋅f(x)<0∀x=0

The third condition — V˙<0\dot{V} < 0V˙<0 — says the "energy" is strictly decreasing along every trajectory. If energy always decreases, the system must converge to the zero-energy state.

Worked example

Consider the scalar system x˙=−x\dot{x} = -xx˙=−x with candidate Lyapunov function V(x)=x2V(x) = x^2V(x)=x2:

V˙(x)=2xx˙=2x(−x)=−2x2<0∀x≠0\dot{V}(x) = 2x \dot{x} = 2x(-x) = -2x^2 < 0 \quad \forall x \neq 0V˙(x)=2xx˙=2x(−x)=−2x2<0∀x=0

All three conditions are satisfied. The system is asymptotically stable, and V(x)=x2V(x) = x^2V(x)=x2 is the certificate. This example is trivial, but the same structure — find a function whose derivative is negative under the closed-loop dynamics — applies to arbitrarily complex nonlinear systems.

Control Lyapunov Functions (CLFs)

A Control Lyapunov Function extends this idea to systems with control inputs. A CLF is a Lyapunov function for which, at every state, there exists a control input uuu that makes V˙(x,u)<0\dot{V}(x, u) < 0V˙(x,u)<0. This reframes stability from a post-hoc analysis into a constraint on the controller: choose uuu such that VVV decreases.

CLFs are the foundation of:

  • CLF-QP controllers: at each timestep, solve a small quadratic program to find the minimum-norm control input that satisfies V˙<0\dot{V} < 0V˙<0 (plus other constraints). This gives a provably stabilizing controller with low computational cost.
  • Neural Lyapunov functions: learn V(x)V(x)V(x) as a neural network jointly with the policy, enforcing the CLF conditions as training constraints. This allows stability certification for complex nonlinear systems where analytical Lyapunov functions are unavailable (Berkenkamp et al., Chang et al.).

Control Barrier Functions (CBFs)

Safety is the spatial analog of stability. A Control Barrier Function h(x)h(x)h(x) defines a safe set S={x:h(x)≥0}\mathcal{S} = \{x : h(x) \geq 0\}S={x:h(x)≥0} and enforces that the system never leaves it. The CBF condition requires:

h˙(x,u)≥−α(h(x))\dot{h}(x, u) \geq -\alpha(h(x))h˙(x,u)≥−α(h(x))

for some class-K\mathcal{K}K function α\alphaα. Like the CLF, this can be enforced as a constraint in a CBF-QP that wraps around any nominal policy and minimally modifies its output to maintain safety. CBF-QP safety filters are increasingly used to wrap learned locomotion and manipulation policies, providing hard safety guarantees without retraining the policy.

Input-to-state stability (ISS)

Classical Lyapunov stability applies to unperturbed systems. In practice — and especially in sim-to-real — the robot always faces disturbances: model mismatch, sensor noise, terrain variation. Input-to-state stability (ISS) extends Lyapunov theory to perturbed systems: a system is ISS if bounded disturbances produce bounded state deviations, with the bound scaling continuously from zero. ISS is the theoretical foundation for understanding why domain randomization helps sim-to-real: a policy trained to be ISS-robust to a distribution of disturbances will remain stable under the bounded mismatch between simulation and hardware.

Why this matters for learning

Classical controllers come with stability guarantees derived from Lyapunov theory. Learned controllers generally do not — a policy that achieves high reward in training may be unstable in the Lyapunov sense and fail catastrophically under small perturbations at deployment. The modern response to this gap takes two forms: (1) enforcing Lyapunov or barrier constraints during policy training, and (2) wrapping a learned policy in a CLF-QP or CBF-QP safety filter that provides guarantees without modifying the policy itself.


Joint-space vs. task-space control

Robots can be controlled in different coordinate systems, and the choice has significant consequences for both controller design and learned policy architecture.

Joint-space control

Control objectives are defined directly in joint space. The controller computes torques as a function of joint position and velocity errors:

τ=f(qd−q, q˙d−q˙)\tau = f(q_d - q, \, \dot{q}_d - \dot{q})τ=f(qd​−q,q˙​d​−q˙​)

Simple and directly matched to the actuator space. This is the dominant paradigm in locomotion RLReinforcement Learning, where policies output desired joint positions into the PD layer described above.

Task-space (operational-space) control

Control objectives are defined in Cartesian space. The end-effector pose error is mapped to joint torques via the Jacobian and the robot dynamics.

The operational-space controller (Khatib, 1987) computes:

τ=JTFtask+(I−JTJ+T)τ0\tau = J^T F_{\text{task}} + (I - J^T J^{+T}) \tau_0τ=JTFtask​+(I−JTJ+T)τ0​

where FtaskF_{\text{task}}Ftask​ is the desired Cartesian force/wrench, J+J^+J+ is the pseudoinverse Jacobian, and τ0\tau_0τ0​ is a secondary joint-space objective (joint limit avoidance, posture control) projected into the null space of the task. This structure separates the primary task (end-effector control) from secondary objectives, which is the conceptual foundation of hierarchical manipulation policy design.

Connection to learning

The action space of a learned policy determines which control regime it operates in, and this choice propagates directly into how the RLReinforcement Learning reward function must be designed.

Locomotion RLReinforcement Learning policies (PPOProximal Policy Optimisation on quadrupeds) typically output desired joint positions → executed by joint-space PD control. In this paradigm, rewards must be formulated in joint space or in terms of quantities that are computable from joint-space targets: joint tracking error ∥qd−q∥2\|q_d - q\|^2∥qd​−q∥2, joint velocity penalties ∥q˙∥2\|\dot{q}\|^2∥q˙​∥2, and energy consumption proxies ∥τ⊙q˙∥1\|\tau \odot \dot{q}\|_1∥τ⊙q˙​∥1​. The policy never directly reasons about end-effector Cartesian position, so task-space objectives (such as foot clearance or swing-foot trajectory) must be projected through the Jacobian or computed from FK within the reward evaluation.

Manipulation policies (diffusion policy, ACTAction Chunking with Transformers, π0\pi_0π0​) typically output end-effector waypoints → executed by IK plus joint PD, or by operational-space control directly. Here, rewards are naturally expressed in task space: Cartesian distance to the target object ∥xee−xgoal∥\|x_{ee} - x_{\text{goal}}\|∥xee​−xgoal​∥, orientation alignment ∥ReeTRgoal−I∥F\|R_{ee}^T R_{\text{goal}} - I\|_F∥ReeT​Rgoal​−I∥F​, or contact force at the end-effector. Joint-space penalties (for smooth motion or avoiding singularities, as discussed in Week 1) must be brought back through FK or IK to be evaluated, adding complexity to the reward computation.

The key asymmetry is this: joint-space policies require rewards on joint targets, while task-space policies require rewards based on Cartesian end-effector distance — bringing IK or inverse dynamics into the reward evaluation loop whenever joint-space quantities need to be penalized. This interaction between action space and reward structure is a frequently overlooked source of reward engineering difficulty in manipulation RLReinforcement Learning.

Hybrid architectures output task-space goals that are tracked by a classical task-space controller, allowing the policy to reason at the level of object interaction rather than motor commands. Understanding which layer the policy operates at determines both what geometric structure it must implicitly learn and how the reward function must be structured to provide useful gradient signal.


Impedance and admittance control

For contact-rich tasks — manipulation, assembly, wiping, legged locomotion — impedance control is the standard control paradigm.

The core idea

Pure position control fails under contact: if the robot commands a position that is physically blocked, it generates unbounded forces. Impedance control instead defines a desired dynamic relationship between end-effector motion error and contact force:

Mde¨+Dde˙+Kde=FextM_d \ddot{e} + D_d \dot{e} + K_d e = F_{\text{ext}}Md​e¨+Dd​e˙+Kd​e=Fext​

where e=xd−xe = x_d - xe=xd​−x is the Cartesian error, MdM_dMd​, DdD_dDd​, KdK_dKd​ are the desired virtual mass, damping, and stiffness, and FextF_{\text{ext}}Fext​ is the external contact force. Rather than commanding the robot to reach a position regardless of contact, impedance control makes the robot behave like a mass-spring-damper: it complies with contact forces while still being attracted to the desired trajectory.

Admittance control

Impedance control is implemented in torque-controlled robots (command τ\tauτ, measure FFF). Admittance control is the dual formulation for position-controlled robots (command xxx, measure FFF): contact force measurements are used to modify the position command, achieving the same compliant behavior through position-loop modulation rather than direct torque control.

Connection to passivity and Week 2

The passivity property introduced in Week 2 — that M˙−2C\dot{M} - 2CM˙−2C is skew-symmetric — is precisely what makes energy-based impedance control stable. A passivity-based impedance controller can guarantee stability under arbitrary contact geometries because the interaction with any passive environment (a wall, a box, a human) cannot add energy to the system. This is a much stronger guarantee than Lyapunov stability at an isolated equilibrium.

Connection to learning

Impedance control is the low-level substrate for much of contact-rich imitation learning:

  • Teleoperated demonstrations for assembly or insertion tasks are typically collected under impedance control, because it allows the human operator to make contact without generating damaging forces.
  • Learned manipulation policies often output impedance set-points (KdK_dKd​, DdD_dDd​, xdx_dxd​) rather than raw torques — allowing the policy to modulate compliance as a function of the task phase (high stiffness for precise positioning, low stiffness for compliant insertion).
  • Variable impedance learning (learning Kd(t)K_d(t)Kd​(t) as part of the policy) is an active research area in manipulation RLReinforcement Learning.

Optimal control: formalizing "best behavior"

Control design can be framed as an optimization problem: find the sequence of actions that minimizes a cost function subject to the robot's dynamics.

General form

min⁡u0:T∑t=0Tℓ(xt,ut)subject toxt+1=f(xt,ut)\min_{u_{0:T}} \sum_{t=0}^{T} \ell(x_t, u_t) \quad \text{subject to} \quad x_{t+1} = f(x_t, u_t)u0:T​min​t=0∑T​ℓ(xt​,ut​)subject toxt+1​=f(xt​,ut​)

This formulation is the foundation of both classical optimal control and reinforcement learning. The cost ℓ\ellℓ in optimal control corresponds to the negative reward in RLReinforcement Learning; the dynamics constraint fff corresponds to the environment transition function. The structural connection is exact — RLReinforcement Learning is optimal control with an unknown dynamics model and a learned value function.


Linear Quadratic Regulator (LQR)

For linear dynamics and quadratic cost, the optimal control problem has a closed-form solution.

Setup

xt+1=Axt+But,ℓ(x,u)=x⊤Qx+u⊤Rux_{t+1} = Ax_t + Bu_t, \qquad \ell(x,u) = x^\top Q x + u^\top R uxt+1​=Axt​+But​,ℓ(x,u)=x⊤Qx+u⊤Ru

where Q⪰0Q \succeq 0Q⪰0 penalizes state deviation and R≻0R \succ 0R≻0 penalizes control effort. The optimal controller is linear: u∗=−Kxu^* = -Kxu∗=−Kx.

Solving for K: the Discrete Algebraic Riccati Equation

KKK is obtained by solving the Discrete Algebraic Riccati Equation (DARE):

P=Q+A⊤PA−A⊤PB(R+B⊤PB)−1B⊤PAP = Q + A^\top P A - A^\top P B (R + B^\top P B)^{-1} B^\top P AP=Q+A⊤PA−A⊤PB(R+B⊤PB)−1B⊤PA K=(R+B⊤PB)−1B⊤PAK = (R + B^\top P B)^{-1} B^\top P AK=(R+B⊤PB)−1B⊤PA

PPP is the solution to a fixed-point equation — the optimal value function for the infinite-horizon LQR problem. This is directly analogous to the Bellman equation in reinforcement learning:

V∗(x)=min⁡u[ℓ(x,u)+V∗(f(x,u))]V^*(x) = \min_u \bigl[ \ell(x, u) + V^*(f(x,u)) \bigr]V∗(x)=umin​[ℓ(x,u)+V∗(f(x,u))]

LQR solves the Bellman equation in closed form for the linear-quadratic case. The Riccati equation is the algebraic fixed point of the Bellman operator under linearity and quadratic cost. This connection is not coincidental — the value function V(x)=x⊤PxV(x) = x^\top P xV(x)=x⊤Px is also a Lyapunov function for the closed-loop system, linking optimal control, stability theory, and RLReinforcement Learning in a single object.

Properties

LQR is globally optimal for linear systems, guaranteed stable under mild conditions on (A,B,Q,R)(A, B, Q, R)(A,B,Q,R), and computationally cheap once PPP is solved offline. It is widely used as a local controller around an operating point for nonlinear systems — linearize, solve DARE, apply u=−Kxu = -Kxu=−Kx — and as a baseline against which more complex controllers are benchmarked.


Iterative LQR (iLQR) and Differential Dynamic Programming

Real robots are nonlinear. iLQR extends LQR to nonlinear systems through iterative local linearization.

Algorithm structure

iLQR alternates between two passes over a nominal trajectory (x0:T∗,u0:T∗)(x_{0:T}^*, u_{0:T}^*)(x0:T∗​,u0:T∗​):

Backward pass: starting from the terminal cost, compute a quadratic approximation of the value function at each timestep by linearizing the dynamics fff around the current trajectory. This produces local feedback gains KtK_tKt​ and feedforward corrections ktk_tkt​.

Forward pass: apply the updated control law ut=ut∗+Kt(xt−xt∗)+αktu_t = u_t^* + K_t(x_t - x_t^*) + \alpha k_tut​=ut∗​+Kt​(xt​−xt∗​)+αkt​ (with step size α\alphaα) to generate a new trajectory. The new trajectory has lower cost than the previous one.

Iterate until convergence. The result is a locally optimal trajectory with a stabilizing feedback policy defined along it.

Differential Dynamic Programming (DDP)

DDP is the second-order extension of iLQR: it includes second-order derivatives of the dynamics in the backward pass, giving faster convergence at the cost of more computation. In practice, iLQR (which drops the second-order dynamics terms) is used more widely because the first-order approximation is sufficient for most robot problems and the Hessian of the dynamics is expensive to compute.

Key limitation

Both iLQR and DDP assume smooth dynamics. Contact events — foot strikes, object grasps, surface transitions — introduce discontinuities that violate this assumption and cause the local linearization to fail across contact boundaries. This motivates:

  • Contact-implicit trajectory optimization (e.g. TRAJOPT, DIRTREL), which handles contact mode sequences explicitly.
  • Learned dynamics models that smooth over contact discontinuities.
  • RLReinforcement Learning, which sidesteps the need for differentiable dynamics entirely by using zeroth-order gradient estimates (policy gradient).

iLQR is a direct ancestor of trajectory optimization in MuJoCo (via MJPC), shooting methods in model-based RLReinforcement Learning, and the local policy improvement step in Guided Policy Search.


Model Predictive Control (MPC)

MPC applies optimal control online by solving a finite-horizon optimization problem at every timestep, using the current state as the initial condition.

Core loop

At each time step:

  1. Observe current state xtx_txt​
  2. Solve: min⁡ut:t+H∑k=0Hℓ(xt+k,ut+k)\min_{u_{t:t+H}} \sum_{k=0}^{H} \ell(x_{t+k}, u_{t+k})minut:t+H​​∑k=0H​ℓ(xt+k​,ut+k​) subject to xt+k+1=f(xt+k,ut+k)x_{t+k+1} = f(x_{t+k}, u_{t+k})xt+k+1​=f(xt+k​,ut+k​)
  3. Execute only the first action ut∗u_t^*ut∗​
  4. Discard the rest of the plan; replan at t+1t+1t+1

Why replan every step?

Models are imperfect. By resolving the optimization at each step, MPC continuously corrects for:

  • accumulated prediction error,
  • unexpected disturbances,
  • state estimation updates.

This receding-horizon structure gives MPC much of the robustness of feedback control while retaining the optimality guarantees of trajectory optimization.

Sampling-based MPC: MPPI

Classical MPC solves the optimization with gradient-based methods (iLQR, SQP). Model Predictive Path Integral (MPPI) control takes a different approach: sample thousands of random control sequences, simulate their outcomes in parallel, and compute a importance-weighted average of the samples biased toward low-cost trajectories.

ut∗=∑kwkUk∑kwk,wk∝exp⁡ ⁣(−1λS(Uk))u_t^* = \frac{\sum_k w_k U_k}{\sum_k w_k}, \qquad w_k \propto \exp\!\left(-\frac{1}{\lambda} S(U_k)\right)ut∗​=∑k​wk​∑k​wk​Uk​​,wk​∝exp(−λ1​S(Uk​))

where S(Uk)S(U_k)S(Uk​) is the cumulative cost of sample kkk and λ\lambdaλ is a temperature parameter. MPPI:

  • requires no gradient through the dynamics (works with non-differentiable simulators),
  • handles non-smooth costs and contact naturally,
  • maps directly onto GPU-parallel simulation (used in aggressive driving and locomotion).

MPPI is the classical antecedent of many RLReinforcement Learning planning algorithms and the direct inspiration for CEM-based model-based RLReinforcement Learning methods.

Connections to model-based reinforcement learning

The structural parallel between MPC and model-based RLReinforcement Learning is exact:

| Classical MPC | Learning Analogue | |---|---| | Analytical dynamics model | Learned world model (Dreamer, RSSM) | | Hand-designed cost function | Learned or specified reward function | | Receding-horizon planning | Rollout-based policy search | | iLQR / DDP backward pass | Local policy improvement (GPS, ILQR-RLReinforcement Learning) | | MPPI sampling | CEM / MPPI with learned dynamics | | Terminal value function | Learned value function (critic) |

MuZero performs planning in a learned latent space using a learned value function as the terminal cost — it is MPC with a neural dynamics model and a neural terminal value. Dreamer performs MPC over a learned RSSM world model. In locomotion, MPC is often used as a safety wrapper around an RLReinforcement Learning policy: the RLReinforcement Learning policy proposes actions and MPC projects them onto a feasible, collision-free trajectory.


Hybrid architectures: control + learning in deployed systems

Rather than choosing between classical control and learned policies, modern deployed systems combine them in structured ways. Two architectures dominate:

Residual policy

A classical controller uc(x)u_c(x)uc​(x) handles nominal behavior (stance control, gravity compensation, trajectory following). A learned policy πθ(x)\pi_\theta(x)πθ​(x) outputs a residual correction:

u=uc(x)+πθ(x)u = u_c(x) + \pi_\theta(x)u=uc​(x)+πθ​(x)

The classical controller provides stability and baseline performance; the learned residual captures dynamics that the model gets wrong or behavior that is hard to hand-design (adaptive gait recovery, contact-rich dexterity). Residual policies typically train faster than end-to-end RLReinforcement Learning because the classical controller handles the bulk of the task.

Policy + low-level controller stack

The learned policy operates at a higher abstraction level and outputs targets for a classical low-level controller:

qd=πθ(x),τ=Kp(qd−q)+Kd(q˙d−q˙)q_d = \pi_\theta(x), \qquad \tau = K_p(q_d - q) + K_d(\dot{q}_d - \dot{q})qd​=πθ​(x),τ=Kp​(qd​−q)+Kd​(q˙​d​−q˙​)

This is the dominant architecture in locomotion RLReinforcement Learning. The policy reasons about gait, balance, and terrain at 50 Hz; the PD controller handles torque execution at 500–1000 Hz. The separation of timescales provides robustness: the fast PD loop rejects high-frequency disturbances that the policy never sees.

Understanding these architectures — not just the individual components — is what allows a robotics engineer to design systems that are both learnable and deployable.


Key takeaways

The concepts in this lecture form a progression from local to global, and from classical to learned:

PID and joint-space PD control are the lowest layer of almost every real robot, including the action interface through which learned policies execute. Stability and Lyapunov theory provide the mathematical framework for certifying that controllers work — and motivate CLFs, CBFs, and neural Lyapunov functions as tools for extending those guarantees to learned systems. Impedance control is the standard paradigm for contact-rich manipulation and the substrate on which most imitation learning for assembly is built. Task-space and operational-space control generalize joint-space control to Cartesian objectives, and are the execution layer for manipulation policies that reason about end-effector pose. LQR is optimal control in closed form — and the Riccati equation is the Bellman equation under linearity and quadratic cost, directly linking control theory and RLReinforcement Learning. iLQR and DDP extend optimality to nonlinear systems through iterative linearization, at the cost of assuming smooth dynamics. MPC closes the loop by resolving the optimization online, giving robustness through feedback. MPPI removes the smoothness assumption through sampling, bridging MPC and gradient-free RLReinforcement Learning. And hybrid architectures — residual policies and policy + controller stacks — are how these classical and learned components are combined in systems that actually deploy.


Conceptual questions

  1. A locomotion policy outputs desired joint positions at 50 Hz, which are tracked by a PD controller at 500 Hz. What happens to stability if the PD gains KpK_pKp​ and KdK_dKd​ are set too high? Too low? How does the choice of gains interact with the policy's training dynamics in simulation?

  2. You have a Lyapunov function V(x)=x⊤PxV(x) = x^\top P xV(x)=x⊤Px for a linear system. Show that VVV is also a valid Lyapunov function for the LQR-controlled closed-loop system. What does this imply about the relationship between the optimal cost-to-go and stability certification?

  3. A manipulation policy trained with behavioral cloning outputs end-effector positions. At deployment, these are tracked with a stiff position controller. The robot is asked to insert a peg into a hole. Describe the failure mode, and explain how switching to impedance control would change the behavior.

  4. iLQR converges to a locally optimal trajectory for a robot arm reaching task. The same algorithm is applied to a legged robot performing a jumping motion. Describe specifically why iLQR is likely to fail for the jumping task and what alternative approach you would use.

  5. A learned locomotion policy achieves excellent sim performance but fails at deployment when the terrain is slightly different from training. You decide to wrap it in a CBF-QP safety filter. What does the CBF need to represent, what constraint does it enforce, and what does it not protect against?


✦Solutions
  1. PD gains. Too-high Kp,KdK_p,K_dKp​,Kd​ make the loop stiff and can excite high-frequency oscillation or instability (worsened by the 50 Hz command rate and actuator delay) and fight the policy; too-low gains track sluggishly, so commanded positions are never realized and the policy's intended motion is lost. In sim the policy adapts to whatever gains it trains with, so mismatched deploy gains break transfer — the gains define the effective action space the policy experiences.
  2. Lyapunov for LQR. With V(x)=x⊤PxV(x)=x^\top P xV(x)=x⊤Px (PPP the Riccati solution) and the closed loop u=−Kxu=-Kxu=−Kx, V˙=−x⊤(Q+K⊤RK)x<0\dot V = -x^\top(Q + K^\top R K)x < 0V˙=−x⊤(Q+K⊤RK)x<0, so VVV is a valid Lyapunov function. Thus the optimal cost-to-go is itself a stability certificate — for LQR, optimality implies stability.
  3. Stiff position control for peg insertion. A stiff controller commands exact positions, so tiny pose errors generate large contact forces and the peg wedges against the hole edge with no way to comply — jamming or excessive force. Impedance control makes the arm behave like a spring-damper, so misalignment produces gentle corrective forces and the peg settles in, trading position accuracy for force regulation.
  4. iLQR on jumping. iLQR linearizes around a trajectory and assumes smooth, differentiable dynamics; jumping makes and breaks contact (discontinuous hybrid dynamics) and has an underactuated flight phase, so the linearization is invalid across mode switches and iLQR stalls or diverges. Use a contact-implicit/hybrid trajectory optimizer (or RL) that reasons over contact modes.
  5. CBF-QP wrap. The CBF must encode the safe set as h(x)≥0h(x)\ge 0h(x)≥0 (distance to terrain/obstacles/joint limits) and enforces forward-invariance (h˙≥−κh\dot h \ge -\kappa hh˙≥−κh) by minimally adjusting the action. It does not protect against unmodeled dynamics, state-estimation error, or inaccuracy in the h˙\dot hh˙ model, and it does not guarantee task success — it only keeps the system in the explicitly encoded safe set, and only if the QP's dynamics model is accurate.

Looking ahead

With control foundations in place, we now ask:

What if we learn behaviors directly from data instead of hand-designing controllers?

Week 4: Teleoperation and Data Collection
We study how robot datasets are generated, logged, and prepared for imitation learning and RLReinforcement Learning — and how the control architecture at collection time shapes what policies can learn.


Further reading

  • Ames, A. D., et al. (2019). Control Barrier Functions: Theory and Applications. European Control Conference. (The foundational modern paper on CBFs for safety).
  • Borrelli, F., Bemporad, A., & Morari, M. (2017). Predictive Control for Linear and Hybrid Systems. Cambridge University Press. (Comprehensive MPC reference).
← Previous
Week 2: Dynamics and State Estimation
Next →
Week 4: Teleoperation and Data Collection
On this page
  • Purpose of this lecture
  • Feedback control: the core idea
  • PID control
  • The PD action interface in torque-controlled robots
  • Python Implementation: PID Controller
  • Stability and Lyapunov theory
  • Python Implementation: Linear Quadratic Regulator (LQR)
  • Informal definition
  • Lyapunov functions
  • Worked example
  • Control Lyapunov Functions (CLFs)
  • Control Barrier Functions (CBFs)
  • Input-to-state stability (ISS)
  • Why this matters for learning
  • Joint-space vs. task-space control
  • Joint-space control
  • Task-space (operational-space) control
  • Connection to learning
  • Impedance and admittance control
  • The core idea
  • Admittance control
  • Connection to passivity and Week 2
  • Connection to learning
  • Optimal control: formalizing "best behavior"
  • General form
  • Linear Quadratic Regulator (LQR)
  • Setup
  • Solving for K: the Discrete Algebraic Riccati Equation
  • Properties
  • Iterative LQR (iLQR) and Differential Dynamic Programming
  • Algorithm structure
  • Differential Dynamic Programming (DDP)
  • Key limitation
  • Model Predictive Control (MPC)
  • Core loop
  • Why replan every step?
  • Sampling-based MPC: MPPI
  • Connections to model-based reinforcement learning
  • Hybrid architectures: control + learning in deployed systems
  • Residual policy
  • Policy + low-level controller stack
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading