Week 3: Control Fundamentals

Purpose of this lecture#

So far, we have established:

how robots are shaped and move (kinematics),
how forces produce motion under uncertainty (dynamics and state estimation).

This lecture addresses the next essential question:

Given a state estimate and a model, how do we choose actions that make the robot behave as desired?

Control theory provides principled answers to this question. Many modern robot learning methods—especially MPC-based RL and sim-to-real pipelines—are best understood as learning-augmented control systems, not replacements for control. This lecture builds the vocabulary and intuitions needed to understand that relationship precisely.

Feedback control: the core idea#

At the heart of robotics is feedback.

A feedback controller continuously:

measures the system state,
compares it to a desired reference,
applies corrective action.

This closed-loop structure is what makes robots robust to disturbances, modeling errors, and sensor noise. Open-loop control — executing a pre-planned action sequence without measuring the result — fails in practice because no model is perfect and no actuator is noiseless. Feedback is what bridges the gap between the model and the world.

PID control#

The most widely used controller in robotics is the PID controller.

Given an error $e(t) = x_{\text{desired}} - x(t)$ , the control input is:

u(t) = K_P e(t) + K_I \int_0^t e(\tau)\,d\tau + K_D \frac{d e(t)}{dt}

The three terms each address a different aspect of the error signal:

Proportional (P): corrects present error proportionally. Large $K_P$ gives fast response but can cause oscillation.
Integral (I): accumulates past error and eliminates steady-state bias. Essential when constant disturbances (e.g. gravity, friction) must be rejected.
Derivative (D): acts on the rate of change of error, damping oscillations and providing anticipatory correction.

The PD action interface in torque-controlled robots#

For torque-controlled robots — which includes virtually all modern legged systems and many manipulation platforms — PID typically takes the form of a joint-space PD impedance law:

\tau = K_p(q_d - q) + K_d(\dot{q}_d - \dot{q})

where $q_d$ is the desired joint position and $\tau$ is the commanded torque. The integral term is usually omitted in torque control to avoid integrator windup under contact transitions.

This is the action interface between a learned policy and the robot hardware. In almost every locomotion RL deployment (Unitree Go1/Go2, ANYmal, MIT Mini Cheetah), the policy outputs desired joint positions $q_d$ at 50–100 Hz; the PD controller runs at 500–1000 Hz and converts those targets to motor torques. Understanding this stack is essential: the policy is not directly commanding torques — it is commanding set-points into a lower-level impedance controller. The stiffness $K_p$ and damping $K_d$ of that controller shape the effective dynamics the policy experiences, and choosing them poorly can destabilize an otherwise well-trained policy at deployment.

Python Implementation: PID Controller#

The following code implements a standard discrete-time PID controller.

python · runs in browser

import numpy as np

class PIDController:
    def __init__(self, kp, ki, kd, dt=0.01):
        self.kp = kp
        self.ki = ki
        self.kd = kd
        self.dt = dt
        
        self.integral = 0
        self.prev_error = 0

    def compute(self, target, current):
        """
        Computes the PID control output.
        """
        error = target - current
        
        # Proportional term
        p_term = self.kp * error
        
        # Integral term (with basic anti-windup clamping)
        self.integral += error * self.dt
        self.integral = np.clip(self.integral, -10, 10) 
        i_term = self.ki * self.integral
        
        # Derivative term
        derivative = (error - self.prev_error) / self.dt
        d_term = self.kd * derivative
        
        # Total output
        output = p_term + i_term + d_term
        
        # Update state
        self.prev_error = error
        
        return output

# Simulation: 1D mass following a target
# Mass dynamics: x_next = x + v*dt; v_next = v + (u/mass)*dt
dt = 0.01
controller = PIDController(kp=10.0, ki=1.0, kd=2.0, dt=dt)

current_pos = 0.0
current_vel = 0.0
target_pos = 1.0
mass = 1.0

print(f"Step | Position | Output")
print("-" * 30)

for step in range(10):
    u = controller.compute(target_pos, current_pos)
    
    # Simple Euler integration of mass dynamics
    acceleration = u / mass
    current_vel += acceleration * dt
    current_pos += current_vel * dt
    
    print(f"{step:4d} | {current_pos:8.3f} | {u:7.2f}")

Stability and Lyapunov theory#

A controller is only useful if it is stable. Stability means that the system returns to — or stays near — a desired operating point when perturbed. Designing and verifying stability is where control theory diverges most sharply from purely empirical learning.

Python Implementation: Linear Quadratic Regulator (LQR)#

For linear systems, LQR provides the optimal control gain $K$ by solving the Discrete Algebraic Riccati Equation (DARE).

python · runs in browser

import scipy.linalg

def compute_lqr_gain(A, B, Q, R):
    """
    Computes LQR gain K for discrete-time system x_next = Ax + Bu.
    """
    # Solve DARE: P = Q + A'PA - A'PB(R + B'PB)^-1 B'PA
    P = scipy.linalg.solve_discrete_are(A, B, Q, R)
    
    # Compute K = (R + B'PB)^-1 B'PA
    K = np.linalg.inv(R + B.T @ P @ B) @ (B.T @ P @ A)
    
    return K

# Double integrator system (discrete time)
dt = 0.1
A = np.array([[1, dt], [0, 1]])
B = np.array([[0], [dt]])

# Penalize position and velocity error equally
Q = np.eye(2)
# Low penalty on control effort
R = np.array([[0.01]])

K = compute_lqr_gain(A, B, Q, R)
print(f"LQR Gain K: {K}")

Informal definition#

A system is stable (in the Lyapunov sense) if:

small disturbances produce small, bounded deviations,
the system returns to equilibrium over time (asymptotic stability).

Lyapunov functions#

A Lyapunov function $V(x)$ is an energy-like scalar function that certifies stability without requiring an explicit solution of the system's differential equations.

For a system $\dot{x} = f(x)$ with equilibrium at $x = 0$ , $V(x)$ certifies asymptotic stability if:

V(x) > 0 \quad \forall x \neq 0, \qquad V(0) = 0, \qquad \dot{V}(x) = \nabla V \cdot f(x) < 0 \quad \forall x \neq 0

The third condition — $\dot{V} < 0$ — says the "energy" is strictly decreasing along every trajectory. If energy always decreases, the system must converge to the zero-energy state.

Worked example#

Consider the scalar system $\dot{x} = -x$ with candidate Lyapunov function $V(x) = x^2$ :

\dot{V}(x) = 2x \dot{x} = 2x(-x) = -2x^2 < 0 \quad \forall x \neq 0

All three conditions are satisfied. The system is asymptotically stable, and $V(x) = x^2$ is the certificate. This example is trivial, but the same structure — find a function whose derivative is negative under the closed-loop dynamics — applies to arbitrarily complex nonlinear systems.

Control Lyapunov Functions (CLFs)#

A Control Lyapunov Function extends this idea to systems with control inputs. A CLF is a Lyapunov function for which, at every state, there exists a control input $u$ that makes $\dot{V}(x, u) < 0$ . This reframes stability from a post-hoc analysis into a constraint on the controller: choose $u$ such that $V$ decreases.

CLFs are the foundation of:

CLF-QP controllers: at each timestep, solve a small quadratic program to find the minimum-norm control input that satisfies $\dot{V} < 0$ (plus other constraints). This gives a provably stabilizing controller with low computational cost.
Neural Lyapunov functions: learn $V(x)$ as a neural network jointly with the policy, enforcing the CLF conditions as training constraints. This allows stability certification for complex nonlinear systems where analytical Lyapunov functions are unavailable (Berkenkamp et al., Chang et al.).

Control Barrier Functions (CBFs)#

Safety is the spatial analog of stability. A Control Barrier Function $h(x)$ defines a safe set $\mathcal{S} = \{x : h(x) \geq 0\}$ and enforces that the system never leaves it. The CBF condition requires:

\dot{h}(x, u) \geq -\alpha(h(x))

for some class- $\mathcal{K}$ function $\alpha$ . Like the CLF, this can be enforced as a constraint in a CBF-QP that wraps around any nominal policy and minimally modifies its output to maintain safety. CBF-QP safety filters are increasingly used to wrap learned locomotion and manipulation policies, providing hard safety guarantees without retraining the policy.

Input-to-state stability (ISS)#

Classical Lyapunov stability applies to unperturbed systems. In practice — and especially in sim-to-real — the robot always faces disturbances: model mismatch, sensor noise, terrain variation. Input-to-state stability (ISS) extends Lyapunov theory to perturbed systems: a system is ISS if bounded disturbances produce bounded state deviations, with the bound scaling continuously from zero. ISS is the theoretical foundation for understanding why domain randomization helps sim-to-real: a policy trained to be ISS-robust to a distribution of disturbances will remain stable under the bounded mismatch between simulation and hardware.

Why this matters for learning#

Classical controllers come with stability guarantees derived from Lyapunov theory. Learned controllers generally do not — a policy that achieves high reward in training may be unstable in the Lyapunov sense and fail catastrophically under small perturbations at deployment. The modern response to this gap takes two forms: (1) enforcing Lyapunov or barrier constraints during policy training, and (2) wrapping a learned policy in a CLF-QP or CBF-QP safety filter that provides guarantees without modifying the policy itself.

Joint-space vs. task-space control#

Robots can be controlled in different coordinate systems, and the choice has significant consequences for both controller design and learned policy architecture.

Joint-space control#

Control objectives are defined directly in joint space. The controller computes torques as a function of joint position and velocity errors:

\tau = f(q_d - q, \, \dot{q}_d - \dot{q})

Simple and directly matched to the actuator space. This is the dominant paradigm in locomotion RL, where policies output desired joint positions into the PD layer described above.

Task-space (operational-space) control#

Control objectives are defined in Cartesian space. The end-effector pose error is mapped to joint torques via the Jacobian and the robot dynamics.

The operational-space controller (Khatib, 1987) computes:

\tau = J^T F_{\text{task}} + (I - J^T J^{+T}) \tau_0

where $F_{\text{task}}$ is the desired Cartesian force/wrench, $J^+$ is the pseudoinverse Jacobian, and $\tau_0$ is a secondary joint-space objective (joint limit avoidance, posture control) projected into the null space of the task. This structure separates the primary task (end-effector control) from secondary objectives, which is the conceptual foundation of hierarchical manipulation policy design.

Connection to learning#

The action space of a learned policy determines which control regime it operates in, and this choice propagates directly into how the RL reward function must be designed.

Locomotion RL policies (PPO on quadrupeds) typically output desired joint positions → executed by joint-space PD control. In this paradigm, rewards must be formulated in joint space or in terms of quantities that are computable from joint-space targets: joint tracking error $\|q_d - q\|^2$ , joint velocity penalties $\|\dot{q}\|^2$ , and energy consumption proxies $\|\tau \odot \dot{q}\|_1$ . The policy never directly reasons about end-effector Cartesian position, so task-space objectives (such as foot clearance or swing-foot trajectory) must be projected through the Jacobian or computed from FK within the reward evaluation.

Manipulation policies (diffusion policy, ACT, $\pi_0$ ) typically output end-effector waypoints → executed by IK plus joint PD, or by operational-space control directly. Here, rewards are naturally expressed in task space: Cartesian distance to the target object $\|x_{ee} - x_{\text{goal}}\|$ , orientation alignment $\|R_{ee}^T R_{\text{goal}} - I\|_F$ , or contact force at the end-effector. Joint-space penalties (for smooth motion or avoiding singularities, as discussed in Week 1) must be brought back through FK or IK to be evaluated, adding complexity to the reward computation.

The key asymmetry is this: joint-space policies require rewards on joint targets, while task-space policies require rewards based on Cartesian end-effector distance — bringing IK or inverse dynamics into the reward evaluation loop whenever joint-space quantities need to be penalized. This interaction between action space and reward structure is a frequently overlooked source of reward engineering difficulty in manipulation RL.

Hybrid architectures output task-space goals that are tracked by a classical task-space controller, allowing the policy to reason at the level of object interaction rather than motor commands. Understanding which layer the policy operates at determines both what geometric structure it must implicitly learn and how the reward function must be structured to provide useful gradient signal.

Impedance and admittance control#

For contact-rich tasks — manipulation, assembly, wiping, legged locomotion — impedance control is the standard control paradigm.

The core idea#

Pure position control fails under contact: if the robot commands a position that is physically blocked, it generates unbounded forces. Impedance control instead defines a desired dynamic relationship between end-effector motion error and contact force:

M_d \ddot{e} + D_d \dot{e} + K_d e = F_{\text{ext}}

where $e = x_d - x$ is the Cartesian error, $M_d$ , $D_d$ , $K_d$ are the desired virtual mass, damping, and stiffness, and $F_{\text{ext}}$ is the external contact force. Rather than commanding the robot to reach a position regardless of contact, impedance control makes the robot behave like a mass-spring-damper: it complies with contact forces while still being attracted to the desired trajectory.

Admittance control#

Impedance control is implemented in torque-controlled robots (command $\tau$ , measure $F$ ). Admittance control is the dual formulation for position-controlled robots (command $x$ , measure $F$ ): contact force measurements are used to modify the position command, achieving the same compliant behavior through position-loop modulation rather than direct torque control.

Connection to passivity and Week 2#

The passivity property introduced in Week 2 — that $\dot{M} - 2C$ is skew-symmetric — is precisely what makes energy-based impedance control stable. A passivity-based impedance controller can guarantee stability under arbitrary contact geometries because the interaction with any passive environment (a wall, a box, a human) cannot add energy to the system. This is a much stronger guarantee than Lyapunov stability at an isolated equilibrium.

Connection to learning#

Impedance control is the low-level substrate for much of contact-rich imitation learning:

Teleoperated demonstrations for assembly or insertion tasks are typically collected under impedance control, because it allows the human operator to make contact without generating damaging forces.
Learned manipulation policies often output impedance set-points ( $K_d$ , $D_d$ , $x_d$ ) rather than raw torques — allowing the policy to modulate compliance as a function of the task phase (high stiffness for precise positioning, low stiffness for compliant insertion).
Variable impedance learning (learning $K_d(t)$ as part of the policy) is an active research area in manipulation RL.

Optimal control: formalizing "best behavior"#

Control design can be framed as an optimization problem: find the sequence of actions that minimizes a cost function subject to the robot's dynamics.

General form#

\min_{u_{0:T}} \sum_{t=0}^{T} \ell(x_t, u_t) \quad \text{subject to} \quad x_{t+1} = f(x_t, u_t)

This formulation is the foundation of both classical optimal control and reinforcement learning. The cost $\ell$ in optimal control corresponds to the negative reward in RL; the dynamics constraint $f$ corresponds to the environment transition function. The structural connection is exact — RL is optimal control with an unknown dynamics model and a learned value function.

Linear Quadratic Regulator (LQR)#

For linear dynamics and quadratic cost, the optimal control problem has a closed-form solution.

Setup#

x_{t+1} = Ax_t + Bu_t, \qquad \ell(x,u) = x^\top Q x + u^\top R u

where $Q \succeq 0$ penalizes state deviation and $R \succ 0$ penalizes control effort. The optimal controller is linear: $u^* = -Kx$ .

Solving for K: the Discrete Algebraic Riccati Equation#

$K$ is obtained by solving the Discrete Algebraic Riccati Equation (DARE):

P = Q + A^\top P A - A^\top P B (R + B^\top P B)^{-1} B^\top P A

K = (R + B^\top P B)^{-1} B^\top P A

$P$ is the solution to a fixed-point equation — the optimal value function for the infinite-horizon LQR problem. This is directly analogous to the Bellman equation in reinforcement learning:

V^*(x) = \min_u \bigl[ \ell(x, u) + V^*(f(x,u)) \bigr]

LQR solves the Bellman equation in closed form for the linear-quadratic case. The Riccati equation is the algebraic fixed point of the Bellman operator under linearity and quadratic cost. This connection is not coincidental — the value function $V(x) = x^\top P x$ is also a Lyapunov function for the closed-loop system, linking optimal control, stability theory, and RL in a single object.

Properties#

LQR is globally optimal for linear systems, guaranteed stable under mild conditions on $(A, B, Q, R)$ , and computationally cheap once $P$ is solved offline. It is widely used as a local controller around an operating point for nonlinear systems — linearize, solve DARE, apply $u = -Kx$ — and as a baseline against which more complex controllers are benchmarked.

Iterative LQR (iLQR) and Differential Dynamic Programming#

Real robots are nonlinear. iLQR extends LQR to nonlinear systems through iterative local linearization.

Algorithm structure#

iLQR alternates between two passes over a nominal trajectory $(x_{0:T}^*, u_{0:T}^*)$ :

Backward pass: starting from the terminal cost, compute a quadratic approximation of the value function at each timestep by linearizing the dynamics $f$ around the current trajectory. This produces local feedback gains $K_t$ and feedforward corrections $k_t$ .

Forward pass: apply the updated control law $u_t = u_t^* + K_t(x_t - x_t^*) + \alpha k_t$ (with step size $\alpha$ ) to generate a new trajectory. The new trajectory has lower cost than the previous one.

Iterate until convergence. The result is a locally optimal trajectory with a stabilizing feedback policy defined along it.

Differential Dynamic Programming (DDP)#

DDP is the second-order extension of iLQR: it includes second-order derivatives of the dynamics in the backward pass, giving faster convergence at the cost of more computation. In practice, iLQR (which drops the second-order dynamics terms) is used more widely because the first-order approximation is sufficient for most robot problems and the Hessian of the dynamics is expensive to compute.

Key limitation#

Both iLQR and DDP assume smooth dynamics. Contact events — foot strikes, object grasps, surface transitions — introduce discontinuities that violate this assumption and cause the local linearization to fail across contact boundaries. This motivates:

Contact-implicit trajectory optimization (e.g. TRAJOPT, DIRTREL), which handles contact mode sequences explicitly.
Learned dynamics models that smooth over contact discontinuities.
RL, which sidesteps the need for differentiable dynamics entirely by using zeroth-order gradient estimates (policy gradient).

iLQR is a direct ancestor of trajectory optimization in MuJoCo (via MJPC), shooting methods in model-based RL, and the local policy improvement step in Guided Policy Search.

Model Predictive Control (MPC)#

MPC applies optimal control online by solving a finite-horizon optimization problem at every timestep, using the current state as the initial condition.

Core loop#

At each time step:

Observe current state $x_t$
Solve: $\min_{u_{t:t+H}} \sum_{k=0}^{H} \ell(x_{t+k}, u_{t+k})$ subject to $x_{t+k+1} = f(x_{t+k}, u_{t+k})$
Execute only the first action $u_t^*$
Discard the rest of the plan; replan at $t+1$

Why replan every step?#

Models are imperfect. By resolving the optimization at each step, MPC continuously corrects for:

accumulated prediction error,
unexpected disturbances,
state estimation updates.

This receding-horizon structure gives MPC much of the robustness of feedback control while retaining the optimality guarantees of trajectory optimization.

Sampling-based MPC: MPPI#

Classical MPC solves the optimization with gradient-based methods (iLQR, SQP). Model Predictive Path Integral (MPPI) control takes a different approach: sample thousands of random control sequences, simulate their outcomes in parallel, and compute a importance-weighted average of the samples biased toward low-cost trajectories.

u_t^* = \frac{\sum_k w_k U_k}{\sum_k w_k}, \qquad w_k \propto \exp\!\left(-\frac{1}{\lambda} S(U_k)\right)

where $S(U_k)$ is the cumulative cost of sample $k$ and $\lambda$ is a temperature parameter. MPPI:

requires no gradient through the dynamics (works with non-differentiable simulators),
handles non-smooth costs and contact naturally,
maps directly onto GPU-parallel simulation (used in aggressive driving and locomotion).

MPPI is the classical antecedent of many RL planning algorithms and the direct inspiration for CEM-based model-based RL methods.

Connections to model-based reinforcement learning#

The structural parallel between MPC and model-based RL is exact:

| Classical MPC | Learning Analogue | |---|---| | Analytical dynamics model | Learned world model (Dreamer, RSSM) | | Hand-designed cost function | Learned or specified reward function | | Receding-horizon planning | Rollout-based policy search | | iLQR / DDP backward pass | Local policy improvement (GPS, ILQR-RL) | | MPPI sampling | CEM / MPPI with learned dynamics | | Terminal value function | Learned value function (critic) |

MuZero performs planning in a learned latent space using a learned value function as the terminal cost — it is MPC with a neural dynamics model and a neural terminal value. Dreamer performs MPC over a learned RSSM world model. In locomotion, MPC is often used as a safety wrapper around an RL policy: the RL policy proposes actions and MPC projects them onto a feasible, collision-free trajectory.

Hybrid architectures: control + learning in deployed systems#

Rather than choosing between classical control and learned policies, modern deployed systems combine them in structured ways. Two architectures dominate:

Residual policy#

A classical controller $u_c(x)$ handles nominal behavior (stance control, gravity compensation, trajectory following). A learned policy $\pi_\theta(x)$ outputs a residual correction:

u = u_c(x) + \pi_\theta(x)

The classical controller provides stability and baseline performance; the learned residual captures dynamics that the model gets wrong or behavior that is hard to hand-design (adaptive gait recovery, contact-rich dexterity). Residual policies typically train faster than end-to-end RL because the classical controller handles the bulk of the task.

Policy + low-level controller stack#

The learned policy operates at a higher abstraction level and outputs targets for a classical low-level controller:

q_d = \pi_\theta(x), \qquad \tau = K_p(q_d - q) + K_d(\dot{q}_d - \dot{q})

This is the dominant architecture in locomotion RL. The policy reasons about gait, balance, and terrain at 50 Hz; the PD controller handles torque execution at 500–1000 Hz. The separation of timescales provides robustness: the fast PD loop rejects high-frequency disturbances that the policy never sees.

Understanding these architectures — not just the individual components — is what allows a robotics engineer to design systems that are both learnable and deployable.

Key takeaways#

The concepts in this lecture form a progression from local to global, and from classical to learned:

PID and joint-space PD control are the lowest layer of almost every real robot, including the action interface through which learned policies execute. Stability and Lyapunov theory provide the mathematical framework for certifying that controllers work — and motivate CLFs, CBFs, and neural Lyapunov functions as tools for extending those guarantees to learned systems. Impedance control is the standard paradigm for contact-rich manipulation and the substrate on which most imitation learning for assembly is built. Task-space and operational-space control generalize joint-space control to Cartesian objectives, and are the execution layer for manipulation policies that reason about end-effector pose. LQR is optimal control in closed form — and the Riccati equation is the Bellman equation under linearity and quadratic cost, directly linking control theory and RL. iLQR and DDP extend optimality to nonlinear systems through iterative linearization, at the cost of assuming smooth dynamics. MPC closes the loop by resolving the optimization online, giving robustness through feedback. MPPI removes the smoothness assumption through sampling, bridging MPC and gradient-free RL. And hybrid architectures — residual policies and policy + controller stacks — are how these classical and learned components are combined in systems that actually deploy.

Conceptual questions#

A locomotion policy outputs desired joint positions at 50 Hz, which are tracked by a PD controller at 500 Hz. What happens to stability if the PD gains $K_p$ and $K_d$ are set too high? Too low? How does the choice of gains interact with the policy's training dynamics in simulation?
You have a Lyapunov function $V(x) = x^\top P x$ for a linear system. Show that $V$ is also a valid Lyapunov function for the LQR-controlled closed-loop system. What does this imply about the relationship between the optimal cost-to-go and stability certification?
A manipulation policy trained with behavioral cloning outputs end-effector positions. At deployment, these are tracked with a stiff position controller. The robot is asked to insert a peg into a hole. Describe the failure mode, and explain how switching to impedance control would change the behavior.
iLQR converges to a locally optimal trajectory for a robot arm reaching task. The same algorithm is applied to a legged robot performing a jumping motion. Describe specifically why iLQR is likely to fail for the jumping task and what alternative approach you would use.
A learned locomotion policy achieves excellent sim performance but fails at deployment when the terrain is slightly different from training. You decide to wrap it in a CBF-QP safety filter. What does the CBF need to represent, what constraint does it enforce, and what does it not protect against?

Solutions

PD gains. Too-high $K_p,K_d$ make the loop stiff and can excite high-frequency oscillation or instability (worsened by the 50 Hz command rate and actuator delay) and fight the policy; too-low gains track sluggishly, so commanded positions are never realized and the policy's intended motion is lost. In sim the policy adapts to whatever gains it trains with, so mismatched deploy gains break transfer — the gains define the effective action space the policy experiences.
Lyapunov for LQR. With $V(x)=x^\top P x$ ( $P$ the Riccati solution) and the closed loop $u=-Kx$ , $\dot V = -x^\top(Q + K^\top R K)x < 0$ , so $V$ is a valid Lyapunov function. Thus the optimal cost-to-go is itself a stability certificate — for LQR, optimality implies stability.
Stiff position control for peg insertion. A stiff controller commands exact positions, so tiny pose errors generate large contact forces and the peg wedges against the hole edge with no way to comply — jamming or excessive force. Impedance control makes the arm behave like a spring-damper, so misalignment produces gentle corrective forces and the peg settles in, trading position accuracy for force regulation.
iLQR on jumping. iLQR linearizes around a trajectory and assumes smooth, differentiable dynamics; jumping makes and breaks contact (discontinuous hybrid dynamics) and has an underactuated flight phase, so the linearization is invalid across mode switches and iLQR stalls or diverges. Use a contact-implicit/hybrid trajectory optimizer (or RL) that reasons over contact modes.
CBF-QP wrap. The CBF must encode the safe set as $h(x)\ge 0$ (distance to terrain/obstacles/joint limits) and enforces forward-invariance ( $\dot h \ge -\kappa h$ ) by minimally adjusting the action. It does not protect against unmodeled dynamics, state-estimation error, or inaccuracy in the $\dot h$ model, and it does not guarantee task success — it only keeps the system in the explicitly encoded safe set, and only if the QP's dynamics model is accurate.

Looking ahead#

With control foundations in place, we now ask:

What if we learn behaviors directly from data instead of hand-designing controllers?

Week 4: Teleoperation and Data Collection
We study how robot datasets are generated, logged, and prepared for imitation learning and RL — and how the control architecture at collection time shapes what policies can learn.

Purpose of this lecture#

So far, we have established:

how robots are shaped and move (kinematics),
how forces produce motion under uncertainty (dynamics and state estimation).

This lecture addresses the next essential question:

Given a state estimate and a model, how do we choose actions that make the robot behave as desired?

Feedback control: the core idea#

At the heart of robotics is feedback.

A feedback controller continuously:

measures the system state,
compares it to a desired reference,
applies corrective action.

PID control#

The most widely used controller in robotics is the PID controller.

Given an error $e(t) = x_{\text{desired}} - x(t)$ , the control input is:

u(t) = K_P e(t) + K_I \int_0^t e(\tau)\,d\tau + K_D \frac{d e(t)}{dt}

The three terms each address a different aspect of the error signal:

Proportional (P): corrects present error proportionally. Large $K_P$ gives fast response but can cause oscillation.
Integral (I): accumulates past error and eliminates steady-state bias. Essential when constant disturbances (e.g. gravity, friction) must be rejected.
Derivative (D): acts on the rate of change of error, damping oscillations and providing anticipatory correction.

The PD action interface in torque-controlled robots#

For torque-controlled robots — which includes virtually all modern legged systems and many manipulation platforms — PID typically takes the form of a joint-space PD impedance law:

\tau = K_p(q_d - q) + K_d(\dot{q}_d - \dot{q})

where $q_d$ is the desired joint position and $\tau$ is the commanded torque. The integral term is usually omitted in torque control to avoid integrator windup under contact transitions.

Python Implementation: PID Controller#

The following code implements a standard discrete-time PID controller.

python · runs in browser

import numpy as np

class PIDController:
    def __init__(self, kp, ki, kd, dt=0.01):
        self.kp = kp
        self.ki = ki
        self.kd = kd
        self.dt = dt
        
        self.integral = 0
        self.prev_error = 0

    def compute(self, target, current):
        """
        Computes the PID control output.
        """
        error = target - current
        
        # Proportional term
        p_term = self.kp * error
        
        # Integral term (with basic anti-windup clamping)
        self.integral += error * self.dt
        self.integral = np.clip(self.integral, -10, 10) 
        i_term = self.ki * self.integral
        
        # Derivative term
        derivative = (error - self.prev_error) / self.dt
        d_term = self.kd * derivative
        
        # Total output
        output = p_term + i_term + d_term
        
        # Update state
        self.prev_error = error
        
        return output

# Simulation: 1D mass following a target
# Mass dynamics: x_next = x + v*dt; v_next = v + (u/mass)*dt
dt = 0.01
controller = PIDController(kp=10.0, ki=1.0, kd=2.0, dt=dt)

current_pos = 0.0
current_vel = 0.0
target_pos = 1.0
mass = 1.0

print(f"Step | Position | Output")
print("-" * 30)

for step in range(10):
    u = controller.compute(target_pos, current_pos)
    
    # Simple Euler integration of mass dynamics
    acceleration = u / mass
    current_vel += acceleration * dt
    current_pos += current_vel * dt
    
    print(f"{step:4d} | {current_pos:8.3f} | {u:7.2f}")

Stability and Lyapunov theory#

Python Implementation: Linear Quadratic Regulator (LQR)#

For linear systems, LQR provides the optimal control gain $K$ by solving the Discrete Algebraic Riccati Equation (DARE).

python · runs in browser

import scipy.linalg

def compute_lqr_gain(A, B, Q, R):
    """
    Computes LQR gain K for discrete-time system x_next = Ax + Bu.
    """
    # Solve DARE: P = Q + A'PA - A'PB(R + B'PB)^-1 B'PA
    P = scipy.linalg.solve_discrete_are(A, B, Q, R)
    
    # Compute K = (R + B'PB)^-1 B'PA
    K = np.linalg.inv(R + B.T @ P @ B) @ (B.T @ P @ A)
    
    return K

# Double integrator system (discrete time)
dt = 0.1
A = np.array([[1, dt], [0, 1]])
B = np.array([[0], [dt]])

# Penalize position and velocity error equally
Q = np.eye(2)
# Low penalty on control effort
R = np.array([[0.01]])

K = compute_lqr_gain(A, B, Q, R)
print(f"LQR Gain K: {K}")

Informal definition#

A system is stable (in the Lyapunov sense) if:

small disturbances produce small, bounded deviations,
the system returns to equilibrium over time (asymptotic stability).

Lyapunov functions#

A Lyapunov function $V(x)$ is an energy-like scalar function that certifies stability without requiring an explicit solution of the system's differential equations.

For a system $\dot{x} = f(x)$ with equilibrium at $x = 0$ , $V(x)$ certifies asymptotic stability if:

V(x) > 0 \quad \forall x \neq 0, \qquad V(0) = 0, \qquad \dot{V}(x) = \nabla V \cdot f(x) < 0 \quad \forall x \neq 0

The third condition — $\dot{V} < 0$ — says the "energy" is strictly decreasing along every trajectory. If energy always decreases, the system must converge to the zero-energy state.

Worked example#

Consider the scalar system $\dot{x} = -x$ with candidate Lyapunov function $V(x) = x^2$ :

\dot{V}(x) = 2x \dot{x} = 2x(-x) = -2x^2 < 0 \quad \forall x \neq 0

Control Lyapunov Functions (CLFs)#

CLFs are the foundation of:

CLF-QP controllers: at each timestep, solve a small quadratic program to find the minimum-norm control input that satisfies $\dot{V} < 0$ (plus other constraints). This gives a provably stabilizing controller with low computational cost.
Neural Lyapunov functions: learn $V(x)$ as a neural network jointly with the policy, enforcing the CLF conditions as training constraints. This allows stability certification for complex nonlinear systems where analytical Lyapunov functions are unavailable (Berkenkamp et al., Chang et al.).

Control Barrier Functions (CBFs)#

\dot{h}(x, u) \geq -\alpha(h(x))

Input-to-state stability (ISS)#

Why this matters for learning#

Joint-space vs. task-space control#

Robots can be controlled in different coordinate systems, and the choice has significant consequences for both controller design and learned policy architecture.

Joint-space control#

Control objectives are defined directly in joint space. The controller computes torques as a function of joint position and velocity errors:

\tau = f(q_d - q, \, \dot{q}_d - \dot{q})

Simple and directly matched to the actuator space. This is the dominant paradigm in locomotion RL, where policies output desired joint positions into the PD layer described above.

Task-space (operational-space) control#

Control objectives are defined in Cartesian space. The end-effector pose error is mapped to joint torques via the Jacobian and the robot dynamics.

The operational-space controller (Khatib, 1987) computes:

\tau = J^T F_{\text{task}} + (I - J^T J^{+T}) \tau_0

Connection to learning#

The action space of a learned policy determines which control regime it operates in, and this choice propagates directly into how the RL reward function must be designed.

Impedance and admittance control#

For contact-rich tasks — manipulation, assembly, wiping, legged locomotion — impedance control is the standard control paradigm.

The core idea#

M_d \ddot{e} + D_d \dot{e} + K_d e = F_{\text{ext}}

Admittance control#

Connection to passivity and Week 2#

Connection to learning#

Impedance control is the low-level substrate for much of contact-rich imitation learning:

Teleoperated demonstrations for assembly or insertion tasks are typically collected under impedance control, because it allows the human operator to make contact without generating damaging forces.
Learned manipulation policies often output impedance set-points ( $K_d$ , $D_d$ , $x_d$ ) rather than raw torques — allowing the policy to modulate compliance as a function of the task phase (high stiffness for precise positioning, low stiffness for compliant insertion).
Variable impedance learning (learning $K_d(t)$ as part of the policy) is an active research area in manipulation RL.

Optimal control: formalizing "best behavior"#

Control design can be framed as an optimization problem: find the sequence of actions that minimizes a cost function subject to the robot's dynamics.

General form#

\min_{u_{0:T}} \sum_{t=0}^{T} \ell(x_t, u_t) \quad \text{subject to} \quad x_{t+1} = f(x_t, u_t)

Linear Quadratic Regulator (LQR)#

For linear dynamics and quadratic cost, the optimal control problem has a closed-form solution.

Setup#

x_{t+1} = Ax_t + Bu_t, \qquad \ell(x,u) = x^\top Q x + u^\top R u

where $Q \succeq 0$ penalizes state deviation and $R \succ 0$ penalizes control effort. The optimal controller is linear: $u^* = -Kx$ .

Solving for K: the Discrete Algebraic Riccati Equation#

$K$ is obtained by solving the Discrete Algebraic Riccati Equation (DARE):

P = Q + A^\top P A - A^\top P B (R + B^\top P B)^{-1} B^\top P A

K = (R + B^\top P B)^{-1} B^\top P A

$P$ is the solution to a fixed-point equation — the optimal value function for the infinite-horizon LQR problem. This is directly analogous to the Bellman equation in reinforcement learning:

V^*(x) = \min_u \bigl[ \ell(x, u) + V^*(f(x,u)) \bigr]

Properties#

Iterative LQR (iLQR) and Differential Dynamic Programming#

Real robots are nonlinear. iLQR extends LQR to nonlinear systems through iterative local linearization.

Algorithm structure#

iLQR alternates between two passes over a nominal trajectory $(x_{0:T}^*, u_{0:T}^*)$ :

Iterate until convergence. The result is a locally optimal trajectory with a stabilizing feedback policy defined along it.

Differential Dynamic Programming (DDP)#

Key limitation#

Contact-implicit trajectory optimization (e.g. TRAJOPT, DIRTREL), which handles contact mode sequences explicitly.
Learned dynamics models that smooth over contact discontinuities.
RL, which sidesteps the need for differentiable dynamics entirely by using zeroth-order gradient estimates (policy gradient).

iLQR is a direct ancestor of trajectory optimization in MuJoCo (via MJPC), shooting methods in model-based RL, and the local policy improvement step in Guided Policy Search.

Model Predictive Control (MPC)#

MPC applies optimal control online by solving a finite-horizon optimization problem at every timestep, using the current state as the initial condition.

Core loop#

At each time step:

Observe current state $x_t$
Solve: $\min_{u_{t:t+H}} \sum_{k=0}^{H} \ell(x_{t+k}, u_{t+k})$ subject to $x_{t+k+1} = f(x_{t+k}, u_{t+k})$
Execute only the first action $u_t^*$
Discard the rest of the plan; replan at $t+1$

Why replan every step?#

Models are imperfect. By resolving the optimization at each step, MPC continuously corrects for:

accumulated prediction error,
unexpected disturbances,
state estimation updates.

This receding-horizon structure gives MPC much of the robustness of feedback control while retaining the optimality guarantees of trajectory optimization.

Sampling-based MPC: MPPI#

u_t^* = \frac{\sum_k w_k U_k}{\sum_k w_k}, \qquad w_k \propto \exp\!\left(-\frac{1}{\lambda} S(U_k)\right)

where $S(U_k)$ is the cumulative cost of sample $k$ and $\lambda$ is a temperature parameter. MPPI:

requires no gradient through the dynamics (works with non-differentiable simulators),
handles non-smooth costs and contact naturally,
maps directly onto GPU-parallel simulation (used in aggressive driving and locomotion).

MPPI is the classical antecedent of many RL planning algorithms and the direct inspiration for CEM-based model-based RL methods.

Connections to model-based reinforcement learning#

The structural parallel between MPC and model-based RL is exact:

Hybrid architectures: control + learning in deployed systems#

Rather than choosing between classical control and learned policies, modern deployed systems combine them in structured ways. Two architectures dominate:

Residual policy#

A classical controller $u_c(x)$ handles nominal behavior (stance control, gravity compensation, trajectory following). A learned policy $\pi_\theta(x)$ outputs a residual correction:

u = u_c(x) + \pi_\theta(x)

Policy + low-level controller stack#

The learned policy operates at a higher abstraction level and outputs targets for a classical low-level controller:

q_d = \pi_\theta(x), \qquad \tau = K_p(q_d - q) + K_d(\dot{q}_d - \dot{q})

Understanding these architectures — not just the individual components — is what allows a robotics engineer to design systems that are both learnable and deployable.

Key takeaways#

The concepts in this lecture form a progression from local to global, and from classical to learned:

Conceptual questions#

A locomotion policy outputs desired joint positions at 50 Hz, which are tracked by a PD controller at 500 Hz. What happens to stability if the PD gains $K_p$ and $K_d$ are set too high? Too low? How does the choice of gains interact with the policy's training dynamics in simulation?
You have a Lyapunov function $V(x) = x^\top P x$ for a linear system. Show that $V$ is also a valid Lyapunov function for the LQR-controlled closed-loop system. What does this imply about the relationship between the optimal cost-to-go and stability certification?
A manipulation policy trained with behavioral cloning outputs end-effector positions. At deployment, these are tracked with a stiff position controller. The robot is asked to insert a peg into a hole. Describe the failure mode, and explain how switching to impedance control would change the behavior.
iLQR converges to a locally optimal trajectory for a robot arm reaching task. The same algorithm is applied to a legged robot performing a jumping motion. Describe specifically why iLQR is likely to fail for the jumping task and what alternative approach you would use.
A learned locomotion policy achieves excellent sim performance but fails at deployment when the terrain is slightly different from training. You decide to wrap it in a CBF-QP safety filter. What does the CBF need to represent, what constraint does it enforce, and what does it not protect against?

Solutions

PD gains. Too-high $K_p,K_d$ make the loop stiff and can excite high-frequency oscillation or instability (worsened by the 50 Hz command rate and actuator delay) and fight the policy; too-low gains track sluggishly, so commanded positions are never realized and the policy's intended motion is lost. In sim the policy adapts to whatever gains it trains with, so mismatched deploy gains break transfer — the gains define the effective action space the policy experiences.
Lyapunov for LQR. With $V(x)=x^\top P x$ ( $P$ the Riccati solution) and the closed loop $u=-Kx$ , $\dot V = -x^\top(Q + K^\top R K)x < 0$ , so $V$ is a valid Lyapunov function. Thus the optimal cost-to-go is itself a stability certificate — for LQR, optimality implies stability.
Stiff position control for peg insertion. A stiff controller commands exact positions, so tiny pose errors generate large contact forces and the peg wedges against the hole edge with no way to comply — jamming or excessive force. Impedance control makes the arm behave like a spring-damper, so misalignment produces gentle corrective forces and the peg settles in, trading position accuracy for force regulation.
iLQR on jumping. iLQR linearizes around a trajectory and assumes smooth, differentiable dynamics; jumping makes and breaks contact (discontinuous hybrid dynamics) and has an underactuated flight phase, so the linearization is invalid across mode switches and iLQR stalls or diverges. Use a contact-implicit/hybrid trajectory optimizer (or RL) that reasons over contact modes.
CBF-QP wrap. The CBF must encode the safe set as $h(x)\ge 0$ (distance to terrain/obstacles/joint limits) and enforces forward-invariance ( $\dot h \ge -\kappa h$ ) by minimally adjusting the action. It does not protect against unmodeled dynamics, state-estimation error, or inaccuracy in the $\dot h$ model, and it does not guarantee task success — it only keeps the system in the explicitly encoded safe set, and only if the QP's dynamics model is accurate.

Looking ahead#

With control foundations in place, we now ask:

What if we learn behaviors directly from data instead of hand-designing controllers?

Purpose of this lecture#

Feedback control: the core idea#

PID control#

The PD action interface in torque-controlled robots#

Python Implementation: PID Controller#

Stability and Lyapunov theory#

Python Implementation: Linear Quadratic Regulator (LQR)#

Informal definition#

Lyapunov functions#

Worked example#

Control Lyapunov Functions (CLFs)#

Control Barrier Functions (CBFs)#

Input-to-state stability (ISS)#

Why this matters for learning#

Joint-space vs. task-space control#

Joint-space control#

Task-space (operational-space) control#

Connection to learning#

Impedance and admittance control#

The core idea#

Admittance control#

Connection to passivity and Week 2#

Connection to learning#

Optimal control: formalizing "best behavior"#

General form#

Linear Quadratic Regulator (LQR)#

Setup#

Solving for K: the Discrete Algebraic Riccati Equation#

Properties#

Iterative LQR (iLQR) and Differential Dynamic Programming#

Algorithm structure#

Differential Dynamic Programming (DDP)#

Key limitation#

Model Predictive Control (MPC)#

Core loop#

Why replan every step?#

Sampling-based MPC: MPPI#

Connections to model-based reinforcement learning#

Hybrid architectures: control + learning in deployed systems#

Residual policy#

Policy + low-level controller stack#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 3: Control Fundamentals

Purpose of this lecture#

Feedback control: the core idea#

PID control#

The PD action interface in torque-controlled robots#

Python Implementation: PID Controller#

Stability and Lyapunov theory#

Python Implementation: Linear Quadratic Regulator (LQR)#

Informal definition#

Lyapunov functions#

Worked example#

Control Lyapunov Functions (CLFs)#

Control Barrier Functions (CBFs)#

Input-to-state stability (ISS)#

Why this matters for learning#

Joint-space vs. task-space control#

Joint-space control#

Task-space (operational-space) control#

Connection to learning#

Impedance and admittance control#

The core idea#

Admittance control#

Connection to passivity and Week 2#

Connection to learning#

Optimal control: formalizing "best behavior"#

General form#

Linear Quadratic Regulator (LQR)#

Setup#

Solving for K: the Discrete Algebraic Riccati Equation#

Properties#

Iterative LQR (iLQR) and Differential Dynamic Programming#

Algorithm structure#

Differential Dynamic Programming (DDP)#

Key limitation#