Week 5: Imitation Learning

Purpose of this lecture#

We now have expert demonstrations: synchronized state-action trajectories collected through teleoperation, covering a range of initial conditions and operator styles. The central question is how to turn these demonstrations into an autonomous policy — a function $\pi_\theta(a \mid s)$ that maps observations to actions without a human in the loop.

Imitation learning addresses this question by training a policy to reproduce expert behavior without an explicit reward function. It is the dominant initialization strategy for modern robot manipulation policies, VLA models, and any system that must be competitive with human-level manipulation from the start of deployment rather than after extensive autonomous exploration. This lecture analyzes the core algorithms — behavior cloning, DAgger, inverse reinforcement learning, and GAIL — from first principles, with particular attention to the distributional shift problem that unifies their failure modes and motivates their differences.

Behavior cloning: imitation as supervised learning#

The basic formulation#

The simplest approach to imitation learning is behavior cloning (BC): treat the demonstration dataset $\mathcal{D} = \{(s_i, a_i)\}_{i=1}^N$ as a supervised regression problem and minimize the loss

\mathcal{L}_{\text{BC}}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[\| a - \pi_\theta(s) \|^2\right]

for continuous action spaces, or the cross-entropy loss for discrete ones. The policy $\pi_\theta$ is typically a neural network parameterized by $\theta$ , and training proceeds with standard gradient descent on $\mathcal{L}_{\text{BC}}$ . No environment interaction, reward model, or simulator is required: the entire training pipeline is offline, using only the fixed demonstration dataset.

The appeal of behavior cloning is self-evident. It is stable to train — supervised regression on a fixed dataset exhibits none of the instability associated with RL training loops. It scales directly with data: more demonstrations produce better policies, and the relationship between dataset size and policy quality is relatively smooth and predictable. It is also computationally inexpensive relative to RL-based alternatives, requiring only forward and backward passes through the policy network rather than environment rollouts.

For tasks where the demonstration dataset is large, diverse, and covers the deployment distribution well, behavior cloning produces excellent policies. Many of the most capable recent manipulation policies — including ACT, diffusion policies, and the supervised training stage of VLA models — are fundamentally behavior-cloned models trained on large teleoperation datasets. The issues with behavior cloning are not with its performance within the training distribution. They are with what happens when the policy inevitably encounters states outside that distribution.

The distributional shift problem#

At training time, behavior cloning minimizes expected loss under the demonstration distribution $d_{\text{demo}}(s)$ — the marginal distribution over states visited by the expert. At test time, the policy induces its own state visitation distribution $d_{\pi_\theta}(s)$ . If $\pi_\theta$ is not a perfect copy of the expert, the trajectory it generates will deviate from the expert's trajectory over time, visiting states that the expert never visited and that therefore have no corresponding actions in the training data. The policy must ACT in these states using extrapolation from its training distribution, which will generally be poor.

The formal analysis, due to Ross and Bagnell (2011), quantifies the cost of this mismatch. Let the per-step loss under the training distribution be $\epsilon = \mathbb{E}_{s \sim d_{\text{demo}}}[\ell(\pi_\theta, s)]$ . The total loss of the behavior-cloned policy over a horizon of $T$ steps is bounded by

J(\pi_\theta) \leq J(\pi_E) + T \epsilon + O(T^2 \epsilon \cdot \text{TV}(d_{\pi_\theta}, d_{\text{demo}}))

where $\text{TV}(\cdot, \cdot)$ is the total variation distance between the state distributions. The quadratic $T^2$ term is the critical observation: for long-horizon tasks (large $T$ ), even a small per-step error $\epsilon$ produces a cumulative cost that grows quadratically with the horizon. A policy with 1% per-step error on a 200-step manipulation task accumulates a total error of order $0.01 \times 200^2 = 400$ times the per-step loss, making catastrophic failure nearly certain on long tasks even for a policy that appears accurate in isolation.

The mechanism is compounding errors: the policy makes a small mistake that moves it slightly off the demonstrated path; this new state is slightly out-of-distribution, causing a slightly larger error; this larger error moves the policy further off the path; and so on in a feedback loop that drives the policy rapidly toward failure. The dynamics of compounding errors are particularly severe in robotics because the physical state space has no built-in regularization — a robot arm that moves 1 cm in the wrong direction will continue moving in the wrong direction unless the policy applies a corrective action.

Dataset aggregation: DAgger#

Motivation and algorithm#

The compounding error problem has a precise diagnosis: the behavior-cloned policy trains on $d_{\text{demo}}(s)$ but deploys under $d_{\pi_\theta}(s)$ , and these distributions diverge as training errors accumulate. The direct fix is to collect training data from the deploying distribution: run the learned policy in the environment, observe which states it visits, obtain expert action labels for those states, and retrain.

DAgger (Dataset Aggregation; Ross et al., 2011) implements this fix as an iterative algorithm. The procedure is: initialize by training $\pi_1$ on the original expert dataset $\mathcal{D}_1$ . At each subsequent iteration $n$ , roll out a mixture policy $\pi_{\text{mix}} = (1 - \beta_n)\pi_n + \beta_n \pi_E$ in the environment (using the expert with probability $\beta_n$ to prevent catastrophic failures during data collection), collect the resulting trajectory $\{s_t\}$ , query the expert for action labels $\{a^*_t = \pi_E(s_t)\}$ at all visited states, aggregate the new labeled data into the dataset $\mathcal{D}_{n+1} = \mathcal{D}_n \cup \{(s_t, a^*_t)\}$ , and retrain $\pi_{n+1}$ on $\mathcal{D}_{n+1}$ . As $\beta_n \to 0$ with iterations, the mixture shifts from primarily expert-driven to primarily policy-driven exploration.

The key property that DAgger provides is that the training distribution converges to the policy's own state visitation distribution. Under mild assumptions on the expert and the policy class, DAgger achieves a regret bound that scales linearly (not quadratically) with the horizon $T$ :

J(\pi_N) \leq J(\pi_E) + O(T \cdot \epsilon_N)

where $\epsilon_N$ is the per-step supervised loss after $N$ iterations. The improvement from $T^2\epsilon$ (behavior cloning) to $T\epsilon_N$ (DAgger) is the formal gain from eliminating distributional shift.

Practical challenges#

DAgger's requirement for expert labels at states generated by the learner's trajectories creates an operational bottleneck for physical robots. In simulation, this is trivial: the expert policy (a hand-crafted controller or privileged-state oracle) can be queried automatically at any state. On a physical robot, the "expert" is a human operator who must label each state, which requires either displaying the state to the operator and receiving a label in real time (tedious and cognitively demanding) or recording the trajectory and labeling offline (expensive in human time).

Practical implementations of DAgger on physical robots often use HG-DAgger (human-gated DAgger), in which the operator observes the policy executing and intervenes to provide corrections only when they detect an impending failure — rather than labeling every state. This reduces the annotation burden while still collecting data at the critical states where the policy diverges from expert behavior.

Inverse reinforcement learning#

The reward identification problem#

Behavior cloning and DAgger both assume that the correct action at any state is the one the expert chose. This assumption fails for stochastic optimal behavior: if multiple actions are equally optimal at a given state (e.g., reaching for an object from slightly different angles), the expert's chosen action is a sample from an equivalence class of optimal actions, and training the policy to replicate the specific sample rather than any member of the class imposes unnecessary constraint.

Inverse reinforcement learning (IRL) takes a fundamentally different approach. Rather than learning a policy directly, IRL asks: what reward function $r(s, a)$ would make the expert's observed behavior optimal? Given the expert trajectories, IRL recovers a reward function $r$ such that the expert's policy is approximately the solution to

\pi_E \approx \arg\max_\pi \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]

The recovered reward function can then be used to train a policy via standard RL, potentially discovering behaviors that generalize better than direct imitation because the policy can find novel trajectories that achieve high reward even when they differ from the expert's specific demonstrations.

The IRL problem is formally underdetermined: an uncountable family of reward functions is consistent with any finite set of demonstrations (the constant reward function explains all behavior, for instance). Classical approaches to resolving this ambiguity include maximum entropy IRL (Ziebart et al., 2008), which selects the reward function that maximizes the entropy of the induced policy distribution subject to matching observed feature expectations, and Bayesian IRL, which maintains a posterior over reward functions conditioned on the demonstrations.

Despite its theoretical appeal, direct IRL is computationally expensive: recovering $r$ from demonstrations typically requires solving an RL problem in an inner loop (to compute the policy induced by candidate reward functions), and this inner-loop RL must converge for each candidate before the outer optimization over reward functions can proceed. This double-loop structure is impractical for large state-action spaces. IRL's most important practical legacy is not in direct use but as the foundation for GAIL.

Adversarial imitation learning: GAIL#

Connecting imitation to GANs#

Generative Adversarial Imitation Learning (GAIL; Ho and Ermon, 2016) reformulates IRL as an adversarial game that avoids the inner-loop RL problem by simultaneously training the policy and the reward function. The connection to IRL is exact: GAIL can be derived as an occupancy-measure matching problem — it minimizes the Jensen-Shannon divergence between the state-action occupancy measures of the expert and the policy:

\min_\pi \max_{D \in (0,1)^{\mathcal{S} \times \mathcal{A}}}\; \mathbb{E}_{\pi_E}\!\left[\log D(s, a)\right] + \mathbb{E}_\pi\!\left[\log(1 - D(s, a))\right] - \lambda H(\pi)

where $D(s,a)$ is a discriminator that assigns high probability to expert state-action pairs and low probability to policy pairs, and $H(\pi)$ is a causal entropy regularizer.

The training dynamics are:

The discriminator $D$ is a binary classifier trained to distinguish expert state-action pairs from the current policy's state-action pairs. A high discriminator score $D(s, a)$ indicates that $(s,a)$ resembles expert behavior; a low score indicates deviation.
The policy is trained via RL to maximize the reward signal $r(s, a) = -\log(1 - D(s,a))$ (or equivalently $\log D(s, a)$ ). This reward is high for state-action pairs that fool the discriminator into thinking they are expert behavior.
As training proceeds, the policy generates trajectories increasingly similar to the expert, making discrimination harder; the discriminator adapts to distinguish more subtle differences; the policy adapts to those in turn — the same GAN training dynamic, with trajectories instead of images.

GAIL is more robust to distributional shift than behavior cloning because the discriminator provides reward at every state-action pair the policy visits during its RL training, including states far from the expert's trajectory. A policy that wanders into a non-demonstrated state receives a low reward signal from the discriminator, which the RL step treats as a negative outcome and moves away from — providing a correction mechanism that behavior cloning completely lacks.

The cost of this improvement is computational. GAIL requires on-policy RL during training, which means running the policy in the environment (or a simulator) at each iteration. For physical robots, this is expensive; for tasks where high-fidelity simulation is available, GAIL is feasible and has been demonstrated on complex locomotion and manipulation tasks.

Offline imitation from large datasets#

The computational demands of DAgger and GAIL conflict with the operational realities of physical robot deployment. This conflict has motivated a third paradigm: offline imitation, in which the policy is trained entirely on a fixed, pre-collected dataset with no environment interaction during training.

Offline imitation is structurally identical to behavior cloning — it is supervised learning on a fixed dataset — but the scale and diversity of the dataset are qualitatively different. While early behavior cloning datasets contained hundreds to thousands of demonstrations on a single task, modern large-scale robot learning datasets aggregate millions of demonstrations across hundreds of tasks, multiple robot embodiments, and diverse environmental conditions. At sufficient scale and diversity, the behavioral coverage becomes broad enough that the distributional shift problem is mitigated not by algorithmic corrections but by brute-force data coverage: the policy has seen enough variation that any state encountered at deployment is close to something in the training distribution.

This approach — scaling offline imitation with dataset breadth — is precisely the strategy employed by foundation models for manipulation. Models like ACT (Action Chunking Transformer) and $\pi_0$ are fundamentally behavior-cloned models at their core, trained on large, carefully curated teleoperation datasets. Their capability comes not from algorithmic innovations that solve distributional shift in the traditional sense, but from datasets large and diverse enough that the effective distributional shift is small.

The remaining challenge for offline imitation is distribution mismatch between dataset and deployment: no finite dataset covers the full deployment distribution perfectly. This gap is addressed in practice by (a) fine-tuning the pretrained policy on task-specific data, (b) applying RL-based post-training to improve beyond the demonstrated distribution (analogous to RLHF in language model alignment), and (c) incorporating explicit uncertainty quantification or fallback mechanisms.

GenAI context: imitation learning as supervised fine-tuning#

The relationship between imitation learning and modern language model training is more than superficial. Behavior cloning is formally identical to supervised fine-tuning (SFT) in the LLM alignment pipeline: both minimize cross-entropy (or mean squared error) between a policy's outputs and a target distribution derived from expert behavior. The expert in SFT is a human writer producing instruction-following demonstrations; the expert in behavior cloning is a human operator producing manipulation demonstrations.

The distributional shift problem in imitation learning corresponds exactly to the distributional shift problem that motivates RLHF: a model fine-tuned by SFT alone performs well on prompts similar to the fine-tuning distribution but degrades on novel prompts — the same $T^2$ compounding failure in a sequential generation context. RLHF applies RL-based correction analogous to GAIL or DAgger: it collects data from the model's own output distribution and applies gradient signals derived from that distribution.

The offline imitation strategy of scaling dataset diversity maps onto the observation that pre-training large language models on diverse web-scale text produces representations that generalize broadly, reducing the distributional shift seen during fine-tuning. Both robotics and NLP converge on the same conclusion: algorithmic fixes for distributional shift help at the margin, but broad data coverage is the dominant factor.

Key takeaways#

Behavior cloning treats imitation as supervised regression on expert state-action pairs. Its fundamental limitation is distributional shift: the policy trains on the expert's state distribution but deploys on its own, and compounding errors cause the two distributions to diverge with a cost that scales quadratically in the task horizon. DAgger resolves this by collecting training data from the policy's own trajectories with expert action labels, converting the quadratic bound to a linear one. The practical barrier for DAgger on physical robots is the cost of expert labeling at learner-induced states. Inverse reinforcement learning recovers a reward function from demonstrations rather than a policy directly, enabling generalization to novel trajectories that achieve high reward; its computational cost (inner-loop RL) is prohibitive at scale. GAIL operationalizes IRL as an adversarial game, training a discriminator as the reward signal and the policy as the generator; it requires on-policy RL but avoids the double-loop structure of direct IRL. Offline imitation at scale — training behavior-cloned models on large, diverse datasets — mitigates distributional shift through data coverage rather than algorithmic correction, which is the strategy underlying modern foundation models for robotics.

Conceptual questions#

A robot policy trained by behavior cloning achieves 98% per-step action accuracy on held-out demonstration data (measured as the fraction of steps where the predicted action is within 2% of the expert action). Estimate its expected success rate on a manipulation task requiring 150 sequential steps, using the quadratic error compounding bound. What per-step accuracy would be required to achieve a 90% success rate at this horizon? Discuss why the quadratic bound may be too pessimistic in practice and under what conditions the actual degradation approaches the bound.
DAgger converges to training on the policy's own state visitation distribution. An engineer argues that this makes DAgger equivalent to on-policy RL with expert-labeled rewards and that the two methods should produce policies of equal quality given the same number of environment interactions. Identify the assumptions under which this equivalence holds and the conditions under which DAgger is expected to outperform RL (and vice versa). Where does IRL fit in this comparison?
GAIL trains a discriminator to distinguish expert and policy trajectories and uses the discriminator's output as a reward signal. If the expert demonstrations contain a consistent systematic bias — for example, the expert always grasps objects from the left side — analyze how GAIL will behave when deployed in an environment where the object's left side is inaccessible. Compare GAIL's behavior in this scenario to behavior cloning and to standard RL with a ground-truth task reward.
An offline imitation learning system is trained on a dataset of 50,000 demonstrations collected across 200 manipulation tasks. After deployment, the team observes that success rate decreases from 85% during training distribution evaluation to 60% during deployment, where the object positions are slightly different from the training distribution. Describe a principled diagnostic procedure to determine whether this gap is caused by distributional shift in the observation space, distributional shift in the action space, or missing recovery behaviors. For each cause, describe the appropriate corrective intervention.
In the LLM alignment analogy, behavior cloning corresponds to SFT and RLHF corresponds to RL-based correction of the SFT policy. However, RLHF uses human preference comparisons rather than state-action demonstrations. Identify the structural difference between (a) applying DAgger to a robot imitation policy and (b) applying RLHF to a language model. Specifically: what plays the role of the "expert labeler," what plays the role of the "reward model," and what is lost in RLHF that DAgger retains? Use this comparison to predict a failure mode that RLHF would exhibit but DAgger would not, and vice versa.

Solutions

With per-step error $\epsilon = 0.02$ , the quadratic compounding bound on failures is $O(\epsilon T^2) = 0.02 \times 150^2 = 450$ — far above 1, so it guarantees essentially nothing (near-0 worst-case success) at $T = 150$ . To bound failure at 10% you need $\epsilon T^2 \le 0.1$ , i.e. $\epsilon \le 0.1 / 22500 \approx 4\times10^{-6}$ (≈99.9996% per step). The bound is pessimistic because it assumes every mistake pushes the agent into an unrecoverable state; in practice many states are self-correcting, so degradation is often closer to linear. It is approached only when errors truly compound and off-distribution states are catastrophic with no recovery.
The equivalence holds only if the expert is optimal and matching its actions is exactly the task objective. DAgger wins when a near-optimal expert is available and the reward is sparse/hard to optimize (RL is sample-inefficient); RL wins when the expert is suboptimal or biased and the true reward lets the agent surpass it. IRL sits in between — it recovers a reward from demonstrations and then runs RL, useful when you want to generalize the expert's intent beyond the demonstrated states.
GAIL imitates the expert's distribution, so it learns "grasp from the left." Where the left side is inaccessible, the discriminator reward keeps pushing toward infeasible left-side grasps, the policy fails to earn reward, and it may never discover the right-side option. Behavior cloning fails the same way (it copies the bias); standard RL with a ground-truth task reward can discover the right-side grasp because it optimizes task success, not expert similarity.
Diagnose by comparing distributions: (a) observation-space shift — measure the distance between train and deploy observation statistics/encoder features; fix with augmentation, domain randomization, or more diverse data. (b) action-space shift — check whether predicted actions are in-distribution yet lead to unseen states. (c) missing recovery — roll out and see whether failures follow the policy drifting off the demo manifold without correcting; fix with DAgger (collect expert corrections on the policy's own states).
In DAgger the "expert labeler" supplies the correct action at every visited state (dense, action-level), whereas in RLHF the human supplies pairwise preferences that train a learned "reward model." What RLHF loses that DAgger keeps is per-state action-level supervision — RLHF only has a relative/scalar signal, not a demonstration of the right action. A failure mode RLHF shows but DAgger does not is reward-model over-optimization / reward hacking (verbosity, sycophancy); conversely DAgger needs an online queryable expert, so it fails on open-ended tasks where no per-state expert exists.

Looking ahead#

Imitation learning initializes policies that perform well within the demonstrated distribution, but it cannot optimize beyond what the expert demonstrated or adapt online to novel conditions. Reinforcement learning is the natural complement: starting from an imitation-learned initialization, RL explores beyond the demonstrated distribution and discovers behaviors that improve on the expert.

Week 6: Reinforcement Learning for Robotics. We examine how RL algorithms — particularly off-policy actor-critic methods — are adapted for physical robot deployment under constraints of safety, limited experience, and contact-rich dynamics.

Purpose of this lecture#

Behavior cloning: imitation as supervised learning#

The basic formulation#

\mathcal{L}_{\text{BC}}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[\| a - \pi_\theta(s) \|^2\right]

The distributional shift problem#

J(\pi_\theta) \leq J(\pi_E) + T \epsilon + O(T^2 \epsilon \cdot \text{TV}(d_{\pi_\theta}, d_{\text{demo}}))

Dataset aggregation: DAgger#

Motivation and algorithm#

J(\pi_N) \leq J(\pi_E) + O(T \cdot \epsilon_N)

Practical challenges#

Inverse reinforcement learning#

The reward identification problem#

\pi_E \approx \arg\max_\pi \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]

Adversarial imitation learning: GAIL#

Connecting imitation to GANs#

\min_\pi \max_{D \in (0,1)^{\mathcal{S} \times \mathcal{A}}}\; \mathbb{E}_{\pi_E}\!\left[\log D(s, a)\right] + \mathbb{E}_\pi\!\left[\log(1 - D(s, a))\right] - \lambda H(\pi)

where $D(s,a)$ is a discriminator that assigns high probability to expert state-action pairs and low probability to policy pairs, and $H(\pi)$ is a causal entropy regularizer.

The training dynamics are:

The discriminator $D$ is a binary classifier trained to distinguish expert state-action pairs from the current policy's state-action pairs. A high discriminator score $D(s, a)$ indicates that $(s,a)$ resembles expert behavior; a low score indicates deviation.
The policy is trained via RL to maximize the reward signal $r(s, a) = -\log(1 - D(s,a))$ (or equivalently $\log D(s, a)$ ). This reward is high for state-action pairs that fool the discriminator into thinking they are expert behavior.
As training proceeds, the policy generates trajectories increasingly similar to the expert, making discrimination harder; the discriminator adapts to distinguish more subtle differences; the policy adapts to those in turn — the same GAN training dynamic, with trajectories instead of images.

Offline imitation from large datasets#

GenAI context: imitation learning as supervised fine-tuning#

Key takeaways#

Conceptual questions#

A robot policy trained by behavior cloning achieves 98% per-step action accuracy on held-out demonstration data (measured as the fraction of steps where the predicted action is within 2% of the expert action). Estimate its expected success rate on a manipulation task requiring 150 sequential steps, using the quadratic error compounding bound. What per-step accuracy would be required to achieve a 90% success rate at this horizon? Discuss why the quadratic bound may be too pessimistic in practice and under what conditions the actual degradation approaches the bound.
DAgger converges to training on the policy's own state visitation distribution. An engineer argues that this makes DAgger equivalent to on-policy RL with expert-labeled rewards and that the two methods should produce policies of equal quality given the same number of environment interactions. Identify the assumptions under which this equivalence holds and the conditions under which DAgger is expected to outperform RL (and vice versa). Where does IRL fit in this comparison?
GAIL trains a discriminator to distinguish expert and policy trajectories and uses the discriminator's output as a reward signal. If the expert demonstrations contain a consistent systematic bias — for example, the expert always grasps objects from the left side — analyze how GAIL will behave when deployed in an environment where the object's left side is inaccessible. Compare GAIL's behavior in this scenario to behavior cloning and to standard RL with a ground-truth task reward.
An offline imitation learning system is trained on a dataset of 50,000 demonstrations collected across 200 manipulation tasks. After deployment, the team observes that success rate decreases from 85% during training distribution evaluation to 60% during deployment, where the object positions are slightly different from the training distribution. Describe a principled diagnostic procedure to determine whether this gap is caused by distributional shift in the observation space, distributional shift in the action space, or missing recovery behaviors. For each cause, describe the appropriate corrective intervention.
In the LLM alignment analogy, behavior cloning corresponds to SFT and RLHF corresponds to RL-based correction of the SFT policy. However, RLHF uses human preference comparisons rather than state-action demonstrations. Identify the structural difference between (a) applying DAgger to a robot imitation policy and (b) applying RLHF to a language model. Specifically: what plays the role of the "expert labeler," what plays the role of the "reward model," and what is lost in RLHF that DAgger retains? Use this comparison to predict a failure mode that RLHF would exhibit but DAgger would not, and vice versa.

Solutions

With per-step error $\epsilon = 0.02$ , the quadratic compounding bound on failures is $O(\epsilon T^2) = 0.02 \times 150^2 = 450$ — far above 1, so it guarantees essentially nothing (near-0 worst-case success) at $T = 150$ . To bound failure at 10% you need $\epsilon T^2 \le 0.1$ , i.e. $\epsilon \le 0.1 / 22500 \approx 4\times10^{-6}$ (≈99.9996% per step). The bound is pessimistic because it assumes every mistake pushes the agent into an unrecoverable state; in practice many states are self-correcting, so degradation is often closer to linear. It is approached only when errors truly compound and off-distribution states are catastrophic with no recovery.
The equivalence holds only if the expert is optimal and matching its actions is exactly the task objective. DAgger wins when a near-optimal expert is available and the reward is sparse/hard to optimize (RL is sample-inefficient); RL wins when the expert is suboptimal or biased and the true reward lets the agent surpass it. IRL sits in between — it recovers a reward from demonstrations and then runs RL, useful when you want to generalize the expert's intent beyond the demonstrated states.
GAIL imitates the expert's distribution, so it learns "grasp from the left." Where the left side is inaccessible, the discriminator reward keeps pushing toward infeasible left-side grasps, the policy fails to earn reward, and it may never discover the right-side option. Behavior cloning fails the same way (it copies the bias); standard RL with a ground-truth task reward can discover the right-side grasp because it optimizes task success, not expert similarity.
Diagnose by comparing distributions: (a) observation-space shift — measure the distance between train and deploy observation statistics/encoder features; fix with augmentation, domain randomization, or more diverse data. (b) action-space shift — check whether predicted actions are in-distribution yet lead to unseen states. (c) missing recovery — roll out and see whether failures follow the policy drifting off the demo manifold without correcting; fix with DAgger (collect expert corrections on the policy's own states).
In DAgger the "expert labeler" supplies the correct action at every visited state (dense, action-level), whereas in RLHF the human supplies pairwise preferences that train a learned "reward model." What RLHF loses that DAgger keeps is per-state action-level supervision — RLHF only has a relative/scalar signal, not a demonstration of the right action. A failure mode RLHF shows but DAgger does not is reward-model over-optimization / reward hacking (verbosity, sycophancy); conversely DAgger needs an online queryable expert, so it fails on open-ended tasks where no per-state expert exists.

Purpose of this lecture#

Behavior cloning: imitation as supervised learning#

The basic formulation#

The distributional shift problem#

Dataset aggregation: DAgger#

Motivation and algorithm#

Practical challenges#

Inverse reinforcement learning#

The reward identification problem#

Adversarial imitation learning: GAIL#

Connecting imitation to GANs#

Offline imitation from large datasets#

GenAI context: imitation learning as supervised fine-tuning#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 5: Imitation Learning

Purpose of this lecture#

Behavior cloning: imitation as supervised learning#

The basic formulation#

The distributional shift problem#

Dataset aggregation: DAgger#

Motivation and algorithm#

Practical challenges#

Inverse reinforcement learning#

The reward identification problem#

Adversarial imitation learning: GAIL#

Connecting imitation to GANs#

Offline imitation from large datasets#

GenAI context: imitation learning as supervised fine-tuning#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#