Skip to main content
illumin8
Courses
Week 5: Imitation Learning
Robot Learning
01Week 1: Robot Modeling and Kinematics
02Week 2: Dynamics and State Estimation
03Week 3: Control Fundamentals
04Week 4: Teleoperation and Data Collection
05Week 5: Imitation Learning
06Week 6: Reinforcement Learning for Robotics
07Week 7: Sim2Real Pipelines and IsaacLab
08Week 8: Foundation Models for Manipulation — ACT and Action Chunking
09Week 9: Flow Matching and Diffusion for Robot Policies
10Week 10: Vision–Language–Action Models
11Week 11: Fine-Tuning and Adaptation
12Week 12: Safety, Constraints, and Reliability
13Week 13: Multi-Robot and Multi-Task Learning
14Week 14: Sim2Real Capstone
Week 5

Week 5: Imitation Learning

✦Learning Outcomes
  • Derive DAgger and explain covariate shift solutions
  • Compare imitation learning, inverse reinforcement learning, and GAIL
  • Analyze distribution shift in robot learning
  • Connect imitation learning to RLHFReinforcement Learning from Human Feedback for language models
◆Prerequisites
  • Week 4: Teleoperation and data collection
  • Basic machine learning (supervised learning, regression)

Recommended: Review Week 4 before proceeding.

Purpose of this lecture

We now have expert demonstrations: synchronized state-action trajectories collected through teleoperation, covering a range of initial conditions and operator styles. The central question is how to turn these demonstrations into an autonomous policy — a function πθ(a∣s)\pi_\theta(a \mid s)πθ​(a∣s) that maps observations to actions without a human in the loop.

Imitation learning addresses this question by training a policy to reproduce expert behavior without an explicit reward function. It is the dominant initialization strategy for modern robot manipulation policies, VLA models, and any system that must be competitive with human-level manipulation from the start of deployment rather than after extensive autonomous exploration. This lecture analyzes the core algorithms — behavior cloning, DAgger, inverse reinforcement learning, and GAIL — from first principles, with particular attention to the distributional shift problem that unifies their failure modes and motivates their differences.


Behavior cloning: imitation as supervised learning

The basic formulation

The simplest approach to imitation learning is behavior cloning (BC): treat the demonstration dataset D={(si,ai)}i=1N\mathcal{D} = \{(s_i, a_i)\}_{i=1}^ND={(si​,ai​)}i=1N​ as a supervised regression problem and minimize the loss

LBC(θ)=E(s,a)∼D ⁣[∥a−πθ(s)∥2]\mathcal{L}_{\text{BC}}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[\| a - \pi_\theta(s) \|^2\right]LBC​(θ)=E(s,a)∼D​[∥a−πθ​(s)∥2]

for continuous action spaces, or the cross-entropy loss for discrete ones. The policy πθ\pi_\thetaπθ​ is typically a neural network parameterized by θ\thetaθ, and training proceeds with standard gradient descent on LBC\mathcal{L}_{\text{BC}}LBC​. No environment interaction, reward model, or simulator is required: the entire training pipeline is offline, using only the fixed demonstration dataset.

The appeal of behavior cloning is self-evident. It is stable to train — supervised regression on a fixed dataset exhibits none of the instability associated with RLReinforcement Learning training loops. It scales directly with data: more demonstrations produce better policies, and the relationship between dataset size and policy quality is relatively smooth and predictable. It is also computationally inexpensive relative to RLReinforcement Learning-based alternatives, requiring only forward and backward passes through the policy network rather than environment rollouts.

For tasks where the demonstration dataset is large, diverse, and covers the deployment distribution well, behavior cloning produces excellent policies. Many of the most capable recent manipulation policies — including ACTAction Chunking with Transformers, diffusion policies, and the supervised training stage of VLA models — are fundamentally behavior-cloned models trained on large teleoperation datasets. The issues with behavior cloning are not with its performance within the training distribution. They are with what happens when the policy inevitably encounters states outside that distribution.

The distributional shift problem

At training time, behavior cloning minimizes expected loss under the demonstration distribution ddemo(s)d_{\text{demo}}(s)ddemo​(s) — the marginal distribution over states visited by the expert. At test time, the policy induces its own state visitation distribution dπθ(s)d_{\pi_\theta}(s)dπθ​​(s). If πθ\pi_\thetaπθ​ is not a perfect copy of the expert, the trajectory it generates will deviate from the expert's trajectory over time, visiting states that the expert never visited and that therefore have no corresponding actions in the training data. The policy must ACTAction Chunking with Transformers in these states using extrapolation from its training distribution, which will generally be poor.

The formal analysis, due to Ross and Bagnell (2011), quantifies the cost of this mismatch. Let the per-step loss under the training distribution be ϵ=Es∼ddemo[ℓ(πθ,s)]\epsilon = \mathbb{E}_{s \sim d_{\text{demo}}}[\ell(\pi_\theta, s)]ϵ=Es∼ddemo​​[ℓ(πθ​,s)]. The total loss of the behavior-cloned policy over a horizon of TTT steps is bounded by

J(πθ)≤J(πE)+Tϵ+O(T2ϵ⋅TV(dπθ,ddemo))J(\pi_\theta) \leq J(\pi_E) + T \epsilon + O(T^2 \epsilon \cdot \text{TV}(d_{\pi_\theta}, d_{\text{demo}}))J(πθ​)≤J(πE​)+Tϵ+O(T2ϵ⋅TV(dπθ​​,ddemo​))

where TV(⋅,⋅)\text{TV}(\cdot, \cdot)TV(⋅,⋅) is the total variation distance between the state distributions. The quadratic T2T^2T2 term is the critical observation: for long-horizon tasks (large TTT), even a small per-step error ϵ\epsilonϵ produces a cumulative cost that grows quadratically with the horizon. A policy with 1% per-step error on a 200-step manipulation task accumulates a total error of order 0.01×2002=4000.01 \times 200^2 = 4000.01×2002=400 times the per-step loss, making catastrophic failure nearly certain on long tasks even for a policy that appears accurate in isolation.

The mechanism is compounding errors: the policy makes a small mistake that moves it slightly off the demonstrated path; this new state is slightly out-of-distribution, causing a slightly larger error; this larger error moves the policy further off the path; and so on in a feedback loop that drives the policy rapidly toward failure. The dynamics of compounding errors are particularly severe in robotics because the physical state space has no built-in regularization — a robot arm that moves 1 cm in the wrong direction will continue moving in the wrong direction unless the policy applies a corrective action.


Dataset aggregation: DAgger

Motivation and algorithm

The compounding error problem has a precise diagnosis: the behavior-cloned policy trains on ddemo(s)d_{\text{demo}}(s)ddemo​(s) but deploys under dπθ(s)d_{\pi_\theta}(s)dπθ​​(s), and these distributions diverge as training errors accumulate. The direct fix is to collect training data from the deploying distribution: run the learned policy in the environment, observe which states it visits, obtain expert action labels for those states, and retrain.

DAgger (Dataset Aggregation; Ross et al., 2011) implements this fix as an iterative algorithm. The procedure is: initialize by training π1\pi_1π1​ on the original expert dataset D1\mathcal{D}_1D1​. At each subsequent iteration nnn, roll out a mixture policy πmix=(1−βn)πn+βnπE\pi_{\text{mix}} = (1 - \beta_n)\pi_n + \beta_n \pi_Eπmix​=(1−βn​)πn​+βn​πE​ in the environment (using the expert with probability βn\beta_nβn​ to prevent catastrophic failures during data collection), collect the resulting trajectory {st}\{s_t\}{st​}, query the expert for action labels {at∗=πE(st)}\{a^*_t = \pi_E(s_t)\}{at∗​=πE​(st​)} at all visited states, aggregate the new labeled data into the dataset Dn+1=Dn∪{(st,at∗)}\mathcal{D}_{n+1} = \mathcal{D}_n \cup \{(s_t, a^*_t)\}Dn+1​=Dn​∪{(st​,at∗​)}, and retrain πn+1\pi_{n+1}πn+1​ on Dn+1\mathcal{D}_{n+1}Dn+1​. As βn→0\beta_n \to 0βn​→0 with iterations, the mixture shifts from primarily expert-driven to primarily policy-driven exploration.

The key property that DAgger provides is that the training distribution converges to the policy's own state visitation distribution. Under mild assumptions on the expert and the policy class, DAgger achieves a regret bound that scales linearly (not quadratically) with the horizon TTT:

J(πN)≤J(πE)+O(T⋅ϵN)J(\pi_N) \leq J(\pi_E) + O(T \cdot \epsilon_N)J(πN​)≤J(πE​)+O(T⋅ϵN​)

where ϵN\epsilon_NϵN​ is the per-step supervised loss after NNN iterations. The improvement from T2ϵT^2\epsilonT2ϵ (behavior cloning) to TϵNT\epsilon_NTϵN​ (DAgger) is the formal gain from eliminating distributional shift.

Practical challenges

DAgger's requirement for expert labels at states generated by the learner's trajectories creates an operational bottleneck for physical robots. In simulation, this is trivial: the expert policy (a hand-crafted controller or privileged-state oracle) can be queried automatically at any state. On a physical robot, the "expert" is a human operator who must label each state, which requires either displaying the state to the operator and receiving a label in real time (tedious and cognitively demanding) or recording the trajectory and labeling offline (expensive in human time).

Practical implementations of DAgger on physical robots often use HG-DAgger (human-gated DAgger), in which the operator observes the policy executing and intervenes to provide corrections only when they detect an impending failure — rather than labeling every state. This reduces the annotation burden while still collecting data at the critical states where the policy diverges from expert behavior.


Inverse reinforcement learning

The reward identification problem

Behavior cloning and DAgger both assume that the correct action at any state is the one the expert chose. This assumption fails for stochastic optimal behavior: if multiple actions are equally optimal at a given state (e.g., reaching for an object from slightly different angles), the expert's chosen action is a sample from an equivalence class of optimal actions, and training the policy to replicate the specific sample rather than any member of the class imposes unnecessary constraint.

Inverse reinforcement learning (IRL) takes a fundamentally different approach. Rather than learning a policy directly, IRL asks: what reward function r(s,a)r(s, a)r(s,a) would make the expert's observed behavior optimal? Given the expert trajectories, IRL recovers a reward function rrr such that the expert's policy is approximately the solution to

πE≈arg⁡max⁡πEπ ⁣[∑t=0∞γtr(st,at)]\pi_E \approx \arg\max_\pi \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]πE​≈argπmax​Eπ​[t=0∑∞​γtr(st​,at​)]

The recovered reward function can then be used to train a policy via standard RLReinforcement Learning, potentially discovering behaviors that generalize better than direct imitation because the policy can find novel trajectories that achieve high reward even when they differ from the expert's specific demonstrations.

The IRL problem is formally underdetermined: an uncountable family of reward functions is consistent with any finite set of demonstrations (the constant reward function explains all behavior, for instance). Classical approaches to resolving this ambiguity include maximum entropy IRL (Ziebart et al., 2008), which selects the reward function that maximizes the entropy of the induced policy distribution subject to matching observed feature expectations, and Bayesian IRL, which maintains a posterior over reward functions conditioned on the demonstrations.

Despite its theoretical appeal, direct IRL is computationally expensive: recovering rrr from demonstrations typically requires solving an RLReinforcement Learning problem in an inner loop (to compute the policy induced by candidate reward functions), and this inner-loop RLReinforcement Learning must converge for each candidate before the outer optimization over reward functions can proceed. This double-loop structure is impractical for large state-action spaces. IRL's most important practical legacy is not in direct use but as the foundation for GAIL.


Adversarial imitation learning: GAIL

Connecting imitation to GANs

Generative Adversarial Imitation Learning (GAIL; Ho and Ermon, 2016) reformulates IRL as an adversarial game that avoids the inner-loop RLReinforcement Learning problem by simultaneously training the policy and the reward function. The connection to IRL is exact: GAIL can be derived as an occupancy-measure matching problem — it minimizes the Jensen-Shannon divergence between the state-action occupancy measures of the expert and the policy:

min⁡πmax⁡D∈(0,1)S×A  EπE ⁣[log⁡D(s,a)]+Eπ ⁣[log⁡(1−D(s,a))]−λH(π)\min_\pi \max_{D \in (0,1)^{\mathcal{S} \times \mathcal{A}}}\; \mathbb{E}_{\pi_E}\!\left[\log D(s, a)\right] + \mathbb{E}_\pi\!\left[\log(1 - D(s, a))\right] - \lambda H(\pi)πmin​D∈(0,1)S×Amax​EπE​​[logD(s,a)]+Eπ​[log(1−D(s,a))]−λH(π)

where D(s,a)D(s,a)D(s,a) is a discriminator that assigns high probability to expert state-action pairs and low probability to policy pairs, and H(π)H(\pi)H(π) is a causal entropy regularizer.

The training dynamics are:

  • The discriminator DDD is a binary classifier trained to distinguish expert state-action pairs from the current policy's state-action pairs. A high discriminator score D(s,a)D(s, a)D(s,a) indicates that (s,a)(s,a)(s,a) resembles expert behavior; a low score indicates deviation.

  • The policy is trained via RLReinforcement Learning to maximize the reward signal r(s,a)=−log⁡(1−D(s,a))r(s, a) = -\log(1 - D(s,a))r(s,a)=−log(1−D(s,a)) (or equivalently log⁡D(s,a)\log D(s, a)logD(s,a)). This reward is high for state-action pairs that fool the discriminator into thinking they are expert behavior.

  • As training proceeds, the policy generates trajectories increasingly similar to the expert, making discrimination harder; the discriminator adapts to distinguish more subtle differences; the policy adapts to those in turn — the same GANGenerative Adversarial Network training dynamic, with trajectories instead of images.

GAIL is more robust to distributional shift than behavior cloning because the discriminator provides reward at every state-action pair the policy visits during its RLReinforcement Learning training, including states far from the expert's trajectory. A policy that wanders into a non-demonstrated state receives a low reward signal from the discriminator, which the RLReinforcement Learning step treats as a negative outcome and moves away from — providing a correction mechanism that behavior cloning completely lacks.

The cost of this improvement is computational. GAIL requires on-policy RLReinforcement Learning during training, which means running the policy in the environment (or a simulator) at each iteration. For physical robots, this is expensive; for tasks where high-fidelity simulation is available, GAIL is feasible and has been demonstrated on complex locomotion and manipulation tasks.


Offline imitation from large datasets

The computational demands of DAgger and GAIL conflict with the operational realities of physical robot deployment. This conflict has motivated a third paradigm: offline imitation, in which the policy is trained entirely on a fixed, pre-collected dataset with no environment interaction during training.

Offline imitation is structurally identical to behavior cloning — it is supervised learning on a fixed dataset — but the scale and diversity of the dataset are qualitatively different. While early behavior cloning datasets contained hundreds to thousands of demonstrations on a single task, modern large-scale robot learning datasets aggregate millions of demonstrations across hundreds of tasks, multiple robot embodiments, and diverse environmental conditions. At sufficient scale and diversity, the behavioral coverage becomes broad enough that the distributional shift problem is mitigated not by algorithmic corrections but by brute-force data coverage: the policy has seen enough variation that any state encountered at deployment is close to something in the training distribution.

This approach — scaling offline imitation with dataset breadth — is precisely the strategy employed by foundation models for manipulation. Models like ACTAction Chunking with Transformers (Action Chunking Transformer) and π0\pi_0π0​ are fundamentally behavior-cloned models at their core, trained on large, carefully curated teleoperation datasets. Their capability comes not from algorithmic innovations that solve distributional shift in the traditional sense, but from datasets large and diverse enough that the effective distributional shift is small.

The remaining challenge for offline imitation is distribution mismatch between dataset and deployment: no finite dataset covers the full deployment distribution perfectly. This gap is addressed in practice by (a) fine-tuning the pretrained policy on task-specific data, (b) applying RLReinforcement Learning-based post-training to improve beyond the demonstrated distribution (analogous to RLHFReinforcement Learning from Human Feedback in language model alignment), and (c) incorporating explicit uncertainty quantification or fallback mechanisms.


GenAI context: imitation learning as supervised fine-tuning

The relationship between imitation learning and modern language model training is more than superficial. Behavior cloning is formally identical to supervised fine-tuning (SFT) in the LLMLarge Language Model alignment pipeline: both minimize cross-entropy (or mean squared error) between a policy's outputs and a target distribution derived from expert behavior. The expert in SFT is a human writer producing instruction-following demonstrations; the expert in behavior cloning is a human operator producing manipulation demonstrations.

The distributional shift problem in imitation learning corresponds exactly to the distributional shift problem that motivates RLHFReinforcement Learning from Human Feedback: a model fine-tuned by SFT alone performs well on prompts similar to the fine-tuning distribution but degrades on novel prompts — the same T2T^2T2 compounding failure in a sequential generation context. RLHFReinforcement Learning from Human Feedback applies RLReinforcement Learning-based correction analogous to GAIL or DAgger: it collects data from the model's own output distribution and applies gradient signals derived from that distribution.

The offline imitation strategy of scaling dataset diversity maps onto the observation that pre-training large language models on diverse web-scale text produces representations that generalize broadly, reducing the distributional shift seen during fine-tuning. Both robotics and NLP converge on the same conclusion: algorithmic fixes for distributional shift help at the margin, but broad data coverage is the dominant factor.


Key takeaways

Behavior cloning treats imitation as supervised regression on expert state-action pairs. Its fundamental limitation is distributional shift: the policy trains on the expert's state distribution but deploys on its own, and compounding errors cause the two distributions to diverge with a cost that scales quadratically in the task horizon. DAgger resolves this by collecting training data from the policy's own trajectories with expert action labels, converting the quadratic bound to a linear one. The practical barrier for DAgger on physical robots is the cost of expert labeling at learner-induced states. Inverse reinforcement learning recovers a reward function from demonstrations rather than a policy directly, enabling generalization to novel trajectories that achieve high reward; its computational cost (inner-loop RLReinforcement Learning) is prohibitive at scale. GAIL operationalizes IRL as an adversarial game, training a discriminator as the reward signal and the policy as the generator; it requires on-policy RLReinforcement Learning but avoids the double-loop structure of direct IRL. Offline imitation at scale — training behavior-cloned models on large, diverse datasets — mitigates distributional shift through data coverage rather than algorithmic correction, which is the strategy underlying modern foundation models for robotics.


Conceptual questions

  1. A robot policy trained by behavior cloning achieves 98% per-step action accuracy on held-out demonstration data (measured as the fraction of steps where the predicted action is within 2% of the expert action). Estimate its expected success rate on a manipulation task requiring 150 sequential steps, using the quadratic error compounding bound. What per-step accuracy would be required to achieve a 90% success rate at this horizon? Discuss why the quadratic bound may be too pessimistic in practice and under what conditions the actual degradation approaches the bound.

  2. DAgger converges to training on the policy's own state visitation distribution. An engineer argues that this makes DAgger equivalent to on-policy RLReinforcement Learning with expert-labeled rewards and that the two methods should produce policies of equal quality given the same number of environment interactions. Identify the assumptions under which this equivalence holds and the conditions under which DAgger is expected to outperform RLReinforcement Learning (and vice versa). Where does IRL fit in this comparison?

  3. GAIL trains a discriminator to distinguish expert and policy trajectories and uses the discriminator's output as a reward signal. If the expert demonstrations contain a consistent systematic bias — for example, the expert always grasps objects from the left side — analyze how GAIL will behave when deployed in an environment where the object's left side is inaccessible. Compare GAIL's behavior in this scenario to behavior cloning and to standard RLReinforcement Learning with a ground-truth task reward.

  4. An offline imitation learning system is trained on a dataset of 50,000 demonstrations collected across 200 manipulation tasks. After deployment, the team observes that success rate decreases from 85% during training distribution evaluation to 60% during deployment, where the object positions are slightly different from the training distribution. Describe a principled diagnostic procedure to determine whether this gap is caused by distributional shift in the observation space, distributional shift in the action space, or missing recovery behaviors. For each cause, describe the appropriate corrective intervention.

  5. In the LLMLarge Language Model alignment analogy, behavior cloning corresponds to SFT and RLHFReinforcement Learning from Human Feedback corresponds to RLReinforcement Learning-based correction of the SFT policy. However, RLHFReinforcement Learning from Human Feedback uses human preference comparisons rather than state-action demonstrations. Identify the structural difference between (a) applying DAgger to a robot imitation policy and (b) applying RLHFReinforcement Learning from Human Feedback to a language model. Specifically: what plays the role of the "expert labeler," what plays the role of the "reward model," and what is lost in RLHFReinforcement Learning from Human Feedback that DAgger retains? Use this comparison to predict a failure mode that RLHFReinforcement Learning from Human Feedback would exhibit but DAgger would not, and vice versa.

✦Solutions
  1. With per-step error ϵ=0.02\epsilon = 0.02ϵ=0.02, the quadratic compounding bound on failures is O(ϵT2)=0.02×1502=450O(\epsilon T^2) = 0.02 \times 150^2 = 450O(ϵT2)=0.02×1502=450 — far above 1, so it guarantees essentially nothing (near-0 worst-case success) at T=150T = 150T=150. To bound failure at 10% you need ϵT2≤0.1\epsilon T^2 \le 0.1ϵT2≤0.1, i.e. ϵ≤0.1/22500≈4×10−6\epsilon \le 0.1 / 22500 \approx 4\times10^{-6}ϵ≤0.1/22500≈4×10−6 (≈99.9996% per step). The bound is pessimistic because it assumes every mistake pushes the agent into an unrecoverable state; in practice many states are self-correcting, so degradation is often closer to linear. It is approached only when errors truly compound and off-distribution states are catastrophic with no recovery.
  2. The equivalence holds only if the expert is optimal and matching its actions is exactly the task objective. DAgger wins when a near-optimal expert is available and the reward is sparse/hard to optimize (RL is sample-inefficient); RL wins when the expert is suboptimal or biased and the true reward lets the agent surpass it. IRL sits in between — it recovers a reward from demonstrations and then runs RL, useful when you want to generalize the expert's intent beyond the demonstrated states.
  3. GAIL imitates the expert's distribution, so it learns "grasp from the left." Where the left side is inaccessible, the discriminator reward keeps pushing toward infeasible left-side grasps, the policy fails to earn reward, and it may never discover the right-side option. Behavior cloning fails the same way (it copies the bias); standard RL with a ground-truth task reward can discover the right-side grasp because it optimizes task success, not expert similarity.
  4. Diagnose by comparing distributions: (a) observation-space shift — measure the distance between train and deploy observation statistics/encoder features; fix with augmentation, domain randomization, or more diverse data. (b) action-space shift — check whether predicted actions are in-distribution yet lead to unseen states. (c) missing recovery — roll out and see whether failures follow the policy drifting off the demo manifold without correcting; fix with DAgger (collect expert corrections on the policy's own states).
  5. In DAgger the "expert labeler" supplies the correct action at every visited state (dense, action-level), whereas in RLHF the human supplies pairwise preferences that train a learned "reward model." What RLHF loses that DAgger keeps is per-state action-level supervision — RLHF only has a relative/scalar signal, not a demonstration of the right action. A failure mode RLHF shows but DAgger does not is reward-model over-optimization / reward hacking (verbosity, sycophancy); conversely DAgger needs an online queryable expert, so it fails on open-ended tasks where no per-state expert exists.

Looking ahead

Imitation learning initializes policies that perform well within the demonstrated distribution, but it cannot optimize beyond what the expert demonstrated or adapt online to novel conditions. Reinforcement learning is the natural complement: starting from an imitation-learned initialization, RLReinforcement Learning explores beyond the demonstrated distribution and discovers behaviors that improve on the expert.

Week 6: Reinforcement Learning for Robotics. We examine how RLReinforcement Learning algorithms — particularly off-policy actor-critic methods — are adapted for physical robot deployment under constraints of safety, limited experience, and contact-rich dynamics.


Further reading

  • Pomerleau, D. A. (1988). ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS. (The original Behavioral Cloning paper).
  • Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS. (The DAgger paper and covariate shift proofs).
  • Ho, J., & Ermon, S. (2016). Generative Adversarial Imitation Learning. NeurIPS. (GAIL).
← Previous
Week 4: Teleoperation and Data Collection
Next →
Week 6: Reinforcement Learning for Robotics
On this page
  • Purpose of this lecture
  • Behavior cloning: imitation as supervised learning
  • The basic formulation
  • The distributional shift problem
  • Dataset aggregation: DAgger
  • Motivation and algorithm
  • Practical challenges
  • Inverse reinforcement learning
  • The reward identification problem
  • Adversarial imitation learning: GAIL
  • Connecting imitation to GANs
  • Offline imitation from large datasets
  • GenAI context: imitation learning as supervised fine-tuning
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading