Purpose of this lecture
The course has so far built a complete pipeline for training and deploying a single robot on a well-defined task: collect demonstrations, learn a policy via imitation or RLReinforcement Learning, bridge the sim2real gap, fine-tune with efficient adaptation, and enforce safety constraints at deployment. This pipeline works. Its limitation is scalability: training a separate policy for each (task, robot) pair produces a combinatorial proliferation of models that cannot share knowledge, cannot leverage data across tasks, and requires independent safety validation for each deployment configuration.
The path forward is multi-task and multi-robot learning: systems that share representations, learn skills that compose, and leverage data pooled across task and embodiment boundaries. This is the robotics analog of the transition from task-specific NLP models to general-purpose language models — the same argument that data diversity and shared parameterization produce qualitatively better generalization applies with equal force in the manipulation domain. This lecture develops the technical foundations of shared policy architectures, skill libraries, hierarchical control, policy distillation, and cross-embodiment learning.
Multi-task policies: conditioning and interference
A multi-task policy operates under the same formalism as a single-task policy, but the task identity is an additional input: . The task descriptor can take several forms depending on the level of semantic richness required.
Task IDs are discrete tokens or one-hot vectors indexing into a finite set of defined tasks. They are the simplest conditioning mechanism and work well when the task set is fixed and enumerable. The policy learns a task-conditioned embedding that adapts its behavior per task ID. The limitation is poor generalization: a policy conditioned on task IDs cannot handle novel tasks not seen during training without additional fine-tuning.
Goal images provide a visual specification of the desired end state: an RGB image of the configuration the robot should achieve. Goal images are task-agnostic in the sense that the same policy can be applied to any task for which a goal image can be provided. The policy must learn to infer the required motion by comparing the current observation to the goal image — a visual correspondence problem. Goal-conditioned policies are natural for manipulation tasks but can be ambiguous when the goal is partially specified (the image shows the desired object position but not the desired approach trajectory).
Language instructions specify tasks semantically through natural language descriptions. The VLA architectures from Week 10 implement this conditioning, and language instructions are the dominant conditioning mechanism in modern large-scale robot learning because they are infinitely composable (novel task descriptions can be constructed by combining known words and concepts), require no visual goal specification, and connect naturally to the LLMLarge Language Model representations that provide semantic grounding.
Negative transfer and task interference
The primary challenge in multi-task learning is that tasks can interfere with each other: learning task A better simultaneously makes task B worse. Negative transfer occurs when the gradient updates needed to improve performance on task A move the parameters away from the optimum for task B. This happens when tasks are dissimilar — a policy fine-tuned on overhead grasping and side grasping simultaneously may converge to a compromise trajectory that is suboptimal for both.
The Frank-Wolfe interpretation of multi-task gradient conflicts provides a diagnostic tool: if and have negative inner product (), the tasks conflict. Conflict-mitigation algorithms can reduce interference when tasks share the same parameter space. PCGrad (Yu et al., 2020) is the most widely used: for tasks and with gradients and , if the gradients conflict (), PCGrad projects onto the normal plane of , removing the conflicting component:
otherwise is left unchanged. The final parameter update aggregates projected gradients across all task pairs. This projection preserves the component of orthogonal to — the part that improves task without harming task — while discarding the conflicting component. On average, PCGrad reduces task interference by approximately 30–40% in manipulation multi-task settings at negligible computational overhead. MGDA (multi-gradient descent algorithm, finding a descent direction that reduces all task losses simultaneously) provides stronger theoretical guarantees at higher cost; GradNorm (normalizing gradient magnitudes to balance task contributions) is preferred when task losses are on very different scales.
The more scalable solution is architectural: separate task-specific components (action heads, task-specific LoRA adapters) from shared components (visual backbone, proprioceptive encoder), so that task-specific optimization affects only task-specific parameters and cannot interfere with the shared representation.
Skill libraries and hierarchical control
Manipulation tasks have natural hierarchical structure: a "prepare coffee" task decomposes into "open cabinet," "grasp cup," "place under dispenser," "press button," "carry cup to table." Each subtask is a skill — a temporally extended behavior with clear entry and exit conditions. A system that learns individual skills and composes them to solve complex tasks acquires combinatorial reach: skills can be composed to solve up to multi-step tasks (in principle), whereas a flat policy must learn each combination separately.
Skill libraries are collections of parameterized reusable behaviors. Skills may be: hand-designed controllers (reach-to-grasp, place-with-contact), policies learned via imitation from demonstrations segmented by subtask, or skills discovered through unsupervised segmentation of demonstration trajectories (Dirichlet Process mixture models, changepoint detection on phase features). The task parameters that instantiate a skill — the specific object to grasp, the target location for placing — are provided by the high-level policy at skill invocation time.
In the options framework (Sutton et al., 1999), a skill (option) is a tuple where is the initiation set (states where the option can be started), is the intra-option policy, and is the termination condition (probability of terminating in each state). A high-level policy operates over the option space, selecting which option to execute based on the current state and goal, and the selected option executes until its termination condition fires.
Hierarchical architectures separate the temporal scales of decision-making: the high-level policy makes decisions at a slow timescale (e.g., one decision per 1–5 seconds), while the low-level option policy executes at the fast control frequency (50–100 Hz). This separation reduces the effective planning horizon of the high-level policy, making it easier to train, while the low-level policies are each focused on a small, well-defined subtask where specialized learning is most efficient.
Policy distillation
Policy distillation (Rusu et al., 2016) addresses the scenario where multiple strong task-specific policies have been trained independently and must be consolidated into a single model without retraining from scratch. The process: collect trajectories from each expert policy by rolling them out in their respective environments; construct a training dataset of (state, expert action distribution) pairs; train a student policy by minimizing the KL divergence between the student's action distribution and the expert's:
where denotes the distribution produced by temperature-scaling the policy's logits by before the softmax. High temperature softens both distributions, revealing the relative probability mass on non-peak actions — the "dark knowledge" (Hinton et al., 2015) that encodes the expert's implicit comparison between all actions, not just the most likely one. For a manipulation policy with a multimodal action distribution (two valid grasp orientations), the softened distribution assigns meaningful probability to both modes, giving the student a richer learning signal than hard-label imitation would. Low temperature sharpens distributions toward the mode, converging to standard behavior cloning as . In practice, – is used for continuous action policies with multimodal outputs; the prefactor in the loss ensures the gradient magnitude is invariant to temperature scaling. When distilling from deterministic expert policies (which output Dirac distributions), temperature has no effect, and the objective reduces to mean squared error on the expert's action.
Distillation has several advantages over multi-task training from scratch. The expert policies provide a stable training target: the student does not need to discover what behaviors are good, only to match behaviors the experts have already discovered. This makes distillation stable even when the tasks are diverse. Distillation also naturally handles capacity: a student with sufficient capacity will faithfully reproduce all expert behaviors; a student with insufficient capacity will learn the best compromise, which typically preserves the most frequently needed skills while slightly compromising on rare ones.
Progressive neural networks (Rusu et al., 2016) extend distillation to sequentially acquired skills by maintaining columns of frozen previous-task networks and using lateral connections to allow new columns to leverage previous representations. This prevents catastrophic forgetting while adding new skills — though at the cost of linear parameter growth with the number of tasks.
Shared backbones and representation learning
The scalable approach to multi-task robot learning is shared representation learning: train a single encoder (visual backbone, proprioceptive encoder, language embedding) that is shared across all tasks, with lightweight task-specific heads for action generation. The shared backbone learns a task-agnostic representation that captures the features relevant across all tasks — object poses, scene structure, contact geometry — while the task-specific heads specialize the representation for each task's action requirements.
The shared backbone quality determines how much positive transfer is achievable. A backbone trained only on robot manipulation data learns manipulation-specific features; a backbone pretrained on web-scale visual data (ImageNet, CLIP, DINOv2) learns broader visual features that generalize to novel objects, environments, and viewpoints encountered in deployment. VLA architectures exploit this by initializing the visual backbone from a pretrained VLMVision-Language Model rather than training from scratch on robot data — the pretraining provides rich, task-agnostic visual representations that dramatically accelerate multi-task learning.
The tradeoff is task-specific representation degradation: a representation optimized for all tasks simultaneously is not optimal for any individual task. Tasks that require fine-grained spatial precision (peg insertion, fine assembly) may benefit from a more specialized visual backbone than a general multi-task encoder can provide. The task-specific adapter approach from Week 11 addresses this: the shared backbone provides a general starting point, and task-specific LoRA adapters tune the backbone features for each task's specific representational requirements.
Multi-robot learning: cross-embodiment generalization
The multi-robot learning problem extends multi-task learning to the additional challenge of embodiment variation: the state space, action space, kinematic constraints, and sensor characteristics differ between robots. A policy learned on a 7-DoF Franka Panda cannot be directly applied to a 6-DoF UR5 because the action spaces are different and the kinematics are different.
Cross-embodiment learning enables knowledge sharing despite these differences by learning representations that are invariant to embodiment-specific factors while capturing the task-relevant structure that is shared. The primary approaches are:
Embodiment-conditioned policies augment the observation with an embodiment descriptor — a vector encoding the robot's kinematic parameters, joint limits, and sensor configuration. The policy learns to adapt its behavior to the current embodiment by conditioning on this descriptor, in the same way that task-conditioned policies adapt to different tasks. At deployment, providing the correct embodiment descriptor allows the policy to use appropriate joint ranges, control frequencies, and motion profiles for the specific hardware.
Cartesian action space normalization eliminates embodiment-specific joint spaces by expressing all actions in the robot's end-effector Cartesian frame — the common physical space across embodiments. A manipulation policy operating in Cartesian end-effector delta commands is invariant to the specific joint configuration that achieves those end-effector motions, allowing the same policy to be applied to robots with different DoF and kinematics (subject to the policy achieving the commanded Cartesian commands through embodiment-specific IK).
The Open X-Embodiment dataset and the corresponding RT-X and Octo models demonstrate that cross-embodiment pretraining produces measurable positive transfer: a model pretrained on trajectories from 22 robot embodiments performs significantly better on a new embodiment with few-shot fine-tuning than a model pretrained only on that embodiment's data. The diversity of embodiments forces the model to learn embodiment-agnostic task representations, which generalize more broadly than single-embodiment representations.
Centralized training with decentralized execution
For systems with multiple robots operating simultaneously (a warehouse with 50 robot arms, a factory floor with 10 collaborative manipulators), training and execution have different requirements. Centralized Training, Decentralized Execution (CTDE) — the paradigm established for multi-agent RLReinforcement Learning in Course 1, Week 9 — applies directly to manipulation systems. Recall from Course 1 that CTDE was introduced in the context of cooperative multi-agent RLReinforcement Learning: the centralized critic in QMIX and MAPPO has access to the joint state and all agents' actions during training, enabling it to compute value functions that correctly attribute credit across agents, while each agent's policy conditions only on local observation during execution. The same architectural split applies here with a direct substitution: agents become robot arms, local observations become each robot's proprioceptive state and local camera, and the centralized critic evaluates workspace-level collision risk and throughput jointly.
During training, the full joint state of all robots is accessible, along with their interactions and global task progress. The centralized critic evaluates the joint state and computes the global value function or joint Q-function ; the CTDE factorization (as in QMIX) decomposes this joint Q-function into per-agent utilities whose sum is a monotone function of the joint , ensuring that each agent's local greedy action is globally optimal. Joint training on this centralized information allows credit assignment across robots (which robot's action caused the task success?), learning of implicit coordination strategies (robots that work in the same workspace must avoid interfering with each other), and optimization of system-level objectives (total throughput, not per-robot throughput).
At execution time, each robot receives only its local observations — its own joint state, its local camera view, and the task instruction — and executes its policy independently. The learned policy has internalized coordination implicitly through centralized training: robots that were trained together will naturally avoid collisions and coordinate task execution without explicit communication, because the training process shaped their individual policies to be compatible. For readers of Course 1 Week 9: the multi-arm factory setting is structurally identical to the cooperative MARL setting studied there, with the additional constraint that actions are continuous torque vectors rather than discrete choices — making the MAPPO extension (with continuous action heads) the natural algorithmic choice over the discrete-action QMIX.
GenAI context: embodied generalism
| Multi-task/multi-robot robotics | GenAI | |---|---| | Task-conditioned policies | Instruction-tuned LLMs | | Skill libraries | Tool-use and function calling | | Policy distillation | Model compression and knowledge distillation | | Shared visual backbones | Pretrained VLMVision-Language Model representations | | Cross-embodiment learning | Cross-domain / cross-language generalization | | CTDE for multi-robot | Multi-agent language model ensembles |
The central insight shared across both domains is that generalization comes from diversity: models trained on diverse tasks, robots, and environments learn representations that transfer more broadly than models trained narrowly, regardless of the specific algorithm. The investment in data diversity — collecting demonstrations across tasks, robots, and environments — pays compound dividends in downstream generalization.
Key takeaways
Multi-task policies condition on task descriptors (IDs, goal images, language instructions) to apply a single model across many tasks. Negative transfer and task interference are the primary challenges, addressable through gradient conflict mitigation or task-specific parameterization. Skill libraries and the options framework decompose long-horizon tasks into reusable primitives, enabling combinatorial task composition. Policy distillation consolidates multiple expert policies into a single model by matching the student's action distribution to the experts' distributions. Shared visual backbones pretrained on diverse data provide the foundation for positive transfer across tasks and robots. Cross-embodiment learning normalizes over robot differences through embodiment conditioning, Cartesian action spaces, and the Open X-Embodiment datasets. The CTDE paradigm from multi-agent RLReinforcement Learning enables scalable multi-robot deployment with no explicit inter-robot communication at execution time.
Conceptual questions
-
A multi-task policy is trained on 20 manipulation tasks using a shared backbone with task-specific LoRA adapters. For two tasks — "peg insertion" (requiring sub-millimeter precision) and "bin picking" (requiring only centimeter-level accuracy) — analyze the gradient conflict that arises when training both tasks simultaneously on the shared backbone. Using the PCGrad framework (gradient projection), describe what happens to the gradient updates for "peg insertion" when a conflicting gradient from "bin picking" is projected away. Does this projection help or hurt peg insertion performance? How does the LoRA adapter design mitigate this conflict?
-
A skill library is designed for a coffee preparation task with 6 skills (open cabinet, grasp cup, position cup, press button, transport cup, set down). The high-level policy must sequence these skills to complete the task. Formalize this as an options-framework MDPMarkov Decision Process: define the augmented state space (including option in-progress information), the option initiation sets, and the inter-option transition dynamics. What is the effective planning horizon for the high-level policy, and how does it compare to the planning horizon of a flat policy operating at 50 Hz over a 30-second task?
-
A policy distillation procedure trains a student policy to match the action distributions of 10 expert policies using KL divergence minimization. Expert has success rate and action diversity (entropy of the action distribution at the average state). Show that the KL minimization objective will bias the student toward matching high-entropy experts more than low-entropy experts even when both experts have equal success rates. Propose a modified distillation objective that weights experts by success rate rather than distribution entropy.
-
Cross-embodiment learning uses Cartesian end-effector delta commands as the common action space. A policy trained on 7-DoF arms (3 position + 3 orientation + 1 gripper = 7-dimensional Cartesian action) is deployed on a 4-DoF arm (2 position + 1 orientation + 1 gripper = 4-dimensional Cartesian action). Identify the gap between the training action distribution and the deployment action space, and explain how a delta command that was valid for the 7-DoF arm (a combination of position and orientation change) might be infeasible for the 4-DoF arm. What constraint or projection is needed to handle this mismatch at deployment?
-
In a CTDE multi-robot warehouse scenario, 4 robot arms share a common workspace and must avoid collisions while executing independent pick-and-place tasks. During centralized training, each robot observes the full joint state of all robots. After deployment with decentralized execution (each robot observes only its own state), the collision rate increases from 0% in training to 5% in deployment. Explain the mechanism of this failure: what information about other robots' states did the policy rely on during training that is unavailable at deployment? Propose a minimal change to the observation space during decentralized execution that would recover the zero-collision performance without requiring full centralized state sharing.
Looking ahead
With multi-task and multi-robot architectures established, the final lecture synthesizes the full Course 2 curriculum by examining a complete end-to-end pipeline from problem specification to deployed policy.
Week 14: Sim2Real Capstone. We work through a complete case study of training, evaluating, and deploying a robot manipulation policy — from IsaacLab simulation setup and domain randomization through real-world evaluation — identifying where each week's concepts enter the pipeline and where the critical engineering decisions lie.
Further reading
- Yu, T., et al. (2020). Gradient Surgery for Multi-Task Learning. NeurIPS. (The PCGrad algorithm).
- Kalashnikov, D., et al. (2021). MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale. arXiv. (Massive multi-task robotic manipulation).