Purpose of this lecture
Every algorithm studied so far has assumed the agent can interact with the environment during training. This assumption fails in three practically critical settings: when environment interaction is dangerous (surgical robotics, clinical decision support, autonomous driving), expensive (physical robot hardware, real-time industrial systems), or ethically constrained (deploying exploratory policies on real users). In all three cases, the agent must learn purely from a fixed dataset collected in the past.
Offline reinforcement learning (also called batch RLReinforcement Learning) studies exactly this setting: given a static dataset of transitions, learn the best possible policy without any further environment interaction. The goal is the same as standard RLReinforcement Learning — maximize expected return — but the information constraint is fundamentally different.
The challenge is not a data-engineering problem. Standard off-policy algorithms (DQNDeep Q-Network, SACSoft Actor-Critic, DDPGDeep Deterministic Policy Gradient) already learn from replay buffers of past experience. When those same algorithms are applied to a fixed offline dataset with no additional collection, they fail catastrophically. The cause is a structural property of the Bellman backup and function approximation that does not appear in the online setting: distributional shift combined with overestimated Q-values. Understanding why this failure occurs and how modern offline RLReinforcement Learning methods address it is the subject of this lecture.
The offline RLReinforcement Learning setting
The agent is given a fixed dataset collected by a behavior policy — which may be a prior agent, a scripted policy, a human demonstrator, or a mixture. The behavior policy is typically unknown.
The agent must find:
using only transitions in . It cannot query the environment for new transitions. This is a harder problem than off-policy RLReinforcement Learning (DQNDeep Q-Network, SACSoft Actor-Critic), which also learns from a replay buffer — because in off-policy RLReinforcement Learning, the agent still periodically collects fresh data to correct errors. In offline RLReinforcement Learning there is no such correction mechanism.
Dataset quality spectrum
The difficulty of offline RLReinforcement Learning depends heavily on the dataset:
- Expert data: all transitions come from a near-optimal policy. Behavioral cloning (supervised imitation) already works well here, and offline RLReinforcement Learning adds marginal benefit.
- Scripted or random data: transitions from heuristic or random policies. Value function learning is possible but stitching across different behavior modes is required.
- Mixed data: a combination of expert, random, and possibly suboptimal transitions. This is the most realistic and most difficult setting — the offline RLReinforcement Learning algorithm must identify and improve upon the best parts of the dataset.
The dataset quality matters because the learned policy can only be as good as the best trajectory that can be stitched together from individual transitions in . Offline RLReinforcement Learning's unique promise — and its central difficulty — is that it can potentially produce a policy better than any individual trajectory in the dataset by combining good segments from different trajectories.
Why standard off-policy RLReinforcement Learning fails offline
The distributional shift problem
A standard Q-learning update selects actions by maximizing the Q-function:
The Q-network is trained only on state-action pairs present in . For out-of-distribution (OOD) actions — actions the behavior policy never took in state — the network's outputs are unconstrained by any training signal. They are whatever the network extrapolates from nearby training points, and that extrapolation can be wildly incorrect.
Because the policy takes the argmax over all actions (including OOD ones), it actively selects OOD actions whenever the network happens to assign them a falsely high Q-value. This is the distributional shift problem: the policy induces a state-action distribution that differs systematically from the dataset distribution .
Bootstrapping error amplifies the problem
Distributional shift alone would be manageable if the Q-network's errors on OOD actions were random. The deeper problem is that Bellman backups propagate errors. Consider:
If is overestimated for some OOD , this overestimate is used as the target for updating . On the next backup, this corrupted estimate of propagates to states that transition to . Overestimation propagates backward through the value function in the direction of the learned policy's trajectory — precisely the trajectory that visits the most overestimated actions. The result is an explosive feedback loop:
- overestimates value for some OOD action in state .
- The policy updates to prefer .
- Bellman targets flowing backward from become inflated.
- Inflated targets spread to all states leading to .
- The policy further exploits the now-inflated values throughout the state space.
This cycle, absent in online RLReinforcement Learning (where interaction reveals the true value of ), destroys the value function in offline settings. Empirically, standard SACSoft Actor-Critic applied to offline data achieves near-zero performance on tasks where behavioral cloning succeeds.
Solution 1: Behavior regularization
The most direct fix is to force the learned policy to remain close to the behavior policy , preventing it from selecting OOD actions the Q-network cannot reliably evaluate.
Policy-constrained objective
Modify the policy improvement step with a divergence penalty:
where is a divergence (KL divergence, MMD, or support constraint). The penalty discourages the policy from placing mass on actions not represented in .
BEAR (Kumar et al., 2019) uses Maximum Mean Discrepancy (MMD), which can be estimated without explicitly modeling , only requiring samples from it:
The key insight of BEAR is that MMD provides a gradient signal for staying in-distribution without requiring an explicit density model of . The method showed that behavior regularization could match behavioral cloning on expert data while enabling improvement beyond the behavior policy on mixed datasets. However, BEAR's dependence on kernel selection and the number of samples for reliable MMD estimation limited its adoption.
Advantage-Weighted actor-critic (AWAC) takes a simpler approach: update the policy only on actions from the dataset, weighted by their estimated advantage:
This is a supervised update (no off-policy sampling needed) that upweights actions with positive advantage under the current policy. Actions with negative advantage (worse than the policy average) contribute near-zero gradient, so the update is driven by the best subset of the dataset. The elegance of AWAC lies in its connection to offline-to-online transfer: the same objective can be applied to both online and offline data without modification, making it a natural warm-start for fine-tuning. Empirically, AWAC's performance plateau on standard D4RL benchmarks motivated the development of more aggressive methods like CQL and IQL that relax the in-distribution constraint.
Drawback of behavior regularization
While behavior regularization addresses the core offline RL problem, it is fundamentally limited to policies close to . If the behavior policy is suboptimal (as in most real datasets), the learned policy inherits that suboptimality. This conservative bound on policy improvement motivates looking beyond regularization to methods that more aggressively reweight or transform the Q-function itself.
Solution 2: Conservative Q-Learning (CQL)
Rather than constraining the policy, Conservative Q-Learning (CQL; Kumar et al., 2020) addresses the problem at the level of the value function itself. The key insight is that if we can ensure the Q-function does not overestimate OOD actions, the policy's argmax will remain in-distribution without any explicit policy constraint.
The CQL objective
CQL modifies the standard Bellman loss to add a pessimism term that pushes down Q-values on OOD actions and pushes up Q-values on dataset actions:
The middle term is the log-sum-exp (soft maximum) over all actions — a differentiable approximation to . Minimizing it pushes down the Q-value of whichever actions the network currently thinks are best, disproportionately penalizing overestimated OOD actions. The final term pushes up Q-values for actions in the dataset, preserving the Bellman signal on observed transitions.
CQL's central claim—that pessimism (lower-bounding the true value) is both theoretically justified and empirically sufficient—challenged the conventional RL intuition that learning accurate values is always desirable. The paper demonstrated that intentional underestimation on unseen actions prevents the catastrophic divergence observed in standard off-policy methods, achieving 3–5× the performance of DDPG and SAC when applied offline to Mujoco benchmarks (D4RL). The community's initial skepticism about pessimism giving up potential improvements has been partially validated: CQL does sacrifice upside on datasets containing near-optimal or mixed trajectories compared to more aggressive methods.
Theoretical guarantee
Under mild regularity conditions, CQL guarantees that the learned Q-function is a lower bound on the true value function:
This conservative (pessimistic) bias is intentional. An agent that underestimates the value of OOD actions will avoid them — exactly the desired behavior. The tradeoff is that CQL may also underestimate the value of in-distribution actions, leading to a policy that is overly conservative about deviations from even when they would be beneficial. The penalty weight controls this tradeoff: large is maximally conservative; small approaches standard Q-learning.
Connection to offline-to-online transfer
CQL-initialized policies can be efficiently fine-tuned online: the conservative Q-function provides a stable initialization that prevents early-training instability, and online interaction quickly corrects the intentional underestimation bias. This offline pre-training followed by online fine-tuning is a dominant practical pipeline in robotics deployment. In contrast to pure offline methods, the offline-to-online setting shows that pessimism's cost (reduced upside) diminishes rapidly once data collection resumes.
Implicit Q-Learning (IQL)
A limitation of CQL and behavior regularization approaches is that policy improvement still requires querying the Q-function on actions outside the dataset. Implicit Q-Learning (IQL; Kostrikov et al., 2021) avoids this entirely by reformulating the Bellman backup to never evaluate the Q-function outside the dataset.
The key idea
Standard Q-learning bootstraps with , which requires querying Q at the optimal action, which may be OOD. IQL replaces this maximum with an expectile regression over dataset actions:
where is the asymmetric loss. At this is ordinary MSE (fitting the mean). At it fits the maximum Q-value achievable by in-distribution actions. The Q-function is then updated via:
Finally, the policy is extracted by advantage-weighted regression on dataset actions:
IQL's central innovation—using expectile regression to implicitly define value targets without OOD action sampling—introduced a qualitatively different approach to offline RL. Rather than penalizing the Q-function (CQL) or constraining the policy (AWAC), IQL reformulates the backup itself to avoid OOD evaluation entirely. The empirical result was striking: IQL matched or exceeded CQL on D4RL benchmarks while being simpler to implement and tune. The method has become the de facto standard for offline robot learning from 2022 onward, particularly because the advantage-weighted policy extraction (identical to AWAC's supervised loss) combines seamlessly with diffusion policies and other expressive model classes.
IQL is entirely in-sample: every step of training evaluates Q and V only on pairs from . This makes it more stable than CQL in practice and easier to combine with expressive policy classes (diffusion policies, transformers). IQL with – achieves strong performance across D4RL offline benchmarks.
When IQL applies and when it does not
IQL shines on datasets where behavioral cloning plus mild value improvement is sufficient—particularly in robotics where the behavior policy is often competent but suboptimal. It struggles on datasets where successful policies require stitching together disparate trajectory segments: because IQL never queries high-value OOD actions, it cannot cross trajectories with large advantage gaps. This limitation becomes critical for tasks requiring significant policy improvement beyond the behavior policy, where CQL's (or other pessimistic) methods may perform better.
Off-policy evaluation
If the agent cannot interact with the environment during training, it also cannot measure its policy's performance directly. Deploying a poorly performing policy on real hardware or real users is costly. Off-policy evaluation (OPE) estimates the expected return of a target policy using only the offline dataset collected by .
Direct method (DM)
Fit a Q-function using the offline dataset and evaluate it at the initial state distribution:
Simple but biased: all errors in propagate directly into the estimate.
Importance sampling (IS)
Reweight trajectory returns by the density ratio between and :
IS is unbiased but its variance grows exponentially with trajectory length : the product of importance weights has variance , which explodes unless . For long-horizon tasks, IS estimates are effectively unusable. This exponential variance is not a minor implementation detail—it is a fundamental statistical barrier that IS cannot overcome without additional structure.
Doubly robust (DR) estimator
DR combines the direct method as a variance-reduction baseline with IS for residual correction:
The DR estimator is doubly robust in the precise sense: it is consistent (converges to the true value) if either the importance weights are correct (accurate model) or the value model is correct — not both simultaneously. This resilience makes DR the standard OPE estimator in industrial recommendation systems and AI evaluation pipelines, where one or the other model component is typically reliable. The double robustness property and its implications for offline evaluation are developed in Kennedy (2023); in practice, DR typically outperforms both DM and IS, especially when one of the two component models (behavior policy or value function) is moderately well-fit to the data.
OPE in practice
In offline RL deployment (robotics, autonomous systems), OPE serves as a sanity check rather than a definitive metric. The standard practice is to use DR with conservative importance weight truncation (capping at some threshold) to trade off bias and variance, then validate on a small held-out batch of real-world data or simulation rollouts if feasible. This hybrid approach—OPE + limited online validation—reflects the reality that no offline evaluation method can fully substitute for interaction with the environment.
GenAI context: RLHFReinforcement Learning from Human Feedback as offline RLReinforcement Learning
Offline RLReinforcement Learning is not a niche robotics topic — it is the structural setting of modern LLMLarge Language Model alignment.
In the RLHFReinforcement Learning from Human Feedback pipeline: a fixed dataset of human preference labels is collected once; a reward model is trained on this static data; and the LLMLarge Language Model policy is optimized against the reward model without further human annotation in the training loop. This is offline RLReinforcement Learning on a fixed preference dataset, with the reward model playing the role of .
The distributional shift problem appears directly: if PPOProximal Policy Optimisation pushes the LLMLarge Language Model policy to maximize the reward model, it will find OOD text sequences — responses outside the distribution of the preference dataset — that score highly under the reward model but are not actually preferred by humans. This is exactly extrapolation error on OOD actions. The reward model, trained only on prompts and responses from the SFT model, has no reliable predictions for text far from that distribution.
The standard fix is the KL penalty against the SFT reference model:
This is behavior regularization with : it prevents the optimized policy from straying into regions where the reward model cannot be trusted. The complete theoretical analysis of this connection — including the closed-form solution of the KL-regularized RLHFReinforcement Learning from Human Feedback objective — is the starting point of the DPODirect Preference Optimization derivation studied in Week 13.
Key takeaways
Offline RLReinforcement Learning learns from fixed datasets without environment interaction. Standard off-policy algorithms fail because the Q-function provides unreliable estimates for OOD actions, and Bellman backups propagate these errors into a catastrophic feedback loop. Behavior regularization constrains the policy to stay close to , preventing OOD action selection; AWAC implements this via advantage-weighted supervised regression, enabling direct offline-to-online transfer. CQL addresses the problem at the value function level: a modified Bellman loss explicitly penalizes high Q-values on OOD actions, guaranteeing a conservative lower bound on the true value function. IQL avoids all OOD evaluation by replacing the Bellman max with expectile regression over dataset actions, making it fully in-sample. OPE via doubly robust estimators allows performance evaluation without environment interaction. These methods collectively constitute the machinery underlying RLHFReinforcement Learning from Human Feedback, decision-making from offline datasets in robotics, and any AI system that must learn from historical data without the ability to explore.
Conceptual questions
-
Standard SACSoft Actor-Critic applied to an offline dataset achieves near-zero performance despite using the same Bellman updates and neural architectures that work in online settings. Trace the failure mechanism step by step: starting from OOD Q-value overestimation in a single state-action pair, show how the error propagates through successive Bellman backups to corrupt the entire value function. Identify the one structural difference between offline and online RLReinforcement Learning that breaks the self-correcting property of online Bellman backups.
-
CQL adds a term to the loss. Show that this term is the log-sum-exp softmax approximation to . Why does minimizing this term disproportionately penalize overestimated OOD actions rather than uniformly shrinking all Q-values? How does the tradeoff weight in CQL relate to the KL penalty in RLHFReinforcement Learning from Human Feedback behavior regularization — are they solving the same underlying problem?
-
IQL fits a value function using expectile regression at on the Q-values of dataset actions. Explain why recovers the mean Q-value and approximates the maximum. In a dataset where the behavior policy is uniformly random over all actions, what does IQL with effectively compute, and how does this relate to the optimal value ?
-
An offline RLReinforcement Learning system for a medical treatment recommendation task is trained on a dataset where consists of conservative physician decisions that rarely deviate from standard of care. The learned policy learns to recommend aggressive treatments with higher expected return but minimal dataset support. Identify the failure mode using the offline RLReinforcement Learning framework, explain which of the three solution approaches (behavior regularization, CQL, IQL) would best prevent this failure, and describe the tradeoff introduced by applying that approach.
-
The doubly robust OPE estimator is consistent if either the importance weights or the value model is correctly specified, but not necessarily both. Describe a scenario where (a) making IS reliable but the direct method model is inaccurate, and (b) is very different from making IS unreliable but is accurate. For case (b), explain what happens to the IS variance as the evaluation horizon grows, and why DR remains reliable despite high IS variance in this case.
Implementation exercises
Exercise 1: Reproduce the offline RL failure
Train a standard SACSoft Actor-Critic agent online on HalfCheetah-v3 until it reaches a moderate performance level (return ~4000). Save all transitions to create a "medium" offline dataset. Then:
- Train a new SACSoft Actor-Critic agent offline on this dataset (no environment interaction, just replay buffer sampling). Track the Q-value estimates and policy return over training.
- You should observe Q-values inflating to 10–100× the true return while actual policy performance collapses to near zero. Plot the divergence between estimated and true Q-values.
- Add behavior cloning (BC) as a baseline on the same dataset. Observe that BC achieves non-trivial return while offline SACSoft Actor-Critic fails — demonstrating that the failure is specific to value-based methods, not a data quality issue.
Identify the moment in training where the Q-value divergence begins and correlate it with the policy starting to select actions outside the data distribution.
Exercise 2: Implement Conservative Q-Learning (CQL)
Starting from your offline SACSoft Actor-Critic implementation, add the CQL penalty term:
- For continuous actions, the log-sum-exp term cannot be computed exactly. Use importance sampling: sample actions from the current policy and actions from a uniform distribution over the action space, then compute the soft maximum over the combined set.
- Tune . Report the final policy return after offline training on the medium dataset from Exercise 1.
- Compare with the BC baseline: does CQL exceed BC performance, and at what ?
Test on halfcheetah-medium-v2 from D4RL: CQL should achieve a normalized score of ~47 (vs ~42 for BC, ~30 for naive SAC). The d4rl package provides the dataset directly via d4rl.qlearning_dataset(env).
Exercise 3: Compare IQL and CQL on a stitching task
Construct a custom dataset for PointMaze or a simple navigation task where:
- Half the trajectories go from start to a midpoint and terminate.
- Half go from the midpoint to the goal and terminate.
- No single trajectory goes from start to goal.
Train IQL () and CQL () offline on this dataset. Measure:
- Success rate (start → goal completion) for each method.
- The IQL-CQL performance gap — IQL should struggle (no trajectory covers the full path, and IQL cannot stitch across the gap), while CQL should succeed by identifying the midpoint → goal segment as high-value and crossing into it.
This exercise directly tests the stitching limitation discussed in the IQL section.
Extension prompts
-
Offline RL with diffusion policies: Replace IQL's Gaussian policy with a diffusion model (as in Diffuser or Decision Diffuser). Train on the same D4RL datasets and compare: (a) performance on heterogeneous data where the behavior policy is a mixture of experts and random actions, and (b) the ability to generate multi-modal action distributions. Does the expressivity of the diffusion policy compensate for IQL's stitching limitation?
-
OPE for model selection without deployment: Implement the doubly robust estimator on the D4RL
halfcheetahdatasets. Use OPE to rank 5 candidate policies (trained with different offline RL methods on the same dataset) by estimated return. Then evaluate all 5 policies in the real environment. Measure the Spearman rank correlation between OPE-predicted and true returns. At what dataset size does OPE ranking become reliable? -
RLHF as offline RL — the extrapolation failure: Take a small language model fine-tuned via RLHFReinforcement Learning from Human Feedback (PPOProximal Policy Optimisation with a reward model). Freeze the reward model and generate completions from the optimized policy. Compare the reward model scores with human preference ratings on a held-out set of prompts. Identify text sequences where the reward model score is high but human preference is low — these are the OOD "actions" (text completions) that the reward model overestimates, exactly analogous to the CQL scenario.
Looking ahead
Offline RLReinforcement Learning provides the theoretical foundation for the most important application of RLReinforcement Learning in modern AI.
Week 12: Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback). We study how the SFT, reward modeling, and KL-regularized PPOProximal Policy Optimisation pipeline transforms a pre-trained language model into an aligned conversational agent — and examine the limitations of this approach that motivate the preference-optimization methods of Week 13.
Further reading
- Levine, S., et al. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv. (The definitive overview).
- Kumar, A., et al. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS. (BEAR — behavior regularization via MMD constraint).
- Kumar, A., et al. (2020). Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS. (CQL).
- Kostrikov, I., et al. (2021). Offline Reinforcement Learning with Implicit Q-Learning. ICLR. (IQL).
- Nair, A., et al. (2020). Accelerating Online Reinforcement Learning with Offline Datasets. arXiv. (AWAC).
- Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics. (Double robustness theory — the statistical foundation for the DR estimator in OPE).