Skip to main content
illumin8
Courses
Week 11: Offline Reinforcement Learning
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 11

Week 11: Offline Reinforcement Learning

✦Learning Outcomes
  • Analyze distributional shift and extrapolation error in offline settings
  • Implement Conservative Q-Learning (CQL) and understand its design
  • Compare IQL, AWAC, and other offline RLReinforcement Learning methods
  • Connect offline RLReinforcement Learning to real-world deployment constraints
◆Prerequisites
  • Week 10: Model-based RLReinforcement Learning, value functions
  • Week 8: Deep RLReinforcement Learning algorithms (DQNDeep Q-Network, SACSoft Actor-Critic)

Recommended: Review Week 10 before proceeding.

◆Grounded In
  • Robotics: Offline RL is the standard deployment pipeline for real-robot systems: collect demonstration and suboptimal data from teleoperation or scripted policies, train conservatively offline, then fine-tune online. CQL-initialized policies on Franka manipulation and autonomous driving stacks follow this exact offline-to-online pattern. OPE via doubly robust estimators serves as the deployment gate before real-hardware testing.
  • GenAI: RLHFReinforcement Learning from Human Feedback is offline RL on a fixed preference dataset — the KL penalty against the SFT model is exactly behavior regularization with πβ=πSFT\pi_\beta = \pi_{\text{SFT}}πβ​=πSFT​. The distributional shift problem (reward model inflating OOD text scores) is the same mechanism as Q-value overestimation on unseen actions. This connection is the theoretical bridge to DPODirect Preference Optimization in Week 13.

Purpose of this lecture

Every algorithm studied so far has assumed the agent can interact with the environment during training. This assumption fails in three practically critical settings: when environment interaction is dangerous (surgical robotics, clinical decision support, autonomous driving), expensive (physical robot hardware, real-time industrial systems), or ethically constrained (deploying exploratory policies on real users). In all three cases, the agent must learn purely from a fixed dataset collected in the past.

Offline reinforcement learning (also called batch RLReinforcement Learning) studies exactly this setting: given a static dataset D\mathcal{D}D of transitions, learn the best possible policy without any further environment interaction. The goal is the same as standard RLReinforcement Learning — maximize expected return — but the information constraint is fundamentally different.

The challenge is not a data-engineering problem. Standard off-policy algorithms (DQNDeep Q-Network, SACSoft Actor-Critic, DDPGDeep Deterministic Policy Gradient) already learn from replay buffers of past experience. When those same algorithms are applied to a fixed offline dataset with no additional collection, they fail catastrophically. The cause is a structural property of the Bellman backup and function approximation that does not appear in the online setting: distributional shift combined with overestimated Q-values. Understanding why this failure occurs and how modern offline RLReinforcement Learning methods address it is the subject of this lecture.


The offline RLReinforcement Learning setting

The agent is given a fixed dataset D={(si,ai,ri,si′)}i=1N\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^ND={(si​,ai​,ri​,si′​)}i=1N​ collected by a behavior policy πβ\pi_\betaπβ​ — which may be a prior agent, a scripted policy, a human demonstrator, or a mixture. The behavior policy is typically unknown.

The agent must find:

π∗=arg⁡max⁡πJ(π)=arg⁡max⁡πEτ∼π ⁣[∑tγtrt]\pi^* = \arg\max_\pi J(\pi) = \arg\max_\pi \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t r_t\right]π∗=argπmax​J(π)=argπmax​Eτ∼π​[t∑​γtrt​]

using only transitions in D\mathcal{D}D. It cannot query the environment for new transitions. This is a harder problem than off-policy RLReinforcement Learning (DQNDeep Q-Network, SACSoft Actor-Critic), which also learns from a replay buffer — because in off-policy RLReinforcement Learning, the agent still periodically collects fresh data to correct errors. In offline RLReinforcement Learning there is no such correction mechanism.

Dataset quality spectrum

The difficulty of offline RLReinforcement Learning depends heavily on the dataset:

  • Expert data: all transitions come from a near-optimal policy. Behavioral cloning (supervised imitation) already works well here, and offline RLReinforcement Learning adds marginal benefit.
  • Scripted or random data: transitions from heuristic or random policies. Value function learning is possible but stitching across different behavior modes is required.
  • Mixed data: a combination of expert, random, and possibly suboptimal transitions. This is the most realistic and most difficult setting — the offline RLReinforcement Learning algorithm must identify and improve upon the best parts of the dataset.

The dataset quality matters because the learned policy can only be as good as the best trajectory that can be stitched together from individual transitions in D\mathcal{D}D. Offline RLReinforcement Learning's unique promise — and its central difficulty — is that it can potentially produce a policy better than any individual trajectory in the dataset by combining good segments from different trajectories.


Why standard off-policy RLReinforcement Learning fails offline

The distributional shift problem

A standard Q-learning update selects actions by maximizing the Q-function:

π(s)=arg⁡max⁡aQθ(s,a)\pi(s) = \arg\max_a Q_\theta(s, a)π(s)=argamax​Qθ​(s,a)

The Q-network QθQ_\thetaQθ​ is trained only on state-action pairs (s,a)(s, a)(s,a) present in D\mathcal{D}D. For out-of-distribution (OOD) actions — actions the behavior policy never took in state sss — the network's outputs are unconstrained by any training signal. They are whatever the network extrapolates from nearby training points, and that extrapolation can be wildly incorrect.

Because the policy takes the argmax over all actions (including OOD ones), it actively selects OOD actions whenever the network happens to assign them a falsely high Q-value. This is the distributional shift problem: the policy induces a state-action distribution that differs systematically from the dataset distribution dπβ(s,a)d^{\pi_\beta}(s, a)dπβ​(s,a).

Bootstrapping error amplifies the problem

Distributional shift alone would be manageable if the Q-network's errors on OOD actions were random. The deeper problem is that Bellman backups propagate errors. Consider:

Qθ(s,a)←r+γmax⁡a′Qθ(s′,a′)Q_\theta(s, a) \leftarrow r + \gamma \max_{a'} Q_\theta(s', a')Qθ​(s,a)←r+γa′max​Qθ​(s′,a′)

If Qθ(s′,a′)Q_\theta(s', a')Qθ​(s′,a′) is overestimated for some OOD a′a'a′, this overestimate is used as the target for updating Qθ(s,a)Q_\theta(s, a)Qθ​(s,a). On the next backup, this corrupted estimate of Qθ(s,a)Q_\theta(s, a)Qθ​(s,a) propagates to states that transition to sss. Overestimation propagates backward through the value function in the direction of the learned policy's trajectory — precisely the trajectory that visits the most overestimated actions. The result is an explosive feedback loop:

  1. QθQ_\thetaQθ​ overestimates value for some OOD action a′a'a′ in state s′s's′.
  2. The policy updates to prefer a′a'a′.
  3. Bellman targets flowing backward from s′s's′ become inflated.
  4. Inflated targets spread to all states leading to s′s's′.
  5. The policy further exploits the now-inflated values throughout the state space.

This cycle, absent in online RLReinforcement Learning (where interaction reveals the true value of a′a'a′), destroys the value function in offline settings. Empirically, standard SACSoft Actor-Critic applied to offline data achieves near-zero performance on tasks where behavioral cloning succeeds.


Solution 1: Behavior regularization

The most direct fix is to force the learned policy to remain close to the behavior policy πβ\pi_\betaπβ​, preventing it from selecting OOD actions the Q-network cannot reliably evaluate.

Policy-constrained objective

Modify the policy improvement step with a divergence penalty:

max⁡πEs∼D, a∼π ⁣[Q(s,a)−α D(π(⋅∣s) ∥ πβ(⋅∣s))]\max_\pi \mathbb{E}_{s \sim \mathcal{D},\, a \sim \pi}\!\left[ Q(s,a) - \alpha\, D(\pi(\cdot|s) \,\|\, \pi_\beta(\cdot|s)) \right]πmax​Es∼D,a∼π​[Q(s,a)−αD(π(⋅∣s)∥πβ​(⋅∣s))]

where DDD is a divergence (KL divergence, MMD, or support constraint). The penalty discourages the policy from placing mass on actions not represented in πβ\pi_\betaπβ​.

BEAR (Kumar et al., 2019) uses Maximum Mean Discrepancy (MMD), which can be estimated without explicitly modeling πβ\pi_\betaπβ​, only requiring samples from it:

DMMD(π,πβ)=∥Ea∼π[ϕ(a)]−Ea∼πβ[ϕ(a)]∥2D_{\text{MMD}}(\pi, \pi_\beta) = \left\|\mathbb{E}_{a \sim \pi}[\phi(a)] - \mathbb{E}_{a \sim \pi_\beta}[\phi(a)]\right\|^2DMMD​(π,πβ​)=​Ea∼π​[ϕ(a)]−Ea∼πβ​​[ϕ(a)]​2

The key insight of BEAR is that MMD provides a gradient signal for staying in-distribution without requiring an explicit density model of πβ\pi_\betaπβ​. The method showed that behavior regularization could match behavioral cloning on expert data while enabling improvement beyond the behavior policy on mixed datasets. However, BEAR's dependence on kernel selection and the number of samples for reliable MMD estimation limited its adoption.

Advantage-Weighted actor-critic (AWAC) takes a simpler approach: update the policy only on actions from the dataset, weighted by their estimated advantage:

LAWAC(θ)=−E(s,a)∼D ⁣[log⁡πθ(a∣s)⋅exp⁡ ⁣(Aπ(s,a)λ)]\mathcal{L}_{\text{AWAC}}(\theta) = -\mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[ \log \pi_\theta(a \mid s) \cdot \exp\!\left(\frac{A^\pi(s,a)}{\lambda}\right) \right]LAWAC​(θ)=−E(s,a)∼D​[logπθ​(a∣s)⋅exp(λAπ(s,a)​)]

This is a supervised update (no off-policy sampling needed) that upweights actions with positive advantage under the current policy. Actions with negative advantage (worse than the policy average) contribute near-zero gradient, so the update is driven by the best subset of the dataset. The elegance of AWAC lies in its connection to offline-to-online transfer: the same objective can be applied to both online and offline data without modification, making it a natural warm-start for fine-tuning. Empirically, AWAC's performance plateau on standard D4RL benchmarks motivated the development of more aggressive methods like CQL and IQL that relax the in-distribution constraint.

⚠Critical Lens

Strengths: AWAC elegantly unifies offline pre-training and online fine-tuning under a single objective — the same advantage-weighted regression works for both phases without modification. This makes it the most natural choice for offline-to-online transfer pipelines in robotics.

Limitations: The exponential advantage weighting exp⁡(A/λ)\exp(A/\lambda)exp(A/λ) concentrates updates heavily on actions with large positive advantage, which amplifies noise when the value function is poorly fit. On datasets where the behavior policy is uniformly mediocre (no actions with clearly positive advantage), the weighting collapses to near-uniform and AWAC effectively becomes behavioral cloning — improving only by mimicking the dataset's best actions rather than discovering better ones.

Drawback of behavior regularization

While behavior regularization addresses the core offline RL problem, it is fundamentally limited to policies close to πβ\pi_\betaπβ​. If the behavior policy is suboptimal (as in most real datasets), the learned policy inherits that suboptimality. This conservative bound on policy improvement motivates looking beyond regularization to methods that more aggressively reweight or transform the Q-function itself.


Solution 2: Conservative Q-Learning (CQL)

Rather than constraining the policy, Conservative Q-Learning (CQL; Kumar et al., 2020) addresses the problem at the level of the value function itself. The key insight is that if we can ensure the Q-function does not overestimate OOD actions, the policy's argmax will remain in-distribution without any explicit policy constraint.

The CQL objective

CQL modifies the standard Bellman loss to add a pessimism term that pushes down Q-values on OOD actions and pushes up Q-values on dataset actions:

LCQL(θ)=LTD(θ)⏟Bellman error+α Es∼D ⁣[log⁡∑aexp⁡Qθ(s,a)⏟push down: log-sum-exp over all actions−Ea∼D[Qθ(s,a)]⏟push up: dataset actions]\mathcal{L}_{\text{CQL}}(\theta) = \underbrace{\mathcal{L}_{\text{TD}}(\theta)}_{\text{Bellman error}} + \alpha\,\mathbb{E}_{s \sim \mathcal{D}}\!\left[ \underbrace{\log \sum_a \exp Q_\theta(s,a)}_{\text{push down: log-sum-exp over all actions}} - \underbrace{\mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s,a)]}_{\text{push up: dataset actions}} \right]LCQL​(θ)=Bellman errorLTD​(θ)​​+αEs∼D​​push down: log-sum-exp over all actionsloga∑​expQθ​(s,a)​​−push up: dataset actionsEa∼D​[Qθ​(s,a)]​​​

The middle term log⁡∑aexp⁡Qθ(s,a)\log\sum_a \exp Q_\theta(s,a)log∑a​expQθ​(s,a) is the log-sum-exp (soft maximum) over all actions — a differentiable approximation to max⁡aQθ(s,a)\max_a Q_\theta(s,a)maxa​Qθ​(s,a). Minimizing it pushes down the Q-value of whichever actions the network currently thinks are best, disproportionately penalizing overestimated OOD actions. The final term Ea∼D[Qθ(s,a)]\mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s,a)]Ea∼D​[Qθ​(s,a)] pushes up Q-values for actions in the dataset, preserving the Bellman signal on observed transitions.

CQL's central claim—that pessimism (lower-bounding the true value) is both theoretically justified and empirically sufficient—challenged the conventional RL intuition that learning accurate values is always desirable. The paper demonstrated that intentional underestimation on unseen actions prevents the catastrophic divergence observed in standard off-policy methods, achieving 3–5× the performance of DDPG and SAC when applied offline to Mujoco benchmarks (D4RL). The community's initial skepticism about pessimism giving up potential improvements has been partially validated: CQL does sacrifice upside on datasets containing near-optimal or mixed trajectories compared to more aggressive methods.

Theoretical guarantee

Under mild regularity conditions, CQL guarantees that the learned Q-function is a lower bound on the true value function:

Qθπ(s,a)≤Qπ(s,a)∀(s,a)∈DQ_\theta^{\pi}(s, a) \leq Q^{\pi}(s, a) \quad \forall (s,a) \in \mathcal{D}Qθπ​(s,a)≤Qπ(s,a)∀(s,a)∈D

This conservative (pessimistic) bias is intentional. An agent that underestimates the value of OOD actions will avoid them — exactly the desired behavior. The tradeoff is that CQL may also underestimate the value of in-distribution actions, leading to a policy that is overly conservative about deviations from πβ\pi_\betaπβ​ even when they would be beneficial. The penalty weight α\alphaα controls this tradeoff: large α\alphaα is maximally conservative; small α\alphaα approaches standard Q-learning.

⚠Critical Lens

Strengths: CQL's theoretical lower-bound guarantee provides the strongest safety signal of any offline RL method — the Q-function will never overestimate, so the policy will never chase phantom value on unseen actions. On D4RL locomotion benchmarks, CQL achieves 3–5× the performance of naive SAC/DDPG applied offline.

Limitations: The log-sum-exp penalty over all actions requires either enumerating the action space (infeasible for continuous actions) or importance-sampling OOD actions, introducing estimator variance. CQL's conservatism can undershoot even in-distribution actions, leaving performance on the table when the dataset is already near-optimal. The community has observed that CQL-initialized policies sometimes require significant online fine-tuning to recover the remaining performance gap.

Connection to offline-to-online transfer

CQL-initialized policies can be efficiently fine-tuned online: the conservative Q-function provides a stable initialization that prevents early-training instability, and online interaction quickly corrects the intentional underestimation bias. This offline pre-training followed by online fine-tuning is a dominant practical pipeline in robotics deployment. In contrast to pure offline methods, the offline-to-online setting shows that pessimism's cost (reduced upside) diminishes rapidly once data collection resumes.


Implicit Q-Learning (IQL)

A limitation of CQL and behavior regularization approaches is that policy improvement still requires querying the Q-function on actions outside the dataset. Implicit Q-Learning (IQL; Kostrikov et al., 2021) avoids this entirely by reformulating the Bellman backup to never evaluate the Q-function outside the dataset.

The key idea

Standard Q-learning bootstraps with max⁡a′Q(s′,a′)\max_{a'} Q(s', a')maxa′​Q(s′,a′), which requires querying Q at the optimal action, which may be OOD. IQL replaces this maximum with an expectile regression over dataset actions:

LV(ψ)=E(s,a)∼D ⁣[Lτ2(Qθ(s,a)−Vψ(s))]\mathcal{L}_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[ L_\tau^2(Q_\theta(s, a) - V_\psi(s)) \right]LV​(ψ)=E(s,a)∼D​[Lτ2​(Qθ​(s,a)−Vψ​(s))]

where Lτ2(u)=∣τ−1u<0∣⋅u2L_\tau^2(u) = |\tau - \mathbf{1}_{u < 0}| \cdot u^2Lτ2​(u)=∣τ−1u<0​∣⋅u2 is the asymmetric L2L_2L2​ loss. At τ=0.5\tau = 0.5τ=0.5 this is ordinary MSE (fitting the mean). At τ→1\tau \to 1τ→1 it fits the maximum Q-value achievable by in-distribution actions. The Q-function is then updated via:

LQ(θ)=E(s,a,s′)∼D ⁣[(r+γVψ(s′)−Qθ(s,a))2]\mathcal{L}_Q(\theta) = \mathbb{E}_{(s,a,s') \sim \mathcal{D}}\!\left[(r + \gamma V_\psi(s') - Q_\theta(s,a))^2\right]LQ​(θ)=E(s,a,s′)∼D​[(r+γVψ​(s′)−Qθ​(s,a))2]

Finally, the policy is extracted by advantage-weighted regression on dataset actions:

πθ=arg⁡min⁡π−E(s,a)∼D ⁣[exp⁡(β(Qθ(s,a)−Vψ(s)))log⁡πθ(a∣s)]\pi_\theta = \arg\min_\pi -\mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[\exp\bigl(\beta (Q_\theta(s,a) - V_\psi(s))\bigr) \log \pi_\theta(a|s)\right]πθ​=argπmin​−E(s,a)∼D​[exp(β(Qθ​(s,a)−Vψ​(s)))logπθ​(a∣s)]

IQL's central innovation—using expectile regression to implicitly define value targets without OOD action sampling—introduced a qualitatively different approach to offline RL. Rather than penalizing the Q-function (CQL) or constraining the policy (AWAC), IQL reformulates the backup itself to avoid OOD evaluation entirely. The empirical result was striking: IQL matched or exceeded CQL on D4RL benchmarks while being simpler to implement and tune. The method has become the de facto standard for offline robot learning from 2022 onward, particularly because the advantage-weighted policy extraction (identical to AWAC's supervised loss) combines seamlessly with diffusion policies and other expressive model classes.

IQL is entirely in-sample: every step of training evaluates Q and V only on (s,a)(s,a)(s,a) pairs from D\mathcal{D}D. This makes it more stable than CQL in practice and easier to combine with expressive policy classes (diffusion policies, transformers). IQL with τ=0.7\tau = 0.7τ=0.7–0.90.90.9 achieves strong performance across D4RL offline benchmarks.

⚠Critical Lens

Strengths: IQL's fully in-sample design eliminates the single largest source of instability in offline RL — OOD action evaluation. The method has fewer hyperparameters than CQL (only τ\tauτ and the advantage temperature β\betaβ), making it easier to tune across diverse datasets. Its compatibility with expressive policy classes (diffusion policies, transformers) has made it the de facto standard for offline robot learning from 2022 onward.

Limitations: The stitching gap (lines 286–288) is fundamental, not an implementation detail — IQL can never discover a policy better than the convex combination of trajectories already in the dataset. On tasks requiring synthesis of meaningfully different trajectory segments (e.g., navigation where the start and goal are in different data clusters), IQL underperforms CQL. The expectile parameter τ\tauτ also introduces a bias–variance tradeoff that is dataset-dependent: values near 0.5 are stable but conservative, while values near 1.0 risk instability.

When IQL applies and when it does not

IQL shines on datasets where behavioral cloning plus mild value improvement is sufficient—particularly in robotics where the behavior policy is often competent but suboptimal. It struggles on datasets where successful policies require stitching together disparate trajectory segments: because IQL never queries high-value OOD actions, it cannot cross trajectories with large advantage gaps. This limitation becomes critical for tasks requiring significant policy improvement beyond the behavior policy, where CQL's (or other pessimistic) methods may perform better.


Off-policy evaluation

If the agent cannot interact with the environment during training, it also cannot measure its policy's performance directly. Deploying a poorly performing policy on real hardware or real users is costly. Off-policy evaluation (OPE) estimates the expected return of a target policy πe\pi_eπe​ using only the offline dataset collected by πβ\pi_\betaπβ​.

Direct method (DM)

Fit a Q-function Q^πe\hat{Q}^{\pi_e}Q^​πe​ using the offline dataset and evaluate it at the initial state distribution:

VDMπe=Es0 ⁣[∑aπe(a∣s0)Q^πe(s0,a)]V^{\pi_e}_{\text{DM}} = \mathbb{E}_{s_0}\!\left[\sum_a \pi_e(a|s_0) \hat{Q}^{\pi_e}(s_0, a)\right]VDMπe​​=Es0​​[a∑​πe​(a∣s0​)Q^​πe​(s0​,a)]

Simple but biased: all errors in Q^\hat{Q}Q^​ propagate directly into the estimate.

Importance sampling (IS)

Reweight trajectory returns by the density ratio between πe\pi_eπe​ and πβ\pi_\betaπβ​:

ρ0:T=∏t=0Tπe(at∣st)πβ(at∣st),VISπe=Eτ∼D ⁣[ρ0:T∑t=0Tγtrt]\rho_{0:T} = \prod_{t=0}^T \frac{\pi_e(a_t | s_t)}{\pi_\beta(a_t | s_t)}, \qquad V^{\pi_e}_{\text{IS}} = \mathbb{E}_{\tau \sim \mathcal{D}}\!\left[\rho_{0:T} \sum_{t=0}^T \gamma^t r_t\right]ρ0:T​=t=0∏T​πβ​(at​∣st​)πe​(at​∣st​)​,VISπe​​=Eτ∼D​[ρ0:T​t=0∑T​γtrt​]

IS is unbiased but its variance grows exponentially with trajectory length TTT: the product of TTT importance weights has variance ∏tE[ρt2]−1\prod_t \mathbb{E}[\rho_t^2] - 1∏t​E[ρt2​]−1, which explodes unless πe≈πβ\pi_e \approx \pi_\betaπe​≈πβ​. For long-horizon tasks, IS estimates are effectively unusable. This exponential variance is not a minor implementation detail—it is a fundamental statistical barrier that IS cannot overcome without additional structure.

Doubly robust (DR) estimator

DR combines the direct method as a variance-reduction baseline with IS for residual correction:

VDRπe=VDMπe+Eτ∼D ⁣[∑t=0Tγtρ0:t(rt+γV^(st+1)−Q^(st,at))]V^{\pi_e}_{\text{DR}} = V^{\pi_e}_{\text{DM}} + \mathbb{E}_{\tau \sim \mathcal{D}}\!\left[ \sum_{t=0}^T \gamma^t \rho_{0:t} \left(r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q}(s_t, a_t)\right) \right]VDRπe​​=VDMπe​​+Eτ∼D​[t=0∑T​γtρ0:t​(rt​+γV^(st+1​)−Q^​(st​,at​))]

The DR estimator is doubly robust in the precise sense: it is consistent (converges to the true value) if either the importance weights are correct (accurate πβ\pi_\betaπβ​ model) or the value model Q^\hat{Q}Q^​ is correct — not both simultaneously. This resilience makes DR the standard OPE estimator in industrial recommendation systems and AI evaluation pipelines, where one or the other model component is typically reliable. The double robustness property and its implications for offline evaluation are developed in Kennedy (2023); in practice, DR typically outperforms both DM and IS, especially when one of the two component models (behavior policy or value function) is moderately well-fit to the data.

⚠Critical Lens

Strengths: DR's double robustness is a genuine practical advantage — in many deployment settings, either the behavior policy model (collected from logged interactions) or the value model (trained on abundant offline data) is reliable, and DR degrades gracefully when one fails. The importance-weight truncation trick (capping ρt\rho_tρt​) provides a tunable bias–variance knob for practitioners.

Limitations: DR is only asymptotically consistent under model misspecification — in finite samples, both models can be simultaneously misspecified, and DR offers no guarantee in that regime. The per-step importance weight ρt\rho_tρt​ must still be computed, requiring either a known πβ\pi_\betaπβ​ or a learned behavior policy model — and errors in the behavior model compound multiplicatively. For safety-critical applications (medical treatment, autonomous driving), OPE is strictly a screening tool; it cannot substitute for real-world validation.

OPE in practice

In offline RL deployment (robotics, autonomous systems), OPE serves as a sanity check rather than a definitive metric. The standard practice is to use DR with conservative importance weight truncation (capping ρt\rho_tρt​ at some threshold) to trade off bias and variance, then validate on a small held-out batch of real-world data or simulation rollouts if feasible. This hybrid approach—OPE + limited online validation—reflects the reality that no offline evaluation method can fully substitute for interaction with the environment.


GenAI context: RLHFReinforcement Learning from Human Feedback as offline RLReinforcement Learning

Offline RLReinforcement Learning is not a niche robotics topic — it is the structural setting of modern LLMLarge Language Model alignment.

In the RLHFReinforcement Learning from Human Feedback pipeline: a fixed dataset of human preference labels is collected once; a reward model is trained on this static data; and the LLMLarge Language Model policy is optimized against the reward model without further human annotation in the training loop. This is offline RLReinforcement Learning on a fixed preference dataset, with the reward model playing the role of R^\hat{R}R^.

The distributional shift problem appears directly: if PPOProximal Policy Optimisation pushes the LLMLarge Language Model policy to maximize the reward model, it will find OOD text sequences — responses outside the distribution of the preference dataset — that score highly under the reward model but are not actually preferred by humans. This is exactly extrapolation error on OOD actions. The reward model, trained only on prompts and responses from the SFT model, has no reliable predictions for text far from that distribution.

The standard fix is the KL penalty against the SFT reference model:

max⁡πE ⁣[rϕ(x,y)−β DKL(π∥πSFT)]\max_\pi \mathbb{E}\!\left[r_\phi(x, y) - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{SFT}})\right]πmax​E[rϕ​(x,y)−βDKL​(π∥πSFT​)]

This is behavior regularization with πβ=πSFT\pi_\beta = \pi_{\text{SFT}}πβ​=πSFT​: it prevents the optimized policy from straying into regions where the reward model cannot be trusted. The complete theoretical analysis of this connection — including the closed-form solution of the KL-regularized RLHFReinforcement Learning from Human Feedback objective — is the starting point of the DPODirect Preference Optimization derivation studied in Week 13.


◆Open Problems
  • Offline RL at LLM scale: Current methods (CQL, IQL) are validated on D4RL (Mujoco tasks, ~10° state dimensions). Scaling to the action space of language model token generation (~50K discrete actions) with billion-parameter architectures introduces new challenges — the log-sum-exp in CQL becomes intractable, and in-sample methods must handle extreme sparsity (most tokens never appear in the preference dataset).
  • Offline-to-online transfer guarantees: Empirically, offline pre-training followed by online fine-tuning is the dominant pipeline, but no method provides a formal guarantee on how many online steps are needed to correct the offline conservatism. The pessimism cost at convergence is unknown.
  • Heterogeneous data sources: Real offline datasets combine teleoperation, scripted policies, human demonstrations, and random exploration — often without source labels. Current methods assume a single πβ\pi_\betaπβ​; methods that model the dataset as a mixture of behavior policies could enable better stitching across diverse sources.
  • Stitching vs. conservatism Pareto frontier: There is no unified framework for trading off IQL-style stitching (cross-trajectory synthesis) against CQL-style conservatism (safety against OOD overestimation). Understanding this tradeoff theoretically would guide method selection.

Key takeaways

Offline RLReinforcement Learning learns from fixed datasets without environment interaction. Standard off-policy algorithms fail because the Q-function provides unreliable estimates for OOD actions, and Bellman backups propagate these errors into a catastrophic feedback loop. Behavior regularization constrains the policy to stay close to πβ\pi_\betaπβ​, preventing OOD action selection; AWAC implements this via advantage-weighted supervised regression, enabling direct offline-to-online transfer. CQL addresses the problem at the value function level: a modified Bellman loss explicitly penalizes high Q-values on OOD actions, guaranteeing a conservative lower bound on the true value function. IQL avoids all OOD evaluation by replacing the Bellman max with expectile regression over dataset actions, making it fully in-sample. OPE via doubly robust estimators allows performance evaluation without environment interaction. These methods collectively constitute the machinery underlying RLHFReinforcement Learning from Human Feedback, decision-making from offline datasets in robotics, and any AI system that must learn from historical data without the ability to explore.


Conceptual questions

  1. Standard SACSoft Actor-Critic applied to an offline dataset achieves near-zero performance despite using the same Bellman updates and neural architectures that work in online settings. Trace the failure mechanism step by step: starting from OOD Q-value overestimation in a single state-action pair, show how the error propagates through successive Bellman backups to corrupt the entire value function. Identify the one structural difference between offline and online RLReinforcement Learning that breaks the self-correcting property of online Bellman backups.

  2. CQL adds a log⁡∑aexp⁡Qθ(s,a)\log\sum_a \exp Q_\theta(s,a)log∑a​expQθ​(s,a) term to the loss. Show that this term is the log-sum-exp softmax approximation to max⁡aQθ(s,a)\max_a Q_\theta(s,a)maxa​Qθ​(s,a). Why does minimizing this term disproportionately penalize overestimated OOD actions rather than uniformly shrinking all Q-values? How does the tradeoff weight α\alphaα in CQL relate to the KL penalty β\betaβ in RLHFReinforcement Learning from Human Feedback behavior regularization — are they solving the same underlying problem?

  3. IQL fits a value function Vψ(s)V_\psi(s)Vψ​(s) using expectile regression at τ∈(0.5,1)\tau \in (0.5, 1)τ∈(0.5,1) on the Q-values of dataset actions. Explain why τ=0.5\tau = 0.5τ=0.5 recovers the mean Q-value and τ→1\tau \to 1τ→1 approximates the maximum. In a dataset where the behavior policy is uniformly random over all actions, what does IQL with τ=0.9\tau = 0.9τ=0.9 effectively compute, and how does this relate to the optimal value V∗(s)V^*(s)V∗(s)?

  4. An offline RLReinforcement Learning system for a medical treatment recommendation task is trained on a dataset where πβ\pi_\betaπβ​ consists of conservative physician decisions that rarely deviate from standard of care. The learned policy learns to recommend aggressive treatments with higher expected return but minimal dataset support. Identify the failure mode using the offline RLReinforcement Learning framework, explain which of the three solution approaches (behavior regularization, CQL, IQL) would best prevent this failure, and describe the tradeoff introduced by applying that approach.

  5. The doubly robust OPE estimator is consistent if either the importance weights or the value model is correctly specified, but not necessarily both. Describe a scenario where (a) πe≈πβ\pi_e \approx \pi_\betaπe​≈πβ​ making IS reliable but the direct method model Q^\hat{Q}Q^​ is inaccurate, and (b) πe\pi_eπe​ is very different from πβ\pi_\betaπβ​ making IS unreliable but Q^\hat{Q}Q^​ is accurate. For case (b), explain what happens to the IS variance as the evaluation horizon TTT grows, and why DR remains reliable despite high IS variance in this case.


✦Solutions
  1. Offline SAC collapse. An OOD (s,a)(s,a)(s,a) receives an overestimated QQQ with no data to refute it; that inflated value becomes the bootstrap target max⁡a′Q(s′,a′)\max_{a'}Q(s',a')maxa′​Q(s′,a′) for predecessor states, inflating their values, and the error propagates through successive backups to corrupt the whole value function. The one structural difference: online RL self-corrects because it can visit the overrated action and observe its true low reward, pulling QQQ down — offline RL can never collect that corrective data, so the overestimation is never refuted.
  2. CQL. log⁡∑aexp⁡Qθ(s,a)\log\sum_a\exp Q_\theta(s,a)log∑a​expQθ​(s,a) is the log-sum-exp soft maximum, approaching max⁡aQ\max_a Qmaxa​Q as values spread. Its gradient is the softmax over actions, concentrating weight on the highest QQQ, so minimizing it pushes down the most overestimated (typically OOD) actions disproportionately rather than shrinking all values uniformly — while the paired term lifts dataset-action values. CQL's α\alphaα and RLHF's β\betaβ are the same kind of knob: both trade reward/value maximization against staying close to the data/reference distribution.
  3. IQL expectile. Expectile regression at τ=0.5\tau=0.5τ=0.5 minimizes squared error and recovers the mean QQQ over dataset actions; as τ→1\tau\to1τ→1 it weights upper deviations more and approaches the in-support maximum without querying OOD actions. With a uniformly random behavior policy, τ=0.9\tau=0.9τ=0.9 effectively estimates a high-percentile (near-best) in-dataset action value; it approaches V∗(s)V^*(s)V∗(s) only to the extent the optimal action is covered by the data — in general it is the in-support max, a lower bound on V∗V^*V∗.
  4. Medical aggressive treatments. The policy extrapolates to OOD aggressive actions with high predicted return but minimal dataset support — overestimation off the behavior distribution, dangerous in a safety-critical setting. Behavior regularization (keeping the policy close to the conservative πβ\pi_\betaπβ​) best prevents it, with CQL a close alternative; the tradeoff is conservatism caps achievable improvement — you forgo potentially-better treatments to remain within demonstrated, supported actions.
  5. Doubly robust OPE. (a) When πe≈πβ\pi_e\approx\pi_\betaπe​≈πβ​ the importance weights are near 1 and reliable, so DR is accurate even if Q^\hat QQ^​ is wrong. (b) When πe\pi_eπe​ differs greatly the IS weights — products of per-step ratios — have variance that grows exponentially with horizon TTT, making pure IS useless; but an accurate Q^\hat QQ^​ carries the direct-method estimate while the IS term only corrects small residuals, so DR stays reliable. DR is consistent if either component is correctly specified.

Implementation exercises

Exercise 1: Reproduce the offline RL failure

Train a standard SACSoft Actor-Critic agent online on HalfCheetah-v3 until it reaches a moderate performance level (return ~4000). Save all transitions to create a "medium" offline dataset. Then:

  1. Train a new SACSoft Actor-Critic agent offline on this dataset (no environment interaction, just replay buffer sampling). Track the Q-value estimates and policy return over training.
  2. You should observe Q-values inflating to 10–100× the true return while actual policy performance collapses to near zero. Plot the divergence between estimated and true Q-values.
  3. Add behavior cloning (BC) as a baseline on the same dataset. Observe that BC achieves non-trivial return while offline SACSoft Actor-Critic fails — demonstrating that the failure is specific to value-based methods, not a data quality issue.

Identify the moment in training where the Q-value divergence begins and correlate it with the policy starting to select actions outside the data distribution.

Exercise 2: Implement Conservative Q-Learning (CQL)

Starting from your offline SACSoft Actor-Critic implementation, add the CQL penalty term:

  • For continuous actions, the log-sum-exp term log⁡∑aexp⁡Qθ(s,a)\log\sum_a \exp Q_\theta(s,a)log∑a​expQθ​(s,a) cannot be computed exactly. Use importance sampling: sample MMM actions from the current policy πθ\pi_\thetaπθ​ and MMM actions from a uniform distribution over the action space, then compute the soft maximum over the combined set.
  • Tune α∈{0.1,1.0,5.0,10.0}\alpha \in \{0.1, 1.0, 5.0, 10.0\}α∈{0.1,1.0,5.0,10.0}. Report the final policy return after offline training on the medium dataset from Exercise 1.
  • Compare with the BC baseline: does CQL exceed BC performance, and at what α\alphaα?

Test on halfcheetah-medium-v2 from D4RL: CQL should achieve a normalized score of ~47 (vs ~42 for BC, ~30 for naive SAC). The d4rl package provides the dataset directly via d4rl.qlearning_dataset(env).

Exercise 3: Compare IQL and CQL on a stitching task

Construct a custom dataset for PointMaze or a simple navigation task where:

  • Half the trajectories go from start to a midpoint and terminate.
  • Half go from the midpoint to the goal and terminate.
  • No single trajectory goes from start to goal.

Train IQL (τ=0.7,0.9\tau = 0.7, 0.9τ=0.7,0.9) and CQL (α=1.0,5.0\alpha = 1.0, 5.0α=1.0,5.0) offline on this dataset. Measure:

  • Success rate (start → goal completion) for each method.
  • The IQL-CQL performance gap — IQL should struggle (no trajectory covers the full path, and IQL cannot stitch across the gap), while CQL should succeed by identifying the midpoint → goal segment as high-value and crossing into it.

This exercise directly tests the stitching limitation discussed in the IQL section.


Extension prompts

  1. Offline RL with diffusion policies: Replace IQL's Gaussian policy with a diffusion model (as in Diffuser or Decision Diffuser). Train on the same D4RL datasets and compare: (a) performance on heterogeneous data where the behavior policy is a mixture of experts and random actions, and (b) the ability to generate multi-modal action distributions. Does the expressivity of the diffusion policy compensate for IQL's stitching limitation?

  2. OPE for model selection without deployment: Implement the doubly robust estimator on the D4RL halfcheetah datasets. Use OPE to rank 5 candidate policies (trained with different offline RL methods on the same dataset) by estimated return. Then evaluate all 5 policies in the real environment. Measure the Spearman rank correlation between OPE-predicted and true returns. At what dataset size does OPE ranking become reliable?

  3. RLHF as offline RL — the extrapolation failure: Take a small language model fine-tuned via RLHFReinforcement Learning from Human Feedback (PPOProximal Policy Optimisation with a reward model). Freeze the reward model and generate completions from the optimized policy. Compare the reward model scores with human preference ratings on a held-out set of prompts. Identify text sequences where the reward model score is high but human preference is low — these are the OOD "actions" (text completions) that the reward model overestimates, exactly analogous to the CQL scenario.


Looking ahead

Offline RLReinforcement Learning provides the theoretical foundation for the most important application of RLReinforcement Learning in modern AI.

Week 12: Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback). We study how the SFT, reward modeling, and KL-regularized PPOProximal Policy Optimisation pipeline transforms a pre-trained language model into an aligned conversational agent — and examine the limitations of this approach that motivate the preference-optimization methods of Week 13.


Further reading

  • Levine, S., et al. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv. (The definitive overview).
  • Kumar, A., et al. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS. (BEAR — behavior regularization via MMD constraint).
  • Kumar, A., et al. (2020). Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS. (CQL).
  • Kostrikov, I., et al. (2021). Offline Reinforcement Learning with Implicit Q-Learning. ICLR. (IQL).
  • Nair, A., et al. (2020). Accelerating Online Reinforcement Learning with Offline Datasets. arXiv. (AWAC).
  • Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics. (Double robustness theory — the statistical foundation for the DR estimator in OPE).
← Previous
Week 10: Model-Based Reinforcement Learning and Planning
Next →
Week 12: Reinforcement Learning from Human Feedback
On this page
  • Purpose of this lecture
  • The offline RL setting
  • Dataset quality spectrum
  • Why standard off-policy RL fails offline
  • The distributional shift problem
  • Bootstrapping error amplifies the problem
  • Solution 1: Behavior regularization
  • Policy-constrained objective
  • Drawback of behavior regularization
  • Solution 2: Conservative Q-Learning (CQL)
  • The CQL objective
  • Theoretical guarantee
  • Connection to offline-to-online transfer
  • Implicit Q-Learning (IQL)
  • The key idea
  • When IQL applies and when it does not
  • Off-policy evaluation
  • Direct method (DM)
  • Importance sampling (IS)
  • Doubly robust (DR) estimator
  • OPE in practice
  • GenAI context: RLHF as offline RL
  • Key takeaways
  • Conceptual questions
  • Implementation exercises
  • Exercise 1: Reproduce the offline RL failure
  • Exercise 2: Implement Conservative Q-Learning (CQL)
  • Exercise 3: Compare IQL and CQL on a stitching task
  • Extension prompts
  • Looking ahead
  • Further reading