Week 11: Offline Reinforcement Learning

Grounded In

Robotics: Offline RL is the standard deployment pipeline for real-robot systems: collect demonstration and suboptimal data from teleoperation or scripted policies, train conservatively offline, then fine-tune online. CQL-initialized policies on Franka manipulation and autonomous driving stacks follow this exact offline-to-online pattern. OPE via doubly robust estimators serves as the deployment gate before real-hardware testing.
GenAI: RLHF is offline RL on a fixed preference dataset — the KL penalty against the SFT model is exactly behavior regularization with $\pi_\beta = \pi_{\text{SFT}}$ . The distributional shift problem (reward model inflating OOD text scores) is the same mechanism as Q-value overestimation on unseen actions. This connection is the theoretical bridge to DPO in Week 13.

Purpose of this lecture#

Every algorithm studied so far has assumed the agent can interact with the environment during training. This assumption fails in three practically critical settings: when environment interaction is dangerous (surgical robotics, clinical decision support, autonomous driving), expensive (physical robot hardware, real-time industrial systems), or ethically constrained (deploying exploratory policies on real users). In all three cases, the agent must learn purely from a fixed dataset collected in the past.

Offline reinforcement learning (also called batch RL) studies exactly this setting: given a static dataset $\mathcal{D}$ of transitions, learn the best possible policy without any further environment interaction. The goal is the same as standard RL — maximize expected return — but the information constraint is fundamentally different.

The challenge is not a data-engineering problem. Standard off-policy algorithms (DQN, SAC, DDPG) already learn from replay buffers of past experience. When those same algorithms are applied to a fixed offline dataset with no additional collection, they fail catastrophically. The cause is a structural property of the Bellman backup and function approximation that does not appear in the online setting: distributional shift combined with overestimated Q-values. Understanding why this failure occurs and how modern offline RL methods address it is the subject of this lecture.

The offline RL setting#

The agent is given a fixed dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$ collected by a behavior policy $\pi_\beta$ — which may be a prior agent, a scripted policy, a human demonstrator, or a mixture. The behavior policy is typically unknown.

The agent must find:

\pi^* = \arg\max_\pi J(\pi) = \arg\max_\pi \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t r_t\right]

using only transitions in $\mathcal{D}$ . It cannot query the environment for new transitions. This is a harder problem than off-policy RL (DQN, SAC), which also learns from a replay buffer — because in off-policy RL, the agent still periodically collects fresh data to correct errors. In offline RL there is no such correction mechanism.

Dataset quality spectrum#

The difficulty of offline RL depends heavily on the dataset:

Expert data: all transitions come from a near-optimal policy. Behavioral cloning (supervised imitation) already works well here, and offline RL adds marginal benefit.
Scripted or random data: transitions from heuristic or random policies. Value function learning is possible but stitching across different behavior modes is required.
Mixed data: a combination of expert, random, and possibly suboptimal transitions. This is the most realistic and most difficult setting — the offline RL algorithm must identify and improve upon the best parts of the dataset.

The dataset quality matters because the learned policy can only be as good as the best trajectory that can be stitched together from individual transitions in $\mathcal{D}$ . Offline RL's unique promise — and its central difficulty — is that it can potentially produce a policy better than any individual trajectory in the dataset by combining good segments from different trajectories.

Why standard off-policy RL fails offline#

The distributional shift problem#

A standard Q-learning update selects actions by maximizing the Q-function:

\pi(s) = \arg\max_a Q_\theta(s, a)

The Q-network $Q_\theta$ is trained only on state-action pairs $(s, a)$ present in $\mathcal{D}$ . For out-of-distribution (OOD) actions — actions the behavior policy never took in state $s$ — the network's outputs are unconstrained by any training signal. They are whatever the network extrapolates from nearby training points, and that extrapolation can be wildly incorrect.

Because the policy takes the argmax over all actions (including OOD ones), it actively selects OOD actions whenever the network happens to assign them a falsely high Q-value. This is the distributional shift problem: the policy induces a state-action distribution that differs systematically from the dataset distribution $d^{\pi_\beta}(s, a)$ .

Bootstrapping error amplifies the problem#

Distributional shift alone would be manageable if the Q-network's errors on OOD actions were random. The deeper problem is that Bellman backups propagate errors. Consider:

Q_\theta(s, a) \leftarrow r + \gamma \max_{a'} Q_\theta(s', a')

If $Q_\theta(s', a')$ is overestimated for some OOD $a'$ , this overestimate is used as the target for updating $Q_\theta(s, a)$ . On the next backup, this corrupted estimate of $Q_\theta(s, a)$ propagates to states that transition to $s$ . Overestimation propagates backward through the value function in the direction of the learned policy's trajectory — precisely the trajectory that visits the most overestimated actions. The result is an explosive feedback loop:

$Q_\theta$ overestimates value for some OOD action $a'$ in state $s'$ .
The policy updates to prefer $a'$ .
Bellman targets flowing backward from $s'$ become inflated.
Inflated targets spread to all states leading to $s'$ .
The policy further exploits the now-inflated values throughout the state space.

This cycle, absent in online RL (where interaction reveals the true value of $a'$ ), destroys the value function in offline settings. Empirically, standard SAC applied to offline data achieves near-zero performance on tasks where behavioral cloning succeeds.

Solution 1: Behavior regularization#

The most direct fix is to force the learned policy to remain close to the behavior policy $\pi_\beta$ , preventing it from selecting OOD actions the Q-network cannot reliably evaluate.

Policy-constrained objective#

Modify the policy improvement step with a divergence penalty:

\max_\pi \mathbb{E}_{s \sim \mathcal{D},\, a \sim \pi}\!\left[ Q(s,a) - \alpha\, D(\pi(\cdot|s) \,\|\, \pi_\beta(\cdot|s)) \right]

where $D$ is a divergence (KL divergence, MMD, or support constraint). The penalty discourages the policy from placing mass on actions not represented in $\pi_\beta$ .

BEAR (Kumar et al., 2019) uses Maximum Mean Discrepancy (MMD), which can be estimated without explicitly modeling $\pi_\beta$ , only requiring samples from it:

D_{\text{MMD}}(\pi, \pi_\beta) = \left\|\mathbb{E}_{a \sim \pi}[\phi(a)] - \mathbb{E}_{a \sim \pi_\beta}[\phi(a)]\right\|^2

The key insight of BEAR is that MMD provides a gradient signal for staying in-distribution without requiring an explicit density model of $\pi_\beta$ . The method showed that behavior regularization could match behavioral cloning on expert data while enabling improvement beyond the behavior policy on mixed datasets. However, BEAR's dependence on kernel selection and the number of samples for reliable MMD estimation limited its adoption.

Advantage-Weighted actor-critic (AWAC) takes a simpler approach: update the policy only on actions from the dataset, weighted by their estimated advantage:

\mathcal{L}_{\text{AWAC}}(\theta) = -\mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[ \log \pi_\theta(a \mid s) \cdot \exp\!\left(\frac{A^\pi(s,a)}{\lambda}\right) \right]

This is a supervised update (no off-policy sampling needed) that upweights actions with positive advantage under the current policy. Actions with negative advantage (worse than the policy average) contribute near-zero gradient, so the update is driven by the best subset of the dataset. The elegance of AWAC lies in its connection to offline-to-online transfer: the same objective can be applied to both online and offline data without modification, making it a natural warm-start for fine-tuning. Empirically, AWAC's performance plateau on standard D4RL benchmarks motivated the development of more aggressive methods like CQL and IQL that relax the in-distribution constraint.

Critical Lens

Strengths: AWAC elegantly unifies offline pre-training and online fine-tuning under a single objective — the same advantage-weighted regression works for both phases without modification. This makes it the most natural choice for offline-to-online transfer pipelines in robotics.

Limitations: The exponential advantage weighting $\exp(A/\lambda)$ concentrates updates heavily on actions with large positive advantage, which amplifies noise when the value function is poorly fit. On datasets where the behavior policy is uniformly mediocre (no actions with clearly positive advantage), the weighting collapses to near-uniform and AWAC effectively becomes behavioral cloning — improving only by mimicking the dataset's best actions rather than discovering better ones.

Drawback of behavior regularization#

While behavior regularization addresses the core offline RL problem, it is fundamentally limited to policies close to $\pi_\beta$ . If the behavior policy is suboptimal (as in most real datasets), the learned policy inherits that suboptimality. This conservative bound on policy improvement motivates looking beyond regularization to methods that more aggressively reweight or transform the Q-function itself.

Solution 2: Conservative Q-Learning (CQL)#

Rather than constraining the policy, Conservative Q-Learning (CQL; Kumar et al., 2020) addresses the problem at the level of the value function itself. The key insight is that if we can ensure the Q-function does not overestimate OOD actions, the policy's argmax will remain in-distribution without any explicit policy constraint.

The CQL objective#

CQL modifies the standard Bellman loss to add a pessimism term that pushes down Q-values on OOD actions and pushes up Q-values on dataset actions:

\mathcal{L}_{\text{CQL}}(\theta) = \underbrace{\mathcal{L}_{\text{TD}}(\theta)}_{\text{Bellman error}} + \alpha\,\mathbb{E}_{s \sim \mathcal{D}}\!\left[ \underbrace{\log \sum_a \exp Q_\theta(s,a)}_{\text{push down: log-sum-exp over all actions}} - \underbrace{\mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s,a)]}_{\text{push up: dataset actions}} \right]

The middle term $\log\sum_a \exp Q_\theta(s,a)$ is the log-sum-exp (soft maximum) over all actions — a differentiable approximation to $\max_a Q_\theta(s,a)$ . Minimizing it pushes down the Q-value of whichever actions the network currently thinks are best, disproportionately penalizing overestimated OOD actions. The final term $\mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s,a)]$ pushes up Q-values for actions in the dataset, preserving the Bellman signal on observed transitions.

CQL's central claim—that pessimism (lower-bounding the true value) is both theoretically justified and empirically sufficient—challenged the conventional RL intuition that learning accurate values is always desirable. The paper demonstrated that intentional underestimation on unseen actions prevents the catastrophic divergence observed in standard off-policy methods, achieving 3–5× the performance of DDPG and SAC when applied offline to Mujoco benchmarks (D4RL). The community's initial skepticism about pessimism giving up potential improvements has been partially validated: CQL does sacrifice upside on datasets containing near-optimal or mixed trajectories compared to more aggressive methods.

Theoretical guarantee#

Under mild regularity conditions, CQL guarantees that the learned Q-function is a lower bound on the true value function:

Q_\theta^{\pi}(s, a) \leq Q^{\pi}(s, a) \quad \forall (s,a) \in \mathcal{D}

This conservative (pessimistic) bias is intentional. An agent that underestimates the value of OOD actions will avoid them — exactly the desired behavior. The tradeoff is that CQL may also underestimate the value of in-distribution actions, leading to a policy that is overly conservative about deviations from $\pi_\beta$ even when they would be beneficial. The penalty weight $\alpha$ controls this tradeoff: large $\alpha$ is maximally conservative; small $\alpha$ approaches standard Q-learning.

Critical Lens

Strengths: CQL's theoretical lower-bound guarantee provides the strongest safety signal of any offline RL method — the Q-function will never overestimate, so the policy will never chase phantom value on unseen actions. On D4RL locomotion benchmarks, CQL achieves 3–5× the performance of naive SAC/DDPG applied offline.

Limitations: The log-sum-exp penalty over all actions requires either enumerating the action space (infeasible for continuous actions) or importance-sampling OOD actions, introducing estimator variance. CQL's conservatism can undershoot even in-distribution actions, leaving performance on the table when the dataset is already near-optimal. The community has observed that CQL-initialized policies sometimes require significant online fine-tuning to recover the remaining performance gap.

Connection to offline-to-online transfer#

CQL-initialized policies can be efficiently fine-tuned online: the conservative Q-function provides a stable initialization that prevents early-training instability, and online interaction quickly corrects the intentional underestimation bias. This offline pre-training followed by online fine-tuning is a dominant practical pipeline in robotics deployment. In contrast to pure offline methods, the offline-to-online setting shows that pessimism's cost (reduced upside) diminishes rapidly once data collection resumes.

Implicit Q-Learning (IQL)#

A limitation of CQL and behavior regularization approaches is that policy improvement still requires querying the Q-function on actions outside the dataset. Implicit Q-Learning (IQL; Kostrikov et al., 2021) avoids this entirely by reformulating the Bellman backup to never evaluate the Q-function outside the dataset.

The key idea#

Standard Q-learning bootstraps with $\max_{a'} Q(s', a')$ , which requires querying Q at the optimal action, which may be OOD. IQL replaces this maximum with an expectile regression over dataset actions:

\mathcal{L}_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[ L_\tau^2(Q_\theta(s, a) - V_\psi(s)) \right]

where $L_\tau^2(u) = |\tau - \mathbf{1}_{u < 0}| \cdot u^2$ is the asymmetric $L_2$ loss. At $\tau = 0.5$ this is ordinary MSE (fitting the mean). At $\tau \to 1$ it fits the maximum Q-value achievable by in-distribution actions. The Q-function is then updated via:

\mathcal{L}_Q(\theta) = \mathbb{E}_{(s,a,s') \sim \mathcal{D}}\!\left[(r + \gamma V_\psi(s') - Q_\theta(s,a))^2\right]

Finally, the policy is extracted by advantage-weighted regression on dataset actions:

\pi_\theta = \arg\min_\pi -\mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[\exp\bigl(\beta (Q_\theta(s,a) - V_\psi(s))\bigr) \log \pi_\theta(a|s)\right]

IQL's central innovation—using expectile regression to implicitly define value targets without OOD action sampling—introduced a qualitatively different approach to offline RL. Rather than penalizing the Q-function (CQL) or constraining the policy (AWAC), IQL reformulates the backup itself to avoid OOD evaluation entirely. The empirical result was striking: IQL matched or exceeded CQL on D4RL benchmarks while being simpler to implement and tune. The method has become the de facto standard for offline robot learning from 2022 onward, particularly because the advantage-weighted policy extraction (identical to AWAC's supervised loss) combines seamlessly with diffusion policies and other expressive model classes.

IQL is entirely in-sample: every step of training evaluates Q and V only on $(s,a)$ pairs from $\mathcal{D}$ . This makes it more stable than CQL in practice and easier to combine with expressive policy classes (diffusion policies, transformers). IQL with $\tau = 0.7$ – $0.9$ achieves strong performance across D4RL offline benchmarks.

Critical Lens

Strengths: IQL's fully in-sample design eliminates the single largest source of instability in offline RL — OOD action evaluation. The method has fewer hyperparameters than CQL (only $\tau$ and the advantage temperature $\beta$ ), making it easier to tune across diverse datasets. Its compatibility with expressive policy classes (diffusion policies, transformers) has made it the de facto standard for offline robot learning from 2022 onward.

Limitations: The stitching gap (lines 286–288) is fundamental, not an implementation detail — IQL can never discover a policy better than the convex combination of trajectories already in the dataset. On tasks requiring synthesis of meaningfully different trajectory segments (e.g., navigation where the start and goal are in different data clusters), IQL underperforms CQL. The expectile parameter $\tau$ also introduces a bias–variance tradeoff that is dataset-dependent: values near 0.5 are stable but conservative, while values near 1.0 risk instability.

When IQL applies and when it does not#

IQL shines on datasets where behavioral cloning plus mild value improvement is sufficient—particularly in robotics where the behavior policy is often competent but suboptimal. It struggles on datasets where successful policies require stitching together disparate trajectory segments: because IQL never queries high-value OOD actions, it cannot cross trajectories with large advantage gaps. This limitation becomes critical for tasks requiring significant policy improvement beyond the behavior policy, where CQL's (or other pessimistic) methods may perform better.

Off-policy evaluation#

If the agent cannot interact with the environment during training, it also cannot measure its policy's performance directly. Deploying a poorly performing policy on real hardware or real users is costly. Off-policy evaluation (OPE) estimates the expected return of a target policy $\pi_e$ using only the offline dataset collected by $\pi_\beta$ .

Direct method (DM)#

Fit a Q-function $\hat{Q}^{\pi_e}$ using the offline dataset and evaluate it at the initial state distribution:

V^{\pi_e}_{\text{DM}} = \mathbb{E}_{s_0}\!\left[\sum_a \pi_e(a|s_0) \hat{Q}^{\pi_e}(s_0, a)\right]

Simple but biased: all errors in $\hat{Q}$ propagate directly into the estimate.

Importance sampling (IS)#

Reweight trajectory returns by the density ratio between $\pi_e$ and $\pi_\beta$ :

\rho_{0:T} = \prod_{t=0}^T \frac{\pi_e(a_t | s_t)}{\pi_\beta(a_t | s_t)}, \qquad V^{\pi_e}_{\text{IS}} = \mathbb{E}_{\tau \sim \mathcal{D}}\!\left[\rho_{0:T} \sum_{t=0}^T \gamma^t r_t\right]

IS is unbiased but its variance grows exponentially with trajectory length $T$ : the product of $T$ importance weights has variance $\prod_t \mathbb{E}[\rho_t^2] - 1$ , which explodes unless $\pi_e \approx \pi_\beta$ . For long-horizon tasks, IS estimates are effectively unusable. This exponential variance is not a minor implementation detail—it is a fundamental statistical barrier that IS cannot overcome without additional structure.

Doubly robust (DR) estimator#

DR combines the direct method as a variance-reduction baseline with IS for residual correction:

V^{\pi_e}_{\text{DR}} = V^{\pi_e}_{\text{DM}} + \mathbb{E}_{\tau \sim \mathcal{D}}\!\left[ \sum_{t=0}^T \gamma^t \rho_{0:t} \left(r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q}(s_t, a_t)\right) \right]

The DR estimator is doubly robust in the precise sense: it is consistent (converges to the true value) if either the importance weights are correct (accurate $\pi_\beta$ model) or the value model $\hat{Q}$ is correct — not both simultaneously. This resilience makes DR the standard OPE estimator in industrial recommendation systems and AI evaluation pipelines, where one or the other model component is typically reliable. The double robustness property and its implications for offline evaluation are developed in Kennedy (2023); in practice, DR typically outperforms both DM and IS, especially when one of the two component models (behavior policy or value function) is moderately well-fit to the data.

Critical Lens

Strengths: DR's double robustness is a genuine practical advantage — in many deployment settings, either the behavior policy model (collected from logged interactions) or the value model (trained on abundant offline data) is reliable, and DR degrades gracefully when one fails. The importance-weight truncation trick (capping $\rho_t$ ) provides a tunable bias–variance knob for practitioners.

Limitations: DR is only asymptotically consistent under model misspecification — in finite samples, both models can be simultaneously misspecified, and DR offers no guarantee in that regime. The per-step importance weight $\rho_t$ must still be computed, requiring either a known $\pi_\beta$ or a learned behavior policy model — and errors in the behavior model compound multiplicatively. For safety-critical applications (medical treatment, autonomous driving), OPE is strictly a screening tool; it cannot substitute for real-world validation.

OPE in practice#

In offline RL deployment (robotics, autonomous systems), OPE serves as a sanity check rather than a definitive metric. The standard practice is to use DR with conservative importance weight truncation (capping $\rho_t$ at some threshold) to trade off bias and variance, then validate on a small held-out batch of real-world data or simulation rollouts if feasible. This hybrid approach—OPE + limited online validation—reflects the reality that no offline evaluation method can fully substitute for interaction with the environment.

GenAI context: RLHF as offline RL#

Offline RL is not a niche robotics topic — it is the structural setting of modern LLM alignment.

In the RLHF pipeline: a fixed dataset of human preference labels is collected once; a reward model is trained on this static data; and the LLM policy is optimized against the reward model without further human annotation in the training loop. This is offline RL on a fixed preference dataset, with the reward model playing the role of $\hat{R}$ .

The distributional shift problem appears directly: if PPO pushes the LLM policy to maximize the reward model, it will find OOD text sequences — responses outside the distribution of the preference dataset — that score highly under the reward model but are not actually preferred by humans. This is exactly extrapolation error on OOD actions. The reward model, trained only on prompts and responses from the SFT model, has no reliable predictions for text far from that distribution.

The standard fix is the KL penalty against the SFT reference model:

\max_\pi \mathbb{E}\!\left[r_\phi(x, y) - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{SFT}})\right]

This is behavior regularization with $\pi_\beta = \pi_{\text{SFT}}$ : it prevents the optimized policy from straying into regions where the reward model cannot be trusted. The complete theoretical analysis of this connection — including the closed-form solution of the KL-regularized RLHF objective — is the starting point of the DPO derivation studied in Week 13.

Open Problems

Offline RL at LLM scale: Current methods (CQL, IQL) are validated on D4RL (Mujoco tasks, ~10° state dimensions). Scaling to the action space of language model token generation (~50K discrete actions) with billion-parameter architectures introduces new challenges — the log-sum-exp in CQL becomes intractable, and in-sample methods must handle extreme sparsity (most tokens never appear in the preference dataset).
Offline-to-online transfer guarantees: Empirically, offline pre-training followed by online fine-tuning is the dominant pipeline, but no method provides a formal guarantee on how many online steps are needed to correct the offline conservatism. The pessimism cost at convergence is unknown.
Heterogeneous data sources: Real offline datasets combine teleoperation, scripted policies, human demonstrations, and random exploration — often without source labels. Current methods assume a single $\pi_\beta$ ; methods that model the dataset as a mixture of behavior policies could enable better stitching across diverse sources.
Stitching vs. conservatism Pareto frontier: There is no unified framework for trading off IQL-style stitching (cross-trajectory synthesis) against CQL-style conservatism (safety against OOD overestimation). Understanding this tradeoff theoretically would guide method selection.

Key takeaways#

Offline RL learns from fixed datasets without environment interaction. Standard off-policy algorithms fail because the Q-function provides unreliable estimates for OOD actions, and Bellman backups propagate these errors into a catastrophic feedback loop. Behavior regularization constrains the policy to stay close to $\pi_\beta$ , preventing OOD action selection; AWAC implements this via advantage-weighted supervised regression, enabling direct offline-to-online transfer. CQL addresses the problem at the value function level: a modified Bellman loss explicitly penalizes high Q-values on OOD actions, guaranteeing a conservative lower bound on the true value function. IQL avoids all OOD evaluation by replacing the Bellman max with expectile regression over dataset actions, making it fully in-sample. OPE via doubly robust estimators allows performance evaluation without environment interaction. These methods collectively constitute the machinery underlying RLHF, decision-making from offline datasets in robotics, and any AI system that must learn from historical data without the ability to explore.

Conceptual questions#

Standard SAC applied to an offline dataset achieves near-zero performance despite using the same Bellman updates and neural architectures that work in online settings. Trace the failure mechanism step by step: starting from OOD Q-value overestimation in a single state-action pair, show how the error propagates through successive Bellman backups to corrupt the entire value function. Identify the one structural difference between offline and online RL that breaks the self-correcting property of online Bellman backups.
CQL adds a $\log\sum_a \exp Q_\theta(s,a)$ term to the loss. Show that this term is the log-sum-exp softmax approximation to $\max_a Q_\theta(s,a)$ . Why does minimizing this term disproportionately penalize overestimated OOD actions rather than uniformly shrinking all Q-values? How does the tradeoff weight $\alpha$ in CQL relate to the KL penalty $\beta$ in RLHF behavior regularization — are they solving the same underlying problem?
IQL fits a value function $V_\psi(s)$ using expectile regression at $\tau \in (0.5, 1)$ on the Q-values of dataset actions. Explain why $\tau = 0.5$ recovers the mean Q-value and $\tau \to 1$ approximates the maximum. In a dataset where the behavior policy is uniformly random over all actions, what does IQL with $\tau = 0.9$ effectively compute, and how does this relate to the optimal value $V^*(s)$ ?
An offline RL system for a medical treatment recommendation task is trained on a dataset where $\pi_\beta$ consists of conservative physician decisions that rarely deviate from standard of care. The learned policy learns to recommend aggressive treatments with higher expected return but minimal dataset support. Identify the failure mode using the offline RL framework, explain which of the three solution approaches (behavior regularization, CQL, IQL) would best prevent this failure, and describe the tradeoff introduced by applying that approach.
The doubly robust OPE estimator is consistent if either the importance weights or the value model is correctly specified, but not necessarily both. Describe a scenario where (a) $\pi_e \approx \pi_\beta$ making IS reliable but the direct method model $\hat{Q}$ is inaccurate, and (b) $\pi_e$ is very different from $\pi_\beta$ making IS unreliable but $\hat{Q}$ is accurate. For case (b), explain what happens to the IS variance as the evaluation horizon $T$ grows, and why DR remains reliable despite high IS variance in this case.

Solutions

Offline SAC collapse. An OOD $(s,a)$ receives an overestimated $Q$ with no data to refute it; that inflated value becomes the bootstrap target $\max_{a'}Q(s',a')$ for predecessor states, inflating their values, and the error propagates through successive backups to corrupt the whole value function. The one structural difference: online RL self-corrects because it can visit the overrated action and observe its true low reward, pulling $Q$ down — offline RL can never collect that corrective data, so the overestimation is never refuted.
CQL. $\log\sum_a\exp Q_\theta(s,a)$ is the log-sum-exp soft maximum, approaching $\max_a Q$ as values spread. Its gradient is the softmax over actions, concentrating weight on the highest $Q$ , so minimizing it pushes down the most overestimated (typically OOD) actions disproportionately rather than shrinking all values uniformly — while the paired term lifts dataset-action values. CQL's $\alpha$ and RLHF's $\beta$ are the same kind of knob: both trade reward/value maximization against staying close to the data/reference distribution.
IQL expectile. Expectile regression at $\tau=0.5$ minimizes squared error and recovers the mean $Q$ over dataset actions; as $\tau\to1$ it weights upper deviations more and approaches the in-support maximum without querying OOD actions. With a uniformly random behavior policy, $\tau=0.9$ effectively estimates a high-percentile (near-best) in-dataset action value; it approaches $V^*(s)$ only to the extent the optimal action is covered by the data — in general it is the in-support max, a lower bound on $V^*$ .
Medical aggressive treatments. The policy extrapolates to OOD aggressive actions with high predicted return but minimal dataset support — overestimation off the behavior distribution, dangerous in a safety-critical setting. Behavior regularization (keeping the policy close to the conservative $\pi_\beta$ ) best prevents it, with CQL a close alternative; the tradeoff is conservatism caps achievable improvement — you forgo potentially-better treatments to remain within demonstrated, supported actions.
Doubly robust OPE. (a) When $\pi_e\approx\pi_\beta$ the importance weights are near 1 and reliable, so DR is accurate even if $\hat Q$ is wrong. (b) When $\pi_e$ differs greatly the IS weights — products of per-step ratios — have variance that grows exponentially with horizon $T$ , making pure IS useless; but an accurate $\hat Q$ carries the direct-method estimate while the IS term only corrects small residuals, so DR stays reliable. DR is consistent if either component is correctly specified.

Implementation exercises#

Exercise 1: Reproduce the offline RL failure#

Train a standard SAC agent online on HalfCheetah-v3 until it reaches a moderate performance level (return ~4000). Save all transitions to create a "medium" offline dataset. Then:

Train a new SAC agent offline on this dataset (no environment interaction, just replay buffer sampling). Track the Q-value estimates and policy return over training.
You should observe Q-values inflating to 10–100× the true return while actual policy performance collapses to near zero. Plot the divergence between estimated and true Q-values.
Add behavior cloning (BC) as a baseline on the same dataset. Observe that BC achieves non-trivial return while offline SAC fails — demonstrating that the failure is specific to value-based methods, not a data quality issue.

Identify the moment in training where the Q-value divergence begins and correlate it with the policy starting to select actions outside the data distribution.

Exercise 2: Implement Conservative Q-Learning (CQL)#

Starting from your offline SAC implementation, add the CQL penalty term:

For continuous actions, the log-sum-exp term $\log\sum_a \exp Q_\theta(s,a)$ cannot be computed exactly. Use importance sampling: sample $M$ actions from the current policy $\pi_\theta$ and $M$ actions from a uniform distribution over the action space, then compute the soft maximum over the combined set.
Tune $\alpha \in \{0.1, 1.0, 5.0, 10.0\}$ . Report the final policy return after offline training on the medium dataset from Exercise 1.
Compare with the BC baseline: does CQL exceed BC performance, and at what $\alpha$ ?

Test on halfcheetah-medium-v2 from D4RL: CQL should achieve a normalized score of ~47 (vs ~42 for BC, ~30 for naive SAC). The d4rl package provides the dataset directly via d4rl.qlearning_dataset(env).

Exercise 3: Compare IQL and CQL on a stitching task#

Construct a custom dataset for PointMaze or a simple navigation task where:

Half the trajectories go from start to a midpoint and terminate.
Half go from the midpoint to the goal and terminate.
No single trajectory goes from start to goal.

Train IQL ( $\tau = 0.7, 0.9$ ) and CQL ( $\alpha = 1.0, 5.0$ ) offline on this dataset. Measure:

Success rate (start → goal completion) for each method.
The IQL-CQL performance gap — IQL should struggle (no trajectory covers the full path, and IQL cannot stitch across the gap), while CQL should succeed by identifying the midpoint → goal segment as high-value and crossing into it.

This exercise directly tests the stitching limitation discussed in the IQL section.

Extension prompts#

Offline RL with diffusion policies: Replace IQL's Gaussian policy with a diffusion model (as in Diffuser or Decision Diffuser). Train on the same D4RL datasets and compare: (a) performance on heterogeneous data where the behavior policy is a mixture of experts and random actions, and (b) the ability to generate multi-modal action distributions. Does the expressivity of the diffusion policy compensate for IQL's stitching limitation?
OPE for model selection without deployment: Implement the doubly robust estimator on the D4RL halfcheetah datasets. Use OPE to rank 5 candidate policies (trained with different offline RL methods on the same dataset) by estimated return. Then evaluate all 5 policies in the real environment. Measure the Spearman rank correlation between OPE-predicted and true returns. At what dataset size does OPE ranking become reliable?
RLHF as offline RL — the extrapolation failure: Take a small language model fine-tuned via RLHF (PPO with a reward model). Freeze the reward model and generate completions from the optimized policy. Compare the reward model scores with human preference ratings on a held-out set of prompts. Identify text sequences where the reward model score is high but human preference is low — these are the OOD "actions" (text completions) that the reward model overestimates, exactly analogous to the CQL scenario.

Looking ahead#

Offline RL provides the theoretical foundation for the most important application of RL in modern AI.

Week 12: Reinforcement Learning from Human Feedback (RLHF). We study how the SFT, reward modeling, and KL-regularized PPO pipeline transforms a pre-trained language model into an aligned conversational agent — and examine the limitations of this approach that motivate the preference-optimization methods of Week 13.

Purpose of this lecture#

The offline RL setting#

The agent must find:

\pi^* = \arg\max_\pi J(\pi) = \arg\max_\pi \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t r_t\right]

Dataset quality spectrum#

The difficulty of offline RL depends heavily on the dataset:

Expert data: all transitions come from a near-optimal policy. Behavioral cloning (supervised imitation) already works well here, and offline RL adds marginal benefit.
Scripted or random data: transitions from heuristic or random policies. Value function learning is possible but stitching across different behavior modes is required.
Mixed data: a combination of expert, random, and possibly suboptimal transitions. This is the most realistic and most difficult setting — the offline RL algorithm must identify and improve upon the best parts of the dataset.

Why standard off-policy RL fails offline#

The distributional shift problem#

A standard Q-learning update selects actions by maximizing the Q-function:

\pi(s) = \arg\max_a Q_\theta(s, a)

Bootstrapping error amplifies the problem#

Distributional shift alone would be manageable if the Q-network's errors on OOD actions were random. The deeper problem is that Bellman backups propagate errors. Consider:

Q_\theta(s, a) \leftarrow r + \gamma \max_{a'} Q_\theta(s', a')

$Q_\theta$ overestimates value for some OOD action $a'$ in state $s'$ .
The policy updates to prefer $a'$ .
Bellman targets flowing backward from $s'$ become inflated.
Inflated targets spread to all states leading to $s'$ .
The policy further exploits the now-inflated values throughout the state space.

Solution 1: Behavior regularization#

The most direct fix is to force the learned policy to remain close to the behavior policy $\pi_\beta$ , preventing it from selecting OOD actions the Q-network cannot reliably evaluate.

Policy-constrained objective#

Modify the policy improvement step with a divergence penalty:

\max_\pi \mathbb{E}_{s \sim \mathcal{D},\, a \sim \pi}\!\left[ Q(s,a) - \alpha\, D(\pi(\cdot|s) \,\|\, \pi_\beta(\cdot|s)) \right]

where $D$ is a divergence (KL divergence, MMD, or support constraint). The penalty discourages the policy from placing mass on actions not represented in $\pi_\beta$ .

BEAR (Kumar et al., 2019) uses Maximum Mean Discrepancy (MMD), which can be estimated without explicitly modeling $\pi_\beta$ , only requiring samples from it:

D_{\text{MMD}}(\pi, \pi_\beta) = \left\|\mathbb{E}_{a \sim \pi}[\phi(a)] - \mathbb{E}_{a \sim \pi_\beta}[\phi(a)]\right\|^2

Advantage-Weighted actor-critic (AWAC) takes a simpler approach: update the policy only on actions from the dataset, weighted by their estimated advantage:

\mathcal{L}_{\text{AWAC}}(\theta) = -\mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[ \log \pi_\theta(a \mid s) \cdot \exp\!\left(\frac{A^\pi(s,a)}{\lambda}\right) \right]

Critical Lens

Drawback of behavior regularization#

Solution 2: Conservative Q-Learning (CQL)#

The CQL objective#

CQL modifies the standard Bellman loss to add a pessimism term that pushes down Q-values on OOD actions and pushes up Q-values on dataset actions:

\mathcal{L}_{\text{CQL}}(\theta) = \underbrace{\mathcal{L}_{\text{TD}}(\theta)}_{\text{Bellman error}} + \alpha\,\mathbb{E}_{s \sim \mathcal{D}}\!\left[ \underbrace{\log \sum_a \exp Q_\theta(s,a)}_{\text{push down: log-sum-exp over all actions}} - \underbrace{\mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s,a)]}_{\text{push up: dataset actions}} \right]

Theoretical guarantee#

Under mild regularity conditions, CQL guarantees that the learned Q-function is a lower bound on the true value function:

Q_\theta^{\pi}(s, a) \leq Q^{\pi}(s, a) \quad \forall (s,a) \in \mathcal{D}

Critical Lens

Connection to offline-to-online transfer#

Implicit Q-Learning (IQL)#

The key idea#

\mathcal{L}_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[ L_\tau^2(Q_\theta(s, a) - V_\psi(s)) \right]

\mathcal{L}_Q(\theta) = \mathbb{E}_{(s,a,s') \sim \mathcal{D}}\!\left[(r + \gamma V_\psi(s') - Q_\theta(s,a))^2\right]

Finally, the policy is extracted by advantage-weighted regression on dataset actions:

\pi_\theta = \arg\min_\pi -\mathbb{E}_{(s,a) \sim \mathcal{D}}\!\left[\exp\bigl(\beta (Q_\theta(s,a) - V_\psi(s))\bigr) \log \pi_\theta(a|s)\right]

Critical Lens

When IQL applies and when it does not#

Off-policy evaluation#

Direct method (DM)#

Fit a Q-function $\hat{Q}^{\pi_e}$ using the offline dataset and evaluate it at the initial state distribution:

V^{\pi_e}_{\text{DM}} = \mathbb{E}_{s_0}\!\left[\sum_a \pi_e(a|s_0) \hat{Q}^{\pi_e}(s_0, a)\right]

Simple but biased: all errors in $\hat{Q}$ propagate directly into the estimate.

Importance sampling (IS)#

Reweight trajectory returns by the density ratio between $\pi_e$ and $\pi_\beta$ :

\rho_{0:T} = \prod_{t=0}^T \frac{\pi_e(a_t | s_t)}{\pi_\beta(a_t | s_t)}, \qquad V^{\pi_e}_{\text{IS}} = \mathbb{E}_{\tau \sim \mathcal{D}}\!\left[\rho_{0:T} \sum_{t=0}^T \gamma^t r_t\right]

Doubly robust (DR) estimator#

DR combines the direct method as a variance-reduction baseline with IS for residual correction:

V^{\pi_e}_{\text{DR}} = V^{\pi_e}_{\text{DM}} + \mathbb{E}_{\tau \sim \mathcal{D}}\!\left[ \sum_{t=0}^T \gamma^t \rho_{0:t} \left(r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q}(s_t, a_t)\right) \right]

Critical Lens

OPE in practice#

GenAI context: RLHF as offline RL#

Offline RL is not a niche robotics topic — it is the structural setting of modern LLM alignment.

The standard fix is the KL penalty against the SFT reference model:

\max_\pi \mathbb{E}\!\left[r_\phi(x, y) - \beta\, D_{\text{KL}}(\pi \| \pi_{\text{SFT}})\right]

Open Problems

Offline RL at LLM scale: Current methods (CQL, IQL) are validated on D4RL (Mujoco tasks, ~10° state dimensions). Scaling to the action space of language model token generation (~50K discrete actions) with billion-parameter architectures introduces new challenges — the log-sum-exp in CQL becomes intractable, and in-sample methods must handle extreme sparsity (most tokens never appear in the preference dataset).
Offline-to-online transfer guarantees: Empirically, offline pre-training followed by online fine-tuning is the dominant pipeline, but no method provides a formal guarantee on how many online steps are needed to correct the offline conservatism. The pessimism cost at convergence is unknown.
Heterogeneous data sources: Real offline datasets combine teleoperation, scripted policies, human demonstrations, and random exploration — often without source labels. Current methods assume a single $\pi_\beta$ ; methods that model the dataset as a mixture of behavior policies could enable better stitching across diverse sources.
Stitching vs. conservatism Pareto frontier: There is no unified framework for trading off IQL-style stitching (cross-trajectory synthesis) against CQL-style conservatism (safety against OOD overestimation). Understanding this tradeoff theoretically would guide method selection.

Key takeaways#

Conceptual questions#

Standard SAC applied to an offline dataset achieves near-zero performance despite using the same Bellman updates and neural architectures that work in online settings. Trace the failure mechanism step by step: starting from OOD Q-value overestimation in a single state-action pair, show how the error propagates through successive Bellman backups to corrupt the entire value function. Identify the one structural difference between offline and online RL that breaks the self-correcting property of online Bellman backups.
CQL adds a $\log\sum_a \exp Q_\theta(s,a)$ term to the loss. Show that this term is the log-sum-exp softmax approximation to $\max_a Q_\theta(s,a)$ . Why does minimizing this term disproportionately penalize overestimated OOD actions rather than uniformly shrinking all Q-values? How does the tradeoff weight $\alpha$ in CQL relate to the KL penalty $\beta$ in RLHF behavior regularization — are they solving the same underlying problem?
IQL fits a value function $V_\psi(s)$ using expectile regression at $\tau \in (0.5, 1)$ on the Q-values of dataset actions. Explain why $\tau = 0.5$ recovers the mean Q-value and $\tau \to 1$ approximates the maximum. In a dataset where the behavior policy is uniformly random over all actions, what does IQL with $\tau = 0.9$ effectively compute, and how does this relate to the optimal value $V^*(s)$ ?
An offline RL system for a medical treatment recommendation task is trained on a dataset where $\pi_\beta$ consists of conservative physician decisions that rarely deviate from standard of care. The learned policy learns to recommend aggressive treatments with higher expected return but minimal dataset support. Identify the failure mode using the offline RL framework, explain which of the three solution approaches (behavior regularization, CQL, IQL) would best prevent this failure, and describe the tradeoff introduced by applying that approach.
The doubly robust OPE estimator is consistent if either the importance weights or the value model is correctly specified, but not necessarily both. Describe a scenario where (a) $\pi_e \approx \pi_\beta$ making IS reliable but the direct method model $\hat{Q}$ is inaccurate, and (b) $\pi_e$ is very different from $\pi_\beta$ making IS unreliable but $\hat{Q}$ is accurate. For case (b), explain what happens to the IS variance as the evaluation horizon $T$ grows, and why DR remains reliable despite high IS variance in this case.

Solutions

Offline SAC collapse. An OOD $(s,a)$ receives an overestimated $Q$ with no data to refute it; that inflated value becomes the bootstrap target $\max_{a'}Q(s',a')$ for predecessor states, inflating their values, and the error propagates through successive backups to corrupt the whole value function. The one structural difference: online RL self-corrects because it can visit the overrated action and observe its true low reward, pulling $Q$ down — offline RL can never collect that corrective data, so the overestimation is never refuted.
CQL. $\log\sum_a\exp Q_\theta(s,a)$ is the log-sum-exp soft maximum, approaching $\max_a Q$ as values spread. Its gradient is the softmax over actions, concentrating weight on the highest $Q$ , so minimizing it pushes down the most overestimated (typically OOD) actions disproportionately rather than shrinking all values uniformly — while the paired term lifts dataset-action values. CQL's $\alpha$ and RLHF's $\beta$ are the same kind of knob: both trade reward/value maximization against staying close to the data/reference distribution.
IQL expectile. Expectile regression at $\tau=0.5$ minimizes squared error and recovers the mean $Q$ over dataset actions; as $\tau\to1$ it weights upper deviations more and approaches the in-support maximum without querying OOD actions. With a uniformly random behavior policy, $\tau=0.9$ effectively estimates a high-percentile (near-best) in-dataset action value; it approaches $V^*(s)$ only to the extent the optimal action is covered by the data — in general it is the in-support max, a lower bound on $V^*$ .
Medical aggressive treatments. The policy extrapolates to OOD aggressive actions with high predicted return but minimal dataset support — overestimation off the behavior distribution, dangerous in a safety-critical setting. Behavior regularization (keeping the policy close to the conservative $\pi_\beta$ ) best prevents it, with CQL a close alternative; the tradeoff is conservatism caps achievable improvement — you forgo potentially-better treatments to remain within demonstrated, supported actions.
Doubly robust OPE. (a) When $\pi_e\approx\pi_\beta$ the importance weights are near 1 and reliable, so DR is accurate even if $\hat Q$ is wrong. (b) When $\pi_e$ differs greatly the IS weights — products of per-step ratios — have variance that grows exponentially with horizon $T$ , making pure IS useless; but an accurate $\hat Q$ carries the direct-method estimate while the IS term only corrects small residuals, so DR stays reliable. DR is consistent if either component is correctly specified.

Implementation exercises#

Exercise 1: Reproduce the offline RL failure#

Train a standard SAC agent online on HalfCheetah-v3 until it reaches a moderate performance level (return ~4000). Save all transitions to create a "medium" offline dataset. Then:

Train a new SAC agent offline on this dataset (no environment interaction, just replay buffer sampling). Track the Q-value estimates and policy return over training.
You should observe Q-values inflating to 10–100× the true return while actual policy performance collapses to near zero. Plot the divergence between estimated and true Q-values.
Add behavior cloning (BC) as a baseline on the same dataset. Observe that BC achieves non-trivial return while offline SAC fails — demonstrating that the failure is specific to value-based methods, not a data quality issue.

Identify the moment in training where the Q-value divergence begins and correlate it with the policy starting to select actions outside the data distribution.

Exercise 2: Implement Conservative Q-Learning (CQL)#

Starting from your offline SAC implementation, add the CQL penalty term:

For continuous actions, the log-sum-exp term $\log\sum_a \exp Q_\theta(s,a)$ cannot be computed exactly. Use importance sampling: sample $M$ actions from the current policy $\pi_\theta$ and $M$ actions from a uniform distribution over the action space, then compute the soft maximum over the combined set.
Tune $\alpha \in \{0.1, 1.0, 5.0, 10.0\}$ . Report the final policy return after offline training on the medium dataset from Exercise 1.
Compare with the BC baseline: does CQL exceed BC performance, and at what $\alpha$ ?

Exercise 3: Compare IQL and CQL on a stitching task#

Construct a custom dataset for PointMaze or a simple navigation task where:

Half the trajectories go from start to a midpoint and terminate.
Half go from the midpoint to the goal and terminate.
No single trajectory goes from start to goal.

Train IQL ( $\tau = 0.7, 0.9$ ) and CQL ( $\alpha = 1.0, 5.0$ ) offline on this dataset. Measure:

Success rate (start → goal completion) for each method.
The IQL-CQL performance gap — IQL should struggle (no trajectory covers the full path, and IQL cannot stitch across the gap), while CQL should succeed by identifying the midpoint → goal segment as high-value and crossing into it.

This exercise directly tests the stitching limitation discussed in the IQL section.

Extension prompts#

Offline RL with diffusion policies: Replace IQL's Gaussian policy with a diffusion model (as in Diffuser or Decision Diffuser). Train on the same D4RL datasets and compare: (a) performance on heterogeneous data where the behavior policy is a mixture of experts and random actions, and (b) the ability to generate multi-modal action distributions. Does the expressivity of the diffusion policy compensate for IQL's stitching limitation?
OPE for model selection without deployment: Implement the doubly robust estimator on the D4RL halfcheetah datasets. Use OPE to rank 5 candidate policies (trained with different offline RL methods on the same dataset) by estimated return. Then evaluate all 5 policies in the real environment. Measure the Spearman rank correlation between OPE-predicted and true returns. At what dataset size does OPE ranking become reliable?
RLHF as offline RL — the extrapolation failure: Take a small language model fine-tuned via RLHF (PPO with a reward model). Freeze the reward model and generate completions from the optimized policy. Compare the reward model scores with human preference ratings on a held-out set of prompts. Identify text sequences where the reward model score is high but human preference is low — these are the OOD "actions" (text completions) that the reward model overestimates, exactly analogous to the CQL scenario.

Looking ahead#

Offline RL provides the theoretical foundation for the most important application of RL in modern AI.

Purpose of this lecture#

The offline RLReinforcement Learning setting#

Dataset quality spectrum#

Why standard off-policy RLReinforcement Learning fails offline#

The distributional shift problem#

Bootstrapping error amplifies the problem#

Solution 1: Behavior regularization#

Policy-constrained objective#

Drawback of behavior regularization#

Solution 2: Conservative Q-Learning (CQL)#

The CQL objective#

Theoretical guarantee#

Connection to offline-to-online transfer#

Implicit Q-Learning (IQL)#

The key idea#

When IQL applies and when it does not#

Off-policy evaluation#

Direct method (DM)#

Importance sampling (IS)#

Doubly robust (DR) estimator#

OPE in practice#

GenAI context: RLHFReinforcement Learning from Human Feedback as offline RLReinforcement Learning#

Key takeaways#

Conceptual questions#

Implementation exercises#

Exercise 1: Reproduce the offline RL failure#

Exercise 2: Implement Conservative Q-Learning (CQL)#

Exercise 3: Compare IQL and CQL on a stitching task#

Extension prompts#

Looking ahead#

Further reading#

Week 11: Offline Reinforcement Learning

Purpose of this lecture#

The offline RLReinforcement Learning setting#

Dataset quality spectrum#

Why standard off-policy RLReinforcement Learning fails offline#

The distributional shift problem#

Bootstrapping error amplifies the problem#

Solution 1: Behavior regularization#

Policy-constrained objective#

Drawback of behavior regularization#

Solution 2: Conservative Q-Learning (CQL)#

The CQL objective#

Theoretical guarantee#

Connection to offline-to-online transfer#

Implicit Q-Learning (IQL)#

The key idea#

When IQL applies and when it does not#

Off-policy evaluation#

Direct method (DM)#

Importance sampling (IS)#

Doubly robust (DR) estimator#

OPE in practice#

GenAI context: RLHFReinforcement Learning from Human Feedback as offline RLReinforcement Learning#

Key takeaways#

Conceptual questions#

Implementation exercises#

Exercise 1: Reproduce the offline RL failure#

Exercise 2: Implement Conservative Q-Learning (CQL)#

Exercise 3: Compare IQL and CQL on a stitching task#

Extension prompts#

Looking ahead#

Further reading#

The offline RL setting#

Why standard off-policy RL fails offline#

GenAI context: RLHF as offline RL#

The offline RL setting#

Why standard off-policy RL fails offline#

GenAI context: RLHF as offline RL#