Purpose of this lecture
Here is the core puzzle: Why is learning anything difficult if the problem is stateless?
In a multi-armed bandit, there are no state transitions, no long-term planning, no hidden dynamics to uncover. Each round is independent. Yet the agent still faces a fundamental challenge: it can only learn by pulling arms, and pulling the wrong arm wastes immediate reward. This is the exploration-exploitation tradeoff in its purest, most unavoidable form.
The payoff of studying bandits is real. Despite their simplicity, they:
-
Isolate the statistical core of learning from interaction. What makes learning hard is not complex dynamics—it is uncertainty under partial feedback. A bandit distills this to its essence.
-
Provide theoretical foundations. Bandits have clean regret analysis, lower bounds, and instance-optimal algorithms. These tools generalize to MDPs and beyond.
-
Show up everywhere in practice. Recommendation systems, A/B testing, and RLHFReinforcement Learning from Human Feedback all reduce to bandit or contextual bandit problems in key components.
This lecture formalizes the exploration-exploitation tradeoff, establishes fundamental limits via the Lai-Robbins lower bound, and derives algorithms that provably match those limits. The progression from -greedy (heuristic) through UCB (principled confidence bounds) to Thompson Sampling (Bayesian) is a masterclass in algorithm design: formalize the problem, establish what is theoretically possible, then derive algorithms from first principles.
The multi-armed bandit problem
A multi-armed bandit problem consists of:
- A fixed set of actions (arms)
- Each arm has an unknown reward distribution with mean
- At each round , the agent selects an arm and observes a reward
There is no state, no transition dynamics, and no delayed consequences. The agent's goal is to maximize cumulative reward over rounds by learning which arms are good — while still occasionally trying others.
The stateless structure makes bandits analytically tractable and exposes the exploration-exploitation tradeoff in its purest form. Every difficulty present in bandits — uncertainty about unknown means, the cost of exploration, partial observability of counterfactual rewards — persists in full MDPs alongside additional challenges. By studying bandits first, you understand the irreducible statistical core before adding the complexity of state transitions and long-horizon planning.
Regret
Performance in bandit problems is measured by regret: the cumulative reward foregone relative to an oracle that always pulls the optimal arm.
Definition
The suboptimality gap of arm is:
The expected cumulative regret after rounds is:
The regret decomposition
The most useful form of the regret is the decomposition by suboptimality gap:
where is the number of times arm is pulled over rounds. This decomposition follows directly from linearity of expectation and the definition of .
The decomposition is fundamental because it reveals what a bandit algorithm must control: the expected number of pulls of each suboptimal arm, weighted by how suboptimal it is. Arms with small gaps () are hard to distinguish from the optimal arm and require many pulls to identify — this is why problem difficulty scales with . Arms with large gaps are easy to identify but costly per pull. Good algorithm design minimizes for each suboptimal arm at a rate that balances statistical confidence against exploration cost.
Sublinear regret
Good bandit algorithms achieve sublinear regret: .
Sublinear regret means the average regret per step as : the algorithm eventually learns to behave near-optimally. Linear regret — — means the algorithm permanently pulls suboptimal arms at a constant rate, which is a failure of learning.
The exploration–exploitation trade-off
At every time step, the agent faces a fundamental choice:
- Exploitation: select the arm that currently appears best (maximize immediate reward)
- Exploration: select an uncertain arm to gather information (sacrifice immediate reward for future improvement)
Why pure exploitation fails: a concrete example
Consider arms and suppose in the first two rounds, arm 1 is pulled once with reward 1 and arm 2 is pulled once with reward 0. A greedy algorithm commits to arm 1 forever. But suppose the true means are and . The greedy algorithm was unlucky in round 1 and will now suffer near-maximal regret for all remaining rounds. The problem is not the decision at round 3 — it's that the algorithm has no mechanism to recognize that its estimate of arm 1 is based on a single noisy sample. Exploitation without uncertainty quantification is fragile.
Why pure exploration fails
Conversely, an algorithm that samples each arm uniformly achieves zero bias in its estimates but wastes reward proportional to per round — linear regret. Learning and earning must happen simultaneously.
-greedy: the simplest resolution
The simplest approach that forces exploration is -greedy: with probability , select a random arm; with probability , select the empirically best arm.
-greedy is simple and works in practice, but has a critical limitation: the exploration rate is fixed regardless of uncertainty. In early rounds, when estimates are highly uncertain, may be too small. In late rounds, when the optimal arm is well-identified, is too large — the algorithm keeps exploring at the same rate even when there is nothing left to learn. The result is linear regret: .
Decaying can achieve logarithmic regret but requires tuning to the unknown suboptimality gaps. This motivates algorithms that adapt their exploration rate automatically based on observed uncertainty — which is exactly what UCB and Thompson Sampling do.
Stochastic bandits
In stochastic bandits, each arm has a fixed but unknown reward distribution with mean . Observed rewards are i.i.d. conditioned on the chosen arm. This setting admits clean statistical analysis and sharp theoretical guarantees, making it the canonical formulation before extending to adversarial and contextual variants.
The Lai-Robbins lower bound
Before studying algorithms, it is worth asking: how good can a bandit algorithm be?
Theorem (Lai and Robbins, 1985): For any consistent algorithm (one that achieves sublinear regret on every bandit instance), the expected number of pulls of any suboptimal arm satisfies:
where is the KL divergence between arm 's reward distribution and the optimal arm's distribution. For Gaussian rewards with unit variance, this simplifies to up to constants, giving:
This is a fundamental lower bound: no algorithm can achieve better than logarithmic regret. The lower bound has two consequences. First, logarithmic regret is not just what UCB achieves — it is the best any algorithm can achieve. Second, the dependence is unavoidable: hard instances are those where the optimal and suboptimal arms are close in mean reward, requiring many pulls to distinguish them statistically.
Optimism in the face of uncertainty: UCB
The principle of optimism in the face of uncertainty provides a principled response to the exploration-exploitation tradeoff:
When uncertain, act as if the best plausible outcome will occur.
An arm that has been pulled rarely has a wide confidence interval. An optimistic algorithm treats the upper end of that interval as the arm's value, which drives exploration of uncertain arms without requiring a manually tuned exploration parameter.
Deriving the UCB bonus from Hoeffding's inequality
For rewards bounded in , Hoeffding's inequality gives:
We want a confidence bound that holds with high probability across all rounds and all arms. Setting the failure probability to and solving for :
This is an upper confidence bound on the gap between the empirical mean and the true mean: with probability at least , the true mean lies below . This is exactly the UCB bonus term. The formula is not a heuristic — it is the statistical confidence interval derived from Hoeffding's inequality with failure probability .
UCB1 algorithm
where is the empirical mean of arm and is its pull count prior to round . Arms that have not yet been pulled are assigned infinite UCB and are pulled first.
Interpretation: the first term exploits arms with high observed reward; the second term explores arms with high uncertainty. As grows, the bonus shrinks, and the algorithm naturally transitions from exploration to exploitation. No tuning is required.
Regret bound
UCB1 achieves:
This matches the Lai-Robbins lower bound up to constants, confirming that UCB1 is essentially instance-optimal: no consistent algorithm can do significantly better on any bandit instance.
Implementation
import numpy as np
class UCB1:
"""
UCB1 algorithm for multi-armed bandits.
Args:
n_arms: Number of bandit arms
alpha: Exploration parameter (default=2 for UCB1)
Attributes:
counts: Number of times each arm was pulled
values: Estimated mean reward for each arm
"""
def __init__(self, n_arms, alpha=2.0):
self.n_arms = n_arms
self.alpha = alpha
self.counts = np.zeros(n_arms) # N_t(a) - pull counts
self.values = np.zeros(n_arms) # \hat{\mu}_a - empirical means
self.t = 0 # Total rounds
def select_arm(self):
"""
Select arm using UCB1 formula: argmax( \hat{\mu}_a + \sqrt{\alpha * log(t) / N_t(a)} )
Returns: arm index to pull
"""
# First, pull each arm once (ensure N_t(a) > 0)
for arm in range(self.n_arms):
if self.counts[arm] == 0:
return arm
# Compute UCB for each arm
ucb_values = np.zeros(self.n_arms)
for arm in range(self.n_arms):
bonus = np.sqrt(self.alpha * np.log(self.t) / self.counts[arm])
ucb_values[arm] = self.values[arm] + bonus
return np.argmax(ucb_values)
def update(self, arm, reward):
"""
Update estimates after observing reward.
Args:
arm: The arm that was pulled
reward: The observed reward
"""
self.t += 1
self.counts[arm] += 1
# Incremental mean update: \hat{\mu}_a = \hat{\mu}_a + (r - \hat{\mu}_a) / N_t(a)
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n
# Example: 4-armed bandit with Bernoulli rewards
np.random.seed(42)
n_arms = 4
true_means = [0.3, 0.5, 0.7, 0.9] # True arm means (unknown to algorithm)
n_rounds = 1000
# Run UCB1
ucb = UCB1(n_arms)
total_reward = 0
rewards = []
for t in range(n_rounds):
arm = ucb.select_arm()
reward = float(np.random.random() < true_means[arm]) # Bernoulli reward
ucb.update(arm, reward)
total_reward += reward
rewards.append(total_reward)
print(f"UCB1 Total Reward: {total_reward:.0f}/{n_rounds}")
print(f"UCB1 Estimated Means: {ucb.values}")
print(f"True Means: {true_means}")
Key implementation details:
- Initial exploration: Each arm is pulled once before UCB formula applies (handles division by zero)
- Incremental updates: Uses
values[arm] += (reward - values[arm]) / nfor numerical stability - Logarithmic regret: The in the bonus ensures diminishing exploration over time
- No epsilon tuning: Unlike -greedy, UCB automatically balances exploration/exploitation
Comparison with Epsilon-Greedy
class EpsilonGreedy:
"""Baseline: explore with probability epsilon, exploit otherwise."""
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms) # Explore: random arm
return np.argmax(self.values) # Exploit: best known arm
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n
# Comparison
for epsilon in [0.01, 0.1, 0.3]:
eg = EpsilonGreedy(n_arms, epsilon)
eg_reward = 0
for _ in range(n_rounds):
arm = eg.select_arm()
reward = float(np.random.random() < true_means[arm])
eg.update(arm, reward)
eg_reward += reward
print(f"ε={epsilon}: Total={eg_reward:.0f} (needs tuning!)")
# UCB1 automatically tunes exploration
print(f"UCB1: Total={total_reward:.0f} (no tuning needed)")
Why UCB beats ε-greedy:
- ε-greedy uses fixed exploration rate (requires tuning)
- UCB exploration bonus automatically decreases as confidence increases
- UCB has theoretical regret guarantees, ε-greedy does not
Thompson Sampling
Thompson Sampling approaches exploration from a Bayesian perspective. Rather than constructing deterministic confidence bounds, it maintains a posterior distribution over each arm's reward parameter and samples from it.
Algorithm
- Maintain a posterior over each arm's reward parameter
- At round , sample for each arm
- Select
- Observe and update the posterior for arm
The key insight: an arm with high posterior uncertainty has high variance in its samples, which means it will occasionally produce a very high sample and get selected — this is exploration. An arm whose posterior is tightly concentrated near a low mean will rarely produce a sample that beats the optimal arm — this is exploitation. Exploration is implicit in the posterior variance, not forced by a separate mechanism.
Beta-Bernoulli Thompson Sampling
For binary rewards (click/no-click, thumbs-up/thumbs-down, success/failure), the Beta distribution is the conjugate prior for the Bernoulli likelihood. This makes the posterior update exact and closed-form.
Prior: , initialized as (uniform prior).
Why Beta is conjugate to Bernoulli: By Bayes' rule, given binary observations with successes and failures from arm :
This is — the posterior is Beta with updated counts. Each observation updates exactly one parameter: on success, on failure. The Beta distribution is not an arbitrary modeling choice — it is the exact Bayesian posterior for a Bernoulli arm under a Beta prior.
Action selection:
import numpy as np
def thompson_sampling_step(alphas, betas):
"""
One step of Beta-Bernoulli Thompson Sampling.
alphas, betas: arrays of shape (K,) with current posterior parameters.
Returns selected arm index.
"""
samples = np.random.beta(alphas, betas)
return np.argmax(samples)
def update_posterior(alphas, betas, arm, reward):
"""
Bayesian posterior update for Bernoulli arm.
reward: 1 (success) or 0 (failure).
"""
alphas[arm] += reward
betas[arm] += (1 - reward)
return alphas, betas
- We sample from the Beta distribution for each arm using their current posterior parameters and . This single line handles both the exploitation (mean of the distribution) and exploration (variance of the distribution).
- We simply act greedily with respect to the sampled values. This is probability matching in action.
- If the reward is 1 (success), we increment , effectively shifting the distribution mean closer to 1.
- If the reward is 0 (failure), we increment , shifting the mean closer to 0.
The same conjugacy principle extends beyond Bernoulli rewards: Gaussian arms use a normal-inverse-gamma conjugate prior; Poisson arms use a gamma-Poisson conjugacy. Any conjugate prior-likelihood pair gives a Thompson Sampling algorithm with closed-form posterior updates.
Regret guarantees
Thompson Sampling achieves regret in the frequentist sense and matches the Lai-Robbins lower bound asymptotically for exponential family reward distributions. In practice it often outperforms UCB empirically despite having slightly weaker worst-case guarantees, because its randomized exploration is better calibrated to posterior uncertainty than the deterministic UCB bonus.
Comparison: -greedy vs UCB vs Thompson Sampling
| Aspect | -greedy | UCB1 | Thompson Sampling | |---|---|---|---| | Exploration mechanism | Fixed random rate | Deterministic confidence bound | Posterior sampling | | Exploration rate | Manual tuning of | Automatic via | Automatic via posterior variance | | Regret | Linear (fixed ) | — optimal | — near-optimal | | Implementation | Trivial | Simple counters and means | Requires prior and sampling | | Empirical behavior | Often competitive with tuning | Strong, consistent | Often best in practice | | Theoretical status | Not instance-optimal | Instance-optimal (matches Lai-Robbins) | Asymptotically optimal |
The progression -greedy UCB Thompson Sampling is a progression from heuristic exploration toward principled uncertainty quantification. All three implement the same underlying principle — uncertain arms should be explored more — with increasing statistical sophistication.
Contextual bandits
In many real systems, the optimal action depends on context observed at each round.
Formulation
In a contextual bandit:
- At round , the agent observes context
- Selects action
- Receives reward
There are still no state transitions or delayed effects — each round is independent given the context. But the policy is now a mapping from contexts to action distributions, rather than a fixed action.
LinUCB: contextual bandits with linear reward models
The canonical contextual bandit algorithm is LinUCB, which assumes:
for arm-specific parameter vectors . Under this model, the ridge regression estimate and its covariance give a confidence ellipsoid, and the UCB for arm in context is:
where controls the confidence width. LinUCB achieves regret, with dependence on (the context dimension) rather than (the number of arms) — exploiting the linear structure to generalize across contexts.
LinUCB is deployed in production recommendation systems (the original paper describes its use at Yahoo for news article recommendation) and is the conceptual foundation for neural contextual bandit algorithms that replace the linear reward model with a neural network.
Partial feedback: the core difficulty
In both standard and contextual bandits, the agent observes only the reward for the chosen action — not the rewards it would have received from other actions. This is bandit feedback or partial feedback, as opposed to full information feedback where all rewards are observed.
Partial feedback is what distinguishes bandit learning from supervised learning over the same data. In supervised learning, all labels are observed and a classifier can be trained by regression on the full dataset. In a contextual bandit, you only observe , not for . This means you cannot directly evaluate a policy that would have made different choices — you must estimate counterfactual performance.
This partial feedback structure is the root of the distributional shift problem in offline RLReinforcement Learning: a policy trained from logged data never observes rewards for actions the logging policy did not take, so its value estimates for those actions are ungrounded.
Policy evaluation with bandit feedback
Suppose you have a logged dataset collected by some behavior policy and you want to evaluate a new target policy . Since may take different actions than , you cannot simply average the observed rewards.
Importance sampling estimator
The inverse propensity scoring (IPS) estimator corrects for this mismatch:
The ratio is the importance weight: it upweights rounds where would have chosen the same action as and downweights rounds where it would not. Under the assumption that wherever (coverage), the IPS estimator is unbiased: .
The limitation is high variance: when and differ substantially, the importance weights become large, inflating the estimator's variance. In extreme cases, a single round with a large weight can dominate the estimate entirely.
The doubly robust (DR) estimator reduces variance by combining IPS with a learned reward model :
The DR estimator is unbiased if either the reward model or the importance weights are correct — it is robust to misspecification of one but not both. These estimators reappear in offline RLReinforcement Learning and in RLHFReinforcement Learning from Human Feedback evaluation wherever counterfactual reasoning is required.
GenAI context: bandits in RLHFReinforcement Learning from Human Feedback
Several components of RLHFReinforcement Learning from Human Feedback reduce directly to bandit or contextual bandit problems.
The RLHFReinforcement Learning from Human Feedback bandit formulation
| MDPMarkov Decision Process/Bandit component | RLHFReinforcement Learning from Human Feedback interpretation | |---|---| | Context | Prompt | | Action | Generated response (full completion) | | Reward | Human preference score or reward model output | | Behavior policy | Reference model (SFT checkpoint) |
The RLHFReinforcement Learning from Human Feedback training loop is a contextual bandit where the action space is the set of all possible text completions — astronomically large and structured. This structure has two important consequences:
Why direct UCB/Thompson Sampling are inapplicable: With a continuous, high-dimensional action space, maintaining per-action counts or posteriors is infeasible. Instead, RLHFReinforcement Learning from Human Feedback learns a reward model that generalizes across the action space, then uses it to score completions. The reward model plays the role of the UCB confidence bound or the Thompson sample — it scores actions by their estimated value — but it generalizes via function approximation rather than per-arm statistics.
The partial feedback problem in RLHFReinforcement Learning from Human Feedback: Human preference data is inherently partial: for a given prompt, the human rates one or two completions, not all possible completions. The reward model must generalize from this bandit feedback to the full action space. When the reward model overfits to the distribution of rated completions, it produces unreliable scores for out-of-distribution responses — this is one mechanism behind reward hacking in RLHFReinforcement Learning from Human Feedback and is a direct manifestation of the distributional shift problem identified in bandit policy evaluation.
Limitations of bandits
Bandits deliberately ignore:
- delayed consequences of actions,
- state evolution over time,
- long-horizon planning.
They are insufficient when actions influence future states and rewards. Their value lies in isolating the statistical core of learning from interaction — uncertainty quantification, exploration, and partial feedback — in the simplest setting where these issues arise.
The transition from bandits to full MDPs reintroduces state transitions. The arm value generalizes to the action-value function : the expected return of taking action in state and acting optimally thereafter. The UCB bonus for exploration generalizes to optimism-based exploration in MDPs. The importance sampling estimator for bandit policy evaluation generalizes to importance-weighted policy gradient estimators. Every bandit concept has an MDPMarkov Decision Process analog.
Key takeaways
The structure of this lecture mirrors the structure of a good algorithm design argument: identify the objective (regret), decompose it into controllable quantities (the suboptimality gap decomposition), establish what is theoretically achievable (Lai-Robbins lower bound), and then derive algorithms that meet the bound from first principles (UCB from Hoeffding, Thompson Sampling from Bayes' rule). This pattern — formalize, lower bound, match — recurs throughout the course.
Concretely: regret decomposes into expected pulls of suboptimal arms weighted by their gaps. -greedy achieves linear regret because it explores at a fixed rate regardless of uncertainty. UCB achieves logarithmic regret by deriving the exploration bonus from a statistical confidence bound that shrinks as uncertainty decreases. Thompson Sampling achieves the same asymptotically by sampling from the posterior, making exploration implicit in Bayesian uncertainty. Contextual bandits extend the framework to context-dependent rewards; LinUCB applies UCB to a linear reward model. Partial feedback requires importance sampling for counterfactual evaluation, introducing the distributional shift that reappears throughout offline RLReinforcement Learning and RLHFReinforcement Learning from Human Feedback.
Conceptual questions
-
Suppose you have arms with true means , , . Write out the regret decomposition explicitly. Which arm dominates the regret, and why does the answer depend on both and ? Why is arm 2 harder to handle than arm 3?
-
Derive the UCB1 bonus term from Hoeffding's inequality by setting the failure probability to . Explain why the choice of (rather than, say, ) matters for the union bound over all arms and all rounds.
-
The Beta-Bernoulli Thompson Sampling update is on success, on failure. Derive this rule from Bayes' theorem using the Beta prior and Bernoulli likelihood. What does the ratio represent, and how does the posterior variance change as more data is collected?
-
An RLHFReinforcement Learning from Human Feedback system collects human preference labels for 1000 prompt-response pairs, always showing humans responses sampled from the current SFT model. A reward model is trained on this data and used to score responses from a fine-tuned model that has drifted significantly from the SFT checkpoint. Explain this failure in terms of partial feedback, distributional shift, and the coverage assumption required by the IPS estimator.
-
UCB1 achieves logarithmic regret and matches the Lai-Robbins lower bound up to constants. Does this mean UCB1 is the "best possible" bandit algorithm? Explain what "instance-optimal" means, what the lower bound actually says, and describe a setting where Thompson Sampling would empirically outperform UCB1 despite both being asymptotically optimal.
-
Extension: The KL-UCB algorithm replaces the Hoeffding-derived bonus with a tighter bound based on KL divergence, achieving: Explain why KL-UCB is tighter than UCB1 for Bernoulli arms. In what regime ( large vs. small) does the difference matter most? Would you expect KL-UCB or UCB1 to have a larger advantage when is large and arms are mostly near-optimal?
Coding exercise
Implement and compare UCB1 variants.
Starting from the UCB1 class in this lesson, implement a KLUCB class that selects arms using the KL-UCB index (binary search over to solve , where for Bernoulli arms). Then run a simulation comparing UCB1 and KL-UCB on a 5-arm Bernoulli bandit with means [0.5, 0.55, 0.6, 0.65, 0.9] over rounds. Plot cumulative regret for both algorithms.
Things to observe:
- Which algorithm accumulates more regret early vs. late?
- On the hard arms (means 0.5–0.65), which algorithm explores more efficiently?
- How does the gap between algorithms change as shrinks?
Looking ahead
The next lecture reintroduces state transitions through dynamic programming for finite MDPs. We will see how the Bellman equations from Week 1 become computational algorithms — policy evaluation, policy iteration, and value iteration — and why exact solutions quickly become infeasible as the state space grows. The exploration-exploitation tradeoff identified in bandits reappears in the MDPMarkov Decision Process setting, where it is compounded by the need to explore a state space rather than a fixed set of arms.
Further reading
- Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press. (The definitive modern textbook on bandit theory).
- Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning. (The original UCB paper).
- Agrawal, S., & Goyal, N. (2012). Analysis of Thompson Sampling for the multi-armed bandit problem. Conference on Learning Theory (COLT).
- Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A Tutorial on Thompson Sampling. Foundations and Trends® in Machine Learning. (Accessible survey covering theory, implementation, and applications including RLHF-adjacent settings).