Skip to main content
illumin8
Courses
Week 2: Multi-Armed Bandits
Reinforcement Learning
01Week 1: Reinforcement Learning Problem Formulation
02Week 2: Multi-Armed Bandits
03Week 3: Dynamic Programming for Finite MDPs
04Week 4: Monte Carlo and Temporal-Difference Learning
05Week 5: Function Approximation in Reinforcement Learning
06Week 6: Deep Q-Learning and Variants
07Week 7: Policy Gradient and Actor–Critic Methods
08Week 8: Modern Deep Reinforcement Learning Algorithms
09Week 9: Exploration, Partial Observability, and Multi-Agent Reinforcement Learning
10Week 10: Model-Based Reinforcement Learning and Planning
11Week 11: Offline Reinforcement Learning
12Week 12: Reinforcement Learning from Human Feedback
13Week 13: Direct Preference Optimization and GRPO
14Week 14: Agentic Systems and Course Capstone
Week 2

Week 2: Multi-Armed Bandits

✦Learning Outcomes
  • Explain the exploration-exploitation trade-off and why pure exploitation fails
  • Derive and interpret the regret decomposition formula
  • Implement Upper Confidence Bound (UCB) and Thompson Sampling algorithms
  • Connect bandit formulations to RLHFReinforcement Learning from Human Feedback and preference-based learning
◆Prerequisites
  • Week 1: Reward hypothesis, returns with discounting, and the concept of maximizing cumulative reward
  • Review: Basic probability theory (expectation, distributions)

Recommended: Review Week 1 sections on "The reward hypothesis" and "Returns and discounting" before proceeding.

Purpose of this lecture

Here is the core puzzle: Why is learning anything difficult if the problem is stateless?

In a multi-armed bandit, there are no state transitions, no long-term planning, no hidden dynamics to uncover. Each round is independent. Yet the agent still faces a fundamental challenge: it can only learn by pulling arms, and pulling the wrong arm wastes immediate reward. This is the exploration-exploitation tradeoff in its purest, most unavoidable form.

The payoff of studying bandits is real. Despite their simplicity, they:

  1. Isolate the statistical core of learning from interaction. What makes learning hard is not complex dynamics—it is uncertainty under partial feedback. A bandit distills this to its essence.

  2. Provide theoretical foundations. Bandits have clean regret analysis, lower bounds, and instance-optimal algorithms. These tools generalize to MDPs and beyond.

  3. Show up everywhere in practice. Recommendation systems, A/B testing, and RLHFReinforcement Learning from Human Feedback all reduce to bandit or contextual bandit problems in key components.

This lecture formalizes the exploration-exploitation tradeoff, establishes fundamental limits via the Lai-Robbins lower bound, and derives algorithms that provably match those limits. The progression from ϵ\epsilonϵ-greedy (heuristic) through UCB (principled confidence bounds) to Thompson Sampling (Bayesian) is a masterclass in algorithm design: formalize the problem, establish what is theoretically possible, then derive algorithms from first principles.


The multi-armed bandit problem

A multi-armed bandit problem consists of:

  • A fixed set of KKK actions (arms) a∈{1,…,K}a \in \{1, \dots, K\}a∈{1,…,K}
  • Each arm aaa has an unknown reward distribution with mean μa\mu_aμa​
  • At each round ttt, the agent selects an arm ata_tat​ and observes a reward rt∼νatr_t \sim \nu_{a_t}rt​∼νat​​

There is no state, no transition dynamics, and no delayed consequences. The agent's goal is to maximize cumulative reward over TTT rounds by learning which arms are good — while still occasionally trying others.

✦Intuition: Why statelessness is the right abstraction

Removing state doesn't make the problem trivial — it makes it analytically tractable. Without states, every remaining difficulty is purely statistical: you don't know the arm means, you can only learn them by pulling, and pulling the wrong arm costs reward. Every additional challenge in full MDPs (state transitions, delayed credit assignment, long-horizon planning) sits on top of this statistical core. Bandits let you study the statistical core in isolation, before those additional complexities are layered back in.

The stateless structure makes bandits analytically tractable and exposes the exploration-exploitation tradeoff in its purest form. Every difficulty present in bandits — uncertainty about unknown means, the cost of exploration, partial observability of counterfactual rewards — persists in full MDPs alongside additional challenges. By studying bandits first, you understand the irreducible statistical core before adding the complexity of state transitions and long-horizon planning.


Regret

Performance in bandit problems is measured by regret: the cumulative reward foregone relative to an oracle that always pulls the optimal arm.

Definition

The suboptimality gap of arm aaa is:

Δa=μ∗−μa,μ∗=max⁡a′μa′\Delta_a = \mu^* - \mu_a, \qquad \mu^* = \max_{a'} \mu_{a'}Δa​=μ∗−μa​,μ∗=a′max​μa′​

The expected cumulative regret after TTT rounds is:

R(T)=Tμ∗−E ⁣[∑t=1Trt]\mathcal{R}(T) = T\mu^* - \mathbb{E}\!\left[\sum_{t=1}^T r_t\right]R(T)=Tμ∗−E[t=1∑T​rt​]
✦Intuition: Regret as a decomposition

R(T)=∑a=1KΔa⋅E[NT(a)]\mathcal{R}(T) = \sum_{a=1}^K \Delta_a \cdot \mathbb{E}[N_T(a)]R(T)=∑a=1K​Δa​⋅E[NT​(a)]

This says: the total regret is the sum of each arm's suboptimality gap multiplied by the expected number of times it is pulled. This is profound because it tells you exactly what a good algorithm must do: pull suboptimal arms as few times as possible, weighted by how bad they are. An arm with a tiny gap to the optimal arm (hard to distinguish) naturally gets pulled more. An arm with a huge gap gets pulled rarely. The algorithm's job is to allocate exploration budget to the hard arms and avoid the easy ones.

The regret decomposition

The most useful form of the regret is the decomposition by suboptimality gap:

R(T)=∑a=1KΔa⋅E[NT(a)]\mathcal{R}(T) = \sum_{a=1}^K \Delta_a \cdot \mathbb{E}[N_T(a)]R(T)=a=1∑K​Δa​⋅E[NT​(a)]

where NT(a)=∑t=1T1[at=a]N_T(a) = \sum_{t=1}^T \mathbf{1}[a_t = a]NT​(a)=∑t=1T​1[at​=a] is the number of times arm aaa is pulled over TTT rounds. This decomposition follows directly from linearity of expectation and the definition of Δa\Delta_aΔa​.

The decomposition is fundamental because it reveals what a bandit algorithm must control: the expected number of pulls of each suboptimal arm, weighted by how suboptimal it is. Arms with small gaps (Δa≈0\Delta_a \approx 0Δa​≈0) are hard to distinguish from the optimal arm and require many pulls to identify — this is why problem difficulty scales with 1/Δa21/\Delta_a^21/Δa2​. Arms with large gaps are easy to identify but costly per pull. Good algorithm design minimizes E[NT(a)]\mathbb{E}[N_T(a)]E[NT​(a)] for each suboptimal arm at a rate that balances statistical confidence against exploration cost.

Sublinear regret

Good bandit algorithms achieve sublinear regret: R(T)=o(T)\mathcal{R}(T) = o(T)R(T)=o(T).

Sublinear regret means the average regret per step R(T)/T→0\mathcal{R}(T)/T \to 0R(T)/T→0 as T→∞T \to \inftyT→∞: the algorithm eventually learns to behave near-optimally. Linear regret — R(T)=Θ(T)\mathcal{R}(T) = \Theta(T)R(T)=Θ(T) — means the algorithm permanently pulls suboptimal arms at a constant rate, which is a failure of learning.


The exploration–exploitation trade-off

At every time step, the agent faces a fundamental choice:

  • Exploitation: select the arm that currently appears best (maximize immediate reward)
  • Exploration: select an uncertain arm to gather information (sacrifice immediate reward for future improvement)

Why pure exploitation fails: a concrete example

Consider K=2K = 2K=2 arms and suppose in the first two rounds, arm 1 is pulled once with reward 1 and arm 2 is pulled once with reward 0. A greedy algorithm commits to arm 1 forever. But suppose the true means are μ1=0.1\mu_1 = 0.1μ1​=0.1 and μ2=0.9\mu_2 = 0.9μ2​=0.9. The greedy algorithm was unlucky in round 1 and will now suffer near-maximal regret for all remaining rounds. The problem is not the decision at round 3 — it's that the algorithm has no mechanism to recognize that its estimate of arm 1 is based on a single noisy sample. Exploitation without uncertainty quantification is fragile.

Why pure exploration fails

Conversely, an algorithm that samples each arm uniformly achieves zero bias in its estimates but wastes reward proportional to ∑aΔa/K\sum_a \Delta_a / K∑a​Δa​/K per round — linear regret. Learning and earning must happen simultaneously.

◆Why Not uniform exploration?

Uniform exploration treats all arms as equally uncertain forever. Early on, this wastes budget on obvious losers you've already identified. Late on, it wastes budget on arms whose means are already well-estimated. A smarter algorithm would front-load exploration on arms that are hard to distinguish from the optimum and taper off as confidence grows — exactly what UCB and Thompson Sampling do automatically.

ϵ\epsilonϵ-greedy: the simplest resolution

The simplest approach that forces exploration is ϵ\epsilonϵ-greedy: with probability ϵ\epsilonϵ, select a random arm; with probability 1−ϵ1 - \epsilon1−ϵ, select the empirically best arm.

at={uniform random armwith probability ϵarg⁡max⁡aμ^a(t)with probability 1−ϵa_t = \begin{cases} \text{uniform random arm} & \text{with probability } \epsilon \\ \arg\max_a \hat{\mu}_a(t) & \text{with probability } 1 - \epsilon \end{cases}at​={uniform random armargmaxa​μ^​a​(t)​with probability ϵwith probability 1−ϵ​

ϵ\epsilonϵ-greedy is simple and works in practice, but has a critical limitation: the exploration rate ϵ\epsilonϵ is fixed regardless of uncertainty. In early rounds, when estimates are highly uncertain, ϵ\epsilonϵ may be too small. In late rounds, when the optimal arm is well-identified, ϵ\epsilonϵ is too large — the algorithm keeps exploring at the same rate even when there is nothing left to learn. The result is linear regret: R(T)=Ω(ϵ⋅∑aΔa⋅T)\mathcal{R}(T) = \Omega(\epsilon \cdot \sum_a \Delta_a \cdot T)R(T)=Ω(ϵ⋅∑a​Δa​⋅T).

Decaying ϵt=c/t\epsilon_t = c/tϵt​=c/t can achieve logarithmic regret but requires tuning ccc to the unknown suboptimality gaps. This motivates algorithms that adapt their exploration rate automatically based on observed uncertainty — which is exactly what UCB and Thompson Sampling do.


Stochastic bandits

In stochastic bandits, each arm aaa has a fixed but unknown reward distribution νa\nu_aνa​ with mean μa\mu_aμa​. Observed rewards are i.i.d. conditioned on the chosen arm. This setting admits clean statistical analysis and sharp theoretical guarantees, making it the canonical formulation before extending to adversarial and contextual variants.


The Lai-Robbins lower bound

Before studying algorithms, it is worth asking: how good can a bandit algorithm be?

Theorem (Lai and Robbins, 1985): For any consistent algorithm (one that achieves sublinear regret on every bandit instance), the expected number of pulls of any suboptimal arm aaa satisfies:

lim inf⁡T→∞E[NT(a)]log⁡T≥1KL(νa∥ν∗)\liminf_{T \to \infty} \frac{\mathbb{E}[N_T(a)]}{\log T} \geq \frac{1}{\text{KL}(\nu_a \| \nu^*)}T→∞liminf​logTE[NT​(a)]​≥KL(νa​∥ν∗)1​

where KL(νa∥ν∗)\text{KL}(\nu_a \| \nu^*)KL(νa​∥ν∗) is the KL divergence between arm aaa's reward distribution and the optimal arm's distribution. For Gaussian rewards with unit variance, this simplifies to 1/Δa21 / \Delta_a^21/Δa2​ up to constants, giving:

R(T)=Ω ⁣(∑a:Δa>0log⁡TΔa)\mathcal{R}(T) = \Omega\!\left(\sum_{a:\Delta_a > 0} \frac{\log T}{\Delta_a}\right)R(T)=Ω(a:Δa​>0∑​Δa​logT​)
✦Intuition: Why KL divergence?

KL(νa∥ν∗)\text{KL}(\nu_a \| \nu^*)KL(νa​∥ν∗) measures how hard it is to statistically distinguish arm aaa's reward distribution from the optimal arm's. A small KL means the two arms look nearly identical — many samples are needed to tell them apart, so many pulls are unavoidable before you can safely stop exploring arm aaa. A large KL means easy discrimination and fewer pulls suffice. The lower bound is ultimately a statement about hypothesis testing difficulty: you cannot stop pulling an arm until you have enough evidence to rule out that it might be optimal.

This is a fundamental lower bound: no algorithm can achieve better than logarithmic regret. The lower bound has two consequences. First, logarithmic regret is not just what UCB achieves — it is the best any algorithm can achieve. Second, the 1/Δa1/\Delta_a1/Δa​ dependence is unavoidable: hard instances are those where the optimal and suboptimal arms are close in mean reward, requiring many pulls to distinguish them statistically.


Optimism in the face of uncertainty: UCB

The principle of optimism in the face of uncertainty provides a principled response to the exploration-exploitation tradeoff:

When uncertain, act as if the best plausible outcome will occur.

An arm that has been pulled rarely has a wide confidence interval. An optimistic algorithm treats the upper end of that interval as the arm's value, which drives exploration of uncertain arms without requiring a manually tuned exploration parameter.

Deriving the UCB bonus from Hoeffding's inequality

For rewards bounded in [0,1][0,1][0,1], Hoeffding's inequality gives:

P ⁣(μ^a−μa≥ϵ)≤exp⁡(−2Nt(a)ϵ2)P\!\left(\hat{\mu}_a - \mu_a \geq \epsilon\right) \leq \exp(-2 N_t(a) \epsilon^2)P(μ^​a​−μa​≥ϵ)≤exp(−2Nt​(a)ϵ2)

We want a confidence bound that holds with high probability across all rounds ttt and all arms. Setting the failure probability to 1/t21/t^21/t2 and solving for ϵ\epsilonϵ:

exp⁡(−2Nt(a)ϵ2)=1t2  ⇒  ϵ=log⁡t2Nt(a)⋅2=2log⁡tNt(a)\exp(-2 N_t(a) \epsilon^2) = \frac{1}{t^2} \;\Rightarrow\; \epsilon = \sqrt{\frac{\log t}{2 N_t(a)}} \cdot \sqrt{2} = \sqrt{\frac{2 \log t}{N_t(a)}}exp(−2Nt​(a)ϵ2)=t21​⇒ϵ=2Nt​(a)logt​​⋅2​=Nt​(a)2logt​​

This ϵ\epsilonϵ is an upper confidence bound on the gap between the empirical mean and the true mean: with probability at least 1−1/t21 - 1/t^21−1/t2, the true mean μa\mu_aμa​ lies below μ^a+2log⁡t/Nt(a)\hat{\mu}_a + \sqrt{2\log t / N_t(a)}μ^​a​+2logt/Nt​(a)​. This is exactly the UCB bonus term. The formula is not a heuristic — it is the statistical confidence interval derived from Hoeffding's inequality with failure probability 1/t21/t^21/t2.

✦Intuition: Optimism via confidence bounds

Why use the upper confidence bound rather than the estimated mean? Because optimism creates a self-correcting mechanism: an uncertain arm with a high upper bound gets pulled, which shrinks that bound; a well-known poor arm never accumulates a high upper bound and is left alone. Compare this to pessimism — an algorithm using the lower bound would ignore uncertain arms entirely and over-commit to whatever looked best early, replicating the failure of pure exploitation. Optimism is the minimal modification to greedy selection that guarantees uncertain arms get explored.

UCB1 algorithm

at=arg⁡max⁡a(μ^a+2log⁡tNt(a))a_t = \arg\max_a \left(\hat{\mu}_a + \sqrt{\frac{2\log t}{N_t(a)}}\right)at​=argamax​(μ^​a​+Nt​(a)2logt​​)

where μ^a\hat{\mu}_aμ^​a​ is the empirical mean of arm aaa and Nt(a)N_t(a)Nt​(a) is its pull count prior to round ttt. Arms that have not yet been pulled are assigned infinite UCB and are pulled first.

Interpretation: the first term exploits arms with high observed reward; the second term explores arms with high uncertainty. As Nt(a)N_t(a)Nt​(a) grows, the bonus shrinks, and the algorithm naturally transitions from exploration to exploitation. No tuning is required.

Regret bound

UCB1 achieves:

R(T)≤∑a:Δa>08log⁡TΔa+(1+π23)∑aΔa\mathcal{R}(T) \leq \sum_{a:\Delta_a > 0} \frac{8 \log T}{\Delta_a} + \left(1 + \frac{\pi^2}{3}\right)\sum_a \Delta_aR(T)≤a:Δa​>0∑​Δa​8logT​+(1+3π2​)a∑​Δa​

This matches the Lai-Robbins lower bound up to constants, confirming that UCB1 is essentially instance-optimal: no consistent algorithm can do significantly better on any bandit instance.

Implementation

import numpy as np

class UCB1:
    """
    UCB1 algorithm for multi-armed bandits.

    Args:
        n_arms: Number of bandit arms
        alpha: Exploration parameter (default=2 for UCB1)

    Attributes:
        counts: Number of times each arm was pulled
        values: Estimated mean reward for each arm
    """

    def __init__(self, n_arms, alpha=2.0):
        self.n_arms = n_arms
        self.alpha = alpha
        self.counts = np.zeros(n_arms)  # N_t(a) - pull counts
        self.values = np.zeros(n_arms)   # \hat{\mu}_a - empirical means
        self.t = 0  # Total rounds

    def select_arm(self):
        """
        Select arm using UCB1 formula: argmax( \hat{\mu}_a + \sqrt{\alpha * log(t) / N_t(a)} )

        Returns: arm index to pull
        """
        # First, pull each arm once (ensure N_t(a) > 0)
        for arm in range(self.n_arms):
            if self.counts[arm] == 0:
                return arm

        # Compute UCB for each arm
        ucb_values = np.zeros(self.n_arms)
        for arm in range(self.n_arms):
            bonus = np.sqrt(self.alpha * np.log(self.t) / self.counts[arm])
            ucb_values[arm] = self.values[arm] + bonus

        return np.argmax(ucb_values)

    def update(self, arm, reward):
        """
        Update estimates after observing reward.

        Args:
            arm: The arm that was pulled
            reward: The observed reward
        """
        self.t += 1
        self.counts[arm] += 1

        # Incremental mean update: \hat{\mu}_a = \hat{\mu}_a + (r - \hat{\mu}_a) / N_t(a)
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n


# Example: 4-armed bandit with Bernoulli rewards
np.random.seed(42)
n_arms = 4
true_means = [0.3, 0.5, 0.7, 0.9]  # True arm means (unknown to algorithm)
n_rounds = 1000

# Run UCB1
ucb = UCB1(n_arms)
total_reward = 0
rewards = []

for t in range(n_rounds):
    arm = ucb.select_arm()
    reward = float(np.random.random() < true_means[arm])  # Bernoulli reward
    ucb.update(arm, reward)
    total_reward += reward
    rewards.append(total_reward)

print(f"UCB1 Total Reward: {total_reward:.0f}/{n_rounds}")
print(f"UCB1 Estimated Means: {ucb.values}")
print(f"True Means: {true_means}")

Key implementation details:

  1. Initial exploration: Each arm is pulled once before UCB formula applies (handles division by zero)
  2. Incremental updates: Uses values[arm] += (reward - values[arm]) / n for numerical stability
  3. Logarithmic regret: The log⁡t\log tlogt in the bonus ensures diminishing exploration over time
  4. No epsilon tuning: Unlike ϵ\epsilonϵ-greedy, UCB automatically balances exploration/exploitation

Comparison with Epsilon-Greedy

class EpsilonGreedy:
    """Baseline: explore with probability epsilon, exploit otherwise."""

    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # Explore: random arm
        return np.argmax(self.values)  # Exploit: best known arm

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

# Comparison
for epsilon in [0.01, 0.1, 0.3]:
    eg = EpsilonGreedy(n_arms, epsilon)
    eg_reward = 0
    for _ in range(n_rounds):
        arm = eg.select_arm()
        reward = float(np.random.random() < true_means[arm])
        eg.update(arm, reward)
        eg_reward += reward
    print(f"ε={epsilon}: Total={eg_reward:.0f} (needs tuning!)")

# UCB1 automatically tunes exploration
print(f"UCB1: Total={total_reward:.0f} (no tuning needed)")

Why UCB beats ε-greedy:

  • ε-greedy uses fixed exploration rate (requires tuning)
  • UCB exploration bonus automatically decreases as confidence increases
  • UCB has theoretical regret guarantees, ε-greedy does not

Thompson Sampling

Thompson Sampling approaches exploration from a Bayesian perspective. Rather than constructing deterministic confidence bounds, it maintains a posterior distribution over each arm's reward parameter and samples from it.

Algorithm

  1. Maintain a posterior p(θa∣history)p(\theta_a \mid \text{history})p(θa​∣history) over each arm's reward parameter
  2. At round ttt, sample θ~a∼p(θa∣history)\tilde{\theta}_a \sim p(\theta_a \mid \text{history})θ~a​∼p(θa​∣history) for each arm
  3. Select at=arg⁡max⁡aθ~aa_t = \arg\max_a \tilde{\theta}_aat​=argmaxa​θ~a​
  4. Observe rtr_trt​ and update the posterior for arm ata_tat​
✦Intuition: Sampling as implicit exploration

Under-explored arms have wide posteriors with high variance. When you sample from a high-variance posterior, you'll occasionally draw a value large enough to beat the empirical leader — forcing the algorithm to pull that arm and narrow its posterior. As data accumulates, posteriors tighten, variance shrinks, and extreme samples become increasingly rare. Exploration is not bolted on as a separate mechanism: it emerges directly from Bayesian uncertainty. The less you know about an arm, the more likely it is to win a sampling round.

The key insight: an arm with high posterior uncertainty has high variance in its samples, which means it will occasionally produce a very high sample and get selected — this is exploration. An arm whose posterior is tightly concentrated near a low mean will rarely produce a sample that beats the optimal arm — this is exploitation. Exploration is implicit in the posterior variance, not forced by a separate mechanism.

Beta-Bernoulli Thompson Sampling

For binary rewards (click/no-click, thumbs-up/thumbs-down, success/failure), the Beta distribution is the conjugate prior for the Bernoulli likelihood. This makes the posterior update exact and closed-form.

Prior: θa∼Beta(αa,βa)\theta_a \sim \text{Beta}(\alpha_a, \beta_a)θa​∼Beta(αa​,βa​), initialized as Beta(1,1)\text{Beta}(1, 1)Beta(1,1) (uniform prior).

Why Beta is conjugate to Bernoulli: By Bayes' rule, given nnn binary observations with sss successes and f=n−sf = n - sf=n−s failures from arm aaa:

p(θa∣data)∝θas(1−θa)f⋅θaαa−1(1−θa)βa−1=θaαa+s−1(1−θa)βa+f−1p(\theta_a \mid \text{data}) \propto \theta_a^s (1-\theta_a)^f \cdot \theta_a^{\alpha_a - 1}(1-\theta_a)^{\beta_a-1} = \theta_a^{\alpha_a + s - 1}(1-\theta_a)^{\beta_a + f - 1}p(θa​∣data)∝θas​(1−θa​)f⋅θaαa​−1​(1−θa​)βa​−1=θaαa​+s−1​(1−θa​)βa​+f−1

This is Beta(αa+s, βa+f)\text{Beta}(\alpha_a + s,\, \beta_a + f)Beta(αa​+s,βa​+f) — the posterior is Beta with updated counts. Each observation updates exactly one parameter: αa←αa+1\alpha_a \leftarrow \alpha_a + 1αa​←αa​+1 on success, βa←βa+1\beta_a \leftarrow \beta_a + 1βa​←βa​+1 on failure. The Beta distribution is not an arbitrary modeling choice — it is the exact Bayesian posterior for a Bernoulli arm under a Beta prior.

✦Intuition: Conjugacy as closed-form learning

Conjugacy means the posterior has the same functional form as the prior, so each update requires no numerical integration. After observing sss successes and fff failures, the posterior is still Beta — just with counts incremented. Without conjugacy, computing the exact posterior after each observation would require MCMC or variational inference, making real-time updates in a bandit loop infeasible. Conjugate priors are the reason Thompson Sampling can update in O(1)O(1)O(1) per round. The same principle extends to Gaussian arms (normal-inverse-gamma conjugacy) and Poisson arms (gamma-Poisson conjugacy).

Action selection:

import numpy as np

def thompson_sampling_step(alphas, betas):
    """
    One step of Beta-Bernoulli Thompson Sampling.
    alphas, betas: arrays of shape (K,) with current posterior parameters.
    Returns selected arm index.
    """
    samples = np.random.beta(alphas, betas)
    return np.argmax(samples)

def update_posterior(alphas, betas, arm, reward):
    """
    Bayesian posterior update for Bernoulli arm.
    reward: 1 (success) or 0 (failure).
    """
    alphas[arm] += reward
    betas[arm] += (1 - reward)
    return alphas, betas
  1. We sample from the Beta distribution for each arm using their current posterior parameters α\alphaα and β\betaβ. This single line handles both the exploitation (mean of the distribution) and exploration (variance of the distribution).
  2. We simply act greedily with respect to the sampled values. This is probability matching in action.
  3. If the reward is 1 (success), we increment α\alphaα, effectively shifting the distribution mean closer to 1.
  4. If the reward is 0 (failure), we increment β\betaβ, shifting the mean closer to 0.

The same conjugacy principle extends beyond Bernoulli rewards: Gaussian arms use a normal-inverse-gamma conjugate prior; Poisson arms use a gamma-Poisson conjugacy. Any conjugate prior-likelihood pair gives a Thompson Sampling algorithm with closed-form posterior updates.

Regret guarantees

Thompson Sampling achieves O(KTlog⁡T)O(\sqrt{KT\log T})O(KTlogT​) regret in the frequentist sense and matches the Lai-Robbins lower bound asymptotically for exponential family reward distributions. In practice it often outperforms UCB empirically despite having slightly weaker worst-case guarantees, because its randomized exploration is better calibrated to posterior uncertainty than the deterministic UCB bonus.


Comparison: ϵ\epsilonϵ-greedy vs UCB vs Thompson Sampling

| Aspect | ϵ\epsilonϵ-greedy | UCB1 | Thompson Sampling | |---|---|---|---| | Exploration mechanism | Fixed random rate | Deterministic confidence bound | Posterior sampling | | Exploration rate | Manual tuning of ϵ\epsilonϵ | Automatic via log⁡t/Nt(a)\sqrt{\log t / N_t(a)}logt/Nt​(a)​ | Automatic via posterior variance | | Regret | Linear (fixed ϵ\epsilonϵ) | O(log⁡T/Δmin⁡)O(\log T / \Delta_{\min})O(logT/Δmin​) — optimal | O(KTlog⁡T)O(\sqrt{KT\log T})O(KTlogT​) — near-optimal | | Implementation | Trivial | Simple counters and means | Requires prior and sampling | | Empirical behavior | Often competitive with tuning | Strong, consistent | Often best in practice | | Theoretical status | Not instance-optimal | Instance-optimal (matches Lai-Robbins) | Asymptotically optimal |

The progression ϵ\epsilonϵ-greedy →\to→ UCB →\to→ Thompson Sampling is a progression from heuristic exploration toward principled uncertainty quantification. All three implement the same underlying principle — uncertain arms should be explored more — with increasing statistical sophistication.


Contextual bandits

In many real systems, the optimal action depends on context observed at each round.

Formulation

In a contextual bandit:

  • At round ttt, the agent observes context xt∈Xx_t \in \mathcal{X}xt​∈X
  • Selects action at∈{1,…,K}a_t \in \{1, \ldots, K\}at​∈{1,…,K}
  • Receives reward rt∼ν(xt,at)r_t \sim \nu(x_t, a_t)rt​∼ν(xt​,at​)

There are still no state transitions or delayed effects — each round is independent given the context. But the policy is now a mapping π:X→Δ(A)\pi: \mathcal{X} \to \Delta(\mathcal{A})π:X→Δ(A) from contexts to action distributions, rather than a fixed action.

◆Why Not apply UCB/Thompson Sampling directly?

Standard UCB tracks one confidence interval per arm; Thompson Sampling maintains one posterior per arm. Both assume each arm has a fixed unknown mean. In a contextual bandit, the reward depends on both the arm and the context — "arm 2's mean" is no longer a fixed number, so you would need a separate estimate for arm 2 in every possible context. For continuous or high-dimensional context spaces this is intractable. LinUCB resolves this by assuming a linear reward function E[r∣x,a]=x⊤θa\mathbb{E}[r \mid x, a] = x^\top \theta_aE[r∣x,a]=x⊤θa​, enabling generalization across contexts via ridge regression with O(d)O(d)O(d) parameters per arm rather than O(K⋅∣X∣)O(K \cdot |\mathcal{X}|)O(K⋅∣X∣).

LinUCB: contextual bandits with linear reward models

The canonical contextual bandit algorithm is LinUCB, which assumes:

E[rt∣xt,at]=xt⊤θa\mathbb{E}[r_t \mid x_t, a_t] = x_t^\top \theta_aE[rt​∣xt​,at​]=xt⊤​θa​

for arm-specific parameter vectors θa∈Rd\theta_a \in \mathbb{R}^dθa​∈Rd. Under this model, the ridge regression estimate θ^a\hat{\theta}_aθ^a​ and its covariance Aa−1A_a^{-1}Aa−1​ give a confidence ellipsoid, and the UCB for arm aaa in context xtx_txt​ is:

UCBt(a)=xt⊤θ^a+αxt⊤Aa−1xt\text{UCB}_t(a) = x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t}UCBt​(a)=xt⊤​θ^a​+αxt⊤​Aa−1​xt​​

where α\alphaα controls the confidence width. LinUCB achieves O(dTlog⁡T)O(d\sqrt{T\log T})O(dTlogT​) regret, with dependence on ddd (the context dimension) rather than KKK (the number of arms) — exploiting the linear structure to generalize across contexts.

LinUCB is deployed in production recommendation systems (the original paper describes its use at Yahoo for news article recommendation) and is the conceptual foundation for neural contextual bandit algorithms that replace the linear reward model with a neural network.

Partial feedback: the core difficulty

In both standard and contextual bandits, the agent observes only the reward for the chosen action — not the rewards it would have received from other actions. This is bandit feedback or partial feedback, as opposed to full information feedback where all rewards are observed.

⚠What Breaks Here

With bandit feedback you never observe the counterfactual reward — what arm 3 would have yielded when you pulled arm 2. Empirical estimates for non-chosen arms therefore never update. A policy that was unlucky early and avoided a good arm has no evidence to contradict the initial bad estimate, so it continues avoiding that arm indefinitely. This is the fundamental gap between supervised and bandit learning: a supervised classifier sees all labels and can freely correct its beliefs; a bandit algorithm only sees rewards for actions it actually took, making every exploratory pull load-bearing.

Partial feedback is what distinguishes bandit learning from supervised learning over the same data. In supervised learning, all labels are observed and a classifier can be trained by regression on the full dataset. In a contextual bandit, you only observe rt(at)r_t(a_t)rt​(at​), not rt(a)r_t(a)rt​(a) for a≠ata \neq a_ta=at​. This means you cannot directly evaluate a policy that would have made different choices — you must estimate counterfactual performance.

This partial feedback structure is the root of the distributional shift problem in offline RLReinforcement Learning: a policy trained from logged data never observes rewards for actions the logging policy did not take, so its value estimates for those actions are ungrounded.


Policy evaluation with bandit feedback

Suppose you have a logged dataset {(xt,at,rt)}t=1T\{(x_t, a_t, r_t)\}_{t=1}^T{(xt​,at​,rt​)}t=1T​ collected by some behavior policy πb\pi_bπb​ and you want to evaluate a new target policy π\piπ. Since π\piπ may take different actions than πb\pi_bπb​, you cannot simply average the observed rewards.

Importance sampling estimator

The inverse propensity scoring (IPS) estimator corrects for this mismatch:

V^(π)=1T∑t=1Tπ(at∣xt)πb(at∣xt)rt\hat{V}(\pi) = \frac{1}{T} \sum_{t=1}^T \frac{\pi(a_t \mid x_t)}{\pi_b(a_t \mid x_t)} r_tV^(π)=T1​t=1∑T​πb​(at​∣xt​)π(at​∣xt​)​rt​

The ratio π(at∣xt)/πb(at∣xt)\pi(a_t \mid x_t) / \pi_b(a_t \mid x_t)π(at​∣xt​)/πb​(at​∣xt​) is the importance weight: it upweights rounds where π\piπ would have chosen the same action as πb\pi_bπb​ and downweights rounds where it would not. Under the assumption that πb(a∣x)>0\pi_b(a \mid x) > 0πb​(a∣x)>0 wherever π(a∣x)>0\pi(a \mid x) > 0π(a∣x)>0 (coverage), the IPS estimator is unbiased: E[V^(π)]=V(π)\mathbb{E}[\hat{V}(\pi)] = V(\pi)E[V^(π)]=V(π).

⚠What Breaks Here: The Coverage Assumption
  • If π\piπ is a heavily optimized policy and πb\pi_bπb​ is the original model, they may differ substantially. Actions that π\piπ favors but πb\pi_bπb​ avoided will have zero coverage.
  • An extreme importance weight (from a rare action in the behavior dataset) can make the entire estimate unreliable: a single lucky or unlucky sample with a huge weight dominates the average.
  • This is why offline RLReinforcement Learning is hard: you are trying to evaluate a policy on a distribution of states and actions it has never experienced.

The limitation is high variance: when π\piπ and πb\pi_bπb​ differ substantially, the importance weights become large, inflating the estimator's variance. In extreme cases, a single round with a large weight can dominate the estimate entirely.

The doubly robust (DR) estimator reduces variance by combining IPS with a learned reward model r^(x,a)\hat{r}(x, a)r^(x,a):

V^DR(π)=1T∑t=1T[r^(xt,π)+π(at∣xt)πb(at∣xt)(rt−r^(xt,at))]\hat{V}_{\text{DR}}(\pi) = \frac{1}{T} \sum_{t=1}^T \left[ \hat{r}(x_t, \pi) + \frac{\pi(a_t \mid x_t)}{\pi_b(a_t \mid x_t)}(r_t - \hat{r}(x_t, a_t)) \right]V^DR​(π)=T1​t=1∑T​[r^(xt​,π)+πb​(at​∣xt​)π(at​∣xt​)​(rt​−r^(xt​,at​))]

The DR estimator is unbiased if either the reward model or the importance weights are correct — it is robust to misspecification of one but not both. These estimators reappear in offline RLReinforcement Learning and in RLHFReinforcement Learning from Human Feedback evaluation wherever counterfactual reasoning is required.


GenAI context: bandits in RLHFReinforcement Learning from Human Feedback

Several components of RLHFReinforcement Learning from Human Feedback reduce directly to bandit or contextual bandit problems.

The RLHFReinforcement Learning from Human Feedback bandit formulation

| MDPMarkov Decision Process/Bandit component | RLHFReinforcement Learning from Human Feedback interpretation | |---|---| | Context xtx_txt​ | Prompt | | Action ata_tat​ | Generated response (full completion) | | Reward rtr_trt​ | Human preference score or reward model output | | Behavior policy πb\pi_bπb​ | Reference model (SFT checkpoint) |

The RLHFReinforcement Learning from Human Feedback training loop is a contextual bandit where the action space is the set of all possible text completions — astronomically large and structured. This structure has two important consequences:

Why direct UCB/Thompson Sampling are inapplicable: With a continuous, high-dimensional action space, maintaining per-action counts or posteriors is infeasible. Instead, RLHFReinforcement Learning from Human Feedback learns a reward model rϕ(x,a)r_\phi(x, a)rϕ​(x,a) that generalizes across the action space, then uses it to score completions. The reward model plays the role of the UCB confidence bound or the Thompson sample — it scores actions by their estimated value — but it generalizes via function approximation rather than per-arm statistics.

The partial feedback problem in RLHFReinforcement Learning from Human Feedback: Human preference data is inherently partial: for a given prompt, the human rates one or two completions, not all possible completions. The reward model must generalize from this bandit feedback to the full action space. When the reward model overfits to the distribution of rated completions, it produces unreliable scores for out-of-distribution responses — this is one mechanism behind reward hacking in RLHFReinforcement Learning from Human Feedback and is a direct manifestation of the distributional shift problem identified in bandit policy evaluation.


Limitations of bandits

Bandits deliberately ignore:

  • delayed consequences of actions,
  • state evolution over time,
  • long-horizon planning.

They are insufficient when actions influence future states and rewards. Their value lies in isolating the statistical core of learning from interaction — uncertainty quantification, exploration, and partial feedback — in the simplest setting where these issues arise.

The transition from bandits to full MDPs reintroduces state transitions. The arm value μa\mu_aμa​ generalizes to the action-value function Q∗(s,a)Q^*(s, a)Q∗(s,a): the expected return of taking action aaa in state sss and acting optimally thereafter. The UCB bonus for exploration generalizes to optimism-based exploration in MDPs. The importance sampling estimator for bandit policy evaluation generalizes to importance-weighted policy gradient estimators. Every bandit concept has an MDPMarkov Decision Process analog.


Key takeaways

The structure of this lecture mirrors the structure of a good algorithm design argument: identify the objective (regret), decompose it into controllable quantities (the suboptimality gap decomposition), establish what is theoretically achievable (Lai-Robbins lower bound), and then derive algorithms that meet the bound from first principles (UCB from Hoeffding, Thompson Sampling from Bayes' rule). This pattern — formalize, lower bound, match — recurs throughout the course.

Concretely: regret decomposes into expected pulls of suboptimal arms weighted by their gaps. ϵ\epsilonϵ-greedy achieves linear regret because it explores at a fixed rate regardless of uncertainty. UCB achieves logarithmic regret by deriving the exploration bonus from a statistical confidence bound that shrinks as uncertainty decreases. Thompson Sampling achieves the same asymptotically by sampling from the posterior, making exploration implicit in Bayesian uncertainty. Contextual bandits extend the framework to context-dependent rewards; LinUCB applies UCB to a linear reward model. Partial feedback requires importance sampling for counterfactual evaluation, introducing the distributional shift that reappears throughout offline RLReinforcement Learning and RLHFReinforcement Learning from Human Feedback.


Conceptual questions

  1. Suppose you have K=3K = 3K=3 arms with true means μ1=0.9\mu_1 = 0.9μ1​=0.9, μ2=0.85\mu_2 = 0.85μ2​=0.85, μ3=0.1\mu_3 = 0.1μ3​=0.1. Write out the regret decomposition R(T)=∑aΔa⋅E[NT(a)]\mathcal{R}(T) = \sum_a \Delta_a \cdot \mathbb{E}[N_T(a)]R(T)=∑a​Δa​⋅E[NT​(a)] explicitly. Which arm dominates the regret, and why does the answer depend on both Δa\Delta_aΔa​ and E[NT(a)]\mathbb{E}[N_T(a)]E[NT​(a)]? Why is arm 2 harder to handle than arm 3?

  2. Derive the UCB1 bonus term 2log⁡t/Nt(a)\sqrt{2\log t / N_t(a)}2logt/Nt​(a)​ from Hoeffding's inequality by setting the failure probability to 1/t21/t^21/t2. Explain why the choice of 1/t21/t^21/t2 (rather than, say, 1/t1/t1/t) matters for the union bound over all arms and all rounds.

  3. The Beta-Bernoulli Thompson Sampling update is α←α+1\alpha \leftarrow \alpha + 1α←α+1 on success, β←β+1\beta \leftarrow \beta + 1β←β+1 on failure. Derive this rule from Bayes' theorem using the Beta prior and Bernoulli likelihood. What does the ratio α/(α+β)\alpha / (\alpha + \beta)α/(α+β) represent, and how does the posterior variance αβ/[(α+β)2(α+β+1)]\alpha\beta / [(\alpha+\beta)^2(\alpha+\beta+1)]αβ/[(α+β)2(α+β+1)] change as more data is collected?

  4. An RLHFReinforcement Learning from Human Feedback system collects human preference labels for 1000 prompt-response pairs, always showing humans responses sampled from the current SFT model. A reward model is trained on this data and used to score responses from a fine-tuned model that has drifted significantly from the SFT checkpoint. Explain this failure in terms of partial feedback, distributional shift, and the coverage assumption required by the IPS estimator.

  5. UCB1 achieves logarithmic regret and matches the Lai-Robbins lower bound up to constants. Does this mean UCB1 is the "best possible" bandit algorithm? Explain what "instance-optimal" means, what the lower bound actually says, and describe a setting where Thompson Sampling would empirically outperform UCB1 despite both being asymptotically optimal.

  6. Extension: The KL-UCB algorithm replaces the Hoeffding-derived bonus with a tighter bound based on KL divergence, achieving: at=arg⁡max⁡asup⁡ ⁣{q∈[0,1]:Nt(a) KL(μ^a∥q)≤log⁡t}a_t = \arg\max_a \sup\!\left\{q \in [0,1] : N_t(a)\,\text{KL}(\hat{\mu}_a \| q) \leq \log t\right\}at​=argmaxa​sup{q∈[0,1]:Nt​(a)KL(μ^​a​∥q)≤logt} Explain why KL-UCB is tighter than UCB1 for Bernoulli arms. In what regime (Δa\Delta_aΔa​ large vs. small) does the difference matter most? Would you expect KL-UCB or UCB1 to have a larger advantage when KKK is large and arms are mostly near-optimal?


Coding exercise

Implement and compare UCB1 variants.

Starting from the UCB1 class in this lesson, implement a KLUCB class that selects arms using the KL-UCB index (binary search over qqq to solve Nt(a) KL(μ^a∥q)=log⁡tN_t(a)\,\text{KL}(\hat{\mu}_a \| q) = \log tNt​(a)KL(μ^​a​∥q)=logt, where KL(p∥q)=plog⁡(p/q)+(1−p)log⁡((1−p)/(1−q))\text{KL}(p \| q) = p\log(p/q) + (1-p)\log((1-p)/(1-q))KL(p∥q)=plog(p/q)+(1−p)log((1−p)/(1−q)) for Bernoulli arms). Then run a simulation comparing UCB1 and KL-UCB on a 5-arm Bernoulli bandit with means [0.5, 0.55, 0.6, 0.65, 0.9] over T=5000T = 5000T=5000 rounds. Plot cumulative regret for both algorithms.

Things to observe:

  • Which algorithm accumulates more regret early vs. late?
  • On the hard arms (means 0.5–0.65), which algorithm explores more efficiently?
  • How does the gap between algorithms change as Δmin⁡\Delta_{\min}Δmin​ shrinks?

Looking ahead

The next lecture reintroduces state transitions through dynamic programming for finite MDPs. We will see how the Bellman equations from Week 1 become computational algorithms — policy evaluation, policy iteration, and value iteration — and why exact solutions quickly become infeasible as the state space grows. The exploration-exploitation tradeoff identified in bandits reappears in the MDPMarkov Decision Process setting, where it is compounded by the need to explore a state space rather than a fixed set of arms.


Further reading

  • Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press. (The definitive modern textbook on bandit theory).
  • Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning. (The original UCB paper).
  • Agrawal, S., & Goyal, N. (2012). Analysis of Thompson Sampling for the multi-armed bandit problem. Conference on Learning Theory (COLT).
  • Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A Tutorial on Thompson Sampling. Foundations and Trends® in Machine Learning. (Accessible survey covering theory, implementation, and applications including RLHF-adjacent settings).
← Previous
Week 1: Reinforcement Learning Problem Formulation
Next →
Week 3: Dynamic Programming for Finite MDPs
On this page
  • Purpose of this lecture
  • The multi-armed bandit problem
  • Regret
  • Definition
  • The regret decomposition
  • Sublinear regret
  • The exploration–exploitation trade-off
  • Why pure exploitation fails: a concrete example
  • Why pure exploration fails
  • \epsilon-greedy: the simplest resolution
  • Stochastic bandits
  • The Lai-Robbins lower bound
  • Optimism in the face of uncertainty: UCB
  • Deriving the UCB bonus from Hoeffding's inequality
  • UCB1 algorithm
  • Regret bound
  • Implementation
  • Comparison with Epsilon-Greedy
  • Thompson Sampling
  • Algorithm
  • Beta-Bernoulli Thompson Sampling
  • Regret guarantees
  • Comparison: \epsilon-greedy vs UCB vs Thompson Sampling
  • Contextual bandits
  • Formulation
  • LinUCB: contextual bandits with linear reward models
  • Partial feedback: the core difficulty
  • Policy evaluation with bandit feedback
  • Importance sampling estimator
  • GenAI context: bandits in RLHF
  • The RLHF bandit formulation
  • Limitations of bandits
  • Key takeaways
  • Conceptual questions
  • Coding exercise
  • Looking ahead
  • Further reading