Week 2: Multi-Armed Bandits

Purpose of this lecture#

Here is the core puzzle: Why is learning anything difficult if the problem is stateless?

In a multi-armed bandit, there are no state transitions, no long-term planning, no hidden dynamics to uncover. Each round is independent. Yet the agent still faces a fundamental challenge: it can only learn by pulling arms, and pulling the wrong arm wastes immediate reward. This is the exploration-exploitation tradeoff in its purest, most unavoidable form.

The payoff of studying bandits is real. Despite their simplicity, they:

Isolate the statistical core of learning from interaction. What makes learning hard is not complex dynamics—it is uncertainty under partial feedback. A bandit distills this to its essence.
Provide theoretical foundations. Bandits have clean regret analysis, lower bounds, and instance-optimal algorithms. These tools generalize to MDPs and beyond.
Show up everywhere in practice. Recommendation systems, A/B testing, and RLHF all reduce to bandit or contextual bandit problems in key components.

This lecture formalizes the exploration-exploitation tradeoff, establishes fundamental limits via the Lai-Robbins lower bound, and derives algorithms that provably match those limits. The progression from $\epsilon$ -greedy (heuristic) through UCB (principled confidence bounds) to Thompson Sampling (Bayesian) is a masterclass in algorithm design: formalize the problem, establish what is theoretically possible, then derive algorithms from first principles.

The multi-armed bandit problem#

A multi-armed bandit problem consists of:

A fixed set of $K$ actions (arms) $a \in \{1, \dots, K\}$
Each arm $a$ has an unknown reward distribution with mean $\mu_a$
At each round $t$ , the agent selects an arm $a_t$ and observes a reward $r_t \sim \nu_{a_t}$

There is no state, no transition dynamics, and no delayed consequences. The agent's goal is to maximize cumulative reward over $T$ rounds by learning which arms are good — while still occasionally trying others.

The stateless structure makes bandits analytically tractable and exposes the exploration-exploitation tradeoff in its purest form. Every difficulty present in bandits — uncertainty about unknown means, the cost of exploration, partial observability of counterfactual rewards — persists in full MDPs alongside additional challenges. By studying bandits first, you understand the irreducible statistical core before adding the complexity of state transitions and long-horizon planning.

Regret#

Performance in bandit problems is measured by regret: the cumulative reward foregone relative to an oracle that always pulls the optimal arm.

Definition#

The suboptimality gap of arm $a$ is:

\Delta_a = \mu^* - \mu_a, \qquad \mu^* = \max_{a'} \mu_{a'}

The expected cumulative regret after $T$ rounds is:

\mathcal{R}(T) = T\mu^* - \mathbb{E}\!\left[\sum_{t=1}^T r_t\right]

The regret decomposition#

The most useful form of the regret is the decomposition by suboptimality gap:

\mathcal{R}(T) = \sum_{a=1}^K \Delta_a \cdot \mathbb{E}[N_T(a)]

where $N_T(a) = \sum_{t=1}^T \mathbf{1}[a_t = a]$ is the number of times arm $a$ is pulled over $T$ rounds. This decomposition follows directly from linearity of expectation and the definition of $\Delta_a$ .

The decomposition is fundamental because it reveals what a bandit algorithm must control: the expected number of pulls of each suboptimal arm, weighted by how suboptimal it is. Arms with small gaps ( $\Delta_a \approx 0$ ) are hard to distinguish from the optimal arm and require many pulls to identify — this is why problem difficulty scales with $1/\Delta_a^2$ . Arms with large gaps are easy to identify but costly per pull. Good algorithm design minimizes $\mathbb{E}[N_T(a)]$ for each suboptimal arm at a rate that balances statistical confidence against exploration cost.

Sublinear regret#

Good bandit algorithms achieve sublinear regret: $\mathcal{R}(T) = o(T)$ .

Sublinear regret means the average regret per step $\mathcal{R}(T)/T \to 0$ as $T \to \infty$ : the algorithm eventually learns to behave near-optimally. Linear regret — $\mathcal{R}(T) = \Theta(T)$ — means the algorithm permanently pulls suboptimal arms at a constant rate, which is a failure of learning.

The exploration–exploitation trade-off#

At every time step, the agent faces a fundamental choice:

Exploitation: select the arm that currently appears best (maximize immediate reward)
Exploration: select an uncertain arm to gather information (sacrifice immediate reward for future improvement)

Why pure exploitation fails: a concrete example#

Consider $K = 2$ arms and suppose in the first two rounds, arm 1 is pulled once with reward 1 and arm 2 is pulled once with reward 0. A greedy algorithm commits to arm 1 forever. But suppose the true means are $\mu_1 = 0.1$ and $\mu_2 = 0.9$ . The greedy algorithm was unlucky in round 1 and will now suffer near-maximal regret for all remaining rounds. The problem is not the decision at round 3 — it's that the algorithm has no mechanism to recognize that its estimate of arm 1 is based on a single noisy sample. Exploitation without uncertainty quantification is fragile.

Why pure exploration fails#

Conversely, an algorithm that samples each arm uniformly achieves zero bias in its estimates but wastes reward proportional to $\sum_a \Delta_a / K$ per round — linear regret. Learning and earning must happen simultaneously.

$\epsilon$ -greedy: the simplest resolution#

The simplest approach that forces exploration is $\epsilon$ -greedy: with probability $\epsilon$ , select a random arm; with probability $1 - \epsilon$ , select the empirically best arm.

a_t = \begin{cases} \text{uniform random arm} & \text{with probability } \epsilon \\ \arg\max_a \hat{\mu}_a(t) & \text{with probability } 1 - \epsilon \end{cases}

$\epsilon$ -greedy is simple and works in practice, but has a critical limitation: the exploration rate $\epsilon$ is fixed regardless of uncertainty. In early rounds, when estimates are highly uncertain, $\epsilon$ may be too small. In late rounds, when the optimal arm is well-identified, $\epsilon$ is too large — the algorithm keeps exploring at the same rate even when there is nothing left to learn. The result is linear regret: $\mathcal{R}(T) = \Omega(\epsilon \cdot \sum_a \Delta_a \cdot T)$ .

Decaying $\epsilon_t = c/t$ can achieve logarithmic regret but requires tuning $c$ to the unknown suboptimality gaps. This motivates algorithms that adapt their exploration rate automatically based on observed uncertainty — which is exactly what UCB and Thompson Sampling do.

Stochastic bandits#

In stochastic bandits, each arm $a$ has a fixed but unknown reward distribution $\nu_a$ with mean $\mu_a$ . Observed rewards are i.i.d. conditioned on the chosen arm. This setting admits clean statistical analysis and sharp theoretical guarantees, making it the canonical formulation before extending to adversarial and contextual variants.

The Lai-Robbins lower bound#

Before studying algorithms, it is worth asking: how good can a bandit algorithm be?

Theorem (Lai and Robbins, 1985): For any consistent algorithm (one that achieves sublinear regret on every bandit instance), the expected number of pulls of any suboptimal arm $a$ satisfies:

\liminf_{T \to \infty} \frac{\mathbb{E}[N_T(a)]}{\log T} \geq \frac{1}{\text{KL}(\nu_a \| \nu^*)}

where $\text{KL}(\nu_a \| \nu^*)$ is the KL divergence between arm $a$ 's reward distribution and the optimal arm's distribution. For Gaussian rewards with unit variance, this simplifies to $1 / \Delta_a^2$ up to constants, giving:

\mathcal{R}(T) = \Omega\!\left(\sum_{a:\Delta_a > 0} \frac{\log T}{\Delta_a}\right)

This is a fundamental lower bound: no algorithm can achieve better than logarithmic regret. The lower bound has two consequences. First, logarithmic regret is not just what UCB achieves — it is the best any algorithm can achieve. Second, the $1/\Delta_a$ dependence is unavoidable: hard instances are those where the optimal and suboptimal arms are close in mean reward, requiring many pulls to distinguish them statistically.

Optimism in the face of uncertainty: UCB#

The principle of optimism in the face of uncertainty provides a principled response to the exploration-exploitation tradeoff:

When uncertain, act as if the best plausible outcome will occur.

An arm that has been pulled rarely has a wide confidence interval. An optimistic algorithm treats the upper end of that interval as the arm's value, which drives exploration of uncertain arms without requiring a manually tuned exploration parameter.

Deriving the UCB bonus from Hoeffding's inequality#

For rewards bounded in $[0,1]$ , Hoeffding's inequality gives:

P\!\left(\hat{\mu}_a - \mu_a \geq \epsilon\right) \leq \exp(-2 N_t(a) \epsilon^2)

We want a confidence bound that holds with high probability across all rounds $t$ and all arms. Setting the failure probability to $1/t^2$ and solving for $\epsilon$ :

\exp(-2 N_t(a) \epsilon^2) = \frac{1}{t^2} \;\Rightarrow\; \epsilon = \sqrt{\frac{\log t}{2 N_t(a)}} \cdot \sqrt{2} = \sqrt{\frac{2 \log t}{N_t(a)}}

This $\epsilon$ is an upper confidence bound on the gap between the empirical mean and the true mean: with probability at least $1 - 1/t^2$ , the true mean $\mu_a$ lies below $\hat{\mu}_a + \sqrt{2\log t / N_t(a)}$ . This is exactly the UCB bonus term. The formula is not a heuristic — it is the statistical confidence interval derived from Hoeffding's inequality with failure probability $1/t^2$ .

UCB1 algorithm#

a_t = \arg\max_a \left(\hat{\mu}_a + \sqrt{\frac{2\log t}{N_t(a)}}\right)

where $\hat{\mu}_a$ is the empirical mean of arm $a$ and $N_t(a)$ is its pull count prior to round $t$ . Arms that have not yet been pulled are assigned infinite UCB and are pulled first.

Interpretation: the first term exploits arms with high observed reward; the second term explores arms with high uncertainty. As $N_t(a)$ grows, the bonus shrinks, and the algorithm naturally transitions from exploration to exploitation. No tuning is required.

Regret bound#

UCB1 achieves:

\mathcal{R}(T) \leq \sum_{a:\Delta_a > 0} \frac{8 \log T}{\Delta_a} + \left(1 + \frac{\pi^2}{3}\right)\sum_a \Delta_a

This matches the Lai-Robbins lower bound up to constants, confirming that UCB1 is essentially instance-optimal: no consistent algorithm can do significantly better on any bandit instance.

Implementation#

python · runs in browser

import numpy as np

class UCB1:
    """
    UCB1 algorithm for multi-armed bandits.

    Args:
        n_arms: Number of bandit arms
        alpha: Exploration parameter (default=2 for UCB1)

    Attributes:
        counts: Number of times each arm was pulled
        values: Estimated mean reward for each arm
    """

    def __init__(self, n_arms, alpha=2.0):
        self.n_arms = n_arms
        self.alpha = alpha
        self.counts = np.zeros(n_arms)  # N_t(a) - pull counts
        self.values = np.zeros(n_arms)   # \hat{\mu}_a - empirical means
        self.t = 0  # Total rounds

    def select_arm(self):
        """
        Select arm using UCB1 formula: argmax( \hat{\mu}_a + \sqrt{\alpha * log(t) / N_t(a)} )

        Returns: arm index to pull
        """
        # First, pull each arm once (ensure N_t(a) > 0)
        for arm in range(self.n_arms):
            if self.counts[arm] == 0:
                return arm

        # Compute UCB for each arm
        ucb_values = np.zeros(self.n_arms)
        for arm in range(self.n_arms):
            bonus = np.sqrt(self.alpha * np.log(self.t) / self.counts[arm])
            ucb_values[arm] = self.values[arm] + bonus

        return np.argmax(ucb_values)

    def update(self, arm, reward):
        """
        Update estimates after observing reward.

        Args:
            arm: The arm that was pulled
            reward: The observed reward
        """
        self.t += 1
        self.counts[arm] += 1

        # Incremental mean update: \hat{\mu}_a = \hat{\mu}_a + (r - \hat{\mu}_a) / N_t(a)
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n


# Example: 4-armed bandit with Bernoulli rewards
np.random.seed(42)
n_arms = 4
true_means = [0.3, 0.5, 0.7, 0.9]  # True arm means (unknown to algorithm)
n_rounds = 1000

# Run UCB1
ucb = UCB1(n_arms)
total_reward = 0
rewards = []

for t in range(n_rounds):
    arm = ucb.select_arm()
    reward = float(np.random.random() < true_means[arm])  # Bernoulli reward
    ucb.update(arm, reward)
    total_reward += reward
    rewards.append(total_reward)

print(f"UCB1 Total Reward: {total_reward:.0f}/{n_rounds}")
print(f"UCB1 Estimated Means: {ucb.values}")
print(f"True Means: {true_means}")

Key implementation details:

Initial exploration: Each arm is pulled once before UCB formula applies (handles division by zero)
Incremental updates: Uses values[arm] += (reward - values[arm]) / n for numerical stability
Logarithmic regret: The $\log t$ in the bonus ensures diminishing exploration over time
No epsilon tuning: Unlike $\epsilon$ -greedy, UCB automatically balances exploration/exploitation

Comparison with Epsilon-Greedy#

python · runs in browser

class EpsilonGreedy:
    """Baseline: explore with probability epsilon, exploit otherwise."""

    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # Explore: random arm
        return np.argmax(self.values)  # Exploit: best known arm

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

# Comparison
for epsilon in [0.01, 0.1, 0.3]:
    eg = EpsilonGreedy(n_arms, epsilon)
    eg_reward = 0
    for _ in range(n_rounds):
        arm = eg.select_arm()
        reward = float(np.random.random() < true_means[arm])
        eg.update(arm, reward)
        eg_reward += reward
    print(f"ε={epsilon}: Total={eg_reward:.0f} (needs tuning!)")

# UCB1 automatically tunes exploration
print(f"UCB1: Total={total_reward:.0f} (no tuning needed)")

Why UCB beats ε-greedy:

ε-greedy uses fixed exploration rate (requires tuning)
UCB exploration bonus automatically decreases as confidence increases
UCB has theoretical regret guarantees, ε-greedy does not

Thompson Sampling#

Thompson Sampling approaches exploration from a Bayesian perspective. Rather than constructing deterministic confidence bounds, it maintains a posterior distribution over each arm's reward parameter and samples from it.

Algorithm#

Maintain a posterior $p(\theta_a \mid \text{history})$ over each arm's reward parameter
At round $t$ , sample $\tilde{\theta}_a \sim p(\theta_a \mid \text{history})$ for each arm
Select $a_t = \arg\max_a \tilde{\theta}_a$
Observe $r_t$ and update the posterior for arm $a_t$

The key insight: an arm with high posterior uncertainty has high variance in its samples, which means it will occasionally produce a very high sample and get selected — this is exploration. An arm whose posterior is tightly concentrated near a low mean will rarely produce a sample that beats the optimal arm — this is exploitation. Exploration is implicit in the posterior variance, not forced by a separate mechanism.

Beta-Bernoulli Thompson Sampling#

For binary rewards (click/no-click, thumbs-up/thumbs-down, success/failure), the Beta distribution is the conjugate prior for the Bernoulli likelihood. This makes the posterior update exact and closed-form.

Prior: $\theta_a \sim \text{Beta}(\alpha_a, \beta_a)$ , initialized as $\text{Beta}(1, 1)$ (uniform prior).

Why Beta is conjugate to Bernoulli: By Bayes' rule, given $n$ binary observations with $s$ successes and $f = n - s$ failures from arm $a$ :

p(\theta_a \mid \text{data}) \propto \theta_a^s (1-\theta_a)^f \cdot \theta_a^{\alpha_a - 1}(1-\theta_a)^{\beta_a-1} = \theta_a^{\alpha_a + s - 1}(1-\theta_a)^{\beta_a + f - 1}

This is $\text{Beta}(\alpha_a + s,\, \beta_a + f)$ — the posterior is Beta with updated counts. Each observation updates exactly one parameter: $\alpha_a \leftarrow \alpha_a + 1$ on success, $\beta_a \leftarrow \beta_a + 1$ on failure. The Beta distribution is not an arbitrary modeling choice — it is the exact Bayesian posterior for a Bernoulli arm under a Beta prior.

Action selection:

python · runs in browser

import numpy as np

def thompson_sampling_step(alphas, betas):
    """
    One step of Beta-Bernoulli Thompson Sampling.
    alphas, betas: arrays of shape (K,) with current posterior parameters.
    Returns selected arm index.
    """
    samples = np.random.beta(alphas, betas)
    return np.argmax(samples)

def update_posterior(alphas, betas, arm, reward):
    """
    Bayesian posterior update for Bernoulli arm.
    reward: 1 (success) or 0 (failure).
    """
    alphas[arm] += reward
    betas[arm] += (1 - reward)
    return alphas, betas

We sample from the Beta distribution for each arm using their current posterior parameters $\alpha$ and $\beta$ . This single line handles both the exploitation (mean of the distribution) and exploration (variance of the distribution).
We simply act greedily with respect to the sampled values. This is probability matching in action.
If the reward is 1 (success), we increment $\alpha$ , effectively shifting the distribution mean closer to 1.
If the reward is 0 (failure), we increment $\beta$ , shifting the mean closer to 0.

The same conjugacy principle extends beyond Bernoulli rewards: Gaussian arms use a normal-inverse-gamma conjugate prior; Poisson arms use a gamma-Poisson conjugacy. Any conjugate prior-likelihood pair gives a Thompson Sampling algorithm with closed-form posterior updates.

Regret guarantees#

Thompson Sampling achieves $O(\sqrt{KT\log T})$ regret in the frequentist sense and matches the Lai-Robbins lower bound asymptotically for exponential family reward distributions. In practice it often outperforms UCB empirically despite having slightly weaker worst-case guarantees, because its randomized exploration is better calibrated to posterior uncertainty than the deterministic UCB bonus.

Comparison: $\epsilon$ -greedy vs UCB vs Thompson Sampling#

| Aspect | $\epsilon$ -greedy | UCB1 | Thompson Sampling | |---|---|---|---| | Exploration mechanism | Fixed random rate | Deterministic confidence bound | Posterior sampling | | Exploration rate | Manual tuning of $\epsilon$ | Automatic via $\sqrt{\log t / N_t(a)}$ | Automatic via posterior variance | | Regret | Linear (fixed $\epsilon$ ) | $O(\log T / \Delta_{\min})$ — optimal | $O(\sqrt{KT\log T})$ — near-optimal | | Implementation | Trivial | Simple counters and means | Requires prior and sampling | | Empirical behavior | Often competitive with tuning | Strong, consistent | Often best in practice | | Theoretical status | Not instance-optimal | Instance-optimal (matches Lai-Robbins) | Asymptotically optimal |

The progression $\epsilon$ -greedy $\to$ UCB $\to$ Thompson Sampling is a progression from heuristic exploration toward principled uncertainty quantification. All three implement the same underlying principle — uncertain arms should be explored more — with increasing statistical sophistication.

Contextual bandits#

In many real systems, the optimal action depends on context observed at each round.

Formulation#

In a contextual bandit:

At round $t$ , the agent observes context $x_t \in \mathcal{X}$
Selects action $a_t \in \{1, \ldots, K\}$
Receives reward $r_t \sim \nu(x_t, a_t)$

There are still no state transitions or delayed effects — each round is independent given the context. But the policy is now a mapping $\pi: \mathcal{X} \to \Delta(\mathcal{A})$ from contexts to action distributions, rather than a fixed action.

LinUCB: contextual bandits with linear reward models#

The canonical contextual bandit algorithm is LinUCB, which assumes:

\mathbb{E}[r_t \mid x_t, a_t] = x_t^\top \theta_a

for arm-specific parameter vectors $\theta_a \in \mathbb{R}^d$ . Under this model, the ridge regression estimate $\hat{\theta}_a$ and its covariance $A_a^{-1}$ give a confidence ellipsoid, and the UCB for arm $a$ in context $x_t$ is:

\text{UCB}_t(a) = x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t}

where $\alpha$ controls the confidence width. LinUCB achieves $O(d\sqrt{T\log T})$ regret, with dependence on $d$ (the context dimension) rather than $K$ (the number of arms) — exploiting the linear structure to generalize across contexts.

LinUCB is deployed in production recommendation systems (the original paper describes its use at Yahoo for news article recommendation) and is the conceptual foundation for neural contextual bandit algorithms that replace the linear reward model with a neural network.

Partial feedback: the core difficulty#

In both standard and contextual bandits, the agent observes only the reward for the chosen action — not the rewards it would have received from other actions. This is bandit feedback or partial feedback, as opposed to full information feedback where all rewards are observed.

Partial feedback is what distinguishes bandit learning from supervised learning over the same data. In supervised learning, all labels are observed and a classifier can be trained by regression on the full dataset. In a contextual bandit, you only observe $r_t(a_t)$ , not $r_t(a)$ for $a \neq a_t$ . This means you cannot directly evaluate a policy that would have made different choices — you must estimate counterfactual performance.

This partial feedback structure is the root of the distributional shift problem in offline RL: a policy trained from logged data never observes rewards for actions the logging policy did not take, so its value estimates for those actions are ungrounded.

Policy evaluation with bandit feedback#

Suppose you have a logged dataset $\{(x_t, a_t, r_t)\}_{t=1}^T$ collected by some behavior policy $\pi_b$ and you want to evaluate a new target policy $\pi$ . Since $\pi$ may take different actions than $\pi_b$ , you cannot simply average the observed rewards.

Importance sampling estimator#

The inverse propensity scoring (IPS) estimator corrects for this mismatch:

\hat{V}(\pi) = \frac{1}{T} \sum_{t=1}^T \frac{\pi(a_t \mid x_t)}{\pi_b(a_t \mid x_t)} r_t

The ratio $\pi(a_t \mid x_t) / \pi_b(a_t \mid x_t)$ is the importance weight: it upweights rounds where $\pi$ would have chosen the same action as $\pi_b$ and downweights rounds where it would not. Under the assumption that $\pi_b(a \mid x) > 0$ wherever $\pi(a \mid x) > 0$ (coverage), the IPS estimator is unbiased: $\mathbb{E}[\hat{V}(\pi)] = V(\pi)$ .

The limitation is high variance: when $\pi$ and $\pi_b$ differ substantially, the importance weights become large, inflating the estimator's variance. In extreme cases, a single round with a large weight can dominate the estimate entirely.

The doubly robust (DR) estimator reduces variance by combining IPS with a learned reward model $\hat{r}(x, a)$ :

\hat{V}_{\text{DR}}(\pi) = \frac{1}{T} \sum_{t=1}^T \left[ \hat{r}(x_t, \pi) + \frac{\pi(a_t \mid x_t)}{\pi_b(a_t \mid x_t)}(r_t - \hat{r}(x_t, a_t)) \right]

The DR estimator is unbiased if either the reward model or the importance weights are correct — it is robust to misspecification of one but not both. These estimators reappear in offline RL and in RLHF evaluation wherever counterfactual reasoning is required.

GenAI context: bandits in RLHF#

Several components of RLHF reduce directly to bandit or contextual bandit problems.

The RLHF bandit formulation#

| MDP/Bandit component | RLHF interpretation | |---|---| | Context $x_t$ | Prompt | | Action $a_t$ | Generated response (full completion) | | Reward $r_t$ | Human preference score or reward model output | | Behavior policy $\pi_b$ | Reference model (SFT checkpoint) |

The RLHF training loop is a contextual bandit where the action space is the set of all possible text completions — astronomically large and structured. This structure has two important consequences:

Why direct UCB/Thompson Sampling are inapplicable: With a continuous, high-dimensional action space, maintaining per-action counts or posteriors is infeasible. Instead, RLHF learns a reward model $r_\phi(x, a)$ that generalizes across the action space, then uses it to score completions. The reward model plays the role of the UCB confidence bound or the Thompson sample — it scores actions by their estimated value — but it generalizes via function approximation rather than per-arm statistics.

The partial feedback problem in RLHF: Human preference data is inherently partial: for a given prompt, the human rates one or two completions, not all possible completions. The reward model must generalize from this bandit feedback to the full action space. When the reward model overfits to the distribution of rated completions, it produces unreliable scores for out-of-distribution responses — this is one mechanism behind reward hacking in RLHF and is a direct manifestation of the distributional shift problem identified in bandit policy evaluation.

Limitations of bandits#

Bandits deliberately ignore:

delayed consequences of actions,
state evolution over time,
long-horizon planning.

They are insufficient when actions influence future states and rewards. Their value lies in isolating the statistical core of learning from interaction — uncertainty quantification, exploration, and partial feedback — in the simplest setting where these issues arise.

The transition from bandits to full MDPs reintroduces state transitions. The arm value $\mu_a$ generalizes to the action-value function $Q^*(s, a)$ : the expected return of taking action $a$ in state $s$ and acting optimally thereafter. The UCB bonus for exploration generalizes to optimism-based exploration in MDPs. The importance sampling estimator for bandit policy evaluation generalizes to importance-weighted policy gradient estimators. Every bandit concept has an MDP analog.

Key takeaways#

The structure of this lecture mirrors the structure of a good algorithm design argument: identify the objective (regret), decompose it into controllable quantities (the suboptimality gap decomposition), establish what is theoretically achievable (Lai-Robbins lower bound), and then derive algorithms that meet the bound from first principles (UCB from Hoeffding, Thompson Sampling from Bayes' rule). This pattern — formalize, lower bound, match — recurs throughout the course.

Concretely: regret decomposes into expected pulls of suboptimal arms weighted by their gaps. $\epsilon$ -greedy achieves linear regret because it explores at a fixed rate regardless of uncertainty. UCB achieves logarithmic regret by deriving the exploration bonus from a statistical confidence bound that shrinks as uncertainty decreases. Thompson Sampling achieves the same asymptotically by sampling from the posterior, making exploration implicit in Bayesian uncertainty. Contextual bandits extend the framework to context-dependent rewards; LinUCB applies UCB to a linear reward model. Partial feedback requires importance sampling for counterfactual evaluation, introducing the distributional shift that reappears throughout offline RL and RLHF.

Conceptual questions#

Suppose you have $K = 3$ arms with true means $\mu_1 = 0.9$ , $\mu_2 = 0.85$ , $\mu_3 = 0.1$ . Write out the regret decomposition $\mathcal{R}(T) = \sum_a \Delta_a \cdot \mathbb{E}[N_T(a)]$ explicitly. Which arm dominates the regret, and why does the answer depend on both $\Delta_a$ and $\mathbb{E}[N_T(a)]$ ? Why is arm 2 harder to handle than arm 3?
Derive the UCB1 bonus term $\sqrt{2\log t / N_t(a)}$ from Hoeffding's inequality by setting the failure probability to $1/t^2$ . Explain why the choice of $1/t^2$ (rather than, say, $1/t$ ) matters for the union bound over all arms and all rounds.
The Beta-Bernoulli Thompson Sampling update is $\alpha \leftarrow \alpha + 1$ on success, $\beta \leftarrow \beta + 1$ on failure. Derive this rule from Bayes' theorem using the Beta prior and Bernoulli likelihood. What does the ratio $\alpha / (\alpha + \beta)$ represent, and how does the posterior variance $\alpha\beta / [(\alpha+\beta)^2(\alpha+\beta+1)]$ change as more data is collected?
An RLHF system collects human preference labels for 1000 prompt-response pairs, always showing humans responses sampled from the current SFT model. A reward model is trained on this data and used to score responses from a fine-tuned model that has drifted significantly from the SFT checkpoint. Explain this failure in terms of partial feedback, distributional shift, and the coverage assumption required by the IPS estimator.
UCB1 achieves logarithmic regret and matches the Lai-Robbins lower bound up to constants. Does this mean UCB1 is the "best possible" bandit algorithm? Explain what "instance-optimal" means, what the lower bound actually says, and describe a setting where Thompson Sampling would empirically outperform UCB1 despite both being asymptotically optimal.
Extension: The KL-UCB algorithm replaces the Hoeffding-derived bonus with a tighter bound based on KL divergence, achieving: $a_t = \arg\max_a \sup\!\left\{q \in [0,1] : N_t(a)\,\text{KL}(\hat{\mu}_a \| q) \leq \log t\right\}$ Explain why KL-UCB is tighter than UCB1 for Bernoulli arms. In what regime ( $\Delta_a$ large vs. small) does the difference matter most? Would you expect KL-UCB or UCB1 to have a larger advantage when $K$ is large and arms are mostly near-optimal?

Solutions

Coding exercise#

Implement and compare UCB1 variants.

Starting from the UCB1 class in this lesson, implement a KLUCB class that selects arms using the KL-UCB index (binary search over $q$ to solve $N_t(a)\,\text{KL}(\hat{\mu}_a \| q) = \log t$ , where $\text{KL}(p \| q) = p\log(p/q) + (1-p)\log((1-p)/(1-q))$ for Bernoulli arms). Then run a simulation comparing UCB1 and KL-UCB on a 5-arm Bernoulli bandit with means [0.5, 0.55, 0.6, 0.65, 0.9] over $T = 5000$ rounds. Plot cumulative regret for both algorithms.

Things to observe:

Which algorithm accumulates more regret early vs. late?
On the hard arms (means 0.5–0.65), which algorithm explores more efficiently?
How does the gap between algorithms change as $\Delta_{\min}$ shrinks?

Looking ahead#

The next lecture reintroduces state transitions through dynamic programming for finite MDPs. We will see how the Bellman equations from Week 1 become computational algorithms — policy evaluation, policy iteration, and value iteration — and why exact solutions quickly become infeasible as the state space grows. The exploration-exploitation tradeoff identified in bandits reappears in the MDP setting, where it is compounded by the need to explore a state space rather than a fixed set of arms.

Purpose of this lecture#

Here is the core puzzle: Why is learning anything difficult if the problem is stateless?

The payoff of studying bandits is real. Despite their simplicity, they:

Isolate the statistical core of learning from interaction. What makes learning hard is not complex dynamics—it is uncertainty under partial feedback. A bandit distills this to its essence.
Provide theoretical foundations. Bandits have clean regret analysis, lower bounds, and instance-optimal algorithms. These tools generalize to MDPs and beyond.
Show up everywhere in practice. Recommendation systems, A/B testing, and RLHF all reduce to bandit or contextual bandit problems in key components.

The multi-armed bandit problem#

A multi-armed bandit problem consists of:

A fixed set of $K$ actions (arms) $a \in \{1, \dots, K\}$
Each arm $a$ has an unknown reward distribution with mean $\mu_a$
At each round $t$ , the agent selects an arm $a_t$ and observes a reward $r_t \sim \nu_{a_t}$

Regret#

Performance in bandit problems is measured by regret: the cumulative reward foregone relative to an oracle that always pulls the optimal arm.

Definition#

The suboptimality gap of arm $a$ is:

\Delta_a = \mu^* - \mu_a, \qquad \mu^* = \max_{a'} \mu_{a'}

The expected cumulative regret after $T$ rounds is:

\mathcal{R}(T) = T\mu^* - \mathbb{E}\!\left[\sum_{t=1}^T r_t\right]

The regret decomposition#

The most useful form of the regret is the decomposition by suboptimality gap:

\mathcal{R}(T) = \sum_{a=1}^K \Delta_a \cdot \mathbb{E}[N_T(a)]

Sublinear regret#

Good bandit algorithms achieve sublinear regret: $\mathcal{R}(T) = o(T)$ .

The exploration–exploitation trade-off#

At every time step, the agent faces a fundamental choice:

Exploitation: select the arm that currently appears best (maximize immediate reward)
Exploration: select an uncertain arm to gather information (sacrifice immediate reward for future improvement)

Why pure exploitation fails: a concrete example#

Why pure exploration fails#

$\epsilon$ -greedy: the simplest resolution#

The simplest approach that forces exploration is $\epsilon$ -greedy: with probability $\epsilon$ , select a random arm; with probability $1 - \epsilon$ , select the empirically best arm.

a_t = \begin{cases} \text{uniform random arm} & \text{with probability } \epsilon \\ \arg\max_a \hat{\mu}_a(t) & \text{with probability } 1 - \epsilon \end{cases}

Stochastic bandits#

The Lai-Robbins lower bound#

Before studying algorithms, it is worth asking: how good can a bandit algorithm be?

Theorem (Lai and Robbins, 1985): For any consistent algorithm (one that achieves sublinear regret on every bandit instance), the expected number of pulls of any suboptimal arm $a$ satisfies:

\liminf_{T \to \infty} \frac{\mathbb{E}[N_T(a)]}{\log T} \geq \frac{1}{\text{KL}(\nu_a \| \nu^*)}

\mathcal{R}(T) = \Omega\!\left(\sum_{a:\Delta_a > 0} \frac{\log T}{\Delta_a}\right)

Optimism in the face of uncertainty: UCB#

The principle of optimism in the face of uncertainty provides a principled response to the exploration-exploitation tradeoff:

When uncertain, act as if the best plausible outcome will occur.

Deriving the UCB bonus from Hoeffding's inequality#

For rewards bounded in $[0,1]$ , Hoeffding's inequality gives:

P\!\left(\hat{\mu}_a - \mu_a \geq \epsilon\right) \leq \exp(-2 N_t(a) \epsilon^2)

We want a confidence bound that holds with high probability across all rounds $t$ and all arms. Setting the failure probability to $1/t^2$ and solving for $\epsilon$ :

\exp(-2 N_t(a) \epsilon^2) = \frac{1}{t^2} \;\Rightarrow\; \epsilon = \sqrt{\frac{\log t}{2 N_t(a)}} \cdot \sqrt{2} = \sqrt{\frac{2 \log t}{N_t(a)}}

UCB1 algorithm#

a_t = \arg\max_a \left(\hat{\mu}_a + \sqrt{\frac{2\log t}{N_t(a)}}\right)

where $\hat{\mu}_a$ is the empirical mean of arm $a$ and $N_t(a)$ is its pull count prior to round $t$ . Arms that have not yet been pulled are assigned infinite UCB and are pulled first.

Regret bound#

UCB1 achieves:

\mathcal{R}(T) \leq \sum_{a:\Delta_a > 0} \frac{8 \log T}{\Delta_a} + \left(1 + \frac{\pi^2}{3}\right)\sum_a \Delta_a

This matches the Lai-Robbins lower bound up to constants, confirming that UCB1 is essentially instance-optimal: no consistent algorithm can do significantly better on any bandit instance.

Implementation#

python · runs in browser

import numpy as np

class UCB1:
    """
    UCB1 algorithm for multi-armed bandits.

    Args:
        n_arms: Number of bandit arms
        alpha: Exploration parameter (default=2 for UCB1)

    Attributes:
        counts: Number of times each arm was pulled
        values: Estimated mean reward for each arm
    """

    def __init__(self, n_arms, alpha=2.0):
        self.n_arms = n_arms
        self.alpha = alpha
        self.counts = np.zeros(n_arms)  # N_t(a) - pull counts
        self.values = np.zeros(n_arms)   # \hat{\mu}_a - empirical means
        self.t = 0  # Total rounds

    def select_arm(self):
        """
        Select arm using UCB1 formula: argmax( \hat{\mu}_a + \sqrt{\alpha * log(t) / N_t(a)} )

        Returns: arm index to pull
        """
        # First, pull each arm once (ensure N_t(a) > 0)
        for arm in range(self.n_arms):
            if self.counts[arm] == 0:
                return arm

        # Compute UCB for each arm
        ucb_values = np.zeros(self.n_arms)
        for arm in range(self.n_arms):
            bonus = np.sqrt(self.alpha * np.log(self.t) / self.counts[arm])
            ucb_values[arm] = self.values[arm] + bonus

        return np.argmax(ucb_values)

    def update(self, arm, reward):
        """
        Update estimates after observing reward.

        Args:
            arm: The arm that was pulled
            reward: The observed reward
        """
        self.t += 1
        self.counts[arm] += 1

        # Incremental mean update: \hat{\mu}_a = \hat{\mu}_a + (r - \hat{\mu}_a) / N_t(a)
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n


# Example: 4-armed bandit with Bernoulli rewards
np.random.seed(42)
n_arms = 4
true_means = [0.3, 0.5, 0.7, 0.9]  # True arm means (unknown to algorithm)
n_rounds = 1000

# Run UCB1
ucb = UCB1(n_arms)
total_reward = 0
rewards = []

for t in range(n_rounds):
    arm = ucb.select_arm()
    reward = float(np.random.random() < true_means[arm])  # Bernoulli reward
    ucb.update(arm, reward)
    total_reward += reward
    rewards.append(total_reward)

print(f"UCB1 Total Reward: {total_reward:.0f}/{n_rounds}")
print(f"UCB1 Estimated Means: {ucb.values}")
print(f"True Means: {true_means}")

Key implementation details:

Initial exploration: Each arm is pulled once before UCB formula applies (handles division by zero)
Incremental updates: Uses values[arm] += (reward - values[arm]) / n for numerical stability
Logarithmic regret: The $\log t$ in the bonus ensures diminishing exploration over time
No epsilon tuning: Unlike $\epsilon$ -greedy, UCB automatically balances exploration/exploitation

Comparison with Epsilon-Greedy#

python · runs in browser

class EpsilonGreedy:
    """Baseline: explore with probability epsilon, exploit otherwise."""

    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # Explore: random arm
        return np.argmax(self.values)  # Exploit: best known arm

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

# Comparison
for epsilon in [0.01, 0.1, 0.3]:
    eg = EpsilonGreedy(n_arms, epsilon)
    eg_reward = 0
    for _ in range(n_rounds):
        arm = eg.select_arm()
        reward = float(np.random.random() < true_means[arm])
        eg.update(arm, reward)
        eg_reward += reward
    print(f"ε={epsilon}: Total={eg_reward:.0f} (needs tuning!)")

# UCB1 automatically tunes exploration
print(f"UCB1: Total={total_reward:.0f} (no tuning needed)")

Why UCB beats ε-greedy:

ε-greedy uses fixed exploration rate (requires tuning)
UCB exploration bonus automatically decreases as confidence increases
UCB has theoretical regret guarantees, ε-greedy does not

Thompson Sampling#

Algorithm#

Maintain a posterior $p(\theta_a \mid \text{history})$ over each arm's reward parameter
At round $t$ , sample $\tilde{\theta}_a \sim p(\theta_a \mid \text{history})$ for each arm
Select $a_t = \arg\max_a \tilde{\theta}_a$
Observe $r_t$ and update the posterior for arm $a_t$

Beta-Bernoulli Thompson Sampling#

Prior: $\theta_a \sim \text{Beta}(\alpha_a, \beta_a)$ , initialized as $\text{Beta}(1, 1)$ (uniform prior).

Why Beta is conjugate to Bernoulli: By Bayes' rule, given $n$ binary observations with $s$ successes and $f = n - s$ failures from arm $a$ :

p(\theta_a \mid \text{data}) \propto \theta_a^s (1-\theta_a)^f \cdot \theta_a^{\alpha_a - 1}(1-\theta_a)^{\beta_a-1} = \theta_a^{\alpha_a + s - 1}(1-\theta_a)^{\beta_a + f - 1}

Action selection:

python · runs in browser

import numpy as np

def thompson_sampling_step(alphas, betas):
    """
    One step of Beta-Bernoulli Thompson Sampling.
    alphas, betas: arrays of shape (K,) with current posterior parameters.
    Returns selected arm index.
    """
    samples = np.random.beta(alphas, betas)
    return np.argmax(samples)

def update_posterior(alphas, betas, arm, reward):
    """
    Bayesian posterior update for Bernoulli arm.
    reward: 1 (success) or 0 (failure).
    """
    alphas[arm] += reward
    betas[arm] += (1 - reward)
    return alphas, betas

We sample from the Beta distribution for each arm using their current posterior parameters $\alpha$ and $\beta$ . This single line handles both the exploitation (mean of the distribution) and exploration (variance of the distribution).
We simply act greedily with respect to the sampled values. This is probability matching in action.
If the reward is 1 (success), we increment $\alpha$ , effectively shifting the distribution mean closer to 1.
If the reward is 0 (failure), we increment $\beta$ , shifting the mean closer to 0.

Regret guarantees#

Comparison: $\epsilon$ -greedy vs UCB vs Thompson Sampling#

Contextual bandits#

In many real systems, the optimal action depends on context observed at each round.

Formulation#

In a contextual bandit:

At round $t$ , the agent observes context $x_t \in \mathcal{X}$
Selects action $a_t \in \{1, \ldots, K\}$
Receives reward $r_t \sim \nu(x_t, a_t)$

LinUCB: contextual bandits with linear reward models#

The canonical contextual bandit algorithm is LinUCB, which assumes:

\mathbb{E}[r_t \mid x_t, a_t] = x_t^\top \theta_a

\text{UCB}_t(a) = x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t}

Partial feedback: the core difficulty#

Policy evaluation with bandit feedback#

Importance sampling estimator#

The inverse propensity scoring (IPS) estimator corrects for this mismatch:

\hat{V}(\pi) = \frac{1}{T} \sum_{t=1}^T \frac{\pi(a_t \mid x_t)}{\pi_b(a_t \mid x_t)} r_t

The doubly robust (DR) estimator reduces variance by combining IPS with a learned reward model $\hat{r}(x, a)$ :

\hat{V}_{\text{DR}}(\pi) = \frac{1}{T} \sum_{t=1}^T \left[ \hat{r}(x_t, \pi) + \frac{\pi(a_t \mid x_t)}{\pi_b(a_t \mid x_t)}(r_t - \hat{r}(x_t, a_t)) \right]

GenAI context: bandits in RLHF#

Several components of RLHF reduce directly to bandit or contextual bandit problems.

The RLHF bandit formulation#

Limitations of bandits#

Bandits deliberately ignore:

delayed consequences of actions,
state evolution over time,
long-horizon planning.

Key takeaways#

Conceptual questions#

Suppose you have $K = 3$ arms with true means $\mu_1 = 0.9$ , $\mu_2 = 0.85$ , $\mu_3 = 0.1$ . Write out the regret decomposition $\mathcal{R}(T) = \sum_a \Delta_a \cdot \mathbb{E}[N_T(a)]$ explicitly. Which arm dominates the regret, and why does the answer depend on both $\Delta_a$ and $\mathbb{E}[N_T(a)]$ ? Why is arm 2 harder to handle than arm 3?
Derive the UCB1 bonus term $\sqrt{2\log t / N_t(a)}$ from Hoeffding's inequality by setting the failure probability to $1/t^2$ . Explain why the choice of $1/t^2$ (rather than, say, $1/t$ ) matters for the union bound over all arms and all rounds.
The Beta-Bernoulli Thompson Sampling update is $\alpha \leftarrow \alpha + 1$ on success, $\beta \leftarrow \beta + 1$ on failure. Derive this rule from Bayes' theorem using the Beta prior and Bernoulli likelihood. What does the ratio $\alpha / (\alpha + \beta)$ represent, and how does the posterior variance $\alpha\beta / [(\alpha+\beta)^2(\alpha+\beta+1)]$ change as more data is collected?
An RLHF system collects human preference labels for 1000 prompt-response pairs, always showing humans responses sampled from the current SFT model. A reward model is trained on this data and used to score responses from a fine-tuned model that has drifted significantly from the SFT checkpoint. Explain this failure in terms of partial feedback, distributional shift, and the coverage assumption required by the IPS estimator.
UCB1 achieves logarithmic regret and matches the Lai-Robbins lower bound up to constants. Does this mean UCB1 is the "best possible" bandit algorithm? Explain what "instance-optimal" means, what the lower bound actually says, and describe a setting where Thompson Sampling would empirically outperform UCB1 despite both being asymptotically optimal.
Extension: The KL-UCB algorithm replaces the Hoeffding-derived bonus with a tighter bound based on KL divergence, achieving: $a_t = \arg\max_a \sup\!\left\{q \in [0,1] : N_t(a)\,\text{KL}(\hat{\mu}_a \| q) \leq \log t\right\}$ Explain why KL-UCB is tighter than UCB1 for Bernoulli arms. In what regime ( $\Delta_a$ large vs. small) does the difference matter most? Would you expect KL-UCB or UCB1 to have a larger advantage when $K$ is large and arms are mostly near-optimal?

Solutions

Coding exercise#

Implement and compare UCB1 variants.

Things to observe:

Which algorithm accumulates more regret early vs. late?
On the hard arms (means 0.5–0.65), which algorithm explores more efficiently?
How does the gap between algorithms change as $\Delta_{\min}$ shrinks?

Purpose of this lecture#

The multi-armed bandit problem#

Regret#

Definition#

The regret decomposition#

Sublinear regret#

The exploration–exploitation trade-off#

Why pure exploitation fails: a concrete example#

Why pure exploration fails#

ϵ\epsilonϵ-greedy: the simplest resolution#

Stochastic bandits#

The Lai-Robbins lower bound#

Optimism in the face of uncertainty: UCB#

Deriving the UCB bonus from Hoeffding's inequality#

UCB1 algorithm#

Regret bound#

Implementation#

Comparison with Epsilon-Greedy#

Thompson Sampling#

Algorithm#

Beta-Bernoulli Thompson Sampling#

Regret guarantees#

Comparison: ϵ\epsilonϵ-greedy vs UCB vs Thompson Sampling#

Contextual bandits#

Formulation#

LinUCB: contextual bandits with linear reward models#

Partial feedback: the core difficulty#

Policy evaluation with bandit feedback#

Importance sampling estimator#

GenAI context: bandits in RLHFReinforcement Learning from Human Feedback#

The RLHFReinforcement Learning from Human Feedback bandit formulation#

Limitations of bandits#

Key takeaways#

Conceptual questions#

Coding exercise#

Looking ahead#

Further reading#

Week 2: Multi-Armed Bandits

Purpose of this lecture#

The multi-armed bandit problem#

Regret#

Definition#

The regret decomposition#

Sublinear regret#

The exploration–exploitation trade-off#

Why pure exploitation fails: a concrete example#

Why pure exploration fails#

ϵ\epsilonϵ-greedy: the simplest resolution#

Stochastic bandits#

The Lai-Robbins lower bound#

Optimism in the face of uncertainty: UCB#

Deriving the UCB bonus from Hoeffding's inequality#

UCB1 algorithm#

Regret bound#

Implementation#

Comparison with Epsilon-Greedy#

Thompson Sampling#

Algorithm#

Beta-Bernoulli Thompson Sampling#

Regret guarantees#

Comparison: ϵ\epsilonϵ-greedy vs UCB vs Thompson Sampling#

Contextual bandits#

Formulation#

LinUCB: contextual bandits with linear reward models#

Partial feedback: the core difficulty#

Policy evaluation with bandit feedback#

Importance sampling estimator#

GenAI context: bandits in RLHFReinforcement Learning from Human Feedback#

The RLHFReinforcement Learning from Human Feedback bandit formulation#

Limitations of bandits#

Key takeaways#

Conceptual questions#

Coding exercise#

Looking ahead#

Further reading#

$\epsilon$ -greedy: the simplest resolution#

Comparison: $\epsilon$ -greedy vs UCB vs Thompson Sampling#

GenAI context: bandits in RLHF#

The RLHF bandit formulation#

$\epsilon$ -greedy: the simplest resolution#

Comparison: $\epsilon$ -greedy vs UCB vs Thompson Sampling#

GenAI context: bandits in RLHF#

The RLHF bandit formulation#