Online Learning & Bandits

Chapter 17: Learning Theory & Advanced ML Section 2

I avoided online learning for an embarrassingly long time. Every time it came up in a paper or a system design discussion, I'd nod along, quietly thinking "that's for the theory crowd." I understood batch ML perfectly well — collect data, train, evaluate, deploy. That loop made sense. But the idea of learning from a stream, one example at a time, with no do-overs? That felt like trying to build a house while someone keeps handing you bricks and you can't go back to fix the foundation. Finally the discomfort of not understanding what powers every ad system, every recommendation engine, every adaptive A/B test grew too loud. Here is that dive.

Online learning is a framework where a model receives examples one at a time and updates after each one. It was formalized in the 1950s with the Perceptron and matured through the work of Littlestone, Vovk, and others in the 1990s. Multi-armed bandits, a special case where you only see the outcome of the action you chose, date back to a 1933 paper by Thompson and were given their modern theoretical foundation by Robbins in 1952. Today, these ideas are the beating heart of production systems at Google, Netflix, Spotify, and every major ad platform.

Before we start, a heads-up. We're going to be building up from basic prediction loops to Bayesian posterior sampling, touching on confidence bounds, importance weighting, and linear algebra along the way. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Prediction Game
  Regret: Keeping Score Against Hindsight
  The Perceptron: Where Online Learning Began
  Online Gradient Descent
  Prediction with Expert Advice
Rest Stop
The Bandit Problem: When You Can't See the Road Not Taken
  Epsilon-Greedy: The Coin-Flip Approach
  UCB: Optimism as a Strategy
  Thompson Sampling: Let Uncertainty Do the Thinking
  Adversarial Bandits and EXP3
Contextual Bandits: Because People Are Different
  LinUCB and the Yahoo Story
Online Convex Optimization: The Unifying Framework
Where All of This Lives in the Real World
Wrap-Up
Resources

The Prediction Game

Let's start with a scenario we'll carry through this entire journey. Imagine you run a small coffee shop. Every morning, you have to decide which single daily special to feature on the chalkboard before any customers walk in. You pick one, the day plays out, and at the end you see how many people ordered it. Tomorrow, you pick again.

You have five candidates: a lavender latte, a cold brew float, a matcha espresso, a caramel cortado, and a chai affogato. You have no historical data. You have no focus groups. You have a chalkboard and one shot per day.

This is the essence of online learning. Not "collect data, then train." Instead: predict, observe, update, repeat. One round at a time. No second passes. No reshuffling. The world hands you a situation, you make a call, you see what happens, you adjust.

In a more formal version, the game works like this. Each morning (round t), you see some information about the day — maybe the weather forecast, the day of the week. You make a prediction or a choice. The world reveals the outcome. You update whatever strategy you've been using. Then round t + 1 begins.

The striking thing about this setup is that there's no assumption about where the data comes from. It doesn't need to be "independent and identically distributed." The weather could be adversarial. Customer preferences could shift week to week. The framework doesn't care. It works anyway. That's what makes it powerful, and that's what makes it different from the batch ML you're used to.

Regret: Keeping Score Against Hindsight

In batch ML, we measure how good a model is with accuracy, or loss on a test set. In online learning, the question is more interesting: how much worse did I do, over all T rounds, compared to the best fixed strategy I could have chosen if I'd had a crystal ball?

This quantity has a name. It's called regret.

Back to our coffee shop. Suppose after 100 days, your total daily-special sales add up to 620 cups. But if you'd featured the cold brew float every single day — which turns out to have been the best choice — you would have sold 780 cups. Your regret is 780 − 620 = 160 cups. That's the price you paid for not knowing in advance which special was best, and for the days you spent experimenting.

More precisely, regret is your cumulative loss minus the cumulative loss of the single best fixed predictor from your hypothesis class, computed with the luxury of seeing all T rounds laid out. You're not competing against some magical oracle that changes its strategy each round. You're competing against the single best unchanging strategy — the one that, in hindsight, would have been the best overall.

A good online algorithm achieves sublinear regret — the regret grows slower than the number of rounds. If regret is O(√T), that means after 10,000 rounds, your average per-round regret is proportional to 1/√10000 = 1/100. It's shrinking. You're converging to the performance of the best fixed strategy, even though you never had all the data at once.

I'll be honest — when I first encountered regret as a metric, it felt weirdly defeatist. You're measuring how much you lost? But it's actually more nuanced than accuracy. Accuracy tells you how often you were right. Regret tells you how quickly you figured out how to be right. In a streaming world, the speed of learning is what matters.

The Perceptron: Where Online Learning Began

The Perceptron algorithm, published by Rosenblatt in 1958, is the oldest online learning algorithm most people have heard of. It makes predictions one data point at a time, and it only updates when it makes a mistake.

Here's the setup. Suppose each morning, based on the weather forecast (represented as a vector of numbers — temperature, humidity, chance of rain), you're trying to predict whether it'll be a "hot drinks" day or a "cold drinks" day. You maintain a weight vector w that encodes your current belief about which weather features matter.

When a new forecast x arrives, you compute w · x (the dot product). If it's positive, you predict "hot drinks." If it's negative, "cold drinks." If you're right, you do nothing. If you're wrong, you nudge the weights: add x if you should have predicted positive, subtract it if you should have predicted negative.

import numpy as np

def perceptron_online(stream):
    w = None
    mistakes = 0

    for x, y in stream:  # y is +1 or -1
        if w is None:
            w = np.zeros(len(x))

        prediction = 1 if w @ x >= 0 else -1

        if prediction != y:
            w += y * x    # nudge toward correct side
            mistakes += 1

    return w, mistakes

That's the entire algorithm. No learning rate to tune. No loss function to differentiate. Just: when you're wrong, adjust.

Now, the remarkable result. If the data is linearly separable — meaning some perfect weight vector w* exists that classifies everything correctly with a margin γ (every point is at least γ away from the decision boundary) — then the Perceptron will make at most (R/γ)² mistakes, where R is the maximum length of any data point.

This is the Perceptron Convergence Theorem, and it's worth sitting with for a moment. The mistake bound is (R/γ)². Think of it like stepping toward a finish line. Each mistake pushes the weights closer to the correct solution. If the margin γ is large — meaning the classes are well-separated — each mistake makes a big stride forward. If the data points are small (R is small), you can't overshoot. Big strides, short track, fewer steps needed.

The reason this matters for our story is that the Perceptron was the first algorithm to come with a guarantee about online performance. It doesn't matter what order the data arrives in. It doesn't matter if someone is adversarially choosing the hardest example to show you next. The bound holds regardless. That idea — guarantees without distributional assumptions — is the philosophical core of online learning.

But the Perceptron only works for linearly separable data. What about when no perfect separator exists? What about when we're doing regression, not classification? We need something more general.

Online Gradient Descent

The workhorse of online learning is Online Gradient Descent (OGD). It looks a lot like the stochastic gradient descent you already know, but the framing is different. In SGD, you're trying to minimize a fixed loss function over a fixed dataset, and you use random samples to approximate the gradient. In OGD, the loss functions can change every round. There's no fixed dataset. You're playing the prediction game, and each round brings a new loss.

Let's apply this to our coffee shop. Suppose instead of picking a single special, you're setting prices for all five drinks. Each morning you choose a price vector w. Customers arrive, and at the end of the day you see how much revenue you lost compared to the optimal pricing. That loss is f_t(w), and it can be different every day because customer behavior shifts.

OGD says: after each day, take one gradient step against today's loss.

import numpy as np

def online_gradient_descent(stream, d, lr_fn=None):
    """OGD for online convex optimization.
    stream yields (loss_fn, gradient_fn) pairs each round."""
    w = np.zeros(d)
    total_loss = 0.0

    for t, (loss_fn, grad_fn) in enumerate(stream, 1):
        # suffer today's loss
        total_loss += loss_fn(w)

        # step size shrinks over time
        lr = lr_fn(t) if lr_fn else 1.0 / np.sqrt(t)

        # update using today's gradient
        w = w - lr * grad_fn(w)

    return w, total_loss

With a learning rate of η_t = 1/√t, OGD achieves O(√T) regret for convex losses. That's the best you can do without stronger assumptions. If the losses happen to be strongly convex — they curve sharply, like a bowl rather than a flat plate — you can get O(log T) regret, which is dramatically better. After 10,000 rounds, √T = 100 while log T ≈ 9. That's the difference between "pretty close to optimal" and "nearly indistinguishable from optimal."

Here's a connection that surprised me when I first learned it: the Perceptron is OGD applied to the hinge loss. The "only update on mistakes" behavior? That's what happens when the gradient of the hinge loss is zero for correctly classified points. The Perceptron isn't some separate invention. It's a special case of the same framework. We've been doing online learning since the 1950s and didn't always know it.

Prediction with Expert Advice

There's another flavor of online learning that's worth understanding before we move on. Instead of maintaining a single model, imagine you have access to N "experts" — they could be different models, different heuristics, different weather forecasters — and each round, every expert gives you their prediction. You have to combine those predictions into one, then see the truth.

Back to the coffee shop. Suppose you have three advisors: your barista (who thinks cold drinks always win), a regular customer (who insists seasonal specials are the way to go), and an algorithm you found on the internet. Each morning, they all tell you which special to feature. You want to learn which advisor to trust, and you want to learn fast.

The Weighted Majority Algorithm (Littlestone and Warmuth, 1994) does exactly this. Start by giving every expert equal weight. Each round, predict based on a weighted vote. After seeing the outcome, multiply the weight of every expert who was wrong by (1 − η), where η is a small learning rate. Correct experts keep their weight. Over time, bad experts fade away.

The guarantee: if the best expert makes M mistakes, the Weighted Majority algorithm makes at most roughly 2(1 + η)M + (2 ln N)/η mistakes. With the right choice of η, this becomes about M + O(√(M ln N)). You're not much worse than the best expert, and you didn't need to know in advance which one that was.

The randomized version of this idea — the Hedge algorithm (also called Exponential Weights) — is one of the most beautiful results in all of learning theory. Instead of taking a hard vote, you sample an expert proportionally to their weight. The regret bound sharpens to O(√(T ln N)). That logarithmic dependence on N means you can have a million experts and the cost of not knowing which one is best grows only as ln(1,000,000) ≈ 14. Fourteen rounds of extra regret, no matter how many experts you're juggling.

I'm still developing my intuition for why the exponential weighting works so well. The rough idea is that multiplicative updates amplify small differences between experts faster than additive updates would. An expert who's slightly worse gets its weight halved exponentially fast, while one who's slightly better keeps accumulating influence. The exponential function is doing the heavy lifting — it's the same mathematical trick that shows up later in EXP3 for adversarial bandits.

But there's a limitation to all of this. In the prediction-with-expert-advice framework, and in OGD, you see the full outcome after each round. You find out what would have happened for every possible action, not just the one you took. That's called full information feedback. In many real problems, you don't get that luxury.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a working mental model of online learning: predictions arrive one at a time, you're scored by regret against the best fixed strategy in hindsight, and algorithms like OGD and Exponential Weights can achieve sublinear regret — meaning you converge to optimal even without seeing all the data up front. You know the Perceptron convergence theorem, and you know that the expert-advice framework lets you track the best among many strategies with only logarithmic overhead.

That's a genuine, useful understanding. It's how spam filters adapt, how real-time bidding systems adjust prices, and how streaming prediction systems work.

But here's what we haven't confronted: what happens when you can't see the full outcome? When you pick the lavender latte as today's special but never find out how the cold brew float would have done? That's the bandit problem, and it's where the exploration-exploitation tradeoff — the most fundamental tension in decision-making under uncertainty — lives.

If the discomfort of not knowing what's underneath is nagging at you, read on.

The Bandit Problem: When You Can't See the Road Not Taken

Here's the shift. In online learning, you pick an action and then the environment reveals the loss for all actions — you see the full picture, even for choices you didn't make. In the multi-armed bandit problem, you only see the reward for the action you actually took. Everything else stays hidden.

The name comes from a gambler standing in front of a row of slot machines — each with one arm, hence "one-armed bandits." Each machine has a different (unknown) probability of paying out. You have a finite number of coins. Which machines do you play?

Our coffee shop version: you pick one daily special each morning. At the end of the day, you know how many cups sold. But you have no idea what would have happened if you'd chosen a different special. The lavender latte sold 12 cups today — great? Terrible? You can't know, because you didn't try the alternatives.

This is partial information feedback, or bandit feedback, and it makes the problem fundamentally harder. In full-information online learning, you can compute gradients. You can evaluate every strategy after each round. With bandit feedback, you're flying partly blind.

The central tension is immediate: should you keep featuring the special that's been selling well (exploit), or should you try a different one to gather information that might lead to even better sales (explore)? Exploit too much and you might miss the real winner. Explore too much and you waste days on specials you already suspect are mediocre.

This is the exploration-exploitation tradeoff. I'd argue it's the single most important idea in decision theory. It shows up everywhere — in how you decide which restaurants to try, how venture capitalists allocate funds, how drug companies design clinical trials, and how every major tech company chooses which ads to show you.

Formally, you have K arms. At each round t, you pull arm a_t and receive a random reward from that arm's distribution. Bandit regret compares your total reward against what you'd have earned pulling the best arm every round: Regret(T) = T · μ* − Σ μ_{a_t}, where μ* is the mean of the best arm. The Lai-Robbins lower bound (1985) proves that even the best possible algorithm must incur at least Ω(log T) regret — exploration is unavoidable, and no cleverness can eliminate it entirely.

The question isn't whether to explore. It's how.

Epsilon-Greedy: The Coin-Flip Approach

The most obvious strategy. Each round, flip a (biased) coin. With probability (1 − ε), pick whichever special has the highest average sales so far. With probability ε, pick one at random.

Let's watch it play out at our coffee shop. Say ε = 0.1. Most days — 90% of them — you feature whatever's been selling best. But one day in ten, you randomly try something else. Maybe that random day reveals the chai affogato is a sleeper hit. Maybe it confirms the matcha espresso is still mediocre. Either way, you're gathering information.

import numpy as np

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(len(self.counts))
        return np.argmax(self.values)

    def update(self, arm, reward):
        self.counts[arm] += 1
        # incremental mean: new_mean = old_mean + (reward - old_mean) / n
        self.values[arm] += (reward - self.values[arm]) / self.counts[arm]

That incremental mean update is worth pausing on. Instead of storing all past rewards and recomputing the average, we fold each new observation into the running average with one line. The algebra works out because (reward − old_mean) / n is the exact correction needed. It's the same trick used in Welford's algorithm for online variance computation.

Epsilon-greedy works. It works surprisingly well for how unsophisticated it is. But there's an obvious problem: ε is fixed. After 5,000 days — years of running your coffee shop — when you're extremely confident the cold brew float is the winner, you're still wasting 10% of your days on random experiments. That's 500 days of lost sales, not because you needed the information, but because the algorithm doesn't know when to stop exploring.

You can decay ε over time — ε_t = 1/t, or ε₀/(1 + t/τ) for some constant τ. With the right decay schedule, regret improves to O(log T). The catch: "the right schedule" depends on the gaps between arms, which you don't know. You're tuning a hyperparameter in a setting where the whole point was to avoid needing information you don't have. That circularity motivated people to look for something better.

UCB: Optimism as a Strategy

Instead of exploring randomly, what if you explored strategically? What if you always picked the option that could be the best, given what you know and what you don't?

That's the idea behind Upper Confidence Bound (UCB). For each arm, compute two things: your best estimate of its average reward, and a bonus that represents your uncertainty. Pick the arm where the sum is highest.

Back to the coffee shop. The cold brew float has an average of 15 cups/day over 200 days — you're quite confident in that number. The chai affogato averaged 14 cups/day, but you've only tried it 5 times. That average could be way off. UCB says: give the chai affogato a big uncertainty bonus because you barely know anything about it. Maybe it's actually a 20-cup-per-day blockbuster hiding behind a noisy 5-day sample.

The UCB1 formula makes this precise:

Pick arm a that maximizes: μ̂_a + √(2 ln t / N_a)

The first term, μ̂_a, is the empirical mean — exploitation. The second term, √(2 ln t / N_a), is the confidence bonus — exploration. When N_a is small (you've barely tried this arm), the bonus is large. When N_a is large, the bonus shrinks toward zero. The ln t in the numerator ensures that even well-explored arms eventually get revisited, but at a decreasing rate.

class UCB1:
    def __init__(self, n_arms):
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)

    def select_arm(self):
        n_arms = len(self.counts)
        for a in range(n_arms):
            if self.counts[a] == 0:
                return a  # try each arm at least once

        t = np.sum(self.counts)
        bonuses = np.sqrt(2 * np.log(t) / self.counts)
        return np.argmax(self.values + bonuses)

    def update(self, arm, reward):
        self.counts[arm] += 1
        self.values[arm] += (reward - self.values[arm]) / self.counts[arm]

UCB1 achieves O(log T) regret, matching the Lai-Robbins lower bound up to constants. No hyperparameters. No random coin flips. The algorithm figures out on its own how much to explore and when to stop. That phrase — "optimism in the face of uncertainty" — is one of the most powerful ideas in decision-making. When you don't know something, assume it might be wonderful, and then go find out. If it turns out to be wonderful, great. If it turns out to be mediocre, the uncertainty bonus shrinks and you stop wasting time on it.

This same principle appears all over ML. The UCT algorithm used in AlphaGo's Monte Carlo Tree Search is UCB applied to game trees. Bayesian optimization uses a similar acquisition function (the Upper Confidence Bound variant). Optimistic initialization in reinforcement learning is the same philosophy. The idea is portable.

But UCB has a quirk. It explores all uncertain arms with roughly equal enthusiasm. If you have 1,000 arms and 990 of them are clearly terrible after a few pulls, UCB still gives each one an uncertainty bonus that forces occasional revisits. It's principled, but not always efficient. That brings us to the algorithm that often beats it in practice.

Thompson Sampling: Let Uncertainty Do the Thinking

Thompson Sampling was proposed in 1933. Yes, 1933 — by William R. Thompson, a statistician working on clinical trials. It was then mostly forgotten for about 80 years, rediscovered by the tech industry in the 2010s, and is now the default bandit algorithm at most major tech companies. The gap between invention and adoption is one of the wilder stories in applied statistics.

The idea is Bayesian. For each arm, maintain a probability distribution — a posterior — over what you believe the arm's true reward rate might be. Each round, sample a value from each arm's posterior. Pull the arm whose sample is highest.

Let's make this concrete with our coffee shop. Suppose the reward for each special is binary: a customer either orders it (1) or doesn't (0). After observing successes and failures for each special, you maintain a Beta distribution for each one. The Beta distribution is the natural posterior for binary data — it's parameterized by α (number of successes + 1) and β (number of failures + 1).

On day one, every special has Beta(1, 1) — a completely flat distribution. You have no idea what any of them will do. So when you sample from each posterior, the samples are wild and spread out. Any special could win. That's pure exploration.

After 50 days, suppose the cold brew float has had 40 orders and 10 non-orders: Beta(41, 11). That posterior is tight and centered around 0.78. The chai affogato has had 3 orders and 2 non-orders: Beta(4, 3). That posterior is wide, centered around 0.57, but it could plausibly be as high as 0.85 or as low as 0.25. On any given sampling round, the chai might draw 0.82 from its wide posterior and beat the cold brew's sampled 0.76. That triggers exploration — you try the chai again. But most of the time, the cold brew's tight, high posterior wins the sampling, and you exploit.

class ThompsonSampling:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta = np.ones(n_arms)   # failures + 1

    def select_arm(self):
        samples = np.random.beta(self.alpha, self.beta)
        return np.argmax(samples)

    def update(self, arm, reward):
        self.alpha[arm] += reward
        self.beta[arm] += 1 - reward

That's the entire algorithm. Five lines of logic. The posterior does all the work — it naturally balances exploration and exploitation. An arm with few observations has a wide posterior, so its samples sometimes swing high, triggering exploration. An arm with many observations and low reward has a tight posterior near zero — it almost never wins the sampling contest. No confidence bounds. No epsilon. No tuning.

Thompson Sampling achieves O(log T) regret, matching the Lai-Robbins lower bound. But the reason practitioners love it goes beyond the theory. UCB explores all uncertain arms with similar intensity. Thompson concentrates exploration on arms that are plausibly near-optimal. If two arms have similar estimated rewards, Thompson will naturally sample between them. If one arm is clearly inferior, its posterior quickly becomes tight and low, and Thompson stops wasting pulls on it. When you have thousands of arms — common in ad serving — this efficiency difference is enormous.

My favorite thing about Thompson Sampling is that, aside from high-level explanations like the one I gave, no one is completely certain why it works so well empirically. The theoretical analysis showing it matches the Lai-Robbins bound only came in 2012 (Agrawal and Goyal), nearly 80 years after the algorithm was proposed. For most of that time, people who used it were running on faith and experimental results. The theory finally caught up, but practitioners had already figured out it was the right tool.

For continuous rewards (say, average revenue per customer rather than just click/no-click), you swap the Beta posterior for a Gaussian. Track the sample mean and variance for each arm, and sample from a Normal distribution with the appropriate posterior uncertainty:

class ThompsonGaussian:
    def __init__(self, n_arms):
        self.counts = np.zeros(n_arms)
        self.means = np.zeros(n_arms)
        self.sum_sq = np.zeros(n_arms)  # Welford's running variance

    def select_arm(self):
        samples = np.zeros(len(self.counts))
        for a in range(len(self.counts)):
            if self.counts[a] < 2:
                return a  # need at least 2 observations for variance
            var = self.sum_sq[a] / (self.counts[a] - 1)
            # posterior uncertainty shrinks as 1/sqrt(n)
            samples[a] = np.random.normal(
                self.means[a], np.sqrt(var / self.counts[a]))
        return np.argmax(samples)

    def update(self, arm, reward):
        self.counts[arm] += 1
        old = self.means[arm]
        self.means[arm] += (reward - old) / self.counts[arm]
        self.sum_sq[arm] += (reward - old) * (reward - self.means[arm])

Adversarial Bandits and EXP3

Everything so far assumed the rewards are stochastic — each arm has a fixed (unknown) distribution, and the rewards are drawn from it independently. But what if the environment is adversarial? What if someone is actively trying to make your algorithm look bad — changing the rewards each round to maximize your regret?

In our coffee shop analogy, this is like a competitor across the street who watches your chalkboard each morning and then runs a promotion on whatever special you chose, stealing your customers. Or, less dramatically, it captures settings where rewards are non-stationary in unpredictable ways — seasonal shifts, viral trends, supply chain disruptions.

For adversarial bandits, we need EXP3 — Exponential-weight algorithm for Exploration and Exploitation (Auer et al., 2002). The key challenge: with bandit feedback, you only see the reward for the arm you pulled. You need to somehow estimate rewards for arms you didn't pull, without actually observing them. EXP3 solves this with a trick called importance-weighted estimation.

The idea works like this. Maintain a weight for each arm. Convert weights to probabilities (with some mixing for guaranteed exploration). Sample an arm according to those probabilities. Observe the reward. Now here's the trick: construct an estimated reward for the chosen arm by dividing the observed reward by the probability of having chosen it. If you chose arm a with probability p_a and saw reward r, the estimate is r / p_a. Why? Because you only see this arm with probability p_a, so the estimate needs to be inflated to compensate. In expectation, this estimate is unbiased — it equals the true reward on average.

class EXP3:
    def __init__(self, n_arms, gamma=0.1):
        self.n_arms = n_arms
        self.gamma = gamma
        self.weights = np.ones(n_arms)

    def _probabilities(self):
        total = np.sum(self.weights)
        # mix exponential weights with uniform exploration
        return (1 - self.gamma) * (self.weights / total) \
               + self.gamma / self.n_arms

    def select_arm(self):
        probs = self._probabilities()
        return np.random.choice(self.n_arms, p=probs)

    def update(self, arm, reward):
        probs = self._probabilities()
        # importance-weighted reward estimate
        estimated_reward = reward / probs[arm]
        # exponential weight update
        self.weights[arm] *= np.exp(
            self.gamma * estimated_reward / self.n_arms)

EXP3 achieves O(√(TK ln K)) regret in the adversarial setting, where K is the number of arms. That's worse than the O(log T) of UCB or Thompson in the stochastic setting — but the stochastic algorithms can fail catastrophically when rewards are adversarial. EXP3 gives you a guarantee that holds no matter what the environment does.

The importance weighting is the part that took me the longest to develop intuition for. The division by p_a seems like it could blow up — what if p_a is tiny? It can, and that's why the variance of the estimates is high. The γ/K floor on each probability keeps things from going completely off the rails, but the price is noisier updates. That's the fundamental cost of adversarial robustness: you trade tight estimates for guarantees that hold in the worst case.

Contextual Bandits: Because People Are Different

Everything we've built so far treats every round as identical. Pull an arm, get a reward from that arm's distribution. But in the real world, context matters. A lavender latte might sell well on rainy Mondays to the morning commuter crowd, and terribly on sunny Friday afternoons to college students. The optimal action depends on who is walking through the door and when.

Contextual bandits formalize this. Each round, you observe a feature vector x_t — the context — before choosing an action. The expected reward depends on both the action and the context. You still only see the reward for the action you chose.

This is the exact setup for recommendation systems. The arms are items to show (news articles, ads, products). The context is everything you know about the user (demographics, session features, time of day, device). The reward is engagement — a click, a purchase, time spent. You want to learn a policy — a function from context to action — that maximizes expected reward.

LinUCB and the Yahoo Story

The most influential contextual bandit algorithm is LinUCB, published by Li et al. in 2010, developed and deployed at Yahoo for personalized news article recommendation. The model behind it is disarmingly clean: assume the expected reward for arm a given context x is linear. That is, E[reward | x, a] = θ_a^Tx, where θ_a is an unknown weight vector specific to each arm.

For each arm, LinUCB maintains a ridge regression estimate of θ_a along with a confidence ellipsoid — a multi-dimensional version of the confidence interval. Given context x, the predicted reward plus an exploration bonus gives the UCB:

UCB_a(x) = θ̂_a^Tx + α √(x^T A_a⁻¹ x)

The first term is the predicted reward — exploitation. The second term, √(x^T A_a⁻¹ x), measures how unfamiliar this context is to arm a. If the current user has features that arm a has rarely seen before, the bonus is large and LinUCB explores. If the context is well-covered by past data, the bonus is small and LinUCB exploits.

class LinUCB:
    def __init__(self, n_arms, d, alpha=1.0):
        self.alpha = alpha
        # per-arm: A is d×d, b is d-vector
        self.A = [np.eye(d) for _ in range(n_arms)]
        self.b = [np.zeros(d) for _ in range(n_arms)]

    def select_arm(self, x):
        ucbs = np.zeros(len(self.A))
        for a in range(len(self.A)):
            A_inv = np.linalg.inv(self.A[a])
            theta = A_inv @ self.b[a]
            ucbs[a] = theta @ x + self.alpha * np.sqrt(x @ A_inv @ x)
        return np.argmax(ucbs)

    def update(self, arm, x, reward):
        self.A[arm] += np.outer(x, x)   # tighten confidence ellipsoid
        self.b[arm] += reward * x         # update reward signal

The A_a matrix accumulates the outer products of contexts seen by arm a. Its inverse represents the uncertainty — directions in feature space where little data has been collected. The update np.outer(x, x) tightens the ellipsoid in the direction of the observed context. Over time, as more data accumulates, the exploration bonuses shrink and LinUCB converges to the best linear policy.

The Yahoo deployment was the landmark that made the industry take contextual bandits seriously. The system served personalized news articles to millions of users on the Yahoo front page, and the bandit approach significantly outperformed both random selection and a fixed supervised model. The key insight: a supervised model trained on historical click data suffers from feedback loops. If the old system never showed sports articles to users over 60, the supervised model learns "users over 60 don't click sports articles" — but that's because it never gave them the chance. LinUCB, by exploring, discovers these hidden preferences.

I still occasionally get tripped up by the difference between contextual bandits and full reinforcement learning, so let me be explicit. In contextual bandits, each round is independent — your action today doesn't change what context you see tomorrow. In RL, actions change the state. Recommending a video changes what the user wants to watch next. Contextual bandits ignore that sequential dependency, which is a real limitation but also what makes them tractable enough for production deployment at web scale.

Online Convex Optimization: The Unifying Framework

There's a theoretical framework that ties together nearly everything we've discussed. Online Convex Optimization (OCO) is the general game: each round, you choose a point from a convex set, an adversary reveals a convex loss function, you incur that loss. The goal: minimize regret against the best fixed point in hindsight.

OGD is one algorithm for this game. But there's a more elegant, unifying approach: Follow the Regularized Leader (FTRL). At each round, choose the point that minimizes the sum of all past losses plus a regularization term:

w_t = argmin_w [ Σ_s=1..t-1 f_s(w) + R(w) ]

Without the regularizer, this would be Follow the Leader (FTL) — always play whatever would have been best so far. FTL sounds reasonable but can oscillate wildly. If today's best answer is the opposite of yesterday's, FTL whiplashes back and forth. The regularizer penalizes large deviations, smoothing the trajectory.

Here's what makes FTRL beautiful: different regularizers give you different algorithms. L2 regularization (R(w) = ||w||²/2η) recovers Online Gradient Descent. Negative entropy regularization (R(w) = Σ w_i ln w_i) gives you the Exponential Weights / Hedge algorithm. It's the same template, with the regularizer acting as a knob that controls the geometry of the update.

Google's ad serving system uses a variant called FTRL-Proximal, which adds L1 regularization for sparsity. When you have billions of features (most of which are irrelevant for any given ad), L1 regularization drives irrelevant weights to exactly zero, keeping the model tractable. The paper describing this (McMahan et al., 2013, "Ad Click Prediction: a View from the Trenches") is one of the most cited papers in applied ML, and it's fundamentally an online learning algorithm.

The OCO framework also gives a clean way to state the regret bounds we've been collecting. For convex losses: O(√T). For strongly convex losses: O(log T). For exp-concave losses: O(d log T) where d is the dimension. These bounds hold against adversarial loss sequences — no distributional assumptions needed. That's the power of the framework.

Where All of This Lives in the Real World

Let's bring it all back from theory to practice.

A/B testing. Traditional A/B testing splits traffic 50/50, waits two weeks, then computes a p-value. During those two weeks, half your users see the worse variant. If variant B is clearly superior after day 3, you're still forcing 50% of traffic to the loser for 11 more days. Bandit-based testing — typically Thompson Sampling — gradually shifts traffic toward the winning variant while the experiment is running. Early on, it's roughly 50/50. As evidence accumulates, it becomes 90/10. Less wasted exposure, faster convergence. Google, Netflix, and Microsoft all use variants of this. One critical caveat: because the sampling is adaptive, standard frequentist confidence intervals become invalid. You need "always-valid" confidence sequences or careful sequential testing on top of the bandit allocation.

Ad placement. Every ad impression is a bandit round. The context is the user profile plus page content. The arms are candidate ads. The reward is a click or a purchase. The system needs to maximize click-through rate while still discovering which creatives work for which audiences. This is the canonical contextual bandit application, and Thompson Sampling is the industry standard. Netflix uses a similar approach for choosing which thumbnail to show for each title — the same movie gets different artwork depending on who's looking at the catalog.

Recommendation systems. When you open Spotify or YouTube, a contextual bandit (or something close) decides what to surface. The reason bandits beat pure supervised models here is the feedback loop problem. A supervised model trained on historical data only learns about items the old system chose to show. If the old system never showed jazz to rock fans, the supervised model concludes rock fans don't like jazz — not because it's true, but because it never ran the experiment. Bandits, by construction, explore past the edges of what's already known.

Clinical trials. Adaptive clinical trials use bandit algorithms to shift more patients toward the treatment arm that's performing better. The response-adaptive randomization used during COVID-19 vaccine trials is an example. Compared to fixed equal randomization, this approach can reduce the number of patients receiving inferior treatment while maintaining statistical validity. The ethical argument is hard to ignore: if you have evidence mid-trial that treatment A works better than treatment B, is it ethical to keep assigning half the patients to B?

Hyperparameter tuning. Bayesian optimization for hyperparameter search is, at its core, a bandit problem with continuous arms. Each "arm" is a hyperparameter configuration, the reward is validation performance, and Gaussian process UCB or Thompson Sampling guides which configuration to try next. Tools like Optuna and BoTorch are bandit algorithms wearing a different hat.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a simple coffee shop and a chalkboard, trying to figure out which daily special to offer without any historical data. From that seed, we built up the online learning framework — predictions one at a time, scored by regret against the best fixed strategy in hindsight. We saw how the Perceptron, the oldest neural network, is fundamentally an online algorithm. We traced Online Gradient Descent and Exponential Weights. Then we confronted the bandit problem — what happens when you can't see the road not taken — and worked through three strategies for navigating it: epsilon-greedy (the coin flip), UCB (the optimist), and Thompson Sampling (the Bayesian). We handled adversarial settings with EXP3, added context with LinUCB, and unified everything under Online Convex Optimization and FTRL.

My hope is that the next time you encounter a system that adapts in real time — whether it's an A/B test that shifts traffic, a news feed that personalizes in the moment, or an ad system that learns which creative to show — instead of treating it as a black box, you'll recognize the prediction game underneath. You'll see the regret it's trying to minimize, the exploration-exploitation tradeoff it's navigating, and the posterior distributions (or confidence bounds, or exponential weights) doing the heavy lifting. You'll have a pretty darn good mental model of what's going on under the hood.

Resources

"Prediction, Learning, and Games" by Cesa-Bianchi and Lugosi — the definitive textbook on online learning. Dense but deeply rewarding. If you're going to read one book on this topic, this is the one.

"A Tutorial on Thompson Sampling" by Russo et al. (2018) — the clearest exposition of Thompson Sampling I've found, with practical guidance that goes well beyond the theory. Covers Gaussian bandits, contextual bandits, and real deployment considerations.

"Introduction to Online Convex Optimization" by Elad Hazan — freely available online. Beautifully written, covers the OCO framework and FTRL in a way that makes the math feel like storytelling. Wildly helpful for building intuition about regret bounds.

"A Contextual-Bandit Approach to Personalized News Article Recommendation" by Li et al. (2010) — the LinUCB paper that changed how the industry thinks about recommendations. The O.G. paper for contextual bandits in production.

"Ad Click Prediction: a View from the Trenches" by McMahan et al. (2013, Google) — the paper that introduced FTRL-Proximal to the world. Insightful and practical, full of lessons from deploying online learning at massive scale.

"Bandit Algorithms" by Lattimore and Szepesvári — a modern, comprehensive treatment. The first few chapters alone are an unforgettable introduction to the exploration-exploitation tradeoff.

← Previous Learning Theory Foundations Next → Reinforcement Learning