Multi-Armed Bandit

The multi-armed bandit problem gets its name from a gambler facing multiple slot machines (one-armed bandits) with unknown payout rates. The gambler must balance trying different machines to learn their payouts (exploration) with playing the machine that seems best so far (exploitation). In optimization, each "arm" is a variant being tested.

Unlike traditional A/B tests that split traffic equally for the entire experiment, bandit algorithms dynamically shift traffic toward better-performing variants. Thompson Sampling, Upper Confidence Bound (UCB), and epsilon-greedy are common algorithms. As data accumulates, more traffic flows to the winning variant, reducing the opportunity cost of showing inferior variants to users.

For growth teams, bandits are ideal for optimizing continuous choices: which headline to show, which recommendation algorithm to use, which pricing tier to display. They converge faster than A/B tests for clear winners and continuously adapt to changing conditions. The trade-off is reduced statistical rigor: bandits optimize for cumulative reward rather than producing clean causal estimates of treatment effects.

Related Terms

Cosine Similarity

Dimensionality Reduction

Batch Inference

Real-Time Inference

Data Pipeline

ETL (Extract, Transform, Load)