Back to glossary

Peeking Problem

The statistical inflation of false positive rates that occurs when experimenters repeatedly check experiment results and stop the test as soon as statistical significance is observed, rather than waiting for the pre-determined sample size to be reached.

The peeking problem is one of the most common and damaging statistical errors in online experimentation. When an analyst checks results daily and stops the experiment the first time the p-value drops below 0.05, the actual false positive rate can be dramatically higher than the nominal 5%. Simulations show that daily checking over a 30-day experiment can inflate the false positive rate to 25% or higher. This happens because random fluctuations in the data can temporarily create the appearance of a significant effect that would disappear with more data. For growth teams under pressure to move fast and ship results, the temptation to peek and stop early is strong, but doing so systematically undermines the credibility of the entire experimentation program by flooding the portfolio with false positives.

The mathematical mechanism behind peeking inflation is rooted in the properties of random walks. Under the null hypothesis, the running z-statistic of a test follows a process that will eventually cross any fixed threshold given enough looks. The probability of crossing the 1.96 threshold (corresponding to alpha = 0.05) at least once during N checks is much higher than 0.05. Formally, the problem is a repeated application of the optional stopping theorem violation: the expected value of the test statistic remains zero, but the maximum over multiple looks has a larger expected absolute value. The probability of at least one significant result in k independent looks is approximately 1 - (1 - alpha)^k, though actual experiment data is correlated across looks, making the exact inflation depend on the checking schedule and accumulation rate. With continuous monitoring, the inflation is even worse because you are effectively performing infinitely many tests.

The primary solution is sequential testing, which adjusts significance boundaries to account for multiple looks. Group sequential methods like O'Brien-Fleming and Pocock boundaries define spending functions that allocate the total alpha budget across planned interim analyses. O'Brien-Fleming boundaries are very conservative early (requiring very strong evidence to stop) and close to the fixed-sample boundary at the end, while Pocock boundaries distribute alpha more evenly across looks. Always-valid confidence sequences, recently developed by Howard, Ramdas, and others, provide confidence intervals that maintain their coverage guarantee no matter when you look at the data. Modern experimentation platforms increasingly adopt these methods: Eppo uses always-valid sequential testing, Statsig implements group sequential boundaries, and Optimizely offers sequential testing as an option.

Advanced approaches to the peeking problem include mixture sequential probability ratio tests (mSPRT), which provide always-valid p-values by mixing over a class of alternative hypotheses, and e-values, which offer a more flexible framework for sequential evidence accumulation that composes naturally under optional stopping. Bayesian approaches are inherently immune to the peeking problem in theory because the posterior distribution is valid regardless of the stopping rule, though in practice the operating characteristics (false positive rate, power) still depend on the stopping rule. For organizations that cannot implement sequential testing, pragmatic solutions include setting a minimum experiment duration before any analysis (e.g., one full business cycle), automating the analysis to run only at the pre-specified end date, and requiring experiment review board approval for early stopping. Training all stakeholders on the peeking problem and making it a cultural norm to respect experiment timelines is essential for maintaining experiment integrity.

Related Terms