Peeking Problem
The statistical inflation of false positive rates that occurs when experimenters repeatedly check experiment results and stop the test as soon as statistical significance is observed, rather than waiting for the pre-determined sample size to be reached.
The peeking problem is one of the most common and damaging statistical errors in online experimentation. When an analyst checks results daily and stops the experiment the first time the p-value drops below 0.05, the actual false positive rate can be dramatically higher than the nominal 5%. Simulations show that daily checking over a 30-day experiment can inflate the false positive rate to 25% or higher. This happens because random fluctuations in the data can temporarily create the appearance of a significant effect that would disappear with more data. For growth teams under pressure to move fast and ship results, the temptation to peek and stop early is strong, but doing so systematically undermines the credibility of the entire experimentation program by flooding the portfolio with false positives.
The mathematical mechanism behind peeking inflation is rooted in the properties of random walks. Under the null hypothesis, the running z-statistic of a test follows a process that will eventually cross any fixed threshold given enough looks. The probability of crossing the 1.96 threshold (corresponding to alpha = 0.05) at least once during N checks is much higher than 0.05. Formally, the problem is a repeated application of the optional stopping theorem violation: the expected value of the test statistic remains zero, but the maximum over multiple looks has a larger expected absolute value. The probability of at least one significant result in k independent looks is approximately 1 - (1 - alpha)^k, though actual experiment data is correlated across looks, making the exact inflation depend on the checking schedule and accumulation rate. With continuous monitoring, the inflation is even worse because you are effectively performing infinitely many tests.
The primary solution is sequential testing, which adjusts significance boundaries to account for multiple looks. Group sequential methods like O'Brien-Fleming and Pocock boundaries define spending functions that allocate the total alpha budget across planned interim analyses. O'Brien-Fleming boundaries are very conservative early (requiring very strong evidence to stop) and close to the fixed-sample boundary at the end, while Pocock boundaries distribute alpha more evenly across looks. Always-valid confidence sequences, recently developed by Howard, Ramdas, and others, provide confidence intervals that maintain their coverage guarantee no matter when you look at the data. Modern experimentation platforms increasingly adopt these methods: Eppo uses always-valid sequential testing, Statsig implements group sequential boundaries, and Optimizely offers sequential testing as an option.
Advanced approaches to the peeking problem include mixture sequential probability ratio tests (mSPRT), which provide always-valid p-values by mixing over a class of alternative hypotheses, and e-values, which offer a more flexible framework for sequential evidence accumulation that composes naturally under optional stopping. Bayesian approaches are inherently immune to the peeking problem in theory because the posterior distribution is valid regardless of the stopping rule, though in practice the operating characteristics (false positive rate, power) still depend on the stopping rule. For organizations that cannot implement sequential testing, pragmatic solutions include setting a minimum experiment duration before any analysis (e.g., one full business cycle), automating the analysis to run only at the pre-specified end date, and requiring experiment review board approval for early stopping. Training all stakeholders on the peeking problem and making it a cultural norm to respect experiment timelines is essential for maintaining experiment integrity.
Related Terms
Stopping Rules
Pre-defined criteria that determine when an experiment should be concluded, including both the conditions for early termination due to clear results and the maximum duration or sample size at which a final analysis is performed.
Type I Error
The error of incorrectly rejecting a true null hypothesis, also known as a false positive, where an experiment concludes that a treatment has an effect when in reality there is no true difference between treatment and control.
Confidence Interval
A range of values, derived from sample data, that is expected to contain the true population parameter with a specified probability, providing both an estimate of the treatment effect and the precision of that estimate.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.