Back to glossary

Stopping Rules

Pre-defined criteria that determine when an experiment should be concluded, including both the conditions for early termination due to clear results and the maximum duration or sample size at which a final analysis is performed.

Stopping rules formalize the decision of when an experiment has collected enough data to reach a conclusion, preventing both premature termination (which inflates false positive rates) and unnecessarily prolonged experiments (which waste time and limit experimentation velocity). A well-designed stopping rule specifies the planned sample size or duration, any interim analysis points where early stopping is permitted, the statistical boundaries for early stopping decisions, and criteria for stopping due to safety concerns. For growth teams, clear stopping rules are essential for maintaining experiment discipline and enabling valid sequential monitoring without compromising statistical rigor.

The most common framework for stopping rules in online experimentation is group sequential testing. The experiment is planned for a maximum sample size determined by power analysis, with K planned interim analyses at evenly spaced fractions of the maximum sample. At each interim look, the test statistic is compared to adjusted boundaries. The O'Brien-Fleming spending function produces boundaries that are very stringent early and approach the fixed-sample boundary at the final analysis, which aligns with the practical desire to stop early only when evidence is overwhelming. For a 5-look design with overall alpha = 0.05, O'Brien-Fleming boundaries might be z = 4.56, 3.23, 2.63, 2.28, 2.04 at the five looks. The Lan-DeMets alpha spending function generalizes this by allowing flexible analysis timing. These methods are implemented in tools like Statsig, Eppo, and the R package gsDesign. For futility stopping (concluding no effect exists), conditional power or predictive power calculations assess whether, given the data observed so far, the experiment has a reasonable chance of reaching significance if continued to the planned end.

Teams should establish stopping rules before launching any experiment and document them in the experiment analysis plan. The stopping rule should include: the maximum sample size and duration, the schedule of interim analyses (e.g., weekly after a minimum 1-week burn-in), the statistical boundaries for efficacy and futility stopping, and any safety guardrail metrics that trigger immediate stopping if degraded beyond a threshold. Common pitfalls include having no stopping rule at all (leading to indefinite experiments that block traffic), having rules that are too permissive (allowing early stopping based on peeked p-values), and not implementing futility stopping (continuing hopeless experiments that waste traffic). Teams should also consider the business cycle: experiments should run for at least one full week to capture day-of-week effects, and preferably through any known cyclical patterns.

Advanced stopping rule designs include adaptive sample size re-estimation, where the planned maximum sample is adjusted at an interim analysis based on the observed effect size and variance, allowing the experiment to grow if effects are smaller than expected. Bayesian stopping rules based on posterior probability thresholds (e.g., stop when P(treatment > control | data) > 0.99) offer intuitive interpretation and natural immunity to the peeking problem, though their frequentist operating characteristics should be verified through simulation. For always-on experiments like recommendation algorithm changes, continuous monitoring with always-valid inference provides valid conclusions at any stopping time. Multi-arm stopping rules add complexity: arms can be dropped for futility while promising arms continue, using methods like the Dunnett procedure or response-adaptive randomization that shifts traffic away from underperforming arms.

Related Terms

Peeking Problem

The statistical inflation of false positive rates that occurs when experimenters repeatedly check experiment results and stop the test as soon as statistical significance is observed, rather than waiting for the pre-determined sample size to be reached.

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.

Adaptive Experiment

An experiment design that modifies its parameters during execution based on accumulating data, including adjusting traffic allocation between variants, dropping underperforming arms, or modifying the sample size, while maintaining statistical validity through appropriate corrections.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.