Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.

Power analysis is the critical planning step that determines whether an experiment is worth running before any resources are committed. Statistical power is the probability that a test will correctly detect a true effect when one exists, conventionally set at 80% or higher. A power analysis takes as inputs the desired significance level (alpha, typically 0.05), the minimum detectable effect size the team cares about, the baseline metric value and its variance, and the desired power level, then outputs the required sample size per variant. For growth teams, power analysis prevents two costly mistakes: running experiments that are too small to detect realistic effects (wasting time on inconclusive results) and running experiments far longer than necessary (delaying the rollout of winning changes).

The standard power analysis formula for a two-sample proportion test is n = (Z_alpha/2 + Z_beta)^2 * (p1(1-p1) + p2(1-p2)) / (p1 - p2)^2, where p1 and p2 are the baseline and expected treatment proportions, Z_alpha/2 is the critical value for the significance level, and Z_beta is the critical value for the desired power. For continuous metrics, the formula uses the pooled standard deviation instead of the proportion variances. Most experimentation platforms like Statsig, Eppo, and Optimizely include built-in power calculators that automate this computation. The practical workflow involves specifying the primary metric, looking up its current baseline value and variance from historical data, specifying the minimum effect size worth detecting (the minimum detectable effect or MDE), and computing the required sample size. Dividing this by the daily traffic rate gives the expected experiment duration. If the duration is impractically long, teams can either increase the MDE (accept that only larger effects will be detected), use variance reduction techniques like CUPED, choose a more sensitive metric, or increase the significance level.

Power analysis should be performed for every experiment before launch, and experiments should not be launched if they cannot achieve adequate power within a reasonable timeframe. A common pitfall is performing power analysis on the overall conversion rate when the change only affects a subset of users; in such cases, the analysis should be restricted to the affected population using triggered analysis. Another mistake is using unrealistically small MDEs to justify large experiments when the team would ship the change even with a smaller effect. Teams should also account for multiple variants: a test with three treatment arms requires roughly 50% more total sample than a two-arm test for the same per-comparison power. Sequential testing designs can reduce the expected sample size by allowing early stopping when effects are large, but the power analysis for these designs uses different formulas that account for the multiple interim analyses.

Advanced power analysis considerations include accounting for clustered randomization, where the effective sample size is reduced by the intracluster correlation coefficient; handling ratio metrics where the variance depends on both numerator and denominator; and using simulation-based power analysis for complex metrics that do not follow standard distributional assumptions. Bayesian power analysis, sometimes called assurance analysis, computes the probability that the posterior credible interval will exclude zero given a prior distribution over the true effect size, which more naturally incorporates uncertainty about the expected effect. For organizations running many experiments, meta-analytic power analysis uses the distribution of observed effects from past experiments to inform realistic MDE assumptions, often revealing that most true effects are smaller than teams typically assume.

Power Analysis

Related Terms

Effect Size

Minimum Detectable Effect

Type II Error

Multivariate Testing

Split Testing

Holdout Testing