Power Analysis
A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.
Power analysis is the critical planning step that determines whether an experiment is worth running before any resources are committed. Statistical power is the probability that a test will correctly detect a true effect when one exists, conventionally set at 80% or higher. A power analysis takes as inputs the desired significance level (alpha, typically 0.05), the minimum detectable effect size the team cares about, the baseline metric value and its variance, and the desired power level, then outputs the required sample size per variant. For growth teams, power analysis prevents two costly mistakes: running experiments that are too small to detect realistic effects (wasting time on inconclusive results) and running experiments far longer than necessary (delaying the rollout of winning changes).
The standard power analysis formula for a two-sample proportion test is n = (Z_alpha/2 + Z_beta)^2 * (p1(1-p1) + p2(1-p2)) / (p1 - p2)^2, where p1 and p2 are the baseline and expected treatment proportions, Z_alpha/2 is the critical value for the significance level, and Z_beta is the critical value for the desired power. For continuous metrics, the formula uses the pooled standard deviation instead of the proportion variances. Most experimentation platforms like Statsig, Eppo, and Optimizely include built-in power calculators that automate this computation. The practical workflow involves specifying the primary metric, looking up its current baseline value and variance from historical data, specifying the minimum effect size worth detecting (the minimum detectable effect or MDE), and computing the required sample size. Dividing this by the daily traffic rate gives the expected experiment duration. If the duration is impractically long, teams can either increase the MDE (accept that only larger effects will be detected), use variance reduction techniques like CUPED, choose a more sensitive metric, or increase the significance level.
Power analysis should be performed for every experiment before launch, and experiments should not be launched if they cannot achieve adequate power within a reasonable timeframe. A common pitfall is performing power analysis on the overall conversion rate when the change only affects a subset of users; in such cases, the analysis should be restricted to the affected population using triggered analysis. Another mistake is using unrealistically small MDEs to justify large experiments when the team would ship the change even with a smaller effect. Teams should also account for multiple variants: a test with three treatment arms requires roughly 50% more total sample than a two-arm test for the same per-comparison power. Sequential testing designs can reduce the expected sample size by allowing early stopping when effects are large, but the power analysis for these designs uses different formulas that account for the multiple interim analyses.
Advanced power analysis considerations include accounting for clustered randomization, where the effective sample size is reduced by the intracluster correlation coefficient; handling ratio metrics where the variance depends on both numerator and denominator; and using simulation-based power analysis for complex metrics that do not follow standard distributional assumptions. Bayesian power analysis, sometimes called assurance analysis, computes the probability that the posterior credible interval will exclude zero given a prior distribution over the true effect size, which more naturally incorporates uncertainty about the expected effect. For organizations running many experiments, meta-analytic power analysis uses the distribution of observed effects from past experiments to inform realistic MDE assumptions, often revealing that most true effects are smaller than teams typically assume.
Related Terms
Effect Size
A quantitative measure of the magnitude of a treatment's impact, expressed either as an absolute difference between groups or as a standardized metric like Cohen's d that allows comparison across different experiments and metrics.
Minimum Detectable Effect
The smallest improvement in a metric that an experiment is designed to reliably detect with a given level of statistical power and significance, determining the practical sensitivity of the test.
Type II Error
The error of failing to reject a false null hypothesis, also known as a false negative, where an experiment fails to detect a real treatment effect, concluding there is no difference when one actually exists.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.