Back to glossary

Confidence Interval

A range of values, derived from sample data, that is expected to contain the true population parameter with a specified probability, providing both an estimate of the treatment effect and the precision of that estimate.

A confidence interval (CI) provides far more information than a simple point estimate or p-value by quantifying the uncertainty around a measured effect. A 95% confidence interval means that if the experiment were repeated many times, 95% of the computed intervals would contain the true effect. For growth teams, confidence intervals are essential for making informed ship/no-ship decisions because they communicate both the likely magnitude and the range of plausible values for a treatment effect. A point estimate of +3% conversion lift tells you the most likely outcome, but the confidence interval [+0.5%, +5.5%] tells you the best and worst realistic scenarios, enabling proper risk assessment and revenue forecasting.

The standard confidence interval for a difference in means is calculated as (X_bar_treatment - X_bar_control) +/- Z_alpha/2 * SE, where SE is the standard error of the difference, computed as sqrt(s1^2/n1 + s2^2/n2). For proportions, the SE uses the proportion formula sqrt(p1(1-p1)/n1 + p2(1-p2)/n2). The width of the confidence interval is inversely proportional to the square root of the sample size, meaning that quadrupling the sample size halves the interval width. Experimentation platforms like Statsig, Optimizely, and Eppo display confidence intervals prominently in their dashboards. Many platforms also offer Bayesian credible intervals, which have a more intuitive interpretation: a 95% credible interval means there is a 95% probability that the true parameter lies within the interval, given the data and prior. This distinction matters because the frequentist CI's coverage guarantee applies to the procedure across hypothetical repetitions, not to the specific interval computed.

Teams should use confidence intervals as the primary output of experiment analysis rather than relying on binary significant/not-significant conclusions. A result can be not statistically significant but still highly informative: a 95% CI of [-0.5%, +4%] for conversion lift suggests the treatment is unlikely to be harmful and has a good chance of being beneficial, which might justify shipping. Conversely, a statistically significant result with a wide CI that includes trivially small effects may not warrant the implementation investment. Common pitfalls include misinterpreting the confidence level (it is not the probability that the true value is in this specific interval in the frequentist framework), ignoring the width of the interval when making decisions, and not adjusting confidence levels when examining multiple metrics or segments simultaneously. When running multiple comparisons, the family-wise confidence level is lower than the individual interval level.

Advanced applications include using confidence intervals for equivalence testing, where the goal is to show that a new system performs within an acceptable range of the old one rather than showing superiority. If the entire CI falls within a pre-specified equivalence margin, the treatment is declared equivalent. This is valuable for infrastructure migrations, code refactors, and cost-reduction changes where the goal is to confirm no harm. Confidence intervals also enable more nuanced meta-analysis across experiments, where overlapping CIs from multiple tests can be combined using inverse-variance weighting to produce a pooled estimate with a narrower interval. For sequential experiments that allow early stopping, the confidence intervals must be adjusted using methods like alpha spending functions to maintain valid coverage despite the multiple looks at the data.

Related Terms