False Discovery Rate
The expected proportion of false positives among all statistically significant results, offering a less conservative alternative to familywise error rate control that is more appropriate when many hypotheses are tested and some false discoveries are acceptable.
The false discovery rate (FDR) provides a practical middle ground between no multiple testing correction (which floods results with false positives) and strict familywise error rate control (which is so conservative it misses many real effects). While FWER methods control the probability of even one false positive, FDR controls the expected fraction of discoveries that are false. If you declare 20 metrics as significantly affected and the FDR is controlled at 5%, you expect about 1 of those 20 to be a false positive. For growth teams analyzing dozens of metrics per experiment, FDR control is usually the right approach because it acknowledges that some false discoveries are acceptable as long as the majority of declared effects are real, enabling more aggressive exploration without drowning in noise.
The standard FDR control method is the Benjamini-Hochberg (BH) procedure, introduced in 1995. The algorithm is: (1) order all m p-values from smallest to largest as p(1) <= p(2) <= ... <= p(m); (2) find the largest k such that p(k) <= k * q / m, where q is the target FDR level (e.g., 0.05); (3) reject all hypotheses corresponding to p(1) through p(k). This procedure controls FDR at level q when the tests are independent or positively correlated, which is usually satisfied for experiment metrics. The Benjamini-Yekutieli procedure extends FDR control to arbitrary dependence structures at the cost of being more conservative. In practice, the BH procedure is much less conservative than Bonferroni: with 100 tests and alpha = 0.05, Bonferroni requires p < 0.0005 for each test, while BH might declare significance for any p < 0.02 depending on the p-value distribution, dramatically increasing the discovery rate for real effects.
FDR control is most appropriate when analyzing multiple secondary metrics, performing segment-level analyses, or conducting discovery-oriented research where the goal is to identify promising leads for follow-up rather than making definitive ship decisions. For the primary experiment metric that determines the ship/no-ship decision, standard single-hypothesis testing at the desired alpha level is more appropriate because you want to control the probability of error for that specific decision. Common pitfalls include confusing FDR with the false positive rate (they answer different questions), not reporting which correction was applied (making results non-reproducible), and cherry-picking the correction method after seeing results. Teams should pre-specify their correction strategy in the experiment analysis plan.
Advanced FDR concepts include the local false discovery rate (lfdr), which estimates the probability that a specific discovery is false rather than controlling the overall rate, and the positive false discovery rate (pFDR), which conditions on making at least one discovery. The q-value, introduced by John Storey, is the FDR analog of the p-value: the minimum FDR at which a result would be declared significant. Adaptive FDR methods estimate the proportion of true nulls from the data and use this to increase power, since fewer true nulls means more lenient thresholds can maintain FDR control. For sequential experimentation where results accumulate over time, online FDR control algorithms like LOND and SAFFRON maintain FDR guarantees in the streaming setting, which is relevant for always-on experimentation platforms that continuously evaluate new experiments.
Related Terms
Multiple Comparison Correction
Statistical adjustments applied when testing multiple hypotheses simultaneously to control the overall probability of making at least one Type I error, preventing the inflation of false positive rates that occurs when many tests are conducted.
Type I Error
The error of incorrectly rejecting a true null hypothesis, also known as a false positive, where an experiment concludes that a treatment has an effect when in reality there is no true difference between treatment and control.
Peeking Problem
The statistical inflation of false positive rates that occurs when experimenters repeatedly check experiment results and stop the test as soon as statistical significance is observed, rather than waiting for the pre-determined sample size to be reached.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.