Heterogeneous Treatment Effects
Variation in treatment effects across different subgroups of the population, where an intervention may have different impacts depending on user characteristics such as tenure, geography, device type, or behavioral patterns.
Heterogeneous treatment effects (HTE) occur when a treatment's impact varies across individuals or subgroups rather than being uniform. An average treatment effect of +2% conversion lift might mask a +8% lift for new users and a -1% effect for power users. Understanding HTE is crucial for growth teams because it reveals who benefits from a change, who is harmed, and who is unaffected, enabling more targeted and effective product decisions. A feature that shows a modest average improvement might be transformative for a specific segment, and knowing this allows teams to ship changes selectively or to design personalized experiences that serve each segment optimally.
Analyzing HTE typically follows a progression from simple to sophisticated methods. The simplest approach is pre-specified subgroup analysis: before the experiment, define subgroups of interest (e.g., new vs. returning users, mobile vs. desktop) and estimate the treatment effect within each subgroup. The interaction test formally assesses whether the treatment effect differs significantly across subgroups using an interaction term in the regression model: Y = beta_0 + beta_1*Treatment + beta_2*Subgroup + beta_3*(Treatment*Subgroup) + epsilon, where beta_3 is the differential treatment effect. For discovery-oriented HTE analysis, machine learning methods like causal forests, Bayesian Additive Regression Trees (BART), and meta-learners (T-learner, X-learner, R-learner) estimate personalized treatment effects as a function of many covariates simultaneously. These methods are implemented in the R grf package, Python's EconML and CausalML libraries. Visualization of HTE often uses sorted treatment effect plots where users are ordered by estimated CATE and the cumulative effect is plotted, revealing what fraction of users benefit from the treatment.
HTE analysis should be conducted for every major experiment, but teams must be disciplined about distinguishing confirmatory and exploratory analysis. Pre-specified subgroups with hypothesis-driven rationale (e.g., we expect the new onboarding flow to help new users more than existing users) provide stronger evidence than post-hoc data mining across many possible subgroups. Multiple comparison corrections should be applied when testing many subgroups. Common pitfalls include the garden of forking paths, where analysts search across many subgroup definitions until finding a significant interaction, overfitting in small subgroups where random variation creates the appearance of large effects, and ignoring the fact that average effects in subgroups are still averages that may themselves be heterogeneous. The clinical trials literature has extensive guidance on proper subgroup analysis that digital experimentation teams can adapt.
Advanced HTE considerations include using the estimated heterogeneity to design optimal treatment policies (which users should receive the treatment and which should not), applying HTE estimates to power analysis for follow-up experiments targeting specific subgroups, and using HTE to understand the mechanisms behind treatment effects. If a treatment works only for users with a specific behavioral pattern, that suggests a mechanistic explanation that can inform future interventions. The concept of the best linear projection of CATE onto user features provides an interpretable summary of the most important dimensions of heterogeneity. For organizations with multiple experiments, meta-analytic approaches can identify consistent patterns of heterogeneity across experiments, revealing general principles like certain user segments are consistently more responsive to UI changes, which informs both experiment design and product strategy.
Related Terms
Causal Forest
A machine learning method based on random forests that estimates heterogeneous treatment effects, discovering how the impact of a treatment varies across different subgroups of users defined by their observable characteristics.
Propensity Score Matching
A statistical method that reduces selection bias in observational studies by matching treated and untreated units that have similar probabilities (propensity scores) of receiving the treatment, creating a pseudo-randomized comparison.
Triggered Analysis
An analysis technique that restricts experiment evaluation to users who actually encountered or were exposed to the experimental change, reducing noise from unaffected users while maintaining the validity of the randomization through careful implementation.
Multivariate Testing
An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.
Split Testing
The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.
Holdout Testing
An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.