Per-Protocol Analysis

An analysis approach that evaluates experiment results based on which treatment users actually received rather than their original random assignment, providing an estimate of the treatment effect among compliant users but potentially introducing selection bias.

Per-protocol (PP) analysis restricts the comparison to users who actually received the treatment as intended by their assignment. Users who were assigned to treatment but did not engage with it, or who were assigned to control but somehow received the treatment, are either excluded or reassigned. This approach answers a different question than intention-to-treat: while ITT asks what is the effect of assigning users to this treatment, PP asks what is the effect of actually receiving the treatment. For growth teams, PP analysis is tempting because it estimates the undiluted treatment effect, which is what product managers often want to know. However, PP analysis is susceptible to selection bias because the users who comply with their assignment may differ systematically from those who do not.

Per-protocol analysis is computed by restricting the sample to compliers and comparing outcomes: tau_PP = E[Y | Received Treatment, Assigned Treatment] - E[Y | Received Control, Assigned Control]. The problem is that compliance is a post-randomization outcome that may be correlated with the treatment effect itself. Users who actively engage with a new feature may be more motivated, tech-savvy, or have higher baseline engagement, all of which correlate with better outcomes regardless of the treatment. This means PP analysis confounds the treatment effect with selection effects, potentially making the treatment appear more effective than it truly is. The bias can go in either direction: if sicker patients drop out of treatment, PP analysis in a clinical trial overestimates efficacy; if more engaged users are more likely to encounter a feature change, PP analysis in a digital experiment overestimates the impact.

Per-protocol analysis may be reported as a secondary analysis alongside the primary ITT analysis when teams want to understand the magnitude of the treatment effect for engaged users. It is particularly relevant when non-compliance rates are high and the ITT estimate is severely diluted, making it difficult to assess whether the treatment itself is effective. Common pitfalls include reporting only the PP estimate without the ITT (which overstates impact and is statistically invalid as the primary analysis), not checking whether compliers and non-compliers differ on observable characteristics (which would indicate selection bias), and using PP as justification for shipping when the ITT is not significant. A better alternative to PP analysis is the instrumental variables approach to estimate the complier average causal effect (CACE), which provides the treatment effect for compliers without the selection bias of PP, under certain assumptions.

Advanced alternatives to simple PP analysis include inverse probability of compliance weighting, which re-weights compliers to represent the full population, and principal stratification, which formally models the types of users (always-compliers, never-compliers, etc.) and estimates treatment effects within each stratum. For digital experiments where exposure is often partial and graded rather than binary, dose-response models can relate the intensity of treatment exposure to the outcome while using random assignment as an instrument to handle selection. The clinical trials literature has extensively debated ITT vs. PP, and the consensus strongly favors ITT as the primary analysis with PP as supplementary. Digital experimentation teams should follow this convention and reserve PP analysis for exploratory investigation rather than primary decision-making.

Related Terms

Intention-to-Treat

An analysis principle that evaluates experiment results based on the original random assignment of users to treatment groups, regardless of whether they actually received or engaged with the treatment, preserving the validity of randomization.

Triggered Analysis

An analysis technique that restricts experiment evaluation to users who actually encountered or were exposed to the experimental change, reducing noise from unaffected users while maintaining the validity of the randomization through careful implementation.

Split Testing

The practice of randomly dividing users into two or more groups and exposing each group to a different version of a product experience to measure which version performs better on a target metric, commonly known as A/B testing.

Multivariate Testing

An experimentation method that simultaneously tests multiple variables and their combinations to determine which combination of changes produces the best outcome, unlike A/B testing which typically varies a single element at a time.

Holdout Testing

An experimental design where a small percentage of users are permanently excluded from receiving a new feature or set of features, serving as a long-term control group to measure the cumulative impact of product changes over time.

Power Analysis

A statistical calculation performed before an experiment to determine the minimum sample size required to detect a meaningful effect with a specified probability, balancing the risk of false negatives against practical constraints like traffic and experiment duration.