Causal Inference for AI Feature Adoption: A Propensity Score Guide in Python

Introduction

When your product launches an opt-in AI feature—like an agent mode or smart reply toggle—comparing users who enable it to those who don't gives a misleading picture. Volunteers are not a random sample; heavy users opt in far more often. This opt-in trap confounds the feature's true effect with pre-existing differences between user segments. Propensity score methods fix this by reweighting or matching the groups so they resemble a randomized experiment. This guide walks through the full pipeline using a synthetic SaaS dataset (50,000 users) where the true causal effect is known. You will estimate the effect, quantify uncertainty, and learn when the method silently breaks.

Causal Inference for AI Feature Adoption: A Propensity Score Guide in Python — Source: www.freecodecamp.org

What You Need

Python 3.8+ with pandas, numpy, scikit-learn, statsmodels, matplotlib
The companion notebook from GitHub (file psm_demo.ipynb)
Basic understanding of logistic regression, matching, and bootstrapping
A synthetic dataset (included in the repo) with columns: treatment (1=opted in), outcome (tasks completed), X1–X5 (observable covariates)

Step 1: Estimate the Propensity Score

The propensity score is the probability of opting in given covariates. Use logistic regression to model treatment ~ X1 + X2 + X3 + X4 + X5. Fit the model on the entire dataset, then predict probabilities for each user.

Key points:

Include all confounders that drive both adoption and outcome.
Check that the model converges (no perfect separation).
Store the resulting scores in a column propensity.

Why this works: the score summarizes all observed differences into a single number, enabling later weighting or matching.

Step 2: Inverse-Probability Weighting

With propensity scores, create weights so the treated group looks like the whole population. For treated users, weight = 1 / score. For control users, weight = 1 / (1 - score). Then compute the weighted difference in mean outcomes:

ATE = weighted_mean(treated_outcome) - weighted_mean(control_outcome)

This removes bias from observable covariates, but only if the scores are correctly specified. Trim extremely high or low weights (e.g., above 10) to avoid variance blow-up.

In the notebook, the IPW estimate typically lands near the true ATE (around +12 tasks) when the model is well calibrated.

Step 3: Nearest-Neighbor Matching

Match each treated user to one or more control users with the closest propensity score (within a caliper, e.g., 0.01). Then compute the average difference in outcomes within matched pairs.

Implementation:

Use sklearn.neighbors.NearestNeighbors or manual lookup.
Match with replacement (each control can be used multiple times).
After matching, discard unmatched treated units (they lie outside the region of common support).

This produces an Average Treatment effect on the Treated (ATT). For our synthetic data, the ATT is often slightly different from the ATE, reflecting that treated users are harder to match.

Step 4: Check Covariate Balance

After weighting or matching, test whether the groups now look similar on covariates. For each covariate, compute the standardized mean difference (SMD) between treated and control groups after adjustment. SMD < 0.1 is considered balanced.

Also run a t-test or compute the variance ratio. If balance is poor, revisit the propensity model (add interactions, splines, or use a more flexible classifier like gradient boosting). A love plot (displaying SMD before and after) helps communicate the improvement.

In the notebook, IPW achieves SMDs below 0.05 for all covariates; matching is slightly worse but still acceptable.

Step 5: Bootstrap Confidence Intervals

Propensity score methods produce point estimates, but you need uncertainty bounds. Use bootstrap resampling (e.g., 1000 repetitions):

Draw a sample of size N with replacement from the original data.
Estimate propensity scores and the treatment effect (IPW or matching) on that sample.
Record the estimate.
After all repetitions, take the 2.5th and 97.5th percentiles as the confidence interval.

This yields robust intervals even with complex weighting. The true effect (e.g., +12 tasks) should fall within the 95% CI if the model is correctly specified.

Tips and Pitfalls

When propensity scores fail:

If there are unobserved confounders (e.g., user motivation not captured in logs), no method can rescue you.
Insufficient overlap: if treated and control scores don't overlap much, extrapolation is dangerous.
Misspecified model: use cross-validation to check probability calibration.

Best practices:

Always report balance diagnostics before and after adjustment.
Try both IPW and matching; if they agree, you're more confident.
Use the bootstrap to capture uncertainty.
Consider sensitive sub-group analyses (e.g., by user engagement quartile).

Propensity score methods are powerful but not magical. When used correctly, they transform noisy opt-in comparisons into credible causal estimates, helping you decide whether to ship that AI feature to everyone.