Causal Inference

Chapter 17: Learning Theory & Advanced ML Section 4

I avoided causal inference for longer than I'd like to admit. Every time someone said "correlation is not causation," I'd nod sagely and move on, as if repeating the phrase was the same as understanding it. I could train models that predicted outcomes with impressive accuracy. I could tune hyperparameters, cross-validate, deploy to production. And then one day a product manager asked me: "Okay, but if we actually show the banner to these users, will they buy more?" I stared at my gradient-boosted trees. They had no answer. They had never been asked that question. Here is that dive.

Causal inference is the set of mathematical tools for moving from "X and Y move together" to "changing X will change Y." The field sits at the intersection of statistics, economics, epidemiology, and computer science. Two major frameworks dominate: Judea Pearl's structural causal models (developed through the 1990s and 2000s) and Donald Rubin's potential outcomes framework (formalized in the 1970s). They attack the same problem from different angles, and understanding both gives you genuine working knowledge.

Before we start, a heads-up. We're going to be talking about probabilities, conditional distributions, and a little bit of graph theory, but you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Two Realities You Can Never Both See
A Tiny E-Commerce Experiment
What We Actually Want to Estimate
The Gold Standard: Randomized Experiments
Rest Stop
When You Can't Randomize: Observational Studies
Propensity Score Matching
Inverse Probability Weighting
Difference-in-Differences
Instrumental Variables
Regression Discontinuity
Pearl's Structural Causal Models
The do-Operator and Graph Surgery
The Backdoor Criterion
Simpson's Paradox
Causal Discovery: Learning the Graph
Second Rest Stop
Machine Learning Meets Causality
Causal Inference in Tech
Resources and Credits

The Two Realities You Can Never Both See

Imagine you have a headache. You take aspirin. An hour later, the headache is gone. Did the aspirin cause the relief?

To truly know, you'd need to compare two parallel universes: one where you took the aspirin, and one where you didn't, with everything else identical — same headache severity, same stress level, same lunch, same weather. The difference in your headache between those two universes is the causal effect of the aspirin on you, specifically.

This thought experiment is the foundation of the potential outcomes framework, developed by Donald Rubin. For any individual and any treatment, there are two potential outcomes: Y(1), the outcome if treated, and Y(0), the outcome if not treated. The causal effect for that person is Y(1) − Y(0). The problem, of course, is that you can only ever observe one of these. You took the aspirin or you didn't. You can never visit both universes.

This is called the fundamental problem of causal inference, and it's worth sitting with for a moment because everything else in this entire field is a response to it. Every method we'll build, every assumption we'll make, every clever trick — all of it exists because we can never directly see both Y(1) and Y(0) for the same person at the same time.

I'll be honest — when I first encountered this framing, I thought it was philosophy, not math. "Parallel universes? Come on." But the more I worked with it, the more I realized it's the most practical framing imaginable. It forces you to define exactly what you're comparing, which is where most causal reasoning goes wrong in the first place.

Think of it like a courtroom. A lawyer asking "did the defendant's action cause the harm?" is really asking: "in the parallel universe where the defendant didn't act, would the harm still have happened?" This is the counterfactual or "but-for" test, and it maps directly to Y(1) − Y(0). We'll keep returning to this courtroom analogy because it turns out the legal intuition tracks the mathematics remarkably well.

A Tiny E-Commerce Experiment

To make everything concrete, let's build a running example. Suppose you work at a small online store. You want to know: does showing a promotional banner cause users to spend more?

We start with six users. That's it. Six people, each with two potential outcomes — what they'd spend if they see the banner (treated), and what they'd spend if they don't (untreated). In our parallel-universe thought experiment, we can peek at both:

User    Y(0): No Banner    Y(1): Banner    Causal Effect
─────   ──────────────────  ──────────────  ─────────────
Alice         $40               $55             +$15
Bob           $20               $22             +$2
Carol         $60               $58             −$2
Dave          $10               $30             +$20
Eve           $35               $40             +$5
Frank         $50               $50              $0

In reality, we'd never have this table. We're playing omniscient narrator here to build understanding. Each user has a personal causal effect: Alice benefits a lot from the banner (+$15), Carol actually spends slightly less with it (−$2), and Frank is completely unaffected. The effect is heterogeneous — different for different people. That's a crucial insight. Asking "does the banner work?" is a bit like asking "do shoes fit?" It depends on who's wearing them.

Now here's the fundamental problem in action. In the real world, each user either sees the banner or doesn't. Let's say Alice, Dave, and Eve see the banner, while Bob, Carol, and Frank don't. Our observed data looks like this:

User    Treatment    Observed Spending    (The other outcome is ? forever)
─────   ─────────    ────────────────     ────────────────────────────────
Alice   Banner             $55                Y(0) = ?
Bob     No Banner          $20                Y(1) = ?
Carol   No Banner          $60                Y(1) = ?
Dave    Banner             $30                Y(0) = ?
Eve     Banner             $40                Y(0) = ?
Frank   No Banner          $50                Y(1) = ?

The question marks are the parallel-universe outcomes we can never observe. Every method in causal inference is, in some way, a strategy for filling in those question marks — or at least estimating what they'd average out to across a population.

What We Actually Want to Estimate

Since individual causal effects are unobservable, we work with averages. Three quantities come up constantly, and they answer subtly different questions.

The Average Treatment Effect (ATE) is the average causal effect across the entire population. In our tiny example, ATE = (15 + 2 + (−2) + 20 + 5 + 0) / 6 = $6.67. It answers: "if we rolled out the banner to everyone, how much more would the average person spend?" This is the broadest, most democratic measure — it weights every person equally regardless of whether they'd actually receive the treatment.

The Average Treatment Effect on the Treated (ATT) focuses only on the people who actually received the treatment. Among our treated users (Alice, Dave, Eve), ATT = (15 + 20 + 5) / 3 = $13.33. This answers a different question: "for the users who saw the banner, how much did it actually help them?" Notice ATT ≠ ATE here — the treated group happened to include users who benefited more than average. In practice, this is common: people who seek treatment often differ from those who don't, and the treatment may affect them differently.

The Conditional Average Treatment Effect (CATE) is the average effect for a specific subgroup defined by observable characteristics. If we grouped our users by spending tier — say, high spenders (Alice, Carol, Frank) versus low spenders (Bob, Dave, Eve) — we could compute CATE for each tier. This is where things get exciting for personalization: if CATE varies dramatically across groups, you can target the treatment at people who'd actually benefit. CATE estimation is what powers uplift modeling, personalized medicine, and targeted marketing. We'll come back to it.

I'm still developing my intuition for when ATT is the right target versus ATE. In many product settings, you want ATT because you're deciding whether to keep a feature for users who already have it. In policy settings, you often want ATE because you're deciding whether to roll out an intervention to everyone. The distinction matters more than most textbooks suggest.

The Gold Standard: Randomized Experiments

Here is the simplest, most powerful idea in all of causal inference. If you want to know whether the banner causes higher spending, randomly assign who sees it.

Why does randomization work? Return to our six users. If we flip a fair coin for each user to decide treatment, the coin doesn't know or care about Alice's spending habits, Bob's mood, or Carol's income. The coin is independent of everything. That independence is the key — it means that, on average, the treated group and the control group will be similar in every way except the treatment. Every confounding variable — the ones you measured and the ones you didn't even think of — gets balanced by chance.

Let's trace through this. Suppose the coin assigns Alice, Carol, and Frank to treatment (banner), and Bob, Dave, and Eve to control (no banner). We observe:

Treated group:  Alice=$55, Carol=$58, Frank=$50  →  avg = $54.33
Control group:  Bob=$20,   Dave=$10,  Eve=$35    →  avg = $21.67

Difference in means = $54.33 − $21.67 = $32.67

That estimate ($32.67) is quite far from the true ATE of $6.67, and that's because six people is a ludicrously small sample. Randomization guarantees unbiased estimates, not precise ones. With 6,000 users instead of 6, the law of large numbers kicks in and the difference in means converges to the true ATE. This is why A/B tests need sample size calculators — not because the method is flawed, but because randomness is noisy in small doses.

A randomized controlled trial (RCT), which is the formal name for this approach, is the gold standard for exactly this reason. Randomization physically implements what Pearl calls the do-operator — a concept we'll build properly later. When you randomly assign treatment, you are literally reaching into the system and setting the treatment variable, severing it from everything that might have influenced it naturally.

So if randomization is this powerful, why do we need anything else? Four reasons keep coming up in practice. First, randomization can be expensive — running an A/B test means actually showing (or withholding) the treatment for weeks. Second, it can be unethical — you can't randomly assign smoking to study lung cancer. Third, it's backward-looking blind — you have years of historical data sitting in a warehouse, and you can't go back in time to randomize. Fourth, it can be slow — if you have forty causal questions, running forty sequential A/B tests takes years.

This is where observational causal inference comes in. The entire rest of this journey is about getting A/B-test-quality answers without the A/B test. It requires stronger assumptions, but when those assumptions hold, it works.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a working mental model: every person has two potential outcomes, we can only see one, the causal effect is the difference, and randomization is the cleanest way to estimate it. That's not a toy understanding — it's the foundation that every method in this field is built on. If someone asks you "why can't we just use the correlation?" you can now explain it in terms of potential outcomes and confounding, which is a far more precise answer than the hand-wavy "correlation isn't causation" platitude.

But the model doesn't tell the complete story. What do you do when randomization isn't possible? What happens when the world hands you messy observational data and a causal question that needs an answer by next Friday?

The short version: there are a half-dozen clever methods that exploit different structural assumptions about your data — matching on similar users, reweighting the population, finding natural experiments, exploiting arbitrary cutoffs. They're not as clean as randomization, but they're often all you've got. There. You're about 30% of the way through the full picture.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

When You Can't Randomize: Observational Studies

In an observational study, you don't control who gets treated. The data arrives pre-cooked — users who saw the banner might have already been more engaged, or been visiting from a particular referral source, or been shopping on a weekend. Any of these factors could independently affect spending, which means the difference between the treated and untreated groups reflects a mix of the banner's true effect and these pre-existing differences.

This mixing is called confounding. A confounder is any variable that influences both the treatment and the outcome. In our banner example, suppose that users with higher prior engagement are both more likely to be shown the banner (the recommendation algorithm favors them) and more likely to spend money (they're already fans). Prior engagement is a confounder — it creates a spurious association between the banner and spending that has nothing to do with the banner's causal effect.

Every observational method we'll cover is a different strategy for neutralizing confounders. They each exploit a different structural assumption about the data. None of them are free — every one requires you to believe something about the world that you cannot fully verify from data alone. Choosing the right method means choosing the assumption you're most comfortable defending.

Propensity Score Matching

The idea behind propensity score matching is almost embarrassingly intuitive. If you can't randomize, fake it.

Here's the logic. The reason randomization works is that treated and control groups end up similar. In observational data, they're not similar — treated users might be younger, richer, more engaged. But what if we could construct a comparison by pairing each treated user with an untreated user who looks almost identical on all the variables we can measure?

The challenge is dimensionality. If you have 20 covariates, finding exact matches across all 20 is nearly impossible. Paul Rosenbaum and Donald Rubin showed in 1983 that you can collapse all those dimensions into a single number: the propensity score, defined as e(X) = P(Treatment = 1 | X). It's the probability of being treated, given your observable characteristics. They proved that if you match on this single score, the balance across all the covariates comes for free.

Let's walk through this with our banner example, now scaled up. Imagine 2,000 users, 800 of whom saw the banner. We have covariates: prior spending, days since last visit, device type, referral source. We fit a logistic regression to predict banner exposure from these covariates. That gives each user a propensity score between 0 and 1.

import numpy as np

rng = np.random.default_rng(42)
n = 2000

prior_spending = rng.normal(50, 20, n)
days_since_visit = rng.exponential(10, n)

propensity = 1 / (1 + np.exp(-(0.03 * prior_spending - 0.05 * days_since_visit - 1)))
treatment = rng.binomial(1, propensity)

true_effect = 8.0
spending = (
    0.6 * prior_spending
    - 0.3 * days_since_visit
    + true_effect * treatment
    + rng.normal(0, 5, n)
)

naive_diff = spending[treatment == 1].mean() - spending[treatment == 0].mean()

from scipy.spatial import KDTree
treated_idx = np.where(treatment == 1)[0]
control_idx = np.where(treatment == 0)[0]

tree = KDTree(propensity[control_idx].reshape(-1, 1))
_, match_idx = tree.query(propensity[treated_idx].reshape(-1, 1))

matched_effect = (spending[treated_idx] - spending[control_idx[match_idx.flatten()]]).mean()

print(f"True causal effect:      ${true_effect:.2f}")
print(f"Naive difference:        ${naive_diff:.2f}")
print(f"Propensity score match:  ${matched_effect:.2f}")

The naive difference is biased because high-spenders are both more likely to see the banner and more likely to spend. The propensity-matched estimate gets much closer to the truth by comparing each treated user to a control user with a similar probability of treatment — effectively asking "among users who were equally likely to see the banner, what's the difference in spending between those who did and those who didn't?"

The limitation is sitting right there in the definition: the propensity score is based on observable covariates. If there's an unobserved confounder — something that affects both treatment and outcome but isn't in your dataset — matching can't fix it. This assumption, that all confounders are observed, is called unconfoundedness or ignorability, and it's the price of admission for this method. It's unverifiable from data. You defend it with domain knowledge and hope.

Inverse Probability Weighting

Propensity score matching pairs up individuals. Inverse probability weighting takes a different tack — instead of discarding unmatched data, it reweights the entire population so that the treated and untreated groups "look like" a randomized experiment.

The core idea: if a treated user has a propensity score of 0.9 (they were very likely to be treated), they don't tell us much — most people like them would have been treated anyway. But a treated user with a propensity score of 0.1 is informative: they got treated despite being the type of user who usually doesn't. We upweight that rare, informative user by giving them a weight of 1/e(X) = 1/0.1 = 10. Similarly, on the control side, we weight by 1/(1 − e(X)).

Here's the formula in words: the causal effect is the weighted average of outcomes in the treated group (weighted by 1/e(X)) minus the weighted average in the control group (weighted by 1/(1−e(X))). This is the Horvitz-Thompson estimator, and it creates a pseudo-population where treatment assignment is independent of covariates — mimicking randomization.

weights_treated = 1 / propensity[treated_idx]
weights_control = 1 / (1 - propensity[control_idx])

ipw_effect = (
    np.average(spending[treated_idx], weights=weights_treated)
    - np.average(spending[control_idx], weights=weights_control)
)
print(f"IPW estimate: ${ipw_effect:.2f}")

IPW uses all the data (no discarding), which is an advantage over matching. But it has its own weakness: if propensity scores get very close to 0 or 1, the weights explode. A user with e(X) = 0.01 gets a weight of 100, which means a single user can dominate your entire estimate. Practitioners typically trim or stabilize the weights, but there's a tension between bias and variance that never fully resolves. I still find IPW the hardest of these methods to get right in practice.

A natural follow-up: what if you combined IPW and regression adjustment, so that if either model (the propensity model or the outcome model) is wrong, you're still okay? That's exactly what doubly robust estimators do, and they've become the default recommendation in modern causal inference work. We'll see them again when we get to Double ML.

Difference-in-Differences

Propensity scores and IPW both try to control for confounders by measuring them. Difference-in-differences takes a completely different approach: it controls for confounders by exploiting time.

Here's the setup. Imagine our online store launches the promotional banner in January, but only for users in region A. Users in region B don't get it. We have spending data for both regions from before (December) and after (January) the launch.

                Before (Dec)    After (Jan)     Change
─────────────   ────────────    ───────────     ──────
Region A          $42              $56           +$14
(got banner)

Region B          $38              $46           +$8
(no banner)

Region A's spending went up by $14, but region B's went up by $8 too — and they never saw the banner. Maybe it's the holiday season, maybe the economy improved, maybe the store released better products. Whatever the reason, $8 of region A's increase would have happened anyway. The difference-in-differences estimate strips that out: $14 − $8 = $6. That's our estimate of the banner's causal effect.

The name captures the math exactly: we take the difference in outcomes over time (before vs. after) and then take the difference of those differences across groups (treated vs. control).

The crucial assumption is called parallel trends: in the absence of treatment, the two groups would have followed the same trajectory over time. Not the same level — region A and region B can have permanently different average spending — but the same change. If region A was about to experience a boom regardless of the banner (maybe a new warehouse opened nearby), the parallel trends assumption fails and our estimate is wrong.

Difference-in-differences was popularized by David Card and Alan Krueger's famous 1994 study of minimum wage effects in New Jersey and Pennsylvania — work that helped earn Card the 2021 Nobel Prize in Economics. The method is beautiful because it doesn't require you to measure individual-level confounders. Anything that is constant within each group over time (regional culture, demographics, store preferences) gets differenced away. Anything that changes over time but affects both groups equally (macro trends, seasonality) also gets differenced away. Only a time-varying, group-specific confounder can break it.

Instrumental Variables

Sometimes you can't measure the confounders (ruling out propensity methods), and you don't have the before-after structure that difference-in-differences requires. Are you stuck? Not necessarily — if you can find an instrumental variable.

An instrument is a variable Z that meets three conditions. First, Z affects the treatment (it's relevant). Second, Z affects the outcome only through the treatment, not through any other path (the exclusion restriction). Third, Z is independent of the confounders (it's exogenous).

This sounds abstract, so let's make it vivid. Suppose you want to know if attending an outdoor concert causes happiness. People who attend concerts might already be happier (confounder: personality). But rain is a plausible instrument: rain affects whether you attend (relevant), rain plausibly doesn't affect your happiness except through preventing attendance (exclusion), and rain is random with respect to your personality (exogenous).

The estimation works through two-stage least squares. In the first stage, you predict attendance from rain (plus controls). In the second stage, you predict happiness from the predicted attendance (the part that's driven purely by rain). Because that predicted component is driven by something exogenous, the confounders are neutralized.

rng = np.random.default_rng(42)
n = 5000

personality = rng.normal(0, 1, n)      # unobserved confounder
rain = rng.binomial(1, 0.3, n)         # instrument

attendance = 0.6 * personality - 0.8 * rain + rng.normal(0, 0.5, n)
happiness = 0.5 * attendance + 0.7 * personality + rng.normal(0, 0.3, n)

naive_slope = np.polyfit(attendance, happiness, 1)[0]

A_first = np.column_stack([rain, np.ones(n)])
from numpy.linalg import lstsq
beta_first, _, _, _ = lstsq(A_first, attendance, rcond=None)
attendance_hat = A_first @ beta_first

A_second = np.column_stack([attendance_hat, np.ones(n)])
beta_second, _, _, _ = lstsq(A_second, happiness, rcond=None)

print(f"True causal effect:  0.500")
print(f"Naive regression:    {naive_slope:.3f}")
print(f"IV estimate:         {beta_second[0]:.3f}")

The naive regression overestimates the effect because personality confounds both attendance and happiness. The IV estimate, by isolating only the variation in attendance that's driven by rain, recovers something much closer to the true effect.

Finding good instruments is notoriously difficult. The exclusion restriction — that Z affects Y only through treatment — is untestable from data. You defend it with logic and domain expertise, and reasonable people can disagree. This is why instrumental variable papers tend to generate more heated seminar discussions than almost any other econometric technique. I'll be honest: every time I see someone claim they've found a great instrument, my first instinct is skepticism, and that instinct has served me well.

Regression Discontinuity

Regression discontinuity exploits one of the most common features of real-world treatment assignment: arbitrary cutoffs.

Back to our e-commerce example. Suppose the banner is only shown to users with a loyalty score of 70 or above. A user with a score of 71 sees the banner. A user with a score of 69 does not. Are these two users fundamentally different? Not really — they're nearly identical in everything except that one of them happened to land on the right side of an arbitrary line.

That's the insight behind regression discontinuity design (RDD). Near the cutoff, treatment assignment is essentially random, because tiny variations in the assignment variable (loyalty score) are driven by noise rather than meaningful differences. So we can estimate the causal effect by comparing outcomes right above and right below the threshold.

rng = np.random.default_rng(42)
n = 3000
loyalty = rng.normal(70, 15, n)
treatment = (loyalty >= 70).astype(int)
true_effect = 10.0

spending = 0.5 * loyalty + true_effect * treatment + rng.normal(0, 5, n)

bandwidth = 5
near_cutoff = np.abs(loyalty - 70) < bandwidth
treated_near = spending[(loyalty >= 70) & near_cutoff]
control_near = spending[(loyalty < 70) & near_cutoff]

rd_estimate = treated_near.mean() - control_near.mean()
print(f"True effect at cutoff: ${true_effect:.2f}")
print(f"RD estimate:           ${rd_estimate:.2f}")

This is called a sharp regression discontinuity because treatment switches from 0 to 1 exactly at the cutoff. In a fuzzy design, the cutoff increases the probability of treatment but doesn't determine it perfectly — maybe some users below 70 still see the banner through a different channel. The fuzzy case uses instrumental variables (with the cutoff as the instrument), which connects these two methods elegantly.

The limitation: RDD gives you a causal estimate at the cutoff, not for the entire population. The effect of the banner on users with loyalty scores of 70 might be very different from its effect on users with scores of 30 or 95. You're estimating a local effect, and extrapolating beyond the cutoff requires additional assumptions.

Pearl's Structural Causal Models

Everything up to this point has been largely in the Rubin tradition — potential outcomes, treated-vs-control comparisons, estimation strategies. Now we switch to Judea Pearl's framework, which approaches causation from a completely different angle: graphs.

A Structural Causal Model (SCM) has three components. First, a set of structural equations: each variable is defined as a function of its direct causes plus independent noise. Second, a directed acyclic graph (DAG): nodes are variables, and an arrow from A to B means "A directly causes B." Third, a set of exogenous noise variables representing everything outside the model.

Let's build the DAG for our banner example. We believe prior engagement causes both banner exposure (the algorithm shows it to active users) and spending. And we believe the banner directly causes spending. Drawing this out:

                Prior Engagement
                  /          \
                 v            v
           Banner Shown  →  Spending

Three arrows, three causal claims. Prior engagement causes banner exposure. Prior engagement causes spending. And the banner causes spending. This graph is the heart of Pearl's framework: it's a compact, visual encoding of your causal assumptions. Every path in any DAG is built from exactly three elementary structures, and understanding these three patterns is essential to everything that follows.

The fork (also called a confounder pattern): X ← Z → Y. Z causes both X and Y, creating a spurious correlation between them even though neither causes the other. Conditioning on Z removes the spurious association. This is the "common cause" pattern, and it's what makes naive regressions biased. In our example, prior engagement is the fork — it's the common cause of banner exposure and spending.

The chain (also called a mediator pattern): X → Z → Y. X causes Z, and Z causes Y. The causal effect of X on Y flows through Z. Here's the trap: if you condition on Z, you block the causal pathway and underestimate the total effect of X. This is why blindly "controlling for everything" in a regression can be dangerous.

The collider: X → Z ← Y. Both X and Y independently cause Z. Here's the counterintuitive part: X and Y are independent in the general population, but conditioning on Z (or selecting based on Z) creates a spurious association between them. This is the most treacherous of the three patterns because it violates the instinct that controlling for more variables is always better.

A vivid example of collider bias: suppose talent and physical attractiveness are independent in the general population. But Hollywood success requires either talent or attractiveness (or both). If you only look at successful actors — conditioning on the collider — you'll find that talent and attractiveness appear negatively correlated. Among the successful, if someone isn't talented, they must be attractive, and vice versa. The correlation is entirely created by the selection. This is known as Berkson's paradox, and it trips up experienced researchers regularly.

I still occasionally get tripped up by colliders. The instinct to "control for more variables" is deeply ingrained, and the collider is the one case where that instinct is exactly wrong. Drawing the DAG before running any regression is the single most useful habit I've developed.

The do-Operator and Graph Surgery

Here's where Pearl's framework delivers its deepest insight. There's a precise mathematical difference between seeing that X takes a value and forcing X to take a value.

P(Y | X = x) — the conditional probability — asks: "Among all cases where X happens to be x, what's the distribution of Y?" This includes cases where X = x because of some confounder. It mixes the causal signal with confounding noise.

P(Y | do(X = x)) — the interventional probability — asks: "If I reach into the system, force X to be x regardless of everything else, what happens to Y?" This is the causal quantity. It's what an experiment measures.

Pearl's insight is that the do-operator corresponds to a specific surgery on the graph. When you do(X = x), you delete all arrows coming into X (because you're overriding whatever normally causes X), fix X to the value x, and leave everything else untouched. In the resulting "mutilated" graph, the confounding paths through X are severed, and what remains is the pure causal effect.

ORIGINAL graph:                    AFTER do(Banner = shown):
  Prior Engagement                   Prior Engagement
    /          \                             \
   v            v                             v
Banner  →  Spending                Banner=shown → Spending

The arrow from Prior Engagement      That arrow is gone now.
to Banner creates confounding.       We set Banner by force.
P(Spending | Banner) is biased.      P(Spending | do(Banner)) is causal.

I'll admit — I haven't figured out a great way to make the do-operator feel viscerally intuitive on first encounter. But our courtroom analogy helps. The conditional P(Y|X) is like asking "what usually happens when X is observed?" — it's a description of correlation in the world as it naturally unfolds. The do-operator P(Y|do(X)) is like asking "what would happen if a judge ordered X to take a specific value?" — it's a description of what follows from an intervention, regardless of what normally causes X.

Every A/B test is a physical implementation of the do-operator. When you randomly assign treatment, you're performing the graph surgery — severing the arrows into the treatment variable by having a coin flip decide instead of confounders. That's why RCTs are the gold standard: they compute P(Y|do(X)) directly, by construction.

The Backdoor Criterion

We've established that P(Y|do(X)) is what we want. But can we compute it from observational data — data where nobody did any intervening?

Often, yes. Pearl's backdoor criterion tells you exactly which variables to condition on. A set of variables Z satisfies the backdoor criterion relative to (X, Y) if two conditions hold: no variable in Z is a descendant of X (you're not conditioning on something the treatment causes), and Z blocks every "backdoor path" from X to Y — that is, every path that starts with an arrow into X.

If such a Z exists, the backdoor adjustment formula applies:

P(Y | do(X=x)) = Σ_z P(Y | X=x, Z=z) · P(Z=z)

In words: take the conditional distribution P(Y|X,Z), and average it over the marginal distribution of Z. This is subtly different from conditioning on Z directly. Conditioning uses P(Z|X); backdoor adjustment uses P(Z). The difference is the entire gap between correlation and causation.

In our banner example, the DAG tells us that prior engagement is a confounder (it has arrows into both the banner and spending). It satisfies the backdoor criterion. So we can estimate the causal effect by stratifying on prior engagement: within each stratum, compare spending between those who saw the banner and those who didn't, then average across strata weighted by the overall distribution of prior engagement.

This connects directly to what we did earlier with propensity scores and IPW — those are different computational strategies for performing the same backdoor adjustment. The DAG tells you what to adjust for; propensity scores and IPW tell you how to adjust. Pearl gives the "what," Rubin gives the "how." The two frameworks converge on the same answer when the assumptions align.

Simpson's Paradox

With the backdoor criterion in hand, we can now tackle one of the most famous puzzles in statistics.

In 1973, UC Berkeley was accused of gender bias in graduate admissions. The aggregate numbers looked damning: 44% of male applicants were admitted, but only 35% of female applicants. But when researchers looked department by department, most departments actually admitted women at slightly higher rates than men. The aggregate trend reversed when you stratified by department.

What happened? Women disproportionately applied to competitive departments with low admission rates (humanities, social sciences). Men disproportionately applied to departments with higher admission rates (engineering, sciences). Department choice was a confounder — it influenced both the applicant's gender distribution and the admission rate.

This reversal — where a trend in aggregated data flips when you stratify — is called Simpson's paradox. And here's the punchline: there is no purely statistical rule that resolves it. "Always stratify" is wrong. "Always aggregate" is wrong. The correct answer depends on the causal structure.

In the Berkeley case, department is a confounder (it affects both gender composition and admission probability), so stratifying is correct. But imagine a different scenario: a drug is more effective in mild cases than severe cases, and doctors prescribe it more to severe patients. If disease severity is a mediator (the drug works through reducing severity), you should not stratify by severity — that would block the causal pathway. Same paradox, opposite resolution, because the DAG is different.

This is why causal inference exists. It provides the framework to answer questions that statistics alone cannot. No amount of data, no clever test, no p-value can tell you whether to aggregate or stratify. You need a causal model.

Causal Discovery: Learning the Graph

Everything so far has assumed you know the DAG. You drew it from domain knowledge, debated it with colleagues, defended it in a paper. But what if you could learn the causal structure from data itself?

That's the promise of causal discovery algorithms. The most foundational is the PC algorithm (named after its creators, Peter Spirtes and Clark Glymour). It starts with a complete graph (every variable connected to every other) and systematically removes edges by testing for conditional independence. If X and Y are independent given some set of variables Z, the edge between X and Y is removed. After pruning, it orients remaining edges by looking for collider patterns.

The FCI algorithm (Fast Causal Inference) extends PC to handle latent confounders — variables that affect the observed ones but aren't in the dataset. Instead of a clean DAG, FCI outputs a partial ancestral graph (PAG) with some edges marked as ambiguous, honestly reflecting the limits of what the data can tell you.

My favorite thing about causal discovery is also its biggest caveat: these algorithms can only identify the causal structure up to an equivalence class. Multiple DAGs can produce the same set of conditional independencies. The data tells you the skeleton (which variables are connected), and it identifies some edge directions (colliders are detectable), but other directions remain ambiguous. You still need domain knowledge to resolve those ambiguities. No one gets a free causal graph from data alone — the data constrains it, but doesn't fully determine it.

Second Rest Stop

If you've made it here, you have the classical toolkit. You understand the fundamental problem (two potential outcomes, only one observable), the gold standard (randomization), and five observational methods: propensity matching, IPW, difference-in-differences, instrumental variables, and regression discontinuity. You understand Pearl's DAGs, the do-operator, backdoor adjustment, and why Simpson's paradox is a causal problem. You know the limits of causal discovery. That's a genuinely strong foundation.

What remains is the intersection of causal inference and machine learning — where flexible models meet causal questions, where treatment effects get personalized, and where the field is moving fastest. If you're interested in applying causal thinking to real products, the next section is where it gets most directly useful.

But if you need to stop, you have a working mental model that covers the vast majority of what practitioners encounter. The short version of what's next: machine learning models can estimate heterogeneous treatment effects, which means figuring out not just "does the treatment work?" but "for whom does it work best?" There. That's the core idea. You're about 80% of the way through.

Machine Learning Meets Causality

Classical causal inference methods — regression, matching, IPW — all estimate one number: the average treatment effect. But in a world of personalization, we often want to know: which specific users benefit from the treatment, and by how much?

This is CATE estimation at scale, and it's where machine learning enters the picture.

Double Machine Learning (also called Double/Debiased ML), developed by Victor Chernozhukov and colleagues in 2018, combines the flexibility of ML with the rigor of causal inference. The insight is clever: use ML to estimate two "nuisance" functions — the propensity score e(X) = P(T=1|X) and the conditional outcome E[Y|X] — and then use the residuals (the parts these models can't explain) to estimate the causal effect. The key trick is cross-fitting: you estimate the nuisance functions on one half of the data and the causal effect on the other half, then swap. This prevents overfitting from leaking bias into the causal estimate.

from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.model_selection import KFold

rng = np.random.default_rng(42)
n = 4000
X_feat = rng.normal(0, 1, (n, 5))

propensity_true = 1 / (1 + np.exp(-X_feat[:, 0]))
T = rng.binomial(1, propensity_true)

true_cate = 2 + 3 * X_feat[:, 1]
Y = X_feat[:, 0] + true_cate * T + rng.normal(0, 1, n)

kf = KFold(n_splits=2, shuffle=True, random_state=42)
theta_parts = []

for train_idx, est_idx in kf.split(X_feat):
    m_model = GradientBoostingRegressor(n_estimators=100, max_depth=3)
    m_model.fit(X_feat[train_idx], Y[train_idx])
    Y_residual = Y[est_idx] - m_model.predict(X_feat[est_idx])

    e_model = GradientBoostingClassifier(n_estimators=100, max_depth=3)
    e_model.fit(X_feat[train_idx], T[train_idx])
    T_residual = T[est_idx] - e_model.predict_proba(X_feat[est_idx])[:, 1]

    theta = np.sum(Y_residual * T_residual) / np.sum(T_residual ** 2)
    theta_parts.append(theta)

dml_ate = np.mean(theta_parts)
true_ate = np.mean(true_cate)
print(f"True ATE:    {true_ate:.3f}")
print(f"DML ATE:     {dml_ate:.3f}")

Double ML gives you a debiased estimate even when you're using complex, nonlinear ML models for the nuisance parameters. This matters because if you naively plug ML predictions into a causal formula, regularization bias leaks in and corrupts the estimate. The cross-fitting structure prevents that.

Causal Forests, developed by Susan Athey and Stefan Wager in 2018, go further — they estimate CATE(x) directly. Built on the random forest machinery, causal forests split the feature space not to predict outcomes but to find regions where the treatment effect differs. Each leaf of each tree estimates a local treatment effect. The forest aggregates across trees to produce CATE estimates with valid confidence intervals.

What makes causal forests remarkable is that they inherit the flexibility of random forests (handling nonlinear interactions, high-dimensional features) while providing the statistical guarantees that causal inference demands (asymptotic normality, honest inference through sample splitting). They're the workhorse method for heterogeneous treatment effect estimation in industry today.

A family of approaches called meta-learners provides a simpler entry point. The T-learner fits separate outcome models for treated and control groups, then takes the difference. The S-learner fits a single model with treatment as a feature. The X-learner uses a clever cross-fitting scheme that works especially well when treatment groups are imbalanced. These are easy to implement with any ML model and are a good starting point before reaching for the full causal forest or DML machinery.

Causal Inference in Tech

Let's bring all of this back to where it lives in production systems.

Uplift modeling is the direct application of CATE estimation to business decisions. Instead of asking "which users are likely to convert?" (a prediction question), it asks "which users' conversion probability will increase if we send them an email?" (a causal question). The difference is enormous. A high-propensity user who would convert anyway doesn't need the email — sending it to them is waste. A user whose behavior changes because of the email is where the value is. Companies like Uber, Booking.com, and Wayfair use uplift models to target promotions, reducing cost while increasing effectiveness.

Consider our banner example one last time. An uplift model would segment users into four types. Persuadables: buy if shown the banner, don't buy otherwise — these are your target. Sure things: buy regardless — don't waste the banner on them. Lost causes: don't buy regardless — also don't waste the banner. Sleeping dogs: buy if not shown the banner, don't buy if shown — showing the banner actively hurts with these users. A traditional predictive model conflates sure things with persuadables, because both have high conversion probabilities. An uplift model separates them by estimating the causal effect per user.

Beyond marketing, causal inference powers feature impact estimation (did our new recommendation algorithm actually improve engagement, or did engagement go up because of the holiday season?), algorithmic fairness (does the model's disparate impact on protected groups reflect a causal mechanism we should fix?), and policy optimization (given a budget for discounts, which users should receive what amount to maximize total revenue?).

The field is moving fast. Microsoft's EconML, Uber's CausalML, and PyWhy's DoWhy have made these methods accessible to engineers who aren't econometricians. The gap between "research technique" and "production tool" has collapsed dramatically in the last five years.

Wrapping Up

If you're still with me, thank you. I hope it was worth it.

We started with a thought experiment about parallel universes — the two potential outcomes for every person, only one of which we can ever observe. We built up from there: randomized experiments as the cleanest solution, then five observational methods for when randomization isn't possible (propensity matching, IPW, difference-in-differences, instrumental variables, regression discontinuity). We switched frameworks from Rubin's potential outcomes to Pearl's graphs, learned about DAGs, the do-operator, and the backdoor criterion, and saw how Simpson's paradox can only be resolved with causal reasoning. We ended with the modern marriage of ML and causality — Double ML, causal forests, and uplift modeling.

My hope is that the next time someone asks "did this feature increase revenue?", instead of pulling up a correlation and shrugging, you'll draw a DAG, identify the confounders, pick the estimation strategy that fits, and give an answer you can actually defend — having a pretty darn good mental model of what's going on under the hood.

Resources and Credits

Judea Pearl, The Book of Why (2018) — the most accessible introduction to causal thinking. Pearl writes with a passion that's rare in technical books. If you read one thing, make it this.

Scott Cunningham, Causal Inference: The Mixtape (2021) — freely available online, wildly helpful, covers the econometric toolkit (DiD, IV, RDD) with humor and clarity.

Miguel Hernan and James Robins, Causal Inference: What If (2020) — the definitive textbook for the potential outcomes framework. Free PDF from the authors' website. Dense but incredibly thorough.

Brady Neal's Introduction to Causal Inference online course — if you prefer video, this is the best free resource. Covers both Pearl and Rubin with real code examples.

Athey and Imbens, "Machine Learning Methods That Economists Should Know About" (2019) — the O.G. survey paper connecting ML and causal inference. If you want to understand where the field is headed, start here.

The DoWhy documentation (dowhy.readthedocs.io) — clear, well-organized, with worked examples for every major method. The best way to get your hands dirty.

← Previous Reinforcement Learning Next → Sparse & Factorization Methods