Fairness & Bias Detection

Chapter 18: Responsible AI and Ethics Section 1

I avoided fairness in machine learning for longer than I'd like to admit. Every time someone mentioned "algorithmic bias" in a meeting, I'd nod gravely and think "that's important, someone should look into that" — and then go back to optimizing my AUC score. It felt like a political problem dressed up in technical language, and I wasn't sure I had anything useful to contribute. Finally the discomfort of not understanding what's actually going on under the hood — what "fair" even means mathematically, and why smart people can't agree on a definition — grew too great for me. Here is that dive.

Fairness in ML became a formal research area around 2012–2016, though the underlying legal and philosophical questions go back decades. The field exploded after a 2016 ProPublica investigation showed that a criminal recidivism prediction tool called COMPAS was twice as likely to falsely label Black defendants as high-risk compared to white defendants. That investigation, and the fierce technical debate it sparked, forced the community to confront something uncomfortable: there are multiple mathematically rigorous definitions of "fair," and they contradict each other.

Before we start, a heads-up. We're going to be talking about probability, conditional statistics, and some light algebra. We'll also touch on real-world case studies that involve race, gender, and other sensitive attributes — because that's where bias does its damage. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Loan Officer and the Spreadsheet
Where Bias Comes From
The First Instinct That Doesn't Work
Demographic Parity — The Simplest Definition
When Equal Outcomes Aren't Enough
Calibration — When the Score Itself Matters
The Impossibility Theorem
Rest Stop and an Off Ramp
Detecting Bias in Practice
The Four-Fifths Rule
Fixing It — Three Intervention Points
Intersectional Fairness
Fairness in LLMs
The Cases That Changed Everything
Resources and Credits

The Loan Officer and the Spreadsheet

Imagine a small bank — three branches, one loan officer at each. They've been approving and rejecting loan applications by hand for years. Now the bank wants to build a model to automate the process. They hand you a spreadsheet: 20 recent applications, each with an income, a credit score, a zip code, and whether the loan was approved or denied.

Here's a slice of that data. Ten applicants from neighborhood A, ten from neighborhood B.

Applicant   Neighborhood   Income    Credit Score   Approved?
────────────────────────────────────────────────────────────────
Alice       A              $62,000   710            Yes
Bob         A              $58,000   680            Yes
Carol       A              $45,000   620            No
David       B              $61,000   700            No
Elena       B              $59,000   690            No
Frank       A              $70,000   740            Yes
Grace       B              $67,000   720            No
Hiro        A              $52,000   650            Yes
Irene       B              $55,000   660            No
James       B              $48,000   630            No

Something jumps out. David has almost identical numbers to Alice — $61k income, 700 credit score — but David was denied and Alice was approved. Grace has better numbers than Hiro, yet Hiro was approved and Grace wasn't. The pattern becomes clear when you look at the neighborhood column: applicants from neighborhood A were approved at 70%, applicants from neighborhood B at 10%.

Now imagine neighborhood A is predominantly white and neighborhood B is predominantly Black. The historical loan officers weren't using race as an explicit criterion. They were using their "gut feeling," shaped by years of pattern-matching on outcomes that were themselves shaped by decades of discriminatory lending practices. The spreadsheet faithfully recorded those decisions. And if you train a model on this spreadsheet, the model will faithfully reproduce them.

That's bias. Not a bug in the code. Not a malicious actor. A machine learning system doing exactly what we asked it to do — learn from the past — when the past was unfair.

We'll keep coming back to this bank throughout our journey. It starts small, but the problems it surfaces are the same ones that show up in criminal justice, healthcare, hiring, and every other high-stakes domain where ML makes decisions about people.

Where Bias Comes From

Our bank's spreadsheet has a specific kind of problem — it reflects decisions made by biased humans. But that's only one of several ways bias sneaks into a machine learning pipeline. I find it helpful to think of these as different entry points, because the fix depends on where the contamination happened.

Historical bias is what our bank has. The data faithfully records an unfair world. Amazon discovered this in 2018 when they built a hiring model trained on a decade of resumes. The tech industry had been predominantly male for those ten years, so the model learned that "male" was a signal for "good candidate." It started penalizing resumes that included the word "women's" — as in "women's chess club captain" — and downranking graduates of all-women's colleges. Amazon scrapped the project entirely. The data wasn't wrong in a technical sense. It was a truthful record of a biased process.

Representation bias is about who's missing. If your training data is 90% one demographic, the model will perform well on that demographic and poorly on everyone else. This is what Joy Buolamwini found when she audited facial recognition systems in 2018 — the Gender Shades study. The training data skewed heavily toward lighter-skinned faces, so the systems worked brilliantly on light-skinned men (error rate below 1%) and terribly on dark-skinned women (error rates up to 34.7%). The model wasn't malicious. It was underfed.

Measurement bias is subtler and, I think, the most dangerous. It happens when the thing you're measuring isn't the thing you think you're measuring. In 2019, Obermeyer and colleagues published a study in Science that shook the healthcare world. A widely-used algorithm was supposed to identify patients who needed extra medical care. It used healthcare cost as a proxy for healthcare need. Sounds reasonable — sicker people cost more, right? Except Black patients in America systematically have less access to healthcare, so they incur lower costs even when they're equally sick. The algorithm was essentially saying "people who've spent less money on healthcare must be healthier." At any given risk score, Black patients were substantially sicker than white patients. Only 17.7% of Black patients were flagged for extra care, when 46.5% should have been.

Think of it like a thermometer that reads five degrees low, but only in certain rooms. You can calibrate it perfectly for the rooms where it works, and the measurement will still be wrong in the rooms where it doesn't. Except in this case, the "rooms" are people's lives.

Aggregation bias happens when you pool data across groups that actually behave differently. A diabetes prediction model that doesn't account for how HbA1c levels relate to risk differently across ethnic groups will work well on average and fail on specific populations. The aggregate looks fine. The parts don't.

Label bias is when the humans doing the labeling bring their own prejudices. Content moderation models trained on human annotations have been shown to flag African American Vernacular English as "toxic" at higher rates than Standard American English. The model learned what the annotators taught it.

Sampling bias is the most mechanical — your data collection process systematically excludes populations. Speech recognition trained primarily on podcast audio underperforms on elderly speakers, non-native speakers, and people with speech impediments. The pipeline never heard them, so it doesn't know them.

Each of these entry points has its own fix, and that's important. Rebalancing your training data helps with representation bias but does nothing for measurement bias. Reweighing your samples helps with historical bias but doesn't fix broken labels. There's no single "debias" button. Knowing where the contamination entered is the first step toward knowing how to clean it up.

The First Instinct That Doesn't Work

When I first thought about making a model "fair," my immediate reaction was: remove the protected attribute. Don't give the model access to race or gender, and it can't discriminate. Problem solved.

I'll be honest — I held this belief for an embarrassingly long time. It's called fairness through unawareness, and it's the most common first instinct in the industry. It's also almost completely useless.

Go back to our bank. Suppose we remove the "neighborhood" column from the training data. The model still has zip code (which correlates with neighborhood), income (which correlates with neighborhood due to historical wealth gaps), and credit score (which correlates with access to credit, which correlates with neighborhood). These are proxy variables — features that carry the signal of the thing you removed, sometimes with such fidelity that the model barely notices the protected attribute is gone.

Here's a quick test. Take all the features you plan to use and try to predict the sensitive attribute from them. If a simple model can do it well, you have proxies.

from sklearn.ensemble import GradientBoostingClassifier

# Can we predict neighborhood from the "neutral" features?
X = df[["income", "credit_score", "zip_code_encoded"]]
y = df["neighborhood"]

proxy_model = GradientBoostingClassifier(n_estimators=100, max_depth=3)
proxy_model.fit(X, y)
proxy_accuracy = proxy_model.score(X, y)
baseline = y.value_counts(normalize=True).max()

# proxy_accuracy: 0.94, baseline: 0.50
# The "neutral" features predict neighborhood with 94% accuracy.
# Removing the column was theater.

In our bank example, the proxy model predicts neighborhood with 94% accuracy using only income, credit score, and zip code. Removing the neighborhood column from training data didn't remove the information — it spread out across the remaining features, where it's harder to see and harder to audit.

This is the central tension that makes fairness genuinely hard. The very features that carry useful predictive signal — income predicts loan repayment, credit score predicts default risk — are also the features that encode historical discrimination. You can't untangle the "legitimate signal" from the "discriminatory signal" by deleting a column. The signals are woven into each other.

So if removing the attribute doesn't work, what does "fair" even mean? That turns out to be a deeper question than I expected.

Demographic Parity — The Simplest Definition

Let's start with the most intuitive definition. Demographic parity says: the model should approve loans at the same rate for every group. If 60% of neighborhood-A applicants get approved, then 60% of neighborhood-B applicants should also get approved.

Mathematically: P(Ŷ=1 | G=a) = P(Ŷ=1 | G=b), where Ŷ is the prediction and G is the group. The probability of getting a "yes" shouldn't depend on which group you belong to.

Back at our bank, demographic parity would look like this. Suppose we have 100 applicants from each neighborhood. The uncorrected model approves 70 from A and 10 from B. Demographic parity requires us to equalize those rates — maybe 40 from each, or 50 from each, depending on where we set the threshold.

This feels right. Equal outcomes. Nobody is being treated differently based on group membership. So what's wrong with it?

The problem is that demographic parity ignores whether the applicants are actually qualified. If neighborhood A genuinely has more high-income, high-credit-score applicants (because of historical wealth accumulation, not because of anything inherent), then forcing equal approval rates means either approving less-qualified applicants from B or rejecting more-qualified applicants from A. The bank might approve a 550-credit-score applicant from B while rejecting a 720-credit-score applicant from A.

Another way to think about it: imagine a fair courtroom. One judge says "fairness means acquitting the same percentage of defendants from every racial group." That feels appealing until you realize it could mean acquitting people who are guilty or convicting people who are innocent — to hit a quota. Most people's intuition is that the judge should look at the evidence, not the demographics. That instinct leads us to the next definition.

When Equal Outcomes Aren't Enough

If demographic parity ignores qualifications, maybe we should define fairness in terms of errors. A model makes two kinds of mistakes: false positives (approving someone who will default) and false negatives (rejecting someone who would have repaid). Equalized odds says both error rates should be equal across groups.

Formally: the false positive rate (FPR) and the true positive rate (TPR) should be the same for every group. If the model catches 85% of good borrowers in neighborhood A (TPR = 0.85), it should also catch 85% of good borrowers in neighborhood B. And if it incorrectly approves 10% of bad borrowers in A (FPR = 0.10), it should incorrectly approve 10% of bad borrowers in B too.

Let's trace through the bank example. Imagine the model's confusion matrix looks like this:

Neighborhood A (100 applicants, 60 actually qualified):
  True Positives:  51   (approved, would repay)     TPR = 51/60 = 0.85
  False Negatives:  9   (rejected, would have repaid)
  False Positives:  4   (approved, will default)     FPR = 4/40 = 0.10
  True Negatives:  36   (rejected, will default)

Neighborhood B (100 applicants, 30 actually qualified):
  True Positives:  18   (approved, would repay)     TPR = 18/30 = 0.60
  False Negatives: 12   (rejected, would have repaid)
  False Positives:  2   (approved, will default)     FPR = 2/70 = 0.03
  True Negatives:  68   (rejected, will default)

The TPR gap (0.85 vs 0.60) tells us that qualified applicants from B are much less likely to be approved than equally qualified applicants from A. The FPR gap (0.10 vs 0.03) tells us that unqualified applicants from A are more likely to slip through. Equalized odds says both of these gaps should be zero.

This feels like a better judge — one who says "I should make the same quality of decisions regardless of who's in front of me." Mistakes are inevitable, but they shouldn't fall disproportionately on one group.

Equal opportunity is the relaxed version: it only requires equal true positive rates. It says "qualified people from every group should have the same chance of being approved." It's silent on false positives. This is the definition that feels most natural for situations like loan approvals or college admissions, where the primary harm is denying a benefit to someone who deserved it.

But equalized odds has a limitation we haven't confronted yet. It cares about error rates but says nothing about what the model's scores mean. And that turns out to matter enormously.

Calibration — When the Score Itself Matters

Many real-world systems don't output a binary yes/no — they output a score. COMPAS assigns recidivism risk scores from 1 to 10. Credit scoring systems produce three-digit numbers. Healthcare algorithms assign acuity levels. When a system outputs a score, we want that score to mean the same thing regardless of who it's applied to.

Calibration says: among all the people who receive a risk score of 7, roughly 70% should actually have the outcome, regardless of their group. A score of 7 for a Black defendant should mean the same thing as a score of 7 for a white defendant. This is also called predictive parity — formally, P(Y=1 | Ŷ=1, G=a) = P(Y=1 | Ŷ=1, G=b).

Our courtroom judge analogy is useful again. A calibrated judge says "when I sentence someone as high-risk, I'm right 70% of the time, no matter who they are." That sounds unimpeachable. The number means what it says.

So now we have three reasonable definitions: demographic parity (equal outcomes), equalized odds (equal error rates), and calibration (scores mean the same thing). Each one captures a genuine aspect of what people mean when they say "fair." And here is where the floor drops out.

The Impossibility Theorem

In 2016 and 2017, two independent research groups — Kleinberg, Mullainathan, and Raghavan at one end, and Alexandra Chouldechova at the other — proved something that fundamentally changed the fairness conversation. When the base rates differ between groups (meaning the actual rate of the outcome is different — say, actual loan default rates differ between neighborhoods), you cannot simultaneously have:

Calibration — a score of 0.7 means 70% probability for every group
Equal false positive rates across groups
Equal false negative rates across groups

Not "it's really hard." Not "we haven't found a way yet." Mathematically impossible. Proven with algebra, not experiments.

I'll be honest — when I first read this, I didn't believe it. It felt like there had to be a loophole. So let me walk through why it's true, using our bank.

Suppose neighborhood A has a 15% actual default rate and neighborhood B has a 30% actual default rate (these differ because of historical economic conditions, not anything inherent). Now suppose we have a perfectly calibrated model — when it says "risk = 0.3," exactly 30% of those people actually default, in both neighborhoods.

Here's the catch. Because the base rate is higher in B, the model has to assign higher risk scores to B applicants on average to stay calibrated. That means more B applicants end up above any given threshold. Which means more B applicants who won't default get flagged as high-risk (higher FPR for B). And more A applicants who will default slip below the threshold (higher FNR for A).

You can see it in the math. By Bayes' rule, the positive predictive value for group g is:

PPV_g = (TPR_g × base_rate_g) / (TPR_g × base_rate_g + FPR_g × (1 - base_rate_g))

If you force TPR and FPR to be equal across groups (equalized odds), but the base rates differ, the numerator and denominator shift differently for each group. The PPV must differ. Calibration breaks.

And if you force calibration and equal FPR, the false negative rates must differ. Any two of the three can be satisfied. All three cannot. It's a mathematical identity, as inescapable as 1 + 1 = 2.

This is exactly what happened in the COMPAS debate. ProPublica analyzed the tool and showed that Black defendants who did not go on to reoffend were nearly twice as likely to be labeled high-risk compared to white defendants who didn't reoffend. That's unequal false positive rates. Northpointe, the company behind COMPAS, responded by showing that their scores were calibrated — a score of 7 meant the same thing for every race. Both were right. The impossibility theorem guarantees they had to disagree on at least one metric. The question was never "is COMPAS fair?" The question was "which definition of fairness matters most when you're deciding whether someone stays in jail?" That's a value judgment, not a technical one.

There are also two more fairness definitions worth knowing, even though they're less common in practice. Individual fairness says similar individuals should receive similar predictions — but defining "similar" is the entire problem, and there's no universal answer. Counterfactual fairness asks "would the prediction change if this person had been a different race?" — but it requires a causal model of the domain, and if your causal model is wrong, your fairness guarantee is meaningless.

Rest Stop and an Off Ramp

Congratulations on making it this far. You can stop here if you want.

You now have a mental model that puts you ahead of most practitioners: bias enters through data (historical, representation, measurement, aggregation, label, sampling), removing protected attributes doesn't work because of proxies, there are multiple mathematically rigorous definitions of fairness (demographic parity, equalized odds, calibration, individual, counterfactual), and — here's the kicker — those definitions contradict each other when base rates differ, which they almost always do in the real world. There is no "be fair" button. Every choice of fairness metric is a value judgment about who benefits and who is harmed.

If that's enough, here's the short version of what comes next: use Fairlearn or AIF360 to measure your model's behavior across groups, apply pre-processing, in-processing, or post-processing mitigation depending on your constraints, check the four-fifths rule for legal compliance, and always audit at intersections (not just individual protected attributes). There. You're 80% of the way there.

But if the discomfort of not knowing how any of that actually works is nagging at you, read on.

Detecting Bias in Practice

Knowing what fairness means is one thing. Measuring it in a running system is another. Let's go back to our bank. The model is deployed. Loans are being approved and rejected. How do we know if it's behaving fairly?

The core idea is to compute the same performance metrics — accuracy, TPR, FPR, precision — separately for each group, then compare them. If the numbers differ significantly, you have a disparity. The question is then whether that disparity is acceptable.

Microsoft's Fairlearn library makes this mechanical. Its MetricFrame object takes any sklearn metric and disaggregates it across a sensitive attribute.

from fairlearn.metrics import (
    MetricFrame, selection_rate,
    false_positive_rate, false_negative_rate
)
from sklearn.metrics import accuracy_score, recall_score

# y_true: actual outcomes (did the person repay?)
# y_pred: model's predictions
# sensitive: neighborhood group for each applicant

mf = MetricFrame(
    metrics={
        "approval_rate": selection_rate,
        "accuracy": accuracy_score,
        "recall": recall_score,           # TPR — equal opportunity
        "fpr": false_positive_rate,       # equalized odds, part 1
        "fnr": false_negative_rate,       # equalized odds, part 2
    },
    y_true=y_true,
    y_pred=y_pred,
    sensitive_features=sensitive
)

# This gives us per-group metrics in a single table.
# mf.by_group shows the breakdown.
# mf.difference() shows the max gap for each metric.

That single object tells us whether the model satisfies demographic parity (equal approval rates), equal opportunity (equal recall), and equalized odds (equal FPR and FNR) — all in one pass. The .difference() method gives us the maximum gap between any two groups for each metric. If the gap is near zero, we're in good shape for that particular definition. If it's large, we have a problem — and which problem it is depends on which metric has the gap.

IBM's AIF360 is the other major toolkit. It wraps data in its own BinaryLabelDataset structure, which is heavier and less Pythonic but offers a wider range of algorithms. I'd start with Fairlearn for most production work and reach for AIF360 when you need a specific algorithm it has and Fairlearn doesn't — like disparate impact remover or the optimized pre-processing method.

The Four-Fifths Rule

Before you pick a fairness definition, you might have a legal floor to clear. In the United States, the Equal Employment Opportunity Commission uses the four-fifths rule (also called the 80% rule): if the selection rate for any protected group is less than 80% of the rate for the group with the highest selection rate, there's prima facie evidence of disparate impact.

Two terms to pin down. Disparate treatment is intentionally using a protected attribute — the model literally has a "race" feature. That's illegal and easy to spot. Disparate impact is a facially neutral policy that disproportionately harms a protected group. Your model can have zero race features and still create disparate impact through proxies. That's the harder one.

Let's apply the four-fifths rule to our bank. Neighborhood A: 70% approval rate. Neighborhood B: 10% approval rate. The ratio is 10% / 70% = 0.14. The threshold is 0.80. We fail the four-fifths rule by a landslide.

# Four-fifths rule check
rates = mf.by_group["approval_rate"]
ratio = rates.min() / rates.max()
# ratio = 0.14 — well below 0.80
# This is prima facie evidence of disparate impact.

The four-fifths rule doesn't tell you which fairness definition to optimize for. It tells you the floor below which you have a legal problem. Think of it as a smoke alarm — it doesn't tell you how to put out the fire, but it does tell you there's smoke.

The EU AI Act, which came into force in 2024, takes a different approach. It classifies AI systems by risk tier. High-risk systems — hiring, credit, criminal justice, healthcare — must undergo conformity assessments that include bias audits. The specifics are still being formalized, but the direction is clear: "we didn't know" is no longer an acceptable defense.

Fixing It — Three Intervention Points

Suppose the smoke alarm went off. The model has a disparity. Now what? There are three places in the pipeline where you can intervene, and each one has a different character.

Pre-processing changes the data before training. The idea is to fix the root cause — if the data is biased, make it less biased. The most common technique is reweighing: assign higher sample weights to underrepresented group-outcome combinations and lower weights to overrepresented ones. If neighborhood-B approvals are rare, give each one more weight during training so the model pays more attention to them.

# Reweighing: the model sees the same data, but some
# examples count more than others during training.
# Weight = (expected frequency) / (observed frequency)
# for each (group, outcome) combination.

overall_approval_rate = y_train.mean()  # say, 0.40
group_b_approval_rate = y_train[sensitive == "B"].mean()  # say, 0.10

# A positive example from B is underrepresented:
weight_b_positive = overall_approval_rate / group_b_approval_rate
# weight = 0.40 / 0.10 = 4.0 — counts as four examples

# model.fit(X_train, y_train, sample_weight=weights)

Pre-processing is model-agnostic — you can use any downstream model. But it can distort the data in ways that hurt overall performance, and it only addresses bias that's visible in the training labels.

In-processing modifies the training procedure itself. Instead of optimizing only for accuracy, you optimize for accuracy subject to a fairness constraint. Fairlearn's ExponentiatedGradient is the gold standard here. It frames the problem as a game between a learner (trying to be accurate) and an auditor (trying to find fairness violations), and iterates until they reach an equilibrium.

from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds
from sklearn.linear_model import LogisticRegression

# The learner optimizes accuracy.
# The constraint enforces equalized odds.
# ExponentiatedGradient finds the best tradeoff.

mitigator = ExponentiatedGradient(
    estimator=LogisticRegression(solver="liblinear"),
    constraints=EqualizedOdds(),
    max_iter=50
)
mitigator.fit(X_train, y_train, sensitive_features=sensitive_train)
y_pred_fair = mitigator.predict(X_test)

This is the most principled approach — it directly optimizes the tradeoff between accuracy and fairness — but it requires access to the training process, which means you can't use it on a model you didn't build.

Post-processing adjusts predictions after the model is already trained. The most common technique is threshold optimization: use different decision thresholds for different groups. If the model systematically underscores neighborhood B, lower the threshold for B so that qualified applicants in B have the same chance of approval as equally qualified applicants in A.

from fairlearn.postprocessing import ThresholdOptimizer

# Finds per-group thresholds that satisfy a fairness constraint
# while maximizing accuracy. No retraining needed.
postprocessor = ThresholdOptimizer(
    estimator=trained_model,
    constraints="equalized_odds",
    objective="accuracy_score",
    prefit=True
)
postprocessor.fit(X_val, y_val, sensitive_features=sensitive_val)
y_adjusted = postprocessor.predict(X_test, sensitive_features=sensitive_test)

Post-processing is the most pragmatic option — it works on any model, including ones you can't retrain. But it requires access to sensitive attributes at prediction time, and it can feel like a patch rather than a fix. Using different thresholds for different groups is itself a form of disparate treatment, which creates an uncomfortable paradox: you're treating people differently based on group membership in order to achieve fair outcomes.

I'm still developing my intuition for when each intervention is appropriate. In practice, I've found that in-processing works best when you're building from scratch and care about a specific fairness definition, pre-processing is safest when you don't control the model, and post-processing is the emergency lever for deployed systems where you've discovered a disparity and can't retrain in time.

Intersectional Fairness

Everything we've done so far checks one sensitive attribute at a time. "Is the model fair across neighborhoods?" "Is it fair across genders?" But checking each dimension separately can hide compounding effects.

Kimberlé Crenshaw coined the term intersectionality in 1989 to describe how different forms of discrimination overlap and compound. It applies directly to ML auditing. A model might be fair for women overall and fair for Black applicants overall, while being deeply unfair to Black women specifically. The single-axis audits say "everything's fine." The intersectional audit says "you have a serious problem."

The Gender Shades study is the most vivid demonstration. Buolamwini and Gebru tested facial recognition systems from IBM, Microsoft, and Face++ across four subgroups: lighter-skinned males, lighter-skinned females, darker-skinned males, darker-skinned females. The systems' error rates on lighter-skinned males were below 1%. On darker-skinned females, error rates reached 34.7%. That's a 44x disparity. A single-axis audit by gender would have shown a moderate gap. A single-axis audit by skin tone would have shown a moderate gap. Only the intersection revealed the catastrophe.

In practice, intersectional auditing means creating combined group labels — "neighborhood B × female × age over 50" — and computing metrics for each combination.

# Intersectional audit: combine sensitive attributes
intersection = (
    df[["neighborhood", "gender"]].astype(str).agg(" × ".join, axis=1)
)
mf_intersect = MetricFrame(
    metrics={"approval_rate": selection_rate, "recall": recall_score},
    y_true=y_true,
    y_pred=y_pred,
    sensitive_features=intersection
)
# mf_intersect.by_group now shows metrics for each combination:
# "A × male", "A × female", "B × male", "B × female"

The catch — and this is something I still wrestle with — is that intersectional subgroups get small fast. "Neighborhood B × female × age 60+" might be 8 people in your test set. Metrics computed on 8 people are noisy enough to be meaningless. You need to report confidence intervals, use bootstrap sampling, or accept that some subgroups are too small to audit reliably with the data you have. That's an uncomfortable answer, but it's an honest one.

Fairness in LLMs

Everything we've discussed so far applies to classical ML with structured data — tabular inputs, binary outputs, well-defined protected attributes. Large language models break most of these assumptions, and nobody fully knows how to measure fairness for them yet. I think it's worth being honest about the state of the art here.

LLMs can exhibit bias in ways that don't map cleanly onto the frameworks we've built. A model might generate more positive recommendation letters when the name sounds male. It might associate certain nationalities with terrorism in story completion tasks. It might produce more confident medical advice for conditions that predominantly affect demographics overrepresented in its training data.

The most structured evaluation tool is the Bias Benchmark for QA (BBQ), introduced by Parrish et al. in 2022. BBQ contains over 58,000 multiple-choice questions targeting nine social categories — age, disability, gender identity, nationality, physical appearance, race, religion, sexual orientation, and socioeconomic status. Each question appears in two contexts: ambiguous (where the right answer is "unknown" and choosing anything else reveals stereotype reliance) and disambiguated (where evidence points to a specific answer). If the model picks the stereotyped answer in the ambiguous condition, that's measured bias.

Beyond benchmarks, red teaming — having humans try to elicit biased or harmful outputs through adversarial prompting — remains the most practical approach for discovering LLM bias. Organizations like Anthropic and OpenAI run structured red-teaming campaigns before major model releases. The challenge is that this doesn't scale, and the biases you find depend entirely on what the red teamers think to look for.

Embedding-level bias is another avenue. Word embeddings encode geometric relationships that reflect societal stereotypes — "man" relates to "programmer" the way "woman" relates to "homemaker." These biases propagate through every layer of the model. Debiasing techniques exist (projecting out bias directions, counterfactual data augmentation), but they tend to reduce measurable bias on benchmarks while leaving subtler biases intact.

My favorite thing about fairness in LLMs is that, aside from high-level intuitions like "check for stereotyped outputs," no one is completely certain how to do this well. The field is moving fast, the benchmarks are imperfect, and the models are too large and opaque for the kind of surgical analysis we can do on logistic regression. If someone tells you they've "solved" LLM fairness, be skeptical.

The Cases That Changed Everything

Throughout this journey, I've woven in real-world examples. Let me bring the major ones together, because each one taught the field something specific that you can carry forward into your own work.

COMPAS (2016) taught us that fairness is not a single metric. ProPublica showed unequal false positive rates by race. Northpointe showed calibrated risk scores. The impossibility theorem proved both couldn't be satisfied simultaneously. The lesson: before you deploy a risk scoring system, decide which fairness definition matters most for your stakeholders, document that choice, and prepare to defend it. There is no neutral option.

Amazon Hiring (2014–2017) taught us about historical bias. The model trained on ten years of mostly-male hiring decisions and learned that "male" was a positive signal. It penalized the word "women's" on resumes. Amazon tried to fix it, concluded the bias was too deeply embedded, and scrapped the project. The lesson: if your training data reflects a biased process, the model will reproduce that bias with mechanical efficiency. Audit your labels before you train.

Obermeyer Healthcare Study (2019) taught us about measurement bias — specifically, the danger of proxy variables. Using healthcare cost as a proxy for healthcare need systematically disadvantaged Black patients because they had less access to care. Only 17.7% were flagged for extra care when 46.5% should have been. The lesson: your proxy variable encodes the biases of the system that generated it. Ask yourself what your target variable is actually measuring.

Gender Shades (2018) taught us about intersectional failure. Facial recognition error rates were below 1% for light-skinned men and up to 34.7% for dark-skinned women. Single-axis audits missed it entirely. The lesson: always audit at intersections. The groups that experience the worst outcomes are often the ones that disappear when you average over a single axis.

Each of these cases is also, in a sense, a story about a thermometer that read five degrees low in certain rooms. The COMPAS thermometer read "high risk" differently depending on your race. The Amazon thermometer read "qualified" differently depending on your gender. The healthcare thermometer read "sick" differently depending on your ability to access a doctor. The Gender Shades thermometer read "face" differently depending on your skin tone. The fix in every case started with the same step: measuring what the thermometer actually said for each group, and comparing.

If you're still with me, thank you. I hope it was worth it.

We started with a simple spreadsheet of loan applications and a nagging pattern in the approval rates. From there we traced where bias enters (six different ways), discovered that removing the protected column is theater because of proxy variables, built up three competing definitions of fairness (demographic parity, equalized odds, calibration), and then watched the floor drop out with the impossibility theorem — the proof that you can't have all three when base rates differ. We learned to detect bias with Fairlearn and the four-fifths rule, explored three intervention points (pre-processing, in-processing, post-processing), confronted the compounding effects of intersectional bias, and acknowledged that fairness in LLMs is still an open problem.

My hope is that the next time you're building a model that makes decisions about people — who gets a loan, who gets flagged for review, who gets shown a job listing, who gets a healthcare referral — instead of assuming it's fair because you didn't include a race column, you'll measure its behavior across groups, choose a fairness definition that fits your context, document the tradeoff, and defend that choice. That's the job. Not "be fair." Choose which kind of fair, and own it.

Resources and Credits

Chouldechova, "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments" (2017) — the impossibility theorem paper. Short, readable, and devastating. If you only read one paper on this topic, make it this one.

Obermeyer et al., "Dissecting racial bias in an algorithm used to manage the health of populations," Science (2019) — the healthcare proxy-variable study. One of the clearest demonstrations of how measurement bias works in practice.

Buolamwini & Gebru, "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification" (2018) — the study that made the entire industry reckon with intersectional bias. The project website is beautifully done.

Fairlearn documentation — clean API, solid tutorials, and the ExponentiatedGradient implementation is wildly useful for constrained optimization. Start here for production work.

IBM AIF360 — heavier toolkit with a wider range of algorithms. Reach for it when Fairlearn doesn't have what you need.

Parrish et al., "BBQ: A Hand-Built Bias Benchmark for Question Answering" (2022) — the most structured approach to measuring LLM bias. Imperfect but the best we have.

← Previous Ch 18 Index Next → Explainability & Interpretability