Interpretability and Explainability

Chapter 16: Advanced Deep Learning Section 2 Deep Dive

I avoided looking closely at explainability for longer than I'm comfortable admitting. Every time a colleague mentioned SHAP values or LIME, I'd nod along, mumble something about Shapley, and quietly change the subject. I could use the tools — call shap.TreeExplainer, get a pretty waterfall plot — but I didn't really understand what was happening underneath. What exactly is a Shapley value? Why do SHAP and LIME sometimes disagree on which feature matters most? And if the explanations themselves can be unreliable, what are we even doing? Finally the discomfort of not knowing grew too great. Here is that dive.

Interpretability is the study of understanding why a model makes the predictions it does. The field has exploded since roughly 2016, driven by two forces: the rise of deep learning (which produced models too complex to inspect by eye) and the arrival of regulations — the EU's GDPR in 2018, followed by the EU AI Act in 2024 — that started requiring explanations for automated decisions. The tools we'll build up here — SHAP, LIME, counterfactuals, concept-based methods, and more — are the practitioner's toolkit for answering the question "why did the model say that?"

Before we start, a heads-up. We're going to be working through some game theory, some calculus, a bit of linear algebra, and a lot of conceptual reasoning. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

Contents

A loan, denied
Two philosophies: intrinsic vs. post-hoc
The map analogy
SHAP — Shapley values from scratch
LIME — the fast local approximation
PDP and ICE — the global landscape
Rest stop
Model-specific methods: attention, gradients, Grad-CAM
Counterfactual explanations
Concept-based explanations (TCAV)
Mechanistic interpretability
Fairness through interpretability
The regulatory landscape
The hard truth: limitations of current XAI
Resources and further reading

A Loan, Denied

Here's our running scenario. We've built a model that decides whether to approve or deny loan applications. It takes in three features: income (annual, in dollars), credit_score (300–850), and debt_ratio (total monthly debt divided by monthly income, as a percentage). The model is an XGBoost classifier. It works well — 94% accuracy on the test set. Stakeholders are happy.

Then an applicant named Maria gets denied. She calls customer service and asks: "Why?" The customer service agent looks at the model's output — a single number, 0.23 probability of repayment — and has nothing useful to say. There's no "because." There's no trail of reasoning. There's a number.

This is the problem. The model made a decision. It might be the right decision. But nobody — not Maria, not the agent, not the regulator who will audit this system next quarter — can tell why. And in many jurisdictions, "the model said so" is no longer a legally acceptable answer.

Everything we build in this section is an attempt to answer Maria's question. We'll keep coming back to her.

Two Philosophies: Intrinsic vs. Post-Hoc

There are two fundamentally different approaches to making models understandable, and the tension between them defines the entire field.

The first approach says: build a model that is intrinsically interpretable. A decision tree, for example. If the tree says "deny" for Maria, you can trace the path from root to leaf: her credit score was below 620, so we went left; her debt ratio was above 40%, so we went left again; that leaf says "deny." The explanation is the model. There's no separate explanation system to build, no approximation, no gap between what the model does and what we say it does.

Generalized Additive Models (GAMs) are another intrinsically interpretable family. A GAM decomposes the prediction into a sum of individual feature effects — one smooth curve for income, one for credit score, one for debt ratio. You can literally plot each curve and see "here's how credit score affects the prediction, holding everything else constant." The explanation is baked into the structure.

Cynthia Rudin, in a 2019 paper that stirred up the whole field, argued forcefully that for high-stakes decisions — criminal sentencing, medical diagnosis, loan approvals — we should stop using black-box models with post-hoc explanations and use intrinsically interpretable models instead. Her core argument: a post-hoc explanation is a second model approximating the first one, and there's no guarantee that approximation is faithful. If we can build a model that's both accurate and interpretable, why add the extra layer of uncertainty?

She has a point. But here's the uncomfortable reality: for many problems, especially in deep learning (image classification, language modeling, speech recognition), there is no intrinsically interpretable model that comes close to the performance of a 100-million-parameter neural network. So we're stuck. We need the black box. And then we need to figure out what's happening inside it, after the fact.

That's the second approach: post-hoc explainability. Train whatever model performs best, then apply separate tools — SHAP, LIME, Integrated Gradients — to generate explanations for its predictions. The model and the explanation are two different things. The explanation is an approximation of what the model is doing, not a direct readout.

I'll be honest — I find this tension genuinely unresolved. Rudin's argument is intellectually compelling, but in practice, most of the industry runs on post-hoc methods applied to complex models. We'll cover both sides, but the bulk of this section focuses on post-hoc methods because that's what most practitioners need.

The Map Analogy

Before we dive into specific methods, it helps to have a mental model for what explanation methods actually do. Think of it this way.

Your trained model is a high-dimensional landscape — a surface with peaks, valleys, ridges, and cliffs, one dimension for each feature. The prediction for any input is the height of the terrain at that point. This landscape is real, but it exists in a space with hundreds or thousands of dimensions. We can't see it. We can't walk around it.

An explanation method is a map of that landscape. SHAP draws one kind of map — it decomposes the height at any point into the contribution from each dimension. LIME draws a different kind of map — it flattens the terrain around a point into a simple plane and reads off the slopes. Grad-CAM draws yet another — it shows you which region of the input image corresponds to the steepest ascent on the terrain. Counterfactual explanations draw yet another — they find the nearest point on the map where the terrain crosses a boundary.

Every map is a projection. Every projection loses information. Different maps of the same terrain can look wildly different and still be "correct" in their own terms. This is why SHAP and LIME sometimes disagree — they're different projections of the same landscape.

We'll keep returning to this map analogy as we go.

SHAP — Shapley Values from Scratch

SHAP is, in my experience, the most important single tool in the explainability toolkit. It's the one you'll reach for most often, and it's the one with the strongest theoretical foundation. But to really understand it, we need to start somewhere that seems completely unrelated: a problem from cooperative game theory.

Three friends and a taxi

Imagine three friends — Alice, Bob, and Carol — sharing a taxi. The taxi costs $24. Alice's destination is along the way and would cost $6 if she rode alone. Bob's would cost $12 alone. Carol's is the full $24 ride. They all share the taxi, but how should they split the $24 fare? If they split evenly ($8 each), Alice is paying more than her solo ride — that's unfair. If they pay proportionally to their solo costs ($6 + $12 + $24 = $42, so Alice pays 6/42 × $24 ≈ $3.43), that seems better, but it doesn't account for the fact that Alice's leg of the journey benefits Bob and Carol too.

Lloyd Shapley solved this problem in 1953. His insight: consider every possible ordering of the players. For each ordering, look at what each player adds to the cost when they "join" the coalition. The player's fair share is their average marginal contribution across all possible orderings.

Let's work through it. There are 3! = 6 possible orderings of Alice (A), Bob (B), and Carol (C). For each ordering, we track what each person adds:

When the order is A → B → C: Alice joins first, adds $6. Bob joins, extending the ride from $6 to $12, so he adds $6. Carol joins, extending from $12 to $24, so she adds $12.

When the order is A → C → B: Alice adds $6. Carol joins, extending to $24, adds $18. Bob joins — the ride already goes past his stop, so he adds $0.

Working through all six orderings and averaging, we get Alice's Shapley value: $2. Bob's: $6. Carol's: $16. They sum to exactly $24 — the total cost. That's not a coincidence. It's a mathematical guarantee.

From taxis to features

Now here's the leap. Replace "friends" with "features." Replace "taxi fare" with "model prediction." Replace "coalition" with "subset of features included in the model."

For our loan model with three features (income, credit_score, debt_ratio), the "game" works like this. The "payout" for any coalition of features is the model's expected prediction when we know those features and marginalize over the rest. The Shapley value for each feature tells us: on average, across all possible orderings of features being revealed, how much does this feature change the prediction?

The formal definition: for a model f with n features, the Shapley value for feature i is:

φ_i = Σ_S⊆N\{i} |S|! (n − |S| − 1)! / n! × [v(S ∪ {i}) − v(S)]

That looks intimidating, but it's doing exactly what we did with the taxi. The fraction |S|!(n−|S|−1)!/n! is the weight — it accounts for how many orderings put feature i right after the coalition S. The bracketed term v(S ∪ {i}) − v(S) is the marginal contribution: how much does the prediction change when we add feature i to the set S?

The beautiful thing about Shapley values is they're the only attribution method that satisfies four properties simultaneously: efficiency (they sum to the total prediction minus the baseline), symmetry (equal contributors get equal values), null player (irrelevant features get zero), and additivity (for combined models, attributions add up). Lundberg and Lee proved this in their 2017 paper that introduced SHAP. There is literally no other attribution method with all four properties. That's not marketing — it's a theorem.

The exponential problem and its solutions

There's a catch, and it's a serious one. Computing exact Shapley values requires evaluating v(S) for every possible subset S. With n features, there are 2ⁿ subsets. For our 3-feature loan model, that's 8 subsets — totally manageable. For a model with 100 features, that's 2¹⁰⁰ ≈ 10³⁰ subsets. Not manageable. Not even close.

This is where the practical variants come in. TreeSHAP exploits the structure of decision trees to compute exact Shapley values in polynomial time — O(TLD²) where T is the number of trees, L is the number of leaves, and D is the maximum depth. For a typical XGBoost model, this takes milliseconds per prediction. It's the reason SHAP dominates tabular ML.

KernelSHAP works on any model by sampling random coalitions and fitting a weighted linear regression. It's model-agnostic but much slower — seconds to minutes per prediction for high-dimensional inputs. It inherits some instability from the sampling, though much less than LIME.

DeepSHAP adapts the DeepLIFT algorithm to satisfy the Shapley axioms approximately, running through the neural network's computation graph. GradientSHAP uses expected gradients — it's noisier but scales well to large inputs like images.

Back to Maria

Let's return to our loan denial. We compute SHAP values for Maria's application and find: income contributes +0.08 (pushing toward approval), credit_score contributes −0.31 (pushing toward denial), and debt_ratio contributes −0.19 (also pushing toward denial). These sum to −0.42, which is exactly the difference between Maria's predicted probability (0.23) and the average prediction across all applicants (0.65).

Now we can answer Maria's question. The model denied her primarily because of her credit score, with her high debt ratio as a secondary factor. Her income was actually working in her favor. That's an actionable answer — she knows what to work on.

This is SHAP's superpower: every explanation is a complete accounting. Nothing is hidden, nothing is double-counted. The values always add up. When I first understood this additivity property, it was the moment SHAP stopped feeling like a black box explaining another black box and started feeling like an actual tool I could trust.

LIME — The Fast Local Approximation

SHAP approaches explanation through game theory. LIME (Local Interpretable Model-agnostic Explanations, Ribeiro et al. 2016) takes a completely different route — it approaches explanation through approximation.

The core idea is beguilingly simple. Around any prediction, the model's behavior — however complex globally — is approximately linear in a small neighborhood. So let's fit a simple linear model right there, in that neighborhood, and use its coefficients as our explanation.

The recipe, step by step

Take Maria's loan application. LIME generates, say, 5,000 perturbed copies of her data — versions where income is a bit higher, credit score a bit lower, debt ratio changed. For each perturbed copy, LIME asks the original model "what would you predict for this?" Then it weighs each perturbed sample by how close it is to the original (closer neighbors get higher weight) and fits a sparse linear regression on this weighted data.

The coefficients of that linear model are the explanation. A positive coefficient for credit_score means "increasing credit score pushes the prediction toward approval in this neighborhood." A large negative coefficient for debt_ratio means "debt ratio is strongly pushing toward denial right here."

Back to our map analogy: LIME takes the complex landscape, zooms into a tiny patch around Maria's location, and lays a flat plane over it. The slopes of that plane are the explanation. If the terrain is actually curved, the plane is an approximation — it's good in the immediate neighborhood but might be misleading further out.

Where LIME struggles

The biggest problem with LIME is stability. Because it generates perturbations randomly, running LIME twice on the exact same input can produce different explanations. I've seen cases where the top feature flips between runs. In a Jupyter notebook, that's a minor annoyance. In a regulatory filing, it's a deal-breaker.

LIME also doesn't guarantee that the explanation is consistent with the actual model. The linear approximation might assign a feature positive importance when the model's actual behavior is non-monotonic in that region. There's no theoretical guarantee analogous to SHAP's efficiency axiom. The coefficients don't have to sum to anything meaningful.

That said, LIME has real strengths. It's faster than KernelSHAP for high-dimensional inputs. It works beautifully for image explanations — by perturbing superpixels rather than individual pixels, it produces clean visual explanations that are intuitive to non-technical stakeholders. And for text, dropping words and seeing what happens is a natural and compelling approach.

I think of LIME as the explanation method equivalent of a quick sketch — fast, intuitive, often useful, but not something you'd submit as a legal document.

PDP and ICE — The Global Landscape

SHAP and LIME explain individual predictions — "why did the model deny Maria?" But sometimes we need a bigger picture: "how does the model use credit score in general, across all applicants?" That's where Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) curves come in.

The thought experiment

Imagine taking every applicant in our dataset and artificially setting their credit score to 500. Run the model on all of them. Record the average prediction. Now set everyone's credit score to 510. Run the model again. Average again. Repeat this across the full range of credit scores, from 300 to 850. Plot the averages. That's a Partial Dependence Plot.

The PDP shows the average effect of credit score on the prediction, marginalizing over all the other features. If the curve rises steeply between 600 and 700, it means credit scores in that range matter a lot, on average. If it's flat above 750, it means once you have excellent credit, more credit doesn't help much.

When the average lies

The problem with averages is that they hide heterogeneity. Maybe for high-income applicants, the credit score curve is steep everywhere. But for low-income applicants, the curve is essentially flat — credit score barely matters because their income alone predicts denial. The PDP, by averaging both groups together, would show a moderately steep curve that represents neither group accurately.

ICE curves (Goldstein et al., 2015) fix this by plotting the curve for every individual, not just the average. You get a bundle of lines instead of a single line. If the lines are roughly parallel, the PDP tells the whole story. If the lines diverge or cross, there are interaction effects, and the PDP is hiding something important.

In our map analogy, a PDP is like a topographic cross-section of the terrain averaged across all north-south positions. An ICE plot shows every individual cross-section. When the terrain has ridges running in different directions, the average cross-section can be deceptively flat.

I'll be honest — I underestimated ICE curves for a long time. PDPs are what everyone shows in presentations. But the few times ICE curves revealed crossing patterns in my models, it completely changed how I understood the model's behavior. They're worth the extra visual clutter.

Rest Stop

Congratulations on making it this far. You can stop if you want.

If you stopped here, you'd have a solid working toolkit: SHAP for principled attributions (especially with tree models), LIME for quick local approximations, and PDP/ICE for understanding global feature effects. These three — plus the understanding that Shapley values are the only attribution satisfying all four fairness axioms — put you ahead of most practitioners I've met.

What this doesn't tell you is what to do when your model is a neural network (where TreeSHAP doesn't apply), how to explain decisions in terms of human concepts rather than raw features, or what's happening at the very frontier of interpretability research, where people are trying to reverse-engineer neural networks neuron by neuron.

The short version: gradient methods handle neural nets, TCAV handles concepts, and mechanistic interpretability is trying to understand the actual circuits inside the network. There. That gets you 80% of the way.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Model-Specific Methods: Attention, Gradients, Grad-CAM

Everything we've covered so far is model-agnostic — it treats the model as a black box and only cares about inputs and outputs. But when we know the model's architecture, we can exploit that structure for faster, richer explanations.

Attention weights: the seductive trap

If you've worked with transformers, you've probably seen attention heatmaps — those colorful grids showing which tokens the model "attends to" when making a prediction. They're visually compelling. When the model classifies a movie review as positive and the attention highlights "brilliant performance," it feels like we've found the explanation.

It's mostly an illusion. Jain and Wallace (2019) showed three devastating findings: attention weights frequently don't correlate with gradient-based feature importance; you can find alternative attention patterns that produce identical predictions; and adversarial attention distributions can attend to completely different tokens while maintaining the same output. The attention pattern is not uniquely determined by the prediction — it's one of many possible configurations that could produce the same result.

Wiegreffe and Pinter (2019) offered a partial defense — "attention is not not explanation" — showing that in some tasks, attention does correlate with human-annotated rationales. But the consensus has settled into a cautious middle: attention is useful for debugging (noticing the model attends to punctuation when it shouldn't) but not for formal explanation (telling a regulator why a decision was made).

I still look at attention visualizations regularly. They're useful conversation starters. But I've trained myself to say "interesting" instead of "explanatory."

Gradient-based methods: the calculus path

When a model is differentiable (neural networks), we can compute ∂output/∂input in one backward pass. That gradient tells us: which input dimensions is the model most sensitive to right now? If the gradient with respect to a particular pixel is large, nudging that pixel changes the prediction a lot.

Vanilla gradients have a problem: they capture sensitivity at a single point. In our map analogy, it's like measuring the slope of the terrain exactly where you're standing. Take one step and the slope might be completely different. The resulting saliency maps look like visual static — noisy, hard to interpret, fragile.

Integrated Gradients (Sundararajan et al., 2017) fixes this elegantly. Instead of measuring the gradient at one point, it measures the gradient at every point along a straight-line path from a blank baseline (say, an all-black image) to the actual input, then averages them all. The resulting attribution for each input dimension is the integral of the gradient along that path, times the difference between the input and baseline.

Why this works: Integrated Gradients satisfies completeness — the attributions sum to exactly the difference between the model's output at the input and at the baseline. Sound familiar? It's the same efficiency axiom that makes Shapley values add up. That's not a coincidence — both methods are anchored in the same mathematical foundations. This is why, empirically, Integrated Gradients and SHAP tend to agree more often than they disagree.

Grad-CAM is the go-to method specifically for CNNs. It takes the gradients flowing into the last convolutional layer and uses them to weight the feature maps, producing a coarse spatial heatmap — a blob over the region the model found relevant. The resolution is low (you get a blurry highlight, not pixel-level precision), but that's actually a feature: it's much easier to look at a blob and say "the model focused on the chest area" than to make sense of a noisy pixel-level attribution.

SmoothGrad is a simple enhancement: add Gaussian noise to the input multiple times, compute gradients each time, average. It denoises any gradient-based saliency map. Think of it as applying a Gaussian blur to a photograph — the important structures remain, the noise washes out.

Counterfactual Explanations

All the methods we've covered so far answer the question "which features drove this prediction?" Counterfactual explanations answer a fundamentally different question: "what's the smallest change that would have produced a different outcome?"

Back to Maria. SHAP told her: credit score was the main reason for denial. A counterfactual explanation goes further: "If your credit score had been 645 instead of 580, your application would have been approved — all else being equal." That's not an attribution. It's a recipe for change.

Wachter, Mittelstadt, and Russell formalized this in 2017, borrowing the philosophical concept of the "closest possible world." Given an input x with prediction f(x) = denied, we search for the nearest input x' such that f(x') = approved, measuring "nearest" by some distance function. The idea borrows from how humans naturally reason about decisions: "I would have gotten the job if I'd had one more year of experience."

The elegance of counterfactuals is that they're inherently actionable. "Your credit score was the most important factor" is informative. "Reduce your debt-to-income ratio by 5 percentage points" is actionable. For Maria, the counterfactual is the most useful explanation of all.

There are practical complications. A naive counterfactual might suggest "if your age were 15 years younger" — technically correct but useless, because you can't change your age. Good counterfactual methods incorporate actionability constraints (only suggest changes to features the person can actually change) and causal structure (if you suggest increasing income, the model should also update features that would naturally change with income). This is an area of very active research — and, I'll admit, one where my intuition is still developing.

Concept-Based Explanations (TCAV)

Here's a frustration I kept running into: SHAP tells me that "pixel (127, 84) contributed positively to classifying this as a zebra." Great. But what I really want to know is: "did the model use stripes to classify this as a zebra?" Features and concepts are different levels of abstraction, and humans think in concepts.

TCAV — Testing with Concept Activation Vectors — was introduced by Been Kim et al. in 2018 to bridge this gap. The idea is beautifully direct.

Pick a concept you care about — say, "stripes." Collect a set of images that exemplify stripes and a set of random images that don't. Feed both sets through the neural network and extract the activations at some intermediate layer. Train a simple linear classifier (like logistic regression) to separate "stripe" activations from "non-stripe" activations. The direction perpendicular to that classifier's decision boundary — a vector in activation space — is the Concept Activation Vector (CAV). It points in the direction of "more stripey" in the network's internal representation.

Now, for test images of zebras, compute the gradient of the "zebra" output with respect to the activations at that layer. Check whether those gradients point in the same direction as the striped CAV. The TCAV score is the fraction of zebra images for which the gradient aligns with the concept direction. A TCAV score of 0.95 for "stripes" on "zebra" means: for 95% of zebra images, making the internal representation "more stripey" would increase the model's confidence that it's a zebra.

What makes TCAV powerful is that you can test concepts that the model was never explicitly trained on. You can ask: "Does my skin cancer classifier use 'ruler markings' as a concept?" (If so, it might be keying on the presence of a ruler in the image rather than the lesion itself.) You can ask: "Does my hiring model use 'gender' as a concept?" (If so, that's a fairness problem.)

The limitation is that TCAV requires you to define your concepts in advance and provide example images. If you don't know what concept to test, TCAV won't discover it for you. It confirms hypotheses; it doesn't generate them.

Mechanistic Interpretability

Everything so far has been about explaining individual predictions from the outside — treating the model as, at best, a partially transparent box. Mechanistic interpretability is something fundamentally more ambitious: it's trying to reverse-engineer what the model actually learned, neuron by neuron, layer by layer.

I find this the most fascinating and unsettling area of interpretability research. It asks: can we open up a neural network and read its "source code"?

The central challenge is superposition. Neural networks represent far more concepts than they have neurons. A single neuron doesn't represent "cat" or "legal clause" — it participates in representing hundreds of different concepts simultaneously, in different combinations and at different activation levels. It's as if every variable in a program stored multiple values at once, encoded in a way that the program can still function correctly. This makes it enormously hard to understand what any individual neuron "means."

Anthropic's team made a major breakthrough in 2024, scaling up sparse autoencoders (SAEs) to production-sized language models like Claude Sonnet. A sparse autoencoder is trained to decompose a model's internal activations into a much larger set of features, most of which are zero for any given input. The idea is that superposition encodes many features in a few neurons, and the SAE tries to reverse that compression. They found millions of features corresponding to interpretable concepts — things like "mentions of the Golden Gate Bridge," "legal language," "code that's about to have a bug."

Circuit analysis goes a step further: once you've identified features, trace how they connect to each other. What pattern of features in layer 3 activates what features in layer 7? These chains of feature activations are called circuits, and they're the closest thing we have to reading a neural network's reasoning process.

I'll be honest about where this stands: mechanistic interpretability has produced genuinely stunning results on individual examples, but we're far from being able to mechanistically explain an entire large model. It's still more "proof of concept" than "production tool." No one is using circuit analysis to debug their production fraud model (yet). But the pace of progress is remarkable, and I wouldn't be surprised if this changes within a few years.

Fairness Through Interpretability

Interpretability tools aren't only about satisfying regulators or curious applicants. They're one of the most effective ways to detect bias in your model — sometimes biases that no aggregate fairness metric would catch.

Return to our loan model. Suppose we run SHAP on the entire test set and create a summary plot. We notice something uncomfortable: zip_code has a high SHAP importance. That's odd — why would geography matter so much? We dig deeper. We overlay SHAP values for zip_code with the racial demographics of each zip code. The pattern is stark: zip codes with predominantly minority populations systematically get negative SHAP contributions. The model isn't using race as a feature — race isn't even in the dataset — but it's using zip code as a proxy, and the effect is discriminatory.

This is the kind of finding that no accuracy metric, no confusion matrix, and no aggregate fairness score would reveal. You needed a tool that could show you which features drive which predictions for which subgroups. SHAP gives you exactly that.

The general workflow for fairness auditing through interpretability: compute SHAP values for all predictions, segment by protected group (gender, race, age), and compare the distributions. Are the same features driving decisions across groups, or is the model using different reasoning for different people? Do any features serve as proxies for protected attributes? The answers aren't always comfortable, but they're essential.

TCAV is equally powerful here. You can directly test: "Does my model use the concept 'female' when making hiring decisions?" If the TCAV score for 'female' on 'reject' is significantly above 0.5, your model has a problem — and you now have a precise, quantifiable measurement of that problem.

The Regulatory Landscape

Explainability is no longer optional in many domains. The regulatory landscape is complex and evolving rapidly, but here's the picture as of 2024.

The EU's GDPR (2018) includes Article 22, which gives individuals the right not to be subject to decisions based solely on automated processing. Recitals 71 and Articles 13–15 establish what amounts to a "right to explanation" — data subjects must receive "meaningful information about the logic involved." The exact scope of this right is still debated by legal scholars, but in practice, companies operating in the EU need to be able to explain automated decisions.

The EU AI Act (2024) goes much further. It classifies AI systems by risk level — high-risk systems (credit scoring, hiring, medical devices) require transparency, documentation, human oversight, and explainability as explicit design requirements. This isn't a suggestion. There are penalties up to €35 million or 7% of global annual turnover.

In the US, banking regulators have long required explanation for credit decisions (Equal Credit Opportunity Act, Fair Credit Reporting Act). The OCC's SR 11-7 guidance on model risk management requires that models be "explainable" and that their limitations be well understood. In healthcare, the FDA's framework for clinical decision support software increasingly requires that clinicians understand the basis for AI recommendations.

For practitioners, this means a few concrete things: you need deterministic explanations (the same input must always produce the same explanation, which rules out unseeded LIME), you need documented methodology (which method you chose, why, and how you validated it), and you need stability testing (small changes to input shouldn't wildly change the explanation).

The Hard Truth: Limitations of Current XAI

After building up all these tools, I want to be honest about their limits. This is the part of the field where I have the most unresolved discomfort.

Explanation methods disagree with each other. Run SHAP, LIME, and Integrated Gradients on the same prediction and you'll often get different top features. Which one is right? There's no ground truth to compare against — we don't actually know what the model "really" uses. Each method makes different assumptions (game theory, local linearity, path integration) and produces a different projection of the same terrain.

Some explanation methods lie. Adebayo et al. (2018) introduced a devastating sanity check: compare explanations from a trained model against a randomly initialized one with garbage weights. If the explanations look similar, the method isn't detecting learned behavior — it's detecting input structure. Guided Backpropagation and Guided Grad-CAM failed this test. They produce visually appealing saliency maps that are essentially edge detectors, regardless of what the model learned. They look explanatory. They aren't.

Faithfulness is not guaranteed. A post-hoc explanation is a second model approximating the first. There's no guarantee that approximation is accurate, especially in regions of the feature space where the model's behavior is highly nonlinear. Rudin's 2019 critique cuts deep here: if the explanation can be wrong, and we can't easily check whether it's wrong, are we better off or worse off than having no explanation at all?

Human users misunderstand explanations. Studies have shown that providing explanations can actually increase user overtrust in incorrect model predictions. People see a confident-looking SHAP waterfall plot and think "well, it has reasons, so it must be right." The explanation becomes a tool for persuasion rather than understanding. This is particularly dangerous in high-stakes settings.

Explanations can be gamed. Slack et al. (2020) showed that it's possible to build models that behave unfairly on real data but produce fair-looking SHAP explanations — by detecting when the input is a SHAP perturbation and switching to a "fair" mode. If your adversary knows which explanation method you're using, they can fool it.

I don't say this to discourage you from using these tools. They're the best we have, and they catch real problems. But I've learned to hold explanations with a degree of skepticism — to treat them as evidence rather than proof, and to always cross-check with multiple methods before making important decisions based on them.

Wrapping Up

If you're still with me, thank you. I hope it was worth it.

We started with a denied loan application and a question nobody could answer. We built up from cooperative game theory (splitting a taxi fare) to Shapley values, then to SHAP. We added LIME as a faster but less rigorous alternative, PDP and ICE for global patterns, gradient-based methods for neural networks, counterfactuals for actionable "what-if" answers, TCAV for concept-level explanations, and mechanistic interpretability for the ambitious goal of reverse-engineering what models actually learn. We confronted the regulatory pressures driving the field and the uncomfortable limitations that keep it humble.

My hope is that the next time you deploy a model and someone asks "why did it decide that?" — instead of mumbling about probabilities and changing the subject (as I used to do), you'll reach for the right tool, understand what it's actually telling you, and know where its answers stop being trustworthy. Not because you memorized a catalog of methods, but because you have a genuine mental model of what each one is doing underneath.

Resources and Further Reading

Christoph Molnar, "Interpretable Machine Learning" — the single best book on this topic. Free online, exhaustively thorough, and updated regularly. If you read one thing after this, make it this.
Lundberg & Lee, "A Unified Approach to Interpreting Model Predictions" (2017) — the original SHAP paper. Beautifully written, connects the dots between Shapley values and several existing methods in a way that feels inevitable in hindsight.
Cynthia Rudin, "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions" (2019, Nature Machine Intelligence) — the sharpest critique of post-hoc explainability. Whether or not you agree, it will permanently change how you think about the field.
Anthropic, "Mapping the Mind of a Large Language Model" (2024) — the paper that made mechanistic interpretability feel real. The examples of features they discovered are genuinely mind-bending.
Been Kim et al., "Interpretability Beyond Feature Attribution: TCAV" (2018) — the paper that introduced concept-based explanations. Elegant idea, well-executed, and deeply practical for model auditing.
Adebayo et al., "Sanity Checks for Saliency Maps" (2018) — essential reading before you trust any gradient-based explanation method. The sanity checks are straightforward to run and should be standard practice.

← Previous Graph Neural Networks Next → Multimodal Models