Interpretability & Visualization
Your CNN makes a prediction — but what did it actually look at? This section builds interpretability from scratch: we start with a single pixel's gradient, construct saliency maps, discover why they're noisy, fix that with SmoothGrad and Integrated Gradients, then graduate to Grad-CAM's spatial heatmaps. We explore model-agnostic methods (LIME, SHAP, occlusion), peer inside neurons with feature visualization, and touch on the frontier — mechanistic interpretability, where researchers are literally reverse-engineering what individual neurons compute. Throughout, we use a running example of debugging a skin-lesion classifier that's suspiciously good, because interpretability isn't decoration. It's the difference between a model that works and a model that cheats.
I'll be honest — I treated interpretability as an afterthought for an embarrassingly long time. I'd train a model, check accuracy, maybe glance at a confusion matrix, and move on. The model said "malignant." The number was 0.94. Good enough, right? Then I saw a Grad-CAM heatmap of one of my "best" models, and the hotspot was on the ruler in the corner of a dermoscopy image. Not the lesion. The ruler. The model had learned that rulers appear more often in images that dermatologists thought were worth measuring — and measuring correlates with malignancy. My 94% accuracy was a lie.
That experience changed how I think about every model I build. Interpretability isn't a nice-to-have academic exercise. It's the only way to know whether your model learned the right thing or the convenient thing. The tools in this section are your X-ray machine for X-ray machines — they let you look inside the network and see what it actually learned.
Before we start, a heads-up. We're going to be computing gradients, manipulating feature maps, and building some intuition for game theory (Shapley values). You don't need to know any of it beforehand. We'll add what we need, one piece at a time.
This isn't a short journey, but I hope you'll be glad you came.
The Question Every Model Must Answer
Saliency Maps — The Simplest Window
Fixing the Noise — SmoothGrad and Integrated Gradients
Grad-CAM — From Pixels to Regions
Occlusion Sensitivity — The Brute-Force Sanity Check
LIME — Explaining Without Seeing the Weights
SHAP for Images — Game Theory Meets Pixels
Rest Stop
Feature Visualization — What Neurons Dream About
Concept Activation Vectors — Thinking in Human Terms
Attention Maps in Vision Transformers
The Sanity Check That Shook the Field
Mechanistic Interpretability — The Frontier
The Debugging Playbook
The Question Every Model Must Answer
Imagine you've trained a CNN to classify skin lesions from dermoscopy images. It hits 96% accuracy on your test set. Your stakeholder — a dermatologist with 20 years of experience — asks a single question: "What is the model looking at when it says melanoma?"
If you can't answer that, the model doesn't ship. Not because the dermatologist is being difficult, but because a model that achieves 96% by detecting rulers, color cards, or ink marks on the skin is worse than no model at all. It would give confident wrong answers on images without those artifacts. In medicine, confident wrong answers kill people.
This is the problem interpretability tools solve. They take the billions of multiplications happening inside a network and produce something a human can look at and say: "Yes, it's focusing on the lesion border irregularity" or "No, it's focusing on the hair follicles."
We'll call our skin-lesion classifier DermNet throughout this section — a ResNet-50 fine-tuned on 25,000 dermoscopy images across 7 lesion types. Every technique we build will be applied to DermNet, so you can see each tool doing real work on the same model.
Saliency Maps — The Simplest Window
The most basic question we can ask about a neural network's decision is: "Which input pixels, if I wiggled them slightly, would change the output the most?" That question has a precise mathematical answer — the gradient of the output with respect to the input.
Let's make this concrete with DermNet. We feed in a dermoscopy image. The network outputs a score for each of the 7 lesion classes. We pick the predicted class — say, melanoma — and ask PyTorch to compute how that score changes if we nudge each of the 224×224×3 = 150,528 input values by a tiny amount. The magnitude of each gradient tells us how sensitive the prediction is to that pixel.
tensor.requires_grad_(True)
output = model(tensor)
score = output[0, predicted_class]
score.backward()
# Each pixel now has a gradient — take absolute value, collapse RGB channels
saliency = tensor.grad.data.abs().squeeze() # [3, 224, 224]
saliency, _ = saliency.max(dim=0) # [224, 224]
This gives us a saliency map — a pixel-level heat map of sensitivity. The term was introduced by Simonyan et al. in 2013, making it one of the oldest neural network visualization techniques still in use. The idea is elegant: pixels with large gradients are the ones the model "cares about" most.
But here's the catch. If you actually look at a raw saliency map, it's... noisy. Speckled. Individual pixels light up like static on an old TV. The map doesn't show clean regions — it shows scattered bright dots everywhere, and it's hard to tell whether the model is looking at the lesion or at noise in the image sensor. I remember staring at my first saliency map thinking it must be a bug.
It's not a bug. It's a fundamental limitation. Gradients are local — they tell you what would happen if you moved a single pixel by an infinitesimal amount, holding everything else fixed. Real features span many pixels, and the gradient at any one pixel is a noisy estimate of a broader pattern.
Fixing the Noise — SmoothGrad and Integrated Gradients
That noisy saliency map is frustrating, but the underlying idea — use gradients to measure importance — is sound. The noise is the problem, not the principle. Two techniques fix it in different ways, and understanding both matters because they reveal different things about the model.
SmoothGrad takes the brute-force approach. Instead of computing one saliency map, it computes fifty. Each time, it adds a small amount of random Gaussian noise to the input before computing gradients. Then it averages all fifty maps. The noise is different each time, so the speckle averages out, but the consistent signal — the regions the model truly depends on — survives.
import torch
def smooth_grad(model, tensor, target_class, n_samples=50, noise_level=0.15):
mean_grad = torch.zeros_like(tensor)
for _ in range(n_samples):
noisy = tensor + noise_level * torch.randn_like(tensor)
noisy.requires_grad_(True)
score = model(noisy)[0, target_class]
score.backward()
mean_grad += noisy.grad.data.abs()
return (mean_grad / n_samples).squeeze().max(dim=0).values
The improvement is dramatic. Where the vanilla map was speckled static, SmoothGrad produces coherent regions you can actually interpret. For DermNet, the smoothed map clearly highlights the lesion boundary — which is exactly where dermatologists look for irregularity.
Integrated Gradients (Sundararajan et al., 2017) takes a more principled approach. Instead of adding noise, it asks: "What's the path from nothing to this image, and how much does each pixel contribute along that path?" It starts with a baseline — typically a black image (all zeros) — and interpolates in small steps toward the actual input. At each step, it computes the gradient. The final attribution for each pixel is the average gradient across all those steps, multiplied by the difference between input and baseline.
Why go to this trouble? Because Integrated Gradients satisfies two mathematical axioms that raw gradients don't. Sensitivity: if a feature matters to the prediction, it gets a non-zero attribution. Implementation invariance: two networks that produce identical outputs always get identical attributions, regardless of internal architecture differences. These properties matter when you need to trust your explanations, not approximate them.
Integrated Gradients requires a baseline — the "nothing" you interpolate from. A black image works for natural photos, but for medical imaging where black regions carry meaning (like the dark background in X-rays), a black baseline can produce misleading attributions. For dermoscopy, I've found that using the per-channel dataset mean as the baseline gives more reliable results. The choice of baseline is not a minor detail — it can completely change your explanation.
Grad-CAM — From Pixels to Regions
Saliency maps, even the smoothed varieties, answer the question at the pixel level. But when the dermatologist asks "what is the model looking at?", they don't want to hear about pixel (147, 203). They want to hear "the irregular border in the upper-left quadrant." We need to zoom out from pixels to regions.
This is where Gradient-weighted Class Activation Mapping — Grad-CAM — comes in, and it's the single most important interpretability tool in the CNN practitioner's toolkit. Selvaraju et al. introduced it in 2017, and it became the default visualization method for good reason: it's fast, works on any CNN architecture without modification, and produces class-specific heatmaps that match human intuition about what the model should be looking at.
The core insight is beautiful. Deep inside a CNN, the last convolutional layer produces a stack of feature maps — let's say 2,048 of them in ResNet-50, each one a 7×7 spatial grid. Each feature map responds to a different learned pattern: one might fire when it sees irregular borders, another when it sees brown pigment, another when it sees a particular texture. These maps still carry spatial information — they know where in the image the pattern occurred — but they're semantically rich enough to relate to specific classes.
Grad-CAM figures out which of those 2,048 feature maps matter for the target class. It does this with gradients. Backpropagate the target class score, and for each feature map, compute the average of its gradient across all spatial positions. That average becomes the importance weight for that channel. Multiply each feature map by its weight, sum them all up, pass through ReLU (we only care about features that positively influence the class), and you've got a 7×7 heatmap showing which spatial regions mattered most. Upscale it to 224×224, overlay it on the original image, and you can see exactly where the model was looking.
import torch
import torch.nn.functional as F
# Hook into the last conv layer to capture activations and gradients
activations, gradients = {}, {}
def save_act(module, inp, out):
activations['val'] = out.detach()
def save_grad(module, grad_in, grad_out):
gradients['val'] = grad_out[0].detach()
target_layer = model.layer4[-1]
target_layer.register_forward_hook(save_act)
target_layer.register_full_backward_hook(save_grad)
# Forward pass, then backprop the target class
output = model(tensor)
model.zero_grad()
output[0, target_class].backward()
# The Grad-CAM recipe:
# 1) Global average pool the gradients → one weight per channel
weights = gradients['val'].mean(dim=[2, 3], keepdim=True)
# 2) Weighted sum of feature maps, then ReLU
cam = (weights * activations['val']).sum(dim=1, keepdim=True)
cam = F.relu(cam)
# 3) Upscale to input size
cam = F.interpolate(cam, size=(224, 224), mode='bilinear', align_corners=False)
cam = cam.squeeze().numpy()
cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)
When I ran this on DermNet's melanoma predictions, the story became immediately clear. On correctly diagnosed images, the heatmap sat right over the lesion — specifically over regions with asymmetric borders and color variegation, the textbook dermoscopic features. On my "ruler images" that had fooled me before, the heatmap was firmly planted on the measurement artifact. Two minutes of Grad-CAM revealed what a week of metric-staring had missed.
There's a variant called Grad-CAM++ that uses second-order gradients (the gradient of the gradient) to produce better localization when multiple instances of the same object appear in one image. In practice, vanilla Grad-CAM is sufficient for most debugging workflows.
The limitation of Grad-CAM is its resolution. That 7×7 grid means the heatmap is inherently coarse — it can tell you "upper-left quadrant" but not "this specific cluster of 20 pixels." When you need pixel-level attribution, you're back to Integrated Gradients or the methods we'll see next.
Occlusion Sensitivity — The Brute-Force Sanity Check
All the methods so far rely on gradients — they peek inside the model's computational graph. Occlusion sensitivity does something refreshingly dumb: it covers up part of the image and checks whether the model's confidence drops.
Slide a grey square across DermNet's input, one position at a time. At each position, run the (now partially occluded) image through the network and record the confidence for the target class. Where the confidence drops the most, that region was carrying the most important information. No hooks, no gradients, no assumptions about the model's internals.
import numpy as np
def occlusion_map(model, img_tensor, target_class, patch=32, stride=8):
_, _, H, W = img_tensor.shape
base = model(img_tensor).softmax(1)[0, target_class].item()
heatmap = np.zeros((H, W))
count = np.zeros((H, W))
for y in range(0, H - patch + 1, stride):
for x in range(0, W - patch + 1, stride):
masked = img_tensor.clone()
masked[:, :, y:y+patch, x:x+patch] = 0.5
score = model(masked).softmax(1)[0, target_class].item()
heatmap[y:y+patch, x:x+patch] += (base - score)
count[y:y+patch, x:x+patch] += 1
return heatmap / np.maximum(count, 1)
The appeal is that this method is model-agnostic. It works on CNNs, transformers, even a model wrapped behind an API where you only get predictions. No access to weights needed — you're treating the model as a pure black box.
The cost is speed. For a 224×224 image with a 32×32 patch and stride of 8, you're running about 576 forward passes per image. That's fine for inspecting a handful of suspicious predictions. It's impractical for large-scale analysis, which is why Grad-CAM (one forward + one backward pass) remains the default for routine work.
But here's why occlusion still matters: when Grad-CAM and occlusion sensitivity agree — when both methods highlight the same region — your confidence that the model is genuinely using that region goes way up. Gradient-based methods can sometimes be misleading (more on that later), but occlusion can't lie. If covering up a region tanks the prediction, that region was carrying the signal. Period.
LIME — Explaining Without Seeing the Weights
Occlusion sensitivity is model-agnostic but gives you a heatmap, not an explanation in natural terms. LIME — Local Interpretable Model-agnostic Explanations (Ribeiro et al., 2016) — takes the black-box idea further by building an entire interpretable model around a single prediction.
For images, LIME works like this. First, it segments the image into superpixels — coherent blobs of similar color, typically 50–150 of them, using an algorithm called SLIC. Each superpixel becomes a "feature." Then LIME generates hundreds of perturbed versions of the image, randomly turning superpixels on (visible) or off (greyed out). Each perturbed image is run through the model, and the prediction is recorded.
Now comes the clever bit: LIME fits a simple weighted linear regression. The features are which superpixels were on/off (a binary vector), and the target is the model's predicted probability. Perturbations closer to the original image get higher weight. The resulting coefficients tell you how much each superpixel contributed to the prediction. The top positive coefficients are highlighted as the explanation.
For DermNet, LIME might show that superpixels covering the dark, asymmetric core of a lesion are the most important for the melanoma prediction, while the surrounding healthy skin contributes nothing. That's an explanation a dermatologist can evaluate — "the model focused on the dark irregular region" is actionable in a way that "pixel (147, 203) has gradient 0.003" is not.
The limitation is stability. Run LIME twice on the same image, and you might get slightly different explanations, because the random perturbations differ. For deployment scenarios where you need reproducible explanations, pin the random seed or consider SHAP.
SHAP for Images — Game Theory Meets Pixels
The instability of LIME bothered researchers, and the fix came from an unexpected direction: cooperative game theory. SHAP — SHapley Additive exPlanations (Lundberg & Lee, 2017) — computes the Shapley value for each feature, a concept from the 1950s that asks: "If every possible coalition of features were tried, what's the average marginal contribution of this particular feature?"
For images, computing exact Shapley values for every pixel is intractable — with 150,528 pixels, you'd need to evaluate more coalitions than there are atoms in the universe. So SHAP uses the same superpixel trick as LIME: segment the image, treat each superpixel as a player in a cooperative game. DeepSHAP goes further by exploiting the network's layer structure to approximate Shapley values efficiently using a technique called DeepLIFT propagation rules.
What makes SHAP special is its theoretical foundation. Shapley values are the only attribution method satisfying three properties simultaneously: if a feature doesn't affect the output at all, it gets zero attribution; if two models produce the same output, they produce the same attributions; and the attributions for all features sum to the difference between the model's output and the baseline. No other attribution method satisfies all three.
The practical trade-off is compute. Even with DeepSHAP's approximations, generating SHAP explanations for a batch of images is significantly slower than Grad-CAM. In my workflow, I use Grad-CAM for fast iteration and pull out SHAP when I need attributions I can defend in front of a regulatory body or a skeptical clinician.
Congratulations on making it this far. You can stop here if you want.
You now have a working mental model of the core interpretability toolkit: saliency maps for pixel-level sensitivity, SmoothGrad and Integrated Gradients for cleaner versions, Grad-CAM for spatial region attribution, occlusion for model-agnostic verification, LIME for human-readable explanations, and SHAP for theoretically grounded attributions. That's enough to debug any production CNN.
What comes next goes deeper — we'll look inside individual neurons with feature visualization, bridge human concepts to model internals with TCAV, examine why some of these methods can lie to you, and peek at the frontier of mechanistic interpretability where researchers are literally reverse-engineering neural circuits. These topics show up in senior ML interviews and are increasingly important in regulated industries.
But if the discomfort of not knowing what's underneath is nagging at you, read on.
Feature Visualization — What Neurons Dream About
Everything so far has been about explaining a specific prediction: "Why did the model say melanoma for this image?" Feature visualization asks a different question: "What has this neuron learned to recognize, across all images?"
The technique, popularized by Chris Olah and colleagues in a landmark 2017 Distill article, is activation maximization. Pick a neuron — say, channel 137 in ResNet-50's layer3. Start with a random noise image. Then use gradient ascent (not descent — we're maximizing, not minimizing) to iteratively modify the image so that it makes that neuron fire as strongly as possible. After a few hundred iterations, the resulting image shows what the neuron "wants to see."
The results are striking. Early-layer neurons dream of edges and color gradients — exactly what you'd expect from Gabor-like filters. Mid-layer neurons produce textures: honeycomb patterns, fur, scales, repeating geometric motifs. Deep-layer neurons produce recognizable objects or parts of objects — eyes, wheels, dog faces — though they often look surreal, like a hallucination. That's not a coincidence. Google's DeepDream was built on exactly this technique: run activation maximization on an entire image rather than starting from noise, and the network's learned features get amplified into those psychedelic, eye-covered landscapes that went viral in 2015.
Without regularization, activation-maximized images look like high-frequency noise that happens to excite the neuron. Adding jitter (randomly shifting the image by a few pixels each step), applying small rotations, and penalizing high-frequency patterns produces images that look much more natural and interpretable. The regularization isn't a cosmetic trick — it acts as a prior for "the kinds of images that actually occur in the real world."
For DermNet, feature visualization revealed something interesting. Several deep neurons had learned to respond to specific dermoscopic patterns — blue-white structures, pigment network, regression areas — that correspond almost exactly to features dermatologists are trained to look for. That's reassuring. But other neurons had clearly learned to respond to the circular border of the dermoscope itself. Those neurons would fire whenever they saw the characteristic dark vignette, regardless of what lesion was inside. That's the kind of insight that no accuracy metric would ever reveal.
Concept Activation Vectors — Thinking in Human Terms
Feature visualization shows what neurons respond to, but interpreting those dream-like images still requires guesswork. Testing with Concept Activation Vectors — TCAV (Kim et al., 2018) — flips the script. Instead of asking "what does this neuron represent?", it asks "how important is this human-defined concept to the model's prediction?"
The mechanism is clever. You pick a concept — say, "irregular border." You collect a set of images that exhibit irregular borders and a set that don't. You pass both sets through the network and extract activations from a chosen layer. Then you train a simple linear classifier (logistic regression or linear SVM) to separate "has irregular border" from "doesn't." The normal vector of that classifier's decision boundary becomes the Concept Activation Vector — a direction in activation space that corresponds to your concept.
The TCAV score then answers: "For images the model classifies as melanoma, what fraction get a higher melanoma score when you nudge their activations in the 'irregular border' direction?" If the score is high — say, 0.85 — the model is strongly influenced by border irregularity when predicting melanoma. If it's 0.50 (random), the concept doesn't matter. If you test the concept "ruler present" and get a high TCAV score, you've caught the model using a shortcut.
TCAV's power is that it speaks in the language of domain experts. Instead of showing a dermatologist a heatmap and asking "does this look right?", you can say: "The model's melanoma predictions are 82% influenced by border irregularity, 74% by color variegation, and 15% by the presence of a ruler." That's a conversation a clinician can engage with.
Attention Maps in Vision Transformers
Vision Transformers (ViTs) split images into patches — typically 16×16 pixels — and process them as a sequence with self-attention. This architecture gives us interpretability almost for free: the attention weights are right there, telling us which patches the model attends to.
with torch.no_grad():
out = vit_model(tensor, output_attentions=True)
# Last layer, all heads, CLS token attending to image patches
attn = out.attentions[-1] # [batch, heads, patches, patches]
cls_attn = attn[0, :, 0, 1:] # CLS → image patches, all heads
cls_attn = cls_attn.mean(dim=0) # average across heads
grid = int(cls_attn.shape[0] ** 0.5)
attn_map = cls_attn.reshape(grid, grid) # spatial map
Individual attention heads often show surprisingly clean specialization. In ViT models trained on dermoscopy, I've seen heads that consistently attend to lesion borders, heads that attend to color patches, and heads that attend to the image center regardless of content. Visualizing heads separately is more informative than averaging them.
This is a trap worth flagging explicitly, because it comes up in interviews. Attention maps show where the model looks, not what causes the prediction. A head might attend strongly to a region, but downstream layers might completely ignore that information. For rigorous causal attribution in ViTs, use Attention Rollout (multiply attention matrices across all layers to trace information flow from input to output) or apply Grad-CAM to the ViT's patch embeddings. Raw single-layer attention is suggestive, not conclusive.
The Sanity Check That Shook the Field
In 2018, Adebayo et al. published a paper called "Sanity Checks for Saliency Maps" that made a lot of people uncomfortable — myself included. The experiment was devastatingly simple: take a trained model, randomize its weights (destroying everything it learned), and then re-run the saliency method. If the explanation changes dramatically, the method is actually sensitive to what the model learned. If it looks roughly the same... the method was showing you the input image's structure, not the model's reasoning.
Several popular methods failed. Guided Backpropagation — which produces visually stunning, sharp-looking explanations — produced nearly identical outputs before and after weight randomization. It was effectively performing edge detection on the input, with no connection to the model's decision process. Pretty, but meaningless.
Grad-CAM fared better but not perfectly. It changes when weights are randomized, which is good, but some researchers have shown cases where the changes are smaller than you'd expect. The takeaway isn't that Grad-CAM is useless — it isn't — but that no single attribution method should be trusted in isolation. When Grad-CAM and occlusion sensitivity agree, you can be confident. When they disagree, investigate further.
I still use Grad-CAM as my first tool. But after the Adebayo paper, I always corroborate with at least one other method before making claims about what the model "learned." That extra five minutes has caught problems more than once.
Mechanistic Interpretability — The Frontier
Everything we've covered so far treats the network as something to be probed from the outside — we poke at it with inputs and watch what happens to outputs. Mechanistic interpretability is an entirely different philosophy: reverse-engineer the network from the inside, neuron by neuron, circuit by circuit, until you understand how it computes what it computes.
The challenge is a phenomenon called superposition. In an ideal world, each neuron would represent one clean, interpretable feature. In practice, networks learn to pack far more concepts into their neurons than they have dimensions — a single neuron might encode "striped texture," "diagonal edge at 45 degrees," and "part of a boat" simultaneously, in different activation contexts. This makes individual neurons hard to interpret.
Anthropic's research (2023–2025) pioneered a promising approach: train sparse autoencoders on a layer's activations. The autoencoder has many more output dimensions than the layer has neurons, forcing a sparse, overcomplete representation. Each output dimension tends to correspond to a single, human-interpretable concept — disentangling the superposition. Researchers have used this to identify features in large language models that correspond to specific concepts, and the same ideas are being applied to vision models.
I'm still developing my intuition for where mechanistic interpretability is heading. It's the most exciting work in the field, but it's also the least mature. For production CNN debugging today, Grad-CAM and SHAP are your workhorses. But five years from now, mechanistic tools might let us make guarantees like "this network cannot use background color as a feature" — not by testing, but by inspecting the circuit directly. That would change everything about how we trust models.
The Debugging Playbook
These tools aren't decorations for paper figures. They're diagnostic instruments. Here's how they fit into a real debugging workflow, using DermNet as our running example.
The model predicts wrong on an image you expected it to get right. Start with Grad-CAM. If the heatmap highlights the background instead of the lesion, the model learned a shortcut — maybe all melanoma training images came from one hospital with a distinctive skin marker, and the model learned "skin marker = melanoma." The fix is data-side: diversify your sources, augment aggressively, or crop out the irrelevant regions.
Accuracy is suspiciously high. DermNet hitting 99% on a dermatology dataset should trigger alarm bells, not celebration. Run Grad-CAM on a batch of 50–100 correctly classified images. If the heatmaps cluster on watermarks, ruler marks, or the circular vignette of the dermoscope rather than the lesion, you've found data leakage. Strip the metadata, crop the margins, and re-evaluate on images from a hospital not in your training set.
One class consistently underperforms. Overall accuracy looks fine, but "dermatofibroma" keeps getting confused with "benign keratosis." Run feature visualization on the neurons most active for each class. If they look similar, the model hasn't found discriminative features. You need more training examples of the underperforming class, or harder negative mining to force the model to learn the subtle differences.
A stakeholder asks "can you prove it's not using race/skin-tone?" This is a TCAV question. Define the concept "dark skin tone" with example images, compute the concept activation vector, and check the TCAV score for each diagnosis class. If the score is significantly above 0.5, the model's predictions are influenced by skin tone — a fairness problem that needs addressing before deployment.
Named after a horse in 1900s Berlin that appeared to do arithmetic but was actually reading its trainer's unconscious body language. CNNs do the same thing. The most famous example: a model trained to distinguish wolves from huskies learned to detect snow in the background — because wolves were photographed in snow and huskies on grass. The model never learned what a wolf looks like. It learned what snow looks like. Grad-CAM would have caught this in minutes. The researchers who discovered it had to use LIME. The lesson: never deploy a model whose Grad-CAM you haven't checked.
The Toolbox — Libraries and Practicalities
You don't need to implement Grad-CAM from scratch for production work. The ecosystem has matured significantly.
Captum (Meta) is the most comprehensive library for PyTorch. It implements Integrated Gradients, Layer Conductance, Grad-CAM, SHAP, and dozens more — all through a consistent API where you pass your model and input, specify a target layer, and get attributions back. It also includes visualization utilities that produce publication-quality overlay images.
pytorch-grad-cam is laser-focused on CAM variants. If all you need is Grad-CAM, Grad-CAM++, Score-CAM, and their relatives, this library is smaller and faster to get started with. It handles multi-GPU models and works with both CNNs and ViTs.
SHAP (the library) includes DeepExplainer for deep networks and GradientExplainer as a faster approximation. For image work, it handles the superpixel segmentation automatically.
One practical tip that has saved me hours: always run at least two methods on the same image before drawing conclusions. If Grad-CAM says "the model is looking at the lesion" but occlusion sensitivity says "confidence doesn't change when you cover the lesion," something is wrong — and the gradient-based method is more likely to be misleading than the brute-force one.
If you're still with me, thank you. I hope it was worth it.
We started with the simplest possible question — "which pixels have the largest gradients?" — and discovered that the answer was noisy. So we smoothed it with SmoothGrad and built a principled version with Integrated Gradients. We zoomed out from pixels to regions with Grad-CAM, verified our trust with brute-force occlusion, and built human-readable explanations with LIME and SHAP. Then we went deeper: peering inside neurons with feature visualization, bridging human concepts to model internals with TCAV, watching Vision Transformer attention, learning from the Adebayo sanity checks that some explanations are less trustworthy than they look, and glimpsing the frontier of mechanistic interpretability where circuits are being reverse-engineered one neuron at a time.
My hope is that the next time your model hits an impressive accuracy number, instead of celebrating and shipping, you'll open Grad-CAM, run it across a batch of predictions, and look. Because a model that's right for the right reasons is worth ten models that are right by accident — and the only way to tell the difference is to look under the hood.
What You Should Now Be Able To Do
- Generate and interpret a Grad-CAM heatmap — hook into the last conv layer, compute gradient-weighted feature maps, overlay the result on the input image, and explain whether the model is focusing on the right region for the right reason.
- Compute saliency maps and know their limits — use vanilla gradients for a quick look, SmoothGrad for noise reduction, and Integrated Gradients when you need theoretically grounded attributions. Know that baseline choice matters.
- Apply model-agnostic methods — use occlusion sensitivity to verify gradient-based results, LIME to produce superpixel-level explanations, and SHAP when you need Shapley-value guarantees.
- Visualize what neurons have learned — use activation maximization to see what individual neurons respond to, and TCAV to test whether the model uses specific human-defined concepts.
- Debug with a multi-method workflow — never trust a single attribution method. Corroborate with at least two, and always check for the Clever Hans problem before deploying to production.
- Evaluate attribution quality — understand why the Adebayo sanity checks matter, what parameter randomization tests reveal, and why visually pretty explanations aren't necessarily faithful ones.