Anomaly Detection

Chapter 6: Unsupervised Learning Finding the weird stuff in your data

I'll be honest — I avoided anomaly detection for a long time. It sat in my mental backlog as this vaguely unsettling topic, the kind of thing I knew mattered but kept putting off because it didn't fit neatly into the supervised-learning box I was comfortable in. There was no tidy train/test split, no clear loss function to minimize, no accuracy number to chase. Every time I peeked at the literature I found a different taxonomy, a different set of assumptions, and a different reason to close the tab. But anomaly detection kept showing up in real work — in manufacturing alerts, in fraud pipelines, in the health checks we ran on our own ML models. Eventually the discomfort outgrew the avoidance. Here is that dive.

Anomaly detection — also called outlier detection or novelty detection depending on who you ask — is the practice of identifying data points that deviate meaningfully from the expected pattern. The field picked up serious momentum in the late 1990s and early 2000s as credit-card fraud, network intrusion, and industrial monitoring all demanded automated ways to spot the weird stuff. Today it powers everything from detecting fraudulent transactions to flagging defective products on an assembly line to catching sensor drift in IoT systems.

Before we start: you don't need much. Some comfort with means and standard deviations, a rough sense of what a covariance matrix does, and a passing familiarity with scikit-learn. If you've read the earlier chapters on probability and on clustering, you're more than ready. If you haven't, you'll still follow along — I'll define every term the first time it appears.

This isn't a short journey. We'll start with the humblest statistical test, build through tree-based and density-based methods, touch on neural approaches, wrestle with evaluation, and end with hard-won production lessons. By the end you'll have a mental toolkit you can actually use on Monday morning. Let's go.

What We'll Cover

The Core Flip — Learning Normal to Find Abnormal
Z-Score — The Simplest Detector
Mahalanobis Distance — Handling Correlated Features
Isolation Forest — The Workhorse
Local Outlier Factor — Density-Aware Detection
Rest Stop
One-Class SVM — A Boundary in Kernel Space
Autoencoders — The Photocopier That Catches Imposters
Evaluation — Why This Is Harder Than Classification
Production Reality — Drift, Thresholds, and Retraining
Which Method When? — A Comparison

The Core Flip — Learning Normal to Find Abnormal

Most machine learning we encounter is about learning a target: given these features, predict that label. Anomaly detection flips the script. We don't try to learn what anomalies look like — we learn what normal looks like, and flag anything that doesn't fit. The reason is practical: you can't collect all the bad stuff. Fraudsters invent new tricks. Manufacturing defects appear in forms nobody anticipated. Sensor failures produce readings that no engineer would have thought to include in a training set. The space of "abnormal" is enormous and constantly shifting, but the space of "normal" is relatively stable and well-sampled.

Let me make this concrete. Imagine a small factory with a packaging machine monitored by three temperature sensors — one near the motor, one at the sealing head, one on the conveyor belt. On a healthy day, the motor sensor hovers around 70°C, the sealing head around 150°C, the conveyor around 35°C. We have thousands of hours of this boring, beautiful, normal data. We have maybe two incidents of genuine overheating in the past year. Training a binary classifier on two positive examples is a non-starter. But training a model that memorizes the shape of "normal" and screams when something looks different? That we can do.

This insight — learn normal, flag different — is the thread that connects every method we'll build in this section, from a one-line z-score to a deep autoencoder. Keep it in mind. It's the compass.

Z-Score — The Simplest Detector

We start with the tool that's been in every statistician's pocket for over a century. The z-score (also called the standard score) measures how many standard deviations a point lies from the mean: z = (x − μ) / σ, where μ is the mean of your data and σ is the standard deviation. A z-score of 0 means the reading is right at the average. A z-score of 3 means it's three standard deviations above. The classic rule of thumb: flag anything with |z| > 3 as anomalous.

Let's use our factory. Suppose the motor-temperature sensor has five recent readings in degrees Celsius: 69, 71, 70, 72, and 68. The mean is 70, the standard deviation is about 1.58. A new reading of 74 gives z = (74 − 70) / 1.58 ≈ 2.53 — elevated but within bounds. A reading of 85 gives z ≈ 9.49 — wildly outside anything we've seen. That one deserves an alert.

The z-score is univariate, meaning it looks at one feature at a time. It also assumes your data is roughly bell-shaped (Gaussian). If your sensor readings follow a skewed distribution or have heavy tails, the z-score's ±3 threshold will misfire — either crying wolf too often or staying silent when it shouldn't. Still, for a quick sanity check on a single well-behaved measurement, nothing beats it for simplicity and interpretability.

💡 When z-scores break

If your data has multiple modes (two humps instead of one bell curve), the mean and standard deviation become meaningless summaries. A reading that sits between the two humps might get a low z-score despite being in a region where no real data lives. For multimodal data, you need density-based methods further down this page.

Mahalanobis Distance — Handling Correlated Features

Here's where the z-score breaks down and I had to push past my comfort zone. Our factory has three sensors, not one. The motor temperature and sealing-head temperature are correlated — when one runs hot, the other tends to follow, because they share a heating element. The z-score checks each sensor independently, so it might see 78°C on the motor (a bit high but not alarming) and 160°C on the sealing head (also a bit high but not alarming) and shrug. But the combination — motor running unusually cool relative to how hot the sealing head is — might be the actual danger sign. We need a distance measure that accounts for the shape and orientation of the data cloud, not one that treats each dimension in isolation.

The Mahalanobis distance does exactly this. The formula is:

D(x) = √( (x − μ)ᵀ Σ⁻¹ (x − μ) )

Let me break down every piece. x is the data point we're scoring — a vector of our three sensor readings. μ is the mean vector of our training data, one average per sensor. Σ is the covariance matrix (a matrix that captures not only how much each sensor varies but how they vary together). Σ⁻¹ is its inverse, which effectively "un-correlates" the features. The transpose (x − μ)ᵀ turns our column vector into a row so the matrix multiplication works out, and the square root at the end brings us back to distance units.

I'll be honest — the first time I saw this formula, my eyes glazed over. But the intuition rescued me. Imagine your normal data forms an elongated ellipse in 2D — like a cigar tilted at 45 degrees. The Euclidean distance (or equivalently, independent z-scores) draws circles around the center, so a point at the tip of the cigar looks just as far away as a point way off to the side, even though the tip-of-the-cigar point is perfectly normal. Mahalanobis distance draws ellipses that match the data's actual shape. A point at the tip gets a small distance; a point off to the side gets a large one. That's all it's doing: measuring "how many standard deviations away" but along the data's own natural axes, not the coordinate axes.

There's a practical gotcha. Estimating the covariance matrix Σ requires enough data relative to the number of features, and if the data itself is contaminated with outliers, those outliers pull the covariance estimate toward them, making themselves look less anomalous. This is a chicken-and-egg problem. The fix is called the Minimum Covariance Determinant (MCD) — a robust estimator that finds the subset of points (typically 50–75% of the data) whose covariance matrix has the smallest determinant, effectively ignoring the most extreme points when computing the covariance. Scikit-learn provides this via sklearn.covariance.MinCovDet.

Isolation Forest — The Workhorse

Statistical methods got us started, but they lean on distributional assumptions that real data gleefully violates. I wanted something that makes no assumptions about the shape of normal — and that could handle fifty features as easily as three. The Isolation Forest, published by Liu, Ting, and Zhou in 2008, is that tool. It has become the default first-reach anomaly detector in industry for good reason.

Random Chopping: The Core Intuition

Think about chopping wood. If you have a whole log and a small twig lying on a table, and you make a random chop across the table, what happens? The twig gets isolated in one or two chops — it's small and far from the log. The log needs many more chops before any single piece is fully separated. Anomalies are twigs. Normal points are chunks of the log. The number of random chops it takes to isolate a point is itself the anomaly score.

More precisely, the Isolation Forest builds an ensemble of isolation trees. Each tree is constructed by repeatedly picking a random feature and a random split value between that feature's minimum and maximum in the current subset, then partitioning the data. This continues until every point sits in its own leaf or the tree reaches a maximum depth. A point that ends up in a leaf after very few splits has a short path length — it was easy to isolate. A point buried in a dense cluster needs many splits before it's alone, producing a long path length.

A Tiny Walk-Through

Let's make this tangible. Imagine six data points in 2D — five forming a tight cluster around (5, 5) and one outlier at (−3, 8). We build one isolation tree. First split: feature x₁, split value 1.0. The outlier (x₁ = −3) goes left, completely alone. Path length: 1. The five normal points all go right. They need several more splits among themselves before any single one is isolated, giving path lengths of 3–5. Average over many trees, and the outlier consistently gets the shortest path.

The Scoring Formula

The raw path lengths get normalized into a score between 0 and 1 using:

s(x, n) = 2−E(h(x)) / c(n)

Here E(h(x)) is the average path length of point x across all the trees in the forest. c(n) is the average path length of an unsuccessful search in a Binary Search Tree (BST) built from n points — it serves as a normalization constant so that scores are comparable across different dataset sizes. When s is close to 1, the point has very short paths and is almost certainly anomalous. When s is close to 0.5, the point's path length is about average — it blends in. Scores near 0 indicate points that are even deeper than average, solidly normal. I'm still developing my intuition for why this particular BST normalization works so elegantly — but empirically, it produces well-calibrated scores across wildly different datasets, and that's why it became the standard.

The contamination Parameter — The One Knob That Matters

When you call scikit-learn's IsolationForest, the parameter contamination tells the model what fraction of your training data you believe is anomalous. It shifts the decision threshold on the internal anomaly scores: set it too high and you'll flag normal points; too low and real anomalies slip through. In practice, use domain knowledge. If you know fraud is roughly 0.1% of transactions, set contamination=0.001. If you genuinely have no idea, set contamination="auto" and work directly with the raw decision_function scores, choosing your own threshold based on the operational cost of false positives versus missed detections.

Let's run our factory data through it.

import numpy as np
from sklearn.ensemble import IsolationForest

# Five normal readings from 3 sensors: [motor, sealing_head, conveyor]
X_normal = np.array([
    [70, 150, 35],
    [69, 148, 34],
    [71, 152, 36],
    [70, 149, 35],
    [72, 151, 34],
])
# Simulate more normal data by adding small noise
rng = np.random.RandomState(42)
X_train = X_normal[rng.choice(5, size=200)] + rng.normal(0, 1.5, (200, 3))

# New readings to score — one normal, one suspicious
X_test = np.array([
    [71, 150, 35],   # looks perfectly normal
    [85, 140, 50],   # motor and conveyor running hot, sealing head cool
])

iso = IsolationForest(n_estimators=200, contamination=0.05, random_state=42)
iso.fit(X_train)

scores = iso.decision_function(X_test)  # higher = more normal
preds = iso.predict(X_test)             # 1 = normal, -1 = anomaly
for i, (s, p) in enumerate(zip(scores, preds)):
    label = "normal" if p == 1 else "ANOMALY"
    print(f"Reading {i}: score={s:.3f}  → {label}")

The first reading scores comfortably positive — it's well inside the normal cloud. The second, with its unusual combination of hot motor, cool sealing head, and warm conveyor, scores negative and gets flagged. Two hundred trees, one fit call, and we have a working detector.

💡 Extended Isolation Forest

Classic Isolation Forest splits along one axis at a time, which creates axis-aligned decision boundaries. This can introduce bias when anomalies lie along diagonal directions in feature space. The Extended Isolation Forest (Hariri et al., 2019) fixes this by using random hyperplane cuts instead of axis-aligned cuts, better capturing anomalies in any orientation. The eif package on PyPI provides an implementation.

Local Outlier Factor — Density-Aware Detection

Isolation Forest is powerful, but it has a blind spot: it treats the whole feature space with the same level of suspicion. What if your normal data lives in clusters of wildly different densities? Our factory might have two operating modes — a daytime mode where the machine runs steadily, producing a tight cluster of readings, and a nighttime "low-power" mode with a wider spread of temperatures. A point sitting between the two clusters might not have a short isolation path (it's not far from either cluster), but it's in a region where no real data actually lives.

This is where the Local Outlier Factor (LOF) comes in, and I like to think of it through the neighborhood watch analogy. Imagine each data point is a house in a neighborhood. LOF asks: "Is this house's neighborhood as densely populated as its neighbors' neighborhoods?" If everyone around you lives on a crowded street but your immediate surroundings are eerily empty, you stand out — even if you're not far from the city center in absolute terms.

The Mechanics, Step by Step

LOF works through three layers of computation. First, for each point, we find its k nearest neighbors (where k is a parameter we choose, typically 10–50). Then we compute the reachability distance from point A to point B — this is the maximum of the actual distance from A to B and B's own k-distance (the distance from B to its k-th nearest neighbor). The reachability distance prevents points within very tight clusters from having artificially tiny distances; it smooths things out.

Next we compute the Local Reachability Density (LRD) of each point: the inverse of the average reachability distance to its k neighbors. High LRD means you live in a dense neighborhood; low LRD means your neighbors are spread out. Finally, the LOF score for each point is the average ratio of its neighbors' LRDs to its own LRD. A score near 1.0 means your density is similar to your neighbors' — you fit in. A score well above 1.0 (say, 2 or 3) means your neighborhood is much sparser than the neighborhoods around you — the neighborhood watch has flagged you.

from sklearn.neighbors import LocalOutlierFactor

# Same factory data as before
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.05,
    novelty=False  # scoring the training data itself
)
lof_preds = lof.fit_predict(X_train)
lof_scores = lof.negative_outlier_factor_  # more negative = more anomalous

n_anomalies = (lof_preds == -1).sum()
print(f"LOF flagged {n_anomalies} points out of {len(X_train)}")

That code scores every point in our training set. Points deep inside dense regions get scores close to −1 (which confusingly means "normal" — the sign is an artifact of scikit-learn's convention). Points in sparse regions score much more negative.

⚙️ novelty=True vs novelty=False

When novelty=False (the default), LOF operates in transductive mode — it scores the training data itself. You call fit_predict and get labels for every training point. When novelty=True, LOF operates in inductive mode — you train on clean data, then call predict on new, unseen points. This is what you want in production: train on yesterday's known-good sensor readings, then score today's incoming stream as it arrives.

LOF shines on datasets with clusters of varying density — the exact situation that trips up Isolation Forest. But it has costs. Distance computations degrade in high dimensions (the infamous "curse of dimensionality"), and without spatial indexing structures the naive algorithm is O(n²) in memory and time. For datasets beyond about 100K points, you'll want approximate nearest neighbor libraries or a different method entirely.

Rest Stop

Let's pause. If you've followed from the top, you now have three genuinely useful tools in your hands. The z-score and Mahalanobis distance handle cases where your data is low-dimensional and well-behaved. Isolation Forest is the general-purpose workhorse — reach for it first in most real problems. Local Outlier Factor covers the gap where data density varies across regions. Together, these three cover the vast majority of anomaly detection tasks you'll encounter in practice.

If you stop here, you're well-equipped. You understand the core flip (learn normal, flag different), you have concrete tools to implement it, and you know when each tool is the right one. Go build something. Come back when you're curious about the rest.

What follows is for those who want the full picture: a boundary-based method (One-Class SVM), a neural approach (autoencoders), the thorny problem of evaluation, and the messy reality of keeping these systems alive in production.

One-Class SVM — A Boundary in Kernel Space

One-Class SVM takes a different approach from everything we've seen so far. Instead of measuring distances or densities or path lengths, it learns a decision boundary — a surface that wraps tightly around the normal data. Anything outside the boundary is anomalous.

The mechanics borrow heavily from the classic Support Vector Machine. The algorithm maps your data into a high-dimensional space using a kernel function (usually the Radial Basis Function, or RBF kernel), then finds the hyperplane in that space that separates the data from the origin with maximum margin. The key parameter is ν (nu), which controls the trade-off: setting nu=0.05 tells the algorithm that up to 5% of training points are allowed to fall outside the boundary. It's simultaneously an upper bound on the fraction of outliers and a lower bound on the fraction of support vectors (the points that define the boundary).

One-Class SVM works well on small, clean datasets where the boundary between normal and abnormal is relatively sharp. But training time scales between O(n²) and O(n³), which means it becomes impractical beyond about 10,000 points. For larger data, Isolation Forest is almost always the better choice. I mention One-Class SVM here because you'll encounter it in the literature and in legacy codebases, and because there are niche situations — small sensor datasets with very clean training data — where it still earns its keep.

Autoencoders — The Photocopier That Catches Imposters

Here's an analogy that made autoencoders click for me. Imagine a photocopier that's been trained exclusively on pictures of cats. It's gotten extraordinarily good at reproducing cat photos — whiskers, ears, tails, all faithfully replicated. Now you feed it a photo of a dog. The copier tries its best, but the output looks wrong — blurry ears, a tail that curls the wrong way, a snout where a flat face should be. The reconstruction error (how different the output is from the input) is high. That high error is your anomaly signal.

An autoencoder is a neural network trained to reconstruct its own input. It has an encoder that compresses the input into a low-dimensional latent representation (a bottleneck), and a decoder that tries to reconstruct the original input from that compressed representation. When trained on normal data, the network learns to capture the patterns and regularities of normality in its bottleneck. At inference time, normal inputs are reconstructed faithfully (low error), while anomalous inputs — patterns the network has never seen — produce poor reconstructions (high error).

Threshold setting is where this gets genuinely hard — more art than science, I'll admit. One approach: compute reconstruction errors on a validation set of known-normal data, then set the threshold at the 95th or 99th percentile. Another: if you have a handful of known anomalies, use them to calibrate. In practice, I've found that plotting the distribution of reconstruction errors and looking for a natural gap between normal and anomalous is the most reliable starting point, though it requires human judgment.

Two important variants deserve mention. LSTM autoencoders replace the encoder and decoder with Long Short-Term Memory recurrent layers, making them well-suited for time-series anomaly detection — exactly the kind of data our factory sensors produce. Instead of reconstructing a single snapshot, they reconstruct a window of sequential readings, catching anomalies that manifest as unusual temporal patterns rather than unusual individual values. Variational Autoencoders (VAEs) add a probabilistic twist: instead of compressing each input to a single point in latent space, they compress it to a distribution. This gives you a natural anomaly score — the reconstruction probability — rather than a raw error value, which can be easier to calibrate.

Evaluation — Why This Is Harder Than Classification

Nobody fully agrees on the best evaluation approach for anomaly detection, and I've come to believe that's inherent to the problem rather than a sign that the field is immature. The core issue: anomalies are rare by definition, often less than 1% of the data. This rarity breaks the evaluation tools we rely on for balanced classification.

The Accuracy Trap

A model that predicts "normal" for every single input achieves 99.97% accuracy on a credit-card fraud dataset. That number is useless. It tells you nothing about the model's ability to find the 0.03% that matters. We need metrics that focus on performance among the rare positives.

PR-AUC over ROC-AUC

ROC-AUC (the Area Under the Receiver Operating Characteristic curve) measures the trade-off between the true positive rate and the false positive rate. It's the standard metric in balanced classification, but it's misleading for anomaly detection. When the positive class is tiny, even a mediocre model can maintain a very low false positive rate in absolute terms, inflating the ROC-AUC score. A model that catches 50% of frauds while flagging 1% of normal transactions looks great on ROC-AUC because 1% of a million normal transactions is 10,000 false alarms — operationally terrible, but ROC-AUC doesn't care about the absolute count.

PR-AUC (Precision-Recall AUC) is more honest. Precision asks: "Of the things I flagged, how many are real anomalies?" Recall asks: "Of the real anomalies, how many did I catch?" The PR curve plots precision against recall at various thresholds, and its area under the curve penalizes models that produce lots of false positives. When your anomaly rate is below 1%, PR-AUC gives you a far more realistic picture of how the model will perform in practice.

Precision@k — The Operational Metric

Here's the metric that actually maps to how anomaly detection systems get used. Imagine a fraud investigation team that can review 200 alerts per day. They don't care about the model's performance across the entire dataset — they care about the quality of the top 200 alerts. Precision@k (in this case, Precision@200) asks: of the 200 most anomalous points according to the model, how many are genuinely fraudulent? If 180 out of 200 are real, that's a Precision@200 of 0.9, and the team is happy. If only 40 are real, they're wasting 80% of their time, and your model needs work regardless of what PR-AUC says.

🎯 A practical guideline

Report PR-AUC for model comparison during development. Report Precision@k for stakeholder conversations, where k matches the team's daily review capacity or operational budget. These two metrics together — one for model selection, one for operational readiness — cover most of what you need.

Production Reality — Drift, Thresholds, and Retraining

Getting an anomaly detector working in a notebook is one thing. Keeping it working in production for six months is a completely different challenge, and the gap between the two is where most projects fail.

Concept Drift

Normal changes over time. Our factory's packaging machine runs cooler in winter and hotter in summer. A fraud model trained on last year's attack patterns won't recognize this year's schemes. This is concept drift — the statistical properties of the data shift, and what was once anomalous becomes the new normal (or vice versa). If you don't account for it, your detector's precision degrades steadily until it's either missing real anomalies or flooding the team with false alarms.

Retraining Strategies

The most common approach is periodic retraining on a sliding window of recent data — retrain weekly on the last 30 days, for example. This works when drift is gradual. For faster drift, you might retrain daily or use online learning variants that update incrementally. The tricky part is ensuring your retraining data is clean — if a burst of undetected anomalies contaminates the training window, the model learns to treat them as normal. Some teams maintain a human-reviewed "gold standard" dataset that gets updated less frequently but with higher confidence.

Threshold Calibration Over Time

Even if the model itself is stable, the right threshold can shift. Seasonal patterns, business growth, changes in data volume — all of these can move the distribution of anomaly scores. I've seen teams set a threshold once and never revisit it, only to discover six months later that their detector hasn't flagged anything in weeks because the score distribution drifted upward. The fix: monitor the distribution of scores continuously, set alerts on the alert rate itself (meta-monitoring), and recalibrate thresholds monthly or whenever the score distribution shifts by more than a standard deviation.

Which Method When?

After all of that, here's the comparison I wish I'd had when I started. No single method wins everywhere — the right choice depends on your data size, dimensionality, density structure, and operational constraints.

Method Best When Avoid When Scales To Key Assumption
Z-Score Single feature, roughly Gaussian Correlated features, multimodal data Any size Unimodal, symmetric distribution
Mahalanobis Distance Few correlated features, Gaussian-ish High dimensions, contaminated training data Any size (with MCD for robustness) Elliptical distribution
Isolation Forest General default — no distributional assumptions Variable-density clusters Millions of rows Anomalies are few and different
Local Outlier Factor Variable-density clusters High dimensions (>50), very large datasets ~100K rows Local density is meaningful
One-Class SVM Small data, clean training set, sharp boundary Large data, messy or overlapping boundaries ~10K rows Smooth boundary in kernel space
Autoencoder Complex patterns, images, time series, high dimensions Small data, no deep learning infrastructure Any size (GPU) Normal data has learnable structure

My default workflow: start with Isolation Forest. If the data has obvious multi-density structure, try LOF. If the features are low-dimensional and roughly Gaussian, check Mahalanobis distance first because it's faster and more interpretable. Autoencoders come out when the data is rich enough to justify the complexity — sensor time series, images, or high-dimensional embeddings. One-Class SVM is a last resort for small, clean datasets where nothing else fits.

Wrapping Up

We started with a confession and a single factory machine, and we've come a long way. The core insight — learn what normal looks like, flag what doesn't fit — carried us from a one-line z-score through elliptical distances, random forests that chop twigs, neighborhood watches that compare densities, kernel boundaries, and neural photocopiers. Along the way we wrestled with evaluation metrics that refuse to be simple and production realities that refuse to be static.

I'm genuinely glad I finally took this dive. Anomaly detection is one of those corners of machine learning where the methods are elegant, the applications are immediately useful, and the gap between theory and practice is large enough to keep you humble. I hope this walkthrough saves you some of the time I spent circling the topic before jumping in.

Thank you for reading. Go find some anomalies.

Resources