Nice to Know

Chapter 6: Unsupervised Learning Interesting but not essential — awareness level

I'll be honest — I put off learning most of these topics for a long time. They sit in this awkward middle ground: not core enough to build your daily workflows around, but persistent enough that they keep showing up. In interviews, in legacy codebases, in that one product meeting where someone casually drops the word "lift" and everyone nods like they know what it means.

So I finally sat down with each of them. What I found is that they're not hard individually — they're the kind of thing where ten minutes of genuine understanding beats ten hours of skimming. That's what we'll do here. We won't go deep enough to implement any of these from scratch. But we'll go deep enough that when they appear in front of you, you won't flinch.

Think of this as a walking tour of the neighborhoods we haven't visited yet in the unsupervised learning city. We know the downtown — clustering, dimensionality reduction, anomaly detection. Now we're exploring the side streets.

Association Rules — The "Beer and Diapers" Story

This one starts with a question that sounds trivial but turns out to be genuinely useful: which items tend to show up together in shopping carts? Imagine a grocery store with a million transaction receipts. Each receipt is a list of items someone bought. We want to discover patterns — not by asking a human to stare at receipts, but by letting the data tell us.

The classic finding, possibly apocryphal, is that diapers and beer are often purchased together. The story goes that young fathers, sent on a diaper run, would grab beer on the way out. Whether or not it's true, it captures the spirit perfectly: we're looking for surprising co-occurrences that reflect real human behavior.

To measure these co-occurrences, we need three numbers, and the relationship between them matters more than any one in isolation.

Support answers: how common is this combination? If diapers and beer appear together in 5% of all transactions, the support is 0.05. It tells you whether the pattern is frequent enough to care about.

Confidence answers: if someone bought diapers, how likely are they to also buy beer? This is a conditional probability — P(beer | diapers). A confidence of 0.7 means 70% of diaper-buyers also bought beer. Sounds impressive, right? But there's a trap.

Here's where I got tripped up the first time. Suppose beer is already purchased by 65% of all shoppers. That 70% confidence doesn't look so special anymore — it barely exceeds the base rate. This is where lift comes in. Lift divides the confidence by the base rate of the consequent: confidence / P(beer). If lift equals 1, the items are independent — knowing about diapers tells you nothing about beer. Lift above 1 means there's a genuine association. In our example, lift is only 0.7 / 0.65 ≈ 1.08 — barely above random. Not the dramatic finding we hoped for.

That distinction between confidence and lift is one of the most common interview gotchas in this space. Interviewers love handing you a rule with high confidence and asking you to interpret it. The experienced answer always involves checking lift.

Two algorithms dominate the mechanics of finding these rules. Apriori works bottom-up: start with individual items, count which pairs are frequent, then which triples, pruning at each level. If an item pair isn't frequent, no superset of it will be either — that's the Apriori property, and it's what makes the search tractable. The downside is that Apriori requires multiple passes through the entire transaction database, which gets painful at scale.

FP-Growth takes a different approach. It compresses all the transactions into a tree structure — the FP-tree — and mines patterns directly from the tree without generating candidate itemsets. For large datasets, it's substantially faster. If someone asks you to pick between the two for a million transactions with ten thousand products, FP-Growth is the answer.

Here's the practical reality, though: in modern ML pipelines, association rules have largely been replaced by collaborative filtering and embedding-based approaches. Recommendations at Netflix or Spotify don't run Apriori. But association rules still show up in retail analytics, in fraud pattern detection, and — maybe most importantly — in interviews. They test whether you understand the subtlety behind seemingly simple metrics.

Topic Modeling — What Are These Documents About?

Imagine you've inherited a database of fifty thousand customer support tickets and someone asks: "What are the main themes?" You could read them all. Or you could use topic modeling — algorithms that discover latent themes in a collection of documents without any labels.

The grandfather of this space is Latent Dirichlet Allocation, introduced in 2003. LDA tells a generative story about how documents come into existence. The story goes like this: each document starts by choosing a mix of topics — maybe 40% "billing issues" and 60% "shipping problems." Then for each word in the document, it picks a topic from that mix, and then picks a word from that topic's vocabulary. "Invoice" comes from the billing topic. "Delayed" comes from the shipping topic. The algorithm's job is to run this story in reverse — given the observed words, figure out the hidden topics.

I'll be honest: LDA's generative story is elegant, but its practical outputs can be maddening. You'll get topics that are clearly coherent ("payment, refund, charge, billing, invoice") and topics that are nonsensical ("the, and, system, update, please"). Choosing the number of topics requires experimentation. Preprocessing matters enormously — stop word removal, lemmatization, minimum document frequency — and different preprocessing choices give you genuinely different topics. I still find this frustrating. It means that LDA results always carry an asterisk: these topics are reasonable given these preprocessing choices.

The important thing LDA assumes — and this is its fundamental limitation — is that documents are bags of words. Word order doesn't matter. "The dog bit the man" and "the man bit the dog" are identical to LDA. For discovering broad themes across thousands of documents, this is often fine. For anything requiring nuance, it's not.

BERTopic is the modern answer, and it takes a completely different path. Instead of a generative word model, BERTopic leans on the semantic understanding baked into transformer embeddings. The pipeline has four steps: embed each document using a sentence-transformer model, reduce those high-dimensional embeddings down to something manageable with UMAP, cluster the reduced embeddings with HDBSCAN, and then figure out what each cluster is "about" using a variant of TF-IDF applied to the clusters (called c-TF-IDF).

The result is a fundamentally different kind of topic model. Because it starts from semantic embeddings, synonyms get grouped together. Short texts — tweets, chat messages, support tickets — work well, where LDA would starve for word co-occurrence signal. BERTopic also supports dynamic topics (watching themes evolve over time) and works with any language if you pick the right embedding model.

If you're building topic modeling in 2024, BERTopic is the starting point. If you're maintaining a legacy system that already has an LDA pipeline, understanding how LDA works will save you debugging time. Both are worth knowing, but for different reasons.

NMF — When You Want Parts, Not Directions

There's a third player in the topic modeling space that often gets overlooked: Non-negative Matrix Factorization. The idea is deceptively simple. Take a matrix V (say, documents × words, where each cell counts how often a word appears in a document) and factorize it into two smaller matrices: V ≈ W × H. The twist — and it's the whole point — is that every value in W and H must be non-negative. No negative numbers allowed.

Why does this constraint matter so much? Think about what PCA does when you ask it to decompose a face image. PCA gives you eigenfaces — ghostly, full-face images where some pixels are positive and others are negative. The components subtract from each other. NMF, by contrast, can only add. So its components end up looking like parts: an eyebrow here, a nose there, a mouth shape. Each component is a recognizable piece that combines with others to reconstruct the whole.

For text data, this means NMF topics are weighted combinations of words where every weight is positive — a topic can include a word strongly or weakly, but it can't "anti-include" a word. This makes NMF topics feel more natural and interpretable than PCA components, though perhaps less sophisticated than LDA's probabilistic framework.

Beyond text, NMF appears in recommender systems (factorizing user-item matrices), audio source separation (decomposing spectrograms into individual instruments), and medical imaging. It's a workhorse for any situation where your data is naturally non-negative and you want an additive decomposition.

Density Estimation — What Shape Is My Data?

Most of the algorithms we've covered in this chapter make implicit assumptions about data density. K-Means assumes clusters are roughly spherical and equally dense. DBSCAN explicitly uses density to define clusters. But what if we wanted to estimate the density itself — to answer "how likely is a data point in this region?"

Kernel Density Estimation is the non-parametric way to do this. The intuition is wonderfully physical: place a small, smooth bump (a kernel, usually Gaussian-shaped) on top of every data point. Sum all the bumps. Where data is dense, the bumps pile up into peaks. Where data is sparse, the surface is nearly flat.

The critical parameter is bandwidth — how wide each bump is. Too narrow, and the estimate is spiky and noisy, overfitting to individual data points. Too wide, and you smooth away all the interesting structure. Bandwidth selection is genuinely tricky. Scott's rule and Silverman's rule give reasonable defaults for roughly Gaussian data. Cross-validation works for everything else but is slower.

KDE is the engine behind Mean-Shift clustering, which finds clusters by climbing density peaks — each point rolls uphill until it settles at a local maximum, and all points that converge to the same peak belong to the same cluster. It's also the foundation for certain anomaly detection approaches: estimate the density, flag anything that lands in a low-density region.

But KDE has a wall it hits hard: dimensionality. In one or two dimensions, it's beautiful. In five dimensions, it starts struggling. Beyond ten, it's essentially useless without an enormous amount of data. The curse of dimensionality means your bumps spread so thin in high-dimensional space that the estimate becomes unreliable. This is why you'll rarely see KDE used directly on raw feature spaces in modern ML. But as a visualization tool, as a density estimator for low-dimensional embeddings after UMAP, and as the conceptual backbone of density-based methods, it's indispensable.

Spectral Clustering — The Graph Theory Angle

We touched on spectral clustering in the clustering section, but the mathematical intuition deserves a closer look because it connects to something deeper.

The core idea: build a similarity graph where each data point is a node and edge weights reflect how similar two points are. Now ask: what's the best way to cut this graph into groups? "Best" here means minimizing the total weight of edges we cut — we want groups that are tightly connected internally but loosely connected to each other.

This is a graph cut problem, and solving it optimally is NP-hard. But there's a beautiful relaxation. Construct the graph Laplacian — the matrix L = D − W, where D is the degree matrix (diagonal matrix of row sums) and W is the similarity matrix. The eigenvectors of L, particularly the ones corresponding to the smallest eigenvalues, encode the optimal approximate cuts. The second-smallest eigenvector — called the Fiedler vector — gives you the best binary partition of the graph.

What makes this profound is the connection to random walks. Imagine a random walker hopping between nodes, with transition probabilities proportional to edge weights. The graph Laplacian eigenvectors reveal where the walker tends to get "trapped" — spending a long time in one cluster before randomly jumping to another. Those trapping regions are the clusters.

I'm still developing my intuition for why the spectral relaxation works as well as it does. The math proves it, but the gap between "I can follow the proof" and "I feel in my bones why eigenvectors of the Laplacian reveal cluster structure" is one I haven't fully closed. If you get there, let me know how.

When does this matter practically? When your clusters aren't convex — rings, spirals, irregular shapes where K-Means fails catastrophically. Spectral clustering handles these because it operates on the graph structure, not on raw coordinates.

Clustering at Scale — BIRCH, CURE, and Mini-Batch

Here's a problem that sounds mundane but matters enormously in production: your dataset has fifty million data points and doesn't fit in memory. Standard K-Means wants the entire dataset loaded. Hierarchical clustering needs an O(n²) distance matrix. What do you do?

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) takes a single pass through the data, building a compact tree structure called a CF-tree. Each leaf node in the tree stores a "clustering feature" — a summary consisting of the number of points, their linear sum, and their squared sum. This is enough information to compute centroids and radii without storing individual points. You stream your fifty million data points through, the tree summarizes them incrementally, and then you cluster the leaf summaries. One pass. Bounded memory.

CURE (Clustering Using Representatives) tackles a different problem: cluster shapes. Instead of representing each cluster with a single centroid, CURE picks multiple scattered representative points from each cluster and then shrinks them toward the cluster center. This lets it handle non-spherical, non-uniform clusters that would fool centroid-based approaches.

And then there's the pragmatic choice: Mini-Batch K-Means. Take a small random batch of data, update the centroids using that batch, repeat. It's an approximation of K-Means that converges faster and uses far less memory. It's in scikit-learn. It works. For many practical scenarios, this is the right first attempt before reaching for anything fancier.

Here's the honest advice: if your data fits in memory, use standard algorithms. If it doesn't, try Mini-Batch K-Means first. If cluster shapes are weird, look at CURE. If you need a single-pass streaming solution, BIRCH is your friend. Most of the time, Mini-Batch K-Means is "good enough," and "good enough" ships products.

Fuzzy C-Means — When Points Belong Everywhere

K-Means makes a hard decision: every point belongs to exactly one cluster. Gaussian Mixture Models make a soft decision but assume clusters are Gaussian-shaped. Fuzzy C-Means sits between them — soft assignments without the Gaussian assumption.

Each data point gets a membership degree for every cluster, and these degrees must sum to one. A point sitting squarely inside cluster A might have membership 0.95 in A and 0.05 in B. A point on the boundary might be 0.55 and 0.45. The algorithm iteratively updates cluster centers (weighted by membership degrees) and recalculates membership degrees (based on distance to centers), much like K-Means but with this continuous softness.

The fuzziness is controlled by a parameter called m (the fuzzifier). When m approaches 1, Fuzzy C-Means collapses into hard K-Means. As m grows larger, all memberships converge toward equal — every point belongs equally to every cluster. In practice, m = 2 is the default that most people use, and I'll be honest, I've rarely seen a compelling argument for changing it.

Where does this actually appear? Medical image segmentation, where tissue boundaries are genuinely fuzzy — a pixel might be part brain tissue, part cerebrospinal fluid. Control systems and pattern recognition in industrial applications. It's niche, but within those niches, it's preferred precisely because hard boundaries would be a lie.

Deep Clustering — When Features Fail

Sometimes the raw features you have aren't good enough for clustering. The data lives in a high-dimensional space where distances are meaningless, and no amount of PCA or UMAP preprocessing gives you clusters that make sense. The intuition behind deep clustering is: what if we could learn a representation that's specifically optimized for clustering?

DEC (Deep Embedded Clustering) does this in two phases. First, pretrain an autoencoder — an encoder-decoder neural network that learns to compress data and reconstruct it. The encoder's bottleneck layer gives you a lower-dimensional representation. Second — and this is the clever part — fine-tune the encoder by adding a clustering objective. DEC computes soft cluster assignments in the embedding space, creates a sharpened "target" distribution, and minimizes the KL divergence between them. This pulls the embeddings toward tighter, more separable clusters.

DeepCluster takes an alternating approach: cluster the current representations, use those cluster assignments as pseudo-labels, train the network to predict those pseudo-labels, repeat. The representations and the cluster assignments co-evolve, each improving the other.

This is an active research area. My honest take: deep clustering works impressively on image data (faces, digits) where autoencoders capture meaningful structure. On tabular data, the gains over traditional approaches are less clear. When someone tells you deep clustering "solved" their problem, ask what they compared against. A well-tuned UMAP + HDBSCAN pipeline is a strong baseline that doesn't require training a neural network.

Semi-Supervised Learning — The Space Between

Strictly speaking, semi-supervised learning isn't unsupervised. But it lives at the boundary, and understanding unsupervised methods is prerequisite to understanding how it works. The setup: you have a small number of labeled examples and a large number of unlabeled examples. This is the reality of most production ML — labels are expensive, data is cheap.

Self-training is the most intuitive approach. Train a model on your labeled data. Use it to predict labels for the unlabeled data. Add the high-confidence predictions to your training set. Retrain. Repeat. It's so simple it feels like cheating, and that simplicity is both its strength and its danger. If the model makes confident-but-wrong predictions early on, those errors compound — the model trains on its own mistakes. This confirmation bias is the fundamental failure mode of self-training, and mitigating it is where all the sophistication enters: confidence thresholds, temperature sharpening, curriculum strategies.

Label propagation takes a graph-based approach that connects directly to what we discussed in spectral clustering. Build a similarity graph over all your data — labeled and unlabeled. Then let labels "flow" through edges to nearby unlabeled points. If a labeled point has high similarity to an unlabeled point, the label propagates. The assumption is that nearby points in feature space are likely to share labels — the smoothness assumption. This is built into scikit-learn and works surprisingly well when the assumption holds.

Modern deep semi-supervised methods like FixMatch combine self-training with data augmentation. The idea: if you augment an unlabeled image in two different ways, a good model should predict the same label for both. This consistency regularization provides a training signal without any labels at all. Combined with pseudo-labeling for high-confidence predictions, these methods achieve remarkable performance with very few labels.

The interview version of this question is: "You have 100 labeled examples and 100,000 unlabeled examples. What do you do?" The experienced answer walks through self-training or label propagation first (cheap, interpretable), then modern semi-supervised methods if you have deep learning infrastructure, and acknowledges the confirmation bias risk throughout.

Interview Gotchas — The Things That Trip People Up

I want to close with a collection of traps that I've seen catch experienced practitioners in interviews and in production. These aren't obscure trivia — they're genuine misunderstandings that lead to real bugs.

t-SNE and UMAP clusters aren't real clusters. These are visualization algorithms optimized for preserving local structure in two dimensions. The beautiful clusters you see in a UMAP plot may not correspond to meaningful groupings in the original high-dimensional space. I've watched teams build entire analyses on UMAP cluster boundaries and then wonder why their downstream classifier didn't work. The visualization showed them what they wanted to see.

PCA before clustering can destroy your signal. PCA finds directions of maximum variance. But maximum variance doesn't mean maximum cluster separation. If two clusters overlap along the highest-variance direction but separate along a lower-variance direction, PCA might collapse exactly the dimension you need. This doesn't mean PCA is bad preprocessing for clustering — it often helps. But it's not guaranteed to help, and blindly applying it is a mistake.

K-Means makes assumptions you never agreed to. Spherical clusters. Equal sizes. Equal density. Features on the same scale. If any of these are violated, K-Means will still run, still converge, and still give you centroids. It won't tell you it's giving you nonsense. This silence is what makes it dangerous — the algorithm never fails loudly.

Distance metrics change everything. Euclidean distance and cosine distance will give you genuinely different clusters on the same data. For text embeddings, cosine similarity is almost always the right choice. For spatial data, Euclidean makes sense. For mixed-type data, Gower distance exists but has its own issues. The distance metric is arguably the most important decision in unsupervised learning, and it's the one people spend the least time thinking about.

Evaluating without labels is fundamentally ambiguous. Silhouette score works for convex clusters but fails for non-convex ones. The elbow method requires you to squint at a plot and decide where the "elbow" is — ask three people, get three answers. If you have even a handful of labeled points, use them for validation. Unsupervised evaluation metrics are useful signals, not ground truth.

Anomalies can form their own cluster. If you have a group of similar anomalies — say, a coordinated fraud ring — a clustering algorithm might detect them as a small but valid cluster rather than flagging them as outliers. This is why pure clustering and pure anomaly detection require different tools, even though people often conflate them.

None of these topics demand deep mastery right now. But each one represents a real scenario you'll encounter — in code reviews, in legacy systems, in interviews, in that moment when someone names a technique and the room expects you to nod along. Now you have genuine understanding behind that nod, not bluffing. And if you ever need to go deeper on any of them, you know where the rabbit hole starts.

← Previous Self-Supervised & Representation Learning