Clustering

Chapter 6: Unsupervised Learning Finding structure without labels

The Confession

I avoided really understanding clustering for an embarrassingly long time. I'd call KMeans(n_clusters=5), look at some colored scatter plots, nod sagely in a meeting, and move on. It felt like one of those topics where the API was so clean that you could convince yourself you understood what was happening underneath. I didn't. And then one day I clustered a customer dataset, got five beautiful groups, presented them to the product team, and someone asked: "Why five? Why not four? Or seven?" I had no answer. The discomfort of not knowing what's really going on grew too great. Here is that dive.

Clustering is the act of grouping data points so that points within the same group are more similar to each other than to points in other groups. No labels. No teacher. No "right answer" column. The data has to speak for itself. The concept has been around since the 1950s — Stuart Lloyd developed the first version of what we now call K-Means at Bell Labs in 1957, though it wasn't published until 1982. Since then, dozens of algorithms have emerged, each with a different opinion about what "group" means.

Before we start, a heads-up. We're going to be talking about distances, density, eigenvectors, and probability distributions, but you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What "similar" even means
K-Means from scratch
Why starting position matters — K-Means++
How many groups? The k question
Where K-Means breaks
Rest stop
DBSCAN — when density is the answer
HDBSCAN — the modern default
Hierarchical clustering — the merge tree
Spectral clustering — changing the lens
Gaussian Mixture Models — soft assignments
How do you know it worked?
Picking the right tool
Wrap-up

What "Similar" Even Means

Every clustering algorithm needs an answer to one deceptively simple question: how close are these two points? The most common answer is Euclidean distance — the straight-line distance you'd measure with a ruler. For two points in a space with d features, it's the square root of the sum of squared differences across every feature.

Let's make this concrete. Imagine we're building a recommendation engine for a small coffee shop. We have six customers and two features: how many espressos they order per week and how many pastries. We can plot them on a flat surface, espressos on one axis and pastries on the other. Customer A orders 1 espresso and 1 pastry. Customer B orders 2 espressos and 1 pastry. The Euclidean distance between them is 1 — they're neighbors. Customer F, who orders 8 espressos and 7 pastries, is far away from both.

This is our toy world. Six customers, two dimensions, everything visible on a napkin sketch. We'll keep coming back to this coffee shop as the algorithms get more sophisticated.

Here's the thing about distance, though: it's fragile. If espressos are measured in cups (0–10) and pastries in dollars (0–50), the dollar axis dominates and the algorithm "thinks" pastry differences matter five times more than coffee differences. That's why we almost always standardize features before clustering — subtract the mean, divide by the standard deviation. Every feature gets an equal voice. I still forget this sometimes and get garbage clusters. It's the most common clustering bug in production.

K-Means from Scratch

Back to our coffee shop. Six customers on a napkin. We suspect there are two groups — the "quick espresso" crowd and the "settle in with pastries" crowd. We want the algorithm to find them.

Here's the entire algorithm, known as Lloyd's algorithm. We pick two random points as starting centers — call them centroids. Then we alternate two steps. First, assign each customer to their nearest centroid. Customer A is closer to centroid 1, so they go there. Customer F is closer to centroid 2. We do this for all six customers. Second, move each centroid to the average position of its assigned customers. If centroid 1 got customers A, B, and C, it slides to the mean of their coordinates. That's it. Assign, update, repeat.

import numpy as np

customers = np.array([
    [1, 1], [2, 1], [2, 2],   # the espresso crowd
    [7, 7], [8, 7], [8, 8],   # the pastry-and-latte crowd
])

# Start with two random centroids
rng = np.random.RandomState(42)
centroids = customers[rng.choice(6, 2, replace=False)]

for iteration in range(10):
    # ASSIGN: each customer to nearest centroid
    distances = np.linalg.norm(
        customers[:, None] - centroids[None, :], axis=2
    )
    labels = distances.argmin(axis=1)

    # UPDATE: each centroid to mean of its customers
    new_centroids = np.array([
        customers[labels == j].mean(axis=0) for j in range(2)
    ])

    if np.allclose(centroids, new_centroids):
        print(f"Converged after {iteration + 1} iterations")
        break
    centroids = new_centroids

print(f"Final centroids: {centroids}")
print(f"Assignments: {labels}")

With our six coffee-shop customers, this converges in two or three iterations. The centroids land right in the middle of each natural group. On our napkin, you can literally see it happen — the centers slide toward the clumps, the assignments stabilize, and we're done.

The objective that K-Means minimizes has a name: within-cluster sum of squares, sometimes called inertia. For each point, measure the squared distance to its assigned centroid. Add them all up. That's the number K-Means is trying to shrink.

Why does the algorithm converge? The assign step can only decrease inertia (or leave it unchanged) because each point moves to its nearest centroid. The update step can only decrease inertia because the arithmetic mean is the unique point that minimizes squared distances to a set of points. The objective is bounded below by zero. It decreases monotonically. And there are only finitely many possible assignments (at most k^n, where n is the number of points). A decreasing sequence over a finite set must terminate. That's the whole proof.

But here's the catch that took me a while to internalize. Convergence means local minimum. Not global. The algorithm finds a good grouping, not necessarily the best grouping. Where you drop those initial centroids matters enormously.

Why Starting Position Matters — K-Means++

Imagine our coffee shop has grown. Thirty customers now, in three natural clumps. If we randomly drop all three initial centroids into the same clump, the algorithm might converge with one cluster hogging 25 customers and two clusters splitting the remaining 5. Technically converged. Practically useless.

The brute-force fix: run K-Means 10 times with different random seeds, keep the run with the lowest inertia. Sklearn does this by default with n_init=10. It works, but it's wasteful.

K-Means++, introduced by Arthur and Vassilvitskii in 2007, is much cleverer. Pick the first centroid uniformly at random from the data. For the second centroid, compute every point's squared distance to the first centroid, then pick the next centroid with probability proportional to that squared distance. Points far away are much more likely to get picked. Repeat for each subsequent centroid. This is called D² weighting.

The intuition is physical. Imagine dropping pins on a map. The first one lands somewhere. For the next pin, you want it far from the first. D² weighting gives distant points a much higher chance. The third pin wants to be far from both existing pins. By the time you've placed k pins, they're spread across the data like well-placed weather stations — each one covering a different region.

The mathematical guarantee: K-Means++ initialization produces a solution that is O(log k) competitive with the true optimum. In practice, it's often so close to the final answer that Lloyd's iterations barely move the centroids. Sklearn uses K-Means++ by default. You don't need to change anything.

How Many Groups? The k Question

Nobody tells you how many clusters exist. That number k is your responsibility, and getting it wrong can mean the difference between a useful segmentation and an expensive fiction. I've tried three approaches, and I'll be honest about how much I trust each one.

The silhouette score is the one I reach for first. For every point, it measures two things: how close that point is to others in its own cluster (call that a), and how close it is to the nearest neighboring cluster (call that b). The score for that point is (b − a) / max(a, b). A score near +1 means the point sits comfortably in its cluster with clear daylight to the next one. Near 0 means it's on the boundary. Negative means it's probably in the wrong cluster. Average across all points, sweep k from 2 to 10 or so, and pick the k with the highest average silhouette.

The elbow method is the classic. Plot inertia against k. As you add clusters, inertia drops — more centers means less total distance. Look for the "elbow" where the curve bends and returns start diminishing. The problem? Real data almost never gives you a clean elbow. You end up squinting at a gently curving line, trying to convince yourself the bend is at 4 and not 5. I've been in meetings where three people pointed at three different elbows on the same plot.

The gap statistic is more principled. It compares your inertia to what you'd get from data with no structure at all — uniform random noise. The k where the gap between real and random is largest wins. It requires generating reference datasets, which makes it slow, but it's the closest thing we have to a statistically grounded answer.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

# Our coffee shop grew — 200 customers, unknown number of groups
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=200, centers=4,
                   cluster_std=0.9, random_state=42)

for k in range(2, 9):
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
    sil = silhouette_score(X, km.labels_)
    print(f"k={k}  inertia={km.inertia_:>8.0f}  silhouette={sil:.3f}")
    # Watch for the silhouette peak — that's your best k

Where K-Means Breaks

K-Means carves the feature space into Voronoi cells — regions where every point is closer to one centroid than to any other. These cells are always convex polygons with straight boundaries. That geometric constraint bakes in three assumptions that the algorithm cannot escape.

First, it assumes clusters are roughly spherical. Picture two crescent moons interlocking — a classic test shape. K-Means draws a straight line through the middle, splitting each crescent in half. It cannot follow curves because its boundaries are flat planes. Second, it assumes clusters are roughly equal-sized. A cluster with 10,000 points next to one with 50? The massive cluster's centroid barely notices the small one and gradually absorbs it. Third, it requires you to know k. If you guess wrong, you're either splitting real clusters or merging distinct ones.

There's also the outlier problem. A single extreme point drags the centroid toward it like a weight on a fishing line, distorting the entire cluster. K-Means has no concept of "this point doesn't belong anywhere."

I'll be honest — I spent a year applying K-Means to everything, getting puzzling results on non-blobby data, and blaming my feature engineering. The algorithm was doing exactly what it was designed to do. I was the one using the wrong tool.

Mini-Batch K-Means. Standard K-Means reads every point every iteration. With a million rows, that's painful. MiniBatchKMeans samples a random batch per step — 10x faster, typically within 1–2% of the full solution's quality. If your dataset has more than 100,000 rows, use it without hesitation.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a solid mental model of K-Means: Lloyd's algorithm alternates assign and update steps, converging to a local minimum of inertia. K-Means++ gives you smart initialization. The silhouette score helps you pick k. And you know the fundamental limitation — K-Means can only find spherical, equal-sized clusters because Voronoi cells are convex.

That limitation doesn't tell the complete story. Real data has crescent moons, varying densities, hierarchical structure, and ambiguous boundaries. The algorithms that follow were invented specifically because K-Means can't handle those cases.

The short version: DBSCAN handles arbitrary shapes via density, hierarchical clustering gives you multi-resolution views, spectral clustering transforms geometry into connectivity, and GMMs give you soft probabilistic assignments. There. You're 70% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

DBSCAN — When Density Is the Answer

K-Means defines a cluster as "stuff near a center." But what if there is no sensible center? Think of our coffee shop customers again, except now imagine they form two long, winding lines on the napkin — one group snakes along the espresso axis, another curves along the pastry axis. No centroid can represent a snake. We need a different definition of "group."

DBSCAN says a cluster is a dense region separated from other dense regions by sparse space. It doesn't care about centers. It cares about crowds. You give it two numbers: eps (ε), a neighborhood radius, and min_samples, the minimum number of neighbors within that radius for a point to count as being in a dense region.

Every point in the dataset gets classified into one of three roles. A core point has at least min_samples neighbors within eps — it's in the thick of the crowd. A border point falls within eps of a core point but doesn't have enough neighbors to be core itself — it's standing at the edge of the crowd, close enough to belong but not dense enough to anchor anything. A noise point is neither — it's alone in sparse territory, too far from any crowd. DBSCAN gives noise points the label −1, which is its way of saying "I don't know where this belongs, and I'm not going to pretend."

The algorithm starts at an arbitrary unvisited core point and begins expanding. It looks at all neighbors within eps. Any neighbor that is itself a core point gets its own neighborhood expanded too. This recursive expansion continues until no more core points are reachable — that entire connected web of dense points becomes one cluster. Then DBSCAN picks the next unvisited core point and starts a new cluster. Border points attach to whatever cluster their nearest core point belongs to.

That recursive chain of core-to-core connections is what the formal literature calls density reachability. Point q is directly density-reachable from core point p if q is within p's eps-neighborhood. Point q is density-reachable from p if there's a chain of core points linking them, each within eps of the next. Two points are density-connected if there's some core point from which both are reachable. A cluster, formally, is a maximal set of density-connected points. This chain-linking is why DBSCAN finds arbitrary shapes. Two crescent moons? The dense arc of each crescent connects through its core points, link by link, like a chain of people holding hands. No centroid needed.

Tuning the Two Knobs

The k-distance plot is the standard tool for finding eps. For each point, compute the distance to its k-th nearest neighbor (where k equals your min_samples). Sort those distances in descending order and plot them. You'll see a curve that drops gradually and then plunges — the elbow where density transitions from sparse to dense. Set eps at that elbow.

For min_samples, the rule of thumb is to start at 2 × the number of features, with a floor of 5. Higher values make DBSCAN more conservative — it demands thicker crowds before it'll call something a cluster. Lower values make it more aggressive, potentially chaining isolated points together.

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_moons
import numpy as np

X, _ = make_moons(n_samples=500, noise=0.06, random_state=42)
X = StandardScaler().fit_transform(X)

# k-distance plot to find eps
k = 5
nn = NearestNeighbors(n_neighbors=k).fit(X)
distances, _ = nn.kneighbors(X)
k_dist = np.sort(distances[:, -1])[::-1]
# The elbow in k_dist tells us eps is around 0.27

db = DBSCAN(eps=0.27, min_samples=5).fit(X)
n_clusters = len(set(db.labels_) - {-1})
n_noise = np.sum(db.labels_ == -1)
print(f"Clusters: {n_clusters}, Noise: {n_noise}")

The Single-eps Trap

DBSCAN has one deep limitation: a single eps for the entire dataset. Think of it like setting one volume threshold for an entire concert hall. If the orchestra pit is packed shoulder-to-shoulder and the balcony has seats scattered far apart, no single radius captures both. Set eps large enough for the balcony, and the orchestra pit merges into one giant mass. Set it tight enough for the orchestra, and the balcony shatters into noise.

Also worth knowing: high-dimensional data is hostile territory for DBSCAN. As dimensions increase, distances between points converge — the nearest neighbor and the farthest neighbor end up almost equally far away. The concept of "neighborhood" dissolves. Always reduce dimensions before running DBSCAN when you have more than about 15–20 features.

HDBSCAN — The Modern Default

HDBSCAN fixes DBSCAN's single-eps problem by asking: what if we ran DBSCAN at every possible eps value simultaneously?

The idea is built on something called mutual reachability distance. For two points a and b, it's the maximum of three values: the core distance of a (distance to a's k-th nearest neighbor), the core distance of b, and the actual distance between a and b. This modified distance effectively "spreads out" sparse regions — if either point is isolated, the mutual reachability distance inflates, preventing fragile single-point bridges from connecting unrelated clusters. It's like saying "I won't trust a connection unless both endpoints are already in reasonably dense areas."

HDBSCAN uses these distances to build a hierarchy — a tree that records how clusters merge as the density threshold changes. Then it scores each cluster on stability: how long does this cluster persist as we sweep through density levels? A cluster that appears at one density and vanishes at a slightly different one is noise. A cluster that survives across a wide range of densities is real structure.

The result: no eps to tune, automatic cluster count, graceful handling of varying densities, and confidence scores for every point. Points that HDBSCAN can't confidently assign get label −1 — it genuinely says "I don't know" rather than forcing an assignment. Going back to our concert hall analogy, HDBSCAN measures the density at the orchestra pit and the balcony separately, recognizing that both are real crowds at different scales.

from hdbscan import HDBSCAN
import numpy as np

hdb = HDBSCAN(min_cluster_size=15, min_samples=5)
hdb_labels = hdb.fit_predict(X)

n_clusters = len(set(hdb_labels) - {-1})
n_noise = np.sum(hdb_labels == -1)
print(f"Clusters: {n_clusters}, Noise: {n_noise}")
print(f"Mean confidence: {hdb.probabilities_.mean():.3f}")

If you're reaching for DBSCAN today, reach for HDBSCAN instead. The only exception is when eps has a domain meaning you need to preserve — "all cell towers within 500 meters" or "all transactions within 10 seconds." In that case, the interpretability of a fixed radius matters more than adaptive density.

Hierarchical Clustering — The Merge Tree

Every algorithm so far gives you one answer: here are your clusters, take it or leave it. But what if you want to see the data at 3 groups, 5 groups, and 12 groups, and understand which groups split into which? That's where hierarchical clustering earns its name.

Agglomerative clustering starts with every point as its own cluster — our six coffee-shop customers are six clusters of one. At each step, it merges the two closest clusters. After the first merge, we have five clusters. Then four. Then three. It keeps merging until everything collapses into one big cluster. The entire history of merges gets recorded in a tree structure called a dendrogram.

The dendrogram is the whole point. Its y-axis shows the distance at which each merge happened. Short bars at the bottom mean nearby points collapsed together early — that's fine structure. Tall bars near the top mean the algorithm had to reach far to connect two groups — those are your real cluster boundaries. Draw a horizontal line at any height, and the number of vertical lines it crosses is your cluster count at that resolution. Slide the line up, you get fewer bigger clusters. Slide it down, you get more smaller ones. It's clustering at every resolution, stored in one picture.

Linkage — The Make-or-Break Decision

"Distance between two clusters" isn't obvious when clusters have multiple points. Which points do you measure between? The answer to that question is called the linkage method, and it changes everything about the clusters you get.

Single linkage measures the minimum distance between any pair of points, one from each cluster. It finds long, chain-like structures because a single bridge point connecting two groups is enough to merge them. This chaining effect makes single linkage dangerous for most applications — one noisy point between two legitimate clusters can glue them together. But for geographic data where chains are real (river systems, mountain ridges), it's the right choice.

Complete linkage goes the other direction — it measures the maximum distance between any pair. This forces compact, spherical clusters because two groups won't merge until even their farthest members are close. It tends to produce roughly equal-sized clusters and will break apart elongated structures.

Average linkage takes the mean of all pairwise distances. It's the middle ground — no strong guarantees, but no dramatic failure modes either.

Ward linkage measures the increase in total within-cluster sum of squares that would result from the merge. It produces the same kind of compact, spherical clusters that K-Means does, but without requiring you to pick k first. Ward is the default for a reason, and it's where I start unless I have a specific reason not to. The catch: it requires Euclidean distance.

from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4,
                   cluster_std=0.7, random_state=42)

Z = linkage(X, method='ward')

# Cut the dendrogram at different heights
for n_clust in [2, 3, 4, 5, 6]:
    labels = fcluster(Z, t=n_clust, criterion='maxclust')
    sil = silhouette_score(X, labels)
    print(f"{n_clust} clusters: silhouette = {sil:.3f}")

The cost of this flexibility: O(n²) memory and O(n² log n) time. You're computing and storing the full pairwise distance matrix. Beyond about 20,000 points, this becomes painful. For larger datasets, BIRCH builds a compact summary tree in a single pass and then clusters the summaries — a way to get hierarchical-like analysis at scale.

Spectral Clustering — Changing the Lens

Concentric rings. Spirals. Two interlocking crescents. All the shapes where K-Means draws a straight line through the middle because it can't think in curves. DBSCAN handles some of these, but spectral clustering takes a fundamentally different approach: instead of trying harder to cluster the data as-is, it transforms the data into a new representation where the clusters are easy to find, and then runs K-Means on that.

I'm still developing my intuition for why this works as well as it does, but here's how I think about it.

Imagine you have 500 points arranged in two concentric circles. In the original x-y space, there's no line, plane, or convex boundary that separates them. But what if we could build a map that reflects connectivity rather than position? Points connected by short hops through dense neighborhoods would be near each other on this map, even if they're on opposite sides of a circle in the original space. Points in different rings, despite being geometrically close at certain spots, would be far apart on this map because there are no short-hop paths between the rings.

That's what spectral clustering builds. It constructs a similarity graph where each point is a node and edge weights reflect how similar two points are — typically using a Gaussian kernel: W_ij = exp(−‖xi − xj‖² / 2σ²). Close points get heavy edges, distant points get near-zero edges.

From this graph, it computes the graph Laplacian, L = D − W, where D is a diagonal matrix of node degrees (how connected each node is). The key insight: the smallest eigenvectors of this Laplacian encode the "smoothest" ways to partition the graph. If the graph has k well-separated clusters, the k smallest eigenvectors will take on roughly constant values within each cluster and different values between clusters. Stack these eigenvectors as columns, and each row becomes a new representation of a data point. In this new space, even wildly non-convex clusters become tidy, separable blobs that K-Means can handle. This is closely related to the normalized cut problem — finding a partition that minimizes the edges cut between groups while keeping the groups balanced in size.

from sklearn.cluster import SpectralClustering, KMeans
from sklearn.datasets import make_circles
from sklearn.metrics import adjusted_rand_score

X, y = make_circles(n_samples=500, factor=0.5,
                     noise=0.05, random_state=42)

km = KMeans(n_clusters=2, random_state=42).fit(X)
sc = SpectralClustering(n_clusters=2, affinity='rbf',
                         gamma=10, random_state=42)
sc_labels = sc.fit_predict(X)

print(f"K-Means ARI: {adjusted_rand_score(y, km.labels_):.3f}")
print(f"Spectral ARI: {adjusted_rand_score(y, sc_labels):.3f}")

The catch is computational. Building the full similarity matrix takes O(n²) space. Computing eigendecompositions takes O(n³) time. Beyond about 10,000 points, it gets impractical. Nyström approximation and sparse affinity matrices help, but spectral clustering is fundamentally a small-to-medium data tool. If your dataset is large and non-convex, HDBSCAN is usually the better path.

Gaussian Mixture Models — Soft Assignments

Every clustering algorithm so far makes a hard call: this point belongs to cluster A, period. But back at our coffee shop, what about the customer who orders a moderate amount of espresso and a moderate amount of pastries? They sit right between the two groups. K-Means picks one and throws away the uncertainty. That always felt unsatisfying to me.

Gaussian Mixture Models keep the uncertainty. Instead of saying "cluster A," a GMM says "72% cluster A, 26% cluster B, 2% cluster C." Each cluster is modeled as a Gaussian distribution — a bell curve in multiple dimensions — with its own mean (center), covariance matrix (shape and orientation), and mixing weight (how many points belong to it overall). The data is assumed to be generated by a mixture of these Gaussians, and fitting the model means finding the parameters that make the observed data most likely.

The engine that fits a GMM is the Expectation-Maximization algorithm, and it has the same heartbeat as K-Means: assign, then update. The E-step computes responsibilities — for each point and each Gaussian component, the probability that this component generated this point. This is Bayes' theorem in action: multiply the component's prior weight by how likely the point is under that component's Gaussian, then normalize so the responsibilities sum to one. The M-step re-estimates each component's mean, covariance, and weight using responsibilities as soft counts — a responsibility-weighted average of all points, instead of a hard average of only the assigned points.

Here's the connection that made everything click for me: if you force the covariances to be spherical and equal, and force the responsibilities to be hard (0 or 1 only), EM reduces exactly to Lloyd's algorithm. K-Means is a degenerate special case of GMMs. It's the version where you throw away uncertainty and flexibility for the sake of speed. That realization — exposing the oversimplification I'd been operating under for years — was worth the deep dive by itself.

Covariance Types — Match the Shape

The covariance matrix of each component controls the shape of that cluster. Spherical gives you circles — one variance parameter per component, like K-Means but with varying sizes. Diagonal gives you axis-aligned ellipses — each feature gets its own variance, but no correlations between features. Full gives you arbitrary ellipses tilted at any angle — maximum flexibility, but also maximum parameters to estimate. With d features, that's d(d+1)/2 parameters per component. Tied forces all components to share the same covariance — same shape everywhere, different centers.

Start with full if you have enough data. If overfitting shows up — the BIC starts increasing — step down to diagonal or tied.

Choosing k with BIC

GMMs give you something K-Means doesn't: a likelihood. You can measure how probable your data is under the model. That opens the door to information criteria. The Bayesian Information Criterion balances fit against model complexity — it penalizes extra parameters so you don't keep adding components until the likelihood stops improving. Lower BIC is better. Sweep k, pick the valley.

from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=600, centers=3,
                   cluster_std=[1.0, 1.5, 0.5], random_state=42)

for k in range(1, 8):
    gm = GaussianMixture(n_components=k,
                          covariance_type='full',
                          random_state=42).fit(X)
    print(f"k={k}  BIC={gm.bic(X):>10.1f}")

# Fit the winner and extract soft assignments
gmm = GaussianMixture(n_components=3, covariance_type='full',
                       random_state=42).fit(X)
probs = gmm.predict_proba(X)
# probs[i] gives the probability distribution over clusters for point i

If you don't want to sweep k at all, BayesianGaussianMixture puts a Dirichlet process prior on the weights. Set a generous upper bound — say 15 components — and it automatically prunes unused ones to near-zero weight. The model tells you how many clusters it needs.

When GMMs beat K-Means. Three situations: when you need soft assignments (anomaly detection thrives on this — "how unlikely is this point under the model?"), when clusters have different shapes and sizes (full covariance handles ellipses that K-Means can't), or when you want a generative model you can sample new data from. The tradeoff: EM is slower, can hit local optima (always run multiple initializations), and full covariance needs substantially more data per cluster to be stable.

How Do You Know It Worked?

Clustering has no accuracy score. There's no y_test column to compare against. You're evaluating structure that may or may not exist in the data. This is genuinely one of the hardest problems in unsupervised learning, and I still occasionally get tripped up by it.

Internal Metrics — No Ground Truth

The silhouette score is the most widely used. For each point, it compares how tightly the point sits within its own cluster (the mean intra-cluster distance, a) to how far it is from the nearest neighboring cluster (the mean nearest-cluster distance, b). The formula is (b − a) / max(a, b). A score near +1 means the point is well-placed. Near 0 means boundary. Negative means probably in the wrong cluster. Average it across all points for an overall score. Above 0.7 is strong structure, above 0.5 is reasonable, below 0.25 suggests you might be clustering noise.

The Davies-Bouldin Index measures the average worst-case overlap between each cluster and its most similar neighbor. Lower is better. Fast to compute, but biased toward convex shapes. Calinski-Harabasz is the ratio of between-cluster variance to within-cluster variance — a clustering F-statistic. Higher is better.

from sklearn.metrics import (silhouette_score,
                              davies_bouldin_score,
                              calinski_harabasz_score)

sil = silhouette_score(X, labels)
dbi = davies_bouldin_score(X, labels)
ch  = calinski_harabasz_score(X, labels)
print(f"Silhouette: {sil:.3f}  DB: {dbi:.3f}  CH: {ch:.0f}")

External Metrics — When You Have Labels

Sometimes you do have ground truth — a benchmark dataset, a validation set, or known categories you're checking against.

The Adjusted Rand Index counts all pairs of points in the dataset. How many pairs are in the same cluster and the same true class? How many are in the same cluster but different true classes? ARI adjusts for what you'd expect by random chance. Range: −1 to 1. Zero means the clustering is no better than random.

Normalized Mutual Information asks: how much does knowing the cluster assignment reduce your uncertainty about the true label? It's rooted in information theory. Range: 0 (independent) to 1 (perfect correspondence). One subtlety worth knowing: NMI can overrate results when you have many small clusters, because more partitions make it easier to "match."

from sklearn.metrics import (adjusted_rand_score,
                              normalized_mutual_info_score)

ari = adjusted_rand_score(y_true, labels)
nmi = normalized_mutual_info_score(y_true, labels)
print(f"ARI: {ari:.3f}  NMI: {nmi:.3f}")

The metric bias trap. Here's something that burned me. Silhouette favors compact, spherical clusters — the kind K-Means produces. On a dataset where DBSCAN correctly found two crescent moons, the silhouette score rated K-Means higher because K-Means produced tighter (but wrong) convex blobs. The metric was rewarding the algorithm for matching its own assumptions, not for finding real structure. Internal metrics reflect the algorithm's geometry, not ground truth. Always visualize. Run multiple algorithms. Compare with external metrics when you have them.

Stability — The Underrated Check

Run your clustering on bootstrap samples or random subsets of the data. Do the same clusters keep appearing? If yes, they're real structure. If the results reshuffle every time, your "clusters" are artifacts of noise. This is the validation technique I trust most when metrics are ambiguous, and I wish I'd learned about it earlier.

Picking the Right Tool

After all this, the single most important insight is that each algorithm defines "cluster" differently. K-Means says "stuff near a center." DBSCAN says "a dense crowd." Spectral clustering says "a tightly connected subgraph." GMMs say "a Gaussian distribution." None of them are wrong. They're different lenses on the same data, and the right lens depends on what you're looking at.

Situation	First choice	Why
Know k, expect blobs, need speed	K-Means	Fast, interpretable, well-understood
n > 100k, blob-shaped clusters	MiniBatchKMeans	10x faster, negligible quality loss
Unknown k, noise and outliers	HDBSCAN	Auto k, varying density, honest uncertainty
Non-convex shapes, moderate data	Spectral	Captures connectivity, not geometry
Need soft probabilities	GMM	Soft assignments, elliptical clusters, generative model
Multi-resolution exploration	Agglomerative	Dendrogram gives every k at once
Interpretable density threshold	DBSCAN	eps has physical meaning
Very high-dimensional data	K-Means after PCA/UMAP	Reduce dimensions first, always
Anomaly scoring	GMM	Low-likelihood points = anomalies

In practice, I run two or three algorithms on the same data, compare visually and via metrics, and pick the one that tells a story the domain experts believe. No single algorithm is universally best because "cluster" has no universal definition.

from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

algos = {
    'KMeans(4)': KMeans(n_clusters=4, n_init=10, random_state=42),
    'DBSCAN':    DBSCAN(eps=0.5, min_samples=5),
    'GMM(4)':    GaussianMixture(n_components=4, random_state=42),
}

for name, algo in algos.items():
    if hasattr(algo, 'fit_predict'):
        labels = algo.fit_predict(X_scaled)
    else:
        labels = algo.fit(X_scaled).predict(X_scaled)

    n_clusters = len(set(labels) - {-1})
    if n_clusters < 2:
        print(f"{name:15s}: {n_clusters} cluster(s), skipping")
        continue

    mask = labels != -1
    sil = silhouette_score(X_scaled[mask], labels[mask])
    print(f"{name:15s}: {n_clusters} clusters, silhouette={sil:.3f}")

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a question — what does "similar" mean? — and built up from there. Lloyd's algorithm on a napkin with six coffee-shop customers. K-Means++, spreading initial centroids like weather stations across a map. The silhouette score, peering into the quality of our groupings. Then the limitations of K-Means pulled us into DBSCAN's density-based world, HDBSCAN's adaptive hierarchies, agglomerative clustering's merge trees, spectral clustering's graph-theoretic lens, and finally GMMs, which taught us that K-Means was a special case of something richer all along.

My hope is that the next time someone asks you "why five clusters?", instead of staring at a scatter plot and hoping the silence passes, you'll have a pretty good mental model of what's going on under the hood — and you'll know which tool to reach for, and why.

Next → Dimensionality Reduction

Clustering

The Confession

Contents

What "Similar" Even Means

K-Means from Scratch

Why Starting Position Matters — K-Means++

How Many Groups? The k Question

Where K-Means Breaks

Rest Stop

DBSCAN — When Density Is the Answer

Tuning the Two Knobs

The Single-eps Trap

HDBSCAN — The Modern Default

Hierarchical Clustering — The Merge Tree

Linkage — The Make-or-Break Decision

Spectral Clustering — Changing the Lens

Gaussian Mixture Models — Soft Assignments

Covariance Types — Match the Shape

Choosing k with BIC

How Do You Know It Worked?

Internal Metrics — No Ground Truth

External Metrics — When You Have Labels

Stability — The Underrated Check

Picking the Right Tool

Wrap-Up