EM Algorithm & Gaussian Mixtures

Chapter 6: Unsupervised Learning Soft clustering · Hidden variables · Interview favorite

The Problem That Haunted Me

I avoided the EM algorithm for longer than I'd like to admit. Every time someone mentioned "expectation-maximization," my brain would quietly file it under "things that sound important but I'll learn later." The formulas looked like they were specifically designed to punish curiosity. Then one day I needed to cluster customer data where the groups weren't neat little spheres — they were stretched, overlapping, messy — and K-Means gave me garbage. I couldn't avoid it anymore. Here is that dive.

The Expectation-Maximization algorithm was formalized by Dempster, Laird, and Rubin in 1977 as a general method for finding maximum likelihood estimates when data is incomplete or has hidden variables. Its most famous application is fitting Gaussian Mixture Models — probability distributions that assume your data came from a blend of several Gaussian bells — but EM shows up everywhere: speech recognition, gene analysis, missing data imputation, topic modeling. It's one of those algorithms that, once you see it, you start noticing it in places you never expected.

Before we start, a heads-up. We're going to be building up from a concrete toy example, touching on Bayes' theorem and probability distributions, and eventually reaching the mathematical foundations that make EM provably work. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The hidden-variable problem
A detective story: our running example
Guess and refine — the EM loop
Walking through one full iteration
Rest stop
Why it always gets better: the ELBO
K-Means was EM all along
Gaussian Mixture Models: shape, size, and uncertainty
Covariance types and what shapes they allow
Choosing K: BIC, AIC, and the Bayesian escape hatch
The singularity trap
Convergence: the good and the frustrating
Where else EM hides in the wild
Putting it all together in code
Resources

The Hidden-Variable Problem

Imagine you're a data scientist at a music streaming service. You have listening data for 50 users — hours of pop music and hours of rock music per month. When you plot it, you see three-ish clouds of points. There are clearly different listener personas here: pop enthusiasts, rock devotees, and eclectic listeners who enjoy both. But nobody labeled these users. You don't know who belongs to which group.

Here's the circular trap. If someone handed you the labels — "this user is a pop lover, that one is a rock fan" — computing each group's average listening pattern and spread would be trivial arithmetic. And if someone handed you the group statistics — means and variances for each persona — assigning users to groups would be a straightforward application of Bayes' theorem. But you have neither. You need the labels to estimate the statistics, and you need the statistics to estimate the labels.

This is the hidden-variable problem, sometimes called the incomplete data problem. The "hidden" part is whatever you can't observe — cluster assignments, missing values in a spreadsheet, the topic that generated a word in a document, the mental state of a speech recognition system. It's one of the most common situations in machine learning, and the reason we need EM.

A Detective Story: Our Running Example

Think of EM like a detective working a case with three suspects. The detective doesn't know who did what, but they have evidence — our data points. Here's the strategy: start by guessing. Make a rough sketch of each suspect's profile based on gut feeling. Then, for each piece of evidence, estimate how likely it is that each suspect is connected to it. Then refine the profiles based on those estimates. Then re-evaluate the evidence. Keep going until the profiles stop changing.

Let's make this concrete with the tiniest possible example. We have six users and two features: hours of pop per month and hours of rock per month. We believe there are two listener personas (we'll tackle "how many personas?" later). Here is our data:

User    Pop Hours    Rock Hours
─────   ─────────    ──────────
  A        8.0          2.0
  B        7.5          3.0
  C        2.0          8.5
  D        3.0          7.0
  E        5.0          5.5
  F        6.0          4.0

Looking at this, our gut says users A and B are pop listeners, C and D are rock listeners, and E and F are somewhere in between. But EM doesn't have a gut. It has math. Let's watch it work.

Guess and Refine — The EM Loop

EM has two steps that alternate, over and over, until the answer stabilizes. That's it. The entire algorithm structure fits on a napkin. But inside those two steps is where all the insight lives.

The E-Step: "Who Probably Belongs Where?"

Given whatever parameter estimates we currently have — the mean listening pattern and spread for each persona, plus how common each persona is — we compute responsibilities. A responsibility is the probability that a particular user was generated by a particular persona. These are soft assignments. User E might be 60% persona 1 and 40% persona 2. We're not forcing a choice. We're distributing credit.

The math behind this is Bayes' theorem, nothing more exotic. For each user x_i and each persona k, the responsibility is:

r_ik = π_k · N(x_i | μ_k, Σ_k) / Σ_j [π_j · N(x_i | μ_j, Σ_j)]

Let me break that down. π_k is the mixing weight — our current belief about how common persona k is (what fraction of all users belong to it). N(x_i | μ_k, Σ_k) is the Gaussian density — how likely user i's data is if they came from persona k with mean μ_k and covariance Σ_k. The denominator is a normalizer that ensures responsibilities sum to 1 across all personas for each user. A user sitting right on top of a persona's mean gets a high responsibility for that persona. A user in the overlap zone between two personas splits their responsibility.

The M-Step: "Given Those Soft Labels, What Are the Best Parameters?"

Now treat those soft responsibilities as fractional head-counts and re-estimate everything. The detective refines each suspect's profile based on the weighted evidence. The new mean for persona k is a weighted average of all user data, where the weights are the responsibilities toward k. The new covariance is the weighted scatter. The new mixing weight is the average responsibility.

N_k  = Σ_i r_ik                                            (effective count for persona k)
μ_k  = (1/N_k) · Σ_i r_ik · x_i                            (weighted mean)
Σ_k  = (1/N_k) · Σ_i r_ik · (x_i - μ_k)(x_i - μ_k)ᵀ      (weighted covariance)
π_k  = N_k / N                                              (updated mixing weight)

These are the same formulas you'd use if you had hard labels — compute group means, group variances, group proportions. The only difference is that instead of each user counting as "1" in exactly one group, each user contributes a fractional count to every group based on their responsibility. That's the entire M-step. Weighted maximum likelihood.

Walking Through One Full Iteration

Let's trace through our six users with two personas, starting from a random initialization. Suppose our initial guess is:

Persona 1: μ₁ = (6, 3),  σ₁² = 2  (spherical for simplicity),  π₁ = 0.5
Persona 2: μ₂ = (4, 7),  σ₂² = 2  (spherical),                 π₂ = 0.5

Persona 1 is our initial guess for "pop-leaning listeners" and Persona 2 for "rock-leaning listeners." Both start with equal weight because we have no reason to prefer one.

In the E-step, we compute how likely each user's data is under each persona's Gaussian, then normalize. User A at (8, 2) is close to Persona 1's mean of (6, 3) and far from Persona 2's mean of (4, 7). So User A gets a high responsibility toward Persona 1 — something like 0.95 — and a low one toward Persona 2 — around 0.05. User C at (2, 8.5) is the opposite story. User E at (5, 5.5) sits roughly in between, maybe splitting 0.45/0.55.

In the M-step, we use those responsibilities as weights to recompute each persona's mean. Persona 1's new mean gets pulled toward users A, B, and F (who gave it high responsibility), landing somewhere around (7.2, 2.8). Persona 2's new mean gets pulled toward C, D, and E, landing around (3.1, 7.2). The covariances and mixing weights update too.

That's one iteration. The detective has refined their profiles. Now we go back to the E-step with these updated parameters, recompute responsibilities, and repeat. Each cycle, the profiles sharpen. After maybe 10 to 20 iterations, the changes become negligibly small, and we call it converged.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a working mental model of EM: it's a guess-and-refine loop where the E-step computes soft assignments (responsibilities) using Bayes' theorem, and the M-step re-estimates parameters using weighted statistics. Each iteration improves the fit. It's the detective sharpening suspect profiles by iterating between "who does the evidence point to?" and "given those assignments, what does each suspect look like?"

That model is genuinely useful. You can explain EM in an interview, understand what sklearn's GaussianMixture is doing under the hood, and reason about why initialization matters.

But it doesn't tell the complete story. We haven't answered why EM is guaranteed to improve at each step, what it's really optimizing under the hood, how it relates to K-Means, or what to do when things go wrong. If the discomfort of not knowing what's underneath is nagging at you, read on.

Why It Always Gets Better: The ELBO

I'll be honest — the first time I saw the proof that EM always improves, it felt like a magic trick. I had to trace through it three times before it stopped feeling like sleight of hand. Here's the core difficulty: we want to maximize the log-likelihood of the observed data, log p(X|θ). But there's a sum over hidden variables trapped inside the log, and that makes direct optimization intractable.

EM sidesteps this by constructing a lower bound on the log-likelihood. Think of it like this: if you can't climb the mountain directly because you can't see the peak through the fog, build a floor underneath you that you can see, and push that floor upward. Each time you raise the floor, you're guaranteed to be at least as high as you were before.

That floor is the Evidence Lower Bound, or ELBO. Here's where it comes from. Start with the log-likelihood and introduce any distribution q(Z) over the hidden variables:

log p(X|θ) = log ∫ p(X, Z|θ) dZ
           = log ∫ q(Z) · [p(X, Z|θ) / q(Z)] dZ

Now apply Jensen's inequality — the fact that for a concave function like log, the log of an average is at least as large as the average of the logs, but we're pulling the log inside the integral, so the inequality flips:

log p(X|θ) ≥ ∫ q(Z) · log[p(X, Z|θ) / q(Z)] dZ = ELBO(q, θ)

The ELBO is always less than or equal to the true log-likelihood. That's the "lower bound" part. Now here's the beautiful bit. The E-step sets q(Z) to the exact posterior p(Z|X, θ_old), which makes the bound tight — the ELBO touches the true log-likelihood at the current parameters. The M-step then maximizes this tightened bound with respect to θ. Since we started with the bound touching the true likelihood and then pushed the bound upward, the true likelihood must have gone up too (or at least stayed the same).

This is why EM's convergence guarantee is not hand-wavy. It's coordinate ascent on the ELBO — optimize q in the E-step, optimize θ in the M-step. Each step is guaranteed to not decrease the objective. The floor keeps rising until it can't anymore.

Another way to see the ELBO is through the lens of KL divergence. The gap between the log-likelihood and the ELBO is exactly KL(q || p(Z|X,θ)) — the divergence between our chosen q and the true posterior. The E-step closes that gap to zero. The M-step may reopen it slightly (because changing θ changes the true posterior), but the net effect on the log-likelihood is always non-negative.

💡 The ELBO connects to everything

If you go on to study variational autoencoders, variational inference, or Bayesian deep learning, you'll meet the ELBO again. The difference is that in those settings, the true posterior is too complex to compute exactly, so the E-step uses an approximate q from a restricted family. EM is the special case where q is unconstrained and set to the exact posterior. Variational inference is the generalization where q is approximate. Same objective, different level of constraint.

K-Means Was EM All Along

Here's something that rewired my understanding of both algorithms. K-Means is not a separate algorithm that happens to look like EM. It is EM, with three specific constraints bolted on.

First, responsibilities are hard — each point gets responsibility 1.0 for its nearest centroid and 0.0 for everything else. There's no soft assignment, no "40% cluster A." This is what K-Means' "assign to nearest centroid" step is: a degenerate E-step where the Gaussian densities have been replaced with "closest wins."

Second, all covariance matrices are spherical and identical — every cluster is assumed to be a perfect sphere of the same size. That's why K-Means can only find round, equally-sized blobs. It can't tilt or stretch to match the actual shape of the data.

Third, mixing weights are uniform — K-Means implicitly assumes all clusters are equally likely, regardless of how many points end up in each.

K-Means' "recompute centroids" step is the M-step under these three constraints: with hard assignments and spherical equal covariances, the weighted mean simplifies to the plain mean of assigned points. Relax any of the three constraints and you move from K-Means toward a full GMM. Relax all three and you get the EM algorithm for Gaussian mixtures. The detective stops forcing binary decisions, starts allowing suspects to have different builds, and adjusts how likely each suspect is based on the evidence.

This relationship matters in interviews and in practice. When someone asks "why not K-Means?", the answer is: K-Means is the version that assumes all your clusters are perfect identical spheres with hard boundaries. If that assumption holds, great. If your clusters are elliptical, different sizes, or overlapping — and real data almost always is — you need the full version.

Gaussian Mixture Models: Shape, Size, and Uncertainty

Back to our music streaming example. Suppose the pop enthusiasts have a tight, round listening pattern — they all listen to about the same amount of pop. But the eclectic listeners are spread out in a stretched, tilted cloud — their pop and rock hours are correlated (more of one tends to mean more of the other). K-Means would try to fit identical circles over both groups and do a poor job. A GMM gives each persona its own mean, its own covariance matrix (controlling the shape and orientation of the cluster), and its own mixing weight (how common that persona is).

This is why GMMs are sometimes called "clustering with soft assignments and shape." They're the natural upgrade when you need clusters that aren't all the same size and shape, and when you want probabilities instead of hard labels. A user being "60% pop enthusiast, 35% eclectic, 5% rock devotee" is often more useful than a flat label, especially when you're making personalized recommendations.

Because GMMs define a proper probability density over the data, they double as density estimators. You can score new users with score_samples and flag those with unusually low likelihood as anomalies — maybe bots, maybe users whose behavior has shifted. That's something K-Means can't give you.

Covariance Types and What Shapes They Allow

The covariance matrix is where the complexity lives. Choosing the right type is a tradeoff between expressiveness and the amount of data you need to not overfit.

With full covariance, each component gets its own unconstrained covariance matrix. It can form arbitrary ellipsoids — tilted, stretched, squashed in any direction. This needs d(d+1)/2 parameters per component, where d is the number of features. For 100 features, that's over 5,000 parameters per cluster. With enough data, this captures the richest structure. Without enough data, it overfits spectacularly or the covariance matrix becomes singular (more on that trap shortly).

With tied covariance, all components share a single covariance matrix but have different means. The clusters are all the same shape and orientation, differing only in location. Useful when you believe the groups have similar spread but are centered in different places.

With diag covariance, each component gets its own diagonal covariance — axis-aligned ellipsoids. Features are treated as independent within each cluster. This needs only d parameters per component. It's the practical sweet spot for many problems: expressive enough to capture different spreads along each feature, cheap enough to fit without drowning in parameters.

With spherical covariance, each component is a perfect sphere — a single variance value shared across all features. This is what K-Means implicitly uses. It's the most constrained option, needing only 1 parameter per component, but it can only find round clusters.

My default strategy: start with diag. If the data is low-dimensional and you have plenty of samples, try full. If you're in high dimensions or short on data, diag or even spherical may be all you can afford.

Choosing K: BIC, AIC, and the Bayesian Escape Hatch

With K-Means, people stare at elbow plots and squint. GMMs offer something more principled: information criteria that balance fit against complexity.

BIC (Bayesian Information Criterion) computes -2·log(L) + k·log(n), where L is the maximized likelihood, k is the number of free parameters, and n is the number of data points. The penalty grows with log(n), so BIC punishes complexity more harshly as you get more data. It's consistent — meaning as n→∞, it will select the true number of components if the data really came from a mixture. This is usually the right default.

AIC (Akaike Information Criterion) computes -2·log(L) + 2k. The penalty is fixed regardless of sample size, so AIC tends to favor more components than BIC, especially with large datasets. It's better when the goal is prediction rather than identifying the "true" K, though in practice BIC is what most people reach for first.

The practical recipe: fit GMMs with K ranging from 1 to some reasonable upper bound, compute BIC for each, and pick the K where BIC is lowest (or where it first stops dropping significantly).

There's a third option that sidesteps the question entirely. Sklearn's BayesianGaussianMixture places a Dirichlet process prior on the mixing weights. You set a large upper bound on K, and the model automatically deactivates unnecessary components by driving their weights toward zero. After fitting, you check which components survived. It's not magic — it's sensitive to the concentration parameter — but it removes the need to sweep over K values.

from sklearn.mixture import BayesianGaussianMixture

bgmm = BayesianGaussianMixture(
    n_components=10,                          # generous upper bound
    weight_concentration_prior_type='dirichlet_process',
    weight_concentration_prior=0.01,          # encourages fewer active clusters
    covariance_type='full',
    random_state=42
)
bgmm.fit(X)

active = (bgmm.weights_ > 1e-2).sum()
print(f"Active components: {active}")          # often much less than 10

The Singularity Trap

There's a failure mode that catches people off guard. If a Gaussian component collapses onto a single data point — its mean lands exactly on that point and its variance shrinks toward zero — the Gaussian density at that point shoots toward infinity. The likelihood goes to infinity too. Mathematically, EM found an answer. Practically, that answer is useless garbage.

This is the singularity problem, and it's a genuine hazard with GMMs, especially in high dimensions or with too many components relative to the data. It happens because maximum likelihood doesn't inherently penalize degenerate solutions. A component that perfectly memorizes one data point achieves infinite density, and EM, faithfully maximizing likelihood, will happily march toward that degenerate solution.

The standard fix is covariance regularization: add a small value λ (typically 1e-6) to the diagonal of each covariance matrix after every M-step. This prevents any eigenvalue from reaching zero, keeping the matrix invertible and the density finite. In sklearn, this is the reg_covar parameter, and it's set to 1e-6 by default. If you see convergence warnings or NaN values, bump it up.

Other defenses include using diag or spherical covariance (which have far fewer parameters and are harder to collapse), reducing dimensionality with PCA before fitting, and using the Bayesian variant (which places priors on the covariance matrices that naturally resist degeneracy).

Convergence: The Good and the Frustrating

The good news: every EM iteration is guaranteed to increase (or at least not decrease) the observed data log-likelihood. This follows directly from the ELBO argument — we tighten the bound and then push it up. So EM always converges to something.

The frustrating news comes in two flavors.

First, that "something" is a local optimum, not necessarily the global one. The likelihood surface for mixture models is riddled with local maxima and saddle points. Two runs with different random starting points can land in completely different places, giving different cluster assignments and different parameter estimates. The fix is unglamorous but effective: run EM multiple times with different initializations and keep the result with the highest log-likelihood. That's what sklearn's n_init parameter does. I typically set it to 10 or 20.

Second, EM exhibits linear convergence — the error shrinks by a roughly constant fraction each iteration. Compare that to Newton's method, which converges quadratically (error squares each step). Near a flat region or saddle point, EM can feel painfully slow, making tiny improvements iteration after iteration. I'm still developing my intuition for why this happens, but the core issue is that EM doesn't use curvature information — it doesn't know which direction the landscape curves most steeply. It's the detective who refines profiles at a steady pace regardless of whether the case is nearly solved or still wide open.

In practice, initializing with K-Means++ (the default in sklearn) gets EM close to a good solution from the start, so the slow linear convergence in the endgame is less painful. For very large datasets, there are acceleration techniques — SQUAREM, Aitken acceleration, or hybrid approaches that switch from EM to a gradient-based optimizer once close enough — but these are specialized tools you'll rarely need.

Where Else EM Hides in the Wild

EM is not a GMM-specific algorithm. It's a general-purpose framework for any model with hidden variables, and it shows up in places that might surprise you.

Hidden Markov Models. The Baum-Welch algorithm — used to train HMMs for speech recognition, biological sequence analysis, and financial regime detection — is EM applied to sequential data. The E-step computes forward-backward probabilities (what hidden state was the system in at each time step?). The M-step re-estimates transition and emission parameters. If you've ever used a speech-to-text system, you've used EM.

Missing data imputation. When a spreadsheet has missing values, you can treat those missing entries as hidden variables. EM iterates between imputing expected values based on the current model (E-step) and re-estimating the model from the completed data (M-step). This is the engine behind several multivariate imputation methods.

Topic models. Latent Dirichlet Allocation, the workhorse of topic modeling, uses a variational EM algorithm. The hidden variables are the topic assignments for each word. The E-step estimates which topics generated which words. The M-step updates the topic-word distributions. The "variational" part means the E-step uses an approximate posterior rather than the exact one — a direct generalization of the ELBO framework we discussed.

Semi-supervised learning. When you have some labeled and some unlabeled data, EM can incorporate both. The labeled data contributes directly to the M-step; the unlabeled data goes through the E-step to get soft assignments first. This was one of the early approaches to semi-supervised learning before deep methods took over.

Putting It All Together in Code

Here's the code that brings everything together. The most important thing to notice is how little of it there is — sklearn does the heavy lifting. The choices that matter are the covariance type, the number of components, and the number of restarts.

from sklearn.mixture import GaussianMixture
import numpy as np

gmm = GaussianMixture(
    n_components=3,
    covariance_type='diag',       # start here, upgrade to 'full' if data supports it
    n_init=10,                    # 10 random restarts, keep best
    init_params='k-means++',      # smart initialization
    reg_covar=1e-6,               # singularity guard
    random_state=42
)
gmm.fit(X)

The fit call runs the full EM loop internally — E-step, M-step, repeat until convergence or max iterations. What comes out is a fitted model with three attributes that tell the whole story: means_ (where each cluster is centered), covariances_ (how each cluster is shaped), and weights_ (how common each cluster is).

soft_labels = gmm.predict_proba(X)    # (n_samples, 3) — responsibilities
hard_labels = gmm.predict(X)          # argmax of responsibilities
scores = gmm.score_samples(X_new)     # log-likelihood per point — useful for anomaly detection

For model selection, sweep over K and plot BIC:

import matplotlib.pyplot as plt

bics = []
K_range = range(1, 10)
for k in K_range:
    gm = GaussianMixture(n_components=k, covariance_type='diag',
                          n_init=5, random_state=42).fit(X)
    bics.append(gm.bic(X))

plt.plot(K_range, bics, 'o-')
plt.xlabel('Number of components')
plt.ylabel('BIC (lower is better)')
plt.title('Pick the K where BIC stops dropping')
plt.show()

And here's a production-ready pipeline that handles scaling, dimensionality reduction, and clustering in one shot:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture

pipe = Pipeline([
    ('scaler', StandardScaler()),             # GMMs are distance-sensitive
    ('pca', PCA(n_components=20)),            # tame high dimensions
    ('gmm', GaussianMixture(
        n_components=5,
        covariance_type='full',               # PCA reduced dims enough for full
        n_init=10,
        random_state=42
    ))
])

pipe.fit(X_train)
soft_labels = pipe.predict_proba(X_test)
hard_labels = pipe.predict(X_test)

The scaler matters because GMMs, like K-Means, are sensitive to feature scale. A feature measured in millions will dominate the covariance estimates and drown out features measured in single digits. PCA matters because full covariance in high dimensions is a recipe for overfitting or singularity — reducing to 20 or 30 dimensions first lets you use full without needing an ocean of data.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a circular trap — you need labels to estimate parameters, and parameters to estimate labels — and broke it with a beautifully simple idea: guess one, solve the other, repeat. We traced through a toy example of music listeners to see the E-step (compute soft responsibilities via Bayes' theorem) and M-step (re-estimate parameters via weighted statistics) in action. We pulled back the curtain to see the ELBO — the lower bound that makes EM's convergence guarantee rigorous, not hand-wavy. We discovered that K-Means was EM all along, operating under three restrictive assumptions. We learned how covariance types control cluster shape, how BIC and Bayesian methods help choose K, and how regularization prevents the singularity trap. And we saw that EM isn't limited to clustering — it's a general framework hiding inside HMMs, topic models, imputation methods, and more.

My hope is that the next time you encounter a mixture model in a paper, a codebase, or an interview, instead of feeling that old instinct to quietly move on, you'll recognize the detective's loop — guess, refine, repeat — and have a pretty solid mental model of what's happening under the hood.

Resources

Bishop, Pattern Recognition and Machine Learning, Chapter 9 — the gold standard treatment of EM and GMMs. Thorough but readable, which is rare for a textbook at this level.

Murphy, Machine Learning: A Probabilistic Perspective, Chapter 11 — covers EM, the ELBO, and variational connections with excellent clarity. Insightful for the probabilistic framing.

Dempster, Laird & Rubin (1977), "Maximum Likelihood from Incomplete Data via the EM Algorithm" — the O.G. paper. Surprisingly accessible for a 1977 statistics paper.

Scikit-learn documentation on Gaussian Mixture Models — wildly helpful for the practitioner. Covers covariance types, BIC selection, and the Bayesian variant with excellent diagrams.

Andrew Ng's CS229 Lecture Notes on EM — a clean, concise derivation that strips away unnecessary formalism. Available free online and worth reading alongside this section.

← Previous Anomaly Detection Next → Self-Supervised & Representation Learning