Dimensionality Reduction

Chapter 6: Unsupervised Learning Compression · Visualization · Feature engineering

A Confession and an Invitation

I avoided this topic for an embarrassingly long time. Every explanation I found would start reasonably — "PCA finds directions of maximum variance" — and within two paragraphs I'd be staring at eigenvectors, covariance matrices, and notation that made my eyes glaze over. I'd close the tab, promise myself I'd come back when I was "better at linear algebra," and go build another random forest instead.

It took me years to realize the problem wasn't my math. It was that nobody explained why these tools exist before dumping the machinery on me. Nobody started with the frustration that makes dimensionality reduction necessary, or walked through a tiny example where you could watch the thing work with your own eyes.

So here's what this chapter is: the explanation I wish I'd found back then. We're going to start with a real problem — too many features, not enough signal — and build every tool from scratch using examples small enough to hold in your head. We'll start with PCA, the workhorse that handles 80% of cases, then move into the wilder territory of t-SNE, UMAP, and a handful of specialized methods that solve problems PCA can't touch.

A heads-up before we start: there will be some matrix math. I've tried to make every equation earn its place by showing you what it does on real numbers before naming it. If you survived the linear algebra chapter, you have more than enough. If you skipped it, you'll still follow the intuition — I'll point out where to come back if you want the full derivation.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

The Curse of Dimensionality
The Manifold Hypothesis — Why Reduction Even Works
PCA: Finding the Best Camera Angle
- The Photography Analogy
- A Tiny Example: 4 Products, 3 Features
- The Covariance Matrix
- Eigendecomposition
- SVD: What Actually Runs Under the Hood
- How Many Components? The Scree Plot
- PCA in Production
- PCA ≈ Linear Autoencoder
Rest Stop
When PCA Fails: The Nonlinear World
t-SNE: The Visualization Microscope
UMAP: The Modern Default
LDA: Supervised Dimensionality Reduction
ICA: The Cocktail Party Solver
NMF: Parts-Based Decomposition
Random Projections: The Brute Force Shortcut
Choosing a Method — The Decision Framework
Wrap-Up
Resources

The Curse of Dimensionality

Before we learn how to reduce dimensions, we need to feel why high dimensions hurt. Not in the abstract "it's computationally expensive" way, but viscerally, in a way that changes how you think about data.

Let's start with two data points. In one dimension, place them at 0 and 1. The distance between them is 1. Now add a second dimension — put them at (0, 0) and (1, 1). Distance is √2 ≈ 1.41. Three dimensions: (0, 0, 0) and (1, 1, 1). Distance is √3 ≈ 1.73. In 100 dimensions, the distance between the two corners of the unit hypercube is √100 = 10.

That doesn't sound so bad. But here's the part that breaks your intuition: generate a thousand random points uniformly inside the 100-dimensional unit hypercube, and measure all the pairwise distances. The average distance will be large, but much more importantly, the spread of those distances will be tiny relative to the mean. Every point is roughly the same distance from every other point.

import numpy as np

np.random.seed(42)
for d in [2, 10, 100, 1000]:
    points = np.random.rand(500, d)
    # Compute all pairwise distances
    diffs = points[:, np.newaxis, :] - points[np.newaxis, :, :]
    dists = np.sqrt((diffs ** 2).sum(axis=2))
    # Get upper triangle (unique pairs)
    idx = np.triu_indices(500, k=1)
    pair_dists = dists[idx]
    ratio = pair_dists.std() / pair_dists.mean()
    print(f"d={d:4d}  mean={pair_dists.mean():.2f}  "
          f"std={pair_dists.std():.3f}  std/mean={ratio:.4f}")

Run that and watch the std/mean ratio shrink toward zero. In 1000 dimensions, the farthest pair and the nearest pair are almost the same distance apart. This is distance concentration, and it follows a clean mathematical pattern: the relative spread of distances scales roughly as 1/√d. As dimensions grow, distances lose their ability to discriminate.

Think about what that means for k-nearest neighbors. The algorithm works by finding the k closest points and taking a vote. If every point is roughly equidistant, the "nearest" neighbors are barely closer than the farthest strangers. The vote becomes random. K-NN doesn't gradually degrade in high dimensions — it effectively stops working.

Three things break as dimensions grow. Speed — distance calculations scale linearly with features, and tree-based indexing structures (like KD-trees) degrade to brute-force search beyond about 20 dimensions. Overfitting — a dataset with 200 samples and 500 features gives your model enormous room to memorize noise. There's always some spurious direction in 500-D space that separates any two classes perfectly by accident. Visualization — you cannot plot 500 axes. You can barely fake 3.

That's why we need dimensionality reduction. Not because it's elegant, but because without it, large swaths of machine learning stop functioning.

The Manifold Hypothesis — Why Reduction Even Works

Here's a question that bugged me for a long time: if we throw away dimensions, aren't we throwing away information? How can we compress a 500-feature dataset down to 50 and still have a useful model?

The answer is the manifold hypothesis, and it's one of those ideas that sounds fancy but is actually very intuitive. The claim is that real-world high-dimensional data doesn't fill the entire space it lives in. Instead, it lies on or near a lower-dimensional surface — a manifold — embedded in the high-dimensional space.

Picture a Swiss roll. Take a flat 2D sheet of paper and curl it up in 3D space. The data points live in 3D, but the real structure — the relationships between points — is 2D. Two points that are far apart in 3D (one on the outside of the roll, one on the inside) might actually be close together if you unroll the sheet. The 2D surface is the manifold.

This happens constantly in real data. Images of faces live in a space with millions of pixel dimensions, but the actual variation — pose, lighting, expression, identity — has maybe a few hundred degrees of freedom. A user behavior dataset might have 300 features, but users really fall along a handful of behavioral axes: price-sensitive vs. brand-loyal, frequent-browser vs. targeted-buyer, impulse vs. deliberate.

Dimensionality reduction works because the data was never truly high-dimensional to begin with. We're not compressing information — we're finding the lower-dimensional surface the data actually lives on and throwing away the empty space around it. Think of it like a map projection: the Earth's surface is 2D, even though it's embedded in 3D. A good map preserves the relationships that matter while discarding the dimension (altitude, for most purposes) that doesn't.

This map projection analogy will come back. Different dimensionality reduction methods are like different map projections — they all sacrifice something, and the right choice depends on what you need to preserve.

PCA: Finding the Best Camera Angle

The Photography Analogy

Imagine you have a 3D sculpture and you need to photograph it for a catalog. You get exactly one photo — one 2D image to capture as much of the sculpture's detail as possible. You walk around it, looking for the angle where the most interesting features are visible. From the front, you can see the face and the arms. From the side, the arms disappear behind the torso. From directly above, everything flattens into an unrecognizable blob.

PCA is the algorithm that finds the best camera angle. It rotates your coordinate system so that the first axis captures the most variation in the data, the second axis captures the most remaining variation perpendicular to the first, and so on. Then you drop the axes that show the least — the ones where the sculpture looks the same from every angle along that axis.

The key insight: PCA doesn't warp, stretch, or distort your data. It rotates it. The new axes — called principal components — are ordered by how much of the data's spread they capture. The first few components show you the "interesting directions." The rest are noise you can safely ignore.

A Tiny Example: 4 Products, 3 Features

Let's make this concrete. Suppose you're building a product recommendation system for a small online store. You have 4 products, and for each one you've measured 3 features of user behavior: average time spent on the product page (in seconds), number of times it was added to a wishlist, and number of purchases.

import numpy as np

# 4 products × 3 features: [page_time, wishlists, purchases]
X = np.array([
    [120,  15,  45],   # Product A: popular, high engagement
    [100,  12,  40],   # Product B: somewhat popular
    [ 20,   2,   5],   # Product C: niche, low engagement
    [ 30,   4,   8],   # Product D: niche, slightly more visible
])

# Step 1: Center the data (subtract the mean of each feature)
X_centered = X - X.mean(axis=0)
print("Centered data:")
print(X_centered)
# [[ 52.5  6.75 20.5 ]
#  [ 32.5  3.75 15.5 ]
#  [-47.5 -6.25-19.5 ]
#  [-37.5 -4.25-16.5 ]]

Look at those centered values. Products A and B have positive values across all three features. Products C and D have negative values across all three. The features are moving together — when page time goes up, wishlists and purchases go up too. That correlated movement is exactly what PCA will exploit.

If all three features are telling roughly the same story ("this product is popular"), then we don't need three numbers to describe each product. One number — a "popularity score" — might capture most of what's going on. That's what the first principal component will be.

The Covariance Matrix

To find the direction of maximum spread, we need to quantify how features vary together. That's what the covariance matrix does.

# Step 2: Compute the covariance matrix
n = X_centered.shape[0]
C = (1 / (n - 1)) * X_centered.T @ X_centered
print("Covariance matrix (3×3):")
print(np.round(C, 1))
# [[2091.7  266.6  634.2]
#  [ 266.6   35.6   83.2]
#  [ 634.2   83.2  203.0]]

The covariance matrix C = (1/(n−1)) Xᵀ X has shape (features × features) — here, 3×3. The diagonal entries are the variances of each feature. C[0,0] = 2091.7 is the variance of page_time; it's large because page_time has big numbers. The off-diagonal entry C[0,1] = 266.6 is the covariance between page_time and wishlists — it's positive, confirming they move together.

Every entry in this matrix tells you something about how two features relate. The covariance matrix is a complete summary of the linear relationships in your data. PCA's job is to find the directions in this matrix where the variance is concentrated.

Eigendecomposition

I'll be honest — when I first saw that PCA was "eigendecomposition of the covariance matrix," I nearly closed the tab. It sounded like someone had combined every intimidating word from linear algebra into one phrase designed to repel beginners.

But here's what eigendecomposition actually does. You give it a square matrix (our covariance matrix C), and it returns two things: a set of eigenvectors — directions — and a set of eigenvalues — one number per direction, telling you how much variance lives along that direction. The eigenvectors are perpendicular to each other. Ordered by eigenvalue, they give you a ranking: first direction captures the most variance, second captures the next most, and so on.

# Step 3: Eigendecompose the covariance matrix
eigenvalues, eigenvectors = np.linalg.eigh(C)

# Sort by eigenvalue descending (eigh returns ascending)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print("Eigenvalues:", np.round(eigenvalues, 1))
# [2322.6    7.2    0.5]

print("Explained variance ratios:")
ratios = eigenvalues / eigenvalues.sum()
print(np.round(ratios, 4))
# [0.9967  0.0031  0.0002]

print("\nFirst eigenvector (PC1 direction):")
print(np.round(eigenvectors[:, 0], 4))
# [-0.9539 -0.1220 -0.2740]

Look at those eigenvalues. The first one (2322.6) dwarfs the other two (7.2 and 0.5). That means 99.67% of all the variance in our 3-feature dataset lives along a single direction. The data cloud isn't a sphere — it's a cigar, elongated along one axis. We came in with 3 features and the data is screaming that it's essentially one-dimensional.

The first eigenvector [-0.95, -0.12, -0.27] tells us the direction of that cigar. It loads heavily on page_time (the first entry is -0.95), with smaller contributions from wishlists and purchases. This is our "popularity score" direction.

The formal decomposition is C = VΛVᵀ, where V is the matrix of eigenvectors as columns, and Λ is the diagonal matrix of eigenvalues. That's the equation. The columns of V are our principal component directions. The eigenvalues in Λ tell us how important each direction is.

# Step 4: Project the data onto the first principal component
pc1 = X_centered @ eigenvectors[:, 0]  # 4 products → 4 scores
print("PC1 scores:", np.round(pc1, 1))
# [-56.7  -35.5   50.8   41.5]

# Products A and B get large negative scores (popular)
# Products C and D get large positive scores (niche)
# One number now captures most of the 3-feature story

That's it. That's PCA. Center the data, compute the covariance matrix, find the eigenvectors, project onto the top k. Everything else is optimization, numerics, and practical advice.

SVD: What Actually Runs Under the Hood

In our toy example with 3 features, we formed the covariance matrix Xᵀ X and eigendecomposed it. That works fine for small problems. But imagine a gene expression dataset with 20,000 features. Xᵀ X would be a 20,000 × 20,000 matrix — that's 400 million entries. Forming it is expensive. Decomposing it is worse. And there's a subtler problem: multiplying Xᵀ X can amplify floating-point errors, making the result numerically unstable.

In practice, nobody forms Xᵀ X. Instead, we apply Singular Value Decomposition directly to the data matrix X:

# X = U Σ Vᵀ
# U: (n × n) — left singular vectors (row space)
# Σ: (n × p) diagonal — singular values
# V: (p × p) — right singular vectors (= principal component directions!)

U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

print("Singular values:", np.round(sigma, 2))
# [83.47   4.65   1.24]

# The relationship: eigenvalue_i = sigma_i^2 / (n - 1)
print("Eigenvalues from SVD:", np.round(sigma**2 / (n - 1), 1))
# [2322.6    7.2    0.5]  — matches eigendecomposition exactly!

print("PC directions from SVD (rows of Vt):")
print(np.round(Vt[0], 4))
# Same direction as eigenvectors[:, 0], possibly sign-flipped

The right singular vectors (rows of Vᵀ) are exactly the principal component directions. The singular values σᵢ relate to eigenvalues by λᵢ = σᵢ² / (n − 1). Same answer, better numerics, and we never build the covariance matrix.

Randomized SVD goes a step further. If you only need the top 50 components out of 20,000 features, why compute all 20,000? Randomized algorithms use random projections to approximate the top-k singular vectors dramatically faster. Scikit-learn uses this automatically when you set svd_solver='randomized' or 'auto' (which selects randomized when n_components is small relative to features).

But here's the limitation that matters: both eigendecomposition and SVD find linear directions. They can rotate the data, but they can't bend it. Remember the Swiss roll — a 2D sheet curled up in 3D? PCA would try to flatten it by finding the direction of maximum variance, which would project the spiral onto a line and mangle the structure. PCA can find the best camera angle for a sculpture, but it can't unroll the sculpture into a flat photograph. We'll need nonlinear methods for that, and we'll get to them after the rest stop.

How Many Components? The Scree Plot

Knowing how PCA works is one thing. Deciding how many components to keep is the actual practical question you'll face every time you use it.

The explained variance ratio for component k is its eigenvalue divided by the sum of all eigenvalues. In our product example, PC1 explained 99.67% — we were lucky, the data was almost perfectly one-dimensional. Real datasets are messier.

A scree plot shows explained variance ratios in order, from largest to smallest. You're looking for an elbow — the point where the curve drops off and each additional component contributes diminishing returns. Everything past the elbow is mostly noise.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
print(f"Original shape: {X.shape}")  # (1797, 64) — 64 pixel features

# ALWAYS standardize before PCA
X_scaled = StandardScaler().fit_transform(X)

# Fit with all components to inspect the variance breakdown
pca_full = PCA().fit(X_scaled)
cumvar = np.cumsum(pca_full.explained_variance_ratio_)

# How many components for 95% variance?
n_95 = np.argmax(cumvar >= 0.95) + 1
print(f"Components for 95% variance: {n_95}")  # ~28

# Now reduce for real
pca = PCA(n_components=n_95)
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_reduced.shape}")  # (1797, 28)

# Sanity check: reconstruction error
X_back = pca.inverse_transform(X_reduced)
mse = np.mean((X_scaled - X_back) ** 2)
print(f"Reconstruction MSE: {mse:.4f}")

The pragmatic rule: keep enough components to reach 90–95% cumulative explained variance. Then validate on your actual downstream task — if classification accuracy doesn't drop, you've successfully compressed your data. If it does, bump the threshold up.

I should confess something here: the 90–95% rule is a useful starting point, but it's not gospel. I've seen datasets where 80% variance was plenty, and others where even 99% wasn't enough because the signal lived in the small-variance components. The scree plot is a guide, not a commandment. Always let your downstream metric have the final word.

PCA in Production

There's one mistake I see more often than any other with PCA, and I made it myself more than once: fitting PCA on the entire dataset, including the test set, before doing cross-validation. This leaks information. The principal components are computed using statistics (means, variances, covariances) from the data — if those statistics include test data, your model has peeked at information it shouldn't have.

Standardize first! PCA maximizes variance. If feature A ranges from 0 to 1,000,000 and feature B ranges from 0 to 1, PCA will think A dominates — not because it's more informative, but because it has bigger numbers. StandardScaler before PCA. Every time. No exceptions. I learned this the hard way when my first PCA reduced 50 features down to one that was 99.9% "revenue in dollars."

The fix is to put PCA inside a scikit-learn Pipeline, so it only sees training data in each fold:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# The right way: PCA inside the pipeline, after scaling
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),  # float = auto-select for 95% variance
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"PCA pipeline accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

# Compare: no PCA
pipe_full = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
scores_full = cross_val_score(pipe_full, X, y, cv=5, scoring='accuracy')
print(f"Full features accuracy: {scores_full.mean():.3f} ± {scores_full.std():.3f}")

Passing n_components=0.95 as a float tells PCA to automatically keep however many components reach 95% explained variance. No manual tuning needed — the number of components adapts to the data in each training fold.

PCA ≈ Linear Autoencoder

This is one of those connections that clicked for me once and permanently changed how I think about learned representations.

Build a neural network with one hidden layer, linear activations, and MSE loss. The input and output are the same data (it's an autoencoder — it tries to reconstruct its own input). The hidden layer has k neurons, fewer than the input dimension — a bottleneck. Train it.

That network learns to project the data down to k dimensions and back up, minimizing reconstruction error. The subspace it finds is exactly the same subspace PCA finds. Mathematically identical. The encoder learns the top-k eigenvectors, and the decoder learns their transpose.

Add nonlinear activations and you get something strictly more powerful — a nonlinear autoencoder that can capture curves and folds PCA misses. But the linear case is PCA. This makes PCA the theoretical baseline for all learned compressions. When someone claims their fancy autoencoder is great at dimensionality reduction, PCA is the sanity check: are you actually doing better than a linear projection?

Rest Stop

Congratulations on making it this far. Seriously — you've absorbed the curse of dimensionality, the manifold hypothesis, and the complete mechanics of PCA from covariance matrices to SVD to production pipelines. That's a lot.

You can stop here if you want. What you know right now covers probably 80% of real-world dimensionality reduction needs. You know what PCA does (rotates to maximum-variance axes), how it works (eigendecomposition of the covariance matrix, or SVD in practice), how to choose the number of components (scree plot, 90-95% variance), and how to use it correctly in a pipeline (StandardScaler → PCA → model, inside cross-validation). For many datasets and many problems, that's all you'll ever need.

But PCA has a hard limitation: it's linear. It finds the best camera angle, but it can only rotate the camera — it can't reshape the sculpture. When your data lies on a curved manifold (like the Swiss roll), PCA will flatten it in the wrong way, and the structure you care about gets destroyed.

The rest of this chapter covers what to do when linearity isn't enough. We'll meet t-SNE (the visualization tool everyone uses but few interpret correctly), UMAP (the modern workhorse), and a handful of specialized methods for specific problems. Each one trades PCA's simplicity for the ability to see something PCA can't.

If that's all you need, you're 80% of the way there. Close this tab, go build something, and come back when you hit a wall.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

When PCA Fails: The Nonlinear World

Remember the Swiss roll — a 2D sheet of paper curled up into a spiral in 3D space? Let's see what PCA does with it.

from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import PCA

X_swiss, color = make_swiss_roll(n_samples=1500, noise=0.5, random_state=42)
print(f"Swiss roll shape: {X_swiss.shape}")  # (1500, 3)

# PCA: project from 3D to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_swiss)

# If you plot X_pca colored by 'color', you'll see the spiral
# has been flattened — points that were far apart on the sheet
# (but close in 3D because the sheet is rolled up) are now
# overlapping. PCA preserved the 3D distances, not the
# 2D-on-the-manifold distances.
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")
# High! PCA thinks it did great. But the structure is mangled.

PCA finds the two directions of maximum variance in 3D and projects onto them. Those happen to be the width and height of the roll. The spiral structure — which is the actual geometry of the data — gets flattened and folded on top of itself. Points that were far apart along the surface of the sheet end up on top of each other in the PCA projection.

Going back to our map analogy: this is like trying to make a flat map of the Earth using only a single photograph from space. You'll capture one hemisphere but lose the other completely. What you need is a proper map projection — an algorithm that knows the Earth is curved and unfolds it intelligently. That's what nonlinear methods do.

Kernel PCA is one approach — it applies the kernel trick to PCA, mapping data into a higher-dimensional space where linear PCA can capture nonlinear structure. It works, but you need to choose a kernel (RBF is the usual default), it doesn't scale well to large datasets, and in practice it's been largely displaced by the methods we'll cover next.

t-SNE: The Visualization Microscope

Let me say this upfront and in bold: t-SNE is a visualization tool. Not a feature engineering method. Not a preprocessing step. Not a way to produce inputs for a downstream model. It makes pretty 2D pictures of high-dimensional data. That's what it's for, and it's spectacularly good at it — but using it for anything else is a mistake I see constantly.

I still get tripped up interpreting t-SNE plots, if I'm being honest. After years of using it, I sometimes catch myself staring at a gap between two clusters and thinking "those groups are very different" — which is exactly the kind of conclusion t-SNE doesn't support. Old habits die hard.

How It Works

The core idea has three steps, and each one is worth understanding.

Step 1 — Build a probability distribution in high-D. For each data point, t-SNE computes the probability that it would "pick" every other point as a neighbor. This uses a Gaussian kernel centered on each point: nearby points get high probability, distant points get near-zero. The width of the Gaussian is adapted per point so that each one has roughly the same number of effective neighbors (controlled by the perplexity parameter). The result is a probability distribution over all pairs of points that encodes the local neighborhood structure of the data.

Step 2 — Build a similar distribution in 2D, but with a different kernel. Place points randomly in 2D. Compute pairwise similarities again — but this time, use a Student's t-distribution with one degree of freedom (which is the same as a Cauchy distribution) instead of a Gaussian. This is the key innovation, and the reason for the name t-SNE.

Step 3 — Make the two distributions match. Use gradient descent to minimize the KL divergence between the high-D distribution and the low-D distribution. Points shuffle around in 2D until the neighborhood structure matches as closely as possible — things that were close in high-D end up close in 2D.

The Crowding Problem and Why Heavy Tails Fix It

This is the insight that made t-SNE work where earlier methods (like the original SNE) failed, and it took me a while to really internalize it.

In 100 dimensions, a single point can have dozens of roughly equidistant neighbors. The surface area of a hypersphere is enormous — there's room for everyone. In 2D, you can pack maybe 6 equidistant neighbors around a point before they start colliding with other clusters. Trying to preserve all those high-D neighborhoods with a Gaussian in 2D creates a traffic jam — everything smushes into the center. This is the crowding problem.

The Student's t-distribution has heavier tails than the Gaussian. Points that are moderately far apart in high-D get pushed much further apart in the 2D embedding. This creates breathing room — clusters separate, and the local neighborhoods can be preserved without everything collapsing into a blob. The heavy tails are like adding extra lanes to a highway during rush hour.

Perplexity: The One Knob That Matters

Perplexity (typical range: 5–50) roughly controls "how many effective neighbors each point considers." Think of it as a soft version of k in k-nearest neighbors. Low perplexity (5–10) means tight local focus — you'll see fine-grained structure, but clusters might fragment into subclusters that aren't real. High perplexity (30–50) means broader context — clusters are chunkier and their relative positions are more stable, but you might miss internal structure.

Golden rule for t-SNE interpretation. Run it at perplexity 5, 30, and 50. If a cluster appears in all three, it's probably real. If it only appears at low perplexity, it might be an artifact of the algorithm splitting a continuous group.

The Gotchas — Read This Before Interpreting Any t-SNE Plot

Cluster sizes are meaningless. The algorithm can inflate or squash clusters freely. A big blob on screen might be tighter in the original space than a tiny blob next to it.

Distances between clusters are meaningless. Two well-separated groups might be nearly identical in high-D. The gap between them tells you nothing reliable.

Only cluster existence and membership are trustworthy. "These points group together" — yes, you can believe that. "This cluster is far from that one" — no, you can't.

Never use t-SNE output as features for a model. The embedding is optimized for visual aesthetics, not predictive power. The axes have no consistent meaning. Different random seeds produce different layouts. Training a classifier on t-SNE coordinates is like training on random art.

Run it multiple times. Even with PCA initialization (which helps a lot), the results can vary. If the structure changes dramatically across runs, be skeptical of whatever you think you're seeing.

from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate='auto',   # sklearn auto-tunes since v1.2
    init='pca',             # much more stable than random init
    random_state=42,
    n_iter=1000
)
X_2d = tsne.fit_transform(X)

# Visualization code (uncomment with matplotlib):
# import matplotlib.pyplot as plt
# plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
# plt.title('t-SNE of Digits Dataset')
# plt.xticks([]); plt.yticks([])  # axes are meaningless — don't label them
# plt.show()
print(f"t-SNE output shape: {X_2d.shape}")  # (1797, 2)

The Barnes-Hut approximation brings t-SNE's complexity from O(n²) to O(n log n), making it usable for up to about 100,000 points. Beyond that, the runtime becomes painful. For larger datasets, UMAP is the way to go.

UMAP: The Modern Default

UMAP (Uniform Manifold Approximation and Projection) showed up in 2018 and quickly became the tool most practitioners reach for first. It's faster than t-SNE, preserves more global structure, and — this is the big one — you can actually use its output as features for downstream models.

How It Works

UMAP's theoretical foundation involves fuzzy simplicial sets and Riemannian geometry. I won't pretend I fully understand the topology paper. But the practical algorithm is surprisingly clean:

Step 1 — Build a weighted k-nearest-neighbor graph in high-D. For each point, find its n_neighbors closest points. Assign edge weights based on distance: closer means stronger. Each point's distances are normalized relative to its own local scale, so dense regions and sparse regions are treated on equal footing.

Step 2 — Build a similar graph in low-D. Initialize points in 2D (or whatever target dimension) using spectral embedding. Compute edge weights in the low-D space.

Step 3 — Optimize the low-D layout so the two graphs match. Minimize the cross-entropy between the high-D and low-D edge weight distributions using stochastic gradient descent.

That cross-entropy objective is the key difference from t-SNE. KL divergence (t-SNE's loss) penalizes "things that should be close but aren't" — attractive forces. Cross-entropy also penalizes "things that should be far apart but aren't" — repulsive forces. This is why UMAP preserves global structure better: it actively pushes apart points that don't belong together, maintaining the large-scale relationships between clusters, not only the local neighborhoods within them.

The Two Knobs

n_neighbors (default 15): how many neighbors define "local." Small values (5–15) emphasize fine-grained local structure — you'll see detailed clusters, possibly with more noise. Large values (50–200) capture broader topology — the big picture of how clusters relate to each other, at the cost of losing internal detail. This is the most important parameter.

min_dist (default 0.1): how tightly points can pack in the embedding. Set to 0.0 for dense, well-separated clusters with clear boundaries. Set to 0.5 or higher if you want to see the continuous structure within clusters rather than hard edges.

# pip install umap-learn
import umap
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

# 2D visualization
reducer = umap.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    metric='euclidean',
    random_state=42
)
X_2d = reducer.fit_transform(X)
print(f"UMAP 2D shape: {X_2d.shape}")

# Feature engineering (higher-D) — this is what t-SNE CAN'T do
reducer_feat = umap.UMAP(n_components=10, random_state=42)
X_features = reducer_feat.fit_transform(X)
print(f"UMAP 10D feature shape: {X_features.shape}")
# Feed X_features into any classifier — these are usable features

# Supervised mode — uses labels to guide the embedding
reducer_sup = umap.UMAP(n_components=2, random_state=42)
X_sup = reducer_sup.fit_transform(X, y=y)
print(f"Supervised UMAP shape: {X_sup.shape}")
# Class boundaries become much sharper — great for exploration

That supervised mode is worth calling out. By passing labels to fit_transform, UMAP adjusts the neighbor graph to favor same-class connections. The result is an embedding where class boundaries are cleaner and the structure is more interpretable. It's not cheating — it's using available information to make a better map.

t-SNE vs UMAP — When to Use Which

Criterion	t-SNE	UMAP
Speed	O(n log n) with Barnes-Hut; slow beyond 100k points	O(n) effective; handles millions
Global structure	Inter-cluster distances meaningless	Better preserved (not perfect, but more reliable)
Use output as features	No. Never.	Yes — higher-D embeddings are stable and useful
Determinism	Varies across runs, even with same seed	More stable with fixed random_state
Supervised mode	No	Yes — pass y to fit_transform
Theoretical basis	Probability matching (KL divergence)	Topological (cross-entropy on fuzzy simplicial sets)
Best for	Publication-quality 2D plots; second opinion on clusters	Almost everything: visualization, feature engineering, exploration

The honest answer: use UMAP by default. Use t-SNE when you want a second opinion on the visualization, or when a reviewer asks for it. Run both with multiple parameter settings before drawing any biological, business, or scientific conclusions from the plots.

LDA: Supervised Dimensionality Reduction

Everything we've seen so far is unsupervised — PCA, t-SNE, and UMAP don't know about class labels. Linear Discriminant Analysis flips this: it uses labels to find the directions that best separate classes. Different goal, different math, and sometimes dramatically better results.

Fisher's Criterion

Imagine two clouds of points in 2D — class A and class B. You need to project them onto a single line (1D) in a way that makes it easy to tell the classes apart. What line do you choose?

PCA would choose the direction of maximum overall variance. But that might be a direction where the two classes overlap completely — the spread is large, but it's all within-class spread. LDA is smarter: it looks for the direction where the class means are far apart (high between-class scatter, S_B) and the points within each class are tightly packed (low within-class scatter, S_W).

Formally, LDA finds the vector w that maximizes the ratio wᵀS_Bw / wᵀS_Ww. This is a generalized eigenvalue problem, and it gives you at most C − 1 components, where C is the number of classes. Binary classification? One component. Ten digit classes? At most nine. That's a hard ceiling, and it's both LDA's strength (extreme compression) and its limitation.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

# 10 digit classes → maximum 9 LDA components
lda = LinearDiscriminantAnalysis(n_components=9)
X_lda = lda.fit_transform(X, y)
print(f"LDA shape: {X_lda.shape}")  # (1797, 9)

# These 9 dimensions are OPTIMIZED for separating the 10 digit classes.
# Compare: PCA needed ~28 components for 95% variance.
# LDA often outperforms PCA for classification despite having far fewer dimensions.

LDA assumes each class follows a Gaussian distribution with equal covariance matrices. When those assumptions roughly hold — and they hold more often than you'd expect — LDA is remarkably effective. Nine dimensions vs. 28 from PCA, and the nine are laser-focused on class boundaries rather than general variance. When the assumptions fail badly (curved decision boundaries, wildly different class shapes), LDA underperforms.

Name collision warning. LDA in dimensionality reduction means Linear Discriminant Analysis (Fisher, 1936). LDA in NLP means Latent Dirichlet Allocation (Blei, 2003). Completely different algorithms that share an acronym. In the context of this chapter, LDA always means Fisher's method. If someone mentions LDA in a topic modeling conversation, they mean Blei's.

ICA: The Cocktail Party Solver

PCA finds uncorrelated directions. ICA finds statistically independent sources. The difference matters more than it sounds: uncorrelated means zero covariance (no linear relationship). Independent means knowing one signal tells you absolutely nothing about the others — not even nonlinear relationships. Independence is a much stronger condition.

The Cocktail Party Problem

Three people are talking at a party. Three microphones are scattered around the room, each recording a different mix of the three voices: mic1 = 0.7·Alice + 0.2·Bob + 0.1·Carol. You have the three mixed recordings. You don't know the mixing weights. Can you recover the three original voices?

ICA says yes, under one key assumption: the original sources must be non-Gaussian. Why? Because of the Central Limit Theorem — when you mix independent signals, the mixture tends toward Gaussian. So the most non-Gaussian direction in the mixed data corresponds to a recovered source. ICA finds directions that maximize non-Gaussianity (measured by kurtosis or negentropy), and those directions give you the original independent signals. The FastICA algorithm does this efficiently with a fixed-point iteration.

from sklearn.decomposition import FastICA
import numpy as np

# Simulate the cocktail party: 3 sources, 3 mixtures
np.random.seed(42)
t = np.linspace(0, 1, 1000)
s1 = np.sin(2 * np.pi * 5 * t)            # sine wave
s2 = np.sign(np.sin(2 * np.pi * 3 * t))   # square wave
s3 = np.random.laplace(size=1000)          # noise burst (non-Gaussian!)
S = np.c_[s1, s2, s3]

A = np.array([[1, 0.5, 0.3],
              [0.5, 1, 0.7],
              [0.3, 0.7, 1]])  # unknown mixing matrix
X = S @ A.T  # observed mixtures

# ICA recovers the sources (up to sign and permutation)
ica = FastICA(n_components=3, random_state=42)
S_recovered = ica.fit_transform(X)
print(f"Recovered sources shape: {S_recovered.shape}")  # (1000, 3)
# Columns of S_recovered ≈ original sources, possibly reordered and sign-flipped

ICA is a specialist tool. Use it for EEG/MEG signal processing (separating brain signals from eye blinks and muscle artifacts), audio source separation, or financial factor discovery. Don't use it for general dimensionality reduction — it requires the number of sources to equal the number of observations, and the sources must be non-Gaussian. If those conditions aren't met, reach for something else.

NMF: Parts-Based Decomposition

Here's something that always bothered me about PCA: the components can have negative values. When you decompose a face image with PCA, the first eigenvector might have positive loadings on the nose and negative loadings on the eyes. What does a "negative eye" mean? Nothing intuitive. PCA components are holistic — they capture global patterns of variation, and the individual entries don't have clean interpretations.

Non-negative Matrix Factorization (NMF) adds a single constraint: everything must be non-negative. The original data, the basis vectors, and the coefficients are all ≥ 0. This seemingly small change has a profound effect on what the algorithm learns.

Instead of holistic eigenfaces that combine positive and negative regions, NMF discovers parts. Apply it to face images, and you get components that correspond to eyes, noses, mouths, jawlines — recognizable parts that combine additively to build a face. Apply it to text data (word-frequency matrices), and you get topics — groups of words that co-occur because they're about the same subject.

from sklearn.decomposition import NMF
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
# Digits pixel values are already non-negative (0-16 grayscale)

nmf = NMF(n_components=15, random_state=42, max_iter=500)
W = nmf.fit_transform(X)     # (1797, 15) — coefficients per sample
H = nmf.components_           # (15, 64) — the 15 basis "parts"

print(f"Coefficients shape: {W.shape}")
print(f"Components shape: {H.shape}")
print(f"Reconstruction error: {nmf.reconstruction_err_:.2f}")

# Each row of H is a 8×8 "part" image
# Unlike PCA eigenvectors, these parts are all non-negative —
# you can visualize them as actual image patches

NMF is the right tool when your data is naturally non-negative (pixel intensities, word counts, chemical concentrations, user-item ratings) and you want components that are interpretable by construction. You can show an NMF component to a domain expert and they'll recognize what it represents. Try that with a PCA eigenvector and you'll get blank stares.

The connection to topic modeling is direct: apply NMF to a term-document matrix (TF-IDF), and the components are topics. Each topic is a distribution over words (non-negative), and each document is a mixture of topics (also non-negative). This is one of the simplest and most effective approaches to discovering themes in a text corpus.

Random Projections: The Brute Force Shortcut

Every method we've discussed so far learns something from the data — PCA learns the covariance structure, t-SNE learns the neighborhood probabilities, UMAP learns the nearest-neighbor graph. Learning takes time and compute. What if you don't have either?

The Johnson-Lindenstrauss lemma says something remarkable: you can project n points from high-D to O(log n / ε²) dimensions using a random matrix, and all pairwise distances are preserved within a factor of (1 ± ε). No learning. No data dependence. You generate a random matrix, multiply, and you're done.

from sklearn.random_projection import GaussianRandomProjection
from sklearn.datasets import load_digits
import numpy as np

X, y = load_digits(return_X_y=True)

# Project from 64D to ~25D using a random matrix
rp = GaussianRandomProjection(n_components=25, random_state=42)
X_rp = rp.fit_transform(X)
print(f"Random projection shape: {X_rp.shape}")  # (1797, 25)

# Check distance preservation
from sklearn.metrics import pairwise_distances
D_orig = pairwise_distances(X[:100])
D_proj = pairwise_distances(X_rp[:100])
ratios = D_proj / (D_orig + 1e-10)
print(f"Distance ratio — mean: {ratios[ratios > 0].mean():.3f}, "
      f"std: {ratios[ratios > 0].std():.3f}")
# Ratios should be close to 1.0 — distances are approximately preserved

Random projection is the fastest possible dimensionality reduction. It's O(n·d·k) — one matrix multiplication. No iteration, no convergence, no hyperparameters beyond the target dimension. It doesn't find the best projection — PCA will always do better for the same number of components — but when you have a million samples with ten thousand features and you need to reduce dimensions in seconds, random projection gets you 90% of the way there at a fraction of the cost.

I think of it as the "good enough" option. When speed matters more than optimality, or when you need a quick baseline before investing in something fancier, random projection is your friend.

Choosing a Method — The Decision Framework

I wish I could give you a formula. I can't. But I can give you a decision table that matches what I've seen work in practice, followed by some meta-advice that's more honest than any table.

Your Goal	First Choice	Second Choice	Avoid
Feature compression before a model	PCA (95% variance)	LDA (if you have labels)	t-SNE
2D visualization	UMAP	t-SNE (second opinion)	PCA to 2D (usually loses too much)
Features from nonlinear structure	UMAP (10–50D)	Autoencoder	t-SNE (non-reproducible, axes meaningless)
Signal separation (EEG, audio)	ICA	NMF (for non-negative data)	PCA (finds uncorrelated, not independent)
Speed on massive data	Random projection	Randomized PCA	Kernel PCA, Isomap (don't scale)
Interpretable components	NMF (non-negative data)	Sparse PCA	t-SNE / UMAP (opaque embeddings)
Supervised, classification preprocessing	LDA	Supervised UMAP	Unsupervised methods (ignoring labels you have)
Topic discovery in text	NMF on TF-IDF	LDA (Blei's, in NLP sense)	PCA (negative loadings on words make no sense)

And here's the meta-advice, which I think matters more than the table: evaluate by downstream performance. Dimensionality reduction is a preprocessing step, not a final answer. Run PCA → train model → measure metric. Run UMAP → train model → measure metric. Whichever wins on the metric you care about, wins. The mathematical elegance of the decomposition is irrelevant if it doesn't help your task.

I usually try 2–3 methods and compare. There's no formula for knowing in advance which one will work best on your data. The table gives you a starting point. The downstream metric gives you the answer.

One more thing nobody tells you: sometimes dimensionality reduction makes things worse. If your features are already clean, relatively low-dimensional, and informative, compressing them can only lose signal. Always compare against the "no reduction" baseline. If it wins, let it win. There's no prize for using a fancier pipeline.

Putting It All Together

"""
Full workflow: compare PCA, LDA, UMAP, and no-reduction head-to-head.
"""
import numpy as np
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

X, y = load_digits(return_X_y=True)

# --- Baseline: all 64 features ---
pipe_full = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
score_full = cross_val_score(pipe_full, X, y, cv=5).mean()

# --- PCA: keep 95% variance ---
pipe_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
score_pca = cross_val_score(pipe_pca, X, y, cv=5).mean()

# --- LDA: max 9 components for 10 classes ---
pipe_lda = Pipeline([
    ('scaler', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis(n_components=9)),
    ('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
score_lda = cross_val_score(pipe_lda, X, y, cv=5).mean()

print(f"Full features (64D): {score_full:.3f}")
print(f"PCA (~28D):          {score_pca:.3f}")
print(f"LDA (9D):            {score_lda:.3f}")
# LDA often wins — 9 dimensions PURPOSE-BUILT to separate digits
# vs. 28 PCA dimensions that maximize variance (not separation)

Wrap-Up

Thank you for staying through all of this. We covered a lot of ground.

We started with the curse of dimensionality — the unsettling reality that high-dimensional spaces make distances meaningless, models brittle, and visualization impossible. We saw why reduction is even possible (the manifold hypothesis: real data lives on lower-dimensional surfaces). Then we built PCA from scratch, from the covariance matrix through eigendecomposition to the SVD that actually runs in practice, and learned how to use it properly inside a pipeline.

After the rest stop, we confronted PCA's blind spot — linearity — and met the tools designed for curved manifolds: t-SNE for visualization (with all its caveats about uninterpretable distances and cluster sizes), UMAP as the modern default that does almost everything well. Then the specialists: LDA for supervised compression, ICA for blind source separation, NMF for interpretable parts-based decomposition, and random projections for when speed is all that matters.

I'm still learning the nuances of when to apply which tool. I don't think that uncertainty ever fully goes away — the best practitioners I know still try multiple methods and let the downstream metric decide. But the vocabulary and intuition you've built here means you're equipped to make those experiments intelligently.

Next time you see a 500-feature dataset, instead of dreading the dimensionality, you'll know exactly which tool to reach for — and more importantly, why.

Resources

A Tutorial on Principal Component Analysis — Jonathon Shlens (2014). The clearest mathematical treatment of PCA I've found. Short, builds from scratch, no unnecessary abstraction. If you want the proofs behind what we covered, start here.

How to Use t-SNE Effectively — Wattenberg, Viégas, and Johnson (Distill, 2016). An interactive article that lets you play with perplexity and watch t-SNE break in real time. After reading this, you'll never over-interpret a t-SNE plot again. The interactive examples are worth more than a dozen paragraphs of explanation.

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction — McInnes, Healy, and Melville (2018). The original paper. The topology sections are dense, but sections 2–4 on the practical algorithm are readable and surprisingly well-written for an ML paper.

Scikit-learn Decomposition Module Documentation. Not glamorous, but endlessly practical. Every method we covered has a page with clear examples, parameter descriptions, and links to the underlying papers. I keep this bookmarked and refer to it more often than I'd like to admit.

The Elements of Statistical Learning — Hastie, Tibshirani, and Friedman. Chapter 14 covers unsupervised learning including PCA, ICA, and various nonlinear methods. Dense, mathematical, and comprehensive. Not where you start, but where you go when you want to really understand what's happening under the hood.

What You Should Now Be Able To Do

Explain the curse of dimensionality and why distance concentration breaks algorithms like k-NN
Describe the manifold hypothesis and why it makes dimensionality reduction possible
Walk through PCA as eigendecomposition of the covariance matrix and explain why SVD is used in practice
Use the explained variance ratio and scree plot to choose the number of PCA components
Know why you must standardize features before PCA and embed PCA inside a pipeline (not before the CV split)
State the PCA ≈ linear autoencoder connection and what nonlinear activations add
Explain the crowding problem and why t-SNE uses a t-distribution instead of a Gaussian in low-D
List three things you should never do with a t-SNE plot (interpret distances, use as features, trust cluster sizes)
Explain why UMAP preserves global structure better than t-SNE (cross-entropy vs KL divergence)
Use LDA for supervised dimensionality reduction and explain Fisher's criterion
Describe ICA's cocktail party problem and know when ICA is appropriate vs. PCA
Know when to use NMF (non-negative data, interpretability) and its connection to topic modeling
Understand the Johnson-Lindenstrauss lemma and when random projections are the right choice
Choose between PCA, t-SNE, UMAP, LDA, ICA, NMF, and random projections using the decision table
Evaluate dimensionality reduction by downstream task performance, not by explained variance alone

← Previous Clustering Next → Anomaly Detection