Linear Algebra

Chapter 2: Mathematical Foundations 14 subtopics

TL;DR

Linear algebra is the operating system of machine learning. Every neural network forward pass is matrix multiplication. PCA, SVD, and eigendecomposition are the workhorses behind dimensionality reduction, recommender systems, and feature extraction. If you understand how matrices transform space — stretching, rotating, projecting — you understand what ML models actually do to your data. We build from single numbers to graph Laplacians using one running example: a tiny movie recommender. Two recurring analogies — a telescope and a shadow — will carry you from vectors through SVD and beyond, with NumPy code you can run today.

I'll be honest: I avoided linear algebra for years. I could use scikit-learn's PCA. I could call np.linalg.svd and get numbers out. But if you had asked me what those numbers meant geometrically — what a singular value actually is, or why eigendecomposition makes PCA work — I would have changed the subject. I had this background hum of anxiety that one day someone would ask me a question in an interview and I'd be exposed as the person who shipped production ML without understanding the math underneath.

Linear algebra is the language machines use to talk about data. Every vector is a data point. Every matrix is a transformation. Every decomposition is a way of looking at that transformation from a better angle. It shows up everywhere: neural network forward passes are matrix multiplication, linear regression is a geometric projection, recommender systems are matrix factorization, and the attention mechanism in transformers is a sequence of matrix products with a softmax sandwiched in. If you understand what matrices do to space, every algorithm you encounter will make more sense.

Before we start: I'm assuming you're comfortable with basic Python and NumPy. If you can create arrays, index into them, and call functions from np.linalg, you have everything you need. We won't need calculus until the optimization chapter. Everything here is algebra and geometry.

We're going to build from the smallest possible object — a single number — all the way to decompositions that power production ML at scale. And we'll do it using one running example that threads through every section: a tiny movie recommender with three users and three movies. By the end, you'll watch SVD predict a missing rating, and the entire journey from scalars to singular values will click into place. Let's go.

What We'll Cover

  Vectors: Where Everything Starts
  Matrices: The Language of Transformation
  What Matrices Actually Do to Space
  Systems of Equations and the Least Squares Trick
  Vector Spaces: The Stage Where Data Lives
  Eigenvalues and Eigenvectors: The Directions That Matter
  PCA: Finding the Best Angle
  Rest Stop
  SVD: The Swiss Army Knife
  Matrix Decompositions: The Right Tool
  Condition Number: Why Your Model Produces NaN
  Covariance, Correlation, and Mahalanobis
  Graph Laplacian and Spectral Methods
  NMF: When You Need Parts, Not Wholes
  Wrap-Up
  Resources

Vectors: Where Everything Starts

The Data Hierarchy: Scalars, Vectors, Matrices, Tensors

Let's start at the very bottom. A scalar is a single number. One movie rating: 4 stars. One pixel intensity. One weight in your neural network. That's a scalar.

A vector is an ordered list of scalars. Suppose we have three movies — Alien, Titanic, and Toy Story — and user Alice rates them [5, 2, 4]. That list of three numbers is Alice's rating vector. It lives in three-dimensional space, one axis per movie. Each user gets a vector. Each movie gets a vector. This is our running example, and we'll keep building on it.

A matrix is a grid of numbers — stack multiple vectors together and you get one. If Alice, Bob, and Carol each rate three movies, we get a 3×3 matrix. Each row is a user. Each column is a movie. The entire dataset in one object. A tensor generalizes this to more dimensions. A color image is a 3D tensor (height × width × 3 color channels). A batch of 32 images is a 4D tensor. In code, all of these are NumPy arrays with different numbers of dimensions.

import numpy as np

# Our running example: 3 users × 3 movies
# Movies: Alien, Titanic, Toy Story
alice = np.array([5, 2, 4])   # loves sci-fi and animation
bob   = np.array([4, 1, 5])   # loves sci-fi and animation
carol = np.array([1, 5, 2])   # loves romance

# Stack into a matrix: each row is a user
R = np.array([alice, bob, carol])
print(R)
# [[5 2 4]
#  [4 1 5]
#  [1 5 2]]

Notice something already. Alice and Bob have similar taste — both rate Alien high, Titanic low. Carol is the opposite. We haven't done any math yet, and the structure is visible. Linear algebra gives us tools to make this structure precise.

The Dot Product: Measuring Similarity

The dot product of two vectors multiplies corresponding elements and sums the results. It's the most important single operation in all of ML. Here's why: the dot product measures how much two vectors "agree." If two vectors point in the same direction, their dot product is large and positive. If they're perpendicular — no relationship — it's zero. If they point in opposite directions, it's large and negative.

Let's compute the dot product of Alice and Bob's rating vectors, then Alice and Carol's.

# Dot product: element-wise multiply and sum
alice_bob = np.dot(alice, bob)    # 5*4 + 2*1 + 4*5 = 42
alice_carol = np.dot(alice, carol) # 5*1 + 2*5 + 4*2 = 23

print(f"Alice · Bob:   {alice_bob}")    # 42 — high agreement
print(f"Alice · Carol: {alice_carol}")  # 23 — less agreement

Alice and Bob agree more than Alice and Carol. The dot product caught it. But there's a problem: if Alice had rated everything on a 1-to-10 scale instead of 1-to-5, all her dot products would double, even though her preferences haven't changed. The magnitude of the vectors is contaminating our similarity measure.

That's where cosine similarity comes in. It divides the dot product by the product of the vectors' magnitudes: cos(θ) = (a · b) / (||a|| × ||b||). This strips away magnitude and leaves pure directional agreement, a number between −1 and +1. Word embeddings like Word2Vec use cosine similarity to find that "king" and "queen" point in similar directions in embedding space, even though their vectors might have different lengths.

# Cosine similarity removes magnitude effects
from numpy.linalg import norm

cos_ab = np.dot(alice, bob) / (norm(alice) * norm(bob))
cos_ac = np.dot(alice, carol) / (norm(alice) * norm(carol))

print(f"cos(Alice, Bob):   {cos_ab:.4f}")   # ~0.937 — very similar
print(f"cos(Alice, Carol): {cos_ac:.4f}")    # ~0.610 — less similar

Norms: Measuring Size and the Geometry of Regularization

A norm measures the "size" of a vector. The L2 norm (Euclidean norm) is the straight-line distance from the origin: ||v||₂ = √(v₁² + v₂² + ... + vₙ²). The L1 norm (Manhattan distance) sums the absolute values: ||v||₁ = |v₁| + |v₂| + ... + |vₙ|. The L∞ norm is the largest absolute component.

Why do norms matter? Regularization. This is where the geometry gets interesting and interview-relevant. L2 regularization (Ridge) penalizes the L2 norm of your weight vector. Picture the set of all vectors with L2 norm ≤ 1 — it's a smooth circle (in 2D) or sphere (in higher dimensions). The optimizer can slide smoothly along this boundary, so weights get small but rarely hit zero. L1 regularization (Lasso) penalizes the L1 norm. The L1 "ball" is a diamond with sharp corners sitting right on the axes. When the optimizer bumps into a corner, some weights land exactly at zero. That's why L1 gives you sparsity — feature selection for free — and L2 doesn't.

w = np.array([0.5, -1.2, 0.01, 3.0])

l1 = np.linalg.norm(w, ord=1)       # 4.71 — sum of absolute values
l2 = np.linalg.norm(w, ord=2)       # 3.27 — Euclidean length
linf = np.linalg.norm(w, ord=np.inf) # 3.0 — largest component

print(f"L1: {l1:.2f}, L2: {l2:.2f}, L∞: {linf:.2f}")

A unit vector has L2 norm of exactly 1. You create one by dividing a vector by its norm — this operation is called normalization. Batch normalization and layer normalization in deep learning are more sophisticated versions of this same idea: keep magnitudes controlled so signals don't explode or vanish as they flow through the network.

Orthogonality

Two vectors are orthogonal if their dot product is zero — they're at right angles, contributing nothing to each other. Orthogonal features are statistically uncorrelated, which is the ideal situation for ML. PCA explicitly produces orthogonal components. The angle between any two vectors comes from the dot product: θ = arccos((a · b) / (||a|| × ||b||)).

Limitation: Where Single Vectors Fall Short

We can now measure how similar two users are. But a single vector can't capture the relationships between all users and all movies simultaneously. Alice's vector tells us her preferences, but it doesn't tell us which movies relate to each other, or which groups of users share hidden taste dimensions. For that, we need to go from individual vectors to the grid that holds them all. We need matrices.

Matrices: The Language of Transformation

A Matrix Is a Grid — and a Function

A matrix is a rectangular grid of numbers with m rows and n columns. Our ratings table is a 3×3 matrix. In ML, your dataset is typically an m×n matrix where m is the number of samples and n is the number of features. But here's the deeper idea: a matrix is also a function. When you multiply a matrix by a vector, you get a new vector. The matrix transforms the input into the output. Every matrix is a machine that eats vectors and produces vectors.

# Our ratings matrix R is both a table and a transformation
R = np.array([
    [5, 2, 4],  # Alice
    [4, 1, 5],  # Bob
    [1, 5, 2],  # Carol
])

# What does multiplying R by a vector mean?
# If v represents "movie qualities," then R @ v gives "user scores"
movie_quality = np.array([1, 0, 0])  # only Alien matters
user_scores = R @ movie_quality
print(user_scores)  # [5, 4, 1] — each user's Alien rating

Matrix Multiplication: THE Operation

Matrix multiplication is the single most important operation in ML. It is not element-wise. To multiply an m×n matrix A by an n×p matrix B, you take the dot product of each row of A with each column of B, producing an m×p result. The inner dimensions must match.

A neural network's fully connected layer is one matrix multiplication and one vector addition: y = Wx + b. Here x is your input (n×1), W is the weight matrix (k×n), and b is the bias (k×1). That's it. Stack several of these with nonlinearities between them, and you have a deep neural network. The entire forward pass of a transformer is a carefully orchestrated sequence of matrix multiplications.

# A single neural network layer: y = Wx + b
W = np.random.randn(3, 4)                    # 4 inputs → 3 outputs
x = np.array([1.0, 0.5, -1.0, 2.0])         # input vector
b = np.array([0.1, -0.1, 0.0])              # bias

y = W @ x + b   # @ is matrix multiplication in Python 3.5+
print(f"Layer output: {y}")

# Batch: process 100 samples at once
X = np.random.randn(100, 4)
Y = X @ W.T + b   # (100, 3) — all samples transformed simultaneously
print(f"Batch output shape: {Y.shape}")

Transpose, Identity, and Special Matrices

The transpose flips a matrix across its diagonal: rows become columns, columns become rows. An m×n matrix becomes n×m. In NumPy: A.T. You'll see transposes everywhere — the normal equation for regression is x̂ = (AᵀA)⁻¹Aᵀb.

The identity matrix I has ones on the diagonal and zeros elsewhere. Multiplying by I changes nothing — it's the matrix equivalent of multiplying by 1. Symmetric matrices equal their own transpose: A = Aᵀ. Covariance matrices are always symmetric, which gives them beautiful decomposition properties we'll exploit in PCA.

An orthogonal matrix Q has columns that are orthonormal unit vectors. Its inverse equals its transpose: Q⁻¹ = Qᵀ. Orthogonal matrices represent rotations and reflections — they preserve lengths and angles. They're the building blocks of SVD.

Limitation: We Have the Grid, But What Does It Do?

We now have our ratings table as a matrix and we can multiply it by things. But we haven't answered the deeper question: what does multiplication by this matrix do to space? Stretch it? Rotate it? Flatten it? Understanding the geometry of matrix transformations is what separates someone who can call APIs from someone who understands what the APIs are doing.

What Matrices Actually Do to Space

Stretching, Rotating, Projecting

Think of a matrix as a machine that grabs the fabric of space and deforms it. Some matrices stretch space — they pull some directions farther from the origin than others. Some rotate — they spin the entire space around. Some project — they flatten a 3D object onto a 2D surface, like casting a shadow of a statue onto a wall. (Remember this shadow analogy. We'll need it when we get to regression and PCA.)

Here's a concrete example. Take the 2D matrix [[2, 0], [0, 0.5]]. It stretches the x-axis by 2 and squishes the y-axis to half. Apply it to a unit circle, and you get an ellipse — wider than it is tall. A rotation matrix like [[cos θ, -sin θ], [sin θ, cos θ]] spins every point by angle θ without changing distances. A projection matrix collapses a dimension entirely.

import numpy as np

# A stretching matrix
S = np.array([[2, 0], [0, 0.5]])

# Apply to a vector
v = np.array([1, 1])
print(S @ v)  # [2, 0.5] — stretched horizontally, squished vertically

# A rotation matrix (45 degrees)
theta = np.pi / 4
rot = np.array([[np.cos(theta), -np.sin(theta)],
                [np.sin(theta),  np.cos(theta)]])
print(rot @ v)  # [0, 1.414] — rotated 45° counterclockwise

# A projection matrix (onto x-axis)
P = np.array([[1, 0], [0, 0]])
print(P @ v)  # [1, 0] — shadow on x-axis

The Determinant: How Much Space Gets Scaled

The determinant of a square matrix is a single number that tells you the scaling factor of the transformation. For a 2×2 matrix [[a, b], [c, d]], it's ad - bc. The absolute value tells you how much areas (2D) or volumes (3D) change. A determinant of 2 means the matrix doubles all areas. A determinant of −1 means it flips orientation (like looking in a mirror) while preserving area.

A determinant of zero is the important one. It means the matrix squishes space down to a lower dimension — a 2D region collapses to a line, or a line collapses to a point. If space has been flattened, you can't unflatten it. That's why a zero determinant means the matrix is not invertible: you can't undo a squish.

A = np.array([[3, 1], [2, 4]])
print(f"det(A) = {np.linalg.det(A):.1f}")  # 10.0 — scales area by 10x

# Singular matrix: row 2 is 2× row 1
B = np.array([[1, 2], [2, 4]])
print(f"det(B) = {np.linalg.det(B):.1f}")  # 0.0 — squishes to a line

In probability, the determinant of a covariance matrix appears in the normalization constant of a multivariate Gaussian. Geometrically, it measures the "volume" of the uncertainty ellipsoid.

Never Invert a Matrix in Production Code

If you need to solve Ax = b, do NOT compute A⁻¹ and multiply. Use np.linalg.solve(A, b). It uses LU decomposition under the hood and is both faster and more numerically stable. Matrix inversion is O(n³) and amplifies floating-point errors. I've seen production systems output garbage because someone used np.linalg.inv on a nearly singular matrix. Don't be that person.

Positive Definite Matrices

A symmetric matrix is positive definite if xᵀAx > 0 for every nonzero vector x. It is positive semi-definite (PSD) if xᵀAx ≥ 0. Geometrically, a positive definite matrix transforms space in a way that never flips or collapses any direction — it's a "pure stretch." Covariance matrices are always PSD. Kernel matrices (Gram matrices) must be PSD for a kernel function to be valid. If your Hessian matrix is positive definite at a point, that point is a local minimum — the foundation of second-order optimization.

Systems of Equations and the Least Squares Trick

Ax = b: The Setup

Any system of linear equations can be written as Ax = b. You have a coefficient matrix A, an unknown vector x, and a right-hand side b. In linear regression, the rows of A are your data points, b is the vector of observed outputs, and x is the weight vector you're trying to find.

If A is square and invertible, there's one exact solution. If there are more equations than unknowns (the usual case in ML — thousands of data points, a handful of features), the system is overdetermined and no exact solution exists. We need the best approximate solution.

Least Squares: The Shadow of Reality

The least squares solution is the x̂ that minimizes ||Ax - b||² — the total squared error. Here's where the shadow analogy pays off. Imagine b is a point in high-dimensional space, and the column space of A is a flat surface (subspace) within that space. The least squares solution Ax̂ is the point on that surface closest to b. Geometrically, it's the projection — the shadow of b cast straight down onto the column space.

The residual b - Ax̂ is perpendicular to the column space. Setting up this perpendicularity condition gives us the normal equation: AᵀAx̂ = Aᵀb. "Normal" here means perpendicular. The equation guarantees the error is orthogonal to every feature — the model has extracted everything it can from the data.

import numpy as np

# 3 users rate a "sci-fi affinity" feature and "romance affinity" feature
# We want to predict their overall satisfaction score
A = np.array([
    [5, 2],   # Alice: high sci-fi, low romance
    [4, 1],   # Bob: high sci-fi, very low romance
    [1, 5],   # Carol: low sci-fi, high romance
])
b = np.array([9, 8, 6])  # satisfaction scores

# Least squares via normal equation
x_hat = np.linalg.solve(A.T @ A, A.T @ b)
print(f"Weights: {x_hat.round(3)}")

# Predictions = shadow of b onto column space of A
predictions = A @ x_hat
residual = b - predictions
print(f"Predictions: {predictions.round(2)}")
print(f"Residual: {residual.round(2)}")

# Verify: residual is orthogonal to column space
print(f"A^T @ residual ≈ 0: {(A.T @ residual).round(10)}")

This is one of the most beautiful results in ML. Linear regression isn't curve fitting. It's a geometric projection. The predicted values are the shadow of reality cast onto the subspace of possible predictions. Every time you fit a linear model, this is what's happening under the hood.

Why "Normal" Equation?

"Normal" means perpendicular. The normal equation ensures the residual vector is perpendicular (normal) to the column space of A. This is the same as saying the error is orthogonal to all features — the model has extracted everything it can from the data. The equation exists because perpendicularity gives us the closest point.

Vector Spaces: The Stage Where Data Lives

Span, Basis, Dimension

A linear combination of vectors is any weighted sum: c₁v₁ + c₂v₂ + ... + cₙvₙ. The span of a set of vectors is every point you can reach with linear combinations. If vectors span all of ℝⁿ, you can reach any point in n-dimensional space. Vectors are linearly independent if none can be written as a combination of the others — they each contribute a new direction.

A basis is a minimal set of linearly independent vectors that spans the space. The number of vectors in any basis is the dimension. ℝ³ has dimension 3. Here's the key insight for ML: if your data lives in a 100-dimensional space but only varies along 5 directions, the effective dimension is 5. The other 95 dimensions are noise or redundancy. PCA finds exactly this — the basis for the subspace where your data actually lives.

Column Space, Null Space, Rank

The column space of a matrix is the span of its columns — it's all possible outputs Ax. When you fit a linear model, your predictions live in the column space. The null space is all vectors x such that Ax = 0 — inputs the matrix kills entirely. If the null space contains anything besides the zero vector, the matrix loses information.

The rank is the dimension of the column space — the number of linearly independent columns. Rank tells you how much information the matrix carries. A rank-deficient matrix has redundant columns. In ML, a low-rank data matrix means your features are correlated and can be compressed. The rank-nullity theorem ties it together: rank + nullity = number of columns. Information preserved + information destroyed = total information.

Back to our movie example. Our 3×3 ratings matrix has rank at most 3. But if Alice and Bob's taste vectors happen to be proportional, the effective rank drops to 2 — there are really two taste dimensions, not three. Low-rank structure in ratings matrices is exactly what recommender systems exploit.

Limitation: We Know the Stage. Now We Need to Find the Best Directions on It.

We can talk about spaces and dimensions, but we haven't identified which directions matter most. If your data sprawls across 100 dimensions but most of the action happens along 3 of them, which 3? That's where eigenvalues and eigenvectors enter.

Eigenvalues and Eigenvectors: The Directions That Matter

The Key Idea

We know matrices transform space. They stretch, rotate, project. But which directions survive the transformation most intact? Which directions does the matrix treat as special?

An eigenvector of a matrix A is a nonzero vector v that, when multiplied by A, only gets scaled — not rotated. The scaling factor is the eigenvalue λ. In equation form: Av = λv. Most vectors change direction when multiplied by a matrix. Eigenvectors are the rare ones that keep pointing the same way. They're the natural axes of the transformation.

Think of it like a telescope. You're looking at a galaxy, and from most angles it looks like a blurry mess. But there are a few special angles where the structure snaps into focus — where you can see the spiral arms, the core, the outer halo. Eigenvectors are those angles. They're the directions from which the transformation is clearest. (Remember this telescope analogy. It'll come back in PCA and SVD.)

Why Eigenvectors Matter for ML

Consider a covariance matrix. Its eigenvectors point in the directions of maximum variance in your data. Its eigenvalues tell you how much variance exists along each direction. The eigenvector with the largest eigenvalue is the direction along which your data spreads the most. This is PCA in a nutshell: find the eigenvectors of the covariance matrix, keep the ones with the biggest eigenvalues, and project your data onto them.

import numpy as np

# Covariance-like symmetric matrix
A = np.array([[4, 2],
              [2, 3]])

# Use eigh for symmetric matrices — faster, more stable, sorted
eigenvalues, eigenvectors = np.linalg.eigh(A)
print(f"Eigenvalues: {eigenvalues.round(3)}")
print(f"Eigenvectors:\n{eigenvectors.round(3)}")

# Verify: Av = λv for each pair
for i in range(len(eigenvalues)):
    Av = A @ eigenvectors[:, i]
    lv = eigenvalues[i] * eigenvectors[:, i]
    print(f"Av = {Av.round(4)}, λv = {lv.round(4)}, match: {np.allclose(Av, lv)}")

Computation and the Spectral Theorem

To find eigenvalues analytically, you solve det(A - λI) = 0, the characteristic equation. For an n×n matrix, this gives a degree-n polynomial. Nobody solves polynomials for matrices bigger than 3×3. Numerical algorithms like QR iteration handle it, which is what np.linalg.eig calls.

The spectral theorem says every real symmetric matrix can be decomposed as A = QΛQᵀ, where Q is orthogonal (columns are eigenvectors) and Λ is diagonal (eigenvalues on the diagonal). All eigenvalues are real. All eigenvectors are orthogonal. This is why PCA works so cleanly — covariance matrices are symmetric, so their eigenvectors form a nice orthogonal coordinate system. Always use np.linalg.eigh (the "h" stands for Hermitian/symmetric) instead of eig for symmetric matrices — it's faster and more numerically stable.

A matrix with n linearly independent eigenvectors can be diagonalized: A = PDP⁻¹. Powers become trivial: Aⁿ = PDⁿP⁻¹. This is how Markov chain steady-state analysis works — raise the transition matrix to a large power by diagonalizing it.

Limitation

Eigendecomposition requires square matrices. Our ratings matrix is often rectangular (more users than movies, or vice versa). We need a decomposition that works for any matrix shape. That's SVD — but first, let's use eigendecomposition for what it does best: PCA.

PCA: Finding the Best Angle

The Telescope, Calibrated

Remember the telescope analogy? PCA is the process of calibrating that telescope. You have data scattered across many dimensions, and most of those dimensions contribute noise. PCA rotates your coordinate system until the axes align with the directions of maximum variance — the directions where the data actually spreads out. Then you drop the axes that don't matter.

I'm going to walk through PCA step by step using our movie recommender data, because I think seeing the mechanics with familiar data makes everything click.

Step-by-Step with the Movie Recommender

Step 1: Center the data. Subtract the mean of each feature (movie) so that the data is centered at the origin. PCA's covariance formula assumes centered data.

Step 2: Compute the covariance matrix. This tells us how each pair of movies co-varies across users. A large positive entry means "when one movie gets rated high, so does the other."

Step 3: Eigendecompose the covariance matrix. The eigenvectors are the principal components — the telescope angles. The eigenvalues tell us how much variance each direction captures.

Step 4: Project the data onto the top eigenvectors. This is the dimensionality reduction — we keep the interesting directions and drop the rest.

import numpy as np

# Our running example: 3 users × 3 movies
R = np.array([
    [5, 2, 4],  # Alice
    [4, 1, 5],  # Bob
    [1, 5, 2],  # Carol
], dtype=float)

# Step 1: Center (subtract column means)
means = R.mean(axis=0)
R_centered = R - means
print(f"Means: {means.round(2)}")
print(f"Centered:\n{R_centered.round(2)}")

# Step 2: Covariance matrix (3×3, one entry per movie pair)
cov = (R_centered.T @ R_centered) / (R.shape[0] - 1)
print(f"Covariance matrix:\n{cov.round(2)}")

# Step 3: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# eigh returns sorted ascending; we want descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"Eigenvalues: {eigenvalues.round(3)}")
print(f"Explained variance ratio: {(eigenvalues / eigenvalues.sum()).round(3)}")

# Step 4: Project onto top 2 components
k = 2
W = eigenvectors[:, :k]  # top-k eigenvectors
R_projected = R_centered @ W
print(f"Projected data (3 users in 2D):\n{R_projected.round(3)}")

The first eigenvalue captures most of the variance — it's the "taste dimension" that separates sci-fi lovers from romance lovers. The second captures what's left. The third is negligible. We've compressed three dimensions into two without losing much. In production, you'd go from 768 dimensions to 50 using the same logic.

Scree Plot and Choosing k

A scree plot graphs eigenvalues in descending order. You look for the "elbow" — where eigenvalues drop sharply. Components before the elbow capture signal; those after capture noise. The explained variance ratio of each component is its eigenvalue divided by the sum of all eigenvalues. A common rule of thumb: keep enough components to explain 95% of the variance.

PCA can be derived two equivalent ways. The variance maximization view says: find the direction that maximizes projected variance. That's the first eigenvector. The reconstruction error view says: find the k-dimensional subspace that minimizes the average squared distance between points and their projections. Both produce the same answer — a satisfying duality that tells you the math is self-consistent.

PCA Variants

Kernel PCA applies the kernel trick to capture nonlinear structure — it does PCA in an implicit high-dimensional feature space without computing coordinates there. Incremental PCA processes data in batches, handling datasets too large for memory. Sparse PCA adds L1 regularization so each component loads on few features, improving interpretability. Probabilistic PCA casts PCA as a latent variable model, enabling proper likelihood computation and handling of missing data.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Larger example: 500 samples, 10 features with injected correlations
np.random.seed(42)
X = np.random.randn(500, 10)
X[:, 1] = X[:, 0] * 0.8 + np.random.randn(500) * 0.3
X[:, 2] = X[:, 0] * 0.6 + np.random.randn(500) * 0.5

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_.round(3))
print("Cumulative:", np.cumsum(pca.explained_variance_ratio_).round(3))

# Keep 95% of variance
pca_95 = PCA(n_components=0.95, svd_solver='full')
X_reduced = pca_95.fit_transform(X_scaled)
print(f"Reduced: {X_scaled.shape[1]}D → {X_reduced.shape[1]}D")

# Reconstruction error
X_recon = pca_95.inverse_transform(X_reduced)
error = np.mean((X_scaled - X_recon) ** 2)
print(f"Reconstruction error: {error:.4f}")

I'm still developing my intuition for why PCA works so elegantly — why the directions of maximum variance also minimize reconstruction error. The proof isn't hard once you set it up, but the fact that these two different goals lead to the same answer still feels like a small miracle.

PCA Under the Hood: SVD, Not Eigendecomposition

Scikit-learn's PCA does not actually compute the covariance matrix and find its eigenvalues. It computes the SVD of the centered data matrix directly. This is more numerically stable (no squaring the data matrix) and handles the case where features exceed samples. The right singular vectors of the data matrix are the principal components. The connection: if X = UΣVᵀ, then XᵀX = VΣ²Vᵀ, so the eigenvectors of XᵀX are the right singular vectors of X.

Rest Stop

Congratulations on making it this far. Seriously. If you've followed along from scalars through PCA, you've covered the core of linear algebra that matters for ML. You can stop here if you want — what you already know is enough to understand neural network forward passes, regression as projection, dimensionality reduction, and the geometry of regularization.

Here's what you now have in your toolkit: vectors as data representations, dot products as similarity measures, matrices as transformations of space, determinants as volume scaling factors, eigenvectors as the natural axes of a transformation, and PCA as the process of finding those axes in your data. The telescope is calibrated. The shadow analogy made regression geometric. The running example showed us that Alice and Bob live in the same taste neighborhood while Carol lives across town.

What's coming next goes deeper. SVD generalizes eigendecomposition to any matrix shape. Matrix decompositions give you the right numerical tool for each situation. Condition numbers explain why your model produces NaN. Spectral methods turn graphs into linear algebra problems. Each section builds on what you already know. But if you need a break, this is a good place to take one.

SVD: The Swiss Army Knife

The Decomposition That Works on Anything

Eigendecomposition requires square matrices. Our ratings matrix is often rectangular. We need something more general. Singular Value Decomposition (SVD) decomposes any m×n matrix A into three matrices: A = UΣVᵀ. Think of it as factoring the transformation into three steps: rotate (Vᵀ), scale (Σ), rotate again (U). Every matrix transformation is a rotation, followed by a stretch along the axes, followed by another rotation.

U is m×m orthogonal (left singular vectors). Σ is m×n diagonal with non-negative entries called singular values, sorted in decreasing order. V is n×n orthogonal (right singular vectors). The singular values of A are the square roots of the eigenvalues of AᵀA. The right singular vectors are the eigenvectors of AᵀA. SVD is the generalization of eigendecomposition to non-square matrices.

Back to our telescope: eigenvectors found the best angles for square transformations. SVD finds the best angles for any transformation, even one that maps between spaces of different dimensions.

Low-Rank Approximation and the Netflix Prize

Here's where SVD earns its "Swiss Army knife" title. You can approximate a matrix by keeping only the top k singular values and zeroing out the rest. The Eckart-Young theorem guarantees this is the best possible rank-k approximation — no other matrix of rank k gets closer (in Frobenius or spectral norm). If a 1000×1000 matrix has 10 significant singular values, you can represent it with roughly 10 × (1000 + 1000) = 20,000 numbers instead of 1,000,000.

This is the insight that powered the Netflix Prize. In 2006, Netflix offered $1 million to anyone who could improve their recommendation algorithm by 10%. Simon Funk published a blog post describing a technique that was essentially SVD applied to the ratings matrix — decompose the user×movie matrix into user-factor and movie-factor matrices, then predict missing ratings by multiplying them back together. The key insight: if Alice and Bob both love sci-fi, and Alice rated Interstellar but Bob didn't, the latent "sci-fi" factor lets you predict Bob's rating from Alice's.

I didn't fully grasp SVD until I saw it predict missing ratings. Let me show you with our running example.

import numpy as np

# Our 3×3 ratings matrix. But now Carol hasn't rated Toy Story.
# We'll put 0 as a placeholder and see if SVD can fill it in.
R = np.array([
    [5, 2, 4],  # Alice: Alien=5, Titanic=2, ToyStory=4
    [4, 1, 5],  # Bob:   Alien=4, Titanic=1, ToyStory=5
    [1, 5, 0],  # Carol: Alien=1, Titanic=5, ToyStory=?
], dtype=float)

# Full SVD
U, sigma, Vt = np.linalg.svd(R, full_matrices=False)
print(f"Singular values: {sigma.round(2)}")

# Low-rank approximation: keep top 2 components
k = 2
approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
print(f"\nRank-{k} approximation:\n{approx.round(1)}")
print(f"\nPredicted Carol's Toy Story rating: {approx[2, 2]:.1f}")
# Carol doesn't like sci-fi, so we'd expect a low-ish prediction

The rank-2 approximation fills in the missing entry with a prediction. Since Carol's taste pattern (low sci-fi, high romance) is different from Alice and Bob's, SVD predicts a modest rating for Toy Story. In production, this same idea scales to millions of users and items.

Randomized SVD for Scale

Full SVD costs O(mn × min(m,n)), which is impractical for matrices with millions of rows and columns. Randomized SVD uses random projections to approximate the top-k singular vectors in O(mn × k) time. The idea: multiply the matrix by a random skinny matrix to get a "sketch," compute the SVD of the much smaller sketch, and project back. Scikit-learn uses randomized SVD in TruncatedSVD and PCA(svd_solver='randomized'). For datasets with more than ~1000 features, it's almost certainly what you want.

from sklearn.decomposition import TruncatedSVD
import numpy as np

# Simulate a large sparse ratings matrix
np.random.seed(42)
large_R = np.random.rand(10000, 500)

# Randomized SVD: much faster than full SVD
svd = TruncatedSVD(n_components=20, algorithm='randomized', random_state=42)
R_reduced = svd.fit_transform(large_R)
print(f"Reduced: {large_R.shape} → {R_reduced.shape}")
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.3f}")

Matrix Decompositions: The Right Tool

Cholesky: The Square Root of a Matrix

The Cholesky decomposition factors a positive definite matrix into A = LLᵀ, where L is lower triangular. It's twice as fast as general LU decomposition and is the matrix equivalent of taking a square root. Its killer application: sampling from multivariate Gaussians. To sample from a Gaussian with mean μ and covariance Σ, compute the Cholesky factor L, sample standard normals z, and compute x = μ + Lz. That's how np.random.multivariate_normal works internally. Gaussian processes rely on Cholesky for their marginal likelihood computation.

import numpy as np

# Sampling from a multivariate Gaussian via Cholesky
mean = np.array([1.0, 2.0])
cov = np.array([[2.0, 0.8],
                [0.8, 1.0]])

L = np.linalg.cholesky(cov)
z = np.random.randn(2)
sample = mean + L @ z
print(f"Sample: {sample.round(3)}")

# Verify with many samples
samples = mean + (L @ np.random.randn(2, 10000)).T
print(f"Empirical mean: {samples.mean(axis=0).round(2)}")
print(f"Empirical cov:\n{np.cov(samples.T).round(2)}")

QR: The Stable Least Squares Solver

QR decomposition factors A into an orthogonal Q and upper triangular R: A = QR. The standard method uses Householder reflections, which are unconditionally numerically stable. QR's main application in ML: solving least squares. Instead of the normal equation AᵀAx = Aᵀb (which squares the condition number), decompose A = QR and solve Rx = Qᵀb. Since R is triangular, back-substitution is fast. This is what np.linalg.lstsq uses under the hood.

import numpy as np

# Least squares via QR vs. normal equation
A = np.random.randn(100, 5)
true_w = np.array([1, -2, 3, 0.5, -1])
b = A @ true_w + np.random.randn(100) * 0.1

# QR approach (numerically stable)
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)

# Normal equation (less stable)
x_ne = np.linalg.solve(A.T @ A, A.T @ b)

print(f"QR solution:     {x_qr.round(3)}")
print(f"Normal eq:       {x_ne.round(3)}")
print(f"True weights:    {true_w}")

LU: General Purpose Solving

LU decomposition factors a matrix into a lower triangular L and upper triangular U: A = LU (with pivoting: PA = LU). It's Gaussian elimination organized for efficiency. This is what np.linalg.solve uses internally. It's the workhorse for solving Ax = b when A is square and well-conditioned.

When to Use Which

The choice between decompositions is about matching the tool to the matrix structure. Cholesky works only for positive definite matrices but is the fastest. QR handles any matrix and preserves the condition number — the default for least squares. LU is general-purpose for square systems. SVD handles everything, including rank-deficient matrices, but is the slowest.

For numerical stability: the normal equation squares the condition number (dangerous). QR preserves it. SVD handles even rank-deficient matrices gracefully. For well-conditioned problems, all methods agree. For ill-conditioned problems, only SVD remains reliable.

Choosing a Least Squares Solver

Small, well-conditioned: normal equation (fastest). General use: QR (stable, fast). Ill-conditioned or rank-deficient: SVD via np.linalg.lstsq (most robust). Scikit-learn's LinearRegression uses SVD by default — a sensible choice since you rarely know your condition number in advance.

Condition Number: Why Your Model Produces NaN

Definition and Intuition

The condition number of a matrix measures how sensitive the solution of Ax = b is to small perturbations. It's defined as the ratio of the largest to smallest singular value: κ(A) = σ_max / σ_min. A condition number near 1 means the matrix is well-behaved. A condition number of 10⁸ means your solution could have 8 fewer digits of accuracy than your input.

I spent two days once debugging NaN values in a production model. The features were right. The labels were right. The architecture was right. The learning rate was reasonable. It turned out one feature ranged from 0 to 1 and another from 0 to 1,000,000. The condition number of the data matrix was around 10¹². No amount of hyperparameter tuning was going to fix that. Five minutes of feature scaling solved it.

Feature Scaling Is Not Optional

When features have wildly different scales, the resulting matrices become ill-conditioned. Feature scaling (standardization, normalization) is not a best practice — it's a numerical necessity. Batch normalization in deep learning serves the same purpose: keeping intermediate activations well-conditioned so gradients don't explode or vanish.

import numpy as np

# Ill-conditioned: one feature is huge, another tiny
X_bad = np.column_stack([
    np.random.randn(100) * 1e6,   # income in dollars
    np.random.randn(100) * 0.001   # some tiny feature
])
print(f"Condition number (bad): {np.linalg.cond(X_bad):.0e}")

# After standardization
X_good = (X_bad - X_bad.mean(axis=0)) / X_bad.std(axis=0)
print(f"Condition number (good): {np.linalg.cond(X_good):.1f}")

NaN Debugging Checklist

When your model outputs NaN: (1) Check feature scales — standardize everything. (2) Compute np.linalg.cond(X) — if it's above 10⁶, you have a problem. (3) Lower the learning rate — large steps amplify numerical errors. (4) Add a small epsilon (1e-8) to every denominator in custom loss functions. (5) Check for log(0) or division by zero. (6) If using custom matrix operations, switch from np.linalg.inv to np.linalg.solve. I keep this list pinned to my monitor.

Covariance, Correlation, and Mahalanobis

The Covariance Matrix

The covariance matrix Σ captures pairwise linear relationships between features. The (i,j) entry is the covariance between features i and j. Diagonal entries are variances. It's always symmetric and positive semi-definite. For centered data X, the covariance matrix is (1/(n-1))XᵀX.

In our movie example, the covariance matrix tells us how each pair of movies co-varies across users. A large positive entry between Alien and Toy Story means "users who rate Alien high tend to rate Toy Story high too." A negative entry between Alien and Titanic means they tend to move in opposite directions. The eigenvectors of this covariance matrix are the principal components — the directions of maximum variance in taste space.

Correlation vs. Covariance

The correlation matrix is the covariance matrix of standardized features. Each entry is between −1 and +1, removing the effect of feature scales. Running PCA on the correlation matrix is equivalent to running PCA on standardized data — and is almost always what you want, because otherwise PCA is dominated by whichever feature happens to be measured in the largest units.

Whitening

The whitening transform decorrelates features and normalizes their variances so the covariance matrix becomes the identity. You compute it using eigendecomposition: X_white = Λ^(-1/2) Qᵀ X, where Q and Λ come from the eigendecomposition of the covariance matrix. Whitening is used as preprocessing in ICA (Independent Component Analysis) and is conceptually related to batch normalization.

Mahalanobis Distance: The Right Ruler

Euclidean distance treats all dimensions equally, which is wrong when features are correlated or have different variances. The Mahalanobis distance from a point x to a distribution with mean μ and covariance Σ is √((x - μ)ᵀ Σ⁻¹ (x - μ)). It's Euclidean distance after whitening — it stretches the axes so the data looks spherical, then measures distance in the corrected space.

This matters for anomaly detection. A point might be close in Euclidean distance (close to the mean in raw coordinates) but far in Mahalanobis distance (far from the main cloud when correlations are accounted for). Gaussian classifiers, outlier detection, and Gaussian mixture models all rely on Mahalanobis distance.

import numpy as np

# Two-feature data with strong correlation
np.random.seed(42)
mean = np.array([0, 0])
cov = np.array([[1.0, 0.9],
                [0.9, 1.0]])
X = np.random.multivariate_normal(mean, cov, 200)

# A test point: close in Euclidean but far along the thin axis
point = np.array([2, -1])

# Euclidean distance from mean
eucl = np.linalg.norm(point - mean)

# Mahalanobis distance from mean
diff = point - mean
cov_inv = np.linalg.inv(cov)  # small matrix, inv is OK here
mahal = np.sqrt(diff @ cov_inv @ diff)

print(f"Euclidean distance: {eucl:.2f}")
print(f"Mahalanobis distance: {mahal:.2f}")
# Mahalanobis is much larger — this point is a true outlier

Graph Laplacian and Spectral Methods

Turning Graphs into Matrices

Not all data comes in neat feature matrices. Social networks, citation graphs, molecular structures — these are graphs. And here's where linear algebra reaches into unexpected territory: you can turn any graph into a matrix, decompose it, and extract structure.

The adjacency matrix A of a graph has A[i,j] = 1 if there's an edge between nodes i and j (or a weight, for weighted graphs). The degree matrix D is diagonal, with D[i,i] = degree of node i. The graph Laplacian is L = D - A. It looks like an arbitrary definition, but it encodes deep structural information.

Spectral Clustering

The eigenvectors of the Laplacian reveal cluster structure. The smallest eigenvalue is always 0 (with eigenvector = all ones). The second-smallest eigenvector — the Fiedler vector — gives the optimal two-way partition of the graph. Nodes with positive Fiedler values go in one cluster, negative values in the other. Spectral clustering computes the k smallest eigenvectors, treats each node's eigenvector entries as coordinates, and runs k-means in that space.

Why does this work? The Laplacian's eigenvectors minimize a quantity called the "graph cut" — they find splits where few edges cross between groups. It's the same telescope idea, applied to graphs: the eigenvectors find the angles from which the community structure is clearest.

import numpy as np
from scipy.sparse.csgraph import laplacian

# A small graph with two clear clusters
# Nodes 0,1,2 are connected; nodes 3,4,5 are connected
# One weak link between 2 and 3
A = np.array([
    [0, 1, 1, 0, 0, 0],
    [1, 0, 1, 0, 0, 0],
    [1, 1, 0, 1, 0, 0],  # node 2 bridges the clusters
    [0, 0, 1, 0, 1, 1],  # node 3 bridges the clusters
    [0, 0, 0, 1, 0, 1],
    [0, 0, 0, 1, 1, 0],
])

L = laplacian(A)
eigenvalues, eigenvectors = np.linalg.eigh(L)
fiedler = eigenvectors[:, 1]  # second-smallest eigenvector

print(f"Fiedler vector: {fiedler.round(3)}")
print(f"Cluster assignment: {['A' if x < 0 else 'B' for x in fiedler]}")

PageRank: Power Iteration in Action

How do you find the dominant eigenvector of a matrix without computing all of them? Power iteration: start with a random vector, repeatedly multiply by the matrix and normalize. The vector converges to the eigenvector with the largest eigenvalue.

Google's original PageRank algorithm is power iteration on the web graph's transition matrix. Each iteration refines the importance scores of web pages. It converges because the transition matrix has a dominant eigenvalue of 1 (guaranteed by the Perron-Frobenius theorem for stochastic matrices). The web has billions of pages, so full eigendecomposition is impossible — but power iteration scales gracefully with sparse matrix-vector products.

import numpy as np

# Tiny web graph: 4 pages, links as edges
# Page 0 → 1, 2; Page 1 → 2; Page 2 → 0; Page 3 → 0, 2
n = 4
links = {0: [1, 2], 1: [2], 2: [0], 3: [0, 2]}

# Build transition matrix
M = np.zeros((n, n))
for src, dests in links.items():
    for dst in dests:
        M[dst, src] = 1.0 / len(dests)

# Power iteration
v = np.ones(n) / n
for i in range(50):
    v = M @ v
    v /= v.sum()

print(f"PageRank scores: {v.round(4)}")
# Page 0 and 2 have highest scores — most linked to

NMF: When You Need Parts, Not Wholes

The Problem with PCA and SVD

PCA and SVD can produce components with negative values. That's fine mathematically, but sometimes it's nonsensical. A face can't have "negative nose." A document can't have "negative topic." When your data is inherently non-negative (pixel intensities, word counts, ratings), you want a decomposition that respects that constraint.

Non-Negative Matrix Factorization

NMF decomposes a non-negative matrix V into two non-negative matrices: V ≈ WH. Both W and H have only non-negative entries. This constraint forces the decomposition to learn additive parts. In face recognition, NMF learns features like "nose," "eye," "mouth" that combine additively to form faces. PCA learns holistic features — entire eigenfaces — that cancel each other out through subtraction. NMF's parts are more interpretable.

In topic modeling, NMF applied to a term-document matrix produces topics (columns of W) as non-negative mixtures of words, and document representations (rows of H) as non-negative mixtures of topics. It's a popular alternative to LDA that's often faster and gives comparable results.

import numpy as np
from sklearn.decomposition import NMF

# Term-document matrix: 4 documents, 6 words
# Each entry = word count (non-negative)
V = np.array([
    [5, 0, 3, 0, 2, 0],  # doc about topic A
    [4, 0, 4, 0, 1, 0],  # doc about topic A
    [0, 3, 0, 5, 0, 4],  # doc about topic B
    [0, 4, 0, 3, 0, 5],  # doc about topic B
], dtype=float)

nmf = NMF(n_components=2, random_state=42)
W = nmf.fit_transform(V)  # document-topic matrix
H = nmf.components_        # topic-word matrix

print("Document-topic weights:\n", W.round(2))
print("Topic-word weights:\n", H.round(2))
print("Reconstruction:\n", (W @ H).round(1))

Notice how NMF found two clean topics. Topic 1 loads on words 0, 2, 4 (our "topic A" words). Topic 2 loads on words 1, 3, 5 (our "topic B" words). Documents 0-1 load on topic 1; documents 2-3 load on topic 2. Because everything is non-negative, the parts are additive and interpretable in a way that PCA components aren't.

The tradeoff: NMF doesn't have the optimality guarantee of SVD (Eckart-Young). The solution depends on initialization and isn't unique. But for many applications, interpretability matters more than mathematical optimality.

Wrap-Up

Thank you for staying with me through this. Linear algebra is a big subject, and I know some of these sections were dense. But here's what I hope happened along the way: the subject went from abstract manipulation of symbols to something you can see and feel.

We started with a single number — a scalar, one movie rating — and built from there. Vectors gave us a way to represent an entire user's preferences. Dot products measured how similar two users are. Matrices organized everything into one object — and turned out to be transformations that stretch, rotate, and project space. The determinant told us when a transformation squishes things flat. Eigenvalues and eigenvectors found the directions that matter most, the angles where the telescope focuses. PCA calibrated that telescope on real data, reducing dimensions while keeping signal. SVD generalized the trick to any matrix shape and predicted a missing movie rating. Decompositions gave us the right numerical tool for each situation. Condition numbers explained why things blow up. Covariance and Mahalanobis distance measured the right kind of similarity. Graph Laplacians extended everything to networks. And NMF showed that when you need additive parts, there's a tool for that too.

The next time you see a matrix in an ML paper, I hope you see a transformation. When you see a loss function, I hope you see geometry — a shadow being cast onto a subspace, or a telescope being rotated to maximize clarity. When your training run produces NaN, I hope you check the condition number before blaming the learning rate. Linear algebra isn't a prerequisite you grind through and forget. It's the lens through which everything else comes into focus.

Resources

3Blue1Brown — Essence of Linear Algebra (YouTube series). If you only consume one resource on this list, make it this one. Grant Sanderson's animations will give you geometric intuition that no textbook can match. I watch the eigenvalue episode at least once a year.

Gilbert Strang — Introduction to Linear Algebra. The gold standard textbook for building real understanding. Strang teaches at MIT and his explanations prioritize insight over rigor. The four fundamental subspaces chapter is worth the price of the book.

Immersive Linear Algebra (immersivemath.com). An interactive online textbook with live demos. Drag vectors around and watch transformations happen in real time. Invaluable for building the geometric intuition we've been talking about.

Philip Klein — Coding the Matrix. Linear algebra taught entirely through Python programming exercises. If you learn by coding rather than reading proofs, this is your book.

fast.ai — Computational Linear Algebra (free online course). Rachel Thomas teaches linear algebra from the perspective of applications — NLP, image processing, PageRank. The emphasis on numerical stability and speed is exactly what production ML engineers need.

← Previous Mathematical Notation Next → Calculus