Nice to Know

Chapter 2: Mathematical Foundations 35 topics

TL;DR

I'll be honest: when I first encountered most of these topics, my reaction was some combination of "I'll never need this" and "what is even happening." Then a paper would cite Itô's lemma, or an interview would ask about the pseudoinverse, and I'd wish I'd spent thirty minutes on it earlier. This isn't a syllabus — it's a field guide. Enough to recognize what you're looking at when it appears, enough to know where to dig deeper when it actually matters.

Linear Algebra

Tensor Decomposition (CP, Tucker)

Matrices handle two-dimensional data elegantly. The trouble is that a lot of interesting data refuses to be two-dimensional: user × item × time in recommendation systems, spatial × spectral × temporal in remote sensing, or the weights of a neural network layer once you start stacking attention heads. Flattening a three-way array into a matrix loses structure that may actually matter.

Tensor decompositions are the higher-dimensional analog of SVD. CP decomposition (CANDECOMP/PARAFAC) writes a tensor as a sum of rank-one outer products — think of it as SVD's more ambitious cousin. Tucker decomposition is more general: it produces a compressed core tensor multiplied by factor matrices along each mode, which gives you more flexibility at the cost of more parameters.

In practice you'll encounter these in the context of compressing convolutional layers (decomposing a 4D weight tensor), multi-relational knowledge graph embeddings, or any signal processing pipeline where the data is naturally multi-way. The algorithms are iterative, the decompositions are generally not unique, and the question of what rank to use is about as settled as it is for matrices — which is to say, not very.

Kronecker Product

The Kronecker product A ⊗ B builds a block matrix by replacing each entry aᵢⱼ with the scaled matrix aᵢⱼB. It's the natural algebraic structure for describing operations that act simultaneously on two spaces — which sounds abstract until you realize it shows up everywhere once you know to look for it.

In graph neural networks, message-passing operations often involve adjacency structure ⊗ feature transformations. In multi-task learning, the relationship between task covariances and input covariances is naturally Kroneckerian. It also appears when you vectorize matrix equations: the identity vec(AXB) = (B^T ⊗ A) vec(X) turns a matrix problem into a vector problem, which is sometimes exactly what a solver needs. K-FAC, the approximate natural gradient method, exploits Kronecker structure to make Fisher information matrix inversion tractable.

Matrix Square Root

The matrix square root S of a positive semidefinite matrix A satisfies S² = A. Unlike the scalar case, there can be many such S — we typically want the unique symmetric positive semidefinite one, computed via eigendecomposition: diagonalize A, take the square root of the eigenvalues, rotate back.

I first encountered this in neural style transfer, where matching the square root of the feature covariance (the Gram matrix) between a content image and a style image turned out to transfer artistic texture. It also appears in the natural gradient, where the update direction requires multiplying by the inverse square root of the Fisher information matrix — computationally expensive enough that most implementations approximate it. K-FAC and KFAC-style optimizers use the Kronecker structure mentioned above to make this approximation cheap enough to actually run.

Pseudoinverse (Moore-Penrose)

The regular matrix inverse requires a square, full-rank matrix. Real problems rarely cooperate on either count. The Moore-Penrose pseudoinverse A⁺ handles any shape and any rank: it gives the minimum-norm least-squares solution to Ax = b. If A is full-rank and square, A⁺ equals the regular inverse — that's a comforting special case worth remembering for interviews.

Computationally, you get it from the SVD: invert the nonzero singular values, leave the zero ones alone. The intuition is that you're inverting the "informative" directions of A and leaving alone the directions in which A provides no information. You'll see it in the theory behind linear regression (the normal equations), in analysis of over- and under-parameterized models, and in the initialization schemes of some networks where you want a controlled pseudoinverse-like mapping at the start of training.

Random Matrix Theory

Here's a thing that surprised me: if you take a large matrix of i.i.d. random entries, the distribution of its eigenvalues is not random-looking at all — it follows a precise deterministic law called the Marchenko-Pastur distribution. The shape of the bulk eigenvalue spectrum depends only on the aspect ratio of the matrix, not on the specific entries. This is random matrix theory in action.

Why should a deep learning practitioner care? Because the eigenvalue spectrum of a randomly initialized weight matrix should look like Marchenko-Pastur. After training, it often doesn't — a few eigenvalues "spike" out of the bulk, and those spikes correspond to the directions the network actually learned. This gives you a diagnostic: compare the empirical eigenvalue spectrum of a trained weight matrix to the Marchenko-Pastur prediction for a random matrix of the same shape. Eigenvalues outside the bulk represent learned structure; eigenvalues inside the bulk might as well be noise. Some initialization schemes and implicit regularization analyses are built directly on this insight.

Calculus

Neural ODEs

A residual network update looks like h_t+1 = h_t + f(h_t). As you make the layers thinner and more numerous, this discrete update approaches a continuous differential equation: dh/dt = f(h, t, θ). That's the idea behind Neural ODEs — parameterize the derivative of the hidden state with a neural network, then use a black-box ODE solver to integrate forward to get the output.

The memory efficiency is genuine. Standard backpropagation through an ODE solver would require storing all intermediate states. Instead, the adjoint method solves a second ODE backwards in time to compute gradients, using constant memory regardless of how many "steps" the solver took. The cost is that forward passes are slower and less predictable. You'll see Neural ODEs cited in papers on continuous-time dynamics, latent SDEs, and irregularly sampled time series where the discrete-layer assumption is awkward.

Variational Calculus

Standard calculus asks: what value of x minimizes f(x)? Variational calculus asks: what function q(x) minimizes a functional F[q]? A functional is a function that takes a function as input and returns a scalar. This sounds recursive, but the setup is natural: F[q] = ∫ L(q(x), q'(x), x) dx, and the optimal q satisfies the Euler-Lagrange equation, which is the "derivative equals zero" condition for functionals.

In ML, the most important application is variational inference. When we write the ELBO and ask "what distribution q(z) minimizes KL(q||p)?", we're doing variational calculus. The connection isn't always made explicit in deep learning tutorials, but it's there. Physics-informed neural networks, optimal control, and the theory of normalizing flows all draw on the same machinery.

Stochastic Calculus / SDEs

Ordinary differential equations describe deterministic dynamics. Stochastic differential equations add noise: dX = f(X, t) dt + g(X, t) dW, where dW is an infinitesimal Wiener process (Brownian motion) increment. The trouble is that Brownian motion is nowhere differentiable, which breaks ordinary calculus. Itô's lemma is the replacement for the chain rule in this setting — it has an extra second-derivative term that appears precisely because of the roughness of the noise.

This stopped being niche when diffusion models became mainstream. Score-based generative models like DDPM and their continuous-time extensions are literally SDEs. The forward process adds noise according to a specific SDE; the reverse process removes noise according to a related SDE involving the score function ∇_x log p(x). If you want to understand why diffusion models work at a mathematical level — not just the recipe, but the why — you're reading stochastic calculus.

Implicit Differentiation

Most functions you differentiate are explicit: given x, compute y = f(x), done. But sometimes y is defined as the fixed point or solution of an equation g(x, y) = 0 — you solve for y numerically, and you want gradients through that solve. Implicit differentiation computes dy/dx = −(∂g/∂y)⁻¹ (∂g/∂x) without re-solving the equation or unrolling the iterative solver.

Deep equilibrium models define their "output layer" as the fixed point of a transformation, then use implicit differentiation to backpropagate through it. iMAML and other meta-learning algorithms differentiate through inner optimization loops this way, avoiding the memory explosion of unrolling hundreds of gradient steps. If you've ever wondered how you can backpropagate through "run gradient descent until convergence," implicit differentiation is the answer.

Probability & Statistics

Copulas

Pearson correlation is the workhorse of dependence measurement, but it has a dirty secret: it only measures linear dependence, and it can be misleading when your marginal distributions are non-Gaussian. Two variables can have zero correlation but substantial dependence — and correlation says nothing at all about what happens in the tails, which is often exactly where things go wrong.

Copulas solve this by separating the marginal distributions from the dependence structure. By Sklar's theorem, any joint distribution can be written as a copula applied to the marginals — the copula captures purely how the variables co-move, stripped of any information about their individual shapes. A Gaussian copula has the dependence structure of a multivariate Gaussian but arbitrary marginals. A Clayton copula has strong lower-tail dependence — it models situations where things tend to fail together when they're both bad. The 2008 financial crisis was partly a story about using Gaussian copulas for mortgage default correlations and being surprised that tail events were actually correlated.

Sufficient Statistics

When you collect data to estimate a parameter, there's a natural question: how much of the data do you actually need to keep? A sufficient statistic T(X) captures everything about θ that's in the data — formally, the distribution of X given T(X) doesn't depend on θ at all. The mean and variance of a sample are sufficient for a Gaussian. The sum of a Bernoulli sample is sufficient for the success probability.

This matters for two reasons. First, it tells you when you can safely compress your data without losing estimation power — useful for streaming or distributed settings. Second, it connects deeply to the exponential family: every exponential family distribution has a finite-dimensional sufficient statistic, which is one of the reasons that family is so mathematically tractable. The concept also underlies the Rao-Blackwell theorem, which tells you how to improve any unbiased estimator by conditioning on a sufficient statistic.

Exponential Family

Gaussian, Bernoulli, Poisson, Gamma, Beta, Dirichlet — these distributions feel unrelated until you write them in a common form. Every member of the exponential family can be written as p(x | η) = h(x) exp(η · T(x) − A(η)), where η are the natural parameters, T(x) is the sufficient statistic, and A(η) is the log-partition function that normalizes everything. Once you see the pattern, computing gradients, conjugate priors, and maximum likelihood estimates all follow the same template.

The log-partition function A(η) is convex, and its gradient gives the mean of the sufficient statistic: E[T(x)] = ∇A(η). This means fitting exponential family models reduces to moment matching — match the model's expected sufficient statistics to the empirical ones. Generalized linear models, variational inference with structured approximations, and the theory of natural gradients all live in this framework. I avoided learning this for an embarrassingly long time, thinking it was abstract machinery. It's actually the most useful unifying framework in classical statistics.

Survival Analysis

Survival analysis addresses a specific annoyance: your outcome is a time — how long until a customer churns, a bearing fails, a patient relapses — and you often don't get to observe the outcome for everyone in your dataset because the study ends, or the customer is still active, or the patient is still alive. This is called censoring, and naively ignoring censored observations biases every estimate you make.

The field developed a suite of methods for handling this gracefully. The Kaplan-Meier estimator produces a nonparametric survival curve that accounts for censoring at each time step. The Cox proportional hazards model estimates how covariates affect the relative risk of the event without assuming any particular shape for the baseline hazard — it's semiparametric in a way that's both elegant and practically useful. Deep survival models like DeepSurv graft a neural network onto the Cox framework. If you're building churn models or predictive maintenance systems and not accounting for censoring, your estimates are probably wrong in a systematic direction.

Causal Discovery (PC, FCI)

Most ML models learn that X and Y are associated, not that X causes Y. Often that's fine. But when you want to understand the mechanism, design an intervention, or build a model that generalizes under distribution shift, correlation isn't enough. Causal discovery algorithms try to learn the actual directed causal graph from observational data — without any experiments.

The PC algorithm starts with a fully connected undirected graph and removes edges where it finds conditional independence, then orients as many edges as possible using orientation rules. FCI extends this to handle latent confounders — variables that influence two observed variables without being observed themselves. Neither algorithm gives you certainty; they give you a Markov equivalence class of graphs that are all consistent with your conditional independence tests. The practical upshot is that you can sometimes identify which causal directions are impossible, even if you can't identify the true graph uniquely. That's still useful for ruling out wrong stories.

Optimization

Mirror Descent

Gradient descent assumes your parameters live in Euclidean space and that the Euclidean distance is the right way to measure how far you've moved. Neither assumption is always appropriate. When parameters are constrained to a probability simplex, an L2 gradient step will typically send you outside the simplex, and you have to project back. That projection is wasteful and breaks the natural geometry of the problem.

Mirror descent replaces the Euclidean distance with a Bregman divergence that matches the geometry of the constraint set. For the simplex, the natural choice is KL divergence — and mirror descent with this divergence gives you the exponentiated gradient algorithm, which automatically stays on the simplex without projection. The framework is the foundation of much of online learning theory, and it clarifies why different algorithms are appropriate for different geometries. When you read about online mirror descent in a regret analysis, this is what they mean.

Proximal Operators

Gradient descent needs a smooth objective. The moment you add L1 regularization — which creates the useful property of exactly-zero weights — you've introduced a term that isn't differentiable. Subgradients exist, but they're slow to converge. Proximal operators handle non-smooth terms more cleanly: split the objective into a smooth part and a non-smooth part, take a gradient step on the smooth part, then apply the proximal operator of the non-smooth part.

The proximal operator of a function g is prox_g(v) = argmin_x [g(x) + ½‖x − v‖²]. For L1 regularization, this works out to soft thresholding: push values toward zero by λ, and set them to exactly zero if they're within λ of the origin. That's why LASSO produces sparse solutions. ISTA (Iterative Shrinkage-Thresholding Algorithm) and FISTA (its accelerated version) alternate gradient steps on the smooth loss with proximal steps on the regularizer. The proximal gradient family is the right tool whenever you want to combine smooth losses with structured sparsity penalties.

Frank-Wolfe

Constrained optimization usually works by projecting your unconstrained gradient step back onto the feasible set. But projection can be expensive — for matrix norm balls, trace norm constraints, or polytopes with complex structure, it may require solving a convex program at every step.

The Frank-Wolfe algorithm (also called conditional gradient) avoids projection entirely. Instead of projecting, it finds the vertex of the constraint set that most aligns with the negative gradient — a linear minimization problem, which is often much cheaper. It then moves toward that vertex by a convex combination. The iterates are naturally feasible at every step. The cost is sublinear convergence in general, but for problems where projection is expensive and the constraint set has easily computable vertices — nuclear norm balls in matrix completion, for instance — Frank-Wolfe often wins in practice.

Alternating Minimization

Some optimization problems factor into two groups of variables where, if you fix one group, the problem over the other group is easy. Alternating minimization exploits this: hold one fixed, optimize the other, swap, repeat. It's inelegant in the sense that you're not directly minimizing the joint objective, but it's practical in the sense that it actually works on a surprising number of problems.

Non-negative matrix factorization decomposes this way: given W and H, solving for H with W fixed is a least-squares problem, and vice versa. The EM algorithm alternates between computing expected sufficient statistics (E-step) and maximizing the expected log-likelihood (M-step). K-means alternates between assigning points to clusters and recomputing cluster centers. None of these are guaranteed to find global optima, but all of them converge to local optima reliably, which in practice is often good enough. The pattern is worth recognizing because it often suggests a clean implementation strategy for problems that look joint and hard.

Information Theory

Rate-Distortion Theory

Lossless compression has a hard floor: you can't compress beyond the Shannon entropy of your source. But if you're willing to tolerate some reconstruction error, you can go further. Rate-distortion theory characterizes this tradeoff: for any level of distortion D (measured by some distortion metric), there's a minimum number of bits R(D) needed to encode the source. The function R(D) is the rate-distortion function, and it's the fundamental limit for lossy compression.

This framework matters for ML because representation learning is implicitly a compression problem. The information bottleneck principle frames learning as finding a compressed representation Z of input X that retains as much information about output Y as possible — it's rate-distortion with a semantic distortion measure. Neural compression codecs are explicitly trying to approach the rate-distortion bound. And when someone claims their learned representation is "compact" or "efficient," rate-distortion theory is the right framework for making that precise.

Channel Capacity

Shannon's noisy channel coding theorem is one of those results that feels like it shouldn't be true. It says: for any noisy channel, there exists a maximum rate C (the channel capacity) such that, below this rate, you can transmit information with arbitrarily low error probability, and above this rate, errors are unavoidable. Below capacity: reliable communication is always possible, no matter how noisy the channel. Above capacity: impossible. The boundary is sharp.

For ML practitioners, channel capacity shows up most often in theoretical analyses of learning systems — information-theoretic bounds on generalization, the capacity of neural networks as communication channels, and communication-efficient distributed learning where you need to transmit gradients over limited bandwidth. It's also a genuinely beautiful result that makes you appreciate why information theory is its own discipline rather than a chapter of statistics.

Rényi Entropy

Shannon entropy is one member of a parametric family. Rényi entropy of order α is H_α(X) = (1/(1−α)) log Σ p_i^α. At α → 1 you recover Shannon entropy. At α = 0 you get the log of the support size. At α = 2 you get collision entropy, related to the probability that two independent draws from the distribution are equal. At α → ∞ you get min-entropy, determined entirely by the most likely outcome.

Min-entropy is the conservative measure of uncertainty — it answers "how uncertain is the worst case?" rather than "how uncertain is the average case?" This makes it the right measure for cryptographic and privacy applications, where an adversary always gets to pick the best attack. Differential privacy's composition theorems and the moments accountant method use Rényi divergences (the generalization to pairs of distributions) to track privacy budget across multiple adaptive queries. If you're implementing or analyzing differential privacy, you'll encounter Rényi divergences repeatedly.

f-Divergences

KL divergence is the most used measure of how different two probability distributions are, but it's one choice among many. An f-divergence D_f(P‖Q) = E_Q[f(dP/dQ)] is defined by a convex function f with f(1) = 0. Choosing different f gives different divergences: f(t) = t log t gives forward KL; f(t) = −log t gives reverse KL; f(t) = (t−1)² gives χ²; f(t) = |t−1|/2 gives total variation. They all measure distributional mismatch but in different ways, penalizing different kinds of discrepancy.

This became practically important when GANs arrived. The original GAN minimizes Jensen-Shannon divergence (an f-divergence). f-GAN showed you can recover any f-divergence as the objective for a minimax game. Different choices lead to different training dynamics and mode-dropping behaviors. Understanding f-divergences gives you a unified view of why different GAN variants exist and what tradeoffs they're making — rather than treating each variant as an ad hoc modification.

Numerical Computing

Mixed Precision Beyond fp16

fp16 training was the first big precision reduction: halve the memory, get roughly the same accuracy if you're careful about gradient scaling. But the industry didn't stop there. INT8 inference packs weights into 8-bit integers — the quantization step rounds continuous weights to a discrete grid, and with careful calibration the accuracy loss is small. Running INT8 also tends to be faster because many hardware units have higher INT8 throughput than fp16.

The more recent development is NF4 (4-bit NormalFloat), introduced with QLoRA. NF4 is designed for weights that follow an approximately normal distribution: it places its 16 quantization levels non-uniformly, with higher density near zero where most weights cluster. This gives better reconstruction error than uniform 4-bit quantization. QLoRA uses NF4 for frozen base model weights and full precision for the trainable LoRA adapters, enabling fine-tuning of 65B models on a single consumer GPU. The trend in inference is clear: go as low as you can on bit-width without measurable task degradation, and NF4 plus INT8 are currently the practical endpoints.

Kahan Summation

Floating-point addition is not associative. Adding 10⁶ numbers in a naïve loop accumulates rounding error with each step — the running sum grows large, and small addends get rounded away entirely. For typical ML workloads this is usually fine. For gradient accumulators, long-sequence statistics, or any setting where you're summing many small numbers, it can silently degrade precision in ways that are hard to debug.

Kahan summation fixes this with a two-line idea: keep a separate compensation variable c that tracks the rounding error from the last addition. On each step, you adjust the new term by c before adding, then compute the new c from what got lost. The result is that you accumulate nearly full-precision sums even over millions of additions, at the cost of a handful of extra floating-point operations per step. It's one of those algorithms that feels disproportionately clever for how simple it is to implement.

Matrix-Free Methods

The Hessian of a neural network loss has as many rows and columns as the number of parameters. For a model with 100 million parameters, the Hessian is a 10¹⁶-element matrix — you cannot store it, period. But you can compute Hessian-vector products cheaply: run a forward and backward pass to get the gradient, then run a second backward pass through the gradient to get Hv for any vector v. This is the key operation for matrix-free methods.

Conjugate gradient solves linear systems using only matrix-vector products, never the matrix itself. Lanczos iteration computes extreme eigenvalues and eigenvectors the same way. Randomized SVD estimates the top singular vectors via a small number of matrix-vector products. All of these are essential for second-order optimization (KFAC, Hessian-free optimization), for analyzing the loss landscape (computing the Hessian spectrum), and for large-scale kernel methods where the kernel matrix is too large to materialize. The abstraction — "I have an operator, not a matrix" — is a powerful one.

Structured Matrices

A Toeplitz matrix has constant values along each diagonal — the (i,j) entry depends only on i−j. This structure appears in any shift-invariant linear operation, which is exactly what convolution is. A circulant matrix is a Toeplitz matrix with periodic boundary conditions — the last row wraps around to the first. The key fact about circulant matrices is that they are diagonalized by the discrete Fourier transform: any multiplication by a circulant can be done in O(n log n) using FFT.

This is why computing convolution in the frequency domain is faster for large kernels, why state space models like S4 can process sequences efficiently, and why the eigenvalue analysis of convolutional layers has a clean spectral form. Beyond Toeplitz/circulant, the broader class of structured matrices — including low-displacement-rank matrices, butterfly matrices, and others — captures linear transformations that can be applied in O(n log n) rather than O(n²). This is an active area of neural architecture research: can you replace dense weight matrices with structured ones and retain expressivity while gaining efficiency?

The Deeper Waters

Everything above is a detour you might take in the first year or two of serious ML work. What follows is further out. These topics show up when you read theory papers carefully, when you work in fields like differential geometry or physics-inspired ML, or when a senior researcher says something in passing that you want to actually understand rather than nod at. I'm not suggesting you study these now. I'm suggesting that when you hear the words, you don't dismiss them as permanently out of reach.

Measure Theory

Probability theory as taught in most ML courses is informal about what a "probability space" actually is. Measure theory makes it rigorous: a probability space is a set Ω, a σ-algebra of measurable events, and a probability measure that assigns values in [0,1] to those events. The σ-algebra is there to handle the subtlety that not every subset of Ω can be assigned a probability consistently — a fact that's obvious once you've seen the paradoxes it prevents.

The payoff is Lebesgue integration, which generalizes the Riemann integral to handle functions that are discontinuous in complicated ways. Expectations in probability are Lebesgue integrals. The phrase "almost surely" — as in "converges almost surely" — means "except on a set of measure zero," which is the measure-theoretic way of saying something holds everywhere that matters. You don't need to do measure theory to do ML. But the first time you read a formal convergence proof and every step cites measure theory, you'll wish you had at least the vocabulary.

Functional Analysis & RKHS

Linear algebra lives in finite-dimensional vector spaces. Functional analysis extends the same ideas — inner products, norms, orthogonality, projections — to infinite-dimensional spaces of functions. A Hilbert space is an inner product space that's complete (no sequences converge to missing points). The space of square-integrable functions is a Hilbert space. So is the space of functions defined by a kernel.

A reproducing kernel Hilbert space (RKHS) is a Hilbert space where point evaluation is a continuous linear functional — which means evaluation has a representer in the space. This is the mathematical foundation for the kernel trick: rather than working in an explicit infinite-dimensional feature space, you work with the kernel function k(x, x') = ⟨φ(x), φ(x')⟩ that implicitly computes inner products. The Representer theorem says that the solution to any regularized kernel learning problem can be written as a finite linear combination of kernels centered at training points — which is why kernel methods with infinite-dimensional features produce finite-dimensional solutions. If you ever want to understand why kernel methods work at a deeper level than "it's dot products in a big space," RKHS is the language.

Topology & Topological Data Analysis

Topology studies properties of spaces that are preserved under continuous deformation — stretching and bending, but not tearing or gluing. A coffee mug and a donut are topologically equivalent (both have one hole); a sphere and a torus are not. Topological data analysis (TDA) applies these ideas to data: it tracks how the "shape" of a point cloud changes as you vary a scale parameter, recording when connected components appear and merge, when loops form and fill, when voids appear and close.

This is called persistent homology, and it produces a "persistence diagram" that summarizes the data's multi-scale topological structure. In ML, TDA has been applied to analyze the loss landscape (are the minima connected? is there a connected manifold of good solutions, which would explain why different training runs find similar models?), to study mode connectivity in neural networks, and as a feature engineering tool for structured data where shape matters. It's still a research-level tool, not a production one, but the language of topology is increasingly appearing in theory papers about neural networks.

Group Theory & Geometric Deep Learning

A group is a set with an associative binary operation, an identity, and inverses for every element. The integers under addition form a group. The rotations of a sphere form a group. The permutations of n elements form a group. Groups formalize the notion of symmetry, and symmetry is everywhere in machine learning.

CNNs are equivariant to translations: translate the input, the feature maps translate correspondingly. This equivariance is the reason CNNs work for images — it hard-codes the translation symmetry of visual scenes into the architecture rather than making the network learn it from data. Geometric deep learning generalizes this: given any symmetry group G, how do you build neural networks that are equivariant to G? E(3)-equivariant GNNs are equivariant to 3D rotations, reflections, and translations — essential for molecular property prediction, where a molecule's properties don't depend on its orientation in space. Lie groups are continuous groups (like the rotation group SO(3)) with differentiable structure, which allows gradient-based optimization over the group. Once you see the framework, the architecture design choices in many modern geometric and physics-inspired models become transparent.

Information Geometry

Probability distributions form a manifold — a smooth curved space. The natural geometry on this manifold is given by the Fisher information matrix, which acts as a Riemannian metric: it defines the notion of distance between nearby distributions in a way that's invariant to reparameterization. This is information geometry, and it's the mathematical foundation for understanding why the natural gradient matters.

Gradient descent moves in parameter space, treating all parameter directions as equally important. But nearby parameters in Euclidean space can correspond to very different distributions, and nearby distributions can correspond to very different parameters. The natural gradient moves in distribution space by premultiplying the gradient by the inverse Fisher matrix, effectively correcting for the curvature of the statistical manifold. It converges faster than ordinary gradient descent in a parameter-invariant sense. K-FAC approximates the natural gradient cheaply. Amari's work on information geometry established the theoretical framework; it remains one of the more beautiful bridges between differential geometry and statistics.

Optimal Transport

Imagine two piles of dirt. You want to reshape one pile into the other. The optimal transport problem asks: what's the minimum total work required, where work is mass times distance moved? The minimum cost is the Wasserstein distance between the two distributions, also called the Earth Mover's Distance. Unlike KL divergence, Wasserstein distance is defined even when the two distributions have non-overlapping support — which matters enormously in practice.

This became prominent in ML with Wasserstein GANs, which replaced the Jensen-Shannon divergence with Wasserstein distance to get smoother gradients and more stable training. But optimal transport is broader: it's used for distribution alignment in domain adaptation, for comparing point clouds in 3D shape analysis, for training set selection, and as a loss function that respects the geometric structure of the output space. The Sinkhorn algorithm computes approximate optimal transport efficiently via entropy regularization, making it practical at the scale of modern ML.

Category Theory

Category theory is the mathematics of structure and composition. A category consists of objects and morphisms (arrows between objects) with composition rules. A functor maps between categories in a way that preserves structure. A natural transformation maps between functors. The punchline is that category theory captures the essential compositional structure of mathematical objects, abstracted away from the specific details.

This is furthest from daily ML practice, but it's not merely decorative. There's a paper called "Backprop as Functor" that formulates reverse-mode automatic differentiation as a functor between categories — which gives a clean compositional account of why backprop works the way it does. More practically, categorical thinking shows up in the design of differentiable programming frameworks and in understanding the algebraic structure of model composition. I'll be honest: I understand category theory at the "I've read the Wikipedia article several times" level, and I suspect that's fine for most ML work. But knowing it exists and roughly what it claims is worth more than knowing nothing.

Interview Gotchas

A friend once warned me about these before I walked into a technical interview. I'm passing the favor along. These are the questions where the answer feels obvious, and then you say it, and it's wrong.

Eigenvalues, Zero, and Nilpotent Matrices

If all eigenvalues of a matrix are zero, you might think the matrix must be the zero matrix. It isn't. A nilpotent matrix has all zero eigenvalues but is not itself zero — the classic example is a strictly upper-triangular matrix. The eigenvalues being zero tells you about the characteristic polynomial, not the entries of the matrix. Don't confuse "all eigenvalues are zero" with "the matrix is zero."

Zero Covariance Is Not Independence

Zero covariance means no linear relationship. Two variables can have zero covariance and wildly dependent distributions — the classic example is X uniform on [−1, 1] and Y = X². They have zero covariance because the positive and negative parts cancel, but knowing X tells you Y exactly. This is one of those things that everyone says they know and then gets wrong in the moment.

Jensen's Inequality and E[f(X)] ≠ f(E[X])

For a nonlinear function f, the expectation of f(X) is not the same as f applied to the expectation of X. For convex f, E[f(X)] ≥ f(E[X]) — this is Jensen's inequality. For concave f, the inequality flips. This comes up constantly: the expected loss under stochastic training is not the loss at the expected parameters, the expected prediction under an ensemble is not the prediction at the mean weights, the expected perplexity is not the perplexity at the expected distribution. When you see a paper "lower bound" something using Jensen, this is what they're doing.

Medical Tests and Base Rates

A test for a rare disease is 99% accurate (1% false positive rate, 1% false negative rate). You test positive. What's the probability you have the disease? Most people say "99%." The answer depends entirely on the base rate of the disease. If one in 10,000 people have it, most positive tests are false positives — even a very accurate test. This is Bayes' theorem applied to medical testing, and the failure mode is ignoring the prior. The practical lesson for ML: when your classifier says "positive," ask how rare the positive class is before trusting it.

PCA and Uniform Variance

PCA finds directions of maximum variance. If all features have equal variance and all directions are equally informative, PCA will find... something, but it won't be meaningful. The principal components will be sensitive to random noise rather than capturing signal. The technique assumes there's genuine variance structure to find. When the data is isotropic — when variance really is uniform in all directions — PCA's output is essentially arbitrary, and using it for dimensionality reduction will hurt rather than help.

The Pseudoinverse of a Full-Rank Square Matrix

If A is square and full-rank, then A⁺ = A⁻¹. The pseudoinverse reduces to the ordinary inverse in exactly the case where the ordinary inverse exists. This is a common interview question in the "here's a formula, what does it reduce to in this special case?" category. Knowing that A⁺ equals A⁻¹ when the inverse exists is a sign that you understand the pseudoinverse rather than having memorized its definition.

The Softmax Jacobian

The Jacobian of the softmax function σ(z) has a specific structure worth knowing. The diagonal entries are σᵢ(1 − σᵢ) and the off-diagonal entries are −σᵢσⱼ. In matrix form: J = diag(σ) − σσᵀ. The matrix is symmetric, rank-deficient (it's rank n−1), and positive semidefinite. The rank deficiency comes from the fact that softmax outputs always sum to one, so there's one direction in which the Jacobian is identically zero. When you're deriving cross-entropy gradient updates by hand or debugging a custom softmax implementation, this structure tells you exactly what to expect.

← Previous Numerical Computing Next Chapter → Ch 3: Data Fundamentals