Nice to Know — Mathematical Frontiers

Chapter 17: Learning Theory & Advanced ML 10 topics

I avoided these topics for a long time. Each one lives at the intersection of pure mathematics and deep learning, and every time I'd see phrases like "Wasserstein distance" or "algorithmic information theory" in a paper, I'd nod knowingly and keep scrolling. Finally the discomfort of pretending grew too great. Here is what I found.

These ideas won't change your PyTorch code tomorrow. But they form the theoretical bedrock beneath the things that do change your code — scaling decisions, pruning strategies, model selection, understanding why overparameterized networks generalize at all. When these come up in a paper, a conference talk, or an interview with a senior researcher, you'll know what people are actually arguing about.

Before we start, a heads-up. We'll touch on information theory, complexity theory, and some heavy geometry. You don't need any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The landscape

Optimal Transport and the Wasserstein Distance
The Sinkhorn Shortcut
The Information Bottleneck
Minimum Description Length
Algorithmic Information Theory
Computational Complexity of Learning
The Lottery Ticket Hypothesis, Revisited
Neural Collapse
Feature Learning Theory
Scaling Law Theory
Universality in Deep Learning
Rest Stop
Resources

Optimal Transport and the Wasserstein Distance

Here's a problem that predates machine learning by about two centuries. A French mathematician named Gaspard Monge posed it in 1781: you have a pile of sand, and you need to reshape it into a different pile somewhere else. What's the cheapest way to move the sand, measured by total weight times distance carried?

That's optimal transport. The minimum cost of reshaping one distribution of mass into another.

Now replace "pile of sand" with "probability distribution." You have a distribution P (maybe the distribution of generated images from your GAN) and a distribution Q (the distribution of real images). You want to measure how different P and Q are. The classic tools — KL divergence, JS divergence — compare how much the distributions overlap. But here's the problem: if P and Q don't overlap at all (which happens constantly during early GAN training, when the generator's output is garbage), KL divergence is infinite and JS divergence is a flat constant. Neither gives you a useful gradient. The loss landscape is a plateau, and your optimizer wanders aimlessly.

The Wasserstein distance, also called the Earth Mover's Distance, doesn't have this problem. It measures the minimum total "work" needed to morph P into Q, where work means amount of probability mass moved times the distance it travels. Even when P and Q are completely disjoint, the Wasserstein distance is finite and meaningful — it tells you how far apart the distributions are, not whether they happen to share the same support.

Think of it this way. KL divergence is like asking "do these two piles of sand occupy the same space?" If the answer is no, the measurement is useless. Wasserstein distance is like asking "how many truck-loads would it take to turn one pile into the other?" That's always a meaningful question, no matter where the piles sit.

This is why Wasserstein GANs (WGANs) use it as their loss function. It gave GANs a smooth, informative gradient everywhere, not just where the two distributions happened to overlap. It also shows up in domain adaptation (matching feature distributions between source and target domains) and fairness (measuring how far a model's predictions differ across groups).

The catch? Computing the exact Wasserstein distance requires solving a linear program, and that gets expensive fast — roughly cubic in the number of points. For anything at modern scale, you need a shortcut.

The Sinkhorn Shortcut

The Sinkhorn algorithm provides that shortcut. The idea is to add a small entropy term to the transport problem, which makes it slightly "fuzzier." Instead of finding the single cheapest way to move mass, you allow a spread of nearly-cheap ways, weighted by how orderly they are.

What does that buy you? The entropic version can be solved by alternately normalizing the rows and columns of a matrix — an operation that's fast, parallelizable, and differentiable. That last property is crucial: differentiable means you can backpropagate through it, which means you can plug it into a neural network's loss function.

The entropy parameter ε controls the trade-off. Large ε gives you a very blurry approximation that's quick to compute. Small ε gives you something close to the true Wasserstein distance but takes more iterations. In practice, people tune ε the same way they tune any hyperparameter — start coarse, refine as needed.

I'll be honest — the first time I saw the Sinkhorn iterations described as "alternating row and column normalization of a kernel matrix," I didn't believe something that simple could approximate an optimization problem that old. But it does. And it's become the go-to computational tool for optimal transport in machine learning, showing up in everything from generative models to single-cell genomics alignment.

The Information Bottleneck

Naftali Tishby introduced the Information Bottleneck (IB) method in 1999, and it spent years as a niche idea in information theory. Then in 2015, Tishby applied it to deep learning and caused a genuine stir.

The core question is deceptively clean: given an input X and a target Y, what is the most compressed representation T of X that still preserves everything you need to predict Y?

Formally, you're minimizing I(T; X) — the mutual information between your representation and the input — while maximizing I(T; Y) — the mutual information between your representation and the label. A parameter β controls the trade-off. Crank β up and you demand better prediction at the cost of keeping more input details. Dial it down and you demand aggressive compression.

The controversial claim is that deep networks do this automatically during training, in two phases. First comes a fitting phase: the network rapidly learns to predict Y, and both I(T; X) and I(T; Y) increase as the layers soak up information. Then comes a compression phase: the network starts "forgetting" details about X that don't help predict Y. Mutual information I(T; X) decreases. The network is distilling input noise away, keeping only signal.

I should be upfront: this is still debated. Some researchers have shown the compression phase depends heavily on the activation function used (it appears clearly with saturating activations like tanh, less clearly with ReLU). Others question whether mutual information in high-dimensional continuous spaces can even be reliably estimated. The information bottleneck is a beautiful framework that sometimes explains what networks do. Whether it's a universal law or a special case — that's an open question.

What's not debated is the intuition it provides. The best representations are minimal sufficient statistics — they throw away everything about the input except what matters for the task. Any senior engineer who's ever said "your model is memorizing noise" is making an information bottleneck argument, whether they know it or not.

Minimum Description Length

Occam's Razor says prefer the simpler explanation. The Minimum Description Length (MDL) principle makes that quantitative.

The idea: the best model for a dataset is the one that produces the shortest total description, where total description means the length (in bits) of describing the model itself plus the length of describing the data given that model.

A concrete analogy. Imagine you're sending a friend a sequence of 1,000 numbers. Option A: transmit all 1,000 numbers verbatim. That costs 1,000 units. Option B: notice the numbers follow a pattern — say, the squares of 1 through 1,000 — and send the rule "square the index" plus the five numbers that don't quite fit. That might cost 10 + 5 = 15 units. Option C: fit a degree-999 polynomial that hits every point exactly. The polynomial coefficients alone cost nearly as much as the raw data. MDL picks Option B. It balances model complexity against residual error.

This is exactly the bias-variance trade-off, reframed through a compression lens. An overfit model has low data-given-model cost but high model cost (you need to specify all those parameters). An underfit model has low model cost but high residual cost (lots of unexplained data). MDL finds the sweet spot.

The connection to regularization is direct. L1 and L2 penalties on model weights? Those are rough proxies for the "cost of describing the model" term. MDL gives them a principled information-theoretic justification.

Algorithmic Information Theory

MDL is practical but approximate. The theoretical ideal behind it is Kolmogorov complexity, the centerpiece of Algorithmic Information Theory.

The Kolmogorov complexity of a string x is the length of the shortest computer program that produces x as output. A string like "01010101...01" repeated 500 times has low Kolmogorov complexity — a short loop generates it. A truly random string of 1,000 bits has Kolmogorov complexity close to 1,000 — you can't do better than writing it out.

This is the purest possible measure of the "information content" of data. A pattern-rich dataset is compressible and has low algorithmic complexity. A noisy dataset is incompressible and has high complexity. The ideal model, in an MDL sense, is the shortest program that generates the data — which is exactly the Kolmogorov complexity.

There's a beautiful catch: Kolmogorov complexity is uncomputable. You cannot write a program that calculates it for arbitrary inputs. This isn't a practical limitation that faster hardware will solve — it's a proven mathematical impossibility, related to the halting problem. So every compression scheme, every model selection criterion, every regularization penalty is an approximation to a quantity we can never exactly know.

I find this both humbling and clarifying. Every time we argue about which regularizer is "right," we're arguing about which approximation to an uncomputable ideal is most useful in practice. That reframes the whole debate.

Computational Complexity of Learning

Learning theory doesn't only ask "how many samples do we need?" It also asks "how much computation do we need?"

In the PAC (Probably Approximately Correct) framework, a concept class is efficiently learnable if there's an algorithm that, given enough samples, outputs a good hypothesis in polynomial time. But some concept classes are provably hard to learn efficiently, unless you're willing to believe P = NP.

Here's the crux. Proper learning requires the algorithm to output a hypothesis from the same class it's trying to learn. If that class is NP-hard — and many natural ones are, including decision trees, DNF formulas, and intersections of halfspaces — then proper learning is intractable. You'd need exponential time.

Improper learning relaxes this. The algorithm can output a hypothesis from a different, possibly richer class. Sometimes this makes the problem tractable. A decision tree is hard to learn properly, but a neural network that mimics its behavior might be learnable efficiently — because the network isn't constrained to be a decision tree.

There are also cryptographic hardness results. Under standard cryptographic assumptions (like the hardness of factoring or learning with errors), certain learning problems are provably intractable even for improper learners. These results draw a hard line: no amount of algorithmic cleverness will solve them in polynomial time.

The practical takeaway: when your model class struggles on certain data distributions, sometimes the problem isn't your optimizer or your hyperparameters. Sometimes the problem is computationally hard in a fundamental sense, and the right response is to change the problem formulation, not train longer.

The Lottery Ticket Hypothesis, Revisited

Jonathan Frankle and Michael Carlin proposed the Lottery Ticket Hypothesis in 2018: within a large, randomly initialized neural network, there exists a much smaller subnetwork — a "winning ticket" — that, when trained in isolation from the same initialization, matches the full network's performance.

The original finding was empirical. Train a big network. Prune the smallest-magnitude weights. Reset the survivors to their original initialization values. Retrain. The sparse network matches the dense one. The metaphor is buying lottery tickets — most random initializations lose, but a few happen to be perfect starting points.

What's happened since is the theoretical machinery catching up.

The Strong Lottery Ticket Hypothesis goes further: a sufficiently large random network contains a subnetwork that performs well without any training at all. You don't retrain — you find the right mask and you're done. This sounds almost magical, but there's a probabilistic argument behind it: with enough random weights, some subset will, by chance, already approximate any target function. It's the infinite monkeys theorem, but for neural networks.

Pruning at initialization methods like SNIP and GraSP try to find winning tickets before any training happens, using one-shot saliency scores to decide which connections matter. The appeal is obvious — if you can identify the winning ticket upfront, you never train the full network, saving enormous compute.

The theoretical picture in 2024 is honest but incomplete. The hypothesis holds well for CNNs and some transformer architectures. For very large or very deep networks, finding reliable tickets at initialization remains hard. The connection between lottery tickets and signal propagation at initialization (networks near the "edge of chaos") is promising but not yet a complete theory.

What excites me most is the implication: overparameterization might not be wasteful. The extra parameters aren't doing useful work directly — they're creating a combinatorial space large enough that a good subnetwork exists somewhere inside it.

Neural Collapse

This one genuinely surprised me when I first encountered it.

Neural collapse is a phenomenon observed in the terminal phase of training — that window after the training loss has essentially hit zero but you keep training anyway. In this phase, the network's internal representations develop a remarkably rigid geometric structure, independently of the architecture, dataset, or optimizer.

Four things happen simultaneously, and they happen reliably across a striking range of settings:

First, within-class features collapse to their class mean. Every image of a cat, no matter how different, maps to nearly the same point in feature space at the penultimate layer. The within-class variance goes to zero.

Second, these class means arrange themselves into a simplex Equiangular Tight Frame (ETF). If you have C classes, the C mean vectors become equiangular — the angle between any two of them is exactly the same — and they're maximally spread apart. For 3 classes, this is an equilateral triangle. For 4 classes, a regular tetrahedron. For C classes, the C-dimensional generalization of "as far apart as geometrically possible while still being equally spaced."

Third, the classifier's weight vectors align with these class means. The last linear layer's weight matrix converges to the same simplex ETF structure as the features.

Fourth, the network's decision is reduced to a nearest-class-mean classifier. All the complex nonlinear computation in dozens of layers collapses to: "which class prototype is this closest to?"

This is wild. A network with millions of parameters and dozens of nonlinear layers spontaneously organizes its final representation into the most symmetric possible arrangement. It's as if the optimization process, left to run long enough, discovers the same geometric sweet spot every time — regardless of how it started or what architecture it's running in.

I'm still developing my intuition for why this happens so universally, but the leading explanation connects it to the cross-entropy loss and overparameterization. When the network has enough capacity and the loss drives features to be maximally separable, the simplex ETF is the mathematically optimal configuration. The network doesn't "know" geometry — it's gradient descent finding the global optimum of a well-behaved landscape.

Feature Learning Theory

One of the deepest questions in deep learning theory is deceptively simple: do neural networks actually learn features, or do they mostly rely on the random features they were born with?

The Neural Tangent Kernel (NTK) regime, also called the lazy regime, describes very wide networks where the answer is "they barely move." In this limit, the weights change so little during training that the network is essentially performing kernel regression with a fixed kernel determined by its initialization. The features don't evolve. The network is "lazy" — it combines its initial random features in clever ways but never creates new ones.

The NTK regime is theoretically elegant. Training dynamics become linear. Convergence guarantees are clean. But there's a problem: NTK predictions don't match what real networks do. Real networks at practical widths learn rich, hierarchical features — edges, textures, objects, concepts — that weren't present at initialization. The lazy theory predicts none of this.

The rich regime, also called the feature learning regime or the mean-field regime, describes what actually happens in practice. Weights move significantly. Features evolve. The network develops new representations tailored to the data. This is the regime that produces the spectacular results we associate with deep learning.

The catch is that the rich regime is much harder to analyze mathematically. The dynamics are nonlinear. The loss landscape is non-convex. The features interact in complex ways. We're still building the theoretical tools to properly characterize it.

Greg Yang's work on the "maximal update parameterization" (μP) bridges this gap somewhat — it provides a principled way to set learning rates and initialization so that wide networks operate in the feature learning regime rather than the lazy one. This has direct practical implications: it means hyperparameters tuned on small models can transfer to large ones, if you parameterize things correctly.

The big picture: the NTK was a breakthrough in our ability to prove things about neural networks, but the things it proves are about a regime that real networks don't inhabit. Feature learning theory is trying to close that gap, and it's one of the most active frontiers in ML theory today.

Scaling Law Theory

In 2020, Jared Kaplan and colleagues at OpenAI published a finding that reshaped how the industry thinks about training: the test loss of neural language models follows a power law in model size, dataset size, and compute.

A power law means: Loss ∝ N−α, where N is the number of parameters (or tokens, or FLOPs) and α is a small positive exponent, typically around 0.05 to 0.1. Double your compute, and your loss drops by a predictable, modest amount. There's no cliff where things suddenly work, and no ceiling where improvements stop — it's a smooth, diminishing curve stretching across orders of magnitude.

The original Kaplan scaling laws suggested making models as large as possible for a given compute budget, even if it meant training on less data. DeepMind's Chinchilla paper (2022) challenged this: they showed that the optimal strategy is to balance model size and data size. For a fixed compute budget, a smaller model trained on more data beats a larger model trained on less. Their rule of thumb: about 20 tokens per parameter is the sweet spot.

This wasn't a minor correction — it changed the industry's scaling strategy. Models like LLaMA followed the Chinchilla recipe and achieved strong results with fewer parameters than their predecessors.

The theoretical question — why do power laws appear at all? — is largely unanswered. Power laws show up across physics (turbulence, earthquake magnitudes, city sizes) and seem to emerge whenever there's some kind of scale invariance in the underlying process. Some researchers link them to the fractal structure of natural language. Others point to the statistical properties of high-dimensional optimization. My favorite thing about scaling laws is that, aside from the empirical curves, no one is completely certain why they work so well. We have a recipe, but not the chemistry.

Recent work (2023–2024) has refined the picture further. The exponents vary by domain — vision models scale differently than language models. Data quality matters at least as much as data quantity. And at extreme scales, some researchers observe "breaks" in the power law, possibly because the model is approaching the irreducible entropy of the data — the point where there's nothing more to learn.

Universality in Deep Learning

The phenomena in this section are, collectively, the strangest things in modern machine learning. They suggest that deep learning obeys laws we don't yet fully understand.

Double descent. Classical statistics teaches that as models get more complex, test error first decreases (less underfitting) then increases (more overfitting). A clean U-shaped curve. But modern deep networks break this. If you keep increasing model size past the interpolation threshold — the point where the model has exactly enough capacity to fit every training point — test error starts decreasing again. The U-shape has a second dip. Overparameterized models, the ones with far more parameters than training examples, generalize better than models with "just enough" parameters. This violates everything classical statistics predicts, and we're still working out exactly why. The leading theories involve implicit regularization by gradient descent and the smoothness of interpolating solutions in high dimensions.

Grokking. In 2022, researchers at OpenAI observed something bizarre on small algorithmic tasks: a model would memorize the training data quickly (near-perfect train accuracy, terrible test accuracy) and then, after vastly more training — sometimes 100x more epochs — suddenly generalize. Test accuracy would jump from near-chance to near-perfect, long after you'd normally stop training. The model "groks" the underlying algorithm. The name stuck because it captures the feel perfectly: a long period of confusion followed by sudden insight. It appears to be related to the network transitioning from memorized lookup tables to genuine algorithmic representations, driven by weight decay slowly eroding the memorized solution.

Emergent abilities. Large language models, when scaled up, sometimes acquire capabilities that weren't present at smaller scales — and the transition is sharp rather than gradual. A model that can't do three-digit arithmetic at 10 billion parameters might suddenly handle it at 100 billion. This looks like a phase transition in physics: a qualitative change in behavior at a critical threshold. Whether these are "real" emergence or an artifact of how we measure performance (some recent papers argue the sharp transitions disappear with better metrics) is actively debated. But the empirical pattern, at minimum, is that scale buys you capabilities that extrapolation from smaller models wouldn't predict.

Taken together, double descent, grokking, and emergent abilities all point toward the same uncomfortable truth: our classical intuitions about learning — more parameters means more overfitting, longer training means diminishing returns, bigger models are predictable from smaller ones — are wrong in ways we don't fully understand. Deep learning theory is still catching up to deep learning practice.

Rest Stop

If you've made it this far, congratulations. You now have a working mental model for the mathematical frontier of deep learning theory — optimal transport for comparing distributions, information bottleneck for understanding compression, MDL and Kolmogorov complexity for formalizing simplicity, computational hardness for knowing when a problem is fundamentally intractable, lottery tickets for why overparameterization might be a feature rather than a bug, neural collapse for the geometry that emerges in trained networks, feature learning theory for the gap between what we can prove and what networks actually do, scaling laws for predicting performance, and universality phenomena for the things that defy prediction entirely.

You can stop here. This mental map is enough to follow research talks, read paper abstracts without flinching, and hold your own in conversations about why deep learning works (or why we don't know why it works).

But if the discomfort of not having built the connections between these ideas is nagging at you — if you're noticing that MDL connects to the information bottleneck, that neural collapse might explain why overparameterized models generalize, that scaling laws and universality phenomena might be two views of the same underlying structure — then I'd encourage you to dive deeper into the resources below.

Resources

A handful of resources that genuinely helped me connect these ideas:

If you're still with me, thank you. We've walked from 18th-century sand-moving problems to phenomena that were discovered last year, from ideas with rigorous proofs to observations that defy our current theoretical tools. The thread connecting them all is the same question: what does it mean for a model to learn?

My hope is that the next time you see "Wasserstein distance" in a GAN paper, or "neural collapse" in a representation learning talk, or "grokking" in a Twitter thread, instead of that familiar impulse to nod and scroll past, you'll have a pretty good mental model of what's going on under the hood — and you'll know exactly where to dig deeper.