Learning Theory Foundations
I avoided learning theory for longer than I'd like to admit. Every time someone brought up "VC dimension" or "PAC bounds" in a meeting, I'd nod along and silently change the subject back to hyperparameter tuning. The math looked impenetrable. The notation was hostile. And the practical payoff felt questionable — after all, I was already training models that worked. Why bother with theory that seemed designed for a 1990s textbook?
Then I started hitting walls. I couldn't explain why my 100-million-parameter model generalized better than my 1-million-parameter model. I couldn't answer "how much data do we need?" without hand-waving. And I definitely couldn't explain why deep learning works at all, given that everything classical theory says should make it fail. The discomfort of not knowing grew too great. Here is that dive.
Learning theory is the mathematical framework behind a deceptively simple question: when you train a model on some data, why should it work on data it hasn't seen? It was formalized starting in the 1980s, and has gone through a quiet revolution in the last five years as researchers scramble to explain why modern deep learning defies the predictions of classical theory.
Before we start, a heads-up. We're going to be working with some probability, some set theory, and a bit of abstract reasoning about function classes. You don't need any of it beforehand. We'll build what we need, one piece at a time, with explanation. This isn't a short journey, but I hope you'll be glad you came.
No Free Lunch
PAC Learning
VC Dimension
Generalization Bounds
Rademacher Complexity
The Bias-Complexity Tradeoff
Rest Stop
Double Descent
Benign Overfitting
The Neural Tangent Kernel
Implicit Regularization
Grokking
Why Deep Learning Works
Resources
No Free Lunch
Let's start with something humbling. Imagine you're building a spam classifier. You've got a training set of 500 emails, and you're trying to decide which algorithm to use. A colleague insists random forests are always the best. Another swears by logistic regression. Who's right?
Neither, in general. And this isn't a diplomatic dodge — it's a theorem.
In 1996, David Wolpert proved the No Free Lunch theorem. The idea is this: take any learning algorithm — neural networks, decision trees, SVMs, anything. Now average its performance across every possible problem (every conceivable mapping from inputs to outputs). That average performance is the same for every algorithm. No exceptions. If your algorithm does better on some class of problems, it must do worse on some other class, and the gains and losses cancel out perfectly.
Think of it like a restaurant that claims its dishes are the best in the world. Maybe the pasta is incredible. But if you average the quality across every possible cuisine — Mongolian, Brazilian, Ethiopian, molecular gastronomy — no restaurant beats any other. The claim "we are the best" is only meaningful for a specific type of food.
For our spam classifier, No Free Lunch says: the reason random forests work well on your email data has nothing to do with random forests being universally superior. It has everything to do with the structure of your particular problem matching the assumptions that random forests make. Change the problem, and the ranking of algorithms changes too.
This theorem is the philosophical bedrock of learning theory. It tells us there is no magic bullet. Every algorithm carries assumptions — about smoothness, about which patterns matter, about how the world is structured. Success comes from matching those assumptions to reality. And that means we need a way to measure when a model's assumptions are good enough for the data at hand.
That measurement is what the rest of this section builds toward.
PAC Learning
Every time you train a model and deploy it, you're making a bet. You're betting that the patterns your model found in the training data aren't flukes — that they'll hold up on new data drawn from the same source. The question is: how do you know when that bet is safe?
In 1984, Leslie Valiant introduced a framework to answer this precisely. He called it PAC learning — Probably Approximately Correct. I love this name because it's unusually honest. It doesn't promise your model will be perfect. It promises that, with enough data, your model will probably be approximately right.
Let's make this concrete with our spam classifier. We have some unknown "truth" about what makes an email spam — call it c, the target concept. We have a set of candidate classifiers our algorithm can consider — call it H, the hypothesis class. And we have training emails, drawn independently from some distribution of real-world emails we'll call D.
PAC learning gives us two knobs to turn. The first is ε (epsilon), the accuracy knob. This controls how wrong we're willing to tolerate. If ε = 0.05, we're saying "I'm okay if the classifier misclassifies up to 5% of emails." The second knob is δ (delta), the confidence knob. If δ = 0.05, we want to be 95% sure our classifier achieves that accuracy.
A concept class is PAC learnable if there exists an algorithm that, for any ε and δ you pick, can find an approximately correct hypothesis with high probability, using a number of training examples that grows reasonably (polynomially) with 1/ε, 1/δ, and the problem size.
Here's where it gets interesting. For a finite hypothesis class with |H| candidates, the number of training examples you need is:
m ≥ (1/ε) · ln(|H|/δ)
Let's unpack this with real numbers. Say our spam classifier has a hypothesis class of size 59,049 (all possible conjunctions of 10 binary features — things like "contains 'free'" AND "no greeting" AND "has attachment"). Each feature can be included positively, negatively, or not at all, giving 310 = 59,049 candidates.
import numpy as np
def pac_sample_complexity(epsilon, delta, h_size):
"""How many training emails do we need?"""
return int(np.ceil((1 / epsilon) * np.log(h_size / delta)))
h_size = 3**10 # 59,049 candidate spam rules
# "I want less than 5% error, 95% confidence"
m_relaxed = pac_sample_complexity(0.05, 0.05, h_size)
print(f"Relaxed: {m_relaxed} emails needed")
# "I want less than 1% error, 99% confidence"
m_strict = pac_sample_complexity(0.01, 0.01, h_size)
print(f"Strict: {m_strict} emails needed")
# 5x tighter on both knobs, but not 25x more data
print(f"Ratio: {m_strict / m_relaxed:.1f}x more data")
The output is revealing. Going from "5% error, 95% confidence" to "1% error, 99% confidence" — five times stricter on both knobs — costs only about 6x more data, not 25x. The logarithmic dependence on |H| is the key insight here. A hypothesis class a million times bigger only needs about 14x more training examples. Learning is feasible even with enormous candidate pools, as long as they are finite.
There's a catch, though. That formula assumes the realizable case — the true spam rule actually exists somewhere in your hypothesis class H. Your model family is expressive enough to represent the truth perfectly. In the real world, this is almost never true. Your 10-feature conjunction can't capture "this email has the emotional manipulation pattern of a Nigerian prince scam." The truth isn't in H.
This is the agnostic setting. Your goal shifts from finding the perfect hypothesis to finding the best available one in H, even though none are exactly right. Agnostic PAC learning is harder — the sample complexity grows as 1/ε² instead of 1/ε, and the guarantees become relative: you're competing against the best hypothesis in H, not against perfection.
But PAC learning with finite hypothesis classes only gets us so far. Linear classifiers, neural networks, SVMs — these are all infinite hypothesis classes. The |H| in our formula is infinity. Does that mean we need infinite data?
VC Dimension
No, we don't need infinite data. And the reason is one of the most elegant ideas in all of machine learning.
I'll be honest — the concept of "shattering" took me three separate attempts to internalize. Each time I read about it, I thought I understood it, and each time I was wrong in a slightly different way. Let me try to spare you that experience by building it from the ground up with our spam classifier.
Imagine we've zoomed into the simplest possible version of our problem. We have three emails, and we're trying to classify them as spam or not-spam. With three emails and two possible labels each, there are 2³ = 8 possible ways to label them:
All three legit. All three spam. First one spam, rest legit. First one legit, rest spam. And so on — eight combinations total.
Now here's the question: can our classifier handle all eight of those labelings? If someone adversarially picked any labeling whatsoever, could we find some hypothesis in our class that gets all three emails right?
If yes, we say our hypothesis class shatters those three points. It can memorize any pattern on those three examples, no matter how arbitrary. The hypothesis class is flexible enough to accommodate anything.
The VC dimension (named after Vladimir Vapnik and Alexey Chervonenkis) is the largest number of points your hypothesis class can shatter. More precisely: VC(H) equals the largest m such that there exists some set of m points that H can shatter.
Let's see this in action with something more visual. Consider lines in 2D — linear classifiers that divide a flat plane into a "+" region and a "−" region.
Take three points arranged in a triangle (not on a line). There are 8 possible labelings. For each one — all positive, all negative, two positive and one negative, one positive and two negative — you can find a line that puts the positives on one side and the negatives on the other. Try this on a piece of paper. Pick any labeling of three triangle points, and you'll always find a separating line. Three points: shattered.
Now try four points arranged in a square. Label the top-left and bottom-right as "+", the top-right and bottom-left as "−". This is the XOR pattern — positive points on one diagonal, negative on the other. No single line can separate them. It doesn't matter how you arrange four points in the plane — there will always be at least one labeling that defeats every possible line.
Therefore, the VC dimension of linear classifiers in 2D is 3. They can shatter 3 points, but not 4.
import numpy as np
from itertools import product
from scipy.optimize import linprog
def can_line_separate(points, labels):
"""Can a line in 2D separate these points by these labels?"""
n, d = points.shape
# Find w, b such that label_i * (w · x_i + b) >= 1
c = np.zeros(d + 1)
A_ub = np.zeros((n, d + 1))
b_ub = -np.ones(n)
for i in range(n):
sign = 1 if labels[i] else -1
A_ub[i, :d] = -sign * points[i]
A_ub[i, d] = -sign
return linprog(c, A_ub=A_ub, b_ub=b_ub, method='highs').success
# Triangle of 3 points
triangle = np.array([[0, 0], [1, 0], [0, 1]], dtype=float)
all_3 = [l for l in product([False, True], repeat=3)
if any(l) and not all(l)] # skip trivial cases
shatters_3 = all(can_line_separate(triangle, np.array(l)) for l in all_3)
print(f"Lines shatter 3 points in triangle? {shatters_3}")
# Square of 4 points
square = np.array([[0,0],[1,0],[0,1],[1,1]], dtype=float)
all_4 = [l for l in product([False, True], repeat=4)
if any(l) and not all(l)]
shatters_4 = all(can_line_separate(square, np.array(l)) for l in all_4)
print(f"Lines shatter 4 points in square? {shatters_4}") # XOR kills it
This pattern generalizes beautifully. For linear classifiers in d dimensions, the VC dimension is d + 1. A line in 2D can shatter 3 points. A plane in 3D can shatter 4 points. A hyperplane in 100 dimensions can shatter 101 points. The VC dimension scales with the number of parameters (d weights plus one bias), which aligns with intuition: more knobs to turn means more flexibility to memorize patterns.
Going back to our restaurant analogy from No Free Lunch — if the hypothesis class is a restaurant menu, the VC dimension tells you how many distinct palates the menu can satisfy simultaneously. A menu with 3 dishes can satisfy any combination of 3 diners' preferences. But with 4 diners, there's always some combination of tastes it can't accommodate.
Here are VC dimensions for some common hypothesis classes, to build your intuition:
| Hypothesis Class | VC Dimension | Why |
|---|---|---|
| Threshold on ℝ (is x > θ?) | 1 | One knob to turn |
| Intervals on ℝ (is a < x < b?) | 2 | Two endpoints |
| Linear classifiers in ℝd | d + 1 | d weights + bias |
| Decision stumps | 2 | One feature, one threshold |
| Neural net with W weights | O(W log W) | Roughly proportional to parameters |
| k-nearest neighbors | ∞ | Can memorize anything |
That last row is interesting. k-NN has infinite VC dimension — theory says it should overfit catastrophically on everything. In practice, it works fine on many problems. This is our first hint that VC dimension, while powerful, is telling us about worst-case scenarios across all possible data distributions. The actual data distribution you encounter might be much kinder than the worst case.
But the real power of VC dimension is this: it replaces |H| in our sample complexity formula. For an infinite hypothesis class with VC dimension d, you need roughly m = O(d/ε² · ln(d/ε)) training examples. Infinite hypothesis class, finite data requirement. That's the breakthrough.
Generalization Bounds
Now we arrive at the question that started this whole journey: how far can your test error be from your training error?
For any hypothesis h from a class H with VC dimension d, with probability at least 1 − δ:
Ltrue(h) ≤ Ltrain(h) + √( (d · ln(2m/d) + ln(4/δ)) / m )
That square root term is the complexity penalty. Think of it as a tax your model pays for being flexible. The more complex the model (higher d), the bigger the tax. The more data you have (larger m), the smaller the tax becomes.
For our spam classifier: if we're using a linear classifier with 50 features (VC dimension = 51) and we have 10,000 training emails, the complexity penalty is moderate. The gap between training accuracy and real-world accuracy should be manageable. But if we switch to a neural network with 10 million weights, the VC bound gives an astronomically large penalty — far larger than any meaningful number.
And this is where I need to be honest about the limits of what we've built so far. VC generalization bounds are almost always too loose for modern deep learning. A neural network with millions of parameters gets a VC bound that says "you need more training examples than there are atoms in the observable universe." That is clearly wrong — these networks generalize fine with millions or even thousands of examples.
Nobody uses VC bounds to compute actual numbers for deep networks. Their value is conceptual: they tell you the right variables to think about (model complexity, sample size, training error) and why generalization happens in principle. But they don't tell you what happens in practice for overparameterized models.
We need something sharper.
Rademacher Complexity
VC dimension measures the capacity of a hypothesis class without ever looking at your data. It's a property of the model family alone. This is a strength — the guarantees hold for any possible dataset. But it's also a weakness — the bounds are loose precisely because they prepare for the worst case, and your actual data might be far from the worst case.
Rademacher complexity fixes this by asking a different question: how well can your hypothesis class fit random noise on your actual data?
Here's the procedure, and it's beautifully intuitive. Take your 500 training emails. Throw away the real labels. Instead, assign each email a random label — flip a fair coin for each one. Some emails get labeled "spam," some get "not spam," completely at random. Now find the best classifier in your hypothesis class for these garbage labels. How well does it do?
If it does well — if it can find strong correlations even with random labels — that's bad news. It means your hypothesis class is flexible enough to fit noise, which means it's prone to overfitting real data too. If it does poorly — if the best classifier can't do much better than chance with random labels — that's good. The hypothesis class is constrained enough that it can only pick up real signal.
Repeat this many times with different random labelings, and average the result. That average is the empirical Rademacher complexity of your hypothesis class on your data.
import numpy as np
def empirical_rademacher(X, hypothesis_fn, n_trials=1000):
"""How well can hypothesis class fit random noise on this data?"""
m = X.shape[0]
correlations = []
for _ in range(n_trials):
# Random ±1 labels (coin flips)
sigma = np.random.choice([-1, 1], size=m)
best_corr = hypothesis_fn(X, sigma)
correlations.append(best_corr)
return np.mean(correlations)
def linear_max_correlation(X, sigma):
"""Best linear classifier's correlation with random labels."""
m = X.shape[0]
return np.linalg.norm(X.T @ sigma) / m
np.random.seed(42)
for m in [50, 200, 1000, 5000]:
X = np.random.randn(m, 10)
rc = empirical_rademacher(X, linear_max_correlation)
print(f"{m:5d} emails, 10 features: Rademacher ≈ {rc:.4f}")
Notice something important in the output: as the number of training examples grows, Rademacher complexity drops. More data makes it harder to fit random noise, which means your generalization guarantees get tighter. This makes intuitive sense — it's harder to find fake patterns in a larger dataset.
The generalization bound using Rademacher complexity is: with probability at least 1 − δ,
Ltrue(h) ≤ Ltrain(h) + 2·R̂m(H) + √(ln(1/δ) / (2m))
These bounds are typically tighter than VC bounds because they incorporate the geometry of your actual data — how spread out it is, how the features relate to each other — rather than treating all datasets as equally dangerous.
But even Rademacher bounds have limits. For truly overparameterized models like deep networks, they still tend to be loose. The reason is that both VC dimension and Rademacher complexity measure what the hypothesis class could do. They don't account for what gradient descent actually does. And as we'll see after the rest stop, that distinction turns out to be everything.
The Bias-Complexity Tradeoff
Everything we've built so far leads to a clean formalization of something you've likely felt in your gut: use too simple a model and you miss real patterns. Use too complex a model and you memorize noise. The sweet spot is somewhere in between.
Learning theory decomposes the true error into two pieces:
True Error ≤ Approximation Error + Estimation Error
Approximation error is the best your hypothesis class can possibly do, even with infinite data. If you're fitting a straight line to data that follows a sine wave, the best line in the world still has significant error. This is the cost of choosing a model family that can't represent the truth. Returning to our restaurant analogy: it's the error from only having Italian dishes on the menu when the customer wants sushi.
Estimation error is the gap between that best-possible hypothesis and the one your algorithm actually finds from finite data. Even if your model family contains something close to the truth, you might not find it with limited training examples — or you might accidentally pick a hypothesis that happened to fit noise in your sample. This is the cost of learning from a finite peek at reality.
Increase model complexity and approximation error drops (the menu gets bigger, more likely to have what the customer wants). But estimation error rises (harder to find the right dish when the menu has 500 items and you've only seen 3 customers). Decrease model complexity and the tradeoff reverses.
Structural Risk Minimization (SRM) is the principled answer. Consider a nested sequence of increasingly complex hypothesis classes — H₁ ⊂ H₂ ⊂ H₃ ⊂ ... — like polynomial regression with degrees 1, 2, 3, and so on. For each class, compute training error plus a complexity penalty based on its VC dimension. Pick the class that minimizes the sum.
import numpy as np
def srm_select(train_errors, vc_dims, m, delta=0.05):
"""Pick the right model complexity by balancing fit vs. penalty."""
bounds = []
for err, d in zip(train_errors, vc_dims):
penalty = np.sqrt((d * np.log(2*m/d) + np.log(4/delta)) / m) if d > 0 else 0
bounds.append(err + penalty)
best = np.argmin(bounds)
return best, bounds
m = 100
degrees = list(range(1, 11))
vc_dims = [k + 1 for k in degrees]
train_errs = [0.25, 0.12, 0.06, 0.03, 0.015, 0.008, 0.004, 0.002, 0.001, 0.0005]
best, bounds = srm_select(train_errs, vc_dims, m)
print("Degree | Train Err | Penalty | Total Bound")
print("-------|-----------|-----------|------------")
for i, (deg, te, b) in enumerate(zip(degrees, train_errs, bounds)):
tag = " ← best" if i == best else ""
print(f" {deg:4d} | {te:9.4f} | {b-te:9.4f} | {b:.4f}{tag}")
This is what regularization does in practice. L2 regularization, early stopping, dropout — they all implement variations of "minimize training error plus a complexity penalty." Every time you tune a regularization strength via cross-validation, you are doing SRM empirically. The theory names and justifies what practitioners have been doing by instinct.
The classical picture is now complete. It's a tidy story: there's a sweet spot, find it, and your model generalizes. If only it were that simple.
Rest Stop
Congratulations on making it this far. You can stop here if you want.
You now have the core mental model of classical learning theory. You know that PAC learning tells you how many examples you need, that VC dimension measures the effective complexity of a model class, that generalization bounds quantify the gap between training and true performance, that Rademacher complexity makes those bounds data-aware, and that the bias-complexity tradeoff formalizes the underfitting-overfitting balance.
This mental model is genuinely useful. It's enough to explain why regularization works, to reason about sample complexity in interviews, and to understand why the field cared about model capacity for 30 years.
But it doesn't tell the whole story. It predicts that once your model is complex enough to perfectly fit the training data, things should only get worse. More parameters past that point should mean more overfitting. And yet, deep learning — with its billions of parameters and perfect training fits — works spectacularly well. Something is missing from the classical picture.
If the discomfort of not knowing what's underneath is nagging at you, read on. What comes next is where learning theory is being rewritten in real time.
Double Descent
I'll be honest — when I first saw the double descent curve, I didn't believe it. I'd spent years internalizing the U-shaped bias-variance tradeoff as gospel. The idea that test error could go back down after the model became ridiculously overparameterized felt like an artifact of bad experimental design.
It wasn't. In 2019, Belkin et al. documented what practitioners had been quietly noticing for years. If you keep increasing model complexity past the point where the model can perfectly fit the training data — the interpolation threshold — test error does not stay high. It comes back down.
Let's trace this with a concrete example. We'll fit polynomials of increasing degree to 30 noisy data points sampled from sin(3x). As the degree increases, the polynomial gets more flexible.
import numpy as np
np.random.seed(42)
n_train, n_test = 30, 1000
X_train = np.random.uniform(-1, 1, n_train)
y_train = np.sin(3 * X_train) + 0.3 * np.random.randn(n_train)
X_test = np.random.uniform(-1, 1, n_test)
y_test = np.sin(3 * X_test)
print("Degree | Train MSE | Test MSE | What's happening")
print("-------|-----------|-----------|------------------")
for d in [2, 5, 10, 20, 29, 30, 35, 45, 55]:
X_poly = np.vander(X_train, d + 1, increasing=True)
X_tpoly = np.vander(X_test, d + 1, increasing=True)
w = np.linalg.lstsq(X_poly, y_train, rcond=None)[0]
tr_mse = np.mean((X_poly @ w - y_train)**2)
te_mse = min(np.mean((X_tpoly @ w - y_test)**2), 10.0)
if d < 20: regime = "Underfitting → improving"
elif d <= 31: regime = "INTERPOLATION THRESHOLD"
else: regime = "Overparameterized → improving again"
print(f" {d:4d} | {tr_mse:9.4f} | {te_mse:9.4f} | {regime}")
The story unfolds in three acts. In the first act (low degree), test error drops as the model gains capacity — the classical regime where more parameters help. In the second act (around degree 29-30, where the model has about as many parameters as data points), something terrible happens. The model has barely enough capacity to perfectly fit the training data, and it does so in the most brittle, volatile way possible. Test error spikes. This is the interpolation threshold, and it's the worst place to be.
In the third act — and this is the part that broke my mental model — test error starts declining again. The model now has far more parameters than data points, and among the infinitely many polynomials that perfectly fit the training data, the algorithm (minimum-norm least squares) picks one that turns out to generalize well.
Think of it this way. If you ask someone to draw a curve through 30 points using a polynomial of degree 29, there's exactly one polynomial that does it. It has to contort wildly to hit every point, and that contortion produces terrible predictions between the points. But if you ask someone to draw a curve through 30 points using a polynomial of degree 55, there are infinitely many options. The algorithm picks the "smoothest" one — the one with the smallest coefficients — and that smoothness translates to good generalization.
Our restaurant analogy evolves here. Classical theory says: "If your menu has 30 items for 30 customers, each customer gets forced into their closest match — rigid, no room for error." Double descent says: "If your menu has 100 items for 30 customers, each customer has many good options. The host can seat them at dishes they actually enjoy." Abundance creates flexibility.
Double descent isn't limited to polynomials. It has been observed in random Fourier features, decision trees, random forests, and deep neural networks — with respect to both model size and training duration. Epoch-wise double descent is particularly striking: for a fixed-size model, if you train long enough, test error can peak, then drop, then rise, then drop again.
Benign Overfitting
Double descent raises an uncomfortable question: how can a model memorize all the training data, including the noise, and still generalize well? Isn't memorizing noise the very definition of overfitting?
In 2020, Bartlett, Long, Lugosi, and Tsigler gave this phenomenon a name: benign overfitting. The model overfits — it has zero training error, it has memorized every noisy label — but the overfitting is benign. It doesn't hurt generalization much.
The key insight is about high-dimensional geometry. When your model has far more parameters than data points, the noise it memorizes gets spread across many dimensions. Imagine your training data lives in a 100-dimensional space, but the real signal only occupies 5 of those dimensions. The model memorizes the noise in the other 95 dimensions. When a new data point arrives, the noise in those 95 dimensions looks completely different, so it averages out to roughly zero. The signal in the 5 meaningful dimensions, however, remains consistent.
A physical analogy helps. Imagine shouting a message across a windy field. The wind (noise) is loud, but it's random — it doesn't carry any pattern. If you shout the same message 30 times (training examples), someone on the other side can average out the wind and recover your words. Benign overfitting is similar. The model "hears" the noise, but because the noise is random and high-dimensional, it doesn't contaminate the signal when you make predictions.
This doesn't work in all situations. Benign overfitting requires the data to have a specific structure — roughly, the "noise directions" in parameter space must be nearly orthogonal to the "signal directions." When this condition fails, overfitting is not benign. The model memorizes noise in ways that do corrupt predictions. Understanding exactly when overfitting is benign and when it's catastrophic is one of the sharpest questions in modern learning theory.
The Neural Tangent Kernel
I'm still developing my intuition for this one, so let me share what I've got so far and be upfront about where my understanding gets fuzzy.
In 2018, Jacot, Gabriel, and Hongler proved something remarkable about neural networks in the limit of infinite width — networks where each hidden layer has infinitely many neurons. In that limit, training a neural network with gradient descent becomes equivalent to a much simpler operation: kernel regression.
Here's the core idea. When you initialize a neural network randomly and train it with gradient descent, the weights change over time. In a finite network, those weight changes can be dramatic — the network learns new representations, discovers features, reshapes how it sees the data. But as you make the network infinitely wide, something strange happens: the weights barely move from their initial values. Training becomes a small perturbation around the starting point.
If the weights barely move, you can approximate the network's output with a first-order Taylor expansion around the initial weights. That approximation turns the neural network into a linear model — linear in the parameters, not in the inputs — operating in a fixed feature space defined by the random initialization. And a linear model in a fixed feature space is exactly a kernel method. The kernel that emerges is called the Neural Tangent Kernel (NTK).
Think back to our restaurant analogy, but now think about the chefs. A finite neural network is like a chef who invents new techniques as they cook — they taste, adjust, and develop new methods in response to each dish. An infinitely wide network in the NTK regime is like a chef who already knows every technique on day one (from the random initialization) and only needs to figure out which techniques to apply to which dish. The tools are fixed; only the application changes.
The NTK gives us a theoretical lens to understand convergence — why gradient descent finds a global minimum in overparameterized networks. But it has a significant limitation: real, finite-width networks do something that NTK cannot explain. They learn features. They discover representations. A convolutional network learns edge detectors and texture recognizers that weren't present at initialization. This feature learning is what makes deep learning powerful, and it happens precisely because finite networks leave the NTK regime — their weights move far enough that the linear approximation breaks down.
The NTK is like a map that's accurate in one terrain (infinite width, lazy training) but misleading in another (finite width, feature learning). Still, it's the best analytical tool we have for understanding why training converges, and it's a foundation on which more nuanced theories are being built.
Implicit Regularization
Here is perhaps the deepest puzzle in modern machine learning: why does gradient descent, applied to an overparameterized model with no explicit regularization, find solutions that generalize well?
In the overparameterized regime, there are infinitely many sets of weights that perfectly fit the training data. Most of them generalize terribly — they memorize noise in ways that corrupt predictions. But gradient descent doesn't pick an arbitrary one. It exhibits a bias toward certain solutions, and that bias acts like an invisible regularizer.
For linear models, this is well understood. Gradient descent initialized at zero converges to the minimum-norm solution — the set of weights with the smallest L2 norm among all that perfectly fit the data. This is mathematically equivalent to L2-regularized regression in a specific limit. The algorithm, by its very mechanics, prefers "simple" solutions even though nobody asked it to.
import numpy as np
np.random.seed(42)
# Overparameterized: 10 data points, 50 features
m, d = 10, 50
X = np.random.randn(m, d)
y = np.random.randn(m)
# Minimum-norm solution (what gradient descent from zero converges to)
w_min_norm = X.T @ np.linalg.inv(X @ X.T) @ y
print(f"Norm of min-norm solution: {np.linalg.norm(w_min_norm):.4f}")
# A random solution that also fits the data perfectly
null_component = np.random.randn(d)
null_component -= X.T @ np.linalg.inv(X @ X.T) @ X @ null_component
w_random = w_min_norm + 5 * null_component
print(f"Norm of random solution: {np.linalg.norm(w_random):.4f}")
# Both fit perfectly
print(f"Min-norm fit error: {np.max(np.abs(X @ w_min_norm - y)):.1e}")
print(f"Random fit error: {np.max(np.abs(X @ w_random - y)):.1e}")
Both solutions fit the training data with essentially zero error, but the minimum-norm solution has a dramatically smaller norm. On test data, the minimum-norm solution will typically generalize much better, because its small weights mean it isn't putting large bets on any particular feature.
For deep networks, the picture is murkier and more fascinating. SGD doesn't converge to the minimum-norm solution in parameter space — the relationship between parameters and predictions is nonlinear. But something analogous appears to happen in function space: gradient descent seems to prefer functions that are "simple" in some hard-to-formalize sense. The network settles into flat regions of the loss landscape, where the function it computes doesn't change much if the weights are slightly perturbed. These flat minima tend to generalize better than sharp minima.
My favorite thing about implicit regularization is that, aside from the linear case, no one is completely certain why it works so well. We have partial explanations — the NTK perspective, flat minima arguments, information-theoretic bounds — but a complete, unified theory remains elusive. It's one of the great open problems in machine learning theory.
Grokking
If implicit regularization is the great open puzzle, grokking is the great open mystery.
In 2022, Power et al. reported something that shouldn't happen. They trained small neural networks on algorithmic tasks — things like modular arithmetic (what's 23 + 47 mod 59?). The networks memorized the training data very quickly, reaching near-zero training error. Test accuracy, meanwhile, stayed at chance. Memorization without generalization — classic overfitting.
Then they kept training. Way past the point where training loss had flatlined. For hundreds of thousands of additional steps. And eventually, suddenly, test accuracy shot up. The network went from "memorizing answers" to "understanding the algorithm." Not gradually — abruptly. A phase transition from memorization to generalization, long after anyone would have stopped training.
They called it grokking: the delayed, sudden onset of generalization, well after the model has perfectly memorized the training data.
The emerging explanation involves competing solutions in parameter space. Early in training, the network finds a memorization solution — essentially a lookup table. This is fast because memorization requires less coordination among parameters. Somewhere in the background, a generalizable solution — one that actually computes the underlying algorithm — is slowly forming. It's "slower" because it requires the parameters to organize themselves in a more structured way.
Weight decay plays a crucial role. It steadily penalizes large weights, and the memorization solution typically has larger weights (it needs to carve out a separate region of output space for each training example). Over many, many steps, weight decay erodes the memorization solution until the generalizable one takes over. The phase transition happens when the structured solution becomes more "energetically favorable" than the memorized one.
Grokking is fascinating because it violates the most basic heuristic everyone uses: "if your training loss has plateaued and your test loss isn't improving, stop training." It suggests that models can undergo fundamental structural reorganization long after the loss surface looks flat. Whether this happens in large-scale practical models or only in these small algorithmic settings is still being investigated.
Why Deep Learning Works
We've now built enough machinery to tackle the big question. Classical learning theory says overparameterized models should overfit. Deep learning uses massively overparameterized models and they generalize beautifully. What gives?
The answer isn't any single mechanism. It's the conspiracy of several:
Implicit regularization from gradient descent. SGD doesn't explore the full hypothesis class. It follows a specific path through parameter space, biased toward solutions with small norms, low rank, or other simplicity measures. The "effective" hypothesis class — the set of solutions SGD can actually reach — is much smaller than the theoretical class, and smaller effective classes mean better generalization.
The architecture encodes assumptions. Convolutional networks assume spatial locality and translation invariance. Transformers assume that relationships between tokens matter more than absolute position. These architectural priors are a form of regularization that doesn't appear in the loss function. They constrain the hypothesis class in ways that happen to match the structure of images and language.
Overparameterization smooths the loss landscape. In the overparameterized regime, the loss surface has fewer bad local minima and more paths to good solutions. This connects back to our restaurant menu: with more dishes than customers, it's easier for the host (gradient descent) to find a satisfying seating arrangement.
Feature learning compounds generalization. Unlike kernel methods (and the NTK regime), deep networks learn representations that are themselves useful for the task. A network doesn't memorize "image #4573 is a cat" — it learns "things with pointy ears and whiskers tend to be cats." These learned features compress the input space, and the compressed representation has much lower effective complexity than the original data.
The data has structure. No Free Lunch says no algorithm wins on all problems. But real-world problems — images, language, code, molecules — have massive amounts of structure: smoothness, compositionality, symmetries, hierarchies. Deep networks, through the mechanisms above, are exceptionally good at exploiting this structure. The bet you make when you train a deep model is that reality is structured. And for the problems we care about, it overwhelmingly is.
I still find it remarkable that we built a theoretical framework in the 1980s-90s, it predicted overparameterized models should fail, practitioners ignored the prediction, and it took another 20 years for theory to catch up and explain why the practitioners were right. The field is still in the middle of this reckoning. PAC-Bayes bounds (which measure complexity as the distance between your trained model and a prior, rather than counting parameters) are currently the tightest non-vacuous generalization bounds we have for neural networks. But even they don't tell the full story.
If you're still with me, thank you. I hope it was worth the climb.
We started with the No Free Lunch theorem — the humbling reminder that no algorithm is universally best. We built PAC learning to understand sample complexity, then VC dimension to handle infinite hypothesis classes. We formalized the bias-complexity tradeoff and saw how classical theory paints a neat U-shaped picture. Then we watched that picture shatter: double descent showed that overparameterized models can generalize again, benign overfitting explained how memorizing noise can be harmless, the neural tangent kernel connected wide networks to kernel methods, and implicit regularization revealed the hidden bias of gradient descent. Grokking showed us that generalization can arrive suddenly, long after we'd given up hope.
My hope is that the next time someone asks "why does your huge model generalize so well?" you won't need to shrug. You'll have a mental model of what's actually happening under the hood — imperfect, evolving, but grounded in the real mechanisms that make learning from data possible.
Resources
Understanding Machine Learning: From Theory to Algorithms by Shalev-Shwartz and Ben-David — the single best textbook on learning theory. Rigorous but humane. Available free online.
"Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-off" by Belkin et al. (2019) — the O.G. double descent paper. Changed how a generation of researchers thinks about overfitting.
"Benign Overfitting in Linear Regression" by Bartlett et al. (2020) — the paper that made benign overfitting rigorous. Clear writing for a theory paper.
"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" by Power et al. (2022) — a short, surprising paper that opened a new research direction. The figures alone are worth the read.
"Neural Tangent Kernel: Convergence and Generalization in Neural Networks" by Jacot et al. (2018) — technically dense, but the core message is profound. The blog posts summarizing it are wildly helpful for building intuition.
The "Deep Learning Theory" lecture series by Matus Telgarsky — if you want to go deeper. The best bridge between the intuitions in this section and the full mathematical machinery.