Sparse & Kernel Methods

Chapter 17: Learning Theory & Advanced ML Sparsity · Kernels · RKHS · Approximations

I avoided a deep dive into sparsity and kernel methods for longer than I'd like to admit. Every time I saw an L1 penalty in a paper, I'd nod and move on. Every time someone mentioned "RKHS," I'd quietly change the subject. Kernels felt like a relic of the pre-deep-learning era, and sparsity felt like something I already understood well enough. I was wrong on both counts. Here is that dive.

Sparse methods and kernel methods are two of the oldest and most powerful ideas in machine learning. Sparsity is about a simple bet: most of the features you're looking at don't matter, and if you can find the few that do, everything gets better — your model, your understanding, your compute bill. Kernel methods are about a different bet: your data has rich structure that lives in a space you can't see, and there's a mathematical shortcut that lets you work in that invisible space without ever going there. Both ideas were developed largely in the 1990s and early 2000s, and both are more relevant today than most practitioners realize.

Before we start, a heads-up. We'll touch on some geometry, a bit of linear algebra, and a sprinkle of functional analysis. But you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

A house price predictor and the noise problem
L1 vs L2: the geometry of zero
The LASSO path
Elastic net: when LASSO gets greedy
Compressed sensing: fewer measurements than dimensions
Sparse coding and dictionary learning
Rest stop
The kernel trick, from scratch
RKHS: why kernels aren't a hack
Kernels beyond SVM
The scaling wall
Nyström approximation
Random Fourier features
Attention is a kernel
Wrap-up
Resources

A House Price Predictor and the Noise Problem

Imagine we're building a model to predict house prices. We've collected data on 50 features for each house: square footage, number of bedrooms, distance to the nearest school, the color of the front door, the number of light switches, the latitude of the nearest fire hydrant, and 44 other things our data team scraped from public records. We fit an ordinary regression model on all 50 features. It works okay on our training data, but it generalizes poorly.

Here's the thing we already know intuitively: most of those 50 features are noise. The color of the front door doesn't predict the price. The number of light switches is correlated with square footage, so it's redundant. If we could somehow figure out that only about 5 of these features actually matter — square footage, number of bedrooms, neighborhood quality, lot size, age of the house — and zero out everything else, we'd have a simpler model that generalizes better and is easier to explain to a homebuyer.

That's the core promise of sparsity. Not "shrink everything a little bit." Literally delete the things that don't matter. Set their coefficients to exactly zero.

But how do you convince a mathematical optimization procedure to produce exact zeros? That turns out to be a surprisingly geometric question.

L1 vs L2: The Geometry of Zero

Let's start with the smallest possible version of this problem. We have a model with two coefficients, β₁ and β₂. Our loss function (say, squared error) draws elliptical contours in the β₁–β₂ plane — ovals centered around the optimal unconstrained solution. We want to penalize large coefficients to prevent overfitting. The question is: how?

The L2 penalty (Ridge regression) says: keep β₁² + β₂² below some budget. Geometrically, that's a circle. Picture an ellipse of loss contours bumping up against a circle. Where do they touch? On the smooth edge of the circle. Both β₁ and β₂ will be small, but neither will be exactly zero. The circle has no corners, so there's no reason for the solution to land on an axis.

The L1 penalty (LASSO) says: keep |β₁| + |β₂| below some budget. That constraint region is a diamond — it has sharp corners sitting right on the axes. When those same elliptical contours bump into a diamond, they tend to hit a corner. And at a corner, one of the coordinates is exactly zero.

I'll be honest — when I first heard this explanation, I didn't believe it. Why should the shape of the constraint region matter so much? But try drawing it yourself. Sketch an ellipse and a diamond on paper. Tilt the ellipse at various angles. The corner of the diamond is like a knife edge; the ellipse almost always makes first contact right at a corner. Now replace the diamond with a circle and try again. The contact point slides along the smooth surface, landing at a corner only by extreme coincidence.

That's the entire geometric intuition for why L1 produces exact zeros and L2 doesn't. The diamond has corners. The circle doesn't. That's it.

Think of it like sculpting. L2 regularization is like sanding down a block of wood — everything gets smoother and smaller, but nothing gets removed. L1 is like chiseling — entire chunks get knocked off, and what remains is the essential shape. We'll come back to this sculpting analogy as the ideas get more sophisticated.

Back to our house price predictor. If we fit LASSO on our 50 features, the optimization will drive the front-door-color coefficient, the light-switch coefficient, and most of the other noise features to exactly zero. We end up with a model that says "only these 5 features matter" — and it selected them automatically, as part of fitting the model. No manual feature selection needed.

import numpy as np
from sklearn.linear_model import Lasso, Ridge

# 50 features, but only 5 actually matter
np.random.seed(42)
n = 200
X = np.random.randn(n, 50)
true_coefs = np.zeros(50)
true_coefs[:5] = [8.0, -5.0, 3.0, -2.0, 1.5]  # only first 5 are real
y = X @ true_coefs + np.random.randn(n) * 2

lasso = Lasso(alpha=0.3).fit(X, y)
ridge = Ridge(alpha=0.3).fit(X, y)

print(f"Lasso nonzero: {np.sum(lasso.coef_ != 0)}")  # ~5-8
print(f"Ridge nonzero: {np.sum(ridge.coef_ != 0)}")   # 50 — every one

LASSO found roughly the right set of features. Ridge kept all 50 and shrank them all a little. For interpretation and generalization on this kind of problem, LASSO wins.

But LASSO has a limitation that becomes painful quickly.

The LASSO Path

The LASSO has a tuning parameter, usually called λ (lambda) or α (alpha in scikit-learn). It controls how aggressively we penalize. When λ is zero, we get ordinary least squares — no penalty, all 50 coefficients doing whatever they want. As we crank λ up, coefficients start shrinking, and one by one they hit zero and stay there.

The LASSO path is a picture of this process. Plot every coefficient on the y-axis, λ on the x-axis, and watch what happens as λ increases from left to right. At the far left, all coefficients are at their unconstrained values. As we move right, the least important coefficient drops to zero first. Then the next. Then the next. The last coefficient standing is the single most important feature.

This path tells a story. The order in which features drop out reveals their relative importance. Features that survive to high values of λ are genuinely predictive. Features that vanish immediately were noise all along. For our house price model, square footage would be the last survivor. Front door color would be among the first to go.

In practice, we pick the best λ using cross-validation — typically with LassoCV in scikit-learn, which fits the entire path efficiently using a technique called coordinate descent and picks the λ that minimizes cross-validated error. The path is computed all at once, not refitted from scratch for every λ value, which makes this surprisingly fast.

from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(cv=5, random_state=42).fit(X, y)
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Nonzero features: {np.sum(lasso_cv.coef_ != 0)}")
print(f"Which ones: {np.where(lasso_cv.coef_ != 0)[0]}")

Here's where LASSO runs into trouble. Suppose two of our features — say, square footage and living area — are highly correlated (they measure nearly the same thing). LASSO will pick one and kill the other, more or less at random. If you run it again with slightly different data, it might swap which one it keeps. This is unstable, and for interpretation it can be misleading. You'd report "square footage matters but living area doesn't," when in reality both carry the same information.

Elastic Net: When LASSO Gets Greedy

The elastic net was invented to fix exactly this problem. It blends L1 and L2 penalties: α × (L1 part) + (1 − α) × (L2 part), where α controls the mix. When α = 1, you get pure LASSO. When α = 0, pure Ridge. In between, you get something that produces zeros (from the L1 part) but handles correlated features gracefully (from the L2 part).

Going back to our sculpting analogy: elastic net chisels off the big chunks of noise like LASSO does, but when it encounters a group of correlated features — features that are essentially different measurements of the same underlying thing — it keeps the whole group together instead of arbitrarily picking one. It sands within the group (L2) while chiseling between groups (L1).

For gene expression data, where you might have 20,000 features in correlated clusters, elastic net is the standard choice. For our house price model with correlated features like square footage and living area, elastic net would keep both with similar coefficients, giving a more honest picture of what matters.

from sklearn.linear_model import ElasticNetCV

# Add a feature highly correlated with feature 0
X_corr = np.column_stack([X, X[:, 0] + np.random.randn(n) * 0.1])

enet = ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9], cv=5).fit(X_corr, y)
print(f"Best L1 ratio: {enet.l1_ratio_}")
print(f"Coef for feature 0: {enet.coef_[0]:.3f}")
print(f"Coef for correlated copy: {enet.coef_[-1]:.3f}")
# Both get similar nonzero coefficients — elastic net grouped them

Elastic net is often the safest default for linear models with regularization. But all of these methods — Ridge, LASSO, elastic net — assume that your signal is sparse in the coefficient space of your given features. What if the signal is sparse, but not in the features you measured?

Compressed Sensing: Fewer Measurements Than Dimensions

Classical signal processing has a rule called the Nyquist theorem: to perfectly reconstruct a signal, you need to sample at twice its highest frequency. For decades, this was treated as gospel. If your signal has bandwidth B, you need 2B samples per second. End of story.

In the early 2000s, Emmanuel Candès, Justin Romberg, and Terence Tao showed something that felt almost illegal: if your signal is sparse in some basis (meaning most of its coefficients in that basis are zero), you can recover it exactly from far fewer measurements than Nyquist demands. How many fewer? Roughly proportional to the sparsity level, not the signal dimension.

The mechanism is L1 minimization — the same principle behind LASSO, but applied to signal recovery. You take a small number of random linear measurements of your signal, then solve: "find the sparsest signal consistent with these measurements." Sparsest, in practice, means minimizing the L1 norm. The diamond-shaped constraint region does its corner-finding magic, and out pops the original sparse signal.

The math guarantees this works under a condition called the Restricted Isometry Property (RIP) — roughly, your measurement matrix must not collapse sparse signals into the same measurement. Random matrices (Gaussian, Bernoulli) satisfy RIP with high probability, which is why random measurements work so well.

The most famous application is MRI. A full MRI scan requires the patient to lie still for 20-30 minutes while the scanner collects measurements in k-space (the Fourier domain of the image). The image is sparse in a wavelet basis. Compressed sensing says: take a fraction of the k-space measurements, solve an L1 problem, and reconstruct the image. Scan time drops to 5-10 minutes with nearly identical diagnostic quality. That's not a theoretical curiosity — this is in clinical use today in hospitals worldwide.

I still find it a little surprising that this works. You're recovering a signal from fewer measurements than it has dimensions. It feels like getting something for nothing. But the "something" isn't free — you're paying with the prior knowledge that the signal is sparse. That prior is doing enormous work.

Sparse Coding and Dictionary Learning

LASSO and compressed sensing assume your signal is sparse in some known basis — the original feature space, or wavelets, or Fourier. But what if you don't know the right basis? What if the best representation for your data is one you haven't discovered yet?

Sparse coding flips the script: instead of using a fixed basis, you learn a dictionary of basis elements — called atoms — from the data itself, and then represent each data point as a sparse combination of those atoms. The dictionary is tailored to your specific data, so the representations tend to be sparser and more meaningful than anything you'd get from a generic basis.

Going back to our sculpting analogy one more time: if LASSO is chiseling with tools you brought to the workshop, dictionary learning is forging your own custom chisels first, then sculpting. The tools fit the material better.

Concretely, given a matrix of data X, you factorize it as X ≈ D × A, where D is the dictionary (columns are atoms) and A is a matrix of sparse codes (each column has mostly zeros). The optimization alternates between two steps: fix the dictionary and find the best sparse codes, then fix the codes and update the dictionary. The algorithm K-SVD is the workhorse here — it updates atoms one at a time using a rank-1 SVD approximation.

When you apply dictionary learning to image patches, the learned atoms look remarkably like edge detectors at various orientations and scales — similar to what the primary visual cortex (V1) in the brain does. This connection to neuroscience was one of the original motivations: Olshausen and Field showed in 1996 that sparse coding on natural images produces Gabor-like filters, matching the receptive fields of biological neurons. The brain appears to use sparsity as a coding principle.

from sklearn.decomposition import DictionaryLearning
import numpy as np

# Simulate 100 data points, each a mix of a few dictionary atoms
np.random.seed(42)
true_dict = np.random.randn(20, 50)  # 50 atoms, each 20-dimensional
true_codes = np.zeros((50, 100))
for i in range(100):
    active = np.random.choice(50, 3, replace=False)  # only 3 atoms per sample
    true_codes[active, i] = np.random.randn(3)
X = (true_dict @ true_codes).T  # 100 samples, 20 features

dl = DictionaryLearning(n_components=50, transform_algorithm='lasso_lars',
                        transform_n_nonzero_coefs=5, random_state=42)
codes = dl.fit_transform(X)
print(f"Average nonzeros per sample: {np.mean(np.sum(codes != 0, axis=1)):.1f}")

Sparse coding has been used for image denoising (each noisy patch is represented sparsely in a learned dictionary, and the noise gets left out), image super-resolution, audio source separation, and as a feature extraction step before classification. It's less fashionable now that deep learning handles feature learning end-to-end, but the principle — learn a representation where most coefficients are zero — remains influential. Autoencoders with sparsity constraints are direct descendants of this idea.

The limitation of sparse coding is computational cost. The alternating optimization is expensive, and it doesn't scale to the millions-of-examples regime without careful engineering. For large-scale problems, deep learning took over not because sparse coding was wrong in principle, but because gradient descent on neural networks is easier to parallelize on GPUs.

Rest Stop

Congratulations on making it this far. If you want to stop here, you can.

You now have a solid mental model of the sparsity half of this story. You understand why L1 penalties produce exact zeros (the diamond has corners), how the LASSO path reveals which features matter, why elastic net handles correlated features better, how compressed sensing exploits sparsity for signal recovery, and how dictionary learning lets you discover the best sparse basis from data.

That's genuinely useful. If someone asks you in an interview "why does L1 give zeros but L2 doesn't?" you can draw the diamond and circle and explain it in thirty seconds. If someone mentions compressed sensing, you know the core idea: sparse signals can be recovered from few measurements via L1 minimization.

But there's a whole other half of this section — kernel methods — and it connects to sparsity in surprising ways. Kernel methods are about working in high-dimensional (even infinite-dimensional) spaces without ever going there, and they quietly underpin some of the most important ideas in modern ML, including attention in transformers. If the discomfort of not knowing what's underneath is nagging at you, read on.

The Kernel Trick, from Scratch

Back to our house price predictor. Suppose we add a feature: distance to downtown. We notice that houses very close to downtown are expensive (walkable urban), houses far out are expensive (large suburban lots), and houses in the middle are cheap (neither urban charm nor suburban space). If we plot price versus distance, it's a U-shape. No straight line can capture that relationship.

The classic fix: add a squared feature. Instead of using distance alone, use both distance and distance². Now a linear model in this expanded space can fit the U-shape, because it can learn a quadratic relationship. We've mapped our 1D input into a 2D feature space where the relationship is linear.

That works for one feature. But with 50 features, if we wanted all pairwise interactions and squared terms, we'd have about 1,325 new features. With cubic terms, we'd have about 23,000. The explosion gets worse fast. And for some problems, we'd need features in a space so high-dimensional that storing them is physically impossible.

Here's the key insight that makes kernel methods work. Many algorithms — SVMs, ridge regression, PCA — never actually need the feature vectors themselves. They only need dot products between pairs of data points. The dot product φ(x) · φ(y) in the high-dimensional space measures similarity between two transformed points. If we could compute that dot product without ever building φ(x) and φ(y), we'd get all the power of the high-dimensional space with none of the storage cost.

A kernel function K(x, y) does exactly that. It computes φ(x) · φ(y) — the dot product in the high-dimensional space — using only the original low-dimensional inputs x and y. No mapping required.

I think of it like a telescope. A telescope lets you observe objects in distant space without traveling there. A kernel lets you compute dot products in high-dimensional space without transforming there. We'll keep coming back to this telescope analogy as the ideas get richer.

Let's see this concretely. For a polynomial kernel of degree 2 with two input features:

import numpy as np

def explicit_map(x):
    """Map 2D -> 6D: all degree-2 terms."""
    x1, x2 = x
    return np.array([x1**2, x2**2, np.sqrt(2)*x1*x2,
                     np.sqrt(2)*x1, np.sqrt(2)*x2, 1.0])

def poly_kernel(x, y, c=1):
    """Polynomial kernel, degree 2."""
    return (np.dot(x, y) + c) ** 2

a = np.array([1.0, 2.0])
b = np.array([3.0, 4.0])

explicit_result = np.dot(explicit_map(a), explicit_map(b))
kernel_result = poly_kernel(a, b)

print(f"Explicit map then dot product: {explicit_result:.1f}")
print(f"Kernel shortcut:               {kernel_result:.1f}")
# Both give 196.0 — same answer, but the kernel never built the 6D vectors

Both approaches give 196.0. But the kernel version never created those 6D vectors. It computed one dot product and one squaring. For degree-2 polynomials, the savings are modest. For the RBF (Gaussian) kernel — K(x, y) = exp(−γ‖x − y‖²) — the implicit feature space is infinite-dimensional. You literally cannot build φ(x). The kernel is the only way in.

The RBF kernel is the Swiss army knife of kernel functions. Its implicit feature space is so rich that it can represent any smooth decision boundary, given enough data. The parameter γ controls how "local" the kernel is: large γ means each data point only influences its immediate neighbors (high variance, risk of overfitting), small γ means each point casts a wide influence (high bias, risk of underfitting). Finding the right γ is the central tuning challenge of kernel methods.

But there's a catch. So far, the kernel trick is a computational convenience — a faster way to compute dot products. What ensures that this isn't a house of cards? What guarantees that there actually exists a feature space φ where K(x, y) = φ(x) · φ(y)?

RKHS: Why Kernels Aren't a Hack

I avoided reading about Reproducing Kernel Hilbert Spaces for years because the name is terrifying. Three intimidating words in a row, each more abstract than the last. But the core idea is more approachable than the name suggests, and understanding it changes how you think about kernel methods from "clever trick" to "principled framework."

Here's the idea, stripped down. Normally, we think of data points as vectors in ℝⁿ — lists of numbers. An RKHS is a space where the "vectors" are functions, and you can measure angles and distances between functions using an inner product, the same way you measure angles between vectors using a dot product.

The "reproducing" part is the magic. In this space, if you want to evaluate a function f at a point x — that is, compute f(x) — you can do it by taking the inner product of f with a special function K(x, ·). The kernel K reproduces function values via inner products. That sounds abstract, but it has a concrete payoff: the representer theorem.

The representer theorem says: for any regularized optimization problem in an RKHS (minimize some loss plus a smoothness penalty), the optimal solution can be written as a weighted sum of kernel evaluations at the training points. That is, f*(·) = Σᵢ αᵢ K(xᵢ, ·). You don't have to search over all possible functions — the answer is always a combination of kernels centered at your data. The optimization reduces from "search over infinite-dimensional function space" to "find n numbers α₁, ..., αₙ."

Using our telescope analogy: RKHS theory is what guarantees the telescope isn't lying to you. It proves that the things you observe through the kernel lens correspond to real geometric structure in a real (if abstract) space. The representer theorem then says that you only need to point the telescope at your training data to find the best function.

I'm still developing my intuition for why the reproducing property is the specific condition that makes everything work, rather than some other property. But the practical consequence is clear: kernels aren't a hack. They're backed by a rigorous mathematical framework that guarantees the existence of the feature space and the form of the optimal solution.

Not every function K(x, y) is a valid kernel. A function qualifies if and only if the matrix K(xᵢ, xⱼ) for any set of points is positive semi-definite — meaning all its eigenvalues are non-negative. This is called Mercer's condition, and it's the mathematical test that separates legitimate kernels from arbitrary similarity functions.

Kernels Beyond SVM

The kernel trick was popularized through SVMs, but SVM is not the only algorithm that works entirely through dot products. Any algorithm that can be expressed in terms of inner products between data points can be kernelized. Three important extensions:

Kernel PCA. Standard PCA finds the directions of maximum variance by eigendecomposing the covariance matrix. But covariance is built from dot products. Replace those dot products with kernel evaluations, and you get kernel PCA — nonlinear dimensionality reduction. Apply it to data that lies on a Swiss roll or a circle, and it unfolds structures that linear PCA misses entirely. The eigendecomposition is done on the n × n kernel matrix rather than the d × d covariance matrix, which means it scales with the number of data points rather than the number of features.

Kernel ridge regression. Ordinary ridge regression finds coefficients that minimize squared error plus an L2 penalty. The dual formulation involves only dot products between training points. Swap those for kernel evaluations and you get kernel ridge regression — a nonlinear smoother that fits arbitrarily complex functions while still having a closed-form solution. It's the workhorse of Gaussian process regression (GPs are kernel ridge regression with a Bayesian interpretation and uncertainty estimates).

Multiple kernel learning (MKL). Sometimes no single kernel captures all the structure in your data. MKL learns a weighted combination of multiple kernels: K_combined(x, y) = Σₘ βₘ Kₘ(x, y), where the weights βₘ are learned alongside the model. This is especially useful when your data comes from heterogeneous sources — say, text features (where a linear kernel works well) combined with image features (where an RBF kernel is better). Instead of choosing one kernel, you let the optimizer blend them.

from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_circles

# Data on a circle — linear PCA can't separate inner from outer ring
X, y = make_circles(n_samples=200, factor=0.3, noise=0.05, random_state=42)

kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)

# In the kernel PCA space, the two circles become linearly separable
print(f"Original shape: {X.shape}")       # (200, 2)
print(f"Kernel PCA shape: {X_kpca.shape}")  # (200, 2) — same dims, but now separable

These extensions share a common limitation. And it's a big one.

The Scaling Wall

Every kernel method requires computing the kernel matrix: K(xᵢ, xⱼ) for all pairs of training points. With n training points, that's an n × n matrix. Computing it costs O(n²). Solving a kernel SVM or kernel ridge regression involves operations on this matrix that cost O(n²) to O(n³). Storing it requires O(n²) memory.

With 1,000 data points, the kernel matrix has a million entries. Fine. With 10,000 points, it has 100 million entries. Tight but manageable. With 100,000 points, it has 10 billion entries. With a million points — common in production systems — it would have a trillion entries. That matrix doesn't fit in any computer's memory.

This is the scaling wall. It's the reason kernel methods largely lost to deep learning for large-scale problems. Neural networks train with stochastic gradient descent, processing one mini-batch at a time, never needing to hold all pairwise relationships in memory. Kernel methods, in their exact form, need the whole matrix.

For years, this seemed like a fundamental limitation. But two clever approximation techniques brought kernel methods back into the game for larger problems — and one of them turned out to be the key ingredient in making transformers efficient.

I oversimplified when I said earlier that kernel methods "can't scale." What I should have said is that exact kernel methods can't scale. The approximations we're about to see make them surprisingly practical.

Nyström Approximation

The first approximation is beautifully intuitive. If the full n × n kernel matrix is too big to store, what if we only compute part of it?

The Nyström approximation works like this. Pick m landmark points from your data, where m is much smaller than n — maybe 500 landmarks from 100,000 data points. Compute the kernel between every data point and every landmark (an n × m matrix, call it C). Compute the kernel among the landmarks themselves (an m × m matrix, call it W). Then approximate the full n × n matrix as K ≈ C W⁻¹ Cᵀ.

Why does this work? Because kernel matrices are typically low-rank or approximately low-rank — they have a lot of redundant structure. The information in the full n × n matrix is mostly captured by the relationships to those m landmarks. It's like describing a crowd of 100,000 people by their relationships to 500 representative individuals. You lose some nuance, but you capture most of the structure.

The cost drops from O(n²) for the full matrix to O(nm) for the approximation. With m = 500 and n = 100,000, that's a 200× improvement. The accuracy depends on how well your landmarks represent the data — random sampling works surprisingly well, and more sophisticated strategies (like k-means centroids) can improve things further.

Through our telescope analogy: the full kernel matrix is like pointing the telescope at every pair of stars in the sky. Nyström is like picking a few reference stars and triangulating everything else from those. Fewer observations, nearly the same map.

Random Fourier Features

The second approximation takes a completely different approach and, to me, is one of the most elegant ideas in applied mathematics. It was introduced by Ali Rahimi and Ben Recht in 2007 (this paper later won a "Test of Time" award at NeurIPS, ten years later).

The idea starts with a theorem from harmonic analysis called Bochner's theorem. It says that any shift-invariant positive-definite kernel (like the RBF kernel, which depends only on ‖x − y‖, not on x and y separately) can be written as the Fourier transform of a non-negative measure. In less formal terms: the RBF kernel is secretly a weighted average of cosine waves at different frequencies, where the frequencies are drawn from a Gaussian distribution.

The practical consequence: if you sample D random frequency vectors ω₁, ..., ω_D from that Gaussian, and for each data point x compute z(x) = √(2/D) [cos(ω₁ᵀx + b₁), ..., cos(ω_Dᵀx + b_D)] (where the bᵢ are random phase shifts), then the dot product z(x) · z(y) ≈ K(x, y). The approximation gets better as D grows.

What Rahimi and Recht did, in effect, is build an explicit finite-dimensional feature map φ that approximates the infinite-dimensional feature map of the RBF kernel. Once you have this map, you can use any linear algorithm — linear SVM, linear regression, logistic regression — on the transformed features, and you get approximately the same result as the kernelized version. No kernel matrix needed. The cost is O(nD) instead of O(n²), and D can be much smaller than n.

import numpy as np

def random_fourier_features(X, D=500, gamma=1.0):
    """Approximate RBF kernel with D random Fourier features."""
    d = X.shape[1]
    omega = np.random.randn(d, D) * np.sqrt(2 * gamma)
    b = np.random.uniform(0, 2 * np.pi, D)
    Z = np.sqrt(2.0 / D) * np.cos(X @ omega + b)
    return Z

# Compare: kernel evaluation vs RFF inner product
np.random.seed(42)
x = np.random.randn(1, 5)
y = np.random.randn(1, 5)
gamma = 0.5

exact_rbf = np.exp(-gamma * np.sum((x - y)**2))
Z_x = random_fourier_features(x, D=5000, gamma=gamma)
Z_y = random_fourier_features(y, D=5000, gamma=gamma)
approx_rbf = Z_x @ Z_y.T

print(f"Exact RBF kernel:   {exact_rbf:.4f}")
print(f"RFF approximation:  {approx_rbf[0,0]:.4f}")

My favorite thing about random Fourier features is that, aside from high-level explanations like "Bochner's theorem guarantees it," no one is completely certain why it works so well with relatively small D in practice. The theoretical bounds on D are often much larger than what's needed empirically. There's something forgiving about the random sampling that we don't fully understand yet.

Attention Is a Kernel

Here's where the story takes a turn I didn't expect when I started this dive. The attention mechanism in transformers — the thing powering GPT, BERT, and every large language model — is a kernel computation.

Standard scaled dot-product attention computes: Attention(Q, K, V) = softmax(QKᵀ / √d) V. The softmax(QKᵀ / √d) term is an n × n matrix of pairwise similarities between all query-key pairs. Each entry measures how much one token should attend to another. This is, conceptually, a kernel matrix — it computes a similarity function for all pairs of positions in the sequence.

And just like classical kernel methods, it has a scaling problem. The n × n attention matrix is quadratic in sequence length. For a sequence of 1,000 tokens, that's a million entries per attention head. For 10,000 tokens, it's 100 million. This is the same wall that kernel methods hit decades earlier.

The connection between attention and kernels sat dormant for a few years after the original transformer paper. Then in 2020, a paper called "Rethinking Attention with Performers" made the link explicit. The authors showed that softmax attention can be approximated using random Fourier features — the same technique Rahimi and Recht developed for kernel approximation thirteen years earlier.

The idea: instead of computing the full n × n attention matrix, replace the softmax with a kernel approximation. Use a feature map φ so that softmax(qᵢᵀkⱼ / √d) ≈ φ(qᵢ) · φ(kⱼ). Then, instead of computing the n × n matrix and multiplying by V, you can rearrange the computation using associativity of matrix multiplication: φ(Q) × (φ(K)ᵀ V). The product φ(K)ᵀ V is a D × d matrix (independent of n), and multiplying φ(Q) by it costs O(nDd) instead of O(n²d). Linear in n, not quadratic.

The connection between attention and kernels still makes me pause every time I think about it. Transformers, the architecture that dominates modern AI, are running into the same scaling problem that kernel SVMs hit in the 2000s. And the solution — random feature approximation — was developed in the kernel methods community years before transformers existed. The old ideas weren't obsolete. They were waiting for the right problem.

Not every efficient attention method uses kernel approximations (some use sparse patterns, low-rank projections, or hashing), but the kernel perspective provides the cleanest theoretical framework for understanding why these approximations work.

Seeing the Full Picture

Let's step back and trace the path we've traveled. We started with a noisy house price predictor with 50 features. We discovered that L1 regularization (LASSO) can chisel away the noise features, producing exact zeros thanks to the geometry of the diamond constraint. We walked the LASSO path to see features drop out in order of importance. We saw elastic net handle correlated features more gracefully. We detoured through compressed sensing — the same L1 principle applied to signal recovery, powering fast MRI scans. We watched sparse coding learn its own dictionary of basis elements from data.

Then we crossed to kernel methods. We saw how the kernel trick lets you compute dot products in impossibly high-dimensional spaces without going there — a mathematical telescope. We grounded this in RKHS theory, which proves the telescope is telling the truth. We extended kernels beyond SVM to PCA, regression, and multi-kernel blending. We hit the scaling wall (the O(n²) kernel matrix), and found two escape routes: Nyström approximation (subsample the matrix) and random Fourier features (build an explicit approximate map). And we discovered that attention in transformers is itself a kernel computation, and that the efficient attention literature is rediscovering kernel approximation techniques from the 2000s.

The thread connecting all of it: working smarter in high-dimensional spaces. Sparsity says "most dimensions don't matter, find the ones that do." Kernels say "you need more dimensions, but you don't have to go there." The approximations say "when things get too big, subsample or randomize." These are the same ideas, wearing different hats.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a noisy regression problem and the question of how to delete irrelevant features automatically. That led us through the geometry of L1 penalties, the LASSO path, elastic nets, compressed sensing, and dictionary learning. Then we picked up kernel methods — the trick of working in high-dimensional spaces through a shortcut, grounded by RKHS theory, extended far beyond SVMs, limited by quadratic scaling, and rescued by Nyström and random Fourier approximations. We ended with the surprising connection between classical kernel methods and modern transformer attention.

My hope is that the next time you see an L1 penalty in a paper, instead of nodding and moving on, you'll picture the diamond and its corners. The next time someone mentions RKHS, instead of changing the subject, you'll think "functions as vectors, kernel reproduces evaluations, representer theorem gives the optimal form." And the next time you read about efficient attention, you'll recognize it as a kernel approximation problem that was solved, in principle, over a decade before transformers were invented.

Resources

Tibshirani (1996), "Regression Shrinkage and Selection via the LASSO" — the O.G. paper. Clear writing for a statistics paper, and the geometric argument holds up beautifully.

Hastie, Tibshirani, Wainwright, "Statistical Learning with Sparsity" — the definitive book on sparse methods. Chapter 2 on the LASSO path is particularly illuminating. Available free online.

Candès and Wakin (2008), "An Introduction to Compressive Sampling" — the most accessible introduction to compressed sensing I've found. IEEE Signal Processing Magazine.

Schölkopf and Smola, "Learning with Kernels" — still the best comprehensive treatment of kernel methods. Heavy on theory but packed with insight.

Rahimi and Recht (2007), "Random Features for Large-Scale Kernel Machines" — the random Fourier features paper. Remarkably short and readable for how influential it became.

Choromanski et al. (2020), "Rethinking Attention with Performers" — the paper that explicitly connected transformer attention to kernel methods and used random features for linear-time attention.

Next → Nice to Know