Emerging Methods

Chapter 16: Advanced Topics in Deep Learning Scaling Laws · Meta-Learning · Continual Learning · Neurosymbolic AI · Geometric DL

I put off digging into the "emerging methods" grab-bag for an embarrassingly long time. Every few months a new paper would land — scaling laws revisited, test-time training, neurosymbolic this, topological that — and I'd skim the abstract, nod knowingly, and move on. The problem is that these topics don't form a neat narrative the way transformers or diffusion models do. They're scattered across subfields, each with its own jargon and its own conference clique. Finally the discomfort of having a dozen blank spots in my mental map of modern ML grew too great for me. Here is that dive.

This section covers the methods and ideas that don't fit neatly into the standard deep learning curriculum but keep showing up in research papers, interviews, and system designs. Scaling laws and their recent unraveling. Test-time training, where models keep learning even after deployment. Meta-learning — MAML, Reptile, prototypical networks — the art of learning to learn. Continual learning and the fight against catastrophic forgetting. Neurosymbolic AI, where neural nets meet logic. Geometric deep learning, where symmetry becomes a design principle. And a few more that round out the frontier.

Before we start, a heads-up. We're going to cover a lot of ground, touching on group theory, topology, quantum computing, and program synthesis along the way. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Build Together

Neural scaling laws — and why they broke
Test-time training and adaptation
Meta-learning: MAML, Reptile, and Prototypical Networks
Few-shot and zero-shot generalization
Continual and lifelong learning
Neurosymbolic AI and program synthesis
Geometric deep learning
Topological data analysis meets ML
Quantum ML — hype vs. hope
Rest stop
Where it all connects

Neural Scaling Laws — And Why They Broke

Let's start with a question that consumed billions of dollars: if you want a better model, what should you spend your money on? More parameters? More data? More training time?

Imagine you're building a language model for a small startup. You've got a fixed compute budget — say, 1,000 GPU-hours. You could train a 10-billion-parameter model on a modest dataset, or a 1-billion-parameter model on ten times more text. Which gives you the better model?

In 2020, Kaplan and colleagues at OpenAI published a set of neural scaling laws — empirical relationships between model performance (measured by loss on a test set) and three quantities: the number of parameters, the size of the training dataset, and the total compute used. The headline finding was seductive in its simplicity: loss decreases as a power law with each of these quantities. Double your parameters, your loss drops by a predictable amount. Double your data, same thing. The relationship holds over many orders of magnitude, tracing remarkably straight lines on log-log plots.

That straight line on a log-log plot is a power law. If you've seen one before in physics — say, how the brightness of a star relates to its mass — you know the appeal. It means the system has no characteristic scale. There's no "cliff" where suddenly more parameters stop helping. The gains get smaller in absolute terms, but they don't stop.

Kaplan's original work suggested a lopsided strategy: parameters matter more than data. GPT-3 was designed accordingly — 175 billion parameters trained on what now looks like a relatively modest amount of text. The token-to-parameter ratio was roughly 1.7:1. Crank up the model size, feed it enough data to not completely overfit, and call it a day.

Chinchilla Changes the Story

In 2022, DeepMind's Hoffmann et al. published the Chinchilla paper, and the scaling law narrative shifted dramatically. Their key insight: Kaplan's team had undertrained their models. When you actually run the experiment properly — training each model size to something approaching its optimal data budget — you find that parameters and data should scale in roughly equal proportion. The magic ratio turned out to be about 20 tokens per parameter.

This had an immediate and expensive implication. GPT-3, with its 175B parameters trained on roughly 300B tokens (a ratio of ~1.7:1), was wildly undertrained by Chinchilla's standards. It should have seen 3.5 trillion tokens. DeepMind demonstrated this by training Chinchilla — a smaller 70B-parameter model on 1.4 trillion tokens — and watching it outperform the much larger Gopher (280B parameters). Smaller model, more data, better results.

I'll be honest — when the Chinchilla paper dropped, I found the logic almost too clean. A single ratio that tells you exactly how to allocate your budget? It felt like finding a universal constant, and those are rare in ML.

And Then the Ratio Kept Climbing

Here's where the story gets interesting. By 2024, the industry had blown past the Chinchilla-optimal 20:1 ratio. Meta's Llama 3 trained at roughly 200 tokens per parameter. Microsoft's Phi-3 pushed to a staggering 870:1. These models weren't following Chinchilla's prescription at all — they were massively "overtraining" smaller models on enormous datasets.

Why? Because Chinchilla optimized for a world where training compute is your only cost. In the real world, you also pay for inference — every single time someone sends a query to your model. A 70B-parameter model that's Chinchilla-optimal costs far more per query than a 7B-parameter model that's been overtrained. The 2024 insight was to add inference cost to the equation: train smaller, train longer, deploy cheaper.

Think of it like building a house. The Chinchilla approach says: for a fixed construction budget, here's the optimal house size. The 2024 approach says: wait, we're going to live in this house for 30 years, so maybe build a slightly smaller, extremely well-insulated house that costs less to heat every winter. The upfront math changes completely when you account for the ongoing costs.

Era	Guiding Law	Token:Parameter Ratio	Strategy
2020 (Kaplan)	Scale parameters first	~1.7:1	Bigger model, modest data
2022 (Chinchilla)	Balance both equally	~20:1	Right-sized model, matched data
2024 (Llama 3, Phi-3)	Account for inference	200:1 to 870:1	Smaller model, massive data

The scaling laws themselves aren't wrong — loss still follows power laws with compute. What broke was the assumption about what you should optimize. The law describes the landscape; the strategy for traversing it keeps changing as economics evolve.

And there's a second axis of scaling emerging now, which we'll get to next. But the core limitation of parameter-scaling is already visible: you eventually run out of high-quality training data. The entire internet has roughly 10-15 trillion tokens of decent text. When your training recipe demands hundreds of trillions, the well runs dry.

Test-Time Training and Adaptation

If scaling up training data has limits, what if the model could keep learning after training — not on some new labeled dataset, but on each individual test input as it arrives?

That's the core idea behind test-time training (TTT). During training, you build a model with two objectives: the main task (say, classification or language modeling) and an auxiliary self-supervised task (say, predicting rotations or reconstructing masked patches). At inference time, you freeze the main task head but keep the self-supervised objective active. When a new input arrives, the model takes a few gradient steps on that input using the self-supervised loss, adapting its internal representations to the specific characteristics of this particular sample. Then it makes its prediction.

Let me make this concrete with a tiny example. Suppose you've trained an image classifier on photos taken during the day. Now someone sends it a nighttime photo. The visual statistics are completely different — darker, noisier, different contrast patterns. A standard model would push the image through its frozen weights and hope for the best. A TTT model would first say: "This input looks different from what I'm used to. Let me adjust my internal normalization by doing a few gradient steps on my self-supervised objective using this very image." Then, with its adapted representations, it makes the classification.

Sun et al. (2020) introduced this idea with a rotation prediction auxiliary task. The model learns to predict how an image has been rotated — 0°, 90°, 180°, 270° — alongside its main classification objective. At test time, the model adapts to each test image by fine-tuning on rotation prediction for that specific image. The rotation task requires no labels. It's asking: "do I understand the structure of this particular image well enough to tell which way is up?"

TTT Layers — A More Radical Version

The 2024 wave of TTT research takes this further. Instead of treating test-time training as a post-hoc adaptation trick, newer work bakes it into the architecture itself. TTT layers are network components specifically designed to be updated at inference time. Only the normalization parameters, or a small adapter module, get modified — the bulk of the network stays frozen. This keeps the adaptation fast and prevents the model from overfitting to a single test example.

The most striking recent result: TTT layers used as a replacement for self-attention in sequence models. Where a transformer uses attention to "look back" at previous tokens, a TTT layer instead does gradient descent on those previous tokens' self-supervised loss, effectively encoding their information into the layer's weights. The claim is bold — linear complexity like an RNN, with quality approaching transformers.

I'm still developing my intuition for why updating weights at test time doesn't blow up more often than it does. The stability seems to come from two things: only a tiny fraction of parameters get updated (adapter layers, normalization statistics), and the self-supervised objectives are gentle — they push representations in the right direction without dramatic weight changes.

The limitation? Compute cost. Every test input now requires backward passes, not a single forward pass. That's fine for a medical imaging system processing 100 scans per day. It's a non-starter for a web search engine handling 100,000 queries per second. Test-time training trades inference speed for adaptability, and that tradeoff doesn't always pay off.

Meta-Learning: Learning to Learn

Test-time training adapts to individual inputs. But what if you wanted a model that could adapt to entirely new tasks — tasks it's never been trained on — using a handful of examples?

Let me set the scene with our running example. Imagine you're building a wildlife camera system for a national park. You train a classifier on 50 species of animals with thousands of images each. Works great. Then a biologist says: "We just spotted a new species of frog. Here are 3 photos. Can the system identify it?" Training a new model from scratch for one frog species is absurd. Fine-tuning on 3 images will almost certainly overfit. What you need is a model that has learned how to learn new categories quickly from tiny amounts of data.

That's meta-learning — the art of learning to learn. Instead of training on one big task, you train on hundreds of tiny simulated tasks, each designed to mimic the "here are a few examples, now classify" situation you'll face in the real world.

The Episode: A Tiny Simulated Task

Each training iteration in meta-learning is called an episode. Think of it as a self-contained mini-experiment. An episode has two halves. The support set contains the handful of labeled examples the model gets to study — like three photos of the new frog species plus three photos each of a few other species. The query set contains new images the model has to classify after seeing only the support set.

The standard benchmark format is N-way K-shot: N classes, K examples per class. "5-way 1-shot" means: here are 5 species you've never seen, one photo each. Now classify these other photos. It's deliberately brutal. And over thousands of episodes with different random species combinations, the model builds a transferable skill for rapid adaptation rather than memorizing any particular set of species.

Here's the key trick that makes this work: training conditions match test conditions. During training, the model never gets thousands of examples of anything. It always practices the "learn from a few" skill. By the time it encounters your real 3-photo frog problem, it's already done thousands of similar challenges.

Prototypical Networks: Embed, Average, Compare

The simplest and often most practical approach is the Prototypical Network (Snell et al., 2017). The idea is almost disappointingly intuitive: learn an embedding space where similar things cluster together, then classify new examples by proximity.

Here's how it works for our wildlife camera. The model has an encoder — a convolutional neural network, say — that maps any image to a point in a high-dimensional embedding space. Given a 5-way 3-shot episode, it embeds all 15 support images. For each of the 5 species, it computes the prototype — the mean of the 3 embeddings for that species. Now it has 5 points in embedding space, one per species. To classify a new query image, embed it and measure the distance to each prototype. The nearest prototype wins.

import torch
import torch.nn.functional as F

def prototypical_episode(encoder, support_imgs, support_labels,
                         query_imgs, query_labels, n_classes):
    # Embed everything through the shared encoder
    z_support = encoder(support_imgs)    # (N*K, embed_dim)
    z_query = encoder(query_imgs)        # (Q, embed_dim)

    # One prototype per class: the mean of that class's embeddings
    prototypes = torch.stack([
        z_support[support_labels == c].mean(dim=0)
        for c in range(n_classes)
    ])  # (N, embed_dim)

    # Distance from each query to each prototype
    dists = torch.cdist(z_query, prototypes)   # (Q, N)

    # Nearest prototype = prediction (negate distance for softmax)
    log_probs = F.log_softmax(-dists, dim=-1)
    loss = F.nll_loss(log_probs, query_labels)
    acc = (log_probs.argmax(-1) == query_labels).float().mean()
    return loss, acc

That's the whole algorithm. Embed, average, measure distance. The mean embedding acts as a robust summary of each class — more stable than any single example, especially when K is small and individual photos might be blurry or atypical. The original paper found that Euclidean distance outperforms cosine similarity here, because negative squared Euclidean distance through a softmax turns out to be equivalent to a linear classifier — clean hyperplane boundaries between prototypes.

Prototypical Networks have a beautiful property: at test time, there are zero gradient updates. One forward pass through the encoder, a few distance computations, and you're done. Many fancier methods from the 2018–2020 era beat ProtoNets by 1–2% while being 10× more complex. For our wildlife camera, this is the first thing I'd try.

MAML: Learning a Starting Point

Prototypical Networks assume the problem reduces to "embed and compare." But what if your tasks are structurally different — some classification, some regression, different output spaces? What if distance in an embedding space isn't enough?

Model-Agnostic Meta-Learning (MAML, Finn et al., 2017) takes a fundamentally different approach. Instead of learning an embedding space, it learns an initialization — a specific set of starting weights from which a few gradient descent steps on any new task produce a well-adapted model. You're not learning a model. You're learning a starting point for fast learning.

The mechanics have two nested loops. The inner loop takes the current weights θ and fine-tunes them on a single episode's support set for a few gradient steps, producing adapted weights θ'. The outer loop evaluates θ' on the query set and backpropagates that query loss all the way back through the inner gradient steps to update θ. It's computing gradients of gradients — second-order optimization.

# The MAML idea in pseudocode
for episode in episodes:
    support_x, support_y, query_x, query_y = episode

    # Inner loop: adapt to this specific task
    theta_adapted = theta.clone()
    for step in range(inner_steps):
        loss = cross_entropy(model(support_x, theta_adapted), support_y)
        theta_adapted = theta_adapted - inner_lr * grad(loss, theta_adapted)

    # Outer loop: evaluate adaptation quality, update starting point
    query_loss = cross_entropy(model(query_x, theta_adapted), query_y)
    theta = theta - outer_lr * grad(query_loss, theta)  # grad through inner loop!

That grad(query_loss, theta) is the expensive part. It differentiates through the inner loop — computing Hessian-vector products, maintaining large computation graphs. This is why practitioners typically use only 1–5 inner steps.

The good news: there are cheaper approximations. FOMAML (First-Order MAML) pretends the inner loop didn't involve θ, dropping all second-order terms. It works nearly as well. Reptile (Nichol et al., 2018) simplifies further — take several SGD steps on the support set, then nudge θ toward the result. No query set needed, no second derivatives. Both are much cheaper and usually good enough.

The limitation of all these meta-learning methods becomes apparent when you compare them to what happened next in AI.

Few-Shot and Zero-Shot: The Foundation Model Plot Twist

Here's the uncomfortable truth for the meta-learning community. By 2024, for most practical few-shot problems, you should try three things before reaching for MAML or Prototypical Networks: prompt a large language model with a few examples (zero training required), fine-tune a foundation model's embeddings (CLIP, DINOv2) with a linear probe on your few examples, or use LoRA to cheaply adapt a pretrained model.

A CLIP embedding plus nearest-neighbor on 5-way 1-shot ImageNet subsets beats many purpose-built meta-learning algorithms. The foundation model has already "learned to learn" during pretraining — at a scale no meta-learning dataset can match. For our wildlife camera, if the park has internet access, a CLIP-based approach would likely outperform our carefully trained Prototypical Network.

Zero-shot generalization is even more striking. Models like CLIP can classify images into categories they were never explicitly trained on, using only a text description of the category. "A photo of a golden poison dart frog" — the model has never seen this exact label during training, but it has learned to align image and text embeddings in a shared space. The "zero examples" aren't really zero; the knowledge comes from the billions of image-text pairs seen during pretraining.

So is meta-learning dead? No, but its role has shifted. The principles — episodic evaluation, learning to adapt, the support/query protocol — are baked into how we train and evaluate foundation models today. The standalone algorithms still win in specific niches: on-device deployment where you can't run a 7B model, truly novel domains where pretrained models have zero relevant data (satellite hyperspectral imaging, industrial defect detection), and privacy-constrained settings where data can't leave the device. The concepts are alive. The 2017-era methods are increasingly niche.

Continual Learning: The War Against Forgetting

Our wildlife camera learns to identify a new frog from 3 photos. Excellent. But what happens six months later when the biologist brings 20 more new species? We fine-tune on each batch, and one morning we discover the model can't recognize deer anymore — an animal it used to classify perfectly.

This is catastrophic forgetting, and it's not a bug. It's a fundamental property of how neural networks store knowledge. Every gradient update that improves frog recognition overwrites some of the weight configurations that encoded deer knowledge. The weights are a shared resource, and new learning doesn't politely carve out its own space — it bulldozes whatever was there before.

The naive fix — retrain from scratch on all data combined — works if you have all the old data and unlimited compute. In the real world, the original camera footage might be deleted. User data expires under privacy laws. And retraining a massive model every time new species arrive is financially absurd. We need models that can learn continuously without forgetting.

The Stability-Plasticity Dilemma

Every continual learning method is a different answer to the same tradeoff. Plasticity is the ability to learn new things — to reshape weights in response to new data. Stability is the ability to retain old knowledge — to resist changes that would erase what's already been learned. Too much plasticity and you forget everything. Too much stability and you can't learn anything new. The brain solves this remarkably well. Neural networks, so far, don't.

Think of the model's weight space as a physical landscape — valleys represent good configurations for different tasks. When you train on Task A, you descend into Valley A. When you then train on Task B, gradient descent pulls you toward Valley B, potentially dragging you up and out of Valley A entirely. The question is: can you find a spot that's reasonably deep in both valleys at once?

Elastic Weight Consolidation: Springs on Important Weights

The first major defense came from Kirkpatrick et al. (2017) with Elastic Weight Consolidation (EWC). The intuition is lovely: after training on Task A, figure out which weights were important for that task. Then, when training on Task B, add a penalty that makes those important weights resistant to change — like attaching springs that pull them back toward their Task A values.

The "importance" of each weight is estimated using the Fisher Information Matrix, which measures how much the loss changes when you wiggle each weight. A weight with high Fisher Information is one that Task A really depends on. A weight with low Fisher Information is one that Task A barely notices — it's free to be repurposed for Task B.

# EWC loss during Task B training
L_total = L_taskB(theta) + (lambda/2) * sum(
    F_i * (theta_i - theta_star_A_i)**2
    for i in all_weights
)
# F_i: Fisher Information (spring stiffness) for weight i
# theta_star_A_i: optimal value of weight i after Task A
# High F_i = stiff spring, weight stays near Task A value
# Low F_i = loose spring, weight is free to move

The "elastic" in the name comes from those springs — each weight has its own stiffness. It's a beautiful analogy that I keep returning to. Variants like Synaptic Intelligence (SI) compute importance on the fly during training, and Memory Aware Synapses (MAS) measure importance without needing labels. They all perform similarly in practice.

The limitation becomes clear after many tasks. Each new task adds more springs. Eventually every weight is so spring-loaded that there's no room to move at all. The model becomes rigid — all stability, no plasticity. Our landscape analogy: the spring network anchors you to Valleys A, B, C, D... and eventually there's no position that satisfies all the springs simultaneously.

Replay: The Crude Fix That Works

The second family of defenses is embarrassingly simple. Keep a small buffer of examples from previous tasks — maybe 200 per task. Each training batch mixes new data with samples randomly drawn from the buffer. It's crude. It works surprisingly well.

# Experience replay, stripped to its essence
for new_images, new_labels in new_species_loader:
    old_images, old_labels = replay_buffer.sample(batch_size=32)
    all_images = torch.cat([new_images, old_images])
    all_labels = torch.cat([new_labels, old_labels])
    loss = F.cross_entropy(model(all_images), all_labels)
    loss.backward()
    optimizer.step()

The model literally rehearses old tasks while learning new ones. A-GEM (Averaged Gradient Episodic Memory) is a more principled version: it uses the buffer to constrain gradient direction. If the gradient from the new task would increase the loss on old tasks, project it to a safe direction. Same cost, more guarantee.

When you can't store real data — privacy regulations, deletion requirements — generative replay trains a small generative model alongside the classifier and synthesizes fake examples of old tasks. Elegant in theory. In practice, if the generator also forgets... cascading failure. You're trusting a model that has the same disease to produce the medicine.

Architecture-Based: Give Each Task Its Own Space

The third approach avoids the problem entirely by separating parameters. Progressive Networks add a fresh neural network column for each task, freeze old columns, and connect them with lateral links. Zero forgetting by construction — old weights never change. But the network grows linearly with tasks, which is impractical after a dozen.

PackNet is cleverer: train on Task A, prune aggressively (remove 80% of weights), freeze survivors, train the now-freed weights on Task B. Each task gets its own weight subset within a fixed-size network. Works for 10–20 tasks before capacity runs out.

In practice, the best results come from combining approaches — EWC-style regularization plus a small replay buffer plus task-aware output heads. I'm still not convinced any single method "solves" continual learning. Most production systems retrain periodically on all available data, using continual learning techniques to bridge the gap between full retrains. The field is important and active, but the honest assessment is: don't expect a drop-in solution today.

Rest Stop

If you've made it this far, take a breath. You now have a working understanding of the most practically relevant emerging methods: how scaling laws evolved from "bigger is better" to "smaller and smarter," how models can adapt at test time, how meta-learning trains models that learn from tiny amounts of data, and how continual learning fights catastrophic forgetting. That's a solid mental model for most conversations and interviews.

What comes next is more speculative — methods from the research frontier that haven't yet reached production ubiquity but keep appearing in top conference papers. Neurosymbolic AI, geometric deep learning, topological data analysis, and quantum ML. These are the ideas that might define the next decade, or might remain academic curiosities. Either way, knowing they exist — and roughly how they work — puts you ahead of most practitioners.

If the discomfort of not knowing what's underneath is nagging at you, read on.

Neurosymbolic AI and Program Synthesis

Every method we've discussed so far shares a common DNA: they're all fundamentally pattern matching. Neural networks learn statistical regularities from data. They're spectacularly good at it. But they have a well-documented blind spot: systematic, compositional reasoning. A language model that can solve "What is 347 + 528?" by pattern matching will eventually fail on "What is 3,478,291 + 5,283,742?" because it's not actually doing arithmetic — it's doing fuzzy recall of patterns that look like arithmetic.

Neurosymbolic AI tries to get the best of both worlds. Let the neural network handle what it's good at — perceiving, embedding, pattern recognition — and hand off the reasoning to a symbolic system that actually follows logical rules. The neural side sees the image of a chess board. The symbolic side runs the actual game tree search.

Let me make this concrete with our wildlife camera. Suppose the biologist defines a rule: "If the animal has webbed feet AND is found near water AND is smaller than 10 cm, classify it as an amphibian." A pure neural network would need thousands of examples to implicitly learn this rule. A neurosymbolic system could encode the rule explicitly in logic and use the neural network only for the perception part — detecting webbed feet, estimating size, recognizing water.

Program Synthesis: Learning Code, Not Weights

Program synthesis takes this further. Instead of learning a weight matrix that maps inputs to outputs, the system learns an actual program — executable code with variables, loops, and conditionals. The program is interpretable by construction. You can read it, verify it, and trust it in ways you never can with a neural network.

The typical approach: a neural network proposes candidate programs (or program fragments) from a library of primitives, and a symbolic search verifies which programs satisfy the input-output specification. DeepCoder (Balog et al., 2017) pioneered this — a neural guide that predicts which operations are likely to appear in the solution, dramatically narrowing the combinatorial search space.

In the era of large language models, program synthesis has taken an unexpected turn. LLMs can generate code directly from natural language specifications. That's a form of neural program synthesis — the "symbolic" output (the program) is produced by a neural system. But the generated programs can then be executed and verified symbolically, creating a neuro → symbolic → execution pipeline that gets correctness guarantees no pure neural system can offer.

The persistent challenge: aligning continuous neural representations with discrete symbolic structures. Neural networks live in smooth, differentiable spaces. Logic lives in crisp true/false spaces. Bridging that gap — making the system end-to-end differentiable while still producing valid symbolic outputs — remains an active research frontier. No one has fully solved it, and I suspect the eventual solution won't look like either pure approach.

Geometric Deep Learning

Here's a question that seems trivial but has deep consequences: why do convolutional neural networks work so well on images?

The standard answer is "parameter sharing" and "local receptive fields." But there's a deeper reason: CNNs exploit a specific symmetry of images. If you slide a cat three pixels to the right, it's still a cat. CNNs are built to be translation equivariant — shifting the input shifts the output in the same way, without needing to learn this property from data. That built-in symmetry is doing enormous work. Without it, the network would need to see the cat at every possible position to learn that position doesn't matter.

Geometric deep learning generalizes this insight. It asks: what other symmetries exist in your data, and how can you build them into the network architecture?

The word equivariance is key here. A function f is equivariant to a transformation T if applying T to the input and then f gives the same result as applying f first and then T. In notation: f(T(x)) = T'(f(x)). CNNs are equivariant to translations. But what about rotations? Reflections? Permutations? The symmetries of 3D space?

Our wildlife camera makes this tangible. Suppose we want to classify animals regardless of the angle the camera catches them. A standard CNN needs to see the deer from many angles to learn that orientation doesn't matter. A rotation-equivariant network knows this by construction. Every filter automatically applies to all rotations of the input, producing features that transform predictably under rotation. Fewer parameters, less data needed, better generalization.

The Symmetry Zoo

Different data types have different symmetries, and different architectures encode them:

Images on a grid are equivariant to translations — that's standard CNNs. Sets and point clouds are equivariant to permutations — the order of points shouldn't matter. This gives us architectures like DeepSets and PointNet. Graphs are equivariant to node relabeling — you can shuffle node indices without changing the graph. That's what Graph Neural Networks (covered in Ch 16 Section 1) exploit. Molecules and 3D structures respect the symmetries of 3D Euclidean space — rotations and translations. Architectures like SE(3)-Transformers and EGNN build this in, which is why they've become essential in protein folding (AlphaFold2 uses equivariant components) and drug discovery.

The mathematical framework behind all of this is group theory. A group is a set of transformations (rotations, translations, permutations) with a composition operation. Equivariant networks are designed so that their operations commute with the group action. If that sounds abstract, here's the practical payoff: when you build the right symmetry into your architecture, you get models that generalize better with less data, because they don't waste capacity learning invariances they already know.

Bronstein et al. crystallized this framework in their 2021 paper "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges" — which I'd describe as the Rosetta Stone for understanding why CNNs, GNNs, Transformers, and equivariant architectures are all instances of the same underlying idea. It's dense reading, but it fundamentally changed how I think about architecture design.

The limitation of geometric deep learning is that you need to know your symmetries upfront. If you get the symmetry wrong — if your data actually isn't rotationally symmetric, say — the constraint hurts more than it helps. And for many real-world problems, the exact symmetry structure is unclear. The data is approximately symmetric, but not exactly. Handling approximate symmetries gracefully is an open problem.

Topological Data Analysis Meets ML

Geometric deep learning cares about symmetries — the transformations that leave your data unchanged. Topological data analysis (TDA) cares about shape — the qualitative structure of your data that persists even when you stretch, squeeze, or deform it.

The core tool is persistent homology. Here's the intuition. Imagine you have a point cloud — thousands of data points scattered in high-dimensional space. You want to understand the "shape" of this cloud. Are there clusters? Holes? Loops? Voids? These are topological features, and they're invariant to the exact coordinates. Stretch the cloud, rotate it, deform it smoothly — the clusters, holes, and voids persist.

Persistent homology works like this: start with your point cloud and gradually grow a ball around each point. As the balls expand, they start overlapping, forming connections. At some ball radius, a cluster of points becomes connected — a "component" is born. At a larger radius, a loop of connections forms — a "hole" is born. At an even larger radius, that hole might get filled in and die. The key insight: features that persist across a wide range of radii are real structure. Features that appear and vanish quickly are noise.

The output is typically a persistence diagram — a scatter plot where each point represents a topological feature, with birth time on one axis and death time on the other. Points far from the diagonal are long-lived, robust features. Points near the diagonal are ephemeral noise. You can compute these diagrams from any point cloud using tools like Ripser or GUDHI, and they serve as powerful features for downstream machine learning.

For our wildlife camera, persistent homology might seem overkill — and for image classification, it usually is. But consider the broader park monitoring system. Sensor networks measuring temperature, humidity, and wildlife movement across the park produce complex, high-dimensional time series. TDA can detect regime shifts — sudden changes in the topological structure of the data — that indicate ecological events like seasonal migrations or habitat disruption. These are structural changes that standard statistical tests might miss.

The challenge for integration with deep learning is that persistence diagrams aren't vectors — they're multisets of points, with variable size. You can't directly feed them into a neural network. Approaches like persistence images (kernel density estimation on the diagram), persistence landscapes (functional summaries), and recently, differentiable topological layers that backpropagate through the persistence computation itself, are all active research areas. The software is maturing — Giotto-tda, GUDHI, and Ripser integrate with PyTorch — but the field is still establishing which topological features are worth computing for which problems.

Quantum ML — Hype Versus Hope

I'll be direct: quantum machine learning is the topic on this list where the gap between excitement and practical reality is widest. Physicists and quantum computing researchers are genuinely excited, and I understand why — the mathematics is elegant. But as an ML practitioner, I need to be honest about where things stand.

The basic premise: quantum computers operate on qubits, which can exist in superpositions of 0 and 1. This means a quantum computer with n qubits can, in principle, explore 2ⁿ states simultaneously. The hope is that certain ML computations — optimization landscapes with many local minima, kernel methods in exponentially large feature spaces, sampling from complex distributions — could be done exponentially faster on quantum hardware.

The most promising near-term approaches are hybrid quantum-classical systems. A classical computer handles the data loading, preprocessing, and most of the computation. A quantum circuit handles a specific subroutine — typically a parameterized quantum circuit (PQC) that acts like a neural network layer, with rotation angles as trainable parameters. You train the rotation angles by measuring the circuit output and computing classical gradients.

The showstopping problem right now is barren plateaus. As quantum circuits get larger (more qubits, more layers), the gradients of the loss function with respect to the circuit parameters vanish exponentially. This is the quantum equivalent of the vanishing gradient problem, but worse — it's been proven to be exponential in the number of qubits for certain circuit architectures (McClean et al., 2018). Mitigation strategies exist (problem-specific circuit designs, local cost functions, better initialization) but none fully solves it.

Current quantum hardware — so-called NISQ (Noisy Intermediate-Scale Quantum) devices — has tens to a few hundred noisy qubits. That's enough for proof-of-concept experiments but nowhere near enough for practical advantage on real ML problems. The most honest summary: quantum ML is a bet on future hardware. If fault-tolerant quantum computers with thousands of logical qubits arrive, certain ML tasks may see genuine speedups. If they don't, classical hardware with algorithmic improvements may keep winning.

My advice: know it exists, understand the basic ideas (superposition, entanglement, PQCs, barren plateaus), and keep an eye on the hardware progress. Don't build production systems around it today. The researchers working on it are brilliant, and there may be a genuine revolution coming. But as of 2025, it's firmly in the "promising research" category, not the "engineering tool" category.

Where It All Connects

If you're still with me, thank you. I hope it was worth it.

We started with scaling laws — the empirical rules that guided billions of dollars in compute allocation — and watched them evolve from "bigger is better" to "smaller and smarter and cheaper to run." We saw how test-time training lets models adapt at inference, and how meta-learning teaches models to learn from tiny datasets. We confronted catastrophic forgetting and the three-way tug-of-war between stability, plasticity, and memory. Then we ventured into the research frontier: neurosymbolic AI blending pattern matching with logical reasoning, geometric deep learning encoding symmetries directly into architectures, topological data analysis capturing the shape of data, and quantum ML betting on hardware that doesn't quite exist yet.

These topics don't form a linear story. They're more like the spokes of a wheel, all radiating from a central question: how do we make ML systems that are more capable, more efficient, and more robust than what we have today? Scaling laws and test-time training are about efficiency. Meta-learning and continual learning are about adaptability. Neurosymbolic AI is about reasoning. Geometric DL is about building in prior knowledge. TDA and quantum ML are about new mathematical tools.

My hope is that the next time you encounter one of these terms in a paper abstract, a conference talk, or a job interview, instead of nodding and moving on like I did for too long, you'll have a pretty darn good mental model of what's going on under the hood. And if one of them turns out to be the right tool for your problem — a continual learning strategy for your streaming deployment, an equivariant architecture for your molecular data, a meta-learning setup for your few-shot edge device — you'll know where to start digging.

Resources and Credits

Kaplan et al., "Scaling Laws for Neural Language Models" (2020) — the paper that started the scaling conversation. Still worth reading for the methodology alone.

Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022) — the Chinchilla paper. Changed how every lab allocates its training budget.

Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (2017) — the MAML paper. One of the most cited ML papers of the decade, and surprisingly readable.

Snell et al., "Prototypical Networks for Few-shot Learning" (2017) — the algorithm is almost disappointingly simple, and that's what makes it great.

Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks" (2017) — the EWC paper. The spring analogy in the original is even better than mine.

Bronstein et al., "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges" (2021) — the Rosetta Stone of geometric DL. Dense but unforgettable.

Otter et al., "A roadmap for the computation of persistent homology" (2017) — the most accessible on-ramp to TDA I've found.

Next → Nice to Know