Nice to Know

Chapter 16: Advanced Deep Learning Section 7

I'll be honest — I kept a running list of architectures and ideas that I'd see referenced in papers, skim the abstract, think "that's interesting," and then never actually sit down with. Kolmogorov-Arnold Networks. Liquid neural networks. Spiking neurons. Each one felt like a rabbit hole I'd get lost in. But over time, the discomfort of seeing these terms everywhere without understanding what they actually do grew too heavy to ignore. Here is that dive.

This section is a tour of ideas that live at the edges of mainstream deep learning. None of them are required for your day-to-day work with transformers and CNNs. But each one represents a genuinely different way of thinking about what a neural network is — and at least a few of them will likely matter a great deal in the years ahead.

Before we start, a heads-up. We're going to cover a lot of ground — learnable activation functions, biological neurons, networks that generate other networks, AI that predicts protein structures. You don't need background in any of it. We'll build each idea from scratch, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

Contents

Kolmogorov-Arnold Networks — When Edges Do the Thinking

Liquid Neural Networks — Computation That Flows

Hypernetworks — Networks That Build Networks

Neural Cellular Automata — Growing Intelligence from Local Rules

Rest Stop

Boltzmann Machines and the Energy Landscape

Reservoir Computing — The Lazy Genius

Spiking Neural Networks — The Brain's Native Language

Differentiable Rendering and Neural Implicit Representations

AI for Science — When Models Meet the Physical World

Resources and Credits

Kolmogorov-Arnold Networks — When Edges Do the Thinking

To understand why KANs matter, we need to first feel the limitation they're solving. In a standard MLP, every neuron does two things: it takes a weighted sum of its inputs, then passes the result through a fixed activation function — ReLU, GELU, whatever you chose at design time. The weights are learnable. The activation function is not. It's the same function for every neuron, frozen in place before training begins.

Think of it like a kitchen where every chef has the same knife. You can rearrange the chefs (change the weights), but no one gets a different tool. KANs flip this around entirely.

In a Kolmogorov-Arnold Network (introduced in 2024 by Ziming Liu et al.), the learnable parts live on the edges, not the nodes. Each connection between neurons has its own small, trainable function — typically a spline, a smooth curve that can bend into whatever shape the data demands. The nodes themselves do nothing fancy — they sum their incoming values. All the expressiveness lives in the wiring.

The mathematical foundation comes from the Kolmogorov-Arnold representation theorem, a result from the 1950s that says any continuous function of multiple variables can be decomposed into sums and compositions of functions of a single variable. That's a mouthful. What it means in practice: if you can learn the right set of one-dimensional functions and compose them properly, you can represent any continuous mapping. KANs are a direct neural network implementation of that theorem.

Here's a concrete way to see the difference. In an MLP, you might have 3 inputs flowing into a neuron. Each input gets multiplied by a scalar weight (say 0.7, -1.2, 0.3), those products get summed, and the result passes through ReLU. In a KAN, each of those 3 inputs passes through its own learned spline function — a flexible curve that might squash small values, amplify medium ones, and saturate at the high end — and then those transformed values get summed at the node. No fixed activation function anywhere.

The result? KANs can represent complex functions with fewer parameters and shallower networks than MLPs. They're especially strong on tasks where the underlying relationship has clean mathematical structure — scientific modeling, symbolic regression, physics simulations. On messy, high-dimensional tasks like ImageNet classification, they haven't displaced transformers. Back to our kitchen analogy: when every chef gets a custom-forged knife for their specific ingredient, you need fewer chefs to make the same dish — but a massive banquet hall with 1,000 dishes still favors the standardized approach.

I'm still developing my intuition for where KANs will settle in the ecosystem. They're genuinely novel, not a rehash of an old idea. But whether they'll become a mainstream tool or remain a niche for scientific computing is something no one has fully resolved yet.

Liquid Neural Networks — Computation That Flows

Standard neural networks have a curious property: the moment training ends, they freeze. Every weight locks into place. The network becomes a fixed function — it will respond to new inputs, but it will never adapt to them. If the world shifts (new lighting conditions, seasonal changes, different sensor noise), the frozen network doesn't notice. It keeps applying yesterday's understanding to today's data.

MIT's CSAIL team, inspired by the nervous system of a tiny worm called C. elegans (which manages to navigate its environment with only 302 neurons), developed liquid neural networks to address exactly this rigidity.

The core idea: instead of discrete layer-by-layer computation, each neuron's state evolves continuously over time according to an ordinary differential equation (ODE). And here's the key twist — the time constant of each neuron (how quickly it reacts) is itself a function of both the neuron's current state and the incoming input. This is called a liquid time constant, and it's what gives the network its name. The neurons literally adjust their responsiveness based on what they're seeing.

Imagine two faucets. A regular neural network is a faucet that's been set to a fixed flow rate — you turn it on during training, find the right position, then weld it in place. A liquid neural network is a faucet that adjusts its own flow rate based on the water pressure coming in. More pressure? It opens wider. Less? It tightens. The faucet is always adapting.

In practice, this means liquid networks can handle distribution shift without retraining. MIT demonstrated drones navigating environments with weather and lighting conditions they'd never seen in training — something conventional networks fail at badly. The networks are also remarkably compact. Because each "rich" neuron adapts dynamically, you need far fewer of them. We're talking 19 neurons matching the performance of networks with tens of thousands of parameters.

The limitation is the ODE solver. Solving differential equations at every forward pass is computationally expensive, which makes liquid networks slower to run than standard feedforward architectures. Recent work on parallel-in-time solvers and hybrid numerical methods is closing this gap, but it's not closed yet. That faucet metaphor has a dark side: adjustable plumbing is more complex plumbing.

Hypernetworks — Networks That Build Networks

Here's an idea that sounds like it shouldn't work: what if, instead of training the weights of a neural network directly, you trained a second neural network whose job is to generate the weights of the first one?

That's a hypernetwork. The concept was formalized by David Ha, Andrew Dai, and Quoc Le in 2016, though the underlying intuition — one system parameterizing another — shows up across many fields. The network being generated is called the target network. The network doing the generating is the hypernetwork. The hypernetwork takes some conditioning input (a task descriptor, a time step, an embedding of the current context) and outputs a complete set of weights for the target network.

Think of it like a master locksmith. Instead of carrying around thousands of keys (a separate trained model for each lock), the locksmith carries one key-cutting machine (the hypernetwork) and a description of each lock (the conditioning input). Given any lock description, the machine cuts a key on the spot.

Why would you want this? Three reasons show up repeatedly. First, parameter efficiency — if you have many related tasks, a single hypernetwork can generate specialized weights for each one, using far fewer total parameters than training separate models. Second, dynamic adaptation — the weights can change in response to context, making the architecture inherently adaptive. Third, meta-learning — hypernetworks are a natural fit for few-shot learning, where you want to rapidly produce a task-specific model from minimal data.

The catch is training stability. You're now optimizing the parameters of a network that generates the parameters of another network. That's a meta-optimization problem, and the loss landscape can be treacherous. Hypernetworks tend to be harder to train and more sensitive to hyperparameter choices (yes, the irony of hypernetwork hyperparameters is not lost on anyone). Our locksmith's key-cutting machine is powerful, but calibrating it is fiddly work.

Neural Cellular Automata — Growing Intelligence from Local Rules

A cellular automaton is a grid of cells, each in some state, where every cell updates its state at each time step based only on its immediate neighbors. Conway's Game of Life is the most famous example — four rules applied locally produce astonishing global complexity. Gliders emerge. Oscillators pulse. Structures self-replicate. All from cells that can only see one square away.

Neural cellular automata (NCAs), pioneered by Alexander Mordvintsev and collaborators at Google in 2020, replace the hand-crafted update rules with a small neural network. Each cell still looks only at its neighbors. But instead of "if exactly 3 neighbors are alive, come alive," the rule is a learned function that takes the local neighborhood as input and outputs the cell's next state.

The stunning result: you can train an NCA to grow a target image from a single seed cell. Start with one pixel. Apply the learned local rule repeatedly. Watch as the pattern organizes itself — first rough shapes, then finer detail, converging on a target like a lizard emoji or a human face. And here's what blew my mind when I first saw it: if you damage the pattern (erase a chunk of the image), the NCA repairs itself. The local rules have encoded not a static image, but a regenerative process. It's artificial morphogenesis — the same principle that lets a salamander regrow a limb.

Coming back to our locksmith analogy from hypernetworks, NCAs represent something fundamentally different. The locksmith builds a key from a blueprint (top-down). An NCA grows a key from a single atom of metal, with each atom only talking to its neighbors (bottom-up). No central plan. No global coordination. Structure emerges from local interaction.

The practical applications are still largely in research — procedural texture generation, self-organizing agent behaviors, growing neural network topologies. The limitation is speed and scalability: running thousands of update steps on a large grid is expensive, and training through that many steps requires careful gradient management. But as a way of thinking about computation — intelligence as a developmental process rather than a fixed architecture — NCAs are deeply compelling.

Rest Stop

If you've made it this far, congratulations. You can stop here if you want.

You now have a mental model of four genuinely different approaches to neural computation: KANs (learnable functions on edges), liquid networks (continuous-time adaptation), hypernetworks (networks generating networks), and neural cellular automata (intelligence emerging from local rules). Each one challenges a fundamental assumption of standard deep learning — fixed activations, frozen weights, direct parameterization, global architecture.

That mental model is incomplete. We haven't talked about energy-based thinking, or what happens when you refuse to train most of a network, or how to make neurons that fire in spikes instead of continuous values. And there's a whole world where neural networks are being pointed at physics, chemistry, and biology instead of text and images.

But if you're feeling the pull of the next section, read on. The ideas ahead are some of the most beautifully weird things in the field.

Boltzmann Machines and the Energy Landscape

Before deep learning as we know it, there were Boltzmann machines. The concept dates to the 1980s, proposed by Geoffrey Hinton and Terry Sejnowski, and it introduces an idea that keeps resurfacing in modern research: energy-based modeling.

A Boltzmann machine assigns an "energy" to every possible configuration of its neurons. Low energy = likely configuration. High energy = unlikely. Training means adjusting the weights so that configurations matching real data get low energy scores, and everything else gets high ones. It's like a landscape of valleys and hills, where the valleys correspond to patterns the network has learned.

Think of a ball rolling on a hilly terrain. It naturally settles into valleys. The Boltzmann machine sculpts that terrain during training so the valleys align with the data it's seen. When you want to generate new data, you release the ball and see where it settles. This is the energy landscape metaphor, and it's one of the most powerful intuitions in all of machine learning.

The original Boltzmann machines were nearly impossible to train at scale — the math required sampling from a distribution that was computationally intractable. The Restricted Boltzmann Machine (RBM), which removes connections within the same layer, made training feasible. Stacked RBMs — called Deep Belief Networks — became the mechanism that reignited deep learning around 2006. Hinton showed you could pre-train deep networks layer by layer using RBMs, solving the vanishing gradient problem that had stalled the field for a decade.

Today, nobody uses RBMs for pre-training. We have better tools. But the energy-based thinking underneath has never gone away. Modern energy-based models (EBMs) use the same principle — learn a scalar energy function, make real data low-energy — with modern training techniques. Diffusion models, score matching, and contrastive learning all have deep roots in the energy landscape. The Boltzmann machine isn't a historical footnote. It's the intellectual ancestor of half of modern generative AI. That hilly terrain we imagined? It's been quietly shaping the field for forty years.

Reservoir Computing — The Lazy Genius

Every neural network architecture we've discussed so far shares one assumption: you train the weights. All of them. Or at least most of them. Reservoir computing asks an unsettling question: what if you didn't?

An Echo State Network (ESN), the most common form of reservoir computing, works like this. You create a large recurrent neural network — the "reservoir." You initialize its weights randomly. And then you never touch them. The input weights? Random and fixed. The internal recurrent connections? Random and fixed. The only thing you train is a single linear layer that reads out from the reservoir's state to produce the output.

The metaphor that clicks for me: imagine dropping a pebble into a pond. The pond (reservoir) creates complex, interacting ripples — that's the high-dimensional nonlinear transformation of the input. You don't design the ripples. You don't control the pond. You read the surface pattern and learn to interpret it. Different pebbles (inputs) create different ripple patterns, and a simple linear model can learn to map those patterns to desired outputs.

The mathematics are surprisingly clean. At each time step, the reservoir state updates as x(t+1) = tanh(W·x(t) + W_in·u(t)), where W and W_in are random and fixed. The output is y(t) = W_out·x(t), and W_out is the only learned parameter — trained with ordinary least-squares regression. No backpropagation through time. No gradient flow through recurrent connections. Training takes seconds.

This sounds like it shouldn't work. But it does — remarkably well for chaotic time series prediction, speech recognition, and any task where temporal dynamics matter. The key requirement is the echo state property: the reservoir must "forget" its initial state over time, so that its current state genuinely reflects the input history rather than some arbitrary starting condition.

The limitation is precision. Because the reservoir isn't optimized for your specific task, you're relying on random projections to create useful representations. For complex tasks with subtle structure, a fully trained RNN or transformer will outperform it. Our pond creates beautiful ripples, but it can't choose to create the exact ripples you need. Still, for problems where training speed matters more than the last fraction of accuracy, reservoir computing is remarkably effective — and it's found a second life in neuromorphic and photonic computing, where the "reservoir" is a physical system (an optical fiber, a bucket of water — no, seriously) rather than a simulated one.

Spiking Neural Networks — The Brain's Native Language

Every artificial neuron we've discussed computes a weighted sum, applies an activation function, and outputs a continuous number — 0.73, -1.2, 4.56. Biological neurons do something fundamentally different. They accumulate input over time, and when their internal voltage crosses a threshold, they fire a discrete spike — a brief electrical pulse. Then they reset. The information isn't in the magnitude of the output. It's in the timing of the spikes.

Spiking neural networks (SNNs) attempt to replicate this. Each artificial neuron maintains a membrane potential that rises with incoming spikes and decays over time. When it crosses a threshold — spike. The signal propagates to connected neurons, nudging their potentials up. Information is encoded in when neurons fire relative to each other, not in continuous activation values.

Back to our pond metaphor from reservoir computing: if a standard neural network is measuring the height of each ripple, a spiking network is measuring when each ripple arrives at the shore. Same pond, fundamentally different measurement. And that timing information turns out to be incredibly rich.

The appeal is energy efficiency. In a conventional network, every neuron computes on every forward pass, whether it has anything useful to say or not. In an SNN, a neuron that hasn't reached threshold doesn't fire, doesn't communicate, and doesn't consume energy. This is event-driven computation — work happens only when something interesting occurs. Intel's Loihi chip and IBM's TrueNorth are neuromorphic processors built specifically to exploit this property, achieving orders of magnitude better energy efficiency than GPUs for certain tasks.

The limitation is training. Spikes are discrete events — either a neuron fires or it doesn't. That makes the loss function non-differentiable at the spike threshold, which means standard backpropagation doesn't directly apply. Researchers use surrogate gradient methods (smooth approximations of the spike function during the backward pass), but training SNNs remains harder and less mature than training conventional networks. The biological authenticity of SNNs is both their greatest strength and their biggest practical obstacle. We have wonderful brain-inspired hardware waiting for brain-inspired algorithms to catch up.

Differentiable Rendering and Neural Implicit Representations

I avoided 3D vision for a long time because it felt like a completely separate field from the rest of deep learning. Then I saw NeRF results and realized the boundary had dissolved.

Traditional 3D graphics works with explicit representations — meshes (triangles stitched together), point clouds (scattered dots in space), voxel grids (3D pixels). You build a scene from these primitives, then a renderer turns it into a 2D image. The renderer is a fixed function. It's not differentiable. You can't backpropagate through it.

Differentiable rendering changes that. It makes the rendering process smooth enough that gradients can flow from pixels all the way back to the 3D scene parameters. This means you can start with a set of 2D photographs, define a 3D representation, render it into images, compare those to the real photos, and use gradient descent to adjust the 3D representation until the renders match. It's inverse graphics — going from images back to the 3D world that produced them.

Neural Radiance Fields (NeRF), introduced in 2020, pushed this to stunning effect. A NeRF represents a 3D scene as a single neural network — an MLP that takes a 3D coordinate and a viewing direction as input, and outputs the color and density at that point. To render an image, you shoot rays from the camera through each pixel, sample points along each ray, query the network at each point, and composite the results using volume rendering. The entire process is differentiable. Train on a handful of photographs, and the network learns a continuous 3D representation of the scene.

This is a neural implicit representation — the 3D geometry isn't stored as an explicit mesh or grid. It's implicit in the weights of a neural network. Other variants include DeepSDF (which learns a signed distance function — positive outside an object, negative inside, zero on the surface) and occupancy networks (which output the probability that a point is inside an object). All of them share the same beautiful property: infinite resolution, because the network is a continuous function you can query at any coordinate.

3D Gaussian Splatting (2023) took a different approach — represent the scene as a collection of 3D Gaussian ellipsoids, each with learned position, size, orientation, opacity, and color. It's explicitly parameterized (not neural-implicit), but still differentiable. The payoff: real-time rendering at quality competitive with NeRF, because rasterizing Gaussians is dramatically faster than querying an MLP millions of times per frame. NeRF gives you quality; Gaussian splatting gives you speed. Both give you differentiability.

I'll be honest — when I first read that a single MLP could encode an entire 3D scene, I didn't believe it. The network is small. The scenes are complex. And yet it works. My favorite thing about this area is that it's one of the clearest examples of a deep learning idea creating genuine new capability, not incremental improvement on an existing benchmark.

AI for Science — When Models Meet the Physical World

Most of this book has been about teaching neural networks to process text, images, and sequences. The applications we've discussed live in the digital world — language models, classifiers, recommender systems. But some of the most consequential applications of deep learning are now happening in domains where the data has physical structure, where the predictions have atomic-level consequences, and where a wrong answer isn't a bad recommendation — it's a failed experiment that cost a year of lab time.

AlphaFold (DeepMind, 2020) solved the protein folding problem — predicting a protein's 3D structure from its amino acid sequence. This was a grand challenge in biology for fifty years. AlphaFold 2 combined attention mechanisms over evolutionary relationships (multiple sequence alignments) with SE(3)-equivariant structure modules that respect the rotational symmetry of 3D space. AlphaFold 3 (2024) extended this to predict not only individual protein structures but complexes involving DNA, RNA, small molecules, and ions — bringing it closer to modeling real cellular machinery.

The impact is hard to overstate. Before AlphaFold, determining a single protein structure could take a PhD student's entire career. After AlphaFold, the structures of over 200 million proteins were predicted in months. Drug discovery timelines that spanned decades are being compressed to years.

GenCast (DeepMind, 2024) applies similar ambition to weather prediction. Traditional weather forecasting runs massive physics simulations on supercomputers — billion-dollar infrastructure producing forecasts that still struggle beyond 7-10 days. GenCast uses a diffusion-based approach to produce ensemble weather forecasts that outperform the best physics-based models at medium-range prediction, including for extreme events like hurricanes and heatwaves. It runs on a cluster of TPUs instead of a national supercomputing center.

GNoME (Graph Networks for Materials Exploration, DeepMind, 2023) turned graph neural networks loose on materials science. The system predicted the stability of 2.2 million new crystal structures — more than the entire history of human materials discovery. Of those, 380,000 were identified as highly stable, meaning they're strong candidates for real-world synthesis. This is the kind of result that could accelerate development of better batteries, solar cells, and superconductors.

The pattern across all three: take a domain with rich physical structure, design or adapt neural architectures that respect that structure's symmetries (equivariance, graph topology, physical constraints), train on existing data, and produce predictions that would have taken traditional methods orders of magnitude longer. The models don't replace the science. They compress the search space so dramatically that scientists can focus on the experiments most likely to matter. Our energy landscape from the Boltzmann machines section makes one final appearance here — AlphaFold is literally searching for low-energy protein conformations, and GNoME is searching for low-energy crystal structures. The metaphor isn't a metaphor. It's the actual physics.

Resources and Credits

If any of these topics hooked you, here are the places I'd start digging.

📄 "KAN: Kolmogorov-Arnold Networks" — Liu et al., 2024. The original paper. Clear writing, excellent visualizations comparing KAN and MLP function approximation. The one to read first.
📄 "Liquid Time-constant Networks" — Hasani et al., 2021 (AAAI). MIT's foundational paper on liquid neural networks. The C. elegans motivation section alone is worth the read.
📄 "Growing Neural Cellular Automata" — Mordvintsev et al., 2020 (Distill). One of the most visually stunning papers in deep learning. The interactive demos are unforgettable.
📄 "NeRF: Representing Scenes as Neural Radiance Fields" — Mildenhall et al., 2020. The paper that launched a thousand follow-ups. Still the clearest introduction to the idea.
📄 "Highly accurate protein structure prediction with AlphaFold" — Jumper et al., 2021 (Nature). Wildly ambitious. The structural biology sections are accessible even without a biology background.
📄 "Scaling deep learning for materials discovery" — Merchant et al., 2023 (Nature). The GNoME paper. The scale of what they discovered is genuinely staggering.
📄 "A Practical Guide to Echo State Networks" — Lukoševičius, 2012. Still the best tutorial on reservoir computing. Clear, practical, and honest about limitations.

If you're still with me, thank you. I hope it was worth the tour.

We started with a network that puts learnable functions on its edges instead of its nodes, moved through networks that flow like liquid, networks that generate other networks, and cells that grow patterns from local rules. We crossed into energy landscapes, frozen reservoirs, neurons that speak in spikes, and neural networks that encode entire 3D worlds in their weights. We ended with AI systems that are folding proteins, predicting storms, and discovering new materials.

My hope is that the next time you encounter one of these terms in a paper or a conversation, instead of that familiar itch of "I should look that up someday," you'll have a pretty solid mental model of what's going on under the hood — and enough curiosity to dig deeper when the time is right.

← Previous Emerging Architectures and Continual Learning