Classic Generative Models

Autoencoders · VAEs · GANs · Flows · EBMs

I avoided generative models for longer than I'd like to admit. Every time someone mentioned "latent spaces" or "adversarial training" or "evidence lower bound," I'd nod along and quietly change the subject. I understood what these models did — they generate images, text, molecules — but how they actually worked under the hood? The mechanics of making a neural network create something new, rather than classify or predict? That felt like a different kind of magic, and I didn't trust myself to explain it. Finally the discomfort of not knowing grew too great. Here is that dive.

Generative models learn the underlying distribution of training data so they can produce new samples from that distribution. The idea has been around in various forms since the 1980s (Boltzmann machines, anyone?), but the modern era really kicked off around 2013–2014 with Variational Autoencoders and Generative Adversarial Networks. Since then, normalizing flows, energy-based models, and eventually diffusion models have all entered the picture, each attacking the same problem from a different mathematical angle.

Before we start, a heads-up. We're going to be working with probability distributions, KL divergence, Jacobian determinants, and minimax optimization. But you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The compression game

Vanilla autoencoders

Denoising and sparse autoencoders

The problem with autoencoder latent spaces

Variational autoencoders: encoding to distributions

The reparameterization trick

The ELBO: why the loss looks the way it does

Rest stop

The adversarial game

Training a GAN: the delicate dance

Mode collapse and the dirt-pile fix

Conditional generation: steering the output

Normalizing flows: invertible transformations

Energy-based models: the landscape view

The generative model landscape

Resources and credits

The compression game

Imagine you run a tiny portrait studio. You've taken thousands of 28×28 pixel grayscale headshots. Each image is 784 numbers. You want to store them efficiently, and you notice something: most of those 784 numbers are redundant. Faces share structure — two eyes, a nose, a mouth, roughly the same layout. If you could find the essential information that distinguishes one face from another, you could store each face as a much smaller set of numbers.

That's the core idea behind an autoencoder. It's two neural networks glued together. The first one, the encoder, takes a 784-dimensional image and squeezes it down to, say, 20 numbers. The second one, the decoder, takes those 20 numbers and tries to reconstruct the original 784-dimensional image. The 20-number representation in the middle is called the latent code, and the narrow passage it must squeeze through is called the bottleneck.

The whole thing is trained with a single objective: make the output look as close to the input as possible. The loss is typically mean squared error — the average squared difference between each original pixel and its reconstruction.

Here's the thing that makes this interesting. The bottleneck is tiny. 784 dimensions in, 20 dimensions in the middle, 784 dimensions out. The network cannot memorize inputs — there's not enough room. It has to figure out which features of a face actually matter (the angle, the expression, whether there are glasses) and throw away the rest. The bottleneck forces the network to learn a compressed representation. That's representation learning, and the autoencoder didn't even know it was doing it.

Vanilla autoencoders

Let's trace through a concrete example. We'll take our tiny portrait studio's simplest possible case: 4-pixel grayscale images. Each image is a vector of 4 numbers between 0 and 1. Our encoder compresses that to 2 numbers. Our decoder expands those 2 numbers back to 4.

Input image:     [0.9, 0.1, 0.8, 0.2]   (4 pixels)
                        │
                   ┌────▼────┐
                   │ Encoder  │   (4 → 2)
                   └────┬────┘
                        │
Latent code:         [0.7, -0.3]           (2 numbers)
                        │
                   ┌────▼────┐
                   │ Decoder  │   (2 → 4)
                   └────┬────┘
                        │
Reconstruction:  [0.85, 0.15, 0.78, 0.22] (4 pixels)

The reconstruction isn't perfect — we lost information in the bottleneck. But it's close. The loss (mean squared error between input and reconstruction) measures how close, and training adjusts the encoder and decoder weights to minimize it.

Here's a useful mental model. Picture the encoder as a person summarizing a book into a tweet. The decoder is a different person who reads only the tweet and tries to rewrite the book. If the tweet captures the essence of the story, the rewrite will be recognizable. If it doesn't, you get nonsense. Training is the process of both people getting better at their jobs simultaneously.

A subtle but important point: if the bottleneck were the same size as the input (or larger), the network could learn the identity function — pass everything through unchanged, reconstruct perfectly, and learn nothing interesting. A single-layer linear autoencoder with MSE loss learns exactly the same subspace as PCA. The deep, nonlinear version learns richer structures, but the spirit is the same: find the dimensions that matter most and ignore the rest.

Denoising and sparse autoencoders

The vanilla autoencoder has a problem that becomes apparent quickly: it can be lazy. If the bottleneck is even slightly too generous, or the network is too powerful, it starts memorizing inputs rather than learning meaningful structure. Two ideas emerged to fight this, and both are worth knowing.

A denoising autoencoder deliberately corrupts the input before feeding it to the encoder — adding Gaussian noise, randomly zeroing out pixels, or both. But the loss is still measured against the clean original. The network can't memorize the corrupted input (it changes every time), so it's forced to learn what the underlying clean signal looks like. Think of our portrait studio: we're handing the encoder a blurry, scratched photo and asking the decoder to reconstruct the clean original. To do that, the encoder must understand what a face is, not what any particular noisy version of a face looks like.

This idea was ahead of its time. BERT's masked language modeling — corrupt some words, predict the originals — is the same trick applied to text, published over a decade later.

A sparse autoencoder takes a different approach. Instead of corrupting the input, it adds a penalty to the loss that discourages the latent code from having too many active (non-zero) neurons at once. The typical approach is to add a KL divergence penalty that pushes the average activation of each hidden neuron toward some small target value, like 0.05. The result: for any given input, only a handful of latent dimensions "fire." Each active dimension represents a distinct, meaningful feature. It's like forcing the tweet-summarizer to use only 5 of their 20 available words — they choose very carefully, and each word carries heavy meaning.

Both denoising and sparse autoencoders produce better representations than vanilla autoencoders. But they share a fundamental limitation: the latent space they produce is unstructured. Points are scattered wherever the optimization happens to put them. There's no guarantee that nearby latent codes correspond to similar images, and no way to sample from the latent space to generate new images. The decoder can reconstruct codes it was trained on, but hand it a random point in latent space and it'll produce garbage.

That limitation is what pulls us toward the next idea.

The problem with autoencoder latent spaces

Let me make this concrete. Suppose we train a vanilla autoencoder on our portrait studio dataset — thousands of headshots — and look at the resulting 2D latent space. We'd see something like this: each face maps to a specific point, but the points form irregular clusters with vast empty regions in between. Image of a smiling woman? That's at coordinates (3.7, -1.2). Image of a frowning man? That's at (-8.4, 5.6). What's at (0, 0)? Who knows. Probably nothing coherent.

We can't generate new faces from this space because we have no idea which coordinates produce valid faces and which produce noise. We can't interpolate between faces because the path between two points in latent space might pass through dead zones that decode to static. The autoencoder learned to compress and decompress, but it didn't learn the distribution of faces. It knows individual faces, not face-ness.

I'll be honest — when I first heard people describe the latent space of a standard autoencoder, I didn't understand why this was such a big deal. Compression is useful, right? Who cares about sampling? But then I realized: if you want a generative model — one that creates new data rather than parroting old data — you need a latent space you can navigate. You need to know that any point you pick will decode to something meaningful. That requires structure. That requires probability.

Variational autoencoders: encoding to distributions

The Variational Autoencoder, introduced by Kingma and Welling in 2013, makes one conceptual change to the autoencoder that transforms everything. Instead of encoding an image to a single point in latent space, the encoder outputs the parameters of a probability distribution.

Let's walk through it with our portrait studio. We feed in a headshot. The encoder doesn't output a single latent code like [0.7, −0.3]. Instead, it outputs two vectors: a mean vector μ = [0.7, −0.3] and a log-variance vector log σ² = [−1.0, −0.8]. Together, these define a Gaussian distribution centered at μ with spread σ. We then sample a latent code z from that distribution and pass it to the decoder.

Input image → Encoder → μ = [0.7, -0.3], log σ² = [-1.0, -0.8]
                              │
                         Sample z from N(μ, σ²)
                              │
                         z = [0.65, -0.25]  (a random draw)
                              │
                         Decoder → Reconstructed image

Why log-variance instead of variance directly? Because variance must be positive, and log-variance can be any real number, making it easier for a neural network to output. We recover σ by computing exp(0.5 × log σ²).

Now here's the crucial part. The loss function has two terms. The first is the same reconstruction loss as before: make the decoder's output match the input. The second is new — a KL divergence term that measures how far each encoding distribution is from a standard normal distribution N(0, I). This term penalizes the encoder for putting images in wildly different regions of latent space. It gently pulls all the encoding distributions toward the origin, toward each other, creating overlap and continuity.

The result is a latent space with structure. Nearby points decode to similar images. Paths between points produce smooth transitions. And critically, we can sample z from a standard normal, hand it to the decoder, and get a coherent new portrait that doesn't exist in the training data. The VAE hasn't memorized faces — it's learned face-ness.

Back to our portrait studio analogy: the vanilla autoencoder was a filing system — each photo in its own arbitrary drawer. The VAE is more like a map. Similar photos are near each other. You can walk from "smiling woman" to "smiling man" and the images morph smoothly along the way. And you can point to a random spot on the map and find a reasonable face there.

The reparameterization trick

There's a mechanical problem with what I described. During training, we need to backpropagate gradients through the entire network — from the reconstruction loss, back through the decoder, through the latent code z, and into the encoder. But z was sampled from a distribution. Sampling is a random operation. You can't take the derivative of randomness.

The reparameterization trick is the solution, and it's clever in a way that makes you feel silly for not thinking of it yourself. Instead of sampling z directly from N(μ, σ²), we do this:

First, sample ε from a standard normal N(0, I). This is pure randomness that doesn't depend on any model parameters. Then compute z = μ + σ ⊙ ε, where ⊙ means element-wise multiplication.

z is the same random variable — it has the same distribution. But now it's expressed as a deterministic, differentiable function of μ and σ (which are the encoder's outputs, and therefore functions of its weights). The randomness is quarantined in ε, which sits outside the computation graph. Gradients flow cleanly through μ and σ back into the encoder.

# The trick in three lines of PyTorch:
std = torch.exp(0.5 * logvar)   # σ = exp(½ log σ²)
eps = torch.randn_like(std)      # ε ~ N(0, I) — no gradients needed
z = mu + std * eps               # z = μ + σ ⊙ ε — differentiable in μ, σ

I remember reading this for the first time and thinking: "Wait, that's it? You moved the randomness to a separate variable?" Yes. That's it. And it's the thing that made VAEs trainable with standard backpropagation. Without this trick, you'd need reinforcement-learning-style gradient estimators, which are noisy and slow. With it, you get clean, low-variance gradients for free.

The ELBO: why the loss looks the way it does

The VAE loss is called the Evidence Lower Bound, or ELBO. The name sounds imposing, but the logic behind it is surprisingly natural.

Here's the situation. We want to maximize the probability of our training data under the model — the quantity log p(x). But this quantity requires integrating over all possible latent codes z, which is intractable. We can't compute it directly.

So we do something practical. We introduce our approximate encoder q(z|x) and use it to derive a lower bound on log p(x). The math involves Jensen's inequality, but the result is clean:

log p(x) ≥ 𝔼_q(z|x)[log p(x|z)] − D_KL(q(z|x) ∥ p(z))

That's the ELBO. Two terms, fighting each other.

The first term, 𝔼_q(z|x)[log p(x|z)], is the reconstruction loss. It says: "For the latent codes that the encoder produces, how well does the decoder reconstruct the original input?" Maximizing this term pushes the model toward perfect reconstruction. In practice, this is the mean squared error or binary cross-entropy between input and output.

The second term, D_KL(q(z|x) ∥ p(z)), is the KL divergence between the encoder's distribution and the prior. It says: "How different is the distribution the encoder learned from the simple standard normal we chose as our prior?" Minimizing this pushes all encoding distributions toward N(0, I).

For Gaussian encoders with a standard normal prior, the KL term has a beautiful closed-form solution. No sampling, no approximation:

D_KL = −½ Σ_j(1 + log σ_j² − μ_j² − σ_j²)

Where j runs over each dimension of the latent space. This is computationally cheap and exact.

Now, these two terms are in tension. The reconstruction term wants the encoder to spread different inputs far apart in latent space — it's easier to reconstruct if each input has its own unique, distant code. The KL term wants everything pulled together near the origin. Train a VAE and you're mediating a never-ending argument between fidelity and structure. This is why VAE outputs tend to be slightly blurry compared to, say, GAN outputs. The KL term is smoothing things out, and that smoothing costs sharpness.

I'm still developing my intuition for why the β-VAE trick works as well as it does. The idea, from Higgins et al. (2017), is to multiply the KL term by a scalar β > 1. Higher β means more pressure toward a structured latent space, which encourages disentanglement — individual latent dimensions capture independent factors like rotation, color, size. But it also means blurrier reconstructions. The tradeoff is real, and picking the right β is more art than science.

Rest stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model for the entire autoencoder family: vanilla autoencoders compress and reconstruct through a bottleneck, denoising autoencoders learn robust features by training on corrupted inputs, sparse autoencoders encourage selective feature activation, and VAEs impose probabilistic structure on the latent space so you can actually sample new data from it. The reparameterization trick makes VAEs trainable, and the ELBO loss balances reconstruction quality against latent space regularity.

That's a useful mental model. If someone asks you how a VAE works in an interview, you can explain the encoder-to-distribution idea, the KL penalty, and the reparameterization trick, and you'll be in solid shape.

But this doesn't tell the whole story. VAEs produce smooth, structured outputs, but they tend to be blurry. What if you want razor-sharp generated images? What if you want exact likelihood computation? What if you want to control what gets generated? Those questions pull us into GANs, normalizing flows, and energy-based models — three radically different approaches to the same generation problem.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

The adversarial game

In 2014, Ian Goodfellow proposed something that felt almost absurd: instead of training one network to generate data, train two networks that fight each other.

The generator G takes random noise z (sampled from a simple distribution, usually a standard normal) and produces a fake sample — say, a fake portrait for our studio. The discriminator D looks at images — some real from the training set, some fake from G — and tries to tell them apart, outputting a probability that its input is real.

Back to our portrait studio. The generator is a forger who's never seen a real photograph but has been given a pile of random noise and told "make this look like a face." The discriminator is a detective who examines photos and says "real" or "fake." As the detective gets sharper, the forger has to improve technique. As the forger improves, the detective has to look more carefully. Both get better through competition.

The formal objective captures this as a minimax game:

min_G max_D V(D, G) = 𝔼_{x~p_data}[log D(x)] + 𝔼_{z~p_z}[log(1 − D(G(z)))]

Let's unpack that. D wants to maximize this expression: push D(x) toward 1 for real data (making log D(x) large) and D(G(z)) toward 0 for fakes (making log(1 − D(G(z))) large, since log(1 − 0) = 0). G wants to minimize it: make D(G(z)) close to 1 so the second term collapses.

In theory, this game has a Nash equilibrium where the generator has perfectly learned the data distribution and the discriminator can't do better than random guessing — D(x) = 0.5 for everything. In practice, reaching that equilibrium is extraordinarily difficult, and therein lies the entire drama of GAN training.

One practical note that matters: the generator's loss is often modified from log(1 − D(G(z))) to −log D(G(z)). Why? Early in training, the generator produces obvious garbage. The discriminator rejects it with confidence: D(G(z)) ≈ 0. At that point, the gradient of log(1 − D(G(z))) is tiny — the generator gets almost no learning signal. It's trying to climb a hill but the ground is flat. The −log D(G(z)) formulation has steeper gradients near zero, giving the generator enough signal to start improving. This is a small change in the math, but it's the difference between a GAN that learns and one that sits there doing nothing.

Training a GAN: the delicate dance

If you've trained classifiers or regression models, GAN training will feel alien. There's no single loss that decreases monotonically. There are two losses, for two networks, and they oscillate. The discriminator gets better at catching fakes, then the generator adapts, then the discriminator responds, and on it goes. There's no natural "we're done" signal.

The training loop alternates between two steps. First, fix the generator and update the discriminator: show it real images (label: 1) and fake images from G (label: 0), compute binary cross-entropy loss, update D's weights. Second, fix the discriminator and update the generator: generate fake images, pass them through D, compute loss based on how well they fooled D (we want D to output 1 for our fakes), update G's weights.

for each training batch:

    # Step 1: Train the detective
    z = sample_noise(batch_size)
    fake = G(z).detach()       # .detach() stops gradients into G
    loss_D = BCE(D(real), 1) + BCE(D(fake), 0)
    update D

    # Step 2: Train the forger
    z = sample_noise(batch_size)
    fake = G(z)                # no detach — G needs gradients through D
    loss_G = BCE(D(fake), 1)   # G wants D to think fakes are real
    update G

That .detach() call is critical. When training D, we don't want gradients flowing back into G — we're only improving the detective, not the forger. When training G, we deliberately let gradients flow through D because G needs to learn what D considers convincing.

The balance between these two networks is everything. If D becomes too strong too fast, it rejects everything G produces with near-100% confidence. D's output saturates, the gradients flowing to G become vanishingly small, and G stops learning. If G somehow gets ahead and fools D completely, D's gradients become meaningless too. You need both networks improving at roughly the same pace. Practitioners have tried training D for multiple steps per G step, using label smoothing (labeling real images as 0.9 instead of 1.0), adding noise to D's inputs — an entire bag of tricks developed over years of collective frustration.

I'll be honest — the first time I trained a GAN, I watched the loss curves for an hour and had no idea if things were going well. They were oscillating wildly. Turns out, that was normal. The only reliable way to monitor GAN training is to periodically look at the generated samples yourself, or compute FID scores (a metric that compares the distribution of generated images to real ones). Loss values alone tell you almost nothing.

Mode collapse and the dirt-pile fix

Here's the signature GAN failure mode, and once you've seen it you'll never forget it. You're training a GAN on a dataset of faces — men, women, different ages, different expressions. You check the generated samples and every single face is the same middle-aged woman. A thousand samples, a thousand copies of the same face with minor variations.

This is mode collapse. The generator discovered that one particular output fools the discriminator reliably, and decided that's all it needs to produce. It's our portrait studio forger who realized the detective can't spot one particular fake, so they produce that exact fake over and over instead of learning to forge different styles.

The mathematical reason is subtle. The generator is optimizing to fool D right now. It has no incentive to cover the full data distribution. If one mode of the data is easier to imitate, G overexploits it. D eventually catches on and G shifts to a different mode, creating an oscillation where G cycles through modes without ever covering them all simultaneously. The GAN objective doesn't care about diversity — it only cares about fooling the discriminator.

The most impactful fix came in 2017 with the Wasserstein GAN (WGAN), and the intuition is beautiful. The original GAN loss is closely related to the Jensen-Shannon divergence between the real and generated distributions. Here's the problem: when these two distributions have non-overlapping support (which is almost always the case in high dimensions, especially early in training), the JS divergence is constant. It provides zero useful gradient. The generator is standing in a flat field with no indication of which direction to walk.

The Wasserstein distance, also called the earth mover's distance, fixes this. Imagine the real data distribution as a pile of dirt and the generated distribution as another pile of dirt. The Wasserstein distance measures the minimum amount of work — dirt times distance moved — needed to reshape one pile into the other. Crucially, this distance is smooth and meaningful even when the piles don't overlap at all. There's always a gradient pointing the generator in the right direction: toward the real data.

WGAN replaces the discriminator with a critic (it no longer outputs a probability, but a score), requires the critic to satisfy a Lipschitz constraint (originally enforced by weight clipping, later by a gradient penalty in WGAN-GP), and uses the Wasserstein distance as the loss. The result: dramatically more stable training, a loss curve that actually correlates with sample quality, and reduced mode collapse because the gradient signal is always informative.

This won't be the last time we see the idea that "smooth, informative gradients solve training problems." It shows up again in normalizing flows and energy-based models.

Conditional generation: steering the output

So far, every generative model we've seen produces random outputs. Sample noise, decode or generate, and you get whatever the model gives you. That's interesting but not very practical. What if we want to generate a portrait of a smiling woman? Or convert a sketch into a photograph? Or generate a specific digit?

The solution is remarkably consistent across model families: feed additional information to the network. In a Conditional GAN (cGAN), you concatenate a label y — a class label, a text embedding, an input image — to both the noise vector for G and the input for D. G learns to produce images conditioned on y. D learns to judge whether an image is a realistic example of category y.

The objective becomes:

min_G max_D 𝔼_x[log D(x|y)] + 𝔼_z[log(1 − D(G(z|y)|y))]

The same idea works for VAEs. A Conditional VAE (CVAE) feeds the label y into both the encoder and decoder. The encoder produces q(z|x, y) — a distribution over latent codes given both the image and the label. The decoder produces p(x|z, y). Now you can fix y = "smiling" and sample different z values to get diverse smiling faces.

The applications range widely. Pix2Pix conditions a GAN on an input image to do paired image-to-image translation: sketch to photo, segmentation map to street scene, day to night. CycleGAN extends this to unpaired translation: horses to zebras, photos to paintings, without requiring matched training pairs. The idea of "condition on something extra" is one of those simple additions that opened an enormous space of practical applications.

Back at our portrait studio: the vanilla GAN is a forger who produces random faces. The conditional GAN is a forger who takes commissions. "Give me a bearded man in his 40s" — and the forger obliges.

Normalizing flows: invertible transformations

VAEs and GANs each have a frustrating weakness. VAEs give you a tractable (approximate) likelihood and a structured latent space, but the outputs are blurry. GANs give you sharp outputs, but no likelihood computation and a temperamental training process. Is there a model that gives you exact likelihoods and high-quality samples?

Normalizing flows take a completely different approach. The idea: start with a simple distribution you know how to work with (a standard normal), and warp it through a series of invertible transformations until it matches the data distribution. Because every transformation is invertible, you can go both directions — from noise to data (generation) and from data to noise (density evaluation).

The mathematical backbone is the change of variables formula. If z has a known density p_z(z), and x = f(z) where f is invertible and differentiable, then:

log p_x(x) = log p_z(f⁻¹(x)) + log |det(∂f⁻¹/∂x)|

That second term, the log-absolute-determinant of the Jacobian, is what makes this hard. For a transformation f from ℝ^d to ℝ^d, the Jacobian is a d×d matrix. Computing its determinant naively costs O(d³). For a 784-dimensional image, that's impractical.

Here's where the clever engineering comes in. Coupling layers, introduced in the NICE paper (2014) and refined in RealNVP (2017), are transformations specifically designed to have tractable Jacobian determinants. The trick: split the input into two halves, x_a and x_b. Leave x_a unchanged. Transform x_b using a function that depends on x_a:

y_a = x_a                                     (unchanged)
y_b = x_b ⊙ exp(s(x_a)) + t(x_a)            (scaled and shifted)

Here s() and t() are neural networks that output scale and translation parameters. They can be arbitrarily complex — the coupling layer doesn't require them to be invertible because they only operate on the conditioning half. The inverse is easy:

x_a = y_a
x_b = (y_b - t(y_a)) ⊙ exp(-s(y_a))

And the Jacobian of this transformation is triangular, which means its determinant is the product of the diagonal elements — which is the sum of s(x_a) in log space. No expensive matrix operations. You stack many coupling layers, alternating which half gets transformed, and the composition of invertible functions is itself invertible. The total log-determinant is the sum of each layer's log-determinant.

The Glow model (Kingma and Dhariwal, 2018) added invertible 1×1 convolutions between coupling layers — learnable permutations that mix information across channels before each coupling step. This boosted expressiveness substantially and produced the first flow-based model that generated genuinely convincing face images.

The tradeoff with flows is real, though. The invertibility constraint limits what architectures you can use. You can't have a bottleneck (that would destroy invertibility), so there's no dimensionality reduction. The models tend to be large, and while sample quality has improved enormously, it has historically lagged behind GANs and later diffusion models for the most challenging image generation tasks. But for applications where you need exact likelihoods — anomaly detection, density estimation, lossless compression — flows are hard to beat.

Energy-based models: the landscape view

Every generative model so far has had a specific architecture with a specific generation mechanism: encode/decode, adversarial game, invertible transform. Energy-based models strip all that away and work with the most general formulation possible.

The idea: define a function E_θ(x) that assigns a scalar energy to every possible input x. Low energy means "this looks like real data." High energy means "this is unlikely." The probability of x is:

p_θ(x) = exp(−E_θ(x)) / Z_θ

Where Z_θ = ∫ exp(−E_θ(x)) dx is the partition function, a normalizing constant that makes the probabilities sum to 1.

Picture an energy landscape — a mountainous terrain where valleys represent likely data points and peaks represent unlikely ones. Faces live in deep valleys. Random noise lives on mountain peaks. Our portrait studio's entire collection of headshots occupies a particular valley in this landscape.

The problem — and it's a big one — is that Z_θ requires integrating over all possible inputs. For a 784-dimensional image, that's an integral over ℝ⁷⁸⁴. Completely intractable. This means we can evaluate the unnormalized density exp(−E(x)) for any given x, but not the actual probability. We know which valleys are deeper, but not by how much relative to the total landscape.

Score matching (Hyvärinen, 2005) sidesteps this brilliantly. Instead of learning the density p(x), learn its score function — the gradient of log p(x) with respect to x. The score function is ∇_x log p(x) = −∇_x E(x). Notice what happened: the partition function Z disappeared. It's a constant with respect to x, so its gradient is zero. We can learn the score function without ever computing Z.

But how do you generate samples if you only know the score? This is where Langevin dynamics comes in. Start with a random point x₀ (noise). Then iteratively update:

x_t+1 = x_t + (ε/2) ∇_x log p(x_t) + √ε · noise

Each step nudges x in the direction of higher probability (following the score) while adding a bit of random noise. Over many steps, x migrates from the mountain peaks down into the valleys where real data lives. It's like rolling a ball downhill on the energy landscape, with random jostling to prevent it from getting stuck in shallow dips.

I'll be honest — the connection between energy-based models and diffusion models is one of those things I had to read three times before it clicked. Diffusion models, which we'll cover in the next section, can be understood as a specific way of doing score matching across multiple noise levels. The line between "EBM with Langevin sampling" and "diffusion model" is blurrier than most presentations suggest. Song and Ermon's 2019 work on score-based generative models made this connection explicit and paved the way for the diffusion revolution.

Energy-based models are the most flexible generative framework — the energy function E(x) can be any neural network, with no architectural constraints like invertibility or encoder-decoder structure. The cost of that flexibility is slow, iterative sampling (hundreds or thousands of Langevin steps) and tricky training (score matching has its own failure modes). But the theoretical framework is elegant, and it underpins much of the modern generative modeling landscape.

The generative model landscape

We've covered five families of generative models, and at this point it helps to see them side by side. Each makes a different tradeoff.

Model Family	How It Generates	Likelihood?	Sample Quality	Key Weakness
VAE	Sample z ~ N(0,I), decode	Approximate (ELBO)	Smooth, sometimes blurry	Reconstruction-KL tension
GAN	Sample z, one forward pass through G	None	Sharp, realistic	Mode collapse, training instability
Normalizing Flow	Sample z, apply inverse transforms	Exact	Good, improving	Invertibility limits architecture
Energy-Based	Langevin dynamics (iterative)	Unnormalized	Varies	Slow sampling, tricky training
Diffusion (preview)	Iterative denoising	Approximate	Excellent	Slow (many denoising steps)

Notice a pattern: there's no free lunch. Exact likelihoods (flows) cost architectural flexibility. Sharp samples (GANs) cost training stability and mode coverage. Flexible architectures (EBMs) cost efficient sampling. Diffusion models, which we'll meet in the next section, found a sweet spot by combining ideas from score matching and denoising autoencoders — but they pay for it with slow, multi-step generation.

The portrait studio analogy holds throughout. The VAE forger produces acceptable work but it's a bit fuzzy. The GAN forger produces photorealistic work but sometimes gets stuck forging the same face. The flow forger can tell you exactly how probable any face is, but needs a very specific set of tools. The energy-based forger is the most versatile artist of the bunch, but takes forever to finish each piece. Each has their place.

Wrap-up

If you're still with me, thank you. I hope it was worth it.

We started with the simplest possible idea — compress an image through a bottleneck and reconstruct it — and built up, one piece at a time, through denoising and sparsity tricks, through the probabilistic leap of VAEs with their reparameterization trick and ELBO loss, through the adversarial drama of GANs with their mode collapse and the Wasserstein fix, through conditional generation, through the invertible elegance of normalizing flows, and finally to the energy landscape view that connects to everything that came after.

My hope is that the next time you encounter a paper mentioning "ELBO," "Wasserstein distance," "coupling layers," or "score matching," instead of feeling that familiar urge to quietly change the subject, you'll have a pretty good mental model of what's going on under the hood. These aren't separate, disconnected ideas — they're all different answers to the same question: how do you teach a neural network to create?

Resources and credits

Kingma and Welling, "Auto-Encoding Variational Bayes" (2013) — the O.G. VAE paper. Concise and readable, which is rare for a paper this influential.

Goodfellow et al., "Generative Adversarial Nets" (2014) — the paper that started the GAN era. The theoretical analysis is elegant and the experiments look quaint by modern standards.

Arjovsky, Chintala, and Bottou, "Wasserstein GAN" (2017) — a wildly insightful paper. The explanation of why JS divergence fails in high dimensions is worth reading even if you never train a GAN.

Dinh, Sohl-Dickstein, and Bengio, "Density estimation using Real-NVP" (2017) — the paper that made normalizing flows practical. The coupling layer idea is one of those things that seems obvious in retrospect.

Lilian Weng's blog posts on VAEs, GANs, and flow-based models — unforgettable overviews. If you read one blog on generative models besides this one, read hers.

Song and Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (2019) — the bridge between energy-based models and diffusion models. This paper changed everything that followed.

Next → Diffusion Models