Diffusion Models

Chapter 15: Generative Models 14 subtopics

I avoided diffusion models for longer than I'd like to admit. Every time someone mentioned "denoising score matching" or "the forward process is a Markov chain with a Gaussian kernel," I'd nod politely and change the subject. GANs I understood — two networks fighting, one fakes, one detects. Intuitive. Adversarial. Dramatic. But diffusion? Adding noise a thousand times and then learning to undo it? That felt like the strangest possible way to build an image generator. Finally the discomfort of not knowing what Stable Diffusion actually does under the hood grew too great for me. Here is that dive.

Diffusion models are a family of generative models first proposed by Sohl-Dickstein et al. in 2015 and brought to prominence by the DDPM paper (Ho, Jain, & Abbeel, 2020). They power Stable Diffusion, DALL·E, Midjourney, Imagen, and nearly every state-of-the-art image generator as of 2024. The core idea is disarmingly simple: destroy data with noise, then train a neural network to reverse the destruction.

Before we start, a heads-up. We're going to be working through probability distributions, Gaussian math, and some neural network architecture. But you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Contents

Kicking over the sand castle
The forward process — noise by numbers
The closed-form shortcut
The reverse process — learning to denoise
Noise prediction vs. score prediction
The training loop
Rest Stop
DDPM sampling — the slow walk back
DDIM — taking the highway
Classifier-free guidance — the magic dial
Latent diffusion — Stable Diffusion from the inside
ControlNet — giving the model a skeleton
Consistency models — one giant step
Flow matching and rectified flows — the next chapter
Wrap-up
Resources

Kicking Over the Sand Castle

Imagine a child on a beach spends an hour building an elaborate sand castle — turrets, a moat, a drawbridge made from a popsicle stick. Now imagine a second child walks over and kicks it. The castle is rubble in two seconds. Destruction took no skill, no understanding of architecture, no creativity. It was free.

Rebuilding that castle from rubble? That requires knowing what castles look like. Knowing where the turrets go. Knowing what sand at the right moisture level feels like. The reconstruction demands understanding that the destruction did not.

That asymmetry — destruction is trivial, reconstruction requires knowledge — is the entire foundation of diffusion models. We're going to build an image generator by formalizing this idea into precise math. The destruction is Gaussian noise. The reconstruction is a neural network that learns what images look like by learning to undo tiny amounts of damage.

Let's make this concrete with a toy example. Take a tiny 2×2 grayscale image — four pixels, nothing more. Maybe the pixel values are [0.8, 0.3, 0.5, 0.9]. That's our sand castle. Now we add a small amount of random noise to every pixel. The values shift to something like [0.82, 0.27, 0.53, 0.86]. Still recognizable. We add more noise. [0.91, 0.14, 0.68, 0.72]. Getting rougher. After a thousand rounds of noise-adding, the pixel values are something like [−0.42, 1.13, 0.07, −0.88] — random draws from a bell curve, with no trace of the original image. Pure static. Rubble.

That progressive corruption is called the forward process. It requires zero learning. It is pure math. The question that makes diffusion models interesting: can we train a neural network to look at the rubble and work out how to reverse each tiny noise step, eventually walking all the way back to a clean image?

It turns out we can. And the result surpasses GANs in image quality, trains with the stability of supervised learning, and requires nothing more exotic than mean squared error as a loss function. But I'm getting ahead of myself.

The Forward Process — Noise by Numbers

Let's formalize the sand-kicking. We start with a clean data sample — call it x₀. It could be a photograph, a piece of audio, anything with numerical pixel values. We're going to corrupt it over T timesteps, where T is typically 1000.

At each step t, we take the current state x_t−1 and produce x_t by scaling the signal down slightly and adding a small dose of Gaussian noise:

q(xₜ | xₜ₋₁) = 𝒩(xₜ;  √(1 − βₜ) · xₜ₋₁,  βₜ · I)

This notation means: x_t is drawn from a Gaussian (normal) distribution whose mean is √(1 − β_t) · x_t−1 and whose variance is β_t. In plain language: shrink the signal a little, add noise proportional to β_t.

The parameter β_t is the noise schedule. It's a predefined, monotonically increasing sequence — a dial that we turn up gradually. In the original DDPM, β₁ starts at 0.0001 (a whisper of noise) and β_T ends at 0.02 (still a small number, but after a thousand applications, the cumulative effect is devastating). No single step changes the image much. A thousand steps obliterate it entirely.

Going back to our sand castle analogy: each step is one grain of sand getting displaced. Barely noticeable. But after a thousand rounds of displacement, you've got a flat beach. The key property is that this is a Markov chain — each step depends only on the previous state, not the entire history. The chain has no memory. It doesn't need any.

There's a subtlety in the scaling factor √(1 − β_t) that tripped me up initially. Why not add noise without scaling? Because we need the variance to stay bounded. If we kept adding noise without ever shrinking the signal, the total variance would grow without bound — the numbers would explode. The scaling keeps the distribution well-behaved: at every step, the total variance stays at 1. This is a property called variance preservation, and it's what makes the math work out so cleanly.

The Closed-Form Shortcut

Here's where the math becomes genuinely beautiful, and where training becomes practical.

If we needed to create a noisy version of an image at timestep 500, we'd have to run the forward chain 500 times — applying noise, step by step, from x₀ to x₁ to x₂ all the way to x₅₀₀. For training a neural network that needs millions of noisy examples at random timesteps, this would be catastrophically slow.

But there's a shortcut. Define two quantities:

αₜ  =  1 − βₜ

ᾱₜ  =  α₁ · α₂ · … · αₜ  =  ∏ (1 − βₛ)  for s = 1 to t

α_t (alpha) is how much of the signal survives one step. ᾱ_t (alpha-bar) is the cumulative product — how much of the original signal survives after t steps. When t is small, ᾱ_t is close to 1 (most signal survives). When t is large, ᾱ_t is close to 0 (almost no signal remains).

With these definitions, we can jump from the clean image x₀ directly to any timestep t in a single operation:

q(xₜ | x₀) = 𝒩(xₜ;  √ᾱₜ · x₀,  (1 − ᾱₜ) · I)

Which means:  xₜ  =  √ᾱₜ · x₀  +  √(1 − ᾱₜ) · ε     where ε ~ 𝒩(0, I)

Read that formula out loud: "The noisy image at step t equals the clean image scaled by √ᾱ_t, plus pure random noise scaled by √(1 − ᾱ_t)." The two scaling factors are complementary — they sum to 1 in a variance-preserving way. At t = 0, we get almost all signal. At t = T, we get almost all noise.

This is called the reparameterization trick — writing a sample from a distribution as a deterministic function of its parameters plus independent noise. It's the same trick used in VAEs, and here it has the same benefit: it lets gradients flow through the sampling process during training.

The practical payoff is enormous. Instead of chaining 500 noise steps to create x₅₀₀, we compute it in a single line: scale the original image, add scaled noise, done. Training goes from O(T) per sample to O(1). That's the difference between a training run that finishes in days and one that never finishes at all.

Let's trace through our 2×2 toy image to make this concrete. Say our clean image is x₀ = [0.8, 0.3, 0.5, 0.9]. At t = 250, suppose ᾱ₂₅₀ = 0.5. Then √ᾱ₂₅₀ ≈ 0.707, and √(1 − ᾱ₂₅₀) ≈ 0.707 as well. We draw random noise ε = [0.1, −0.6, 0.3, −0.2] and compute x₂₅₀ = 0.707 · [0.8, 0.3, 0.5, 0.9] + 0.707 · [0.1, −0.6, 0.3, −0.2] = [0.64, −0.21, 0.57, 0.49]. Half signal, half noise — a very foggy version of the original. At t = 999, ᾱ would be nearly 0, and x₉₉₉ would be indistinguishable from random Gaussian noise. The sand castle is a flat beach.

The Reverse Process — Learning to Denoise

We've formalized destruction. Now comes reconstruction.

The key mathematical insight is this: when each individual noise step is small enough, the reverse of a Gaussian corruption step is also Gaussian. This means that if we can learn the parameters of that reverse Gaussian — specifically its mean — we can undo the noise, one step at a time.

We parameterize the reverse process as:

p_θ(xₜ₋₁ | xₜ) = 𝒩(xₜ₋₁;  μ_θ(xₜ, t),  σₜ² · I)

The variance σ_t² is typically set to β_t (a fixed value from the noise schedule). The mean μ_θ is the part the neural network has to learn. The subscript θ signals that this is parameterized by learnable weights.

Now, there are three different things we could train the network to predict. Each one gives us the mean μ_θ through a different formula, but they're mathematically equivalent — the same information in different packaging:

Option 1: Predict x₀. The network looks at the noisy image x_t and directly guesses what the clean image was. Intuitive, but the difficulty of this prediction varies wildly with t. At t = 5, the input is almost clean — easy. At t = 999, the input is pure noise — nearly impossible. The loss scale across timesteps is all over the place.

Option 2: Predict the mean μ. Directly output the reverse step mean. This is what the math naturally suggests, but in practice it's the least common.

Option 3: Predict the noise ε. The network ε_θ(x_t, t) takes in the noisy image and the timestep, and guesses which specific noise pattern was added. This is what Ho et al. chose in DDPM, and it turned out to be the winning parameterization. The reason: the noise ε always has zero mean and unit variance, regardless of timestep or image content. The prediction target is well-behaved and uniform across all t, so the loss scale stays stable throughout training.

Given the noise prediction, we can recover the reverse step mean:

μ_θ(xₜ, t)  =  (1 / √αₜ) · (xₜ − (βₜ / √(1 − ᾱₜ)) · ε_θ(xₜ, t))

And here is the punchline — the loss that trains the whole thing:

L  =  𝔼[ ‖ε − ε_θ(xₜ, t)‖² ]

Mean squared error between the actual noise that was added and the noise the network predicted. That's it. No adversarial objectives. No KL divergence balancing act. No mode collapse. No training instability. A regression problem, and it trains with the same reliability as predicting house prices from square footage.

I'll be honest — when I first saw this, I didn't believe it. An MSE loss that produces photorealistic images? After years of struggling with GAN training — mode collapse, discriminator oscillation, hyperparameter sensitivity — this felt too easy. But that simplicity is real, and it's the main reason diffusion models won.

Noise Prediction vs. Score Prediction

There's a parallel universe of diffusion research that arrived at essentially the same algorithm from a completely different starting point. It's worth understanding both, because the terminology bleeds across papers and you'll encounter both frames in the wild.

In the noise prediction frame (DDPM), the network predicts ε — the noise that was added to create x_t from x₀. We covered this above.

In the score prediction frame (Song & Ermon, 2019), the network predicts the score function — defined as the gradient of the log probability of the data: ∇_x log p(x). In plain language, the score points in the direction where data becomes more probable. If you're at a noisy image, the score tells you which way to nudge it to make it look more like a real image.

Here's the connection that took me an embarrassingly long time to see: for Gaussian noise, these two things are the same, up to a scaling factor. The score function at timestep t is related to the noise by:

∇_{xₜ} log p(xₜ)  =  −ε / √(1 − ᾱₜ)

So predicting the noise and predicting the score are equivalent — they contain the same information, packaged differently. The score is the noise, divided by the noise standard deviation and flipped in sign. A network trained to do one can trivially do the other with a single rescaling.

This equivalence has a name: it falls out of Tweedie's formula, a result from the 1950s that tells you the best estimate of the original signal given noisy observations. Tweedie says: your best guess of x₀ given x_t is x_t plus the noise variance times the score. The diffusion model community essentially rediscovered a seventy-year-old statistical result and built the most powerful generative models in history on top of it.

In practice, most production systems use noise prediction (the DDPM framing) because the implementation is more straightforward. But the score-based framing gives deeper theoretical insight — it connects diffusion models to stochastic differential equations (SDEs), which opens the door to the ODE-based samplers we'll see shortly.

The Training Loop

Everything we've built so far collapses into a training loop that fits on one screen. Here it is:

import torch
import torch.nn.functional as F

T = 1000
betas = torch.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bar = torch.cumprod(alphas, dim=0)

def train_step(model, x0, optimizer):
    batch_size = x0.shape[0]

    # Pick a random timestep for each image in the batch
    t = torch.randint(0, T, (batch_size,), device=x0.device)

    # Draw fresh noise
    epsilon = torch.randn_like(x0)

    # Jump to timestep t in one shot (the closed-form shortcut)
    ab_t = alpha_bar[t].view(-1, 1, 1, 1).to(x0.device)
    x_t = torch.sqrt(ab_t) * x0 + torch.sqrt(1.0 - ab_t) * epsilon

    # Ask the network: what noise was added?
    epsilon_pred = model(x_t, t)

    # MSE between real noise and predicted noise
    loss = F.mse_loss(epsilon_pred, epsilon)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

Five lines of real logic. Sample a random timestep. Draw noise. Create the noisy image in one shot using the closed-form formula. Feed it to the network. Penalize the difference between real noise and predicted noise.

The network itself — a U-Net, which we'll explore later — receives two inputs: the noisy image x_t and the timestep t (so it knows which noise level it's dealing with). It outputs a tensor the same shape as x_t, representing its best guess of ε. The architecture is important, but the training objective is what makes diffusion models tick, and the training objective is this five-line loop.

Compare this to a GAN training loop: two networks, two losses, a minimax game, careful balancing of generator vs. discriminator updates, mode collapse diagnostics, gradient penalty terms. The diffusion training loop doesn't even need a second network. It's one-player solitaire.

Rest stop. Congratulations on making it this far. You can stop here if you want.

You now have a working mental model of diffusion: the forward process destroys data by adding Gaussian noise over many steps. The reverse process trains a neural network to predict and remove that noise, one step at a time. Training is stable, the loss is MSE, and the closed-form shortcut makes it practical. You could implement a basic diffusion model on MNIST right now with what you know.

This doesn't tell the complete story, though. We haven't talked about how to actually generate images (sampling is its own adventure), how to make text control what gets generated (classifier-free guidance), how to make the whole thing fast enough to ship as a product (latent diffusion), or where the field is heading next (flow matching). Those pieces are coming.

But if you're at a dinner party and someone says "diffusion models," you can now hold your own. The short version: add noise until the image is static, train a network to predict the noise at every level, then run the process in reverse to create images from nothing. There. You're 70% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

DDPM Sampling — The Slow Walk Back

Training teaches the network to denoise. Sampling is where we actually use it to create images from scratch. Start with pure random noise — a flat beach — and apply the learned reverse process step by step until a sand castle emerges.

@torch.no_grad()
def ddpm_sample(model, shape, device):
    x = torch.randn(shape, device=device)   # start from pure noise

    for t in reversed(range(T)):
        t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
        eps_pred = model(x, t_batch)

        alpha_t = alphas[t]
        ab_t = alpha_bar[t]

        # Remove predicted noise, scaled by the schedule
        x = (1.0 / torch.sqrt(alpha_t)) * (
            x - (betas[t] / torch.sqrt(1.0 - ab_t)) * eps_pred
        )

        # Inject fresh randomness (except at the final step)
        if t > 0:
            x = x + torch.sqrt(betas[t]) * torch.randn_like(x)

    return x

Each iteration asks the network: "given this somewhat-noisy image at timestep t, what noise was added?" The code then subtracts a scaled version of that predicted noise, nudging x_t toward x_t−1. At each step (except the last), we add a small amount of fresh randomness — this stochasticity is what gives DDPM its diversity, producing different images from different random seeds.

The problem is speed. T = 1000 means a thousand sequential neural network forward passes. On an A100 GPU, that's 10–30 seconds for a single 512×512 image. For a product serving millions of requests, this is a non-starter. The race for faster sampling has been one of the most productive areas in all of generative AI.

Think of it this way: if the forward process is kicking over the sand castle (instantaneous), the reverse process as DDPM implements it is rebuilding it grain by grain. We need to find a way to place whole handfuls at a time.

DDIM — Taking the Highway

DDIM — Denoising Diffusion Implicit Models (Song, Meng, & Ermon, 2020) — offers a different perspective on the same problem. The insight: the reverse process doesn't have to be stochastic. We can rewrite it as a deterministic ordinary differential equation (ODE).

Here's the intuition. DDPM's reverse process is like a hiker taking a winding, random path down a mountain — sometimes going left, sometimes right, always trending downhill. DDIM replaces that with a straight road. Same starting point, same destination, but no randomness along the way. Given the same starting noise, DDIM always produces the same output image.

The mathematical trick: DDIM defines a family of reverse processes parameterized by η (eta). When η = 1, you get DDPM (fully stochastic). When η = 0, you get a fully deterministic mapping. The beautiful thing is that all of these processes produce samples from the same distribution — they arrive at the same mountain village via different roads.

Because the path is deterministic and smooth, you can take bigger steps. Instead of stepping through every one of 1000 timesteps, you pick a subsequence — say, [1000, 950, 900, …, 50, 0] — and skip between them. In practice, 50–100 DDIM steps produce quality close to 1000 DDPM steps, a 10–20× speedup for essentially free.

There's a bonus. Deterministic sampling means the mapping from noise to image is invertible. Given an image, you can run DDIM forward to find its noise code, edit the noise, and run DDIM backward to get a modified image. This is how many image editing tools (inpainting, style transfer) work under the hood: they operate in the noise space and let the deterministic ODE handle the translation back to pixel space.

Higher-order ODE solvers push this even further. DPM-Solver++ (Lu et al. 2022) applies second- and third-order numerical methods — think Runge-Kutta, the same techniques used for simulating physical systems — to the diffusion ODE. The result: 10–20 steps with quality that rivals 1000-step DDPM. This is what made real-time diffusion applications feasible.

Classifier-Free Guidance — The Magic Dial

Everything so far generates random images. They're valid samples from the data distribution, but we can't control what they depict. Saying "I want a cat astronaut on the moon" requires conditioning — feeding the model some information about what to generate. Making that conditioning work well was the breakthrough that launched the text-to-image revolution.

The first attempt was classifier guidance (Dhariwal & Nichol, 2021). Train a separate image classifier on noisy images. During sampling, use the classifier's gradient to nudge the diffusion process toward a target class. It worked, but it required a separately trained classifier, only supported class labels (not free-form text), and the classifier had to be robust to every noise level — an awkward engineering burden.

Classifier-free guidance (Ho & Salimans, 2022) threw all that away with a shockingly elegant trick. During training, randomly drop the conditioning 10–20% of the time, replacing the text prompt with an empty placeholder. This forces the single model to learn both conditional generation ("an astronaut cat") and unconditional generation ("any image at all") in the same set of weights.

At inference, we run the model twice per denoising step:

ε_uncond = model(x_t, t, cond=∅)          # "what would any random image look like?"
ε_cond   = model(x_t, t, cond=prompt)     # "what would THIS image look like?"

ε_guided = ε_uncond + w · (ε_cond − ε_uncond)

The term (ε_cond − ε_uncond) isolates the signal that comes purely from the text conditioning — the features that distinguish "cat astronaut" from "generic image." The guidance scale w amplifies that signal. Think of w as a volume dial on a radio. At w = 1, you're hearing the music as recorded — standard conditional generation. Crank it to w = 7.5 (Stable Diffusion's default), and the distinctive features of the prompt get amplified — the image strongly matches the text. Push it past w = 20, and you're cranking the dial so far that the music distorts — oversaturated colors, artifacts, the model trying too hard.

The tradeoff maps cleanly to one in language models: high guidance scale is like low temperature. You get sharper, more prompt-faithful results, but less diversity. Low guidance scale is like high temperature — more creative, more varied, but sometimes off-prompt. Production systems typically land between w = 5 and w = 12.

There's a delightful consequence that falls out of the math for free: negative prompts. Instead of comparing against the null embedding, substitute the embedding of things you want to avoid — "blurry, low quality, deformed hands." The guidance formula then pushes the output away from the negative prompt and toward the positive one. This isn't a hack. It's a direct consequence of how the linear extrapolation works. Every Stable Diffusion user who types a negative prompt is, without realizing it, doing arithmetic in noise-prediction space.

Latent Diffusion — Stable Diffusion from the Inside

A 512×512 RGB image has 786,432 dimensions. Running a U-Net a thousand times — or even fifty times — on tensors that large requires enormous compute and memory. Research labs could afford it. Nobody else could. This was the gap between "impressive paper" and "product that works on a laptop."

Latent Diffusion Models (Rombach et al. 2022) closed that gap with a two-stage design so clean it's hard to believe nobody thought of it sooner. (People had thought of it sooner, but the Rombach team made it work reliably.)

Stage 1: Compress. Train an autoencoder — specifically a KL-regularized VAE — that compresses 512×512×3 images into 64×64×4 latent tensors. That's a 48× reduction in dimensionality. The encoder squeezes an image into a compact code, the decoder reconstructs it. Train this once with a combination of pixel loss, perceptual loss (comparing VGG features), and a KL penalty that keeps the latent distribution close to a standard normal. Then freeze it.

Stage 2: Diffuse in latent space. Run the entire noise-and-denoise pipeline on 64×64×4 latents instead of 512×512×3 pixels. Same math, same training loop, same MSE loss. Everything is 48× smaller.

Stable Diffusion pipeline (end to end):

  "A cat astronaut on the moon, oil painting"
       │
       ▼
  ┌──────────────────────┐
  │ CLIP Text Encoder     │   (frozen, 77 × 768 token embeddings)
  └──────────┬───────────┘
             │ cross-attention at every U-Net layer
             ▼
  z_T ──→  U-Net Denoiser  ──→ z₀
  noise     (self-attention      clean latent
  64×64×4    + cross-attention   64×64×4
             + timestep embed)
                                  │
                                  ▼
                          ┌──────────────┐
                          │  VAE Decoder  │  (frozen)
                          │  64×64×4 →    │
                          │  512×512×3    │
                          └──────┬───────┘
                                 │
                                 ▼
                         512×512 RGB image

Three components, each doing one thing well. The CLIP text encoder converts text into a sequence of embedding vectors — 77 tokens, each a 768-dimensional vector. It was trained by OpenAI on 400 million image-text pairs to learn a shared embedding space where images and text that describe the same thing land near each other. In Stable Diffusion, it's frozen — its weights never change during diffusion training.

The U-Net denoiser is the only component that gets trained. It's a classic encoder-decoder architecture with skip connections (borrowed from image segmentation), enhanced with three key ingredients for diffusion. Timestep embeddings: the integer t gets mapped through sinusoidal encoding (like Transformer positional encoding) and an MLP, then injected into every residual block through adaptive group normalization — so the network knows whether it's denoising heavy static or polishing fine details. Self-attention layers: let distant spatial locations attend to each other, giving the model global coherence. Cross-attention layers: image features form the queries, text embeddings form the keys and values. This is how the word "cat" tells a particular spatial region to generate fur, and "moon" tells the background to be grey and cratered.

The VAE decoder upsamples the clean 64×64×4 latent back to a 512×512×3 pixel image. Also frozen after Stage 1 training.

The compute savings are not incremental — they're transformational. Stable Diffusion v1.5 (860M parameters in the U-Net) runs on a consumer GPU with 8GB of VRAM. SDXL (2023) scales to dual text encoders, a 2.6B U-Net, and 1024×1024 outputs. Stable Diffusion 3 (2024) goes further: it replaces the U-Net with a Diffusion Transformer (DiT) — a plain Vision Transformer operating on patchified latent tokens — and adopts flow matching instead of traditional diffusion, which we'll get to at the end.

ControlNet — Giving the Model a Skeleton

Text is a surprisingly imprecise language for describing images. "A person standing with arms raised" says nothing about exactly where the arms are, or the person's height, or the camera angle. For applications that need spatial precision — fashion, architecture, animation — text prompts are not enough.

ControlNet (Zhang & Agrawala, 2023) solved this by adding structural conditioning: pose skeletons, depth maps, edge maps, segmentation masks. The architecture is clever. It clones the entire encoder half of the frozen Stable Diffusion U-Net, creates a trainable copy, and connects the copy back to the frozen original through zero convolutions — convolutional layers whose weights are initialized to all zeros.

Why zero initialization? Because it means the ControlNet contributes nothing to the output at the start of training. The frozen base model works exactly as before. As training progresses, the zero-initialized connections gradually learn to incorporate the spatial signal — edges, poses, depth — into the generation process. The base model's generative knowledge is preserved perfectly; the ControlNet layer learns only to steer it.

Back to our sand castle: if text conditioning is like telling the child "build a castle," ControlNet is like giving the child a blueprint. The child still uses their knowledge of sand and architecture (the frozen base model), but now they know exactly where each turret goes (the spatial conditioning).

In practice, the training data is pairs: (conditioning signal, target image). For edge-based ControlNet, run a Canny edge detector on real images, and train the ControlNet copy to reconstruct the images given the edges plus the original text prompt. For pose, use an off-the-shelf pose estimator to extract skeletons, and train on (skeleton, image) pairs. The approach is modular — you can train ControlNets for any spatial signal, and swap them at inference time.

Consistency Models — One Giant Step

Even with DDIM and DPM-Solver++, diffusion requires at least 10–20 neural network forward passes per image. For interactive applications — real-time image editing, video generation, game textures — even that is too many. The dream: generate a quality image in a single forward pass, like a GAN, but without the training instability.

Consistency Models (Song et al. 2023) take a fundamentally different approach. Instead of training the network to reverse one noise step, they train it to map any point on the entire diffusion trajectory directly to x₀ — the clean image. The defining property is self-consistency: the model's output should be the same whether it starts from heavy noise or light noise, as long as both points lie on the same trajectory.

Formally, if f_θ(x_t, t) is the consistency model's output at timestep t, then for any two timesteps t and t' on the same trajectory: f_θ(x_t, t) = f_θ(x_t', t'). Every point on the path maps to the same destination. This means you can start from pure noise and reach the clean image in one step — or, if you want higher quality, take 2–4 steps by applying the model repeatedly at decreasing noise levels.

There are two ways to train consistency models. Consistency distillation starts from a pre-trained diffusion model and distills the trajectory-to-x₀ mapping. Consistency training learns the self-consistency property directly from data, without needing a teacher model. The distillation route typically produces better results (since it starts from a strong model), and Latent Consistency Models (LCM) apply this to the latent diffusion framework, achieving impressive quality in 2–4 steps.

I'm still developing my intuition for why this works as well as it does. In principle, you're asking a single forward pass to do what 1000 sequential passes used to do. That's a 1000× compression of information processing. Yet the quality is surprisingly close to the multi-step version, especially with 4 steps. My best understanding is that the network learns to leverage the trajectory structure — nearby points on the same trajectory carry nearly the same information about x₀, so the consistency constraint is less punishing than it might seem.

Flow Matching and Rectified Flows — The Next Chapter

Everything we've discussed so far treats generation as reversing a noise process — adding noise step by step, then learning to undo it. Flow matching (Lipman et al. 2023, Albergo & Vanden-Eijnden, 2023) asks: what if we skip the whole noise-then-reverse framework and instead train a model to transport noise to data along straight lines?

The idea is startlingly direct. Define a path from every noise sample z to its target data sample x as a simple linear interpolation: x_t = (1 − t) · z + t · x. At t = 0, you're at the noise sample. At t = 1, you're at the data. The path is a straight line through the space of images.

A neural network learns the velocity field — the direction and speed to move at any point along this path. The training objective is: given x_t at time t, predict the velocity v = x − z. That's it. No noise schedules, no β_t, no ᾱ_t, no reparameterization tricks. The loss is MSE between the predicted velocity and the true velocity.

Going back to our mountain hiking analogy: diffusion models trace a winding path down a mountainside, shaped by the noise schedule and the stochastic reverse process. Flow matching says: forget the mountain path. Draw a straight line from the summit to the village and learn to walk it.

Rectified flows (Liu et al. 2023) take this further. After training a flow matching model, they "straighten" the learned trajectories through a distillation process: sample pairs (noise, generated image) from the trained model, then retrain on these pairs to make the paths even straighter. Straighter paths mean fewer ODE steps to traverse them, because a numerical ODE solver can cover a straight line in fewer steps than a curved one. After one or two rounds of rectification, the paths are straight enough for 1–4 step generation.

This is what Stable Diffusion 3 uses under the hood. Instead of the traditional DDPM forward-reverse framework, SD3 uses a flow matching objective. The result is simpler training, faster sampling, and — according to Stability AI's benchmarks — better image quality. The field is increasingly moving in this direction, and I wouldn't be surprised if "diffusion model" becomes a historical term, replaced by "flow model," within a few years. Though I've been wrong about these predictions before.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a sand castle and a child kicking it over — the observation that destruction is free but reconstruction requires knowledge. We formalized that into the forward process, adding Gaussian noise step by step until data becomes static. We discovered the closed-form shortcut that makes training practical. We built the reverse process — a neural network that predicts the noise, trained with nothing more than MSE loss. We walked through DDPM sampling and sped it up with DDIM's deterministic highway. We added text control through classifier-free guidance, compressed everything into latent space to make it shippable, gave the model spatial blueprints through ControlNet, compressed the 1000-step walk into a single leap with consistency models, and finally saw the field evolving toward the straight-line elegance of flow matching.

My hope is that the next time you use Stable Diffusion, or read about DALL·E 3's architecture, or see "CFG scale" in a generation interface, instead of treating it as a magic black box, you'll have a pretty darn good mental model of what's happening under the hood. The sand castle is being rebuilt, grain by grain — or, increasingly, in one confident motion — from pure noise, guided by learned knowledge of what real images look like.

Resources

A curated set of things I found genuinely helpful on this journey:

The DDPM paper — Ho, Jain, & Abbeel (2020). The O.G. paper that made diffusion work. The math is dense but the algorithm section is gold. arxiv.org/abs/2006.11239
Lilian Weng's "What are Diffusion Models?" — Wildly helpful blog post that walks through the math with more patience than any paper. If you read one thing beyond this section, read this. lilianweng.github.io
The Latent Diffusion paper — Rombach et al. (2022). The paper that became Stable Diffusion. Insightful on why latent space makes everything better. arxiv.org/abs/2112.10752
The Classifier-Free Guidance paper — Ho & Salimans (2022). Short paper, outsized impact. The guidance formula is one of those results that feels obvious in hindsight. arxiv.org/abs/2207.12598
The Consistency Models paper — Song et al. (2023). One-step generation without adversarial training. The self-consistency property is an elegant idea. arxiv.org/abs/2303.01469
Yang Song's blog on score-based models — The clearest explanation of the score-matching perspective I've found. Connects everything to SDEs in a way that actually clicks. yang-song.net/blog/2021/score

← Previous Classic Generative Models Next → Modern Breakthroughs