Nice to Know

Chapter 15: Generative Models 10 topics
TL;DR

Generative models don't end at GANs, VAEs, and diffusion. There's a whole constellation of techniques orbiting around them — from painting photographs in Van Gogh's style, to generating images one pixel at a time, to modeling data as electric fields. These are the topics that show up in papers, interviews, and late-night rabbit holes. You don't need to implement any of them today, but when you encounter them in the wild, you'll know what they are and why they matter.

Neural Style Transfer

I'll be honest — the first time I saw a photograph rendered in the style of Starry Night, I assumed it was some elaborate Photoshop filter. It wasn't until I read the Gatys et al. paper (2015) that I realized how deceptively clever the trick actually is. It doesn't use a generative model at all. It uses a pre-trained image classifier — and that's what makes it so surprising.

Here's the setup. Take a content image (say, a photograph of a city skyline) and a style image (say, a Van Gogh painting). The goal is to produce a new image that keeps the structure of the city but looks like Van Gogh painted it. Two completely different goals — preserve what is in the image, transform how it looks.

The insight is that a CNN like VGG-19, trained for image classification, already separates these two concerns across its layers. Earlier layers capture textures and patterns — brushstrokes, color palettes, repeating motifs. Deeper layers capture high-level structure — "there's a building here, sky there." Style transfer exploits this separation.

The process works by optimizing an image from scratch (starting from random noise or a copy of the content image) to simultaneously minimize two losses. The content loss measures how different the generated image's deep-layer activations are from the content image's — this keeps the buildings, the skyline, the general layout. The style loss measures something more subtle: how different the texture statistics are between the generated image and the style image. Those "texture statistics" are captured by the Gram matrix — a matrix that records how much every pair of feature channels correlate with each other across the image. If Van Gogh's painting has swirling blue-yellow textures, the Gram matrix captures that particular pattern of correlations, and the style loss pushes the generated image to develop the same correlations.

The total loss is a weighted sum: α × content_loss + β × style_loss. Crank up β and you get wild painterly effects. Crank up α and the photo stays almost unchanged. The balance between the two is where the art happens — literally.

The original approach was slow — several minutes per image, because you're running gradient descent on the pixels themselves. Johnson et al. (2016) fixed this with fast style transfer: train a feedforward network to produce the stylized image in a single forward pass. Once trained for a particular style, it runs in real-time. That's what powers most style-transfer apps on your phone today.

The limitation that matters: style transfer works beautifully for textures and color palettes, but it struggles with structural style. It can make your photo look painted, but it can't make your photo look like a Picasso cubist rearrangement of your face. The Gram matrix captures statistical texture patterns, not spatial rearrangements. That's a fundamentally harder problem, and it's part of what motivated the development of more powerful generative approaches.

Deepfakes and Detection

Deepfakes are what happens when generative models meet human faces with malicious (or at least mischievous) intent. The term originally referred to a specific Reddit user who used autoencoders to swap faces in videos, but it's now the catch-all for any AI-generated face manipulation.

The core technique behind face-swapping deepfakes is surprisingly straightforward. Train two autoencoders that share the same encoder but have different decoders — one for face A, one for face B. The shared encoder learns a common "face representation" space. At inference time, feed face A through the shared encoder, then decode it with face B's decoder. The result is face A's expressions and pose, but face B's identity. The blending is imperfect, so a post-processing pipeline handles color correction, edge smoothing, and alignment to make it seamless.

GAN-based deepfakes took this further. Models like StyleGAN can generate entirely synthetic faces that never existed, or interpolate between real identities with unnerving smoothness. The realism reached a point where humans couldn't reliably distinguish real from synthetic — which is when the arms race with detection began.

Deepfake detection is its own subfield now, and it's a genuinely hard problem. The approaches fall into a few categories. CNN and Transformer-based detectors are trained on datasets of real and fake media to spot pixel-level artifacts — subtle compression inconsistencies, unnatural skin textures, weird reflections in eyes. Temporal analysis looks at frame-to-frame inconsistencies in videos: blinking patterns that are too regular, head movements that don't quite match the body, facial landmarks that jitter. Biological signal analysis is perhaps the most creative — some detectors look for the subtle skin color changes caused by blood flow (photoplethysmography). Real faces pulse with your heartbeat. Synthetic faces don't. Audio-visual mismatch detectors compare lip movements against spoken words, catching cases where the face and voice were generated by different models.

The uncomfortable truth: detection is always playing catch-up. Every new generation model produces fewer artifacts, and detectors trained on one generation technique often fail on a newer one. The field is starting to shift toward provenance-based solutions — cryptographic watermarks embedded at capture time — rather than trying to detect forgeries after the fact. I'm still not sure which approach will win in the long run, and I don't think anyone else is either.

Adversarial Examples for Generative Models

Most discussions of adversarial examples focus on classifiers — adding tiny, invisible perturbations to an image of a panda until a model confidently labels it "gibbon." But generative models have their own adversarial vulnerabilities, and they're arguably more concerning because the outputs are things people look at and trust.

The concept transfers directly. For a classifier, an adversarial example is an input that produces a wrong label. For a generative model, an adversarial example is an input (or a manipulation of the latent space, or a perturbation of the conditioning signal) that produces a wildly wrong or harmful output. Think of a text-to-image model where a carefully crafted prompt embedding — not a natural-language prompt, but a numerical vector designed by optimization — produces an image completely unrelated to what a human would expect.

With diffusion models specifically, researchers have shown that imperceptible perturbations to the initial noise can steer the generation toward specific target images. The model "thinks" it's doing normal denoising, but the carefully chosen starting point funnels it toward a predetermined destination. It's like rigging a marble run — the marble follows physics correctly at every step, but the track was designed to send it where you want.

There's also the defensive angle. Adversarial purification flips the script — using a diffusion model's denoising process to remove adversarial perturbations from images before feeding them to a classifier. The idea is that the denoising process destroys the carefully crafted perturbation while preserving the actual content. It works surprisingly well, which tells us something interesting: the denoising process implicitly knows what "natural" looks like and pushes images back toward the data manifold.

I'm still developing my intuition for why some adversarial attacks transfer between generative models and some don't. The transferability question is one of the more active and unsettled areas in this space.

Score Matching and Langevin Dynamics

If you've worked through the diffusion models section of this chapter, you've already met score matching without knowing it. This is where we pull back the curtain on what's really going on mathematically underneath diffusion.

The core problem in generative modeling is learning a probability distribution p(x) from data. The trouble is that computing p(x) requires a normalizing constant — you need to integrate the unnormalized density over all possible x to make the probabilities sum to one. For high-dimensional data like images, that integral is intractable. You can't compute it. Full stop.

Score matching (Hyvärinen, 2005) sidesteps this entirely by learning the score function — ∇ₓ log p(x), the gradient of the log-density with respect to the data. The key insight: when you take the gradient of log p(x), the normalizing constant disappears. It's a constant, and the gradient of a constant is zero. So you can train a neural network s_θ(x) to approximate ∇ₓ log p(x) without ever needing to compute p(x) itself.

Once you have the score function, you can generate samples using Langevin dynamics — an iterative process that starts from random noise and walks toward high-probability regions by following the score. Each step nudges the sample in the direction of increasing log-probability, plus a small random perturbation to maintain diversity. Think of it like a marble rolling downhill on the negative-log-probability landscape, with a little bit of random jiggling to prevent it from getting stuck in the nearest valley.

Song and Ermon (2019) made this practical by training score networks at multiple noise levels — not one score function, but a whole family of them, one for each noise scale. Start with heavily corrupted data (easy to model), then progressively denoise using the score function at each level. If this sounds exactly like what diffusion models do... that's because it is. The denoising score matching framework and the DDPM framework (Ho et al., 2020) turned out to be mathematically equivalent, derived from different starting points but arriving at the same destination.

That's the bridge. Energy-based models define an energy function. The score is the gradient of the negative energy. Score matching learns that gradient. Langevin dynamics uses it to sample. Diffusion models do this at multiple noise scales. It's all one story told from different angles.

Poisson Flow Generative Models

Most generative models borrow their mathematics from probability theory or thermodynamics. Poisson Flow Generative Models (PFGM), introduced in 2022, borrow from electrostatics instead. It's one of those ideas that makes you pause and wonder why nobody tried it sooner.

The analogy goes like this. Imagine every data point in your training set is a tiny electric charge sitting on a flat surface. Those charges create an electric field that extends into the space above the surface. If you drop a test particle from far away and let it fall along the field lines, it will naturally gravitate toward regions where the charges are densely packed — that is, toward regions of high data density. That's your generative process. Start from random noise (far from the data), follow the field, arrive at a realistic sample.

Formally, PFGM embeds the data in a (d+1)-dimensional space and solves the Poisson equation — the fundamental equation of electrostatics — to compute the vector field that guides generation. A neural network learns this vector field, and generation proceeds by integrating (flowing) along it from a random starting point on a spherical shell down to the data hyperplane.

The practical appeal: PFGM can achieve high-quality samples with fewer integration steps than vanilla diffusion models, because the field lines provide a more direct path from noise to data than the stochastic back-and-forth of diffusion. The theoretical appeal: it offers a completely different mathematical lens on the same problem, which means techniques that are hard in the diffusion framework might be natural in the Poisson framework, and vice versa.

This is still a relatively young approach. I haven't seen it deployed at scale in production systems yet, but the physics-inspired framing is elegant enough that it's worth keeping on your radar. The marble-rolling-downhill analogy from score matching works here too — except now the hill is shaped by Coulomb's law instead of thermodynamic noise.

Neural ODEs for Generation

In a standard neural network, each layer is a discrete step: take the hidden state, transform it, pass it forward. Neural ODEs (Chen et al., 2018) replace that sequence of discrete steps with a continuous transformation described by an ordinary differential equation. Instead of h₁ = f(h₀), h₂ = f(h₁), ..., you define dh/dt = f(h(t), t, θ) and let a numerical ODE solver evolve the hidden state from time 0 to time T. The network's "depth" becomes continuous — there are no layers, only a smooth trajectory through state space.

For generation, this becomes a tool called Continuous Normalizing Flows (CNFs). Traditional normalizing flows require each layer to be invertible and need to compute the determinant of the Jacobian at each step — an expensive operation that constrains what architectures you can use. CNFs sidestep this. Because the transformation is defined by an ODE, it's automatically invertible (run the ODE backward), and the change in log-density over time depends only on the trace of the Jacobian, not the full determinant. The trace is much cheaper to compute, and it doesn't constrain your network architecture.

The trade-off is that ODE solvers introduce their own computational cost, and the number of function evaluations depends on the complexity of the trajectory. Simple transformations are fast; complex ones require many solver steps. There's also a memory advantage — the adjoint method lets you compute gradients with constant memory regardless of the number of solver steps, because you don't need to store intermediate states.

Neural ODEs connect to score-based models in a beautiful way. Song et al. (2021) showed that the diffusion process can be described as a stochastic differential equation (SDE), and the reverse process — generation — can be cast as either an SDE or a probability flow ODE. The ODE version produces identical samples to the SDE version but without the stochastic noise, trading randomness for a deterministic trajectory. That probability flow ODE is a continuous normalizing flow. The ideas converge.

Discrete Diffusion for Text

Here's something that nagged at me for a while. Diffusion models work by gradually adding Gaussian noise to data and then learning to reverse that process. Gaussian noise makes sense for continuous data like images — pixel values are real numbers, and adding a little noise to a real number gives you another real number. But what about text? Words are discrete. You can't add a little bit of noise to the word "cat" and get something between "cat" and "car." There's no "between" in discrete space.

Discrete diffusion models solve this by replacing Gaussian noise with discrete corruption processes. Instead of adding noise, you mask or randomly substitute tokens. The forward process gradually replaces more and more tokens with a [MASK] token (or a random token from the vocabulary), and the reverse process learns to reconstruct the original tokens from the corrupted sequence.

If this sounds like BERT's masked language modeling — it is, but with principled probabilistic foundations. D3PM (Discrete Denoising Diffusion Probabilistic Models) formalized this in 2021, defining discrete transition matrices that play the same role as the Gaussian noise schedule in continuous diffusion. MDLM (Masked Diffusion Language Models, 2024) simplified the framework dramatically. The training loss turns out to be a weighted mixture of masked language modeling losses at different corruption levels. The weighting comes from the diffusion math, but the actual training looks remarkably familiar.

The practical result: MDLM achieves perplexity scores that are competitive with autoregressive language models — a first for diffusion-based text generation. And it gets a capability that autoregressive models don't have: it can fill in tokens in any order, not left-to-right. That's useful for tasks like text infilling, constrained generation, and editing, where you want to modify the middle of a passage while keeping the beginning and end fixed.

The limitation is that generating text from scratch still typically requires many denoising steps (10–1000), while an autoregressive model generates each token in one step. Semi-autoregressive variants help — generate a few tokens in parallel, then refine — but the speed gap hasn't fully closed. For most practical text generation tasks, autoregressive models still win on throughput. But discrete diffusion opens doors that autoregressive models can't.

Autoregressive Image Models: PixelCNN to VQGAN

Before diffusion models took over, there was a different lineage of image generation that asked a surprisingly simple question: what if we generated images the same way we generate text — one piece at a time, predicting each piece based on everything that came before it?

PixelCNN (van den Oord et al., 2016) took this literally. It models the probability of each pixel conditioned on all previously generated pixels, using masked convolutions to ensure causality — the prediction for pixel (i, j) can only see pixels above and to the left. The model scans across the image like reading a book: left to right, top to bottom. At each position, it predicts a probability distribution over possible color values, samples from it, and moves on.

The results were sharp and had legitimate diversity — no mode collapse. But the generation speed was brutal. For a 256×256 image with 3 color channels, you're making about 200,000 sequential predictions. Each one depends on the previous, so you can't parallelize. Training was parallel (teacher forcing), but generation was painfully serial.

VQ-VAE (van den Oord et al., 2017) changed the game by introducing a trick: don't model pixels directly. Instead, train an autoencoder with a discrete bottleneck — a codebook of learned vectors. The encoder maps image patches to the nearest codebook entry, producing a grid of discrete codes much smaller than the original image. A decoder reconstructs the image from these codes. Then train an autoregressive model (like PixelCNN) over the codes, not the pixels. The code grid might be 32×32 instead of 256×256 — that's 64 times fewer sequential predictions.

VQGAN (Esser et al., 2021) refined this further by adding adversarial training to the autoencoder. The GAN loss forces the decoder to produce images that look realistic at the texture level, not blurry reconstructions. The codebook learns richer, more expressive entries. And the autoregressive prior is upgraded from PixelCNN to a Transformer, which handles long-range dependencies across the code grid better. This was the approach behind DALL·E 1 — tokenize images into a vocabulary of visual "words," then generate those words with a Transformer, treating image generation as a sequence modeling problem.

The evolutionary arc is clean: PixelCNN showed autoregressive generation works for images but is slow. VQ-VAE compressed the problem into a smaller discrete space. VQGAN made the compressed space high-fidelity with adversarial training and paired it with Transformers. Diffusion eventually outperformed all of these for raw image quality, but the VQ-VAE/VQGAN tokenization idea survived — it's how many multimodal models handle images today, encoding them as discrete tokens that live in the same "language" as text.

The Creative AI Tools Ecosystem

Everything we've discussed in this chapter — GANs, VAEs, diffusion models, autoregressive generation — has converged into a rapidly evolving ecosystem of creative tools that are reshaping how people make things. It's worth knowing the landscape, not because you need to use every tool, but because interview conversations about generative models increasingly end with "...and how is this used in practice?"

For image generation, the major players are Midjourney (known for high aesthetic quality and stylized outputs), DALL·E (OpenAI, integrated with ChatGPT), and Stable Diffusion (open-source, the backbone of the open ecosystem). Stable Diffusion's open weights spawned an entire cottage industry of fine-tuned models, LoRA adapters, ControlNet conditioning, and community-driven innovation on platforms like Civitai and Hugging Face. The distinction matters: Midjourney and DALL·E are walled-garden services; Stable Diffusion is infrastructure you can run, modify, and build on.

For video generation, Runway (Gen-2, Gen-4) leads among creative professionals, offering AI-powered video generation, inpainting, and motion editing. Pika and Sora (OpenAI) are newer entrants pushing toward longer, more coherent video generation. The underlying architectures vary — some use diffusion in pixel space, others in latent space, some use autoregressive methods — but the user-facing experience converges toward "describe what you want, get a video."

For audio and music, Suno and Udio generate full songs from text descriptions. Most use a two-stage architecture: compress audio into discrete tokens with a neural codec (like Meta's EnCodec), then generate those tokens with a language model or diffusion model. ElevenLabs dominates voice synthesis — cloning voices, generating speech with emotional range, real-time dubbing across languages.

The emerging pattern is a multimodal pipeline: sketch a concept in Midjourney, animate it in Runway, add a voiceover with ElevenLabs, score it with Suno. What used to require a studio, a team, and months of work can now be prototyped by one person in an afternoon. That's not hyperbole — it's the actual workflow of solo creators and small studios today.

The thing I find most interesting about this ecosystem isn't any individual tool. It's that the gap between "understands generative models" and "ships creative products with them" has collapsed to almost nothing. Five years ago, using a GAN required writing training loops. Today, the models are infrastructure. The value has shifted to knowing which model to use, how to control it, and when its outputs need human judgment. That's a different kind of expertise, and it's the one the industry is actually hiring for.

What You Should Now Be Able To Do

✅ Checklist
  • Explain how neural style transfer separates content and style using CNN activations and Gram matrices
  • Describe how face-swapping deepfakes work (shared encoder, separate decoders) and name at least three detection approaches
  • Explain what adversarial examples mean for generative models — not classifiers, but generators — and what adversarial purification does
  • Articulate the score matching → Langevin dynamics → diffusion connection in your own words
  • Describe PFGM's electrostatics analogy and why it might offer faster sampling than diffusion
  • Explain what Neural ODEs and Continuous Normalizing Flows buy you over discrete normalizing flows
  • Describe how discrete diffusion adapts the diffusion framework to text (masking instead of Gaussian noise)
  • Trace the PixelCNN → VQ-VAE → VQGAN evolutionary arc and explain what each step added
  • Name the major creative AI tools across image, video, audio, and voice — and describe the emerging multimodal pipeline pattern