Evaluating Generative Models

Chapter 15: Generative Models 11 subtopics

I avoided digging into generative model evaluation for an embarrassingly long time. Every paper I read would report "FID = 3.5" like that settled something, and I'd nod along as if I understood what that number meant, or why anyone should care. Meanwhile, I couldn't answer the most basic question: if two models generate different images from the same prompt, which one is better? And what does "better" even mean when there's no right answer? Finally, the discomfort of nodding along blindly grew too much. Here is that dive.

Evaluating generative models — GANs, diffusion models, VAEs, and the rest — has been one of the thorniest open problems in machine learning since the field started generating images in earnest around 2014. The Fréchet Inception Distance (FID) was introduced by Heusel et al. in 2017 and quickly became the default yardstick for image generation quality. Since then, a small ecosystem of complementary metrics has grown up: Inception Score, CLIP Score, LPIPS, precision/recall, and the perennial gold standard — asking a human to look at the image and tell you if it's any good.

Before we start, a heads-up. We're going to walk through some statistics and a bit of information theory, and we'll touch on how neural network feature spaces work. You don't need any of that beforehand. We'll build each idea from scratch, one piece at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The problem no one warned you about
A tiny art gallery (our running example)
Comparing distributions, not images
FID: the math, the intuition, the flaws
Inception Score — the metric that came first and aged worst
CLIP Score — when faithfulness matters
LPIPS — measuring what your eyes measure
Precision and recall for generators
Rest stop
The likelihood trap
Mode collapse — how to catch a generator cheating
The diversity–quality tradeoff
Asking humans (and doing it right)
Goodhart's law, or: when the metric becomes the enemy
Wrapping up
Resources

The Problem No One Warned You About

With a classifier, evaluation is almost trivially clear. The model predicts "cat." The label says "dog." The model is wrong. Count the rights and wrongs, and you have accuracy. It's so straightforward that we barely think about it.

Generative models break this entirely. Suppose you ask a model to generate "a watercolor painting of a lighthouse at sunset." The model produces an image. Is it correct? There are infinitely many valid watercolor lighthouses at sunset. There's no ground truth to compare against. The question isn't "is this right?" — it's "is this good?"

"Good" turns out to have at least three dimensions that fight with each other. The first is quality: does each individual sample look crisp, coherent, and realistic? The second is diversity: does the model produce a wide variety of outputs, or does it keep generating the same lighthouse over and over? The third is faithfulness: does the image actually depict what you asked for?

Here's the tension. You can achieve perfect quality by memorizing ten beautiful lighthouse paintings from your training set and replaying them — but diversity is zero. You can achieve perfect diversity by outputting random noise — every output is unique, but quality is zero. And you can produce a stunning, photorealistic image of a horse at dawn — high quality, wonderful diversity, zero faithfulness to your lighthouse prompt. Any metric that tries to capture "how good is this generator?" has to somehow balance these three forces. That's what makes this hard.

A Tiny Art Gallery

To make all of this concrete, let's build a running example that we'll return to throughout. Imagine you're running a tiny art gallery that commissions AI-generated paintings. You have exactly five real paintings in your collection — your "ground truth" of what good art looks like. And you're evaluating three different AI artists (generative models) that each produce five paintings for you.

Artist A produces five paintings that are stunningly detailed and technically perfect, but they're all nearly identical — the same landscape, the same color palette, the same composition. Every visitor says "wow, that's beautiful" but also "didn't I already see this one?"

Artist B produces five wildly different paintings — different styles, different subjects, different moods. But each one is messy, with blurry edges, inconsistent lighting, and visible artifacts. Visitors say "interesting variety" but wouldn't buy any of them.

Artist C produces five paintings that are reasonably good-looking and reasonably diverse, though none are as jaw-dropping as Artist A's best, and none are as adventurous as Artist B's range. Visitors find them... satisfying.

Which artist is best? That depends on what you value. And that's the entire problem of generative model evaluation in miniature. We'll keep coming back to these three artists as we explore each metric, and you'll see that different metrics would crown different winners.

Comparing Distributions, Not Images

The key insight that makes automatic evaluation even possible is this: instead of comparing individual images (which has no clear "right answer"), we compare collections of images. More precisely, we compare the statistical distribution of generated images against the statistical distribution of real images.

But comparing raw pixel distributions is useless. Two photos of the same cat, shifted one pixel to the right, are statistically very different in pixel space but perceptually identical. What we need is a feature space that captures how humans perceive images — something that says "this is a face" or "this has the texture of fur" rather than "pixel (127, 42) has value 0.73."

This is where a pretrained neural network comes in. Take a network like InceptionV3, trained to classify ImageNet images into 1,000 categories. It learned to extract rich, hierarchical features along the way — edges in early layers, textures in middle layers, objects and scenes in later layers. If we strip off the final classification layer, what remains is a 2,048-dimensional feature vector for each image — a kind of "perceptual fingerprint" that captures what the image is about, not what its pixels happen to be.

Now we can feed our real images through Inception and get a cloud of 2,048-dimensional points. Feed our generated images through and get another cloud. The question becomes: how similar are these two clouds? That's a well-defined statistical question. And it's the question that FID answers.

FID: The Math, the Intuition, the Flaws

I'll be honest — when I first encountered the FID formula, I stared at it for a while, thinking it was more complicated than it needed to be. Then I understood what it was actually doing, and I felt the opposite: it's an almost elegant compression of a very hard problem into something you can compute in a few lines of code. Let me try to transmit that shift.

We have our two clouds of feature vectors — one from real images, one from generated. FID makes one big simplifying assumption: each cloud is roughly shaped like a multivariate Gaussian — a bell curve, but in 2,048 dimensions. A Gaussian is fully described by its mean (the center of the cloud) and its covariance matrix (the shape and spread of the cloud). So for each set of images, we compute:

μreal and Σreal — the mean vector and covariance matrix of the real image features.

μgen and Σgen — same thing for generated image features.

The Fréchet distance (also called the Wasserstein-2 distance for Gaussians) between these two distributions is:

FID = ||μ_real - μ_gen||² + Tr(Σ_real + Σ_gen - 2·(Σ_real · Σ_gen)^(1/2))

There are two parts, and each tells a different story. The first part, ||μ_real - μ_gen||², is the squared distance between the centers of the two clouds. If your generated images are "about different things" than the real images — say, you trained on faces but you're generating landscapes — the centers will be far apart, and this term blows up. Think of it as the "are you even in the right ballpark?" term.

The second part involves the covariance matrices and is harder to visualize, but here's the intuition. Imagine both clouds are 2D ellipses instead of 2,048-dimensional blobs. The covariance matrix describes the size, shape, and orientation of each ellipse. If the generated ellipse is the same size, shape, and orientation as the real one, this term goes to zero. If the generated ellipse is too small (mode collapse — the generator is only exploring a narrow region), or too big (the generator is spraying features all over the place), or tilted in the wrong direction (the generator has learned wrong feature correlations), this term grows.

Lower FID is better. FID = 0 means the two Gaussians are identical — the generated distribution is indistinguishable from the real one in Inception feature space. State-of-the-art models on ImageNet 256×256 now achieve FID around 1.5–2.0. An FID of 50 means the samples are visibly poor.

Let's revisit our art gallery. Artist A (beautiful but identical paintings) would have a low mean distance — the paintings look "real" in feature space — but the covariance term would be large because the generated cloud is tiny compared to the real one. All five feature vectors cluster tightly. Artist B (diverse but messy) might have a large mean distance because blurry, artifact-ridden images land in different regions of feature space than clean ones. Artist C (balanced) would likely have the best FID — not perfect on either dimension, but the best compromise.

And that's exactly what FID is: a compromise metric. It captures both quality and diversity in a single number, which is its great strength and its fundamental limitation. It can't tell you which problem you have — it conflates "your images are ugly" with "your images aren't varied enough" into the same big number.

⚠️ FID's Practical Gotchas

The Gaussian assumption matters. Inception features aren't truly Gaussian, so FID is an approximation of an approximation. You need roughly 50,000 generated images for a stable FID estimate — with 1,000 images, your score can swing by ±10 between runs. And the preprocessing pipeline matters more than you'd think: resizing images with bilinear vs. bicubic interpolation before feeding them to Inception can shift FID by several points. The clean-fid library was built specifically to standardize all of this. Use it.

Inception Score — The Metric That Came First and Aged Worst

Before FID, there was Inception Score (IS), introduced by Salimans et al. in 2016. It's one of those ideas that sounds clever the first time you hear it and then slowly unravels as you think harder.

The idea has two halves. First, feed each generated image through InceptionV3, which produces a probability distribution over 1,000 ImageNet classes. If the image is clear and recognizable — a sharp photo of a dog — the classifier should be confident: the output distribution p(y|x) should be peaked, with most probability mass on one class. Low entropy means the image "looks like something." That's the quality signal.

Second, look at the marginal distribution p(y) across all generated images — the average of all those per-image predictions. If the model is generating a diverse set of images, this marginal should be spread out across many classes. High entropy means variety. That's the diversity signal.

IS combines them with a KL divergence:

IS = exp(E_x[KL(p(y|x) || p(y))])

Higher is better. Real ImageNet images score around 250. A good GAN circa 2018 scored 50–100.

Here's where it falls apart. IS never compares against real images. It only looks at the generated images in isolation. A model that produces perfect, diverse ImageNet images but has absolutely nothing to do with your actual training data would score beautifully. It's like a restaurant critic who rates food without ever comparing it to what was ordered.

Worse, the Inception classifier was trained on ImageNet's 1,000 specific classes. If you're generating faces, medical scans, satellite imagery, or anything else outside those categories, the classifier has no idea what it's looking at. Asking it to judge your face generator is like asking a sommelier to judge a cocktail competition.

And there's a subtler problem. If your model generates 50,000 different-looking dogs — golden retrievers, poodles, huskies — but nothing else, IS might still report a decent score. Each dog image is confidently classified (quality ✓), and the marginal might still spread across several dog-related classes (partial diversity ✓). The complete absence of cats, cars, and everything else goes unpenalized.

I'm still occasionally surprised to see IS reported in new papers. It persists as a historical convention, and it's fast to compute, but treat it as a quick sanity check, never as your primary metric.

CLIP Score — When Faithfulness Matters

FID and IS both miss something critical for modern text-to-image models like Stable Diffusion and DALL·E: they don't measure whether the generated image matches the text prompt. You could generate the world's most photorealistic image and score brilliantly on FID, but if you asked for "a cat sitting on the moon" and got a horse on a beach, that's a failure.

CLIP Score addresses this directly. CLIP (Contrastive Language-Image Pre-training, from OpenAI) was trained on hundreds of millions of text-image pairs to map both text and images into a shared embedding space where similar concepts land near each other. To compute a CLIP Score, you encode the text prompt and the generated image separately, then measure the cosine similarity between the two vectors:

CLIP Score = cosine_similarity(CLIP_image(generated_image), CLIP_text(prompt))

Higher means better alignment. Scores above 0.30 generally indicate strong text-image correspondence. Below 0.20 and the image is probably ignoring the prompt.

Back to our art gallery: imagine you commissioned all three artists to paint "a stormy sea crashing against a red lighthouse." Artist A might produce five identical but perfectly rendered lighthouses — high CLIP Score for each, since they all clearly match the description. Artist B might produce wildly varied scenes, but one of them is a desert and another is an abstract splatter — low CLIP Scores for the misfires. Artist C might hit the prompt reasonably well each time, with moderate CLIP Scores across the board.

The catch is that CLIP Score measures semantic alignment but is completely blind to image quality. A blurry, low-resolution blob that vaguely resembles a lighthouse in a storm can outscore a photorealistic painting that renders the lighthouse as slightly pink instead of red. CLIP was also trained on internet data with all its biases — it may score certain demographics, cultures, or visual styles differently. And most insidiously: if your generation model used CLIP-guided optimization during sampling (as many early diffusion models did), the CLIP Score becomes circular. The model was trained to maximize exactly the metric you're now using to evaluate it. That's like grading a student with the same test they practiced on.

LPIPS — Measuring What Your Eyes Measure

Sometimes you don't need to compare distributions. You need to ask a simpler question: how different do these two specific images look?

This comes up constantly in image-to-image tasks: super-resolution (does the upscaled image look like the original?), inpainting (does the filled-in region match the surrounding context?), style transfer (has the content been preserved?). Traditional pixel-based metrics like PSNR and SSIM are notoriously bad at capturing perceptual similarity. Two images can have identical PSNR but look wildly different to a human, because PSNR treats every pixel error equally — whether it's in a critical facial feature or in an irrelevant background texture.

LPIPS (Learned Perceptual Image Patch Similarity), introduced by Zhang et al. in 2018, solves this by exploiting the same insight that powers FID: deep neural network features capture what humans actually perceive. Feed both images through a pretrained network (VGG, AlexNet), extract feature maps at several layers, compute the channel-wise L2 distance at each layer, weight those distances by learned coefficients, and sum everything up. The weights were calibrated against a large dataset of human perceptual judgments — actual people saying "these two images look the same" or "these look different."

Lower LPIPS means more perceptually similar. The paper behind it, titled "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric," is one of those rare cases where the title is not hyperbolic. LPIPS correlates with human similarity judgments far better than PSNR or SSIM, and it's become the default perceptual distance metric for paired image comparisons.

The limitation is the "paired" part. LPIPS compares two specific images — it's not a distribution metric. You can't use it to evaluate an unconditional image generator that produces images from scratch with no reference. For that, you still need FID or its cousins.

Precision and Recall for Generators

FID gives you one number. That's convenient but frustrating — when FID goes up, is it because quality degraded or diversity dropped? You can't tell. In 2019, Kynkäänniemi et al. proposed a fix: split the evaluation into two separate numbers, borrowing the familiar language of precision and recall from classification, but giving them entirely new meanings for generative models.

The ideas map beautifully onto our art gallery. Precision asks: of all the generated paintings, what fraction look like they could plausibly be real paintings? This is a quality measurement. If Artist A generates five paintings and all five are realistic enough to fool a gallery visitor, precision is 1.0. If Artist B generates five and only two look real, precision is 0.4.

Recall asks the opposite question: of all the real paintings in the gallery's collection, what fraction are "covered" by the generated set? This is a diversity measurement. If the gallery has paintings of landscapes, portraits, and abstracts, and the generator only produces landscapes, recall is low — it's missing whole categories. Artist A, with five identical landscapes, has low recall. Artist B, despite being messy, might cover more ground.

In practice, both are computed in Inception feature space using nearest-neighbor distances. The technical details involve building a manifold estimate around each set of points and measuring overlap, but the intuition is what matters: precision catches quality failures, recall catches diversity failures. FID conflates the two into one number; precision and recall disentangle them.

A refinement came from Naeem et al. in 2020, who proposed density and coverage as more robust alternatives. Density measures how many generated samples cluster near each real sample (a smoother version of precision), while coverage measures the fraction of real samples that have at least one generated neighbor (a more stable version of recall). These are less sensitive to outliers and more reliable with smaller sample sizes.

I find the precision/recall framing genuinely useful because it forces you to articulate which failure mode you're worried about. When a stakeholder says "the model got worse," precision/recall lets you say "quality is fine, but diversity dropped" — a much more actionable diagnosis than "FID went up by 3 points."

Rest Stop

Congratulations on making it this far. Seriously — if you've followed everything up to here, you have a solid working model of how generative models are evaluated in practice. You can stop now and be well-equipped for most conversations about the topic.

Here's what you've got: FID compares the statistical fingerprints of real and generated image sets — it's the workhorse metric. IS measures quality and diversity but only through the lens of ImageNet classification — it's outdated. CLIP Score measures text-image alignment — essential for prompt-conditional models. LPIPS measures perceptual similarity between specific image pairs. And precision/recall decompose what FID conflates into separate quality and diversity signals.

That doesn't tell the complete story, though. We haven't talked about what happens when you try to evaluate generators using likelihood (it's a trap). We haven't discussed how to systematically detect mode collapse, which is the most common and dangerous failure mode of generators. We haven't explored the fundamental tension between quality and diversity, which haunts every evaluation. And we haven't confronted the deepest problem: what happens when the metric itself becomes the enemy.

If the discomfort of those gaps is nagging at you, read on.

The Likelihood Trap

There's a tempting alternative to all these feature-based metrics: if your generative model assigns probabilities to data (as VAEs, normalizing flows, and autoregressive models do), why not evaluate it by measuring how likely it considers the real test data? The model that assigns higher probability to real images should be better, right?

This sounds logical and is mostly wrong. Theis, van den Oord, and Bethge demonstrated this in their influential 2016 paper, "A Note on the Evaluation of Generative Models," and their findings still cause headaches for people who encounter them for the first time.

The core problem is that high likelihood does not imply good samples, and good samples do not imply high likelihood. These two things are genuinely independent. A model can be very good at recognizing that real images are probable — assigning high likelihood to the test set — while producing terrible samples of its own. How? By smearing its probability mass broadly across all of image space. It covers everything, assigns reasonable probability to real images, but when you sample from it, most of what you draw is blurry nonsense because the mass is spread too thin.

Conversely, a model that memorizes 100 training images and assigns all its probability to exact copies of those 100 images will generate stunning samples (they're literal photographs!) but will assign near-zero likelihood to any test image that isn't in its memorized set.

The standard way to report likelihood for images is bits per dimension: the negative log-likelihood in base 2, divided by the number of pixels. Lower is better. A model scoring 3.5 bits per dimension on CIFAR-10 is decent; 3.0 is strong. But a model with 3.0 bits per dimension can produce samples that look worse than a model with 3.5. I've seen this happen, and it still feels wrong every time.

The information-theoretic explanation involves the typical set — the narrow band of probability in which real data overwhelmingly lives in high-dimensional spaces. But the practical lesson is simple: never use likelihood as your only evaluation metric. It's measuring something real, but that something is not what you want to measure when you care about sample quality.

Mode Collapse — How to Catch a Generator Cheating

Mode collapse is what happens when a generator decides that variety is for other models. Instead of learning the full diversity of its training data, it latches onto a handful of outputs that happen to satisfy whatever training objective it faces, and produces those over and over.

In GAN training, this is a well-known failure mode: the generator finds a few outputs that consistently fool the discriminator and keeps replaying them. In diffusion models it's rarer but not impossible — it can surface when the model is undertrained, when the dataset has severe class imbalance, or when guidance scale is cranked too high.

The symptoms are easy to spot if you know where to look. Generate a grid of, say, 64 images from different random seeds. If they all look suspiciously similar — same pose, same composition, same color palette — you have mode collapse. It sounds crude, but visual inspection at fixed intervals during training catches the majority of mode collapse events in practice.

For a more quantitative diagnosis, recall and coverage from the precision/recall framework are your friends. A mode-collapsed generator will have decent precision (the few things it generates look real) but terrible recall (it's only covering a tiny fraction of the real data's variety). FID will also spike, but as we discussed, it won't tell you why — recall will.

Another approach: compute pairwise distances between generated samples in feature space. If the distribution of these distances is tightly peaked near zero, samples are suspiciously similar. You can also run a pretrained classifier on the generated images and look at the class distribution — if 90% of outputs are "golden retriever," something's gone wrong.

For GANs specifically, latent interpolation is revealing. Take two random latent codes z₁ and z₂, and generate images at evenly spaced points between them. If the model has learned a smooth, diverse manifold, these interpolated images should show gradual transitions. If the model has collapsed, the interpolation will be jerky — abrupt jumps between the few modes it knows, with nothing in between.

The Diversity–Quality Tradeoff

Here's something that caught me off guard when I first understood it: in every deployed generative system I'm aware of, the quality knob and the diversity knob are the same knob. Turning it one way improves quality and reduces diversity. Turning it the other way does the opposite. Every model ships with this knob tuned to some compromise position.

In GANs, this knob is called the truncation trick. StyleGAN samples latent codes from a normal distribution. The truncation parameter ψ scales these codes toward the mean: z_truncated = ψ · z. At ψ = 1.0, you get the full, untruncated distribution — maximum diversity but occasional ugly samples from the tails. At ψ = 0.5, you get consistently attractive images but much less variety. At ψ = 0.0, every image is identical — the "average" face. If you imagine the normal distribution as a landscape, truncation is like building a fence closer and closer to the center, keeping only the "safe" territory.

In diffusion models, the same knob is called classifier-free guidance (CFG). During sampling, you compute both a conditional prediction (what matches the prompt) and an unconditional one, then amplify the difference:

noise_pred = unconditional + guidance_scale × (conditional - unconditional)

At guidance_scale = 1.0, you get the raw model — diverse but sometimes ignoring the prompt. At 7.5 (Stable Diffusion's default), images are sharp and prompt-faithful but less varied. At 20+, things become oversaturated and repetitive — the model pushes every image toward the most stereotypical possible interpretation of the prompt.

This has direct implications for evaluation. If you plot FID against guidance scale, you get a U-shaped curve: too low and the images are poor quality (high FID), too high and diversity collapses (also high FID). There's a sweet spot in the middle where FID is minimized. This is why you should always report the guidance scale (or truncation ψ) alongside FID. A model boasting FID = 2.0 at guidance_scale = 30 is hiding diversity loss behind cherry-picked settings. That's not a better model — that's a more aggressively truncated one.

Precision and recall make this tradeoff visible in a way that FID alone cannot. As you increase guidance scale, precision goes up (each image looks better) and recall goes down (the generator covers less of the real distribution). Watching both numbers move as you turn the knob is far more informative than watching a single FID number bounce around.

Asking Humans (And Doing It Right)

Every automatic metric we've discussed is a proxy for what we actually care about: whether a human being would look at the output and think it's good. FID can't tell you that a model generates hands with six fingers. CLIP Score can't tell you that the colors feel muddy. Precision/recall can't tell you that the image is technically correct but aesthetically lifeless. For anything going to production, humans have to be in the loop.

There are three main protocols, each with different strengths.

Mean Opinion Score (MOS) is the simplest: show an evaluator a single image, ask them to rate quality on a 1 to 5 scale. The problem is calibration — different people have wildly different standards. Your 4 might be my 2. MOS works for rough absolute quality measurement but requires large numbers of ratings to average out the noise.

Two-Alternative Forced Choice (2AFC) is what most serious evaluations use. Show two images side by side — one from model A, one from model B — and ask: "which is better?" Humans are far more reliable at relative comparisons than absolute ratings. You don't need to agree on what a "4" means; you only need to agree on which of two images you prefer. Randomize left/right positioning to avoid positional bias, blind the evaluators to model identity, and you get surprisingly consistent results.

Elo ratings extend 2AFC to multiple models. Run many pairwise comparisons, and use the Elo algorithm (the same one chess uses) to compute a rating for each model. Every comparison updates the ratings: if a model wins against a higher-rated opponent, it gains more points. After a few hundred comparisons per model, the rankings stabilize. This is exactly what Chatbot Arena uses for LLM evaluation, and image generation leaderboards are increasingly adopting the same approach.

A practical guideline: 200 to 500 comparisons with 3 raters each is typically enough to detect statistically significant differences between two models. Always measure inter-rater agreement (Fleiss' kappa or Krippendorff's alpha) to identify and remove unreliable annotators. And design your prompt set carefully — don't evaluate on random prompts. Target your known failure modes: complex scenes, text rendering, human hands, unusual objects. A model that scores well on "a photo of a dog" but fails on "a left hand holding a clock that reads 3:15" is a model you haven't evaluated thoroughly enough.

Goodhart's Law, or: When the Metric Becomes the Enemy

There's a principle from economics that haunts evaluation in every field of machine learning, but nowhere more acutely than in generative modeling. Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure."

This is not abstract philosophy. It's a concrete, practical problem. The moment the field collectively decided that FID was the number to minimize, models started getting optimized — directly or indirectly — to produce images that match Inception feature statistics, whether or not those images actually look good to humans.

Here are some of the ways FID can be gamed, and I've seen each of these in practice. You can tune your guidance scale or truncation to hit exactly the FID sweet spot, hiding diversity problems behind a good headline number. You can adjust your image preprocessing pipeline to happen to match the reference set's pipeline, getting several free FID points from an implementation detail rather than a model improvement. You can generate images at a resolution that Inception handles well, rather than the resolution your model actually targets. You can report FID on a convenient subset of the test set rather than the full distribution.

CLIP Score is even more vulnerable. If your generation pipeline includes CLIP-guided optimization — which many early text-to-image systems used — then optimizing for CLIP Score is circular. The model was trained to maximize the very metric you're now reporting as evidence of quality. It's like measuring a student's learning with the exact practice problems they've been drilling.

The defense against Goodhart's Law is not a better metric. It's a collection of metrics, measured by different underlying models, supplemented by human evaluation, and reported with full transparency about methodology. Report FID and CLIP Score and precision/recall and the guidance scale and the sample size and the preprocessing pipeline. A model that improves on all fronts is genuinely better. A model that improves on one while the others regress is probably gaming the system.

I find this the most important lesson in the entire section: the metrics are tools, not goals. The goal is to build a model that produces outputs humans actually want. Any single number that claims to capture that is either lying to you or lying to itself.

Wrapping Up

If you're still with me, thank you. I hope it was worth the journey.

We started with the fundamental problem — that generative models have no ground truth to compare against — and built up an entire toolkit for dealing with it. We saw how FID compresses real and generated distributions into Gaussians and measures the distance between them. We saw how Inception Score tried to capture quality and diversity but is blind to the real data. We explored CLIP Score for prompt faithfulness, LPIPS for perceptual similarity between specific images, and precision/recall for disentangling quality from diversity. We walked through the likelihood trap, learned how to diagnose mode collapse, grappled with the fundamental diversity–quality tradeoff, designed human evaluation protocols, and confronted Goodhart's Law — the uncomfortable reality that any metric you optimize against will eventually stop measuring what you care about.

My hope is that the next time you encounter a paper claiming "our model achieves FID = 1.79," instead of nodding along the way I used to, you'll ask: at what guidance scale? With how many samples? Compared to what reference? What does precision/recall look like? And did a human actually look at the outputs? Those questions make you a better practitioner than any single number ever could.

Resources

Heusel et al., "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium" (2017) — the paper that introduced FID. The title has almost nothing to do with why you'd read it. Everyone reads it for Section 5.

Theis, van den Oord, and Bethge, "A Note on the Evaluation of Generative Models" (2016) — the paper that convinced me likelihood is a trap. Short, precise, and permanently changed how I think about evaluation.

Kynkäänniemi et al., "Improved Precision and Recall Metric for Assessing Generative Models" (NeurIPS 2019) — where the precision/recall framework for generators was formalized. If you've ever wanted to know why FID went up, this is your tool.

Zhang et al., "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" (CVPR 2018) — the LPIPS paper. The title delivers on its promise. One of those papers where the central finding seems obvious in hindsight and was not obvious at all before.

Parmar et al., "On Aliased Resizing and Surprising Subtleties in GAN Evaluation" (2022) — the clean-fid paper. If you compute FID for a living, this will save you from reporting wrong numbers due to preprocessing bugs you didn't know you had.

Barratt and Sharma, "A Note on the Inception Score" (2018) — a thorough post-mortem on IS. If you want ammunition for why IS shouldn't be your primary metric, this is the reference.