LLM Evaluation

Chapter 12: Large Language Models 10 subtopics

I avoided thinking deeply about LLM evaluation for longer than I should have. Every time someone asked me “how do you know this model is actually better?” I’d wave my hands at some benchmark numbers or say something like “the outputs look pretty good.” That worked for a while. Then I shipped a feature where the model hallucinated legal citations in production, and “looks pretty good” stopped feeling adequate. Here is that dive.

LLM evaluation is the discipline of measuring whether a language model’s outputs are good — and “good” turns out to be a surprisingly hard word to pin down. The field has evolved from simple metrics like perplexity (1990s) through standardized benchmarks (2018–2023) to crowdsourced human preference systems and LLM-as-judge approaches. It sits at the center of every decision about which model to use, whether a prompt change helped, and whether a product is safe to ship.

Before we start, a heads-up. We’re going to be talking about information theory, ranking systems borrowed from chess, and some code for building evaluation pipelines. You don’t need to know any of it beforehand. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

Contents

The restaurant review problem

Perplexity — the original language model metric

When string matching was enough

The benchmark zoo

Rest stop

Letting one LLM judge another

Elo ratings and the arena

The contamination problem

Human evaluation — the expensive gold standard

Evaluation frameworks and building your own evals

Goodhart’s law and the eval crisis

Resources

The Restaurant Review Problem

To ground everything that follows, imagine we’re building a product called ReviewBot. It reads restaurant reviews and generates a one-paragraph summary for each restaurant. Our dataset has three reviews for a tiny pizza shop:

Review 1: "Best margherita in town. Crust is perfect. Service was slow."
Review 2: "Pizza is great but the wait was 45 minutes on a Tuesday."
Review 3: "Amazing pizza, terrible service. Worth the wait though."

We feed these into ReviewBot and it produces: “Customers love the pizza, especially the margherita. However, several reviewers noted slow service and long wait times.”

How do we know if that summary is “good”? With a traditional classifier, we’d have a label (cat or dog, spam or not) and we’d count how often the model got it right. But there are dozens of equally valid summaries of those three reviews. “Great pizza, bad service” captures the gist. So does a three-sentence version that mentions the margherita specifically. Quality isn’t one-dimensional here — we care about accuracy, completeness, conciseness, and whether it hallucinated something that wasn’t in the reviews.

This is the fundamental challenge of LLM evaluation: the output space is vast, quality is multidimensional, and there’s rarely a single “right answer.” Every method we’ll explore is an attempt to wrestle that messy reality into something measurable. We’ll keep coming back to ReviewBot throughout, watching each evaluation method succeed and fail on this same problem.

Perplexity — The Original Language Model Metric

The oldest way to evaluate a language model doesn’t look at the model’s generated text at all. Instead, it asks: given a sentence, how surprised was the model by each word?

Let’s build the intuition with a tiny example. Suppose our model reads the sentence “The pizza was” and needs to predict the next word. If it assigns probability 0.4 to “great,” 0.3 to “good,” 0.1 to “terrible,” and tiny probabilities to thousands of other words, and the actual next word is “great,” the model was reasonably confident. It assigned 0.4 to the right answer. But if the actual next word is “transcendent” and the model assigned 0.0001 to it, the model was very surprised. That surprise is what we measure.

For each token in a test sequence, the model produces a probability. We take the log of that probability (which is negative, since probabilities are between 0 and 1), average across all tokens, negate it, and exponentiate. That’s perplexity.

# Given tokens x₁, x₂, ..., xₙ and the model's predicted probabilities:
#
#   PPL = exp( -1/N × Σ log P(xᵢ | x₁, ..., xᵢ₋₁) )
#
# Walk through a 4-token example:
#
#   Token:        "The"   "pizza"  "was"   "great"
#   P(token):      0.1     0.05    0.3     0.4
#   log P:        -2.30   -3.00   -1.20   -0.92
#
#   Average log P = (-2.30 + -3.00 + -1.20 + -0.92) / 4 = -1.855
#   PPL = exp(1.855) ≈ 6.4
#
# Interpretation: on average, the model was as uncertain as if it
# were picking uniformly among ~6 options at each step.

A perplexity of 1 would mean the model predicted every token with 100% confidence. That never happens in practice. A perplexity of 50,000 (roughly the vocabulary size) would mean the model had no idea whatsoever. Modern LLMs on well-matched text score somewhere between 5 and 30.

I’ll be honest — when I first encountered perplexity, I thought it was the definitive metric. Lower perplexity, better model, end of story. That intuition is wrong in a way that took me a while to appreciate. A model can have excellent perplexity — it predicts tokens brilliantly — while generating bland, generic text. It can predict “the” and “is” and “of” with high confidence, dragging the average down, while being clueless about the words that actually carry meaning. Perplexity tells you nothing about whether the model follows instructions, whether it hallucinates, or whether its outputs are safe.

And there’s a comparison trap: you can only compare perplexity across models that use the same tokenizer. GPT-2 and LLaMA split text into different pieces, producing different sequence lengths for the same input. Comparing their perplexity numbers is like comparing race times when the runners ran different distances.

For our ReviewBot, perplexity would tell us whether the model is a competent English writer. It would not tell us whether the summary was accurate, whether it captured all three reviews, or whether it invented a detail that wasn’t there. We need something more.

When String Matching Was Enough

Before LLMs, the NLP community had a simpler problem: compare generated text against a known reference. Machine translation had a human-translated sentence to compare against. Summarization had a human-written summary. The idea was to measure how much the generated text overlapped with the reference.

Imagine we write a gold-standard summary for our pizza shop: “Excellent pizza with notably slow service.” And ReviewBot produces: “Customers love the pizza, especially the margherita. However, several reviewers noted slow service.”

BLEU (Bilingual Evaluation Understudy) counts n-gram overlaps. It looks at individual words (unigrams), pairs of words (bigrams), triplets, and quadruplets, then combines the scores. It was designed for machine translation, where phrasing matters. The overlap between our reference and ReviewBot’s output? The words “pizza” and “slow service” appear in both, so there’s some overlap. But “Excellent pizza” and “Customers love the pizza” mean the same thing and share zero bigrams. BLEU would penalize a perfectly good paraphrase.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the mirror image — it asks what fraction of the reference n-grams appear in the output. ROUGE-1 looks at unigrams, ROUGE-2 at bigrams, and ROUGE-L finds the longest common subsequence. It was built for summarization, where covering the key content matters more than exact phrasing. Same problem though: “Excellent” and “love” don’t overlap, even though they carry similar meaning.

BERTScore tried to fix this. Instead of counting matching strings, it uses BERT embeddings to compute semantic similarity between tokens. “Excellent” and “love” have similar embeddings, so they’d get a high similarity score. It catches paraphrases that BLEU and ROUGE miss.

But all three methods share a fatal limitation for LLM evaluation: they need a reference. For our pizza shop, we wrote one gold summary. In the real world, ReviewBot might summarize thousands of restaurants, and writing a gold-standard summary for each is prohibitively expensive. For open-ended tasks like “write me a poem about autumn” or “help me debug this code,” there isn’t even a sensible reference to write. The field needed evaluation methods that don’t require knowing the right answer in advance.

The Benchmark Zoo

While string-matching metrics were failing for open-ended generation, researchers took a different approach: instead of evaluating free-form text, test the model on structured problems with known answers. This is the world of benchmarks — standardized exams for LLMs.

Think of it like this. We can’t easily score ReviewBot’s free-form summaries, but we can ask it multiple-choice questions about restaurant reviews and check if it picks the right answer. That’s the core idea behind every benchmark.

MMLU (Massive Multitask Language Understanding) is the SAT of LLMs. It has 14,000+ multiple-choice questions across 57 subjects — anatomy, astronomy, philosophy, law, everything. The model reads a question, picks A, B, C, or D, and we check the answer. For years, MMLU was the number everyone quoted. “Our model scores 86% on MMLU.” But by late 2023, frontier models were scoring above 90%, the differences between top models shrank to statistical noise, and researchers found that MMLU had a 6.5% label error rate — meaning the “correct” answers were themselves wrong for roughly 1 in 15 questions. MMLU-Pro appeared in 2024 with 10 answer choices instead of 4 and harder questions, buying some headroom.

HellaSwag tests commonsense reasoning through sentence completion. Given “A woman is sitting at a table. She picks up a fork and...”, the model chooses the most plausible continuation from four options. When it was introduced, it was hard. Now GPT-4 class models score above 95%. The benchmark has been, for lack of a better word, solved.

HumanEval is different from the others, and I find it the most satisfying. It gives the model a Python function signature with a docstring and asks it to write the implementation. Then the generated code is actually run against test cases. No ambiguity — the code either passes or it doesn’t. The metric is pass@k: generate k candidate solutions, and you pass if any one of them works. HumanEval has 164 problems, and its harder variant HumanEval+ adds more rigorous test cases. Code benchmarks resist saturation in a way that multiple-choice tests don’t, because there are always harder programming problems to write.

GSM8K has 8,500 grade-school math word problems. “If a restaurant serves 120 customers per day and each customer orders an average of 2.5 dishes...” — that kind of thing. It tests multi-step reasoning, and it’s where chain-of-thought prompting proved its value. Models that “think step by step” score dramatically higher than models that try to jump to the answer. Top models now score above 90% with chain-of-thought, so the harder MATH benchmark (competition-level problems) has taken over as the frontier.

ARC (AI2 Reasoning Challenge) presents science questions at various difficulty levels. ARC-Easy is largely solved. ARC-Challenge contains the questions that retrieval-based methods get wrong — they require genuine reasoning. The more ambitious ARC-AGI tests abstract visual pattern reasoning, and frontier models still struggle with it.

TruthfulQA takes a completely different angle. Its 817 questions are specifically designed to trigger common misconceptions. “What happens if you crack your knuckles?” A truthful model says “nothing harmful.” A model that has absorbed internet folklore says “you’ll get arthritis.” It measures the gap between confident-sounding and actually-correct.

MT-Bench pushes into multi-turn conversation. It has 80 questions across 8 categories (writing, reasoning, math, coding, extraction, STEM, humanities, roleplay), and the model must answer a follow-up question that builds on its first answer. A strong judge model (typically GPT-4) scores each response from 1 to 10. This captures something the other benchmarks miss: whether the model can hold a coherent conversation rather than answer isolated questions.

I’m still developing my intuition for which benchmarks actually predict real-world usefulness. A model can ace MMLU (broad knowledge) while being terrible at following specific instructions for your task. A model can crush HumanEval (clean coding problems) while failing at the messy, underspecified code requests that real users make. The benchmarks tell you something, but the mapping from benchmark score to “works for my product” is weaker than I initially assumed.

Rest Stop

Congratulations on making it this far. If you want to stop here, you can.

You now have a mental model that covers the foundation of LLM evaluation: perplexity measures how well a model predicts text but not whether that text is useful. Classical metrics (BLEU, ROUGE, BERTScore) compare outputs against references but fail when there’s no reference. Benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA) test specific capabilities with known answers but are saturating and getting contaminated. For our ReviewBot, perplexity tells us the model writes coherent English, ROUGE tells us the summary overlaps with a reference we wrote, and benchmarks tell us the model is broadly capable — but none of them directly answer the question we care about: “is this summary good?”

That gap is what the rest of this piece is about. From here, we’ll look at methods that try to evaluate free-form text without needing a reference answer — using other LLMs as judges, crowdsourcing human preferences, and building custom evaluation sets that capture exactly what matters for your product.

But if the discomfort of not knowing how teams actually evaluate LLMs in production is nagging at you, read on.

Letting One LLM Judge Another

Here’s the idea that changed how LLM evaluation actually works in practice: instead of comparing text against a reference, give the text to a stronger model and ask it to rate the quality. Use GPT-4 to judge whether a smaller model’s output is good.

I know. It sounds circular. It bothered me too — using an LLM to evaluate an LLM feels like grading your own homework. But the results are surprisingly defensible. Multiple studies show that GPT-4’s judgments agree with expert human annotators more than 80% of the time. That’s comparable to how often two humans agree with each other.

Let’s walk through how this works with ReviewBot. We have the three pizza reviews, ReviewBot’s summary, and we want to know if the summary is good. Instead of writing a reference summary and computing ROUGE, we send this to a judge model:

# The prompt to our judge model:

"""Given the following restaurant reviews and a generated summary,
rate the summary on four dimensions (1-5 each):

1. Accuracy: Does the summary only contain claims supported by the reviews?
2. Completeness: Does it capture the main themes across all reviews?
3. Conciseness: Is it appropriately brief without losing key information?
4. Coherence: Does it read naturally and flow well?

Reviews:
- "Best margherita in town. Crust is perfect. Service was slow."
- "Pizza is great but the wait was 45 minutes on a Tuesday."
- "Amazing pizza, terrible service. Worth the wait though."

Summary: "Customers love the pizza, especially the margherita.
However, several reviewers noted slow service and long wait times."

Return your ratings as JSON with a brief reason for each score."""

The judge model might return accuracy: 5 (everything in the summary is supported), completeness: 4 (it missed the “worth the wait” sentiment), conciseness: 5, coherence: 5. That’s useful feedback, and we got it without writing a reference summary.

There are three main patterns for LLM-as-judge evaluation. Single-score rating asks the judge to rate one response on a rubric, like we did above. Pairwise comparison gives the judge two responses and asks which is better — this turns out to be more reliable, because humans (and LLMs) are better at comparing than at assigning absolute scores. Reference-guided evaluation provides a gold answer for the judge to compare against, combining the best of both worlds when you have references available.

The research paper that formalized this is G-Eval (2023), which showed that LLM judges using chain-of-thought reasoning and fine-grained rubrics achieve human-level correlation on summarization and dialogue evaluation. The key insight was that the judge needs a specific rubric, not a vague “is this good?” prompt.

But LLM judges have biases, and they’re not subtle. Position bias: in a pairwise comparison, judges tend to prefer whichever response comes first (or last, depending on the model). The fix is to run every comparison twice with the order swapped and check for consistency. Verbosity bias: longer responses get higher scores, even when the extra length is padding. The fix is to explicitly instruct the judge that conciseness matters and to penalize unnecessary verbosity. Self-preference bias: if you use GPT-4 to judge GPT-4’s outputs, it rates them higher than equivalent outputs from Claude. The fix is to use a different model family as judge than the one being evaluated.

The economics are what make this approach transformative. Human evaluation costs $10–50+ per sample when you factor in annotator time, training, and quality control. LLM-as-judge costs $0.01–0.10 per sample and runs in seconds. That means you can evaluate hundreds of examples on every code commit instead of waiting for a weekly human review cycle. For ReviewBot, we could evaluate summaries for every restaurant in our dataset on every deploy, catching regressions before users do.

import json
from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are an expert evaluator for restaurant review summaries.
Given the original reviews and a generated summary, rate on these dimensions:

1. Accuracy (1-5): Does the summary only state things supported by reviews?
2. Completeness (1-5): Does it capture the main themes?
3. Conciseness (1-5): Is it appropriately brief?
4. Coherence (1-5): Does it read naturally?

Return valid JSON: {"accuracy": N, "completeness": N,
"conciseness": N, "coherence": N, "overall": N, "reasoning": "..."}
Be strict. 3 = acceptable. 5 = exceptional."""

def judge_summary(reviews: list[str], summary: str) -> dict:
    reviews_text = "\n".join(f"- {r}" for r in reviews)
    result = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {"role": "user", "content":
                f"Reviews:\n{reviews_text}\n\nSummary: {summary}"}
        ]
    )
    return json.loads(result.choices[0].message.content)

I should mention a limitation I’ve bumped into repeatedly: LLM-as-judge is great for relative comparisons (is version A of the prompt better than version B?) but less reliable for absolute quality (is this response good enough to ship?). The judge might consistently rate one response higher than another, but calibrating what score counts as “production-ready” still requires human validation.

Elo Ratings and the Arena

The pairwise comparison idea, taken to its logical extreme, leads us to the most ambitious LLM evaluation system built so far: Chatbot Arena, run by LMSYS.

The concept is borrowed from chess. In chess, every player has an Elo rating — a number that captures their relative strength. When two players compete, the outcome updates both ratings: beating a higher-rated player earns you more points than beating a lower-rated one. Over thousands of games, the ratings converge to a stable ranking that reflects true skill.

Chatbot Arena applies this to LLMs. A user types a prompt and receives responses from two anonymous models side by side. The user doesn’t know which model is which. They read both responses and pick the better one (or declare a tie). The result updates both models’ Elo ratings. Over hundreds of thousands of these blind battles, the ratings produce a leaderboard.

Let’s trace through one battle with our ReviewBot framing. A user submits the three pizza reviews and asks for a summary. Model A returns a terse “Good pizza, bad service.” Model B returns our ReviewBot’s more detailed summary. The user picks Model B. If Model A had a higher Elo going in, it loses more rating points than if it were already lower-rated. This self-correcting property is what makes the system work — early mistakes get washed out as more data accumulates.

What makes the Arena compelling is what it doesn’t suffer from. There’s no fixed test set to memorize. The prompts come from real users, so they reflect real use cases. The evaluation is blind, so models can’t be tuned for specific evaluators. And the sheer volume (hundreds of thousands of battles) averages out individual voter biases.

Or at least, that was the theory. In early 2025, Meta submitted a variant of Llama 4 that appeared to be specifically tuned for the kinds of conversational preferences that Arena voters reward — verbose, engaging, stylistically polished — without corresponding improvements on other benchmarks. It topped the leaderboard briefly, then the community noticed the discrepancy. The incident exposed that even crowdsourced evaluation can be gamed if you know what the crowd tends to prefer. No evaluation system is fully immune to optimization pressure. We’ll come back to this uncomfortable truth later.

The Contamination Problem

Here is the elephant in every room where benchmark numbers are discussed. LLMs are trained on massive web crawls. The web contains benchmark datasets. If MMLU questions appear in the training data, the model isn’t reasoning through the question — it’s remembering the answer it saw during training. And this isn’t hypothetical: multiple studies have confirmed that major benchmark datasets leaked into the training sets of production models.

Consider what this means for our ReviewBot scenario. If we evaluate the model on a set of 200 restaurant-summary pairs, and those pairs were scraped from a website that was in the training data, our “evaluation” is measuring memorization, not capability. The model might produce perfect summaries for those specific restaurants while failing on novel ones.

The contamination problem has a few dimensions. Direct contamination happens when test questions appear verbatim in training data. Indirect contamination happens when the training data contains discussions or solutions referencing the benchmark (blog posts explaining MMLU answers, for instance). Temporal contamination happens when a benchmark that was clean at launch gets absorbed into training data for later models.

Detection is hard. You’d need access to the full training dataset, which most model providers don’t share. Some mitigation strategies: use benchmarks that rotate questions (LiveBench, which refreshes monthly). Insert canary strings — unique, never-before-published text snippets — into your test set and check if the model can reproduce them (if it can, your test data leaked). Create your own eval set from scratch, since data you wrote yesterday can’t be in a model trained last year. Use time-based splits: evaluate on inputs from after the model’s training cutoff date.

By 2024, contamination was so pervasive that nearly every major pre-2023 benchmark had been partially compromised. This is one reason the field has been in a constant arms race, producing harder, newer benchmarks to stay ahead of the training data.

Human Evaluation — The Expensive Gold Standard

Every evaluation method we’ve discussed so far is, in some sense, a proxy for the question we actually care about: do humans think the output is good? At some point, you need to ask actual humans.

I’ll be honest — human evaluation sounds straightforward until you try to implement it. When I first set up a human eval pipeline, I assumed the hard part would be recruiting annotators. The actual hard part was getting them to agree with each other.

Here’s what happens. You give three annotators the pizza reviews and ReviewBot’s summary and ask them to rate quality from 1 to 5. Annotator A gives it a 4 (“good summary, captures the main points”). Annotator B gives it a 3 (“it missed the ‘worth the wait’ nuance”). Annotator C gives it a 5 (“excellent, concise and accurate”). Same input, same rubric, three different scores.

This is why inter-annotator agreement matters so much. The standard measure is Cohen’s kappa (κ), which adjusts for chance agreement. If annotators agree 70% of the time, but random chance would produce 50% agreement, kappa accounts for that.

# Cohen's Kappa measures agreement beyond chance
#
#   κ = (observed agreement - chance agreement) / (1 - chance agreement)
#
#   κ = 1.0  → annotators always agree
#   κ = 0.0  → agreement is no better than coin-flipping
#   κ > 0.8  → excellent (publishable, trustworthy)
#   κ 0.6-0.8 → substantial (usable for most decisions)
#   κ 0.4-0.6 → moderate (your rubric needs work)
#   κ < 0.4  → poor (stop and redesign the annotation task)
#
# Example: two annotators rate 10 summaries on a 1-5 scale
annotator_1 = [4, 3, 5, 2, 4, 3, 5, 4, 2, 3]
annotator_2 = [4, 4, 5, 2, 3, 3, 4, 4, 2, 4]

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(annotator_1, annotator_2, weights="quadratic")
# weights="quadratic" means disagreeing by 1 point is penalized less
# than disagreeing by 3 points. Use this for ordinal scales.

The cost math is sobering. Three annotators, 200 examples, 5 minutes per example: that’s 50 person-hours. At $25/hour, one round of human evaluation costs $1,250. And you probably need to do it before every major launch. LLM-as-judge on the same 200 examples costs $2–20 and takes 5 minutes of wall-clock time. The practical strategy that most production teams land on is a layered approach: LLM-as-judge on every commit for fast feedback, weekly human spot-checks on 20–50 examples to calibrate the judge, and full human evaluation before major launches as the final gate.

Human evaluation is also the only reliable way to assess safety. Can an automated judge catch subtle toxicity, cultural insensitivity, or dangerous medical advice? Sometimes. Reliably? No. For safety-critical applications — medical, legal, financial — human review isn’t a luxury. It’s a requirement.

Evaluation Frameworks and Building Your Own Evals

Running benchmarks by hand is tedious and error-prone. Two frameworks have emerged as the standard tools for the job.

lm-eval-harness (by EleutherAI) is the Swiss Army knife. It supports hundreds of benchmarks out of the box — MMLU, HellaSwag, ARC, TruthfulQA, GSM8K, and many more. Point it at a model (local or API-based), specify the tasks you want, and it handles prompt formatting, tokenization, scoring, and result aggregation. Most of the LLM leaderboards you see online use lm-eval-harness under the hood. If you want to compare a new model against established benchmarks, this is where you start.

HELM (Holistic Evaluation of Language Models, by Stanford) goes broader. Where lm-eval-harness focuses on accuracy, HELM evaluates along multiple axes: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. If you need to assess not only whether the model gets the right answer but also whether it’s reliable, fair, and safe, HELM gives you that multi-dimensional view.

But here’s the thing that took me too long to internalize: public benchmarks evaluate models. Custom evals evaluate products. MMLU tells you whether the model has broad knowledge. Your custom eval set tells you whether the model solves your specific problem for your specific users.

Building custom evals for ReviewBot would look like this. We collect 50–200 real sets of restaurant reviews, spanning different cuisines, review lengths, sentiment distributions, and edge cases (contradictory reviews, reviews in broken English, restaurants with only one review). For each, we either write a reference summary or define a rubric: must mention food quality, must mention service if reviewers mention it, must not claim more than three stars if reviews are negative.

# A custom eval case for ReviewBot
eval_case = {
    "id": "pizza-shop-contradictory",
    "reviews": [
        "Worst pizza I've ever had. Burnt crust, cold center.",
        "Best pizza in the city! Perfect every time.",
        "Pizza is fine. Nothing special. Service is fast."
    ],
    "rubric": {
        "must_capture": ["mixed reviews", "both positive and negative opinions"],
        "must_not": ["claim all reviews were positive",
                     "claim all reviews were negative"],
        "tone": "balanced and neutral"
    },
    "difficulty": "hard",
    "category": "contradictory-reviews"
}

The most powerful pattern I’ve seen for growing custom evals is the eval flywheel: ship the product, watch for failures in production (user complaints, flagged responses, low satisfaction scores), turn every failure into a new eval case with an input and expected behavior, improve the model or prompt until the new case passes, run the full eval suite to check for regressions, and ship again. After six months of this cycle, your custom eval set is worth more than any public benchmark, because it captures the specific failure modes your users actually encounter.

# The eval flywheel in practice:
#
#   Deploy → Users find failures → Add failures to eval set
#     → Improve model/prompt → Run full eval suite → Deploy
#
# Start with 50 well-chosen examples.
# After 6 months, you'll have 300+ that cover your real failure modes.
# Every production incident becomes a regression test.
# The eval set becomes your most valuable asset.

Goodhart’s Law and the Eval Crisis

There is a century-old observation from economics that haunts every evaluation system we’ve discussed: “When a measure becomes a target, it ceases to be a good measure.” This is Goodhart’s Law, and it is the single most important concept in LLM evaluation.

Watch how it plays out across every method we’ve covered. Perplexity becomes a target → models optimize for predicting common tokens rather than generating useful text. MMLU becomes a target → training data gets contaminated with MMLU questions. Chatbot Arena becomes a target → models get tuned for the stylistic preferences of Arena voters. Even LLM-as-judge becomes a target → models learn to produce outputs that GPT-4 rates highly, which isn’t the same as outputs that humans find useful.

2024 was when the field collectively acknowledged this as a crisis. Hugging Face’s evaluation team published an “Evaluation Crisis” guide documenting how nearly every pre-2023 benchmark had been compromised by some combination of contamination, saturation, and optimization pressure. Researchers started calling for dynamic benchmarks that rotate questions, private benchmarks that can’t be trained on, and adversarial benchmarks that actively resist gaming.

For our ReviewBot, Goodhart’s Law means we can’t rely on any single metric or benchmark forever. If we optimize ReviewBot’s prompts to maximize our LLM-judge scores, we might end up with summaries that score well by the judge’s rubric but miss something that actually matters to users. The judge becomes the target, and it stops measuring what we care about.

My favorite thing about this problem is that, despite years of work, no one has a clean solution. The best defense is defense in depth: multiple evaluation methods, multiple metrics, regular calibration against human judgment, and a healthy skepticism toward any single number. The moment you start saying “our eval score is 4.2 out of 5, we’re good” is the moment you’ve stopped actually evaluating.

I still get tripped up by this. I’ll optimize a prompt for two days, watch the LLM-judge score climb from 3.8 to 4.3, celebrate, and then discover that the “improvement” came from the model producing longer, more hedged responses that the judge likes but users find annoying. The score went up. The product got worse. That tension never fully goes away.

Wrap-Up

If you’re still with me, thank you. I hope it was worth it.

We started with a deceptively simple question — how do you know if an LLM’s output is good? — and discovered that the answer involves information theory (perplexity), string matching (BLEU, ROUGE), standardized exams (MMLU, HumanEval, GSM8K), using LLMs to judge other LLMs (G-Eval, pairwise comparison), borrowing ranking systems from chess (Elo, Chatbot Arena), paying humans to disagree with each other (human evaluation with Cohen’s kappa), and confronting the fact that every measurement system eventually gets gamed (Goodhart’s Law).

My hope is that the next time someone shows you a model that “scores 92% on MMLU” or “ranks #3 on the Arena,” instead of taking that at face value, you’ll ask: what does that benchmark actually measure? Could it be contaminated? Does it predict performance on my task? And then you’ll go build your own eval set, because the fifty examples you write yourself will tell you more about your product than any leaderboard ever will.

Resources

Hugging Face Evaluation Guidebook — The most comprehensive and honest treatment of the evaluation crisis, including the contamination and saturation problems. Insightful and frequently updated.

Chatbot Arena Leaderboard (lmsys.org) — The live, crowdsourced Elo leaderboard. Watching the rankings shift after new model releases is addictive. The underlying dataset of human preferences is also available for research.

“Judging LLM-as-a-Judge” (Zheng et al., 2023) — The O.G. paper on using GPT-4 as a judge, documenting both its effectiveness and its biases. Required reading if you’re implementing LLM-as-judge in production.

lm-eval-harness (EleutherAI, GitHub) — The standard framework for running benchmarks. Wildly useful for comparing models and getting reproducible results.

“Benchmark Data Contamination of Large Language Models: A Survey” (Xu et al., 2024) — A sobering catalog of how contamination has affected every major benchmark. Changes how you think about leaderboard numbers.

HELM (Stanford CRFM) — Holistic evaluation across accuracy, fairness, robustness, and toxicity. The multi-dimensional view that single-score benchmarks miss.