Text Generation and Translation

Chapter 11: Natural Language Processing Decoding strategies · Machine translation · Evaluation · Constrained generation

I avoided looking under the hood of text generation for longer than I'd like to admit. Every time I used a chatbot or watched a translation model spit out fluent German, I treated it like a magic trick — enjoy the show, don't ask how it works. I'd tweak temperature and top-p in API calls without understanding what those knobs actually did to the probability distribution. Finally, the discomfort of not knowing what's really happening between "the model produces logits" and "you see text on screen" grew too great for me. Here is that dive.

Text generation is the process of producing sequences of tokens — words, subwords, characters — one at a time. Machine translation is one of the oldest and most demanding applications of this process. Together, they form the backbone of modern NLP: chatbots, code completion, summarization, translation services, creative writing tools. The decoding strategies we'll explore here are the same ones powering every major language model in production today.

Before we start, a heads-up. We're going to trace through probability distributions, walk through sampling algorithms step by step, and build translation evaluation metrics from scratch. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Autoregressive Loop
Greedy Decoding and Its Discontents
Temperature: Reshaping Confidence
Top-K: Drawing a Hard Line
Top-P (Nucleus Sampling): A Smarter Line
Repetition Penalties: Breaking the Loop
Beam Search: Scouting Multiple Paths
Rest Stop
Machine Translation: The Original Sequence Problem
Parallel Corpora and the Training Data
Measuring Translation: BLEU, METEOR, chrF, COMET
Multilingual Models: One Model, Many Languages
Constrained Generation and Structured Output
Watermarking Generated Text
The Unsolved Problem: Evaluating Open-Ended Generation
Wrap-Up
Resources

The Autoregressive Loop

Let's start with a scenario we'll carry through the entire section. Imagine we're building a tiny story-writing assistant. It has a vocabulary of exactly five tokens: the, cat, sat, on, and mat. That's the whole language. Five words.

Our assistant's job is to continue a prompt by predicting one token at a time. We give it "the", and it produces a probability distribution over all five tokens — a list of five numbers that sum to one. Each number represents the model's confidence that that particular token should come next. Maybe it looks like this:

Prompt: "the"

Model output (probabilities):
  the  → 0.02
  cat  → 0.70
  sat  → 0.15
  on   → 0.08
  mat  → 0.05

The model thinks cat is the most likely next word after the. We pick cat, append it to our sequence, and now our prompt is "the cat". We feed that back in and get a new distribution. Pick again. Append again. Repeat.

This loop — predict, pick, append, repeat — is called autoregressive generation. The term "autoregressive" means self-feeding: each generated token becomes part of the input for predicting the next one. Token 5 depends on tokens 1 through 4. Token 50 depends on tokens 1 through 49. It's inherently sequential, which is why generation is so much slower than training. During training, you can process all positions in parallel because you already know the target sequence. During generation, you're building the sequence one piece at a time, and you can't peek ahead.

This sequential bottleneck is also why KV-caching matters so much in practice. Without it, the model would recompute attention over the entire sequence for every new token — quadratic cost. With KV-cache, the model stores the key and value tensors from previous positions and only computes attention for the newest token. That turns the cost from O(n²) per token to O(n). For a 2,000-token generation, that's the difference between minutes and seconds.

But here's the thing I glossed over. I said "we pick cat." How did we decide to pick cat? The model gave us five probabilities. The strategy we use to turn those probabilities into a concrete token choice — that's the decoding strategy. And it changes everything about the output quality.

Greedy Decoding and Its Discontents

The most obvious strategy is: pick the token with the highest probability. Every time. No randomness, no deliberation. In our toy example, cat has probability 0.70, so we pick cat. Next step, whatever has the highest probability, we pick that too.

This is greedy decoding. It's fast, it's deterministic (same input always gives same output), and it's the first thing everyone tries. It's also, in many situations, terrible.

The problem is that greedy decoding can't backtrack. Imagine the globally best five-word story requires picking sat (probability 0.15) at step one instead of cat (probability 0.70). Greedy will never find it. It's locked onto the locally optimal path at every step, like a hiker who always takes the steepest uphill trail at every fork. Sometimes the steepest trail leads to a dead end, and the gentler path led to the summit. But the greedy hiker never finds out.

In practice, greedy decoding produces repetitive, bland text. You've probably seen this: "I think that this is a great idea. I think that this is a great idea. I think that this is a great idea." The model keeps picking the same high-probability continuation because, at every single step, that continuation looks best. It never has a reason to deviate.

We need a way to introduce controlled randomness. That's where temperature comes in.

Temperature: Reshaping Confidence

I used to think temperature was some magical creativity dial — turn it up, get wackier text. That's not wrong, but it hides what's actually happening, and once I saw the mechanics, the knob suddenly made sense.

Here's what's really going on. Before the model produces probabilities, it produces logits — raw scores for each token, not yet normalized to sum to one. To turn logits into probabilities, we apply the softmax function. Temperature inserts itself right before that softmax step: we divide every logit by the temperature value T.

Let's trace through our five-token vocabulary with concrete numbers. Suppose the model outputs these logits:

Raw logits:
  the  → -1.0
  cat  →  3.0
  sat  →  1.5
  on   →  0.5
  mat  → -0.2

At temperature T = 1.0, nothing changes. We divide by 1, apply softmax, and get the original distribution. Cat dominates.

At T = 0.5, we divide each logit by 0.5, which doubles them: cat's logit becomes 6.0, sat's becomes 3.0. When we apply softmax to these amplified logits, the gap between cat and everything else becomes enormous. Cat's probability goes from maybe 0.65 to 0.95. The distribution gets sharper — peakier. The model becomes more confident, more predictable. In the limit, as T approaches 0, this converges to greedy decoding.

At T = 2.0, we divide by 2, which halves the logits: cat's becomes 1.5, sat's becomes 0.75. The differences between logits shrink. Softmax produces a flatter distribution — cat might drop to 0.35, and the other tokens get more of the probability mass. More randomness, more surprise. In the limit, as T approaches infinity, every token becomes equally likely.

The formula is: P(token_i) = exp(logit_i / T) / Σ exp(logit_j / T)

Think of temperature like adjusting the contrast knob on a photograph. Low contrast (low T) makes the bright spots blinding and the shadows pitch black — the dominant tokens absorb all the probability. High contrast (high T) brings out detail in the shadows — rare tokens get a fighting chance.

In practice, T between 0.7 and 1.0 works well for creative text. T between 0.1 and 0.3 works for factual tasks where you want the model to stick to its most confident predictions. But there's a problem: even at a well-chosen temperature, we're still sampling from the entire vocabulary. That includes all the garbage tokens — the ones with vanishingly small probability that no reasonable continuation would ever include.

Top-K: Drawing a Hard Line

Top-k sampling addresses the garbage problem by drawing a hard line. Before sampling, we find the k tokens with the highest probabilities, set everything else to zero, and renormalize what's left so it sums to one again. Then we sample from those k survivors.

Back to our toy example. With k = 3:

Original probabilities:
  the  → 0.02    cat  → 0.70    sat  → 0.15
  on   → 0.08    mat  → 0.05

Top-3 survivors: cat (0.70), sat (0.15), on (0.08)
Total surviving mass: 0.93

After renormalization:
  cat  → 0.70 / 0.93 ≈ 0.753
  sat  → 0.15 / 0.93 ≈ 0.161
  on   → 0.08 / 0.93 ≈ 0.086
  the  → 0    mat  → 0

Now the and mat can't be picked at all. We can only sample from the top three. This eliminates the risk of drawing a nonsensical token from the tail of the distribution.

But k is fixed, and that's the limitation. Sometimes the model is extremely confident — cat has probability 0.95 and everything else is noise. With k = 50, we're still considering 49 garbage tokens alongside the one good one. Other times, the model is genuinely uncertain — maybe 200 tokens all have reasonable probability, and k = 50 is cutting off perfectly valid continuations. The right value of k varies at every single generation step, and there's no perfect fixed number.

Top-P (Nucleus Sampling): A Smarter Line

Top-p sampling, also called nucleus sampling (Holtzman et al., 2020), fixes top-k's rigidity with an elegant idea: instead of fixing the number of tokens, fix the total probability mass you're willing to consider.

Sort the tokens by probability from highest to lowest. Walk down the list, accumulating probability as you go. Stop as soon as the cumulative probability reaches your threshold p. Everything you've included is the nucleus — the smallest set of tokens whose probabilities sum to at least p.

Let's trace through with p = 0.90:

Sorted probabilities:
  cat  → 0.70  (cumulative: 0.70)  ← included
  sat  → 0.15  (cumulative: 0.85)  ← included
  on   → 0.08  (cumulative: 0.93)  ← included (this pushes us past 0.90)
  mat  → 0.05  (cumulative: 0.98)  ← excluded
  the  → 0.02  (cumulative: 1.00)  ← excluded

The nucleus contains three tokens: cat, sat, on. We renormalize those three and sample.

Now here's why this is smarter than top-k. Imagine a different step where the model is very confident:

  cat  → 0.92  (cumulative: 0.92)  ← included, and we've hit 0.90
  sat  → 0.04  ← excluded
  on   → 0.02  ← excluded
  mat  → 0.01  ← excluded
  the  → 0.01  ← excluded

With p = 0.90, the nucleus is a single token: cat. We almost certainly pick cat, which makes sense — the model was confident. But with top-k set to 50, we'd still be considering four other tokens unnecessarily. And in the opposite scenario, where the model is uncertain and 200 tokens each have probability 0.005, top-p would include all 200 of them, while top-k = 50 would chop off 150 valid options.

Top-p = 0.95 is a common default that works well across many tasks. You can also combine top-p with temperature — apply temperature first to reshape the distribution, then apply top-p to trim the tails.

But neither temperature nor top-p addresses a specific, maddening failure mode: repetition.

Repetition Penalties: Breaking the Loop

Even with good sampling settings, models sometimes get stuck in loops. "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat." The model keeps generating the same phrase because, at every step, those tokens genuinely do have high probability given the context — a context that now contains the phrase it keeps repeating.

A repetition penalty fights this by looking at which tokens have already appeared in the generated sequence and actively discouraging them. For each token that's already been generated, its logit gets modified: if the logit is positive, divide it by the penalty factor; if it's negative, multiply by it. Either way, the token becomes less attractive.

A penalty of 1.0 means no change. Values between 1.1 and 1.3 work well in practice. But be careful — crank it too high and the model starts avoiding common words it legitimately needs to reuse. You end up with unnatural, thesaurus-stuffed prose where the model desperately searches for synonyms: "the feline rested upon the rug. said creature positioned itself atop the floor covering." That's worse than repetition.

There's also no_repeat_ngram_size, which is a blunter instrument: it flat-out forbids repeating any n-gram of a specified length. Set it to 2, and the model will never produce the same two-word phrase twice. This is effective for summarization but can be too aggressive for dialogue, where repeating phrases like "I don't know" is natural.

OpenAI's API offers two related knobs: frequency_penalty (penalizes tokens proportional to how many times they've appeared) and presence_penalty (penalizes tokens based on whether they've appeared at all, regardless of count). The distinction matters. Frequency penalty punishes the word "the" more the more it appears. Presence penalty punishes "the" the same amount whether it appeared once or fifty times. For most creative generation tasks, a small presence penalty (0.1–0.6) encourages the model to explore new topics without over-penalizing common function words.

All of these sampling-based approaches share a fundamental trait: they're stochastic. Run the same prompt twice, get different outputs. For some tasks — creative writing, brainstorming — that's a feature. For others — translation, code generation, extracting structured data — you want the single best output, not a random good one. That's where beam search enters.

Beam Search: Scouting Multiple Paths

I'll be honest — I found beam search counterintuitive the first time I encountered it. The name made it sound like a physics concept, and every diagram I saw looked like a tree with too many branches. What helped me understand it was thinking about it as sending scouts down a trail.

Remember our greedy hiker who always picks the steepest trail at every fork? Beam search is like sending a team of hikers — say, five of them. At each fork, all five hikers explore their best option. Then the group compares notes and keeps only the five who have found the best trails overall, measured by the total path quality from the start. The others are eliminated. At the next fork, the surviving five each explore again, compare again, and prune again.

The number of hikers (candidates) is the beam width, usually written as b. With b = 1, beam search is greedy decoding. With b equal to the entire vocabulary size, it's exhaustive search (which is computationally impossible for real vocabularies). Values of 4 to 10 are typical in practice.

Let's trace through our toy example with beam width b = 2:

Step 1. Prompt: "the"
  Model output: cat (0.70), sat (0.15), on (0.08), mat (0.05), the (0.02)
  Keep top-2: "the cat" (score: log 0.70 = -0.36)
               "the sat" (score: log 0.15 = -1.90)

Step 2. Expand each beam:
  "the cat" → cat (0.05), sat (0.60), on (0.25), mat (0.08), the (0.02)
  "the sat" → cat (0.10), sat (0.05), on (0.70), mat (0.10), the (0.05)

  All candidates with cumulative scores:
    "the cat sat"  → -0.36 + log(0.60) = -0.87
    "the cat on"   → -0.36 + log(0.25) = -1.75
    "the sat on"   → -1.90 + log(0.70) = -2.26
    "the cat mat"  → -0.36 + log(0.08) = -2.86
    ... (rest are worse)

  Keep top-2: "the cat sat" (-0.87), "the cat on" (-1.75)

Notice that "the sat on" got eliminated even though "on" had a high probability in that context — the cumulative score from the bad first step dragged it down. That's the power of beam search: it evaluates entire sequences, not individual steps.

One subtlety: without adjustment, beam search favors shorter sequences. Multiplying probabilities (or adding log probabilities) means longer sequences always have lower scores. A length penalty corrects this by dividing the score by the sequence length raised to a power α. With α = 0.7 (a common choice), the penalty softly encourages longer outputs without forcing them.

Beam search excels at tasks with a "correct" answer — translation, summarization, code completion. But for open-ended creative writing, it tends to produce safe, generic, committee-approved prose. It gravitates toward high-probability sequences, and high-probability sequences are, by definition, the least surprising ones. The most interesting writing often involves lower-probability choices that pay off later, and beam search with a small beam width will never find those paths.

In production, many systems combine strategies: beam search with a small beam width for translation, top-p with temperature for chatbots. Some use beam search to generate a set of candidates and then rescore them with a separate quality metric. There's no universal answer — the right decoding strategy depends entirely on what you're trying to generate.

Rest Stop

Congratulations on making it this far. If you need to stop here, you've already built a useful mental model.

You now understand the autoregressive generation loop and why it's sequential. You can trace through five decoding strategies — greedy, temperature, top-k, top-p, beam search — and explain what each one does to a probability distribution. You know that repetition penalties exist and why they're necessary. You know that sampling-based methods trade off diversity and quality, while beam search trades off creativity for correctness.

That covers the generation side. But text generation didn't emerge in a vacuum. The hardest, most consequential application of sequence-to-sequence generation — the one that literally gave us the attention mechanism and the transformer architecture — is machine translation. And the way we evaluate translation output reveals deep problems that apply to all text generation.

If the discomfort of not knowing what happens on the translation side is nagging at you, read on.

Machine Translation: The Original Sequence Problem

Translation is where sequence-to-sequence models cut their teeth. The attention mechanism was invented to solve a translation problem. The transformer was introduced in a translation paper. The "T" in GPT stands for "Transformer," which exists because of translation research. Understanding the history helps you understand why modern NLP looks the way it does.

The earliest attempts at machine translation (1950s–1980s) were rule-based. Linguists hand-wrote thousands of grammar rules, morphological transformations, and bilingual dictionaries. To translate English to Russian, you'd parse the English sentence into a syntax tree, apply transformation rules to restructure it into Russian syntax, then look up words in the dictionary. This worked for narrow domains like weather reports. It fell apart everywhere else. Idioms don't translate literally. Word order varies wildly across languages. "Bank" means riverbank or financial institution depending on context, and rules can't resolve that ambiguity reliably. Every new language pair required a completely new set of rules.

Statistical MT (1990s–2015) replaced handcrafted rules with learned probabilities. The key insight was the noisy channel model: assume the foreign sentence is a "corrupted" version of an English sentence, and find the English sentence most likely to have produced it. The math: argmax_e P(e|f) = argmax_e P(f|e) · P(e). P(f|e), the translation model, was learned from parallel corpora — datasets of aligned sentence pairs in both languages. P(e), the language model, was learned from English text alone, ensuring fluent output. Phrase-based SMT extended this from word-level to phrase-level alignment — learning that "in spite of" translates to "malgré" as a unit, not word by word. Google Translate used phrase-based SMT from 2006 to 2016.

Neural MT (2014–present) replaced the entire statistical pipeline with a single neural network. The first version (Sutskever et al., 2014) used an encoder-decoder architecture: an RNN reads the source sentence and compresses it into a single fixed-length vector, then a second RNN generates the target sentence from that vector. The fatal flaw: all information about a 40-word sentence must pass through one vector. Short sentences worked fine. Long sentences fell apart — the decoder couldn't remember the beginning by the time it reached the end.

Attention (Bahdanau et al., 2015) solved this by letting the decoder peek back at every encoder state at every generation step. Instead of one compressed vector, the decoder gets a weighted view of the entire source sentence, with weights that shift depending on which target word it's currently generating. This was the breakthrough that changed everything. The transformer (Vaswani et al., 2017) took it further, replacing RNNs entirely with self-attention and cross-attention. Every modern language model descends from this architecture.

Parallel Corpora and the Training Data

A parallel corpus is a dataset of sentence pairs: the same meaning expressed in two languages. The European Parliament proceedings (Europarl), the United Nations parallel corpus, and web-crawled datasets like ParaCrawl contain millions of such pairs. These are the fuel for training translation models.

The catch: high-quality parallel data is scarce for most language pairs. There's abundant English↔French data because the EU has been translating parliamentary proceedings for decades. But English↔Yoruba? English↔Quechua? Almost nothing. This data scarcity for low-resource languages is one of the central challenges in translation, and it drives many of the architectural decisions in modern multilingual models.

Back-translation is a clever workaround: train a target→source model, use it to translate monolingual target-language text back to the source language, and use those synthetic pairs as additional training data. The translations are imperfect, but the model still learns useful patterns from them. This technique, introduced by Sennrich et al. in 2016, remains one of the most effective ways to overcome data scarcity.

Measuring Translation: BLEU, METEOR, chrF, COMET

I'll be honest — the first time I computed a BLEU score and got 0.32, I had no idea whether that was good or terrible. The number felt completely unanchored. Here's what I wish someone had told me.

BLEU (Bilingual Evaluation Understudy) is the oldest and most widely reported automatic metric for translation quality. It measures n-gram precision: what fraction of the n-grams (contiguous word sequences) in the machine translation also appear in a human reference translation?

Let's compute it by hand. Suppose our model translates a French sentence and produces: "The cat is sitting on the mat." The reference translation is: "The cat sat on the mat."

Machine output:  The cat is sitting on the mat .
Reference:       The cat sat on the mat .

1-gram matches: The ✓, cat ✓, is ✗, sitting ✗, on ✓, the ✓, mat ✓, . ✓
  → 6 out of 8 = 0.75

2-gram matches: "The cat" ✓, "cat is" ✗, "is sitting" ✗, "sitting on" ✗,
                "on the" ✓, "the mat" ✓, "mat ." ✓
  → 4 out of 7 = 0.57

3-gram matches: "The cat is" ✗, "cat is sitting" ✗, "is sitting on" ✗,
                "sitting on the" ✗, "on the mat" ✓, "the mat ." ✓
  → 2 out of 6 = 0.33

4-gram matches: "The cat is sitting" ✗, "cat is sitting on" ✗,
                "is sitting on the" ✗, "sitting on the mat" ✗,
                "on the mat ." ✓
  → 1 out of 5 = 0.20

BLEU takes the geometric mean of the 1-gram through 4-gram precisions and multiplies by a brevity penalty (to punish translations that are too short — you could cheat by outputting a single high-confidence word). The final score: roughly 0.41 in this case. That's actually decent for a single sentence.

BLEU ranges from 0 to 1 (often reported as 0 to 100). A BLEU of 30+ is generally understandable. 50+ is high quality. Human translators score 60–80 against other humans — not 100, because there are many valid ways to translate any sentence, and no two translators make the same choices.

BLEU's great virtue is that it's fast, cheap, and reproducible. Its great flaw is that it only captures surface-level word overlap. Two equally good translations that use different vocabulary will score poorly against each other. "The automobile halted" and "The car stopped" share almost no n-grams, but they mean the same thing.

METEOR (Metric for Evaluation of Translation with Explicit ORdering) addresses this by incorporating synonyms, stems, and paraphrases. It aligns words between the output and reference using exact matches first, then stems ("running" matches "ran"), then synonyms (from WordNet). It computes both precision and recall (BLEU only uses precision), and applies a penalty for word-order differences. METEOR correlates better with human judgments than BLEU, especially at the sentence level.

chrF (character n-gram F-score) takes a completely different approach: it works at the character level instead of the word level. It computes the F-score over character n-grams — contiguous sequences of characters. This has a wonderful side effect: it handles morphologically rich languages far better than BLEU. In German, Finnish, or Turkish, a single word can take dozens of inflected forms. "Translated" and "translating" share zero word-level n-grams but many character n-grams ("translat", "ranslati", "anslatin"...). chrF captures that overlap. It also works for languages without clear word boundaries, like Chinese, where tokenization itself is a contentious problem.

COMET (Crosslingual Optimized Metric for Evaluation of Translation) represents the newest generation: a learned metric. It feeds the source sentence, the machine translation, and the reference into a multilingual neural encoder, and predicts a quality score. Because it was trained on human quality judgments, it captures meaning — paraphrases score well even without word overlap. COMET correlates significantly better with human evaluation than BLEU, chrF, or METEOR. The cost: you need a GPU to run it, and it's a black box.

My recommendation for practice: report BLEU for comparability with the literature (it's the lingua franca of MT evaluation, whether we like it or not). Add chrF for morphologically rich target languages. Use COMET when you need the most accurate automatic quality assessment. And for anything high-stakes, include human evaluation — no automatic metric is a perfect proxy for whether the translation actually communicates the right meaning.

Multilingual Models: One Model, Many Languages

Early neural MT required a separate model for every language pair. English→French: one model. English→German: another. French→German: a third. With 100 languages, that's 9,900 separate models. This is, to put it mildly, unsustainable.

The breakthrough insight was that languages share deep structure. The concept of agent-action-object exists across almost every human language, even when the surface word order differs. A model that learns English→French translation picks up patterns — how to map between verb tenses, how to handle articles, how adjective placement works — that transfer to English→Spanish, because French and Spanish share Latin roots and similar grammar.

mBART (multilingual BART) extended this by pretraining a single encoder-decoder transformer with a denoising objective across 25 languages: corrupt text in any language, reconstruct it. This forced the model to learn the internal structure of each language. Fine-tuning for translation then required far fewer parallel sentence pairs, because the model already understood how each language works. A target-language tag (a special token like <2fr> for French) told the decoder which language to generate.

NLLB (No Language Left Behind), released by Meta in 2022, scaled this idea to over 200 languages — including many low-resource languages that previous systems had ignored entirely. NLLB uses a sparse mixture-of-experts architecture: different expert sub-networks activate for different language families. Adding Zulu doesn't degrade French performance, because they use largely different experts. For many low-resource language pairs, NLLB improved BLEU scores by over 44% compared to the previous state-of-the-art. That's not an incremental gain. That's a paradigm shift for communities whose languages had been effectively invisible to translation technology.

A particularly surprising capability of multilingual models is zero-shot translation: translating between a language pair the model was never explicitly trained on. If a model learned English↔French and English↔German, can it translate French→German directly? Often, yes. The shared internal representation acts as a kind of universal meaning space — an interlingua — that bridges languages without explicit bridge data. But the quality is typically lower than trained directions, and the model sometimes produces off-target translation: output in the wrong language, usually defaulting to English because English dominated the training data.

I'm still developing my intuition for why zero-shot translation works as well as it does. The conventional explanation — "the model learns a shared meaning space" — feels right but vague. The fact that it works at all, without ever seeing a single French-German sentence pair, is remarkable. The fact that it sometimes fails by slipping into English is a reminder that these models are pattern matchers, not true understanding machines.

Constrained Generation and Structured Output

So far, we've treated generation as open-ended: give the model a prompt, let it produce whatever tokens it wants, and shape the output with temperature and sampling. But many real-world applications need the model to produce output that follows a specific structure — valid JSON, a particular schema, SQL queries, function calls. The model needs guardrails.

Constrained decoding enforces structure at the token level. At each generation step, before sampling, you check which tokens are valid according to a grammar — a formal specification of what sequences are allowed. If you're generating JSON and you've produced {"name": ", the only valid next tokens are characters that could appear in a string value. A closing brace would be syntactically invalid at this point, so its probability is set to zero. The model is forced to follow the grammar.

Libraries like Outlines and Microsoft Guidance implement this by compiling a grammar (context-free grammar, regular expression, or JSON schema) into a finite state machine. At each step, the FSM says which tokens are valid, and only those get nonzero probability. The output is guaranteed to be valid — not "usually valid" or "valid if you ask nicely in the prompt," but mathematically guaranteed.

This is a different philosophy from prompt engineering, where you write "Please return valid JSON" and hope for the best. Constrained decoding doesn't hope. It enforces. OpenAI's function calling and structured outputs features use similar ideas under the hood — the model's output is constrained to match a specified schema.

The tradeoff is speed. Maintaining a grammar state and masking invalid tokens at every step adds overhead to each generation step. For complex grammars, this can be significant. And there's a subtler cost: constraining the model can force it into awkward phrasings or suboptimal token sequences that it wouldn't have chosen freely. The model might "know" a better way to express something, but the grammar won't allow it.

Watermarking Generated Text

As language models produce more of the text we read, a natural question arises: can we tell whether a piece of text was written by a model or a human? Watermarking embeds a detectable but invisible signal in generated text, making it possible to identify machine-generated content after the fact.

The most influential watermarking scheme (Kirchenbauer et al., 2023) works at generation time, not after. Here's the core idea. At each generation step, a pseudorandom number generator (seeded by the previous token and a secret key) splits the entire vocabulary into two sets: a green list and a red list. The split is roughly 50/50. Before sampling, the model adds a small bonus to the logits of all green-list tokens, making them slightly more likely to be chosen.

Any single token looks perfectly natural — the bias is small enough that no individual word choice seems suspicious. But over hundreds of tokens, the statistical signal accumulates. A detector who knows the secret key can reconstruct the green/red split for each position, count how many tokens landed in their respective green lists, and compare to the expected rate. Watermarked text will show a statistically significant overrepresentation of green-list tokens. Unwatermarked text won't.

I find the elegance of this idea striking, and I'm still developing my intuition for its robustness. Minor edits (fixing typos, changing a few words) don't destroy the watermark — the signal is distributed across hundreds of tokens. Heavy paraphrasing or running the text through a separate model can break it. The length requirement (roughly 200+ tokens for reliable detection) means watermarking doesn't work well for short outputs like single-sentence answers. And it only works if the generator embeds the watermark — you can't retroactively watermark text that was generated without it.

The Unsolved Problem: Evaluating Open-Ended Generation

For translation, we have BLEU, chrF, and COMET — imperfect but useful. For open-ended generation — stories, dialogue, essays, brainstorming — we have almost nothing that works well automatically. This is one of the genuinely hard unsolved problems in NLP, and I think anyone who claims to have a good solution is overselling it.

The fundamental challenge: there is no reference to compare against. If I ask a model to "write a short story about a lonely astronaut," there are infinitely many good stories it could produce. What counts as "good"? Fluency? Coherence? Creativity? Emotional resonance? Factual accuracy (for non-fiction)? These are different dimensions, and they often conflict — the most creative response may sacrifice coherence, and the most coherent response may be boring.

Perplexity measures how surprised the model is by a text — specifically, the exponentiated average negative log-likelihood per token. Lower perplexity means the model found the text more predictable. It's useful for comparing language models on the same test set, but it tells you nothing about whether the generated text is good. A model can have low perplexity and still produce repetitive, dull output. A creative, surprising text might have high perplexity precisely because it's interesting.

MAUVE (Pillutla et al., 2021) takes a distributional approach: it measures how similar the distribution of generated text is to the distribution of human-written text, using a neural embedding space. If the model's outputs occupy the same region of embedding space as human writing, MAUVE is high. It captures both quality and diversity — a model that produces only one great sentence repeatedly will score low on diversity, and a model that produces diverse garbage will score low on quality. MAUVE correlates better with human judgments than perplexity for open-ended generation.

Distinct-n measures diversity by counting the number of unique n-grams as a fraction of total n-grams. A model that keeps repeating itself will have low distinct-1 and distinct-2 scores. Self-BLEU measures diversity by computing BLEU between pairs of generated outputs — if all outputs are similar, self-BLEU is high (which is bad).

But all of these are proxies. The honest assessment: for open-ended generation, human evaluation remains the gold standard. Have multiple human evaluators rate generated text on dimensions like fluency, coherence, informativeness, and engagement. Use Likert scales or pairwise comparisons (which text is better, A or B?). It's expensive, slow, and noisy — different evaluators disagree, and even the same evaluator might rate the same text differently on different days. But it's the closest thing we have to ground truth for whether generated text is actually good.

My favorite thing about the evaluation problem is that it forces us to confront what we actually mean by "good text." We don't have a formula for it. We might never have one. And that might be okay — it means there's something irreducibly human about judging language quality that we haven't captured in a metric yet.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a five-word vocabulary and a probability distribution, traced through six decoding strategies from greedy to beam search, and saw how each one trades off between creativity and correctness. We crossed over into machine translation — the original sequence-to-sequence task — and followed the arc from handcrafted rules through statistical models to neural attention. We computed BLEU by hand, learned why chrF handles German better, and saw why COMET is slowly replacing them all. We explored how constrained decoding guarantees structured output, how watermarking hides a statistical signal in plain sight, and why evaluating open-ended generation remains genuinely unsolved.

My hope is that the next time you adjust a temperature parameter or choose between beam search and nucleus sampling, instead of treating those knobs as magic, you'll have a concrete picture of what's happening to the probability distribution at each step — one token at a time, one choice at a time, building a sequence from scratch.

Resources

Holtzman et al., "The Curious Case of Neural Text Degeneration" (2020) — the paper that introduced nucleus sampling and showed why greedy and beam search produce degenerate text. Wildly influential and very readable.

Vaswani et al., "Attention Is All You Need" (2017) — the O.G. transformer paper. Every modern language model traces back to this. Dense but essential.

Papineni et al., "BLEU: a Method for Automatic Evaluation of Machine Translation" (2002) — the paper that launched a thousand evaluation debates. Understanding BLEU means understanding the last two decades of MT evaluation.

Kirchenbauer et al., "A Watermark for Large Language Models" (2023) — the green/red list watermarking paper. Elegant idea, well written, and increasingly relevant as AI-generated text becomes ubiquitous.

NLLB Team, "No Language Left Behind" (2022) — Meta's 200-language translation paper. Insightful on the challenges of low-resource translation and the mixture-of-experts approach to scaling.

HuggingFace Generation documentation — the most practical reference for implementing all the decoding strategies we covered. Updated regularly with new techniques. Unforgettable once you've used it in production.

← Previous The NLP Task Landscape Next → Retrieval-Augmented Generation