Embeddings & Tokenization

Chapter 11: Natural Language Processing Word Embeddings · Subword Tokenization

I avoided thinking about word embeddings for an embarrassingly long time. Every time I saw a diagram of vectors floating in space with arrows labeled "king − man + woman = queen," I'd nod, say "neat," and move on without actually understanding how raw text becomes a list of numbers that a neural network can chew on. I told myself the details didn't matter — that I could treat embeddings as a black box and tokenizers as someone else's problem. Finally the discomfort of not knowing what's happening between "the user typed a sentence" and "the model saw a tensor" grew too great for me. Here is that dive.

Word embeddings are dense numerical representations of words — compact vectors, typically 100 to 1,024 dimensions, where geometric proximity encodes semantic similarity. The idea traces back to distributional semantics in the 1950s, but the modern revolution started in 2013 with Word2Vec and never really stopped. Tokenization is the less glamorous sibling: the process of deciding where to cut a string of text into pieces before handing those pieces to an embedding layer. Together, embeddings and tokenization form the front door of every language model. Nothing else works until they do.

Before we start, a heads-up. We're going to build everything from scratch — one-hot vectors, prediction tasks, co-occurrence matrices, merge algorithms — but you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

One-hot encoding and why it fails
The distributional hypothesis
Word2Vec: Skip-gram and CBOW
The softmax bottleneck and negative sampling
GloVe: the global co-occurrence view
FastText: subword-aware embeddings
Cosine similarity and embedding geometry
Rest Stop
The polysemy problem
ELMo: context enters the picture
From ELMo to BERT: contextual embeddings arrive
The vocabulary problem and why tokenization matters
Byte Pair Encoding (BPE)
WordPiece
Unigram and SentencePiece
Byte-level BPE
Vocabulary size and embedding dimension tradeoffs
Positional embeddings: giving order to orderless models

One-Hot Encoding and Why It Fails

Imagine we're building a movie review analyzer. It reads what people write about films and decides whether they liked it or not. Our first challenge: the model only speaks numbers, and our users speak English. We need a translation layer.

Let's start with a tiny vocabulary of five words: "love," "hate," "great," "movie," and "boring." The most straightforward way to convert these into numbers is one-hot encoding — give each word a vector as long as the vocabulary, with a single 1 in the position assigned to that word and 0s everywhere else.

# Our tiny vocabulary of 5 words
#           love  hate  great  movie  boring
# love    = [1,    0,    0,     0,     0   ]
# hate    = [0,    1,    0,     0,     0   ]
# great   = [0,    0,    1,     0,     0   ]
# movie   = [0,    0,    0,     1,     0   ]
# boring  = [0,    0,    0,     0,     1   ]

Five words, five dimensions. Each word lives at one corner of a five-dimensional hypercube. Let's check the distance between "love" and "great" — two words that feel similar in a movie review context. The dot product of their one-hot vectors is 0. Now check "love" and "boring." Also 0. Every pair of distinct words is equally distant from every other pair. Our representation has told the model that "love" is as unrelated to "great" as it is to "boring."

That's the first problem: one-hot vectors carry zero information about meaning. They're glorified ID numbers with extra zeros.

The second problem is scale. A realistic English vocabulary has 50,000 to 100,000 words. That means every single word is a vector with 100,000 dimensions, of which 99,999 are zero. Multiply by the number of words in your dataset, and you're storing vast oceans of nothing. It's like assigning every person on Earth their own planet and then trying to build a social network between them — the address system tells you nothing about who lives near whom.

What we want instead is a representation where "love" and "great" are close together, "hate" and "boring" are close together, and "love" and "boring" are far apart. We want the geometry of the vector space to mirror the geometry of meaning. That's what word embeddings give us: dense vectors, typically 100 to 300 dimensions, where proximity encodes similarity.

The question is: how do we learn them?

The Distributional Hypothesis

The answer comes from a deceptively powerful idea from linguistics. In 1957, J.R. Firth wrote: "You shall know a word by the company it keeps." This is the distributional hypothesis — words that appear in similar contexts tend to have similar meanings.

Let's go back to our movie reviews. Suppose our dataset contains these sentences:

"I love this great movie"
"I love this amazing movie"
"I hate this boring movie"
"I hate this terrible movie"

Look at the slots around "great" and "amazing." They both appear between "this" and "movie," preceded by "love." Look at "boring" and "terrible." Same pattern, but preceded by "hate." Without anyone telling the model what these words mean, the contexts alone reveal that "great" and "amazing" play similar roles, and "boring" and "terrible" play similar roles. And that the first group is associated with "love" while the second is associated with "hate."

The distributional hypothesis is not a proof that context equals meaning. It's an empirical observation that works shockingly well in practice. If two words consistently show up in the same neighborhoods, surrounded by the same neighbors, they're almost certainly related. Not identical — "big" and "small" also share contexts ("the ___ house") even though they mean opposite things. We'll come back to that wrinkle. But as a starting point, context is remarkably informative.

The challenge now is to turn this linguistic insight into a mathematical training procedure. That's what Tomas Mikolov and colleagues at Google did in 2013, with a system called Word2Vec.

Word2Vec: Learning Embeddings from Prediction

Skip-gram — Predict the Neighbors

The Skip-gram model asks a peculiar question. Given a single word, can you predict which words appear near it? On the surface, this seems like a bizarre task. But the attempt to answer it forces the model to learn rich representations.

Here's the setup. We take a sentence from our movie reviews — say, "I love this great movie" — and slide a window across it. If the window size is 2, we look two words in each direction. When the window is centered on "great," the context words are "love," "this," "great," and "movie." The model's task: given "great" as input, predict each of those neighbors.

Internally, the model has two weight matrices. The first is the embedding matrix — it maps each word's one-hot vector to a dense embedding. This is the matrix we actually care about; it's what we'll keep after training. The second is the context matrix — it maps from embedding space back to vocabulary space to produce predictions. We usually throw this one away.

When we feed "great" into the model, it looks up the embedding for "great" (a row in the embedding matrix), then tries to use that vector to predict "love," "this," and "movie." The loss function measures how well it did. Gradients flow backward and nudge the embedding for "great" — pushing it closer to words it appears near and further from words it doesn't.

We repeat this for every word in every sentence across the entire corpus. Billions of these tiny nudges. After enough passes, words that consistently appear in similar neighborhoods end up with similar vectors. "Great" and "amazing" converge because they both predict the same context words. "Great" and "refrigerator" diverge because their neighborhoods look nothing alike.

That's Skip-gram. It sounds almost too simple to work. And yet the vectors it produces capture nuanced semantic relationships that no one explicitly programmed.

CBOW — Predict the Center

Continuous Bag of Words (CBOW) is Skip-gram's mirror image. Instead of using one word to predict its neighbors, we use the neighbors to predict the one word. Given "love," "this," and "movie," predict "great."

The context word embeddings get averaged into a single vector — this is the "bag of words" part; we throw away word order within the context window. That averaged vector is then used to predict the center word.

CBOW trains faster because it makes one prediction per window position instead of several (Skip-gram makes one prediction per context word). But Skip-gram tends to produce better embeddings for rare words. Here's why: in Skip-gram, a rare word appears as the center many times, each time paired with a different context word. It gets many gradient updates. In CBOW, a rare word in the center still generates one prediction, but a rare word in the context gets averaged with common words and its signal gets diluted.

In practice, Skip-gram is more commonly used, especially for smaller datasets. The quality gap shrinks as data grows. Both produce good embeddings. The bigger question is computational — which brings us to a serious bottleneck.

The Softmax Bottleneck and Negative Sampling

Here's the problem that nearly killed Word2Vec before it started. To predict a context word, the model needs to produce a probability distribution over the entire vocabulary. For each training example, it computes a score for every word, exponentiates all of them, and divides by the sum. That's a softmax over 100,000+ words. For every single window position. For every sentence in a corpus of billions of words.

The computational cost is O(V) per training step, where V is the vocabulary size. This is brutal. Training would take years.

Negative sampling is the trick that makes Word2Vec feasible. Instead of asking "which word among 100,000 is most likely?", it asks a much easier question: "is this specific word-context pair real, or did I make it up?"

For each real pair — like ("great", "movie") from our review corpus — we grab k random words from the vocabulary that almost certainly weren't in the context. Maybe we grab "elephant," "Tuesday," and "carburetor." Now we have one positive example and k negative examples. We train a binary classifier: push the dot product of "great" and "movie" up (they co-occur), push the dot products of "great" with each noise word down (they don't).

# Negative sampling objective for one (center, context) pair:
#
# Maximize:  log σ(v_movie · v_great)          ← real pair, push together
#          + log σ(-v_elephant · v_great)       ← fake pair, push apart
#          + log σ(-v_tuesday · v_great)        ← fake pair, push apart
#          + log σ(-v_carburetor · v_great)     ← fake pair, push apart
#
# σ is the sigmoid function
# k = 3 negative samples in this example (typically 5-15)

Each training step now touches k + 1 words instead of 100,000. That's a reduction from O(V) to O(k), where k is typically 5 to 15. Larger datasets need fewer negative samples (k = 5 works well); smaller datasets benefit from more (k = 15).

One subtle detail: the negative words aren't sampled uniformly. They're sampled proportional to their frequency raised to the 3/4 power. I'll be honest — I'm still developing my intuition for why 3/4 specifically. Mikolov's team tried several exponents and 3/4 won empirically. The effect is to slightly flatten the frequency distribution: common words like "the" still get sampled more often as negatives (which makes sense — they provide useful contrast), but rare words get sampled more than raw frequency would suggest (which helps their embeddings train faster). The exponent 3/4 isn't theoretically derived; it's an empirical sweet spot that has proven robust across many settings.

One more trick worth mentioning: subsampling frequent words. Words like "the" and "is" appear in nearly every context window. They contribute very little information — "movie" appearing near "the" tells us nothing — but they dominate the training data. Word2Vec randomly discards high-frequency words during training, with a drop probability that increases with word frequency. This speeds up training and, counterintuitively, improves embedding quality because the remaining pairs are more informative.

The King–Queen Magic (and Its Limits)

The most famous result from Word2Vec is the analogy test. Take the vector for "king," subtract the vector for "man," add the vector for "woman," and search for the nearest neighbor. You get "queen."

I still find it wild that this works. Nobody told the model about gender. Nobody told it about royalty. But the co-occurrence patterns in billions of words of text were enough for the model to carve out a "gender" direction in vector space and a "royalty" direction, such that arithmetic on those directions produces sensible results.

# vec("king") - vec("man") + vec("woman") ≈ vec("queen")
#
# The vector difference (king - man) encodes something like
# "royalty without gender." Adding it to "woman" gives
# "royalty + female" → queen.
#
# Similarly:
# vec("paris") - vec("france") + vec("germany") ≈ vec("berlin")
# vec("bigger") - vec("big") + vec("cold") ≈ vec("colder")

But let's be careful about what this tells us. The king–queen analogy works because gender is a strong, consistent signal in the training data. Many other analogies fail silently. And some succeed for the wrong reasons: "doctor" minus "man" plus "woman" often returns "nurse" — not because that's the correct analogy, but because the training corpus is saturated with societal stereotypes. The embeddings are a mirror of the data they were trained on, including its biases. Bolukbasi et al. showed in 2016 that Word2Vec associates "programmer" with male names and "homemaker" with female names. If you deploy embeddings in a real system, auditing them for harmful associations is not optional.

GloVe: The Global View

Word2Vec learns by sliding a small window across text, one example at a time. It never steps back to look at the full picture. In 2014, Pennington, Socher, and Manning at Stanford asked: what if we started with the global picture instead?

GloVe (Global Vectors for Word Representation) begins by building a co-occurrence matrix. This is a giant table — one row per word, one column per word — where entry X_ij counts how often word i appears near word j across the entire corpus. For our movie review vocabulary, it might look like this:

#           love  hate  great  movie  boring
# love      -     0     12     45     1
# hate      0     -     1      38     15
# great     12    1     -      30     0
# movie     45    38    30     -      20
# boring    1     15    0      20     -
#
# (numbers are illustrative)

"Love" co-occurs heavily with "movie" (45 times) and "great" (12 times) but rarely with "boring" (once). "Hate" co-occurs heavily with "movie" (38) and "boring" (15) but rarely with "great" (once). The structure of meaning is already visible in this table — GloVe's job is to compress it into dense vectors.

The model learns vectors w_i and w_j (plus bias terms) such that their dot product approximates the logarithm of the co-occurrence count:

# GloVe objective:
# w_i · w_j + b_i + b_j  ≈  log(X_ij)
#
# The loss weights each pair by f(X_ij), a function that:
# - Caps the influence of very frequent pairs (so "the"+"the" doesn't dominate)
# - Ignores very rare pairs (noise)
# f(x) = (x / x_max)^0.75  if x < x_max, else 1

Why the logarithm? Because raw co-occurrence counts span many orders of magnitude. "The" might co-occur with "movie" 50,000 times while "exquisite" co-occurs with "movie" 3 times. Working in log-space compresses this range and makes the optimization better behaved.

The key insight from the GloVe paper — and this is genuinely elegant — is about ratios of co-occurrence probabilities. Consider three words: "ice," "steam," and "solid." The ratio P(solid | ice) / P(solid | steam) is large — "solid" co-occurs much more with "ice" than with "steam." The ratio P(gas | ice) / P(gas | steam) is small. And the ratio P(water | ice) / P(water | steam) is close to 1 — "water" is equally relevant to both. These ratios encode the distinction between "ice" and "steam" more cleanly than raw probabilities. GloVe's objective is designed to make dot products of word vectors reflect these ratios.

Here's the part that surprised me when I first learned it: Levy and Goldberg showed in 2014 that Word2Vec with negative sampling is implicitly factorizing a shifted version of the pointwise mutual information matrix — which is closely related to a log co-occurrence matrix. In other words, Word2Vec and GloVe are doing more similar things than their architectures suggest. One approaches the problem through local prediction, the other through global factorization, but they converge to similar solutions. In practice, well-tuned Word2Vec and well-tuned GloVe produce embeddings of comparable quality on most benchmarks. The choice often comes down to which pretrained vectors are available, not which algorithm is inherently superior.

FastText: Breaking Words into Pieces

Both Word2Vec and GloVe have a quiet but devastating flaw: they treat every word as an atomic, indivisible unit. The word "loving" has no connection to "love" in the model's eyes — it's a completely separate entry in the vocabulary. And if a word never appeared in the training data? No vector. Game over. The model outputs a useless [UNK] token.

Let's feel this pain in our movie review analyzer. A user writes: "unbelievably loveable cinematography." The word "unbelievably" wasn't in our training corpus. Neither was "loveable" (we had "lovable" with a different spelling). Word2Vec gives up on both. Two out of three words — gone.

FastText, published by Bojanowski et al. at Facebook AI Research in 2017, fixes this by cracking words open. Instead of treating "loving" as one atomic symbol, FastText represents it as a bag of character n-grams — overlapping slices of 3 to 6 characters, plus the whole word itself:

# FastText decomposes "loving" into character n-grams
# (using angle brackets as word boundary markers)
#
# n=3:  <lo, lov, ovi, vin, ing, ng>
# n=4:  <lov, lovi, ovin, ving, ing>
# n=5:  <lovi, lovin, oving, ving>
# n=6:  <lovin, loving, oving>
# whole word: <loving>
#
# The word's embedding = sum of all these n-gram vectors

Each character n-gram gets its own vector. The word's embedding is the sum of all its n-gram vectors. Training proceeds exactly like Word2Vec's skip-gram with negative sampling — the only difference is how the center word's vector is computed.

This has two consequences that matter enormously. First, out-of-vocabulary words are no longer a death sentence. Even if "unbelievably" never appeared in training, FastText can build a reasonable vector from its character n-grams — many of which appeared in other words like "unbelievable," "incredibly," and "believable." The model has never seen this exact word, but it has seen most of its parts.

Second, morphological relationships emerge automatically. "Love," "loved," "loving," and "loveable" all share the n-grams "lov," "love," and others. Their vectors end up close in embedding space — not because anyone told the model about English verb conjugation, but because shared character sequences lead to shared n-gram vectors. This makes FastText particularly powerful for morphologically rich languages like Turkish, Finnish, and German. In Turkish, a single root verb can generate hundreds of inflected forms. Word2Vec treats each form as unrelated; FastText captures their structural kinship.

Coming back to our movie review analyzer: with FastText, "unbelievably" gets a reasonable vector (close to "unbelievable" and "incredibly"), "loveable" gets a vector close to "lovable" and "love," and the model can actually do its job. The address-book analogy holds here too — FastText doesn't need to have met someone personally; it can infer where they'd live on the map from their family name.

Cosine Similarity and Embedding Geometry

Now that we have word vectors, we need a way to measure how similar two words are. The standard tool is cosine similarity: the cosine of the angle between two vectors. It ranges from −1 (pointing in opposite directions) through 0 (perpendicular, unrelated) to 1 (pointing the same way).

# cos(a, b) = (a · b) / (||a|| × ||b||)
#
# Think of it as: how much do these vectors point
# in the same direction, ignoring how long they are?

Why cosine instead of Euclidean distance? Because cosine ignores magnitude and focuses on direction. This matters because embedding magnitude tends to correlate with word frequency — common words often have larger norms. We don't want "the" to be "similar" to everything; we want direction to carry the semantic signal. Cosine factors out the length.

Here's a result that trips people up: "good" and "bad" typically have moderate-to-high cosine similarity, around 0.5 to 0.7. That feels wrong — they're antonyms! But distributional similarity means "used in similar contexts," and antonyms fill identical syntactic slots: "the food was ___," "that was a ___ idea," "how ___ was it?" The model can't distinguish "similar meaning" from "similar usage." If you need a model that separates sentiment polarity, raw word embeddings aren't enough. You'll need a task-specific layer on top.

🛑 Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a solid mental model of static word embeddings: one-hot vectors are wasteful and meaning-blind. Word2Vec learns dense vectors by predicting which words appear near which, using negative sampling to make it tractable. GloVe achieves similar results by factorizing a global co-occurrence matrix. FastText extends the idea to sub-word character n-grams, handling misspellings and rare words gracefully. Cosine similarity measures how much two embeddings point in the same direction.

This model doesn't tell the complete story. Every method we've seen assigns exactly one vector per word, regardless of context. "Bank" gets the same embedding whether it means a financial institution or the side of a river. For many tasks, this ceiling is real and painful.

The short version of what comes next: contextual embeddings (ELMo, BERT) generate different vectors for the same word depending on the surrounding sentence. Then we'll tackle tokenization — how text gets chopped into pieces before embeddings ever enter the picture. There. You're 60% of the way through.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

The Polysemy Problem

Let's return to our movie review analyzer. A user writes: "The film's bank scenes were shot on location near the river bank." The word "bank" appears twice with two completely different meanings. With Word2Vec, GloVe, or FastText, both get the exact same vector — some weighted average of the financial and geographical senses, dominated by whichever appeared more in the training data. Our model can't tell the difference.

This is the polysemy problem — a single word with multiple meanings collapses into a single representation. "Play" as a theatrical performance gets the same vector as "play" as a verb meaning to engage in a game. "Light" as illumination, "light" as not heavy, "light" as to ignite — all the same vector. Language is riddled with these ambiguities, and static embeddings can't resolve them.

For many tasks, this limitation is tolerable. If 95% of "bank" in your corpus means the financial kind, the static embedding captures that dominant sense well enough for document classification. But for tasks that require precise understanding — question answering, named entity recognition, machine translation — a single vector per word hits a hard ceiling.

ELMo: Context Enters the Picture

ELMo (Embeddings from Language Models), published by Peters et al. in 2018, cracked this ceiling wide open. The core idea: instead of looking up a fixed vector from a table, run the entire sentence through a deep neural network and use the network's internal states as word representations. Because the network reads the whole sentence, each word's representation is shaped by every other word around it.

ELMo's architecture is a two-layer bidirectional LSTM trained on a language modeling objective. The forward LSTM reads left-to-right and predicts the next word. The backward LSTM reads right-to-left and predicts the previous word. At inference time, for each word position, ELMo produces three vectors: one from a character-level CNN at the bottom (capturing morphology — word shape and spelling patterns), one from the first LSTM layer (which researchers found captures syntax — part-of-speech, grammatical roles), and one from the second LSTM layer (capturing semantics — word sense, topic).

The final embedding is a learned weighted combination of all three layers. And here's the elegant part: different downstream tasks learn different weights. A part-of-speech tagger puts more weight on the first LSTM layer. A sentiment classifier puts more weight on the second. Each task finds its own optimal mix automatically.

Now "bank" in "I deposited money at the bank" gets a vector influenced by "deposited," "money," and the financial context. "Bank" in "I sat on the river bank" gets a different vector, pulled toward geographical terrain by "river." The same word, different representations. That's contextual embedding.

ELMo improved the state of the art on six NLP benchmarks the day it was published. But its deeper contribution was conceptual: it proved that you don't need a fixed lookup table. You need a function that produces word vectors conditioned on context.

From ELMo to BERT: The Transformer Revolution

ELMo used bidirectional LSTMs. They work, but LSTMs process text sequentially — word by word — which limits both speed and the range of context they can effectively capture. In late 2018, Devlin et al. at Google published BERT (Bidirectional Encoder Representations from Transformers), which replaced LSTMs with self-attention — the mechanism at the heart of the Transformer architecture. Self-attention lets every word attend to every other word in the sentence simultaneously, regardless of distance. It's covered in depth in the chapter on Sequence Models and Attention; the relevant point here is that BERT's contextual embeddings are richer, faster to compute, and better at capturing long-range dependencies than ELMo's.

BERT takes in a sequence of tokens, passes them through 12 (or 24) layers of self-attention, and outputs a contextual embedding for each token position. The output at each position is a function of all input tokens — so every word's representation is shaped by the entire sentence.

After BERT, contextual embeddings became the default. GPT (autoregressive, left-to-right context), RoBERTa (BERT trained more carefully), XLNet, ALBERT — all produce contextual embeddings, differing in architecture and training details but sharing the same core idea that ELMo pioneered: words don't have fixed meanings; context determines meaning.

I'll confess that the jump from "each word has one vector" to "each word has a different vector in every sentence it appears in" felt unsettling to me at first. It's harder to inspect, harder to visualize, harder to debug. But the performance gains are undeniable, and once you internalize the idea, it actually feels more honest — because words do change meaning with context, and pretending otherwise was always a simplification.

The Vocabulary Problem

We've been talking about turning words into vectors. But we've been quietly dodging a harder question: what is a "word"?

Consider our movie review analyzer again. It's trained on a vocabulary of 50,000 English words. A user writes: "The cinematographer's visualizations were unbelievably breathtaking." How many of those words are in our vocabulary? "The" — yes. "Were" — yes. "Cinematographer's" — maybe not (we might have "cinematographer" without the possessive). "Visualizations" — possibly not. "Unbelievably" — unlikely. "Breathtaking" — hopefully.

Every word that isn't in the vocabulary collapses into a single [UNK] token. Unknown. The model learns nothing from it. In some domains — medical records, legal contracts, user-generated social media — 10 to 15 percent of words can end up unknown. That's like reading a book with every seventh word blacked out.

We could go the other direction: character-level tokenization. Treat every letter, space, and punctuation mark as its own token. Now nothing is ever unknown — the vocabulary is roughly 256 entries (or fewer for English). But the sequences become absurdly long. "The cat sat" goes from 3 tokens to 11. Transformers have a fixed context window — say, 4,096 tokens. You've burned your window on spelling out individual characters instead of processing actual meaning. And the model must learn that "c-a-t" means the same thing as "C-A-T" or "c-a-t-s," which is hard when each letter is processed as a separate step.

Subword tokenization is the sweet spot. Keep common words whole — "the," "movie," "great" stay as single tokens. Break rare words into pieces the model has seen before — "unbelievably" becomes "un" + "believ" + "ably." Each piece is meaningful and reusable. "Unbreakable" shares the "un" and "able" pieces with "unbelievably." Think of it like Lego: you can build any word from a set of reusable bricks, and the bricks themselves carry meaning. You don't need a separate brick for every possible creation.

Vocabulary sizes land in the sweet spot of 30,000 to 50,000 tokens. Sequences stay short. Nothing is ever truly unknown. The question is: how do you decide which pieces to keep?

Byte Pair Encoding (BPE)

Byte Pair Encoding was originally a data compression algorithm from 1994, repurposed for NLP by Sennrich, Haddow, and Birch in 2016. The idea: start with individual characters. Find the most frequent adjacent pair. Merge them into a new token. Repeat until you've hit your target vocabulary size.

Let's walk through it with movie-related words. Suppose our entire training corpus has four words with these frequencies:

# Word frequencies in our corpus:
# "loved"  (appears 50 times)  → l o v e d _
# "love"   (appears 40 times)  → l o v e _
# "movie"  (appears 60 times)  → m o v i e _
# "movies" (appears 30 times)  → m o v i e s _
#
# (_ marks end-of-word)

We count every adjacent pair across all words, weighted by frequency. The pair ("o", "v") appears in "loved" (50) + "love" (40) + "movie" (60) + "movies" (30) = 180 times. That's the most frequent pair. Merge it.

# Merge 1: ("o", "v") → "ov"
#
# "loved"  → l ov e d _    (×50)
# "love"   → l ov e _      (×40)
# "movie"  → m ov i e _    (×60)
# "movies" → m ov i e s _  (×30)

Recount. Now ("l", "ov") appears 50 + 40 = 90 times. ("ov", "i") appears 60 + 30 = 90 times. ("ov", "e") appears 50 + 40 = 90 times. It's a three-way tie. Let's break ties alphabetically and merge ("l", "ov").

# Merge 2: ("l", "ov") → "lov"
#
# "loved"  → lov e d _     (×50)
# "love"   → lov e _       (×40)
# "movie"  → m ov i e _    (×60)
# "movies" → m ov i e s _  (×30)

Now ("lov", "e") appears 50 + 40 = 90 times. Merge it.

# Merge 3: ("lov", "e") → "love"
#
# "loved"  → love d _      (×50)
# "love"   → love _        (×40)
# "movie"  → m ov i e _    (×60)
# "movies" → m ov i e s _  (×30)

After three merges, "love" is a single token. "Loved" is "love" + "d" + "_". The algorithm learned that "loved" contains "love" — a morphological relationship — purely from character frequency statistics. No linguistics knowledge was programmed in. No rules about English verb tenses. The Lego bricks assembled themselves.

The number of merges is a hyperparameter that directly controls vocabulary size. Start with ~256 base characters, add one new token per merge. GPT-2 uses about 50,257 tokens. GPT-4 uses around 100,000. LLaMA 3 uses 128,000. More tokens means shorter sequences (good for context window) but a larger embedding table (more parameters to train).

One critical detail about BPE at inference time: when tokenizing new text, BPE replays the merge rules in the exact order they were learned during training. It doesn't greedily find the longest match. The merge priority list is the tokenizer's memory. This is why you must always pair a model with its exact tokenizer — using a different merge table produces completely different token IDs, and the model's embedding layer becomes meaningless. It's like shuffling the labels on a filing cabinet — the contents no longer match the index.

WordPiece

WordPiece was introduced by Schuster and Nakajima in 2012 for Japanese and Korean speech processing, and later became famous as BERT's tokenizer. At first glance, it looks identical to BPE — start with characters, iteratively merge pairs. But the merge criterion is different, and that difference matters.

BPE merges the most frequent pair. WordPiece merges the pair that maximizes the likelihood of the training data. Concretely, it asks: if tokens "a" and "b" appear frequently on their own but rarely together, merging them doesn't help much — they were already being modeled fine separately. But if they appear frequently together relative to their individual frequencies, the merge captures a genuine pattern. This is closely related to pointwise mutual information (PMI).

The result is a vocabulary that's slightly more "linguistically aware" than BPE. WordPiece tends to merge real morphological units rather than high-frequency coincidences.

WordPiece uses a special marker — ## — to indicate continuation tokens. The first piece of any word has no prefix. All subsequent pieces get ##. This makes it trivial to reconstruct the original text: strip the prefix, join the pieces.

# BERT (WordPiece) tokenization examples:
#
# "unbelievable"  →  ["un", "##bel", "##ie", "##va", "##ble"]
# "tokenization"  →  ["token", "##ization"]
# "the cat sat"   →  ["the", "cat", "sat"]
#
# Common words stay whole. Rare words split into reusable pieces.
# "un" appears in un-happy, un-usual, un-likely — it learns negation.
# "##ble" appears in read-able, cap-able, comfort-able — it learns suffixes.

The model learns morphology without anyone programming morphological rules. The prefix "un" carries the concept of negation across hundreds of words. The suffix "##tion" appears in "action," "nation," "tokenization" — all nominalized forms. These patterns emerge from the statistics of the training data.

Unigram and SentencePiece

BPE and WordPiece both build bottom-up — start with characters, merge upward. The Unigram approach, introduced by Taku Kudo in 2018, goes the opposite direction. Start with a very large initial vocabulary — all substrings up to a certain length, plus all individual characters — and iteratively remove the tokens that contribute least to the corpus likelihood.

Each token has an associated probability, learned through expectation-maximization. Given a word, there are multiple valid ways to segment it. "Unbreakable" could be ["un", "break", "able"] or ["unbreak", "able"] or individual characters. The Unigram model scores each segmentation by the product of its token probabilities and picks the highest-scoring one.

This multiplicity is actually a feature. During training, you can sample different segmentations of the same word — a technique called subword regularization. Sometimes "unbreakable" segments as ["un", "break", "able"], sometimes as ["unbreak", "able"]. This acts as data augmentation, making the model robust to segmentation variation at test time. BPE and WordPiece always produce the same deterministic segmentation; Unigram gives you controlled randomness.

SentencePiece, also by Kudo, is a tokenization library that implements both Unigram and BPE. Its key innovation: it treats the input as a raw stream of characters rather than pre-segmented words. Most tokenizers assume you've already split text on whitespace. SentencePiece doesn't. It treats spaces as ordinary characters (represented by the ▁ symbol). This means it works identically for English (which uses spaces), Chinese and Japanese (which don't), and everything in between. No language-specific pre-processing rules required.

SentencePiece is the tokenizer behind T5, ALBERT, XLNet, and many multilingual models. The "no pre-tokenization" design is why it became the default for multilingual settings.

Byte-Level BPE

GPT-2 introduced a clever twist on standard BPE. Instead of starting with Unicode characters (of which there are ~150,000), it starts with the 256 raw UTF-8 byte values. Every possible text — English, Mandarin, Arabic, emoji, even raw binary data — is first converted to a sequence of bytes, and BPE merges operate on those bytes.

The result: there is no unknown token. Ever. Any byte sequence is representable, because the base vocabulary covers every possible byte. You never need to worry about unseen scripts, unexpected Unicode characters, or creative emoji combinations. The [UNK] problem disappears entirely.

The tradeoff is efficiency across scripts. A common English word like "the" is 3 bytes in UTF-8, and the merge table quickly learns to represent it as a single token. But a Chinese character is 3 bytes, and a rare emoji can be 4. If the merge table hasn't seen a particular character often enough, that single character might consume 3 to 4 tokens. In practice, byte-level BPE uses about 1.3 to 1.5 times more tokens for non-Latin scripts compared to a dedicated tokenizer for that language.

This has real consequences. If your tokenizer takes 1.5 times more tokens to encode Hindi text than English, Hindi users get 1.5 times less context window and pay 1.5 times more per API call. This is a genuine equity issue in LLM deployment. Newer models like LLaMA 3 and Gemma use larger vocabularies (128,000 to 256,000 tokens) specifically to improve efficiency across languages.

Vocabulary Size and Embedding Dimension

Two hyperparameters quietly shape everything downstream: how many tokens are in the vocabulary, and how many dimensions each token's embedding has.

Vocabulary Size

Too small — say, 4,000 tokens — and common words get shredded. "Understanding" might become five tokens. Sequences get long, the model burns context window on sub-word fragments instead of meaning, and training slows down. Too large — say, 500,000 tokens — and the embedding table becomes enormous. Each token needs a dense vector (768 to 4,096 dimensions), so 500,000 tokens means hundreds of millions of parameters in the embedding layer alone. Worse, rare tokens in a huge vocabulary see very few training examples, so their embeddings stay noisy and undertrained.

The sweet spot for English-centric models has settled around 30,000 to 50,000 tokens. Multilingual models go higher: 100,000 to 256,000, because they need to efficiently represent multiple scripts. The trend in 2023–2024 is toward larger vocabularies as model sizes grow — when the rest of your model has 70 billion parameters, the relative cost of a bigger embedding table shrinks.

Embedding Dimension

For classic static embeddings (Word2Vec, GloVe), typical dimensions are 50, 100, 200, or 300. Smaller dimensions train faster and use less memory but might miss fine-grained distinctions. Larger dimensions can capture richer relationships but risk overfitting on small datasets and show diminishing returns past 300.

For transformer-based models, the embedding dimension is usually tied to the model's hidden dimension: BERT-base uses 768, BERT-large uses 1,024, GPT-3 uses 12,288, and LLaMA-70B uses 8,192. The embedding dimension and model width should scale together — there's little point in having 4,096-dimensional embeddings feeding into a 768-dimensional transformer.

I haven't figured out a perfectly clean way to summarize the dimension tradeoff, but here's a rough heuristic: the embedding dimension should be large enough that adding dimensions still improves downstream task performance, and no larger. For most practical purposes, match whatever the pretrained model uses. If you're training from scratch, start with the community's conventional wisdom for your model scale and adjust based on validation performance.

Positional Embeddings: Giving Order to Orderless Models

We've covered how words become vectors. But there's a subtle gap we haven't addressed. Transformers process all tokens in parallel — they have no inherent sense of order. The sentence "dog bites man" and "man bites dog" would produce identical representations if we only used word embeddings, because the same words are present. Self-attention computes relationships between all pairs, but it doesn't know which came first.

Positional embeddings fix this by injecting position information into each token's representation. The original Transformer paper (Vaswani et al., 2017) used sinusoidal positional encodings — fixed mathematical functions that produce a unique pattern for each position:

# Sinusoidal positional encoding:
# PE(pos, 2i)   = sin(pos / 10000^(2i/d))
# PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
#
# pos = position in the sequence (0, 1, 2, ...)
# i = dimension index
# d = embedding dimension
#
# Each position gets a unique vector. Different dimensions
# oscillate at different frequencies, creating a unique
# "fingerprint" for each position.

These sinusoidal encodings are not learned — they're fixed mathematical functions. The appealing property is that relative positions can be computed from the encodings: PE(pos + k) can be expressed as a linear function of PE(pos), which in theory lets the model learn to attend to relative positions.

Most modern models use learned positional embeddings instead — a separate embedding table where each position (0 through max_length − 1) gets its own trainable vector. BERT and GPT-2 both use this approach. The model learns whatever positional patterns are useful, without being constrained to sinusoids.

More recently, Rotary Position Embeddings (RoPE), used in LLaMA and many recent models, take a different approach entirely. Instead of adding a positional vector, RoPE rotates the query and key vectors in attention by an angle proportional to position. This naturally encodes relative position — the angle between two tokens depends on their distance, not their absolute positions. RoPE handles longer sequences more gracefully than fixed-length positional tables and has become the dominant approach for modern LLMs.

Regardless of the approach, the final input to the model is the sum (or combination) of the token embedding and the positional embedding. The token embedding says what the word is. The positional embedding says where it is. Both are essential.

The Full Picture

If you're still with me, thank you. I hope it was worth it.

We started with one-hot vectors — wasteful, enormous, and meaning-blind. We learned that distributional context is all you need to discover word meaning, and built that insight into Skip-gram and CBOW. We tamed the softmax bottleneck with negative sampling, saw GloVe take a global view through co-occurrence matrices, and watched FastText break words into character n-grams so that no word is ever truly unknown. We hit the ceiling of static embeddings, where every word gets one vector regardless of context, and watched ELMo and BERT shatter that ceiling by making representations dynamic — shaped by the full sentence. Then we turned to the less glamorous but equally critical question of tokenization: BPE merging character pairs bottom-up, WordPiece choosing merges by likelihood, Unigram pruning top-down, and byte-level BPE eliminating unknown tokens forever. We closed with the quiet but powerful choices of vocabulary size, embedding dimension, and positional encodings.

My hope is that the next time you see a tokenizer configuration or an embedding layer in a model architecture, instead of nodding and moving on, you'll know what each piece does and why it's there — having a pretty good mental model of the machinery that turns raw text into the tensors that make everything else possible.

Resources

A handful of sources that shaped my understanding: