Text Pipeline: From Raw Text to Features
I avoided text preprocessing for longer than I'd like to admit. Every time I started an NLP project, I'd skip straight to the model — the interesting part, the neural network, the thing that actually learns. Cleaning text felt like washing dishes before cooking: necessary, unglamorous, and surely someone else's problem. Then I watched a sentiment model confidently classify "not good" as positive, and I realized the dishes were on fire. Here is that dive.
Text pipelines are the sequence of transformations that turn raw, messy human language into numerical vectors a machine learning model can process. The core ideas — tokenization, normalization, stop word removal, stemming, bag-of-words, TF-IDF — have been around since the 1970s in information retrieval research, and they remain the backbone of countless production systems today.
Before we start, a heads-up. We're going to be building a spam filter from scratch, getting into some logarithms and matrix operations, and writing Python code with sklearn and spaCy. You don't need to know any of it beforehand. We'll add what we need, one piece at a time.
This isn't a short journey, but I hope you'll be glad you came.
The Mise en Place: Text Normalization
Breaking Text Apart: Tokenization
The Stop Word Dilemma
Stemming: The Meat Cleaver
Lemmatization: The Scalpel
Rest Stop
Bag of Words: Counting Our Way to Numbers
TF-IDF: Making Counts Intelligent
N-grams: A Partial Fix for Word Order
Domain-Specific Cleaning: Where Rules Change
Putting It All Together
Wrap-Up
Resources
Why Computers Can't Read
Imagine we're building a spam filter for a tiny email inbox. We have five emails:
Email 1: "Buy cheap pills now!!!"
Email 2: "Meeting at 3pm tomorrow"
Email 3: "You won FREE tickets — claim NOW"
Email 4: "Can you review the quarterly report?"
Email 5: "DISCOUNT pills!! Buy buy buy"
To us, the spam is obvious. Emails 1, 3, and 5 have that desperate, shouting quality — all caps, exclamation marks, words like "free" and "buy." But a machine learning model has no eyes. It can't read. It needs numbers: vectors it can multiply, distances it can compute, gradients it can follow. A model is, at its core, a function that takes an array of floats as input and produces an array of floats as output.
So how do we get from "Buy cheap pills now!!!" to an array of floats? That's the entire purpose of a text pipeline. Think of it like mise en place in a kitchen — the prep work a chef does before any cooking begins. You wash the vegetables, dice the onions, measure the spices. None of it is the actual cooking, but skip it and the meal falls apart. Raw text is our pile of unprepped ingredients. The text pipeline is every cut, wash, and measurement that turns it into something a model can work with.
Our spam filter will be the running example throughout this section. We'll keep coming back to these five emails as we build up the pipeline piece by piece.
The Mise en Place: Text Normalization
The first thing we notice about our emails is the mess. Email 1 says "Buy" while Email 5 says "buy" and "BUY." Email 3 screams "FREE" and "NOW." To a computer comparing raw strings, "Buy" and "buy" and "BUY" are three completely different tokens. That triples our vocabulary for no good reason. We need to collapse these variations — to bring order to chaos before we do anything else.
This is text normalization: a set of transformations that reduce surface-level variation so that words which carry the same meaning get treated the same way.
Lowercasing
Lowercasing is the most common normalization step. We convert everything to lowercase, so "Buy", "BUY", and "buy" all become "buy." For our spam filter, this is a clear win — we want the model to recognize that screaming "BUY" and whispering "buy" carry the same intent.
But lowercasing destroys information. "US" (the country) becomes "us" (the pronoun). "Apple" (the company) becomes "apple" (the fruit). For tasks like named entity recognition, that lost casing is the only signal distinguishing a proper noun from a common one. The rule of thumb: lowercase for bag-of-words models and search. Keep case for models that learn their own representations, like BERT, which was trained on cased text and uses capitalization as a feature.
Unicode Normalization
I'll be honest — I once spent two hours debugging why two strings that looked identical on screen weren't matching in a comparison. The culprit was Unicode normalization, and it taught me a lesson I haven't forgotten.
The character "é" can be stored two different ways in Unicode. As a single code point — U+00E9, "latin small letter e with acute." Or as two code points — U+0065 (plain "e") followed by U+0301 (a combining acute accent). On screen, they're pixel-for-pixel identical. In memory, they're different byte sequences. String equality returns False.
import unicodedata
s1 = "café" # é as single code point
s2 = "cafe\u0301" # e + combining accent
print(s1 == s2) # False — same pixels, different bytes
print(len(s1), len(s2)) # 4, 5 — even the lengths differ
# NFC normalization: compose into single code points
s1_nfc = unicodedata.normalize("NFC", s1)
s2_nfc = unicodedata.normalize("NFC", s2)
print(s1_nfc == s2_nfc) # True — now they match
NFC stands for Canonical Decomposition followed by Canonical Composition. It smashes decomposed characters back into their single-codepoint form. There's also NFKC, which goes further — it flattens compatibility characters like fullwidth letters ("a" → "a") and ligatures ("ff" → "ff"). For NLP pipelines, NFKC is usually the safest bet. Always normalize before doing anything else with your text. It takes microseconds and prevents bizarre, silent bugs.
Stripping and Punctuation
What about the exclamation marks in "Buy cheap pills now!!!"? For our spam filter, those triple exclamation marks are actually useful signal — spammers love exclamation marks. But for a topic classifier, punctuation is noise. For sentiment analysis, "great!!!" carries stronger positive sentiment than "great" — stripping punctuation would erase that intensity.
There's no universal right answer here. It depends entirely on what your downstream task needs. The same goes for numbers. If you're classifying financial documents, "10,000" matters. If you're classifying movie reviews, it doesn't. Some pipelines replace all numbers with a placeholder token like <NUM>, preserving the information that a number appeared without caring which one.
Our mise en place analogy holds. A chef preparing sushi handles the fish differently from one making stew. What you strip, what you keep, what you transform — those are decisions that depend on the dish you're cooking.
Breaking Text Apart: Tokenization
With our text normalized, we face the next question: where does one word end and the next begin? This process of splitting text into individual pieces is called tokenization, and each piece is a token. A token might be a word, a subword, a character, or a punctuation mark. The choice defines the atoms your model will reason over.
Let's try tokenizing Email 3 from our spam filter: "You won FREE tickets — claim NOW". After lowercasing, it becomes "you won free tickets — claim now". How do we break that into tokens?
Whitespace Tokenization
The most naive approach: split on spaces.
text = "you won free tickets — claim now"
tokens = text.split()
print(tokens)
# ['you', 'won', 'free', 'tickets', '—', 'claim', 'now']
This works for our clean example. But consider a messier string: "dr. smith didn't pay $3.50 for the u.s.a. trip!". Whitespace splitting gives us ["dr.", "smith", "didn't", "pay", "$3.50", "for", "the", "u.s.a.", "trip!"]. The period after "dr" is glued to the word. The exclamation mark is fused with "trip." The contraction "didn't" stays as one token when we might want "did" and "n't" separated. For our five little emails, whitespace splitting would be fine. For anything real, we need more.
Regex Tokenization
A step up is defining what a token looks like using a regular expression. The pattern \w+ grabs sequences of word characters (letters, digits, underscores), effectively stripping punctuation. Fancier patterns like \w+(?:'\w+)? keep contractions intact.
The advantage is control. The disadvantage is that natural language has more edge cases than any regex can anticipate. Abbreviations ("U.S.A."), URLs, email addresses, emoticons, currency symbols — each demands its own pattern. I've seen production regex tokenizers grow to hundreds of lines, and they still had bugs. This path leads to a place you don't want to go.
Library Tokenizers
Libraries like spaCy and NLTK have already walked through that fire for you. They ship with tokenizers that handle abbreviations, contractions, currency, URLs, and dozens of other edge cases. Let's see the difference on a tricky sentence:
import spacy
from nltk.tokenize import word_tokenize
text = "Dr. Smith didn't pay $3.50 for the U.S.A. trip!"
# spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("spaCy:", [t.text for t in doc])
# ['Dr.', 'Smith', 'did', "n't", 'pay', '$', '3.50', 'for',
# 'the', 'U.S.A.', 'trip', '!']
# NLTK
print("NLTK:", word_tokenize(text))
# ['Dr.', 'Smith', 'did', "n't", 'pay', '$', '3.50', 'for',
# 'the', 'U.S.A.', 'trip', '!']
Both correctly split "didn't" into "did" + "n't", separate the dollar sign from the amount, keep "U.S.A." as one token, and detach the exclamation mark. NLTK's word_tokenize uses the Penn Treebank standard — a set of rules designed for English text that has been refined for decades. spaCy uses a rule-based system with exception lists and prefix/suffix/infix patterns.
The lesson: don't build your own tokenizer unless your domain is so unusual that no existing tool handles it. And even then, think twice.
Sentence Tokenization
Sentence tokenization — splitting text into sentences — is harder than it looks. "Dr. Smith went to Washington. He arrived at 3 p.m." has three periods that don't end sentences and two that do. NLTK's sent_tokenize uses the Punkt algorithm, an unsupervised model trained to recognize abbreviations and sentence boundaries. spaCy identifies sentence breaks during its dependency parsing. Both handle the common cases well. Neither is perfect — long, convoluted legal sentences still trip them up.
Beyond English
Everything above assumes spaces between words. Chinese, Japanese, and Thai don't use spaces — the string "我喜欢机器学习" ("I like machine learning") is a continuous stream of characters, and finding word boundaries requires a statistical model or dictionary. German smashes nouns into compounds: "Donaudampfschifffahrtsgesellschaft" (Danube steamship company) is one word. Turkish is agglutinative — a single word can encode what English needs an entire phrase to express. The takeaway: always use a tokenizer designed for your target language. There is no universal tokenizer.
The Stop Word Dilemma
Back to our spam filter. After tokenizing all five emails, we have lists of tokens. Many of those tokens are words like "the", "at", "you", "can" — words that appear in practically every English sentence and carry almost no meaning about whether something is spam or not. These are stop words: the most frequent words in a language that carry little discriminating power on their own.
NLTK ships with an English stop word list of about 179 words. spaCy has around 326. The idea behind removing them is straightforward: they dominate frequency counts without contributing topical information. For our spam filter using TF-IDF, removing "the" and "at" and "you" would let the model focus on words that actually distinguish spam from legitimate email — words like "buy", "free", "pills", "discount."
For years, I removed stop words from every pipeline I built. It seemed like obvious good practice. Then I built a sentiment classifier and watched it classify "this movie is not good" as positive. The word "not" was on the stop word list. Removing it turned "not good" into "good." That one deleted word flipped the entire meaning of the sentence.
The phrase "to be or not to be" — after stop word removal — becomes nothing. Every word in it is a stop word.
So when do we remove them? For bag-of-words models and TF-IDF where we're counting individual words in isolation, stop word removal usually helps. The model doesn't understand word order anyway, so "not" wasn't doing much for it. For topic modeling, removing stop words keeps "the" from dominating the word counts. For search engine indexing, it shrinks the index dramatically.
When do we keep them? For any model that understands word order — LSTMs, transformers, BERT. These models were trained with stop words present. Removing them before feeding text to BERT is like tearing pages out of a book before asking someone to read it. The model needs those little connector words to understand the grammar and flow of the sentence.
There's a third category worth knowing about: domain-specific stop words. In a corpus of legal documents, "court", "plaintiff", and "defendant" appear in nearly every document. They're not on any standard stop word list, but within that domain, they carry no discriminating power. You can find them by checking document frequency — any word appearing in more than 90% of documents is a candidate. But be careful: removing them is irreversible, and you might need them for a different task later.
Stemming: The Meat Cleaver
Our spam filter has another problem. Email 1 says "pills" and Email 5 says "pills" — no issue there. But what if one email said "buying" and another said "buy"? What about "discounted" vs "discount"? These are the same concept in different grammatical clothing. We want our model to recognize them as the same word.
Stemming is the blunt approach: chop off suffixes using a set of rules until you reach something that looks like a root. No dictionary, no grammar — pure string surgery. Think of it as a meat cleaver in our kitchen analogy. It gets the job done, but it's not precise.
The most famous stemmer is the Porter Stemmer, designed by Martin Porter in 1980. It applies cascading rules: strip plurals ("cats" → "cat"), strip past tenses ("running" → "run"), strip derivational suffixes ("happiness" → "happi"). The Snowball Stemmer (Porter2) is a refined version that fixes some quirks and supports multiple languages.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["buying", "buys", "bought", "pills", "discount",
"discounted", "university", "universal", "flies"]
for w in words:
print(f" {w:>15} → {stemmer.stem(w)}")
# buying → buy
# buys → buy
# bought → bought
# pills → pill
# discount → discount
# discounted → discount
# university → univers
# universal → univers
# flies → fli
Look at what happened. "Buying", "buys" → "buy" — good. "Discount", "discounted" → "discount" — good. But "bought" stays as "bought" because the stemmer doesn't know irregular English verbs. And "university" and "universal" both collapse to "univers" — but a university and something universal are completely different concepts. That's over-stemming: collapsing words that shouldn't be grouped. Meanwhile "flies" becomes "fli", which isn't a real word. And "alumnus" and "alumni" won't stem to the same root because Latin plurals don't follow English suffix rules. That's under-stemming: failing to group words that should be together.
Stemming is fast — it's raw string manipulation, no model loading, no dictionary lookups. That speed makes it attractive for search engines processing millions of documents. The price is accuracy. For our spam filter, stemming would work well enough — "buying" and "buy" don't need to be distinguished. But for tasks needing precision, we want something sharper.
Lemmatization: The Scalpel
If stemming is a meat cleaver, lemmatization is a scalpel. Instead of blindly hacking off suffixes, it uses a dictionary (or morphological analysis) to return the actual base form of a word — its lemma. "Better" lemmatizes to "good". "Ran" becomes "run". "Mice" becomes "mouse". A stemmer would never get these right because they require knowing the word, not the suffix pattern.
The catch is that lemmatization often needs to know a word's part of speech to choose the right lemma. The word "saw" is a noun (a cutting tool) or a verb (past tense of "see"). Without POS context, the lemmatizer guesses. With it, verb-"saw" correctly becomes "see" while noun-"saw" stays "saw". spaCy handles this automatically — its pipeline runs POS tagging before lemmatization. NLTK's WordNet lemmatizer requires you to pass the POS tag yourself.
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('wordnet', quiet=True)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
test_words = [
("better", "a"), # adjective
("mice", "n"), # noun
("saw", "v"), # verb
("flies", "n"), # noun
("operating", "v"), # verb
("studies", "n"), # noun
]
print(f" {'Word':>12} {'Stemmed':>12} {'Lemmatized':>12}")
print(f" {'─'*12} {'─'*12} {'─'*12}")
for word, pos in test_words:
print(f" {word:>12} {stemmer.stem(word):>12} "
f"{lemmatizer.lemmatize(word, pos=pos):>12}")
# Word Stemmed Lemmatized
# ──────────── ──────────── ────────────
# better better good
# mice mice mouse
# saw saw see
# flies fli fly
# operating oper operate
# studies studi study
The differences are stark. The stemmer produces fragments that aren't real words — "oper", "studi", "fli." The lemmatizer produces actual English words, but it needs that POS hint to do it. "Better" → "good" is something the stemmer could never achieve because it requires knowing that "better" is the comparative form of the adjective "good." That's vocabulary knowledge, not string manipulation.
For our spam filter, either approach works — the vocabulary is simple enough that stemming's errors won't matter much. For a medical NLP system where "operating" (a surgical procedure) and "operation" (a business operation) need to stay distinct, lemmatization is worth the slower speed. For modern transformer models (BERT, GPT), you need neither — these models use subword tokenizers like BPE and WordPiece that handle morphological variation implicitly. The model learns that "running" and "run" are related from context alone.
Rest Stop
Congratulations on making it this far. You can stop here if you want.
You now have a mental model of the entire text cleaning toolkit: normalization brings order to messy text, tokenization breaks it into pieces, stop word removal filters out noise (when appropriate), and stemming or lemmatization collapses word variations. That's the mise en place — the prep work. With these tools, you can take any raw document and produce a clean list of meaningful tokens.
That's useful, but it doesn't tell the complete story. We still have the fundamental problem we started with: a model needs numbers, not lists of words. The next few sections tackle how we convert those clean tokens into numerical vectors that a model can actually process — starting with the simplest possible approach and building up to something surprisingly powerful.
If the discomfort of not knowing what comes after the prep work is nagging at you, read on.
Bag of Words: Counting Our Way to Numbers
Let's return to our spam filter's five emails, now cleaned and tokenized. We need to convert them into numbers. The absolute simplest way to do this is to count words.
Our kitchen analogy returns: if the earlier normalization steps were the mise en place, this is where we actually start cooking. We're going to build a vocabulary — a card catalog for our library of words, where every unique word gets its own numbered slot.
Across all five emails, suppose we have these unique tokens (after lowercasing and removing punctuation): {buy, cheap, pills, now, meeting, tomorrow, won, free, tickets, claim, review, quarterly, report, discount}. That's 14 unique words. Each email will become a vector of length 14 — one slot per vocabulary word.
The Bag of Words model fills each slot with a count of how many times that word appears in the email. Email 5 ("discount pills buy buy buy") gets a 3 in the "buy" slot, a 1 for "pills", a 1 for "discount", and zeros everywhere else. Email 2 ("meeting at 3pm tomorrow") — after stop word and number removal — might get a 1 for "meeting" and a 1 for "tomorrow", zeros everywhere else.
That's the entire algorithm. Count words. Ignore everything else — word order, grammar, sentence structure, tone, meaning. It's called "bag of words" because you're treating the document like a bag: you dump all the words in, shake it up, and all you know is what fell out and how many of each.
"The dog bit the man" and "the man bit the dog" produce identical BoW vectors. That feels like a serious flaw, and it is. We'll address it later.
from sklearn.feature_extraction.text import CountVectorizer
emails = [
"buy cheap pills now",
"meeting tomorrow",
"won free tickets claim now",
"review the quarterly report",
"discount pills buy buy buy",
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
print("Vocabulary:", list(vectorizer.get_feature_names_out()))
print()
for i, email in enumerate(emails):
print(f" Email {i+1}: {X.toarray()[i]}")
The CountVectorizer handles vocabulary construction and counting in one step. It returns a sparse matrix — a special data structure that stores nonzero values efficiently. For our tiny example, we call .toarray() to see the full matrix. In production with vocabularies of 50,000+ words, you keep it sparse. A typical document uses 200 of those 50,000 words, meaning 99.6% of entries are zero. Storing all those zeros as a dense array would be wasteful.
A binary BoW variant replaces all counts with 0 or 1 — presence instead of frequency. For short texts like tweets, whether a word appears once or three times matters less than whether it appears at all.
So we've turned text into numbers. That's real progress. But there's a glaring problem: BoW treats every word equally. In our spam filter, the word "now" appears in both a spam email and a legitimate one. It's not very useful for distinguishing them. Meanwhile, "pills" appears only in spam — it's a strong signal. BoW doesn't know the difference. To it, "now" and "pills" are both worth one count.
TF-IDF: Making Counts Intelligent
This is where things get interesting. We want a representation that says: "If a word appears in nearly every email, it's probably not useful for telling them apart. If a word appears in only one or two emails, it's a strong signal for what makes those emails unique."
TF-IDF — Term Frequency times Inverse Document Frequency — formalizes exactly this intuition. It takes the raw count from BoW and multiplies it by a factor that rewards rare words and penalizes common ones.
Term Frequency
The first half is straightforward. Term Frequency, TF(t, d), measures how often term t appears in document d. In its simplest form, it's the raw count — what BoW already gives us. "Buy" appears 3 times in Email 5, so TF("buy", Email 5) = 3. Some variants divide by the total number of words in the document to normalize for length, but the raw count is the most common starting point.
Inverse Document Frequency
The second half is where the magic lives. Inverse Document Frequency, IDF(t), measures how rare a term is across the entire collection of documents (the corpus). The formula:
IDF(t) = log(N / df(t))
where N is the total number of documents and df(t) is the number of documents containing term t.
Let's trace through this with our five emails. N = 5.
Consider the word "pills". It appears in Email 1 and Email 5, so df("pills") = 2. IDF("pills") = log(5/2) ≈ 0.916. That's a meaningful positive number — "pills" is somewhat rare, so it gets a decent weight.
Now consider a word like "now", which appears in Emails 1 and 3. df("now") = 2. IDF("now") = log(5/2) ≈ 0.916. Same rarity, same weight. So far, IDF can't tell these apart.
But imagine a word that appeared in all five emails. df = 5, so IDF = log(5/5) = log(1) = 0. The word gets completely zeroed out. It carries no information about what distinguishes one email from another. This is the key insight: IDF automatically identifies and suppresses words that don't help.
Now imagine a word that appears in only one email — say "quarterly" in Email 4. df("quarterly") = 1. IDF = log(5/1) ≈ 1.609. The highest weight of all. That makes sense: "quarterly" uniquely identifies Email 4. It's the most informative word in our corpus.
Why the Logarithm?
I'm still developing my full intuition for why the log is there, but here's the clearest way I can explain it. Without the log, a word appearing in 1 out of 1,000 documents would get a weight of 1,000. A word in 1 out of 10,000 documents would get 10,000 — ten times more. But is that second word really ten times more informative? Not really. The relationship between rarity and informativeness is not linear. Going from "appears in 50% of documents" to "appears in 10% of documents" is a much bigger jump in usefulness than going from "appears in 0.1%" to "appears in 0.02%."
The logarithm captures this diminishing return. It's the same function that shows up in information theory — Shannon's self-information is defined as −log(P), the negative log of an event's probability. IDF is measuring something analogous: the surprise, or information content, of encountering a particular word. Common words carry no surprise. Rare words carry a lot. The log is the natural way to express that.
The Combined Score
The final TF-IDF score combines both halves:
TF-IDF(t, d) = TF(t, d) × IDF(t)
A word scores high when it's frequent in this specific document (high TF) but rare across the whole corpus (high IDF). That's the kind of word that tells you what this document is about. "Buy" appears 3 times in Email 5 (high TF) and only in 2 of 5 emails (decent IDF). Its TF-IDF in Email 5 = 3 × 0.916 ≈ 2.75. That's a strong signal. Meanwhile a word appearing everywhere would get TF-IDF = anything × 0 = 0, regardless of how often it appears.
Let's see this in action with our emails:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
emails = [
"buy cheap pills now",
"meeting tomorrow",
"won free tickets claim now",
"review the quarterly report",
"discount pills buy buy buy",
]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(emails)
# Which words have the highest TF-IDF in each email?
feature_names = tfidf.get_feature_names_out()
for i, email in enumerate(emails):
row = X.toarray()[i]
top_idx = row.argsort()[-3:][::-1] # top 3 words
top_words = [(feature_names[j], round(row[j], 3)) for j in top_idx if row[j] > 0]
print(f" Email {i+1}: {top_words}")
# Email 1: [('cheap', 0.528), ('pills', 0.421), ('buy', 0.336)]
# Email 2: [('tomorrow', 0.707), ('meeting', 0.707)]
# Email 3: [('tickets', 0.459), ('won', 0.459), ('claim', 0.459)]
# Email 4: [('quarterly', 0.5), ('review', 0.5), ('report', 0.5)]
# Email 5: [('buy', 0.686), ('discount', 0.452), ('pills', 0.361)]
Look at the top words for each email. TF-IDF has automatically surfaced the most characterizing terms. "Cheap", "pills", "buy" for the spam emails. "Meeting", "tomorrow" for the calendar email. "Quarterly", "review", "report" for the work email. Without us telling it anything about spam, TF-IDF has learned which words matter.
Measuring Similarity with Cosine
Once we have TF-IDF vectors, we can measure how similar two documents are. The standard measure is cosine similarity — the cosine of the angle between two vectors. A value of 1.0 means the vectors point in the same direction (same topic). A value of 0.0 means they're perpendicular (completely unrelated).
Why cosine instead of Euclidean distance? Because cosine is length-invariant. A 5,000-word article and a 50-word abstract about the same topic have very different vector magnitudes, but they point in a similar direction. Cosine captures that shared direction. Euclidean distance would say they're far apart because one vector is much longer. In our spam filter, a short "buy pills" and a long "buy cheap pills now buy them today" should be recognized as similar. Cosine does this; Euclidean doesn't.
sim = cosine_similarity(X)
print("Email 1 vs Email 5:", round(sim[0][4], 3))
# ~0.35 — both mention "buy" and "pills"
print("Email 1 vs Email 4:", round(sim[0][3], 3))
# ~0.0 — no shared vocabulary at all
print("Email 2 vs Email 4:", round(sim[1][3], 3))
# ~0.0 — different topics entirely
Email 1 and Email 5 share vocabulary about buying pills, so they have positive similarity. The spam emails cluster together, and the legitimate emails are far from them. A classifier could draw a line between these clusters — and that's exactly how TF-IDF + logistic regression works in practice.
If you compute TF-IDF by hand using the formula above and compare with sklearn's output, the numbers won't match. That's because TfidfVectorizer uses a smoothed IDF: log((1 + N) / (1 + df(t))) + 1, which prevents division by zero and never fully eliminates a term. It also L2-normalizes each document vector to unit length, which means the dot product of two vectors directly gives their cosine similarity. The conceptual idea is identical — sklearn adds engineering guardrails.
N-grams: A Partial Fix for Word Order
The bag-of-words assumption — that word order doesn't matter — has been bothering us since we introduced it. "The dog bit the man" and "the man bit the dog" produce identical vectors. For our spam filter, "not spam" and "spam not" would look the same.
N-grams offer a partial repair. Instead of treating individual words (called unigrams) as our only vocabulary units, we also include consecutive pairs (bigrams) or triples (trigrams). Our vocabulary grows to include entries like "buy cheap", "cheap pills", "not spam." Now "not spam" is its own vocabulary item, distinct from "spam" alone.
from sklearn.feature_extraction.text import CountVectorizer
emails = [
"buy cheap pills now",
"meeting tomorrow",
"won free tickets claim now",
]
unigram = CountVectorizer(ngram_range=(1, 1))
bigram = CountVectorizer(ngram_range=(1, 2))
trigram = CountVectorizer(ngram_range=(1, 3))
X1 = unigram.fit_transform(emails)
X2 = bigram.fit_transform(emails)
X3 = trigram.fit_transform(emails)
print(f" Unigrams only: {X1.shape[1]} features")
print(f" Uni + Bigrams: {X2.shape[1]} features")
print(f" Uni + Bi + Tri: {X3.shape[1]} features")
The tradeoff is vocabulary explosion. If you have 50,000 unigrams, adding bigrams might push you to 500,000 features. Trigrams push into the millions. Each document vector gets correspondingly longer and sparser. At some point you're spending more storage on mostly-zero vectors than you're gaining in expressive power.
There's also character n-grams — sequences of characters rather than words. A character trigram model for "spam" would include "spa", "pam." These are surprisingly useful for language detection and for handling misspellings — "v1agra" doesn't match "viagra" at the word level, but it shares many character trigrams. Spammers know how to dodge word-level filters. Character n-grams make that harder.
N-grams are a bandage, not a cure. They capture local word order within a window of 2-3 words, but they can't handle long-range dependencies. "I would not recommend this product" has 5 words between "not" and "product." No bigram or trigram captures that relationship. For that, you need models that genuinely understand sequence — and that's where embeddings and transformers take over.
Domain-Specific Cleaning: Where Rules Change
Every new domain humbles you. The pipeline that works for news articles will fail on tweets, crash on medical records, and produce nonsense on legal briefs. Our kitchen mise en place changes depending on the cuisine.
Social Media
Tweets and posts are a different beast. They contain @mentions, #hashtags, URLs, emojis, slang ("u" for "you", "gr8" for "great"), and creative spelling ("sooooo goood"). A standard tokenizer will mangle all of this. NLTK provides a TweetTokenizer specifically designed for this domain — it preserves hashtags, handles repeated characters, and keeps emoticons intact.
from nltk.tokenize import TweetTokenizer
tweet = "@realUser This is sooooo good!!! 😊 #NLP #winning"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
print(tokenizer.tokenize(tweet))
# ['This', 'is', 'sooo', 'good', '!', '!', '!', '😊', '#NLP', '#winning']
Notice reduce_len=True collapses "sooooo" to "sooo" (three repeated characters max). The emoji is preserved as a token — for sentiment analysis, 😊 carries more signal than most words. Hashtags stay intact because they often encode topic information: #NLP tells you what the tweet is about.
Decisions pile up quickly. Do you keep hashtags as-is or segment them ("MachineLearning" → "machine learning")? Do you translate emojis to text descriptions ("😊" → "smiling face") or keep them as symbols? Do you expand slang or leave it? Each choice depends on your downstream task, and there's rarely a single right answer.
Medical Text
Clinical notes are a minefield of abbreviations, shorthand, and domain jargon. "Pt" means "patient." "q.d." means "once daily" (from the Latin quaque die). "PRN" means "as needed." Standard NLP tools don't know any of this. A tokenizer that splits on periods will destroy "q.d." and "b.i.d." Lowercasing loses the distinction between "AIDS" (the disease) and "aids" (helps). Medical NLP often requires specialized tokenizers and abbreviation dictionaries — tools like scispaCy, a spaCy model trained on biomedical text.
Legal Text
Legal documents use section references ("§ 230"), case citations ("Brown v. Board of Education, 347 U.S. 483 (1954)"), Latin phrases ("habeas corpus", "amicus curiae"), and sentences that run for paragraphs. A sentence tokenizer trained on news articles will choke on a 200-word sentence with embedded citations. Legal NLP pipelines often need custom sentence splitters that understand citation patterns and parenthetical references.
The meta-lesson: before building any text pipeline, spend time reading your actual data. Look at the edge cases. Find the abbreviations, the formatting quirks, the domain-specific patterns. The pipeline must be designed for the data it will actually see, not for the clean example data in a tutorial.
Putting It All Together
Let's build a complete pipeline for our spam filter, from raw text to a trained classifier. This is where every piece we've built comes together into something practical.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
# Our tiny labeled dataset (in production, you'd have thousands)
emails = [
"buy cheap pills now",
"meeting at 3pm tomorrow",
"you won FREE tickets claim NOW",
"can you review the quarterly report",
"DISCOUNT pills buy buy buy",
"lunch plans for friday",
"earn money fast click here",
"project deadline extended to next week",
"free trial limited offer act now",
"team standup moved to 10am",
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1 = spam, 0 = not spam
# The pipeline: TF-IDF → Logistic Regression
pipe = Pipeline([
("tfidf", TfidfVectorizer(
lowercase=True, # normalize case
stop_words="english", # remove English stop words
ngram_range=(1, 2), # unigrams + bigrams
max_features=1000, # cap vocabulary size
)),
("clf", LogisticRegression()),
])
pipe.fit(emails, labels)
# Test on new emails
test_emails = [
"buy discount products today",
"can we reschedule the meeting",
"free money click now",
]
predictions = pipe.predict(test_emails)
for email, pred in zip(test_emails, predictions):
print(f" {'SPAM' if pred == 1 else 'HAM ':>4} ← {email}")
The Pipeline chains the TF-IDF transformation and the classifier into a single object. When we call pipe.fit(), it fits the vectorizer on the training text and then trains the classifier on the resulting vectors. When we call pipe.predict(), new text flows through the same normalization, tokenization, and vectorization steps before reaching the classifier. Nothing leaks. Nothing is inconsistent.
In production, you'd add more steps: custom preprocessing functions, hyperparameter tuning with GridSearchCV, evaluation on a proper test set. But the skeleton is always the same: raw text → transformations → numbers → model. That's the text pipeline.
I still occasionally get tripped up by a new dataset that breaks assumptions I didn't know I'd made — a corpus with HTML tags I forgot to strip, or Unicode characters my normalizer didn't handle. The pipeline is never truly "done." It evolves as you learn more about your data.
Wrap-Up
If you're still with me, thank you. I hope it was worth it.
We started with a pile of raw, messy emails and the fundamental problem that computers need numbers, not words. We built up the pipeline piece by piece: normalization to tame the chaos, tokenization to split text into atoms, stop word removal (when appropriate) to cut noise, stemming and lemmatization to collapse word variations. Then we crossed into the numerical world with Bag of Words — counting words into vectors — and refined it with TF-IDF, which uses the logarithmic insight from information theory to automatically highlight the words that matter. We patched the word-order blind spot with n-grams, confronted the messy reality of domain-specific text, and wired everything into a working sklearn pipeline.
My hope is that the next time you start an NLP project, instead of skipping straight to the model and wondering why your sentiment classifier thinks "not good" is positive, you'll spend an hour on the preprocessing — building the mise en place — having a decent mental model of what every step does and why it's there. The dishes aren't glamorous, but they're what make the meal possible.
These classical representations — BoW and TF-IDF — have real limitations. They can't understand that "happy" and "joyful" mean the same thing. They can't capture long-range word dependencies. They produce sparse, high-dimensional vectors that models struggle with. Those limitations are exactly what motivated the development of dense word embeddings (Word2Vec, GloVe) and contextual representations (ELMo, BERT, GPT) — which is where we're headed next.
Resources
A handful of resources I found genuinely useful while going down this rabbit hole:
- Speech and Language Processing by Jurafsky & Martin (3rd edition, free online) — the definitive NLP textbook. Chapters 2 and 6 cover text normalization and TF-IDF with the depth of a 900-page book. Wildly comprehensive.
- scikit-learn's text feature extraction docs — the
CountVectorizerandTfidfVectorizerAPI docs are surprisingly well-written and include the exact mathematical formulas with all the smoothing details. Better than most tutorials. - spaCy's linguistic features guide (spacy.io/usage/linguistic-features) — walks through tokenization, POS tagging, and lemmatization with interactive examples. Insightful for understanding what a real NLP pipeline does under the hood.
- "The Classical Language Model" by Dan Jurafsky (Stanford CS124 lectures, on YouTube) — if you want the information-theoretic foundation behind IDF and why the log is there, Jurafsky explains it better than anyone. Unforgettable.
- NLTK Book (nltk.org/book) — free, hands-on, and covers every preprocessing step with working Python code. Chapter 3 on processing raw text is the O.G. tutorial on this topic.