The NLP Task Landscape

Chapter 11: Natural Language Processing Classification · NER · QA · Summarization · NLI · The task taxonomy

I avoided mapping out the NLP task landscape for an embarrassingly long time. Every time a new paper dropped — some new benchmark score, some new leaderboard entry — I'd skim the abstract, nod along, and quietly hope nobody asked me to explain the difference between token classification and sequence classification in an interview. I knew the words. I could not draw the picture. Finally the discomfort of not knowing what connects all these tasks under the hood grew too great. Here is that dive.

Natural language processing is a collection of tasks — not one task with many names. Each task asks a different question about text, and each question demands a different shape of output from a model. Some tasks label an entire document. Some label every single token. Some generate new text entirely. Understanding which shape goes with which question is the skeleton key to the whole field. It's what lets you look at a new business problem and immediately know which architecture to reach for.

Before we start: we're going to walk through classification, entity recognition, parsing, entailment, question answering, summarization, coreference, relation extraction, and the benchmarks that tie them all together. You don't need to have any prior knowledge of these tasks. We'll build each one from a concrete toy example, and the same running scenario will thread through all of them. We'll add concepts one piece at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

The Three Shapes of NLP
  Our Running Example: ShopBot
Text Classification — Sorting Mail into Bins
  Sentiment Analysis: The Trickiest Bin
Named Entity Recognition — Highlighting the Nouns That Matter
  BIO Tagging: A Coloring Scheme for Tokens
  Span-Based NER and SpanBERT
POS Tagging and Dependency Parsing — The Grammar Skeleton
Rest Stop
Semantic Similarity — Measuring Meaning Distance
Textual Entailment and Natural Language Inference
Question Answering — Pointing vs. Generating
Summarization — Pruning vs. Rewriting
Coreference Resolution — The Glue of Document Understanding
Relation Extraction — Building Knowledge from Text
The Benchmarks: GLUE, SuperGLUE, and SQuAD
Wrap-Up
Resources

The Three Shapes of NLP

Here's the insight that took me far too long to absorb: nearly every NLP task falls into one of three shapes, and the shape determines the architecture. I think of them as sorting, coloring, and rewriting.

Sequence classification takes an entire input — a sentence, a paragraph, a whole email — and assigns it a single label. Spam or not spam. Positive or negative. Topic A, B, or C. The model reads the whole thing, produces one vector (typically from a special [CLS] token in a transformer), and maps that vector to a label through a classification head. One input, one output. That's sorting.

Token classification assigns a label to every single token in the input. The sentence goes in, and out comes a tag for each word: this one is a person name, this one is a location, this one is a verb. The model produces a hidden state for each token, and each hidden state gets its own classification head. One input, many outputs — one per word. That's coloring.

Sequence-to-sequence (seq2seq) takes an input sequence and generates a completely new output sequence. The output can be shorter, longer, or the same length. Translation, summarization, and abstractive question answering all live here. An encoder reads the input; a decoder writes the output, one token at a time. That's rewriting.

I'll be honest — when I first internalized this three-way split, half the confusion I had about NLP evaporated overnight. Sentiment analysis? Sorting. NER? Coloring. Summarization? Rewriting. Every task in this chapter maps to one of these three shapes. Keep that in mind as we go — it's the compass for the whole journey.

But a compass alone doesn't show you the terrain. Let's build a map.

Our Running Example: ShopBot

Imagine we're building ShopBot, a customer support system for a small online store that sells three things: headphones, laptops, and phone cases. Customers send messages like these:

"My laptop screen is cracked and I want a refund."
"These headphones are amazing! Best purchase ever."
"I ordered a phone case from Austin, TX and it hasn't arrived."
"Does the warranty cover water damage?"
"The laptop was great but the delivery took forever."

Five messages. Five different things ShopBot needs to do with them: route them to the right team, detect sentiment, find the product and location names, answer questions, and summarize long complaint threads. Each of those needs is a different NLP task. Each task demands a different shape of model output. We'll keep coming back to these five messages as we work through the landscape.

Text Classification — Sorting Mail into Bins

The first thing ShopBot needs to do with an incoming message is decide where it goes. Is this a complaint? A product question? A shipping inquiry? A compliment? That's text classification — the most fundamental NLP task. One document in, one label out.

Think of it like a mailroom. Letters arrive. A person reads each one, stamps it with a category, and drops it into the corresponding bin. The letter doesn't change. It gets a label.

Let's trace through a tiny example. Suppose ShopBot has three bins: billing, shipping, and product. Our message is "My laptop screen is cracked and I want a refund." A classifier needs to read this and output a single label. The classical way is to count words. We build a bag-of-words vector — a list of word counts, one entry per word in our vocabulary — and feed it to a classifier like logistic regression or Naive Bayes. Words like "refund" and "cracked" push toward billing. Words like "arrived" and "shipping" would push toward shipping. The counts become votes.

A fancier version uses TF-IDF (term frequency–inverse document frequency), which downweights common words like "and" and "I" that appear everywhere and upweights rare, informative words like "refund" and "cracked." The idea: a word that appears in every document tells you nothing about which bin a document belongs to. A word that appears in only a few documents is a strong signal.

The limitation is that bag-of-words throws away word order. "The laptop was great but the delivery was terrible" and "The delivery was great but the laptop was terrible" produce identical vectors despite meaning opposite things. For years, people tolerated this. Then convolutional and recurrent neural networks entered the picture, encoding word sequences as vectors that preserve order. TextCNN, introduced by Yoon Kim in 2014, slides small filters over sequences of word embeddings — filters of width 3, 4, and 5 words — to capture local phrase patterns like "want a refund" or "screen is cracked."

The modern approach collapses the problem even further. Fine-tune a pretrained transformer — BERT, RoBERTa, or similar — on your labeled data. The transformer reads the entire message, and its [CLS] token (a special token prepended to the input) accumulates a representation of the whole sequence. A single linear layer on top of that token maps it to your label set. With a few hundred labeled examples, this beats the classical approaches by a wide margin.

And when you have zero labeled data? Zero-shot classification with a large language model. You write a prompt: "Classify this customer message as billing, shipping, or product: 'My laptop screen is cracked and I want a refund.'" The model gives you an answer. It won't match a fine-tuned specialist, but it requires no training data at all. That tradeoff — accuracy versus setup cost — is one you'll make constantly in production NLP.

Text classification is the mailroom metaphor. But some mail requires reading the emotion between the lines, not the topic on the label. That's where sentiment analysis comes in.

Sentiment Analysis: The Trickiest Bin

Sentiment analysis is text classification wearing a more nuanced costume. Instead of routing a message to a department, you're gauging the emotional temperature: is this customer happy, angry, or neutral?

Take our ShopBot messages. "These headphones are amazing! Best purchase ever." — positive. "My laptop screen is cracked and I want a refund." — negative. Those are easy. Now try: "The laptop was great but the delivery took forever." That single message carries positive sentiment toward the product and negative sentiment toward the shipping experience. This is aspect-based sentiment analysis — detecting sentiment toward specific aspects within the same text. Same document, two different emotions.

The classical approach leans on a sentiment lexicon — a dictionary that assigns polarity scores to individual words. VADER (Valence Aware Dictionary and sEntiment Reasoner) is the most widely used English lexicon. It knows that "amazing" is positive, "cracked" is negative, and that capitalization intensifies things ("GREAT" hits harder than "great"). But lexicons break on sarcasm. "Oh wonderful, my package is lost again" — every lexicon sees "wonderful" and smiles. Humans don't.

I still get tripped up by sarcasm detection in my own projects. Models do too. Negation is another trap — "not bad" is positive despite containing "bad." And domain shift is real: a model trained on movie reviews learns that "unpredictable" is praise (for a thriller), but a restaurant review model learns it's a complaint (for a chef). The words are the same; the meaning flips based on context the model has never seen.

Modern sentiment analysis fine-tunes a transformer on labeled data for your specific domain. That domain-specificity is the key insight. A general-purpose sentiment model is a starting point, not a finished product. Always evaluate on data from your domain before shipping.

Sorting and sentiment tell us about the whole message. But ShopBot also needs to pick out specific names and places within that message. That's a different shape entirely.

Named Entity Recognition — Highlighting the Nouns That Matter

Consider our third ShopBot message: "I ordered a phone case from Austin, TX and it hasn't arrived." ShopBot needs to know that "phone case" is a product, and "Austin, TX" is a location. It's not enough to know the topic of the message — we need to reach inside the sentence and tag specific words.

This is Named Entity Recognition (NER), and it's a token classification task. Every token gets its own label: person, organization, location, product, date, or — for the majority of tokens — nothing at all.

Here's the thing that confused me for a long time: if every token gets a label, how do you mark multi-word entities? "Austin" and "TX" are two tokens, but they form one location. "Phone" and "case" are two tokens forming one product mention. You need a scheme that tells you where an entity starts and where it continues.

BIO Tagging: A Coloring Scheme for Tokens

The answer is BIO tagging (also called IOB tagging). Every token gets one of three kinds of tags:

B-XXX means "this token is the beginning of an entity of type XXX." I-XXX means "this token is inside an entity that already started." O means "this token is outside any entity — it's not part of a named entity."

Let's trace through our ShopBot message:

Token        Tag
─────────    ──────
I            O
ordered      O
a            O
phone        B-PROD
case         I-PROD
from         O
Austin       B-LOC
,            I-LOC
TX           I-LOC
and          O
it           O
hasn't       O
arrived      O

"Phone" gets B-PROD because it begins a product entity. "Case" gets I-PROD because it continues that same entity. "Austin" starts a new location entity (B-LOC), and the comma and "TX" continue it (I-LOC). Everything else is O.

Why do we need the B tag at all? Why not tag everything as I-LOC? Because consecutive entities of the same type would merge. Imagine the sentence "I visited Austin TX and Dallas TX." Without B tags, all four location tokens would look like one continuous location. The B tag on "Dallas" signals: this is a new entity, separate from the one before it.

Classical NER systems used Conditional Random Fields (CRFs) with hand-crafted features — is the word capitalized? Does it appear in a gazetteer of city names? What are the surrounding words? CRFs are powerful because they model transition probabilities between tags. They learn constraints like "I-PER cannot follow B-LOC" — a person tag can't continue a location tag. At inference time, the Viterbi algorithm finds the globally optimal tag sequence, respecting all these constraints at once.

The neural breakthrough was the BiLSTM-CRF — a bidirectional LSTM that reads the sentence in both directions, building context-aware representations for each token, followed by a CRF layer that enforces valid tag transitions. This was the dominant NER architecture from roughly 2015 to 2018. Modern NER fine-tunes BERT: feed the sentence through the transformer, take each token's hidden state, and classify it through a linear layer. Some practitioners still add a CRF on top of BERT; others find that BERT's contextual representations are rich enough to enforce valid transitions on their own.

But BIO tagging has a blind spot.

Span-Based NER and SpanBERT

Consider the entity "Bank of America." It's an organization. But "America" inside it is also a location. BIO tagging can assign each token only one tag, so it can't represent both entities simultaneously. This is the nested NER problem — entities that live inside other entities.

Span-based approaches solve this by changing the question. Instead of asking "what tag does each token get?", they ask "for every possible span of tokens, is this span an entity?" The model enumerates all candidate spans up to a maximum length — say, 10 tokens — and classifies each span independently. A span can be an entity even if it overlaps with another span that's also an entity.

SpanBERT, introduced by Joshi et al. in 2019, was pretrained with a span-level objective: instead of masking individual tokens like BERT, it masks contiguous spans and trains the model to predict the masked content using the tokens at the span boundaries. This gives SpanBERT particularly strong representations for contiguous text spans, making it a natural fit for span-based NER (and for extractive QA and coreference resolution, as we'll see later).

The tradeoff is computational cost. For a sentence with n tokens, there are O(n²) possible spans. For short sentences, that's fine. For long documents, you need pruning strategies — only consider spans that score above a threshold, or limit the maximum span length. But the ability to handle nested entities makes span-based approaches essential for domains like biomedicine, where nested mentions are the norm (a gene mentioned inside a protein complex mentioned inside a pathway).

NER tells us what a word is. But language has more structure than that. Words play grammatical roles, and those roles form a skeleton that holds the sentence together.

POS Tagging and Dependency Parsing — The Grammar Skeleton

Part-of-speech (POS) tagging is another token classification task: every word gets a grammatical label. Noun, verb, adjective, preposition, determiner. Take "My laptop screen is cracked." The tags are:

Token      POS
────────   ────
My         PRON
laptop     NOUN
screen     NOUN
is         AUX
cracked    ADJ

The Universal POS Tagset defines 17 coarse-grained tags that work across languages: NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, and a handful of others. Finer-grained tagsets like the Penn Treebank set distinguish between past-tense verbs (VBD), gerunds (VBG), third-person singular present (VBZ), and so on — over 36 tags. The universal tagset trades granularity for cross-lingual consistency.

POS tagging sounds like a solved problem — modern taggers achieve 97%+ accuracy — but that remaining 3% hides all the interesting cases. "Book" is a noun in "I read a book" and a verb in "Book a flight." "Running" is a verb in "She is running" and a noun in "Running is fun." The ambiguous cases are exactly the ones that trip up downstream tasks.

Dependency parsing goes one level deeper. It doesn't ask "what part of speech is this word?" but rather "which word does this word depend on, and what's the relationship?" Every word in a sentence (except the root, usually the main verb) has exactly one head — the word it modifies or is governed by. The relationship between a word and its head is called a dependency relation.

Let's parse "My laptop screen is cracked":

cracked  ← root (the main verb, anchor of the sentence)
  ├── screen ← nsubj (the subject of "cracked")
  │     ├── My ← nmod:poss (possessive modifier of "screen")
  │     └── laptop ← compound (compound modifier of "screen")
  └── is ← cop (copula — the linking verb)

The tree shows that "screen" is the subject of "cracked," "My" possessively modifies "screen," and "laptop" is a compound modifier of "screen." The Universal Dependencies framework standardizes these relation labels across 100+ languages, making it possible to build parsers that work across languages with the same annotation scheme.

Why does any of this matter for ShopBot? Because dependency parsing lets you extract structured meaning. If a customer writes "The headphones from your store have terrible sound quality," parsing reveals that "terrible" modifies "quality" (not "headphones" or "store"), and "quality" is the subject complement. Without that structure, you might associate "terrible" with the wrong noun — and route the complaint to the wrong team.

I'll be honest — I spent a long time thinking POS tagging and parsing were relics of an era before transformers could handle everything end-to-end. And for many production tasks, that's true. But in low-resource languages, in domain-specific applications where you lack training data, and in systems that need explainable structure (legal tech, medical NLP), these tasks remain essential components. Libraries like spaCy ship fast, accurate parsers that you can bolt on to any pipeline.

Rest Stop

Congratulations on making it this far. Take a breath if you need one. You can stop here and already have a solid mental model.

Here's what you now know: NLP tasks come in three shapes — sorting (sequence classification), coloring (token classification), and rewriting (seq2seq). Text classification and sentiment analysis sort whole documents. NER and POS tagging color individual tokens. You know how BIO tagging works, why the B tag exists, and how span-based approaches handle nested entities. You've seen how dependency parsing reveals the grammatical skeleton of a sentence.

That's a real foundation. If someone asks you in an interview "what's the difference between sequence classification and token classification?", you can draw it on a whiteboard with ShopBot examples.

But the landscape has more territory. We haven't tackled the tasks that involve comparing two texts, answering questions, or compressing long documents. We also haven't talked about the benchmarks that the entire field uses to measure progress. Those are coming next.

If you want the short version: semantic similarity measures how close two sentences are in meaning. Textual entailment asks whether one sentence logically follows from another. Question answering either points to an answer in a passage or generates one from scratch. Summarization does the same split — extract sentences or generate new ones. There. You're 70% of the way there.

But if the nagging feeling of "how does that actually work?" is pulling at you, read on.

Semantic Similarity — Measuring Meaning Distance

ShopBot gets two tickets:

Ticket A: "My headphones stopped working after one week."
Ticket B: "The earbuds I bought died within seven days."

These share almost no words, but they describe the same problem. A keyword-matching system would say these are unrelated. A human would say they're near-duplicates. Semantic textual similarity (STS) tries to give machines the human answer — a continuous score from 0 (completely unrelated) to 5 (identical in meaning).

The naive approach: compute a word embedding (Word2Vec, GloVe) for each word, average them into a single vector per sentence, and compute cosine similarity between the two vectors. This works better than you'd expect, but averaging destroys word order. "Dog bites man" and "Man bites dog" produce identical sentence vectors. That's unsatisfying.

Sentence-BERT (SBERT), introduced by Reimers and Gurevych in 2019, was the breakthrough that changed how I think about text similarity. SBERT fine-tunes BERT with a Siamese architecture: feed both sentences through the same BERT encoder, pool each output into a single vector (typically by averaging all token embeddings), and train with a contrastive or regression objective. The resulting model produces dense sentence embeddings in a single forward pass. Measuring similarity then becomes a cosine similarity between two vectors — which takes microseconds, even for millions of pairs.

Another way to think about it: SBERT gives every sentence an address in a high-dimensional space. Sentences with similar meanings live at nearby addresses. Sentences with different meanings live far apart. Finding similar documents becomes a nearest-neighbor lookup — the same kind of operation that powers image similarity search, but for text.

The STS Benchmark is the standard evaluation dataset: sentence pairs from captions, news headlines, and forum posts, each scored by human annotators. Models are evaluated by how well their predicted similarity scores correlate with the human scores (Spearman or Pearson correlation). State-of-the-art models achieve correlations above 0.90, which sounds good — but I've found that real-world performance often falls behind benchmark numbers, especially on domain-specific text that doesn't look like the training data. Always benchmark on your own data.

Measuring how similar two sentences are is useful. But sometimes the question isn't "how similar?" but rather "does one follow from the other?" That's a fundamentally different question, and it has its own task.

Textual Entailment and Natural Language Inference

A customer writes to ShopBot: "I bought this laptop last Tuesday and the screen is cracked." The warranty policy says: "Products purchased within 30 days are eligible for free replacement." Does the policy cover this customer?

This is natural language inference (NLI), also called textual entailment. You have two pieces of text — a premise and a hypothesis — and you need to decide: does the premise entail the hypothesis, contradict it, or neither?

Let's make it concrete with three tiny examples:

Premise:    "The package was delivered on Monday."
Hypothesis: "The customer received a delivery."
Label:      ENTAILMENT (if the package was delivered, a delivery happened)

Premise:    "The package was delivered on Monday."
Hypothesis: "The package has not been shipped yet."
Label:      CONTRADICTION (delivered and not shipped can't both be true)

Premise:    "The package was delivered on Monday."
Hypothesis: "The customer was satisfied with the product."
Label:      NEUTRAL (delivery says nothing about satisfaction)

NLI is a sequence classification task — but with a twist. The input is a pair of sentences, not one sentence. You concatenate the premise and hypothesis with a separator token, feed the pair through a transformer, and classify the [CLS] token into one of three labels: entailment, contradiction, or neutral.

What makes NLI powerful — and this took me a while to appreciate — is that it's a building block for other tasks. Need zero-shot text classification? Frame it as entailment. For each candidate label, create a hypothesis: "This message is about shipping." Feed it with the message as the premise. The label that scores highest on entailment wins. This is how many zero-shot classification systems work under the hood — they're running NLI in disguise.

The major datasets for NLI are SNLI (Stanford Natural Language Inference, 570K sentence pairs) and MultiNLI (Multi-Genre NLI, 433K pairs across ten genres). MultiNLI is one of the tasks in the GLUE benchmark, which we'll get to at the end.

The hard cases in NLI mirror the hard cases in human reasoning. Numerical reasoning ("The train arrives at 3:15" — does it entail "The train arrives in the afternoon"?), world knowledge ("The painting is by Picasso" — does it entail "The painting is cubist"?), and multi-step inference all remain challenging. I'm still developing my intuition for where models fail at entailment, and honestly, so is the field.

Question Answering — Pointing vs. Generating

A customer asks ShopBot: "Does the warranty cover water damage?" ShopBot has a paragraph of warranty text to search through. How should it answer?

There are two fundamentally different approaches, and confusing them is a common source of bugs in production systems.

Extractive QA points to the answer. The model reads the question and a passage, then identifies a contiguous span in the passage that answers the question. The answer is always a verbatim quote — the model never generates new words. Think of it like a highlighter: the model highlights the relevant span in the text and says "here's your answer."

Let's walk through the mechanics. The model receives the question and the passage concatenated together:

[CLS] Does the warranty cover water damage? [SEP] Our standard warranty
covers manufacturing defects for 12 months. Water damage and physical
damage are not covered under any warranty plan. [SEP]

Two separate linear heads sit on top of the transformer. One predicts the start position of the answer span. The other predicts the end position. Each head produces a score for every token in the passage. Softmax converts the scores into probabilities. The model picks the span with the highest combined start-end probability — in this case, "Water damage and physical damage are not covered under any warranty plan."

The SQuAD dataset (Stanford Question Answering Dataset) is the benchmark that popularized this approach. SQuAD 1.1 contains 100K+ question-answer pairs where every question has an answer in the passage. SQuAD 2.0 added a twist: some questions are unanswerable, and the model must learn to say "I don't know" — which turns out to be harder than finding the answer when it exists.

Abstractive QA generates the answer in its own words. Instead of pointing to a span, the model reads the question and context, then produces new text via an encoder-decoder architecture (T5, BART, or a large language model). The answer might paraphrase, combine information from multiple sentences, or express the same fact differently.

The tradeoff is trust versus fluency. Extractive QA gives you provenance — you can always point to exactly where the answer came from. Abstractive QA gives you natural, conversational answers — but it can hallucinate, generating fluent text that isn't supported by the source. In legal or medical applications where traceability matters, extractive QA is the safer choice. For chatbots and conversational interfaces where naturalness matters more, abstractive approaches are the default.

Open-domain QA goes further: there's no passage provided at all. The model must find the answer from a large corpus — all of Wikipedia, or your company's entire knowledge base. The standard architecture is a retrieve-then-read pipeline: a retriever fetches the most relevant passages, then a reader extracts or generates an answer from those passages. This is the pattern behind retrieval-augmented generation (RAG), which we cover in more depth in a later section.

Extractive QA is "pointing" — a span classification task that predicts start and end positions. Abstractive QA is "rewriting" — a seq2seq task that generates new text. Same question, different architecture shapes.

Summarization — Pruning vs. Rewriting

ShopBot has a long complaint thread — fifteen back-and-forth messages between a customer and two support agents. A manager needs a three-sentence summary. The same extractive-vs.-abstractive split from QA appears here, and for the same reason.

Extractive summarization selects the most important sentences from the original document and concatenates them. The text is never rewritten — only pruned. The classic algorithm is TextRank, which works by analogy to Google's PageRank. Build a graph where each sentence is a node. Connect sentences with edges weighted by their similarity (cosine similarity of TF-IDF vectors is the traditional choice). Run a PageRank-like algorithm to score each sentence by its importance — sentences that are similar to many other important sentences get high scores. Pick the top-N sentences. That's your summary.

I find the analogy useful: PageRank treats web pages as important if many important pages link to them. TextRank treats sentences as important if many important sentences are similar to them. The logic is identical; the domain is different.

Abstractive summarization generates new text that captures the document's meaning. This is a seq2seq task: an encoder reads the document, and a decoder generates the summary token by token. BART was pretrained by corrupting text (deleting, shuffling, masking sentences) and learning to reconstruct the original — a pretraining objective that directly mirrors the skill needed for summarization. Pegasus went further: it masked entire sentences during pretraining and trained the model to generate them, which is even closer to the downstream task.

Evaluation uses ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation). ROUGE-1 measures unigram overlap between the generated summary and a reference summary. ROUGE-2 measures bigram overlap. ROUGE-L measures the longest common subsequence. But ROUGE has a well-known limitation: two summaries can express the same meaning with entirely different words and score poorly. Newer metrics like BERTScore compare summaries using embedding similarity rather than word overlap, which better captures semantic equivalence.

In practice, LLM-based summarization is now the default for many applications. But for processing millions of documents — say, summarizing every customer support thread in a company's history — a fine-tuned BART model at 400M parameters is orders of magnitude cheaper than API calls to a 100B+ parameter model. The architecture choice always comes back to the tradeoff between quality, cost, and scale.

Coreference Resolution — The Glue of Document Understanding

Here's a ShopBot support thread that illustrates a subtle problem:

Customer: "I bought a laptop last week. It arrived with a cracked screen."
Agent: "I'm sorry about that. We'll send you a replacement."
Customer: "When will it arrive?"

The word "it" appears three times. The first "it" refers to the laptop. The second "it" — in the agent's implicit context — refers to the replacement. The third "it" refers to the replacement, not the laptop. A human reads this and instantly connects the dots. A machine needs to do the same to answer "When will it arrive?" correctly.

Coreference resolution is the task of figuring out which mentions in a text refer to the same real-world entity. "The laptop," "it" (first occurrence), and "a laptop" are one cluster. "A replacement" and "it" (third occurrence) are another cluster. The task has two stages: mention detection (find all noun phrases, pronouns, and named entities that could refer to something) and mention clustering (group the ones that refer to the same thing).

Classical systems used hand-crafted features: does the pronoun's gender match the candidate antecedent? Do the numbers agree (singular vs. plural)? How far apart are they in the text? These feature-based systems were brittle. "The CEO" and "she" co-refer only if you know the CEO is female — information that might not appear anywhere in the local context.

The landmark neural approach is the end-to-end model by Lee et al. (2017). Instead of using a pipeline — first detect mentions, then cluster them — this model does both jointly. It considers every possible span in the document as a potential mention, scores each span's "mentionness" (how likely is this span to refer to an entity?), and for each mention, ranks all possible antecedents. The key innovation: because mention detection and clustering happen in one pass, errors in mention detection don't propagate and compound. The model uses span representations — which is why SpanBERT, with its span-focused pretraining, became the preferred backbone for coreference.

Why does coreference matter for ShopBot? Because almost every other task depends on it. Summarization can't produce a coherent summary if it doesn't know that "it" and "the laptop" refer to the same thing. Question answering can't connect information spread across sentences. Relation extraction can't build a knowledge graph from a document where key entities are referred to by pronouns instead of names. Coreference resolution is the glue that holds document-level understanding together.

I'll be honest — coreference is one of those tasks where I read the papers and nod along, then try to implement it and discover that the devil lives in the details. The scoring metrics alone (MUC, B³, CEAF, LEA) each measure different aspects of cluster quality, and they can disagree with each other. My favorite thing about coreference resolution is that, aside from high-level explanations like the one I gave, the fine-grained decisions about which spans to cluster remain genuinely hard, even for state-of-the-art models.

Relation Extraction — Building Knowledge from Text

NER tells us that "Austin, TX" is a location and "phone case" is a product. But it doesn't tell us the relationship between them. Relation extraction fills that gap: given entities in a sentence, identify the semantic relationship that connects them.

From our ShopBot message "I ordered a phone case from Austin, TX and it hasn't arrived," a relation extraction system should produce the triple: ("phone case", ordered_from, "Austin, TX"). This is how knowledge graphs are built — every fact in Wikidata or Google's Knowledge Graph was either curated by hand or extracted from text by a system like this.

The simplest approach is supervised relation extraction: label a training set with entity pairs and their relationships, then train a classifier. The features typically include the entities themselves, the words between them, their dependency path in the parse tree, and their entity types. A person and a location are more likely linked by "born_in" than by "manufactured_by." The downside: you can only extract relationship types you've labeled in training, and labeled data is expensive.

Distant supervision is a clever workaround that I find elegant despite its messiness. You take an existing knowledge base (Wikidata, for example), find all entity pairs that have a known relationship, then search a large text corpus for sentences that mention both entities. You assume — often incorrectly, but usefully on average — that those sentences express the known relationship. This gives you noisy but abundant training data without any manual labeling. The noise is the price of scale.

Modern neural approaches encode the sentence with a transformer and insert special marker tokens around entities to help the model attend to their boundaries:

Input: [CLS] I ordered a [E1] phone case [/E1] from [E2] Austin, TX [/E2] [SEP]

The model sees exactly where each entity starts and ends,
attends to the tokens between and around them, and classifies
the relationship between E1 and E2.

Relation extraction is sequence classification with a twist: instead of classifying the whole document, you classify the relationship between a specific pair of entities within it. The shape is still "one input, one label" — but the input is entity-pair-aware.

The Benchmarks: GLUE, SuperGLUE, and SQuAD

The tasks we've walked through aren't measured in isolation. The field converged on standardized benchmarks to compare models across tasks — and understanding these benchmarks is essential because nearly every NLP paper you read will reference them.

GLUE (General Language Understanding Evaluation), introduced in 2018, is a collection of nine tasks:

Task     What it tests                              Shape
────     ─────────────────────────────────────────   ──────────────────────
CoLA     Linguistic acceptability (is this grammar   Sequence classification
         okay?)
SST-2    Sentiment (positive/negative)               Sequence classification
MRPC     Paraphrase detection (same meaning?)        Sentence-pair classification
QQP      Question equivalence (same question?)       Sentence-pair classification
STS-B    Semantic similarity (how similar, 0-5?)     Sentence-pair regression
MNLI     Natural language inference (3-way)          Sentence-pair classification
QNLI     QA as entailment (does the sentence         Sentence-pair classification
         contain the answer?)
RTE      Textual entailment (2-way)                  Sentence-pair classification
WNLI     Winograd schema (coreference)               Sentence-pair classification

Notice the pattern: most GLUE tasks are sequence classification or sentence-pair classification — sorting tasks. GLUE was designed to test language understanding, and classification is the cleanest way to measure understanding without the confounds of text generation.

BERT demolished the GLUE leaderboard when it was released in 2018. Within a year, models surpassed the estimated human performance baseline. The benchmark had become too easy.

SuperGLUE, released in 2019, raised the bar with harder tasks: BoolQ (yes/no reading comprehension), CB (commitment bank — a nuanced 3-way entailment task with very few training examples), COPA (causal reasoning), WiC (word-in-context — does the same word have the same meaning in two different sentences?), WSC (Winograd Schema Challenge — coreference requiring world knowledge), MultiRC (multi-sentence reading comprehension), and ReCoRD (reading comprehension with commonsense reasoning). SuperGLUE tasks require deeper reasoning: commonsense, causal thinking, and world knowledge — things that surface-level pattern matching can't solve.

SQuAD (Stanford Question Answering Dataset) stands alone as the benchmark for extractive QA. Version 1.1 has 100K+ question-answer pairs where every question is answerable from the passage. Version 2.0 mixed in 50K+ unanswerable questions — forcing models to learn when to say "I don't know." SQuAD scores are reported as exact match (EM, the percentage of predictions that match the ground truth exactly) and F1 (the average word-level overlap between prediction and ground truth).

I should note: benchmarks have real limitations. High GLUE scores don't mean a model understands language in any deep sense — models can exploit dataset artifacts, spurious correlations, and annotation biases. SuperGLUE is harder, but still falls short of testing the kind of reasoning humans do effortlessly. The benchmarks measure a specific slice of capability, not general understanding. They're useful as thermometers, not as definitions of intelligence.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with three shapes — sorting, coloring, and rewriting — and used them as a compass to navigate the entire NLP task landscape. We built ShopBot and traced each task through it: text classification sorted messages into bins, sentiment analysis read emotional temperature, NER highlighted product and location names with BIO tags, POS tagging and dependency parsing exposed grammatical structure, semantic similarity measured meaning distance between duplicate tickets, NLI determined whether policy text entails a customer's claim, extractive QA pointed to answer spans while abstractive QA generated them, summarization pruned or rewrote long threads, coreference resolution tracked pronouns across a conversation, and relation extraction built structured knowledge from unstructured text. We closed with the benchmarks — GLUE, SuperGLUE, and SQuAD — that measure progress across all of these.

My hope is that the next time you encounter a new NLP problem — whether it's routing customer messages, extracting medical entities, or building a question-answering system — instead of guessing at architectures, you'll recognize the shape of the task, know which benchmarks to reference, and have a solid mental model of what's happening under the hood.

Resources

Jurafsky & Martin, Speech and Language Processing (3rd edition draft) — The most comprehensive NLP textbook, freely available online. Covers every task in this chapter and more. If you want the deep academic treatment, start here.

The GLUE Benchmark paper (Wang et al., 2018) — The original paper that defined multi-task evaluation for NLP. Read this to understand what each task measures and why it was chosen.

Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (2018) — The O.G. paper that changed how every task in this chapter is approached. Essential reading.

Joshi et al., "SpanBERT: Improving Pre-training by Representing and Predicting Spans" (2019) — The span-focused pretraining paper. Insightful for understanding why span representations matter for NER, QA, and coreference.

Reimers & Gurevych, "Sentence-BERT" (2019) — The paper that made sentence embeddings practical. Wildly useful if you're building semantic search or similarity systems.

HuggingFace NLP Course (huggingface.co/learn/nlp-course) — Hands-on, code-first walkthroughs of every major NLP task using the transformers library. Unforgettable for building practical intuition.

← Previous Subword Tokenization Next → Text Generation and Translation