Retrieval-Augmented Generation

Chapter 11: Natural Language Processing Hallucination · Retrieval · Chunking · Reranking · Vector Search · Evaluation

I put off building a RAG system for an embarrassingly long time. Every time someone asked me why their chatbot was making up facts, I'd mutter something about "grounding" and "retrieval pipelines" and then quietly change the subject. I knew the rough shape — search some documents, stuff them into the prompt — but I didn't actually understand what was happening under the hood. Why did some chunks retrieve well and others didn't? What was a vector database actually doing? Why did reranking make such a dramatic difference? Finally the discomfort of not knowing grew too great for me. Here is that dive.

Retrieval-Augmented Generation — RAG — was introduced by Lewis et al. in a 2020 paper that combined a neural retriever with a seq2seq generator, trained end-to-end. The core idea is deceptively straightforward: before asking a language model to answer a question, first go find the relevant documents and hand them over. The model reads the documents instead of guessing from memory. In practice today, most production RAG systems use off-the-shelf embedding models and separate LLMs, and the real engineering challenge is everything in between.

Before we start, a heads-up. We're going to be talking about embeddings, similarity metrics, approximate nearest-neighbor search, and some information retrieval theory, but you don't need to know any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The journey ahead

Why LLMs hallucinate
The retrieve-then-generate idea
Our running example: the company wiki bot
Turning documents into numbers
Similarity: cosine, dot product, and the geometry of meaning
Keyword search and why it still matters
Hybrid search: getting the best of both
Chopping documents into chunks
Vector databases and the HNSW trick
Rest stop
The reranking stage: bi-encoders vs. cross-encoders
Building the prompt with retrieved context
Advanced RAG: HyDE, Self-RAG, and Corrective RAG
Evaluating RAG: the RAGAS framework
Production realities: latency, cost, and freshness
Wrap-up
Resources

Why LLMs Hallucinate

To understand why RAG exists, we first need to understand the problem it solves. And the problem is hallucination — a language model producing text that sounds confident and plausible but is factually wrong.

Here's the root cause. A language model is a next-token predictor. Given the tokens so far, it picks the most probable next token. It has been trained on billions of documents, so it has absorbed an enormous amount of world knowledge into its weights. But it learned that knowledge the way you might learn trivia at a loud party — catching fragments of conversation, absorbing patterns, occasionally mixing up who said what. It doesn't have an internal database it can query. It doesn't have a citation index. It has statistical patterns compressed into floating-point numbers.

Imagine you ask the model: "What was Acme Corp's Q3 revenue?" The model has never seen Acme Corp's financial data — it wasn't in the training set. But the model can't say "I don't know" the way a lookup table would return null. It's a text-completion engine. Given the pattern "Acme Corp's Q3 revenue was $", the most probable next token is... some number. So it produces one. Confidently. With decimal places.

There are three specific flavors of hallucination that matter for practical systems. First, knowledge cutoff: the model was trained on data up to some date, and everything after that is invisible. Second, knowledge confusion: the model has seen similar-looking facts about different entities and blurs them together — it might give you Microsoft's revenue when you asked about Acme's. Third, confabulation: when the model has no relevant knowledge at all, it generates plausible-sounding fiction, sometimes even fabricating citations to papers that don't exist.

I'll be honest — the first time I saw a model invent a citation with a fake DOI, real-looking author names, and a plausible journal title, it shook me. That's when I realized this wasn't a minor nuisance. It was a fundamental architectural limitation.

The Retrieve-Then-Generate Idea

So how do you fix a system that guesses when it should look things up? You give it something to look things up in.

The core insight of RAG is to split the problem into two stages. When a user asks a question, the system first retrieves relevant documents from a knowledge base — your company wiki, your product docs, your research papers, whatever you have. Then it hands those documents to the language model as additional context in the prompt, and the model generates an answer grounded in what it was given to read.

Think of it like the difference between an exam and an open-book exam. In a closed-book exam (a vanilla LLM), you have to recall everything from memory, and if your memory is fuzzy, you guess. In an open-book exam (RAG), you can flip to the relevant page, read it, and then write your answer. You might still misinterpret what you read, but at least you're working from real source material instead of hazy recollections.

We'll use this open-book analogy throughout. It keeps coming back.

Our Running Example: The Company Wiki Bot

To make all of this concrete, let's build something. Imagine we're building an internal Q&A bot for a small company. The company has a wiki with three pages:

Page 1 — "Refund Policy": "Customers can request a full refund within 30 days of purchase. After 30 days, only store credit is offered. Digital products are non-refundable once the download link has been accessed."

Page 2 — "Shipping Info": "Standard shipping takes 5–7 business days. Express shipping takes 1–2 business days and costs an additional $15. International orders may incur customs fees."

Page 3 — "Product Specs": "The Widget Pro weighs 340g, has a 6-inch display, and runs on a 4000mAh battery. It supports USB-C charging and is rated IP67 for water resistance."

Three pages. Three topics. A user types: "Can I get my money back for the Widget Pro?" Our job is to find the right page (the refund policy), hand it to the language model, and get an accurate answer. That's the whole game. Everything that follows — embeddings, vector search, chunking, reranking — is about playing this game well at scale, when you have not three pages but three million.

Turning Documents into Numbers

To find which wiki page is relevant to the user's question, we need a way to measure similarity between a question and a document. And to measure similarity, we need to convert text into numbers — specifically, into dense vectors called embeddings.

An embedding model (also called a bi-encoder) is a neural network that takes a piece of text and produces a fixed-length vector of floating-point numbers — maybe 384 numbers, maybe 1024, depending on the model. The key property: texts that mean similar things end up with similar vectors. "Can I get my money back?" and "What is your refund policy?" would produce vectors that point in roughly the same direction, even though they share almost no words.

Let's trace through our wiki bot. We take each of our three wiki pages and run them through the embedding model. Out come three vectors — let's call them v_refund, v_shipping, and v_product. These get stored in our index. When the user asks "Can I get my money back for the Widget Pro?", we run that question through the same model to get v_query. Then we compare v_query against all three document vectors and pick the closest one.

These models are trained using contrastive learning. During training, the model sees pairs of texts: a question and its correct answer passage (a positive pair), plus a bunch of unrelated passages (negative pairs). The training objective pushes the question embedding closer to the correct passage embedding and further from the wrong ones. After millions of these examples, the model learns a general notion of "these two texts are about the same thing."

Popular embedding models include BAAI/bge-large-en-v1.5, intfloat/e5-large-v2, and OpenAI's text-embedding-3-small. The MTEB leaderboard (Massive Text Embedding Benchmark) ranks models across retrieval, classification, and clustering tasks. But here's something I had to learn the hard way: MTEB scores don't always predict performance on your specific domain. A model that's #1 on general benchmarks might underperform on medical or legal text. Always evaluate on your own data.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# Our three wiki pages
pages = [
    "Customers can request a full refund within 30 days of purchase.",
    "Standard shipping takes 5-7 business days. Express costs $15.",
    "The Widget Pro weighs 340g with a 6-inch display and USB-C.",
]

# Embed all pages (done once, offline)
page_vectors = model.encode(pages, normalize_embeddings=True)

# User asks a question
question = "Can I get my money back?"
q_vector = model.encode([question], normalize_embeddings=True)

# Find the most similar page
scores = q_vector @ page_vectors.T
best_idx = np.argmax(scores[0])
print(f"Best match (score {scores[0][best_idx]:.3f}): {pages[best_idx]}")

That @ operator is matrix multiplication — in this case, it computes the dot product between the query vector and each page vector. Because we normalized the vectors to unit length, the dot product is identical to cosine similarity. More on that in a moment.

Similarity: Cosine, Dot Product, and the Geometry of Meaning

We've been casually saying "find the closest vector," but what does "closest" mean in a 384-dimensional space? There are three common ways to measure it, and the differences matter more than you might expect.

Cosine similarity measures the angle between two vectors, ignoring their lengths. If two vectors point in the same direction, cosine similarity is 1.0 — perfect match. If they're perpendicular (completely unrelated), it's 0.0. If they point in opposite directions, it's −1.0. The formula divides the dot product by the product of the two vector lengths, which effectively normalizes both vectors. This means a short document and a long document can still match a query equally well — cosine cares about direction, not magnitude.

Dot product is cosine similarity without the normalization. It measures both direction and magnitude. If you've already normalized your vectors to unit length (which most embedding models do, or which you can do yourself), dot product and cosine similarity produce identical rankings. The dot product is faster to compute because you skip the division.

Euclidean distance measures straight-line distance between two points in the vector space. Smaller distance means more similar. It's sensitive to both direction and magnitude, and it's commonly used in clustering but less common in retrieval because most embedding models are designed to be compared by angle, not distance.

Here's the practical takeaway: if your embedding model outputs normalized vectors (most modern ones do), use dot product — it's the fastest and produces the same results as cosine similarity. If your vectors aren't normalized and you care about semantic similarity regardless of document length, use cosine. Match your metric to how your model was trained.

Let's see this with our wiki bot:

# All three metrics on our example
query_vec = q_vector[0]
refund_vec = page_vectors[0]

# Dot product (fast, identical to cosine when normalized)
dot = np.dot(query_vec, refund_vec)

# Cosine similarity (explicit)
cosine = dot / (np.linalg.norm(query_vec) * np.linalg.norm(refund_vec))

# Euclidean distance (lower = more similar)
euclidean = np.linalg.norm(query_vec - refund_vec)

print(f"Dot product:  {dot:.4f}")
print(f"Cosine sim:   {cosine:.4f}")   # Same as dot for normalized vectors
print(f"Euclidean:    {euclidean:.4f}") # Lower = closer

For normalized vectors, dot product and cosine will be identical down to floating-point precision. That's the whole reason most retrieval systems normalize their embeddings — it lets them use the cheaper operation.

Keyword Search and Why It Still Matters

Dense retrieval with embeddings is powerful, but it has a blind spot that I didn't appreciate until I saw it fail in production.

Imagine a user asks: "What is error code ERR-4072?" The embedding model converts this into a vector that captures the general semantic meaning — something about errors. But the critical piece of information is the exact string ERR-4072. The embedding model may not have seen this specific code during training. It might retrieve documents about errors in general, missing the one document that contains the exact code.

BM25 is the classic keyword-matching algorithm that has powered search engines for decades. It scores documents based on three factors: how often the query terms appear in the document (term frequency), how rare those terms are across all documents (inverse document frequency — rare terms get more weight), and a normalization factor for document length. No neural networks. No GPU. No embeddings. It counts words, but it counts them cleverly.

For our wiki bot, BM25 would match "money back" to the refund page because those exact words appear there. It would match "ERR-4072" to whatever document contains that exact string. The weakness? BM25 can't handle synonyms or paraphrases. "Reimbursement" and "refund" are the same concept, but BM25 treats them as completely unrelated tokens because they share no characters.

I'll be honest — when I first started building RAG systems, I dismissed BM25 as a relic. "We have neural embeddings now, why would anyone count words?" It took a production failure involving part numbers and error codes to teach me that exact-match retrieval isn't optional. It's complementary.

from rank_bm25 import BM25Okapi

docs = [
    "Customers can request a full refund within 30 days",
    "Standard shipping takes 5-7 business days",
    "The Widget Pro weighs 340g with USB-C charging",
]

tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)

query = "refund policy 30 days"
scores = bm25.get_scores(query.lower().split())
best = scores.argmax()
print(f"BM25 best match: {docs[best]}")

Hybrid Search: Getting the Best of Both

So dense retrieval catches semantic matches that BM25 misses, and BM25 catches exact keyword matches that dense models miss. The obvious next question: can we use both?

Yes. That's hybrid search, and it works remarkably well.

The standard technique for merging two ranked lists is Reciprocal Rank Fusion (RRF). For each document that appears in either list, compute a score of 1 / (k + rank) where rank is the document's position in that list, and k is a constant (typically 60) that controls how much credit later-ranked documents get. Sum the RRF scores across both lists, and re-sort.

Back to our wiki bot. Suppose a user asks "Can I return the Widget Pro?" BM25 might rank the product specs page highest (because "Widget Pro" appears there) and the refund page second. Dense retrieval might rank the refund page highest (because "return" is semantically close to "refund") and the product page second. RRF merges these and the refund page — which appears near the top in both lists — gets the highest combined score. The right answer floats up.

def reciprocal_rank_fusion(ranked_lists, k=60):
    """Merge multiple ranked lists. Each list is [doc_id, ...]."""
    scores = {}
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

bm25_ranking  = ["product", "refund", "shipping"]
dense_ranking = ["refund", "product", "shipping"]

merged = reciprocal_rank_fusion([bm25_ranking, dense_ranking])
print(merged)  # ['refund', 'product', 'shipping'] — refund wins

The beauty of RRF is that it doesn't require the scores from each retriever to be on the same scale. BM25 scores might range from 0 to 25; dense cosine scores range from −1 to 1. RRF only looks at rank positions, so it sidesteps the calibration problem entirely.

But hybrid search has its own limitation. We've been assuming our documents are small enough to embed as single vectors. What happens when your wiki pages are not three sentences but three thousand words?

Chopping Documents into Chunks

Here's the tension. Embedding models compress an entire passage into a single vector — say, 384 numbers. If you feed a three-word sentence, those 384 numbers capture its meaning well. If you feed a 10,000-word document, the meaning gets diluted across too many topics, and the single vector becomes a vague average of everything in the document. It's like summarizing an entire book with one sentence — you lose the details that make retrieval precise.

The solution is chunking: splitting documents into smaller pieces before embedding. Each chunk gets its own vector, and retrieval happens at the chunk level. When our wiki bot eventually grows from three pages to three thousand pages, and those pages are long, chunking is how we maintain retrieval quality.

Let's expand our wiki. Suppose the refund policy page grows to five paragraphs covering different scenarios — cash refunds, store credit, digital products, subscription cancellations, and international returns. If we embed the whole page as one vector, a question about "digital product refund" might not match strongly because the vector is a blend of all five topics. If we chunk it into five pieces, one per paragraph, the "digital products are non-refundable" chunk matches the question precisely.

There are three main chunking strategies, each with a tradeoff I wish someone had spelled out for me earlier.

Fixed-size chunking splits text every N tokens — typically 256 to 512 — with an overlap of 50 to 100 tokens between consecutive chunks. The overlap ensures that a sentence straddling a chunk boundary appears in full in at least one chunk. It's predictable and fast, but it's also dumb. It will happily split mid-sentence, creating chunks that start with "...and is rated IP67 for water resistance" with no preceding context about what "it" refers to.

Recursive splitting is smarter. It tries to split on paragraph boundaries first. If a chunk is still too large, it falls back to splitting on sentence boundaries, then word boundaries, then character boundaries. This produces chunks that respect the document's natural structure. Most real systems use some variant of this approach.

Semantic chunking goes further. It computes embeddings for each sentence, then measures cosine similarity between consecutive sentences. Wherever the similarity drops sharply — indicating a topic shift — it places a split. The resulting chunks are topically coherent, not arbitrary. The cost: you need an extra embedding pass over every sentence, which is slower and more complex.

# Recursive splitting — the workhorse approach
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,      # max characters per chunk
    chunk_overlap=30,    # overlap between consecutive chunks
    separators=["\n\n", "\n", ". ", " ", ""]
)

long_page = """Customers can request a full refund within 30 days of purchase.
After 30 days, only store credit is offered.

Digital products are non-refundable once the download link has been accessed.
Subscription services can be cancelled at any time for a prorated refund.

International returns must be shipped within 14 days.
The customer is responsible for return shipping costs."""

chunks = splitter.split_text(long_page)
for i, c in enumerate(chunks):
    print(f"Chunk {i}: {c[:70]}...")

The separators list tells the splitter to try paragraph breaks first (\n\n), then line breaks, then sentence boundaries, then spaces, and finally individual characters as a last resort. The chunk_overlap parameter controls how many characters bleed from one chunk into the next, providing continuity.

I'm still developing my intuition for the right chunk size. The conventional wisdom says 256–512 tokens, but I've seen projects where 128 tokens worked better for precise factual retrieval and 1024 tokens worked better for summarization tasks. The honest answer is: experiment. Chunk size is one of those hyperparameters that nobody has a clean theory for.

Vector Databases and the HNSW Trick

With three wiki pages, we can compare the query vector against every page vector with a simple matrix multiplication. With three million chunks? That brute-force approach takes too long. We need a data structure that can find the nearest vectors without checking every single one.

Enter vector databases — specialized storage systems optimized for fast approximate nearest-neighbor (ANN) search. The "approximate" is important: they trade a tiny amount of accuracy for a massive speedup. Instead of guaranteed "this is the closest vector," they give you "this is very likely among the top few closest vectors," and they do it in milliseconds instead of seconds.

The most popular indexing algorithm behind the scenes is HNSW — Hierarchical Navigable Small World graphs. The analogy that helped it click for me: imagine searching for a person in a city. You don't check every house. You first take a highway to the right neighborhood (coarse, long-range navigation), then take local streets to the right block (medium-range), then walk door to door (fine-grained). HNSW builds this multi-scale navigation into a graph structure.

Concretely, HNSW organizes vectors into multiple layers. The top layer has very few nodes connected by long-range links — highways. Each lower layer has more nodes with shorter-range links. To search, you start at the top layer and greedily walk toward the query vector, hopping to whichever neighbor is closest. When you can't get any closer at this layer, you drop down a level and continue. At the bottom layer — where every vector lives — you do a careful local search among the nearest neighbors. The result is search that scales logarithmically with the number of vectors, not linearly.

This is conceptually similar to a skip list, if you've seen those. Top layers let you skip large distances. Bottom layers let you refine.

The major vector databases in the ecosystem:

FAISS (from Meta) is a library, not a database. It runs on a single machine, supports both CPU and GPU, and gives you fine-grained control over index types (HNSW, IVF, product quantization). It's the right choice when you need raw speed and are comfortable managing your own infrastructure.

Chroma is lightweight, Python-native, and designed for prototyping and small-to-medium workloads. It runs in-process — no server to set up. Start here when you're exploring.

Pinecone is a managed cloud service. You upload vectors through an API; they handle indexing, scaling, and reliability. The tradeoff is vendor lock-in and cost, but the operational simplicity is compelling for production teams that don't want to manage infrastructure.

Weaviate is open-source and supports hybrid search (dense + keyword) natively. It also has a GraphQL API and built-in support for calling embedding models during ingestion, which reduces plumbing.

Qdrant is open-source, written in Rust, and focuses on performance. It supports filtering during search (e.g., "find the nearest vector, but only among documents from the last 30 days"), which is useful in production where you often need metadata filters alongside similarity.

I'll share something that took me too long to internalize: the choice of vector database matters far less than the quality of your embeddings and chunks. I've seen teams spend weeks evaluating databases and then use a default chunking strategy that lost them more retrieval quality than any database choice would recover. Get the fundamentals right first.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

Here's the mental model you now have: a RAG system takes a user's question, converts it to a vector, searches a database of pre-embedded document chunks for the most similar ones, and hands those chunks to a language model as context for generating an answer. You understand that dense retrieval (embeddings) catches semantic matches, BM25 catches exact keyword matches, and hybrid search combines both. You know that documents get split into chunks so each chunk's embedding is focused, and that vector databases use HNSW to search millions of chunks in milliseconds.

That mental model is genuinely useful. You could build a working RAG system with what you know so far, and it would handle the majority of straightforward Q&A use cases.

It doesn't tell the complete story, though. The chunks that the retriever returns aren't always the best ones — they're the ones with the closest embeddings, which isn't the same thing. There's a technique called reranking that dramatically improves retrieval quality, and there are advanced patterns that make RAG systems self-correcting. There's also the uncomfortable question of how you evaluate whether any of this is working.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

The Reranking Stage: Bi-Encoders vs. Cross-Encoders

Here's a limitation of the retrieval approach we've built so far that bothered me for a while before I understood what was going on.

Our embedding model is a bi-encoder. It encodes the query and each document independently — they never "see" each other during encoding. The query becomes a vector. The document becomes a vector. We compare the two vectors with a dot product. This is what makes retrieval fast: we can pre-compute all the document vectors offline and only compute one query vector at search time.

But this independence is also a limitation. The bi-encoder has to compress the entire meaning of the query into a fixed vector and the entire meaning of the document into another fixed vector, and then hope that the dot product between those two compressed representations captures whether the document actually answers the question. For many cases, it works well enough. For subtle cases, it misses.

Consider our wiki bot. The query "Is the Widget Pro waterproof enough for swimming?" might retrieve the product specs chunk (which mentions "IP67 water resistance"). But IP67 means "protected against temporary immersion up to 1 meter" — it's not rated for swimming. A bi-encoder might rank this chunk highly because the vectors for "waterproof" and "water resistance" are close. But whether IP67 is sufficient for swimming requires understanding the interaction between the query's intent and the document's specific details. A bi-encoder can't do that because it encoded them separately.

A cross-encoder takes a different approach. Instead of encoding the query and document separately, it feeds them both into a single transformer, concatenated with a [SEP] token in between. The transformer's attention mechanism can now attend to every query token while reading every document token, and vice versa. The output is a single relevance score.

This is dramatically more accurate. The cross-encoder can understand that "waterproof enough for swimming" requires more than IP67. It can understand that "money back" and "full refund within 30 days" are answering the same question. It captures nuances that dot products between independent embeddings miss.

The tradeoff: speed. A cross-encoder must process the full (query, document) pair through the transformer for every candidate. You can't pre-compute anything because the representation depends on the specific query. If you have a million documents, running a cross-encoder on all of them is impossibly slow.

The solution is a two-stage pipeline. Stage 1: use a bi-encoder (or BM25, or hybrid search) to quickly retrieve the top 50–100 candidates. Stage 2: feed each of those 50–100 candidates through a cross-encoder with the query, get a fine-grained relevance score for each, and re-sort. The final top 3–5 chunks go into the prompt.

This gives you cross-encoder accuracy with near-bi-encoder latency. The cross-encoder only runs on 50–100 documents instead of millions — a manageable workload.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "Can I get a refund for a digital product?"
# Suppose retrieval returned these 4 candidates:
candidates = [
    "Digital products are non-refundable once the download link is accessed.",
    "Customers can request a full refund within 30 days of purchase.",
    "The Widget Pro weighs 340g with a 6-inch display.",
    "International returns must be shipped within 14 days.",
]

pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

for score, doc in sorted(zip(scores, candidates), reverse=True):
    print(f"  [{score:.3f}] {doc}")
# The digital products chunk rises to the top — correctly.

The cross-encoder understands that "digital product refund" is best answered by the chunk about digital products being non-refundable, even though the general refund chunk also mentions refunds. That distinction is the kind of fine-grained understanding that bi-encoders struggle with.

Building the Prompt with Retrieved Context

We've found the best chunks. Now we need to hand them to the language model in a way that actually helps. This is prompt construction, and it has its own set of gotchas.

The basic structure is straightforward: a system instruction, the retrieved context, and the user's question.

def build_rag_prompt(question, retrieved_chunks):
    context = "\n\n".join(retrieved_chunks)
    return f"""Answer the question based on the context below.
If the context doesn't contain the answer, say "I don't know."

Context:
{context}

Question: {question}

Answer:"""

chunks = [
    "Digital products are non-refundable once the download link is accessed.",
    "Customers can request a full refund within 30 days of purchase.",
]
prompt = build_rag_prompt("Can I get a refund for a digital product?", chunks)

That "If the context doesn't contain the answer, say I don't know" instruction is more important than it looks. Without it, the model will often fall back to its parametric knowledge — the very behavior RAG is supposed to prevent. You're telling the model: "I gave you an open book. If the answer isn't in the book, admit it rather than guessing."

There's a subtle failure mode here that researchers call the "lost in the middle" problem. When you stuff many chunks into the context, the model pays the most attention to chunks at the beginning and end, and tends to ignore chunks in the middle. A 2023 paper by Liu et al. showed this systematically: model performance degrades when the relevant information is buried in the middle of long contexts, even when the context window is large enough to hold everything.

The practical fix: put the most relevant chunk first. This is another reason reranking matters — not only does it find the best chunks, it determines the order in which they appear in the prompt. The chunk with the highest cross-encoder score goes at the top of the context, where the model is most likely to attend to it.

Advanced RAG: HyDE, Self-RAG, and Corrective RAG

The pipeline we've built — chunk, embed, retrieve, rerank, generate — works well for straightforward questions. But it has failure modes that become apparent with more complex queries. Three advanced patterns address the most common ones.

HyDE: Hypothetical Document Embeddings

Sometimes the user's query is too short or too vague for the embedding model to produce a useful vector. "Widget returns" is a legitimate query, but it's so terse that its embedding might not land close to the detailed refund policy chunk.

HyDE addresses this with a clever trick. Before embedding the query, you first ask an LLM to generate a hypothetical answer — what a perfect document responding to this query might look like. Then you embed that hypothetical answer and use it as the search vector instead of the raw query. Because the hypothetical answer is written in the same style and vocabulary as actual documents, its embedding tends to be closer to relevant chunks.

For our wiki bot, the query "Widget returns" might produce a hypothetical answer like "The return policy for Widget Pro products allows customers to request a refund within 30 days of purchase..." This hypothetical document, when embedded, lands much closer to the actual refund policy chunk than the raw two-word query would.

The cost is one extra LLM call per query, plus a risk: if the LLM generates a misleading hypothetical answer, it could steer retrieval in the wrong direction. In practice, HyDE helps most with short, vague queries and helps least with queries that are already detailed and specific.

Self-RAG

Self-RAG (Asai et al., 2023) teaches the model to decide when it actually needs retrieval. Instead of always retrieving — even for questions the model can answer from its own knowledge — Self-RAG trains the model to generate special reflection tokens. These tokens indicate whether retrieval is needed, whether the retrieved documents are relevant, and whether the generated answer is faithful to the retrieved context.

The idea is that retrieval is not always helpful. If you ask "What is 2+2?", retrieving documents about arithmetic doesn't help and adds latency. Self-RAG lets the model say "I'm confident, no retrieval needed" for easy questions and "I need to look this up" for knowledge-intensive ones.

I'm still developing my intuition for when Self-RAG outperforms standard RAG in practice. The research results are promising, but deploying it requires either fine-tuning the model with the reflection tokens or using a separate classifier to decide when to retrieve. It adds complexity to the pipeline that may not be worth it for simpler use cases.

Corrective RAG (CRAG)

Corrective RAG (Yan et al., 2024) tackles a different failure mode: what happens when the retrieved documents aren't actually relevant? Standard RAG has no mechanism to detect this — it retrieves the top-k chunks and feeds them to the model regardless of quality.

CRAG adds a lightweight evaluator after retrieval. This evaluator scores the retrieved documents for relevance and routes the system down one of three paths. If the documents are confidently relevant, proceed with standard RAG. If they're ambiguous, supplement them with a web search for additional context. If they're confidently irrelevant — meaning the knowledge base doesn't contain the answer — fall back to the LLM's own knowledge or admit "I don't know."

Going back to our open-book analogy: CRAG is like a student who, after flipping to a page, pauses to ask "Wait, is this page actually about the question?" before starting to write. Without that check, the student might write a beautiful answer about the wrong topic.

Evaluating RAG: The RAGAS Framework

Here's the uncomfortable truth about RAG systems: it's surprisingly hard to tell whether yours is working well. The pipeline has multiple components — retriever, reranker, generator — and any of them can fail independently. An end-to-end metric like "percentage of correct answers" doesn't tell you where the problem is.

The RAGAS framework (Retrieval Augmented Generation Assessment) decomposes evaluation into four metrics that isolate different failure modes:

Faithfulness measures whether the generated answer is supported by the retrieved context. If the model says "refunds are available within 60 days" but the retrieved chunk says "30 days," faithfulness is low. This catches hallucination — the model ignoring or contradicting the context it was given.

Answer Relevance measures whether the generated answer actually addresses the user's question. If the user asks about refunds and the model talks about shipping, answer relevance is low — even if the answer is faithful to the context (maybe it was given the wrong context).

Context Precision measures whether the retrieved documents are relevant to the question. If 4 out of 5 retrieved chunks are about shipping when the question is about refunds, context precision is low. This is a retrieval problem, not a generation problem.

Context Recall measures whether all the relevant documents were retrieved. If the knowledge base contains three chunks about the refund policy but only one was retrieved, context recall is low. The model might give an incomplete answer because it's missing context.

The diagnostic power comes from combining these metrics. If faithfulness is high but answer relevance is low, the model is correctly reading the context but the context doesn't contain the answer — fix your retriever. If context precision is high but faithfulness is low, you retrieved the right documents but the model is ignoring them — fix your prompt or try a better model. Each combination points to a different part of the pipeline.

Let me trace through a failure diagnosis with our wiki bot. Suppose a user asks "Can I return a digital product?" and the system answers "Yes, within 30 days." The correct answer is "No, digital products are non-refundable." RAGAS would show: context precision is low (the system retrieved the general refund chunk but missed the digital products chunk), which caused the model to generate an answer based on incomplete context. The fix: improve chunking so the digital products paragraph is its own chunk, or improve retrieval so it surfaces.

# Conceptual RAGAS evaluation — not runnable without the ragas library,
# but this shows the structure
evaluation_sample = {
    "question": "Can I return a digital product?",
    "answer": "Digital products are non-refundable once downloaded.",
    "contexts": [
        "Digital products are non-refundable once the download link is accessed."
    ],
    "ground_truth": "No, digital products cannot be refunded after download."
}
# RAGAS would score:
#   Faithfulness: HIGH (answer matches context)
#   Answer Relevance: HIGH (answer addresses the question)
#   Context Precision: HIGH (retrieved chunk is relevant)
#   Context Recall: HIGH (the key fact was retrieved)

Production Realities: Latency, Cost, and Freshness

Building a RAG prototype in a notebook is one thing. Running it in production with real users is a different beast entirely. Three concerns dominate.

Latency

Every stage adds time. Embedding the query: 5–20ms. Vector search: 5–50ms for HNSW. Cross-encoder reranking on 50 candidates: 100–500ms. LLM generation: 500–3000ms. Add them up and a single query might take 1–4 seconds, most of which is the LLM generation step.

The levers you have: cache frequent queries and their results, pre-compute document embeddings (never embed at query time), use a smaller cross-encoder or skip reranking for latency-sensitive applications, and stream the LLM output so the user sees tokens appearing while the model is still generating. Streaming alone transforms the perceived latency because the user starts reading immediately instead of staring at a spinner.

Cost

The cost equation has three components. Embedding your document corpus is a one-time cost (plus re-embedding when documents change). Vector database hosting is an ongoing cost that scales with the number of vectors. LLM generation is a per-query cost that scales with context length — and RAG queries have longer contexts than non-RAG queries because you're stuffing retrieved chunks into the prompt.

The expensive trap: stuffing too many chunks into the context. Each additional chunk increases the number of input tokens the LLM processes, which increases cost and latency. More context also triggers the lost-in-the-middle problem. The practical sweet spot for most systems is 3–5 highly relevant chunks, not 20 mediocre ones. Better reranking is cheaper than more tokens.

Freshness

When a document in your knowledge base changes, the embeddings in your vector database become stale. If someone updates the refund policy from 30 days to 14 days, but the old embedding is still in the index, the system will retrieve and present the outdated policy.

The simplest approach: re-embed changed documents immediately on update and replace their vectors in the database. For systems where documents change infrequently, a nightly batch re-indexing job is often sufficient. For rapidly changing data (news, stock prices, live dashboards), you may need to bypass the vector database entirely and use real-time search over the current state, or combine cached embeddings with a freshness check that verifies the source document hasn't changed since it was embedded.

Freshness is the main advantage RAG has over fine-tuning. Updating a fine-tuned model requires retraining, which takes hours or days and costs significant compute. Updating a RAG knowledge base requires re-embedding the changed documents, which takes seconds. This is why many production systems use both: a fine-tuned model for stable, domain-specific behavior, augmented with RAG for knowledge that changes.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a deceptively straightforward problem — language models guess when they should look things up — and built the entire retrieval-augmented generation pipeline from that single observation. We converted documents to embeddings, measured similarity with cosine and dot product, searched with both keywords and neural vectors, merged results with RRF, chopped documents into focused chunks, indexed them in vector databases using HNSW, refined results with cross-encoder reranking, constructed prompts that keep the model grounded, explored advanced patterns like HyDE and Corrective RAG for handling edge cases, evaluated the whole system with RAGAS metrics that isolate exactly what's failing, and confronted the production realities of latency, cost, and freshness.

My hope is that the next time you see a chatbot confidently hallucinating, or someone waves their hands about "RAG pipelines," instead of nodding along, you'll have a pretty clear picture of every stage — from the user's question to the embedding model to the HNSW index to the reranker to the final prompt — and a good sense of which knob to turn when something breaks.

Resources

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020) — The O.G. RAG paper. Introduced the end-to-end retriever-generator architecture.

Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey" (2024) — A comprehensive survey covering naive, advanced, and modular RAG paradigms. Wildly helpful for understanding the landscape.

Liu et al., "Lost in the Middle" (2023) — The paper that systematically showed how LLMs ignore information buried in the middle of long contexts. Changed how I think about chunk ordering in prompts.

RAGAS Documentation (ragas.io) — The go-to framework for evaluating RAG systems. The diagnostic approach of decomposing into faithfulness, relevance, precision, and recall is insightful.

MTEB Leaderboard (huggingface.co/spaces/mteb/leaderboard) — The benchmark for comparing embedding models. Essential for choosing a retrieval model, but always validate on your own data.

Malkov & Yashunin, "Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs" (2018) — The HNSW paper. If you want to understand what your vector database is actually doing under the hood, this is the source.

← Previous Text Generation and Translation Next → Nice to Know