RAG & Semantic Search

Chapter 12: Large Language Models Advanced RAG Patterns & Production Pipelines

I avoided going deep on RAG for longer than I’d like to admit. Every time someone mentioned “retrieval-augmented generation,” I’d nod along, thinking I understood it. You chunk documents, embed them, search over vectors, paste the results into a prompt. How hard could it be? Then I tried to build one that actually worked well on messy, real-world documents—internal company wikis, engineering runbooks, product specs full of tables and acronyms—and realized I understood the outline but not the substance. The naive version was easy. Making it good was a different problem entirely. Here is that dive.

Retrieval-Augmented Generation (RAG) was introduced by Lewis et al. at Meta in 2020, combining a retrieval component with a sequence-to-sequence generator. The core idea: instead of relying on what an LLM memorized during training, you fetch relevant documents at inference time and feed them directly into the prompt. By 2024, RAG had become the single most deployed architecture pattern for production LLM systems, used everywhere from customer support bots to internal search tools to coding assistants.

Before we start, a heads-up. We’re going to be talking about embedding spaces, vector similarity, approximate nearest neighbors, knowledge graphs, and evaluation frameworks. You don’t need to know any of it beforehand. We’ll add what we need, one piece at a time.

This isn’t a short journey, but I hope you’ll be glad you came.

The path ahead:

Why parametric memory fails

Embeddings as semantic coordinates

Chunking — the art of slicing documents

Vector search and the nearest neighbor problem

The naive RAG pipeline, end to end

Rest stop and an off-ramp

Where naive RAG breaks down

Hybrid retrieval — sparse meets dense

Reranking with cross-encoders

ColBERT and late interaction

Advanced query strategies — HyDE, step-back, query rewriting

Multi-hop RAG

GraphRAG — knowledge graphs meet retrieval

Agentic RAG

Long-context windows vs RAG

Fine-tuning embedding models

Evaluating RAG — the RAGAS framework

Production patterns

Resources and credits

Why Parametric Memory Fails

To understand why RAG matters, we need to feel the pain of not having it.

Imagine you’re building a company knowledge assistant. Let’s call it AskIntern. Employees type questions like “What’s our parental leave policy?” or “How do I roll back a deployment in staging?” and AskIntern answers. Your first instinct might be: use a large language model. GPT-4, Claude, Llama—any of them. The model knows a lot. Maybe it knows enough.

It doesn’t. And the reasons are fundamental, not fixable by scaling up.

The first problem is hallucination. Ask the model about your company’s parental leave policy and it will generate a confident, detailed, and entirely fictional answer. It will sound like it read your employee handbook. It didn’t. The model has no concept of “I don’t know”—it always produces something, and the something often looks indistinguishable from truth.

The second problem is knowledge cutoff. The model’s understanding of the world froze when training ended. Your engineering team changed the deployment process last Tuesday. The model still thinks you’re on the old system. Fine-tuning to update knowledge is expensive, slow, and introduces its own hallucination risks—the model “sort of remembers” rather than directly reading a source.

The third problem is private data. Your HR policies, Confluence pages, Slack threads, proprietary runbooks—none of that was in the training set. The model literally cannot answer questions about your domain-specific data, no matter how large it is.

These three problems share a common root. The LLM stores everything it knows inside its weights—what researchers call parametric memory. Think of it like trying to answer every question from memory alone, without ever being allowed to look anything up. You might get close for general knowledge, but the moment someone asks about a specific document you read last month, you’re guessing.

RAG says: stop guessing. Go look it up first. Before the model generates an answer, retrieve the relevant documents from an external knowledge base and inject them directly into the prompt. Now the LLM isn’t guessing from parametric memory—it’s reading the source material and synthesizing an answer. This is non-parametric memory—knowledge that lives outside the model, updatable at any time, citable, and verifiable.

The analogy I keep coming back to: imagine a librarian. You ask a question. A bad librarian tries to answer from memory. A good librarian says “hold on,” walks into the stacks, pulls the three most relevant books, reads the relevant passages, and then answers you. RAG turns the LLM into the good librarian. We’ll keep returning to this analogy because it scales surprisingly well as the architecture gets more sophisticated.

Embeddings as Semantic Coordinates

If RAG’s big idea is “look up relevant documents before answering,” we need a way to define what “relevant” means. And for that, we need embeddings.

An embedding is a vector—a list of numbers—that represents the meaning of a piece of text. Not the words themselves, but what those words mean. The sentence “What is our parental leave policy?” and “How many weeks of leave do new parents get?” use completely different words, but an embedding model maps them to nearly identical vectors because the intent is the same.

Think of embeddings as GPS coordinates in meaning-space. Every piece of text gets a location. Texts with similar meanings end up at nearby coordinates. Texts about unrelated topics end up far apart. This is a spatial metaphor, but it’s not a metaphor—it’s literally how the math works. Embedding models like text-embedding-3-small from OpenAI or all-MiniLM-L6-v2 from sentence-transformers produce vectors with hundreds of dimensions (768 or 1536 are common). Each dimension captures some facet of meaning, and the distance between two vectors tells you how semantically similar they are.

Let’s make this concrete with our AskIntern example. Suppose we have three documents in our company knowledge base:

Doc A: "New parents receive 16 weeks of paid parental leave..."
Doc B: "To roll back a staging deployment, run: kubectl rollout undo..."
Doc C: "Our Q3 revenue exceeded projections by 12%..."

After running each through an embedding model, we get three vectors. The actual numbers aren’t meaningful to humans—what matters is the distances between them. If someone asks “What’s our parental leave policy?” the query embedding will be close to Doc A’s embedding and far from Doc B and Doc C. We measure this distance using cosine similarity—the cosine of the angle between two vectors. A cosine similarity of 1.0 means identical direction (identical meaning). Zero means orthogonal (unrelated). Negative means opposite.

This is the foundation of semantic search: convert text to vectors, then find the vectors closest to the query. No keyword matching required. The query “maternity leave duration” finds Doc A even though the word “maternity” never appears in it. That’s the power of embeddings over traditional keyword search like BM25—and also, as we’ll see later, its weakness. We’ll come back to why keyword search still has a role to play.

Chunking — The Art of Slicing Documents

Here’s where the first sharp edge appears. You have a 50-page employee handbook. You can’t embed the entire thing as one vector—embedding models have token limits (typically 512 tokens for older models, 8192 for newer ones), and even if you could embed the whole document, the resulting vector would be a vague average of all the topics within it. A question about parental leave would match weakly because the handbook also discusses expense reports, dress codes, and office hours. The signal drowns in noise.

So you chunk it. You split the document into smaller pieces, each focused enough that its embedding captures a specific topic. Each chunk gets its own vector. Now a query about parental leave can find the specific paragraphs that discuss it.

The question that haunted me early on: how big should the chunks be? I still don’t have a universal answer, and I’m not sure one exists. But here’s what I’ve found through trial and error.

Let’s walk through three chunking strategies with a concrete example. Imagine our employee handbook has this passage:

## Parental Leave Policy

New parents receive 16 weeks of paid parental leave, beginning 
on the date of birth or adoption. Leave may be taken consecutively 
or split into two blocks within the first year. 

To request leave, submit Form PL-7 to HR at least 30 days before 
your expected start date. Emergency situations are handled on a 
case-by-case basis.

## Return to Work

Upon returning, employees may request a flexible schedule for 
up to 8 weeks. Contact your manager and HR to arrange this.

The first approach is fixed-size chunking. Pick a token count—say, 200 tokens—and split rigidly at that boundary. It’s fast and deterministic. But it might slice right between “submit Form PL-7 to HR” and “at least 30 days before your expected start date.” Now neither chunk contains the complete instruction. The fix: add overlap. Each chunk includes the last 50 tokens of the previous chunk, so context bleeds across the boundary. This helps, but it’s a bandage over a fundamental problem—the splitting has no awareness of meaning.

The second approach is recursive character splitting. This is what LangChain’s RecursiveCharacterTextSplitter does. It tries to split on paragraph breaks first. If chunks are still too large, it splits on sentences. If still too large, on words. This respects natural document structure better than fixed-size, and it’s the most common approach in practice.

The third approach is semantic chunking. Instead of splitting by character count or structure, you embed consecutive sentences and split when the embedding similarity between adjacent sentences drops below a threshold. The idea: if two consecutive sentences have very different embeddings, they’re probably about different topics, and that’s a natural boundary. This is the most sophisticated approach and often produces the best results, but it’s slower and more complex to implement.

For our handbook example, semantic chunking would keep the parental leave section together and split cleanly before “Return to Work”—exactly what we want. Fixed-size chunking might butcher it.

There’s a tension here that never fully resolves. Smaller chunks give more precise retrieval—you find exactly the paragraph you need. But they lose context. A chunk that says “Leave may be taken consecutively or split into two blocks” is useless without knowing which leave we’re talking about. Larger chunks preserve context but dilute the embedding. I’ll be honest—I still occasionally get the chunk size wrong on new projects, and the only reliable fix is to look at what’s actually being retrieved and adjust. We’ll see a more elegant solution—parent document retrieval—later.

Vector Search and the Nearest Neighbor Problem

We now have chunks as vectors. We have a query as a vector. We need to find the chunks whose vectors are closest to the query vector. In our AskIntern example with three documents, this is trivial—compute three cosine similarities and pick the highest. But real knowledge bases have millions of chunks. Computing similarity against every single one is too slow for interactive use. If each comparison takes 1 microsecond and you have 10 million chunks, that’s 10 seconds per query. Unacceptable.

This is the approximate nearest neighbor (ANN) problem, and it has its own rich literature. The most popular algorithm in production is HNSW—Hierarchical Navigable Small World graphs. The intuition: imagine a social network where everyone is connected to a few friends, and friends-of-friends form a navigable web. To find someone in the network, you don’t check everyone—you start at a random node, hop to the neighbor most similar to your target, and keep hopping. Higher-level “express lanes” let you skip across the graph quickly, then you descend to finer-grained layers for precision. HNSW doesn’t guarantee finding the absolute nearest neighbor, but in practice it finds a neighbor that’s 95%+ as good in a fraction of the time.

FAISS (Facebook AI Similarity Search) is the most popular library for this. It implements several ANN algorithms including HNSW and IVF (Inverted File Index), supports GPU acceleration, and can handle billions of vectors. For prototyping, you can keep everything in memory with a flat index. For production, you add quantization to compress vectors and trade a small amount of accuracy for massive memory savings.

Beyond FAISS, the vector database ecosystem has exploded. Pinecone offers a fully managed service where you never think about infrastructure. Qdrant and Weaviate are open-source alternatives you can self-host. pgvector adds vector search to PostgreSQL, which is appealing if your team already runs Postgres and doesn’t want another database. Chroma is lightweight and popular for prototyping. The choice depends on your scale, ops preferences, and whether you want to manage infrastructure. For most teams starting out, Chroma or FAISS for development, Qdrant or Pinecone for production.

The Naive RAG Pipeline, End to End

With all the pieces in place, let’s wire together a complete RAG system for AskIntern. The architecture has two pipelines that are important to keep separate in your head: offline (indexing) and online (querying).

The offline pipeline runs once, or on a schedule when documents change. You take your raw documents—PDFs, Markdown files, Confluence exports, whatever—and parse them into clean text. Then you chunk that text using one of the strategies we discussed. Each chunk gets embedded into a vector. Those vectors, along with the original chunk text and metadata (source document, page number, section heading), get stored in a vector database.

# Offline pipeline (pseudocode)
documents = load_documents("./company-docs/")
chunks = []
for doc in documents:
    chunks.extend(split_into_chunks(doc, chunk_size=512, overlap=50))

embeddings = embedding_model.encode([c.text for c in chunks])
vector_db.add(embeddings, metadata=[c.metadata for c in chunks])

The online pipeline runs for every user query. The user asks a question. You embed the question using the same embedding model. You search the vector database for the top-K most similar chunks (typically K=3 to K=10). You construct a prompt that includes those chunks as context, along with the user’s question. You send that prompt to the LLM. The LLM generates an answer grounded in the retrieved context.

# Online pipeline (pseudocode)
query = "What's our parental leave policy?"
query_embedding = embedding_model.encode(query)

retrieved_chunks = vector_db.search(query_embedding, top_k=5)

prompt = f"""Answer the question based on the context below.

Context:
{format_chunks(retrieved_chunks)}

Question: {query}
Answer:"""

answer = llm.generate(prompt)

Our librarian analogy maps perfectly. The offline pipeline is the librarian organizing books on shelves (indexing). The online pipeline is the librarian hearing a question, walking to the right shelves (vector search), pulling the relevant books (top-K retrieval), reading the relevant pages (context injection), and answering (generation). Going back to our AskIntern example: when an employee asks about parental leave, the vector search finds the chunks about parental leave policy, those chunks get injected into the prompt, and the LLM synthesizes a natural-language answer citing the specific policy details.

That’s naive RAG. It works. For many use cases, it works surprisingly well. And for our AskIntern prototype with a few hundred well-structured documents, this might be all you need.

Rest Stop — An Off-Ramp

Congratulations on making it this far. You can stop here if you want.

You now have a mental model of the complete RAG architecture: documents get chunked, chunks get embedded into vectors, vectors get stored in a searchable index, queries get embedded the same way, nearest-neighbor search finds relevant chunks, and those chunks get stuffed into a prompt for the LLM to synthesize an answer.

That mental model is genuinely useful. You could build a working RAG system right now with those concepts. Many production systems are not much more than this.

But it doesn’t tell the whole story. What happens when the user’s question is ambiguous and the retrieved chunks are wrong? What about questions that require synthesizing information from multiple documents? What if your documents are interconnected in ways that flat vector search can’t capture? What about evaluating whether your RAG system is actually giving good answers?

The short version: hybrid retrieval, reranking, and better query strategies fix most naive RAG failures. GraphRAG and agentic RAG handle the rest. RAGAS gives you metrics. There—you’re 60% of the way there.

But if the discomfort of not knowing what’s underneath is nagging at you, read on.

Where Naive RAG Breaks Down

Our AskIntern prototype is live. Employees are using it. And the bug reports are rolling in.

Failure mode 1: vocabulary mismatch. An engineer asks “how do I revert a deploy?” The runbook says “roll back a deployment.” Dense embeddings might catch this—“revert” and “roll back” are semantically similar. But what if someone asks about “PL-7”? That’s the form number from the parental leave policy. Embeddings have no idea what “PL-7” means—it’s not a semantic concept, it’s a specific term. A keyword search would find it instantly. Dense retrieval fails on exact terms, acronyms, and identifiers.

Failure mode 2: wrong chunks, confidently served. Someone asks “Can I split my parental leave?” The system retrieves chunks about flexible return-to-work schedules instead of the leave-splitting policy. The cosine similarity scores are high—both chunks are about parental leave—but the specific information needed is in a different chunk. The top-K results are semantically similar but not actually relevant.

Failure mode 3: multi-hop questions. An employee asks “If I take parental leave and then request flexible return, how much total time can I be on a modified schedule?” The answer requires combining information from two different sections of the handbook. No single chunk contains the complete answer. Naive RAG retrieves some relevant chunks but misses the connection between them.

Failure mode 4: ambiguous queries. “What are the policies?” Policies about what? Without context, the system retrieves a random assortment of policy documents. The query is too vague for vector search to disambiguate.

Each of these failures motivates a specific improvement. Let’s work through them.

Hybrid Retrieval — Sparse Meets Dense

Failure mode 1—the vocabulary mismatch—has an elegant fix: don’t choose between keyword search and semantic search. Use both.

BM25 is the classic keyword retrieval algorithm. It’s been the backbone of search engines since the 1990s. BM25 counts how often query terms appear in a document, adjusted for document length and term rarity. It’s excellent at finding exact matches, specific terms, and identifiers. When someone searches for “PL-7,” BM25 finds it because it literally matches the string. No embeddings needed.

Dense retrieval with embeddings is excellent at understanding meaning. “Revert a deploy” matches “roll back a deployment” because the embeddings capture semantic equivalence. But it struggles with exact terms and rare identifiers.

Hybrid retrieval runs both in parallel. You get a ranked list from BM25 and a ranked list from dense search. The question is: how do you combine them? The most common approach is Reciprocal Rank Fusion (RRF). For each document, take its rank in each list, compute 1/(k + rank) for a constant k (typically 60), and sum across lists. Documents that appear high in both lists get the highest combined score. Documents that appear high in one list but not the other still get credit.

# Reciprocal Rank Fusion
# doc_ranks_bm25 = {"chunk_42": 1, "chunk_17": 2, "chunk_91": 3}
# doc_ranks_dense = {"chunk_17": 1, "chunk_42": 3, "chunk_55": 2}

k = 60
rrf_scores = {}
for doc_id in all_docs:
    score = 0
    if doc_id in bm25_ranks:
        score += 1.0 / (k + bm25_ranks[doc_id])
    if doc_id in dense_ranks:
        score += 1.0 / (k + dense_ranks[doc_id])
    rrf_scores[doc_id] = score

# chunk_17: 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325 (top!)
# chunk_42: 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323
# chunk_55: 0      + 1/(60+2) = 0.0161

In practice, hybrid retrieval with RRF consistently outperforms either BM25 or dense retrieval alone across benchmarks. It’s one of those rare free improvements that has almost no downside—you’re adding a BM25 index (which is fast and cheap) and a fusion step (which is trivial). For AskIntern, this means the search for “PL-7” now works via BM25 even though dense retrieval misses it, and the search for “revert a deploy” still works via dense retrieval even though BM25 misses it.

The limitation: hybrid retrieval gives you better recall—more relevant chunks in the top results—but the ranking within those results can still be noisy. We need a way to re-sort the retrieved chunks by true relevance. That’s where reranking comes in.

Reranking with Cross-Encoders

Here’s a nuance that took me a while to internalize. The embedding models we use for retrieval—the ones that produce vector representations—are bi-encoders. They encode the query and each document independently, then compare using cosine similarity. This is what makes them fast: you can precompute document embeddings offline and only compute the query embedding at search time. But the independence is also a limitation. The bi-encoder never sees the query and document together. It can’t reason about their interaction.

A cross-encoder takes a different approach. It feeds the query and a document as a single concatenated input into a transformer, which outputs a relevance score directly. Because the model sees both texts at once, it can attend across them—noticing that “split my leave” in the query relates to “taken consecutively or split into two blocks” in the document. Cross-encoders are substantially more accurate than bi-encoders for relevance scoring.

The catch: cross-encoders are slow. You can’t precompute anything because the score depends on the specific (query, document) pair. Running a cross-encoder against 10 million chunks is not feasible. But running it against the top 20 chunks that a fast bi-encoder already retrieved? That takes milliseconds. This is the retrieve-then-rerank pattern: use a fast bi-encoder (or hybrid retrieval) to get the top 20–50 candidates, then use a cross-encoder to re-sort those candidates by true relevance, and take the top 5.

# Retrieve-then-rerank pipeline
candidates = hybrid_search(query, top_k=30)  # fast, broad

# Cross-encoder re-scores each candidate
reranked = cross_encoder.rank(
    query=query,
    documents=[c.text for c in candidates]
)

top_chunks = reranked[:5]  # feed these to the LLM

Popular cross-encoder rerankers include Cohere Rerank, cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers, and Jina Reranker. Adding a reranker is consistently the single cheapest accuracy improvement you can make to a RAG pipeline. It typically costs 50–100ms of latency and no infrastructure changes. For AskIntern, the difference is dramatic: the user asking “Can I split my parental leave?” now gets the correct chunk about leave splitting ranked first, instead of the vaguely related chunk about flexible schedules.

But there’s still a spectrum between fast-but-shallow bi-encoders and accurate-but-slow cross-encoders. What if we could find a middle ground?

ColBERT and Late Interaction

I’ll be honest—when I first read about ColBERT, I didn’t immediately see why it mattered. We had bi-encoders for speed and cross-encoders for accuracy. What was the gap? The gap, it turns out, is significant at scale.

ColBERT (Contextualized Late Interaction over BERT) was introduced by Omar Khattab in 2020, and it occupies a genuinely interesting middle ground. Instead of compressing an entire document into a single vector (like a bi-encoder) or processing query and document together (like a cross-encoder), ColBERT keeps per-token embeddings for both.

Here’s how it works, step by step. A query with 8 tokens gets encoded into 8 separate vectors. A document with 200 tokens gets encoded into 200 separate vectors. The document token vectors can be precomputed and stored, same as with a bi-encoder. At query time, you compute the 8 query token vectors, then score against each document using a mechanism called MaxSim.

MaxSim works like this: for each query token, find the document token it’s most similar to (maximum cosine similarity). Then sum those maximum similarities across all query tokens.

Score(Q, D) = Σ over query tokens qi of: max over doc tokens dj of (qi · dj)

Example with a tiny query ["parental", "leave", "split"]:

"parental" → max similarity with any doc token → 0.92 (matches "parental")
"leave"    → max similarity with any doc token → 0.89 (matches "leave")
"split"    → max similarity with any doc token → 0.85 (matches "split...two blocks")

Score = 0.92 + 0.89 + 0.85 = 2.66

The beauty of this: every query token independently finds its best match in the document. This means ColBERT captures fine-grained token-level interactions (like a cross-encoder) while still allowing document embeddings to be precomputed (like a bi-encoder). The “late” in “late interaction” refers to the fact that query and document tokens only interact at the final scoring step, not during encoding.

I’m still developing my intuition for when ColBERT is worth the added complexity versus a bi-encoder plus a cross-encoder reranker. The empirical evidence suggests ColBERT shines when you need both speed and accuracy at large scale—millions of documents where reranking the top-K is too slow or too expensive. For AskIntern with a few thousand documents, a reranker is probably sufficient. For a customer-facing search engine over millions of pages, ColBERT’s approach starts to look very attractive.

Advanced Query Strategies — HyDE, Step-Back, Query Rewriting

We’ve been improving the retrieval and ranking side. Now let’s improve what we’re actually searching for—the query itself. Failure mode 4 was ambiguous queries. But even non-ambiguous queries can underperform because there’s a fundamental asymmetry in RAG: the user’s question is short and informal, while the documents are long and formal. Embedding a question doesn’t always land near the embedding of its answer.

Query rewriting is the most straightforward fix. Before retrieval, you ask the LLM to reformulate the user’s query into something more precise. “What about the last CEO?” becomes “Who was the most recent CEO of the company, and when did they serve?” In a conversational system, query rewriting also resolves pronouns: “What about their salary?” becomes “What was the salary of [person from previous turn]?” This is cheap—one quick LLM call—and it measurably improves retrieval quality.

HyDE (Hypothetical Document Embeddings) takes a more surprising approach. Instead of searching with the query, you ask the LLM to generate a hypothetical answer—a fake document that would answer the question if it existed. Then you embed that hypothetical answer and search for real documents similar to it.

Why does this work? Because the hypothetical answer lives in “document space”—it uses the same vocabulary, structure, and style as the actual documents in your knowledge base. The query “What is our parental leave policy?” might not embed close to the actual policy document. But a hypothetical answer like “Our parental leave policy provides 16 weeks of paid leave for new parents...” will embed very close to the real document because it shares vocabulary and structure. Going back to our GPS analogy: HyDE teleports you from “question-space” to “document-space” before you start searching.

Step-back prompting handles complex questions by decomposing them. Instead of searching for the original question directly, the LLM “steps back” and asks: what sub-questions do I need to answer first? “If I take parental leave and then request flexible return, how much total time can I be on a modified schedule?” becomes two sub-questions: “How long is parental leave?” and “How long is the flexible return period?” Each sub-question gets its own retrieval pass. The results are combined and the LLM synthesizes the final answer from all retrieved context.

These three strategies can be combined. In a sophisticated pipeline, you might rewrite the query for clarity, generate a HyDE document for better embedding, and decompose into sub-queries for complex questions. Each adds one LLM call of latency, so the tradeoff is quality versus speed.

Multi-Hop RAG

Step-back prompting decomposes a question into sub-questions before retrieval. Multi-hop RAG is the more general pattern: retrieve, reason, and then decide whether you need to retrieve again.

Here’s a concrete example for AskIntern. An employee asks: “Does the parental leave policy apply to contractors in the London office?” The first retrieval finds the parental leave policy. But the policy says “applies to all full-time employees.” Now we need a second question: “Are London-based contractors classified as full-time employees?” A second retrieval finds the contractor classification document. Now the LLM has both pieces and can answer: “No, London-based contractors are classified as part-time consultants and the parental leave policy does not apply to them.”

This iterative retrieve-reason-retrieve pattern is multi-hop RAG. Each “hop” is a retrieval step informed by what was learned in the previous step. The key insight: the LLM doesn’t know what it needs until it sees what it got. The first retrieval reveals a knowledge gap, which triggers the second retrieval.

Implementation-wise, this is a loop. Retrieve. Feed context to the LLM. Ask the LLM: “Do you have enough information to answer, or do you need to look up something else?” If it needs more, extract the follow-up query and retrieve again. Repeat until the LLM is satisfied or you hit a maximum hop count (to prevent infinite loops).

The limitation of multi-hop RAG is latency. Each hop adds a retrieval step plus an LLM inference step. Two hops might add 2–4 seconds. For time-sensitive applications, you need to decide whether the accuracy improvement is worth the delay. For AskIntern, where employees are willing to wait a few seconds for a thorough answer, multi-hop is usually worth it.

GraphRAG — Knowledge Graphs Meet Retrieval

Vector search treats every chunk as an independent point in embedding space. It has no concept of relationships between chunks. But documents are not isolated—they reference each other, share entities, and form a web of connected knowledge. Our employee handbook references the contractor classification document, which references the London office guidelines, which references UK employment law. These connections matter.

GraphRAG, introduced by Microsoft Research in 2024, addresses this by building a knowledge graph over your documents before retrieval. The offline pipeline extracts entities (people, policies, teams, products) and relationships (references, depends-on, applies-to) from every document, then organizes them into a graph. The graph is further analyzed using community detection algorithms, which cluster related entities into communities. Each community gets a summary—a natural-language description of what that cluster of entities is about.

At query time, GraphRAG can do two things that naive RAG cannot. For local queries (“What is the parental leave policy?”), it works like standard RAG but with graph-informed retrieval—fetching not only directly relevant chunks but also connected entities and their context. For global queries (“What are the main themes across all our HR policies?”), it uses the community summaries. No amount of top-K vector search can answer a global query, because the answer isn’t in any single chunk—it’s in the relationships between them.

My favorite thing about GraphRAG is that, aside from the high-level intuition I described, no one is completely certain how much the graph structure helps versus naive retrieval across different query types. The Microsoft paper showed dramatic improvements on global questions about datasets like news articles, where relationships between entities are dense. On more straightforward factoid questions, the improvement is smaller. The cost is significant—building the graph requires many LLM calls during indexing, and the graph adds complexity to maintain. For AskIntern, GraphRAG makes sense if employees frequently ask synthesizing questions like “What are the differences between our policies for US and UK employees?” For straightforward lookups, it might be overkill.

Agentic RAG

All the RAG patterns we’ve discussed so far follow a fixed pipeline. The developer decides in advance: do hybrid search, rerank, generate. But what if the LLM itself could decide what kind of retrieval to do, how many hops to take, or whether to retrieve at all?

In agentic RAG, the LLM acts as the orchestrator. Given a query, it decides: Should I search the vector database? Should I run a SQL query against structured data? Should I call a web API? Should I retrieve more after seeing the initial results? The LLM is not a passive generator waiting to be fed context—it’s an active agent that chooses its own retrieval strategy.

Back to our librarian analogy: the naive RAG librarian always walks to the same shelf and pulls the top 5 books. The agentic RAG librarian thinks first. “This is a question about recent policy changes—I should check the latest policy updates section. And since it mentions contractors, I should also check the contractor classification guide. Oh, and the user mentioned ‘London office’—I should verify if UK-specific policies override the global ones.” The agent decides the retrieval strategy on the fly.

A related pattern is self-corrective RAG (sometimes called CRAG). After generating an answer, the system reflects: “Is this answer fully supported by the retrieved context? Did I actually address the user’s question?” If the self-check fails, it automatically triggers a new retrieval with a reformulated query and regenerates. This loop catches the cases where initial retrieval missed the right chunks—the system notices its own failure and corrects it.

The frameworks that make this practical include LangChain (with its agent executors and tool-calling), LlamaIndex (with its query engine abstractions), and more recently, raw function-calling APIs from OpenAI, Anthropic, and Google. The ReAct pattern (Reason + Act) is the most common architecture: the LLM alternates between reasoning (“I need to find information about X”) and acting (calling a retrieval tool) in a loop.

The tradeoff: agentic RAG is more flexible and handles a wider range of queries, but it’s less predictable, harder to debug, and more expensive (each agent step is an LLM call). For AskIntern, a good heuristic is: start with a fixed pipeline, add agentic behavior only for the query types where the fixed pipeline consistently fails.

Long-Context Windows vs RAG

Here is a question I hear constantly, and the answer has been shifting. Gemini 1.5 Pro supports 1 million tokens. Claude supports 200K. GPT-4 Turbo supports 128K. If you can fit your entire knowledge base into the context window, why bother with RAG at all?

The appeal is obvious. No chunking, no embeddings, no vector databases, no retrieval pipeline. Paste everything in and ask your question. For small knowledge bases—say, under 50 pages of text—this can actually work well, and the simplicity is hard to argue with.

But there are real problems. The first is the “lost in the middle” phenomenon. Research has shown that LLMs pay the most attention to information at the beginning and end of the context window, with significantly degraded recall for information in the middle. If your answer is buried in page 30 of a 100-page context, the model might miss it entirely. This isn’t a minor effect—recall can drop by 20–30% for information in the middle versus the edges.

The second problem is cost. Token pricing is linear. Processing 1 million tokens of context for every query is 100x more expensive than processing 5 retrieved chunks of 200 tokens each. At scale, this difference is enormous.

The third problem is latency. Processing a million tokens takes time. The first token of the response might not appear for 10–30 seconds. RAG with 5 chunks responds in milliseconds.

The fourth problem is scale. Even 1 million tokens has limits. Many enterprise knowledge bases are tens of millions of tokens. You cannot fit them in any context window.

So when does long context beat RAG? When the information is tightly interrelated—a single long document where cross-references matter, like a legal contract or a codebase. When you need the model to understand the global structure of a document, not answer factoid questions. When your knowledge base is small enough to fit comfortably.

The emerging best practice is a hybrid: use RAG to retrieve the most relevant chunks, then feed those chunks (not the entire corpus) into a moderately large context window. This gives you the precision of retrieval with the cross-referencing ability of larger context. For AskIntern, the answer is clear: the knowledge base is too large and too dynamic for context stuffing. RAG wins.

Fine-Tuning Embedding Models

So far, we’ve been using off-the-shelf embedding models. They work well for general-purpose text. But AskIntern has a vocabulary full of internal acronyms (PL-7, SRE-L2, Q3-OKR), product names, and domain jargon that generic embedding models have never seen. The embeddings for these terms will be essentially random—the model doesn’t know that “SRE-L2” is about site reliability engineering.

Fine-tuning an embedding model on your domain data teaches it your vocabulary and the semantic relationships in your corpus. The approach uses contrastive learning: you provide pairs of (query, relevant passage) and the model learns to embed them close together, while pushing irrelevant passages apart.

The most popular training loss for this is MultipleNegativesRankingLoss from the sentence-transformers library. For each (query, positive passage) pair in a batch, every other positive passage in the batch acts as a negative example. This is efficient because you don’t need to explicitly mine negative examples—the batch provides them automatically.

# Fine-tuning an embedding model (sentence-transformers)
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer("all-MiniLM-L6-v2")

# Domain-specific (query, relevant_passage) pairs
train_examples = [
    InputExample(texts=["parental leave policy", 
                         "New parents receive 16 weeks..."]),
    InputExample(texts=["PL-7 form", 
                         "Submit Form PL-7 to HR..."]),
    InputExample(texts=["rollback staging", 
                         "kubectl rollout undo deployment..."]),
]

train_dataloader = DataLoader(train_examples, batch_size=32, shuffle=True)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100
)

The hardest part isn’t the training—it’s creating the training data. You need (query, relevant passage) pairs that reflect real user questions. Options include mining from search logs, generating synthetic queries with an LLM, or manual curation. Hard negatives matter enormously: passages that are topically similar but not the correct answer. A chunk about “flexible return-to-work schedules” is a hard negative for the query “Can I split my parental leave?” Training with hard negatives forces the model to make finer-grained distinctions.

For parameter-efficient fine-tuning, LoRA (Low-Rank Adaptation) works well here too—you can adapt an embedding model to your domain without retraining all parameters, making it practical even with limited GPU resources.

A practical guideline: start with off-the-shelf embeddings. Measure retrieval quality. If retrieval is the bottleneck (and not chunking or ranking), fine-tune. Check the MTEB leaderboard (Massive Text Embedding Benchmark) for the current best general-purpose models before investing in fine-tuning—sometimes a better base model beats a fine-tuned weaker one.

Evaluating RAG — The RAGAS Framework

I’ll be honest—evaluating RAG systems felt more subjective than I expected. You can look at answers and say “that seems right” or “that missed the point,” but you need numbers to make systematic improvements. The RAGAS (Retrieval Augmented Generation Assessment) framework, introduced in 2023, provides exactly that.

RAGAS defines four metrics that together give you a complete picture of RAG system quality. Each metric isolates a different component of the pipeline, which is essential for debugging.

Faithfulness measures whether the generated answer is supported by the retrieved context. If the LLM says “You get 20 weeks of leave” but the retrieved context says “16 weeks,” faithfulness is low. This catches hallucination—the model making up information not present in the context. RAGAS measures this by breaking the answer into individual claims and checking whether each claim can be inferred from the retrieved passages.

Answer relevance measures whether the answer actually addresses the user’s question. An answer might be faithful to the context (everything it says is true) but not relevant (it talks about the wrong topic). RAGAS measures this by generating hypothetical questions from the answer and checking how similar they are to the original question—if the answer is relevant, the generated questions should be close to the original.

Context precision measures how much of the retrieved context is actually useful. If you retrieved 5 chunks but only 1 is relevant, context precision is low. High-ranked irrelevant chunks are penalized more heavily. This is a diagnostic for your retrieval pipeline: are you fetching too much noise?

Context recall measures whether the retrieved context contains all the information needed to answer the question. If the answer requires facts from two documents but you only retrieved one, context recall is low. This tells you whether your retrieval is missing relevant chunks.

# Evaluating with RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.metrics import context_precision, context_recall

results = evaluate(
    dataset=eval_dataset,  # questions, contexts, answers, ground_truth
    metrics=[faithfulness, answer_relevancy, 
             context_precision, context_recall]
)

# Example output:
# faithfulness:       0.89  (good - answers are grounded)
# answer_relevancy:   0.92  (good - answers address the questions)
# context_precision:  0.67  (medium - too much irrelevant context)
# context_recall:     0.74  (medium - missing some relevant chunks)

The power of these four metrics is in the diagnostic story they tell. High faithfulness but low context recall means: the LLM is honest about what it reads, but retrieval is missing relevant documents—fix your chunking or embedding model. Low faithfulness but high context recall means: the right documents are being retrieved, but the LLM is hallucinating beyond them—tighten your prompt or use a more instruction-following model. Low context precision means: you’re retrieving too much noise—add a reranker or reduce K.

Beyond RAGAS, retrieval-specific metrics from information retrieval are still useful. Recall@K measures whether the correct document appears in the top K results. MRR (Mean Reciprocal Rank) measures how high the correct document ranks on average. These are easier to compute (they don’t require an LLM) and useful for rapid iteration on the retrieval component specifically.

Production Patterns

Building a demo RAG system takes a day. Making it production-ready takes months. Here are the patterns that close the gap, based on what I’ve seen work and fail in real deployments.

Parent document retrieval addresses the chunk-size dilemma we struggled with earlier. Remember the tension: small chunks give precise retrieval but lose context. Parent document retrieval resolves this by maintaining two levels. You chunk your documents into small pieces for embedding and retrieval. But each small chunk stores a reference to its “parent”—the larger section or full document it came from. When a small chunk is retrieved, you fetch the parent and send that to the LLM instead. You get the retrieval precision of small chunks with the contextual richness of large chunks. LlamaIndex calls this a “hierarchical index.” LangChain has a ParentDocumentRetriever.

Metadata filtering adds structured constraints to vector search. Instead of searching all chunks, you filter first: only search chunks from the HR policy section, or only chunks updated after January 2024, or only chunks tagged as “engineering.” This dramatically improves precision for queries that have a clear scope. For AskIntern, if someone from the London office asks about leave policy, filtering to UK-relevant documents before vector search eliminates most irrelevant results.

Source citation is non-negotiable in production. Every answer should include references to the specific documents and chunks it was generated from. This lets users verify the answer and builds trust. The implementation is straightforward: pass chunk metadata (document name, page, URL) through the pipeline and include it in the prompt template so the LLM can reference sources.

Feedback loops are how your RAG system gets better over time. Track which answers users accept, which they report as incorrect, which they follow up on. Use this data to identify retrieval failures, create fine-tuning pairs for your embedding model, and surface documents that need updating. The best production RAG systems have a flywheel: user feedback improves retrieval, which improves answers, which increases trust, which increases usage.

The tooling landscape in 2024–2025 centers on two frameworks. LangChain is the Swiss Army knife—it supports every retriever, every vector store, every LLM, and every chain pattern you can imagine. It’s flexible but can feel over-abstracted. LlamaIndex is more opinionated and focused specifically on data indexing and retrieval, with excellent support for hierarchical indexing, knowledge graphs, and structured data. Both are maturing rapidly. For quick prototyping, either works. For production, evaluate based on your specific needs: LlamaIndex if your challenge is data indexing complexity, LangChain if your challenge is orchestration complexity.

Debug retrieval before generation. When AskIntern gives a bad answer, 80% of the time the problem is retrieval, not generation. Before tweaking your prompt or switching models, print out the retrieved chunks. Read them yourself. Ask: “Could I answer this question from these chunks alone?” If not, the LLM cannot either. Fix retrieval first. Always.

Wrap-Up

If you’re still with me, thank you. I hope it was worth it.

We started with the fundamental problem—LLMs that hallucinate, freeze at their training cutoff, and can’t see your private data. We built our way up from embeddings as semantic coordinates, through chunking strategies, vector search, and a naive RAG pipeline that actually works for many cases. Then we pushed further: hybrid retrieval to combine the strengths of keywords and semantics, cross-encoder reranking for precision, ColBERT for the middle ground at scale, query rewriting and HyDE to bridge the gap between questions and documents, multi-hop retrieval for complex questions, GraphRAG for interconnected knowledge, and agentic RAG where the LLM decides its own retrieval strategy. We covered when long-context windows compete with RAG and when they don’t. We learned to fine-tune embedding models for domain-specific vocabularies and to evaluate our systems with RAGAS metrics that tell a diagnostic story.

My hope is that the next time you need to build a system that answers questions over documents—whether it’s an internal knowledge assistant like AskIntern, a customer support bot, or a research tool—instead of treating RAG as a black box where you paste things together from a tutorial, you’ll have a clear mental model of every stage in the pipeline, the tradeoffs at each decision point, and the tools to diagnose and fix what goes wrong.

Resources and Credits

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (2020) — The O.G. RAG paper from Meta. Reads well and the architecture diagram alone is worth the click.

Khattab & Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT” (2020) — The paper that introduced late interaction. Omar Khattab went on to build DSPy, which is also worth investigating.

Microsoft Research, “From Local to Global: A Graph RAG Approach to Query-Focused Summarization” (2024) — The GraphRAG paper. The community detection approach is genuinely clever.

Es et al., “RAGAS: Automated Evaluation of Retrieval Augmented Generation” (2023) — Insightful framework that makes RAG evaluation practical instead of vibes-based.

Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels” (2022) — The HyDE paper. The idea of searching with a hypothetical answer is one of those “why didn’t I think of that” moments.

The MTEB Leaderboard (Hugging Face) — The definitive ranking of embedding models. Check it before committing to an embedding model—the landscape changes monthly.

← Previous Reasoning & Inference-Time Scaling Next → Alignment & Safety