AI Builder and Data Storyteller
Once the baseline works, the next challenge appears quickly. Good answers become inconsistent across query styles. Some questions are phrased semantically. Others depend on exact terms, acronyms, or product names. A single retrieval method usually misses one of those groups.
Vector search is excellent at meaning level similarity. It struggles when exact wording matters. Questions about incident codes, legal clauses, or version tags can fail even when the answer exists in the source text.
This is where lexical retrieval helps. BM25 scores documents by term match and term rarity. In practice, semantic retrieval and lexical retrieval fail in different ways, which is exactly why they work better together than alone.
The usual pattern is simple. Run semantic retrieval and BM25 in parallel. Merge both rankings with a fusion rule. A common choice is reciprocal rank fusion because it rewards chunks that rank well across multiple methods.
You do not need complex orchestration at first. A small top k from each retriever plus a clean merge already improves coverage on hard queries.
Hybrid retrieval gives a better candidate pool. Reranking decides which of those candidates are truly central to the user question. This step is helpful when several chunks look relevant but only one or two actually answer the question.
The tradeoff is latency. You add an extra model call. Keep reranking focused on a small candidate set so precision improves without creating unnecessary delay.
Chunking always removes some context from the original document flow. A passage can look vague in isolation even though it is clear in the full report.
Contextual retrieval addresses that by adding a brief situating note to each chunk before indexing. The goal is not to rewrite the chunk. The goal is to preserve document level meaning so retrieval has better signals.
This is the most important habit in production RAG. If retrieval quality is weak, generation quality will be unstable no matter how polished the prompt looks.
A practical evaluation loop tracks both layers.
When these are measured together, decisions become clear. If recall is high but groundedness is low, generation needs work. If groundedness is high only when recall is high, retrieval is still the main bottleneck.
Every quality gain has a cost profile. Hybrid retrieval adds index maintenance. Reranking adds latency. Contextualization adds preprocessing time. This is normal. The objective is not maximum complexity. The objective is stable quality at acceptable cost and speed.
The takeaway from Part 2 is this. Production grade RAG is built from layered retrieval decisions and honest evaluation. Once those are in place, answers become reliable instead of occasionally impressive.
Back to Part 1: Build a Solid Baseline
← Back to Blog