Back to Blog

Your RAG Pipeline Is Retrieving Garbage — Here's How to Fix It with Hybrid Search and Reranking

You know RAG can fail. But do you know how to actually fix it? Beyond the basics — hybrid search, cross-encoder reranking, query decomposition, and contextual retrieval explained with real examples.

thousandmiles-ai-admin10 min read
Your RAG Pipeline Is Retrieving Garbage — Here's How to Fix It with Hybrid Search and Reranking

Your RAG Pipeline Is Retrieving Garbage — Here's How to Fix It with Hybrid Search and Reranking

You've diagnosed the problem. Your RAG app retrieves related chunks, not relevant ones. Now let's actually fix it.


You Fixed the Obvious Stuff. It Still Doesn't Work.

So you've read the intro guides. You know RAG can hallucinate, that chunking matters, that the LLM can ignore context. You've improved your chunks, added overlap, maybe even played with different embedding models. And yet — your retrieval still returns frustratingly wrong results for certain queries.

A user asks: "What's our SSO login endpoint?" Your vector search returns five chunks about login flows, OAuth explanations, and authentication best practices. All semantically related to "login." None containing the actual endpoint URL. The answer was sitting in a configuration doc that used the exact term "SSO" — but your embedding model didn't treat "SSO" as a strong enough signal because it's just a three-letter acronym in a sea of words.

This is where basic RAG breaks down and advanced techniques start earning their keep. The gap between "demo-quality" and "production-quality" RAG is almost entirely about retrieval precision — and that's what this post is about.

Why Should You Care?

If you're building any RAG system that needs to be reliable — not just impressive in a demo — these techniques are non-negotiable. Hybrid search, reranking, query decomposition, and contextual retrieval aren't experimental ideas. They're the standard toolkit that production RAG systems use in 2026. And in AI interviews, understanding why vanilla vector search fails and what to do about it is a much stronger signal than just knowing how to set up a basic pipeline.

The Core Problem: Semantic Similarity Is Not Relevance

Let's get precise about what's going wrong. When you embed a query and search for similar vectors, you're measuring semantic similarity — how close the meaning is in embedding space. But "similar meaning" and "actually answers the question" are not the same thing.

Embeddings are great at broad topic matching. They're terrible at exact token matching. Ask for "SSO endpoint" and embeddings will find paragraphs about authentication in general. They'll miss the one line in a config file that says sso_login_url: https://auth.example.com/sso. The embedding model doesn't give special weight to "SSO" — it treats it as part of a general "authentication" cluster.

This is why relying on vector search alone leaves a measurable gap. Studies from 2026 show that hybrid approaches improve retrieval relevance by roughly 35% compared to pure vector search — and that gap gets worse when queries contain acronyms, proper nouns, version numbers, or exact phrases.

Loading diagram...

The advanced retrieval pipeline: vector search + keyword search, fused together, then reranked. Each stage filters out more noise.

Fix 1: Hybrid Search — Best of Both Worlds

The idea is simple: don't choose between semantic search and keyword search. Use both, and merge the results.

Vector search (semantic) finds chunks that are about the same topic, even if they use different words. Great for "explain how authentication works" type queries.

Keyword search (BM25) finds chunks that contain the exact words in your query. Great for "SSO endpoint" or "error code 403" type queries — where the exact token matters more than the general meaning.

Reciprocal Rank Fusion (RRF) merges the two ranked lists into one. The math is straightforward: each document gets a score based on its position in both lists, and the combined score determines the final ranking. A document that ranks high in both lists gets boosted to the top. A document that ranks high in one but is missing from the other still appears, just lower.

The beauty of this approach is that it covers both failure modes. If the answer uses the exact terms from the query, keyword search catches it. If the answer uses synonyms or paraphrases, semantic search catches it. Together, they're much more robust than either alone.

Most vector databases in 2026 support hybrid search natively — Pinecone, Weaviate, Qdrant, and Milvus all have built-in BM25 + vector fusion. You don't need to build this from scratch.

Fix 2: Reranking — The Second Pass That Changes Everything

Here's a dirty secret about retrieval: the first pass is fast but dumb. Whether it's vector search, BM25, or hybrid, the initial retrieval uses lightweight models that prioritize speed over precision. They're designed to scan millions of chunks quickly and return a rough top-100 or top-50.

Reranking is the second pass — slower, smarter, and applied only to that smaller candidate set. Instead of comparing embeddings (which are compressed representations), a reranker uses a cross-encoder that reads the full query and the full chunk together, side by side, and scores how relevant the chunk actually is to the specific question.

Think of it like this: the first pass is like scanning a library shelf by title. The reranker is like actually opening each book and reading the first page to see if it's what you need.

Cross-encoder rerankers consistently improve precision in production RAG systems. Models like BAAI's bge-reranker series are commonly used — they're small enough to run quickly on the candidate set but smart enough to catch relevance nuances that embeddings miss.

The practical impact: chunks that were buried at position 8 or 10 in the initial retrieval get promoted to position 1 or 2 after reranking, because the cross-encoder recognizes they actually answer the question. And the noise — chunks that were semantically similar but not relevant — gets pushed down or filtered out.

Loading diagram...

Reranking in action: the correct chunk (SSO endpoint config) was buried at position 4. After reranking, it's at position 1.

Fix 3: Query Decomposition — Breaking Complex Questions Apart

Some questions can't be answered from a single chunk, no matter how good your retrieval is. "How does our current pricing compare to what we offered last quarter?" requires information from at least two documents — the current pricing page and the Q3 archive.

Standard RAG embeds the entire question and searches for similar chunks. But what chunk is "similar" to a comparison question? Neither the current pricing doc nor the Q3 doc is a great match for the full query — they're each only half the answer.

Query decomposition breaks the original question into sub-questions: "What is the current pricing?" and "What was the Q3 pricing?" Each sub-question gets its own retrieval pass, finding the most relevant chunks for that specific piece. The results are then combined and sent to the LLM, which now has all the evidence it needs to make the comparison.

This is also where Corrective RAG (CRAG) comes in — a pattern where the system evaluates the quality of retrieved evidence before passing it to the LLM. If the initial retrieval doesn't look good enough, CRAG can re-trigger retrieval with a reformulated query or decompose the question into simpler sub-queries. It's a self-healing mechanism that catches retrieval failures before they become generation failures.

Fix 4: Contextual Retrieval — Give Each Chunk Its Memory

Here's a fundamental problem with chunking: when you split a document into pieces, each piece loses its relationship to the whole. A chunk might say "this policy applies to all enterprise clients" — but the chunk doesn't know what "this policy" refers to, because that context was in a previous section that ended up in a different chunk.

Contextual retrieval, inspired by an approach Anthropic published, solves this by enriching each chunk before it goes into the vector database. For every chunk, you run a small LLM call that reads the full source document and generates 3–4 sentences of context: what document this chunk is from, what section it belongs to, and what the surrounding content discusses.

This context gets prepended to the chunk before embedding. So instead of embedding "this policy applies to all enterprise clients," you embed "From the Enterprise Agreement v2.3, Section 4 — Liability Clauses: this policy applies to all enterprise clients." The embedding now captures both the specific content and its place in the broader document.

The trade-off is cost — you're making an LLM call for every chunk during the indexing phase. But it's a one-time cost that dramatically improves retrieval quality, especially for documents with lots of cross-references, pronouns, and implicit context.

Mistakes That Bite — What Goes Wrong with Advanced RAG

"More retrieval stages = always better." Not necessarily. Every stage adds latency. A hybrid search + reranker pipeline is great, but adding query decomposition on top of that for every query is overkill. Use decomposition selectively — detect when a query is multi-hop (comparison questions, temporal queries, multi-entity queries) and only decompose those.

"I'll just increase top-K to get more chunks." This is the retrieval equivalent of turning up the volume when you can't understand someone. Retrieving 20 chunks instead of 5 means the LLM has to wade through 4x more content, most of it noise. The "lost in the middle" problem gets worse with more chunks. Instead of retrieving more, retrieve better — with hybrid search and reranking.

"Reranking is too slow for production." Modern cross-encoder rerankers process 50 chunks in under 100ms. You're reranking a small candidate set, not the entire database. The latency cost is minimal compared to the quality improvement. If latency is truly critical, distilled reranker models offer 80% of the quality improvement at 2x the speed.

Now Go Break Something — Where to Go from Here

If you want to level up your RAG pipeline, here's a practical path:

  • Start with hybrid search. Most vector databases support it natively now. Switch from pure vector search to hybrid and measure the difference on your own queries. The improvement on exact-match queries will be immediate and obvious.
  • Add a reranker. The Cohere Rerank API offers a free tier. Or use the open-source bge-reranker models from Hugging Face. Plug it in as a second stage after retrieval and compare the top-K before and after.
  • Try RAGAS for evaluation. RAGAS (on GitHub) gives you metrics like context precision, context recall, faithfulness, and answer relevancy. You can't improve what you don't measure.
  • Explore LangSmith or Arize Phoenix for tracing — they let you see exactly which chunks were retrieved, how they were ranked, and what the LLM did with them. Invaluable for debugging.
  • Search for "contextual retrieval Anthropic" to read the original approach — it's a clear writeup with practical implementation details.

Remember that SSO endpoint query that returned five chunks about authentication in general? With hybrid search, BM25 catches the exact "SSO" token. With reranking, the config doc gets promoted to position 1. With contextual retrieval, the chunk knows it's from the deployment configuration guide. The answer was always in your data — you just needed a smarter way to find it.