RAG beyond embedding: the techniques that move quality numbers
Vector search alone is not enough for production RAG. The techniques that consistently improve retrieval quality — hybrid search, rerankers, query routing, synthetic eval data — and when each one is worth the engineering cost.
The default RAG architecture — embed documents, embed query, return top-k by cosine similarity — is the baseline. It is also the version that produces the demos that fall apart in production. Real retrieval systems combine several techniques, each addressing a specific failure mode of the baseline. This post walks through the four that consistently move quality numbers and when each one is worth the engineering cost.
The premise: if you cannot measure your retrieval quality with an evaluation harness, none of these techniques will help you reliably. Build the harness first.
1. Hybrid search (BM25 + dense)
Dense vector search captures semantic similarity but loses lexical precision. The query “Acme V3 firmware checksum error” gets matched on “firmware” and “error” semantically, but a sparse keyword search would correctly anchor on the exact tokens “Acme V3” and “checksum.” BM25 handles those cases. Dense vectors handle the cases where the user phrases the query differently from the document.
The fix is to run both and merge:
- Index documents with both BM25 and dense embeddings
- Run both queries in parallel
- Merge results with reciprocal rank fusion (RRF) or weighted score combination
When it helps: any corpus with technical terms, product names, IDs, codes, or anywhere users search with specific tokens. Most production corpora.
When it doesn’t: pure conversational corpora with no technical vocabulary. Rare in practice.
Engineering cost: low. Adding a BM25 index alongside a vector store is a one-day job with most modern retrieval libraries.
2. Rerankers
Top-k retrieval surfaces the candidates — usually 20 to 50 documents per query. A reranker takes those candidates and reorders them with a more expensive model that scores (query, document) pairs directly. The first stage is recall-optimised; the reranker is precision-optimised.
The reranker matters because the top result is what the LLM sees. A retrieval system with the right document at rank 12 is functionally broken. A reranker that promotes that document to rank 1 is the difference.
Common choices:
- Cross-encoder rerankers (Cohere Rerank, BGE-reranker, etc.) — high quality, ~100ms per query
- LLM-as-reranker — highest quality, slowest, easiest to start with for prototyping
When it helps: any case where top-1 accuracy matters more than top-k coverage. Most question-answering and citation use cases.
When it doesn’t: cases where you genuinely need diversity in the top-k (some recommendation use cases).
Engineering cost: medium. Adds latency you have to account for. Rerankers have their own quality variance, so you should evaluate them on your data, not trust the published benchmarks.
3. Query routing
A single retrieval pipeline assumes all queries are the same shape. They are not. The query “what is our refund policy” wants a single authoritative document. The query “summarise customer complaints from last quarter” wants many documents aggregated. The query “who is the account owner for X” wants a structured database lookup, not retrieval at all.
Query routing uses an LLM to classify the query and route it to the appropriate retrieval strategy:
- Specific document lookup → high-precision retrieval, low k
- Aggregation queries → broader retrieval, higher k, possibly map-reduce
- Structured data queries → SQL or API call, no retrieval
- Out-of-scope → refuse cleanly
When it helps: when your retrieval system serves a heterogeneous query distribution. Most production systems do.
When it doesn’t: narrow internal tools where every query is the same shape.
Engineering cost: medium. The routing classifier needs its own evaluation set. Mis-routes are expensive.
4. Synthetic eval data
The hardest part of evaluating RAG is having enough realistic queries to evaluate against. Hand-writing a hundred queries is slow and biased toward what you can imagine; production logs may not exist yet, or may not cover the cases you need to test.
Synthetic data generation: take your documents, generate plausible queries that those documents would answer, and use those (query, expected-document) pairs as your eval set. An LLM does the generation; you sample-check the results.
This is not a replacement for real production queries. It is a way to bootstrap an eval set when you do not yet have one, and to systematically cover document types or topics that your existing eval set under-represents.
When it helps: greenfield RAG systems with no production logs. Coverage gaps in existing eval sets.
When it doesn’t: when you already have rich production query logs and the bottleneck is labelling, not query generation.
Engineering cost: low. A few hours to write the generator. The hidden cost is the diligence required to sample-check the generated queries — bad synthetic data is worse than no synthetic data.
What to skip
Two techniques that get more attention than they deserve, relative to the four above:
- Fancier embedding models alone. Upgrading from
text-embedding-3-smalltotext-embedding-3-largetypically moves the quality number a few percent. Hybrid + reranker moves it ten or twenty. - Bigger context windows. Stuffing more retrieved chunks into the prompt has diminishing returns past the first few; the failure mode shifts from “no relevant context” to “relevant context buried in noise.” Better retrieval beats bigger windows.
The shape of the work
For most production RAG systems, the order of operations is:
- Ship the dumb baseline (dense top-k).
- Build the eval harness.
- Add hybrid search. Measure.
- Add a reranker. Measure.
- If query distribution is heterogeneous, add routing. Measure.
- Use synthetic data to expand eval coverage to under-tested cases.
Each step pays for itself in the eval suite or it does not get shipped.