RAG beyond embedding: the techniques that move quality numbers

The default RAG architecture — embed documents, embed query, return top-k by cosine similarity — is the baseline. It is also the version that produces the demos that fall apart in production. Real retrieval systems combine several techniques, each addressing a specific failure mode of the baseline. This post walks through the four that consistently move quality numbers and when each one is worth the engineering cost.

The premise: if you cannot measure your retrieval quality with an evaluation harness, none of these techniques will help you reliably. Build the harness first.

1. Hybrid search (BM25 + dense)

Dense vector search captures semantic similarity but loses lexical precision. The query “Acme V3 firmware checksum error” gets matched on “firmware” and “error” semantically, but a sparse keyword search would correctly anchor on the exact tokens “Acme V3” and “checksum.” BM25 handles those cases. Dense vectors handle the cases where the user phrases the query differently from the document.

The fix is to run both and merge:

Index documents with both BM25 and dense embeddings
Run both queries in parallel
Merge results with reciprocal rank fusion (RRF) or weighted score combination

When it helps: any corpus with technical terms, product names, IDs, codes, or anywhere users search with specific tokens. Most production corpora.

When it doesn’t: pure conversational corpora with no technical vocabulary. Rare in practice.

Engineering cost: low. Adding a BM25 index alongside a vector store is a one-day job with most modern retrieval libraries.

2. Rerankers

Top-k retrieval surfaces the candidates — usually 20 to 50 documents per query. A reranker takes those candidates and reorders them with a more expensive model that scores (query, document) pairs directly. The first stage is recall-optimised; the reranker is precision-optimised.

The reranker matters because the top result is what the LLM sees. A retrieval system with the right document at rank 12 is functionally broken. A reranker that promotes that document to rank 1 is the difference.

Common choices:

Cross-encoder rerankers (Cohere Rerank, BGE-reranker, etc.) — high quality, ~100ms per query
LLM-as-reranker — highest quality, slowest, easiest to start with for prototyping

When it helps: any case where top-1 accuracy matters more than top-k coverage. Most question-answering and citation use cases.

When it doesn’t: cases where you genuinely need diversity in the top-k (some recommendation use cases).

Engineering cost: medium. Adds latency you have to account for. Rerankers have their own quality variance, so you should evaluate them on your data, not trust the published benchmarks.

3. Query routing

A single retrieval pipeline assumes all queries are the same shape. They are not. The query “what is our refund policy” wants a single authoritative document. The query “summarise customer complaints from last quarter” wants many documents aggregated. The query “who is the account owner for X” wants a structured database lookup, not retrieval at all.

Query routing uses an LLM to classify the query and route it to the appropriate retrieval strategy:

Specific document lookup → high-precision retrieval, low k
Aggregation queries → broader retrieval, higher k, possibly map-reduce
Structured data queries → SQL or API call, no retrieval
Out-of-scope → refuse cleanly

When it helps: when your retrieval system serves a heterogeneous query distribution. Most production systems do.

When it doesn’t: narrow internal tools where every query is the same shape.

Engineering cost: medium. The routing classifier needs its own evaluation set. Mis-routes are expensive.

4. Synthetic eval data

The hardest part of evaluating RAG is having enough realistic queries to evaluate against. Hand-writing a hundred queries is slow and biased toward what you can imagine; production logs may not exist yet, or may not cover the cases you need to test.

Synthetic data generation: take your documents, generate plausible queries that those documents would answer, and use those (query, expected-document) pairs as your eval set. An LLM does the generation; you sample-check the results.

This is not a replacement for real production queries. It is a way to bootstrap an eval set when you do not yet have one, and to systematically cover document types or topics that your existing eval set under-represents.

When it helps: greenfield RAG systems with no production logs. Coverage gaps in existing eval sets.

When it doesn’t: when you already have rich production query logs and the bottleneck is labelling, not query generation.

Engineering cost: low. A few hours to write the generator. The hidden cost is the diligence required to sample-check the generated queries — bad synthetic data is worse than no synthetic data.

What to skip

Two techniques that get more attention than they deserve, relative to the four above:

Fancier embedding models alone. Upgrading from text-embedding-3-small to text-embedding-3-large typically moves the quality number a few percent. Hybrid + reranker moves it ten or twenty.
Bigger context windows. Stuffing more retrieved chunks into the prompt has diminishing returns past the first few; the failure mode shifts from “no relevant context” to “relevant context buried in noise.” Better retrieval beats bigger windows.

The shape of the work

For most production RAG systems, the order of operations is:

Ship the dumb baseline (dense top-k).
Build the eval harness.
Add hybrid search. Measure.
Add a reranker. Measure.
If query distribution is heterogeneous, add routing. Measure.
Use synthetic data to expand eval coverage to under-tested cases.

Each step pays for itself in the eval suite or it does not get shipped.