Hybrid Search and Reranking

Pure dense (vector) retrieval is the default in most RAG tutorials and the wrong default in most real systems. The moment your corpus contains identifiers, code symbols, error messages, product SKUs, or proper nouns the user might type verbatim, lexical match outperforms semantic similarity — and the embedding model has no idea why.

The fix is hybrid search (BM25 + dense, fused) followed by a cross-encoder reranker over the top ~50 candidates. This page walks through why each layer exists, the math of Reciprocal Rank Fusion, the reranker landscape in 2026, and a runnable end-to-end pipeline.

1. Why Pure Dense Retrieval Underperforms

Dense embeddings encode meaning. They are excellent at:

"How do I cancel my subscription?" matching a doc titled "Account termination procedure."
"sad movie" matching reviews using "tearjerker."

They are bad at:

Identifiers: order A-482, err_code 0xC0000005, SKU-9971. The embedding doesn't know the alphanumeric token is load-bearing.
Symbols and code: useEffect, WITH RECURSIVE, git rebase --onto.
Long-tail proper nouns: a niche library, a regional law, a person's name not seen in the embedding model's training data.
Negation and operators: "not Postgres," "exclude returns" — most embedding models smooth this away.

BM25 is exact lexical match with smart weighting. It catches all of the above. The two are complementary, not competing.

2. BM25 Refresher

BM25 (Robertson, 1994) scores a document against a query as a sum over query terms of:

IDF: rare terms count more than common ones.
Term frequency saturation: more occurrences help, with diminishing returns (controlled by k1, typically 1.2-2.0).
Length normalization: short documents are not unfairly punished by long ones (controlled by b, typically 0.75).


# In-process BM25 with rank_bm25 — fine up to a few hundred thousand chunks.
from rank_bm25 import BM25Okapi

corpus = ["the quick brown fox", "lazy dogs sleep", "foxes are quick and brown", ...]
tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)

query = "quick fox"
scores = bm25.get_scores(query.lower().split())
top_k = sorted(range(len(corpus)), key=lambda i: -scores[i])[:10]

For real corpora use OpenSearch/Elasticsearch (production-grade BM25 with analyzers) or Tantivy/Meilisearch. rank_bm25 is good for prototypes and small in-memory indexes.

3. Hybrid Search with Reciprocal Rank Fusion

Run BM25 and dense retrieval in parallel, fuse the two ranked lists. The dominant fusion method is Reciprocal Rank Fusion (RRF, Cormack et al., 2009) because it requires no score calibration between the two systems — just the ranks.

RRF score for document d:


RRF(d) = sum over each retriever r of:  1 / (k + rank_r(d))
# k is a constant, conventionally 60. Higher k flattens the curve.


def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """rankings: a list of ranked doc-id lists, one per retriever."""
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

bm25_top = ["d_19", "d_03", "d_42", "d_07", "d_88"]   # ranks 1..5
dense_top = ["d_03", "d_88", "d_19", "d_91", "d_55"]
fused = reciprocal_rank_fusion([bm25_top, dense_top], k=60)
# d_03 and d_19 win because they appear high in both lists.

Why k = 60: it's the value the original RRF paper used and it's been sticky for fifteen years. In practice k between 10 and 100 makes very little difference; tune only after you've tuned everything else.

4. Cross-Encoder Rerankers

A bi-encoder (your embedding model) encodes query and doc separately, so similarity is a dot product over precomputed vectors — fast but loses interaction. A cross-encoder takes [query] [SEP] [doc] as a single input and outputs a relevance score, so the attention layers see both at once. Far more accurate per pair, far too slow to run over the full corpus — perfect for reranking the top 50-200 candidates from hybrid retrieval.

Reranker landscape in 2026:

Model	Type	Notes
BAAI/bge-reranker-v2-m3	Cross-encoder, multilingual	Open weights, the default reranker for self-hosted stacks. ~568M params, runs on CPU in a pinch but much happier on GPU.
BAAI/bge-reranker-v2-gemma	LLM-based reranker	Higher quality, larger (2B); use when latency budget allows.
Cohere Rerank v3	Hosted cross-encoder API	Strong English + multilingual quality, no infra to run, ~100ms for 100 docs. Pay-per-call.
Voyage rerank-2	Hosted	Solid alternative to Cohere; competitive on long-context reranking.
ColBERTv2 / PLAID	Late-interaction	Per-token vectors with MaxSim scoring. Higher recall than bi-encoders, faster than cross-encoders. PLAID is the production-tuned index.


# bge-reranker-v2-m3 with sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)

query = "How do I cancel my subscription?"
candidates = [
    "Account termination procedure: from Settings, choose Close Account...",
    "Subscriptions auto-renew unless cancelled 24 hours before the period ends...",
    "Refunds are processed within 5-10 business days...",
]
scores = reranker.predict([(query, c) for c in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])


# Cohere Rerank v3 — hosted, no GPU required.
import cohere
co = cohere.Client()

resp = co.rerank(
    model="rerank-english-v3.0",
    query="How do I cancel my subscription?",
    documents=candidates,
    top_n=5,
)
for r in resp.results:
    print(r.index, r.relevance_score)

5. When to Rerank Top-50 to Top-5

The rule of thumb that survives benchmarks: retrieve 50, rerank to 5. The reasoning:

Bi-encoder recall@50 is typically 80-95% on common benchmarks (BEIR, MTEB) — the right answer is almost always in there somewhere.
The cross-encoder turns recall@50 into precision@5 by re-scoring with full attention.
Below 20 candidates the reranker has too little to work with; above 200 the latency cost stops being worth the recall gain.

Tune the numbers per workload. If your corpus is small (under 10k chunks), retrieve 100 and rerank to 10. If your context window is tight, rerank to 3.

6. Latency Tradeoffs

Approximate budgets on a single A10G GPU, 512-token chunks:

Stage	Latency (50 candidates)	Notes
BM25 (OpenSearch, 1M docs)	~10-30 ms	Network-bound, scales horizontally.
Dense (HNSW, 1M vectors, dim 1024)	~5-15 ms	Memory-bound; pgvector or Qdrant.
RRF fusion	<1 ms	Pure Python set math.
bge-reranker-v2-m3 (50 pairs)	~80-150 ms	Batched; fp16 helps.
Cohere Rerank API (50 docs)	~80-200 ms	Network round-trip dominates.
LLM-based reranker (bge-gemma, 50 pairs)	~400-900 ms	Quality up, latency up.

For sub-200ms RAG end-to-end, drop the reranker or shrink to top-20 candidates. For sub-500ms, the bge-m3 cross-encoder fits comfortably. The LLM reranker only fits if generation is the bottleneck anyway.

7. Metadata Filtering as a Third Signal

Most chunks have structured metadata: source doc, author, date, product line, language, access level. Use it as a hard filter before ranking, not as a soft signal after.


# Example: Qdrant with metadata filter applied at retrieval time.
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

client = QdrantClient("localhost", port=6333)

results = client.search(
    collection_name="docs",
    query_vector=embed(query),
    query_filter=Filter(must=[
        FieldCondition(key="product", match=MatchValue(value="enterprise")),
        FieldCondition(key="published_ts", range=Range(gte=1735689600)),  # >= 2025-01-01
        FieldCondition(key="access_level", match=MatchValue(value="public")),
    ]),
    limit=50,
)

Filtering at retrieval time keeps the reranker honest (no wasted compute on chunks the user can't see) and removes whole categories of "right answer, wrong audience" failures.

8. Full Pipeline: BM25 + Dense + Rerank


pip install rank_bm25 sentence-transformers qdrant-client


"""
End-to-end hybrid retrieval + cross-encoder rerank.
Components:
  - rank_bm25 for lexical
  - sentence-transformers (BAAI/bge-large-en-v1.5) for dense
  - bge-reranker-v2-m3 for reranking
"""
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# ---------- 1. Index time ----------
docs: list[str] = load_corpus()                 # one string per chunk
doc_ids = [f"d_{i}" for i in range(len(docs))]

embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
dense_index = embedder.encode(docs, normalize_embeddings=True, batch_size=64)

bm25 = BM25Okapi([d.lower().split() for d in docs])

# ---------- 2. Query time ----------
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)

def reciprocal_rank_fusion(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return [doc_id for doc_id, _ in sorted(scores.items(), key=lambda x: -x[1])]

def hybrid_search(query: str, top_k_retrieve: int = 50, top_k_rerank: int = 5):
    # 2a. BM25
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_top = [doc_ids[i] for i in np.argsort(-bm25_scores)[:top_k_retrieve]]

    # 2b. Dense
    q_vec = embedder.encode(query, normalize_embeddings=True)
    dense_scores = dense_index @ q_vec
    dense_top = [doc_ids[i] for i in np.argsort(-dense_scores)[:top_k_retrieve]]

    # 2c. Fuse
    fused = reciprocal_rank_fusion([bm25_top, dense_top], k=60)[:top_k_retrieve]

    # 2d. Rerank
    pairs = [(query, docs[doc_ids.index(d)]) for d in fused]
    rerank_scores = reranker.predict(pairs, batch_size=32)
    reranked = sorted(zip(fused, rerank_scores), key=lambda x: -x[1])[:top_k_rerank]

    return [(doc_id, float(score), docs[doc_ids.index(doc_id)])
            for doc_id, score in reranked]

for doc_id, score, text in hybrid_search("how do I cancel my subscription?"):
    print(f"{doc_id}\t{score:.3f}\t{text[:80]}")

9. Tuning Notes

Chunk size before anything else: 256-512 tokens with ~50-token overlap is the workhorse range. Tiny chunks fragment context; huge chunks dilute the embedding signal and starve the reranker's attention.
Embedding model matters more than the fusion method: bge-large-en-v1.5, gte-large, or text-embedding-3-large will move recall more than any RRF tuning.
Don't normalize BM25 and cosine scores and average them — that was the pre-RRF approach and it requires per-corpus calibration. RRF needs only ranks.
Truncate aggressively: cross-encoders typically have a 512-token limit. If your chunks are longer, the tail is invisible to the reranker.
Cache reranker results by (query, doc_id): at scale a small LRU cache cuts cost meaningfully because users ask the same questions.
Measure recall@50 at the retrieval stage and nDCG@5 after reranking — they answer different questions ("did we get the answer in the bag?" vs "did we put it on top?").

Common Interview Questions:

What is Reciprocal Rank Fusion and why is it the default?

RRF assigns each document a score of 1 / (k + rank) in each ranked list it appears in, then sums across lists. The k constant (typically 60) damps the influence of very high ranks so a position-1 hit doesn't drown out a strong position-5 from the other retriever. The killer property: it operates on ranks, not raw scores, so you don't need to calibrate BM25's tf-idf magnitudes against cosine similarity per corpus. It's parameter-light, works across arbitrarily many retrievers, and consistently beats hand-tuned score-weighted fusion in published benchmarks.

Why hybrid search instead of pure dense?

Dense embeddings are great at paraphrase ("can the lessee assign?" matches "assignment by tenant") but under-weight rare exact tokens — product SKUs, proper nouns, statute citations, version numbers. BM25 nails those by construction (idf rewards rarity). Hybrid covers both failure modes: dense recovers the semantic miss, sparse recovers the keyword miss. On most enterprise corpora hybrid + RRF outperforms either alone by 5–15 points of recall@10. Pure dense is fine for clean conversational data; for messy real-world text, hybrid is the default.

When does adding a re-ranker actually help?

Reranking helps when first-stage recall is good but precision-at-top is bad — the answer is in the bag of 50, just not in position 1. Cross-encoders like bge-reranker or Cohere Rerank score (query, doc) pairs jointly and routinely move nDCG@5 up 10–20 points over bi-encoder retrieval alone. They don't help if recall@50 is already terrible (no amount of reranking finds a missing document) or if your generator is robust to noise (a long-context model that handles 20 chunks well doesn't need a tight top-3). Measure recall@50 and nDCG@5 separately to know which problem you have.

What's the latency cost of reranking and how do you contain it?

A cross-encoder reranks ~50 chunks at ~20–50ms per chunk on CPU, ~5–10ms on GPU — so naively, 500ms–2s added to query latency. Containment: batch all 50 pairs in a single forward pass (most rerankers support it); cap rerank input at top-50 not top-200; cache (query, doc_id) pairs in an LRU because users ask the same questions; serve the reranker on a dedicated GPU pod with vLLM or TEI rather than co-locating on the API server. For latency-critical paths, swap the cross-encoder for a smaller distilled reranker (MiniLM-based, ~2–5ms per pair).

ColBERT vs cross-encoder reranker — when would you pick which?

Cross-encoder is the most accurate but slowest — it processes the full (query, doc) text jointly per pair. Use it when you have a fixed top-50 to rerank and can amortize the cost. ColBERT is a late-interaction model: it embeds query tokens and doc tokens separately, then computes MaxSim across token-level vectors. Pre-compute doc embeddings offline, run only the query encoder + similarity at search time. ColBERT gives you most of the cross-encoder's accuracy at a fraction of the latency, but the index is 10–100x larger than a sentence embedding index. Pick ColBERT when first-stage retrieval itself needs to be high-fidelity; cross-encoder when you only need to refine a small candidate set.

How do you debug "the right document wasn't retrieved"?

Decompose by stage. Confirm the document is actually in the index (most common cause: the chunker dropped or split the relevant paragraph). Run BM25 alone, dense alone, hybrid alone — if BM25 finds it but dense doesn't, the embedding model is wrong for the domain (try a domain-specific or fine-tuned embedder). If dense finds it but reranker buries it, the cross-encoder is over-weighting surface form — check rerank scores on the missing doc vs the ones above it. If neither finds it, the chunk strategy fragmented the answer across boundaries; raise overlap or chunk on semantic units. Keep a "lost queries" eval set and test each fix against it.

↑ Back to Top