Hybrid Search and Reranking

Pure dense (vector) retrieval is the default in most RAG tutorials and the wrong default in most real systems. The moment your corpus contains identifiers, code symbols, error messages, product SKUs, or proper nouns the user might type verbatim, lexical match outperforms semantic similarity — and the embedding model has no idea why.

The fix is hybrid search (BM25 + dense, fused) followed by a cross-encoder reranker over the top ~50 candidates. This page walks through why each layer exists, the math of Reciprocal Rank Fusion, the reranker landscape in 2026, and a runnable end-to-end pipeline.



1. Why Pure Dense Retrieval Underperforms

Dense embeddings encode meaning. They are excellent at:

They are bad at:

BM25 is exact lexical match with smart weighting. It catches all of the above. The two are complementary, not competing.


2. BM25 Refresher

BM25 (Robertson, 1994) scores a document against a query as a sum over query terms of:


# In-process BM25 with rank_bm25 — fine up to a few hundred thousand chunks.
from rank_bm25 import BM25Okapi

corpus = ["the quick brown fox", "lazy dogs sleep", "foxes are quick and brown", ...]
tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)

query = "quick fox"
scores = bm25.get_scores(query.lower().split())
top_k = sorted(range(len(corpus)), key=lambda i: -scores[i])[:10]
  

For real corpora use OpenSearch/Elasticsearch (production-grade BM25 with analyzers) or Tantivy/Meilisearch. rank_bm25 is good for prototypes and small in-memory indexes.


3. Hybrid Search with Reciprocal Rank Fusion

Run BM25 and dense retrieval in parallel, fuse the two ranked lists. The dominant fusion method is Reciprocal Rank Fusion (RRF, Cormack et al., 2009) because it requires no score calibration between the two systems — just the ranks.

RRF score for document d:


RRF(d) = sum over each retriever r of:  1 / (k + rank_r(d))
# k is a constant, conventionally 60. Higher k flattens the curve.
  

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """rankings: a list of ranked doc-id lists, one per retriever."""
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

bm25_top = ["d_19", "d_03", "d_42", "d_07", "d_88"]   # ranks 1..5
dense_top = ["d_03", "d_88", "d_19", "d_91", "d_55"]
fused = reciprocal_rank_fusion([bm25_top, dense_top], k=60)
# d_03 and d_19 win because they appear high in both lists.
  

Why k = 60: it's the value the original RRF paper used and it's been sticky for fifteen years. In practice k between 10 and 100 makes very little difference; tune only after you've tuned everything else.


4. Cross-Encoder Rerankers

A bi-encoder (your embedding model) encodes query and doc separately, so similarity is a dot product over precomputed vectors — fast but loses interaction. A cross-encoder takes [query] [SEP] [doc] as a single input and outputs a relevance score, so the attention layers see both at once. Far more accurate per pair, far too slow to run over the full corpus — perfect for reranking the top 50-200 candidates from hybrid retrieval.

Reranker landscape in 2026:

ModelTypeNotes
BAAI/bge-reranker-v2-m3 Cross-encoder, multilingual Open weights, the default reranker for self-hosted stacks. ~568M params, runs on CPU in a pinch but much happier on GPU.
BAAI/bge-reranker-v2-gemma LLM-based reranker Higher quality, larger (2B); use when latency budget allows.
Cohere Rerank v3 Hosted cross-encoder API Strong English + multilingual quality, no infra to run, ~100ms for 100 docs. Pay-per-call.
Voyage rerank-2 Hosted Solid alternative to Cohere; competitive on long-context reranking.
ColBERTv2 / PLAID Late-interaction Per-token vectors with MaxSim scoring. Higher recall than bi-encoders, faster than cross-encoders. PLAID is the production-tuned index.

# bge-reranker-v2-m3 with sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)

query = "How do I cancel my subscription?"
candidates = [
    "Account termination procedure: from Settings, choose Close Account...",
    "Subscriptions auto-renew unless cancelled 24 hours before the period ends...",
    "Refunds are processed within 5-10 business days...",
]
scores = reranker.predict([(query, c) for c in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
  

# Cohere Rerank v3 — hosted, no GPU required.
import cohere
co = cohere.Client()

resp = co.rerank(
    model="rerank-english-v3.0",
    query="How do I cancel my subscription?",
    documents=candidates,
    top_n=5,
)
for r in resp.results:
    print(r.index, r.relevance_score)
  

5. When to Rerank Top-50 to Top-5

The rule of thumb that survives benchmarks: retrieve 50, rerank to 5. The reasoning:

Tune the numbers per workload. If your corpus is small (under 10k chunks), retrieve 100 and rerank to 10. If your context window is tight, rerank to 3.


6. Latency Tradeoffs

Approximate budgets on a single A10G GPU, 512-token chunks:

StageLatency (50 candidates)Notes
BM25 (OpenSearch, 1M docs)~10-30 msNetwork-bound, scales horizontally.
Dense (HNSW, 1M vectors, dim 1024)~5-15 msMemory-bound; pgvector or Qdrant.
RRF fusion<1 msPure Python set math.
bge-reranker-v2-m3 (50 pairs)~80-150 msBatched; fp16 helps.
Cohere Rerank API (50 docs)~80-200 msNetwork round-trip dominates.
LLM-based reranker (bge-gemma, 50 pairs)~400-900 msQuality up, latency up.

For sub-200ms RAG end-to-end, drop the reranker or shrink to top-20 candidates. For sub-500ms, the bge-m3 cross-encoder fits comfortably. The LLM reranker only fits if generation is the bottleneck anyway.


7. Metadata Filtering as a Third Signal

Most chunks have structured metadata: source doc, author, date, product line, language, access level. Use it as a hard filter before ranking, not as a soft signal after.


# Example: Qdrant with metadata filter applied at retrieval time.
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

client = QdrantClient("localhost", port=6333)

results = client.search(
    collection_name="docs",
    query_vector=embed(query),
    query_filter=Filter(must=[
        FieldCondition(key="product", match=MatchValue(value="enterprise")),
        FieldCondition(key="published_ts", range=Range(gte=1735689600)),  # >= 2025-01-01
        FieldCondition(key="access_level", match=MatchValue(value="public")),
    ]),
    limit=50,
)
  

Filtering at retrieval time keeps the reranker honest (no wasted compute on chunks the user can't see) and removes whole categories of "right answer, wrong audience" failures.


8. Full Pipeline: BM25 + Dense + Rerank


pip install rank_bm25 sentence-transformers qdrant-client
  

"""
End-to-end hybrid retrieval + cross-encoder rerank.
Components:
  - rank_bm25 for lexical
  - sentence-transformers (BAAI/bge-large-en-v1.5) for dense
  - bge-reranker-v2-m3 for reranking
"""
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# ---------- 1. Index time ----------
docs: list[str] = load_corpus()                 # one string per chunk
doc_ids = [f"d_{i}" for i in range(len(docs))]

embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
dense_index = embedder.encode(docs, normalize_embeddings=True, batch_size=64)

bm25 = BM25Okapi([d.lower().split() for d in docs])

# ---------- 2. Query time ----------
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)

def reciprocal_rank_fusion(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return [doc_id for doc_id, _ in sorted(scores.items(), key=lambda x: -x[1])]

def hybrid_search(query: str, top_k_retrieve: int = 50, top_k_rerank: int = 5):
    # 2a. BM25
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_top = [doc_ids[i] for i in np.argsort(-bm25_scores)[:top_k_retrieve]]

    # 2b. Dense
    q_vec = embedder.encode(query, normalize_embeddings=True)
    dense_scores = dense_index @ q_vec
    dense_top = [doc_ids[i] for i in np.argsort(-dense_scores)[:top_k_retrieve]]

    # 2c. Fuse
    fused = reciprocal_rank_fusion([bm25_top, dense_top], k=60)[:top_k_retrieve]

    # 2d. Rerank
    pairs = [(query, docs[doc_ids.index(d)]) for d in fused]
    rerank_scores = reranker.predict(pairs, batch_size=32)
    reranked = sorted(zip(fused, rerank_scores), key=lambda x: -x[1])[:top_k_rerank]

    return [(doc_id, float(score), docs[doc_ids.index(doc_id)])
            for doc_id, score in reranked]

for doc_id, score, text in hybrid_search("how do I cancel my subscription?"):
    print(f"{doc_id}\t{score:.3f}\t{text[:80]}")
  

9. Tuning Notes



Common Interview Questions:

What is Reciprocal Rank Fusion and why is it the default?

RRF assigns each document a score of 1 / (k + rank) in each ranked list it appears in, then sums across lists. The k constant (typically 60) damps the influence of very high ranks so a position-1 hit doesn't drown out a strong position-5 from the other retriever. The killer property: it operates on ranks, not raw scores, so you don't need to calibrate BM25's tf-idf magnitudes against cosine similarity per corpus. It's parameter-light, works across arbitrarily many retrievers, and consistently beats hand-tuned score-weighted fusion in published benchmarks.

Why hybrid search instead of pure dense?

Dense embeddings are great at paraphrase ("can the lessee assign?" matches "assignment by tenant") but under-weight rare exact tokens — product SKUs, proper nouns, statute citations, version numbers. BM25 nails those by construction (idf rewards rarity). Hybrid covers both failure modes: dense recovers the semantic miss, sparse recovers the keyword miss. On most enterprise corpora hybrid + RRF outperforms either alone by 5–15 points of recall@10. Pure dense is fine for clean conversational data; for messy real-world text, hybrid is the default.

When does adding a re-ranker actually help?

Reranking helps when first-stage recall is good but precision-at-top is bad — the answer is in the bag of 50, just not in position 1. Cross-encoders like bge-reranker or Cohere Rerank score (query, doc) pairs jointly and routinely move nDCG@5 up 10–20 points over bi-encoder retrieval alone. They don't help if recall@50 is already terrible (no amount of reranking finds a missing document) or if your generator is robust to noise (a long-context model that handles 20 chunks well doesn't need a tight top-3). Measure recall@50 and nDCG@5 separately to know which problem you have.

What's the latency cost of reranking and how do you contain it?

A cross-encoder reranks ~50 chunks at ~20–50ms per chunk on CPU, ~5–10ms on GPU — so naively, 500ms–2s added to query latency. Containment: batch all 50 pairs in a single forward pass (most rerankers support it); cap rerank input at top-50 not top-200; cache (query, doc_id) pairs in an LRU because users ask the same questions; serve the reranker on a dedicated GPU pod with vLLM or TEI rather than co-locating on the API server. For latency-critical paths, swap the cross-encoder for a smaller distilled reranker (MiniLM-based, ~2–5ms per pair).

ColBERT vs cross-encoder reranker — when would you pick which?

Cross-encoder is the most accurate but slowest — it processes the full (query, doc) text jointly per pair. Use it when you have a fixed top-50 to rerank and can amortize the cost. ColBERT is a late-interaction model: it embeds query tokens and doc tokens separately, then computes MaxSim across token-level vectors. Pre-compute doc embeddings offline, run only the query encoder + similarity at search time. ColBERT gives you most of the cross-encoder's accuracy at a fraction of the latency, but the index is 10–100x larger than a sentence embedding index. Pick ColBERT when first-stage retrieval itself needs to be high-fidelity; cross-encoder when you only need to refine a small candidate set.

How do you debug "the right document wasn't retrieved"?

Decompose by stage. Confirm the document is actually in the index (most common cause: the chunker dropped or split the relevant paragraph). Run BM25 alone, dense alone, hybrid alone — if BM25 finds it but dense doesn't, the embedding model is wrong for the domain (try a domain-specific or fine-tuned embedder). If dense finds it but reranker buries it, the cross-encoder is over-weighting surface form — check rerank scores on the missing doc vs the ones above it. If neither finds it, the chunk strategy fragmented the answer across boundaries; raise overlap or chunk on semantic units. Keep a "lost queries" eval set and test each fix against it.

↑ Back to Top