The Art of Combining Opinions

BM25 finds exact word matches. Vector search finds semantic similarity. Each has blind spots the other covers.

The obvious next question: why not use both?

Modern search systems do. They run BM25 and vector search in parallel, then combine the results. But combining ranked lists is harder than it sounds. The technique that makes it work - Reciprocal Rank Fusion - is elegant enough to be worth understanding on its own.

The Problem With Scores

Let’s say you search for “authentication flow” and get these results:

BM25 Results:

meeting-notes.md (score: 12.4)
auth-design.md (score: 8.7)
api-spec.md (score: 6.2)

Vector Results:

auth-design.md (score: 0.89)
login-flow.md (score: 0.84)
meeting-notes.md (score: 0.71)

Which document is most relevant overall?

The naive approach is averaging scores. But that doesn’t work - BM25 scores and cosine similarity scores are completely different things. A BM25 score of 12.4 doesn’t mean the same thing as a vector score of 0.89. They’re not on the same scale, they’re not measuring the same thing, and adding them together is meaningless.

You could try normalizing - convert both to 0-1 ranges and then combine. But normalization is tricky. What range do you normalize to? How do you handle documents that only appear in one list? The choices are arbitrary and affect results in ways that are hard to predict.

Reciprocal Rank Fusion sidesteps all of this by ignoring scores entirely. It only looks at ranks.

Mismatched measurement scales giving way to a clean positional podium

Ranks, Not Scores

RRF’s insight: position in a ranked list tells you something, regardless of the scoring function that produced it.

If a document is ranked #1 by BM25, that means BM25 thinks it’s the most relevant. If the same document is ranked #2 by vector search, that means vector search thinks it’s nearly the most relevant. We don’t need to understand or compare the scores - the ranks carry the signal.

RRF computes a combined score based purely on positions:

RRF(document) = sum of 1/(k + rank) for each list

Where k is a constant (typically 60) and rank is the document’s position in each list.

Let’s work through the example:

meeting-notes.md:

BM25 rank 1: 1/(60+1) = 0.0164
Vector rank 3: 1/(60+3) = 0.0159
RRF: 0.0323

auth-design.md:

BM25 rank 2: 1/(60+2) = 0.0161
Vector rank 1: 1/(60+1) = 0.0164
RRF: 0.0325

api-spec.md:

BM25 rank 3: 1/(60+3) = 0.0159
Vector rank: not present = 0
RRF: 0.0159

login-flow.md:

BM25 rank: not present = 0
Vector rank 2: 1/(60+2) = 0.0161
RRF: 0.0161

Combined ranking:

auth-design.md (0.0325)
meeting-notes.md (0.0323)
login-flow.md (0.0161)
api-spec.md (0.0159)

auth-design.md wins because it ranked highly in both lists. meeting-notes.md ranked #1 in BM25 but only #3 in vectors, so it comes second. Documents appearing in only one list score lowest.

Why k=60?

The constant k controls how much rank position matters.

With a small k (say, 1):

Rank 1: 1/(1+1) = 0.50
Rank 2: 1/(1+2) = 0.33
The gap between #1 and #2 is huge

With a large k (say, 60):

Rank 1: 1/(60+1) = 0.0164
Rank 2: 1/(60+2) = 0.0161
The gap between #1 and #2 is small

A larger k means positions matter less relative to appearing in multiple lists. Being #1 in one list isn’t much better than being #2. But appearing in both lists, even at mediocre ranks, beats appearing at the top of only one.

The original RRF paper (2009) found k=60 worked well empirically across various datasets. It’s become the standard default. The intuition: it rewards consensus across methods while still giving some credit to top positions.

Query Expansion: Casting a Wider Net

RRF combines results from different search methods. But you can go further by generating variations of the query itself.

Instead of just searching for “authentication flow,” you use a small language model to generate alternative phrasings. Maybe “user login process” or “identity verification steps.”

Then you run four searches, not two:

Original query → BM25
Original query → vectors
Expanded query → BM25
Expanded query → vectors

All four result lists feed into RRF. The original query gets weighted higher to keep it dominant, but the expansion helps catch documents using different terminology.

This addresses a limitation I mentioned in the BM25 post: you might not remember the exact words used. If your notes say “login” but you search for “authentication,” the expansion might generate “login” and catch those documents.

Query expansion is essentially automated synonym matching, powered by a language model that understands how concepts can be rephrased.

Why Hybrid Beats Pure

You might wonder: if vector search understands meaning, why bother with BM25 at all?

Because they have different failure modes.

BM25 fails when:

You use different words than the document (“auth” vs “authentication”)
You’re searching for concepts, not terms
You can’t remember exact phrasing

Vector search fails when:

You want exact matches (error messages, function names)
The terms are rare or domain-specific
The document is tangentially related but not actually relevant (high similarity, low relevance)

Consider searching for “ECONNREFUSED timeout.” BM25 will find documents containing that exact error string. Vector search might return documents about network errors generally - related, but not specifically about ECONNREFUSED.

Or consider searching for “how we handle user sessions.” Vector search will find documents about session management even if they never use the word “session.” BM25 would miss them if the terminology differs.

Hybrid search gets both. RRF ensures documents matching both methods rise to the top, while documents matching only one method still appear - just ranked lower.

Putting It Together

A hybrid retrieval pipeline looks like this:

Query expansion: Generate variations of the original query
Parallel retrieval: Run all queries against both BM25 and vector indexes
RRF fusion: Combine all ranked lists into one
Return results: Ranked by combined RRF score

The whole thing can run locally. The expansion model, the embedding model, the BM25 index - all on your machine. No API calls, no network latency, no data leaving your control.

For most searches, this is what you want. It’s slightly slower than pure BM25 (model inference takes time), but the results are substantially better for anything beyond exact-match queries.

RRF Beyond Search

RRF is useful anywhere you need to combine ranked lists from different sources. Election aggregation, recommendation systems, meta-analysis of studies. The principle is the same: ranks carry information even when scores don’t compare.

The elegance is in what RRF ignores. It doesn’t try to understand the scoring functions. It doesn’t normalize or calibrate. It just observes: this document ranked highly according to multiple independent methods. That’s a signal worth trusting.

It’s a reminder that sophisticated problems sometimes have simple solutions. Combining search rankings could involve complex machine learning, learned weighting schemes, elaborate normalization procedures. Or it could involve adding up reciprocals of positions. The simple approach often wins.

The Takeaway

BM25 and vector search answer different questions. BM25 asks “does this document contain these words?” Vector search asks “is this document about the same thing?” Hybrid search asks both.

RRF combines their answers without needing to understand or compare their scores. Documents that both methods like rise to the top. Documents that only one method likes still appear, just lower.

Query expansion catches terminology mismatches by searching for variations you didn’t think of.

Together, these techniques turn keyword search and semantic search from competing approaches into complementary ones. The result is search that handles both precise technical queries and vague conceptual ones - the best of both worlds.