The Second Opinion

BM25 matches keywords. Vector search matches meaning. Reciprocal Rank Fusion combines them. But all three share a limitation: they work from preprocessed representations, not the actual content.
Re-ranking is different. It’s where a language model actually reads the results and judges whether they’re relevant.
This is the piece that makes modern search feel almost magical. Retrieval finds candidates quickly but makes mistakes. Re-ranking catches those mistakes.
Fast and Shallow vs Slow and Deep
BM25 and vector search are fast because they use precomputed indexes. BM25 looks up terms in an inverted index. Vector search compares embeddings that were computed during indexing. Neither method reads the actual document at query time - they work from preprocessed representations.
This is a tradeoff. Speed requires simplification. The indexes capture something about each document, but not everything. A vector embedding compresses a document into a few hundred numbers. Nuance gets lost.
The result: retrieval methods sometimes return documents that seem relevant but aren’t. The embedding for “coffee machine maintenance” might be close to “coffee brewing techniques” because both involve coffee. But if you’re searching for maintenance guides, brewing techniques aren’t helpful.
This is where re-ranking comes in. A re-ranker actually reads the candidate documents and the query, then makes a judgment: is this document actually relevant to what the user asked?
How Re-ranking Works
A re-ranker is a small language model trained specifically to judge relevance. It’s not a general chatbot - it’s a specialist. Models like Qwen3-Reranker or BGE-Reranker are typically 500MB-1GB.
For each candidate document, the re-ranker receives:
- The original query
- The document content (or the matching chunk)
It outputs a simple judgment: yes or no, with a confidence score. “Yes, this document answers the query” or “No, it doesn’t.”
The confidence scores let the system adjust rankings. A high-confidence “yes” boosts a document up. A high-confidence “no” pushes it down. Uncertain judgments have less effect.
This is fundamentally different from retrieval. BM25 and vectors compute similarity metrics - statistical measures that correlate with relevance. The re-ranker makes a semantic judgment - it understands language well enough to assess whether the content actually addresses the query.
Two-Stage Retrieval
You might wonder: if the re-ranker is so good at judging relevance, why not use it for everything?
Because it’s slow.
Running a language model takes time - even a small one. If you have 10,000 documents, running the re-ranker on all of them would take far too long. BM25 can search 10,000 documents in milliseconds. The re-ranker might take seconds per document.
The solution is two-stage retrieval:
- Stage 1 (retrieval): Use fast methods (BM25, vectors, RRF) to find the top 50-100 candidates
- Stage 2 (re-ranking): Use the slow but accurate re-ranker to judge those candidates
This is the classic speed-accuracy tradeoff. Stage 1 casts a wide net quickly, accepting some false positives. Stage 2 filters carefully, using more expensive computation on a much smaller set.
The key assumption: if a document is truly relevant, stage 1 will probably find it. The retrieval methods don’t need to be perfect - they just need to not miss good results. Stage 1 handles recall; stage 2 handles precision.
Position-Aware Blending
Sophisticated systems don’t simply replace retrieval scores with re-ranker scores. They blend them, with the blend ratio depending on position.
| Position | Retrieval Weight | Re-ranker Weight |
|---|---|---|
| 1-3 | 75% | 25% |
| 4-10 | 60% | 40% |
| 11+ | 40% | 60% |
Why different weights at different positions?
The top results from RRF are strong signals - documents that ranked highly in multiple retrieval methods. If both BM25 and vectors agree a document is relevant, it probably is. The re-ranker might disagree, but retrieval has a strong track record at the top. So the blend favors retrieval.
Lower positions are less certain. Maybe a document ranked #50 in BM25 but #5 in vectors. RRF puts it somewhere in the middle, but we’re not confident. Here, the re-ranker’s judgment matters more. It can rescue genuinely relevant documents that retrieval undervalued, or demote false positives that slipped through.
This is a hedge. We don’t fully trust either signal, so we blend them. The blend shifts based on how confident we are in the retrieval signal at each position.
The Full Pipeline
A complete hybrid search pipeline looks like this:
- Query expansion: Generate variations of the query using a local model
- Parallel retrieval: Run all queries against both BM25 and vector indexes
- RRF fusion: Combine all ranked lists into unified scores
- Re-ranking: Run the re-ranker on top candidates
- Position-aware blending: Combine retrieval and re-ranker scores
- Return results: Final ranked list
Each stage serves a purpose:
- Query expansion catches terminology mismatches
- BM25 catches exact keyword matches
- Vectors catch semantic similarity
- RRF combines signals from different methods
- Re-ranking filters false positives and promotes true relevance
- Position-aware blending hedges between retrieval and re-ranker confidence
It’s a lot of machinery for a search command. But the result is search that handles both “ECONNREFUSED error” (exact match) and “why is networking broken” (semantic query) gracefully.
Running Locally
All of this can run on your machine. A typical setup uses three models:
| Model | Size | Purpose |
|---|---|---|
| Embedding model | ~300MB | Creates vector embeddings |
| Re-ranker | ~640MB | Judges relevance |
| Query expansion | ~1.1GB | Generates query variations |
Total: about 2GB. Not tiny, but manageable. I’ve written before about small models and what they can do. These are specialized tools - not general chatbots, but experts at their narrow tasks. They run on CPU if needed, though GPU acceleration helps.
The local-first approach matters. Your documents don’t leave your machine. There’s no API to pay for, no rate limits, no service to go down. The search is as reliable as your laptop.
QMD is one implementation of this pipeline. It assembles BM25, vectors, RRF, and re-ranking into a coherent tool that runs entirely locally. But the concepts apply broadly - any modern search system worth its salt uses some combination of these techniques.
The Takeaway
Modern search is layered. Fast methods find candidates. Slow methods judge them. Fusion combines multiple signals. Each layer compensates for the limitations of the others.
BM25 is fast but only matches words. Vectors understand meaning but can be fooled by surface similarity. RRF combines them but can’t tell if a document actually answers the question. Re-ranking can, but it’s too slow to run on everything.
Together, they form a pipeline that’s greater than the sum of its parts. You ask a question in natural language, and documents appear ranked by genuine relevance - not just keyword density or embedding distance, but something closer to “does this actually help?”
The techniques aren’t new. BM25 is from 1994. Vector search has been around for years. RRF was published in 2009. Re-ranking with language models is newer but well-established. The insight is that these old ideas, thoughtfully assembled, produce something that feels like magic.
That’s often how useful software gets made. The components exist; someone just needs to put them together. Understanding what each piece does - and why - helps you appreciate the engineering, and maybe build something of your own.