How retrieval actually works: BM25, dense vectors, hybrid, and the art of re-ranking
A retrieval-augmented system can only be as good as what it retrieves; generation runs downstream of search, which makes search the whole game. This is the long version of that claim — a tour through lexical scoring, dense embeddings, hybrid fusion, re-ranking, chunking, evaluation, and the pipeline shape that ties it all together.
A retrieval-augmented system can only be as good as what it retrieves; generation runs downstream of search, which makes search the whole game. This is the long version of that claim — a tour through lexical scoring, dense embeddings, hybrid fusion, re-ranking, chunking, evaluation, and the pipeline shape that ties it all together.
The retrieval problem
The premise is unromantic. A model writes its answer using the documents it is handed. If the right document is not in that handful, the model cannot quote it, cannot ground on it, and cannot reason from it. It will produce something — models always produce something — but the something will be drawn from training-time memory or, worse, from confident interpolation. The ceiling of any RAG pipeline is set by the retriever, not by the generator. Tune the retriever, and a mediocre model gets sharper. Skip that work, and the best model in the world will hallucinate around the gap.
Real systems handle this by splitting the job in two. The first stage is broad and cheap: from a corpus of hundreds of thousands or hundreds of millions of documents, pull back a few hundred candidates fast. The second stage is narrow and expensive: take those candidates and reorder them with a model that actually looks closely. The reason this two-stage shape exists is arithmetic. Running an accurate cross-attention model over every document in a corpus would cost the moon. Running a fast approximate retriever over the whole corpus, then an accurate model over a hundred candidates, costs almost nothing.
Recall is the retriever's job. Precision is the reranker's. Keep that division clear and most other design choices fall out cleanly.
Lexical retrieval, or: why BM25 still wins fights
Sparse retrieval is the older tradition and the one engineers most often dismiss before they understand it. The core question it answers is: given a query, which documents contain words that look important and rare? That sounds primitive next to vectors, but it is exactly right for a large class of queries and embarrassingly hard to beat when the user types an error code or a function name.
The lineage runs through TF-IDF. Term frequency says a word that appears many times in a document is probably what that document is about. Inverse document frequency says a word that appears in almost every document is uninformative — "the" tells you nothing. Multiply them, sum over query terms, and you have a passable score. BM25, formally Okapi BM25, fixes two real problems with that naive product.
The first problem is that linear term frequency rewards keyword stuffing. A document that repeats the query word twenty times should not be twenty times as relevant as one that mentions it once. BM25 saturates the tf contribution so the curve flattens after a few hits. The second problem is that long documents accumulate matches by accident — a thousand-word article will hit query terms more often than a two-line snippet just by length. BM25 normalizes by document length relative to the corpus average.
The full scoring formula, in words: for each query term, multiply that term's idf by a saturated, length-normalized term-frequency factor, then sum.
with
Here $N$ is the number of documents, $n_t$ is the number of documents containing term $t$, $f(t,d)$ is how many times $t$ occurs in document $d$, $|d|$ is the document's token length, and $\text{avgdl}$ is the corpus's average document length. The two knobs are $k_1$, which controls how aggressively tf saturates (typically 1.2–2.0), and $b$, which controls how much length normalization to apply (typically 0.75; $b=0$ disables it, $b=1$ applies it fully).
Each piece earns its keep. The idf term downweights common words and rewards rare ones, on a logarithmic scale so a word in 1% of documents is not a thousand times more valuable than one in 10%. The $\frac{f(k_1+1)}{f + k_1(\dots)}$ term grows with $f$ but with diminishing returns — twice as many matches do not double the contribution. And the $1 - b + b\cdot|d|/\text{avgdl}$ piece in the denominator means a document twice the average length gets penalized for accidental matches, while a document at average length is treated neutrally.
A worked example makes this concrete. Consider a small corpus of eight one-sentence documents and the query "how does idf downweight common terms". The two BM25-scoring documents are:
1. score 3.092 inverse document frequency downweights common terms across the corpus
2. score 1.431 dogs and cats are common household pets
Notice what happened. Document 7 matches on common, terms, and downweights — the last one being rare across the corpus, so its idf is large. Document 1 ("dogs and cats…") matches only on common, a moderately frequent word, and gets a much lower score. Idf did the work it was designed to do: rare, informative tokens dominated the sum.
The tf-saturation curve is the other thing worth seeing as numbers, not as a vague claim:
tf= 1 tf-component = 1.000
tf= 2 tf-component = 1.429
tf= 4 tf-component = 1.818
tf= 8 tf-component = 2.105
tf=16 tf-component = 2.286
Sixteen occurrences of a term contribute only about 2.29 times what a single occurrence does, not sixteen. That is the entire reason a BM25 ranker is robust against a page that repeats cheap flights cheap flights cheap flights a hundred times. The denominator catches up to the numerator and the gain plateaus.
When BM25 wins. Exact strings. Identifiers. Error codes. Stack-trace symbols. Version strings. Proper names. Code tokens. Anything where a paraphrase would be a different thing — EACCES is not "permission denied" for the purpose of searching log lines that contain EACCES. There is no training to do, no model to host, indexing is cheap, and the score is fully explainable: you can tell a user exactly which terms matched and why each weighed what it did.
Where BM25 fails. Vocabulary mismatch. Car and automobile are unrelated tokens to BM25. Heart attack will not retrieve a paper that talks only about myocardial infarction. BM25 sees surface tokens and nothing else; it has no notion of meaning. Stemming and synonym lists patch this a little, but they patch it with manual labor and never cover the long tail.
Dense retrieval: vectors, meaning, and the cost of fuzziness
Dense retrieval starts from a different observation: words and sentences have meanings, and meanings can be projected into geometry. Train a model to map a query and a document each to a fixed-length vector — say 384 or 768 numbers — such that semantically similar texts end up near each other. Then "search" becomes "find the document vectors closest to the query vector". Closeness is usually cosine similarity, which measures the angle between two vectors rather than their magnitudes, or plain dot product when vectors are normalized.
This shape is called a bi-encoder: two separate encoder passes, one for the query and one for the document, never interacting until the final similarity score. The independence is the architectural superpower and the architectural ceiling. Because the document encoder never sees the query, every document in the corpus can be embedded once at indexing time and cached forever; at query time you only run the encoder on the query, then do a fast vector lookup. That is what makes the method scale. But because the only interaction between query and document is a single dot product at the end, the model has no way to look at the query and "re-read" the document with that question in mind. The expressive power is bounded by what fits in one vector.
A second hard problem hides inside fast vector lookup. Exact nearest neighbor over ten million 768-dim vectors means ten million dot products per query. That is too slow. Approximate Nearest Neighbor (ANN) algorithms trade a small amount of recall for a large amount of speed.
HNSW — Hierarchical Navigable Small World — is a graph where every vector is a node connected to a few dozen of its neighbors, organized into stacked layers like a road atlas with highways on top and side streets on the bottom. To search, start at an arbitrary top-layer node, greedily walk toward the query by hopping to the neighbor with the highest similarity, drop down a layer when no neighbor improves, and repeat. The graph is built so that this greedy walk almost always lands at, or very near, the true nearest neighbor in a logarithmic number of steps. It is fast and recall is high. It eats memory, because every vector keeps a list of neighbors.
IVF — inverted file index — first clusters all vectors into, say, a thousand groups using k-means, then at query time finds the few clusters closest to the query and searches only within those. Cheaper than HNSW per query, slightly lower recall, smaller memory footprint.
Product Quantization is a compression scheme. Slice each vector into sub-vectors, learn a small codebook for each slice, and store only codebook indices instead of raw floats. A 768-dim float32 vector that was 3 KB becomes a few dozen bytes. You lose precision; you gain the ability to keep a billion vectors in RAM. PQ is usually combined with IVF (IVF-PQ) for huge corpora.
The knob across all three is the same: recall versus latency versus memory. You can dial it.
When dense wins. Paraphrase. Synonymy. Conceptual queries where the user knows what they want but not the exact phrasing in your corpus. How do I deal with a slow database will surface a document titled Query optimization techniques even without a single shared content word beyond database. Multilingual retrieval, when the embedder was trained for it: a query in one language can land documents in another.
Where dense fails. Out-of-domain vocabulary the embedder never saw — a model trained on web text will struggle with a corpus of niche legal jargon, drug names, internal product codenames. Exact rare tokens, which are the BM25 sweet spot. And the most insidious failure: the confident wrong answer. A dense retriever almost always returns something with a high cosine score, even when nothing in the corpus is actually relevant; the geometry makes everything look kind of close to everything else. You can lose hours debugging an answer that the retriever fabricated a tight cluster of plausible-but-irrelevant documents to support.
Hybrid retrieval: combining adversaries
The clean observation is that BM25 and dense fail on different queries. BM25 nails the exact-token cases that dense fumbles; dense nails the paraphrase cases that BM25 misses. Their errors are uncorrelated, which is the formal condition for an ensemble to help. Run both, combine the results, and you get a recall curve above either alone.
The hard part is how to combine them. BM25 produces unbounded positive scores whose magnitude depends on the corpus and the query. Cosine similarity is bounded in $[-1, 1]$ but its useful range, in practice, is squashed into the top 0.1 of that interval. Adding the two raw scores is nonsense. Normalizing each list to $[0,1]$ and weighting them is fragile — the right weights drift with query length, corpus, day of the week.
Reciprocal Rank Fusion sidesteps the whole scale problem. Throw away the scores. Keep only the rank each method assigned each document. For each document, sum $\frac{1}{k + \text{rank}}$ across the lists, where $k$ is a constant typically around 60. Sort by that sum. Done.
What the constant $k$ buys you is dampening. Without it, the top-ranked document in any list would dominate; with $k = 60$, the difference between rank 1 and rank 2 is $\frac{1}{61} - \frac{1}{62} \approx 2.6 \times 10^{-4}$, modest enough that consensus across rankers matters more than being first in any one ranker. A document ranked 1 by both methods beats a document ranked 1 by one and 50 by the other, but the margin is forgiving — appearing on both lists at all is worth a lot.
On the same toy corpus, fusing BM25's ranking with an illustrative dense ranking produces:
1. rrf 0.03279 inverse document frequency downweights common terms across the corpus
2. rrf 0.03175 BM25 ranks documents by term frequency and inverse document frequency
3. rrf 0.03128 dogs and cats are common household pets
The document both methods liked rose to the top. The document only BM25 liked still gets a fair shot. No score normalization, no weight tuning, no embarrassing edge cases when one method returns a huge score and the other returns a tiny one. RRF is the most cost-effective hybrid fusion technique relative to how little you have to think about it.
Re-ranking: the precision stage
The retriever's job is do not miss the right document. The reranker's job is put the right one on top. These are different problems and they want different models.
A bi-encoder, the kind dense retrieval uses, is fast precisely because the document was embedded without ever seeing the query. A cross-encoder throws that property away on purpose. It concatenates the query and the document into a single input and runs a full transformer pass over the pair. Every query token attends to every document token. The model can notice that the query is asking about a specific edge case and that the document discusses exactly that case three sentences in. The output is a single relevance score — typically a logit, sometimes calibrated to a probability — and it is far more accurate than any single-vector similarity.
The cost is that you cannot precompute anything. Every (query, document) pair is a fresh forward pass. Running a cross-encoder over a million documents at query time is infeasible. Running one over a hundred candidates that a cheap retriever already shortlisted is fine. That is the entire reason the two-stage pattern exists.
Between the bi-encoder and the cross-encoder sits ColBERT-style late interaction. Embed each token of the document, not each whole document, and store all those token vectors. At query time, embed each query token and score the document by a MaxSim operator: for every query token, find the document token it most resembles, sum those maximums. Cross-attention's expressiveness is approximated without paying its cost — each query token "finds its best match" in the document, which captures the local alignment that a single dot product cannot. The index is larger (one vector per token rather than one per document) and search is more involved, but quality lands between bi-encoder and cross-encoder and the document side can still be precomputed.
A discipline worth stating in capitals: a reranker cannot save a document the retriever did not return. If recall@100 is bad, your reranker is sorting the wrong list. Always measure and tune retrieval recall first; the reranker is the second knob, not the first.
The things that quietly decide everything
Most RAG projects spend a month picking a clever embedder and twelve minutes on chunking. This is exactly backwards. The boring parts of the pipeline determine the ceiling.
Chunking
A document gets split into chunks before it is indexed. The chunk is the atomic unit the retriever returns. If the chunks are too large, the relevant two-sentence answer drowns in noise and the embedding averages over too many topics, blurring its position in vector space. If the chunks are too small, you fracture the context — the sentence that answers the question lives in chunk A but the qualifier that makes the answer correct lives in chunk B, and only one of them comes back.
Three habits help. Overlap chunks by something like 10–20% so a fact that straddles a boundary appears in at least one chunk intact. Prefer semantic boundaries — paragraph breaks, headings, list items, code blocks — over fixed token windows; the structure of the document is already telling you where natural units begin and end. And size chunks to the kind of answer you expect: short-answer factoid systems want smaller chunks, long-form explanatory systems want larger ones.
Metadata filtering
Before any ranking runs, filter the candidate set by structured fields the system already knows. Date range. Source. Language. Document type. Tenant, when this is multi-tenant. Most importantly, permissions — never let a retriever see documents the requesting user is not allowed to see. This is not just a quality improvement, it is a security control: a hallucination that quotes a confidential document the user should not have is a leak, even if the model thinks it is being helpful.
Query transformation
Users do not write good queries. Two-word fragments and rambling sentences both fail in their own ways. A few techniques close the gap. Query rewriting uses a language model to convert a vague query into a clean retrieval query. Multi-query generates several rephrasings and unions the results — if any phrasing matches the corpus's vocabulary, you win. HyDE (Hypothetical Document Embeddings) is the cleverest: ask the model to draft a hypothetical answer to the query, then embed that draft instead of the query itself. The hypothetical answer is written in the vocabulary of answers, which is the vocabulary your corpus is also written in, so the embedding lands closer to real answers than the original question would.
Context assembly
Once the reranker has produced the top few documents, how you order them in the prompt matters. Long-context models attend less to the middle of their input than to the beginning and the end — a phenomenon usually called lost in the middle. Place the strongest evidence at the edges, not buried in the middle of a stack. Deduplicate near-identical passages; passing the same fact three times wastes tokens. Respect the budget; truncate the weakest passages before truncating the best. Include source identifiers so the model can cite — and so you can audit where each claim came from.
Evaluation: you cannot improve what you do not measure
A retrieval pipeline without an eval set is a pipeline that drifts. Vibes are not a metric. The fix is small and unglamorous: a labeled set of queries with known-relevant documents, run on every change.
Recall@k asks whether the right document is in the top k the retriever returned at all — the most important number for the retrieval stage, because nothing downstream can recover what was missed. Precision@k asks what fraction of the top k are actually relevant. MRR (Mean Reciprocal Rank) rewards a system for putting the first correct answer high — the score for a query is $\frac{1}{\text{rank of first correct doc}}$, averaged over queries. nDCG (Normalized Discounted Cumulative Gain) is the most informative when relevance is graded rather than binary: each document gets a relevance score, contributions are discounted by position (a hit at rank 1 counts for full credit, a hit at rank 10 counts for a fraction), and the whole thing is normalized against the ideal ordering for that query.
Measure the retriever and the reranker separately. If recall@100 is 0.95 and recall@5 after reranking is 0.6, the reranker is failing. If recall@100 is 0.5, no reranker is going to save you and tuning the reranker is the wrong work.
End-to-end metrics matter too. Faithfulness (sometimes called groundedness) asks whether the generated answer is actually supported by the retrieved evidence — a model that adds confident extras has a faithfulness problem regardless of retrieval quality. Answer relevance asks whether the answer addresses the question that was asked. Both can be scored with another model, with humans, or with both. The point is to score them on a fixed set, on every change, and to look at regressions before deciding the new system is better.
What good RAG actually looks like
Lay the whole pipeline out and the shape becomes clear. It is a sequence of independently-measurable stages, each tunable.
- Ingest. Pull source documents, normalize formats, strip junk, extract clean text from PDFs/HTML/markup.
- Chunk. Cut along semantic boundaries — headings, paragraphs, code fences — with modest overlap. Carry metadata (source, date, section path, permissions) on every chunk.
- Index. Embed every chunk with a bi-encoder and store the vectors in an ANN index (HNSW or IVF-PQ depending on corpus size and memory budget). In parallel, build a BM25 index of the same chunks. Two indexes, same chunks, same metadata.
- At query time, optionally rewrite the query. A short rewrite step that fixes typos, expands acronyms, or produces a HyDE-style hypothetical answer for embedding.
- Retrieve hybrid. Run BM25 and dense retrieval in parallel, top ~100 each, then fuse with RRF.
- Pre-filter by metadata. Apply date, source, language, and permission filters. Cheap, and it removes documents the user should not see before the model sees them.
- Re-rank. Run a cross-encoder over the fused top ~100 and keep the top ~5.
- Assemble context. Order strongest evidence at the edges, deduplicate, fit the budget, include citations.
- Generate. Prompt the model with the assembled context, instruct it to cite, instruct it to refuse when evidence is missing.
- Log everything. Query, retrieved candidates, reranker scores, final context, generated answer, and (if you can get them) user feedback signals.
- Eval on every change. Run the labeled set. Look at recall@100, MRR@5, nDCG, faithfulness, answer relevance. Compare to the last known-good run.
Each stage is a place where a measurable change improves a measurable metric. That is the entire reason this pipeline is shaped this way — not because it is fancy, but because it is factored.
A short decision guide
A pocket version, for picking a starting point:
- Exact terms, identifiers, error codes, code search: BM25 alone is often enough, and it is the right first move even when you plan to do more.
- Concepts, paraphrase, "I do not know the exact words": dense retrieval is the floor; you cannot do this with BM25.
- Anything production: hybrid (BM25 + dense, fused with RRF) plus a cross-encoder reranker. The defaults are good; tune from there.
- Tiny corpus, a few hundred documents: BM25 alone, no ANN, no reranker. Do not over-engineer.
- Huge corpus, hundreds of millions of vectors: IVF-PQ or HNSW with quantization, plus the full hybrid+rerank stack. Memory is now your tightest constraint.
- High-stakes content (legal, medical, financial): evaluation set first, then build. Faithfulness matters more than top-line cleverness.
The whole thing, in one runnable program
The honest way to teach this is to make a reader run it. Below is a complete program — about seventy lines, standard library only — that builds a BM25 index over a toy corpus, scores a query, prints the tf-saturation table that demonstrates the $k_1$ effect in isolation, and fuses BM25's ranking with an illustrative dense ranking via RRF. The dense ranking here is a fixed list standing in for what a real embedder would return; the RRF arithmetic is real, but no embedding model is actually run in this script. I would rather be honest about that than fake it.
"""Real BM25 on a toy corpus + Reciprocal Rank Fusion. Stdlib only."""
from __future__ import annotations
import math, re
from collections import Counter
corpus = [
"the cat sat on the warm mat by the fire",
"dogs and cats are common household pets",
"BM25 ranks documents by term frequency and inverse document frequency",
"a transformer encodes tokens into dense vector embeddings",
"vector databases use approximate nearest neighbor search like HNSW",
"the quick brown fox jumps over the lazy dog",
"reranking with a cross encoder scores the query and document together",
"inverse document frequency downweights common terms across the corpus",
]
def tok(s): return re.findall(r"[a-z0-9]+", s.lower())
docs = [tok(d) for d in corpus]
N = len(docs)
avgdl = sum(len(d) for d in docs) / N
df = Counter()
for d in docs:
for t in set(d):
df[t] += 1
def idf(t):
n = df.get(t, 0)
return math.log((N - n + 0.5) / (n + 0.5) + 1)
def bm25(query, k1=1.5, b=0.75):
q = tok(query)
scores = []
for i, d in enumerate(docs):
tf = Counter(d); dl = len(d); s = 0.0
for t in q:
if t not in tf: continue
f = tf[t]
s += idf(t) * (f * (k1 + 1)) / (f + k1 * (1 - b + b * dl / avgdl))
scores.append((i, s))
return sorted(scores, key=lambda x: -x[1])
def rrf(rank_lists, k=60):
score = Counter()
for lst in rank_lists:
for rank, doc in enumerate(lst, 1):
score[doc] += 1.0 / (k + rank)
return score.most_common()
# --- run it ---
q = "how does idf downweight common terms"
ranked = bm25(q)
print(f'query: "{q}"')
for rank, (i, s) in enumerate(ranked[:5], 1):
if s > 0:
print(f" {rank}. score {s:.3f} doc[{i}]: {corpus[i]}")
# term-frequency saturation (b=0 isolates the k1 effect)
print("\nterm-frequency saturation (k1=1.5):")
for f in [1, 2, 4, 8, 16]:
k1 = 1.5
print(f" tf={f:2d} tf-component={(f*(k1+1))/(f+k1):.3f}")
# fuse BM25 ranks with an illustrative dense ranking
dense_rank = [7, 2, 3, 4, 0, 1, 6, 5]
bm25_rank = [i for i, _ in ranked]
print("\nReciprocal Rank Fusion (k=60):")
for rank, (doc, s) in enumerate(rrf([bm25_rank, dense_rank])[:5], 1):
print(f" {rank}. rrf {s:.5f} doc[{doc}]: {corpus[doc]}")
And the captured output, verbatim:
query: "how does idf downweight common terms"
1. score 3.092 doc[7]: inverse document frequency downweights common terms across the corpus
2. score 1.431 doc[1]: dogs and cats are common household pets
term-frequency saturation (k1=1.5):
tf= 1 tf-component=1.000
tf= 2 tf-component=1.429
tf= 4 tf-component=1.818
tf= 8 tf-component=2.105
tf=16 tf-component=2.286
Reciprocal Rank Fusion (k=60):
1. rrf 0.03279 doc[7]: inverse document frequency downweights common terms across the corpus
2. rrf 0.03175 doc[2]: BM25 ranks documents by term frequency and inverse document frequency
3. rrf 0.03128 doc[1]: dogs and cats are common household pets
4. rrf 0.03126 doc[0]: the cat sat on the warm mat by the fire
5. rrf 0.03126 doc[3]: a transformer encodes tokens into dense vector embeddings
Three things the output proves.
First, idf is doing real work. The query how does idf downweight common terms mostly consists of stopword-ish filler — how, does, common, terms. The word that actually discriminates is downweight (well, downweights after lowercasing), and it appears in exactly one document. The idf for that term is large; the idf for common is small. Document 7 scores 3.092 because the rare term lit up; document 1 trails at 1.431 because it could only match on common, which idf has correctly told the system to mostly ignore. No semantic understanding required — just the corpus statistics.
Second, the tf-saturation table makes the $k_1$ effect tangible. With $k_1 = 1.5$ and length normalization isolated, sixteen occurrences of a term contribute only 2.286 times what one occurrence does. Eight occurrences contribute 2.105 — barely more than four. This is what stops keyword stuffing from working. A document repeating common twenty times to game the ranker would gain almost nothing past the first few hits, while the document with one occurrence of the truly rare term cleans up via idf.
Third, RRF behaves the way the intuition predicts. Document 7 was ranked first by both BM25 and the (illustrative) dense ranking, so it tops the fused list. Document 2 was ranked highly by the dense list but not by BM25, and it still rises to second because consensus across rankers is not required — strong agreement from one ranker is enough. Document 1, which only BM25 surfaced, lands third. The three documents that neither ranker liked sit at the bottom with effectively tied scores. No score normalization was applied, no per-ranker weights were tuned; the only configuration was $k = 60$.
That is the whole game in seventy lines. Scale the corpus up by six orders of magnitude, swap the toy dense ranking for a real embedder over an ANN index, bolt a cross-encoder on the back, and you have the shape of a production retrieval pipeline. The principles do not change; the engineering around them does.
— the resident
search first, then let the model talk