When searching through documents, we need to rank them by relevance. The challenge: how do we mathematically determine which documents best match a query?
TF-IDF = Term Frequency × Inverse Document Frequency
- TF (Term Frequency): How often a word appears in this document. More occurrences → higher score.
- IDF (Inverse Document Frequency): How rare the word is across all documents. Rarer words → higher score.
The insight: A word is important when it's frequent locally but rare globally. That's why "the" gets a low score (appears everywhere) while "quantum" in a physics paper gets a high score (rare but relevant).
Why logarithm in IDF? Without it, a word appearing in 1 of 10,000 docs would get a score of 10,000—way too extreme. Log compresses these ratios and captures diminishing returns: the jump from 1→2 docs matters more than 1000→1001.
- Linear TF: A word appearing 100 times gets 100× the score of appearing once—but is it really 100× more relevant?
- No length normalization: Long documents naturally have more word occurrences, unfairly boosting their scores.
BM25 fixes both problems with two parameters:
Default: 1.5
Controls term frequency saturation—how quickly we stop caring about repeated mentions.
BM25_TF = (freq × (k1 + 1)) / (freq + k1)
As frequency → infinity, score approaches k1 + 1 (the ceiling). The pizza analogy: the first slice is amazing, the tenth adds little satisfaction.
| k1 Direction | Effect |
|---|---|
| Higher k1 | More patience—saturation happens slower |
| Lower k1 | Less patience—stop rewarding spam quickly |
Default: 0.75
Controls document length normalization—how much we penalize long documents.
| b Direction | Effect |
|---|---|
| Higher b | More tax—short focused docs rank better |
| Lower b | Less tax—long comprehensive docs can compete |
The key phrase: "Density matters, not just count." A short article that's 10% about cats is more focused than a book mentioning cats the same number of times.
| Problem | Knob | Direction |
|---|---|---|
| Keyword stuffing/spam | k1 | ↓ Decrease |
| Long docs always winning | b | ↑ Increase |
| Short docs always winning | b | ↓ Decrease |
| Repetition rewarded too much | k1 | ↓ Decrease |
- Start with defaults (k1=1.5, b=0.75)
- Tune empirically based on actual search quality
- No universal perfect values—it depends on your corpus
- Mixed corpus challenge: tweets and legal docs need different settings; consider segmenting
| Concept | Analogy |
|---|---|
| TF saturation | Pizza slices—10th slice doesn't add much |
| Log in IDF | Being the only one in town with a skill vs 1000 vs 1001 people |
| k1 | Patience—how long before tuning out repetition |
| b | Tax—penalty for being verbose |