manisnesan/bm25_fundamental.md

## bm25_fundamental.md

      
    Raw
  

              bm25_fundamental.md
            
          
    BM25 & TF-IDF: What Future Me Needs to Remember

The Core Problem

When searching through documents, we need to rank them by relevance. The challenge: how do we mathematically determine which documents best match a query?
TF-IDF: The Foundation

TF-IDF = Term Frequency × Inverse Document Frequency

TF (Term Frequency): How often a word appears in this document. More occurrences → higher score.
IDF (Inverse Document Frequency): How rare the word is across all documents. Rarer words → higher score.

The insight: A word is important when it's frequent locally but rare globally. That's why "the" gets a low score (appears everywhere) while "quantum" in a physics paper gets a high score (rare but relevant).
Why logarithm in IDF? Without it, a word appearing in 1 of 10,000 docs would get a score of 10,000—way too extreme. Log compresses these ratios and captures diminishing returns: the jump from 1→2 docs matters more than 1000→1001.
TF-IDF's Limitations


Linear TF: A word appearing 100 times gets 100× the score of appearing once—but is it really 100× more relevant?
No length normalization: Long documents naturally have more word occurrences, unfairly boosting their scores.


BM25: The Improvement

BM25 fixes both problems with two parameters:
Parameter k1: "Patience with Repetition"

Default: 1.5
Controls term frequency saturation—how quickly we stop caring about repeated mentions.
BM25_TF = (freq × (k1 + 1)) / (freq + k1)

As frequency → infinity, score approaches k1 + 1 (the ceiling). The pizza analogy: the first slice is amazing, the tenth adds little satisfaction.


k1 Direction
Effect


Higher k1
More patience—saturation happens slower


Lower k1
Less patience—stop rewarding spam quickly


Parameter b: "Tax on Length"

Default: 0.75
Controls document length normalization—how much we penalize long documents.


b Direction
Effect


Higher b
More tax—short focused docs rank better


Lower b
Less tax—long comprehensive docs can compete


The key phrase: "Density matters, not just count." A short article that's 10% about cats is more focused than a book mentioning cats the same number of times.

Tuning Cheat Sheet


Problem
Knob
Direction


Keyword stuffing/spam
k1
↓ Decrease


Long docs always winning
b
↑ Increase


Short docs always winning
b
↓ Decrease


Repetition rewarded too much
k1
↓ Decrease


Practical Wisdom


Start with defaults (k1=1.5, b=0.75)
Tune empirically based on actual search quality
No universal perfect values—it depends on your corpus
Mixed corpus challenge: tweets and legal docs need different settings; consider segmenting


Sticky Analogies to Remember


Concept
Analogy


TF saturation
Pizza slices—10th slice doesn't add much


Log in IDF
Being the only one in town with a skill vs 1000 vs 1001 people


k1
Patience—how long before tuning out repetition


b
Tax—penalty for being verbose
k1 Direction	Effect
Higher k1	More patience—saturation happens slower
Lower k1	Less patience—stop rewarding spam quickly
b Direction	Effect
Higher b	More tax—short focused docs rank better
Lower b	Less tax—long comprehensive docs can compete
Problem	Knob	Direction
Keyword stuffing/spam	k1	↓ Decrease
Long docs always winning	b	↑ Increase
Short docs always winning	b	↓ Decrease
Repetition rewarded too much	k1	↓ Decrease
Concept	Analogy
TF saturation	Pizza slices—10th slice doesn't add much
Log in IDF	Being the only one in town with a skill vs 1000 vs 1001 people
k1	Patience—how long before tuning out repetition
b	Tax—penalty for being verbose