Skip to content

Instantly share code, notes, and snippets.

@manisnesan
Created December 23, 2025 22:53
Show Gist options
  • Select an option

  • Save manisnesan/e3f54f2d00d82a0098dba50961ec81bf to your computer and use it in GitHub Desktop.

Select an option

Save manisnesan/e3f54f2d00d82a0098dba50961ec81bf to your computer and use it in GitHub Desktop.
BM25-TFIDF Practical Approach

BM25 & TF-IDF: What Future Me Needs to Remember

The Core Problem

When searching through documents, we need to rank them by relevance. The challenge: how do we mathematically determine which documents best match a query?

TF-IDF: The Foundation

TF-IDF = Term Frequency × Inverse Document Frequency

  • TF (Term Frequency): How often a word appears in this document. More occurrences → higher score.
  • IDF (Inverse Document Frequency): How rare the word is across all documents. Rarer words → higher score.

The insight: A word is important when it's frequent locally but rare globally. That's why "the" gets a low score (appears everywhere) while "quantum" in a physics paper gets a high score (rare but relevant).

Why logarithm in IDF? Without it, a word appearing in 1 of 10,000 docs would get a score of 10,000—way too extreme. Log compresses these ratios and captures diminishing returns: the jump from 1→2 docs matters more than 1000→1001.

TF-IDF's Limitations

  1. Linear TF: A word appearing 100 times gets 100× the score of appearing once—but is it really 100× more relevant?
  2. No length normalization: Long documents naturally have more word occurrences, unfairly boosting their scores.

BM25: The Improvement

BM25 fixes both problems with two parameters:

Parameter k1: "Patience with Repetition"

Default: 1.5

Controls term frequency saturation—how quickly we stop caring about repeated mentions.

BM25_TF = (freq × (k1 + 1)) / (freq + k1)

As frequency → infinity, score approaches k1 + 1 (the ceiling). The pizza analogy: the first slice is amazing, the tenth adds little satisfaction.

k1 Direction Effect
Higher k1 More patience—saturation happens slower
Lower k1 Less patience—stop rewarding spam quickly

Parameter b: "Tax on Length"

Default: 0.75

Controls document length normalization—how much we penalize long documents.

b Direction Effect
Higher b More tax—short focused docs rank better
Lower b Less tax—long comprehensive docs can compete

The key phrase: "Density matters, not just count." A short article that's 10% about cats is more focused than a book mentioning cats the same number of times.


Tuning Cheat Sheet

Problem Knob Direction
Keyword stuffing/spam k1 ↓ Decrease
Long docs always winning b ↑ Increase
Short docs always winning b ↓ Decrease
Repetition rewarded too much k1 ↓ Decrease

Practical Wisdom

  1. Start with defaults (k1=1.5, b=0.75)
  2. Tune empirically based on actual search quality
  3. No universal perfect values—it depends on your corpus
  4. Mixed corpus challenge: tweets and legal docs need different settings; consider segmenting

Sticky Analogies to Remember

Concept Analogy
TF saturation Pizza slices—10th slice doesn't add much
Log in IDF Being the only one in town with a skill vs 1000 vs 1001 people
k1 Patience—how long before tuning out repetition
b Tax—penalty for being verbose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment