Skip to content

Instantly share code, notes, and snippets.

@thomasht86
Created September 17, 2025 07:47
Show Gist options
  • Select an option

  • Save thomasht86/36366716bbb2aa16f1823cfc1449537c to your computer and use it in GitHub Desktop.

Select an option

Save thomasht86/36366716bbb2aa16f1823cfc1449537c to your computer and use it in GitHub Desktop.
bm25poison
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>How to Destroy BM25 With One Document</title>
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;700;900&display=swap" rel="stylesheet">
<style>
body {
font-family: 'Inter', sans-serif;
}
.chart-container {
position: relative;
width: 100%;
max-width: 600px;
margin-left: auto;
margin-right: auto;
height: 350px;
max-height: 400px;
}
@media (max-width: 768px) {
.chart-container {
height: 300px;
}
}
</style>
</head>
<body class="bg-[#001f3f] text-white">
<div class="container mx-auto p-4 sm:p-8 max-w-6xl">
<header class="text-center my-12">
<h1 class="text-4xl md:text-6xl font-black text-[#F9A826] tracking-tight">How One 'Dirty' Document</h1>
<h2 class="text-3xl md:text-5xl font-bold text-white">Can Topple a Search Giant</h2>
<p class="mt-4 max-w-3xl mx-auto text-lg text-gray-300">An interactive deep-dive into the fragility of the BM25 ranking algorithm when faced with data outliers.</p>
</header>
<main>
<section id="intro" class="mb-16">
<div class="bg-[#003B73]/50 rounded-lg shadow-2xl p-6 md:p-8 backdrop-blur-sm border border-white/10">
<h3 class="text-2xl font-bold text-[#00A8CC] mb-4">The Baseline: BM25 in a Clean World</h3>
<p class="text-gray-300 mb-6">Okapi BM25 is a cornerstone of modern search, a powerful algorithm that ranks documents based on term frequency (TF) and inverse document frequency (IDF). In a clean, well-behaved dataset of 1,000 documents, it excels. When we search for a specific query like "the cat sat", BM25 correctly identifies the most relevant document and ranks it at the top with a high score, rewarding its conciseness and relevance.</p>
<div class="grid grid-cols-1 md:grid-cols-3 gap-8 items-center">
<div class="md:col-span-2">
<div class="chart-container">
<canvas id="beforeChart"></canvas>
</div>
</div>
<div class="text-center p-6 bg-[#F9A826]/10 rounded-lg">
<h4 class="text-lg font-bold text-gray-200">Top Relevant Score</h4>
<p class="text-7xl font-black text-[#F9A826] my-2">1.48</p>
<p class="text-gray-300">The perfectly relevant document leads the pack, just as expected. Its high score reflects a good balance of term frequency without excessive length.</p>
</div>
</div>
</div>
</section>
<section id="attack" class="mb-16">
<div class="bg-[#003B73]/50 rounded-lg shadow-2xl p-6 md:p-8 backdrop-blur-sm border border-white/10">
<h3 class="text-2xl font-bold text-[#F64C72] mb-4">The Poison Document: An Agent of Chaos</h3>
<p class="text-gray-300 mb-6">Now, we introduce a single malicious document into our pristine corpus of 1,000. This document is an outlier by design: it's absurdly long and repeats our query terms thousands of times. This isn't just noise; it's a calculated attack on the statistical assumptions that make BM25 effective. Its purpose is to warp the entire corpus's statistical landscape.</p>
<div class="bg-gray-900 rounded-md p-4 font-mono text-sm text-pink-400 overflow-x-auto">
<span class="text-gray-500">// Document 1001 (poison_doc)</span><br>
"the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat the cat sat... <span class="text-gray-500">(repeated 10,000 times)</span>"
</div>
</div>
</section>
<section id="aftermath" class="mb-16">
<div class="bg-[#003B73]/50 rounded-lg shadow-2xl p-6 md:p-8 backdrop-blur-sm border border-white/10">
<h3 class="text-2xl font-bold text-[#00A8CC] mb-4">The Aftermath: Relevance Decimated</h3>
<p class="text-gray-300 mb-6">The impact is immediate and devastating. After adding the poison document and re-running the same query, the search results are unrecognizable. Our once top-ranked, perfectly relevant document has plummeted, its score obliterated. Other, previously irrelevant documents now clutter the top results. The poison document has successfully destroyed the reliability of our search engine for this query.</p>
<div class="grid grid-cols-1 md:grid-cols-3 gap-8 items-center">
<div class="md:col-span-2">
<div class="chart-container">
<canvas id="afterChart"></canvas>
</div>
</div>
<div class="text-center p-6 bg-[#F64C72]/10 rounded-lg">
<h4 class="text-lg font-bold text-gray-200">Relevant Score Collapse</h4>
<p class="text-7xl font-black text-[#F64C72] my-2">0.19</p>
<p class="text-gray-300">Our star document's score dropped by over <span class="font-bold text-white">87%</span>. It's now buried in the search results, effectively lost.</p>
</div>
</div>
</div>
</section>
<section id="why" class="mb-16">
<div class="bg-[#003B73]/50 rounded-lg shadow-2xl p-6 md:p-8 backdrop-blur-sm border border-white/10">
<h3 class="text-2xl font-bold text-[#F9A826] mb-4">Deconstructing the Attack: How It Works</h3>
<p class="text-gray-300 mb-6">This wasn't magic; it was math. The poison document exploited two core components of the BM25 formula: the Inverse Document Frequency (IDF) and the document length normalization. By manipulating these corpus-wide statistics, it tricked the algorithm.</p>
<div class="grid grid-cols-1 lg:grid-cols-2 gap-8">
<div class="bg-gray-900/50 p-6 rounded-lg">
<h4 class="text-xl font-bold text-center mb-4 text-[#F64C72]">1. IDF Collapse</h4>
<p class="text-gray-400 text-sm mb-4">IDF rewards terms that are rare in the corpus. Our query terms "the", "cat", and "sat" were moderately rare. The poison document made them extremely common, causing their IDF scores—and thus their importance—to plummet for all documents.</p>
<div class="chart-container" style="height: 300px;">
<canvas id="idfChart"></canvas>
</div>
</div>
<div class="bg-gray-900/50 p-6 rounded-lg">
<h4 class="text-xl font-bold text-center mb-4 text-[#00A8CC]">2. Length Normalization Skew</h4>
<p class="text-gray-400 text-sm mb-4">BM25 penalizes documents that are significantly longer than the corpus average. The poison document, being thousands of words long, drastically increased the average document length (<span class="font-mono text-yellow-300">avgdl</span>). This made our short, concise, and relevant document appear as an outlier to be penalized.</p>
<div class="flex justify-around items-center h-full text-center">
<div>
<p class="text-gray-400 text-sm">Avg. Length (Before)</p>
<p class="text-6xl font-black text-white">50</p>
</div>
<div class="text-3xl font-bold text-[#F64C72]">&rarr;</div>
<div>
<p class="text-gray-400 text-sm">Avg. Length (After)</p>
<p class="text-6xl font-black text-[#F64C72]">170</p>
</div>
</div>
</div>
</div>
</div>
</section>
<section id="conclusion" class="text-center my-12">
<h3 class="text-3xl font-bold text-[#00A8CC] mb-4">The Takeaway: Guard Your Corpus</h3>
<p class="mt-4 max-w-3xl mx-auto text-lg text-gray-300">While BM25 is remarkably effective, its strength is predicated on a reasonably distributed dataset. This experiment reveals that it's not immune to statistical sabotage. A single, sufficiently extreme outlier can degrade search quality across the board. The lesson is clear: robust data cleaning, outlier detection, and corpus monitoring aren't just best practices—they are essential security measures for maintaining the integrity of any search and retrieval system.</p>
</section>
</main>
<footer class="text-center text-sm text-gray-500 py-8">
<p>Infographic created for educational purposes. No SVG or Mermaid JS was used in the creation of this page.</p>
<p>Color Palette: Energetic & Playful.</p>
<p>Data is simulated to illustrate the statistical impact on the BM25 algorithm.</p>
</footer>
</div>
<script>
document.addEventListener('DOMContentLoaded', () => {
const wrapLabel = (label, maxLength = 16) => {
if (label.length <= maxLength) {
return label;
}
const words = label.split(' ');
const lines = [];
let currentLine = '';
for (const word of words) {
if ((currentLine + ' ' + word).trim().length > maxLength) {
lines.push(currentLine.trim());
currentLine = word;
} else {
currentLine = (currentLine + ' ' + word).trim();
}
}
if (currentLine) {
lines.push(currentLine.trim());
}
return lines;
};
const sharedTooltipConfig = {
plugins: {
tooltip: {
callbacks: {
title: function(tooltipItems) {
const item = tooltipItems[0];
let label = item.chart.data.labels[item.dataIndex];
if (Array.isArray(label)) {
return label.join(' ');
} else {
return label;
}
}
}
}
}
};
const sharedChartOptions = (title) => ({
maintainAspectRatio: false,
responsive: true,
plugins: {
...sharedTooltipConfig.plugins,
legend: {
position: 'top',
labels: {
color: '#FFFFFF'
}
},
title: {
display: true,
text: title,
color: '#FFFFFF',
font: {
size: 16
}
}
},
scales: {
y: {
beginAtZero: true,
grid: {
color: 'rgba(255, 255, 255, 0.1)'
},
ticks: {
color: '#FFFFFF'
}
},
x: {
grid: {
display: false
},
ticks: {
color: '#FFFFFF'
}
}
}
});
const palette = {
blue: '#00A8CC',
yellow: '#F9A826',
pink: '#F64C72',
darkBlue: '#003B73'
};
const beforeData = {
labels: ['Relevant Doc', 'Doc 2', 'Doc 3', 'Doc 4', 'Doc 5'],
datasets: [{
label: 'BM25 Score (Before)',
data: [1.48, 0.95, 0.88, 0.76, 0.65],
backgroundColor: palette.yellow,
borderColor: palette.yellow,
borderWidth: 1
}]
};
new Chart(document.getElementById('beforeChart'), {
type: 'bar',
data: beforeData,
options: sharedChartOptions('Top 5 Document Scores (Before Attack)')
});
const afterData = {
labels: ['Doc 73', 'Doc 15', 'Doc 440', 'Relevant Doc', 'Doc 81'],
datasets: [{
label: 'BM25 Score (After)',
data: [0.85, 0.72, 0.55, 0.19, 0.18],
backgroundColor: palette.pink,
borderColor: palette.pink,
borderWidth: 1
}]
};
new Chart(document.getElementById('afterChart'), {
type: 'bar',
data: afterData,
options: sharedChartOptions('Top 5 Document Scores (After Attack)')
});
const idfData = {
labels: ['the', 'cat', 'sat'],
datasets: [{
label: 'IDF Before',
data: [2.5, 3.1, 3.4],
backgroundColor: palette.blue,
borderColor: palette.blue,
}, {
label: 'IDF After',
data: [0.01, 0.01, 0.01],
backgroundColor: palette.pink,
borderColor: palette.pink,
}]
};
new Chart(document.getElementById('idfChart'), {
type: 'bar',
data: idfData,
options: sharedChartOptions('Inverse Document Frequency (IDF) Collapse')
});
});
</script>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment