You know that moment when you asked your agent to check the data and before you can tell it not to eat the ocean it enters Compaction.
Yeah. We've all been there.
- 🤯 Huge database dumps with unpredictable line sizes
- 💸 Token budgets disappearing into massive single-line JSON blobs
- 🔍 No quick way to see WHERE the chonky lines are hiding
⚠️ Agents choking on files you thought were reasonable
A blazingly fast AWK script that gives you X-ray vision into your files' byte distribution.
# chmod +x ./line_histogram.awk
./line_histogram.awk huge_export.jsonlFile: huge_export.jsonl
Total bytes: 2847392
Total lines: 1000
Bucket Distribution:
Line Range | Bytes | Distribution
─────────────────┼──────────────┼──────────────────────────────────────────
1-100 | 4890 | ██
101-200 | 5234 | ██
201-300 | 5832 | ██
301-400 | 6128 | ██
401-500 | 385927 | ████████████████████████████████████████
501-600 | 5892 | ██
601-700 | 5234 | ██
701-800 | 6891 | ██
801-900 | 5328 | ██
901-1000 | 4982 | ██
─────────────────┼──────────────┼──────────────────────────────────────────
Boom. Line 450 is eating 99% of your file. Skip it and save yourself 2.8MB of token cost.
See the byte distribution across your file in 10 neat buckets. Spot the bloat instantly.
./line_histogram.awk myfile.txtFound a problem line? Extract it without loading the whole file into memory.
# Extract line 450 (the chonky one)
./line_histogram.awk -v mode=extract -v line=450 huge_export.jsonl
# Extract lines 100-200 for inspection
./line_histogram.awk -v mode=extract -v start=100 -v end=200 data.jsonlYes yes, those -v bits look odd but yes yes, they are needed as thats how the sed passes argument, who knew! (Hint: The AI)
If your system has AWK (it does), you're good to go. No npm install, no pip install, no Docker containers. Just pure, unadulterated shell goodness.
Processes multi-GB files in seconds. AWK was built for this.
- Profile before you prompt: Know if that export file is safe to feed your agent
- Smart sampling: Extract representative line ranges instead of the whole file
- Debug token explosions: "Why did my context window fill up?" → histogram shows a 500KB line
- Spot malformed CSVs: One line with 10,000 columns? Histogram shows it
- Log file analysis: Find the log entries that are suspiciously huge
- Database export QA: Verify export structure before importing elsewhere
- Config file sanity checks: Spot embedded certificates or secrets bloating configs
- Debug log truncation: See which lines are hitting your logger's size limits
- Kafka message profiling: Histogram message sizes before they hit your pipeline
line_histogram.awk — profile files by line size distribution or extract specific lines
./line_histogram.awk [options] <file>-v mode=histogram (default) Show byte distribution across 10 buckets
-v mode=extract Extract specific line(s)
-v line=N Extract single line N (requires mode=extract)
-v start=X Extract lines X through Y (requires mode=extract)
-v end=Y
-v outfile=FILE Write output to FILE instead of stdout
Divides the file into 10 equal-sized buckets by line number and shows the byte distribution:
- Bucket 1: Lines 1-10% → X bytes
- Bucket 2: Lines 11-20% → Y bytes
- ...and so on
The visual histogram uses █ blocks scaled to the bucket with the most bytes.
Special cases:
- Files ≤10 lines: Each line gets its own bucket
- Remainder lines: Absorbed into bucket 10
Pull specific lines without loading the entire file into your editor:
# Single line
./line_histogram.awk -v mode=extract -v line=42 file.txt
# Range
./line_histogram.awk -v mode=extract -v start=100 -v end=200 file.txt- 0: Success
- 1: Error (invalid line number, bad range, missing parameters)
Example 1: Quick file profile
./line_histogram.awk database_dump.jsonlExample 2: Extract suspicious line for inspection
./line_histogram.awk -v mode=extract -v line=523 data.csv > suspicious_line.txtExample 3: Sample middle section of large file
./line_histogram.awk -v mode=extract -v start=5000 -v end=5100 bigfile.log | lessExample 4: Save histogram to file
./line_histogram.awk -v outfile=analysis.txt huge_file.jsonlNot sure if it works? We've got you covered with a visual test suite:
# Generate test patterns
./generate_test_files.sh
# Run all tests
./run_tests.shThe test suite generates files with known patterns:
- 📈 Triangle up/down: Ascending/descending line sizes
- 📦 Square: Uniform line lengths
- 🌙 Semicircle: sqrt curve distribution
- 🔔 Bell curve: Gaussian distribution
- 📍 Spike: One massive line in a sea of tiny ones
- 🎯 Edge cases: Empty files, single lines, exactly 10 lines
Watch the histograms match the patterns. It's oddly satisfying.
# Clone or download
curl -O https://gist.github.com/... # (your gist URL here)
# Make executable
chmod +x line_histogram.awk
# Optional: Add to PATH
cp line_histogram.awk ~/bin/line_histogram.awkOr just run it directly:
awk -f line_histogram.awk yourfile.txtBorn from frustration with AI agents eating context windows on mystery files. Sometimes you just need to know: "Is this file safe to feed my agent, or will line 847 consume my entire token budget?". As that is obviously how you think and act. You are not a data engineer up at 2am using SCREAMING ALL CAPS as your LLM got into a crash loop trying to evaluate a jsonl extract that didn't fit into context. That is definately not you, no. Me neither.
MIT or Public Domain. Use it, abuse it, put it in production, whatever. No warranty implied—if it deletes your files, that's on you (though it only reads, so you're probably fine).
It's AWK. If you can make it better, you're a wizard. PRs welcome. You will need to setup a repo though I cannt be bothered; just fork the gist and be done with it.
Made with ❤️ and frustration by someone who spent too many tokens on line 523 of a JSONL file.
Now go profile your files like a pro. 📊✨