Skip to content

Instantly share code, notes, and snippets.

@simbo1905
Last active January 14, 2026 15:57
Show Gist options
  • Select an option

  • Save simbo1905/0454936144ee8dbc55bdc96ef532555e to your computer and use it in GitHub Desktop.

Select an option

Save simbo1905/0454936144ee8dbc55bdc96ef532555e to your computer and use it in GitHub Desktop.
Line Histogram - Profile files by line size distribution. Perfect for AI agents, data engineers, and anyone who needs to know if line 847 will eat their entire context window.

📊 Line Histogram — The File Profiler You Didn't Know You Needed

What If You Could See Your Data Before It Floods Your Context Window?

You know that moment when you asked your agent to check the data and before you can tell it not to eat the ocean it enters Compaction.

Yeah. We've all been there.

The Problem

  • 🤯 Huge database dumps with unpredictable line sizes
  • 💸 Token budgets disappearing into massive single-line JSON blobs
  • 🔍 No quick way to see WHERE the chonky lines are hiding
  • ⚠️ Agents choking on files you thought were reasonable

The Solution: line_histogram.awk

A blazingly fast AWK script that gives you X-ray vision into your files' byte distribution.

# chmod +x ./line_histogram.awk
./line_histogram.awk huge_export.jsonl
File: huge_export.jsonl
Total bytes: 2847392
Total lines: 1000

Bucket Distribution:

Line Range      | Bytes        | Distribution
─────────────────┼──────────────┼──────────────────────────────────────────
1-100           |         4890 | ██
101-200         |         5234 | ██
201-300         |         5832 | ██
301-400         |         6128 | ██
401-500         |       385927 | ████████████████████████████████████████
501-600         |         5892 | ██
601-700         |         5234 | ██
701-800         |         6891 | ██
801-900         |         5328 | ██
901-1000        |         4982 | ██
─────────────────┼──────────────┼──────────────────────────────────────────

Boom. Line 450 is eating 99% of your file. Skip it and save yourself 2.8MB of token cost.


🚀 Features That Actually Matter

1. Histogram Mode (Default)

See the byte distribution across your file in 10 neat buckets. Spot the bloat instantly.

./line_histogram.awk myfile.txt

2. Surgical Line Extraction

Found a problem line? Extract it without loading the whole file into memory.

# Extract line 450 (the chonky one)
./line_histogram.awk -v mode=extract -v line=450 huge_export.jsonl

# Extract lines 100-200 for inspection
./line_histogram.awk -v mode=extract -v start=100 -v end=200 data.jsonl

Yes yes, those -v bits look odd but yes yes, they are needed as thats how the sed passes argument, who knew! (Hint: The AI)

3. Zero Dependencies

If your system has AWK (it does), you're good to go. No npm install, no pip install, no Docker containers. Just pure, unadulterated shell goodness.

4. Stupid Fast

Processes multi-GB files in seconds. AWK was built for this.


💡 Use Cases That'll Make You Look Like a Genius

For AI Agent Wranglers

  • Profile before you prompt: Know if that export file is safe to feed your agent
  • Smart sampling: Extract representative line ranges instead of the whole file
  • Debug token explosions: "Why did my context window fill up?" → histogram shows a 500KB line

For Data Engineers

  • Spot malformed CSVs: One line with 10,000 columns? Histogram shows it
  • Log file analysis: Find the log entries that are suspiciously huge
  • Database export QA: Verify export structure before importing elsewhere

For DevOps/SRE

  • Config file sanity checks: Spot embedded certificates or secrets bloating configs
  • Debug log truncation: See which lines are hitting your logger's size limits
  • Kafka message profiling: Histogram message sizes before they hit your pipeline

📖 Pseudo Man Page (The Details)

NAME

line_histogram.awk — profile files by line size distribution or extract specific lines

SYNOPSIS

./line_histogram.awk [options] <file>

OPTIONS

-v mode=histogram     (default) Show byte distribution across 10 buckets
-v mode=extract       Extract specific line(s)
-v line=N             Extract single line N (requires mode=extract)
-v start=X            Extract lines X through Y (requires mode=extract)
-v end=Y              
-v outfile=FILE       Write output to FILE instead of stdout

MODES

Histogram Mode (Default)

Divides the file into 10 equal-sized buckets by line number and shows the byte distribution:

  • Bucket 1: Lines 1-10% → X bytes
  • Bucket 2: Lines 11-20% → Y bytes
  • ...and so on

The visual histogram uses █ blocks scaled to the bucket with the most bytes.

Special cases:

  • Files ≤10 lines: Each line gets its own bucket
  • Remainder lines: Absorbed into bucket 10

Extract Mode

Pull specific lines without loading the entire file into your editor:

# Single line
./line_histogram.awk -v mode=extract -v line=42 file.txt

# Range
./line_histogram.awk -v mode=extract -v start=100 -v end=200 file.txt

EXIT STATUS

  • 0: Success
  • 1: Error (invalid line number, bad range, missing parameters)

EXAMPLES

Example 1: Quick file profile

./line_histogram.awk database_dump.jsonl

Example 2: Extract suspicious line for inspection

./line_histogram.awk -v mode=extract -v line=523 data.csv > suspicious_line.txt

Example 3: Sample middle section of large file

./line_histogram.awk -v mode=extract -v start=5000 -v end=5100 bigfile.log | less

Example 4: Save histogram to file

./line_histogram.awk -v outfile=analysis.txt huge_file.jsonl

🧪 Testing Suite Included

Not sure if it works? We've got you covered with a visual test suite:

# Generate test patterns
./generate_test_files.sh

# Run all tests
./run_tests.sh

The test suite generates files with known patterns:

  • 📈 Triangle up/down: Ascending/descending line sizes
  • 📦 Square: Uniform line lengths
  • 🌙 Semicircle: sqrt curve distribution
  • 🔔 Bell curve: Gaussian distribution
  • 📍 Spike: One massive line in a sea of tiny ones
  • 🎯 Edge cases: Empty files, single lines, exactly 10 lines

Watch the histograms match the patterns. It's oddly satisfying.


⚡ Installation

# Clone or download
curl -O https://gist.github.com/...  # (your gist URL here)

# Make executable
chmod +x line_histogram.awk

# Optional: Add to PATH
cp line_histogram.awk ~/bin/line_histogram.awk

Or just run it directly:

awk -f line_histogram.awk yourfile.txt

🎯 Why This Exists

Born from frustration with AI agents eating context windows on mystery files. Sometimes you just need to know: "Is this file safe to feed my agent, or will line 847 consume my entire token budget?". As that is obviously how you think and act. You are not a data engineer up at 2am using SCREAMING ALL CAPS as your LLM got into a crash loop trying to evaluate a jsonl extract that didn't fit into context. That is definately not you, no. Me neither.


📜 License

MIT or Public Domain. Use it, abuse it, put it in production, whatever. No warranty implied—if it deletes your files, that's on you (though it only reads, so you're probably fine).


🤝 Contributing

It's AWK. If you can make it better, you're a wizard. PRs welcome. You will need to setup a repo though I cannt be bothered; just fork the gist and be done with it.


Made with ❤️ and frustration by someone who spent too many tokens on line 523 of a JSONL file.

Now go profile your files like a pro. 📊✨

#!/bin/bash
# generate_test_files.sh - Generate ASCII art test patterns for histogram testing
#
# Creates files where line lengths follow specific patterns so the histogram
# visually matches the pattern shape.
TESTDIR="test_files"
mkdir -p "$TESTDIR"
NUM_LINES=100 # Gives us 10 lines per bucket nicely
echo "Generating test files in $TESTDIR/"
# ─────────────────────────────────────────────────────────────────
# TRIANGLE UP: Lines get progressively longer
# Should show histogram ramping up left to right
# ─────────────────────────────────────────────────────────────────
echo " triangle_up.txt - ascending line lengths"
: > "$TESTDIR/triangle_up.txt"
for i in $(seq 1 $NUM_LINES); do
# Line length = i (1 to 100 chars)
printf '%*s\n' "$i" '' | tr ' ' 'X' >> "$TESTDIR/triangle_up.txt"
done
# ─────────────────────────────────────────────────────────────────
# TRIANGLE DOWN: Lines get progressively shorter
# Should show histogram ramping down left to right
# ─────────────────────────────────────────────────────────────────
echo " triangle_down.txt - descending line lengths"
: > "$TESTDIR/triangle_down.txt"
for i in $(seq 1 $NUM_LINES); do
# Line length = (NUM_LINES - i + 1)
len=$((NUM_LINES - i + 1))
printf '%*s\n' "$len" '' | tr ' ' 'X' >> "$TESTDIR/triangle_down.txt"
done
# ─────────────────────────────────────────────────────────────────
# SQUARE: All lines same length
# Should show flat histogram (all buckets equal)
# ─────────────────────────────────────────────────────────────────
echo " square.txt - uniform line lengths"
: > "$TESTDIR/square.txt"
for i in $(seq 1 $NUM_LINES); do
printf '%50s\n' '' | tr ' ' 'X' >> "$TESTDIR/square.txt"
done
# ─────────────────────────────────────────────────────────────────
# SEMICIRCLE: Lines follow sqrt curve (half circle)
# Should show histogram that's tall in middle, shorter at edges
# Using formula: width = sqrt(1 - ((x-0.5)*2)^2) scaled
# ─────────────────────────────────────────────────────────────────
echo " semicircle.txt - semicircle line lengths"
: > "$TESTDIR/semicircle.txt"
for i in $(seq 1 $NUM_LINES); do
# Normalize i to -1..1 range, then compute sqrt(1-x^2)
# awk for floating point math
len=$(awk -v i="$i" -v n="$NUM_LINES" 'BEGIN {
x = (i - 1) / (n - 1) # 0 to 1
x = x * 2 - 1 # -1 to 1
if (x*x > 1) { print 1 }
else { print int(sqrt(1 - x*x) * 100) + 1 }
}')
printf '%*s\n' "$len" '' | tr ' ' 'X' >> "$TESTDIR/semicircle.txt"
done
# ─────────────────────────────────────────────────────────────────
# BELL CURVE: Gaussian distribution
# Should show histogram with peak in middle
# ─────────────────────────────────────────────────────────────────
echo " bell_curve.txt - gaussian line lengths"
: > "$TESTDIR/bell_curve.txt"
for i in $(seq 1 $NUM_LINES); do
# Gaussian: exp(-(x-mean)^2 / (2*sigma^2))
len=$(awk -v i="$i" -v n="$NUM_LINES" 'BEGIN {
x = (i - 1) / (n - 1) # 0 to 1
x = x * 6 - 3 # -3 to 3 (3 sigma range)
g = exp(-(x*x) / 2) # gaussian
print int(g * 100) + 1
}')
printf '%*s\n' "$len" '' | tr ' ' 'X' >> "$TESTDIR/bell_curve.txt"
done
# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Tiny file (3 lines)
# Should show 3 buckets with data, 7 with zeros
# ─────────────────────────────────────────────────────────────────
echo " tiny.txt - only 3 lines"
cat > "$TESTDIR/tiny.txt" << 'EOF'
short
medium medium medium
long long long long long long long long
EOF
# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Single huge line
# Should show all bytes in bucket 1
# ─────────────────────────────────────────────────────────────────
echo " single_huge.txt - one massive line"
printf '%10000s\n' '' | tr ' ' 'X' > "$TESTDIR/single_huge.txt"
# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Empty file
# ─────────────────────────────────────────────────────────────────
echo " empty.txt - zero bytes"
: > "$TESTDIR/empty.txt"
# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Exactly 10 lines
# Each line should be its own bucket
# ─────────────────────────────────────────────────────────────────
echo " exact_10.txt - exactly 10 lines, varying sizes"
: > "$TESTDIR/exact_10.txt"
for i in 10 20 30 40 50 60 70 80 90 100; do
printf '%*s\n' "$i" '' | tr ' ' 'X' >> "$TESTDIR/exact_10.txt"
done
# ─────────────────────────────────────────────────────────────────
# SPIKE: All short except one huge line in the middle
# Should show spike in bucket 5
# ─────────────────────────────────────────────────────────────────
echo " spike.txt - all short except huge line at position 50"
: > "$TESTDIR/spike.txt"
for i in $(seq 1 $NUM_LINES); do
if [ "$i" -eq 50 ]; then
printf '%5000s\n' '' | tr ' ' 'X' >> "$TESTDIR/spike.txt"
else
printf '%5s\n' '' | tr ' ' 'X' >> "$TESTDIR/spike.txt"
fi
done
echo ""
echo "Generated files:"
ls -la "$TESTDIR/"
#!/usr/bin/awk -f
# line_histogram.awk - Profile file by line sizes or extract lines for agents to not floor their context inspecting database extracts such as jsonl files.
#
# Usage:
# ./line_histogram.awk <file> # Histogram mode (default)
# ./line_histogram.awk -v mode=extract -v line=5 <file> # Extract single line
# ./line_histogram.awk -v mode=extract -v start=10 -v end=20 <file> # Extract range
#
# Modes:
# histogram (default) - Show byte distribution across 10 buckets
# extract - Extract specific line(s) with -v line=N or -v start=X -v end=Y
BEGIN {
# Default to histogram mode if not specified
if (mode == "") mode = "histogram"
total_bytes = 0
total_lines = 0
# Determine output destination
if (outfile == "") {
out = "/dev/stdout"
} else {
out = outfile
}
}
{
total_lines++
line_sizes[total_lines] = length($0)
lines[total_lines] = $0
total_bytes += line_sizes[total_lines]
}
END {
# Handle extract mode
if (mode == "extract") {
if (line != "") {
# Single line extraction
if (line >= 1 && line <= total_lines) {
print lines[line]
} else {
print "Error: line " line " out of range (1-" total_lines ")" > "/dev/stderr"
exit 1
}
} else if (start != "" && end != "") {
# Range extraction
if (start < 1) start = 1
if (end > total_lines) end = total_lines
if (start > end) {
print "Error: start " start " > end " end > "/dev/stderr"
exit 1
}
for (i = start; i <= end; i++) {
print lines[i]
}
} else {
print "Error: extract mode requires -v line=N or -v start=X -v end=Y" > "/dev/stderr"
exit 1
}
exit 0
}
# Histogram mode (default)
print "File: " FILENAME > out
print "Total bytes: " total_bytes > out
print "Total lines: " total_lines > out
print "" > out
print "Bucket Distribution:" > out
print "" > out
if (total_lines == 0) {
print "Empty file" > out
exit 0
}
# Determine bucket count and size
if (total_lines <= 10) {
num_buckets = total_lines
bucket_size = 1
} else {
num_buckets = 10
bucket_size = int(total_lines / 10)
# Handle remainder by making last bucket larger
}
# Initialize bucket byte counts
for (i = 1; i <= 10; i++) {
bucket_bytes[i] = 0
}
# Assign lines to buckets and sum bytes
for (line_num = 1; line_num <= total_lines; line_num++) {
if (total_lines <= 10) {
bucket = line_num
} else {
bucket = int((line_num - 1) / bucket_size) + 1
if (bucket > 10) bucket = 10 # Last bucket catches remainder
}
bucket_bytes[bucket] += line_sizes[line_num]
}
# Find max bytes for scaling the visual histogram
max_bytes = 0
for (i = 1; i <= num_buckets; i++) {
if (bucket_bytes[i] > max_bytes) max_bytes = bucket_bytes[i]
}
# Print histogram header and table
printf "%-15s | %-12s | %-40s\n", "Line Range", "Bytes", "Distribution" > out
print "─────────────────┼──────────────┼──────────────────────────────────────────" > out
# Print each bucket
for (i = 1; i <= 10; i++) {
if (total_lines <= 10) {
if (i <= total_lines) {
start_line = i
end_line = i
} else {
start_line = 0
end_line = 0
}
} else {
start_line = (i - 1) * bucket_size + 1
if (i == 10) {
end_line = total_lines
} else {
end_line = i * bucket_size
}
}
# Format line range
if (start_line == 0) {
range = sprintf("%7s", "-")
} else if (start_line == end_line) {
range = sprintf("%7d", start_line)
} else {
range = sprintf("%d-%d", start_line, end_line)
}
# Calculate bar length (max 40 chars)
if (max_bytes > 0) {
bar_len = int((bucket_bytes[i] / max_bytes) * 40)
} else {
bar_len = 0
}
# Build bar
bar = ""
for (j = 1; j <= bar_len; j++) {
bar = bar "█"
}
# Print bucket line
printf "%-15s | %12d | %s\n", range, bucket_bytes[i], bar > out
}
print "─────────────────┼──────────────┼──────────────────────────────────────────" > out
}
#!/bin/bash
# run_tests.sh - Run histogram tests and display results
#
# This runs the awk histogram against each test file and shows
# the visual output. You can eyeball whether the histogram shape
# matches the expected pattern.
AWK_SCRIPT="./line_histogram.awk"
TESTDIR="test_files"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
BLUE='\033[0;34m'
BOLD='\033[1m'
NC='\033[0m' # No Color
echo "═══════════════════════════════════════════════════════════════════"
echo " LINE HISTOGRAM TEST SUITE "
echo "═══════════════════════════════════════════════════════════════════"
echo ""
run_test() {
local file="$1"
local expected_shape="$2"
local description="$3"
echo -e "${BOLD}${BLUE}TEST: $(basename "$file")${NC}"
echo -e "Expected shape: ${GREEN}$expected_shape${NC}"
echo "Description: $description"
echo ""
if [ ! -f "$file" ]; then
echo -e "${RED}ERROR: File not found${NC}"
echo ""
return 1
fi
awk -f "$AWK_SCRIPT" "$file"
echo ""
echo "───────────────────────────────────────────────────────────────────"
echo ""
}
# Generate test files first
echo "Generating test files..."
bash generate_test_files.sh
echo ""
echo "═══════════════════════════════════════════════════════════════════"
echo ""
# Run tests with expected shapes
run_test "$TESTDIR/triangle_up.txt" \
"RAMP UP ↗" \
"Lines grow from 1 to 100 chars. Histogram should show bars increasing left to right."
run_test "$TESTDIR/triangle_down.txt" \
"RAMP DOWN ↘" \
"Lines shrink from 100 to 1 chars. Histogram should show bars decreasing left to right."
run_test "$TESTDIR/square.txt" \
"FLAT ═══" \
"All lines are 50 chars. Histogram should show all bars equal height."
run_test "$TESTDIR/semicircle.txt" \
"SEMICIRCLE ⌒" \
"Line lengths follow sqrt(1-x²). Histogram should show arch - tall middle, low ends."
run_test "$TESTDIR/bell_curve.txt" \
"BELL CURVE 🔔" \
"Gaussian distribution. Histogram should show peak in middle, tapering to edges."
run_test "$TESTDIR/spike.txt" \
"SPIKE IN MIDDLE ⬆" \
"One 5000-char line at position 50, rest are 5 chars. Bucket 5 should dominate."
run_test "$TESTDIR/tiny.txt" \
"3 BUCKETS ONLY" \
"Only 3 lines. Should show 3 buckets with data, 7 with zeros/dashes."
run_test "$TESTDIR/exact_10.txt" \
"STAIRCASE UP ▁▂▃▄▅▆▇█" \
"Exactly 10 lines, lengths 10,20,30...100. Each line is its own bucket, ascending."
run_test "$TESTDIR/single_huge.txt" \
"ALL IN BUCKET 1" \
"Single 10000-char line. All bytes should be in first bucket only."
run_test "$TESTDIR/empty.txt" \
"EMPTY FILE" \
"Zero bytes, zero lines. Should handle gracefully."
echo "═══════════════════════════════════════════════════════════════════"
echo " TEST SUITE COMPLETE "
echo "═══════════════════════════════════════════════════════════════════"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment