simbo1905/README.md

## README.md

      
    Raw
  

              README.md
            
          
    📊 Line Histogram — The File Profiler You Didn't Know You Needed

What If You Could See Your Data Before It Floods Your Context Window?

You know that moment when you asked your agent to check the data and before you can tell it not to eat the ocean it enters Compaction.
Yeah. We've all been there.
The Problem


🤯 Huge database dumps with unpredictable line sizes
💸 Token budgets disappearing into massive single-line JSON blobs
🔍 No quick way to see WHERE the chonky lines are hiding
⚠️ Agents choking on files you thought were reasonable

The Solution: line_histogram.awk

A blazingly fast AWK script that gives you X-ray vision into your files' byte distribution.
# chmod +x ./line_histogram.awk
./line_histogram.awk huge_export.jsonl
File: huge_export.jsonl
Total bytes: 2847392
Total lines: 1000

Bucket Distribution:

Line Range      | Bytes        | Distribution
─────────────────┼──────────────┼──────────────────────────────────────────
1-100           |         4890 | ██
101-200         |         5234 | ██
201-300         |         5832 | ██
301-400         |         6128 | ██
401-500         |       385927 | ████████████████████████████████████████
501-600         |         5892 | ██
601-700         |         5234 | ██
701-800         |         6891 | ██
801-900         |         5328 | ██
901-1000        |         4982 | ██
─────────────────┼──────────────┼──────────────────────────────────────────

Boom. Line 450 is eating 99% of your file. Skip it and save yourself 2.8MB of token cost.

🚀 Features That Actually Matter

1. Histogram Mode (Default)

See the byte distribution across your file in 10 neat buckets. Spot the bloat instantly.
./line_histogram.awk myfile.txt
2. Surgical Line Extraction

Found a problem line? Extract it without loading the whole file into memory.
# Extract line 450 (the chonky one)
./line_histogram.awk -v mode=extract -v line=450 huge_export.jsonl

# Extract lines 100-200 for inspection
./line_histogram.awk -v mode=extract -v start=100 -v end=200 data.jsonl
Yes yes, those -v bits look odd but yes yes, they are needed as thats how the sed passes argument, who knew! (Hint: The AI)
3. Zero Dependencies

If your system has AWK (it does), you're good to go. No npm install, no pip install, no Docker containers. Just pure, unadulterated shell goodness.
4. Stupid Fast

Processes multi-GB files in seconds. AWK was built for this.

💡 Use Cases That'll Make You Look Like a Genius

For AI Agent Wranglers


Profile before you prompt: Know if that export file is safe to feed your agent
Smart sampling: Extract representative line ranges instead of the whole file
Debug token explosions: "Why did my context window fill up?" → histogram shows a 500KB line

For Data Engineers


Spot malformed CSVs: One line with 10,000 columns? Histogram shows it
Log file analysis: Find the log entries that are suspiciously huge
Database export QA: Verify export structure before importing elsewhere

For DevOps/SRE


Config file sanity checks: Spot embedded certificates or secrets bloating configs
Debug log truncation: See which lines are hitting your logger's size limits
Kafka message profiling: Histogram message sizes before they hit your pipeline


📖 Pseudo Man Page (The Details)

NAME

line_histogram.awk — profile files by line size distribution or extract specific lines
SYNOPSIS

./line_histogram.awk [options] <file>
OPTIONS

-v mode=histogram     (default) Show byte distribution across 10 buckets
-v mode=extract       Extract specific line(s)
-v line=N             Extract single line N (requires mode=extract)
-v start=X            Extract lines X through Y (requires mode=extract)
-v end=Y              
-v outfile=FILE       Write output to FILE instead of stdout

MODES

Histogram Mode (Default)

Divides the file into 10 equal-sized buckets by line number and shows the byte distribution:

Bucket 1: Lines 1-10% → X bytes
Bucket 2: Lines 11-20% → Y bytes
...and so on

The visual histogram uses █ blocks scaled to the bucket with the most bytes.
Special cases:

Files ≤10 lines: Each line gets its own bucket
Remainder lines: Absorbed into bucket 10

Extract Mode

Pull specific lines without loading the entire file into your editor:
# Single line
./line_histogram.awk -v mode=extract -v line=42 file.txt

# Range
./line_histogram.awk -v mode=extract -v start=100 -v end=200 file.txt
EXIT STATUS


0: Success
1: Error (invalid line number, bad range, missing parameters)

EXAMPLES

Example 1: Quick file profile
./line_histogram.awk database_dump.jsonl
Example 2: Extract suspicious line for inspection
./line_histogram.awk -v mode=extract -v line=523 data.csv > suspicious_line.txt
Example 3: Sample middle section of large file
./line_histogram.awk -v mode=extract -v start=5000 -v end=5100 bigfile.log | less
Example 4: Save histogram to file
./line_histogram.awk -v outfile=analysis.txt huge_file.jsonl

🧪 Testing Suite Included

Not sure if it works? We've got you covered with a visual test suite:
# Generate test patterns
./generate_test_files.sh

# Run all tests
./run_tests.sh
The test suite generates files with known patterns:

📈 Triangle up/down: Ascending/descending line sizes
📦 Square: Uniform line lengths
🌙 Semicircle: sqrt curve distribution
🔔 Bell curve: Gaussian distribution
📍 Spike: One massive line in a sea of tiny ones
🎯 Edge cases: Empty files, single lines, exactly 10 lines

Watch the histograms match the patterns. It's oddly satisfying.

⚡ Installation

# Clone or download
curl -O https://gist.github.com/...  # (your gist URL here)

# Make executable
chmod +x line_histogram.awk

# Optional: Add to PATH
cp line_histogram.awk ~/bin/line_histogram.awk
Or just run it directly:
awk -f line_histogram.awk yourfile.txt

🎯 Why This Exists

Born from frustration with AI agents eating context windows on mystery files. Sometimes you just need to know: "Is this file safe to feed my agent, or will line 847 consume my entire token budget?". As that is obviously how you think and act. You are not a data engineer up at 2am using SCREAMING ALL CAPS as your LLM got into a crash loop trying to evaluate a jsonl extract that didn't fit into context. That is definately not you, no. Me neither.

📜 License

MIT or Public Domain. Use it, abuse it, put it in production, whatever. No warranty implied—if it deletes your files, that's on you (though it only reads, so you're probably fine).

🤝 Contributing

It's AWK. If you can make it better, you're a wizard. PRs welcome. You will need to setup a repo though I cannt be bothered; just fork the gist and be done with it.

Made with ❤️ and frustration by someone who spent too many tokens on line 523 of a JSONL file.
Now go profile your files like a pro. 📊✨

  
## generate_test_files.sh
#!/bin/bash
# generate_test_files.sh - Generate ASCII art test patterns for histogram testing
#
# Creates files where line lengths follow specific patterns so the histogram
# visually matches the pattern shape.

TESTDIR="test_files"
mkdir -p "$TESTDIR"

NUM_LINES=100  # Gives us 10 lines per bucket nicely

echo "Generating test files in $TESTDIR/"

# ─────────────────────────────────────────────────────────────────
# TRIANGLE UP: Lines get progressively longer
# Should show histogram ramping up left to right
# ─────────────────────────────────────────────────────────────────
echo "  triangle_up.txt - ascending line lengths"
: > "$TESTDIR/triangle_up.txt"
for i in $(seq 1 $NUM_LINES); do
    # Line length = i (1 to 100 chars)
    printf '%*s\n' "$i" '' | tr ' ' 'X' >> "$TESTDIR/triangle_up.txt"
done

# ─────────────────────────────────────────────────────────────────
# TRIANGLE DOWN: Lines get progressively shorter
# Should show histogram ramping down left to right
# ─────────────────────────────────────────────────────────────────
echo "  triangle_down.txt - descending line lengths"
: > "$TESTDIR/triangle_down.txt"
for i in $(seq 1 $NUM_LINES); do
    # Line length = (NUM_LINES - i + 1)
    len=$((NUM_LINES - i + 1))
    printf '%*s\n' "$len" '' | tr ' ' 'X' >> "$TESTDIR/triangle_down.txt"
done

# ─────────────────────────────────────────────────────────────────
# SQUARE: All lines same length
# Should show flat histogram (all buckets equal)
# ─────────────────────────────────────────────────────────────────
echo "  square.txt - uniform line lengths"
: > "$TESTDIR/square.txt"
for i in $(seq 1 $NUM_LINES); do
    printf '%50s\n' '' | tr ' ' 'X' >> "$TESTDIR/square.txt"
done

# ─────────────────────────────────────────────────────────────────
# SEMICIRCLE: Lines follow sqrt curve (half circle)
# Should show histogram that's tall in middle, shorter at edges
# Using formula: width = sqrt(1 - ((x-0.5)*2)^2) scaled
# ─────────────────────────────────────────────────────────────────
echo "  semicircle.txt - semicircle line lengths"
: > "$TESTDIR/semicircle.txt"
for i in $(seq 1 $NUM_LINES); do
    # Normalize i to -1..1 range, then compute sqrt(1-x^2)
    # awk for floating point math
    len=$(awk -v i="$i" -v n="$NUM_LINES" 'BEGIN {
        x = (i - 1) / (n - 1)  # 0 to 1
        x = x * 2 - 1           # -1 to 1
        if (x*x > 1) { print 1 }
        else { print int(sqrt(1 - x*x) * 100) + 1 }
    }')
    printf '%*s\n' "$len" '' | tr ' ' 'X' >> "$TESTDIR/semicircle.txt"
done

# ─────────────────────────────────────────────────────────────────
# BELL CURVE: Gaussian distribution
# Should show histogram with peak in middle
# ─────────────────────────────────────────────────────────────────
echo "  bell_curve.txt - gaussian line lengths"
: > "$TESTDIR/bell_curve.txt"
for i in $(seq 1 $NUM_LINES); do
    # Gaussian: exp(-(x-mean)^2 / (2*sigma^2))
    len=$(awk -v i="$i" -v n="$NUM_LINES" 'BEGIN {
        x = (i - 1) / (n - 1)  # 0 to 1
        x = x * 6 - 3           # -3 to 3 (3 sigma range)
        g = exp(-(x*x) / 2)     # gaussian
        print int(g * 100) + 1
    }')
    printf '%*s\n' "$len" '' | tr ' ' 'X' >> "$TESTDIR/bell_curve.txt"
done

# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Tiny file (3 lines)
# Should show 3 buckets with data, 7 with zeros
# ─────────────────────────────────────────────────────────────────
echo "  tiny.txt - only 3 lines"
cat > "$TESTDIR/tiny.txt" << 'EOF'
short
medium medium medium
long long long long long long long long
EOF

# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Single huge line
# Should show all bytes in bucket 1
# ─────────────────────────────────────────────────────────────────
echo "  single_huge.txt - one massive line"
printf '%10000s\n' '' | tr ' ' 'X' > "$TESTDIR/single_huge.txt"

# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Empty file
# ─────────────────────────────────────────────────────────────────
echo "  empty.txt - zero bytes"
: > "$TESTDIR/empty.txt"

# ─────────────────────────────────────────────────────────────────
# EDGE CASE: Exactly 10 lines
# Each line should be its own bucket
# ─────────────────────────────────────────────────────────────────
echo "  exact_10.txt - exactly 10 lines, varying sizes"
: > "$TESTDIR/exact_10.txt"
for i in 10 20 30 40 50 60 70 80 90 100; do
    printf '%*s\n' "$i" '' | tr ' ' 'X' >> "$TESTDIR/exact_10.txt"
done

# ─────────────────────────────────────────────────────────────────
# SPIKE: All short except one huge line in the middle
# Should show spike in bucket 5
# ─────────────────────────────────────────────────────────────────
echo "  spike.txt - all short except huge line at position 50"
: > "$TESTDIR/spike.txt"
for i in $(seq 1 $NUM_LINES); do
    if [ "$i" -eq 50 ]; then
        printf '%5000s\n' '' | tr ' ' 'X' >> "$TESTDIR/spike.txt"
    else
        printf '%5s\n' '' | tr ' ' 'X' >> "$TESTDIR/spike.txt"
    fi
done

echo ""
echo "Generated files:"
ls -la "$TESTDIR/"

## line_histogram.awk
#!/usr/bin/awk -f
# line_histogram.awk - Profile file by line sizes or extract lines for agents to not floor their context inspecting database extracts such as jsonl files.
#
# Usage:
#   ./line_histogram.awk <file>                                         # Histogram mode (default)
#   ./line_histogram.awk -v mode=extract -v line=5 <file>               # Extract single line
#   ./line_histogram.awk -v mode=extract -v start=10 -v end=20 <file>   # Extract range
#
# Modes:
#   histogram (default) - Show byte distribution across 10 buckets
#   extract             - Extract specific line(s) with -v line=N or -v start=X -v end=Y

BEGIN {
    # Default to histogram mode if not specified
    if (mode == "") mode = "histogram"

    total_bytes = 0
    total_lines = 0

    # Determine output destination
    if (outfile == "") {
        out = "/dev/stdout"
    } else {
        out = outfile
    }
}

{
    total_lines++
    line_sizes[total_lines] = length($0)
    lines[total_lines] = $0
    total_bytes += line_sizes[total_lines]
}

END {
    # Handle extract mode
    if (mode == "extract") {
        if (line != "") {
            # Single line extraction
            if (line >= 1 && line <= total_lines) {
                print lines[line]
            } else {
                print "Error: line " line " out of range (1-" total_lines ")" > "/dev/stderr"
                exit 1
            }
        } else if (start != "" && end != "") {
            # Range extraction
            if (start < 1) start = 1
            if (end > total_lines) end = total_lines
            if (start > end) {
                print "Error: start " start " > end " end > "/dev/stderr"
                exit 1
            }
            for (i = start; i <= end; i++) {
                print lines[i]
            }
        } else {
            print "Error: extract mode requires -v line=N or -v start=X -v end=Y" > "/dev/stderr"
            exit 1
        }
        exit 0
    }

    # Histogram mode (default)
    print "File: " FILENAME > out
    print "Total bytes: " total_bytes > out
    print "Total lines: " total_lines > out
    print "" > out
    print "Bucket Distribution:" > out
    print "" > out

    if (total_lines == 0) {
        print "Empty file" > out
        exit 0
    }

    # Determine bucket count and size
    if (total_lines <= 10) {
        num_buckets = total_lines
        bucket_size = 1
    } else {
        num_buckets = 10
        bucket_size = int(total_lines / 10)
        # Handle remainder by making last bucket larger
    }

    # Initialize bucket byte counts
    for (i = 1; i <= 10; i++) {
        bucket_bytes[i] = 0
    }

    # Assign lines to buckets and sum bytes
    for (line_num = 1; line_num <= total_lines; line_num++) {
        if (total_lines <= 10) {
            bucket = line_num
        } else {
            bucket = int((line_num - 1) / bucket_size) + 1
            if (bucket > 10) bucket = 10  # Last bucket catches remainder
        }
        bucket_bytes[bucket] += line_sizes[line_num]
    }

    # Find max bytes for scaling the visual histogram
    max_bytes = 0
    for (i = 1; i <= num_buckets; i++) {
        if (bucket_bytes[i] > max_bytes) max_bytes = bucket_bytes[i]
    }

    # Print histogram header and table
    printf "%-15s | %-12s | %-40s\n", "Line Range", "Bytes", "Distribution" > out
    print "─────────────────┼──────────────┼──────────────────────────────────────────" > out

    # Print each bucket
    for (i = 1; i <= 10; i++) {
        if (total_lines <= 10) {
            if (i <= total_lines) {
                start_line = i
                end_line = i
            } else {
                start_line = 0
                end_line = 0
            }
        } else {
            start_line = (i - 1) * bucket_size + 1
            if (i == 10) {
                end_line = total_lines
            } else {
                end_line = i * bucket_size
            }
        }

        # Format line range
        if (start_line == 0) {
            range = sprintf("%7s", "-")
        } else if (start_line == end_line) {
            range = sprintf("%7d", start_line)
        } else {
            range = sprintf("%d-%d", start_line, end_line)
        }

        # Calculate bar length (max 40 chars)
        if (max_bytes > 0) {
            bar_len = int((bucket_bytes[i] / max_bytes) * 40)
        } else {
            bar_len = 0
        }

        # Build bar
        bar = ""
        for (j = 1; j <= bar_len; j++) {
            bar = bar "█"
        }

        # Print bucket line
        printf "%-15s | %12d | %s\n", range, bucket_bytes[i], bar > out
    }
    print "─────────────────┼──────────────┼──────────────────────────────────────────" > out
}

## run_tests.sh
#!/bin/bash
# run_tests.sh - Run histogram tests and display results
#
# This runs the awk histogram against each test file and shows
# the visual output. You can eyeball whether the histogram shape
# matches the expected pattern.

AWK_SCRIPT="./line_histogram.awk"
TESTDIR="test_files"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
BLUE='\033[0;34m'
BOLD='\033[1m'
NC='\033[0m' # No Color

echo "═══════════════════════════════════════════════════════════════════"
echo "                    LINE HISTOGRAM TEST SUITE                       "
echo "═══════════════════════════════════════════════════════════════════"
echo ""

run_test() {
    local file="$1"
    local expected_shape="$2"
    local description="$3"

    echo -e "${BOLD}${BLUE}TEST: $(basename "$file")${NC}"
    echo -e "Expected shape: ${GREEN}$expected_shape${NC}"
    echo "Description: $description"
    echo ""

    if [ ! -f "$file" ]; then
        echo -e "${RED}ERROR: File not found${NC}"
        echo ""
        return 1
    fi

    awk -f "$AWK_SCRIPT" "$file"
    echo ""
    echo "───────────────────────────────────────────────────────────────────"
    echo ""
}

# Generate test files first
echo "Generating test files..."
bash generate_test_files.sh
echo ""
echo "═══════════════════════════════════════════════════════════════════"
echo ""

# Run tests with expected shapes
run_test "$TESTDIR/triangle_up.txt" \
    "RAMP UP ↗" \
    "Lines grow from 1 to 100 chars. Histogram should show bars increasing left to right."

run_test "$TESTDIR/triangle_down.txt" \
    "RAMP DOWN ↘" \
    "Lines shrink from 100 to 1 chars. Histogram should show bars decreasing left to right."

run_test "$TESTDIR/square.txt" \
    "FLAT ═══" \
    "All lines are 50 chars. Histogram should show all bars equal height."

run_test "$TESTDIR/semicircle.txt" \
    "SEMICIRCLE ⌒" \
    "Line lengths follow sqrt(1-x²). Histogram should show arch - tall middle, low ends."

run_test "$TESTDIR/bell_curve.txt" \
    "BELL CURVE 🔔" \
    "Gaussian distribution. Histogram should show peak in middle, tapering to edges."

run_test "$TESTDIR/spike.txt" \
    "SPIKE IN MIDDLE ⬆" \
    "One 5000-char line at position 50, rest are 5 chars. Bucket 5 should dominate."

run_test "$TESTDIR/tiny.txt" \
    "3 BUCKETS ONLY" \
    "Only 3 lines. Should show 3 buckets with data, 7 with zeros/dashes."

run_test "$TESTDIR/exact_10.txt" \
    "STAIRCASE UP ▁▂▃▄▅▆▇█" \
    "Exactly 10 lines, lengths 10,20,30...100. Each line is its own bucket, ascending."

run_test "$TESTDIR/single_huge.txt" \
    "ALL IN BUCKET 1" \
    "Single 10000-char line. All bytes should be in first bucket only."

run_test "$TESTDIR/empty.txt" \
    "EMPTY FILE" \
    "Zero bytes, zero lines. Should handle gracefully."

echo "═══════════════════════════════════════════════════════════════════"
echo "                         TEST SUITE COMPLETE                         "
echo "═══════════════════════════════════════════════════════════════════"
	#!/bin/bash
	# generate_test_files.sh - Generate ASCII art test patterns for histogram testing
	#
	# Creates files where line lengths follow specific patterns so the histogram
	# visually matches the pattern shape.

	TESTDIR="test_files"
	mkdir -p "$TESTDIR"

	NUM_LINES=100 # Gives us 10 lines per bucket nicely

	echo "Generating test files in $TESTDIR/"

	# ─────────────────────────────────────────────────────────────────
	# TRIANGLE UP: Lines get progressively longer
	# Should show histogram ramping up left to right
	# ─────────────────────────────────────────────────────────────────
	echo " triangle_up.txt - ascending line lengths"
	: > "$TESTDIR/triangle_up.txt"
	for i in $(seq 1 $NUM_LINES); do
	# Line length = i (1 to 100 chars)
	printf '%*s\n' "$i" '' \| tr ' ' 'X' >> "$TESTDIR/triangle_up.txt"
	done

	# ─────────────────────────────────────────────────────────────────
	# TRIANGLE DOWN: Lines get progressively shorter
	# Should show histogram ramping down left to right
	# ─────────────────────────────────────────────────────────────────
	echo " triangle_down.txt - descending line lengths"
	: > "$TESTDIR/triangle_down.txt"
	for i in $(seq 1 $NUM_LINES); do
	# Line length = (NUM_LINES - i + 1)
	len=$((NUM_LINES - i + 1))
	printf '%*s\n' "$len" '' \| tr ' ' 'X' >> "$TESTDIR/triangle_down.txt"
	done

	# ─────────────────────────────────────────────────────────────────
	# SQUARE: All lines same length
	# Should show flat histogram (all buckets equal)
	# ─────────────────────────────────────────────────────────────────
	echo " square.txt - uniform line lengths"
	: > "$TESTDIR/square.txt"
	for i in $(seq 1 $NUM_LINES); do
	printf '%50s\n' '' \| tr ' ' 'X' >> "$TESTDIR/square.txt"
	done

	# ─────────────────────────────────────────────────────────────────
	# SEMICIRCLE: Lines follow sqrt curve (half circle)
	# Should show histogram that's tall in middle, shorter at edges
	# Using formula: width = sqrt(1 - ((x-0.5)*2)^2) scaled
	# ─────────────────────────────────────────────────────────────────
	echo " semicircle.txt - semicircle line lengths"
	: > "$TESTDIR/semicircle.txt"
	for i in $(seq 1 $NUM_LINES); do
	# Normalize i to -1..1 range, then compute sqrt(1-x^2)
	# awk for floating point math
	len=$(awk -v i="$i" -v n="$NUM_LINES" 'BEGIN {
	x = (i - 1) / (n - 1) # 0 to 1
	x = x * 2 - 1 # -1 to 1
	if (x*x > 1) { print 1 }
	else { print int(sqrt(1 - xx) 100) + 1 }
	}')
	printf '%*s\n' "$len" '' \| tr ' ' 'X' >> "$TESTDIR/semicircle.txt"
	done

	# ─────────────────────────────────────────────────────────────────
	# BELL CURVE: Gaussian distribution
	# Should show histogram with peak in middle
	# ─────────────────────────────────────────────────────────────────
	echo " bell_curve.txt - gaussian line lengths"
	: > "$TESTDIR/bell_curve.txt"
	for i in $(seq 1 $NUM_LINES); do
	# Gaussian: exp(-(x-mean)^2 / (2*sigma^2))
	len=$(awk -v i="$i" -v n="$NUM_LINES" 'BEGIN {
	x = (i - 1) / (n - 1) # 0 to 1
	x = x * 6 - 3 # -3 to 3 (3 sigma range)
	g = exp(-(x*x) / 2) # gaussian
	print int(g * 100) + 1
	}')
	printf '%*s\n' "$len" '' \| tr ' ' 'X' >> "$TESTDIR/bell_curve.txt"
	done

	# ─────────────────────────────────────────────────────────────────
	# EDGE CASE: Tiny file (3 lines)
	# Should show 3 buckets with data, 7 with zeros
	# ─────────────────────────────────────────────────────────────────
	echo " tiny.txt - only 3 lines"
	cat > "$TESTDIR/tiny.txt" << 'EOF'
	short
	medium medium medium
	long long long long long long long long
	EOF

	# ─────────────────────────────────────────────────────────────────
	# EDGE CASE: Single huge line
	# Should show all bytes in bucket 1
	# ─────────────────────────────────────────────────────────────────
	echo " single_huge.txt - one massive line"
	printf '%10000s\n' '' \| tr ' ' 'X' > "$TESTDIR/single_huge.txt"

	# ─────────────────────────────────────────────────────────────────
	# EDGE CASE: Empty file
	# ─────────────────────────────────────────────────────────────────
	echo " empty.txt - zero bytes"
	: > "$TESTDIR/empty.txt"

	# ─────────────────────────────────────────────────────────────────
	# EDGE CASE: Exactly 10 lines
	# Each line should be its own bucket
	# ─────────────────────────────────────────────────────────────────
	echo " exact_10.txt - exactly 10 lines, varying sizes"
	: > "$TESTDIR/exact_10.txt"
	for i in 10 20 30 40 50 60 70 80 90 100; do
	printf '%*s\n' "$i" '' \| tr ' ' 'X' >> "$TESTDIR/exact_10.txt"
	done

	# ─────────────────────────────────────────────────────────────────
	# SPIKE: All short except one huge line in the middle
	# Should show spike in bucket 5
	# ─────────────────────────────────────────────────────────────────
	echo " spike.txt - all short except huge line at position 50"
	: > "$TESTDIR/spike.txt"
	for i in $(seq 1 $NUM_LINES); do
	if [ "$i" -eq 50 ]; then
	printf '%5000s\n' '' \| tr ' ' 'X' >> "$TESTDIR/spike.txt"
	else
	printf '%5s\n' '' \| tr ' ' 'X' >> "$TESTDIR/spike.txt"
	fi
	done

	echo ""
	echo "Generated files:"
	ls -la "$TESTDIR/"
	#!/usr/bin/awk -f
	# line_histogram.awk - Profile file by line sizes or extract lines for agents to not floor their context inspecting database extracts such as jsonl files.
	#
	# Usage:
	# ./line_histogram.awk <file> # Histogram mode (default)
	# ./line_histogram.awk -v mode=extract -v line=5 <file> # Extract single line
	# ./line_histogram.awk -v mode=extract -v start=10 -v end=20 <file> # Extract range
	#
	# Modes:
	# histogram (default) - Show byte distribution across 10 buckets
	# extract - Extract specific line(s) with -v line=N or -v start=X -v end=Y

	BEGIN {
	# Default to histogram mode if not specified
	if (mode == "") mode = "histogram"

	total_bytes = 0
	total_lines = 0

	# Determine output destination
	if (outfile == "") {
	out = "/dev/stdout"
	} else {
	out = outfile
	}
	}

	{
	total_lines++
	line_sizes[total_lines] = length($0)
	lines[total_lines] = $0
	total_bytes += line_sizes[total_lines]
	}

	END {
	# Handle extract mode
	if (mode == "extract") {
	if (line != "") {
	# Single line extraction
	if (line >= 1 && line <= total_lines) {
	print lines[line]
	} else {
	print "Error: line " line " out of range (1-" total_lines ")" > "/dev/stderr"
	exit 1
	}
	} else if (start != "" && end != "") {
	# Range extraction
	if (start < 1) start = 1
	if (end > total_lines) end = total_lines
	if (start > end) {
	print "Error: start " start " > end " end > "/dev/stderr"
	exit 1
	}
	for (i = start; i <= end; i++) {
	print lines[i]
	}
	} else {
	print "Error: extract mode requires -v line=N or -v start=X -v end=Y" > "/dev/stderr"
	exit 1
	}
	exit 0
	}

	# Histogram mode (default)
	print "File: " FILENAME > out
	print "Total bytes: " total_bytes > out
	print "Total lines: " total_lines > out
	print "" > out
	print "Bucket Distribution:" > out
	print "" > out

	if (total_lines == 0) {
	print "Empty file" > out
	exit 0
	}

	# Determine bucket count and size
	if (total_lines <= 10) {
	num_buckets = total_lines
	bucket_size = 1
	} else {
	num_buckets = 10
	bucket_size = int(total_lines / 10)
	# Handle remainder by making last bucket larger
	}

	# Initialize bucket byte counts
	for (i = 1; i <= 10; i++) {
	bucket_bytes[i] = 0
	}

	# Assign lines to buckets and sum bytes
	for (line_num = 1; line_num <= total_lines; line_num++) {
	if (total_lines <= 10) {
	bucket = line_num
	} else {
	bucket = int((line_num - 1) / bucket_size) + 1
	if (bucket > 10) bucket = 10 # Last bucket catches remainder
	}
	bucket_bytes[bucket] += line_sizes[line_num]
	}

	# Find max bytes for scaling the visual histogram
	max_bytes = 0
	for (i = 1; i <= num_buckets; i++) {
	if (bucket_bytes[i] > max_bytes) max_bytes = bucket_bytes[i]
	}

	# Print histogram header and table
	printf "%-15s \| %-12s \| %-40s\n", "Line Range", "Bytes", "Distribution" > out
	print "─────────────────┼──────────────┼──────────────────────────────────────────" > out

	# Print each bucket
	for (i = 1; i <= 10; i++) {
	if (total_lines <= 10) {
	if (i <= total_lines) {
	start_line = i
	end_line = i
	} else {
	start_line = 0
	end_line = 0
	}
	} else {
	start_line = (i - 1) * bucket_size + 1
	if (i == 10) {
	end_line = total_lines
	} else {
	end_line = i * bucket_size
	}
	}

	# Format line range
	if (start_line == 0) {
	range = sprintf("%7s", "-")
	} else if (start_line == end_line) {
	range = sprintf("%7d", start_line)
	} else {
	range = sprintf("%d-%d", start_line, end_line)
	}

	# Calculate bar length (max 40 chars)
	if (max_bytes > 0) {
	bar_len = int((bucket_bytes[i] / max_bytes) * 40)
	} else {
	bar_len = 0
	}

	# Build bar
	bar = ""
	for (j = 1; j <= bar_len; j++) {
	bar = bar "█"
	}

	# Print bucket line
	printf "%-15s \| %12d \| %s\n", range, bucket_bytes[i], bar > out
	}
	print "─────────────────┼──────────────┼──────────────────────────────────────────" > out
	}
	#!/bin/bash
	# run_tests.sh - Run histogram tests and display results
	#
	# This runs the awk histogram against each test file and shows
	# the visual output. You can eyeball whether the histogram shape
	# matches the expected pattern.

	AWK_SCRIPT="./line_histogram.awk"
	TESTDIR="test_files"

	# Colors for output
	RED='\033[0;31m'
	GREEN='\033[0;32m'
	BLUE='\033[0;34m'
	BOLD='\033[1m'
	NC='\033[0m' # No Color

	echo "═══════════════════════════════════════════════════════════════════"
	echo " LINE HISTOGRAM TEST SUITE "
	echo "═══════════════════════════════════════════════════════════════════"
	echo ""

	run_test() {
	local file="$1"
	local expected_shape="$2"
	local description="$3"

	echo -e "${BOLD}${BLUE}TEST: $(basename "$file")${NC}"
	echo -e "Expected shape: ${GREEN}$expected_shape${NC}"
	echo "Description: $description"
	echo ""

	if [ ! -f "$file" ]; then
	echo -e "${RED}ERROR: File not found${NC}"
	echo ""
	return 1
	fi

	awk -f "$AWK_SCRIPT" "$file"
	echo ""
	echo "───────────────────────────────────────────────────────────────────"
	echo ""
	}

	# Generate test files first
	echo "Generating test files..."
	bash generate_test_files.sh
	echo ""
	echo "═══════════════════════════════════════════════════════════════════"
	echo ""

	# Run tests with expected shapes
	run_test "$TESTDIR/triangle_up.txt" \
	"RAMP UP ↗" \
	"Lines grow from 1 to 100 chars. Histogram should show bars increasing left to right."

	run_test "$TESTDIR/triangle_down.txt" \
	"RAMP DOWN ↘" \
	"Lines shrink from 100 to 1 chars. Histogram should show bars decreasing left to right."

	run_test "$TESTDIR/square.txt" \
	"FLAT ═══" \
	"All lines are 50 chars. Histogram should show all bars equal height."

	run_test "$TESTDIR/semicircle.txt" \
	"SEMICIRCLE ⌒" \
	"Line lengths follow sqrt(1-x²). Histogram should show arch - tall middle, low ends."

	run_test "$TESTDIR/bell_curve.txt" \
	"BELL CURVE 🔔" \
	"Gaussian distribution. Histogram should show peak in middle, tapering to edges."

	run_test "$TESTDIR/spike.txt" \
	"SPIKE IN MIDDLE ⬆" \
	"One 5000-char line at position 50, rest are 5 chars. Bucket 5 should dominate."

	run_test "$TESTDIR/tiny.txt" \
	"3 BUCKETS ONLY" \
	"Only 3 lines. Should show 3 buckets with data, 7 with zeros/dashes."

	run_test "$TESTDIR/exact_10.txt" \
	"STAIRCASE UP ▁▂▃▄▅▆▇█" \
	"Exactly 10 lines, lengths 10,20,30...100. Each line is its own bucket, ascending."

	run_test "$TESTDIR/single_huge.txt" \
	"ALL IN BUCKET 1" \
	"Single 10000-char line. All bytes should be in first bucket only."

	run_test "$TESTDIR/empty.txt" \
	"EMPTY FILE" \
	"Zero bytes, zero lines. Should handle gracefully."

	echo "═══════════════════════════════════════════════════════════════════"
	echo " TEST SUITE COMPLETE "
	echo "═══════════════════════════════════════════════════════════════════"