Skip to content

Instantly share code, notes, and snippets.

@mrpollo
Created March 11, 2026 03:44
Show Gist options
  • Select an option

  • Save mrpollo/f824b97fb5423e49ad28632f64c12dcb to your computer and use it in GitHub Desktop.

Select an option

Save mrpollo/f824b97fb5423e49ad28632f64c12dcb to your computer and use it in GitHub Desktop.
Flight Review /browse search hang — analysis from 3 perspectives

Flight Review /browse Search Hang — Analysis & Recommendations

Root Cause

browse.py line 88-95 fetches all 353k public logs from SQLite into Python memory on every request. When there's no search term, it's fast — it only processes the paginated slice (10-25 rows). When searching, it calls get_columns_from_tuple() on every single row (string formatting, airframe lookups, datetime conversions, HTML generation), then does Python string matching. 353k iterations of expensive Python work per keystroke. The database has zero indexes.


Perspective 1: AWS Principal Engineer

Immediate (no code changes, 30 minutes)

Add SQLite indexes on the server:

sqlite3 /home/ubuntu/logs.sqlite
CREATE INDEX IF NOT EXISTS idx_logs_public_source ON Logs(Public, Source);
CREATE INDEX IF NOT EXISTS idx_logs_date ON Logs(Date DESC);
ANALYZE;

This speeds up the SQL query but does not fix the hang — the bottleneck is Python iterating 353k rows, not SQLite.

Nginx response caching (the real win):

proxy_cache_path /var/cache/nginx/browse levels=1:2 keys_zone=browse_cache:10m max_size=500m inactive=5m;

location /browse_data_retrieval {
    proxy_cache browse_cache;
    proxy_cache_valid 200 2m;
    proxy_cache_key "$request_uri";
    proxy_cache_use_stale error timeout updating http_500 http_502 http_503;
    proxy_cache_lock on;              # Only ONE request hits Tornado, others wait for cache
    proxy_cache_lock_timeout 30s;
    add_header X-Cache-Status $upstream_cache_status;
    proxy_pass http://127.0.0.1:5100;
}

proxy_cache_lock on is the critical directive — when 5 users search for "pixhawk", only one request hits Tornado. Repeat searches within 2 minutes are instant. Zero cost.

Add rate limiting:

limit_req_zone $binary_remote_addr zone=browse_search:10m rate=2r/s;

location /browse_data_retrieval {
    limit_req zone=browse_search burst=5 nodelay;
    # ... cache config above ...
}

Prevents rapid keystrokes from queuing up multiple 353k-row scans.

Short-term bridge (1-2 days, buys months)

Pre-computed browse JSON via cron — same pattern as the existing /dbinfo fix:

  1. Cron job dumps browse data to a static JSON file (adapt from tools/generate_dbinfo_json.py)
  2. Upload to S3, serve via CloudFront
  3. Switch DataTables to serverSide: false — client-side search/pagination handles 353k rows in milliseconds in the browser

Estimated JSON size: ~5-8MB gzipped. The browse data changes only on upload, not on every request.

Long-term replacement architecture

CloudFront
    ├── /api/*  → API Gateway → Lambda (Python) → DynamoDB (metadata)
    │                                             → S3 (log files)
    └── /*      → S3 (static React/Vue SPA)
  • DynamoDB over RDS: flat table of log metadata, simple key-value access. On-demand pricing at 353k items ≈ $1-3/month.
  • Lambda over ECS: stateless, short-lived API calls. Effectively free at this traffic level.
  • Static SPA: client-side plotting (Plotly.js), client-side search/filtering. No server to manage.
  • Total new cost: ~$5/month vs current EC2 at $50-100/month.

Perspective 2: Dagar Review

Can we fix without code changes? No. Indexes don't help because the bottleneck is Python iterating 353k rows, not SQLite. Nginx caching is a band-aid — first search still hangs and blocks a Tornado worker for 30+ seconds.

If we must change code, smallest diff? Push search into SQL with LIKE + LIMIT/OFFSET. ~30 lines changed. But every line written for this app is a line thrown away.

What NOT to do:

  • FTS5 — over-engineering for a dying app
  • Redis/caching layers — nginx cache is sufficient
  • Refactoring handlers into proper classes
  • Adding new search features
  • "Just in case" indexes on columns nobody queries

The actual recommendation: Remove the search box. One line of HTML. The browse page without search works fine with pagination. If someone needs search, point them at the dbinfo JSON. If the team insists on keeping search, the SQL LIKE change is ~30 lines but why invest in throwaway code?

Option 3: signal.alarm(5) to kill searches that take too long. Ugly but prevents hangs from killing the workers. No functional change, just a timeout.


Perspective 3: PX4 Open Source Maintainer

Community impact

Search is not a niche feature:

  • Developers debugging regressions need to search by vehicle UUID, git SHA, software version (dagar filed issue #121 for this)
  • Airframe maintainers search by hardware/airframe name
  • Self-hosters with fleets (issue #257) reported this exact hang causing server-wide denial of service — had to hard-reboot multiple times
  • CI/testing workflows — a blocked search request ties up Tornado's event loop and affects plot viewing for all users

Existing community history

  • Issue #257 (2023): "Site-wide denial of service in search box" — self-hoster reported this. Partial fix (5ced4b0) removed per-row os.path.exists() but didn't address the core O(n) Python iteration
  • Issue #121: dagar requested search by UUID, hardware, software — acknowledged but the performance issue was never fixed
  • PRs #316, #317 (2025): Added hash/tag search modes, improved what you can search for but not how
  • No formal RFC or repo exists for a Flight Review replacement. The "eventually replace" sentiment has not materialized

Recommendation

Fix the current app AND plan the replacement. These are not mutually exclusive.

The fix is a focused PR: push search into SQL with LIKE + LIMIT/OFFSET, add CREATE INDEX IF NOT EXISTS to setup_db.py. ~100 lines total. It:

  • Fixes a bug reported twice by the community (#257, #121)
  • Benefits every deployment from single-user to 353k logs
  • Requires zero new dependencies
  • Self-hosters get indexes automatically via setup_db.py migration
  • Buys time to plan a proper replacement

For the replacement, preserve:

  • Upload API (QGroundControl, upload_log.py, CI all depend on it)
  • /plot_app?log=ID URL scheme (thousands of links in GitHub issues, Discuss, Slack)
  • Self-hosting with Docker

For the replacement, change:

  • Separate API from UI (REST backend + static SPA frontend)
  • Proper database with indexes/FTS (or DynamoDB for serverless)
  • Storage abstraction (local disk, S3, GCS) configurable per deployment

Summary of Options

Option Code Change Time Risk Fixes Search
Nginx cache + rate limit None (infra only) 30 min Low Partially (first hit still hangs)
Add SQLite indexes on server None (DB only) 10 min Low No (bottleneck is Python)
Remove search box 1 line HTML 5 min None N/A (removes feature)
SQL LIKE + LIMIT/OFFSET ~30-100 lines 1-2 days Low Yes
Pre-computed JSON + client-side DataTables Template change + cron script 1-2 days Low Yes
Full replacement (DynamoDB + Lambda + SPA) Full rewrite Months Medium Yes

Recommended Path

  1. Today: Nginx caching + rate limiting on /browse_data_retrieval (stops the bleeding)
  2. This week: Decide between SQL fix (minimal code, benefits self-hosters) or pre-computed JSON (zero server load, production-only)
  3. This quarter: Start the replacement architecture conversation with the community — there's no RFC yet, and there should be
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment