Skip to content

Instantly share code, notes, and snippets.

@mrpollo
Last active March 11, 2026 23:29
Show Gist options
  • Select an option

  • Save mrpollo/4d28a0be431185cbc7c0a0e0d920e820 to your computer and use it in GitHub Desktop.

Select an option

Save mrpollo/4d28a0be431185cbc7c0a0e0d920e820 to your computer and use it in GitHub Desktop.
Flight Review Next: Architecture & Implementation Plan - Replacement for logs.px4.io

Flight Review Next: Architecture & Implementation Plan

Executive Summary

Flight Review (logs.px4.io) has served the PX4 community for nearly 10 years since its first commit in October 2016. Built on Tornado/Bokeh/SQLite with an S3 FUSE mount, it processes 350k+ ULog flight logs but suffers from significant infrastructure pain: every page view re-parses the raw ULog file to regenerate all 35+ Bokeh plots, the S3 FUSE mount is fragile, the SQLite database locks under concurrent access, and the Python/Bokeh stack makes the frontend difficult to extend. This document proposes a ground-up replacement designed for performance, extensibility, and operational simplicity at any scale.


Current State Assessment

What Exists Today

  • Age: 9.5 years, 664 commits, actively maintained but receiving only reliability fixes
  • Stack: Python 3.11, Tornado (async web server), Bokeh 3.8.2 (plotting), SQLite (WAL mode), Bootstrap 5, Leaflet (maps), Cesium.js (3D)
  • Infrastructure: Single EC2 instance (15GB RAM, 19GB disk), nginx reverse proxy, S3 FUSE mount at /data_s3, 2 Tornado worker processes
  • Database: SQLite at ~219MB storing only scalar metadata (14 fields per log in LogsGenerated table). No time-series data stored.
  • Processing model: Every page view triggers a full ULog parse (~1-5s for typical logs) and regeneration of all plots. An in-memory LRU cache of parsed ULog objects provides some relief.
  • ULog topics loaded: ~45 of 100+ available topics per log file
  • Plots generated: 35+ Bokeh plots including time-series, FFT spectrograms, PSD analysis, GPS maps, parameter tables, and system diagnostics
  • S3 storage: ~350k ULog files in s3://px4-flight-review/flight_review/log_files/

Key Pain Points

  1. Performance: Re-parsing ULog files on every page view is the #1 bottleneck. A 50MB log takes 1-5 seconds to parse in Python before any plots render.
  2. S3 FUSE mount: s3fs is fragile, adds latency, creates kernel-level failure modes, and doesn't support concurrent access patterns well.
  3. SQLite concurrency: Single-writer limitation causes lock contention with 2 worker processes (mitigated by WAL mode but not eliminated).
  4. Bokeh server-side rendering: Plots are generated server-side in Python, creating CPU-heavy page loads and making the frontend hard to extend.
  5. Monolithic architecture: Upload, processing, storage, and visualization are tightly coupled in a single process.
  6. No caching layer: Computed plot data is discarded after each request (aside from in-memory ULog cache).
  7. Single-instance deployment: No horizontal scaling, no container support, manual deployment via shell script.
  8. No authentication: The public instance is fully open; no mechanism for private deployments.

What Works Well (Keep These)

  • ULog processing pipeline: The analysis logic (35+ plot types, PID analysis, FFT/PSD spectrograms, vibration analysis) represents years of domain expertise
  • S3 as primary storage: Object storage for raw logs is the right pattern
  • Browse/search/statistics pages: The metadata-driven browse experience is useful
  • CloudFront CDN: Recently added for /dbinfo endpoint, works well
  • Overview image generation: Static PNG map thumbnails for browse page

Design Principles

  1. Process once, serve many: Parse and analyze each ULog file exactly once at upload time. Store results. Never re-parse for viewing.
  2. S3 API, not FUSE: Use the S3 SDK directly for all object storage operations. Pre-signed URLs for client-side downloads.
  3. Client-side rendering: Ship processed data to the browser; let the client render plots. Server serves JSON, not HTML.
  4. Plugin-friendly visualization: Make it trivial to add new plot types, key facts, and analysis modules without touching core code.
  5. Scale-agnostic deployment: Same codebase runs as a single Docker container for a team of 5 or as a distributed service for 350k+ logs.
  6. Offline-capable processing: The ULog processing engine should work as a standalone CLI tool, not just as part of the web service.

Recommended Architecture

                                    ┌─────────────────────┐
                                    │   CloudFront CDN    │
                                    │  (static assets +   │
                                    │   pre-signed S3)    │
                                    └─────────┬───────────┘
                                              │
┌──────────────┐                   ┌──────────▼──────────┐
│   Browser    │◄──────────────────│    API Gateway /    │
│              │    REST + WS      │    Reverse Proxy    │
│  SPA Client  │──────────────────►│      (nginx)        │
└──────────────┘                   └──────────┬───────────┘
                                              │
                          ┌───────────────────┼───────────────────┐
                          │                   │                   │
                ┌─────────▼────────┐ ┌───────▼────────┐ ┌───────▼────────┐
                │   API Service    │ │  Upload Worker  │ │  Processing    │
                │  (Rust / Axum)   │ │  (async upload  │ │  Worker(s)     │
                │                  │ │   + S3 put)     │ │  (ULog parse   │
                │  - Auth          │ │                 │ │   + analysis)  │
                │  - Browse/Search │ └───────┬─────────┘ └───────┬────────┘
                │  - Serve plot    │         │                   │
                │    data (JSON)   │         │                   │
                │  - Pre-signed    │    ┌────▼───────────────────▼────┐
                │    S3 URLs       │    │         S3 Bucket           │
                └─────────┬────────┘    │                            │
                          │             │  /raw/{id}.ulg             │
                          │             │  /processed/{id}.json.zst  │
                          │             │  /thumbnails/{id}.png      │
                     ┌────▼─────┐       │  /cache/{id}/plots.json   │
                     │PostgreSQL│       └────────────────────────────┘
                     │          │
                     │ - Logs   │
                     │ - Metadata│
                     │ - Summary│
                     │   stats  │
                     │ - Users  │
                     │ - Tokens │
                     └──────────┘

Component Breakdown

1. Backend: Rust with Axum

Why Rust: ULog parsing is CPU-bound binary processing -- exactly where Rust excels. A Rust backend can parse a 50MB ULog file in ~100-500ms vs 1-5s in Python (10x improvement). The single-binary deployment model eliminates Python dependency management headaches. Axum provides async HTTP with tower middleware for auth, rate limiting, and observability.

Why not other options:

  • Python (FastAPI): Would perpetuate the parsing performance problem. Fine for the API layer but not for processing.
  • Go: Good alternative, but lacks the zero-cost abstractions that make binary parsing ergonomic. No existing ULog parser ecosystem.
  • Node.js: Wrong tool for CPU-bound binary processing.

Key crates:

  • axum -- HTTP framework
  • aws-sdk-s3 -- S3 API (native, not FUSE)
  • sqlx -- PostgreSQL async driver
  • serde / serde_json -- serialization
  • tokio -- async runtime
  • ULog parser: Either port pyulog to Rust, extend ulog-rs/yule_log, or write a new one using mavsim-viewer's C parser as reference (439 LOC, clean architecture, already handles all message types)

API surface:

POST   /api/upload              -- Upload ULog file (multipart)
GET    /api/logs                -- Browse/search/filter logs
GET    /api/logs/{id}           -- Log metadata + summary stats
GET    /api/logs/{id}/plots     -- Pre-computed plot data (JSON)
GET    /api/logs/{id}/download  -- Pre-signed S3 URL for raw .ulg
GET    /api/logs/{id}/pid       -- PID analysis data
GET    /api/logs/{id}/3d        -- 3D trajectory data
GET    /api/stats               -- Aggregate statistics
POST   /api/auth/login          -- Authentication (optional)
GET    /api/health              -- Health check

2. Database: PostgreSQL

Why PostgreSQL over alternatives:

  • Not TimescaleDB/InfluxDB/ClickHouse: We are NOT storing time-series data in the database. The raw ULog stays in S3, and processed plot data goes to S3 as compressed JSON. The database stores only scalar metadata, summary statistics, and user/auth data -- a perfect fit for vanilla PostgreSQL.
  • Not SQLite: Concurrent write access from multiple workers, proper connection pooling, full-text search, JSONB for flexible metadata, and proven horizontal scaling path.
  • Not DuckDB: Interesting for analytical queries but overkill for metadata storage. Could be used as an embedded engine in the processing worker for on-the-fly analysis, but PostgreSQL covers the primary need.

Schema (simplified):

CREATE TABLE logs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    -- Upload metadata
    title           TEXT,
    description     TEXT,
    original_filename TEXT,
    uploaded_at     TIMESTAMPTZ DEFAULT NOW(),
    source          TEXT,        -- "webui", "qgc", "cli"
    email           TEXT,
    public          BOOLEAN DEFAULT TRUE,
    allow_analysis  BOOLEAN DEFAULT TRUE,
    -- User-provided context
    wind_speed      SMALLINT,
    rating          TEXT,
    feedback        TEXT,
    video_url       TEXT,
    error_labels    TEXT,
    -- Processing status
    status          TEXT DEFAULT 'pending',  -- pending, processing, ready, failed
    processed_at    TIMESTAMPTZ,
    -- S3 references
    s3_raw_key      TEXT NOT NULL,
    s3_processed_key TEXT,
    s3_thumbnail_key TEXT,
    -- Access control
    token           TEXT UNIQUE,
    owner_id        UUID REFERENCES users(id)
);

CREATE TABLE log_metadata (
    log_id          UUID PRIMARY KEY REFERENCES logs(id),
    -- Extracted from ULog (computed once at processing time)
    duration_s      INTEGER,
    mav_type        TEXT,
    estimator       TEXT,
    autostart_id    INTEGER,
    hardware        TEXT,
    software_version TEXT,
    software_hash   TEXT,
    vehicle_uuid    TEXT,
    start_time_utc  TIMESTAMPTZ,
    -- Error/warning counts
    num_errors      INTEGER DEFAULT 0,
    num_warnings    INTEGER DEFAULT 0,
    has_hardfault   BOOLEAN DEFAULT FALSE,
    file_corrupted  BOOLEAN DEFAULT FALSE,
    -- Flight modes
    flight_modes    JSONB,       -- [{mode: "POSCTL", duration_s: 120}, ...]
    -- Summary statistics (previously computed on every page view)
    total_distance_m    REAL,
    max_altitude_diff_m REAL,
    avg_speed_ms        REAL,
    max_speed_ms        REAL,
    max_speed_horiz_ms  REAL,
    max_speed_up_ms     REAL,
    max_speed_down_ms   REAL,
    max_tilt_deg        REAL,
    max_rotation_dps    REAL,
    avg_current_a       REAL,
    max_current_a       REAL,
    -- Vibration summary
    max_vibe_level      REAL,
    vibe_status         TEXT,    -- "good", "warning", "critical"
    -- GPS quality summary
    avg_satellites      REAL,
    min_fix_type        SMALLINT,
    -- Dropout summary
    dropout_count       INTEGER,
    dropout_total_ms    INTEGER,
    -- Searchable metadata (JSONB for flexibility)
    parameters          JSONB,   -- Non-default parameters
    info_messages       JSONB,   -- Key info messages
    -- Full-text search
    search_vector       TSVECTOR
);

CREATE TABLE vehicles (
    uuid            TEXT PRIMARY KEY,
    name            TEXT,
    total_flight_time_s BIGINT DEFAULT 0,
    latest_log_id   UUID REFERENCES logs(id)
);

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email           TEXT UNIQUE,
    password_hash   TEXT,
    role            TEXT DEFAULT 'user',  -- user, admin
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

3. Storage: S3 via API

Upload flow:

  1. Client requests a pre-signed upload URL from the API
  2. Client uploads directly to S3 (bypasses backend for large files)
  3. Client notifies API that upload is complete
  4. API enqueues processing job

For small deployments: MinIO provides S3-compatible object storage that runs alongside the app in a single Docker Compose setup.

Storage layout:

s3://bucket/
├── raw/
│   └── {log_id}.ulg                    # Original ULog file (immutable)
├── processed/
│   └── {log_id}.json.zst               # All plot data, compressed (~200-500KB)
├── thumbnails/
│   └── {log_id}.png                    # Overview map image
└── exports/
    └── {log_id}.kml                    # Optional KML export

Processed data format (the key innovation -- compute once, serve forever):

{
  "version": 1,
  "log_id": "abc-123",
  "computed_at": "2026-03-11T22:00:00Z",
  "plots": {
    "attitude_roll": {
      "type": "timeseries",
      "title": "Roll Angle",
      "unit": "deg",
      "series": [
        {"label": "Estimated", "timestamps": [...], "values": [...], "color": "#1f77b4"},
        {"label": "Setpoint", "timestamps": [...], "values": [...], "color": "#ff7f0e"}
      ],
      "flight_modes": [{"start": 0.0, "end": 12.5, "mode": "MANUAL"}, ...],
      "annotations": [{"time": 5.2, "text": "Param change: MC_ROLL_P=6.5"}]
    },
    "fft_actuator_controls": {
      "type": "spectrogram",
      "title": "Actuator Controls FFT",
      "frequencies": [...],
      "magnitudes": [...],
      "markers": [{"freq": 80, "label": "MC_DTERM_CUTOFF"}]
    },
    "gps_track": {
      "type": "map",
      "coordinates": [[lat, lon, alt], ...],
      "flight_mode_segments": [...]
    },
    "trajectory_3d": {
      "type": "trajectory",
      "positions": [[x, y, z], ...],
      "quaternions": [[w, x, y, z], ...],
      "timestamps": [...],
      "vehicle_type": "quadrotor"
    }
  },
  "key_facts": {
    "vibration": {"status": "good", "max_level": 3.2, "unit": "m/s^2"},
    "gps_quality": {"avg_sats": 14, "min_fix": 3},
    "battery": {"voltage_start": 16.2, "voltage_end": 14.8, "mah_used": 1200},
    "flight_modes": [{"mode": "POSCTL", "duration_s": 120, "pct": 80}],
    "errors": [],
    "warnings": ["High vibration on IMU 2"]
  },
  "tables": {
    "parameters": [...],
    "messages": [...],
    "perf_counters": [...]
  }
}

This pre-computed JSON eliminates the need to ever re-parse the ULog for viewing. At ~200-500KB compressed (vs 2-90MB raw ULog), it's fast to fetch and cheap to store. For 350k logs, total cache size would be ~70-175GB -- trivial for S3.

4. Frontend: React + TypeScript SPA

Why React: Largest ecosystem for data visualization components, strong TypeScript support, and the most contributors will be familiar with it. Vue or Svelte are viable alternatives but React maximizes contributor pool for an open-source project.

Charting: Apache ECharts

After evaluating options:

Library Bundle Size Max Points GPU Accel Extensibility Community
uPlot 45KB 10M+ No (Canvas2D) Low Small
Apache ECharts 300KB (tree-shakeable) 10M+ Yes (WebGL) High Very large
Plotly.js 3.5MB 100K Limited Medium Large
D3 90KB Varies Manual Very high Very large

Apache ECharts wins because:

  • WebGL renderer handles millions of points smoothly (critical for high-rate IMU/FIFO data)
  • Built-in support for linked/synchronized time axes across multiple plots
  • Native support for spectrograms, heatmaps, and scatter plots
  • Large-array optimization with sampling and progressive rendering
  • Extensive theming and customization
  • Tree-shakeable: only import the chart types you use
  • Active development with strong community (Apache Foundation)

uPlot is faster and lighter but lacks spectrogram support and has limited extensibility. Plotly.js is too heavy and struggles with >100K points.

3D Flight Replay: Three.js Web Component

Drawing from mavsim-viewer's architecture (which cleanly separates data processing from rendering):

  • Port mavsim-viewer's ULog replay engine logic (~470 LOC) to TypeScript
  • Use Three.js for 3D rendering (most mature WebGL library)
  • Port the vehicle model registry (8 models across 6 types) as glTF assets
  • Implement the dead-reckoning interpolation for smooth 60fps playback
  • Trail rendering with speed-based coloring (ring buffer, adaptive sampling)
  • Chase camera + FPV camera modes
  • Expose as a <flight-replay> web component or React component

The mavsim-viewer C codebase provides exact specifications for:

  • Coordinate transforms (NED to rendering frame)
  • Quaternion handling and interpolation
  • Flight mode transition tracking (up to 256 changes)
  • Playback controls (0.25x to 16x speed, seek, loop)
  • Trail sampling parameters (1800 points, 16ms interval, 1cm distance threshold)

5. Plugin / Extension System

Borrowing from PlotJuggler's architecture (which supports 20+ plugins across data loaders, transforms, and visualizers), the frontend should support a simple plugin registry:

// Plugin definition
interface FlightReviewPlugin {
  id: string;
  name: string;
  version: string;
  // What data this plugin needs from the processed JSON
  requiredPlots?: string[];
  requiredKeyFacts?: string[];
  // Components
  panels?: PanelPlugin[];        // Full panel in the plot area
  keyFacts?: KeyFactPlugin[];    // Cards in the summary section
  transforms?: TransformPlugin[]; // Client-side data transforms
}

interface PanelPlugin {
  id: string;
  title: string;
  category: string;  // "attitude", "position", "power", "sensors", "custom"
  component: React.ComponentType<{data: PlotData, config: any}>;
  // Optional: server-side processing hint
  processorId?: string;
}

interface KeyFactPlugin {
  id: string;
  title: string;
  component: React.ComponentType<{facts: KeyFacts}>;
  priority: number;  // Display order
}

// Registration
registerPlugin({
  id: "vibration-analysis",
  name: "Vibration Analysis",
  panels: [{
    id: "vibe-spectrum",
    title: "Vibration Spectrum",
    category: "sensors",
    component: VibrationSpectrumPanel,
  }],
  keyFacts: [{
    id: "vibe-summary",
    title: "Vibration Health",
    component: VibrationSummaryCard,
    priority: 10,
  }],
});

For the backend processing pipeline, a similar plugin system allows adding new analysis modules:

// Backend processing plugin trait
trait AnalysisPlugin: Send + Sync {
    fn id(&self) -> &str;
    fn name(&self) -> &str;
    /// Which ULog topics this plugin needs
    fn required_topics(&self) -> &[&str];
    /// Process ULog data and return plot data + key facts
    fn process(&self, ulog: &ParsedULog) -> Result<PluginOutput>;
}

struct PluginOutput {
    plots: HashMap<String, PlotData>,
    key_facts: HashMap<String, serde_json::Value>,
    tables: HashMap<String, TableData>,
}

Built-in plugins would cover all current flight review functionality (attitude, position, power, sensors, FFT, PID analysis, etc.), and the community could add new ones without modifying core code.

6. Authentication & Multi-tenancy

For the public Dronecode instance: Anonymous uploads continue as today, with optional user accounts for managing your own logs.

For private deployments: Simple auth with configurable backends:

# config.yaml
auth:
  enabled: true
  provider: "local"           # local, oidc, ldap
  require_login_to_view: true
  require_login_to_upload: true
  # For OIDC (Google, GitHub, Okta, etc.)
  oidc:
    issuer: "https://accounts.google.com"
    client_id: "..."
    client_secret: "..."

Implementation: JWT-based session tokens. The users table is optional -- when auth is disabled, the system behaves exactly like today's public instance.

7. Deployment

Single-container deployment (small teams):

# docker-compose.yml
services:
  flight-review:
    image: ghcr.io/px4/flight-review-next:latest
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: "postgres://fr:fr@db/flight_review"
      S3_ENDPOINT: "http://minio:9000"
      S3_BUCKET: "flight-review"
      S3_ACCESS_KEY: "minioadmin"
      S3_SECRET_KEY: "minioadmin"
    depends_on:
      - db
      - minio

  db:
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: flight_review
      POSTGRES_USER: fr
      POSTGRES_PASSWORD: fr

  minio:
    image: minio/minio
    command: server /data
    volumes:
      - s3data:/data

volumes:
  pgdata:
  s3data:

One docker compose up and you have a fully functional private instance. No nginx, no FUSE mounts, no shell scripts.

Production deployment (Dronecode scale):

  • Same containers, but PostgreSQL on RDS, S3 on AWS, and multiple API/worker replicas behind an ALB
  • Horizontal scaling: add more processing workers for upload bursts
  • CloudFront for static assets and pre-signed S3 URLs
  • Optional: Redis/SQS for job queue (or use PostgreSQL LISTEN/NOTIFY for simplicity)

Processing Pipeline Detail

Upload Flow

Client                    API                     S3                  Worker
  │                        │                      │                    │
  │── POST /api/upload ───►│                      │                    │
  │◄── presigned URL ──────│                      │                    │
  │                        │                      │                    │
  │── PUT (direct S3) ────────────────────────────►│                    │
  │                        │                      │                    │
  │── POST /api/upload/complete ──────────────────►│                    │
  │   {s3_key, metadata}   │                      │                    │
  │                        │── INSERT log ────────►│ (PostgreSQL)       │
  │                        │── enqueue job ───────────────────────────►│
  │◄── 202 {log_id} ──────│                      │                    │
  │                        │                      │                    │
  │                        │                      │    ┌───────────┐   │
  │                        │                      │    │ Download   │   │
  │                        │                      │◄───│ raw .ulg   │   │
  │                        │                      │    │            │   │
  │                        │                      │    │ Parse ULog │   │
  │                        │                      │    │            │   │
  │                        │                      │    │ Run all    │   │
  │                        │                      │    │ analysis   │   │
  │                        │                      │    │ plugins    │   │
  │                        │                      │    │            │   │
  │                        │                      │◄───│ Upload     │   │
  │                        │                      │    │ processed  │   │
  │                        │                      │    │ JSON + PNG │   │
  │                        │                      │    └───────────┘   │
  │                        │◄── UPDATE status='ready' ────────────────│
  │                        │                      │                    │

Processing Steps (per log)

  1. Download raw ULog from S3 (~2-90MB, up to ~2.7GB for extreme cases like 15-hour flights)
  2. Parse ULog header, definitions, subscriptions (streaming parser for large files)
  3. Extract metadata: vehicle type, hardware, software, parameters, info messages
  4. Compute summary statistics: distance, speed, altitude, tilt, current, vibration levels
  5. Generate time-series plot data: For each of the ~35 plot types, extract the relevant topic data, apply transforms (unit conversion, filtering, FFT), and produce downsampled series at multiple resolution tiers
  6. Generate spectrogram data: FFT/PSD for actuator controls, angular velocity, angular acceleration
  7. Generate map data: GPS coordinates with flight mode segments
  8. Generate 3D trajectory data: Positions, quaternions, timestamps for the flight replay component
  9. Generate overview thumbnail: Static map image (can use server-side rendering or delegate to a headless browser)
  10. Compress and upload processed JSON (zstd compression) + thumbnail to S3
  11. Update database with metadata, summary stats, and status='ready'

Total processing time target: <5 seconds for a typical 10-minute flight log. For very long logs (1h+), <30 seconds. For extreme logs (15h), <2 minutes.

Handling Very Large Logs (Up to 15 Hours)

The largest known log in the current dataset is a 15-hour flight. At ~50 KB/s default logging rate, this produces a ~2.7 GB ULog file with ~173 million data points across ~45 topics (13.5M points per 250Hz topic). This has significant implications:

Processing worker requirements:

  • The Rust ULog parser MUST use streaming/mmap parsing -- loading 2.7 GB entirely into RAM is unacceptable
  • Processing worker memory budget: 4 GB max, enforced via configurable limit
  • Configurable max file size (default 5 GB) to reject pathological inputs
  • Large logs should be priority-queued separately to avoid blocking the worker for short uploads
  • Processing time scales roughly linearly: ~10-20s in Rust for a 15h log (vs 2-5 minutes in Python)

Current system has zero guards for large logs:

  • Nginx limit: 100 MB (client_max_body_size), Tornado buffer: 300 MB -- both would reject even a 1-hour log
  • No memory guards in pyulog parsing (loads everything into RAM)
  • LRU cache of 8 parsed ULog objects has no size-in-bytes awareness -- eight 15h logs would OOM the server
  • Downsampling uses naive every-Nth-sample decimation, not LTTB

Upload Flow

Important context: The majority of users upload through the web form on the website, not via QGroundControl auto-upload. The upload flow must prioritize the web UI experience:

  1. Web form upload (primary): Multipart POST directly to the backend API. For files under ~100 MB (the vast majority of uploads), this is simple and fast. For large files (>100 MB), use chunked upload with progress indication.
  2. QGroundControl auto-upload (secondary): Must maintain API compatibility with the current QGC upload endpoint format.
  3. Pre-signed S3 upload (for very large files only): For files >500 MB, the API can optionally provide a pre-signed S3 URL for direct upload, bypassing the backend. This is an optimization, not the default path.

The web upload form should show:

  • Upload progress bar with speed and ETA
  • File validation (is it a valid ULog?) as soon as the header bytes arrive
  • Processing status ("Uploading... → Processing... → Ready") with live updates via SSE or polling
  • Link to the log page as soon as processing completes

Downsampling Strategy

Raw ULog data can have millions of points (e.g., sensor_combined at 250Hz for 15 hours = 13.5M points per axis). The processed JSON should contain intelligently downsampled data at multiple resolution tiers:

  • LTTB (Largest Triangle Three Buckets): Preserves visual shape while aggressively reducing point count. The point budget scales with log duration:
    • Logs up to 10 minutes: 4,000 points per series
    • Longer logs: min(max(4000, duration_minutes * 35), 30000) points
    • 15-hour log: ~31,500 points per series (~5 MB per series, ~75 MB total uncompressed, ~12 MB compressed)
  • Hierarchical tiers (pre-computed, stored in S3):
    • Tier 1 (overview): LTTB-downsampled as above -- used for initial page load
    • Tier 2 (medium zoom): 10x the overview point count, capped at 200K per series
    • Tier 3 (full resolution): on-demand endpoint that reads the raw ULog for a specific time range
  • Full resolution on demand: For zoomed-in views, the client requests a specific time range at full resolution from a secondary endpoint. This avoids storing full-res data (which would be ~400-600 MB compressed for a 15h log) while still supporting deep inspection.
  • FFT/PSD data: Store at native resolution (frequency domain is already compact).
  • Map coordinates: Downsample to ~1000 points using Ramer-Douglas-Peucker.

Migration Strategy

Phase 1: Core Infrastructure (Months 1-2)

  • Rust backend with Axum: health check, upload, S3 integration (API-based)
  • PostgreSQL schema and migrations
  • ULog parser in Rust (port from pyulog/mavsim-viewer reference)
  • Basic processing pipeline: parse ULog, extract metadata, store to DB
  • Docker Compose setup with MinIO
  • CI/CD pipeline

Phase 2: Processing Engine (Months 2-4)

  • Port all 35+ plot types from configured_plots.py to Rust analysis plugins
  • Implement FFT/PSD analysis (use rustfft crate)
  • Implement summary statistics computation
  • Pre-computed JSON generation with LTTB downsampling
  • Overview thumbnail generation
  • Processing worker with job queue

Phase 3: Frontend MVP (Months 3-5)

  • React SPA with TypeScript
  • Browse/search page with filtering
  • Log detail page with all core plots (ECharts)
  • Synchronized time axes across plots
  • Flight mode background coloring
  • GPS map view (Leaflet or Mapbox GL)
  • Parameter table, logged messages
  • Responsive design

Phase 4: Advanced Features (Months 5-7)

  • 3D flight replay component (Three.js, ported from mavsim-viewer)
  • PID analysis page
  • Plugin system for frontend panels and key facts
  • Authentication system (local + OIDC)
  • Full-resolution zoom endpoint
  • KML export
  • Statistics/analytics page
  • Dark mode

Phase 5: Production Migration (Months 7-8)

  • Bulk re-process existing 350k logs (parallel workers on AWS)
  • Data migration from SQLite to PostgreSQL
  • DNS cutover with nginx redirect for old URLs
  • Monitoring and alerting setup
  • Documentation and contributor guide

Parallel Workstream: Data Migration

The 350k existing logs can be re-processed in parallel. At 5 seconds per log with 10 workers, this takes ~48 hours. The migration can run alongside the old system, with a read-only bridge serving old logs until re-processing completes.


Key Design Decisions Borrowed from PlotJuggler

PlotJuggler's architecture (C++/Qt, 20+ plugins, handles millions of points at 60fps) provides several patterns worth adopting:

  1. Lazy range computation: Don't compute min/max for all data upfront. Cache ranges and invalidate on data change. Critical for responsive zoom/pan.
  2. Deque-based storage with dirty flags: PlotJuggler uses std::deque with lazy range caching. The web equivalent: typed arrays with cached bounds, recomputed only when the visible window changes.
  3. Plugin architecture: PlotJuggler's DataLoader, DataStreamer, TransformFunction, and StatePublisher interfaces cleanly separate concerns. Our AnalysisPlugin (backend) and PanelPlugin (frontend) follow the same pattern.
  4. Transform composition: PlotJuggler supports chaining transforms (derivative -> moving average -> outlier removal). ECharts supports client-side transforms, and the backend can pre-compute common ones.
  5. Group-based organization: PlotJuggler groups related series (e.g., all IMU measurements) with shared visibility controls. The frontend should do the same.
  6. WASM plugin potential: PlotJuggler is experimenting with WASM plugins. A future version of Flight Review Next could support user-provided WASM analysis modules that run in the browser.

Key Design Decisions Borrowed from mavsim-viewer

mavsim-viewer's clean C architecture (~5,560 LOC total) provides exact specifications for the 3D replay component:

  1. Data source abstraction: Polymorphic data_source_t with vtable. Port directly to TypeScript abstract class with ReplayDataSource and potential LiveDataSource implementations.
  2. Dead-reckoning interpolation: Essential for smooth 60fps playback from 5-10Hz position data. Linear interpolation: pos = pos_last + vel * dt.
  3. Adaptive trail sampling: Ring buffer of 1800 points, sampled at 16ms intervals with 1cm minimum distance. Prevents memory bloat while maintaining visual fidelity.
  4. Speed ribbon coloring: Trail colored by speed (blue=slow, green=medium, red=fast). Normalized against running max speed.
  5. Seek index: Sparse timestamp index (1 entry per second) enables O(log n) seeking in large logs. Build during initial parse.
  6. Vehicle model registry: 8 models across 6 types with per-model scale and orientation offsets. Ship as glTF assets for the web version.
  7. Camera modes: Chase (orbit around vehicle) and FPV (vehicle-mounted gimbal). Both transfer directly to Three.js camera controls.

Frontend Component Architecture

<App>
├── <Header>
│   ├── <SearchBar>
│   └── <UserMenu>
├── <Routes>
│   ├── <BrowsePage>
│   │   ├── <FilterSidebar>
│   │   ├── <LogGrid>
│   │   │   └── <LogCard> (thumbnail, key facts, duration, vehicle type)
│   │   └── <Pagination>
│   ├── <LogDetailPage>
│   │   ├── <KeyFactsBar>
│   │   │   ├── <VibrationCard>
│   │   │   ├── <GPSQualityCard>
│   │   │   ├── <BatteryCard>
│   │   │   ├── <FlightModesCard>
│   │   │   └── <PluginKeyFactCards...>
│   │   ├── <InfoTable>
│   │   ├── <PlotContainer>
│   │   │   ├── <TimeSeriesPlot>       (ECharts, synchronized axes)
│   │   │   ├── <SpectrogramPlot>      (ECharts heatmap)
│   │   │   ├── <MapPanel>             (Leaflet/Mapbox)
│   │   │   ├── <FlightReplay3D>       (Three.js)
│   │   │   └── <PluginPanels...>
│   │   ├── <ParameterTable>
│   │   ├── <MessagesTable>
│   │   └── <CollapsibleSections>
│   │       ├── <PerfCounters>
│   │       └── <BootConsole>
│   ├── <PIDAnalysisPage>
│   ├── <StatisticsPage>
│   └── <UploadPage>
└── <Footer>

Key UX Improvements Over Current Flight Review

  1. Instant page loads: Pre-computed data loads in <500ms vs 3-10 seconds today
  2. Synchronized cursors: Hover on one plot, see the corresponding time on all plots and the 3D view
  3. Key facts dashboard: At-a-glance vibration health, GPS quality, battery status, flight modes -- visible immediately without scrolling through 35 plots
  4. Collapsible plot categories: Users see what they care about first (attitude, position, power) and can expand advanced sections (FFT, PSD, estimator flags)
  5. 3D flight replay: Interactive replay with playback controls, not just a static 3D trajectory view
  6. Deep linking: Every plot section has a URL hash for sharing specific views
  7. Mobile-responsive: Card-based layout that works on tablets and phones
  8. Dark mode: Because developers love dark mode

Comparison with Alternatives

Foxglove

Foxglove is a commercial robotics visualization platform that supports ULog. It's excellent for interactive exploration but:

  • Commercial product (free tier has limits)
  • Not self-hostable (cloud-only for team features)
  • General-purpose (not PX4-specific key facts and analysis)
  • No community-driven analysis logic (FFT cutoff markers, vibration thresholds, PID analysis)

Flight Review Next would complement Foxglove: users who want deep PX4-specific analysis use Flight Review; users who want general-purpose exploration can export to Foxglove.

PlotJuggler

Excellent desktop tool but:

  • Desktop-only (no web sharing)
  • No persistent storage or team collaboration
  • No PX4-specific key facts or summary statistics
  • No automated analysis pipeline

Flight Review Next would serve a different need: cloud-first, shareable, with automated PX4-specific analysis.

Grafana-Based Approach: Full Analysis

A Grafana-based solution was evaluated as an alternative to building a custom frontend. Two variants were considered: (A) using Grafana as-is with existing panels, and (B) building custom Grafana panels for the missing visualization types.

What Grafana provides out of the box

  • Built-in time-series panels are polished and performant (~60% of Flight Review's plots)
  • Synchronized crosshairs across all panels work natively (Single/All tooltip modes)
  • Dashboard JSON model + provisioning API: one JSON template serves all logs via ?var-log_id=XXX
  • Geomap panel handles GPS tracks with route layers
  • Annotation system can represent flight mode changes as colored regions
  • Table panel handles parameter tables and logged messages
  • Built-in auth with OAuth2, LDAP, SAML, org-based multi-tenancy, role-based permissions
  • Dashboard sharing, snapshot export, alerting
  • Battle-tested: 67.5k GitHub stars, 25M+ users

What's missing (would need custom panels)

Visualization Grafana Status Custom Panel Effort
FFT with filter cutoff markers No panel exists Medium (2-3 weeks). TypeScript panel, FFT data pre-computed server-side
PSD Spectrogram No panel (heatmap is for histograms) Medium-Hard (3-4 weeks). WebGL heatmap with freq/time axes
PID step response Nothing close Hard (4-6 weeks). Wiener deconvolution results, Bode plots
3D flight trajectory One limited community plugin Medium (3-4 weeks). Three.js panel with vehicle replay
Key facts dashboard Stat panels exist but clunky Easy (1 week). Custom panel with cards layout

Total custom panel development: ~13-18 weeks (3-4 months) for the missing visualizations.

Option A: Grafana as-is (rejected)

Using only built-in panels means losing FFT, spectrograms, PID analysis, and 3D trajectory -- the features that differentiate Flight Review. Rejected.

Option B: Grafana + Custom Panels (viable alternative)

Build 4-5 custom Grafana panel plugins and use Grafana as the entire visualization layer.

Architecture:

┌──────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Custom App  │     │     Grafana      │     │   Backend API   │
│  (Browse,    │────►│  (All plotting)  │────►│  (Rust/Axum)    │
│   Upload,    │     │                  │     │                 │
│   Key Facts) │     │  Built-in panels │     │  - ULog parse   │
│              │     │  + Custom panels │     │  - TSDB ingest  │
│  React SPA   │     │  - FFT panel     │     │  - S3 storage   │
└──────────────┘     │  - Spectrogram   │     └────────┬────────┘
                     │  - PID panel     │              │
                     │  - 3D replay     │     ┌────────▼────────┐
                     └────────┬─────────┘     │  TimescaleDB    │
                              └──────────────►│  (time-series)  │
                                              └─────────────────┘

Pros:

  1. Massive head start on time-series. ~25 time-series plots work out of the box. Cursor sync, zoom, pan, legend, annotations -- all free.
  2. Dashboard-as-code. One JSON template serves all logs. No React component tree for the plotting layer.
  3. Auth is solved. Grafana's built-in OAuth2, LDAP, and org-based multi-tenancy cover both the public instance and private deployments.
  4. Familiar to operations teams. Many orgs already run Grafana. Flight log dashboards are a natural extension.
  5. Panel plugin SDK is mature. TypeScript + React, well-documented, hot reload.
  6. Community contribution model. People can contribute Grafana panel plugins without touching the core backend.

Cons:

  1. Requires a TSDB. Grafana queries a datasource, not JSON files. Parsed ULog data must be ingested into TimescaleDB. On-demand ingestion adds 3-10 seconds cold-start per log. Pre-ingesting 350k logs: ~16TB compressed (impractical). For a 15-hour log, on-demand ingest means writing ~173M data points before the dashboard renders.
  2. Two+ services always. Grafana + TSDB + custom app. Kills the "single binary on a Raspberry Pi" deployment tier.
  3. Embedding UX friction. Browse app links to Grafana dashboards. Looks like two different apps. Grafana Cloud does NOT support embedding; only self-hosted OSS.
  4. 35+ panels = heavy. Each panel fires independent queries. 120+ queries to TimescaleDB on page load for a 15-hour log.
  5. Plugin maintenance burden. Grafana's plugin API changes between major versions (~2/year). Custom panels need ongoing testing.
  6. No offline/static export. Pre-computed JSON can generate static HTML reports. Grafana requires a live server.

Decision Framework

Factor Custom Frontend (React + ECharts) Grafana + Custom Panels
Time to MVP 8-10 weeks (build everything) 6-8 weeks (time-series free, build custom panels + ingest)
Time-series quality Good (ECharts is solid) Excellent (Grafana is best-in-class)
FFT/Spectrogram Build in ECharts (~2 weeks) Build as Grafana plugin (~3-4 weeks, more boilerplate)
Deployment simplicity Single binary possible Always needs Grafana + TSDB (min 3 services)
Small team / Pi / air-gap Works everywhere Impractical
Large deployment (Dronecode) More custom code to maintain Leverages Grafana's maturity
Auth Must build Free
15-hour log handling Pre-computed JSON, instant load TSDB ingest of 173M points, cold-start latency
Contributor model Fork + PR Separate plugin repos
UX cohesion Fully cohesive Two-app feel

Recommendation: Build Both Paths, Share the Backend

Rather than choosing one, the Rust backend API can serve data two ways:

  1. REST/JSON endpoint (GET /api/logs/{id}/plots) → consumed by the custom React frontend (default)
  2. TimescaleDB ingestion (on-demand) → consumed by Grafana's PostgreSQL/TimescaleDB datasource (optional)

The custom React frontend is the default for all deployment tiers. Grafana dashboards are an optional, documented alternative for organizations that already run Grafana. Same backend, same processing pipeline, same data -- just different consumers.

Custom Grafana panels (FFT, spectrogram, PID, 3D) can be developed as community contributions since they are standalone plugins with no coupling to the core app. This is a natural contribution path for organizations already invested in Grafana.

Where Grafana is definitely used: As the monitoring dashboard for Flight Review's own infrastructure (API latency, queue depth, error rates, S3 metrics).


Resource Estimates

Compute (Dronecode production instance)

Component CPU RAM Instances
API service 1 vCPU 512MB 2
Processing worker 2 vCPU 2GB 2-4
PostgreSQL 2 vCPU 4GB 1 (RDS)
Total 8-12 vCPU 9-13GB -

Comparable to current single-instance (15GB RAM) but with much better utilization.

Storage (350k logs)

Type Size Cost/month
Raw ULog files (existing) ~5TB ~$115 (S3)
Processed JSON cache ~100GB ~$2.30 (S3)
Thumbnails ~10GB ~$0.23 (S3)
PostgreSQL ~5GB ~$15 (RDS db.t3.medium)
Total ~5.1TB ~$133/month

Small team deployment

A team with 100 logs needs: 1 container (~512MB RAM), embedded PostgreSQL or SQLite-mode, MinIO or local disk. Runs on a $5/month VPS or a Raspberry Pi.


Open Questions for Community Input

  1. Backwards compatibility: Should the new system maintain URL compatibility with review.px4.io/plot_app/s/... paths? (Recommend: yes, via nginx redirects)
  2. API stability: Should we publish an API spec that third-party tools (QGroundControl, MAVSDK) can depend on? (Recommend: yes, OpenAPI 3.0)
  3. Real-time streaming: Should the 3D replay support live MAVLink streaming in addition to log replay? (mavsim-viewer already supports this pattern via the data source abstraction)
  4. Multi-log comparison: Should the UI support overlaying multiple flights for comparison? (PlotJuggler supports this natively)
  5. Community analysis plugins: Should we provide a plugin marketplace or registry? (Recommend: start with a plugins/ directory in the repo, evolve later)
  6. Retention policy: Should old logs be auto-archived to S3 Glacier after N months? (Recommend: yes, configurable)

Risks and Mitigations

Risk Impact Mitigation
Rust ULog parser doesn't match pyulog feature parity Processing gaps Use pyulog as reference test suite; validate against 1000+ real logs
ECharts can't handle spectrogram data well Visual quality Fallback to custom WebGL renderer for spectrograms
3D replay performance in browser Poor mobile experience Make 3D replay opt-in, lazy-loaded
Migration disrupts 350k existing users Lost links, broken bookmarks Maintain old URLs via redirects for 1 year
Community doesn't adopt plugin system Low extensibility Build all current features as core plugins; system works without external plugins
PostgreSQL is overkill for small deployments Complex setup Support embedded SQLite mode via feature flag for single-user deployments

Success Metrics

  1. Page load time: <1 second for log detail page (vs 3-10s today)
  2. Upload-to-viewable: <10 seconds (vs instant but slow viewing today)
  3. Deployment ease: docker compose up for a working instance
  4. Plugin count: 5+ community-contributed plugins within first year
  5. Feature parity: All 35+ current plot types available at launch
  6. Mobile usability: Fully functional on tablet, viewable on phone

Appendix A: Current Flight Review Plot Inventory

All of these must be ported to the new system:

# Plot Source Topics Type
1 2D Position (XY) vehicle_local_position Scatter
2 GPS Map vehicle_gps_position Map (Leaflet)
3 Altitude vehicle_gps_position, vehicle_air_data, vehicle_local_position Time-series
4 Roll Angle vehicle_attitude, vehicle_attitude_setpoint Time-series
5 Pitch Angle vehicle_attitude, vehicle_attitude_setpoint Time-series
6 Yaw Angle vehicle_attitude, vehicle_attitude_setpoint Time-series
7-9 Roll/Pitch/Yaw Rate vehicle_angular_velocity, vehicle_rates_setpoint Time-series
10-12 Local Position X/Y/Z vehicle_local_position, vehicle_local_position_setpoint Time-series
13 Velocity vehicle_local_position Time-series
14-18 Visual Odometry (5) vehicle_visual_odometry Time-series
19 Airspeed airspeed, airspeed_validated Time-series
20 TECS tecs_status Time-series
21 Manual Control manual_control_setpoint, manual_control_switches Time-series
22 Actuator Controls actuator_controls_0, vehicle_thrust_setpoint Time-series
23-25 FFT (3 types) Derived from actuator_controls, angular_velocity Spectrogram
26 Actuator Controls 1 actuator_controls_1 Time-series
27 Motor/Servo Outputs actuator_motors, actuator_servos Time-series
28 ESC RPM esc_status Time-series
29 Raw Acceleration sensor_combined Time-series
30 Vibration Metrics vehicle_imu_status Time-series
31-33 PSD Spectrograms (3) Derived Spectrogram
34 Raw Gyroscope sensor_combined Time-series
35-36 FIFO Accel/Gyro (per IMU) sensor_accel_fifo, sensor_gyro_fifo Time-series + Spectrogram
37 Raw Magnetometer vehicle_magnetometer Time-series
38 Distance Sensor distance_sensor Time-series
39-40 GPS Quality (2) vehicle_gps_position Time-series
41 Thrust-Mag Correlation battery_status, vehicle_magnetometer Time-series
42 Power battery_status, system_power Time-series
43 Temperature Various (baro, accel, battery, ESC) Time-series
44 Estimator Flags estimator_status Time-series (binary)
45 Failsafe Flags failsafe_flags Time-series (binary)
46 CPU & RAM cpuload Time-series
47 Sampling Regularity sensor_combined, estimator_status Time-series

Plus: Non-default parameters table, logged messages table, hardfault card, corrupt log warning, perf counters, boot console, PID analysis page, 3D trajectory view.

Appendix B: Technology Summary

Component Choice Rationale
Backend language Rust 10x faster ULog parsing, single binary, memory safety
Web framework Axum Async, tower middleware, strong ecosystem
Database PostgreSQL Concurrent access, JSONB, full-text search, proven at scale
Object storage S3 API (aws-sdk-s3) Direct API, no FUSE. MinIO for self-hosted
Frontend framework React + TypeScript Largest ecosystem, best for plugin system
Charting Apache ECharts WebGL, millions of points, spectrograms, synchronized axes
3D visualization Three.js Most mature WebGL library, ported from mavsim-viewer
Maps Leaflet or Mapbox GL JS Flight track with mode coloring
Auth JWT + OIDC Simple for small, scalable for large
Deployment Docker Compose (small), K8s/ECS (large) Single docker compose up to full cloud
ULog parser Custom Rust (reference: mavsim-viewer C + pyulog) Native performance, streaming support
Job queue PostgreSQL LISTEN/NOTIFY (simple) or SQS (scale) No extra infrastructure for small deployments
CDN CloudFront Already in use, serves static assets + pre-signed URLs
FFT rustfft (backend), custom (frontend) High-performance spectral analysis
Compression zstd Best ratio/speed tradeoff for processed JSON

Appendix C: Review Feedback & Plan Adjustments

This plan was reviewed from three perspectives: an open-source maintainer, a small-team private deployer, and a DevOps engineer running the production instance. Their feedback surfaced critical gaps and led to the adjustments below.

Review 1: Open-Source Maintainer Perspective

Key concerns raised:

  1. Rust vs Python for contributor accessibility. The PX4 ecosystem is primarily C++ and Python. The domain logic in configured_plots.py (1,165 lines of vibration thresholds, FFT cutoff markers, PID heuristics) was written by flight controller engineers who know Python, not Rust. Rewriting this in Rust risks losing contributors who maintain the analysis logic that is Flight Review's actual value.

  2. Scope is unrealistic at 8 months. A ground-up rewrite (new language, new DB, new frontend, new charting, 3D replay, plugin system, auth, migration) is 12-18 months minimum for a small OSS team. The history of open-source v2 rewrites is littered with projects that never shipped.

  3. Plugin system is over-engineered. Flight Review has had very few external contributors adding new plot types. A formal plugin API adds abstraction overhead, versioning, and API stability commitments without demonstrated demand. Clean code structure is sufficient.

  4. Migration risk is understated. QGroundControl's upload endpoint is a hard API contract not addressed in the plan. URL compatibility for existing links is critical. No rollback plan exists.

  5. Incremental migration recommended. Add process-once caching to the current Python app first, then build a new React frontend, then backfill 350k logs. This delivers instant page loads in 1-2 months with near-zero migration risk.

Response and adjustments:

  • Rust stays as the primary language. This is a deliberate choice by the project stakeholders who want to move away from Python and invest in Rust. The "process once" architecture means the parsing speed advantage still matters for upload processing and bulk migration. More importantly, Rust's type system, memory safety, and single-binary deployment are long-term wins. The PX4 ecosystem is increasingly multilingual (Rust UAVCAN, Rust MAVLink libraries, Auterion's px4-ulog-rs). The analysis domain logic will be ported methodically with test coverage against real logs.

  • Scope is reduced for v1. The following are cut from the initial release:

    • Plugin system → Internal module pattern only, no public plugin API
    • 3D flight replay → Phase 2 feature, current Cesium.js 3D view maintained
    • OIDC authentication → Simple token/password auth only in v1
    • PID analysis page → Phase 2
    • Dark mode → Phase 2
    • Multi-log comparison → Phase 2
    • Real-time MAVLink streaming → Not in scope
  • QGroundControl upload API compatibility is mandatory. The upload endpoint must accept the same multipart POST format QGC uses today. Document this as a hard requirement in Phase 1.

  • Incremental strategy adopted partially. The React frontend can be developed and deployed alongside the old Bokeh frontend during transition. New uploads get processed; old logs get a "legacy view" link until re-processed.

Review 2: Small Team / Private Deployment Perspective

Key concerns raised:

  1. Three containers is too many. PostgreSQL + MinIO + app triples the operational surface for a team with a few hundred logs. Named Docker volumes are not portable.

  2. PostgreSQL is overkill. SQLite with WAL handles the current 350k-log production instance. A team with hundreds of logs will never stress SQLite.

  3. MinIO is unnecessary. For 10GB of logs, local disk with direct file serving is simpler and sufficient. MinIO recommends 4GB RAM minimum.

  4. Auth is harder than shown. OIDC requires registering OAuth apps, stable domains, HTTPS, and debugging opaque token errors. Teams just want a password.

  5. Raspberry Pi / small VPS not viable. PostgreSQL eats 200-400MB idle, MinIO needs 4GB, processing workers need 2GB. A 1GB VPS can't run this.

  6. Air-gapped deployments not addressed. Many commercial/defense drone teams operate without internet. Map tiles, frontend assets, and auth all assume connectivity.

  7. Missing features for private use: Log organization (folders/tags), batch upload, flight comparison, export/reporting, storage quotas, authorization model (who sees what).

Response and adjustments:

  • Single-container mode is the default deployment. The architecture now explicitly supports three deployment tiers:

    Tier Components Storage Database Auth Target
    Minimal Single binary Local disk Embedded SQLite Password list Teams, Pi, VPS
    Standard Docker Compose (2 containers) Local disk or MinIO PostgreSQL Password or OIDC Growing teams
    Production ECS/K8s (N containers) AWS S3 RDS PostgreSQL OIDC Dronecode scale
  • SQLite is first-class, not a fallback. The data access layer abstracts both SQLite and PostgreSQL equally. SQLite is the default; PostgreSQL is the documented upgrade path when concurrent write throughput becomes a measured problem.

  • Local disk storage is the default. STORAGE_BACKEND=local stores ULog files in ./data/logs/ and the app serves them directly. S3 backend is opt-in for cloud deployments. No MinIO required for simple setups.

  • Simple auth added. auth.provider: "password" with a static list of username:bcrypt pairs in the config file. Zero external dependencies. OIDC is documented as an upgrade, not the starting point.

  • Bind mounts, not named volumes. Docker Compose uses ./data:/app/data so backup is tar czf backup.tar.gz ./data/.

  • Air-gap mode added to requirements. All frontend assets bundled in the Docker image. Map tile URL configurable (defaults to OSM, can point to self-hosted tile server). Docker images published as .tar artifacts alongside registry images. Multi-arch builds (amd64 + arm64).

  • Batch upload and log tagging added to v1 scope. These are essential for real field workflows.

Review 3: DevOps / Production Operations Perspective

Key concerns raised:

  1. Actual S3 data is 617k files / 14.8TB, not 350k / 5TB. The plan's estimates are off by nearly 3x. Bulk migration is ~3-5 days, not 48 hours, and costs ~$1,300+ in S3 transfer.

  2. Cost is 10-40% higher than current setup ($530-650/month vs ~$470/month), not comparable. Stakeholders should know this upfront.

  3. No observability story. Monitoring should be Phase 1, not Phase 5. The plan has zero detail on metrics, alerting, or structured logging.

  4. Job queue needs persistence. PostgreSQL LISTEN/NOTIFY loses messages if no worker is listening. Need a table-backed queue with SELECT ... FOR UPDATE SKIP LOCKED.

  5. No disaster recovery plan. No RTO/RPO targets, no restore testing, no secrets management.

  6. No SSL/TLS mentioned. Currently Let's Encrypt + nginx.

  7. Pre-signed URLs expire. If a user opens a page and comes back 2 hours later, download links are dead.

  8. The Rust ULog parser doesn't exist yet. The entire plan depends on it. Build and validate it first.

  9. Proof-of-concept with 1,000 real logs needed in Phase 2, not Phase 5. Measure actual parse times, failure rates, and processed JSON sizes before committing to full migration.

Response and adjustments:

  • Data inventory corrected. The plan now uses 617k files / 14.8TB as the baseline. Migration estimates updated to 3-5 days with 10 workers, ~$1,500 in S3 costs.

  • Cost transparency added. Estimated steady-state cost is $530-650/month, roughly 15-35% higher than current. The trade-off is dramatically better performance, reliability, and operability. The current system's cost will increase anyway as data grows and S3 FUSE becomes more painful.

  • Observability is Phase 1. Minimum from day one:

    • Structured JSON logging via tracing + tracing-subscriber
    • Health endpoint checking PostgreSQL, S3 connectivity, and queue depth
    • Prometheus metrics: request latency (p50/p95/p99), error rates, queue length, processing time
    • Alerting on queue backlog >100, error rate >1%, API p99 >5s
    • CloudWatch Logs integration for ECS deployment
  • Job queue redesigned. PostgreSQL-backed with a processing_jobs table, SELECT ... FOR UPDATE SKIP LOCKED for reliable dequeue, LISTEN/NOTIFY as wake-up signal only. Dead letter handling for failed jobs. Configurable retry count.

  • Disaster recovery defined:

    • RDS: Automated daily snapshots, 35-day retention, point-in-time recovery. Test restore quarterly.
    • S3: Versioning enabled for raw ULog files. Processed JSON is regenerable.
    • RTO: 4 hours. RPO: 24 hours.
    • Secrets in AWS Secrets Manager.
  • SSL/TLS: ACM certificate on ALB for production. Caddy with automatic HTTPS for Docker Compose deployments.

  • Pre-signed URLs: Generate fresh on each API call with 1-hour expiry. Do not cache server-side.

  • ULog parser is the critical path. Phase 1 now explicitly starts with building and validating the Rust ULog parser against a corpus of 1,000+ real logs before any other work begins. This is the go/no-go gate for the project.

  • Proof-of-concept migration in Phase 2. Process 1,000 representative logs, measure parse times, failure rates, memory usage, and processed JSON sizes. Use results to refine bulk migration plan.

Revised Timeline

Phase Duration Deliverable
0: ULog Parser 6-8 weeks Rust ULog parser validated against 1,000+ real logs. Go/no-go gate.
1: Core Backend 8-10 weeks Axum API, PostgreSQL, S3 integration, processing pipeline, observability, QGC-compatible upload
2: Frontend MVP 8-10 weeks React SPA with all 35+ plot types, browse/search, GPS map, batch upload, log tagging
3: Migration 4-6 weeks Proof-of-concept with 1,000 logs, then bulk migration, dual-running with old system
4: Cutover 2-4 weeks DNS cutover, URL redirects, monitoring stabilization
5: Phase 2 Features Ongoing 3D replay, PID analysis, OIDC, flight comparison, dark mode

Total to production: ~8-10 months (vs original 8 months). More realistic, with an explicit go/no-go gate at week 6-8.

Revised Deployment Tiers

┌─────────────────────────────────────────────────────────────────┐
│ MINIMAL: Single binary, SQLite, local disk, password auth       │
│                                                                 │
│   $ ./flight-review-next --data-dir ./data                      │
│                                                                 │
│   Perfect for: Raspberry Pi, laptop, small VPS, air-gapped      │
│   Requirements: 512MB RAM, 1 CPU, Linux/macOS/Windows           │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STANDARD: Docker Compose, PostgreSQL, local disk or S3          │
│                                                                 │
│   $ docker compose up                                           │
│                                                                 │
│   Perfect for: Teams of 5-50, office server, cloud VPS          │
│   Requirements: 2GB RAM, 2 CPU                                  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION: ECS/K8s, RDS, S3, CloudFront, multiple workers      │
│                                                                 │
│   Perfect for: Dronecode (350k+ logs), large organizations      │
│   Requirements: See resource estimates                           │
└─────────────────────────────────────────────────────────────────┘
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment