Skip to content

Instantly share code, notes, and snippets.

@jmealo
Created March 6, 2026 16:47
Show Gist options
  • Select an option

  • Save jmealo/ff52c8650d53642de042ccc02d3e91f9 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/ff52c8650d53642de042ccc02d3e91f9 to your computer and use it in GitHub Desktop.
42 red herrings across 2 days of notification-sender queue backup investigation (March 5-6, 2026)

42 Red Herrings: Notification-Sender Queue Backup (March 5-6, 2026)

Two incidents, one root cause chain, 42 documented false signals across 2 days of investigation.

Actual root cause: AAA API at 8 replicas (some crash-looping) couldn't serve permissions queries from intel-requests-api fast enough. This caused cascading request queuing through intel-requests-api and incidents-api, starving notification-sender of API capacity. SQL queries were sub-millisecond throughout.


March 5: 29 Red Herrings (5-hour investigation)

Missing Observability (flying blind)

# Red Herring What We Thought Reality
1 Datadog down No APM, no application metrics, no alerts for the critical path Entire APM platform was non-operational during the incident
2 Stale Vector config Logs being dropped or missing LLM agent deployed 6-month-old branch; logs dual-shipped in wrong format, broke log-based metric queries
3 Vector CPU throttling Log gaps = missing evidence Silently dropping up to 12.12k events/sec on dense nodes; fix deployed before main incident but morning logs incomplete
4 No pg_stat_statements on app DB Could identify slow queries Still can't determine per-query execution times; root cause analysis blocked
5 No per-operation timing in notification-sender Could see what's slow Had to deploy structured logging (MR #42) mid-incident to get any visibility
6 No queue depth or consumer throughput alerts Would have been alerted earlier RabbitMQCriticalBacklog fired 2.5 hours after buildup started

Corrupted/Misleading Metrics

# Red Herring What We Thought Reality
7 gisual-prometheus-client multi-replica overwrite Dashboard showing accurate data Multi-replica bug caused pods to overwrite each other's metrics
8 Pushgateway push mode rate() on pushgateway metrics is reliable Each pod overwrote others; rate() produced nonsense
9 RabbitMQ dual-environment labels Queue depth metrics are accurate kubectx switch during deploy_all.sh created duplicate time series with staging label for production queues

False Signals

# Red Herring What We Thought Reality
10 "rpc-error" floods Errors indicate a problem Distributed lock working as designed — lock losers publish to rpc-error exchange for retry
11 "notification-sender not processing anything" Consumer is stuck/dead Actually processing at ~500 msgs/5min — 10x slower than normal but not zero
12 CTO queue drain = recovery Queue dropped 20K to 36 — problem solved False recovery signal; queue re-accumulated to 4.4K after drain
13 CommandLink webhook 302s Webhook delivery broken Expired auth credentials, completely separate issue
14 Envoy/Dragonfly chart confusion Dragonfly eviction broke Redis redis-mirror-envoy proxies Azure Redis (not Dragonfly); shared Helm chart caused unrelated envoy rollouts
15 Node autoscaler cascading noise Infrastructure instability Normal pod movements; CoreDNS scaling 13->9->13 proportional to node count
16 CoreDNS NXDOMAIN errors DNS is broken (79% NXDOMAIN) Normal ndots:5 search domain walking behavior; p95 latency stable at 0.24ms
17 notification-sender metrics deployment timing The deploy caused the issue Metrics deployment was observability improvement, not the cause
18 Public-api p95 spike at 8:55 PM API latency correlated with queue backup Single 5-minute window; false temporal correlation
19 archive-tagger shutdown at 9:40 PM Could be related to DB pressure Unrelated; added noise during investigation
20 VictoriaLogs "retention" confusion Logs being deleted Actually a log format change from stale Vector config, not retention
21 Long-running query durations as "established fact" Queries taking 21-28 minutes Code analysis estimates treated as observations; max active query was actually 6.25s
22 incidents-api 401 Unauthorized Auth system broken Unrelated to queue backup
23 RabbitMQ nack rejections Message processing failures Unclear if from CTO queue drain or separate issue; not causal
24 Charter webhook latency overestimate Webhooks taking 20-40s 20-40s was total processing time, not webhook delivery; actual webhook POST was <1s
25 LOG_TIMING_ENABLED forgotten on first deploy Structured logging deployed Had to cancel and redeploy — wasted ~30 minutes
26 gisual-log-utils extras merge bug Structured logging ready to deploy Library bug blocked deployment; required fix chain across log-utils -> amqp-consumer -> notification-sender
27 Decommissioned power-outage-usca redeployed by LLM What is this consuming resources? LLM agent checked out 6-month-old branch and deployed decommissioned ML model to production
28 10 consumer services deployed to feat/metrics branches Could be causing instability Metrics instrumentation deploys; not causal but added to noise floor
29 ApiWritePathErrors alert silenced for tuning Potential early warning masked Mar 4-11 silence for alert tuning; could have provided signal

March 6: 13 Red Herrings (6-hour investigation)

Database Optimization Dead End

# Red Herring What We Thought Reality
30 Recursive CTE query is the bottleneck WITH RECURSIVE dependency CTE is the #1 bottleneck (5-7s API time), needs max_distance limit and query optimization CTE executes in 0.05-0.16ms. Returns 0-13 rows. Zero optimization needed.
31 All SQL queries need optimization Internal API latency = slow database queries; dispatched 4 parallel agents to analyze SQL across intel-requests-api, incidents-api, assets-api Every query in the hot path is sub-millisecond. The database is not the problem.
32 Missing composite index on dependencies table (blocker_type, blocker_id, dependent_type, dependent_id) index needed; OR-condition not using indexes efficiently Existing partial indexes dep_blocker_live_idx and dep_dependent_live_idx work perfectly. 3-10 buffer hits per query.
33 Unbounded recursion depth max_distance parameter never passed = unbounded graph traversal causing exponential blowup under Charter-scale load Recursion returns 0-13 rows in <1ms. Depth is irrelevant.

Wrong Layer / Wrong Service

# Red Herring What We Thought Reality
34 Thundering herd at the DB level 75 pods querying same data = DB contention and connection pool exhaustion on Azure Flex app DB DB handles the load fine (sub-ms). The thundering herd manifests at the API HTTP layer, not the DB.
35 Azure Flex app DB connection pool saturation intel-requests-api and incidents-api saturating the app database The saturated database is the aaa DB, not app. Traced the wrong database.
36 PgBouncer queuing on Azure Flex Queries waiting in PgBouncer queue before reaching PostgreSQL PgBouncer on the app DB is fine. The bottleneck is the AAA API's connection pool to the aaa DB.
37 CNPG connection metrics Started querying CNPG pg-common connection stats intel-requests-api and incidents-api connect to Azure Flexible Server, not CNPG. Wrong database backend entirely.
38 _incorporate_infra_metadata HTTP calls Every GET to intel-requests-api triggers HTTP calls to AAA for organization data + user email lookup Has an early exit when asset_size_class and infra_type are present in metadata. The actual slow path is the permissions check from intel-searcher's user-token requests.

Misattribution

# Red Herring What We Thought Reality
39 require_permission is a no-op for all callers Stated permissions were fully bypassed via allow_service_access_tokens for all requests Only bypassed for service tokens (notification-sender). Intel-searcher uses user tokens, triggering full AAA permissions HTTP call on every request — this is what saturates AAA.
40 Robin's MR #44 fixes the thundering herd Described MR #44 as addressing the 75-pod redundant fetch problem #44 only caches on send failure for retry. No cache entry exists on first delivery. All 3 open MRs (#44, #45, #46) are retry-only; none address the thundering herd.
41 Zero-row EXPLAIN results as representative Used intel_request_ids from slow timing logs for EXPLAIN ANALYZE Those specific IDs had no dependents (0 rows in base case). Not representative — though this accidentally proved SQL wasn't the issue.
42 ServiceNow 500s contributing to queue backup Initial error scan flagged 33 ServiceNow Script Evaluation Exception errors External issue on ServiceNow's GOCO scripted REST API. Auto-retried via rpc-error exchange. Completely unrelated to queue backup.

Pattern Analysis

By Category (across both days)

Category March 5 March 6 Total
Missing observability / flying blind 6 0 6
Corrupted or misleading metrics 3 0 3
Wrong service / wrong layer 0 5 5
Database optimization dead end 0 4 4
Misattribution / wrong conclusion 0 4 4
False temporal correlation 4 1 5
Unrelated concurrent issues 7 0 7
Infrastructure noise 4 1 5
Tooling / deployment failures 3 0 3
Total 29 13 42

What Improved Between March 5 and March 6

  • Structured logging was in place (deployed during March 5) — per-operation timing available from the start
  • No concurrent infrastructure events — no Dragonfly eviction, Reloader cascade, Vector drops, or kubectx switches
  • Prometheus metrics working — pushgateway fixed, gisual-prometheus-client fixed
  • VictoriaLogs reliable — Vector CPU limits removed, stale config fixed
  • Fewer red herrings (13 vs 29) — cleaner signal, though still spent 4+ hours on SQL optimization before checking AAA

What Didn't Improve

  • Still no pg_stat_statements on app DB — couldn't verify query performance without manual EXPLAIN ANALYZE
  • Still no queue depth alerts — discovered the backup from logs, not alerts
  • Bottom-up investigation approach — started with SQL (bottom of stack) instead of tracing from the API layer down. If we'd checked intel-requests-api request_time logs first and traced upstream to AAA, we'd have found the root cause in 30 minutes instead of 6 hours.

The Meta-Lesson

Both incidents share the same anti-pattern: optimizing what you can measure instead of measuring what's slow.

  • March 5: No per-operation timing → deployed structured logging mid-incident → found internal API latency
  • March 6: Had timing → focused on SQL (easy to EXPLAIN ANALYZE) → spent 4 hours on sub-millisecond queries → finally checked API-side logs → found AAA in 30 minutes

The correct investigation order is: trace the request from the outside in, not from the database up. Check the API's request_time first, then trace which upstream call is slow, then check that service's logs, and so on until you find where time is actually spent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment