Two incidents, one root cause chain, 42 documented false signals across 2 days of investigation.
Actual root cause: AAA API at 8 replicas (some crash-looping) couldn't serve permissions queries from intel-requests-api fast enough. This caused cascading request queuing through intel-requests-api and incidents-api, starving notification-sender of API capacity. SQL queries were sub-millisecond throughout.
| # | Red Herring | What We Thought | Reality |
|---|---|---|---|
| 1 | Datadog down | No APM, no application metrics, no alerts for the critical path | Entire APM platform was non-operational during the incident |
| 2 | Stale Vector config | Logs being dropped or missing | LLM agent deployed 6-month-old branch; logs dual-shipped in wrong format, broke log-based metric queries |
| 3 | Vector CPU throttling | Log gaps = missing evidence | Silently dropping up to 12.12k events/sec on dense nodes; fix deployed before main incident but morning logs incomplete |
| 4 | No pg_stat_statements on app DB |
Could identify slow queries | Still can't determine per-query execution times; root cause analysis blocked |
| 5 | No per-operation timing in notification-sender | Could see what's slow | Had to deploy structured logging (MR #42) mid-incident to get any visibility |
| 6 | No queue depth or consumer throughput alerts | Would have been alerted earlier | RabbitMQCriticalBacklog fired 2.5 hours after buildup started |
| # | Red Herring | What We Thought | Reality |
|---|---|---|---|
| 7 | gisual-prometheus-client multi-replica overwrite |
Dashboard showing accurate data | Multi-replica bug caused pods to overwrite each other's metrics |
| 8 | Pushgateway push mode |
rate() on pushgateway metrics is reliable |
Each pod overwrote others; rate() produced nonsense |
| 9 | RabbitMQ dual-environment labels | Queue depth metrics are accurate | kubectx switch during deploy_all.sh created duplicate time series with staging label for production queues |
| # | Red Herring | What We Thought | Reality |
|---|---|---|---|
| 10 | "rpc-error" floods | Errors indicate a problem | Distributed lock working as designed — lock losers publish to rpc-error exchange for retry |
| 11 | "notification-sender not processing anything" | Consumer is stuck/dead | Actually processing at ~500 msgs/5min — 10x slower than normal but not zero |
| 12 | CTO queue drain = recovery | Queue dropped 20K to 36 — problem solved | False recovery signal; queue re-accumulated to 4.4K after drain |
| 13 | CommandLink webhook 302s | Webhook delivery broken | Expired auth credentials, completely separate issue |
| 14 | Envoy/Dragonfly chart confusion | Dragonfly eviction broke Redis | redis-mirror-envoy proxies Azure Redis (not Dragonfly); shared Helm chart caused unrelated envoy rollouts |
| 15 | Node autoscaler cascading noise | Infrastructure instability | Normal pod movements; CoreDNS scaling 13->9->13 proportional to node count |
| 16 | CoreDNS NXDOMAIN errors | DNS is broken (79% NXDOMAIN) | Normal ndots:5 search domain walking behavior; p95 latency stable at 0.24ms |
| 17 | notification-sender metrics deployment timing | The deploy caused the issue | Metrics deployment was observability improvement, not the cause |
| 18 | Public-api p95 spike at 8:55 PM | API latency correlated with queue backup | Single 5-minute window; false temporal correlation |
| 19 | archive-tagger shutdown at 9:40 PM | Could be related to DB pressure | Unrelated; added noise during investigation |
| 20 | VictoriaLogs "retention" confusion | Logs being deleted | Actually a log format change from stale Vector config, not retention |
| 21 | Long-running query durations as "established fact" | Queries taking 21-28 minutes | Code analysis estimates treated as observations; max active query was actually 6.25s |
| 22 | incidents-api 401 Unauthorized | Auth system broken | Unrelated to queue backup |
| 23 | RabbitMQ nack rejections | Message processing failures | Unclear if from CTO queue drain or separate issue; not causal |
| 24 | Charter webhook latency overestimate | Webhooks taking 20-40s | 20-40s was total processing time, not webhook delivery; actual webhook POST was <1s |
| 25 | LOG_TIMING_ENABLED forgotten on first deploy |
Structured logging deployed | Had to cancel and redeploy — wasted ~30 minutes |
| 26 | gisual-log-utils extras merge bug |
Structured logging ready to deploy | Library bug blocked deployment; required fix chain across log-utils -> amqp-consumer -> notification-sender |
| 27 | Decommissioned power-outage-usca redeployed by LLM |
What is this consuming resources? | LLM agent checked out 6-month-old branch and deployed decommissioned ML model to production |
| 28 | 10 consumer services deployed to feat/metrics branches | Could be causing instability | Metrics instrumentation deploys; not causal but added to noise floor |
| 29 | ApiWritePathErrors alert silenced for tuning |
Potential early warning masked | Mar 4-11 silence for alert tuning; could have provided signal |
| # | Red Herring | What We Thought | Reality |
|---|---|---|---|
| 30 | Recursive CTE query is the bottleneck | WITH RECURSIVE dependency CTE is the #1 bottleneck (5-7s API time), needs max_distance limit and query optimization |
CTE executes in 0.05-0.16ms. Returns 0-13 rows. Zero optimization needed. |
| 31 | All SQL queries need optimization | Internal API latency = slow database queries; dispatched 4 parallel agents to analyze SQL across intel-requests-api, incidents-api, assets-api | Every query in the hot path is sub-millisecond. The database is not the problem. |
| 32 | Missing composite index on dependencies table | (blocker_type, blocker_id, dependent_type, dependent_id) index needed; OR-condition not using indexes efficiently |
Existing partial indexes dep_blocker_live_idx and dep_dependent_live_idx work perfectly. 3-10 buffer hits per query. |
| 33 | Unbounded recursion depth | max_distance parameter never passed = unbounded graph traversal causing exponential blowup under Charter-scale load |
Recursion returns 0-13 rows in <1ms. Depth is irrelevant. |
| # | Red Herring | What We Thought | Reality |
|---|---|---|---|
| 34 | Thundering herd at the DB level | 75 pods querying same data = DB contention and connection pool exhaustion on Azure Flex app DB |
DB handles the load fine (sub-ms). The thundering herd manifests at the API HTTP layer, not the DB. |
| 35 | Azure Flex app DB connection pool saturation |
intel-requests-api and incidents-api saturating the app database |
The saturated database is the aaa DB, not app. Traced the wrong database. |
| 36 | PgBouncer queuing on Azure Flex | Queries waiting in PgBouncer queue before reaching PostgreSQL | PgBouncer on the app DB is fine. The bottleneck is the AAA API's connection pool to the aaa DB. |
| 37 | CNPG connection metrics | Started querying CNPG pg-common connection stats | intel-requests-api and incidents-api connect to Azure Flexible Server, not CNPG. Wrong database backend entirely. |
| 38 | _incorporate_infra_metadata HTTP calls |
Every GET to intel-requests-api triggers HTTP calls to AAA for organization data + user email lookup | Has an early exit when asset_size_class and infra_type are present in metadata. The actual slow path is the permissions check from intel-searcher's user-token requests. |
| # | Red Herring | What We Thought | Reality |
|---|---|---|---|
| 39 | require_permission is a no-op for all callers |
Stated permissions were fully bypassed via allow_service_access_tokens for all requests |
Only bypassed for service tokens (notification-sender). Intel-searcher uses user tokens, triggering full AAA permissions HTTP call on every request — this is what saturates AAA. |
| 40 | Robin's MR #44 fixes the thundering herd | Described MR #44 as addressing the 75-pod redundant fetch problem | #44 only caches on send failure for retry. No cache entry exists on first delivery. All 3 open MRs (#44, #45, #46) are retry-only; none address the thundering herd. |
| 41 | Zero-row EXPLAIN results as representative | Used intel_request_ids from slow timing logs for EXPLAIN ANALYZE | Those specific IDs had no dependents (0 rows in base case). Not representative — though this accidentally proved SQL wasn't the issue. |
| 42 | ServiceNow 500s contributing to queue backup | Initial error scan flagged 33 ServiceNow Script Evaluation Exception errors |
External issue on ServiceNow's GOCO scripted REST API. Auto-retried via rpc-error exchange. Completely unrelated to queue backup. |
| Category | March 5 | March 6 | Total |
|---|---|---|---|
| Missing observability / flying blind | 6 | 0 | 6 |
| Corrupted or misleading metrics | 3 | 0 | 3 |
| Wrong service / wrong layer | 0 | 5 | 5 |
| Database optimization dead end | 0 | 4 | 4 |
| Misattribution / wrong conclusion | 0 | 4 | 4 |
| False temporal correlation | 4 | 1 | 5 |
| Unrelated concurrent issues | 7 | 0 | 7 |
| Infrastructure noise | 4 | 1 | 5 |
| Tooling / deployment failures | 3 | 0 | 3 |
| Total | 29 | 13 | 42 |
- Structured logging was in place (deployed during March 5) — per-operation timing available from the start
- No concurrent infrastructure events — no Dragonfly eviction, Reloader cascade, Vector drops, or kubectx switches
- Prometheus metrics working — pushgateway fixed, gisual-prometheus-client fixed
- VictoriaLogs reliable — Vector CPU limits removed, stale config fixed
- Fewer red herrings (13 vs 29) — cleaner signal, though still spent 4+ hours on SQL optimization before checking AAA
- Still no
pg_stat_statementsonappDB — couldn't verify query performance without manual EXPLAIN ANALYZE - Still no queue depth alerts — discovered the backup from logs, not alerts
- Bottom-up investigation approach — started with SQL (bottom of stack) instead of tracing from the API layer down. If we'd checked intel-requests-api
request_timelogs first and traced upstream to AAA, we'd have found the root cause in 30 minutes instead of 6 hours.
Both incidents share the same anti-pattern: optimizing what you can measure instead of measuring what's slow.
- March 5: No per-operation timing → deployed structured logging mid-incident → found internal API latency
- March 6: Had timing → focused on SQL (easy to EXPLAIN ANALYZE) → spent 4 hours on sub-millisecond queries → finally checked API-side logs → found AAA in 30 minutes
The correct investigation order is: trace the request from the outside in, not from the database up. Check the API's request_time first, then trace which upstream call is slow, then check that service's logs, and so on until you find where time is actually spent.