jmealo/notification-sender-42-red-herrings.md

## notification-sender-42-red-herrings.md

      
    Raw
  

              notification-sender-42-red-herrings.md
            
          
    42 Red Herrings: Notification-Sender Queue Backup (March 5-6, 2026)

Two incidents, one root cause chain, 42 documented false signals across 2 days of investigation.
Actual root cause: AAA API at 8 replicas (some crash-looping) couldn't serve permissions queries from intel-requests-api fast enough. This caused cascading request queuing through intel-requests-api and incidents-api, starving notification-sender of API capacity. SQL queries were sub-millisecond throughout.

March 5: 29 Red Herrings (5-hour investigation)

Missing Observability (flying blind)


#
Red Herring
What We Thought
Reality


1
Datadog down
No APM, no application metrics, no alerts for the critical path
Entire APM platform was non-operational during the incident


2
Stale Vector config
Logs being dropped or missing
LLM agent deployed 6-month-old branch; logs dual-shipped in wrong format, broke log-based metric queries


3
Vector CPU throttling
Log gaps = missing evidence
Silently dropping up to 12.12k events/sec on dense nodes; fix deployed before main incident but morning logs incomplete


4
No pg_stat_statements on app DB
Could identify slow queries
Still can't determine per-query execution times; root cause analysis blocked


5
No per-operation timing in notification-sender
Could see what's slow
Had to deploy structured logging (MR #42) mid-incident to get any visibility


6
No queue depth or consumer throughput alerts
Would have been alerted earlier
RabbitMQCriticalBacklog fired 2.5 hours after buildup started


Corrupted/Misleading Metrics


#
Red Herring
What We Thought
Reality


7
gisual-prometheus-client multi-replica overwrite
Dashboard showing accurate data
Multi-replica bug caused pods to overwrite each other's metrics


8
Pushgateway push mode
rate() on pushgateway metrics is reliable
Each pod overwrote others; rate() produced nonsense


9
RabbitMQ dual-environment labels
Queue depth metrics are accurate
kubectx switch during deploy_all.sh created duplicate time series with staging label for production queues


False Signals


#
Red Herring
What We Thought
Reality


10
"rpc-error" floods
Errors indicate a problem
Distributed lock working as designed — lock losers publish to rpc-error exchange for retry


11
"notification-sender not processing anything"
Consumer is stuck/dead
Actually processing at ~500 msgs/5min — 10x slower than normal but not zero


12
CTO queue drain = recovery
Queue dropped 20K to 36 — problem solved
False recovery signal; queue re-accumulated to 4.4K after drain


13
CommandLink webhook 302s
Webhook delivery broken
Expired auth credentials, completely separate issue


14
Envoy/Dragonfly chart confusion
Dragonfly eviction broke Redis
redis-mirror-envoy proxies Azure Redis (not Dragonfly); shared Helm chart caused unrelated envoy rollouts


15
Node autoscaler cascading noise
Infrastructure instability
Normal pod movements; CoreDNS scaling 13->9->13 proportional to node count


16
CoreDNS NXDOMAIN errors
DNS is broken (79% NXDOMAIN)
Normal ndots:5 search domain walking behavior; p95 latency stable at 0.24ms


17
notification-sender metrics deployment timing
The deploy caused the issue
Metrics deployment was observability improvement, not the cause


18
Public-api p95 spike at 8:55 PM
API latency correlated with queue backup
Single 5-minute window; false temporal correlation


19
archive-tagger shutdown at 9:40 PM
Could be related to DB pressure
Unrelated; added noise during investigation


20
VictoriaLogs "retention" confusion
Logs being deleted
Actually a log format change from stale Vector config, not retention


21
Long-running query durations as "established fact"
Queries taking 21-28 minutes
Code analysis estimates treated as observations; max active query was actually 6.25s


22
incidents-api 401 Unauthorized
Auth system broken
Unrelated to queue backup


23
RabbitMQ nack rejections
Message processing failures
Unclear if from CTO queue drain or separate issue; not causal


24
Charter webhook latency overestimate
Webhooks taking 20-40s
20-40s was total processing time, not webhook delivery; actual webhook POST was <1s


25
LOG_TIMING_ENABLED forgotten on first deploy
Structured logging deployed
Had to cancel and redeploy — wasted ~30 minutes


26
gisual-log-utils extras merge bug
Structured logging ready to deploy
Library bug blocked deployment; required fix chain across log-utils -> amqp-consumer -> notification-sender


27
Decommissioned power-outage-usca redeployed by LLM
What is this consuming resources?
LLM agent checked out 6-month-old branch and deployed decommissioned ML model to production


28
10 consumer services deployed to feat/metrics branches
Could be causing instability
Metrics instrumentation deploys; not causal but added to noise floor


29
ApiWritePathErrors alert silenced for tuning
Potential early warning masked
Mar 4-11 silence for alert tuning; could have provided signal


March 6: 13 Red Herrings (6-hour investigation)

Database Optimization Dead End


#
Red Herring
What We Thought
Reality


30
Recursive CTE query is the bottleneck
WITH RECURSIVE dependency CTE is the #1 bottleneck (5-7s API time), needs max_distance limit and query optimization
CTE executes in 0.05-0.16ms. Returns 0-13 rows. Zero optimization needed.


31
All SQL queries need optimization
Internal API latency = slow database queries; dispatched 4 parallel agents to analyze SQL across intel-requests-api, incidents-api, assets-api
Every query in the hot path is sub-millisecond. The database is not the problem.


32
Missing composite index on dependencies table
(blocker_type, blocker_id, dependent_type, dependent_id) index needed; OR-condition not using indexes efficiently
Existing partial indexes dep_blocker_live_idx and dep_dependent_live_idx work perfectly. 3-10 buffer hits per query.


33
Unbounded recursion depth
max_distance parameter never passed = unbounded graph traversal causing exponential blowup under Charter-scale load
Recursion returns 0-13 rows in <1ms. Depth is irrelevant.


Wrong Layer / Wrong Service


#
Red Herring
What We Thought
Reality


34
Thundering herd at the DB level
75 pods querying same data = DB contention and connection pool exhaustion on Azure Flex app DB
DB handles the load fine (sub-ms). The thundering herd manifests at the API HTTP layer, not the DB.


35
Azure Flex app DB connection pool saturation
intel-requests-api and incidents-api saturating the app database
The saturated database is the aaa DB, not app. Traced the wrong database.


36
PgBouncer queuing on Azure Flex
Queries waiting in PgBouncer queue before reaching PostgreSQL
PgBouncer on the app DB is fine. The bottleneck is the AAA API's connection pool to the aaa DB.


37
CNPG connection metrics
Started querying CNPG pg-common connection stats
intel-requests-api and incidents-api connect to Azure Flexible Server, not CNPG. Wrong database backend entirely.


38
_incorporate_infra_metadata HTTP calls
Every GET to intel-requests-api triggers HTTP calls to AAA for organization data + user email lookup
Has an early exit when asset_size_class and infra_type are present in metadata. The actual slow path is the permissions check from intel-searcher's user-token requests.


Misattribution


#
Red Herring
What We Thought
Reality


39
require_permission is a no-op for all callers
Stated permissions were fully bypassed via allow_service_access_tokens for all requests
Only bypassed for service tokens (notification-sender). Intel-searcher uses user tokens, triggering full AAA permissions HTTP call on every request — this is what saturates AAA.


40
Robin's MR #44 fixes the thundering herd
Described MR #44 as addressing the 75-pod redundant fetch problem
#44 only caches on send failure for retry. No cache entry exists on first delivery. All 3 open MRs (#44, #45, #46) are retry-only; none address the thundering herd.


41
Zero-row EXPLAIN results as representative
Used intel_request_ids from slow timing logs for EXPLAIN ANALYZE
Those specific IDs had no dependents (0 rows in base case). Not representative — though this accidentally proved SQL wasn't the issue.


42
ServiceNow 500s contributing to queue backup
Initial error scan flagged 33 ServiceNow Script Evaluation Exception errors
External issue on ServiceNow's GOCO scripted REST API. Auto-retried via rpc-error exchange. Completely unrelated to queue backup.


Pattern Analysis

By Category (across both days)


Category
March 5
March 6
Total


Missing observability / flying blind
6
0
6


Corrupted or misleading metrics
3
0
3


Wrong service / wrong layer
0
5
5


Database optimization dead end
0
4
4


Misattribution / wrong conclusion
0
4
4


False temporal correlation
4
1
5


Unrelated concurrent issues
7
0
7


Infrastructure noise
4
1
5


Tooling / deployment failures
3
0
3


Total
29
13
42


What Improved Between March 5 and March 6


Structured logging was in place (deployed during March 5) — per-operation timing available from the start
No concurrent infrastructure events — no Dragonfly eviction, Reloader cascade, Vector drops, or kubectx switches
Prometheus metrics working — pushgateway fixed, gisual-prometheus-client fixed
VictoriaLogs reliable — Vector CPU limits removed, stale config fixed
Fewer red herrings (13 vs 29) — cleaner signal, though still spent 4+ hours on SQL optimization before checking AAA

What Didn't Improve


Still no pg_stat_statements on app DB — couldn't verify query performance without manual EXPLAIN ANALYZE
Still no queue depth alerts — discovered the backup from logs, not alerts
Bottom-up investigation approach — started with SQL (bottom of stack) instead of tracing from the API layer down. If we'd checked intel-requests-api request_time logs first and traced upstream to AAA, we'd have found the root cause in 30 minutes instead of 6 hours.

The Meta-Lesson

Both incidents share the same anti-pattern: optimizing what you can measure instead of measuring what's slow.

March 5: No per-operation timing → deployed structured logging mid-incident → found internal API latency
March 6: Had timing → focused on SQL (easy to EXPLAIN ANALYZE) → spent 4 hours on sub-millisecond queries → finally checked API-side logs → found AAA in 30 minutes

The correct investigation order is: trace the request from the outside in, not from the database up. Check the API's request_time first, then trace which upstream call is slow, then check that service's logs, and so on until you find where time is actually spent.
#	Red Herring	What We Thought	Reality
1	Datadog down	No APM, no application metrics, no alerts for the critical path	Entire APM platform was non-operational during the incident
2	Stale Vector config	Logs being dropped or missing	LLM agent deployed 6-month-old branch; logs dual-shipped in wrong format, broke log-based metric queries
3	Vector CPU throttling	Log gaps = missing evidence	Silently dropping up to 12.12k events/sec on dense nodes; fix deployed before main incident but morning logs incomplete
4	No `pg_stat_statements` on `app` DB	Could identify slow queries	Still can't determine per-query execution times; root cause analysis blocked
5	No per-operation timing in notification-sender	Could see what's slow	Had to deploy structured logging (MR #42) mid-incident to get any visibility
6	No queue depth or consumer throughput alerts	Would have been alerted earlier	`RabbitMQCriticalBacklog` fired 2.5 hours after buildup started
#	Red Herring	What We Thought	Reality
7	`gisual-prometheus-client` multi-replica overwrite	Dashboard showing accurate data	Multi-replica bug caused pods to overwrite each other's metrics
8	Pushgateway `push` mode	`rate()` on pushgateway metrics is reliable	Each pod overwrote others; `rate()` produced nonsense
9	RabbitMQ dual-environment labels	Queue depth metrics are accurate	`kubectx` switch during `deploy_all.sh` created duplicate time series with `staging` label for production queues
#	Red Herring	What We Thought	Reality
10	"rpc-error" floods	Errors indicate a problem	Distributed lock working as designed — lock losers publish to rpc-error exchange for retry
11	"notification-sender not processing anything"	Consumer is stuck/dead	Actually processing at ~500 msgs/5min — 10x slower than normal but not zero
12	CTO queue drain = recovery	Queue dropped 20K to 36 — problem solved	False recovery signal; queue re-accumulated to 4.4K after drain
13	CommandLink webhook 302s	Webhook delivery broken	Expired auth credentials, completely separate issue
14	Envoy/Dragonfly chart confusion	Dragonfly eviction broke Redis	`redis-mirror-envoy` proxies Azure Redis (not Dragonfly); shared Helm chart caused unrelated envoy rollouts
15	Node autoscaler cascading noise	Infrastructure instability	Normal pod movements; CoreDNS scaling 13->9->13 proportional to node count
16	CoreDNS NXDOMAIN errors	DNS is broken (79% NXDOMAIN)	Normal `ndots:5` search domain walking behavior; p95 latency stable at 0.24ms
17	notification-sender metrics deployment timing	The deploy caused the issue	Metrics deployment was observability improvement, not the cause
18	Public-api p95 spike at 8:55 PM	API latency correlated with queue backup	Single 5-minute window; false temporal correlation
19	archive-tagger shutdown at 9:40 PM	Could be related to DB pressure	Unrelated; added noise during investigation
20	VictoriaLogs "retention" confusion	Logs being deleted	Actually a log format change from stale Vector config, not retention
21	Long-running query durations as "established fact"	Queries taking 21-28 minutes	Code analysis estimates treated as observations; max active query was actually 6.25s
22	incidents-api 401 Unauthorized	Auth system broken	Unrelated to queue backup
23	RabbitMQ nack rejections	Message processing failures	Unclear if from CTO queue drain or separate issue; not causal
24	Charter webhook latency overestimate	Webhooks taking 20-40s	20-40s was total processing time, not webhook delivery; actual webhook POST was <1s
25	`LOG_TIMING_ENABLED` forgotten on first deploy	Structured logging deployed	Had to cancel and redeploy — wasted ~30 minutes
26	`gisual-log-utils` extras merge bug	Structured logging ready to deploy	Library bug blocked deployment; required fix chain across log-utils -> amqp-consumer -> notification-sender
27	Decommissioned `power-outage-usca` redeployed by LLM	What is this consuming resources?	LLM agent checked out 6-month-old branch and deployed decommissioned ML model to production
28	10 consumer services deployed to feat/metrics branches	Could be causing instability	Metrics instrumentation deploys; not causal but added to noise floor
29	`ApiWritePathErrors` alert silenced for tuning	Potential early warning masked	Mar 4-11 silence for alert tuning; could have provided signal
#	Red Herring	What We Thought	Reality
30	Recursive CTE query is the bottleneck	`WITH RECURSIVE` dependency CTE is the #1 bottleneck (5-7s API time), needs `max_distance` limit and query optimization	CTE executes in 0.05-0.16ms. Returns 0-13 rows. Zero optimization needed.
31	All SQL queries need optimization	Internal API latency = slow database queries; dispatched 4 parallel agents to analyze SQL across intel-requests-api, incidents-api, assets-api	Every query in the hot path is sub-millisecond. The database is not the problem.
32	Missing composite index on dependencies table	`(blocker_type, blocker_id, dependent_type, dependent_id)` index needed; OR-condition not using indexes efficiently	Existing partial indexes `dep_blocker_live_idx` and `dep_dependent_live_idx` work perfectly. 3-10 buffer hits per query.
33	Unbounded recursion depth	`max_distance` parameter never passed = unbounded graph traversal causing exponential blowup under Charter-scale load	Recursion returns 0-13 rows in <1ms. Depth is irrelevant.
#	Red Herring	What We Thought	Reality
34	Thundering herd at the DB level	75 pods querying same data = DB contention and connection pool exhaustion on Azure Flex `app` DB	DB handles the load fine (sub-ms). The thundering herd manifests at the API HTTP layer, not the DB.
35	Azure Flex `app` DB connection pool saturation	intel-requests-api and incidents-api saturating the `app` database	The saturated database is the `aaa` DB, not `app`. Traced the wrong database.
36	PgBouncer queuing on Azure Flex	Queries waiting in PgBouncer queue before reaching PostgreSQL	PgBouncer on the `app` DB is fine. The bottleneck is the AAA API's connection pool to the `aaa` DB.
37	CNPG connection metrics	Started querying CNPG pg-common connection stats	intel-requests-api and incidents-api connect to Azure Flexible Server, not CNPG. Wrong database backend entirely.
38	`_incorporate_infra_metadata` HTTP calls	Every GET to intel-requests-api triggers HTTP calls to AAA for organization data + user email lookup	Has an early exit when `asset_size_class` and `infra_type` are present in metadata. The actual slow path is the permissions check from intel-searcher's user-token requests.
#	Red Herring	What We Thought	Reality
39	`require_permission` is a no-op for all callers	Stated permissions were fully bypassed via `allow_service_access_tokens` for all requests	Only bypassed for service tokens (notification-sender). Intel-searcher uses user tokens, triggering full AAA permissions HTTP call on every request — this is what saturates AAA.
40	Robin's MR #44 fixes the thundering herd	Described MR #44 as addressing the 75-pod redundant fetch problem	#44 only caches on send failure for retry. No cache entry exists on first delivery. All 3 open MRs (#44, #45, #46) are retry-only; none address the thundering herd.
41	Zero-row EXPLAIN results as representative	Used intel_request_ids from slow timing logs for EXPLAIN ANALYZE	Those specific IDs had no dependents (0 rows in base case). Not representative — though this accidentally proved SQL wasn't the issue.
42	ServiceNow 500s contributing to queue backup	Initial error scan flagged 33 ServiceNow `Script Evaluation Exception` errors	External issue on ServiceNow's GOCO scripted REST API. Auto-retried via rpc-error exchange. Completely unrelated to queue backup.
Category	March 5	March 6	Total
Missing observability / flying blind	6	0	6
Corrupted or misleading metrics	3	0	3
Wrong service / wrong layer	0	5	5
Database optimization dead end	0	4	4
Misattribution / wrong conclusion	0	4	4
False temporal correlation	4	1	5
Unrelated concurrent issues	7	0	7
Infrastructure noise	4	1	5
Tooling / deployment failures	3	0	3
Total	29	13	42