Skip to content

Instantly share code, notes, and snippets.

@denniswon
Created March 6, 2026 09:42
Show Gist options
  • Select an option

  • Save denniswon/8ac0a41a1538344e6b799d113094736f to your computer and use it in GitHub Desktop.

Select an option

Save denniswon/8ac0a41a1538344e6b799d113094736f to your computer and use it in GitHub Desktop.
Reliability Analysis & Plan Review
---
Reliability Analysis & Plan Review
★ Insight ─────────────────────────────────────
This is a thorough first-pass reliability audit by a new team member. The approach — analysis doc identifying gaps, then a plan
doc with actionable tickets — is solid. I'm reviewing for factual accuracy against the actual codebase, priority correctness, and
any gaps the author may have missed.
─────────────────────────────────────────────────
---
Overall Assessment
The analysis is well-structured, mostly accurate, and shows genuine depth of understanding for someone new to the codebase. The
two-tier monitoring model (Terraform for oncall, Datadog UI for dev) is particularly mature thinking. The priority ordering is
sensible for testnet-prod readiness.
However, there are several factual inaccuracies, missing context, and a few items I'd challenge on priority or approach.
---
Factual Accuracy Check
Confirmed Accurate
┌───────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────┐
│ Claim │ Verified │
├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ Health check is shallow │ Correct. Returns {"status": "healthy", "timestamp": ...} unconditionally — no DB, Redis, or │
│ (handlers.rs:68-73) │ operator pool checks. │
├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ No OpenTelemetry/Datadog APM │ Correct. Zero .rs or .toml files reference opentelemetry, tracing-opentelemetry, or datadog │
│ tracing │ trace exporters. DD_APM_ENABLED=true is set but no trace data flows from the app. │
├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ CancellationToken for │ Correct. Used across 15+ files in gateway, challenger, transporter, chain-watcher, │
│ graceful shutdown │ aggregator. Well-implemented. │
├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ Circuit breaker exists and is │ Correct. crates/gateway/src/task/circuit_breaker.rs exists with proper 3-state │
│ well-implemented │ (closed/open/half-open) implementation. │
├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ Channel resilience exists │ Correct. crates/gateway/src/task/channel_resilience.rs exists, used in notification.rs for │
│ │ webhook delivery with retry + backoff. │
├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ Error classification │ Correct. crates/gateway/src/rpc/api/tx_worker.rs has is_retryable logic for transaction │
│ (transient vs permanent) │ submission. │
└───────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────┘
Inaccurate or Misleading
Claim: "150+ Prometheus metrics"
Issue: Overstated. I count 69 describe_*! macro invocations in crates/metrics/src/lib.rs. Some metrics have label dimensions
(chain_id, status, etc.) which create multiple time series per metric, but the doc says "150+ metrics definitions" — this should
say ~70 metric definitions (which generate many more time series with labels). The difference matters for the Terraform monitors
proposal — you're monitoring ~70 distinct metric names, not 150+.
────────────────────────────────────────
Claim: "90+ metrics" in Executive Summary vs "150+ metrics" later
Issue: Internal inconsistency. The doc says "90+" in Section 1 and "150+" in Section 5. Pick one (the actual number is ~69
definitions).
────────────────────────────────────────
Claim: "Worker queue config at gateway.toml:82-89"
Issue: Needs verification. The DLQ doc quotes worker_count = 4 from gateway.toml, but the actual worker queue pattern in the
codebase uses TransactionSubmissionWorker (in tx_worker.rs) which is sequential by design for nonce safety (see performance.md).
The doc seems to conflate the transaction submission worker with a general "worker queue" — these are different things with
different failure semantics.
Cannot Verify (deploy repo not available locally)
The following claims reference newton-prover-avs-deploy which isn't on this machine. The line numbers and code snippets in the doc
look plausible, but I can't independently confirm:
- Termination protection disabled (base_stack.py:33)
- ALB health check 200-499 (gateway_service.py:136)
- min_healthy_percent=25 (ecs_service.py:306)
- Testnet Datadog disabled (environments.py:169-170, 213-214)
- Stagef Redis single-node
- SNS topics unwired
- Sidecar images unpinned
Question for author: Can you share access to the deploy repo so these can be verified? Or confirm these line numbers are current
as of the analysis date?
---
P0 Items — Critical Assessment
1. Datadog Monitors as Code (Terraform) — AGREE P0
Good call. The two-tier model (Terraform for oncall, UI for dev) is pragmatic and avoids the common trap of trying to codify
everything. The ~25-30 monitor target is reasonable.
Concern: The Terraform implementation plan proposes S3 + DynamoDB state backend. For a team new to Terraform, this adds
operational overhead. Consider using Terraform Cloud free tier initially — it handles state locking and provides a UI for plan
review.
Suggestion: Start with 5-10 critical monitors, not 25-30. For testnet-prod readiness, you need: ECS task count, gateway P99
latency, quorum failures, circuit breaker state, and ALB 5xx. The rest can come in a fast follow.
2. Enable Datadog for Testnet — AGREE P0
This is a no-brainer. The risk assessment ("None to application behavior") is correct. One-line config change per environment.
Question: The doc says redis_secret_arn=None for testnet too. Is this because testnet doesn't use Redis, or because it's another
missing config? If testnet-prod runs 10 gateways and the gateway uses Redis for caching/dedup/chain-watcher filtering, missing
Redis would mean those features are silently disabled.
3. Deep Health Check Endpoint — AGREE P0, with caveats
The proposed /health/ready design is solid. The check list (DB, Redis, operator pool, worker queue) is correct.
Caveats:
- Redis check should be non-fatal. The gateway's multi-tier cache (Moka L1 + Redis L2) is designed to degrade gracefully — if
Redis is down, L1 cache still works. Making Redis failure = unhealthy would cause unnecessary instance rotation. Use a degraded
status instead.
- Operator pool check needs multichain awareness. In multichain mode, there's a per-chain ChainService with its own operator_pool.
The health check should verify operators exist for each configured chain, not just check a global pool. This is a nuance the doc
misses.
- Add timeout to dependency probes. A hanging DB connection check could block the health endpoint itself. Each probe should have a
2-3 second timeout.
Missing from the proposal: The doc correctly identifies the ALB healthy_http_codes="200-499" problem but the /health/ready fix
only partially addresses it. The ALB change should be a separate item — changing from 200-499 to 200 affects ALL responses to the
health path, not just readiness logic.
4. Enable Termination Protection — AGREE P0
Trivially correct. One-line fix, zero risk. Should have been done already.
---
P1 Items — Assessment
5. Verify ALB 4xx Responses — AGREE P1, but the Plan doc softened this too much
The Analysis doc correctly calls this out as a problem. The Plan doc says "This is by design (4xx's can be noisy from normal
client errors)" — but that's the health check HTTP codes, not the ALB target codes. A health check returning 4xx should NOT be
considered healthy. The original analysis was right: change to 200 (or 200-299).
6. Deployment Notifications — AGREE P1
Low effort, good visibility. No concerns.
7. OpenTelemetry APM Tracing — AGREE P1, but the approach needs refinement
The doc proposes opentelemetry-datadog crate, but this has been deprecated in favor of the OTLP exporter with Datadog agent
intake. The Datadog agent already supports OTLP on port 4318 (HTTP) or 4317 (gRPC).
Better approach:
opentelemetry = { version = "0.22", features = ["trace"] }
opentelemetry-otlp = "0.15"
tracing-opentelemetry = "0.23"
This is more future-proof and avoids the deprecated opentelemetry-datadog crate.
8. Increase min_healthy_percent — AGREE P1
Going from 25% to 100% is the right target for zero-downtime rolling updates with 10 gateways. But 100% requires
max_healthy_percent > 100 (typically 200%) to allow new tasks to start before old ones drain. The doc doesn't mention this —
ensure max_healthy_percent is set correctly or you'll get stuck deploys.
9. Pin Sidecar Image Versions — AGREE P1
Correct and low-effort. No concerns.
---
P2 Items — Assessment
10. Dead Letter Queue — PARTIALLY AGREE P2
The concept is right, but the doc's framing needs correction:
- The "worker queue" in Newton is primarily the TransactionSubmissionWorker — it processes on-chain transaction submissions
sequentially for nonce safety. Failed transactions after retry are logged with full context (tx hash, error, task_id). This isn't
"silently dropped" — it's logged and metricked via tx_submission_exhausted_total.
- A Redis DLQ for failed transaction submissions is useful for replay, but the doc should be specific about which worker queue it
targets. The gateway has multiple async processing paths (task broadcast, BLS aggregation, tx submission, webhook delivery) and
they have different failure semantics.
Recommendation: Scope this to transaction submission failures specifically. A Redis list with serialized SubmissionTask payloads +
failure reason would enable manual replay via CLI.
11. AWS WAF on ALB — AGREE P2
Standard practice. The "count mode first, then block" rollout strategy is correct.
12. Expose DB Connection Pool Metrics — AGREE P2
The deadpool-postgres pool has .status() which returns pool_size, available, waiting. Exposing these as Prometheus gauges is
straightforward. Good item but correctly P2.
---
What's Missing from Both Documents
1. Multichain-Specific Reliability Concerns (should be P1)
The analysis doesn't address multichain mode, which is the default, first-class mode for the gateway. Key gaps:
- Per-chain health: A gateway can be healthy for chain A but unhealthy for chain B (e.g., chain B's RPC is down). The health check
needs per-chain status.
- Cross-chain data isolation: PR #419 (recently merged) added chain_id columns to DB tables. The reliability plan should verify
this migration completed and test for cross-chain data leakage.
- Chain watcher reliability: Each chain watcher is single-instance (the doc mentions challenger/watcher redundancy as P2, but
doesn't separate these — chain watcher failure means missed on-chain events which is arguably more critical than challenger
failure since challenges have a 100-block window).
2. Redis as a Single Point of Failure in the Watcher Path (should be P1)
The chain watcher uses Redis SET for deduplication (SISMEMBER). If Redis is unavailable, the watcher can't filter
gateway-originated tasks from direct on-chain tasks. This could cause duplicate processing. The doc mentions Redis failover for
prod but doesn't flag this specific failure mode.
3. Nonce Management Across 10 Gateways (should be documented)
The analysis notes "each gateway needs a unique signer key" for nonce management but doesn't flag the operational complexity: if a
gateway crashes mid-transaction, its nonce can get stuck. There's no documented recovery procedure. This should at least be a
runbook item.
4. BLS Key Rotation / Recovery (not mentioned)
If an operator's BLS key is compromised or lost, there's no documented key rotation procedure. This is more security than
reliability, but key loss = permanent operator loss.
5. Database Backup Verification (mentioned but not actioned)
The analysis notes "No automated backups visible in the deploy code" and "RDS appears to be managed outside CDK" but neither doc
proposes verifying that backups exist and are tested. This should be P0 — a production database with no verified backups is a
critical risk.
---
Questions for the Author
1. Worker queue specificity: Which worker queue(s) does the DLQ proposal target? The TransactionSubmissionWorker in tx_worker.rs?
The webhook notification queue? Both?
2. Redis in testnet: Is redis_secret_arn=None for testnet because Redis isn't deployed, or because the secret isn't configured? If
Redis isn't deployed, several gateway features (L2 cache, chain watcher dedup, request dedup) are silently disabled.
3. RDS ownership: You note RDS isn't in CDK. Where is it managed? Is there a separate Terraform/CloudFormation stack? Are
automated backups confirmed? This needs to be resolved before testnet-prod load testing.
4. Metric count: Can you recount? I see ~69 describe_*! invocations in crates/metrics/src/lib.rs, not 90+ or 150+. The docs
reference "150+ Prometheus metrics" in one place and "90+" in another.
5. opentelemetry-datadog vs OTLP: The OTel proposal uses the deprecated opentelemetry-datadog crate. Are you aware of the OTLP
alternative via the Datadog agent? This is the officially recommended path.
6. Multichain health: The health check proposal doesn't account for per-chain health in multichain mode. Was this an intentional
simplification for v1?
---
Recommended Critical-Only Scope for Testnet-Prod
I agree with the author's recommendation to focus on P0 for testnet-prod. Here's my suggested ordering:
1. Enable termination protection — 5 minutes, zero risk, do it now
2. Enable Datadog for testnet — 1 config change, enables all subsequent monitoring
3. Deep health check — /health/ready with DB + Redis (degraded) + operator pool (per-chain) checks
4. 5-10 critical Datadog monitors — ECS health, gateway P99, quorum failures, circuit breaker, ALB 5xx (defer the full Terraform
module; start with Datadog UI for speed, formalize in Terraform afterward)
Item 4 is a pragmatic deviation from the plan. Standing up a full Terraform + S3 backend + CI/CD pipeline for monitors is "medium
effort" as the doc says, but for testnet-prod urgency, creating 5 monitors in the Datadog UI takes 30 minutes. Formalize them in
Terraform as a fast follow.
★ Insight ─────────────────────────────────────
- The two-tier monitoring model (Terraform for oncall-paging, Datadog UI for dev) is a mature pattern that avoids the common
"codify everything or nothing" trap. This is one of the strongest parts of the proposal.
- The analysis correctly identifies the "blind testnet-prod" as the highest-risk gap — running at mainnet scale with zero
observability means you can't validate releases before mainnet.
- The biggest gap in the analysis is multichain awareness. Newton's gateway is multichain-first (all gateways go through
from_multi_chain_config), but the reliability analysis treats it as single-chain. The health check, operator pool monitoring, and
chain watcher redundancy all need per-chain thinking.
─────────────────────────────────────────────────
@denniswon
Copy link
Author


Updated Review — Deploy Repo Claims Verified

All deploy repo claims are now confirmed against actual code. Here's the complete verification summary:

Deploy Repo — All Claims Confirmed

┌─────┬──────────────────────────┬──────────┬─────────────────────────────────────────────────────────────────────────────────┐
│ # │ Claim │ Verdict │ Evidence │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ Termination protection │ ACCURATE │ base_stack.py:33 — termination_protection=False with TODO comment │
│ │ disabled │ │ │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 2 │ ALB health check accepts │ ACCURATE │ gateway_service.py:136 — healthy_http_codes="200-499". Also confirmed on │
│ │ 200-499 │ │ operator: operator_service.py:161 │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 3 │ min_healthy_percent=25 │ ACCURATE │ ecs_service.py:306 — min_healthy_percent=25. Note: max_healthy_percent=400 is │
│ │ │ │ already set, which I missed in my original review │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ │ │ │ environments.py:169,213 — both testnet-stagef and testnet-prod have │
│ 4 │ Testnet lacks Datadog │ ACCURATE │ datadog_secret_arn=None. Mainnet environments require it (line 360-361 raises │
│ │ │ │ ValueError) │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 5 │ Stagef Redis single-node │ ACCURATE │ data_stack.py:112-114 — num_clusters=1, multi_az=False for non-prod │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 6 │ SNS topics not wired │ ACCURATE │ monitoring_stack.py:33-34 — alarm_email=None, slack_webhook_url=None defaults. │
│ │ │ │ Topics are created but have no subscribers │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 7 │ No Terraform monitors │ ACCURATE │ Zero .tf or .tfvars files anywhere in the deploy repo │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 8 │ Datadog sidecar config │ ACCURATE │ ecs_service.py:168-207 — full Datadog agent sidecar with Prometheus scraping, │
│ │ │ │ APM, logs │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 9 │ Gateway auto-scaling │ ACCURATE │ gateway_service.py:222 — enable_auto_scaling=False with max_capacity=1 per │
│ │ disabled │ │ gateway instance │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 10 │ Single operator security │ ACCURATE │ network_stack.py:135-142 — one OperatorSecurityGroup shared by all operators. │
│ │ group │ │ Per-operator ALB SGs exist separately but the task SG is shared │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 11 │ No RDS in CDK │ ACCURATE │ data_stack.py contains only Redis. Zero RDS/Aurora references │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ 12 │ Deployment circuit │ ACCURATE │ ecs_service.py:305 — circuit_breaker=DeploymentCircuitBreaker(rollback=True) │
│ │ breaker enabled │ │ │
├─────┼──────────────────────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ │ │ │ monitoring_stack.py:120-121 — │
│ 13 │ Sidecar images unpinned │ ACCURATE │ AGENT_IMAGE="public.ecr.aws/datadog/agent:latest", │
│ │ │ │ FLUENT_BIT_IMAGE="amazon/aws-for-fluent-bit:stable" │
└─────┴──────────────────────────┴──────────┴─────────────────────────────────────────────────────────────────────────────────┘

AVS Repo — Verification Summary

┌─────┬──────────────────────────────────┬────────────┬──────────────────────────────────────────────────────────────────┐
│ # │ Claim │ Verdict │ Evidence │
├─────┼──────────────────────────────────┼────────────┼──────────────────────────────────────────────────────────────────┤
│ 1 │ Shallow health check │ ACCURATE │ handlers.rs:68-73 — returns {"status":"healthy"} unconditionally │
├─────┼──────────────────────────────────┼────────────┼──────────────────────────────────────────────────────────────────┤
│ 2 │ No OTel/APM tracing │ ACCURATE │ Zero files with opentelemetry in any .rs or .toml │
├─────┼──────────────────────────────────┼────────────┼──────────────────────────────────────────────────────────────────┤
│ 3 │ "150+ metrics" │ OVERSTATED │ 69 describe_*! macro invocations. Not 90+ or 150+ │
├─────┼──────────────────────────────────┼────────────┼──────────────────────────────────────────────────────────────────┤
│ 4 │ CancellationToken shutdown │ ACCURATE │ Used across 15 files in all services │
├─────┼──────────────────────────────────┼────────────┼──────────────────────────────────────────────────────────────────┤
│ 5 │ Circuit breaker well-implemented │ ACCURATE │ Full 3-state implementation in circuit_breaker.rs │
├─────┼──────────────────────────────────┼────────────┼──────────────────────────────────────────────────────────────────┤
│ 6 │ Channel resilience │ ACCURATE │ channel_resilience.rs exists, used by notification webhooks │
├─────┼──────────────────────────────────┼────────────┼──────────────────────────────────────────────────────────────────┤
│ 7 │ Error classification │ ACCURATE │ tx_worker.rs has retryable error classification │
└─────┴──────────────────────────────────┴────────────┴──────────────────────────────────────────────────────────────────┘

Corrections to My Initial Review

  1. max_healthy_percent=400 — I flagged that increasing min_healthy_percent to 100% requires setting max_healthy_percent > 100.
    It's already set to 400%, so the author's proposal to just change min_healthy_percent will work without additional changes. Good.
  2. Operator security group nuance — The analysis says "single operator security group shared across all operators" and calls it a
    risk. Looking at the code more carefully: there's one shared task-level SG (operator), but each operator gets its own ALB SG
    (operator_alb_1, operator_alb_2, etc.). The shared task SG is standard practice for services with identical network requirements.
    The analysis slightly overstates the risk — a misconfiguration would affect all operators' outbound/inbound at the task level, but
    ALB-level access control is already per-operator. I'd downgrade this from a concern to a documentation item.
  3. redis_secret_arn=None for testnet — Confirmed alongside datadog_secret_arn=None. This means testnet environments have no Redis
    either, which means L2 cache, chain watcher dedup, and request dedup are all disabled in testnet. The analysis mentions this but
    doesn't call it out as a separate risk. For testnet-prod load testing, this is significant — you're testing a different code path
    than mainnet will use.

Additional Finding: healthy_http_codes="200-499" on Operators Too

The analysis only mentions the gateway ALB health check, but operator_service.py:161 has the same healthy_http_codes="200-499"
pattern. The Plan doc's health check fix should cover both gateway and operator ALBs.


Final Verdict

This is solid work for a new hire's first reliability audit. The factual claims are almost entirely accurate (the metrics count
being the main exception). The priority ordering is sensible. The two-tier monitoring model shows mature operational thinking.

For testnet-prod readiness, I agree with focusing on P0 items with my earlier recommended ordering:

  1. Enable termination protection (5 min, do now)
  2. Enable Datadog for testnet (config change)
  3. Deep health check with per-chain awareness
  4. Start with 5-10 critical monitors in Datadog UI, formalize in Terraform as fast follow

Key items to add to the plan:

  • Fix healthy_http_codes on operators too, not just gateway
  • Address redis_secret_arn=None for testnet (this affects load testing validity)
  • Add multichain-aware health checks (per-chain operator pool, per-chain RPC status)
  • Correct the metrics count (69 definitions, not 150+)
  • Use OTLP exporter instead of deprecated opentelemetry-datadog crate for the P1 APM tracing item

★ Insight ─────────────────────────────────────

  • The deploy repo is well-structured — clean separation between config/environments.py (env-specific values), cdk_constructs/
    (reusable CDK constructs), and stacks/ (stack composition). This makes reliability improvements straightforward to implement
    without touching unrelated code.
  • The datadog_secret_arn pattern is clever: mainnet environments require it (ValueError if None), while testnet environments
    silently degrade. This was intentional cost savings, but the analysis correctly identifies that testnet-prod has outgrown this —
    running at mainnet scale means it needs mainnet-level observability.
  • The max_healthy_percent=400 setting is forward-thinking — it allows ECS to spin up 4x the desired count during deployments.
    Combined with raising min_healthy_percent to 100%, this enables true zero-downtime rolling updates.
    ─────────────────────────────────────────────────

✻ Worked for 1m 27s

⏺ Agent "Verify deploy repo claims" completed

⏺ Agent "Verify reliability analysis claims" completed

⏺ Both background agents have completed and confirmed my findings. All 13 deploy repo claims: ACCURATE. AVS repo: 8/10 accurate, 2
inaccurate (metrics count).

The background agents also surfaced one additional detail I should note:

Redis graceful degradation actually exists — the agent found that mark_task_seen_by_watcher() in handler/mod.rs uses
Optionredis::aio::ConnectionManager and fire-and-forgets Redis operations with tracing::warn! on failure. So the analysis's
claim under Section 5e ("unclear if cache bypass happens when Redis fails") is inaccurate — the gateway already handles Redis
failures gracefully in the watcher path. However, the L2 cache path (Moka + Redis hybrid) should also be verified for graceful
degradation. This is a minor correction to the analysis.

Worker queue DLQ verification — confirmed that tx_worker.rs:224-232 drops tasks after max retries with an error!() log but no
persistence. The analysis is correct here, though as I noted earlier, framing this as a "worker queue" is somewhat misleading —
it's specifically the transaction submission worker, not a general task queue.

@denniswon
Copy link
Author

★ Insight ─────────────────────────────────────
All Notion review comments have been posted across 7 pages. The review covers both code-verified corrections (metric count, Redis
graceful degradation, operator SG) and architectural guidance (multichain awareness, OTLP over deprecated crate, DLQ scoping).
─────────────────────────────────────────────────

Comment Summary

Analysis doc (4 comments):

  1. Redis testnet gap — redis_secret_arn=None blocks deep health checks and DLQ
  2. Redis graceful degradation correction — gateway already handles Redis failure via Option
  3. Operator SG nuance — per-operator ALB SGs exist alongside shared task SG
  4. Missing multichain awareness — per-chain ChainService needs per-chain health granularity

Plan doc (1 comment):
5. Overall review — recommended P0 ordering, missing items (Redis testnet, multichain, nonce management), metric count correction,
operator ALB gap, OTLP recommendation

Sub-pages (7 comments):
6. Deep Health Check — Redis non-fatal, multichain operator pools, probe timeouts, liveness vs readiness
7. OTel APM — use opentelemetry-otlp not deprecated opentelemetry-datadog, with config snippets
8. DLQ — scope to TransactionSubmissionWorker in tx_worker.rs, Redis list sufficient for testnet
9. Datadog Monitors — start with 5-10 UI monitors (not 25-30 Terraform), consider TF Cloud free tier
10. Enable Datadog Testnet — pin agent/fluent-bit image tags, Redis metrics won't exist yet
11. Termination Protection — easiest P0 win, also check CF stack update policies for stateful resources
12. Monitoring Workflow — defer full runbook to P1, set alarm_email/slack_webhook_url first

Total: 12 comments across all pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment