Created
March 6, 2026 09:42
-
-
Save denniswon/8ac0a41a1538344e6b799d113094736f to your computer and use it in GitHub Desktop.
Reliability Analysis & Plan Review
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| --- | |
| Reliability Analysis & Plan Review | |
| ★ Insight ───────────────────────────────────── | |
| This is a thorough first-pass reliability audit by a new team member. The approach — analysis doc identifying gaps, then a plan | |
| doc with actionable tickets — is solid. I'm reviewing for factual accuracy against the actual codebase, priority correctness, and | |
| any gaps the author may have missed. | |
| ───────────────────────────────────────────────── | |
| --- | |
| Overall Assessment | |
| The analysis is well-structured, mostly accurate, and shows genuine depth of understanding for someone new to the codebase. The | |
| two-tier monitoring model (Terraform for oncall, Datadog UI for dev) is particularly mature thinking. The priority ordering is | |
| sensible for testnet-prod readiness. | |
| However, there are several factual inaccuracies, missing context, and a few items I'd challenge on priority or approach. | |
| --- | |
| Factual Accuracy Check | |
| Confirmed Accurate | |
| ┌───────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────┐ | |
| │ Claim │ Verified │ | |
| ├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Health check is shallow │ Correct. Returns {"status": "healthy", "timestamp": ...} unconditionally — no DB, Redis, or │ | |
| │ (handlers.rs:68-73) │ operator pool checks. │ | |
| ├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ No OpenTelemetry/Datadog APM │ Correct. Zero .rs or .toml files reference opentelemetry, tracing-opentelemetry, or datadog │ | |
| │ tracing │ trace exporters. DD_APM_ENABLED=true is set but no trace data flows from the app. │ | |
| ├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ CancellationToken for │ Correct. Used across 15+ files in gateway, challenger, transporter, chain-watcher, │ | |
| │ graceful shutdown │ aggregator. Well-implemented. │ | |
| ├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Circuit breaker exists and is │ Correct. crates/gateway/src/task/circuit_breaker.rs exists with proper 3-state │ | |
| │ well-implemented │ (closed/open/half-open) implementation. │ | |
| ├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Channel resilience exists │ Correct. crates/gateway/src/task/channel_resilience.rs exists, used in notification.rs for │ | |
| │ │ webhook delivery with retry + backoff. │ | |
| ├───────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Error classification │ Correct. crates/gateway/src/rpc/api/tx_worker.rs has is_retryable logic for transaction │ | |
| │ (transient vs permanent) │ submission. │ | |
| └───────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────┘ | |
| Inaccurate or Misleading | |
| Claim: "150+ Prometheus metrics" | |
| Issue: Overstated. I count 69 describe_*! macro invocations in crates/metrics/src/lib.rs. Some metrics have label dimensions | |
| (chain_id, status, etc.) which create multiple time series per metric, but the doc says "150+ metrics definitions" — this should | |
| say ~70 metric definitions (which generate many more time series with labels). The difference matters for the Terraform monitors | |
| proposal — you're monitoring ~70 distinct metric names, not 150+. | |
| ──────────────────────────────────────── | |
| Claim: "90+ metrics" in Executive Summary vs "150+ metrics" later | |
| Issue: Internal inconsistency. The doc says "90+" in Section 1 and "150+" in Section 5. Pick one (the actual number is ~69 | |
| definitions). | |
| ──────────────────────────────────────── | |
| Claim: "Worker queue config at gateway.toml:82-89" | |
| Issue: Needs verification. The DLQ doc quotes worker_count = 4 from gateway.toml, but the actual worker queue pattern in the | |
| codebase uses TransactionSubmissionWorker (in tx_worker.rs) which is sequential by design for nonce safety (see performance.md). | |
| The doc seems to conflate the transaction submission worker with a general "worker queue" — these are different things with | |
| different failure semantics. | |
| Cannot Verify (deploy repo not available locally) | |
| The following claims reference newton-prover-avs-deploy which isn't on this machine. The line numbers and code snippets in the doc | |
| look plausible, but I can't independently confirm: | |
| - Termination protection disabled (base_stack.py:33) | |
| - ALB health check 200-499 (gateway_service.py:136) | |
| - min_healthy_percent=25 (ecs_service.py:306) | |
| - Testnet Datadog disabled (environments.py:169-170, 213-214) | |
| - Stagef Redis single-node | |
| - SNS topics unwired | |
| - Sidecar images unpinned | |
| Question for author: Can you share access to the deploy repo so these can be verified? Or confirm these line numbers are current | |
| as of the analysis date? | |
| --- | |
| P0 Items — Critical Assessment | |
| 1. Datadog Monitors as Code (Terraform) — AGREE P0 | |
| Good call. The two-tier model (Terraform for oncall, UI for dev) is pragmatic and avoids the common trap of trying to codify | |
| everything. The ~25-30 monitor target is reasonable. | |
| Concern: The Terraform implementation plan proposes S3 + DynamoDB state backend. For a team new to Terraform, this adds | |
| operational overhead. Consider using Terraform Cloud free tier initially — it handles state locking and provides a UI for plan | |
| review. | |
| Suggestion: Start with 5-10 critical monitors, not 25-30. For testnet-prod readiness, you need: ECS task count, gateway P99 | |
| latency, quorum failures, circuit breaker state, and ALB 5xx. The rest can come in a fast follow. | |
| 2. Enable Datadog for Testnet — AGREE P0 | |
| This is a no-brainer. The risk assessment ("None to application behavior") is correct. One-line config change per environment. | |
| Question: The doc says redis_secret_arn=None for testnet too. Is this because testnet doesn't use Redis, or because it's another | |
| missing config? If testnet-prod runs 10 gateways and the gateway uses Redis for caching/dedup/chain-watcher filtering, missing | |
| Redis would mean those features are silently disabled. | |
| 3. Deep Health Check Endpoint — AGREE P0, with caveats | |
| The proposed /health/ready design is solid. The check list (DB, Redis, operator pool, worker queue) is correct. | |
| Caveats: | |
| - Redis check should be non-fatal. The gateway's multi-tier cache (Moka L1 + Redis L2) is designed to degrade gracefully — if | |
| Redis is down, L1 cache still works. Making Redis failure = unhealthy would cause unnecessary instance rotation. Use a degraded | |
| status instead. | |
| - Operator pool check needs multichain awareness. In multichain mode, there's a per-chain ChainService with its own operator_pool. | |
| The health check should verify operators exist for each configured chain, not just check a global pool. This is a nuance the doc | |
| misses. | |
| - Add timeout to dependency probes. A hanging DB connection check could block the health endpoint itself. Each probe should have a | |
| 2-3 second timeout. | |
| Missing from the proposal: The doc correctly identifies the ALB healthy_http_codes="200-499" problem but the /health/ready fix | |
| only partially addresses it. The ALB change should be a separate item — changing from 200-499 to 200 affects ALL responses to the | |
| health path, not just readiness logic. | |
| 4. Enable Termination Protection — AGREE P0 | |
| Trivially correct. One-line fix, zero risk. Should have been done already. | |
| --- | |
| P1 Items — Assessment | |
| 5. Verify ALB 4xx Responses — AGREE P1, but the Plan doc softened this too much | |
| The Analysis doc correctly calls this out as a problem. The Plan doc says "This is by design (4xx's can be noisy from normal | |
| client errors)" — but that's the health check HTTP codes, not the ALB target codes. A health check returning 4xx should NOT be | |
| considered healthy. The original analysis was right: change to 200 (or 200-299). | |
| 6. Deployment Notifications — AGREE P1 | |
| Low effort, good visibility. No concerns. | |
| 7. OpenTelemetry APM Tracing — AGREE P1, but the approach needs refinement | |
| The doc proposes opentelemetry-datadog crate, but this has been deprecated in favor of the OTLP exporter with Datadog agent | |
| intake. The Datadog agent already supports OTLP on port 4318 (HTTP) or 4317 (gRPC). | |
| Better approach: | |
| opentelemetry = { version = "0.22", features = ["trace"] } | |
| opentelemetry-otlp = "0.15" | |
| tracing-opentelemetry = "0.23" | |
| This is more future-proof and avoids the deprecated opentelemetry-datadog crate. | |
| 8. Increase min_healthy_percent — AGREE P1 | |
| Going from 25% to 100% is the right target for zero-downtime rolling updates with 10 gateways. But 100% requires | |
| max_healthy_percent > 100 (typically 200%) to allow new tasks to start before old ones drain. The doc doesn't mention this — | |
| ensure max_healthy_percent is set correctly or you'll get stuck deploys. | |
| 9. Pin Sidecar Image Versions — AGREE P1 | |
| Correct and low-effort. No concerns. | |
| --- | |
| P2 Items — Assessment | |
| 10. Dead Letter Queue — PARTIALLY AGREE P2 | |
| The concept is right, but the doc's framing needs correction: | |
| - The "worker queue" in Newton is primarily the TransactionSubmissionWorker — it processes on-chain transaction submissions | |
| sequentially for nonce safety. Failed transactions after retry are logged with full context (tx hash, error, task_id). This isn't | |
| "silently dropped" — it's logged and metricked via tx_submission_exhausted_total. | |
| - A Redis DLQ for failed transaction submissions is useful for replay, but the doc should be specific about which worker queue it | |
| targets. The gateway has multiple async processing paths (task broadcast, BLS aggregation, tx submission, webhook delivery) and | |
| they have different failure semantics. | |
| Recommendation: Scope this to transaction submission failures specifically. A Redis list with serialized SubmissionTask payloads + | |
| failure reason would enable manual replay via CLI. | |
| 11. AWS WAF on ALB — AGREE P2 | |
| Standard practice. The "count mode first, then block" rollout strategy is correct. | |
| 12. Expose DB Connection Pool Metrics — AGREE P2 | |
| The deadpool-postgres pool has .status() which returns pool_size, available, waiting. Exposing these as Prometheus gauges is | |
| straightforward. Good item but correctly P2. | |
| --- | |
| What's Missing from Both Documents | |
| 1. Multichain-Specific Reliability Concerns (should be P1) | |
| The analysis doesn't address multichain mode, which is the default, first-class mode for the gateway. Key gaps: | |
| - Per-chain health: A gateway can be healthy for chain A but unhealthy for chain B (e.g., chain B's RPC is down). The health check | |
| needs per-chain status. | |
| - Cross-chain data isolation: PR #419 (recently merged) added chain_id columns to DB tables. The reliability plan should verify | |
| this migration completed and test for cross-chain data leakage. | |
| - Chain watcher reliability: Each chain watcher is single-instance (the doc mentions challenger/watcher redundancy as P2, but | |
| doesn't separate these — chain watcher failure means missed on-chain events which is arguably more critical than challenger | |
| failure since challenges have a 100-block window). | |
| 2. Redis as a Single Point of Failure in the Watcher Path (should be P1) | |
| The chain watcher uses Redis SET for deduplication (SISMEMBER). If Redis is unavailable, the watcher can't filter | |
| gateway-originated tasks from direct on-chain tasks. This could cause duplicate processing. The doc mentions Redis failover for | |
| prod but doesn't flag this specific failure mode. | |
| 3. Nonce Management Across 10 Gateways (should be documented) | |
| The analysis notes "each gateway needs a unique signer key" for nonce management but doesn't flag the operational complexity: if a | |
| gateway crashes mid-transaction, its nonce can get stuck. There's no documented recovery procedure. This should at least be a | |
| runbook item. | |
| 4. BLS Key Rotation / Recovery (not mentioned) | |
| If an operator's BLS key is compromised or lost, there's no documented key rotation procedure. This is more security than | |
| reliability, but key loss = permanent operator loss. | |
| 5. Database Backup Verification (mentioned but not actioned) | |
| The analysis notes "No automated backups visible in the deploy code" and "RDS appears to be managed outside CDK" but neither doc | |
| proposes verifying that backups exist and are tested. This should be P0 — a production database with no verified backups is a | |
| critical risk. | |
| --- | |
| Questions for the Author | |
| 1. Worker queue specificity: Which worker queue(s) does the DLQ proposal target? The TransactionSubmissionWorker in tx_worker.rs? | |
| The webhook notification queue? Both? | |
| 2. Redis in testnet: Is redis_secret_arn=None for testnet because Redis isn't deployed, or because the secret isn't configured? If | |
| Redis isn't deployed, several gateway features (L2 cache, chain watcher dedup, request dedup) are silently disabled. | |
| 3. RDS ownership: You note RDS isn't in CDK. Where is it managed? Is there a separate Terraform/CloudFormation stack? Are | |
| automated backups confirmed? This needs to be resolved before testnet-prod load testing. | |
| 4. Metric count: Can you recount? I see ~69 describe_*! invocations in crates/metrics/src/lib.rs, not 90+ or 150+. The docs | |
| reference "150+ Prometheus metrics" in one place and "90+" in another. | |
| 5. opentelemetry-datadog vs OTLP: The OTel proposal uses the deprecated opentelemetry-datadog crate. Are you aware of the OTLP | |
| alternative via the Datadog agent? This is the officially recommended path. | |
| 6. Multichain health: The health check proposal doesn't account for per-chain health in multichain mode. Was this an intentional | |
| simplification for v1? | |
| --- | |
| Recommended Critical-Only Scope for Testnet-Prod | |
| I agree with the author's recommendation to focus on P0 for testnet-prod. Here's my suggested ordering: | |
| 1. Enable termination protection — 5 minutes, zero risk, do it now | |
| 2. Enable Datadog for testnet — 1 config change, enables all subsequent monitoring | |
| 3. Deep health check — /health/ready with DB + Redis (degraded) + operator pool (per-chain) checks | |
| 4. 5-10 critical Datadog monitors — ECS health, gateway P99, quorum failures, circuit breaker, ALB 5xx (defer the full Terraform | |
| module; start with Datadog UI for speed, formalize in Terraform afterward) | |
| Item 4 is a pragmatic deviation from the plan. Standing up a full Terraform + S3 backend + CI/CD pipeline for monitors is "medium | |
| effort" as the doc says, but for testnet-prod urgency, creating 5 monitors in the Datadog UI takes 30 minutes. Formalize them in | |
| Terraform as a fast follow. | |
| ★ Insight ───────────────────────────────────── | |
| - The two-tier monitoring model (Terraform for oncall-paging, Datadog UI for dev) is a mature pattern that avoids the common | |
| "codify everything or nothing" trap. This is one of the strongest parts of the proposal. | |
| - The analysis correctly identifies the "blind testnet-prod" as the highest-risk gap — running at mainnet scale with zero | |
| observability means you can't validate releases before mainnet. | |
| - The biggest gap in the analysis is multichain awareness. Newton's gateway is multichain-first (all gateways go through | |
| from_multi_chain_config), but the reliability analysis treats it as single-chain. The health check, operator pool monitoring, and | |
| chain watcher redundancy all need per-chain thinking. | |
| ───────────────────────────────────────────────── |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
★ Insight ─────────────────────────────────────
All Notion review comments have been posted across 7 pages. The review covers both code-verified corrections (metric count, Redis
graceful degradation, operator SG) and architectural guidance (multichain awareness, OTLP over deprecated crate, DLQ scoping).
─────────────────────────────────────────────────
Comment Summary
Analysis doc (4 comments):
Plan doc (1 comment):
5. Overall review — recommended P0 ordering, missing items (Redis testnet, multichain, nonce management), metric count correction,
operator ALB gap, OTLP recommendation
Sub-pages (7 comments):
6. Deep Health Check — Redis non-fatal, multichain operator pools, probe timeouts, liveness vs readiness
7. OTel APM — use opentelemetry-otlp not deprecated opentelemetry-datadog, with config snippets
8. DLQ — scope to TransactionSubmissionWorker in tx_worker.rs, Redis list sufficient for testnet
9. Datadog Monitors — start with 5-10 UI monitors (not 25-30 Terraform), consider TF Cloud free tier
10. Enable Datadog Testnet — pin agent/fluent-bit image tags, Redis metrics won't exist yet
11. Termination Protection — easiest P0 win, also check CF stack update policies for stateful resources
12. Monitoring Workflow — defer full runbook to P1, set alarm_email/slack_webhook_url first
Total: 12 comments across all pages.