Skip to content

Instantly share code, notes, and snippets.

@mostlyfine
Created September 3, 2025 10:36
Show Gist options
  • Select an option

  • Save mostlyfine/7110471d4ac92063615c7c0678df3a3b to your computer and use it in GitHub Desktop.

Select an option

Save mostlyfine/7110471d4ac92063615c7c0678df3a3b to your computer and use it in GitHub Desktop.
kiro example

Design Document

Overview

The ECS External Monitoring System is a serverless, cost-efficient solution that performs automated health checks on external services. The system uses AWS ECS Fargate scheduled tasks to run TypeScript-based monitoring code, stores configurations in S3, manages secrets through AWS Secrets Manager, and provides comprehensive alerting through multiple channels. The architecture is designed to scale efficiently to hundreds of monitoring targets while maintaining low operational costs.

Architecture

graph TB
    subgraph "Configuration & Secrets"
        S3[S3 Bucket<br/>Config Files]
        SM[Secrets Manager<br/>Credentials]
    end
    
    subgraph "ECS Cluster"
        ER[EventBridge Rule<br/>1-minute schedule]
        ET[ECS Task<br/>Fargate]
        FB[Fluent Bit<br/>Log Router]
    end
    
    subgraph "Monitoring & Alerting"
        CWM[CloudWatch Metrics]
        CWL[CloudWatch Logs]
        CWA[CloudWatch Alarms]
        SNS[SNS Topics]
        Lambda[Lambda Functions<br/>Alert Handlers]
    end
    
    subgraph "External Services"
        ES1[External Service 1]
        ES2[External Service 2]
        ESN[External Service N]
    end
    
    subgraph "Alert Destinations"
        Email[Email]
        Slack[Slack]
        Webhook[Webhook]
    end
    
    ER --> ET
    ET --> S3
    ET --> SM
    ET --> ES1
    ET --> ES2
    ET --> ESN
    ET --> CWM
    FB --> CWL
    CWM --> CWA
    CWA --> SNS
    SNS --> Lambda
    Lambda --> Email
    Lambda --> Slack
    Lambda --> Webhook
Loading

Components and Interfaces

1. Configuration Management

S3 Configuration Store

  • Purpose: Centralized storage for monitoring configurations
  • Structure:
    # monitoring-config.yaml
    global:
      defaultTimeout: 30
      defaultRetries: 3
      defaultUserAgent: "ECS-Monitor/1.0"
      alertThresholds:
        errorRate: 0.1
        responseTime: 5000
    
    targets:
      - name: "api-service"
        url: "https://api.example.com/health"
        expectedStatusCodes: [200, 201]
        timeout: 10
        retries: 2
        userAgent: "Custom-Monitor/1.0"
        interval: 60
        alertThresholds:
          errorRate: 0.05
          responseTime: 2000
      - name: "web-service"
        url: "https://web.example.com"
        expectedStatusCodes: [200]
        timeout: 15
        retries: 1
    
    alerts:
      email:
        enabled: true
        recipients: ["ops@example.com"]
      slack:
        enabled: true
        webhookSecretName: "slack-webhook-url"
        channel: "#alerts"
      webhook:
        enabled: true
        urlSecretName: "custom-webhook-url"
        headers:
          Authorization: "Bearer ${webhook-token}"

Configuration Loader Interface

interface MonitoringConfig {
  global: GlobalConfig;
  targets: MonitoringTarget[];
  alerts: AlertConfig;
}

interface MonitoringTarget {
  name: string;
  url: string;
  expectedStatusCodes: number[];
  timeout: number;
  retries: number;
  userAgent: string;
  interval: number;
  alertThresholds: ThresholdConfig;
}

2. Monitoring Engine

Parallel Execution Manager

  • Purpose: Orchestrates concurrent monitoring of multiple targets
  • Implementation: Uses TypeScript async/await with Promise.all for parallel execution
  • Resource Management: Implements connection pooling and request queuing to prevent resource exhaustion

HTTP Client Interface

interface MonitoringResult {
  target: string;
  timestamp: Date;
  success: boolean;
  statusCode?: number;
  responseTime: number;
  error?: string;
  retryCount: number;
}

interface HttpMonitor {
  checkTarget(target: MonitoringTarget): Promise<MonitoringResult>;
  checkMultipleTargets(targets: MonitoringTarget[]): Promise<MonitoringResult[]>;
}

3. Metrics and Alerting

CloudWatch Metrics Publisher

  • Custom Metrics:
    • ExternalMonitor/ResponseTime (milliseconds)
    • ExternalMonitor/SuccessRate (percentage)
    • ExternalMonitor/ErrorCount (count)
    • ExternalMonitor/StatusCode (value)
  • Dimensions: TargetName, Environment

Alert Manager

interface AlertManager {
  evaluateThresholds(results: MonitoringResult[]): AlertEvent[];
  sendAlerts(events: AlertEvent[]): Promise<void>;
}

interface AlertEvent {
  target: string;
  severity: 'warning' | 'critical';
  message: string;
  timestamp: Date;
  metrics: Record<string, number>;
}

4. Infrastructure Components

ECS Task Definition

  • CPU: 256 (0.25 vCPU) - optimized for I/O bound operations
  • Memory: 512 MB - sufficient for TypeScript runtime and parallel HTTP requests
  • Network Mode: awsvpc for security group isolation
  • Execution Role: Minimal permissions for ECR, CloudWatch, S3, and Secrets Manager

EventBridge Scheduled Rule

  • Schedule: Configurable cron expression (default: rate(1 minute))
  • Target: ECS Task with specific task definition
  • Dead Letter Queue: For failed task executions

Data Models

Configuration Schema

// Global configuration
interface GlobalConfig {
  defaultTimeout: number;
  defaultRetries: number;
  defaultUserAgent: string;
  alertThresholds: ThresholdConfig;
  debug: boolean;
  dryRun: boolean;
}

// Threshold configuration
interface ThresholdConfig {
  errorRate: number;        // 0.0 to 1.0
  responseTime: number;     // milliseconds
  consecutiveFailures: number;
}

// Alert configuration
interface AlertConfig {
  email: EmailConfig;
  slack: SlackConfig;
  webhook: WebhookConfig;
}

interface EmailConfig {
  enabled: boolean;
  recipients: string[];
  snsTopicArn?: string;
}

interface SlackConfig {
  enabled: boolean;
  webhookSecretName: string;
  channel: string;
  username?: string;
}

interface WebhookConfig {
  enabled: boolean;
  urlSecretName: string;
  headers: Record<string, string>;
  method: 'POST' | 'PUT';
}

Metrics Data Model

interface MetricData {
  metricName: string;
  value: number;
  unit: string;
  dimensions: Record<string, string>;
  timestamp: Date;
}

interface CloudWatchMetricBatch {
  namespace: string;
  metrics: MetricData[];
}

Error Handling

Retry Strategy

  • Exponential Backoff: Base delay of 1 second, maximum of 30 seconds
  • Jitter: Random delay component to prevent thundering herd
  • Circuit Breaker: Temporary failure isolation for consistently failing targets

Error Categories

  1. Network Errors: DNS resolution, connection timeouts, network unreachable
  2. HTTP Errors: 4xx/5xx status codes, invalid responses
  3. Configuration Errors: Invalid YAML, missing required fields
  4. AWS Service Errors: CloudWatch API limits, S3 access denied, Secrets Manager failures

Graceful Degradation

  • Partial Failures: Continue monitoring other targets if some fail
  • Configuration Fallbacks: Use default values for missing configuration
  • Alert Fallbacks: Log to CloudWatch if alert delivery fails

Testing Strategy

Unit Testing

  • HTTP Client: Mock external services, test retry logic, timeout handling
  • Configuration Loader: Test YAML parsing, validation, default value application
  • Metrics Publisher: Mock CloudWatch API, test batch operations
  • Alert Manager: Test threshold evaluation, notification formatting

Integration Testing

  • End-to-End: Deploy to test environment, verify complete workflow
  • AWS Services: Test actual CloudWatch, S3, and Secrets Manager integration
  • Docker Container: Verify local execution, dry-run mode, debug output

Load Testing

  • Parallel Execution: Test with 100+ concurrent monitoring targets
  • Resource Utilization: Monitor CPU, memory, and network usage
  • Cost Analysis: Measure CloudWatch API calls, data transfer costs

Local Development

  • Docker Compose: Local stack with LocalStack for AWS services
  • Mock Services: HTTP servers returning various status codes and delays
  • Configuration Validation: CLI tool for validating YAML configurations

Security Considerations

IAM Permissions (Principle of Least Privilege)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::monitoring-config-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:*:*:secret:monitoring/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "cloudwatch:namespace": "ExternalMonitor"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:/ecs/external-monitor:*"
    }
  ]
}

Network Security

  • VPC Configuration: ECS tasks run in private subnets with NAT Gateway for outbound access
  • Security Groups: Restrictive inbound rules, outbound HTTPS only
  • Secrets Encryption: All secrets encrypted at rest and in transit

Cost Optimization

Resource Efficiency

  • Fargate Spot: Use Spot instances for non-critical monitoring (up to 70% cost savings)
  • Task Sizing: Right-sized CPU/memory allocation based on actual usage
  • Connection Pooling: Reuse HTTP connections to reduce overhead

CloudWatch Optimization

  • Metric Batching: Send metrics in batches to reduce API calls
  • Custom Metrics: Use high-resolution metrics only when necessary
  • Log Retention: Configurable retention periods (default: 30 days)

Scaling Strategy

  • Horizontal Scaling: Multiple smaller tasks instead of single large task
  • Target Grouping: Batch multiple targets per task execution
  • Adaptive Scheduling: Adjust frequency based on target criticality

Cost Monitoring

  • Budget Alerts: CloudWatch billing alarms for cost overruns
  • Usage Metrics: Track API calls, data transfer, and compute costs
  • Cost Attribution: Tag resources for cost allocation by project/team

Requirements Document

Introduction

This feature implements an automated external monitoring system that uses AWS ECS scheduled tasks to perform health checks on external services. The system runs TypeScript-based monitoring tasks in parallel, collects metrics, sends them to CloudWatch, and triggers alerts via multiple channels (email, Slack, webhook) based on configurable thresholds. The entire infrastructure is provisioned using Terraform with aws-terraform-modules, and the system is designed to scale efficiently to hundreds of monitoring targets without significant cost increases.

Requirements

Requirement 1

User Story: As a DevOps engineer, I want to schedule ECS tasks to run external monitoring checks every minute, so that I can continuously monitor the health of external services.

Acceptance Criteria

  1. WHEN the system is deployed THEN ECS scheduled tasks SHALL execute every 1 minute by default
  2. WHEN a monitoring task runs THEN it SHALL execute TypeScript code in parallel for multiple monitoring targets
  3. WHEN the monitoring interval is configured THEN the system SHALL respect the custom interval setting
  4. WHEN ECS tasks are scheduled THEN they SHALL use the minimum required resources to optimize costs

Requirement 2

User Story: As a system administrator, I want to configure monitoring targets via YAML files stored in S3, so that I can easily manage and update monitoring configurations without redeploying the system.

Acceptance Criteria

  1. WHEN the system starts THEN it SHALL download configuration files from S3
  2. WHEN configuration includes URL, expected status codes, user agent, timeout, retry count, alert thresholds, and monitoring intervals THEN the system SHALL apply these settings
  3. WHEN expected status codes are specified as multiple values THEN the system SHALL accept any of the specified codes as successful
  4. WHEN configuration files are updated in S3 THEN the next ECS task execution SHALL use the updated configuration
  5. WHEN configuration files are malformed THEN the system SHALL log errors and continue with default settings where possible

Requirement 3

User Story: As a monitoring operator, I want the system to send metrics to CloudWatch and trigger alerts via multiple channels, so that I can be notified of service issues through my preferred communication methods.

Acceptance Criteria

  1. WHEN monitoring checks complete THEN the system SHALL send metrics to CloudWatch Metrics
  2. WHEN metrics exceed configured thresholds THEN the system SHALL send alerts to configured channels
  3. WHEN alert channels are configured THEN the system SHALL support email, Slack, and webhook notifications
  4. WHEN alert thresholds are breached THEN notifications SHALL include relevant monitoring details
  5. WHEN multiple alert channels are configured THEN the system SHALL send notifications to all configured channels

Requirement 4

User Story: As a developer, I want to run the monitoring system locally with a dry-run option, so that I can test configurations and debug issues without affecting production systems.

Acceptance Criteria

  1. WHEN a Docker container is built THEN it SHALL support local execution
  2. WHEN debug option is enabled THEN the system SHALL output detailed logs to stdout
  3. WHEN dry-run mode is enabled THEN the system SHALL perform monitoring checks but not send metrics or alerts
  4. WHEN running locally THEN the system SHALL use the same configuration format as production

Requirement 5

User Story: As a security-conscious administrator, I want authentication credentials managed through AWS Secrets Manager, so that sensitive information is stored securely and accessed with minimal privileges.

Acceptance Criteria

  1. WHEN the system requires authentication credentials THEN it SHALL retrieve them from AWS Secrets Manager
  2. WHEN accessing AWS services THEN the system SHALL use IAM roles with minimum required permissions
  3. WHEN credentials are rotated in Secrets Manager THEN the system SHALL automatically use updated credentials
  4. WHEN unauthorized access is attempted THEN the system SHALL fail gracefully with appropriate error logging

Requirement 6

User Story: As an infrastructure engineer, I want the entire system provisioned using Terraform with aws-terraform-modules, so that infrastructure is reproducible and follows best practices.

Acceptance Criteria

  1. WHEN deploying the system THEN Terraform SHALL provision all required AWS resources
  2. WHEN using aws-terraform-modules THEN the system SHALL leverage community-maintained, best-practice modules
  3. WHEN project-specific values need customization THEN they SHALL be exposed as Terraform variables
  4. WHEN infrastructure is destroyed THEN Terraform SHALL cleanly remove all provisioned resources
  5. WHEN multiple environments are needed THEN the same Terraform code SHALL support different configurations

Requirement 7

User Story: As a system operator, I want comprehensive logging sent to CloudWatch Logs via Fluent Bit, so that I can troubleshoot issues and maintain audit trails.

Acceptance Criteria

  1. WHEN ECS tasks execute THEN logs SHALL be collected by Fluent Bit
  2. WHEN logs are collected THEN they SHALL be sent to CloudWatch Logs
  3. WHEN log retention is configured THEN CloudWatch Logs SHALL respect the retention settings
  4. WHEN debugging is enabled THEN additional detailed logs SHALL be generated
  5. WHEN log parsing fails THEN Fluent Bit SHALL continue operating and log the parsing errors

Requirement 8

User Story: As a cost-conscious organization, I want the system to scale efficiently to hundreds of monitoring targets without proportional cost increases, so that monitoring costs remain manageable as we grow.

Acceptance Criteria

  1. WHEN monitoring targets increase to hundreds THEN the system SHALL maintain efficient resource utilization
  2. WHEN parallel execution is implemented THEN it SHALL optimize for both speed and resource consumption
  3. WHEN ECS tasks scale THEN they SHALL use appropriate instance types and scaling policies
  4. WHEN CloudWatch metrics volume increases THEN costs SHALL scale sub-linearly with the number of targets
  5. WHEN the system is idle THEN it SHALL consume minimal resources

Requirement 9

User Story: As a monitoring administrator, I want flexible configuration options for timeouts, retries, and user agents, so that I can customize monitoring behavior for different types of services.

Acceptance Criteria

  1. WHEN timeout values are configured THEN the system SHALL respect per-target timeout settings
  2. WHEN retry counts are specified THEN the system SHALL attempt the configured number of retries on failures
  3. WHEN custom user agents are defined THEN HTTP requests SHALL use the specified user agent strings
  4. WHEN default values are not overridden THEN the system SHALL use sensible defaults for all configuration options
  5. WHEN configuration validation fails THEN the system SHALL provide clear error messages indicating the invalid settings

Implementation Plan

  • 1. Set up project structure and core TypeScript configuration

    • Create directory structure for src/, tests/, terraform/, docker/
    • Initialize TypeScript project with tsconfig.json and package.json
    • Configure ESLint, Prettier, and Jest for code quality and testing
    • Requirements: 1.2, 4.1
  • 2. Implement configuration management system

    • 2.1 Create YAML configuration schema and TypeScript interfaces

      • Define MonitoringConfig, MonitoringTarget, AlertConfig interfaces
      • Implement YAML schema validation using joi or similar library
      • Create default configuration values and validation rules
      • Requirements: 2.2, 2.5, 9.4
    • 2.2 Implement S3 configuration loader

      • Write S3Client wrapper for downloading configuration files
      • Implement configuration parsing and validation logic
      • Add error handling for malformed configurations with fallback to defaults
      • Create unit tests for configuration loading and validation
      • Requirements: 2.1, 2.4, 2.5
  • 3. Implement AWS Secrets Manager integration

    • 3.1 Create secrets manager client wrapper
      • Implement SecretManager class for retrieving credentials
      • Add caching mechanism for secrets to reduce API calls
      • Implement error handling for missing or inaccessible secrets
      • Write unit tests with mocked AWS SDK calls
      • Requirements: 5.1, 5.3, 5.4
  • 4. Implement HTTP monitoring engine

    • 4.1 Create HTTP client with retry and timeout logic

      • Implement HttpMonitor class with configurable timeout and retry
      • Add exponential backoff with jitter for retry strategy
      • Implement connection pooling using axios or similar HTTP client
      • Create unit tests for timeout, retry, and error scenarios
      • Requirements: 1.2, 9.1, 9.2, 9.3
    • 4.2 Implement parallel monitoring execution

      • Create ParallelMonitor class for concurrent target monitoring
      • Implement Promise.all-based parallel execution with error isolation
      • Add resource management to prevent memory/connection exhaustion
      • Write unit tests for parallel execution and error handling
      • Requirements: 1.2, 8.2, 8.3
    • 4.3 Implement monitoring result collection and processing

      • Create MonitoringResult interface and result aggregation logic
      • Implement success rate calculation and response time metrics
      • Add result validation and error categorization
      • Write unit tests for result processing and metric calculation
      • Requirements: 1.2, 3.1, 8.1
  • 5. Implement CloudWatch metrics integration

    • 5.1 Create CloudWatch metrics publisher

      • Implement MetricsPublisher class for sending custom metrics
      • Add metric batching to optimize CloudWatch API usage
      • Implement proper error handling for CloudWatch API failures
      • Create unit tests with mocked CloudWatch SDK calls
      • Requirements: 3.1, 8.4
    • 5.2 Implement metric data transformation

      • Create functions to convert MonitoringResult to CloudWatch metrics
      • Implement proper metric dimensions and namespacing
      • Add metric aggregation for batch publishing
      • Write unit tests for metric transformation and batching
      • Requirements: 3.1, 8.4
  • 6. Implement alerting system

    • 6.1 Create threshold evaluation engine

      • Implement AlertManager class for evaluating monitoring results against thresholds
      • Add support for error rate, response time, and consecutive failure thresholds
      • Implement alert event generation with proper severity levels
      • Write unit tests for threshold evaluation logic
      • Requirements: 3.2, 3.4
    • 6.2 Implement multi-channel alert delivery

      • Create alert handlers for email (SNS), Slack webhook, and custom webhook
      • Implement alert message formatting for each channel type
      • Add error handling and fallback logging for failed alert delivery
      • Write unit tests for alert formatting and delivery logic
      • Requirements: 3.3, 3.5
  • 7. Implement main application orchestrator

    • 7.1 Create main application entry point

      • Implement App class that orchestrates configuration loading, monitoring, and alerting
      • Add command-line argument parsing for debug and dry-run modes
      • Implement proper error handling and graceful shutdown
      • Write integration tests for complete monitoring workflow
      • Requirements: 1.1, 4.2, 4.3
    • 7.2 Add debug and dry-run functionality

      • Implement debug logging with detailed stdout output
      • Add dry-run mode that performs checks without sending metrics or alerts
      • Create environment variable configuration for runtime options
      • Write unit tests for debug and dry-run mode behavior
      • Requirements: 4.2, 4.3
  • 8. Create Docker container configuration

    • 8.1 Write Dockerfile for TypeScript application

      • Create multi-stage Dockerfile with Node.js runtime
      • Implement proper layer caching for dependencies
      • Add non-root user for security best practices
      • Configure proper signal handling for graceful shutdown
      • Requirements: 4.1
    • 8.2 Add Fluent Bit logging configuration

      • Create Fluent Bit configuration for CloudWatch Logs integration
      • Implement log parsing and structured logging format
      • Add log filtering and routing based on log levels
      • Configure proper error handling for log delivery failures
      • Requirements: 7.1, 7.2, 7.4
    • 8.3 Create Docker Compose for local development

      • Implement docker-compose.yml with LocalStack for AWS services
      • Add mock HTTP services for testing monitoring targets
      • Configure volume mounts for configuration files and logs
      • Create local development documentation and scripts
      • Requirements: 4.1, 4.4
  • 9. Implement Terraform infrastructure

    • 9.1 Create ECS cluster and task definition

      • Use aws-terraform-modules for ECS cluster creation
      • Implement ECS task definition with proper resource allocation
      • Configure Fargate launch type with networking and security groups
      • Add variables for customizable CPU, memory, and scaling parameters
      • Requirements: 6.1, 6.2, 6.3
    • 9.2 Create EventBridge scheduled rule

      • Implement EventBridge rule with configurable cron schedule
      • Configure ECS task target with proper IAM role assignment
      • Add dead letter queue for failed task executions
      • Create variables for schedule customization
      • Requirements: 1.1, 1.3, 6.4
    • 9.3 Create S3 bucket for configuration storage

      • Use aws-terraform-modules for S3 bucket creation
      • Implement proper bucket policies and encryption settings
      • Configure versioning and lifecycle policies for cost optimization
      • Add variables for bucket naming and retention policies
      • Requirements: 2.1, 6.3, 6.4
    • 9.4 Create IAM roles and policies

      • Implement ECS task execution role with minimum required permissions
      • Create IAM policies for S3, Secrets Manager, and CloudWatch access
      • Add condition-based policies for enhanced security
      • Configure proper resource ARN restrictions
      • Requirements: 5.2, 6.3
    • 9.5 Create CloudWatch resources

      • Implement CloudWatch Log Group for ECS task logs
      • Create CloudWatch Alarms for monitoring system health
      • Configure SNS topics for alert delivery
      • Add variables for log retention and alarm thresholds
      • Requirements: 3.1, 3.2, 7.3
    • 9.6 Create Secrets Manager resources

      • Implement Secrets Manager secrets for alert webhook URLs and tokens
      • Configure proper encryption and access policies
      • Add rotation configuration for enhanced security
      • Create variables for secret naming and rotation schedules
      • Requirements: 5.1, 5.3
  • 10. Create comprehensive test suite

    • 10.1 Implement unit tests for all core components

      • Write unit tests for configuration loading, HTTP monitoring, metrics publishing
      • Add unit tests for alert evaluation and delivery mechanisms
      • Implement mocking for all AWS SDK calls and external HTTP requests
      • Achieve minimum 90% code coverage across all modules
      • Requirements: All requirements for quality assurance
    • 10.2 Create integration tests

      • Implement end-to-end tests using LocalStack for AWS services
      • Add integration tests for complete monitoring workflow
      • Create performance tests for parallel execution with 100+ targets
      • Write tests for error scenarios and graceful degradation
      • Requirements: 8.1, 8.2
  • 11. Create deployment and operational documentation

    • 11.1 Write deployment documentation

      • Create README with setup instructions and prerequisites
      • Document Terraform variable configuration and customization options
      • Add troubleshooting guide for common deployment issues
      • Create operational runbooks for monitoring and maintenance
      • Requirements: 6.4
    • 11.2 Create configuration examples and templates

      • Provide sample YAML configuration files for different use cases
      • Create Terraform tfvars examples for different environments
      • Add Docker run examples for local testing and development
      • Document best practices for scaling and cost optimization
      • Requirements: 2.2, 8.1, 8.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment