mostlyfine/design.md

## design.md

      
    Raw
  

              design.md
            
          
    Design Document

Overview

The ECS External Monitoring System is a serverless, cost-efficient solution that performs automated health checks on external services. The system uses AWS ECS Fargate scheduled tasks to run TypeScript-based monitoring code, stores configurations in S3, manages secrets through AWS Secrets Manager, and provides comprehensive alerting through multiple channels. The architecture is designed to scale efficiently to hundreds of monitoring targets while maintaining low operational costs.
Architecture


      graph TB
    subgraph "Configuration & Secrets"
        S3[S3 Bucket<br/>Config Files]
        SM[Secrets Manager<br/>Credentials]
    end
    
    subgraph "ECS Cluster"
        ER[EventBridge Rule<br/>1-minute schedule]
        ET[ECS Task<br/>Fargate]
        FB[Fluent Bit<br/>Log Router]
    end
    
    subgraph "Monitoring & Alerting"
        CWM[CloudWatch Metrics]
        CWL[CloudWatch Logs]
        CWA[CloudWatch Alarms]
        SNS[SNS Topics]
        Lambda[Lambda Functions<br/>Alert Handlers]
    end
    
    subgraph "External Services"
        ES1[External Service 1]
        ES2[External Service 2]
        ESN[External Service N]
    end
    
    subgraph "Alert Destinations"
        Email[Email]
        Slack[Slack]
        Webhook[Webhook]
    end
    
    ER --> ET
    ET --> S3
    ET --> SM
    ET --> ES1
    ET --> ES2
    ET --> ESN
    ET --> CWM
    FB --> CWL
    CWM --> CWA
    CWA --> SNS
    SNS --> Lambda
    Lambda --> Email
    Lambda --> Slack
    Lambda --> Webhook

    
      Loading

  
Components and Interfaces

1. Configuration Management

S3 Configuration Store

Purpose: Centralized storage for monitoring configurations
Structure:
# monitoring-config.yaml
global:
  defaultTimeout: 30
  defaultRetries: 3
  defaultUserAgent: "ECS-Monitor/1.0"
  alertThresholds:
    errorRate: 0.1
    responseTime: 5000

targets:
  - name: "api-service"
    url: "https://api.example.com/health"
    expectedStatusCodes: [200, 201]
    timeout: 10
    retries: 2
    userAgent: "Custom-Monitor/1.0"
    interval: 60
    alertThresholds:
      errorRate: 0.05
      responseTime: 2000
  - name: "web-service"
    url: "https://web.example.com"
    expectedStatusCodes: [200]
    timeout: 15
    retries: 1

alerts:
  email:
    enabled: true
    recipients: ["ops@example.com"]
  slack:
    enabled: true
    webhookSecretName: "slack-webhook-url"
    channel: "#alerts"
  webhook:
    enabled: true
    urlSecretName: "custom-webhook-url"
    headers:
      Authorization: "Bearer ${webhook-token}"


Configuration Loader Interface
interface MonitoringConfig {
  global: GlobalConfig;
  targets: MonitoringTarget[];
  alerts: AlertConfig;
}

interface MonitoringTarget {
  name: string;
  url: string;
  expectedStatusCodes: number[];
  timeout: number;
  retries: number;
  userAgent: string;
  interval: number;
  alertThresholds: ThresholdConfig;
}
2. Monitoring Engine

Parallel Execution Manager

Purpose: Orchestrates concurrent monitoring of multiple targets
Implementation: Uses TypeScript async/await with Promise.all for parallel execution
Resource Management: Implements connection pooling and request queuing to prevent resource exhaustion

HTTP Client Interface
interface MonitoringResult {
  target: string;
  timestamp: Date;
  success: boolean;
  statusCode?: number;
  responseTime: number;
  error?: string;
  retryCount: number;
}

interface HttpMonitor {
  checkTarget(target: MonitoringTarget): Promise<MonitoringResult>;
  checkMultipleTargets(targets: MonitoringTarget[]): Promise<MonitoringResult[]>;
}
3. Metrics and Alerting

CloudWatch Metrics Publisher

Custom Metrics:

ExternalMonitor/ResponseTime (milliseconds)
ExternalMonitor/SuccessRate (percentage)
ExternalMonitor/ErrorCount (count)
ExternalMonitor/StatusCode (value)


Dimensions: TargetName, Environment

Alert Manager
interface AlertManager {
  evaluateThresholds(results: MonitoringResult[]): AlertEvent[];
  sendAlerts(events: AlertEvent[]): Promise<void>;
}

interface AlertEvent {
  target: string;
  severity: 'warning' | 'critical';
  message: string;
  timestamp: Date;
  metrics: Record<string, number>;
}
4. Infrastructure Components

ECS Task Definition

CPU: 256 (0.25 vCPU) - optimized for I/O bound operations
Memory: 512 MB - sufficient for TypeScript runtime and parallel HTTP requests
Network Mode: awsvpc for security group isolation
Execution Role: Minimal permissions for ECR, CloudWatch, S3, and Secrets Manager

EventBridge Scheduled Rule

Schedule: Configurable cron expression (default: rate(1 minute))
Target: ECS Task with specific task definition
Dead Letter Queue: For failed task executions

Data Models

Configuration Schema

// Global configuration
interface GlobalConfig {
  defaultTimeout: number;
  defaultRetries: number;
  defaultUserAgent: string;
  alertThresholds: ThresholdConfig;
  debug: boolean;
  dryRun: boolean;
}

// Threshold configuration
interface ThresholdConfig {
  errorRate: number;        // 0.0 to 1.0
  responseTime: number;     // milliseconds
  consecutiveFailures: number;
}

// Alert configuration
interface AlertConfig {
  email: EmailConfig;
  slack: SlackConfig;
  webhook: WebhookConfig;
}

interface EmailConfig {
  enabled: boolean;
  recipients: string[];
  snsTopicArn?: string;
}

interface SlackConfig {
  enabled: boolean;
  webhookSecretName: string;
  channel: string;
  username?: string;
}

interface WebhookConfig {
  enabled: boolean;
  urlSecretName: string;
  headers: Record<string, string>;
  method: 'POST' | 'PUT';
}
Metrics Data Model

interface MetricData {
  metricName: string;
  value: number;
  unit: string;
  dimensions: Record<string, string>;
  timestamp: Date;
}

interface CloudWatchMetricBatch {
  namespace: string;
  metrics: MetricData[];
}
Error Handling

Retry Strategy


Exponential Backoff: Base delay of 1 second, maximum of 30 seconds
Jitter: Random delay component to prevent thundering herd
Circuit Breaker: Temporary failure isolation for consistently failing targets

Error Categories


Network Errors: DNS resolution, connection timeouts, network unreachable
HTTP Errors: 4xx/5xx status codes, invalid responses
Configuration Errors: Invalid YAML, missing required fields
AWS Service Errors: CloudWatch API limits, S3 access denied, Secrets Manager failures

Graceful Degradation


Partial Failures: Continue monitoring other targets if some fail
Configuration Fallbacks: Use default values for missing configuration
Alert Fallbacks: Log to CloudWatch if alert delivery fails

Testing Strategy

Unit Testing


HTTP Client: Mock external services, test retry logic, timeout handling
Configuration Loader: Test YAML parsing, validation, default value application
Metrics Publisher: Mock CloudWatch API, test batch operations
Alert Manager: Test threshold evaluation, notification formatting

Integration Testing


End-to-End: Deploy to test environment, verify complete workflow
AWS Services: Test actual CloudWatch, S3, and Secrets Manager integration
Docker Container: Verify local execution, dry-run mode, debug output

Load Testing


Parallel Execution: Test with 100+ concurrent monitoring targets
Resource Utilization: Monitor CPU, memory, and network usage
Cost Analysis: Measure CloudWatch API calls, data transfer costs

Local Development


Docker Compose: Local stack with LocalStack for AWS services
Mock Services: HTTP servers returning various status codes and delays
Configuration Validation: CLI tool for validating YAML configurations

Security Considerations

IAM Permissions (Principle of Least Privilege)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::monitoring-config-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:*:*:secret:monitoring/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "cloudwatch:namespace": "ExternalMonitor"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:/ecs/external-monitor:*"
    }
  ]
}
Network Security


VPC Configuration: ECS tasks run in private subnets with NAT Gateway for outbound access
Security Groups: Restrictive inbound rules, outbound HTTPS only
Secrets Encryption: All secrets encrypted at rest and in transit

Cost Optimization

Resource Efficiency


Fargate Spot: Use Spot instances for non-critical monitoring (up to 70% cost savings)
Task Sizing: Right-sized CPU/memory allocation based on actual usage
Connection Pooling: Reuse HTTP connections to reduce overhead

CloudWatch Optimization


Metric Batching: Send metrics in batches to reduce API calls
Custom Metrics: Use high-resolution metrics only when necessary
Log Retention: Configurable retention periods (default: 30 days)

Scaling Strategy


Horizontal Scaling: Multiple smaller tasks instead of single large task
Target Grouping: Batch multiple targets per task execution
Adaptive Scheduling: Adjust frequency based on target criticality

Cost Monitoring


Budget Alerts: CloudWatch billing alarms for cost overruns
Usage Metrics: Track API calls, data transfer, and compute costs
Cost Attribution: Tag resources for cost allocation by project/team


## requirements.md

      
    Raw
  

              requirements.md
            
          
    Requirements Document

Introduction

This feature implements an automated external monitoring system that uses AWS ECS scheduled tasks to perform health checks on external services. The system runs TypeScript-based monitoring tasks in parallel, collects metrics, sends them to CloudWatch, and triggers alerts via multiple channels (email, Slack, webhook) based on configurable thresholds. The entire infrastructure is provisioned using Terraform with aws-terraform-modules, and the system is designed to scale efficiently to hundreds of monitoring targets without significant cost increases.
Requirements

Requirement 1

User Story: As a DevOps engineer, I want to schedule ECS tasks to run external monitoring checks every minute, so that I can continuously monitor the health of external services.
Acceptance Criteria


WHEN the system is deployed THEN ECS scheduled tasks SHALL execute every 1 minute by default
WHEN a monitoring task runs THEN it SHALL execute TypeScript code in parallel for multiple monitoring targets
WHEN the monitoring interval is configured THEN the system SHALL respect the custom interval setting
WHEN ECS tasks are scheduled THEN they SHALL use the minimum required resources to optimize costs

Requirement 2

User Story: As a system administrator, I want to configure monitoring targets via YAML files stored in S3, so that I can easily manage and update monitoring configurations without redeploying the system.
Acceptance Criteria


WHEN the system starts THEN it SHALL download configuration files from S3
WHEN configuration includes URL, expected status codes, user agent, timeout, retry count, alert thresholds, and monitoring intervals THEN the system SHALL apply these settings
WHEN expected status codes are specified as multiple values THEN the system SHALL accept any of the specified codes as successful
WHEN configuration files are updated in S3 THEN the next ECS task execution SHALL use the updated configuration
WHEN configuration files are malformed THEN the system SHALL log errors and continue with default settings where possible

Requirement 3

User Story: As a monitoring operator, I want the system to send metrics to CloudWatch and trigger alerts via multiple channels, so that I can be notified of service issues through my preferred communication methods.
Acceptance Criteria


WHEN monitoring checks complete THEN the system SHALL send metrics to CloudWatch Metrics
WHEN metrics exceed configured thresholds THEN the system SHALL send alerts to configured channels
WHEN alert channels are configured THEN the system SHALL support email, Slack, and webhook notifications
WHEN alert thresholds are breached THEN notifications SHALL include relevant monitoring details
WHEN multiple alert channels are configured THEN the system SHALL send notifications to all configured channels

Requirement 4

User Story: As a developer, I want to run the monitoring system locally with a dry-run option, so that I can test configurations and debug issues without affecting production systems.
Acceptance Criteria


WHEN a Docker container is built THEN it SHALL support local execution
WHEN debug option is enabled THEN the system SHALL output detailed logs to stdout
WHEN dry-run mode is enabled THEN the system SHALL perform monitoring checks but not send metrics or alerts
WHEN running locally THEN the system SHALL use the same configuration format as production

Requirement 5

User Story: As a security-conscious administrator, I want authentication credentials managed through AWS Secrets Manager, so that sensitive information is stored securely and accessed with minimal privileges.
Acceptance Criteria


WHEN the system requires authentication credentials THEN it SHALL retrieve them from AWS Secrets Manager
WHEN accessing AWS services THEN the system SHALL use IAM roles with minimum required permissions
WHEN credentials are rotated in Secrets Manager THEN the system SHALL automatically use updated credentials
WHEN unauthorized access is attempted THEN the system SHALL fail gracefully with appropriate error logging

Requirement 6

User Story: As an infrastructure engineer, I want the entire system provisioned using Terraform with aws-terraform-modules, so that infrastructure is reproducible and follows best practices.
Acceptance Criteria


WHEN deploying the system THEN Terraform SHALL provision all required AWS resources
WHEN using aws-terraform-modules THEN the system SHALL leverage community-maintained, best-practice modules
WHEN project-specific values need customization THEN they SHALL be exposed as Terraform variables
WHEN infrastructure is destroyed THEN Terraform SHALL cleanly remove all provisioned resources
WHEN multiple environments are needed THEN the same Terraform code SHALL support different configurations

Requirement 7

User Story: As a system operator, I want comprehensive logging sent to CloudWatch Logs via Fluent Bit, so that I can troubleshoot issues and maintain audit trails.
Acceptance Criteria


WHEN ECS tasks execute THEN logs SHALL be collected by Fluent Bit
WHEN logs are collected THEN they SHALL be sent to CloudWatch Logs
WHEN log retention is configured THEN CloudWatch Logs SHALL respect the retention settings
WHEN debugging is enabled THEN additional detailed logs SHALL be generated
WHEN log parsing fails THEN Fluent Bit SHALL continue operating and log the parsing errors

Requirement 8

User Story: As a cost-conscious organization, I want the system to scale efficiently to hundreds of monitoring targets without proportional cost increases, so that monitoring costs remain manageable as we grow.
Acceptance Criteria


WHEN monitoring targets increase to hundreds THEN the system SHALL maintain efficient resource utilization
WHEN parallel execution is implemented THEN it SHALL optimize for both speed and resource consumption
WHEN ECS tasks scale THEN they SHALL use appropriate instance types and scaling policies
WHEN CloudWatch metrics volume increases THEN costs SHALL scale sub-linearly with the number of targets
WHEN the system is idle THEN it SHALL consume minimal resources

Requirement 9

User Story: As a monitoring administrator, I want flexible configuration options for timeouts, retries, and user agents, so that I can customize monitoring behavior for different types of services.
Acceptance Criteria


WHEN timeout values are configured THEN the system SHALL respect per-target timeout settings
WHEN retry counts are specified THEN the system SHALL attempt the configured number of retries on failures
WHEN custom user agents are defined THEN HTTP requests SHALL use the specified user agent strings
WHEN default values are not overridden THEN the system SHALL use sensible defaults for all configuration options
WHEN configuration validation fails THEN the system SHALL provide clear error messages indicating the invalid settings


## tasks.md

      
    Raw
  

              tasks.md
            
          
    Implementation Plan


 1. Set up project structure and core TypeScript configuration

Create directory structure for src/, tests/, terraform/, docker/
Initialize TypeScript project with tsconfig.json and package.json
Configure ESLint, Prettier, and Jest for code quality and testing
Requirements: 1.2, 4.1


 2. Implement configuration management system


 2.1 Create YAML configuration schema and TypeScript interfaces

Define MonitoringConfig, MonitoringTarget, AlertConfig interfaces
Implement YAML schema validation using joi or similar library
Create default configuration values and validation rules
Requirements: 2.2, 2.5, 9.4


 2.2 Implement S3 configuration loader

Write S3Client wrapper for downloading configuration files
Implement configuration parsing and validation logic
Add error handling for malformed configurations with fallback to defaults
Create unit tests for configuration loading and validation
Requirements: 2.1, 2.4, 2.5


 3. Implement AWS Secrets Manager integration

 3.1 Create secrets manager client wrapper

Implement SecretManager class for retrieving credentials
Add caching mechanism for secrets to reduce API calls
Implement error handling for missing or inaccessible secrets
Write unit tests with mocked AWS SDK calls
Requirements: 5.1, 5.3, 5.4


 4. Implement HTTP monitoring engine


 4.1 Create HTTP client with retry and timeout logic

Implement HttpMonitor class with configurable timeout and retry
Add exponential backoff with jitter for retry strategy
Implement connection pooling using axios or similar HTTP client
Create unit tests for timeout, retry, and error scenarios
Requirements: 1.2, 9.1, 9.2, 9.3


 4.2 Implement parallel monitoring execution

Create ParallelMonitor class for concurrent target monitoring
Implement Promise.all-based parallel execution with error isolation
Add resource management to prevent memory/connection exhaustion
Write unit tests for parallel execution and error handling
Requirements: 1.2, 8.2, 8.3


 4.3 Implement monitoring result collection and processing

Create MonitoringResult interface and result aggregation logic
Implement success rate calculation and response time metrics
Add result validation and error categorization
Write unit tests for result processing and metric calculation
Requirements: 1.2, 3.1, 8.1


 5. Implement CloudWatch metrics integration


 5.1 Create CloudWatch metrics publisher

Implement MetricsPublisher class for sending custom metrics
Add metric batching to optimize CloudWatch API usage
Implement proper error handling for CloudWatch API failures
Create unit tests with mocked CloudWatch SDK calls
Requirements: 3.1, 8.4


 5.2 Implement metric data transformation

Create functions to convert MonitoringResult to CloudWatch metrics
Implement proper metric dimensions and namespacing
Add metric aggregation for batch publishing
Write unit tests for metric transformation and batching
Requirements: 3.1, 8.4


 6. Implement alerting system


 6.1 Create threshold evaluation engine

Implement AlertManager class for evaluating monitoring results against thresholds
Add support for error rate, response time, and consecutive failure thresholds
Implement alert event generation with proper severity levels
Write unit tests for threshold evaluation logic
Requirements: 3.2, 3.4


 6.2 Implement multi-channel alert delivery

Create alert handlers for email (SNS), Slack webhook, and custom webhook
Implement alert message formatting for each channel type
Add error handling and fallback logging for failed alert delivery
Write unit tests for alert formatting and delivery logic
Requirements: 3.3, 3.5


 7. Implement main application orchestrator


 7.1 Create main application entry point

Implement App class that orchestrates configuration loading, monitoring, and alerting
Add command-line argument parsing for debug and dry-run modes
Implement proper error handling and graceful shutdown
Write integration tests for complete monitoring workflow
Requirements: 1.1, 4.2, 4.3


 7.2 Add debug and dry-run functionality

Implement debug logging with detailed stdout output
Add dry-run mode that performs checks without sending metrics or alerts
Create environment variable configuration for runtime options
Write unit tests for debug and dry-run mode behavior
Requirements: 4.2, 4.3


 8. Create Docker container configuration


 8.1 Write Dockerfile for TypeScript application

Create multi-stage Dockerfile with Node.js runtime
Implement proper layer caching for dependencies
Add non-root user for security best practices
Configure proper signal handling for graceful shutdown
Requirements: 4.1


 8.2 Add Fluent Bit logging configuration

Create Fluent Bit configuration for CloudWatch Logs integration
Implement log parsing and structured logging format
Add log filtering and routing based on log levels
Configure proper error handling for log delivery failures
Requirements: 7.1, 7.2, 7.4


 8.3 Create Docker Compose for local development

Implement docker-compose.yml with LocalStack for AWS services
Add mock HTTP services for testing monitoring targets
Configure volume mounts for configuration files and logs
Create local development documentation and scripts
Requirements: 4.1, 4.4


 9. Implement Terraform infrastructure


 9.1 Create ECS cluster and task definition

Use aws-terraform-modules for ECS cluster creation
Implement ECS task definition with proper resource allocation
Configure Fargate launch type with networking and security groups
Add variables for customizable CPU, memory, and scaling parameters
Requirements: 6.1, 6.2, 6.3


 9.2 Create EventBridge scheduled rule

Implement EventBridge rule with configurable cron schedule
Configure ECS task target with proper IAM role assignment
Add dead letter queue for failed task executions
Create variables for schedule customization
Requirements: 1.1, 1.3, 6.4


 9.3 Create S3 bucket for configuration storage

Use aws-terraform-modules for S3 bucket creation
Implement proper bucket policies and encryption settings
Configure versioning and lifecycle policies for cost optimization
Add variables for bucket naming and retention policies
Requirements: 2.1, 6.3, 6.4


 9.4 Create IAM roles and policies

Implement ECS task execution role with minimum required permissions
Create IAM policies for S3, Secrets Manager, and CloudWatch access
Add condition-based policies for enhanced security
Configure proper resource ARN restrictions
Requirements: 5.2, 6.3


 9.5 Create CloudWatch resources

Implement CloudWatch Log Group for ECS task logs
Create CloudWatch Alarms for monitoring system health
Configure SNS topics for alert delivery
Add variables for log retention and alarm thresholds
Requirements: 3.1, 3.2, 7.3


 9.6 Create Secrets Manager resources

Implement Secrets Manager secrets for alert webhook URLs and tokens
Configure proper encryption and access policies
Add rotation configuration for enhanced security
Create variables for secret naming and rotation schedules
Requirements: 5.1, 5.3


 10. Create comprehensive test suite


 10.1 Implement unit tests for all core components

Write unit tests for configuration loading, HTTP monitoring, metrics publishing
Add unit tests for alert evaluation and delivery mechanisms
Implement mocking for all AWS SDK calls and external HTTP requests
Achieve minimum 90% code coverage across all modules
Requirements: All requirements for quality assurance


 10.2 Create integration tests

Implement end-to-end tests using LocalStack for AWS services
Add integration tests for complete monitoring workflow
Create performance tests for parallel execution with 100+ targets
Write tests for error scenarios and graceful degradation
Requirements: 8.1, 8.2


 11. Create deployment and operational documentation


 11.1 Write deployment documentation

Create README with setup instructions and prerequisites
Document Terraform variable configuration and customization options
Add troubleshooting guide for common deployment issues
Create operational runbooks for monitoring and maintenance
Requirements: 6.4


 11.2 Create configuration examples and templates

Provide sample YAML configuration files for different use cases
Create Terraform tfvars examples for different environments
Add Docker run examples for local testing and development
Document best practices for scaling and cost optimization
Requirements: 2.2, 8.1, 8.5
No results found