The ECS External Monitoring System is a serverless, cost-efficient solution that performs automated health checks on external services. The system uses AWS ECS Fargate scheduled tasks to run TypeScript-based monitoring code, stores configurations in S3, manages secrets through AWS Secrets Manager, and provides comprehensive alerting through multiple channels. The architecture is designed to scale efficiently to hundreds of monitoring targets while maintaining low operational costs.
graph TB
subgraph "Configuration & Secrets"
S3[S3 Bucket<br/>Config Files]
SM[Secrets Manager<br/>Credentials]
end
subgraph "ECS Cluster"
ER[EventBridge Rule<br/>1-minute schedule]
ET[ECS Task<br/>Fargate]
FB[Fluent Bit<br/>Log Router]
end
subgraph "Monitoring & Alerting"
CWM[CloudWatch Metrics]
CWL[CloudWatch Logs]
CWA[CloudWatch Alarms]
SNS[SNS Topics]
Lambda[Lambda Functions<br/>Alert Handlers]
end
subgraph "External Services"
ES1[External Service 1]
ES2[External Service 2]
ESN[External Service N]
end
subgraph "Alert Destinations"
Email[Email]
Slack[Slack]
Webhook[Webhook]
end
ER --> ET
ET --> S3
ET --> SM
ET --> ES1
ET --> ES2
ET --> ESN
ET --> CWM
FB --> CWL
CWM --> CWA
CWA --> SNS
SNS --> Lambda
Lambda --> Email
Lambda --> Slack
Lambda --> Webhook
S3 Configuration Store
- Purpose: Centralized storage for monitoring configurations
- Structure:
# monitoring-config.yaml global: defaultTimeout: 30 defaultRetries: 3 defaultUserAgent: "ECS-Monitor/1.0" alertThresholds: errorRate: 0.1 responseTime: 5000 targets: - name: "api-service" url: "https://api.example.com/health" expectedStatusCodes: [200, 201] timeout: 10 retries: 2 userAgent: "Custom-Monitor/1.0" interval: 60 alertThresholds: errorRate: 0.05 responseTime: 2000 - name: "web-service" url: "https://web.example.com" expectedStatusCodes: [200] timeout: 15 retries: 1 alerts: email: enabled: true recipients: ["ops@example.com"] slack: enabled: true webhookSecretName: "slack-webhook-url" channel: "#alerts" webhook: enabled: true urlSecretName: "custom-webhook-url" headers: Authorization: "Bearer ${webhook-token}"
Configuration Loader Interface
interface MonitoringConfig {
global: GlobalConfig;
targets: MonitoringTarget[];
alerts: AlertConfig;
}
interface MonitoringTarget {
name: string;
url: string;
expectedStatusCodes: number[];
timeout: number;
retries: number;
userAgent: string;
interval: number;
alertThresholds: ThresholdConfig;
}Parallel Execution Manager
- Purpose: Orchestrates concurrent monitoring of multiple targets
- Implementation: Uses TypeScript async/await with Promise.all for parallel execution
- Resource Management: Implements connection pooling and request queuing to prevent resource exhaustion
HTTP Client Interface
interface MonitoringResult {
target: string;
timestamp: Date;
success: boolean;
statusCode?: number;
responseTime: number;
error?: string;
retryCount: number;
}
interface HttpMonitor {
checkTarget(target: MonitoringTarget): Promise<MonitoringResult>;
checkMultipleTargets(targets: MonitoringTarget[]): Promise<MonitoringResult[]>;
}CloudWatch Metrics Publisher
- Custom Metrics:
ExternalMonitor/ResponseTime(milliseconds)ExternalMonitor/SuccessRate(percentage)ExternalMonitor/ErrorCount(count)ExternalMonitor/StatusCode(value)
- Dimensions:
TargetName,Environment
Alert Manager
interface AlertManager {
evaluateThresholds(results: MonitoringResult[]): AlertEvent[];
sendAlerts(events: AlertEvent[]): Promise<void>;
}
interface AlertEvent {
target: string;
severity: 'warning' | 'critical';
message: string;
timestamp: Date;
metrics: Record<string, number>;
}ECS Task Definition
- CPU: 256 (0.25 vCPU) - optimized for I/O bound operations
- Memory: 512 MB - sufficient for TypeScript runtime and parallel HTTP requests
- Network Mode: awsvpc for security group isolation
- Execution Role: Minimal permissions for ECR, CloudWatch, S3, and Secrets Manager
EventBridge Scheduled Rule
- Schedule: Configurable cron expression (default:
rate(1 minute)) - Target: ECS Task with specific task definition
- Dead Letter Queue: For failed task executions
// Global configuration
interface GlobalConfig {
defaultTimeout: number;
defaultRetries: number;
defaultUserAgent: string;
alertThresholds: ThresholdConfig;
debug: boolean;
dryRun: boolean;
}
// Threshold configuration
interface ThresholdConfig {
errorRate: number; // 0.0 to 1.0
responseTime: number; // milliseconds
consecutiveFailures: number;
}
// Alert configuration
interface AlertConfig {
email: EmailConfig;
slack: SlackConfig;
webhook: WebhookConfig;
}
interface EmailConfig {
enabled: boolean;
recipients: string[];
snsTopicArn?: string;
}
interface SlackConfig {
enabled: boolean;
webhookSecretName: string;
channel: string;
username?: string;
}
interface WebhookConfig {
enabled: boolean;
urlSecretName: string;
headers: Record<string, string>;
method: 'POST' | 'PUT';
}interface MetricData {
metricName: string;
value: number;
unit: string;
dimensions: Record<string, string>;
timestamp: Date;
}
interface CloudWatchMetricBatch {
namespace: string;
metrics: MetricData[];
}- Exponential Backoff: Base delay of 1 second, maximum of 30 seconds
- Jitter: Random delay component to prevent thundering herd
- Circuit Breaker: Temporary failure isolation for consistently failing targets
- Network Errors: DNS resolution, connection timeouts, network unreachable
- HTTP Errors: 4xx/5xx status codes, invalid responses
- Configuration Errors: Invalid YAML, missing required fields
- AWS Service Errors: CloudWatch API limits, S3 access denied, Secrets Manager failures
- Partial Failures: Continue monitoring other targets if some fail
- Configuration Fallbacks: Use default values for missing configuration
- Alert Fallbacks: Log to CloudWatch if alert delivery fails
- HTTP Client: Mock external services, test retry logic, timeout handling
- Configuration Loader: Test YAML parsing, validation, default value application
- Metrics Publisher: Mock CloudWatch API, test batch operations
- Alert Manager: Test threshold evaluation, notification formatting
- End-to-End: Deploy to test environment, verify complete workflow
- AWS Services: Test actual CloudWatch, S3, and Secrets Manager integration
- Docker Container: Verify local execution, dry-run mode, debug output
- Parallel Execution: Test with 100+ concurrent monitoring targets
- Resource Utilization: Monitor CPU, memory, and network usage
- Cost Analysis: Measure CloudWatch API calls, data transfer costs
- Docker Compose: Local stack with LocalStack for AWS services
- Mock Services: HTTP servers returning various status codes and delays
- Configuration Validation: CLI tool for validating YAML configurations
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::monitoring-config-bucket/*"
},
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": "arn:aws:secretsmanager:*:*:secret:monitoring/*"
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"cloudwatch:namespace": "ExternalMonitor"
}
}
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:log-group:/ecs/external-monitor:*"
}
]
}- VPC Configuration: ECS tasks run in private subnets with NAT Gateway for outbound access
- Security Groups: Restrictive inbound rules, outbound HTTPS only
- Secrets Encryption: All secrets encrypted at rest and in transit
- Fargate Spot: Use Spot instances for non-critical monitoring (up to 70% cost savings)
- Task Sizing: Right-sized CPU/memory allocation based on actual usage
- Connection Pooling: Reuse HTTP connections to reduce overhead
- Metric Batching: Send metrics in batches to reduce API calls
- Custom Metrics: Use high-resolution metrics only when necessary
- Log Retention: Configurable retention periods (default: 30 days)
- Horizontal Scaling: Multiple smaller tasks instead of single large task
- Target Grouping: Batch multiple targets per task execution
- Adaptive Scheduling: Adjust frequency based on target criticality
- Budget Alerts: CloudWatch billing alarms for cost overruns
- Usage Metrics: Track API calls, data transfer, and compute costs
- Cost Attribution: Tag resources for cost allocation by project/team