Skip to content

Instantly share code, notes, and snippets.

@adamseoul
Created February 2, 2026 20:48
Show Gist options
  • Select an option

  • Save adamseoul/c33cb7070c3cee8530922eb0148f7d6a to your computer and use it in GitHub Desktop.

Select an option

Save adamseoul/c33cb7070c3cee8530922eb0148f7d6a to your computer and use it in GitHub Desktop.
Brand Radar AWS Architecture - Fargate, DynamoDB, Auto-scaling

Brand Radar - AWS Architecture

Overview

Brand Radar is deployed on AWS using a serverless/container architecture designed for scalability and reliability. The system handles long-running AI analysis (up to 10 minutes) through an async job queue pattern.

Reliability & Self-Healing

This architecture is designed for zero manual intervention. Unlike the previous EC2/Nginx setup which required reboots and manual monitoring, this infrastructure is fully automated:

Frontend (S3 + CloudFront)

  • S3: 99.999999999% (11 nines) durability - effectively cannot fail
  • CloudFront: Global CDN with automatic failover between edge locations
  • No servers to reboot - just static file serving

Backend (Fargate)

  • Health checks: ALB checks /providers every 30 seconds
  • Auto-replacement: If a task becomes unhealthy, ECS automatically:
    1. Drains connections from the failing task
    2. Launches a new task (~30 seconds to start)
    3. Routes traffic to the healthy task
  • No manual intervention required - ECS handles all recovery automatically

Auto-Scaling

  • Trigger: CPU utilization > 70%
  • Scale out: New task launches in ~30 seconds
  • Scale in: Cooldown of 5 minutes before removing tasks
  • Capacity: 1-5 tasks (configurable up to 100+)

Job Queue (DynamoDB)

  • Serverless: No servers to manage
  • Auto-scaling: Handles any throughput automatically
  • Multi-AZ: Data replicated across availability zones
  • TTL: Jobs auto-delete after 1 hour (self-cleaning)

What This Means

Previous (EC2) Current (Fargate)
Server could crash, need manual reboot Tasks auto-replaced in 30 seconds
Memory leaks required restarts Each request gets fresh container
Manual scaling Auto-scales based on load
Single point of failure Multiple tasks across AZs
SSH access needed for debugging CloudWatch logs, no SSH needed

Architecture Diagram

                                    ┌─────────────────────────────────────┐
                                    │           USERS                      │
                                    └─────────────────┬───────────────────┘
                                                      │
                                                      ▼
                                    ┌─────────────────────────────────────┐
                                    │         CLOUDFRONT                   │
                                    │    dmkh1ti2sgxc5.cloudfront.net     │
                                    │    (SSL termination, CDN caching)    │
                                    └───────────┬─────────────┬───────────┘
                                                │             │
                               /assets/*        │             │  /api/*
                               /index.html      │             │
                                                ▼             ▼
                    ┌───────────────────────────┐   ┌─────────────────────────────┐
                    │           S3              │   │           ALB               │
                    │  brandradar-frontend-prod │   │    brandradar-alb           │
                    │  (Static React frontend)  │   │    (10 min idle timeout)    │
                    └───────────────────────────┘   └──────────────┬──────────────┘
                                                                   │
                                                                   ▼
                                                   ┌─────────────────────────────┐
                                                   │      ECS FARGATE            │
                                                   │   brandradar-cluster        │
                                                   │   (Node.js/Express API)     │
                                                   │   - Auto-scales 1-N tasks   │
                                                   │   - 512 CPU, 1GB RAM each   │
                                                   └──────────────┬──────────────┘
                                                                  │
                                      ┌───────────────────────────┼───────────────────────────┐
                                      │                           │                           │
                                      ▼                           ▼                           ▼
                          ┌───────────────────┐     ┌───────────────────┐     ┌───────────────────┐
                          │     DynamoDB      │     │   CloudWatch      │     │   External APIs   │
                          │  brandradar-jobs  │     │      Logs         │     │  - Gemini AI      │
                          │  (Job queue)      │     │   /ecs/brandradar │     │  - Perplexity     │
                          │  TTL: 1 hour      │     │                   │     │  - Reddit         │
                          └───────────────────┘     └───────────────────┘     │  - GDELT News     │
                                                                              │  - Apify (Amazon) │
                                                                              └───────────────────┘

Components

CloudFront Distribution (E31I20YH8FYL81)

  • Domain: dmkh1ti2sgxc5.cloudfront.net
  • SSL: AWS-managed certificate
  • Origin Timeout: 60 seconds (default max without quota increase)
  • Cache Behaviors:
    • /api/* → ALB origin (no caching)
    • Default → S3 origin (cached)
  • Function: Strips /api prefix before forwarding to ALB

S3 Bucket (brandradar-frontend-prod)

  • Purpose: Static React frontend hosting
  • Region: us-east-2
  • Contents: Built React app (index.html, JS, CSS, assets)

Application Load Balancer (brandradar-alb)

  • Region: us-east-2
  • Idle Timeout: 600 seconds (10 minutes)
  • Target Group: brandradar-tg (health check: GET /providers)
  • Security Group: Allows inbound 80 from CloudFront

ECS Fargate Cluster (brandradar-cluster)

  • Service: brandradar-api
  • Task Definition: brandradar-api:2
  • Resources: 512 CPU, 1024 MB RAM per task
  • Desired Count: 1 (auto-scales based on demand)
  • Task Role: brandradar-task-role (DynamoDB access)
  • Execution Role: ecsTaskExecutionRole (ECR, CloudWatch)

DynamoDB Table (brandradar-jobs)

  • Purpose: Async job queue for long-running analyses
  • Primary Key: jobId (String)
  • TTL: Enabled on ttl attribute (auto-delete after 1 hour)
  • Billing: On-demand (pay per request)

ECR Repository (brandradar)

  • Purpose: Docker image storage
  • Region: us-east-2
  • Image: Node.js 20 + tsx + server code

Async Job Flow

The async pattern was implemented to handle AI analyses that can take up to 10 minutes (especially with Amazon data).

Flow Diagram

Frontend                    CloudFront/ALB                Fargate                    DynamoDB
   │                              │                          │                          │
   │ POST /api/analyze/async      │                          │                          │
   │─────────────────────────────>│                          │                          │
   │                              │ POST /analyze/async      │                          │
   │                              │─────────────────────────>│                          │
   │                              │                          │ Create job               │
   │                              │                          │─────────────────────────>│
   │                              │                          │                          │
   │                              │      {job_id: "xxx"}     │                          │
   │<─────────────────────────────│<─────────────────────────│                          │
   │                              │                          │                          │
   │ GET /api/analyze/status/xxx  │                          │ (background processing)  │
   │─────────────────────────────>│                          │──────────────────────────│
   │                              │─────────────────────────>│ Update progress          │
   │                              │                          │─────────────────────────>│
   │                              │      {progress: 50%}     │                          │
   │<─────────────────────────────│<─────────────────────────│                          │
   │                              │                          │                          │
   │ (repeat polling every 4s)    │                          │                          │
   │                              │                          │                          │
   │ GET /api/analyze/status/xxx  │                          │ Get job status           │
   │─────────────────────────────>│─────────────────────────>│<─────────────────────────│
   │                              │   {status: "completed"}  │                          │
   │<─────────────────────────────│<─────────────────────────│                          │
   │                              │                          │                          │
   │ GET /api/analyze/results/xxx │                          │ Get job result           │
   │─────────────────────────────>│─────────────────────────>│<─────────────────────────│
   │                              │      {analysis data}     │                          │
   │<─────────────────────────────│<─────────────────────────│                          │

API Endpoints

Endpoint Method Purpose
/api/analyze/async POST Start new analysis job
/api/analyze/status/:jobId GET Get job status + progress
/api/analyze/results/:jobId GET Get completed job results
/api/analyze POST Synchronous analysis (legacy)
/api/providers GET Health check + provider status

Job States

Status Progress Description
pending 0% Job created, waiting to process
processing 5-95% Analysis in progress
completed 100% Analysis complete, results available
failed 0% Analysis failed, error available

Progress Stages

Progress Stage
10% Gathering brand data...
15% Fetching Reddit discussions...
40% Analyzing brand signals...
50% Running AI analysis...
85% Compiling results...
95% Finalizing report...
100% Analysis complete!

Security

API Authentication

  • Header: X-API-Key: <api-key>
  • Key: Stored in Fargate environment variable BRANDRADAR_API_KEY

CORS

  • Locked to specific origins:
    • https://brandcommerceradar.ai
    • https://www.brandcommerceradar.ai
    • https://dmkh1ti2sgxc5.cloudfront.net
    • http://localhost:5173 (dev)

Rate Limiting

  • General endpoints: 30 requests/minute/IP
  • AI endpoints: 10 requests/minute/IP

IAM Roles

  • ecsTaskExecutionRole: ECR pull, CloudWatch logs
  • brandradar-task-role: DynamoDB read/write to brandradar-jobs table

Monitoring

CloudWatch Alarms

Alarm Metric Threshold SNS Topic
BrandRadar-CloudFront-5xx-Errors 5xxErrorRate >5% brandradar-cloudfront-alerts (us-east-1)
BrandRadar-ALB-5xx-Errors HTTPCode_ELB_5XX_Count >5/5min brandradar-alerts (us-east-2)
BrandRadar-504-Timeouts HTTPCode_Target_5XX_Count >3/5min brandradar-alerts (us-east-2)
BrandRadar-Unhealthy-Targets UnHealthyHostCount >=1 brandradar-alerts (us-east-2)

SNS Topics

  • us-east-1: arn:aws:sns:us-east-1:895778612904:brandradar-cloudfront-alerts
  • us-east-2: arn:aws:sns:us-east-2:895778612904:brandradar-alerts

To subscribe to alerts:

aws sns subscribe --topic-arn arn:aws:sns:us-east-2:895778612904:brandradar-alerts \
  --protocol email --notification-endpoint your@email.com --region us-east-2

Logs

  • Location: CloudWatch Logs /ecs/brandradar-api
  • Region: us-east-2
  • Retention: Default

Deployment

Deploy Backend (Fargate)

# Build Docker image
cd BC-APP-AI-LOCAL
docker build -t brandradar:latest -f Dockerfile .

# Login to ECR
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin 895778612904.dkr.ecr.us-east-2.amazonaws.com

# Push to ECR
docker tag brandradar:latest 895778612904.dkr.ecr.us-east-2.amazonaws.com/brandradar:latest
docker push 895778612904.dkr.ecr.us-east-2.amazonaws.com/brandradar:latest

# Update task definition (if changed)
aws ecs register-task-definition --cli-input-json file://fargate/task-definition.json --region us-east-2

# Deploy
aws ecs update-service --cluster brandradar-cluster --service brandradar-api --force-new-deployment --region us-east-2

Deploy Frontend (S3 + CloudFront)

# Build frontend
cd web
npm run build

# Upload to S3
aws s3 sync dist s3://brandradar-frontend-prod --delete --region us-east-2

# Invalidate CloudFront cache
aws cloudfront create-invalidation --distribution-id E31I20YH8FYL81 --paths "/*"

Environment Variables

Fargate Task

Variable Description
NODE_ENV production
PORT 4000
AWS_REGION us-east-2
GEMINI_API_KEY Google Gemini API key
PERPLEXITY_API_KEY Perplexity AI API key
APIFY_API_KEY Apify API key (Amazon scraping)
BRANDRADAR_API_KEY Backend API authentication key

Scaling

Current Configuration (Auto-Scaling ENABLED)

  • Fargate: 1-5 tasks, auto-scales on CPU
  • DynamoDB: On-demand (auto-scales automatically)
  • ALB: Distributes load across all healthy Fargate tasks

Auto-Scaling Policy (Active)

Setting Value
Metric ECSServiceAverageCPUUtilization
Target 70%
Min Capacity 1 task
Max Capacity 40 tasks
Scale Out Cooldown 60 seconds
Scale In Cooldown 300 seconds (5 min)

Capacity Estimates (Concurrent Analyses)

Each Fargate task (512 CPU, 1GB RAM) can handle approximately 2-3 concurrent brand analyses. Most of the analysis time is spent waiting for AI API responses (I/O), so Node.js can juggle multiple requests efficiently.

Tasks Concurrent Analyses Typical Use Case
1 2-3 Normal usage, demos
2 4-6 Small team reviewing brands
5 10-15 Busy period, multiple teams
10 20-30 High traffic, enterprise
20 40-60 Large campaign
40 80-120 Maximum configured capacity

Important: What "Concurrent Analyses" Means

This is NOT the same as website visitors.

  • Website visitors: The frontend is static files on S3 + CloudFront (a global CDN). It can handle millions of visitors without any impact. Someone browsing the site, reading results, or filling out the form costs essentially nothing.

  • Concurrent analyses: This counts analyses that are currently running. Each analysis takes approximately 1 minute to complete. Because they're not instant, analyses can build up.

How analyses build up: If 5 people click "Run Analysis" over a 2-minute window, you could have 5 analyses running at once - not because they clicked simultaneously, but because each one takes about a minute to finish.

Email campaign scenario: You send an email blast to 200 people. Over 10 minutes, 30 of them visit the site and click "Run Analysis". With ~1 minute per analysis, you might have 10-15 running at any given moment. The system auto-scales to handle this.

Why Max Capacity is Set to 40 Tasks

The 40-task limit (80-120 concurrent analyses) provides safety overhead. It costs nothing to have this headroom - you only pay for tasks that actually run.

This protects against Murphy's Law scenarios: you send out an email, 20+ people click through and run analyses within minutes. The system scales up automatically, handles the load, then scales back down when traffic subsides.

Cost Model - Pay Only For What You Use

Important: Setting max capacity to 20 tasks does NOT cost anything by itself. You only pay for tasks that are actually running.

  • Idle/Low traffic: 1 task runs = ~$15/month
  • Moderate traffic: 2-3 tasks = ~$30-45/month
  • High traffic: Scales up as needed, scales back down automatically

The system starts with 1 task and only launches more when CPU exceeds 70%. When traffic drops, extra tasks are terminated after 5 minutes. You're never paying for capacity you're not using.

Note: These are concurrent analyses running simultaneously. Users browsing the site, viewing results, or filling out forms don't consume significant resources - only the actual "Run Analysis" action does.

How it works:

  1. Single task handles normal load (2-3 concurrent users)
  2. If CPU exceeds 70%, a new task launches (~30 seconds)
  3. ALB automatically routes traffic to both tasks
  4. DynamoDB job queue works across all tasks (shared state)
  5. When load drops, tasks scale back down after 5 minutes

Pre-Warming for Expected High Traffic

If you know you'll have a big demo or email campaign, you can pre-warm by increasing the minimum:

# Before the event: ensure at least 5 tasks are ready
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/brandradar-cluster/brandradar-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 5 \
  --max-capacity 40 \
  --region us-east-2

# After the event: scale back to normal
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/brandradar-cluster/brandradar-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 1 \
  --max-capacity 40 \
  --region us-east-2

Costs (Estimated Monthly)

Base Cost (Normal Traffic - 1 Task)

Service Usage Cost
Fargate 1 task x 730 hours ~$15
ALB 730 hours + LCU ~$20
CloudFront 100 GB transfer ~$10
S3 <1 GB storage ~$0.10
DynamoDB On-demand ~$0.25
CloudWatch Logs + Alarms ~$5
Total ~$50/month

Scaling Costs

Fargate charges ~$15/month per task. You only pay for tasks while they're running:

Scenario Tasks Running Additional Cost
Normal traffic 1 task $0 (base cost)
Moderate spike 2-3 tasks for a few hours ~$1-2
Heavy traffic all day 5 tasks ~$60 extra
Maximum load sustained 20 tasks ~$285 extra

Key point: Max capacity of 20 doesn't cost anything until tasks actually launch. The system auto-scales up when needed and back down when traffic drops. Most months you'll pay the base ~$50.

DNS Migration (Future)

When ready to point brandcommerceradar.ai to CloudFront:

  1. Create ACM certificate for brandcommerceradar.ai in us-east-1
  2. Validate certificate via DNS
  3. Add certificate to CloudFront distribution
  4. Add brandcommerceradar.ai as alternate domain name
  5. Update DNS to point to CloudFront: dmkh1ti2sgxc5.cloudfront.net

Domain is currently at GoDaddy (ns41.domaincontrol.com).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment