adamseoul/ARCHITECTURE.md

## ARCHITECTURE.md

      
    Raw
  

              ARCHITECTURE.md
            
          
    Brand Radar - AWS Architecture

Overview

Brand Radar is deployed on AWS using a serverless/container architecture designed for scalability and reliability. The system handles long-running AI analysis (up to 10 minutes) through an async job queue pattern.
Reliability & Self-Healing

This architecture is designed for zero manual intervention. Unlike the previous EC2/Nginx setup which required reboots and manual monitoring, this infrastructure is fully automated:
Frontend (S3 + CloudFront)


S3: 99.999999999% (11 nines) durability - effectively cannot fail
CloudFront: Global CDN with automatic failover between edge locations
No servers to reboot - just static file serving

Backend (Fargate)


Health checks: ALB checks /providers every 30 seconds
Auto-replacement: If a task becomes unhealthy, ECS automatically:

Drains connections from the failing task
Launches a new task (~30 seconds to start)
Routes traffic to the healthy task


No manual intervention required - ECS handles all recovery automatically

Auto-Scaling


Trigger: CPU utilization > 70%
Scale out: New task launches in ~30 seconds
Scale in: Cooldown of 5 minutes before removing tasks
Capacity: 1-5 tasks (configurable up to 100+)

Job Queue (DynamoDB)


Serverless: No servers to manage
Auto-scaling: Handles any throughput automatically
Multi-AZ: Data replicated across availability zones
TTL: Jobs auto-delete after 1 hour (self-cleaning)

What This Means


Previous (EC2)
Current (Fargate)


Server could crash, need manual reboot
Tasks auto-replaced in 30 seconds


Memory leaks required restarts
Each request gets fresh container


Manual scaling
Auto-scales based on load


Single point of failure
Multiple tasks across AZs


SSH access needed for debugging
CloudWatch logs, no SSH needed


Architecture Diagram

                                    ┌─────────────────────────────────────┐
                                    │           USERS                      │
                                    └─────────────────┬───────────────────┘
                                                      │
                                                      ▼
                                    ┌─────────────────────────────────────┐
                                    │         CLOUDFRONT                   │
                                    │    dmkh1ti2sgxc5.cloudfront.net     │
                                    │    (SSL termination, CDN caching)    │
                                    └───────────┬─────────────┬───────────┘
                                                │             │
                               /assets/*        │             │  /api/*
                               /index.html      │             │
                                                ▼             ▼
                    ┌───────────────────────────┐   ┌─────────────────────────────┐
                    │           S3              │   │           ALB               │
                    │  brandradar-frontend-prod │   │    brandradar-alb           │
                    │  (Static React frontend)  │   │    (10 min idle timeout)    │
                    └───────────────────────────┘   └──────────────┬──────────────┘
                                                                   │
                                                                   ▼
                                                   ┌─────────────────────────────┐
                                                   │      ECS FARGATE            │
                                                   │   brandradar-cluster        │
                                                   │   (Node.js/Express API)     │
                                                   │   - Auto-scales 1-N tasks   │
                                                   │   - 512 CPU, 1GB RAM each   │
                                                   └──────────────┬──────────────┘
                                                                  │
                                      ┌───────────────────────────┼───────────────────────────┐
                                      │                           │                           │
                                      ▼                           ▼                           ▼
                          ┌───────────────────┐     ┌───────────────────┐     ┌───────────────────┐
                          │     DynamoDB      │     │   CloudWatch      │     │   External APIs   │
                          │  brandradar-jobs  │     │      Logs         │     │  - Gemini AI      │
                          │  (Job queue)      │     │   /ecs/brandradar │     │  - Perplexity     │
                          │  TTL: 1 hour      │     │                   │     │  - Reddit         │
                          └───────────────────┘     └───────────────────┘     │  - GDELT News     │
                                                                              │  - Apify (Amazon) │
                                                                              └───────────────────┘

Components

CloudFront Distribution (E31I20YH8FYL81)


Domain: dmkh1ti2sgxc5.cloudfront.net
SSL: AWS-managed certificate
Origin Timeout: 60 seconds (default max without quota increase)
Cache Behaviors:

/api/* → ALB origin (no caching)
Default → S3 origin (cached)


Function: Strips /api prefix before forwarding to ALB

S3 Bucket (brandradar-frontend-prod)


Purpose: Static React frontend hosting
Region: us-east-2
Contents: Built React app (index.html, JS, CSS, assets)

Application Load Balancer (brandradar-alb)


Region: us-east-2
Idle Timeout: 600 seconds (10 minutes)
Target Group: brandradar-tg (health check: GET /providers)
Security Group: Allows inbound 80 from CloudFront

ECS Fargate Cluster (brandradar-cluster)


Service: brandradar-api
Task Definition: brandradar-api:2
Resources: 512 CPU, 1024 MB RAM per task
Desired Count: 1 (auto-scales based on demand)
Task Role: brandradar-task-role (DynamoDB access)
Execution Role: ecsTaskExecutionRole (ECR, CloudWatch)

DynamoDB Table (brandradar-jobs)


Purpose: Async job queue for long-running analyses
Primary Key: jobId (String)
TTL: Enabled on ttl attribute (auto-delete after 1 hour)
Billing: On-demand (pay per request)

ECR Repository (brandradar)


Purpose: Docker image storage
Region: us-east-2
Image: Node.js 20 + tsx + server code

Async Job Flow

The async pattern was implemented to handle AI analyses that can take up to 10 minutes (especially with Amazon data).
Flow Diagram

Frontend                    CloudFront/ALB                Fargate                    DynamoDB
   │                              │                          │                          │
   │ POST /api/analyze/async      │                          │                          │
   │─────────────────────────────>│                          │                          │
   │                              │ POST /analyze/async      │                          │
   │                              │─────────────────────────>│                          │
   │                              │                          │ Create job               │
   │                              │                          │─────────────────────────>│
   │                              │                          │                          │
   │                              │      {job_id: "xxx"}     │                          │
   │<─────────────────────────────│<─────────────────────────│                          │
   │                              │                          │                          │
   │ GET /api/analyze/status/xxx  │                          │ (background processing)  │
   │─────────────────────────────>│                          │──────────────────────────│
   │                              │─────────────────────────>│ Update progress          │
   │                              │                          │─────────────────────────>│
   │                              │      {progress: 50%}     │                          │
   │<─────────────────────────────│<─────────────────────────│                          │
   │                              │                          │                          │
   │ (repeat polling every 4s)    │                          │                          │
   │                              │                          │                          │
   │ GET /api/analyze/status/xxx  │                          │ Get job status           │
   │─────────────────────────────>│─────────────────────────>│<─────────────────────────│
   │                              │   {status: "completed"}  │                          │
   │<─────────────────────────────│<─────────────────────────│                          │
   │                              │                          │                          │
   │ GET /api/analyze/results/xxx │                          │ Get job result           │
   │─────────────────────────────>│─────────────────────────>│<─────────────────────────│
   │                              │      {analysis data}     │                          │
   │<─────────────────────────────│<─────────────────────────│                          │

API Endpoints


Endpoint
Method
Purpose


/api/analyze/async
POST
Start new analysis job


/api/analyze/status/:jobId
GET
Get job status + progress


/api/analyze/results/:jobId
GET
Get completed job results


/api/analyze
POST
Synchronous analysis (legacy)


/api/providers
GET
Health check + provider status


Job States


Status
Progress
Description


pending
0%
Job created, waiting to process


processing
5-95%
Analysis in progress


completed
100%
Analysis complete, results available


failed
0%
Analysis failed, error available


Progress Stages


Progress
Stage


10%
Gathering brand data...


15%
Fetching Reddit discussions...


40%
Analyzing brand signals...


50%
Running AI analysis...


85%
Compiling results...


95%
Finalizing report...


100%
Analysis complete!


Security

API Authentication


Header: X-API-Key: <api-key>
Key: Stored in Fargate environment variable BRANDRADAR_API_KEY

CORS


Locked to specific origins:

https://brandcommerceradar.ai
https://www.brandcommerceradar.ai
https://dmkh1ti2sgxc5.cloudfront.net
http://localhost:5173 (dev)


Rate Limiting


General endpoints: 30 requests/minute/IP
AI endpoints: 10 requests/minute/IP

IAM Roles


ecsTaskExecutionRole: ECR pull, CloudWatch logs
brandradar-task-role: DynamoDB read/write to brandradar-jobs table

Monitoring

CloudWatch Alarms


Alarm
Metric
Threshold
SNS Topic


BrandRadar-CloudFront-5xx-Errors
5xxErrorRate
>5%
brandradar-cloudfront-alerts (us-east-1)


BrandRadar-ALB-5xx-Errors
HTTPCode_ELB_5XX_Count
>5/5min
brandradar-alerts (us-east-2)


BrandRadar-504-Timeouts
HTTPCode_Target_5XX_Count
>3/5min
brandradar-alerts (us-east-2)


BrandRadar-Unhealthy-Targets
UnHealthyHostCount
>=1
brandradar-alerts (us-east-2)


SNS Topics


us-east-1: arn:aws:sns:us-east-1:895778612904:brandradar-cloudfront-alerts
us-east-2: arn:aws:sns:us-east-2:895778612904:brandradar-alerts

To subscribe to alerts:
aws sns subscribe --topic-arn arn:aws:sns:us-east-2:895778612904:brandradar-alerts \
  --protocol email --notification-endpoint your@email.com --region us-east-2
Logs


Location: CloudWatch Logs /ecs/brandradar-api
Region: us-east-2
Retention: Default

Deployment

Deploy Backend (Fargate)

# Build Docker image
cd BC-APP-AI-LOCAL
docker build -t brandradar:latest -f Dockerfile .

# Login to ECR
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin 895778612904.dkr.ecr.us-east-2.amazonaws.com

# Push to ECR
docker tag brandradar:latest 895778612904.dkr.ecr.us-east-2.amazonaws.com/brandradar:latest
docker push 895778612904.dkr.ecr.us-east-2.amazonaws.com/brandradar:latest

# Update task definition (if changed)
aws ecs register-task-definition --cli-input-json file://fargate/task-definition.json --region us-east-2

# Deploy
aws ecs update-service --cluster brandradar-cluster --service brandradar-api --force-new-deployment --region us-east-2
Deploy Frontend (S3 + CloudFront)

# Build frontend
cd web
npm run build

# Upload to S3
aws s3 sync dist s3://brandradar-frontend-prod --delete --region us-east-2

# Invalidate CloudFront cache
aws cloudfront create-invalidation --distribution-id E31I20YH8FYL81 --paths "/*"
Environment Variables

Fargate Task


Variable
Description


NODE_ENV
production


PORT
4000


AWS_REGION
us-east-2


GEMINI_API_KEY
Google Gemini API key


PERPLEXITY_API_KEY
Perplexity AI API key


APIFY_API_KEY
Apify API key (Amazon scraping)


BRANDRADAR_API_KEY
Backend API authentication key


Scaling

Current Configuration (Auto-Scaling ENABLED)


Fargate: 1-5 tasks, auto-scales on CPU
DynamoDB: On-demand (auto-scales automatically)
ALB: Distributes load across all healthy Fargate tasks

Auto-Scaling Policy (Active)


Setting
Value


Metric
ECSServiceAverageCPUUtilization


Target
70%


Min Capacity
1 task


Max Capacity
40 tasks


Scale Out Cooldown
60 seconds


Scale In Cooldown
300 seconds (5 min)


Capacity Estimates (Concurrent Analyses)

Each Fargate task (512 CPU, 1GB RAM) can handle approximately 2-3 concurrent brand analyses. Most of the analysis time is spent waiting for AI API responses (I/O), so Node.js can juggle multiple requests efficiently.


Tasks
Concurrent Analyses
Typical Use Case


1
2-3
Normal usage, demos


2
4-6
Small team reviewing brands


5
10-15
Busy period, multiple teams


10
20-30
High traffic, enterprise


20
40-60
Large campaign


40
80-120
Maximum configured capacity


Important: What "Concurrent Analyses" Means

This is NOT the same as website visitors.


Website visitors: The frontend is static files on S3 + CloudFront (a global CDN). It can handle millions of visitors without any impact. Someone browsing the site, reading results, or filling out the form costs essentially nothing.


Concurrent analyses: This counts analyses that are currently running. Each analysis takes approximately 1 minute to complete. Because they're not instant, analyses can build up.


How analyses build up: If 5 people click "Run Analysis" over a 2-minute window, you could have 5 analyses running at once - not because they clicked simultaneously, but because each one takes about a minute to finish.
Email campaign scenario: You send an email blast to 200 people. Over 10 minutes, 30 of them visit the site and click "Run Analysis". With ~1 minute per analysis, you might have 10-15 running at any given moment. The system auto-scales to handle this.
Why Max Capacity is Set to 40 Tasks

The 40-task limit (80-120 concurrent analyses) provides safety overhead. It costs nothing to have this headroom - you only pay for tasks that actually run.
This protects against Murphy's Law scenarios: you send out an email, 20+ people click through and run analyses within minutes. The system scales up automatically, handles the load, then scales back down when traffic subsides.
Cost Model - Pay Only For What You Use

Important: Setting max capacity to 20 tasks does NOT cost anything by itself. You only pay for tasks that are actually running.

Idle/Low traffic: 1 task runs = ~$15/month
Moderate traffic: 2-3 tasks = ~$30-45/month
High traffic: Scales up as needed, scales back down automatically

The system starts with 1 task and only launches more when CPU exceeds 70%. When traffic drops, extra tasks are terminated after 5 minutes. You're never paying for capacity you're not using.
Note: These are concurrent analyses running simultaneously. Users browsing the site, viewing results, or filling out forms don't consume significant resources - only the actual "Run Analysis" action does.
How it works:

Single task handles normal load (2-3 concurrent users)
If CPU exceeds 70%, a new task launches (~30 seconds)
ALB automatically routes traffic to both tasks
DynamoDB job queue works across all tasks (shared state)
When load drops, tasks scale back down after 5 minutes

Pre-Warming for Expected High Traffic

If you know you'll have a big demo or email campaign, you can pre-warm by increasing the minimum:
# Before the event: ensure at least 5 tasks are ready
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/brandradar-cluster/brandradar-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 5 \
  --max-capacity 40 \
  --region us-east-2

# After the event: scale back to normal
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/brandradar-cluster/brandradar-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 1 \
  --max-capacity 40 \
  --region us-east-2
Costs (Estimated Monthly)

Base Cost (Normal Traffic - 1 Task)


Service
Usage
Cost


Fargate
1 task x 730 hours
~$15


ALB
730 hours + LCU
~$20


CloudFront
100 GB transfer
~$10


S3
<1 GB storage
~$0.10


DynamoDB
On-demand
~$0.25


CloudWatch
Logs + Alarms
~$5


Total

~$50/month


Scaling Costs

Fargate charges ~$15/month per task. You only pay for tasks while they're running:


Scenario
Tasks Running
Additional Cost


Normal traffic
1 task
$0 (base cost)


Moderate spike
2-3 tasks for a few hours
~$1-2


Heavy traffic all day
5 tasks
~$60 extra


Maximum load sustained
20 tasks
~$285 extra


Key point: Max capacity of 20 doesn't cost anything until tasks actually launch. The system auto-scales up when needed and back down when traffic drops. Most months you'll pay the base ~$50.
DNS Migration (Future)

When ready to point brandcommerceradar.ai to CloudFront:

Create ACM certificate for brandcommerceradar.ai in us-east-1
Validate certificate via DNS
Add certificate to CloudFront distribution
Add brandcommerceradar.ai as alternate domain name
Update DNS to point to CloudFront: dmkh1ti2sgxc5.cloudfront.net

Domain is currently at GoDaddy (ns41.domaincontrol.com).
Previous (EC2)	Current (Fargate)
Server could crash, need manual reboot	Tasks auto-replaced in 30 seconds
Memory leaks required restarts	Each request gets fresh container
Manual scaling	Auto-scales based on load
Single point of failure	Multiple tasks across AZs
SSH access needed for debugging	CloudWatch logs, no SSH needed
Endpoint	Method	Purpose
`/api/analyze/async`	POST	Start new analysis job
`/api/analyze/status/:jobId`	GET	Get job status + progress
`/api/analyze/results/:jobId`	GET	Get completed job results
`/api/analyze`	POST	Synchronous analysis (legacy)
`/api/providers`	GET	Health check + provider status
Status	Progress	Description
`pending`	0%	Job created, waiting to process
`processing`	5-95%	Analysis in progress
`completed`	100%	Analysis complete, results available
`failed`	0%	Analysis failed, error available
Progress	Stage
10%	Gathering brand data...
15%	Fetching Reddit discussions...
40%	Analyzing brand signals...
50%	Running AI analysis...
85%	Compiling results...
95%	Finalizing report...
100%	Analysis complete!
Alarm	Metric	Threshold	SNS Topic
BrandRadar-CloudFront-5xx-Errors	5xxErrorRate	>5%	brandradar-cloudfront-alerts (us-east-1)
BrandRadar-ALB-5xx-Errors	HTTPCode_ELB_5XX_Count	>5/5min	brandradar-alerts (us-east-2)
BrandRadar-504-Timeouts	HTTPCode_Target_5XX_Count	>3/5min	brandradar-alerts (us-east-2)
BrandRadar-Unhealthy-Targets	UnHealthyHostCount	>=1	brandradar-alerts (us-east-2)
Variable	Description
`NODE_ENV`	production
`PORT`	4000
`AWS_REGION`	us-east-2
`GEMINI_API_KEY`	Google Gemini API key
`PERPLEXITY_API_KEY`	Perplexity AI API key
`APIFY_API_KEY`	Apify API key (Amazon scraping)
`BRANDRADAR_API_KEY`	Backend API authentication key
Setting	Value
Metric	ECSServiceAverageCPUUtilization
Target	70%
Min Capacity	1 task
Max Capacity	40 tasks
Scale Out Cooldown	60 seconds
Scale In Cooldown	300 seconds (5 min)
Tasks	Concurrent Analyses	Typical Use Case
1	2-3	Normal usage, demos
2	4-6	Small team reviewing brands
5	10-15	Busy period, multiple teams
10	20-30	High traffic, enterprise
20	40-60	Large campaign
40	80-120	Maximum configured capacity
Service	Usage	Cost
Fargate	1 task x 730 hours	~$15
ALB	730 hours + LCU	~$20
CloudFront	100 GB transfer	~$10
S3	<1 GB storage	~$0.10
DynamoDB	On-demand	~$0.25
CloudWatch	Logs + Alarms	~$5
Total		~$50/month
Scenario	Tasks Running	Additional Cost
Normal traffic	1 task	$0 (base cost)
Moderate spike	2-3 tasks for a few hours	~$1-2
Heavy traffic all day	5 tasks	~$60 extra
Maximum load sustained	20 tasks	~$285 extra