I committed a critical fix at 5:51 PM. Pushed to GitHub. CI passed. Green checkmark.
At 6:15 PM, the bug was still happening.
Why? The Docker containers were still running the old code.
This post is about a lesson every developer learns eventually, but that hits differently when AI is writing your code: Committed ≠ Running.
And more importantly: "Deployment successful" ≠ "System working."
- Commit code
- Push to GitHub
- GitHub Actions builds new image
- ECS pulls new image
- New code runs
- Commit code ✅
- Push to GitHub ✅
- GitHub Actions builds new image ✅
- ECS keeps running the old image ❌
- Old code runs ❌
Why: Docker containers are stateful. They don't automatically restart when you commit code. They don't auto-pull new images. They run what they're running until you tell them to stop.
# After commit + push
docker-compose restart backend celery_worker celery_beat
# Or in production (ECS)
aws ecs update-service --cluster trading-prod \
--service backend --force-new-deploymentThe missing step: Restart. Always. Every time.
After multiple "why isn't my fix working" incidents, I codified this process:
git add backend/app/services/execution.py
git commit -m "fix: Stop-loss monitoring event loop issue"
git push origin developStandard stuff. This triggers CI.
GitHub Actions runs:
- Unit tests
- Integration tests
- Coverage check (≥80%)
- Linting (black + ruff)
- Type checking (mypy)
- Security scan
Green checkmark = safe to deploy. But not "deployed."
The critical step everyone skips.
# Local (docker-compose)
docker-compose restart <service-name>
# Production (ECS)
./scripts/restart_service.sh <service-name>Service mapping (which services to restart):
| Changed Files | Restart |
|---|---|
backend/app/**/*.py |
backend |
backend/app/services/*_tasks.py |
celery_worker, celery_beat |
backend/app/agents/**/*.py |
celery_worker, celery_beat |
frontend/**/*.tsx |
frontend |
backend/app/db/models.py |
backend, celery_worker, celery_beat |
.env or env vars |
ALL services |
Rule of thumb: When in doubt, restart both backend and celery_worker celery_beat.
This is where AI agents fail most often.
The agent will say: "Deployed successfully! The fix is live."
What actually happened: Service restarted. But did the fix work? Check.
# Check logs for errors
docker-compose logs backend --tail=50 | grep -i error
docker-compose logs celery_worker --tail=50 | grep -i error
# Verify the NEW code path is executing
docker-compose logs celery_worker --tail=100 | grep "stop-loss check"
# Check that the specific fix is working
# Example: If you fixed event loop errors, verify no event loop errors in logs
docker-compose logs celery_worker --tail=200 | grep "Future attached to different loop"
# Should return nothing if fixedThe verification policy:
After ANY deployment, verify it's working BEFORE reporting success.
- Deploy change
- Check logs for errors within 2-5 minutes
- Verify the NEW code path is being executed (not fallback)
- THEN say "deployed and verified working"
Even after verification, check again 30 minutes later:
# Quick health check
curl https://api.example.com/health
# or
docker-compose exec backend python -c "from app.services.broker import broker_service; import asyncio; print(asyncio.run(broker_service.get_market_clock()))"
# Check metrics (if available)
# - BUY:SELL ratio (should be ~2:1, not 100:0)
# - Task success rates
# - Position closure rateThis one cost me 1.5 hours on a Monday morning.
Sunday night (Jan 5, 2026, 9:00 PM): Prepared for Monday trading. Checked ECS services (running), SageMaker endpoints (active), latest premarket screening run (successful).
Monday morning (Jan 8, 2026, 9:07 AM): 23 minutes before market open, I discover all EventBridge trading rules are DISABLED.
No premarket screening ran. No execution scheduled. No position monitoring. The system would have sat idle all day.
Jan 7 commit (014e03e): Terraform change set enable_live_trading = false by default for safety during RL model setup.
# infrastructure/main.tf
variable "enable_live_trading" {
default = false # ← ADDED FOR SAFETY
}
resource "aws_cloudwatch_event_rule" "execute_strategies" {
name = "trading-prod-execute-active-strategies"
schedule_expression = "cron(*/5 9-16 ? * MON-FRI *)"
is_enabled = var.enable_live_trading # ← DISABLES ALL TRADING
}The Terraform apply ran successfully. No errors. Rules disabled.
I never checked EventBridge status in my pre-market checklist.
# Check rule status
aws events list-rules --name-prefix trading-prod \
--query 'Rules[].{Name:Name,State:State}' --output table
# Expected: All ENABLED
# Actual: All DISABLED
# Enable via Terraform
# infrastructure/prod.tfvars
enable_live_trading = true
# Apply
cd infrastructure && terraform apply -var-file=prod.tfvars -auto-approve
# Verify
aws events list-rules --name-prefix trading-prod \
--query 'Rules[?State==`ENABLED`].Name'Time to fix: 18 minutes (9:07 AM - 9:25 AM).
Time to discover if I hadn't checked: 6.5 hours (full trading day lost).
Now, every Monday morning (or any day after infrastructure changes):
# docs/runbooks/trading-morning-go-no-go.md
## 8:25 AM ET — Confirm scheduled tasks will run
CRITICAL: Verify EventBridge rules are ENABLED
aws events list-rules --name-prefix trading-prod \
--query 'Rules[?contains(Name, `premarket`) || contains(Name, `execute`) || contains(Name, `stop-loss`)].{Name:Name,State:State}' \
--output table
Expected: All rules show State = ENABLED
If ANY rule shows DISABLED, this is a HARD NO-GOThe rule: Never assume scheduled tasks are enabled. Always check.
This scenario has burned me multiple times.
User: "How is trading going today?"
AI Agent: "Great! Portfolio is up 0.5%, 12 trades executed, 8 winners, 4 losers. S&P is flat so we're outperforming."
Reality: The RL model failed at 9:45 AM. All trades after that used the fallback heuristic. The agent reported portfolio stats without checking system health.
User: "How is trading going today?"
AI Agent:
- FIRST: Check worker logs for errors (last 30 min)
docker-compose logs celery_worker --since=30m | grep -i error - SECOND: Verify expected code paths are executing
docker-compose logs celery_worker --tail=100 | grep "RL portfolio decision" # Should see: "RL model made 5 decisions" (not "RL unavailable, using fallback")
- THIRD: Report metrics
- Portfolio P&L
- Trade count (BUY:SELL ratio should be reasonable)
- Win rate
- System health (tasks succeeding, model being used)
NEVER: Blame "market conditions" or "volatility" without first verifying your system is healthy.
The policy:
If you just deployed something, the FIRST thing to check is whether YOUR deployment is working. Not portfolio numbers, not market conditions. YOUR code. Is it running? Is it erroring?
Our CI pipeline enforces quality gates and automates deployments.
# .github/workflows/backend-tests.yml
name: Backend Tests
on:
push:
branches: [develop, main, 'feature/**']
paths: ['backend/**']
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
cd backend
pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/unit/ -v --tb=short --maxfail=5
- name: Run integration tests
run: pytest tests/integration/ -v --tb=short --maxfail=3
- name: Run all tests with coverage
run: |
pytest tests/ --cov=app --cov-fail-under=80 \
--cov-report=xml --cov-report=term-missing -v
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
files: ./backend/coverage.xml
flags: backend
fail_ci_if_error: falseWhat this catches: Code that doesn't compile, tests that fail, coverage regressions.
What this doesn't catch: Deployment issues, service restart failures, production-only bugs.
lint:
runs-on: ubuntu-latest
steps:
- name: Auto-fix code formatting
run: |
black app/ tests/
ruff check --fix app/ tests/
- name: Commit auto-fixes
if: github.ref == 'refs/heads/develop'
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add .
if ! git diff --staged --quiet; then
git commit -m "style: Auto-fix linting issues"
git push
fiWhy auto-commit: Developers shouldn't get distracted by style when tests fail. Auto-fix and move on.
# .github/workflows/terraform-deploy.yml (simplified)
name: Terraform Deploy
on:
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
- name: Terraform Init
run: terraform init
- name: Terraform Plan
run: terraform plan -var-file=prod.tfvars -out=plan.out
- name: Terraform Apply
run: terraform apply plan.out
- name: Force ECS service update
run: |
aws ecs update-service --cluster trading-prod \
--service backend --force-new-deployment
aws ecs update-service --cluster trading-prod \
--service celery-worker --force-new-deploymentWhat this does: Applies infrastructure changes, forces service redeployment.
What this doesn't do: Verify the deployment worked. That's on you.
CI catches code-level issues. Runbooks catch operational issues.
# docs/runbooks/trading-morning-go-no-go.md
## Timeline
### 8:00 AM ET — Baseline health
- CloudWatch: No alarms paging
- ECS services: All healthy (backend, celery-worker, celery-beat)
### 8:25 AM ET — Confirm scheduled tasks
- EventBridge rules: ENABLED
- Latest premarket run: Successful (check logs)
### 8:45 AM ET — Verify SageMaker endpoints
- RL endpoint: InService
- Prescreening endpoint: InService (if enabled)
### 9:25 AM ET — Final GO/NO-GO decision
- API health endpoint: 200 OK
- Worker health: No errors in last 30 min
- Buying power: > $0 (no stale orders blocking)
If all checks pass: **GO**
If any check fails: **NO-GO** (investigate)# docs/runbooks/hotfix-deployment.md
1. Create hotfix branch from main
2. Make fix, test locally
3. Push, wait for CI
4. Deploy to production:
```bash
./scripts/deploy_hotfix.sh <service-name>- Verify deployment:
- Check logs for errors (5 min)
- Verify fix is working (specific check for the bug)
- Monitor for regressions (30 min)
- If verification fails: Rollback immediately
./scripts/rollback_service.sh <service-name>
## The Deployment Checklist
Every deployment, every time:
- [ ] Commit and push
- [ ] CI passes (green checkmark)
- [ ] Identify affected services
- [ ] Restart affected services
- [ ] Check logs for errors (2-5 min)
- [ ] Verify NEW code path is executing
- [ ] Check metrics (BUY:SELL ratio, task success)
- [ ] Monitor for regressions (30 min)
**Only then** can you say "deployed and verified working."
## Key Takeaways
1. **Committed ≠ Running**: Always restart affected services after code changes.
2. **"Deployment successful" ≠ "System working"**: Verify with logs and metrics.
3. **Check outcomes, not status**: "Task scheduled" ≠ "task succeeding." Check worker logs.
4. **Pre-flight checklists**: Verify system health before critical operations (trading day, deployments).
5. **CI catches code issues, runbooks catch operational issues**: Both are necessary.
6. **When asked "how's it going?"**: Check system health first, then report metrics.
7. **Auto-fix style issues**: Don't make developers deal with formatting when tests fail.
## Your Turn: Create Your Deployment Checklist
Take your most critical deployment (production API, cron job, ML model) and write a checklist:
1. What needs to restart?
2. How do you verify it worked?
3. What metrics should you check?
4. What's the rollback procedure?
Commit it. Follow it. Every time.
In the next post, I'll tell you about the ML model that was "live" for 6 days but never made a single production decision—and how to prevent it.
---
*This is Post 5 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Next: The ML Model Was 'Live' for 6 Days—It Never Made a Single Decision.*