Celery beat logs:
[2025-11-18 09:32:00] Sending task: check_stop_loss_take_profit
[2025-11-18 09:34:00] Sending task: check_stop_loss_take_profit
[2025-11-18 09:36:00] Sending task: check_stop_loss_take_profit
Me: "Stop-loss monitoring is running every 2 minutes. ✅"
Reality: Task scheduled ✅, task invoked ✅, task failed ❌. Every. Single. Time.
This post is about the operational assumptions that bit me hardest—and the patterns that finally made the system reliable.
When you see a task scheduled, your brain says: "It's running."
Celery beat (scheduler) logs: "Sending task X"
Your assumption: Task X is working
Reality: Scheduler sends. Worker receives. Worker might fail. Check worker logs, not scheduler logs.
Scheduler logs tell you: "I tried to run this"
Worker logs tell you: "I actually ran this" (or "I crashed trying")
Outcome metrics tell you: "This had the intended effect"
The hierarchy:
- Outcomes (positions closed, trades executed) ← Trust this
- Worker logs (task succeeded/failed) ← Check this
- Scheduler logs (task scheduled) ← Ignore this
# WRONG: Check scheduler
docker-compose logs celery_beat | grep check_stop_loss
# RIGHT: Check worker
docker-compose logs celery_worker | grep check_stop_loss | tail -20
# BEST: Check outcome
psql $DATABASE_URL -c "SELECT COUNT(*) FROM trades WHERE close_reason = 'stop_loss' AND DATE(closed_at) = CURRENT_DATE;"The metrics that would have caught this:
- BUY:SELL ratio: Should be ~2:1, not 100:0
- Oldest open position age: Should be < 3 days, not 7+ days
- Task success rate: Should be 100%, not 0%
# This worked in dev (single worker run)
engine = create_async_engine(DATABASE_URL)
@shared_task
def my_celery_task():
return asyncio.run(_async_impl())
async def _async_impl():
async with get_session() as db: # ← Uses global engine
result = await db.execute(...)
return resultWhat happened:
- First task run: Creates event loop A, engine binds to loop A ✅
- Second task run: Creates event loop B, tries to use engine (still bound to loop A) ❌
Error: RuntimeError: Future attached to different loop
Why it's sneaky: Works fine if you run it once. Fails on the second invocation.
@shared_task(name="my_task")
def my_celery_task():
async def _dispose_and_execute():
from app.services.celery_db import dispose_celery_engine, get_celery_engine
# Dispose old engine (if exists)
await dispose_celery_engine()
# Create new engine in current event loop
get_celery_engine()
# Now run your task
return await _async_implementation()
return asyncio.run(_dispose_and_execute())Key: Every Celery task that uses async DB must dispose/recreate the engine.
Pattern applied: 8 Celery tasks fixed, zero event loop errors since.
async def check_stop_loss():
async with db.begin(): # Start transaction
# Get 100 open positions
positions = await db.execute(select(Trade).where(...))
# Make 100 broker API calls (300ms each = 30 seconds total)
for position in positions:
quote = await broker.get_quote(position.symbol) # ← HOLDING TRANSACTION
if should_close(position, quote):
await close_position(position)
await db.commit() # ← Transaction held for 30-50 secondsWhat happened:
- Transaction starts
- 100 API calls execute (30-50 seconds)
- Connection held during all calls
- Connection pool exhausts
- Other tasks can't get connections
asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
Why it's sneaky: Works fine with 3 positions (0.9 seconds). Breaks with 100 positions (30 seconds).
async def check_stop_loss():
# Step 1: Get data and COMMIT immediately
positions = await db.execute(select(Trade).where(...))
strategies = await db.execute(select(Strategy).where(...))
await db.commit() # ← CRITICAL: Release connection
# Step 2: Batch API calls (outside transaction)
symbols = unique([p.symbol for p in positions])
prices = {}
for symbol in symbols: # 50 unique symbols, not 100 positions
prices[symbol] = await broker.get_quote(symbol)
# Step 3: Determine which positions to close
trades_to_close = []
for position in positions:
if should_close(position, prices[position.symbol]):
trades_to_close.append(position)
# Step 4: Close each position in separate transaction
for position in trades_to_close:
async with db.begin():
position = await db.execute(select(Trade).where(...)) # Fresh query
await close_position(position)
await db.commit()Key principles:
- Commit early: Get data → commit → do work
- Batch API calls: 50 unique symbols, not 100 positions (50% reduction)
- Small transactions: Many 1-second transactions, not one 30-second transaction
Result: Transaction duration < 1 second each, connection pool never exhausts.
User: "Monitor trading throughout the day."
Me (wrong approach):
# Started background process in shell
nohup python monitor_trading.py &Problem: This process is bound to my terminal session. When I close the terminal (or the LLM conversation ends), the process dies.
User discovers (12 hours later): "The monitoring stopped."
Ask yourself: Will this need to run when the conversation ends?
| User Says | They Mean | Solution |
|---|---|---|
| "Monitor throughout the day" | Autonomous, persistent | Cron job or system service |
| "Keep checking" | Autonomous, persistent | Cron job or system service |
| "Run this periodically" | Autonomous, persistent | Cron job or system service |
| "Run this now" | Session-bound, one-time | Background process OK |
The fix: Cron job.
# Add to crontab
*/15 9-16 * * 1-5 /usr/bin/python /path/to/monitor_trading.py >> /var/log/monitor.log 2>&1Why: Cron jobs persist. They run even when you're not in a session. They're what "autonomous" actually means.
Scenario: Just deployed a fix for stop-loss monitoring.
User: "How is trading going?"
Me (wrong): "Portfolio is up 0.5% today! 12 trades executed, 8 winners, 4 losers."
Reality: The fix I just deployed didn't work. Stop-loss still broken. I reported metrics without checking system health.
User: "How is trading going?"
Me (correct):
- FIRST: Check worker logs for errors (last 30 min)
docker-compose logs celery_worker --since=30m | grep -i error - SECOND: Verify expected code paths are executing
docker-compose logs celery_worker --tail=100 | grep "stop-loss check" # Should see: "Stop-loss check complete: X positions checked"
- THIRD: Report metrics
- Portfolio P&L
- Trade count
- Win rate
- System health (tasks succeeding, no errors)
Never blame "market conditions" without first verifying your system is healthy.
The policy:
If you just deployed something, the FIRST thing to check is whether YOUR deployment is working. Not portfolio numbers. YOUR code. Is it running? Is it erroring?
The bug:
universe = await get_all_nyse_nasdaq_symbols() # Returns alphabetically!
sampled = universe[:500] # Takes first 500 = all A-stocksResult: Prescreening only considered stocks starting with "A". All 5 positions: AAL, ABEV, ALHC, ALT, ANGI.
Fix: random.shuffle(universe) before sampling.
Lesson: Always shuffle before sampling. External APIs return sorted data.
The bug:
# Order object contains UUID
order = broker.submit_market_order(...) # Returns {"id": UUID(...)}
# Direct insert into JSON column
await db.execute(
insert(DecisionRun).values(execution_meta=order) # ← UUIDs not JSON serializable
)Error: TypeError: Object of type UUID is not JSON serializable
Fix: Sanitize before insert.
def _sanitize_for_json(obj):
"""Recursively convert UUID → string"""
if isinstance(obj, UUID):
return str(obj)
elif isinstance(obj, dict):
return {k: _sanitize_for_json(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [_sanitize_for_json(v) for v in obj]
return obj
order_sanitized = _sanitize_for_json(order)
await db.execute(insert(DecisionRun).values(execution_meta=order_sanitized))Lesson: Always sanitize external data before JSON serialization.
The bug: Changed Terraform config to enable_live_trading = false for safety. Forgot to re-enable before trading day.
Result: 9:07 AM Monday morning (23 min before market open), discovered all trading rules were DISABLED. Would have missed entire trading day.
Fix: Pre-market checklist.
# docs/runbooks/trading-morning-go-no-go.md
aws events list-rules --name-prefix trading-prod \
--query 'Rules[].{Name:Name,State:State}' --output table
# Expected: All ENABLED
# If ANY rule shows DISABLED → HARD NO-GOLesson: Never assume scheduled tasks are enabled. Always check.
Before: Event loop errors every 2 minutes After: Zero event loop errors across 8 tasks
Code: See Lesson 2 above.
Before: 30-50 second transactions, connection pool exhaustion, deadlocks After: < 1 second transactions, zero pool exhaustion
Code: See Lesson 3 above.
Before: 100 positions → 100 API calls After: 100 positions, 50 unique symbols → 50 API calls (50% reduction)
# Before
for position in positions:
price = await broker.get_quote(position.symbol)
# After
symbols = unique([p.symbol for p in positions])
prices = {s: await broker.get_quote(s) for s in symbols}
for position in positions:
price = prices[position.symbol]Before: Check scheduler logs ("task scheduled") After: Check outcome metrics (positions closed, BUY:SELL ratio)
Before: Background processes that die when session ends After: Cron jobs that persist
After the stop-loss failure, we added three monitoring services:
# Check all registered tasks
# Verify recent execution success
# Alert on failure rate > 10%
# Track task execution times# BUY:SELL ratio (alert if > 5:1 for > 1 hour)
# Open position ages (alert if > 3 days)
# Unrealized P&L tracking
# Strategy execution stats# Connection pool status
# Slow query detection (> 1 second)
# Database error rates
# Table sizes and growthResult: The stop-loss failure would have been detected within 15 minutes (BUY:SELL ratio alert) instead of 7 days.
-
"Scheduled" ≠ "Working": Check worker logs and outcome metrics, not scheduler logs.
-
Async + DB needs event loop management: Dispose/recreate engine in each Celery task.
-
Commit before external calls: Never hold transactions during API calls. Batch by symbol.
-
Persistent ≠ Background process: Use cron jobs for "throughout the day" tasks.
-
Verify before reporting: Check system health first, then report metrics. Never blame external factors without checking internals.
-
Batch API calls: Reduce 100 calls to 50 by deduplicating symbols.
-
Metrics > Logs: BUY:SELL ratio, position ages, task success rates catch failures faster than logs.
-
Shuffle before sampling: External APIs return sorted data.
-
Sanitize before JSON: UUIDs, datetimes, and other types need conversion.
-
Pre-flight checklists: Verify EventBridge rules, service health, buying power before critical operations.
Pick your most critical workflow and add:
- Outcome metric: What should this produce? (trades executed, emails sent, reports generated)
- Health check: Is the system actually running this? (worker logs, task success rate)
- Pre-flight check: Before critical operations, verify prerequisites (rules enabled, services healthy, resources available)
Assume nothing. Verify everything. Metrics and runbooks are first-class deliverables.
In the final post, I'll show you proof that all these lessons resulted in a working system: a week of paper trading performance vs S&P 500.
This is Post 7 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Next: Proof It Works: A Week of Paper Trading Performance.