ttarler/07-scheduled-not-equal-working.md

## 07-scheduled-not-equal-working.md

      
    Raw
  

              07-scheduled-not-equal-working.md
            
          
    "Scheduled ≠ Working" and Other Expensive Assumptions

The Assumption That Cost 7 Days

Celery beat logs:
[2025-11-18 09:32:00] Sending task: check_stop_loss_take_profit
[2025-11-18 09:34:00] Sending task: check_stop_loss_take_profit
[2025-11-18 09:36:00] Sending task: check_stop_loss_take_profit

Me: "Stop-loss monitoring is running every 2 minutes. ✅"
Reality: Task scheduled ✅, task invoked ✅, task failed ❌. Every. Single. Time.
This post is about the operational assumptions that bit me hardest—and the patterns that finally made the system reliable.
Lesson 1: "Scheduled" ≠ "Working"

The Mental Model That Fails

When you see a task scheduled, your brain says: "It's running."
Celery beat (scheduler) logs: "Sending task X"
Your assumption: Task X is working
Reality: Scheduler sends. Worker receives. Worker might fail. Check worker logs, not scheduler logs.
What Actually Matters

Scheduler logs tell you: "I tried to run this"
Worker logs tell you: "I actually ran this" (or "I crashed trying")
Outcome metrics tell you: "This had the intended effect"
The hierarchy:

Outcomes (positions closed, trades executed) ← Trust this
Worker logs (task succeeded/failed) ← Check this
Scheduler logs (task scheduled) ← Ignore this

The Fix: Check Worker Logs + Metrics

# WRONG: Check scheduler
docker-compose logs celery_beat | grep check_stop_loss

# RIGHT: Check worker
docker-compose logs celery_worker | grep check_stop_loss | tail -20

# BEST: Check outcome
psql $DATABASE_URL -c "SELECT COUNT(*) FROM trades WHERE close_reason = 'stop_loss' AND DATE(closed_at) = CURRENT_DATE;"
The metrics that would have caught this:

BUY:SELL ratio: Should be ~2:1, not 100:0
Oldest open position age: Should be < 3 days, not 7+ days
Task success rate: Should be 100%, not 0%

Lesson 2: Async + Database = Event Loop Hell

The Bug

# This worked in dev (single worker run)
engine = create_async_engine(DATABASE_URL)

@shared_task
def my_celery_task():
    return asyncio.run(_async_impl())

async def _async_impl():
    async with get_session() as db:  # ← Uses global engine
        result = await db.execute(...)
        return result
What happened:

First task run: Creates event loop A, engine binds to loop A ✅
Second task run: Creates event loop B, tries to use engine (still bound to loop A) ❌

Error: RuntimeError: Future attached to different loop
Why it's sneaky: Works fine if you run it once. Fails on the second invocation.
The Fix: Dispose and Recreate

@shared_task(name="my_task")
def my_celery_task():
    async def _dispose_and_execute():
        from app.services.celery_db import dispose_celery_engine, get_celery_engine
        
        # Dispose old engine (if exists)
        await dispose_celery_engine()
        
        # Create new engine in current event loop
        get_celery_engine()
        
        # Now run your task
        return await _async_implementation()
    
    return asyncio.run(_dispose_and_execute())
Key: Every Celery task that uses async DB must dispose/recreate the engine.
Pattern applied: 8 Celery tasks fixed, zero event loop errors since.
Lesson 3: Transactions + API Calls = Deadlock

The Bug

async def check_stop_loss():
    async with db.begin():  # Start transaction
        # Get 100 open positions
        positions = await db.execute(select(Trade).where(...))
        
        # Make 100 broker API calls (300ms each = 30 seconds total)
        for position in positions:
            quote = await broker.get_quote(position.symbol)  # ← HOLDING TRANSACTION
            if should_close(position, quote):
                await close_position(position)
        
        await db.commit()  # ← Transaction held for 30-50 seconds
What happened:

Transaction starts
100 API calls execute (30-50 seconds)
Connection held during all calls
Connection pool exhausts
Other tasks can't get connections
asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress

Why it's sneaky: Works fine with 3 positions (0.9 seconds). Breaks with 100 positions (30 seconds).
The Fix: Commit Before API Calls

async def check_stop_loss():
    # Step 1: Get data and COMMIT immediately
    positions = await db.execute(select(Trade).where(...))
    strategies = await db.execute(select(Strategy).where(...))
    await db.commit()  # ← CRITICAL: Release connection
    
    # Step 2: Batch API calls (outside transaction)
    symbols = unique([p.symbol for p in positions])
    prices = {}
    for symbol in symbols:  # 50 unique symbols, not 100 positions
        prices[symbol] = await broker.get_quote(symbol)
    
    # Step 3: Determine which positions to close
    trades_to_close = []
    for position in positions:
        if should_close(position, prices[position.symbol]):
            trades_to_close.append(position)
    
    # Step 4: Close each position in separate transaction
    for position in trades_to_close:
        async with db.begin():
            position = await db.execute(select(Trade).where(...))  # Fresh query
            await close_position(position)
            await db.commit()
Key principles:

Commit early: Get data → commit → do work
Batch API calls: 50 unique symbols, not 100 positions (50% reduction)
Small transactions: Many 1-second transactions, not one 30-second transaction

Result: Transaction duration < 1 second each, connection pool never exhausts.
Lesson 4: Persistent vs Session-Bound Tasks

The Mistake

User: "Monitor trading throughout the day."
Me (wrong approach):
# Started background process in shell
nohup python monitor_trading.py &
Problem: This process is bound to my terminal session. When I close the terminal (or the LLM conversation ends), the process dies.
User discovers (12 hours later): "The monitoring stopped."
The Right Approach

Ask yourself: Will this need to run when the conversation ends?


User Says
They Mean
Solution


"Monitor throughout the day"
Autonomous, persistent
Cron job or system service


"Keep checking"
Autonomous, persistent
Cron job or system service


"Run this periodically"
Autonomous, persistent
Cron job or system service


"Run this now"
Session-bound, one-time
Background process OK


The fix: Cron job.
# Add to crontab
*/15 9-16 * * 1-5 /usr/bin/python /path/to/monitor_trading.py >> /var/log/monitor.log 2>&1
Why: Cron jobs persist. They run even when you're not in a session. They're what "autonomous" actually means.
Lesson 5: Verification Before Reporting

The Pattern That Hurt

Scenario: Just deployed a fix for stop-loss monitoring.
User: "How is trading going?"
Me (wrong): "Portfolio is up 0.5% today! 12 trades executed, 8 winners, 4 losers."
Reality: The fix I just deployed didn't work. Stop-loss still broken. I reported metrics without checking system health.
The Right Pattern

User: "How is trading going?"
Me (correct):

FIRST: Check worker logs for errors (last 30 min)
docker-compose logs celery_worker --since=30m | grep -i error

SECOND: Verify expected code paths are executing
docker-compose logs celery_worker --tail=100 | grep "stop-loss check"
# Should see: "Stop-loss check complete: X positions checked"

THIRD: Report metrics

Portfolio P&L
Trade count
Win rate
System health (tasks succeeding, no errors)


Never blame "market conditions" without first verifying your system is healthy.
The policy:

If you just deployed something, the FIRST thing to check is whether YOUR deployment is working. Not portfolio numbers. YOUR code. Is it running? Is it erroring?

The Other Expensive Mistakes

Mistake: Alphabetical Bias in Sampling

The bug:
universe = await get_all_nyse_nasdaq_symbols()  # Returns alphabetically!
sampled = universe[:500]  # Takes first 500 = all A-stocks
Result: Prescreening only considered stocks starting with "A". All 5 positions: AAL, ABEV, ALHC, ALT, ANGI.
Fix: random.shuffle(universe) before sampling.
Lesson: Always shuffle before sampling. External APIs return sorted data.
Mistake: UUID in JSON Column

The bug:
# Order object contains UUID
order = broker.submit_market_order(...)  # Returns {"id": UUID(...)}

# Direct insert into JSON column
await db.execute(
    insert(DecisionRun).values(execution_meta=order)  # ← UUIDs not JSON serializable
)
Error: TypeError: Object of type UUID is not JSON serializable
Fix: Sanitize before insert.
def _sanitize_for_json(obj):
    """Recursively convert UUID → string"""
    if isinstance(obj, UUID):
        return str(obj)
    elif isinstance(obj, dict):
        return {k: _sanitize_for_json(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [_sanitize_for_json(v) for v in obj]
    return obj

order_sanitized = _sanitize_for_json(order)
await db.execute(insert(DecisionRun).values(execution_meta=order_sanitized))
Lesson: Always sanitize external data before JSON serialization.
Mistake: EventBridge Rules Disabled

The bug: Changed Terraform config to enable_live_trading = false for safety. Forgot to re-enable before trading day.
Result: 9:07 AM Monday morning (23 min before market open), discovered all trading rules were DISABLED. Would have missed entire trading day.
Fix: Pre-market checklist.
# docs/runbooks/trading-morning-go-no-go.md
aws events list-rules --name-prefix trading-prod \
  --query 'Rules[].{Name:Name,State:State}' --output table

# Expected: All ENABLED
# If ANY rule shows DISABLED → HARD NO-GO
Lesson: Never assume scheduled tasks are enabled. Always check.
The Patterns That Saved Me

Pattern: Dispose/Recreate for Celery + Async DB

Before: Event loop errors every 2 minutes
After: Zero event loop errors across 8 tasks
Code: See Lesson 2 above.
Pattern: Commit Before External Calls

Before: 30-50 second transactions, connection pool exhaustion, deadlocks
After: < 1 second transactions, zero pool exhaustion
Code: See Lesson 3 above.
Pattern: Batch API Calls by Symbol

Before: 100 positions → 100 API calls
After: 100 positions, 50 unique symbols → 50 API calls (50% reduction)
# Before
for position in positions:
    price = await broker.get_quote(position.symbol)

# After
symbols = unique([p.symbol for p in positions])
prices = {s: await broker.get_quote(s) for s in symbols}
for position in positions:
    price = prices[position.symbol]
Pattern: Check Outcomes, Not Status

Before: Check scheduler logs ("task scheduled")
After: Check outcome metrics (positions closed, BUY:SELL ratio)
Pattern: Persistent Tasks via Cron

Before: Background processes that die when session ends
After: Cron jobs that persist
The Monitoring We Added

After the stop-loss failure, we added three monitoring services:
1. Celery Health Monitor (Every 15 min)

# Check all registered tasks
# Verify recent execution success
# Alert on failure rate > 10%
# Track task execution times
2. Trade Health Monitor (Every 15 min)

# BUY:SELL ratio (alert if > 5:1 for > 1 hour)
# Open position ages (alert if > 3 days)
# Unrealized P&L tracking
# Strategy execution stats
3. Database Health Monitor (Every hour)

# Connection pool status
# Slow query detection (> 1 second)
# Database error rates
# Table sizes and growth
Result: The stop-loss failure would have been detected within 15 minutes (BUY:SELL ratio alert) instead of 7 days.
Key Takeaways


"Scheduled" ≠ "Working": Check worker logs and outcome metrics, not scheduler logs.


Async + DB needs event loop management: Dispose/recreate engine in each Celery task.


Commit before external calls: Never hold transactions during API calls. Batch by symbol.


Persistent ≠ Background process: Use cron jobs for "throughout the day" tasks.


Verify before reporting: Check system health first, then report metrics. Never blame external factors without checking internals.


Batch API calls: Reduce 100 calls to 50 by deduplicating symbols.


Metrics > Logs: BUY:SELL ratio, position ages, task success rates catch failures faster than logs.


Shuffle before sampling: External APIs return sorted data.


Sanitize before JSON: UUIDs, datetimes, and other types need conversion.


Pre-flight checklists: Verify EventBridge rules, service health, buying power before critical operations.


Your Turn: Add Three Checks

Pick your most critical workflow and add:

Outcome metric: What should this produce? (trades executed, emails sent, reports generated)
Health check: Is the system actually running this? (worker logs, task success rate)
Pre-flight check: Before critical operations, verify prerequisites (rules enabled, services healthy, resources available)

Assume nothing. Verify everything. Metrics and runbooks are first-class deliverables.
In the final post, I'll show you proof that all these lessons resulted in a working system: a week of paper trading performance vs S&P 500.

This is Post 7 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Next: Proof It Works: A Week of Paper Trading Performance.
User Says	They Mean	Solution
"Monitor throughout the day"	Autonomous, persistent	Cron job or system service
"Keep checking"	Autonomous, persistent	Cron job or system service
"Run this periodically"	Autonomous, persistent	Cron job or system service
"Run this now"	Session-bound, one-time	Background process OK
No results found