ttarler/02-architecture-evolution.md

## 02-architecture-evolution.md

      
    Raw
  

              02-architecture-evolution.md
            
          
    From Multi-Agent Chaos to a Single Execution Path


I Built Something Too Clever for My Own Good

Here's a confession: when I started this project, I was way more interested in building something impressive than something that worked.
Multiple LLM agents coordinating decisions. A dozen strategies running in parallel. Complex fallback hierarchies. HuggingFace endpoints for model inference. It was beautiful on the whiteboard.
It was hell in production.
What I Originally Built

Picture this: a prescreening agent talks to a portfolio agent, which consults a risk agent, which coordinates with the execution layer, while ten different trading strategies fight for airtime. Meanwhile, three levels of fallbacks kick in whenever something fails (and something always fails).
The components looked something like this:

LLM Agents: Prescreening, portfolio allocation, and risk management agents all chatting through a central orchestrator
Multi-Strategy Execution: Ten strategies running concurrently, each with their own logic for when to buy and sell
HuggingFace Endpoints: Custom LLM deployments for generating decisions in real-time
Fallback Cascade: If agent A fails, try agent B. If B fails, try rule-based logic. If that fails, just give up (quietly)

I told myself this was sophisticated. I told myself this was what a "real" trading system should look like.
I was wrong.
The Reality Check

Three weeks in, I'm staring at logs at midnight trying to figure out why the system bought TSLA. Not whether it was a good decision—just trying to trace which damn agent made it.
Was it the prescreening agent's recommendation? The portfolio agent overriding that? The risk agent stepping in? Some strategy I forgot I activated? One of the fallback paths?
The answer was usually "yes."
Beyond the debugging nightmare:
Latency killed me. Each LLM API call takes 500ms to 2 seconds. Chain three agents together and you're looking at decision times measured in seconds, not milliseconds. In trading, that's forever.
Non-determinism made testing pointless. Same inputs, different outputs. Same market conditions, different trades. How do you write a test for that? You don't. You just pray.
The fallback cascade was a lie. I thought "graceful degradation" would save me. Instead, I got silent failures. The system would degrade so gracefully that I wouldn't notice it had stopped working for a week. (That happened. More on that in post 6.)
Operations was a part-time job. Five agents, ten strategies, three fallback levels. That's 150 code paths. Every deployment was a roll of the dice.
The Moment Everything Changed

The breaking point came after a seven-day outage. Stop-loss monitoring had silently stopped running. The scheduled task was there, marked as active, looking perfectly healthy in the dashboard. It just... wasn't executing.
That night I asked myself: What's the stupidest architecture that could actually work?
One path. One decision. No agents. No parallel anything.
What It Looks Like Now

The current system is almost boring:

Premarket screening identifies candidates at 4 AM
RL model (or deterministic fallback) makes buy/sell decisions
Broker executes the orders
Monitor watches for stop-loss and take-profit

That's it. Here's the actual flow:
async def execute_active_strategies(strategy_id: int):
    # Get today's candidates
    watchlist = await get_todays_watchlist()
    
    # RL model decides (or fall back to deterministic logic)
    try:
        decisions = await rl_portfolio_service.make_decisions(
            candidates=watchlist,
            current_positions=await get_open_positions(),
            cash_available=await get_buying_power()
        )
    except Exception as e:
        logger.warning(f"RL unavailable: {e}")
        decisions = await fallback_strategy_service.make_decisions(watchlist)
    
    # Execute
    for decision in decisions:
        await broker.place_order(
            symbol=decision['symbol'],
            quantity=decision['quantity'],
            side=decision['action']
        )
When something goes wrong now, I know exactly where to look. One path means one answer to "what happened?"
The Fallback Isn't Really a Fallback

I want to be clear about something: the "fallback" here isn't a complex decision-making system. It's a simple decision tree that reads prescreening scores and applies basic rules.
async def make_decisions(candidates, strategy_risk_params):
    decisions = []
    
    for symbol in candidates:
        scores = await get_prescreening_scores(symbol)
        
        # Dead simple: high buy probability, low sell probability = buy
        if scores['buy_probability'] > 0.7 and scores['sell_probability'] < 0.3:
            decisions.append({
                'action': 'buy',
                'symbol': symbol,
                'quantity': _calculate_position_size(cash_available),
                'stop_loss': strategy_risk_params.stop_loss_pct
            })
    
    return decisions
The active strategy just holds risk parameters—stop-loss percentage, position sizing limits, take-profit targets. It's a config container, not a decision-maker.
Position Monitoring: Learning from Failure

The position monitor runs every couple minutes and does exactly one thing well:
async def check_stop_loss_take_profit():
    # Query positions
    open_trades = await db.execute(
        select(Trade).where(Trade.status == 'open')
    )
    await db.commit()  # Release DB before API calls
    
    # Batch price lookups
    symbols = [t.symbol for t in open_trades]
    prices = {s: await broker.get_latest_quote(s) for s in symbols}
    
    # Check each position
    for trade in open_trades:
        pnl_pct = (prices[trade.symbol] - trade.entry_price) / trade.entry_price
        
        if pnl_pct <= -trade.stop_loss_pct:
            await close_position(trade, reason='stop_loss')
        elif pnl_pct >= trade.take_profit_pct:
            await close_position(trade, reason='take_profit')
That await db.commit() before the API calls? Learned that one the hard way. Holding database connections open while waiting for external APIs is a great way to exhaust your connection pool.
The Infrastructure (It's Normal)

Nothing exotic here:

Backend: ECS Fargate running FastAPI
Workers: Celery for scheduled tasks
Database: Aurora PostgreSQL
Cache: ElastiCache (Redis)
ML: SageMaker for training and inference
Scheduling: EventBridge to trigger everything

The deployment is GitHub Actions pushing to ECR, then rolling ECS updates. Every deployment includes a manual check of the logs to verify things are actually working. (Yes, I verify. Every time. I've been burned too many times.)
Why Simple Actually Works

Debugging is trivial now. When a trade happens, I check one log line: "RL made decision" or "fallback made decision." Done.
Testing is possible again. I can write an end-to-end test that goes from watchlist to broker call. It runs in seconds. It's deterministic.
Operations went from constant firefighting to occasional monitoring. I check dashboards maybe twice a day instead of babysitting alerts constantly.
Performance improved, weirdly enough. A single SageMaker call takes 200ms. The old multi-agent coordination took 2-3 seconds minimum. Simpler is faster.
What I Actually Learned


Sophistication is a trap. The most impressive architecture is often the hardest to run. I was so focused on building something that looked like "real AI" that I forgot the goal was to make money trading, not to impress other engineers.
Start with the dumbest thing that could work. If your ML model isn't ready, use rules. If your rules aren't ready, just log what you would have done. Get the pipeline working first. Make it smart later.
Every code path you add is a code path you have to maintain, test, monitor, and debug at 2 AM when it breaks. I went from 150 code paths to 2. My stress levels dropped accordingly.
When something works reliably, resist the urge to complicate it. I kept waiting for the moment when the simple approach would fail and I'd need the sophistication. That moment never came. The simple path just... keeps working.
Next Up

In the next post, I'll cover how I manage LLM context so Claude doesn't forget all these hard-won lessons every time a conversation ends. Turns out that's its own challenge.

This is part 2 of an 8-part series on building a trading system with AI coding agents. Part 1: I Built a Full-Stack AI Trading App with LLMs—Here's What I Learned
No results found