Skip to content

Instantly share code, notes, and snippets.

@ttarler
Created January 27, 2026 01:34
Show Gist options
  • Select an option

  • Save ttarler/7c60b507efa20601f48a2552e4e93a92 to your computer and use it in GitHub Desktop.

Select an option

Save ttarler/7c60b507efa20601f48a2552e4e93a92 to your computer and use it in GitHub Desktop.
Medium Post 1: I Built a Full-Stack AI Trading App with LLMs (comprehensive diagrams)

I Built a Full-Stack AI Trading App with LLMs: Here's What I Learned

From Notebooks to Production with AI Pair Programmers

As a professional data scientist, I've spent years building ML models in Jupyter notebooks. But I wanted to go beyond research. I wanted to deploy a full-stack, production-grade AI trading system. The catch? I needed to supplement my data science expertise with full-stack development, infrastructure, and operations knowledge I didn't have.

Enter AI coding agents.

Over the past few months, I used LLMs (primarily Claude via Cursor) to build an automated trading system from scratch: FastAPI backend, Celery workers, Next.js frontend, PostgreSQL database, AWS infrastructure (ECS, SageMaker, EventBridge), and a complete ML pipeline for training and deploying models.

This series documents what I learned. Not just the technical wins, but the expensive failures, the operational surprises, and the patterns I wish I'd known from day one.

What I Built

Here's the system architecture:

Trading System Architecture

The system runs autonomously. Each morning at 4 AM, ML models analyze technical indicators to generate a watchlist of 50 candidate stocks. During market hours, a reinforcement learning model (trained on SageMaker) decides which positions to open, size them appropriately, and when to exit. If the RL model is unavailable, a deterministic decision tree takes over using predefined risk parameters.

The stack:

  • Backend: FastAPI (Python, async) + SQLAlchemy + PostgreSQL (Aurora)
  • Workers: Celery + Redis for scheduled tasks
  • Frontend: Next.js (React/TypeScript) dashboard
  • ML: SageMaker (training, hyperparameter optimization, inference endpoints)
  • Infrastructure: AWS (ECS/Fargate, EventBridge, ALB, ElastiCache), Terraform

This isn't a portfolio of ML models. It's a production system with deployment pipelines, monitoring, runbooks, error handling, and all the operational concerns you'd expect from a real service.

Does It Actually Work?

Here's the week of January 20-24, 2026 (paper trading):

Portfolio Performance vs S&P 500

This is paper trading, not live money. But it proves the system works end to end: the ML models make decisions, positions open and close, stop-losses trigger, and the infrastructure runs reliably. The real test will be longer time periods and eventually live trading with real capital.

The Shift: From Coding to Context, Verification, and Operations

Here's the surprising part: LLMs accelerated coding dramatically, but coding wasn't the bottleneck.

What LLMs did well:

  • Scaffolding: API routes, database models, Celery task boilerplate, Terraform modules
  • Fill-in code: Converting requirements to implementations, writing SQL queries, creating CI/CD workflows
  • Documentation: Generating initial docs, README files, deployment procedures
  • Debugging: Suggesting fixes for errors, explaining stack traces, proposing alternative approaches

What LLMs struggled with:

  • Context management: Sessions end and agents forget. Without structure, they repeat mistakes or ignore critical patterns.
  • Verification: Agents would claim features were "deployed and working" without checking if they actually worked in production.
  • Operations: Understanding that "scheduled" does not mean "working," that containers need restarting, that metrics matter more than code.
  • Integration: Ensuring training and inference use the same features, that tests verify behavior not just execution, that deployments are actually running new code.

The real bottleneck became managing LLM context, verifying deployments actually worked, building operational discipline, and establishing patterns that prevented recurring failures.

The Expensive Lessons

Over the course of this project, I encountered failures that cost hours, days, and in one case, rendered an ML model completely useless for a week. Here are the themes that emerged.

1. "Scheduled" Does Not Mean "Working"

A Celery task can be perfectly scheduled every 2 minutes and fail every single time. Stop-loss monitoring ran for 7 days without closing a single position. 83 positions accumulated, 30:1 BUY:SELL ratio. Why? I checked scheduler logs but not worker logs.

Lesson: Check outcomes and worker logs, not just schedules.

2. Tests That Lie

94% test coverage. All tests passing. System completely broken in production. Why? Tests verified execution (task returns success) not behavior (positions actually close).

Lesson: Integration tests beat unit tests for critical paths. Test behavior, not just code execution.

3. The ML Model Was "Live" for 6 Days But Never Made a Single Decision

The RL model was deployed and the agent reported "working." In reality: feature mismatch (18 features in training, 3 at inference), silent fallback to heuristics, zero real predictions for 6 days.

Lesson: Training-inference feature parity is non-negotiable. No silent fallbacks. Verify models are actually called.

4. Committed Code Does Not Equal Running Code

I'd commit a fix, push to GitHub, and assume it was running. It wasn't. Docker containers kept old code until restarted.

Lesson: Always restart affected services. Check logs. Verify the new code path is executing.

5. Context Is a Product

Without structured documentation, LLM agents repeat mistakes across sessions. I built CLAUDE.md, AI_AGENT_QUICK_REFERENCE.md, and .ai-knowledge-base.yaml to encode patterns, anti-patterns, and verification steps.

Lesson: Treat agent context as a first-class deliverable. Single source of truth beats ad-hoc instructions.

What's Coming in This Series

This introduction sets the stage. The rest of the series dives deep into each lesson:

Post 2: Architecture Evolution (From multi-agent chaos to a single execution path)

Post 3: Managing LLM Context (How I structured docs so agents don't forget critical patterns)

Post 4: Why 94% Coverage Didn't Help (Testing behavior vs execution, integration vs unit tests)

Post 5: CI/CD with AI-Written Code (Deployment discipline, verification policies, runbooks)

Post 6: The ML Model That Never Ran (Feature parity, no silent fallbacks, end-to-end verification)

Post 7: "Scheduled Does Not Mean Working" (Operational assumptions, metrics that matter, async+DB patterns)

Post 8: Proof It Works (A week of paper trading performance vs S&P 500)

Each post includes real code snippets, anonymized mishaps, and actionable practices you can adapt.

Why This Matters for Data Scientists

If you're a data scientist looking to ship production ML systems (not just notebooks), AI coding agents can accelerate your journey. But they introduce new failure modes. The bottleneck shifts from "can I write this code?" to "can I verify it works, maintain context across sessions, and operate it reliably?"

This series is for data scientists who want to use LLMs as force multipliers while avoiding the expensive lessons I learned. It's also about strengthening your professional brand: demonstrating you can think in systems, not just models.

Let's Go

Over the next 7 posts, I'll walk through the architecture, the failures, the patterns, and the proof that it works. The goal isn't just to document what I built. It's to help you build better.

Ready? Let's start with how the architecture evolved from multi-agent chaos to a single, testable execution path.


This is Post 1 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Next: Architecture Evolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment