Seven posts on architecture, context management, testing, deployment, ML failures, and operational assumptions.
But here's the real question: Does the system actually work?
In this final post, I'll show you a week of paper trading performance—the proof that all those lessons, fixes, and patterns resulted in a functioning trading system.
Important disclaimer: This is paper trading, not live trading with real money. Paper trading means:
- No slippage (you get the quoted price)
- No emotional pressure (it's not real money)
- No market impact (your orders don't affect prices)
- No execution delays (fills are instant)
Why it still matters: Paper trading proves the system works end-to-end. The architecture executes, the ML models make decisions, positions open and close, stop-losses trigger, and the system operates reliably. Real trading adds complexity, but you can't get to real trading without first proving the system works in paper.
Also: One week is not statistically significant. This is a proof of concept, not a performance guarantee.
Trading Flow: V2 execution path
- Premarket Screening (4:00 AM ET): ML prescreening models + technical indicators → watchlist of 50 stocks
- RL Portfolio Decision (9:30 AM - 4:00 PM ET, every 5 min): SageMaker DDPG model allocates capital across candidates + existing positions
- Deterministic Fallback: Decision tree/random forest when RL unavailable
- Position Monitoring (every 1-2 min): Stop-loss (5%) and take-profit (10%) checks
- Cash Management: Auto-liquidate worst performers if buying power insufficient
Starting Capital: $200,000 (paper)
Risk Parameters:
- Max position size: 25% of portfolio
- Stop-loss: 5% below entry
- Take-profit: 10% above entry
- Max daily loss: $1,000
Note: The actual performance data from Jan 20-24 should be inserted here from trading logs or database queries. The template below shows the structure.
Portfolio Performance:
- Starting Equity: $200,000
- Ending Equity: $[TBD]
- Total Return: [TBD]%
- S&P 500 Return (same period): [TBD]%
- Alpha: [TBD]% (portfolio return - S&P 500 return)
Trade Statistics:
- Total Trades: [TBD] (BUY + SELL)
- BUY Orders: [TBD]
- SELL Orders: [TBD]
- Win Rate: [TBD]% (profitable trades / total closed trades)
- Average Gain (winners): [TBD]%
- Average Loss (losers): [TBD]%
System Health:
- RL Model Usage: [TBD]% (vs fallback)
- Task Success Rate: [TBD]% (Celery workers)
- Uptime: [TBD]% (no critical outages)
- BUY:SELL Ratio: [TBD]:1 (balanced, not extreme)
Market Conditions: [Describe market - volatile/calm/trending]
System Behavior:
- Premarket screening: [X] candidates identified
- RL model decisions: [Y] positions opened
- Stop-losses triggered: [Z]
- Take-profits triggered: [Z]
Performance:
- Day P&L: [TBD]%
- Trades: [X] BUY, [Y] SELL
- Notable: [Any interesting system behavior or trades]
Market Conditions: [Describe]
System Behavior:
- Premarket screening: [X] candidates
- RL model decisions: [Y] positions
- Cash management: [If triggered, describe liquidation]
Performance:
- Day P&L: [TBD]%
- Trades: [X] BUY, [Y] SELL
- Notable: [Any interesting events]
[Same structure as Mon/Tue]
[Same structure]
[Same structure]
Chart/Visualization: Portfolio value vs SPY (S&P 500 ETF) over the week
Key Observations:
- Outperformance days: [List days where portfolio beat S&P 500]
- Underperformance days: [List days where portfolio lagged]
- Correlation: [Low/medium/high - how closely did portfolio track the market?]
- Volatility: [Was portfolio more/less volatile than S&P 500?]
Analysis:
- [Discuss what worked: Did RL model find good opportunities? Did stop-losses protect downside?]
- [Discuss what didn't: Any losses? Why? Market regime mismatch?]
This is where the operational lessons paid off.
Celery Task Health:
execute_active_strategies: [X] runs, [Y]% success ratecheck_stop_loss_take_profit: [X] runs, [Y]% success rateupdate_open_positions: [X] runs, [Y]% success rate
RL Model Performance:
- Inference requests: [X]
- Successful inferences: [Y] (trained_model=True)
- Fallback usage: [Z] (should be 0 or very low)
Database Health:
- Connection pool utilization: < 80% (healthy)
- Transaction durations: < 1 second (healthy)
- No "operation in progress" errors
Zero Critical Incidents:
- No event loop errors
- No connection pool exhaustion
- No silent RL model failures
- No stop-loss monitoring failures
The validation: All the patterns from previous posts (dispose/recreate engine, commit before API calls, verification policies) resulted in a reliable system.
The DDPG model trained on 20 hyperparameter tuning jobs successfully:
- Allocated capital across multiple positions
- [Adapted to market conditions / maintained discipline]
- [Specific example of good decision]
Key metric: trained_model=True on [X]% of decisions (target: 100%)
Stop-losses triggered [X] times, protecting downside:
- [Example: SYMBOL down 6%, stop-loss at 5% limited loss]
- Average loss when stopped out: [Y]% (close to 5% target)
Key metric: Positions closed within expected range of stop-loss percentage
When buying power insufficient, system auto-liquidated worst performers:
- instances of cash management triggered
- [Freed $Y to execute Z new buys]
Key metric: Never failed to execute BUY due to insufficient cash
Zero critical outages. Tasks ran every interval without failure.
Key metric: 99.9%+ uptime, [X]% task success rate
[If portfolio underperformed on high-volatility days, discuss why. Did RL model struggle? Did stop-losses trigger too early?]
[If system bought stocks that immediately reversed, discuss. Was prescreening too aggressive? Did RL model overfit?]
[If positions held too short/long, discuss. Should take-profit be adjusted? Should RL model factor in holding costs?]
This week of paper trading proves:
- Architecture works: V2 flow executes reliably from premarket to position monitoring
- ML models integrate: RL model makes real decisions, fallback works when needed
- Operations are solid: No critical failures, tasks run on schedule, monitoring catches issues
- Risk controls work: Stop-losses trigger, position sizing respects limits, cash management functions
This is what "done" looks like for a data scientist building production ML systems:
- Not just "model trains"
- Not just "API returns 200"
- But: Deployed, monitored, and producing results
Even if paper trading. Even if one week. Even if not perfect.
The system works end-to-end, and I can verify it works through logs, metrics, and outcomes.
- More paper trading: 30-90 days to establish baseline performance
- Regime testing: How does system perform in high volatility? Low volatility? Bear markets?
- Stress testing: What happens with 100+ positions? Connection failures? API outages?
- Risk refinement: Tune stop-loss, take-profit, position sizing based on observed behavior
- RL model iteration: Retrain with new data, try different reward functions
- Prescreening enhancement: Add more features, try ensemble models
- Infrastructure scaling: Move to Aurora for database, add read replicas
- Monitoring dashboards: Real-time visualization of system health and performance
Only when:
- Paper performance consistent over 3+ months
- Stress tests pass
- All P1 risk controls implemented (broker-level stop-losses, live trading toggle with confirmation)
- Capital allocated is money I can afford to lose
And even then: Start with small capital ($1,000-$5,000), scale gradually based on performance.
-
Paper trading proves the system works: Architecture, ML models, operations, risk controls—all functional end-to-end.
-
One week isn't statistically significant: This is proof of concept, not performance guarantee.
-
System reliability matters more than returns: A system that works consistently beats a system that crashes even if the latter has higher theoretical returns.
-
Operational discipline pays off: All the lessons (dispose/recreate, commit before API calls, verification policies) resulted in zero critical failures.
-
"Done" = deployed + monitored + producing results: Not just "code works" but "system operates reliably."
-
Verification is continuous: Even after this week, continue monitoring. New market conditions reveal new edge cases.
-
AI coding agents accelerated this, but verification was on me: LLMs wrote much of the code, but I had to verify it worked, monitor it, and encode the lessons.
When I started this project, I was a data scientist comfortable with notebooks but inexperienced with production systems.
AI coding agents (Claude via Cursor) accelerated my journey dramatically:
- Scaffolded the API, workers, frontend, infrastructure
- Generated Celery tasks, database models, CI/CD workflows
- Proposed fixes when bugs arose
But the real work was:
- Managing agent context so patterns don't repeat mistakes
- Verifying deployments actually worked (not just "deployed successfully")
- Building operational discipline (runbooks, checklists, monitoring)
- Ensuring ML models were integrated and verified (not just trained)
- Learning that "scheduled" ≠ "working" and "committed" ≠ "running"
The bottleneck shifted from "can I write this code?" to "can I verify it works and operate it reliably?"
This series documented that shift—and the patterns that made the difference.
If you're a data scientist looking to ship production ML systems, AI coding agents can get you 80% of the way there. The last 20%—verification, operations, monitoring—is on you.
But it's learnable. And this series is the map.
Did the system work for one week of paper trading? Yes.
Is it ready for live trading? Not yet. More testing needed.
Was it worth building? Absolutely. I went from notebooks to a deployed, monitored, operational trading system in a few months—something that would have taken years without AI assistance.
More importantly: I learned how to build production ML systems. The patterns, the verification discipline, the operational thinking—these transfer to any ML project, not just trading.
That's the real win.
This is Post 8 (final) of an 8-part series on building a full-stack AI trading application with LLM coding agents.
The Full Series:
- I Built a Full-Stack AI Trading App with LLMs—Here's What I Learned
- From Multi-Agent Chaos to a Single Execution Path
- Managing LLM Context So Your AI Coworker Doesn't Forget
- Why 94% Test Coverage Didn't Stop Our Trading System From Failing
- CI/CD and Deployment When AI Writes the Code
- The ML Model Was 'Live' for 6 Days—It Never Made a Single Decision
- "Scheduled ≠ Working" and Other Expensive Assumptions
- Proof It Works: A Week of Paper Trading Performance (this post)
Connect with me: [Your LinkedIn/Twitter]
Want to discuss?: [Your email or contact method]
Interested in the code?: [If you plan to open-source, link here]