ttarler/03-building-the-ml-pipeline.md

## 03-building-the-ml-pipeline.md

      
    Raw
  

              03-building-the-ml-pipeline.md
            
          
    Building the ML Pipeline: From LSTM Experiments to Production RL


The ML System That Actually Trades

Go read any ML trading tutorial online. It's always the same: download some OHLCV data, train a model, show a backtest chart that goes up and to the right, collect Medium claps.
Nobody talks about what happens when you try to run that model every single day. When you need to retrain monthly without breaking the thing that's already trading. When the feature your model was trained on doesn't exist at inference time. When your processing job takes 8 hours and your data is stale by the time training starts.
That's what this post is about. Not the model—the machine around the model.
Two-Stage Architecture: Why One Model Wasn't Enough

My first attempt was the obvious one. One model. Market data goes in, trading decisions come out. Simple.
It didn't work. At all.
The problem is that "should I trade this stock?" and "how should I size this position?" are completely different questions. Scanning 8,000 symbols for the 50 worth considering is a classification problem. Random forests eat that for breakfast. But deciding how much to buy given your current holdings, cash, and risk limits? That's a sequential decision problem. Reinforcement learning fits way better.
So I split it:
┌─────────────────────────────────────────────────────────────────────┐
│  PRESCREENING MODELS (Random Forest + XGBoost)                      │
│  Input: 8,000+ symbols with 40+ technical features                  │
│  Output: buy_probability, sell_probability for each symbol          │
│  Job: Filter to top 50 candidates                                   │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│  RL PORTFOLIO MODEL (PPO/SAC)                                       │
│  Input: 50 candidates + current holdings + prescreening scores      │
│  Output: BUY / HOLD / SELL action for each position                 │
│  Job: Optimal portfolio allocation                                  │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
                          Order Execution

The trick that makes this work: prescreening scores become input features to the RL model. The RL model doesn't blindly trust the prescreening—it takes those signals and weighs them against technical indicators and current position state before making the final call.
From LLM Agents to ML Models


Before I built any of this, the system was using something way more "impressive"—LLM agents making trading decisions. Three of them, no less:

Prescreening Agent: Analyzed market data, decided what to research
Portfolio Agent: Made allocation decisions based on agent discussions
Risk Agent: Reviewed decisions for risk compliance

Sounded like a hedge fund. In reality it was three LLMs arguing with each other at 2 seconds per API call while the market moved on without them.
Latency was 3-6 seconds per decision. Nondeterministic—same inputs, different trades. Debugging was "which of the three agents decided to buy TSLA?" (Usually the answer was "all of them, sort of.") Costs were $0.01-0.05 per decision in API tokens.
Replacing LLM agents with ML models: 200ms decisions, deterministic, $0.0001 per decision. Same quality or better, since the ML model is actually trained on outcomes instead of vibes.


Metric
LLM Agents
ML Models


Latency
3-6 seconds
200ms


Determinism
Nope
Yes


Cost per decision
$0.01-0.05
~$0.0001


Testability
Good luck
Standard pytest


Debugging
"Which agent decided?"
Check one output


The LSTM Detour


I wasted weeks on LSTMs. The logic seemed airtight: stock prices are time series, LSTMs handle sequences, therefore LSTMs predict stocks. QED.
Trained one on OHLCV data. Backtests showed 70%+ accuracy on the holdout. Deployed it. Completely useless live.
The problem was that I was asking the wrong question. My LSTM was answering "Will AAPL go up tomorrow?" when what I actually needed was "Should I buy AAPL vs GOOGL vs MSFT today?" Those are fundamentally different. LSTMs are great at the first one. For the second—cross-sectional ranking with tabular features—tree models just work better. They train in minutes instead of hours and you can actually see which features matter.
The LSTM code is still in the repo, commented out. A monument to premature sophistication.
Feature Engineering: 40+ Indicators and Some Fundamentals

The prescreening models chew through 40+ features. I'm not going to pretend I designed this set on day one—most came from the classic loop of "add features, retrain, check if AUC goes up."
The usual suspects:

Price vs moving averages (SMA20, SMA50, SMA200)
Momentum (RSI, rate of change, 5/10/20-day momentum)
Trend strength (MACD, ADX, directional movement)
Volume (relative volume, OBV slope, VWAP distance)
Volatility (10/20/30-day, Bollinger bands, ATR, and my favorite—vol of vol, which is a solid regime change indicator)
Market context (VIX level, VIX change, regime encoding, relative strength vs SPY)

The newer stuff—fundamentals (6 features):

market_cap_log, beta, pe_ratio_norm
debt_to_equity_norm, roe_norm, revenue_growth

These came from a specific frustration: the model kept treating technically similar stocks the same way, even when one was a profitable mega-cap and the other was a cash-burning small-cap. Pure technicals can't tell you that. We pull the data from Financial Modeling Prep's API, normalize it, and cache results in S3 with a 30-day TTL. A dedicated pipeline stage grabs sector mappings once and writes a shared sector_map.json so all 8+ processing partitions reuse it instead of each one hitting the API.
The RL Model: Making It Trainable

Early RL experiments used weight-based allocation: model outputs a weight per candidate, normalize to sum to 1. Problem: if the model shifts weights, positions vanish. Stock you bought yesterday at 5% allocation? Today the model says 0%. You sell at a loss because the model forgot about you.
Version 2 uses explicit actions:
# V1: Weight-based (whipsaw city)
# Model outputs: [0.4, 0.3, 0.2, 0.1, 0.0] → allocate 40% to AAPL, etc.
# Problem: Holdings vanish if not in top weights

# V2: Action-based (much better)
# Model outputs: [-0.8, 0.5, 0.1, -0.5, 0.7] in range [-1, 1]
# Interpretation:
#   > 0.3  → BUY (strength proportional to value)
#   < -0.3 → SELL (only if currently holding)
#   else   → HOLD (do nothing)
Now the model has to explicitly say "sell this." Positions stick around until the model actually wants them gone.
The model sees 27 features per candidate (grew from 21 recently):
RL_MODEL_FEATURES = [
    # Technical features (16)
    "return", "log_return",
    "sma_fast", "sma_slow", "ema_fast", "ema_slow",
    "ret_roll", "vol_roll", "vol_30",
    "rsi_14", "macd_line", "macd_signal",
    "bb_upper", "bb_lower", "atr_14", "mom_10",

    # Prescreening scores (2) — from the RF/XGBoost model
    "buy_probability", "sell_probability",

    # Intraday context (3)
    "opening_gap_pct", "intraday_momentum_1h", "relative_strength_vs_spy",

    # Fundamental features (6) — added recently
    "market_cap_log", "beta", "pe_ratio_norm",
    "debt_to_equity_norm", "roe_norm", "revenue_growth",
]
Configurable Architecture

I got tired of hard-coding [256, 256] hidden layers and hoping that was close enough. The network architecture—hidden sizes, layer norm, activations—is now part of the HPO search space. The optimizer explores [128,128], [256,128], [256,256], and [256,256,128] as categorical options. Architecture metadata gets saved to the checkpoint so inference can reconstruct the exact same topology. One less thing to get wrong at deploy time.
Reward Engineering


If you train an RL model with raw P&L as the reward, you get a model that chases short-term wins and blows up on drawdowns. Ask me how I know.
I added three reward shaping components:

Drawdown penalty: Quadratic. Gets ugly fast when the portfolio dips below threshold. The model learns that avoiding big losses matters more than chasing small wins.
Holding period bonus: Encourages holding winners longer. Without this, the model over-trades like crazy—every iteration is an opportunity to "lock in gains" that would have been bigger if it just waited.
Momentum alignment: Small bonus for trading with the trend. Penalizes fighting the tape.

All three have HPO-tunable coefficients. I don't hand-tune reward functions anymore. SageMaker finds the balance.
Training Infrastructure: SageMaker + Bayesian HPO

Prescreening (RF + XGBoost):

Bayesian HPO, 20 trials per job
Metric: out-of-time AUC-ROC on a chronological 20% holdout
No inner cross-validation. I learned this one the hard way—adding 5-fold CV inside each HPO trial meant 12 model fits per trial instead of 2. Low-volatility trials were taking 60+ minutes each. Ripped it out; now each trial takes about 5 minutes. HPO's own Bayesian search handles exploration just fine.

RL (PPO):

PyTorch + Ray RLlib in a custom container
HPO searches over learning rate, clip parameter, entropy coefficient, all the reward coefficients, and hidden layer sizes
Backtest evaluation on 60-day validation window (doubled from 30—more data points means less noisy metric estimation)
30-60 minutes per trial with 8 parallel environment runners on ml.g4dn.16xlarge

Warm start was a big recent win. Both prescreening and RL HPO can now start from previous training runs. When a new job kicks off, it auto-discovers up to 5 recent parent jobs and does transfer learning from their best weights. Monthly retraining used to start from scratch every time. Now the model picks up roughly where it left off—converges much faster.
Each job outputs model artifacts to S3, metrics to SageMaker Experiments, feature importance for interpretability, and architecture metadata so inference knows how to rebuild the network.
HPO jobs run in parallel. One per regime (low/moderate/high volatility) and model type (buy/sell). A typical run fires off 6+ jobs at once.
The Sortino Trap: Why the Objective Metric Matters More Than the Model

This one took embarrassingly long to figure out.
The RL HPO was optimizing for backtest_sortino—the Sortino ratio measured over a 30-day backtest window. Sounds reasonable. Sortino penalizes downside volatility while rewarding returns. What's not to like?
Everything, it turns out.
Problem 1: The denominator gaming. If the model holds 95% cash and makes one tiny trade that doesn't lose money, the downside deviation is essentially zero. Division by near-zero gives you a Sortino ratio of 25,000. HPO sees that and thinks it found the Holy Grail. In reality it found a model that does nothing.
Problem 2: No risk-free rate. The original Sortino calculation used r < 0 as the downside threshold. But with T-bills yielding ~5%, a model returning 2% annualized is losing money in real terms. It shouldn't get credit for "no downside."
Problem 3: 30 data points. Sortino estimated from 30 daily returns is basically noise. The confidence interval is wider than the estimate itself.
The fix was a composite objective metric:
# New HPO objective: requires BOTH alpha AND risk-adjustment
composite = excess_return * min(max(sortino, 0), 3.0)
Where excess_return is portfolio return minus SPY return over the same period. This kills three birds:

Cash-hoarding model? SPY goes up, you don't. Negative excess_return. Composite is negative. Dead.
Great Sortino but no alpha? ~0 * sortino = ~0. Also dead.
Gaming the Sortino denominator? Capped at 3.0. Can't get infinite credit for zero downside.

I also fixed the Sortino calculation itself—added the risk-free rate to the downside threshold, used proper sample standard deviation (ddof=1), set a realistic floor of 1% annualized downside deviation (instead of 1e-6), and hard-capped the ratio at 10.0. And extended the backtest window from 30 to 60 days—twice the data for metric estimation.
The deployment gates got stricter too. A model now needs:

Non-negative Sortino and excess return (existing)
Buy ratio >= 5% (must actually generate buy signals—no cash hoarders)
Backtest return >= 0.5% (must produce meaningful absolute returns)

Taming the HPO Search Space

Here's a fun fact about Bayesian optimization: it needs roughly 10x the number of dimensions in initial trials before it's meaningfully better than random search. My HPO was searching 24 hyperparameter dimensions with 40 trials. That's a 1.7x ratio. The Gaussian Process surrogate model couldn't build a useful response surface—it was basically throwing darts blindfolded.
The fix was surgical: audit every hyperparameter, keep the ones that actually move the needle, fix the rest at sensible defaults.
12 parameters I stopped searching (fixed as static):

gamma=0.99 — standard for long horizons, searching [0.95, 0.999] added noise
vf-loss-coeff=0.5 — PPO default, never mattered for trading performance
num-sgd-iter=10 — standard, not worth a dimension
All the narrow-range reward coefficients (extreme-penalty, entropy-weight, vol-sensitivity, churn-penalty, holding-bonus-weight, momentum-alignment-weight) — ranges like [0.0001, 0.005] are too tight for the optimizer to exploit
Structural constraints (max-sector-pct, min-hold-for-bonus) — not learning parameters

12 parameters I kept searching:

The heavy hitters: lr, clip-param, entropy-coeff, reward-scale, risk-aversion
Portfolio construction: max-position-pct, stop-loss-pct, take-profit-pct, target-portfolio-vol
Risk management: drawdown-threshold, drawdown-penalty-weight
Architecture: hidden-sizes (categorical—moved from static to searchable)

With 12 dimensions and 30 trials (reduced from 40—saves ~$48/run), the dimension-to-trial ratio is 2.5x. The Bayesian optimizer can now actually learn which regions of hyperparameter space produce good models.
Parallelizing RL Training

The ml.g4dn.16xlarge instance has 64 vCPUs, a beefy GPU, and 64GB of RAM. We were using 2 of those CPUs for environment rollout collection. Two. Out of sixty-four.
PPO is on-policy: the GPU sits idle waiting for environment data between each training step. With 2 workers collecting rollouts, that's a lot of idle GPU time.
Bumped num_workers from 2 to 8 and scaled train_batch_size from 4,000 to 8,000 steps to match. Each worker loads the full multi-asset feature matrix (~200MB), so 8 workers use about 1.6GB—well within the 64GB budget. Going higher risks memory contention without proportional speedup from RLlib coordination overhead.
Expected result: 2-4x faster per training iteration, cutting trial time from 1.5-2 hours down to 30-60 minutes. Combined with reducing total trials from 40 to 30, total HPO wall-clock time should drop by roughly half.
The Full Pipeline: 7 Stages


What started as a "fetch data and train a model" script turned into a 7-stage orchestrated pipeline. I'm not going to pretend I designed all 7 stages on a whiteboard before writing code. This grew organically as I kept discovering things that needed to happen before training, or after training, or between two things that I thought were adjacent but weren't.
1. Backfill        → Fetch new OHLCV data from Alpaca (incremental)
2. Export OHLCV    → Aurora DB → S3 (because training reads from S3, not the DB)
3. Feature Comp    → Compute technical features from raw OHLCV
4. Processing      → Generate labeled training data with VIX/regime context
5. HPO             → Bayesian hyperparameter optimization for prescreening
6. Score Gen       → Champion models generate buy/sell probabilities
7. RL Pipeline     → Train RL model on pre-scored data

Each stage depends on the previous one finishing. A Lambda orchestrator tracks state in SSM Parameter Store, EventBridge triggers transitions. If stage 3 fails, stages 4-7 don't run and I get an SNS alert. That sounds simple, but getting it to actually work reliably was its own project (more on that below).
Some decisions worth explaining:
Why does "Export OHLCV" exist? Because backfill writes to Aurora DB (the live trading system reads from there) but SageMaker Processing reads from S3 (because that's what SageMaker does). Without this bridge stage, training uses whatever stale S3 data was lying around. I lost a week to this once—ran a full pipeline, got weird results, eventually realized the "freshly processed" data was from four months ago.
Columnar feature serialization. Features used to be stored as JSON blobs—one string per row. Deserializing 7 million rows of json.loads() added 3-4 minutes per HPO trial. Multiply that by 20 trials and six jobs, and you're burning an extra hour just parsing strings. Now we write individual float columns (feat_*) as Parquet. Old JSON data still works via fallback, but the fast path is the default.
Sector data gets fetched once. A dedicated stage grabs sector mappings from FMP and writes them to S3 with a 30-day cache. All downstream processing partitions read from that single file. Before this, every partition was independently hammering the API. Eight partitions, same API, same data. Not my finest design.
Getting Processing from 8 Hours to 45 Minutes


Computing 40+ indicators across 8,000 symbols was taking forever. I threw three things at it:
Vectorized the indicators. The original code was computing rolling means, standard deviations, OBV, true range, and returns with Python for-loops. Row by row. On millions of data points. Rewriting them as pd.Series.rolling() calls and NumPy vector ops (np.sign(), np.cumsum(), np.diff(), np.maximum()) was the kind of change that makes you wonder why you didn't do it first.
For example, the old OBV calculation:
# Before: Python for-loop (slow)
result = np.zeros_like(close)
for i in range(1, len(close)):
    if close[i] > close[i-1]:
        result[i] = result[i-1] + volume[i]
    elif close[i] < close[i-1]:
        result[i] = result[i-1] - volume[i]
    else:
        result[i] = result[i-1]

# After: one line of NumPy (fast)
result = np.cumsum(np.sign(np.diff(close, prepend=close[0])) * volume)
Same result, roughly 50x faster on a large array.
Multiprocessing. SageMaker ml.m5.2xlarge instances have 8 vCPUs. I was using one. Added a multiprocessing.Pool with 6 workers per instance, shared the read-only VIX and SPY data via initializer, and let imap_unordered chew through symbol batches. Instant 4-5x on each box.
Doubled the boxes. Went from 8 to 16 SageMaker Processing instances. The math works out cost-neutral: 16 boxes for 1.5 hours costs the same as 8 boxes for 3 hours. Half the wall-clock time for the same bill.
Net result: ~10x end-to-end. What used to be an overnight job now finishes while I'm still drinking coffee.
Decoupling Scripts from Containers

This was maybe the highest-ROI infrastructure change I made. For weeks, every time I changed a line in a training script, I had to wait for a full Docker container rebuild through CI/CD. Ten to fifteen minutes minimum. When you're iterating on a training script and need to test a fix, that feedback loop will drive you insane.
The fix: a bootstrap.sh entrypoint that downloads the latest scripts from S3 when the container starts. A small GitHub Actions workflow pushes script changes to S3 in about 30 seconds. Containers only rebuild when actual dependencies change—requirements.txt, Dockerfiles, base images.
Lambda orchestrators pass SCRIPT_S3_BUCKET and SCRIPT_VERSION to each job. Terraform pins the version for production. Twenty-times faster iteration on script changes. Should have done this on day one.
Pipeline Reliability: The Race Condition Chapter

Running a 7-stage pipeline with EventBridge triggers exposed concurrency bugs I didn't see coming.
The double-trigger problem. A SageMaker Processing job with 16 instances doesn't fire one completion event. It fires sixteen. Each partition completing sends its own EventBridge event, which means sixteen concurrent Lambda invocations all racing to advance the pipeline. Solution: re-read SSM state at the top of every stage transition and bail out if someone else already moved it forward.
The stale-state bug. The orchestrator would read pipeline state from SSM, compute new config flags, then call start_backfill_job(). But that function also reads SSM state internally—and it was reading the old state before the caller had a chance to save the updates. So it would overwrite the caller's flags with stale values. Fix: save state before calling sub-functions, not after. Classic race condition, annoying to track down.
Missing incremental parameters. The export stage wasn't getting START_DATE for incremental runs, so it defaulted to exporting everything every time. Months of data, every run. Fix: pass a 90-day lookback date when running incrementally.
None of this is glamorous. But it's the difference between a pipeline that quietly works and one that randomly launches duplicate training jobs at 3 AM.
The Prescreening Data Leakage Bug

Here's a subtle one. The prescreening training script does an 80/20 chronological split for out-of-time validation. But the DataFrame was built by concatenating parquet files from S3, and the concatenation order depended on filesystem enumeration—not timestamps. Some rows from the middle of the time period were ending up in the "holdout" set while future data leaked into the training set.
The fix was a one-liner: sort by signal_date before splitting. But finding it required staring at the data pipeline for a while and asking "why do the OOT metrics look too good?"
Regime-Aware Models

One model for all market conditions is a bad idea. I train separate prescreening models for three volatility regimes:

Low VIX (< 15): Momentum signals carry more weight
Moderate VIX (15-25): Mean reversion matters more
High VIX (> 25): Everything gets defensive, smaller positions

At inference time, check VIX, route to the right model:
def get_regime() -> str:
    vix = get_vix_level()
    if vix < 15:
        return "low"
    elif vix < 25:
        return "moderate"
    else:
        return "high"

endpoint = f"prescreening-{action}-{get_regime()}"  # e.g., prescreening-buy-moderate
This added 3-5% AUC over a single model. Worth the complexity.
Metrics I Actually Watch

Prescreening:

Out-of-time AUC-ROC on the chronological holdout. Handles class imbalance, threshold-agnostic.
Precision at top 50—are the picks actually good ones?
Deployment gate: AUC > 0.70 or the model doesn't ship.

RL:

Composite score: excess return over SPY times capped Sortino ratio
Max drawdown and recovery time
Deployment gates: non-negative excess return, Sortino >= 0, buy ratio >= 5%, backtest return >= 0.5%

In production:

Daily P&L vs SPY
Win rate
BUY/HOLD/SELL distribution (if it's 90% HOLD, something is wrong)
Inference latency (must stay under 200ms)
Per-stage pipeline timing—so I know when something is taking longer than it should

If win rate drops below 40% or the system trails SPY by more than 5% over a week, I get an alert.
The Feature Parity Disaster

This one cost me 6 days of the model doing absolutely nothing.
Training script expected 18 features. Inference code was computing 3 features and filling the other 15 with zeros. The model received garbage, produced garbage, fell back to heuristics, and nobody noticed because the heuristics also produce trades. Just worse ones.
The fix:
# features.py — SINGLE SOURCE OF TRUTH
RL_MODEL_FEATURES = [
    "return", "log_return", "sma_fast", ...  # all 27 features
]

# Training imports this
from app.ml.features import RL_MODEL_FEATURES

# Inference imports the same thing
from app.ml.features import RL_MODEL_FEATURES

# And validates at runtime
missing = [f for f in RL_MODEL_FEATURES if f not in computed_features]
if len(missing) > 0:
    raise RuntimeError(f"Missing features: {missing}")  # FAIL FAST
Boring. Effective. Would have saved me a week if I'd done it from the start.
This same lesson came back when I expanded from 21 to 27 features (adding fundamentals). New features need backward-compatible fallbacks for old training data that doesn't have them—fill with neutral defaults and log a warning, but don't crash. Feature expansion has to be additive. Break nothing, or break everything.
Where It Stands Now


Prescreening AUC: 0.73-0.79 depending on regime
RL Backtest Return: 15-20% on validation periods
Live Win Rate: ~55-60%
Outperformance vs SPY: +2-5% monthly (paper trading)
Pipeline end-to-end: ~1.5 hours (down from 8+)
Script deployment: ~30 seconds (down from 15 minutes)
HPO search space: 12 dimensions (down from 24)
HPO trial time: 30-60 min (down from 1.5-2 hours)

Not hedge-fund numbers. But consistent, fully automated, and getting better each cycle. Warm starts mean each monthly retraining doesn't start from zero, and the composite objective metric means HPO actually finds models that make money—not just models that avoid losing it.
What I'd Do Differently

Start dumber. I wasted weeks on LSTMs when Random Forest + basic features would have gotten 70% of the result in a tenth of the time.
Obsess over feature parity from day one. Assertions everywhere. Log feature counts on every inference call. Diff training inputs vs inference inputs. The silent mismatches will ruin you.
Build monitoring before you need it. I built dashboards after things broke. Should have built them before.
Don't start with Docker-baked scripts. The S3 bootstrap pattern should have been the default from day one. Waiting 15 minutes per iteration during active development is insane in retrospect.
Don't add cross-validation inside HPO trials. Let HPO manage the search. Adding 5-fold CV inside each trial cost me 5-6x compute for zero meaningful improvement. Just use an out-of-time holdout and move on.
Question your objective metric early. I ran dozens of HPO jobs optimizing Sortino before realizing it was rewarding the wrong behavior. The metric you optimize is the behavior you get. If your metric rewards doing nothing, your model will do nothing—very efficiently.
Audit your search space dimensionality. Bayesian optimization in 24 dimensions with 40 trials is random search with extra steps. Cut the dimensions first, tune the model second.
Test with production data volumes early. Training on 100 rows works great. Training on 7 million rows OOMs at 3 AM and you wake up to a Slack message that says "all jobs failed."
Next Up

The ML pipeline works. But "works in backtests" and "works in production" are different sentences with different meanings. Next post: why 94% test coverage didn't stop a catastrophic production failure—and what kind of testing actually matters.

This is part 3 of an 8-part series on building a trading system with AI coding agents. Previous: From Multi-Agent Chaos to a Single Execution Path | Next: Why 94% Test Coverage Didn't Help
Metric	LLM Agents	ML Models
Latency	3-6 seconds	200ms
Determinism	Nope	Yes
Cost per decision	$0.01-0.05	~$0.0001
Testability	Good luck	Standard pytest
Debugging	"Which agent decided?"	Check one output
No results found