Skip to content

Instantly share code, notes, and snippets.

@ttarler
Last active March 3, 2026 21:35
Show Gist options
  • Select an option

  • Save ttarler/9d1541df109a4019a6b0e236d56c3949 to your computer and use it in GitHub Desktop.

Select an option

Save ttarler/9d1541df109a4019a6b0e236d56c3949 to your computer and use it in GitHub Desktop.
Post 5: The Ensemble That Actually Trades — How PPO, SAC, and TD3 Became One Decisioning System

The Ensemble That Actually Trades: How PPO, SAC, and TD3 Became One Decisioning System

End-to-End Signal Flow

The Problem With One Model

I spent weeks training a single PPO agent to make portfolio decisions. It worked well in calm markets. Then VIX spiked from 14 to 22 and the model froze — its learned policy didn't generalize to a volatility regime it hadn't trained in.

This isn't a novel observation. Any quantitative researcher who's deployed a single model to trade across regimes has hit this wall. The issue isn't the algorithm. It's that markets have structural breaks — volatility regimes, sector rotations, macro shocks — and no single model captures all of them.

So I built an ensemble. Not the textbook kind where you average three identical models with different random seeds. I mean a multi-layered system where fundamentals, technicals, ML prescreening, and reinforcement learning each contribute an independent signal, and a regime-aware decision engine merges them into a single conviction score.

This post is about how that system works, the experiments that shaped it, and three days of paper trading that proved the architecture generates real alpha.

How Information Flows Through the System

The entire pipeline is a funnel. 8,000+ symbols come in. A handful of buy/sell/hold decisions come out. Every layer in between adds a different kind of intelligence.

Raw Market Data (OHLCV, 8,000+ symbols)
    │
    ▼
Feature Engineering (40+ indicators)
    │
    ├── Momentum: RSI, MACD, rate of change
    ├── Volatility: ATR, Bollinger Bands, vol-of-vol
    ├── Trend: SMA crossovers, ADX, directional movement
    ├── Volume: relative volume, OBV slope, VWAP distance
    └── Market context: VIX level, VIX change, relative strength vs SPY
    │
    ▼
VIX Regime Detection → determines WHICH models run, HOW signals are weighted
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  PRESCREENING (Random Forest + XGBoost)                 │
│  6 models: buy/sell × low/moderate/high volatility      │
│  Input: 40+ technical + 6 fundamental features          │
│  Output: buy_probability, sell_probability per symbol    │
│  8,000 symbols → 50 candidates                          │
└─────────────────────────────────────────────────────────┘
    │  prescreening scores become RL input features
    ▼
┌─────────────────────────────────────────────────────────┐
│  RL INNER ENSEMBLE (PPO + SAC + TD3)                    │
│  3 algorithms, each trained via Bayesian HPO            │
│  Input: 44 features per stock (technicals + scores      │
│         + portfolio state + fundamentals)                │
│  Output: continuous action [-1, 1] per stock            │
│  Assembly: composite-weighted (SAC 40%, TD3 32%,        │
│            PPO 28%)                                     │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  OUTER ENSEMBLE DECISION ENGINE                         │
│  4 signal sources: Prescreening, RL, Technical,         │
│                    Fundamental                          │
│  Regime-dependent weights                               │
│  Agreement scaling + coverage penalty                   │
│  Output: BUY / HOLD / SELL + conviction score           │
└─────────────────────────────────────────────────────────┘
    │
    ▼
Portfolio Optimizer (Max-Sharpe, Ledoit-Wolf covariance)
    │
    ▼
Risk Controls → Execution

The key design principle: each layer uses the same underlying data differently. Prescreening uses raw features for cross-sectional classification ("is this stock worth looking at?"). RL uses them as part of a sequential decision ("given my current portfolio, should I buy more?"). The outer ensemble uses them as independent corroborating signals ("do multiple perspectives agree?").

This isn't redundancy. It's triangulation.

Volatility Regimes: The System's Nervous System

VIX regime detection isn't a feature. It's the control plane. It determines which prescreening models run, how the RL signal is weighted, how large positions can be, and whether trading happens at all.

Training: One Pipeline, Multiple Regimes

The prescreening stage trains six separate models — one for each combination of action (buy, sell) and volatility regime (low, moderate, high). At training time, each HPO job filters its data to the relevant regime's VIX range, so the buy-low-vol model only sees calm-market data and the sell-high-vol model only sees turbulent data.

RL models train across all regimes but receive VIX as a context feature. The reward function shapes behavior differently depending on volatility — drawdown penalties scale with the regime, so the model learns to be more conservative when vol is elevated.

Inference: VIX Drives Everything

At inference time, the system checks VIX and routes accordingly:

def classify_regime(self, vix: float) -> str:
    if vix < 15.0:          # VIX_LOW
        return "low_volatility"
    elif vix < 20.0:        # VIX_MODERATE
        return "moderate_volatility"
    elif vix < 30.0:        # VIX_EXTREME
        return "high_volatility"
    else:
        return "extreme_volatility"

The regime then cascades through the entire decision stack:

VIX Regime Bands

Regime VIX Range Position Sizing Max Invest % Cash Reserve Behavior
Low < 15 1.0x 80% 5% Full deployment, trust RL
Moderate 15–20 0.75x 60% 7% Balanced signals
High 20–30 0.50x 40% 10% Defensive, smaller positions
Extreme ≥ 30 0.0x 0% 15% Trading halted

Position sizing isn't just a discrete step function. There's a continuous scaling layer underneath:

def max_position_pct(self, vix: float) -> float:
    """Position size shrinks smoothly with VIX."""
    raw = self.pos_base - self.pos_slope * max(0.0, vix - self.vol_anchor)
    return max(self.pos_floor, raw)

This smooth decay avoids cliff effects. VIX going from 14.9 to 15.1 doesn't trigger an abrupt regime change at the position level — the continuous function handles the transition.

The VIX 25 Trap

For weeks, I had VIX ≥ 25 classified as EXTREME_VOLATILITY, which halts all trading. The problem: VIX hovering around 24–25 during earnings season isn't a crisis. It's normal elevated volatility. The portfolio was frozen with zero buys, sells, or rebalancing for extended periods — not because the market was crashing, but because my thresholds were miscalibrated.

The solution was straightforward: VIX 20–30 is HIGH_VOLATILITY (trade at 50% sizing, defensive posture), and only VIX ≥ 30 is EXTREME (halt). This single threshold change — moving the extreme cutoff from 25 to 30 — allowed the portfolio to rotate during normal elevated-vol periods instead of sitting frozen.

Fundamentals and Technicals: Multi-Layer Inputs

Multi-Layer Data Usage

The same fundamental and technical data feeds into multiple pipeline stages, but each stage uses it differently.

Fundamentals

Six features pulled from Financial Modeling Prep's API: market_cap_log, beta, pe_ratio_norm, debt_to_equity_norm, roe_norm, revenue_growth. Cached in S3 with a 30-day TTL. A dedicated pipeline stage grabs sector mappings once and writes a shared sector_map.json so all processing partitions reuse it instead of each one independently hitting the API.

Where fundamentals appear:

  • Prescreening features: Raw values are part of the 40+ feature vector. The RF/XGBoost models learn that a small-cap with high debt and negative ROE is a riskier buy than a profitable mega-cap — something pure technicals can't distinguish.
  • RL observation space: Part of the 44-feature input per stock. The RL model sees fundamentals alongside technicals and prescreening scores when deciding portfolio allocation.
  • Outer ensemble signal: A standalone 4th signal source that combines PE ratio, ROE, revenue growth, and debt-to-equity into a normalized [-1, 1] composite. This gives the ensemble a fundamentals-based vote independent of the ML models.

Technicals

Six indicators computed as a standalone composite for the outer ensemble: RSI (reversal detection), MACD (trend momentum), Bollinger %B (mean reversion), SMA crossover (trend direction), price momentum, and volume confirmation.

These same indicators also flow into prescreening features and RL observations, but in the outer ensemble they operate as an independent signal source. This proved critical on Feb 27: technical signals of +0.46 to +0.50 overrode prescreening's sell recommendations (-0.79 to -0.85) on stocks like FIGS (+24%), INDO (+23%), and BATL (+33%). The prescreening model wanted to sell all holdings; the technical signal recognized they were momentum stocks still trending. The ensemble held them. That's triangulation working as designed.

The Inner Ensemble: PPO + SAC + TD3

Inner Ensemble Architecture

Three reinforcement learning algorithms, each trained independently via Bayesian HPO on SageMaker, assembled into a weighted ensemble for inference.

Why These Three

PPO (Proximal Policy Optimization): On-policy, stable, clipped updates. Uses curriculum learning — the buy/sell threshold ramps from 0.05 to 0.15 over 220 training iterations, letting the model first learn "what is a good stock" before requiring decisive actions. Reliable but sample-hungry.

SAC (Soft Actor-Critic): Off-policy, entropy-regularized, sample-efficient. Explores more aggressively because the entropy bonus prevents premature convergence. Got the highest backtest composite score (1.302 vs PPO's 0.914 and TD3's 1.042) in the most recent HPO run. No curriculum learning — off-policy replay buffers mix curriculum stages, which defeats the purpose.

TD3 (Twin Delayed DDPG): Deterministic policy, twin critics to reduce Q-value overestimation. Produces precise, non-stochastic action values. Good at avoiding the "optimistic bias" that plagues single-critic methods in noisy financial environments.

Architecture

All three share the same network topology — a StockEncoder that processes each stock's 44 features independently, followed by a StockParallelActor that combines per-stock encodings with a global context vector (mean-pooled across all stocks). This makes the network permutation-equivariant: the decision for AAPL doesn't depend on AAPL being in slot 3 vs slot 17.

Input: 1,100 dimensions (25 stocks × 44 features). Output: continuous action in [-1, 1] per stock, squashed through tanh.

Assembly

assemble_ensemble.py takes the best checkpoint from each algorithm's HPO run and combines them:

class EnsemblePolicy(nn.Module):
    """Weighted average of multiple policy networks."""

    def forward(self, obs: torch.Tensor) -> torch.Tensor:
        total_weight = sum(w for _, w in self.models.values())
        weighted_sum = None
        for name, (policy, weight) in self.models.items():
            with torch.no_grad():
                action = policy(obs)
            if weighted_sum is None:
                weighted_sum = action * (weight / total_weight)
            else:
                weighted_sum = weighted_sum + action * (weight / total_weight)
        return weighted_sum

Weights are proportional to backtest composite scores: SAC 40%, TD3 32%, PPO 28%. Composite score = excess return over SPY × capped Sortino ratio — this prevents cash-hoarding models from gaming the metric (a lesson from an earlier post in this series).

Action Interpretation

The ensemble outputs a continuous value per stock. Three zones:

BUY_THRESHOLD = 0.15    # Action > 0.15  → BUY
SELL_THRESHOLD = -0.15   # Action < -0.15 → SELL
                         # [-0.15, 0.15]  → HOLD

For stocks already in the portfolio, the hold zone isn't wasted. A mildly positive action (say, +0.08) on an owned stock emits a HOLD-AS-BUY signal — the model isn't enthusiastic enough to cross the buy threshold, but it's not negative either. This signal gets forwarded to the outer ensemble with reduced strength, preventing signal loss for held positions.

The Outer Ensemble: Where Signals Merge

Outer Ensemble Decision Engine

This is where four independent signal sources converge into a single decision. The EnsembleDecisionEngine takes prescreening scores, RL actions, technical indicators, and fundamental metrics — normalizes each to [-1, 1], applies regime-dependent weights, and produces a conviction score.

Regime-Dependent Weights

The intuition: in calm markets, the RL model has the highest signal-to-noise ratio (it was trained on relatively stable data). In volatile markets, technicals and prescreening carry more weight because RL struggles with distribution shift.

DEFAULT_REGIME_WEIGHTS = {
    "low_volatility": {
        "prescreening": 0.30, "rl": 0.40,
        "technical": 0.20,    "fundamental": 0.10,
    },
    "moderate_volatility": {
        "prescreening": 0.35, "rl": 0.30,
        "technical": 0.25,    "fundamental": 0.10,
    },
    "high_volatility": {
        "prescreening": 0.35, "rl": 0.20,
        "technical": 0.35,    "fundamental": 0.10,
    },
    "extreme_volatility": {
        "prescreening": 0.40, "rl": 0.10,
        "technical": 0.40,    "fundamental": 0.10,
    },
}

These defaults are also HPO-tunable. The moderate-volatility weights serve as the base; other regimes are auto-generated by scaling RL up (in low vol) or down (in high vol), with technicals moving in the opposite direction. HPO tunes the base, and the regime variants follow.

Agreement Ratio

The most important post-weighting step. After computing a raw weighted conviction, the engine asks: "Do the signals actually agree?"

@staticmethod
def _compute_agreement(signal_map: dict[str, float]) -> float:
    """What fraction of active signals agree on direction?"""
    directions = []
    for val in signal_map.values():
        if val > 0.01:
            directions.append(1)     # Bullish
        elif val < -0.01:
            directions.append(-1)    # Bearish
        # Near-zero signals excluded

    if not directions:
        return 0.5  # All neutral

    majority_dir = 1 if sum(directions) > 0 else -1
    agree_count = sum(1 for d in directions if d == majority_dir)
    return agree_count / len(directions)

Agreement then scales the conviction: conviction *= (floor + (1 - floor) * agreement_ratio). With a floor of 0.50, unanimous agreement passes full conviction through, while a 50-50 split reduces it by 25%.

This prevents overconfident decisions when signals conflict. A stock might have a strong prescreening buy score (+0.80) but a negative RL signal (-0.10) and flat technicals. The weighted average might still be positive, but the low agreement ratio dampens it. The position ends up smaller, which is the right outcome when signals disagree.

A Real Decision

From the March 2 trading session (16:23 UTC cycle):

Symbol Fund Prescreening RL Technical Conviction Agreement Action
DPRO +0.214 +0.793 +0.252 +0.230 0.356 1.00 BUY
ELDN +0.172 +0.799 0.000 +0.358 0.307 1.00 BUY
FIGS* -0.789 -0.015 +0.502 -0.122 0.50 HOLD

FIGS is from the Feb 27 session.

DPRO had all four signals positive, agreement of 1.0, conviction of 0.356. Strong unanimous buy. The system bought it.

FIGS is the more interesting case. Prescreening recommended selling hard (-0.789), but the technical signal was strongly positive (+0.502) — it was a momentum stock that had already rallied 24% on the day. The disagreement dropped agreement to 0.50, and conviction landed at -0.122 — just inside the hold zone. The system held instead of selling. FIGS went on to gain another 4.8% the following session.

That's the ensemble working as designed. No single signal source would have gotten FIGS right. Prescreening alone would have sold. Technicals alone would have bought more. The ensemble found the middle ground: hold what you have, don't add, don't sell.

The Experimentation Loop

Getting three RL algorithms to produce useful trading signals isn't a "train once, deploy, done" process. It's iterative experimentation — tuning hyperparameters, discovering algorithm-specific training dynamics, validating that the assembled ensemble behaves differently from its individual components, and calibrating the parameters between model output and trade execution. Over three weeks, I ran 90+ HPO trials, 20+ sandbox experiments, and migrated the entire training framework. Here's what that looked like.

The Scale of the Search

Each algorithm gets 30 Bayesian HPO trials on SageMaker, each running up to 4 hours on a T4 GPU instance. That's 90 trials total for the three-algorithm ensemble, plus additional runs whenever I changed the search space or reward function.

The objective metric — backtest_composite = excess_return × min(sortino, 3.0) — was itself the result of experimentation. Early iterations used raw portfolio return, which rewarded models that made a few lucky bets with catastrophic drawdown risk. Sortino ratio alone rewarded models that hoarded cash and never traded. The composite metric requires both alpha generation (excess return over SPY) and risk discipline (downside-adjusted returns), with Sortino capped at 3.0 to prevent gaming.

Curriculum Learning: The Most Important Discovery

The single most impactful finding was that curriculum learning is algorithm-specific. I spent days assuming a universal curriculum would work for all three algorithms. It doesn't.

PPO benefits enormously from a 9-stage curriculum. The buy/sell threshold ramps from 0.05 to 0.15 over ~220 training iterations, letting the model first learn "what constitutes a good stock" before requiring decisive actions. Simultaneously, the episode window expands (200 → 400 → 800 → full history), observation noise decays (0.03 → 0.0 standard deviation), and the learning rate anneals (0.001 → 0.0003 → 0.0001). This staged approach produced stable convergence where a flat curriculum did not.

SAC is the opposite. Every sandbox run with curriculum enabled showed declining reward trends. The reason: SAC's entropy auto-tuning adjusts the exploration-exploitation tradeoff continuously, and staged reward changes interfere with that mechanism. When I disabled curriculum entirely and let SAC's entropy coefficient self-regulate, it produced its first positive reward trend — and ultimately achieved the highest composite score in the ensemble (1.302).

TD3 tolerates curriculum but needs wider stage spacing than PPO. Where PPO transitions every 25-50 iterations (25/50/75/100/130/160/190/220), TD3 needs 60/120/180/240/300/370/440/520. The deterministic policy updates are more sensitive to abrupt reward signal changes.

Reward Function Design

The initial reward function had an asymmetric treatment of gains and losses — risk penalties applied more aggressively to drawdowns than gains were rewarded. The model converged to a trivial solution: sell everything and hold cash. Zero drawdown, zero alpha, but no penalties either.

The solution required several adjustments working together: symmetric risk treatment (gains and losses weighted proportionally), a 50x boost to momentum alignment weight (rewarding the model for trading in the direction of the trend), and a 10x activity bonus (penalizing the model for doing nothing when opportunities exist). I also reduced the cash penalty to allow exploration in early curriculum stages — the model needs permission to hold cash while it's still learning what a good trade looks like.

The final reward mode is a 3-component "sharpe" function: Differential Sharpe Ratio (captures risk-adjusted return per step), transaction costs (0.1% per trade, prevents churn), and a concentration penalty (discourages loading into a single position). The 11-component "full" reward was too noisy for Bayesian HPO to optimize effectively.

Sandbox Validation

Before committing to a full HPO run (90 trials × 4 hours = significant compute cost), I built a two-tier sandbox for rapid configuration validation. A local sandbox runs a single trial in ~2-5 minutes and produces a GO/CAUTION/NO-GO verdict based on convergence metrics. Only configurations that pass the sandbox proceed to full SageMaker HPO.

This saved substantial compute. Over 20+ sandbox experiments, I validated the regularization settings that the production models now use:

Parameter Initial Final Why
Dropout 0.10 0.15 Reduced overfitting to historical patterns
Weight decay 1e-4 5e-4 Better generalization on holdout period
Observation noise 0.02 std 0.03 std Forces robustness to noisy market data
PPO epochs per batch 10 3 Prevents importance sampling ratio blowup
PPO batch size 5,000 4,000 Better gradient signal-to-noise

HPO Search Range Calibration

Even with well-tuned training dynamics, the HPO search ranges themselves needed calibration based on the characteristics of the prescreened stock universe.

Stop-loss thresholds: The Bayesian optimizer found an "optimal" stop-loss of 2.13%. Mathematically sound — it minimizes expected drawdown in the historical data. Practically useless — the small-cap momentum stocks that the prescreening model selects routinely swing 10-15% intraday. A 2.13% stop-loss triggers on virtually every position within hours. Raising the floor to 5% eliminated spurious stop-outs while maintaining meaningful downside protection.

Position sizing: The search range for max_position_pct spans [0.03, 0.10]. At 3%, the model can't express strong convictions. At 10%, a single bad position can significantly impact the portfolio. The HPO consistently converged to ~5%, which produced realistic 100-200% annualized returns on the holdout period.

Entropy coefficient (SAC): Rather than tuning entropy directly, SAC uses target_entropy_scale in [0.3, 0.7], which sets the target entropy relative to the action dimension. The default of 1.0 (target = -25 for 25 stocks) was too exploratory — action magnitudes stayed too small to cross the ±0.15 buy/sell threshold. Scaling to 0.5 (target = -12.5) produced decisive actions while maintaining sufficient exploration.

Best Trial Results

After the full HPO sweep, the three best models and their composite scores:

Algorithm Composite Score Key Hyperparameters Weight
SAC 1.302 target_entropy_scale=0.5, no curriculum 40%
TD3 1.042 policy_delay=3, target_noise=0.3, wide curriculum 32%
PPO 0.914 clip=0.14, entropy=0.0019, gamma=0.91, 9-stage curriculum 28%

SAC's dominance came from two factors: sample efficiency (off-policy replay buffer reuses experience) and entropy regularization (prevents premature convergence to suboptimal policies). PPO's lower score reflects its on-policy nature — it requires more data to reach the same level of policy quality, and 30 HPO trials with 220 iterations each may not be enough to fully converge.

Integration Validation

Training models that produce good backtest composites is necessary but not sufficient. The assembled ensemble needs to produce non-degenerate signals when running end-to-end against live market data.

Validating the full inference pipeline revealed that the RL signal required alignment between the inference endpoint's output format and the decision engine's expected input. The composite weights from HPO metadata also needed explicit passthrough to the assembly step — without it, the system defaulted to equal weights (33.3% each) instead of the performance-weighted allocation. I ran both configurations against the Feb 27 session: composite-weighted conviction scores were 5-8% higher on the stocks that ultimately performed best (momentum names like FIGS, BATL, INDO), confirming the weighting scheme adds signal.

Adding fundamentals as a 4th signal source required standardizing the data format from Financial Modeling Prep's API. The March 2 session was the first with all four signals active, and the difference was material: DPRO's conviction of 0.356 with agreement ratio 1.0 (all four sources positive) was the highest-confidence signal the system had generated.

Execution Parameter Optimization

The ensemble's signal quality is only as valuable as what the execution layer does with it. Tuning execution parameters — rebalance frequency, minimum trade size, position trim cooldown — matters as much as any model hyperparameter.

On Feb 26, the execution layer's rebalance parameters were too aggressive, creating unnecessary position churn that destroyed $1,113 in value that the ensemble had correctly identified. Tightening position management parameters — adding a cooldown between rebalances, setting a minimum trade size of $25, and reducing the rebalance frequency — produced dramatic improvement: roundtrip losses went from -$1,113 (Feb 26) to -$49 (Feb 27) to +$1,369 (March 2). The ensemble's signals didn't change between those sessions. The execution configuration did.

The Takeaway

Each experiment followed the same loop: hypothesize → configure → run against live data → measure at every layer → iterate. Most of the alpha improvement across these three weeks didn't come from retraining models. It came from discovering algorithm-specific training dynamics (curriculum helps PPO, hurts SAC), calibrating HPO search ranges to match the actual trading universe, and tuning the parameters between model output and trade execution.

This is the gap that most ML tutorials skip. The model is maybe 40% of the system. The other 60% is everything surrounding it — reward design, curriculum scheduling, search range calibration, ensemble assembly, signal integration, execution parameters. Getting that right requires running the full pipeline end-to-end and measuring at every layer.

Three Days of Proof

The real test of an ensemble system isn't the architecture diagram. It's the P&L.

I have three days of paper trading data where the ensemble was live with progressive configuration refinements: Feb 26, Feb 27, and March 2. Each day tells a different part of the story.

3-Day Performance

Feb 26: First Full Day

Metric Value
Portfolio Return +0.58% (+$586)
SPY -0.56%
Alpha +1.14%

The first day of paper trading with the ensemble live. On a day where the S&P 500 fell 0.56%, the portfolio gained 0.58% — an alpha of 1.14%. The prescreening layer identified small-cap momentum stocks that held up well during the broader selloff.

The execution layer's rebalance parameters were too aggressive — positions were being trimmed too frequently with no cooldown between rebalances, creating unnecessary churn. Roundtrip losses were -$1,113. Without the excessive rebalancing, the day's P&L would have been closer to +$1,700. This identified execution parameter tuning as the next experiment to run.

Feb 27: Overnight Gaps and Ensemble Value

Metric Value
Portfolio Return +9.98% (+$10,046)
SPY -0.48%
Alpha +10.46%

The RL ensemble identified strong small-cap positions on Feb 26 that gapped up dramatically overnight: FIGS +24%, BDMD +29%, BATL +33%, CDIO +31%. The headline number — nearly +10% in a single day — was largely driven by these overnight moves, with intraday trading roughly flat.

Two things mattered beyond the raw return:

  1. Execution parameters tightened: Position management parameters were adjusted between sessions — minimum trade size, rebalance cooldown, and trim frequency. Roundtrip losses dropped from -$1,113 to -$49, a 96% reduction.
  2. The ensemble demonstrated its value: Technical signals of +0.46 to +0.50 overrode prescreening's sell recommendations (-0.79 to -0.85) on the momentum stocks. Prescreening wanted to sell all holdings; the technical signal recognized they were still trending. The ensemble held, and the portfolio captured the full overnight gap.

March 2: First Intraday Alpha

Metric Value
Portfolio Return +1.57% (+$1,738)
SPY +0.05%
Alpha +1.52%

This was the day that validated the full architecture. The portfolio started -$803 below the prior close (overnight gap down). The ensemble recovered that deficit and added +$2,541 during market hours — the first day of genuine intraday alpha generation.

Roundtrip trades turned positive: +$1,369 in realized gains. Average winner jumped from $7.68 to $112.87. The 4-signal ensemble — now with fundamentals contributing as a standalone signal source — produced higher-conviction decisions. DPRO had conviction 0.356 with all four signals positive (unanimous buy). The system was generating alpha from its own decisions, not just riding overnight gaps.

The Trend Line

Metric Feb 26 Feb 27 Mar 2 Trend
Alpha vs SPY +1.14% +10.46% +1.52% Consistent
Roundtrip P&L -$1,113 -$49 +$1,369 Optimization → value
Win Rate 70.5% 60.9% 68.7% Stabilizing
Profit Factor 5.19 1.83 3.34 Recovering
Avg Winner $7.68 $6.67 $112.87 Growing
Signals Active 3 3 4 Full coverage

The progression tells the story. Each day was better than the last — not because the models were retrained, but because each experiment refined the inference configuration and made the ensemble architecture more complete. By March 2, all four signal sources were contributing, execution wasn't destroying value, and the system was making decisions that generated real P&L.

Layer-by-Layer Value Attribution

A Note on Documentation

While building the ensemble system, I also built a comprehensive documentation system for the entire trading platform — 4 perspectives (Architecture, Risk, Operations, System Guide), YAML frontmatter for programmatic discovery, and an "I need to..." lookup table that makes it trivial to find the right document for any task.

On the same day we recalibrated the VIX regime thresholds and the portfolio optimizer, we updated 9 documentation files to reflect the new behavior. Documentation isn't an afterthought in this system — it's part of every model update.

I'll go deeper on the documentation system in the next post, including why we built it, how it prevents the kind of knowledge loss that contributed to our $78K incident, and how it works with AI coding agents.

What I'd Do Differently

Start with an ensemble from day one. I wasted weeks trying to make a single PPO agent work across all market conditions before accepting that one algorithm isn't enough. The ensemble architecture should have been the first design, not the last resort.

Run end-to-end inference validation before live trading. Verify that every signal source produces non-zero, non-degenerate outputs across a representative sample of symbols. A format mismatch between the RL output and the decision engine silenced 35.6% of the ensemble's decision weight for a full trading day. Unit tests on individual components passed fine — the gap was only visible when running the complete inference pipeline end-to-end.

VIX thresholds need stress testing, not just backtesting. I backtested the prescreening and RL models extensively but never stress-tested the regime thresholds themselves. VIX 25 as "extreme" seemed reasonable on paper. In practice, it froze the portfolio during normal earnings-season volatility.

Decompose P&L by signal source. Total portfolio return hides where alpha is generated and where it's destroyed. On Feb 26, the ensemble's stock selection generated +$1,700 of value that the execution layer's aggressive rebalancing eroded to +$586. Without signal-level attribution, I would have questioned the models instead of the execution parameters.

The inference system is the product. Training gets all the attention, but inference — pulling in VIX, routing to the right prescreening model, computing RL features, normalizing signals, applying regime weights, checking agreement, sizing positions — is where the value is actually created. A perfectly trained model with a broken inference pipeline produces nothing.


This is Post 5 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Previous: Why 94% Test Coverage Didn't Stop Our Trading System From Failing | Next: The Documentation System That Prevents Knowledge Loss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment