Skip to content

Instantly share code, notes, and snippets.

@ttarler
Last active February 23, 2026 15:37
Show Gist options
  • Select an option

  • Save ttarler/78b108f496eff88f9075553c4f8439bd to your computer and use it in GitHub Desktop.

Select an option

Save ttarler/78b108f496eff88f9075553c4f8439bd to your computer and use it in GitHub Desktop.
Medium Post 4: Training an RL Portfolio Agent: 47 Commits of "Why Won't You Learn to Sell?"

Training an RL Portfolio Agent: 47 Commits of "Why Won't You Learn to Sell?"

RL Training Journey

The Problem Nobody Warns You About

Every RL tutorial on the internet ends the same way: train PPO on CartPole, watch the reward curve go up, celebrate. Maybe they graduate to Atari and show you a DQN playing Breakout.

None of them prepare you for what happens when your action space has 25 continuous dimensions, your reward signal is delayed by days, and your agent discovers that the optimal policy is to buy everything and never sell.

I spent about two months getting a reinforcement learning agent to make sensible portfolio decisions. Not "beat the market" decisions—just "don't hold 98% cash" and "occasionally sell something" decisions. This post is the real story of that process, told through the git log.

Where We Started: Weight Allocation (V1)

The first RL environment was embarrassingly naive. The model output a weight per stock, softmax normalized to sum to 1, and the environment allocated capital proportionally. Classic portfolio optimization setup.

Model output: [0.4, 0.3, 0.2, 0.1, 0.0]
Portfolio:     40% AAPL, 30% GOOGL, 20% MSFT, 10% TSLA, 0% NVDA

This caused what I started calling "allocation whipsaw." If the model shifts its preferred weights even slightly between steps, positions get rebalanced. Stock you bought yesterday at 5% allocation? Today the model says 3%. Sells some at a loss. Tomorrow it says 6%. Buys it back. Every rebalance triggers transaction costs. The model was churning its own portfolio to death.

The deeper problem: there was no concept of "hold." Every timestep was a full reallocation. The model couldn't say "I like my current positions, leave them alone." It had to re-justify every position every day.

The BUY/HOLD/SELL Rewrite (V2)

The fix was conceptually simple but mechanically involved. Instead of outputting allocation weights, the model outputs a continuous value in [-1, 1] per stock, and we interpret ranges as discrete actions:

# Continuous output -> discrete action semantics
action > 0.3   -> BUY  (strength proportional to how far above 0.3)
action in [-0.3, 0.3] -> HOLD (do nothing)
action < -0.3  -> SELL (exit or trim if holding)

Now positions persist until the model explicitly says "sell." No more daily rebalancing. The HOLD zone in the middle gives the model a safe default—when uncertain, do nothing.

V2 also added explicit position tracking. Each stock slot maintains shares held, entry price, and entry time. The model sees these as additional observation features, so it knows what it's holding and at what cost basis. Unrealized P&L becomes a signal—the model can learn that a position down 8% might need cutting.

This was the first version that felt like it could work. It was also the first version where convergence became a nightmare.

The Architecture Question: Flat MLP vs. Stock-Parallel

Architecture Evolution

The Flat MLP (What We Tried First)

The obvious first architecture: flatten all stock features into one big vector, feed it through a [256, 256] MLP, output 25 actions. With 40 features per stock plus 4 position-state features, that's a 1,100-dimensional input vector. LayerNorm on the input to prevent activation explosions.

Input: (batch, 1100) -> LayerNorm -> MLP(256, 256) -> (loc, scale) per action

It trained. Sort of. The problem was subtle: the network had no structural reason to treat stock #3 and stock #17 the same way. Nothing in the architecture enforced that "AAPL in slot 3" and "AAPL in slot 17" should produce similar actions. The model was memorizing slot positions instead of learning stock properties.

With 25 stocks this was manageable—25 slots, enough data, it learned okay. But when we tried scaling to 50+ candidates? The flat MLP hit a wall. Too many parameters, too little structure.

Stock-Parallel Architecture (The Fix)

The insight was permutation equivariance: the model's action for stock k should depend on stock k's features and the portfolio context, not on which slot k happens to occupy.

Per-stock features: (batch, K, F+4)
    |
    v
LayerNorm (per stock)
    |
    v
Shared StockEncoder: Linear(F+4, 64) -> ReLU -> Linear(64, 64) -> ReLU
    |                    (same weights for all K stocks)
    v
Per-stock embeddings: (batch, K, 64)
    |
    +---> Masked mean-pool -> global context (batch, 1, 64)
    |         |
    |         v
    |     Project to context_dim
    |         |
    v         v
Concat [embed_k, context] for each stock
    |
    v
Action head: Linear(128, 64) -> ReLU -> Linear(64, 2) -> (loc, log_scale)

The StockEncoder uses identical weights for every stock. Same features, same transformation. The masked mean-pool creates a global portfolio summary that every stock can reference for context ("what does the overall portfolio look like right now?"). Then each stock's action is computed from its own embedding plus the shared context.

This cut the parameter count significantly and—more importantly—made training converge faster. The model no longer had to independently learn "RSI means the same thing in slot 3 and slot 17."

One initialization detail that turned out to matter a lot: the final action head layer gets near-zero weights, with the loc bias at 0.0 and log_scale bias at -1.0. This means the initial policy outputs near-zero actions for everything—HOLD by default. The model has to learn away from holding to start buying or selling. This prevented the early-training chaos of random actions causing huge losses that poisoned the reward signal.

The Convergence Saga: A 47-Commit Story

Here's where it gets real. Getting this model to converge—to actually learn a useful buy/sell/hold policy—was the hardest part of the entire project. Let me walk through the major failure modes in roughly the order I encountered them.

Failure Mode 1: Buy Everything, Sell Nothing

The first training runs with V2 produced a model that bought everything it could and never sold. Buy ratio: 95%+. Sell ratio: 0%. The model would ramp up to 100% invested on day 2 and sit there.

Why? The raw reward was portfolio return minus benchmark return. If the market goes up (which it usually does in our training data), being 100% invested beats holding cash. The model found the global optimum for this reward function: buy everything immediately.

The fix was an investment band penalty. The model gets penalized for being less than 30% invested (too much cash) OR more than 70% invested (too concentrated). Quadratic penalty in both directions:

under_invested: 0.3 * (0.3 - invested_frac)^2   if invested < 30%
over_invested:  0.3 * (invested_frac - 0.7)^2    if invested > 70%

This created a "Goldilocks zone" where the model is incentivized to be 30-70% invested, not all-in or all-cash.

Failure Mode 2: Sell Everything, Hold Cash

After adding the band penalty, I also added asymmetric risk aversion—losses get penalized 1.5x versus gains. Seemed reasonable. What I didn't anticipate: the model discovered that the safest way to avoid the 1.5x loss penalty is to not hold anything at all.

This was a subtler version of the cash-hoarding problem. The model would buy a few positions to satisfy the "be at least 30% invested" penalty, then sell them at the first sign of any drawdown. Win rate was terrible because it was cutting winners short and eating transaction costs.

The fix was risk_aversion_gain_frac—applying a fraction of the asymmetric penalty to gains too. With this set to 0, gains get full credit. The model learns that holding winners is actually rewarding, not just "not losing."

fix: symmetric risk-aversion and wire cash_penalty to fix sell-everything convergence

That was an actual commit message. You can tell I was frustrated.

Failure Mode 3: 0% Sell Ratio (Even With Positions)

This one was sneaky. The model learned to buy, learned to hold in the right range, but the sell ratio was literally zero percent. It would buy positions and hold them forever, even when they were losing money.

The problem was that the action space was asymmetric. Buying creates a new position (positive feedback—the model sees new unrealized P&L signals). Selling just removes a position (the slot goes quiet). There was nothing in the reward function that specifically encouraged realized profit-taking.

Two fixes, same commit:

Hold-as-sell symmetry: Actions in the mild negative range (-0.3 to 0) now count as "hold-as-sell"—a weak sell signal. This gives the model a wider zone for expressing "I'm somewhat bearish on this position" without needing the full conviction of a strong sell signal below -0.3.

Profit-taking reward: A direct bonus for closing profitable positions:

profit_taking_bonus = profit_taking_weight * min(realized_pnl / portfolio_val, 5%)

This was the commit that finally got sell ratios above 5%. Not glamorous. But the model needed explicit incentive structure to learn that selling can be good.

Failure Mode 4: NaN Explosions

NaN Explosion Timeline

This one ate three days. Training would run for 50-100 iterations, everything looking fine, then suddenly: NaN in the loss, NaN in the gradients, NaN in the weights. Model destroyed.

The root cause was the TanhNormal distribution. With 25 continuous action dimensions, the log-probability computation involves atanh of the sampled action. If the action is close to -1 or 1 (which happens regularly since we use tanh), atanh produces huge values. Those flow through the PPO ratio calculation (exp(log_prob_new - log_prob_old)) and blow up.

The fix was a stack of numerical guards:

# 1. Clamp log-probabilities to prevent infinite PPO ratios
log_prob = log_prob.clamp(-20, 0)

# 2. Clamp individual loss components
objective_loss = objective_loss.clamp(-100, 100)
entropy_loss = entropy_loss.clamp(-100, 100)

# 3. Detect NaN in batch and skip the update
if torch.isnan(total_loss).any():
    optimizer.zero_grad()  # Reset optimizer state
    # Rollback to pre-update weights
    model.load_state_dict(checkpoint_weights)
    continue

# 4. Scale entropy coefficient by action dimension
# Without this, entropy bonus overwhelms the objective with K=25 actions
entropy_coeff = base_entropy_coeff / action_dim

That last one—scaling entropy by action dimension—was the real insight. Standard PPO implementations use a fixed entropy coefficient (like 0.01). But entropy of a 25-dimensional continuous distribution is way larger than entropy of a single discrete action. The entropy bonus was drowning out the actual policy gradient signal.

Failure Mode 5: Stochastic Exploration Collapse

This one was philosophically interesting. PPO uses stochastic policies—the model outputs a mean and standard deviation, then samples from that distribution during data collection. The idea is that randomness drives exploration.

For a single-action environment, this works great. For 25 continuous actions? The random noise causes the model to simultaneously buy and sell at random, the portfolio churns, transaction costs destroy returns, and the reward signal is pure noise. The model can't distinguish "my policy mean was good but noise made it bad" from "my policy mean was bad."

The fix was deterministic exploration with fixed noise:

# Instead of sampling from TanhNormal(loc, scale):
action = tanh(loc) + gaussian_noise(sigma=0.05)

# The policy mean (loc) directly affects the reward
# Small fixed noise provides exploration without destroying signal

This was a big philosophical departure from textbook PPO. By using deterministic actions during collection, the reward signal directly reflects the quality of the policy mean. The small Gaussian noise (sigma=0.05) still provides some exploration, but it's controlled and doesn't produce the wild swings that stochastic sampling causes in high-dimensional action spaces.

fix: use deterministic actions in collector so rewards push policy mean

Another commit message that tells a story.

The Reward Function: 11 Components and Counting

If you've been reading carefully, you've noticed a pattern: every convergence failure was fixed by adding or modifying a reward component. The reward function grew from "portfolio return minus benchmark" to an 11-component beast. Here's what it looks like now:

Reward Function Components

# Component Purpose When Active
1 Excess return (vs SPY) Alpha generation Always
2 Churn penalty Discourage overtrading Iter 40+
3 Extreme action penalty Penalize wild swings on inactive stocks Iter 80+
4 Entropy bonus Encourage position diversification Iter 80+
5 Volatility penalty Keep realized vol near target Iter 120+
6 Drawdown penalty Avoid deep portfolio losses Iter 120+
7 Holding bonus + Profit-taking Reward patience + realized gains Iter 200+
8 Sector concentration penalty Prevent sector overexposure Iter 200+
9 VIX-aware investment limit Reduce exposure in high-vol regimes Iter 200+
10 Momentum alignment Trade with the trend Iter 200+
11 Investment band penalty Stay 30-70% invested Always

That "When Active" column is the curriculum schedule—more on that in a second.

The key insight is that you can't dump all 11 components on the model from step one. I tried. The gradients are pure noise—the model has no idea which of 11 signals to follow, so it follows none of them. It converges to "hold everything, do nothing" because that's the only policy that doesn't violate any penalty.

Curriculum Learning: Teaching a Model to Walk Before It Runs

The curriculum scheduler progresses along four axes simultaneously, each on a staggered schedule:

Iteration:      0    40    80   120   150   180   200   250   280   300
                │     │     │     │     │     │     │     │     │     │
Window length:  200 ──────▶ 400 ──────▶ 800 ────────────▶ Full ──────
Active rewards: 3 ───▶ 5 ──▶ 8 ────────────▶ 11 ────────────────────
Obs noise:      0.03 ───────────────▶ 0.01 ────────────▶ 0.0 ───────
Learning rate:  0.003 ──────────────────────▶ 0.001 ─────▶ 0.0003 ──

Episode windows: Early training uses 200-step sub-windows randomly sampled from the full time series. The model only sees short chunks—think of it as learning to walk before running. Gradually we extend to 400, 800, and finally the full sequence. This prevents the model from getting overwhelmed by long horizons early on.

Reward progression: Start with just 3 components (excess return, churn penalty, extreme action penalty). The model learns the basics: generate alpha, don't over-trade, don't go crazy. Then we layer in volatility management, drawdown avoidance, and the more nuanced components.

Observation noise: This one is sneaky. Early in training, we add Gaussian noise (std=0.03) to all observations. This forces the model to learn robust features—it can't memorize specific price patterns because they're noisy. We anneal the noise to zero by iteration 280.

Learning rate decay: High LR early for fast learning on the simpler curriculum stages, decayed for fine-tuning on the full objective.

The critical detail: these transitions are staggered. We never advance two axes on the same iteration. If the window length increases at iteration 80, the reward components don't change until later. One thing at a time. An early version tried advancing everything together, and the model would destabilize at every transition point—the optimization landscape changed too much at once.

fix: stagger curriculum transitions to prevent destabilization

Episode Preprocessing: Building the Training Data

The RL model doesn't train on raw market data. It trains on carefully constructed episode arrays. The pipeline is two phases:

Phase 1 (parallel, 16 instances): Load scored parquet files, compute 40 technical features per stock per day, write feature-engineered parquets.

Phase 2 (single instance): Merge everything into dense (T, K, F) arrays. T = number of trading days, K = 25 stocks, F = 40 features. Time-based train/val split (80/20, no lookahead bias). Z-score normalization on features, clipped at 3 standard deviations.

The output is a .npz file with:

  • features: (T_train, 25, 40) — the observation tensor
  • returns: (T_train, 25) — daily stock returns
  • mask: (T_train, 25) — which stocks are active on each day
  • spy_returns: (T_train,) — benchmark returns for excess return calculation
  • vix_levels: (T_train,) — VIX for regime-aware position sizing
  • sectors: sector labels per stock per day for concentration penalties

One thing that bit us: some stocks only become active partway through the training period (IPOs, new listings). The mask tensor handles this, but early versions of the environment didn't properly handle masked stocks in the reward function. The model was getting penalized for "not investing" in stocks that didn't exist yet. Filtering the subsample to prioritize stocks that are active early in the episode fixed this.

fix: prioritize early-active stocks in HPO subsample to prevent zero reward

The TorchRL Migration

We started on Ray RLlib—the standard choice for RL at scale. It worked. It was also a black box. When training diverged, the debugging story was "add print statements to RLlib's internal PPO loop" which involves navigating six levels of class hierarchy and a custom execution framework.

The migration to TorchRL (PyTorch's official RL library) was motivated by control. TorchRL gives you the building blocks—loss modules, data collectors, replay buffers—but you write the training loop yourself. When something goes wrong, you're debugging PyTorch code, not framework abstractions.

The migration took about 3 weeks across several PRs:

feat: migrate RL training from Ray RLlib to TorchRL with concert blending
fix: MLP out_activation_class not supported in TorchRL 0.11.1
fix: use Bounded instead of deprecated BoundedTensorSpec in tests
fix: fully implement TorchRL training — target updates, curriculum, LayerNorm
fix: include torchrl_trainer in S3 deploy + fix GPU/CPU device mismatch
perf: enable GPU learner + increase env runners for faster RL training

The payoff was immediate. When the NaN explosions started (failure mode 4 above), I could add the clamping guards directly in the training loop. With RLlib, that would have required monkey-patching internal loss functions.

TorchRL also made the stock-parallel architecture straightforward. The ProbabilisticActor wraps any nn.Module that outputs (loc, scale), so swapping the flat MLP for StockParallelActor was essentially a config change.

The Objective Metric: Stop Optimizing the Wrong Thing

I covered the Sortino trap in the previous post, but it's worth revisiting in the context of RL specifically.

The HPO system runs multiple RL training trials, each with different hyperparameters, and picks the best one based on an objective metric. For months, that metric was the Sortino ratio on a backtest window.

The problem: Sortino rewards low downside deviation. The easiest way to have low downside deviation is to not trade. A model that holds 95% cash and makes one tiny profitable trade gets a Sortino ratio through the roof. HPO would find these do-nothing models and crown them champion.

The composite metric that actually works:

composite = excess_return * min(max(sortino, 0), 3.0)

If excess return is negative (trailing SPY), the composite is negative regardless of Sortino. If Sortino is gaming the denominator to produce infinite values, it's capped at 3.0. You need both alpha AND risk-adjusted quality to score well.

We also added deployment gates: buy ratio must be >= 5% (the model actually has to generate buy signals), and backtest return must be >= 0.5%. These seem obvious in retrospect. They weren't.

fix: zero sortino bonus in composite if model isn't actively trading

What 47 Commits Taught Me

Looking back at the git log, a few patterns emerge:

The reward function IS the specification. Whatever you reward, you get. Whatever you don't penalize, the model will exploit. I thought I was defining a reward function; really I was writing a specification for portfolio behavior, one edge case at a time.

High-dimensional continuous action spaces are a different beast. Most RL research focuses on discrete actions or low-dimensional continuous control. With 25 continuous actions, standard assumptions about exploration, entropy, and convergence don't hold. You need structural solutions (stock-parallel architecture, deterministic exploration, entropy scaling).

Curriculum learning isn't optional for complex rewards. Trying to optimize 11 reward components from the start is asking the model to solve a 11-dimensional multi-objective optimization problem with noisy gradients. Start simple. Add complexity gradually. Stagger the transitions.

Numerical stability requires active engineering. The NaN explosion issue would never appear in a CartPole environment. It only surfaces with high-dimensional TanhNormal distributions under PPO's importance ratio. You need explicit guards at every potential infinity point.

The objective metric shapes behavior more than the model. I spent weeks tweaking architectures when the real problem was that HPO was optimizing the wrong thing. Get the metric right first. Then tune the model.

Where It Stands

The current model (stock-parallel PPO with 11-component curriculum reward) consistently:

  • Maintains 30-70% investment levels
  • Generates buy AND sell signals (not just one-sided)
  • Holds winners longer than losers (holding bonus working as intended)
  • Reduces exposure when VIX spikes
  • Avoids sector concentration

Is it beating the market? Sometimes. On good months it generates 2-5% excess return. On bad months it tracks SPY closely without major drawdowns. The Sortino ratio on recent backtests is in the 1.5-2.5 range—not spectacular, but real.

More importantly, it's a framework that's improvable. Each monthly retraining cycle gets warm-started from the previous best model. The curriculum can be extended. New reward components can be added without rewriting everything. The 47 commits got us to a foundation. The next 47 will (hopefully) get us to something that makes consistent money.

But I'm not going to pretend the hard part is over. Every time I think the model is converged, it finds a new way to surprise me. That's RL for you.


This is Post 4 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Previous: Building the ML Pipeline | Next: CI/CD and Deployment When AI Writes the Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment