Skip to content

Instantly share code, notes, and snippets.

@ttarler
Created January 27, 2026 01:09
Show Gist options
  • Select an option

  • Save ttarler/8d130a5affd1bc1ff4648f2411fd88bb to your computer and use it in GitHub Desktop.

Select an option

Save ttarler/8d130a5affd1bc1ff4648f2411fd88bb to your computer and use it in GitHub Desktop.
Medium Post 6: The ML Model Was 'Live' for 6 Days—It Never Made a Single Decision

The ML Model Was 'Live' for 6 Days—It Never Made a Single Decision

"Deployed and Working"

Jan 8: "RL model deployed to SageMaker endpoint. Trading with trained model."

Jan 9: "RL model is active and making portfolio decisions."

Jan 10-13: Portfolio trades every day. System reports "working."

Jan 14: User asks, "Are you sure the RL model is actually being used?"

I check the logs. trained_model=false on every single call. For 6 days, the system fell back to simple heuristics. The trained neural network—trained on 20 hyperparameter tuning jobs, using SageMaker's expensive GPU instances—never made a single production decision.

This post is about the most expensive ML failure I've encountered: a model that was "deployed" but never used.

The Timeline: How It Went Wrong

Jan 8: Training Complete, "Deployment" Begins

What I did:

  1. Ran hyperparameter optimization (20 training jobs, ml.g4dn.xlarge GPUs, ~2 hours)
  2. Selected best model (highest reward)
  3. Saved model artifact to S3
  4. Created SageMaker endpoint
  5. Updated application to call endpoint
  6. Tested with curl: endpoint returned JSON ✅

What I announced: "RL model deployed and working."

What I didn't verify: Did the application actually use the model's predictions? Did it fall back to heuristics? Did the model even load successfully?

Jan 9-13: "Trading With Trained Model"

What the logs showed:

[Jan 9 09:35] RL portfolio decision made: 5 positions allocated
[Jan 9 09:40] RL portfolio decision made: 3 positions allocated
[Jan 10 09:35] RL portfolio decision made: 4 positions allocated

What the logs didn't show: trained_model=false was buried in the response payload, never logged at the top level.

What I reported: "RL model making decisions, portfolio performing well."

Reality: Every decision came from a simple heuristic: if buy_probability - sell_probability > 0.15 → BUY.

Jan 14: The Discovery

User: "Are you sure the RL model is actually being used? The decisions seem too simple."

Me: Checks SageMaker CloudWatch logs. Sees ModuleNotFoundError: No module named 'ray' on every inference request.

Then checks application logs more carefully:

response = await sagemaker.invoke_endpoint(...)
payload = json.loads(response['Body'].read())

# This was always false:
if payload.get('used_trained_model'):
    logger.info("RL model made decision")
else:
    logger.warning("RL endpoint fell back to heuristic")  # ← ALWAYS THIS PATH

Reality: The model artifact contained RLlib checkpoint objects (Ray dependencies). The SageMaker inference container couldn't deserialize them. Endpoint fell back to heuristic logic on every request.

The Root Causes

Cause 1: Training vs Inference Architecture Mismatch

Training: RLlib PPO model (multi-asset portfolio allocator)

Inference expectation: Per-symbol discrete actions (BUY/HOLD/SELL)

The "solution": Add threshold heuristic to convert portfolio allocations to discrete actions.

The actual problem: The heuristic replaced the model, didn't convert its output. Model never loaded.

Cause 2: Ray Objects in model.pkl

Training code (simplified):

# Saved RLlib checkpoint as model artifact
checkpoint = algo.save()
model_path = os.path.join(model_dir, "model.pkl")
with open(model_path, "wb") as f:
    pickle.dump(checkpoint, f)  # ← WRONG: Saves Ray objects

Inference container: Standard PyTorch container, no Ray installed.

Result: ModuleNotFoundError: No module named 'ray' on every load attempt.

Model never loaded. Endpoint used fallback heuristic.

Cause 3: Feature Mismatch (18 vs 3)

Even when we fixed the Ray issue, there was a deeper problem.

Training features (18 total):

RL_MODEL_FEATURES = [
    "return", "log_return", 
    "sma_fast", "sma_slow", "ema_fast", "ema_slow",
    "ret_roll", "vol_roll", "vol_30",
    "rsi_14", "macd_line", "macd_signal", 
    "bb_upper", "bb_lower", "atr_14", "mom_10",
    "buy_probability", "sell_probability"
]

Inference features (3 total):

# In rl_portfolio.py
features = {
    "buy_probability": scores.get("buy_probability", 0.5),
    "sell_probability": scores.get("sell_probability", 0.3),
    "net_score": scores.get("buy_probability", 0.5) - scores.get("sell_probability", 0.3)
}

What happened: Endpoint received 3 features, model expected 18. Missing 15 features filled with 0.0.

Result: Neural network received mostly-zero input vector → garbage predictions → fell back to heuristics.

Cause 4: Silent try/except Fallback

The code that masked the problem:

# BAD: Silent fallback
try:
    action = model.predict(features)
except Exception as e:
    logger.warning(f"Model prediction failed: {e}")
    # Silent fallback to heuristic
    action = 'buy' if features['buy_probability'] > 0.7 else 'hold'

Why this is wrong:

  1. Exception happens on every call
  2. Logged as "warning," not error
  3. Fallback silently succeeds
  4. System appears to work
  5. No metrics tracking model vs fallback usage

Better approach: Fail fast.

# GOOD: Fail fast, track metrics
model_inference_count = 0
heuristic_count = 0

try:
    action = model.predict(features)
    model_inference_count += 1
except Exception as e:
    logger.error(f"CRITICAL: Model prediction failed: {e}")
    heuristic_count += 1
    raise  # Fail fast, don't mask the error

# At end of run, verify:
if heuristic_count > 0:
    raise RuntimeError(f"Model failed {heuristic_count} times - investigation required")

Cause 5: No End-to-End Verification

What I tested:

  • Model trains ✅
  • Model saved to S3 ✅
  • Endpoint created ✅
  • Endpoint returns JSON ✅

What I didn't test:

  • Model actually loads
  • Model produces varying outputs
  • Application uses model predictions ❌
  • System tracks model vs fallback usage ❌

The Fix: Training-Inference Feature Parity

Step 1: Single Feature List (Source of Truth)

# backend/app/services/rl_portfolio.py
RL_MODEL_FEATURES = [
    "return", "log_return", "sma_fast", "sma_slow", "ema_fast", "ema_slow",
    "ret_roll", "vol_roll", "vol_30", "rsi_14", "macd_line", "macd_signal",
    "bb_upper", "bb_lower", "atr_14", "mom_10",
    "buy_probability", "sell_probability"
]

Rule: If this list changes, you MUST retrain the model.

Step 2: Training Uses the Same List

# backend/sagemaker/train_ddpg.py
from app.services.rl_portfolio import RL_MODEL_FEATURES

# Verify data has all required features
missing = [f for f in RL_MODEL_FEATURES if f not in df.columns]
if missing:
    raise ValueError(f"Training data missing features: {missing}")

Step 3: Inference Computes ALL Features

# backend/app/services/rl_portfolio.py
async def _compute_rl_features(bars, buy_prob, sell_prob):
    """
    Compute all 18 features from OHLCV bars.
    NOT just prescreening scores!
    """
    features = {}
    
    # Compute returns
    features['return'] = (bars[-1]['close'] - bars[-2]['close']) / bars[-2]['close']
    features['log_return'] = np.log(bars[-1]['close'] / bars[-2]['close'])
    
    # Compute moving averages
    closes = [b['close'] for b in bars]
    features['sma_fast'] = np.mean(closes[-10:])
    features['sma_slow'] = np.mean(closes[-50:])
    features['ema_fast'] = _ema(closes, 10)
    features['ema_slow'] = _ema(closes, 50)
    
    # Compute volatility
    returns = np.diff(np.log(closes))
    features['ret_roll'] = np.mean(returns[-10:])
    features['vol_roll'] = np.std(returns[-10:])
    features['vol_30'] = np.std(returns[-30:])
    
    # Compute indicators
    features['rsi_14'] = _rsi(closes, 14)
    features['macd_line'], features['macd_signal'] = _macd(closes)
    # ... etc for all 18 features
    
    # Add prescreening scores
    features['buy_probability'] = buy_prob
    features['sell_probability'] = sell_prob
    
    # Validate feature count
    assert len(features) == len(RL_MODEL_FEATURES), \
        f"Feature mismatch: {len(features)} vs {len(RL_MODEL_FEATURES)}"
    
    return features

Key: Inference computes the same features training used, from raw OHLCV data.

Step 4: Endpoint Validates Features

# backend/sagemaker/inference.py
def input_fn(request_body, content_type):
    """Validate features on receive"""
    data = json.loads(request_body)
    features = data['features']
    
    # Validate feature count
    if len(features) != len(EXPECTED_FEATURES):
        missing = len(EXPECTED_FEATURES) - len(features)
        logger.error(f"CRITICAL: Missing {missing}/{len(EXPECTED_FEATURES)} features")
        raise ValueError(f"Expected {len(EXPECTED_FEATURES)} features, got {len(features)}")
    
    return features

Step 5: Response Indicates Model Usage

# backend/sagemaker/inference.py
def predict_fn(input_data, model):
    """Return flag indicating if model was used"""
    try:
        prediction = model.predict(input_data)
        return {
            'prediction': prediction,
            'used_trained_model': True,
            'feature_count': len(input_data)
        }
    except Exception as e:
        logger.error(f"Model inference failed: {e}")
        # DON'T fall back silently - return error status
        return {
            'prediction': None,
            'used_trained_model': False,
            'error': str(e)
        }

Step 6: Application Verifies Model Was Used

# backend/app/services/rl_portfolio.py
async def make_decisions(candidates, positions, cash):
    """Call RL endpoint and verify model was used"""
    features = await _compute_rl_features(...)
    
    response = await sagemaker.invoke_endpoint(...)
    payload = json.loads(response['Body'].read())
    
    # CRITICAL: Verify model was actually used
    if not payload.get('used_trained_model'):
        error = payload.get('error', 'Unknown error')
        raise RuntimeError(f"RL model not used: {error}")
    
    # Log success at top level
    logger.info(f"RL model made decision (trained_model=True)")
    
    return _payload_to_decisions(payload)

No silent fallbacks. If the model fails, raise an error. Force investigation.

The Verification Checklist

After deploying an ML model, verify:

1. Model Loads Successfully

# Check SageMaker CloudWatch logs for model loading
aws logs tail /aws/sagemaker/Endpoints/trading-prod-rl-portfolio \
  --since 5m --follow | grep -i "model loaded"

# Should see: "Model loaded successfully from /opt/ml/model"
# Should NOT see: "ModuleNotFoundError" or "FileNotFoundError"

2. Features Match Training

# In endpoint test
test_features = _compute_rl_features(test_bars, 0.7, 0.3)
print(f"Feature count: {len(test_features)}")
print(f"Expected: {len(RL_MODEL_FEATURES)}")

assert len(test_features) == len(RL_MODEL_FEATURES), "Feature mismatch!"

3. Model Produces Varying Outputs

# Test with different inputs
result1 = endpoint.predict(features_bullish)
result2 = endpoint.predict(features_bearish)

assert result1 != result2, "Model returning constant output!"

4. Application Uses Model (Not Fallback)

# Check application logs
docker-compose logs celery_worker --tail=100 | grep "trained_model"

# Should see: trained_model=True
# Should NOT see: trained_model=False or "fell back to heuristic"

5. Inference Counts Tracked

# Add metrics to your code
model_inference_count = 0
heuristic_count = 0

# At end of trading day
logger.info(f"Model inferences: {model_inference_count}")
logger.info(f"Heuristic fallbacks: {heuristic_count}")

# Alert if fallback rate > 0%
if heuristic_count > 0:
    alert("RL model falling back to heuristics - investigate!")

The Policies I Encoded

Policy 1: No Silent Fallbacks

BAD:

try:
    result = model.predict(features)
except Exception:
    result = default_value  # Silent failure

GOOD:

try:
    result = model.predict(features)
    model_inference_count += 1
except Exception as e:
    logger.error(f"CRITICAL: Model inference failed: {e}")
    heuristic_count += 1
    raise  # Fail fast

Policy 2: Track Inference vs Fallback

# Add counters
if response.get('used_trained_model'):
    model_used_count += 1
else:
    fallback_used_count += 1

# Log daily
logger.info(f"Model usage: {model_used_count}/{model_used_count + fallback_used_count}")

Policy 3: Validate Results Vary

# Check for constant outputs (sign of broken inference)
predictions = [model.predict(f) for f in test_features]
if len(set(predictions)) == 1:
    raise RuntimeError("Model returning constant predictions - broken!")

Policy 4: Definition of "Done" for ML

Done ≠ "Training complete"

Done = All of these:

  1. Model trains
  2. Model persists (saved to S3)
  3. Model loads in inference container
  4. Endpoint returns predictions
  5. Application uses predictions (not fallback)
  6. E2E test proves model is called
  7. Monitoring shows real predictions in logs

Key Takeaways

  1. Training-inference feature parity is non-negotiable: Use the same feature list, validate on both sides.

  2. No silent fallbacks: If model fails, fail fast. Track inference vs fallback counts.

  3. Verify model is actually used: Don't just test that endpoint returns JSON. Verify application uses the predictions.

  4. "Deployed" ≠ "Working": Check that model loads, produces varying outputs, and is called by application.

  5. Track inference counts: model_inference_count vs heuristic_count. Alert if fallback rate > 0%.

  6. Definition of done includes verification: Model isn't done until E2E test proves it's invoked and monitoring shows real predictions.

  7. Fail fast, investigate thoroughly: One failure is a signal. Don't mask it with silent fallbacks.

Your Turn: Verify Your Model

If you have an ML model in production, verify:

  1. Model actually loads (check logs)
  2. Features match training (count and names)
  3. Model produces varying outputs (not constant)
  4. Application uses predictions (not fallback)
  5. Inference counts tracked (model vs fallback)

If any check fails, you might have a model that's "deployed" but not used.

In the next post, I'll share more operational lessons—including why "scheduled" doesn't mean "working" and the async+DB patterns that saved me.


This is Post 6 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Next: "Scheduled ≠ Working" and Other Expensive Assumptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment