Jan 8: "RL model deployed to SageMaker endpoint. Trading with trained model."
Jan 9: "RL model is active and making portfolio decisions."
Jan 10-13: Portfolio trades every day. System reports "working."
Jan 14: User asks, "Are you sure the RL model is actually being used?"
I check the logs. trained_model=false on every single call. For 6 days, the system fell back to simple heuristics. The trained neural network—trained on 20 hyperparameter tuning jobs, using SageMaker's expensive GPU instances—never made a single production decision.
This post is about the most expensive ML failure I've encountered: a model that was "deployed" but never used.
What I did:
- Ran hyperparameter optimization (20 training jobs, ml.g4dn.xlarge GPUs, ~2 hours)
- Selected best model (highest reward)
- Saved model artifact to S3
- Created SageMaker endpoint
- Updated application to call endpoint
- Tested with curl: endpoint returned JSON ✅
What I announced: "RL model deployed and working."
What I didn't verify: Did the application actually use the model's predictions? Did it fall back to heuristics? Did the model even load successfully?
What the logs showed:
[Jan 9 09:35] RL portfolio decision made: 5 positions allocated
[Jan 9 09:40] RL portfolio decision made: 3 positions allocated
[Jan 10 09:35] RL portfolio decision made: 4 positions allocated
What the logs didn't show: trained_model=false was buried in the response payload, never logged at the top level.
What I reported: "RL model making decisions, portfolio performing well."
Reality: Every decision came from a simple heuristic: if buy_probability - sell_probability > 0.15 → BUY.
User: "Are you sure the RL model is actually being used? The decisions seem too simple."
Me: Checks SageMaker CloudWatch logs. Sees ModuleNotFoundError: No module named 'ray' on every inference request.
Then checks application logs more carefully:
response = await sagemaker.invoke_endpoint(...)
payload = json.loads(response['Body'].read())
# This was always false:
if payload.get('used_trained_model'):
logger.info("RL model made decision")
else:
logger.warning("RL endpoint fell back to heuristic") # ← ALWAYS THIS PATHReality: The model artifact contained RLlib checkpoint objects (Ray dependencies). The SageMaker inference container couldn't deserialize them. Endpoint fell back to heuristic logic on every request.
Training: RLlib PPO model (multi-asset portfolio allocator)
Inference expectation: Per-symbol discrete actions (BUY/HOLD/SELL)
The "solution": Add threshold heuristic to convert portfolio allocations to discrete actions.
The actual problem: The heuristic replaced the model, didn't convert its output. Model never loaded.
Training code (simplified):
# Saved RLlib checkpoint as model artifact
checkpoint = algo.save()
model_path = os.path.join(model_dir, "model.pkl")
with open(model_path, "wb") as f:
pickle.dump(checkpoint, f) # ← WRONG: Saves Ray objectsInference container: Standard PyTorch container, no Ray installed.
Result: ModuleNotFoundError: No module named 'ray' on every load attempt.
Model never loaded. Endpoint used fallback heuristic.
Even when we fixed the Ray issue, there was a deeper problem.
Training features (18 total):
RL_MODEL_FEATURES = [
"return", "log_return",
"sma_fast", "sma_slow", "ema_fast", "ema_slow",
"ret_roll", "vol_roll", "vol_30",
"rsi_14", "macd_line", "macd_signal",
"bb_upper", "bb_lower", "atr_14", "mom_10",
"buy_probability", "sell_probability"
]Inference features (3 total):
# In rl_portfolio.py
features = {
"buy_probability": scores.get("buy_probability", 0.5),
"sell_probability": scores.get("sell_probability", 0.3),
"net_score": scores.get("buy_probability", 0.5) - scores.get("sell_probability", 0.3)
}What happened: Endpoint received 3 features, model expected 18. Missing 15 features filled with 0.0.
Result: Neural network received mostly-zero input vector → garbage predictions → fell back to heuristics.
The code that masked the problem:
# BAD: Silent fallback
try:
action = model.predict(features)
except Exception as e:
logger.warning(f"Model prediction failed: {e}")
# Silent fallback to heuristic
action = 'buy' if features['buy_probability'] > 0.7 else 'hold'Why this is wrong:
- Exception happens on every call
- Logged as "warning," not error
- Fallback silently succeeds
- System appears to work
- No metrics tracking model vs fallback usage
Better approach: Fail fast.
# GOOD: Fail fast, track metrics
model_inference_count = 0
heuristic_count = 0
try:
action = model.predict(features)
model_inference_count += 1
except Exception as e:
logger.error(f"CRITICAL: Model prediction failed: {e}")
heuristic_count += 1
raise # Fail fast, don't mask the error
# At end of run, verify:
if heuristic_count > 0:
raise RuntimeError(f"Model failed {heuristic_count} times - investigation required")What I tested:
- Model trains ✅
- Model saved to S3 ✅
- Endpoint created ✅
- Endpoint returns JSON ✅
What I didn't test:
- Model actually loads ❌
- Model produces varying outputs ❌
- Application uses model predictions ❌
- System tracks model vs fallback usage ❌
# backend/app/services/rl_portfolio.py
RL_MODEL_FEATURES = [
"return", "log_return", "sma_fast", "sma_slow", "ema_fast", "ema_slow",
"ret_roll", "vol_roll", "vol_30", "rsi_14", "macd_line", "macd_signal",
"bb_upper", "bb_lower", "atr_14", "mom_10",
"buy_probability", "sell_probability"
]Rule: If this list changes, you MUST retrain the model.
# backend/sagemaker/train_ddpg.py
from app.services.rl_portfolio import RL_MODEL_FEATURES
# Verify data has all required features
missing = [f for f in RL_MODEL_FEATURES if f not in df.columns]
if missing:
raise ValueError(f"Training data missing features: {missing}")# backend/app/services/rl_portfolio.py
async def _compute_rl_features(bars, buy_prob, sell_prob):
"""
Compute all 18 features from OHLCV bars.
NOT just prescreening scores!
"""
features = {}
# Compute returns
features['return'] = (bars[-1]['close'] - bars[-2]['close']) / bars[-2]['close']
features['log_return'] = np.log(bars[-1]['close'] / bars[-2]['close'])
# Compute moving averages
closes = [b['close'] for b in bars]
features['sma_fast'] = np.mean(closes[-10:])
features['sma_slow'] = np.mean(closes[-50:])
features['ema_fast'] = _ema(closes, 10)
features['ema_slow'] = _ema(closes, 50)
# Compute volatility
returns = np.diff(np.log(closes))
features['ret_roll'] = np.mean(returns[-10:])
features['vol_roll'] = np.std(returns[-10:])
features['vol_30'] = np.std(returns[-30:])
# Compute indicators
features['rsi_14'] = _rsi(closes, 14)
features['macd_line'], features['macd_signal'] = _macd(closes)
# ... etc for all 18 features
# Add prescreening scores
features['buy_probability'] = buy_prob
features['sell_probability'] = sell_prob
# Validate feature count
assert len(features) == len(RL_MODEL_FEATURES), \
f"Feature mismatch: {len(features)} vs {len(RL_MODEL_FEATURES)}"
return featuresKey: Inference computes the same features training used, from raw OHLCV data.
# backend/sagemaker/inference.py
def input_fn(request_body, content_type):
"""Validate features on receive"""
data = json.loads(request_body)
features = data['features']
# Validate feature count
if len(features) != len(EXPECTED_FEATURES):
missing = len(EXPECTED_FEATURES) - len(features)
logger.error(f"CRITICAL: Missing {missing}/{len(EXPECTED_FEATURES)} features")
raise ValueError(f"Expected {len(EXPECTED_FEATURES)} features, got {len(features)}")
return features# backend/sagemaker/inference.py
def predict_fn(input_data, model):
"""Return flag indicating if model was used"""
try:
prediction = model.predict(input_data)
return {
'prediction': prediction,
'used_trained_model': True,
'feature_count': len(input_data)
}
except Exception as e:
logger.error(f"Model inference failed: {e}")
# DON'T fall back silently - return error status
return {
'prediction': None,
'used_trained_model': False,
'error': str(e)
}# backend/app/services/rl_portfolio.py
async def make_decisions(candidates, positions, cash):
"""Call RL endpoint and verify model was used"""
features = await _compute_rl_features(...)
response = await sagemaker.invoke_endpoint(...)
payload = json.loads(response['Body'].read())
# CRITICAL: Verify model was actually used
if not payload.get('used_trained_model'):
error = payload.get('error', 'Unknown error')
raise RuntimeError(f"RL model not used: {error}")
# Log success at top level
logger.info(f"RL model made decision (trained_model=True)")
return _payload_to_decisions(payload)No silent fallbacks. If the model fails, raise an error. Force investigation.
After deploying an ML model, verify:
# Check SageMaker CloudWatch logs for model loading
aws logs tail /aws/sagemaker/Endpoints/trading-prod-rl-portfolio \
--since 5m --follow | grep -i "model loaded"
# Should see: "Model loaded successfully from /opt/ml/model"
# Should NOT see: "ModuleNotFoundError" or "FileNotFoundError"# In endpoint test
test_features = _compute_rl_features(test_bars, 0.7, 0.3)
print(f"Feature count: {len(test_features)}")
print(f"Expected: {len(RL_MODEL_FEATURES)}")
assert len(test_features) == len(RL_MODEL_FEATURES), "Feature mismatch!"# Test with different inputs
result1 = endpoint.predict(features_bullish)
result2 = endpoint.predict(features_bearish)
assert result1 != result2, "Model returning constant output!"# Check application logs
docker-compose logs celery_worker --tail=100 | grep "trained_model"
# Should see: trained_model=True
# Should NOT see: trained_model=False or "fell back to heuristic"# Add metrics to your code
model_inference_count = 0
heuristic_count = 0
# At end of trading day
logger.info(f"Model inferences: {model_inference_count}")
logger.info(f"Heuristic fallbacks: {heuristic_count}")
# Alert if fallback rate > 0%
if heuristic_count > 0:
alert("RL model falling back to heuristics - investigate!")BAD:
try:
result = model.predict(features)
except Exception:
result = default_value # Silent failureGOOD:
try:
result = model.predict(features)
model_inference_count += 1
except Exception as e:
logger.error(f"CRITICAL: Model inference failed: {e}")
heuristic_count += 1
raise # Fail fast# Add counters
if response.get('used_trained_model'):
model_used_count += 1
else:
fallback_used_count += 1
# Log daily
logger.info(f"Model usage: {model_used_count}/{model_used_count + fallback_used_count}")# Check for constant outputs (sign of broken inference)
predictions = [model.predict(f) for f in test_features]
if len(set(predictions)) == 1:
raise RuntimeError("Model returning constant predictions - broken!")Done ≠ "Training complete"
Done = All of these:
- Model trains
- Model persists (saved to S3)
- Model loads in inference container
- Endpoint returns predictions
- Application uses predictions (not fallback)
- E2E test proves model is called
- Monitoring shows real predictions in logs
-
Training-inference feature parity is non-negotiable: Use the same feature list, validate on both sides.
-
No silent fallbacks: If model fails, fail fast. Track inference vs fallback counts.
-
Verify model is actually used: Don't just test that endpoint returns JSON. Verify application uses the predictions.
-
"Deployed" ≠ "Working": Check that model loads, produces varying outputs, and is called by application.
-
Track inference counts:
model_inference_countvsheuristic_count. Alert if fallback rate > 0%. -
Definition of done includes verification: Model isn't done until E2E test proves it's invoked and monitoring shows real predictions.
-
Fail fast, investigate thoroughly: One failure is a signal. Don't mask it with silent fallbacks.
If you have an ML model in production, verify:
- Model actually loads (check logs)
- Features match training (count and names)
- Model produces varying outputs (not constant)
- Application uses predictions (not fallback)
- Inference counts tracked (model vs fallback)
If any check fails, you might have a model that's "deployed" but not used.
In the next post, I'll share more operational lessons—including why "scheduled" doesn't mean "working" and the async+DB patterns that saved me.
This is Post 6 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Next: "Scheduled ≠ Working" and Other Expensive Assumptions.