ttarler/06-ml-model-live-6-days-never-used.md

## 06-ml-model-live-6-days-never-used.md

      
    Raw
  

              06-ml-model-live-6-days-never-used.md
            
          
    The ML Model Was 'Live' for 6 Days—It Never Made a Single Decision

"Deployed and Working"

Jan 8: "RL model deployed to SageMaker endpoint. Trading with trained model."
Jan 9: "RL model is active and making portfolio decisions."
Jan 10-13: Portfolio trades every day. System reports "working."
Jan 14: User asks, "Are you sure the RL model is actually being used?"
I check the logs. trained_model=false on every single call. For 6 days, the system fell back to simple heuristics. The trained neural network—trained on 20 hyperparameter tuning jobs, using SageMaker's expensive GPU instances—never made a single production decision.
This post is about the most expensive ML failure I've encountered: a model that was "deployed" but never used.
The Timeline: How It Went Wrong

Jan 8: Training Complete, "Deployment" Begins

What I did:

Ran hyperparameter optimization (20 training jobs, ml.g4dn.xlarge GPUs, ~2 hours)
Selected best model (highest reward)
Saved model artifact to S3
Created SageMaker endpoint
Updated application to call endpoint
Tested with curl: endpoint returned JSON ✅

What I announced: "RL model deployed and working."
What I didn't verify: Did the application actually use the model's predictions? Did it fall back to heuristics? Did the model even load successfully?
Jan 9-13: "Trading With Trained Model"

What the logs showed:
[Jan 9 09:35] RL portfolio decision made: 5 positions allocated
[Jan 9 09:40] RL portfolio decision made: 3 positions allocated
[Jan 10 09:35] RL portfolio decision made: 4 positions allocated

What the logs didn't show: trained_model=false was buried in the response payload, never logged at the top level.
What I reported: "RL model making decisions, portfolio performing well."
Reality: Every decision came from a simple heuristic: if buy_probability - sell_probability > 0.15 → BUY.
Jan 14: The Discovery

User: "Are you sure the RL model is actually being used? The decisions seem too simple."
Me: Checks SageMaker CloudWatch logs. Sees ModuleNotFoundError: No module named 'ray' on every inference request.
Then checks application logs more carefully:
response = await sagemaker.invoke_endpoint(...)
payload = json.loads(response['Body'].read())

# This was always false:
if payload.get('used_trained_model'):
    logger.info("RL model made decision")
else:
    logger.warning("RL endpoint fell back to heuristic")  # ← ALWAYS THIS PATH
Reality: The model artifact contained RLlib checkpoint objects (Ray dependencies). The SageMaker inference container couldn't deserialize them. Endpoint fell back to heuristic logic on every request.
The Root Causes

Cause 1: Training vs Inference Architecture Mismatch

Training: RLlib PPO model (multi-asset portfolio allocator)
Inference expectation: Per-symbol discrete actions (BUY/HOLD/SELL)
The "solution": Add threshold heuristic to convert portfolio allocations to discrete actions.
The actual problem: The heuristic replaced the model, didn't convert its output. Model never loaded.
Cause 2: Ray Objects in model.pkl

Training code (simplified):
# Saved RLlib checkpoint as model artifact
checkpoint = algo.save()
model_path = os.path.join(model_dir, "model.pkl")
with open(model_path, "wb") as f:
    pickle.dump(checkpoint, f)  # ← WRONG: Saves Ray objects
Inference container: Standard PyTorch container, no Ray installed.
Result: ModuleNotFoundError: No module named 'ray' on every load attempt.
Model never loaded. Endpoint used fallback heuristic.
Cause 3: Feature Mismatch (18 vs 3)

Even when we fixed the Ray issue, there was a deeper problem.
Training features (18 total):
RL_MODEL_FEATURES = [
    "return", "log_return", 
    "sma_fast", "sma_slow", "ema_fast", "ema_slow",
    "ret_roll", "vol_roll", "vol_30",
    "rsi_14", "macd_line", "macd_signal", 
    "bb_upper", "bb_lower", "atr_14", "mom_10",
    "buy_probability", "sell_probability"
]
Inference features (3 total):
# In rl_portfolio.py
features = {
    "buy_probability": scores.get("buy_probability", 0.5),
    "sell_probability": scores.get("sell_probability", 0.3),
    "net_score": scores.get("buy_probability", 0.5) - scores.get("sell_probability", 0.3)
}
What happened: Endpoint received 3 features, model expected 18. Missing 15 features filled with 0.0.
Result: Neural network received mostly-zero input vector → garbage predictions → fell back to heuristics.
Cause 4: Silent try/except Fallback

The code that masked the problem:
# BAD: Silent fallback
try:
    action = model.predict(features)
except Exception as e:
    logger.warning(f"Model prediction failed: {e}")
    # Silent fallback to heuristic
    action = 'buy' if features['buy_probability'] > 0.7 else 'hold'
Why this is wrong:

Exception happens on every call
Logged as "warning," not error
Fallback silently succeeds
System appears to work
No metrics tracking model vs fallback usage

Better approach: Fail fast.
# GOOD: Fail fast, track metrics
model_inference_count = 0
heuristic_count = 0

try:
    action = model.predict(features)
    model_inference_count += 1
except Exception as e:
    logger.error(f"CRITICAL: Model prediction failed: {e}")
    heuristic_count += 1
    raise  # Fail fast, don't mask the error

# At end of run, verify:
if heuristic_count > 0:
    raise RuntimeError(f"Model failed {heuristic_count} times - investigation required")
Cause 5: No End-to-End Verification

What I tested:

Model trains ✅
Model saved to S3 ✅
Endpoint created ✅
Endpoint returns JSON ✅

What I didn't test:

Model actually loads ❌
Model produces varying outputs ❌
Application uses model predictions ❌
System tracks model vs fallback usage ❌

The Fix: Training-Inference Feature Parity

Step 1: Single Feature List (Source of Truth)

# backend/app/services/rl_portfolio.py
RL_MODEL_FEATURES = [
    "return", "log_return", "sma_fast", "sma_slow", "ema_fast", "ema_slow",
    "ret_roll", "vol_roll", "vol_30", "rsi_14", "macd_line", "macd_signal",
    "bb_upper", "bb_lower", "atr_14", "mom_10",
    "buy_probability", "sell_probability"
]
Rule: If this list changes, you MUST retrain the model.
Step 2: Training Uses the Same List

# backend/sagemaker/train_ddpg.py
from app.services.rl_portfolio import RL_MODEL_FEATURES

# Verify data has all required features
missing = [f for f in RL_MODEL_FEATURES if f not in df.columns]
if missing:
    raise ValueError(f"Training data missing features: {missing}")
Step 3: Inference Computes ALL Features

# backend/app/services/rl_portfolio.py
async def _compute_rl_features(bars, buy_prob, sell_prob):
    """
    Compute all 18 features from OHLCV bars.
    NOT just prescreening scores!
    """
    features = {}
    
    # Compute returns
    features['return'] = (bars[-1]['close'] - bars[-2]['close']) / bars[-2]['close']
    features['log_return'] = np.log(bars[-1]['close'] / bars[-2]['close'])
    
    # Compute moving averages
    closes = [b['close'] for b in bars]
    features['sma_fast'] = np.mean(closes[-10:])
    features['sma_slow'] = np.mean(closes[-50:])
    features['ema_fast'] = _ema(closes, 10)
    features['ema_slow'] = _ema(closes, 50)
    
    # Compute volatility
    returns = np.diff(np.log(closes))
    features['ret_roll'] = np.mean(returns[-10:])
    features['vol_roll'] = np.std(returns[-10:])
    features['vol_30'] = np.std(returns[-30:])
    
    # Compute indicators
    features['rsi_14'] = _rsi(closes, 14)
    features['macd_line'], features['macd_signal'] = _macd(closes)
    # ... etc for all 18 features
    
    # Add prescreening scores
    features['buy_probability'] = buy_prob
    features['sell_probability'] = sell_prob
    
    # Validate feature count
    assert len(features) == len(RL_MODEL_FEATURES), \
        f"Feature mismatch: {len(features)} vs {len(RL_MODEL_FEATURES)}"
    
    return features
Key: Inference computes the same features training used, from raw OHLCV data.
Step 4: Endpoint Validates Features

# backend/sagemaker/inference.py
def input_fn(request_body, content_type):
    """Validate features on receive"""
    data = json.loads(request_body)
    features = data['features']
    
    # Validate feature count
    if len(features) != len(EXPECTED_FEATURES):
        missing = len(EXPECTED_FEATURES) - len(features)
        logger.error(f"CRITICAL: Missing {missing}/{len(EXPECTED_FEATURES)} features")
        raise ValueError(f"Expected {len(EXPECTED_FEATURES)} features, got {len(features)}")
    
    return features
Step 5: Response Indicates Model Usage

# backend/sagemaker/inference.py
def predict_fn(input_data, model):
    """Return flag indicating if model was used"""
    try:
        prediction = model.predict(input_data)
        return {
            'prediction': prediction,
            'used_trained_model': True,
            'feature_count': len(input_data)
        }
    except Exception as e:
        logger.error(f"Model inference failed: {e}")
        # DON'T fall back silently - return error status
        return {
            'prediction': None,
            'used_trained_model': False,
            'error': str(e)
        }
Step 6: Application Verifies Model Was Used

# backend/app/services/rl_portfolio.py
async def make_decisions(candidates, positions, cash):
    """Call RL endpoint and verify model was used"""
    features = await _compute_rl_features(...)
    
    response = await sagemaker.invoke_endpoint(...)
    payload = json.loads(response['Body'].read())
    
    # CRITICAL: Verify model was actually used
    if not payload.get('used_trained_model'):
        error = payload.get('error', 'Unknown error')
        raise RuntimeError(f"RL model not used: {error}")
    
    # Log success at top level
    logger.info(f"RL model made decision (trained_model=True)")
    
    return _payload_to_decisions(payload)
No silent fallbacks. If the model fails, raise an error. Force investigation.
The Verification Checklist

After deploying an ML model, verify:
1. Model Loads Successfully

# Check SageMaker CloudWatch logs for model loading
aws logs tail /aws/sagemaker/Endpoints/trading-prod-rl-portfolio \
  --since 5m --follow | grep -i "model loaded"

# Should see: "Model loaded successfully from /opt/ml/model"
# Should NOT see: "ModuleNotFoundError" or "FileNotFoundError"
2. Features Match Training

# In endpoint test
test_features = _compute_rl_features(test_bars, 0.7, 0.3)
print(f"Feature count: {len(test_features)}")
print(f"Expected: {len(RL_MODEL_FEATURES)}")

assert len(test_features) == len(RL_MODEL_FEATURES), "Feature mismatch!"
3. Model Produces Varying Outputs

# Test with different inputs
result1 = endpoint.predict(features_bullish)
result2 = endpoint.predict(features_bearish)

assert result1 != result2, "Model returning constant output!"
4. Application Uses Model (Not Fallback)

# Check application logs
docker-compose logs celery_worker --tail=100 | grep "trained_model"

# Should see: trained_model=True
# Should NOT see: trained_model=False or "fell back to heuristic"
5. Inference Counts Tracked

# Add metrics to your code
model_inference_count = 0
heuristic_count = 0

# At end of trading day
logger.info(f"Model inferences: {model_inference_count}")
logger.info(f"Heuristic fallbacks: {heuristic_count}")

# Alert if fallback rate > 0%
if heuristic_count > 0:
    alert("RL model falling back to heuristics - investigate!")
The Policies I Encoded

Policy 1: No Silent Fallbacks

BAD:
try:
    result = model.predict(features)
except Exception:
    result = default_value  # Silent failure
GOOD:
try:
    result = model.predict(features)
    model_inference_count += 1
except Exception as e:
    logger.error(f"CRITICAL: Model inference failed: {e}")
    heuristic_count += 1
    raise  # Fail fast
Policy 2: Track Inference vs Fallback

# Add counters
if response.get('used_trained_model'):
    model_used_count += 1
else:
    fallback_used_count += 1

# Log daily
logger.info(f"Model usage: {model_used_count}/{model_used_count + fallback_used_count}")
Policy 3: Validate Results Vary

# Check for constant outputs (sign of broken inference)
predictions = [model.predict(f) for f in test_features]
if len(set(predictions)) == 1:
    raise RuntimeError("Model returning constant predictions - broken!")
Policy 4: Definition of "Done" for ML

Done ≠ "Training complete"
Done = All of these:

Model trains
Model persists (saved to S3)
Model loads in inference container
Endpoint returns predictions
Application uses predictions (not fallback)
E2E test proves model is called
Monitoring shows real predictions in logs

Key Takeaways


Training-inference feature parity is non-negotiable: Use the same feature list, validate on both sides.


No silent fallbacks: If model fails, fail fast. Track inference vs fallback counts.


Verify model is actually used: Don't just test that endpoint returns JSON. Verify application uses the predictions.


"Deployed" ≠ "Working": Check that model loads, produces varying outputs, and is called by application.


Track inference counts: model_inference_count vs heuristic_count. Alert if fallback rate > 0%.


Definition of done includes verification: Model isn't done until E2E test proves it's invoked and monitoring shows real predictions.


Fail fast, investigate thoroughly: One failure is a signal. Don't mask it with silent fallbacks.


Your Turn: Verify Your Model

If you have an ML model in production, verify:

Model actually loads (check logs)
Features match training (count and names)
Model produces varying outputs (not constant)
Application uses predictions (not fallback)
Inference counts tracked (model vs fallback)

If any check fails, you might have a model that's "deployed" but not used.
In the next post, I'll share more operational lessons—including why "scheduled" doesn't mean "working" and the async+DB patterns that saved me.

This is Post 6 of an 8-part series on building a full-stack AI trading application with LLM coding agents. Next: "Scheduled ≠ Working" and Other Expensive Assumptions.
No results found