Target Audience: Dev Engineers Goal: Understand the dataset creation pipeline and training process Context: Part of the LoudEcho online learning bandit system
- Big Picture: Where This Fits
- The Problem We're Solving
- Dataset Creation Pipeline
- Training Process
- Key Code Snippets
- Results & Performance
- Integration with Bandit System
The memorability scorer is one component in a multi-armed bandit system for ad creative optimization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ONLINE LEARNING BANDIT β
β β
β ββββββββββββββ βββββββββββββββ ββββββββββββββββββ β
β β DSPy βββββΆβ Candidate βββββΆβ Memorability β β
β β Generator β β Ad Creativesβ β Scorer β β
β ββββββββββββββ βββββββββββββββ ββββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββ β
β β Scoring Function: β β
β β β β
β β Score = wβΒ·Quality(C) β β
β β + wβΒ·Performance(P) β β
β β + wβΒ·Novelty β β
β β - penalties β β
β ββββββββββββββββ¬ββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββ β
β β Ad Selection & β β
β β Serving Decision β β
β ββββββββββ¬ββββββββββββββββ β
β β β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ
β Real Engagement β
β Data (CTR, etc.) β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Feedback Loop: β
β Update Weights β
β & Retrain Models β
βββββββββββββββββββββ
The Memorability Scorer predicts Quality(C) - how good an ad creative is in a given context.
Challenge: Given an ad creative and an article context, predict how memorable and effective the ad will be.
Why it matters:
- Manual review doesn't scale (thousands of ad variants)
- Post-hoc metrics (CTR) come too late for real-time decisions
- Need to rank candidates BEFORE serving to users
Solution approach:
- Train a "teacher model" (GPT-5.2 Vision) to score thousands of ad-article pairs
- Use those scores to train a fast, cheap XGBoost model for production
- Deploy the XGBoost model for real-time scoring
This was 90% of the effort and cost (several hundred USD in API calls). Here's the full pipeline:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATASET CREATION PIPELINE β
β β
β Step 1: Image Deduplication β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Pinterest + Twitter ad images (10K+) β β
β β β β β
β β Perceptual hashing (aHash: 32Γ32 grayscale) β β
β β β β β
β β 7,258 unique ads β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Synthetic Article Generation β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β For each ad image: β β
β β - GPT-5.2 Vision analyzes the ad β β
β β - Generates contextually relevant article β β
β β - 1:1 mapping (ad_00001 β article_00001) β β
β β β β β
β β 7,258 synthetic articles β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Ad Feature Extraction β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Extract M features (15 dimensions): β β
β β - Visual: faces, clarity, clutter, contrast β β
β β - Text: OCR, copy quality, concreteness β β
β β - Creative: twist present, resolves fast β β
β β - CLIP embeddings (512-dim) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 4: Article Feature Extraction β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Extract article features (4 dimensions): β β
β β - Topic category β β
β β - Named entities β β
β β - Sentiment valence β β
β β - Emotional arousal β β
β β - Text embeddings (512-dim) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 5: Generate Negative Samples β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β For each ad: β β
β β β 1 positive pair (original synthetic match) β β
β β β 2 random negatives (any mismatched article) β β
β β β 1 safe contrast (different topic, similar tone) β β
β β β β β
β β 29,032 total pairs β β
β β - 7,258 positive (25%) β β
β β - 14,516 random negatives (50%) β β
β β - 7,258 safe contrast (25%) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 6: Compute Pair Features β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Combine ad + article β F features (6 dimensions): β β
β β - sim_adtext_article (cosine similarity) β β
β β - sim_adimage_article (CLIP similarity) β β
β β - entity_overlap_rate (Jaccard) β β
β β - sentiment_alignment (distance) β β
β β - topic_match (binary) β β
β β - contrast (inverse similarity) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 7: Teacher Scoring β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β For each of 29,032 pairs: β β
β β - GPT-5.2 Vision evaluates: β β
β β β’ Ad memorability & originality β β
β β β’ Message clarity (< 2 sec?) β β
β β β’ Emotional engagement β β
β β β’ Contextual relevance β β
β β - Outputs: score (1-10) + reasoning β β
β β β β β
β β 29,032 labeled training examples β β
β β Cost: ~$0.02/pair Γ 29K = $580+ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 8: Consolidate Dataset β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Merge all data into single file: β β
β β - Images (as bytes) β β
β β - Ad text + article text β β
β β - 21 features (15 M + 6 F) β β
β β - Teacher scores + reasoning β β
β β β β β
β β memorability_dataset_consolidated.parquet (8.1 GB) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why synthetic articles instead of real ones?
- Real ad-article pairs have selection bias (ads are already chosen to fit)
- Synthetic articles let us control the distribution of positive/negative pairs
- We can generate "safe contrast" pairs (same sentiment, different topic) to teach the model nuance
Why negative samples?
- A model trained only on positive pairs can't discriminate
- Random negatives teach "this ad doesn't fit this article"
- Safe contrast negatives teach subtler distinctions (not just topic matching)
Why teacher-student approach?
- GPT-5.2 Vision is expensive ($0.02/pair) and slow (120s timeout)
- XGBoost is cheap ($0.0001/pair) and fast (< 1ms)
- Trade-off: upfront labeling cost β cheap inference forever
Once we have the labeled dataset, training is straightforward:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β XGBOOST TRAINING PIPELINE β
β β
β 1. Load consolidated dataset β
β β β
β 2. Split by ad_id (prevent leakage) β
β - Train: 80% of unique ads β
β - Test: 20% of unique ads β
β β β
β 3. Encode categorical features β
β - face_emotion: LabelEncoder β
β β β
β 4. Train XGBoost regressor β
β - Target: teacher_score_1_10 β
β - Features: 21 (15 M + 6 F) β
β - Hyperparameters: β
β β’ n_estimators: 300 (early stopped at 215) β
β β’ max_depth: 6 β
β β’ learning_rate: 0.05 β
β β’ L1 regularization: 1 β
β β’ L2 regularization: 10 β
β β β
β 5. Evaluate on test set β
β - MAE: 0.605 (6% error on 10-point scale) β
β - RΒ²: 0.785 (explains 78.5% of variance) β
β β β
β 6. Save model + encoders β
β - baseline_xgboost_model.pkl β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Considered alternatives:
- Neural network: Requires more data, harder to interpret, overkill for 21 features
- Linear regression: Too simple, can't capture feature interactions
- Random Forest: Similar to XGBoost but slower and less accurate
Why XGBoost won:
- β Handles tabular data with mixed types (categorical + numeric) out of the box
- β Built-in regularization (L1, L2) prevents overfitting
- β Feature importance scores for interpretability
- β Fast training (< 5 min on CPU)
- β Blazing fast inference (< 1ms per prediction)
- β Industry standard for Kaggle/production regression tasks
Key design decisions:
- GroupShuffleSplit by ad_id: Prevents data leakage (same ad in train and test)
- Early stopping: Prevents overfitting by stopping at 215 trees when validation MAE plateaus
- Regression, not ranking: We want absolute scores, not just relative ordering (for now)
File: ImageDeduplicator.py:24-37
This was critical to avoid wasting money labeling duplicate ads.
@staticmethod
def compute_image_hash(image_path: str) -> str | None:
"""
Compute aHash for image
Algorithm: 32x32 grayscale β compare to mean β binary hash β MD5
"""
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
if image is None:
return None
resized_image = cv2.resize(image, (32, 32))
avg_pixel_value = resized_image.mean()
hash_str = ''.join('1' if pixel > avg_pixel_value else '0'
for pixel in resized_image.flatten())
return hashlib.md5(hash_str.encode()).hexdigest()Why this works:
- Resizing to 32Γ32 normalizes for size variations
- Comparing to mean creates a perceptual fingerprint
- Near-identical images (different compression, minor edits) get same hash
- Result: Reduced 10K+ images β 7,258 unique ads
File: GenerateNegativeSamples.py:70-153
Teaching the model to discriminate requires careful negative sampling.
def generate_pairs(self):
"""Generate positive and negative pairs"""
pairs = []
for _, ad_row in tqdm(self.df_ads.iterrows(), total=len(self.df_ads)):
ad_id = ad_row['ad_id']
ad_topic = ad_row.get('ad_topic', 'other')
ad_sentiment = ad_row.get('copy_emotion_valence', 0.0)
# 1. Positive pair (original 1:1 synthetic pairing)
positive_article_id = ad_id # ad_00001 β article_00001
pairs.append({
'ad_id': ad_id,
'article_id': positive_article_id,
'pair_type': 'positive'
})
# 2. Random negatives (any mismatched article)
other_article_ids = [a_id for a_id in all_article_ids
if a_id != positive_article_id]
random_article_ids = random.sample(other_article_ids,
self.random_negatives_per_ad)
for article_id in random_article_ids:
pairs.append({
'ad_id': ad_id,
'article_id': article_id,
'pair_type': 'random_negative'
})
# 3. Safe contrast negatives (different topic, similar sentiment)
contrast_candidates = [
a_id for a_id in other_article_ids
if articles_by_id[a_id].get('topic_category') != ad_topic # Different topic
and abs(articles_by_id[a_id].get('sentiment_valence', 0.0)
- ad_sentiment) < 0.5 # Similar sentiment
]
contrast_article_ids = random.sample(
contrast_candidates,
min(self.contrast_negatives_per_ad, len(contrast_candidates))
)
for article_id in contrast_article_ids:
pairs.append({
'ad_id': ad_id,
'article_id': article_id,
'pair_type': 'safe_contrast'
})Why three types of negatives?
- Random negatives: Easy examples (car ad Γ cooking article = bad)
- Safe contrast: Hard examples (car ad Γ travel article = maybe okay?)
- Teaches nuance: Model learns contextual fit, not just topic matching
File: ConsolidateDataset.py:61-198
With 8.1 GB of data including images, we need batch processing to avoid OOM.
def consolidate():
"""Merge all data into single consolidated file (memory-efficient batching)"""
BATCH_SIZE = 1000 # Process 1000 rows at a time
TEMP_DIR = f"{DATA_DIR}/temp_batches"
# Load metadata (no images yet)
df_pairs = pd.read_csv(f"{DATA_DIR}/pairs.csv")
df_features = pd.read_csv(f"{DATA_DIR}/features_full.csv")
df_scores = pd.read_csv(f"{DATA_DIR}/teacher_scores.csv")
# Merge all metadata
df = df_pairs.merge(df_features, on=['ad_id', 'article_id'])
df = df.merge(df_scores, on=['ad_id', 'article_id'])
# Process in batches (load images only for current batch)
num_batches = (len(df) + BATCH_SIZE - 1) // BATCH_SIZE
for batch_idx in tqdm(range(num_batches), desc="Processing batches"):
start_idx = batch_idx * BATCH_SIZE
end_idx = min((batch_idx + 1) * BATCH_SIZE, len(df))
df_batch = df.iloc[start_idx:end_idx].copy()
# Load images for this batch only
image_bytes_list = []
for image_path in df_batch['image_path']:
image_bytes_list.append(load_image_bytes(image_path))
df_batch['image_bytes'] = image_bytes_list
# Save batch
df_batch.to_parquet(f"{TEMP_DIR}/batch_{batch_idx:04d}.parquet",
compression='snappy')
# Merge batches efficiently using PyArrow
import pyarrow.parquet as pq
tables = [pq.read_table(f) for f in batch_files]
combined_table = pa.concat_tables(tables)
pq.write_table(combined_table, OUTPUT_FILE, compression='snappy')Why batch processing?
- Loading 29K images (8.1 GB) into memory at once = OOM crash
- Batch size of 1000 = manageable memory (~ 280 MB per batch)
- PyArrow for efficient Parquet I/O (3-5Γ faster than pandas alone)
| Metric | Value | Interpretation |
|---|---|---|
| MAE | 0.605 | Predictions within Β±0.6 points on 10-point scale (6% error) |
| RMSE | 0.842 | Slightly higher due to outliers, still strong |
| RΒ² | 0.785 | Model explains 78.5% of score variance |
| Pair Type | MAE | Mean True Score | Mean Predicted |
|---|---|---|---|
| Positive | 0.682 | 4.93 | 4.67 |
| Random Negative | 0.587 | 3.09 | 3.14 |
| Safe Contrast | 0.562 | 3.05 | 3.10 |
Key insight: Model correctly separates positive pairs from negatives by ~1.6 points, which is what matters for ranking in production.
- contrast (32.9%) - Inverse text similarity; ads benefit from differentiation
- sim_adtext_article (24.0%) - Text-article semantic alignment
- is_ad (15.3%) - Ad classification confidence
- clarity (10.4%) - Message clarity at a glance
- twist_resolves_fast (3.2%) - Creative twist resolution speed
Insight: Contextual fit features (contrast + similarity + entities) dominate at 58.7% combined importance.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BANDIT SYSTEM: AD SELECTION FLOW β
β β
β 1. Generate candidates β
β ββββββββββββββββββββββββββββββββββββββββ β
β β DSPy + GPT-5.2 β 10-20 ad variants β β
β ββββββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β 2. Extract features for each candidate β
β ββββββββββββββββββββββββββββββββββββββββ β
β β M features (15): Visual + Copy β β
β β F features (6): Context fit β β
β ββββββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β 3. Score with memorability model β
β ββββββββββββββββββββββββββββββββββββββββ β
β β XGBoost prediction: score_1_10 β β
β β Latency: < 1ms per candidate β β
β ββββββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β 4. Compute overall score β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Final = wβΒ·memorability β β
β β + wβΒ·predicted_CTR β β
β β + wβΒ·novelty β β
β β - duplicate_penalty β β
β β - brand_safety_penalty β β
β ββββββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β 5. Select & serve best candidate β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Thompson sampling or UCB policy β β
β ββββββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β 6. Collect feedback β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Actual CTR, dwell time, conversions β β
β ββββββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β 7. Update weights (monthly) β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Optimize wβ, wβ, wβ based on β β
β β real performance data β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Production Inference:
ββββββββββββββββββββ
β Ad Candidate β
β + Article Text β
ββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Feature Extraction (cached)β
β - M features: GPT-5.2 cacheβ
β - F features: compute live β
β Latency: ~50ms if cached β
ββββββββββ¬βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β XGBoost Model (.pkl) β
β - 21 features β score β
β Latency: < 1ms β
β Cost: $0.0001 per call β
ββββββββββ¬βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Score: 1-10 β
β (feeds into bandit scorer) β
βββββββββββββββββββββββββββββββ
Cost comparison:
- Before: GPT-5.2 scoring = $0.02/ad + 120s latency β not feasible for real-time
- After: XGBoost scoring = $0.0001/ad + < 1ms latency β production ready
- Perceptual hashing saved $200+ by deduplicating before API calls
- Synthetic article generation let us control the training distribution
- Teacher-student paradigm traded upfront cost for fast inference
- Batch processing + resume capability made the pipeline robust to failures
- XGBoost was the right choice for tabular data with 21 features
-
Dataset creation took 90% of time and money
- 29K API calls to GPT-5.2 Vision ($580+)
- Incremental saves + resume logic to handle timeouts
- Memory-efficient batching to avoid OOM
-
Preventing data leakage
- Must split by
ad_idnot row-level (same ad appears in multiple pairs) - Careful validation of positive pair alignment
- Must split by
-
Balancing negative sample types
- Too many random negatives = model learns trivial patterns
- Safe contrast negatives = harder to generate but critical for nuance
- Replace teacher scores with real CTR data once we have enough traffic
- Add brand safety dimension (currently missing from MΓF framework)
- Experiment with learning-to-rank instead of regression
- Fine-tune embeddings instead of using off-the-shelf CLIP/OpenAI
| Metric | Value |
|---|---|
| Unique Ads | 7,258 |
| Synthetic Articles | 7,258 |
| Total Pairs | 29,032 |
| Features | 21 (15 M + 6 F) |
| Teacher Scores | 29,032 |
| Dataset Size | 8.1 GB |
| Training Time | < 5 minutes |
| Total Cost | ~$600 (mostly GPT-5.2 API) |
- Dataset:
albertbn/ad-memorability-scorer-v0(HuggingFace, private) - Model:
baseline_xgboost_model.pkl(8 MB) - Code:
/Labs/memorability/(8 Python modules + utilities) - Framework: Based on LoudEcho Creative Quality Framework (6 dimensions)
- Gist: Context Ad Learning Research
Questions? Ping the team or check the code in /Labs/memorability/
Generated for internal dev team walkthrough β’ February 2026