Native advertising effectiveness depends critically on both the intrinsic quality of the ad creative and its contextual fit with the surrounding article. However, automated quality assessment remains challenging, relying primarily on post-hoc engagement metrics or expensive manual review. This project develops a machine learning model that predicts ad memorability scores (1-10 scale) by combining multimodal creative analysis with contextual alignment features.
We trained an XGBoost regression model on 29,032 ad-article pairs scored by GPT-5.2 vision, achieving strong performance (MAE=0.605, R²=0.785). The model extracts 21 interpretable features spanning visual attention, copy quality, and contextual fit - enabling real-time quality prediction for ad placement optimization.
The key innovation is the decomposition of memorability into intrinsic ad properties (M features) and contextual fit metrics (F features), allowing the model to evaluate both "is this ad good?" and "does it fit this article?" simultaneously.
┌─────────────┐
│ Ad Image │──────────┐
└─────────────┘ │
▼
┌──────────────────┐
│ GPT-5.2 Vision │──► 15 M Features
│ + CLIP (512-d) │ (clarity, faces, OCR, twist...)
└──────────────────┘
┌─────────────┐ │
│ Article │──────────┤
│ Text │ │
└─────────────┘ ▼
┌──────────────────┐
│ GPT-5.2 Text + │──► 4 Article Features
│ Embeddings │ (topic, entities, sentiment, arousal)
└──────────────────┘
│
▼
┌──────────────────┐
│ Compute Pair │──► 6 F Features
│ Features (F) │ (similarity, overlap, alignment...)
└──────────────────┘
│
▼
┌──────────────────┐
│ XGBoost │──► Score: 1-10
│ Regressor │
└──────────────────┘
Overall Scoring Function:
Score(1-10) = f(M, F)
where:
- M = {m₁, m₂, ..., m₁₅} ∈ ℝ¹⁵ represents Memorability features (ad intrinsic quality)
- F = {f₁, f₂, ..., f₆} ∈ ℝ⁶ represents Contextual Fit features (ad-article alignment)
Key Pair Features:
Cosine similarity between embeddings:
f_sim_text = (e_ad · e_article) / (||e_ad|| × ||e_article||)
Entity overlap (Jaccard similarity):
f_entity_overlap = |E_ad ∩ E_article| / |E_ad ∪ E_article|
Sentiment alignment:
f_sentiment_align = 1 - |v_ad - v_article| where v ∈ [-1, 1]
Teacher Scoring Criteria (Reference Framework):
High Quality (8-10):
- Memorable & original (clever twist)
- Clear in < 2 seconds
- Emotionally engaging
- Contextually relevant
Mediocre (4-6):
- Generic, cliché
- Takes > 2 seconds to understand
- Forgettable
Low Quality (1-3):
- Boring, confusing
- Irrelevant to context
- Poor fit
| Metric | Value |
|---|---|
| Unique Ads | 7,258 |
| Synthetic Articles | 7,258 |
| Total Pairs | 29,032 |
| - Positive pairs | 7,258 (25%) |
| - Random negatives | 14,516 (50%) |
| - Safe contrast | 7,258 (25%) |
| Features | 21 (15 M + 6 F) |
| Teacher Scores (GPT-5.2) | 29,032 |
| Dataset Size | 8.1 GB |
M Features (15) - Ad Intrinsic Quality:
is_ad- Classification confidence (0-1)ocr_word_count- Visible text word countocr_legible- Text legibility (0-1)face_count- Number of facesface_emotion- Dominant emotion or "none"clarity- Message clarity at a glance (0-1)clutter- Visual clutter (0-1, lower=cleaner)contrast_m- Visual contrast (0-1)color_palette_size- Count of dominant colorscopy_word_count- Ad copy word countcopy_concrete- Concreteness (0-1)copy_emotion_valence- Emotional tone (-1 to 1)copy_arousal- Emotional intensity (0-1)twist_present- Has clever twist (bool)twist_resolves_fast- Twist resolves quickly (0-1)
F Features (6) - Contextual Fit:
sim_adtext_article- Cosine similarity (ad text ↔ article)sim_adimage_article- Cosine similarity (ad image ↔ article)entity_overlap_rate- Jaccard similarity of entitiessentiment_alignment- 1 - |ad_valence - article_valence|topic_match- 1.0 if topics match, else 0.0contrast- 1 - sim_adtext_article
By Pair Type:
- Positive pairs: μ = 4.77, σ = 2.25 (range: 1.2-9.3)
- Random negatives: μ = 3.01, σ = 1.34 (range: 1.0-8.6)
- Safe contrast: μ = 3.00, σ = 1.31 (range: 1.0-8.2)
Key Observation: Teacher model correctly separates positive pairs from negatives with ~1.8 point gap, demonstrating reliable ground truth labels.
| Metric | Value |
|---|---|
| MAE (Mean Absolute Error) | 0.605 |
| RMSE | 0.842 |
| R² | 0.785 |
Interpretation:
- MAE of 0.605 means predictions are within ±0.6 points on average (6% error on 10-point scale)
- R² of 0.785 indicates model explains 78.5% of score variance
- These metrics are strong for ad quality prediction where human judgments are inherently noisy
| Pair Type | MAE | Mean True Score | Mean Predicted |
|---|---|---|---|
| Positive | 0.682 | 4.93 | 4.67 |
| Random Negative | 0.587 | 3.09 | 3.14 |
| Safe Contrast | 0.562 | 3.05 | 3.10 |
Key Takeaway: Model maintains ~1.6 point separation between positive and negative pairs, preserving ranking quality for ad placement optimization.
- contrast (32.9%) - Inverse text similarity; ads benefit from differentiation
- sim_adtext_article (24.0%) - Text-article semantic alignment
- is_ad (15.3%) - Ad classification confidence
- clarity (10.4%) - Message clarity at a glance
- twist_resolves_fast (3.2%) - Creative twist resolution speed
- entity_overlap_rate (1.8%) - Shared entities (brands, products, people)
- topic_match (1.7%) - Coarse topic alignment
- twist_present (1.7%) - Creative twist presence
- copy_word_count (1.2%) - Ad copy length
- clutter (1.1%) - Visual clutter level
Insight: Contextual fit features (contrast, similarity, entity overlap) dominate importance (58.7% combined), validating the M+F decomposition approach.
Predicted Score: 5.67/10
Ad Copy: "It takes 3.1 seconds to read this ad. The same time it takes a Model S to go from 0 to 60mph. TESLA"
Article Context:
"Why Do Men Like Sports Cars So Much?"
It's become almost a cliche over the years…men love sports cars and high-performance cars. There's a feeling that goes along with punching a gas pedal...
Feature Highlights:
- M Features: clarity=0.92, twist_present=true, twist_resolves_fast=0.95, copy_concrete=0.90
- F Features: sim_adtext_article=0.24, sentiment_alignment=0.95, contrast=0.76
- Why it scores highest: Creative twist linking reading time to 0-60mph acceleration, high clarity, concrete copy, aligns with sports car theme
Predicted Score: 4.94/10
Ad Copy: "It takes 3.1 seconds to read this ad. The same time it takes a Model S to go from 0 to 60mph. TESLA"
Article Context:
"120 profitable blog niche ideas and how to pick the right one"
So, you want to create a blog that will draw in tons of readers and eventually make you a profit...
Feature Highlights:
- M Features: clarity=0.95, twist_present=true, copy_concrete=0.90
- F Features: sim_adtext_article=0.09, topic_match=0.0, contrast=0.91
- Why mid-range: Strong ad creative but weak contextual fit (blogging ≠ cars), low similarity scores
Predicted Score: 3.99/10
Ad Copy: "AS COMFORTABLE AS YOUR FAVORITE SHOW SHOP SCRUBS UA Uniform Advantage"
Article Context:
"5 Best Netflix Shows to Watch on New Year's Day: 'Stranger Things' and More"
So you don't have any plans on New Year's Day — that's not a big deal! There are plenty of shows to keep you company...
Feature Highlights:
- M Features: clarity=0.92, twist_present=false, copy_arousal=0.28
- F Features: sim_adtext_article=0.13, entity_overlap=0.0, topic_match=0.0
- Why it underperforms: Metaphorical connection ("comfortable as show") doesn't translate to strong contextual fit, no entities overlap, generic positioning
Predicted Score: 3.77/10
Ad Copy: "AS COMFORTABLE AS YOUR FAVORITE SHOW SHOP SCRUBS UA Uniform Advantage"
Article Context:
"Elon Musk plans 'high-volume production' of Neuralink brain chips..."
Elon Musk's Neuralink startup develops brain-chip implants...
Feature Highlights:
- M Features: clarity=0.95, twist_present=false, copy_arousal=0.25
- F Features: sim_adtext_article=-0.03 (negative!), entity_overlap=0.0, topic_match=0.0
- Why it fails: Complete contextual mismatch (medical scrubs ≠ brain chips), negative text similarity, no shared entities, sentiment misalignment
- Algorithm: XGBoost Regressor
- Hyperparameters: 215 trees (early stopped from max 300), depth=6, learning_rate=0.05
- Regularization: L1=1, L2=10, min_child_weight=10
- Training: 80/20 split with GroupShuffleSplit on ad_id (prevents data leakage)
- GPT-5.2 Vision API: M features extraction from ad images (downsampled to 1024×1024, JPEG q=85)
- GPT-5.2 Text API: Article features (topic, entities, sentiment, arousal)
- CLIP: openai/clip-vit-base-patch32 for 512-dim image embeddings
- Text Embeddings: text-embedding-3-small (512-dim) for ad text and articles
- Timeouts: 120s per API call with automatic retry logic
- Full feature extraction: ~$0.02 per ad-article pair (2× GPT-5.2 calls + embeddings)
- Inference only (with cached features): ~$0.0001 per pair (model prediction only)
- Training dataset cost: Several hundred USD for 29K pairs
- HuggingFace:
albertbn/ad-memorability-scorer-v0(private dataset) - Contents: Consolidated parquet file (8.1 GB) with images as binary bytes + all features + scores
- Access: Contact for permissions
- Fine-tune on real engagement data: Replace teacher scores with actual CTR/viewability metrics
- Add brand safety dimension (S factor): Extend to full M × F × S formula from original concept
- Multi-task learning: Jointly predict score + generate natural language reasoning
- Feature ablation studies: Identify minimal feature set for real-time inference
- Video ad support: Extend to temporal/motion features for video creatives
- Pairwise ranking objective: Switch from regression to learning-to-rank for better separation
- Multimodal fusion: End-to-end neural model replacing hand-crafted features
- Personalization: Condition on user demographics/interests for individualized scoring
If you use this dataset or model, please cite:
@dataset{albertbn_ad_memorability_v0,
title={Ad Memorability Scorer: Context-Aware Quality Prediction},
author={FrameAI},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/albertbn/ad-memorability-scorer-v0}
}Contact: For dataset access, collaboration, or questions about this work, please reach out via GitHub or HuggingFace.
Generated with GPT-5.2 teacher scoring and XGBoost baseline regression • January 2026



