albertbn/MEMORABILITY_PROJECT_WALKTHROUGH.md

## MEMORABILITY_PROJECT_WALKTHROUGH.md

      
    Raw
  

              MEMORABILITY_PROJECT_WALKTHROUGH.md
            
          
    Ad Memorability Scorer: Project Walkthrough

Target Audience: Dev Engineers
Goal: Understand the dataset creation pipeline and training process
Context: Part of the LoudEcho online learning bandit system

Table of Contents


Big Picture: Where This Fits
The Problem We're Solving
Dataset Creation Pipeline
Training Process
Key Code Snippets
Results & Performance
Integration with Bandit System


Big Picture: Where This Fits

The memorability scorer is one component in a multi-armed bandit system for ad creative optimization:
┌─────────────────────────────────────────────────────────────┐
│                   ONLINE LEARNING BANDIT                    │
│                                                             │
│  ┌────────────┐    ┌─────────────┐    ┌────────────────┐    │
│  │   DSPy     │───▶│  Candidate  │───▶│   Memorability │    │
│  │ Generator  │    │  Ad Creatives│    │     Scorer    │    │
│  └────────────┘    └─────────────┘    └────────┬───────┘    │
│                                                  │          │
│                                                  ▼          │
│                    ┌──────────────────────────────────┐     │
│                    │    Scoring Function:             │     │
│                    │                                  │     │
│                    │  Score = w₁·Quality(C)           │     │
│                    │        + w₂·Performance(P)       │     │
│                    │        + w₃·Novelty              │     │
│                    │        - penalties               │     │
│                    └──────────────┬───────────────────┘     │
│                                   │                         │
│                                   ▼                         │
│                    ┌────────────────────────┐               │
│                    │   Ad Selection &       │               │
│                    │   Serving Decision     │               │
│                    └────────┬───────────────┘               │
│                             │                               │
└─────────────────────────────┼───────────────────────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  Real Engagement  │
                    │  Data (CTR, etc.) │
                    └─────────┬─────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  Feedback Loop:   │
                    │  Update Weights   │
                    │  & Retrain Models │
                    └───────────────────┘

The Memorability Scorer predicts Quality(C) - how good an ad creative is in a given context.

The Problem We're Solving

Challenge: Given an ad creative and an article context, predict how memorable and effective the ad will be.
Why it matters:

Manual review doesn't scale (thousands of ad variants)
Post-hoc metrics (CTR) come too late for real-time decisions
Need to rank candidates BEFORE serving to users

Solution approach:

Train a "teacher model" (GPT-5.2 Vision) to score thousands of ad-article pairs
Use those scores to train a fast, cheap XGBoost model for production
Deploy the XGBoost model for real-time scoring


Dataset Creation Pipeline

This was 90% of the effort and cost (several hundred USD in API calls). Here's the full pipeline:
┌──────────────────────────────────────────────────────────────────┐
│                    DATASET CREATION PIPELINE                     │
│                                                                  │
│  Step 1: Image Deduplication                                     │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Pinterest + Twitter ad images (10K+)                │        │
│  │              ↓                                       │        │
│  │  Perceptual hashing (aHash: 32×32 grayscale)         │        │
│  │              ↓                                       │        │
│  │  7,258 unique ads                                    │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 2: Synthetic Article Generation                            │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  For each ad image:                                  │        │
│  │    - GPT-5.2 Vision analyzes the ad                  │        │
│  │    - Generates contextually relevant article         │        │
│  │    - 1:1 mapping (ad_00001 → article_00001)          │        │
│  │              ↓                                       │        │
│  │  7,258 synthetic articles                            │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 3: Ad Feature Extraction                                   │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Extract M features (15 dimensions):                 │        │
│  │    - Visual: faces, clarity, clutter, contrast       │        │
│  │    - Text: OCR, copy quality, concreteness           │        │
│  │    - Creative: twist present, resolves fast          │        │
│  │    - CLIP embeddings (512-dim)                       │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 4: Article Feature Extraction                              │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Extract article features (4 dimensions):            │        │
│  │    - Topic category                                  │        │
│  │    - Named entities                                  │        │
│  │    - Sentiment valence                               │        │
│  │    - Emotional arousal                               │        │
│  │    - Text embeddings (512-dim)                       │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 5: Generate Negative Samples                               │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  For each ad:                                        │        │
│  │    ✓ 1 positive pair (original synthetic match)      │        │
│  │    ✗ 2 random negatives (any mismatched article)     │        │
│  │    ✗ 1 safe contrast (different topic, similar tone) │        │
│  │              ↓                                       │        │
│  │  29,032 total pairs                                  │        │
│  │    - 7,258 positive (25%)                            │        │
│  │    - 14,516 random negatives (50%)                   │        │
│  │    - 7,258 safe contrast (25%)                       │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 6: Compute Pair Features                                   │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Combine ad + article → F features (6 dimensions):   │        │
│  │    - sim_adtext_article (cosine similarity)          │        │
│  │    - sim_adimage_article (CLIP similarity)           │        │
│  │    - entity_overlap_rate (Jaccard)                   │        │
│  │    - sentiment_alignment (distance)                  │        │
│  │    - topic_match (binary)                            │        │
│  │    - contrast (inverse similarity)                   │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 7: Teacher Scoring                                         │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  For each of 29,032 pairs:                           │        │
│  │    - GPT-5.2 Vision evaluates:                       │        │
│  │      • Ad memorability & originality                 │        │
│  │      • Message clarity (< 2 sec?)                    │        │
│  │      • Emotional engagement                          │        │
│  │      • Contextual relevance                          │        │
│  │    - Outputs: score (1-10) + reasoning               │        │
│  │              ↓                                       │        │
│  │  29,032 labeled training examples                    │        │
│  │  Cost: ~$0.02/pair × 29K = $580+                     │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 8: Consolidate Dataset                                     │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Merge all data into single file:                    │        │
│  │    - Images (as bytes)                               │        │
│  │    - Ad text + article text                          │        │
│  │    - 21 features (15 M + 6 F)                        │        │
│  │    - Teacher scores + reasoning                      │        │
│  │              ↓                                       │        │
│  │  memorability_dataset_consolidated.parquet (8.1 GB)  │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Why This Approach?

Why synthetic articles instead of real ones?

Real ad-article pairs have selection bias (ads are already chosen to fit)
Synthetic articles let us control the distribution of positive/negative pairs
We can generate "safe contrast" pairs (same sentiment, different topic) to teach the model nuance

Why negative samples?

A model trained only on positive pairs can't discriminate
Random negatives teach "this ad doesn't fit this article"
Safe contrast negatives teach subtler distinctions (not just topic matching)

Why teacher-student approach?

GPT-5.2 Vision is expensive ($0.02/pair) and slow (120s timeout)
XGBoost is cheap ($0.0001/pair) and fast (< 1ms)
Trade-off: upfront labeling cost → cheap inference forever


Training Process

Once we have the labeled dataset, training is straightforward:
┌────────────────────────────────────────────────────────┐
│               XGBOOST TRAINING PIPELINE                │
│                                                        │
│  1. Load consolidated dataset                          │
│     ↓                                                  │
│  2. Split by ad_id (prevent leakage)                   │
│     - Train: 80% of unique ads                         │
│     - Test: 20% of unique ads                          │
│     ↓                                                  │
│  3. Encode categorical features                        │
│     - face_emotion: LabelEncoder                       │
│     ↓                                                  │
│  4. Train XGBoost regressor                            │
│     - Target: teacher_score_1_10                       │
│     - Features: 21 (15 M + 6 F)                        │
│     - Hyperparameters:                                 │
│       • n_estimators: 300 (early stopped at 215)       │
│       • max_depth: 6                                   │
│       • learning_rate: 0.05                            │
│       • L1 regularization: 1                           │
│       • L2 regularization: 10                          │
│     ↓                                                  │
│  5. Evaluate on test set                               │
│     - MAE: 0.605 (6% error on 10-point scale)          │
│     - R²: 0.785 (explains 78.5% of variance)           │
│     ↓                                                  │
│  6. Save model + encoders                              │
│     - baseline_xgboost_model.pkl                       │
│                                                        │
└────────────────────────────────────────────────────────┘

Why XGBoost?

Considered alternatives:

Neural network: Requires more data, harder to interpret, overkill for 21 features
Linear regression: Too simple, can't capture feature interactions
Random Forest: Similar to XGBoost but slower and less accurate

Why XGBoost won:

✅ Handles tabular data with mixed types (categorical + numeric) out of the box
✅ Built-in regularization (L1, L2) prevents overfitting
✅ Feature importance scores for interpretability
✅ Fast training (< 5 min on CPU)
✅ Blazing fast inference (< 1ms per prediction)
✅ Industry standard for Kaggle/production regression tasks

Key design decisions:

GroupShuffleSplit by ad_id: Prevents data leakage (same ad in train and test)
Early stopping: Prevents overfitting by stopping at 215 trees when validation MAE plateaus
Regression, not ranking: We want absolute scores, not just relative ordering (for now)


Key Code Snippets

1. Image Deduplication (Perceptual Hashing)

File: ImageDeduplicator.py:24-37
This was critical to avoid wasting money labeling duplicate ads.
@staticmethod
def compute_image_hash(image_path: str) -> str | None:
    """
    Compute aHash for image
    Algorithm: 32x32 grayscale → compare to mean → binary hash → MD5
    """
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if image is None:
        return None

    resized_image = cv2.resize(image, (32, 32))
    avg_pixel_value = resized_image.mean()
    hash_str = ''.join('1' if pixel > avg_pixel_value else '0'
                      for pixel in resized_image.flatten())
    return hashlib.md5(hash_str.encode()).hexdigest()
Why this works:

Resizing to 32×32 normalizes for size variations
Comparing to mean creates a perceptual fingerprint
Near-identical images (different compression, minor edits) get same hash
Result: Reduced 10K+ images → 7,258 unique ads


2. Negative Sample Generation Strategy

File: GenerateNegativeSamples.py:70-153
Teaching the model to discriminate requires careful negative sampling.
def generate_pairs(self):
    """Generate positive and negative pairs"""
    pairs = []

    for _, ad_row in tqdm(self.df_ads.iterrows(), total=len(self.df_ads)):
        ad_id = ad_row['ad_id']
        ad_topic = ad_row.get('ad_topic', 'other')
        ad_sentiment = ad_row.get('copy_emotion_valence', 0.0)

        # 1. Positive pair (original 1:1 synthetic pairing)
        positive_article_id = ad_id  # ad_00001 → article_00001
        pairs.append({
            'ad_id': ad_id,
            'article_id': positive_article_id,
            'pair_type': 'positive'
        })

        # 2. Random negatives (any mismatched article)
        other_article_ids = [a_id for a_id in all_article_ids
                            if a_id != positive_article_id]
        random_article_ids = random.sample(other_article_ids,
                                          self.random_negatives_per_ad)
        for article_id in random_article_ids:
            pairs.append({
                'ad_id': ad_id,
                'article_id': article_id,
                'pair_type': 'random_negative'
            })

        # 3. Safe contrast negatives (different topic, similar sentiment)
        contrast_candidates = [
            a_id for a_id in other_article_ids
            if articles_by_id[a_id].get('topic_category') != ad_topic  # Different topic
            and abs(articles_by_id[a_id].get('sentiment_valence', 0.0)
                   - ad_sentiment) < 0.5  # Similar sentiment
        ]

        contrast_article_ids = random.sample(
            contrast_candidates,
            min(self.contrast_negatives_per_ad, len(contrast_candidates))
        )
        for article_id in contrast_article_ids:
            pairs.append({
                'ad_id': ad_id,
                'article_id': article_id,
                'pair_type': 'safe_contrast'
            })
Why three types of negatives?

Random negatives: Easy examples (car ad × cooking article = bad)
Safe contrast: Hard examples (car ad × travel article = maybe okay?)
Teaches nuance: Model learns contextual fit, not just topic matching


3. Dataset Consolidation (Memory-Efficient)

File: ConsolidateDataset.py:61-198
With 8.1 GB of data including images, we need batch processing to avoid OOM.
def consolidate():
    """Merge all data into single consolidated file (memory-efficient batching)"""

    BATCH_SIZE = 1000  # Process 1000 rows at a time
    TEMP_DIR = f"{DATA_DIR}/temp_batches"

    # Load metadata (no images yet)
    df_pairs = pd.read_csv(f"{DATA_DIR}/pairs.csv")
    df_features = pd.read_csv(f"{DATA_DIR}/features_full.csv")
    df_scores = pd.read_csv(f"{DATA_DIR}/teacher_scores.csv")

    # Merge all metadata
    df = df_pairs.merge(df_features, on=['ad_id', 'article_id'])
    df = df.merge(df_scores, on=['ad_id', 'article_id'])

    # Process in batches (load images only for current batch)
    num_batches = (len(df) + BATCH_SIZE - 1) // BATCH_SIZE

    for batch_idx in tqdm(range(num_batches), desc="Processing batches"):
        start_idx = batch_idx * BATCH_SIZE
        end_idx = min((batch_idx + 1) * BATCH_SIZE, len(df))

        df_batch = df.iloc[start_idx:end_idx].copy()

        # Load images for this batch only
        image_bytes_list = []
        for image_path in df_batch['image_path']:
            image_bytes_list.append(load_image_bytes(image_path))

        df_batch['image_bytes'] = image_bytes_list

        # Save batch
        df_batch.to_parquet(f"{TEMP_DIR}/batch_{batch_idx:04d}.parquet",
                          compression='snappy')

    # Merge batches efficiently using PyArrow
    import pyarrow.parquet as pq
    tables = [pq.read_table(f) for f in batch_files]
    combined_table = pa.concat_tables(tables)
    pq.write_table(combined_table, OUTPUT_FILE, compression='snappy')
Why batch processing?

Loading 29K images (8.1 GB) into memory at once = OOM crash
Batch size of 1000 = manageable memory (~ 280 MB per batch)
PyArrow for efficient Parquet I/O (3-5× faster than pandas alone)


Results & Performance

Model Performance


Metric
Value
Interpretation


MAE
0.605
Predictions within ±0.6 points on 10-point scale (6% error)


RMSE
0.842
Slightly higher due to outliers, still strong


R²
0.785
Model explains 78.5% of score variance


Performance by Pair Type


Pair Type
MAE
Mean True Score
Mean Predicted


Positive
0.682
4.93
4.67


Random Negative
0.587
3.09
3.14


Safe Contrast
0.562
3.05
3.10


Key insight: Model correctly separates positive pairs from negatives by ~1.6 points, which is what matters for ranking in production.
Top 5 Feature Importances


contrast (32.9%) - Inverse text similarity; ads benefit from differentiation
sim_adtext_article (24.0%) - Text-article semantic alignment
is_ad (15.3%) - Ad classification confidence
clarity (10.4%) - Message clarity at a glance
twist_resolves_fast (3.2%) - Creative twist resolution speed

Insight: Contextual fit features (contrast + similarity + entities) dominate at 58.7% combined importance.

Integration with Bandit System

How the Memorability Scorer Fits

┌─────────────────────────────────────────────────────────────┐
│             BANDIT SYSTEM: AD SELECTION FLOW                │
│                                                             │
│  1. Generate candidates                                     │
│     ┌──────────────────────────────────────┐                │
│     │ DSPy + GPT-5.2 → 10-20 ad variants   │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  2. Extract features for each candidate                     │
│     ┌──────────────────────────────────────┐                │
│     │ M features (15): Visual + Copy       │                │
│     │ F features (6): Context fit          │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  3. Score with memorability model                           │
│     ┌──────────────────────────────────────┐                │
│     │ XGBoost prediction: score_1_10       │                │
│     │ Latency: < 1ms per candidate         │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  4. Compute overall score                                   │
│     ┌──────────────────────────────────────┐                │
│     │ Final = w₁·memorability              │                │
│     │       + w₂·predicted_CTR             │                │
│     │       + w₃·novelty                   │                │
│     │       - duplicate_penalty            │                │
│     │       - brand_safety_penalty         │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  5. Select & serve best candidate                           │
│     ┌──────────────────────────────────────┐                │
│     │ Thompson sampling or UCB policy      │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  6. Collect feedback                                        │
│     ┌──────────────────────────────────────┐                │
│     │ Actual CTR, dwell time, conversions  │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  7. Update weights (monthly)                                │
│     ┌──────────────────────────────────────┐                │
│     │ Optimize w₁, w₂, w₃ based on        │                 │
│     │ real performance data                │                │
│     └──────────────────────────────────────┘                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Deployment Architecture

Production Inference:
┌──────────────────┐
│  Ad Candidate    │
│  + Article Text  │
└────────┬─────────┘
         │
         ▼
┌─────────────────────────────┐
│  Feature Extraction (cached)│
│  - M features: GPT-5.2 cache│
│  - F features: compute live │
│  Latency: ~50ms if cached   │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  XGBoost Model (.pkl)       │
│  - 21 features → score      │
│  Latency: < 1ms             │
│  Cost: $0.0001 per call     │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  Score: 1-10                │
│  (feeds into bandit scorer) │
└─────────────────────────────┘

Cost comparison:

Before: GPT-5.2 scoring = $0.02/ad + 120s latency → not feasible for real-time
After: XGBoost scoring = $0.0001/ad + < 1ms latency → production ready


Key Takeaways for Devs

What Worked Well


Perceptual hashing saved $200+ by deduplicating before API calls
Synthetic article generation let us control the training distribution
Teacher-student paradigm traded upfront cost for fast inference
Batch processing + resume capability made the pipeline robust to failures
XGBoost was the right choice for tabular data with 21 features

What Was Hard


Dataset creation took 90% of time and money

29K API calls to GPT-5.2 Vision ($580+)
Incremental saves + resume logic to handle timeouts
Memory-efficient batching to avoid OOM


Preventing data leakage

Must split by ad_id not row-level (same ad appears in multiple pairs)
Careful validation of positive pair alignment


Balancing negative sample types

Too many random negatives = model learns trivial patterns
Safe contrast negatives = harder to generate but critical for nuance


Future Work


Replace teacher scores with real CTR data once we have enough traffic
Add brand safety dimension (currently missing from M×F framework)
Experiment with learning-to-rank instead of regression
Fine-tune embeddings instead of using off-the-shelf CLIP/OpenAI


Dataset Statistics


Metric
Value


Unique Ads
7,258


Synthetic Articles
7,258


Total Pairs
29,032


Features
21 (15 M + 6 F)


Teacher Scores
29,032


Dataset Size
8.1 GB


Training Time
< 5 minutes


Total Cost
~$600 (mostly GPT-5.2 API)


References


Dataset: albertbn/ad-memorability-scorer-v0 (HuggingFace, private)
Model: baseline_xgboost_model.pkl (8 MB)
Code: /Labs/memorability/ (8 Python modules + utilities)
Framework: Based on LoudEcho Creative Quality Framework (6 dimensions)
Gist: Context Ad Learning Research


Questions? Ping the team or check the code in /Labs/memorability/

Generated for internal dev team walkthrough • February 2026
Metric	Value	Interpretation
MAE	0.605	Predictions within ±0.6 points on 10-point scale (6% error)
RMSE	0.842	Slightly higher due to outliers, still strong
R²	0.785	Model explains 78.5% of score variance
Pair Type	MAE	Mean True Score	Mean Predicted
Positive	0.682	4.93	4.67
Random Negative	0.587	3.09	3.14
Safe Contrast	0.562	3.05	3.10
Metric	Value
Unique Ads	7,258
Synthetic Articles	7,258
Total Pairs	29,032
Features	21 (15 M + 6 F)
Teacher Scores	29,032
Dataset Size	8.1 GB
Training Time	< 5 minutes
Total Cost	~$600 (mostly GPT-5.2 API)