Albert Bentov • 2026-01-18
Goal: Present a rigorous, data-driven framework for continuous ad creative optimization
We optimize ad creative selection to maximize performance + creative quality under hard safety constraints, using human feedback and logged metrics to continuously improve.
Objective function:
Subject to:
Where:
-
$x$ = context (brand brief, article content, user features) -
$a$ = candidate creative (tagline, image, layout) -
$\hat{P}(x,a)$ = predicted performance (CTR/CVR) — learned from logs -
$\hat{C}(x,a)$ = predicted creative quality (approval probability) — learned from human feedback -
$\text{Novelty}(a)$ = min cosine distance to served history — computed -
$\text{Dup}(a)$ = binary duplicate indicator — computed -
$w_p, w_c, w_o, w_d$ = tunable weights
Request (campaign, article_url)
↓
[1] Return default ad (instant)
↓ (async)
[2] Generate N candidates: π_θ(a|x) ← DSPy (learned prompt/program)
↓
[3] Safety gate: filter unsafe ← Ĉ threshold (learned)
↓
[4] Score candidates: Score(x,a) ← Ĉ + P̂ + novelty (learned + computed)
↓
[5] Explore/exploit: ε-greedy selection ← Log propensity π(a|x)
↓
[6] Serve winner + log (propensity, features, outcome)
↓
[7] Learning loops: retrain Ĉ, P̂, π_θ weekly/daily
Key insight: This mirrors how code assistants work—generate candidates, run tests (safety gates), pick best, learn from feedback.
| Component | Purpose | Input | Output | Training Data | Method | Artifact |
|---|---|---|---|---|---|---|
| Predict human approval | brand, article, creative | 1224 labels (512 approved, 712 rejected) | Fine-tuned LLM (OpenAI) | quality_model_vK.bin |
||
| Predict performance | article, creative | Impression logs (K clicks / N imps) | Weighted logistic regression | perf_model_vK.bin |
||
| Generate candidates | brand, article, anchor | N creative variants | 512 approved examples | DSPy compile (prompt optimization) | dspy_program_vK.json |
Instead of generating creatives from scratch (prone to hallucination), we use delta editing:
Anchors are human-curated, on-brand baseline ads. Deltas preserve meaning/CTA while incorporating safe article cues. This keeps creative "controllable."
DSPy Pipeline:
-
Strategist:
$(\text{brand}, \text{article}) \rightarrow (\text{angle}, \text{avoid-list})$ — runs once -
DeltaRewrite:
$(\text{anchor}, \text{angle}) \rightarrow a_i$ — runs N times -
Critic (optional):
$a_i \rightarrow a_i'$ — fixes/improves
DSPy's compile() optimizes instructions and few-shot examples offline to maximize approval rate on validation set.
When retraining models, we need unbiased estimates from logged data. Without propensities, naive CTR is biased (old policy always picked high-scoring ads).
Logged data: For each impression
IPS estimator for new policy
Why it works: Upweights rare actions that
Implementation:
- 90%: serve argmax (exploit)
- 10%: sample from top-K (explore)
| Field | Type | Purpose |
|---|---|---|
request_id |
UUID | Unique request identifier |
timestamp |
datetime | When served |
campaign_id, brand_id
|
str | Campaign context |
article_url, article_embedding
|
str, vector(256) | Article context |
anchor_id, candidate_id
|
str | Which anchor → which variant |
creative_text |
str | Actual served tagline |
propensity |
float |
|
policy_version |
str | Generator + reranker versions |
impression, viewable, click, conversion
|
bool | Outcome metrics |
| Field | Type | Purpose |
|---|---|---|
candidate_id |
str | Links to impression log |
approved |
bool | Binary label |
brand_fit, context_fit, originality, clarity
|
float |
Optional rubric scores |
Current status: 1224 labeled examples since 2025-12-20 (512 approved, 712 rejected) — sufficient to train initial
| Loop | What Learns | Training Data | Retrain Trigger | Purpose |
|---|---|---|---|---|
| Loop 1 | Safety classifier | Known-safe vs unsafe articles | One-time (stable) | Block unsafe articles early |
| Loop 2 |
|
Human approval labels | Weekly or +50 labels | Predict |
| Loop 3 |
|
Impression logs (IPS weighted) | Daily/weekly | Predict |
| Loop 4 | Reranker weights | Offline eval (IPS on dev set) | Monthly | Tune |
| Loop 5 | DSPy |
Approved examples + metric | Monthly or +100 approved | Optimize prompts/demos |
All models improve as data grows. This is the core self-learning mechanism.
1. Grounding
- Anchors provide safe baselines (not hallucinating from scratch)
- Article embeddings ground generation in context
- Delta edits keep creative "controllable"
2. Learned Models (Not Zero-Shot Prompting)
-
$\hat{C}$ : Trained on YOUR approval data (1224 examples), not generic prompts -
$\hat{P}$ : Trained on YOUR click data, not guessing - DSPy: Optimizes prompts on YOUR metric, not hand-tuning
3. Unbiased Learning (Not Naive A/B)
- IPS corrects for exploration bias
- Can evaluate new policies offline before deployment
- Propensity logging enables counterfactual reasoning
Mathematical guarantee: IPS estimator is unbiased if
- ✅ Infrastructure: Postgres + pgvector, impression logging, Slack annotation workflow
- ✅ Data: 1224 labeled examples (type=2 approved, type=-1 rejected)
- ✅ Embeddings: 256d OpenAI embeddings for articles + ads
- ✅ Baseline: Anchor similarity search + Bernoulli sampling (P1, P2)
-
Phase 1 (Week 1-2): Train
$\hat{C}$ on existing labels, deploy as safety gate + scorer -
Phase 2 (Week 3-4): Train
$\hat{P}$ on impression logs, add to reranker - Phase 3 (Week 5-6): Implement DSPy pipeline, compile on approved examples
- Phase 4 (Month 2+): Add propensity logging, IPS evaluation, weekly retraining
- Rubric definition: What specific aspects should human reviewers score? (brand-fit, originality, clarity, safety)
-
Exploration rate: Start with
$\epsilon=0.1$ (10% explore) or more conservative? - Counterfactual logging: Log all N candidates + scores (enables richer learning, costs storage)?
- Model serving: Fine-tune on OpenAI (fast, managed) or self-host (control, cost)?
Prompt/Program Optimization:
- DSPy (Stanford NLP): Compiling declarative language model calls into pipelines. arXiv:2310.03714
- OPRO (Google DeepMind): LLMs as optimizers. arXiv:2309.03409
Preference Learning:
- DPO (Stanford): Direct Preference Optimization. arXiv:2305.18290
Off-Policy Evaluation:
- Doubly Robust OPE (Dudik et al.): Unbiased evaluation from logged bandit feedback. arXiv:1103.4601
Agentic Workflows:
- ReAct (Google): Reason + Act agents with self-correction. arXiv:2210.03629
Ad Optimization:
- AdBooster/GenCO (Meta, 2025): Generative creative optimization with multi-instance reward. arXiv:2508.09730
Scenario: New reranker policy (after training
| User | Baseline Showed | Baseline |
Click? | New Policy Wants | Match? | IPS Weight | Weighted Reward |
|---|---|---|---|---|---|---|---|
| 1 | anchor (default) | 0.9 | 0 | contextual | ✗ | - | - |
| 2 | anchor (default) | 0.9 | 1 | anchor | ✓ | 1/0.9 = 1.11 | 1.11 |
| 3 | explore variant | 0.1 | 1 | explore variant | ✓ | 1/0.1 = 10.0 | 10.0 |
| 4 | anchor (default) | 0.9 | 0 | contextual | ✗ | - | - |
| 5 | anchor (default) | 0.9 | 0 | anchor | ✓ | 1.11 | 0 |
Interpretation: User 3's click gets 10× weight because baseline rarely explored that variant. IPS correctly attributes value to actions the new policy would take.
End of Document