albertbn/research_pitch.md

## research_pitch.md

      
    Raw
  

              research_pitch.md
            
          
    Self-Learning Contextual Ads: Multi-Objective Optimization via Logged Bandit Feedback

Albert Bentov • 2026-01-18


Goal: Present a rigorous, data-driven framework for continuous ad creative optimization

1. Problem Statement

We optimize ad creative selection to maximize performance + creative quality under hard safety constraints, using human feedback and logged metrics to continuously improve.
Objective function:
$$
\text{Score}(x, a) = w_p \cdot \hat{P}(x,a) + w_c \cdot \hat{C}(x,a) + w_o \cdot \text{Novelty}(a) - w_d \cdot \text{Dup}(a)
$$
Subject to: $\text{Safety}(x, a) \leq \tau$ (hard constraint, must pass)
Where:


$x$ = context (brand brief, article content, user features)

$a$ = candidate creative (tagline, image, layout)

$\hat{P}(x,a)$ = predicted performance (CTR/CVR) — learned from logs


$\hat{C}(x,a)$ = predicted creative quality (approval probability) — learned from human feedback


$\text{Novelty}(a)$ = min cosine distance to served history — computed


$\text{Dup}(a)$ = binary duplicate indicator — computed


$w_p, w_c, w_o, w_d$ = tunable weights


2. System Architecture

Request (campaign, article_url)
    ↓
[1] Return default ad (instant)
    ↓ (async)
[2] Generate N candidates: π_θ(a|x)           ← DSPy (learned prompt/program)
    ↓
[3] Safety gate: filter unsafe                ← Ĉ threshold (learned)
    ↓
[4] Score candidates: Score(x,a)              ← Ĉ + P̂ + novelty (learned + computed)
    ↓
[5] Explore/exploit: ε-greedy selection       ← Log propensity π(a|x)
    ↓
[6] Serve winner + log (propensity, features, outcome)
    ↓
[7] Learning loops: retrain Ĉ, P̂, π_θ weekly/daily

Key insight: This mirrors how code assistants work—generate candidates, run tests (safety gates), pick best, learn from feedback.

3. Mathematical Formulation

3.1 Three Learned Components


Component
Purpose
Input
Output
Training Data
Method
Artifact


$\hat{C}(x,a)$
Predict human approval
brand, article, creative
$P(\text{approved}) \in [0,1]$
1224 labels (512 approved, 712 rejected)
Fine-tuned LLM (OpenAI)
quality_model_vK.bin


$\hat{P}(x,a)$
Predict performance
article, creative
$P(\text{click}) \in [0,1]$
Impression logs (K clicks / N imps)
Weighted logistic regression
perf_model_vK.bin


$\pi_\theta(a|x)$
Generate candidates
brand, article, anchor
N creative variants
512 approved examples
DSPy compile (prompt optimization)
dspy_program_vK.json


3.2 Anchor-Delta Generation Strategy

Instead of generating creatives from scratch (prone to hallucination), we use delta editing:
$$
a = \text{anchor} + \text{small context-aware edit}
$$
Anchors are human-curated, on-brand baseline ads. Deltas preserve meaning/CTA while incorporating safe article cues. This keeps creative "controllable."
DSPy Pipeline:


Strategist: $(\text{brand}, \text{article}) \rightarrow (\text{angle}, \text{avoid-list})$ — runs once

DeltaRewrite: $(\text{anchor}, \text{angle}) \rightarrow a_i$ — runs N times

Critic (optional): $a_i \rightarrow a_i'$ — fixes/improves

DSPy's compile() optimizes instructions and few-shot examples offline to maximize approval rate on validation set.
3.3 Inverse Propensity Scoring (IPS)

When retraining models, we need unbiased estimates from logged data. Without propensities, naive CTR is biased (old policy always picked high-scoring ads).
Logged data: For each impression $i$, we observe $(x_i, a_i, r_i, \pi_{\text{log}}(a_i|x_i))$
IPS estimator for new policy $\pi_{\text{new}}$:
$$
\hat{V}_{\text{IPS}}(\pi_{\text{new}}) = \frac{1}{N} \sum_{i=1}^{N} \frac{r_i \cdot \mathbb{1}[a_i = \pi_{\text{new}}(x_i)]}{\pi_{\text{log}}(a_i|x_i)}
$$
Why it works: Upweights rare actions that $\pi_{\text{new}}$ would take. Allows offline evaluation before deploying.
Implementation: $\epsilon$-greedy exploration ($\epsilon=0.1$):

90%: serve argmax (exploit)
10%: sample from top-K (explore)

$$
\pi(a|x) =
\begin{cases}
1 - \epsilon + \frac{\epsilon}{K} & \text{if } a = \arg\max \text{Score}(x,a) \\
\frac{\epsilon}{K} & \text{if } a \in \text{top-K} \\
0 & \text{otherwise}
\end{cases}
$$

4. Data Schema & Logging

4.1 Impression Log (required for learning)


Field
Type
Purpose


request_id
UUID
Unique request identifier


timestamp
datetime
When served


campaign_id, brand_id

str
Campaign context


article_url, article_embedding

str, vector(256)
Article context


anchor_id, candidate_id

str
Which anchor → which variant


creative_text
str
Actual served tagline


propensity
float

$\pi(a|x)$ — critical for IPS


policy_version
str
Generator + reranker versions


impression, viewable, click, conversion

bool
Outcome metrics


4.2 Human Feedback (required for $\hat{C}$)


Field
Type
Purpose


candidate_id
str
Links to impression log


approved
bool
Binary label


brand_fit, context_fit, originality, clarity

float $\in [0,1]$

Optional rubric scores


Current status: 1224 labeled examples since 2025-12-20 (512 approved, 712 rejected) — sufficient to train initial $\hat{C}$ model.

5. Five Learning Loops


Loop
What Learns
Training Data
Retrain Trigger
Purpose


Loop 1
Safety classifier
Known-safe vs unsafe articles
One-time (stable)
Block unsafe articles early


Loop 2

$\hat{C}$ (quality)
Human approval labels
Weekly or +50 labels
Predict $P(\text{approved})$


Loop 3

$\hat{P}$ (performance)
Impression logs (IPS weighted)
Daily/weekly
Predict $P(\text{click})$


Loop 4
Reranker weights
Offline eval (IPS on dev set)
Monthly
Tune $w_p, w_c, w_o, w_d$


Loop 5
DSPy $\pi_\theta$

Approved examples + metric
Monthly or +100 approved
Optimize prompts/demos


All models improve as data grows. This is the core self-learning mechanism.

6. Why This Works (vs. Prompt Engineering Hype)

1. Grounding

Anchors provide safe baselines (not hallucinating from scratch)
Article embeddings ground generation in context
Delta edits keep creative "controllable"

2. Learned Models (Not Zero-Shot Prompting)


$\hat{C}$: Trained on YOUR approval data (1224 examples), not generic prompts

$\hat{P}$: Trained on YOUR click data, not guessing
DSPy: Optimizes prompts on YOUR metric, not hand-tuning

3. Unbiased Learning (Not Naive A/B)

IPS corrects for exploration bias
Can evaluate new policies offline before deployment
Propensity logging enables counterfactual reasoning

Mathematical guarantee: IPS estimator is unbiased if $\pi_{\text{log}}(a|x) &gt; 0$ for all $(x,a)$ that $\pi_{\text{new}}$ would choose (i.e., sufficient exploration).

7. Current Status & Next Steps

What We Have (End of 2025)


✅ Infrastructure: Postgres + pgvector, impression logging, Slack annotation workflow
✅ Data: 1224 labeled examples (type=2 approved, type=-1 rejected)
✅ Embeddings: 256d OpenAI embeddings for articles + ads
✅ Baseline: Anchor similarity search + Bernoulli sampling (P1, P2)

What We Need (Q1 2026)


Phase 1 (Week 1-2): Train $\hat{C}$ on existing labels, deploy as safety gate + scorer

Phase 2 (Week 3-4): Train $\hat{P}$ on impression logs, add to reranker

Phase 3 (Week 5-6): Implement DSPy pipeline, compile on approved examples

Phase 4 (Month 2+): Add propensity logging, IPS evaluation, weekly retraining

Open Questions for Discussion


Rubric definition: What specific aspects should human reviewers score? (brand-fit, originality, clarity, safety)

Exploration rate: Start with $\epsilon=0.1$ (10% explore) or more conservative?

Counterfactual logging: Log all N candidates + scores (enables richer learning, costs storage)?

Model serving: Fine-tune on OpenAI (fast, managed) or self-host (control, cost)?


8. Key References

Prompt/Program Optimization:

DSPy (Stanford NLP): Compiling declarative language model calls into pipelines. arXiv:2310.03714
OPRO (Google DeepMind): LLMs as optimizers. arXiv:2309.03409

Preference Learning:

DPO (Stanford): Direct Preference Optimization. arXiv:2305.18290

Off-Policy Evaluation:

Doubly Robust OPE (Dudik et al.): Unbiased evaluation from logged bandit feedback. arXiv:1103.4601

Agentic Workflows:

ReAct (Google): Reason + Act agents with self-correction. arXiv:2210.03629

Ad Optimization:

AdBooster/GenCO (Meta, 2025): Generative creative optimization with multi-instance reward. arXiv:2508.09730


Appendix: Quantitative Example

Scenario: New reranker policy (after training $\hat{C}$ + $\hat{P}$) vs baseline.


User
Baseline Showed
Baseline $\pi$

Click?
New Policy Wants
Match?
IPS Weight
Weighted Reward


1
anchor (default)
0.9
0
contextual
✗
-
-


2
anchor (default)
0.9
1
anchor
✓
1/0.9 = 1.11
1.11


3
explore variant
0.1
1
explore variant
✓
1/0.1 = 10.0
10.0


4
anchor (default)
0.9
0
contextual
✗
-
-


5
anchor (default)
0.9
0
anchor
✓
1.11
0


$$
\hat{V}_{\text{IPS}} = \frac{1}{5}(1.11 + 10.0 + 0) = 2.22
$$
Interpretation: User 3's click gets 10× weight because baseline rarely explored that variant. IPS correctly attributes value to actions the new policy would take.

End of Document
Component	Purpose	Input	Output	Training Data	Method	Artifact
$\hat{C}(x,a)$	Predict human approval	brand, article, creative	$P(\text{approved}) \in [0,1]$	1224 labels (512 approved, 712 rejected)	Fine-tuned LLM (OpenAI)	`quality_model_vK.bin`
$\hat{P}(x,a)$	Predict performance	article, creative	$P(\text{click}) \in [0,1]$	Impression logs (K clicks / N imps)	Weighted logistic regression	`perf_model_vK.bin`
$\pi_\theta(a\|x)$	Generate candidates	brand, article, anchor	N creative variants	512 approved examples	DSPy compile (prompt optimization)	`dspy_program_vK.json`
Field	Type	Purpose
`request_id`	UUID	Unique request identifier
`timestamp`	datetime	When served
`campaign_id`, `brand_id`	str	Campaign context
`article_url`, `article_embedding`	str, vector(256)	Article context
`anchor_id`, `candidate_id`	str	Which anchor → which variant
`creative_text`	str	Actual served tagline
`propensity`	float	$\pi(a\|x)$ — critical for IPS
`policy_version`	str	Generator + reranker versions
`impression`, `viewable`, `click`, `conversion`	bool	Outcome metrics
Field	Type	Purpose
`candidate_id`	str	Links to impression log
`approved`	bool	Binary label
`brand_fit`, `context_fit`, `originality`, `clarity`	float $\in [0,1]$	Optional rubric scores
Loop	What Learns	Training Data	Retrain Trigger	Purpose
Loop 1	Safety classifier	Known-safe vs unsafe articles	One-time (stable)	Block unsafe articles early
Loop 2	$\hat{C}$ (quality)	Human approval labels	Weekly or +50 labels	Predict $P(\text{approved})$
Loop 3	$\hat{P}$ (performance)	Impression logs (IPS weighted)	Daily/weekly	Predict $P(\text{click})$
Loop 4	Reranker weights	Offline eval (IPS on dev set)	Monthly	Tune $w_p, w_c, w_o, w_d$
Loop 5	DSPy $\pi_\theta$	Approved examples + metric	Monthly or +100 approved	Optimize prompts/demos
User	Baseline Showed	Baseline $\pi$	Click?	New Policy Wants	Match?	IPS Weight	Weighted Reward
1	anchor (default)	0.9	0	contextual	✗	-	-
2	anchor (default)	0.9	1	anchor	✓	1/0.9 = 1.11	1.11
3	explore variant	0.1	1	explore variant	✓	1/0.1 = 10.0	10.0
4	anchor (default)	0.9	0	contextual	✗	-	-
5	anchor (default)	0.9	0	anchor	✓	1.11	0