Skip to content

Instantly share code, notes, and snippets.

@albertbn
Created January 18, 2026 10:18
Show Gist options
  • Select an option

  • Save albertbn/cffd309661a25744366e5fbdfb3d6f12 to your computer and use it in GitHub Desktop.

Select an option

Save albertbn/cffd309661a25744366e5fbdfb3d6f12 to your computer and use it in GitHub Desktop.
Self-Learning Contextual Ads Research Pitch

Self-Learning Contextual Ads: Multi-Objective Optimization via Logged Bandit Feedback

Albert Bentov • 2026-01-18
Goal: Present a rigorous, data-driven framework for continuous ad creative optimization


1. Problem Statement

We optimize ad creative selection to maximize performance + creative quality under hard safety constraints, using human feedback and logged metrics to continuously improve.

Objective function:

$$ \text{Score}(x, a) = w_p \cdot \hat{P}(x,a) + w_c \cdot \hat{C}(x,a) + w_o \cdot \text{Novelty}(a) - w_d \cdot \text{Dup}(a) $$

Subject to: $\text{Safety}(x, a) \leq \tau$ (hard constraint, must pass)

Where:

  • $x$ = context (brand brief, article content, user features)
  • $a$ = candidate creative (tagline, image, layout)
  • $\hat{P}(x,a)$ = predicted performance (CTR/CVR) — learned from logs
  • $\hat{C}(x,a)$ = predicted creative quality (approval probability) — learned from human feedback
  • $\text{Novelty}(a)$ = min cosine distance to served history — computed
  • $\text{Dup}(a)$ = binary duplicate indicator — computed
  • $w_p, w_c, w_o, w_d$ = tunable weights

2. System Architecture

Request (campaign, article_url)
    ↓
[1] Return default ad (instant)
    ↓ (async)
[2] Generate N candidates: π_θ(a|x)           ← DSPy (learned prompt/program)
    ↓
[3] Safety gate: filter unsafe                ← Ĉ threshold (learned)
    ↓
[4] Score candidates: Score(x,a)              ← Ĉ + P̂ + novelty (learned + computed)
    ↓
[5] Explore/exploit: ε-greedy selection       ← Log propensity π(a|x)
    ↓
[6] Serve winner + log (propensity, features, outcome)
    ↓
[7] Learning loops: retrain Ĉ, P̂, π_θ weekly/daily

Key insight: This mirrors how code assistants work—generate candidates, run tests (safety gates), pick best, learn from feedback.


3. Mathematical Formulation

3.1 Three Learned Components

Component Purpose Input Output Training Data Method Artifact
$\hat{C}(x,a)$ Predict human approval brand, article, creative $P(\text{approved}) \in [0,1]$ 1224 labels (512 approved, 712 rejected) Fine-tuned LLM (OpenAI) quality_model_vK.bin
$\hat{P}(x,a)$ Predict performance article, creative $P(\text{click}) \in [0,1]$ Impression logs (K clicks / N imps) Weighted logistic regression perf_model_vK.bin
$\pi_\theta(a|x)$ Generate candidates brand, article, anchor N creative variants 512 approved examples DSPy compile (prompt optimization) dspy_program_vK.json

3.2 Anchor-Delta Generation Strategy

Instead of generating creatives from scratch (prone to hallucination), we use delta editing:

$$ a = \text{anchor} + \text{small context-aware edit} $$

Anchors are human-curated, on-brand baseline ads. Deltas preserve meaning/CTA while incorporating safe article cues. This keeps creative "controllable."

DSPy Pipeline:

  1. Strategist: $(\text{brand}, \text{article}) \rightarrow (\text{angle}, \text{avoid-list})$ — runs once
  2. DeltaRewrite: $(\text{anchor}, \text{angle}) \rightarrow a_i$ — runs N times
  3. Critic (optional): $a_i \rightarrow a_i'$ — fixes/improves

DSPy's compile() optimizes instructions and few-shot examples offline to maximize approval rate on validation set.

3.3 Inverse Propensity Scoring (IPS)

When retraining models, we need unbiased estimates from logged data. Without propensities, naive CTR is biased (old policy always picked high-scoring ads).

Logged data: For each impression $i$, we observe $(x_i, a_i, r_i, \pi_{\text{log}}(a_i|x_i))$

IPS estimator for new policy $\pi_{\text{new}}$:

$$ \hat{V}_{\text{IPS}}(\pi_{\text{new}}) = \frac{1}{N} \sum_{i=1}^{N} \frac{r_i \cdot \mathbb{1}[a_i = \pi_{\text{new}}(x_i)]}{\pi_{\text{log}}(a_i|x_i)} $$

Why it works: Upweights rare actions that $\pi_{\text{new}}$ would take. Allows offline evaluation before deploying.

Implementation: $\epsilon$-greedy exploration ($\epsilon=0.1$):

  • 90%: serve argmax (exploit)
  • 10%: sample from top-K (explore)

$$ \pi(a|x) = \begin{cases} 1 - \epsilon + \frac{\epsilon}{K} & \text{if } a = \arg\max \text{Score}(x,a) \\ \frac{\epsilon}{K} & \text{if } a \in \text{top-K} \\ 0 & \text{otherwise} \end{cases} $$


4. Data Schema & Logging

4.1 Impression Log (required for learning)

Field Type Purpose
request_id UUID Unique request identifier
timestamp datetime When served
campaign_id, brand_id str Campaign context
article_url, article_embedding str, vector(256) Article context
anchor_id, candidate_id str Which anchor → which variant
creative_text str Actual served tagline
propensity float $\pi(a|x)$ — critical for IPS
policy_version str Generator + reranker versions
impression, viewable, click, conversion bool Outcome metrics

4.2 Human Feedback (required for $\hat{C}$)

Field Type Purpose
candidate_id str Links to impression log
approved bool Binary label
brand_fit, context_fit, originality, clarity float $\in [0,1]$ Optional rubric scores

Current status: 1224 labeled examples since 2025-12-20 (512 approved, 712 rejected) — sufficient to train initial $\hat{C}$ model.


5. Five Learning Loops

Loop What Learns Training Data Retrain Trigger Purpose
Loop 1 Safety classifier Known-safe vs unsafe articles One-time (stable) Block unsafe articles early
Loop 2 $\hat{C}$ (quality) Human approval labels Weekly or +50 labels Predict $P(\text{approved})$
Loop 3 $\hat{P}$ (performance) Impression logs (IPS weighted) Daily/weekly Predict $P(\text{click})$
Loop 4 Reranker weights Offline eval (IPS on dev set) Monthly Tune $w_p, w_c, w_o, w_d$
Loop 5 DSPy $\pi_\theta$ Approved examples + metric Monthly or +100 approved Optimize prompts/demos

All models improve as data grows. This is the core self-learning mechanism.


6. Why This Works (vs. Prompt Engineering Hype)

1. Grounding

  • Anchors provide safe baselines (not hallucinating from scratch)
  • Article embeddings ground generation in context
  • Delta edits keep creative "controllable"

2. Learned Models (Not Zero-Shot Prompting)

  • $\hat{C}$: Trained on YOUR approval data (1224 examples), not generic prompts
  • $\hat{P}$: Trained on YOUR click data, not guessing
  • DSPy: Optimizes prompts on YOUR metric, not hand-tuning

3. Unbiased Learning (Not Naive A/B)

  • IPS corrects for exploration bias
  • Can evaluate new policies offline before deployment
  • Propensity logging enables counterfactual reasoning

Mathematical guarantee: IPS estimator is unbiased if $\pi_{\text{log}}(a|x) > 0$ for all $(x,a)$ that $\pi_{\text{new}}$ would choose (i.e., sufficient exploration).


7. Current Status & Next Steps

What We Have (End of 2025)

  • ✅ Infrastructure: Postgres + pgvector, impression logging, Slack annotation workflow
  • ✅ Data: 1224 labeled examples (type=2 approved, type=-1 rejected)
  • ✅ Embeddings: 256d OpenAI embeddings for articles + ads
  • ✅ Baseline: Anchor similarity search + Bernoulli sampling (P1, P2)

What We Need (Q1 2026)

  • Phase 1 (Week 1-2): Train $\hat{C}$ on existing labels, deploy as safety gate + scorer
  • Phase 2 (Week 3-4): Train $\hat{P}$ on impression logs, add to reranker
  • Phase 3 (Week 5-6): Implement DSPy pipeline, compile on approved examples
  • Phase 4 (Month 2+): Add propensity logging, IPS evaluation, weekly retraining

Open Questions for Discussion

  1. Rubric definition: What specific aspects should human reviewers score? (brand-fit, originality, clarity, safety)
  2. Exploration rate: Start with $\epsilon=0.1$ (10% explore) or more conservative?
  3. Counterfactual logging: Log all N candidates + scores (enables richer learning, costs storage)?
  4. Model serving: Fine-tune on OpenAI (fast, managed) or self-host (control, cost)?

8. Key References

Prompt/Program Optimization:

Preference Learning:

Off-Policy Evaluation:

  • Doubly Robust OPE (Dudik et al.): Unbiased evaluation from logged bandit feedback. arXiv:1103.4601

Agentic Workflows:

Ad Optimization:

  • AdBooster/GenCO (Meta, 2025): Generative creative optimization with multi-instance reward. arXiv:2508.09730

Appendix: Quantitative Example

Scenario: New reranker policy (after training $\hat{C}$ + $\hat{P}$) vs baseline.

User Baseline Showed Baseline $\pi$ Click? New Policy Wants Match? IPS Weight Weighted Reward
1 anchor (default) 0.9 0 contextual - -
2 anchor (default) 0.9 1 anchor 1/0.9 = 1.11 1.11
3 explore variant 0.1 1 explore variant 1/0.1 = 10.0 10.0
4 anchor (default) 0.9 0 contextual - -
5 anchor (default) 0.9 0 anchor 1.11 0

$$ \hat{V}_{\text{IPS}} = \frac{1}{5}(1.11 + 10.0 + 0) = 2.22 $$

Interpretation: User 3's click gets 10× weight because baseline rarely explored that variant. IPS correctly attributes value to actions the new policy would take.


End of Document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment