Date: 2026-02-26
Pipeline: Two-shot (battery + discovery + judge) — 3 LLM calls per entity
Branch: feat/two-shot
Each entity gets 3 LLM calls:
- Battery — system prompt (~800 tokens) + entity JSON (~500-2000 tokens) → response (~500-1500 tokens)
- Discovery — system prompt (~1000 tokens, includes existing pairs) + entity JSON → response (~500-1500 tokens)
- Judge — system prompt (~400 tokens) + all pairs + source data → response (~200-500 tokens)
Token estimates are based on our 2-entity-per-domain validation runs (83 total pairs across 10 entities). Actual costs will vary with entity data richness — allocations with long abstracts cost more per entity than affinity groups with thin data.
gpt-4o-mini pricing (as of Feb 2026): $0.15/1M input tokens, $0.60/1M output tokens
| Domain | Entities | Est. Pairs | Input Tokens | Output Tokens | Est. Cost |
|---|---|---|---|---|---|
| Compute Resources | 23 | ~345 | ~230K | ~92K | ~$0.09 |
| Software Discovery | 1,404 | ~19,000 | ~14M | ~5.6M | ~$5.46 |
| Affinity Groups | 55 | ~670 | ~440K | ~165K | ~$0.17 |
| Allocations | 5,440 | ~79,000 | ~54M | ~22M | ~$21.30 |
| NSF Awards | 10,000+ | ~145,000 | ~100M | ~40M | ~$39.00 |
| Total | ~17K | ~244K | ~169M | ~68M | ~$66 |
- Pairs-per-entity estimate: Based on validation runs averaging ~8 pairs/entity for data-rich domains (compute, software, allocations, nsf-awards) and ~5 for thin domains (affinity-groups). Battery produces 4-7, discovery adds 2-5.
- Token-per-entity estimate: ~10K input tokens (across 3 calls) and ~4K output tokens. Allocations and NSF entities with long abstracts skew higher.
- This is cheaper than the earlier estimate (~$203) which incorrectly doubled the single-pass cost. The two-shot calls share the same entity JSON payload, and the judge call is small.
| Model | Input $/1M | Output $/1M | Est. Full-Run Cost | Notes |
|---|---|---|---|---|
| gpt-4o-mini | $0.15 | $0.60 | ~$66 | Default. Good quality for extraction + judge. |
| claude-haiku | $0.25 | $1.25 | ~$127 | Slightly more expensive. Good alternative. |
| gpt-4o | $2.50 | $10.00 | ~$1,100 | Overkill for most entities. |
| claude-sonnet | $3.00 | $15.00 | ~$1,500 | Overkill for most entities. |
Use a cheaper model for judge (it's just scoring, not generating):
| Role | Model | Cost Share |
|---|---|---|
| Battery + Discovery | gpt-4o-mini | ~$55 |
| Judge | gpt-4o-mini | ~$11 |
| Total | ~$66 |
Or use a stronger model for extraction with cheap judge:
| Role | Model | Cost Share |
|---|---|---|
| Battery + Discovery | gpt-4o | ~$920 |
| Judge | gpt-4o-mini | ~$11 |
| Total | ~$931 |
With --incremental, unchanged entities are skipped entirely (hash-based change detection). Only entities whose upstream data changed get re-extracted. In practice:
- First run: Full cost (~$66 with gpt-4o-mini)
- Subsequent runs: Cost proportional to % of entities that changed. If 5% of entities change, cost is ~$3.30.
- Cache stores: entity hash + all pairs + judge scores. No LLM calls needed for cache hits.
Actual numbers from a 2-entity-per-domain run with gpt-4o-mini:
| Domain | Entities | Pairs | Avg Pairs/Entity | Avg Judge Confidence |
|---|---|---|---|---|
| compute-resources | 2 | 17 | 8.5 | 0.93 |
| software-discovery | 2 | 18 | 9.0 | 0.90 |
| allocations | 2 | 19 | 9.5 | 0.90 |
| nsf-awards | 2 | 18 | 9.0 | 0.88 |
| affinity-groups | 2 | 8 | 4.0 | 0.95 |
| comparisons | — | 3 | — | — |
| Total | 10 | 83 | 8.3 | 0.91 |
100% citation validity. All pairs scored suggested_decision: "approved".