alkimiadev/cost-benefit-ai-models.md

## cost-benefit-ai-models.md

      
    Raw
  

              cost-benefit-ai-models.md
            
          
    Choosing AI Models: A Cost-Benefit Analysis Framework

Standard benchmarks and leaderboards measure an AI model's raw capability, but they don't tell the whole story. They ignore crucial, real-world factors that every developer and business must face: your budget, your time, and the actual cost when a model fails and a human has to intervene.
This document presents a simple framework to move beyond abstract accuracy and calculate the total expected monetary cost of using an AI model for a specific task. By factoring in API fees, the probability of success, the cost of retries, and the financial impact of complete failure, you can make a rational, data-driven decision. It helps you answer the real question: which model provides the most value for your specific workflow, not just the one with the highest benchmark score.
Variable definitions (per task)


Symbol
Meaning
Typical unit


p
first-attempt success probability
0–1


c
cost of the first attempt (API $ + your prompting/inspection time monetised at v)
$


r
max retries you are willing to pay for (≥ 0)
integer


f
cost of one retry (API $ + your time $; often f ≈ c)
$


F
fallback cost if every AI attempt fails (human dev, rewrite, etc.)
$


t
extra hours you lose on final failure (context-switching, re-spec, hand-off …)
h


v
value of your time
$/h


Underlying Assumptions


The probability of success p is independent and constant for each attempt.
The cost of each retry f is constant.

Core probabilities

Total probability that the model succeeds within the allowed r+1 attempts:
$$P_{\text{success}} = 1 - (1 - p)^{r+1}$$
Expected cost when the model eventually succeeds

We always pay c once. Retries are paid only on earlier failure, and we condition on the set of successful outcomes.
Expected number of retries given eventual success:
$$E[R \mid \text{succ}] = \frac{1 - p}{p} \cdot \frac{1 - (1 - p)^{r+1} - (r+1),p,(1 - p)^r}{1 - (1 - p)^{r+1}}$$
(Derivation note: This is derived from the mean of a geometric distribution, conditional on success occurring within the first r+1 trials.)
Hence, the expected cost given eventual success:
$$C_{\text{success}} = c + f \cdot E[R \mid \text{succ}]$$
Expected cost when the model finally fails

On final failure, we have made all r+1 attempts (the first attempt plus r retries), then pay the fallback penalty.
The cost is therefore:
$$C_{\text{fail}} = c + r,f + F + t,v$$
Total expected monetary cost of the task

$$\text{EV}_{\text{task}} = P_{\text{success}} \cdot C_{\text{success}} + (1 - P_{\text{success}}) \cdot C_{\text{fail}}$$
Sanity-check: single-shot (no retries, r = 0)

$$\text{EV}_{\text{task}} = p \cdot c + (1 - p)(c + F + t,v) = c + (1 - p)(F + t,v)$$
Worked micro-example


p
r
c
f
F
t
v


0.6
2
$1
$1
$10
1 h
$50 h⁻¹


$$P_{\text{success}} = 1 - 0.4^{3} = 0.936$$

$$E[R \mid \text{succ}] = \frac{0.4}{0.6} \cdot \frac{1 - 0.4^{3} - 3 \cdot 0.6 \cdot 0.4^{2}}{0.936} \approx 0.462$$ retries
$$C_{\text{success}} = 1 + 1 \cdot 0.462 \approx $1.46$$
$$C_{\text{fail}} = 1 + 2 \cdot 1 + 10 + 1 \cdot 50 = $63$$
$$\text{EV}_{\text{task}} = 0.936 \cdot 1.46 + (1 - 0.936) \cdot 63 \approx $5.40$$

Time-Constrained Retry Budget (optional extension)

If you want to give the AI only as many retries as fit inside the time a human would spend, set your maximum retries r to:
$$r_{\text{max}} = \left\lfloor \frac{T_{\text{human}}}{T_{\text{AI}}} \right\rfloor - 1$$
where:


$$T_{\text{human}}$$ = human hours for the task

$$T_{\text{AI}}$$ = AI hours per attempt (wall-clock + your inspection time)

Then plug $$r = r_{\text{max}}$$ into the formulas above to obtain a time-fair EV comparison.
How to collect the numbers in practice


Pick 30–50 representative tasks.
For each task / model pair record:

pass/fail on first try → p
API $ + minutes you spent → c
retry cost f (usually ≈ c)
did you stop early? → real r used
fallback you would actually pay → F
extra hours lost if all failed → t


Average EV_task across tasks → EV_model.
Choose the model with the lowest EV_model.

Why this beats leaderboards

Leaderboards optimise for accuracy under infinite patience; your wallet optimises for accuracy under finite money and time.
The EV formula converts both dimensions into a single currency ($) and lets you decide whether a 4% absolute accuracy boost is worth a 3× cost increase.
TL;DR

Compute:
“probability it works” × “what you pay when it works”
+
“probability it bombs” × “what you pay when it bombs”
Pick the model whose average $ across your own tasks is lowest.
Symbol	Meaning	Typical unit
p	first-attempt success probability	0–1
c	cost of the first attempt (API $ + your prompting/inspection time monetised at v)	$
r	max retries you are willing to pay for (≥ 0)	integer
f	cost of one retry (API $ + your time $; often f ≈ c)	$
F	fallback cost if every AI attempt fails (human dev, rewrite, etc.)	$
t	extra hours you lose on final failure (context-switching, re-spec, hand-off …)	h
v	value of your time	$/h
No results found