Standard benchmarks and leaderboards measure an AI model's raw capability, but they don't tell the whole story. They ignore crucial, real-world factors that every developer and business must face: your budget, your time, and the actual cost when a model fails and a human has to intervene.
This document presents a simple framework to move beyond abstract accuracy and calculate the total expected monetary cost of using an AI model for a specific task. By factoring in API fees, the probability of success, the cost of retries, and the financial impact of complete failure, you can make a rational, data-driven decision. It helps you answer the real question: which model provides the most value for your specific workflow, not just the one with the highest benchmark score.
| Symbol | Meaning | Typical unit |
|---|---|---|
| p | first-attempt success probability | 0–1 |
| c | cost of the first attempt (API $ + your prompting/inspection time monetised at v) | $ |
| r | max retries you are willing to pay for (≥ 0) | integer |
| f | cost of one retry (API $ + your time $; often f ≈ c) | $ |
| F | fallback cost if every AI attempt fails (human dev, rewrite, etc.) | $ |
| t | extra hours you lose on final failure (context-switching, re-spec, hand-off …) | h |
| v | value of your time | $/h |
- The probability of success
pis independent and constant for each attempt. - The cost of each retry
fis constant.
Total probability that the model succeeds within the allowed r+1 attempts:
We always pay c once. Retries are paid only on earlier failure, and we condition on the set of successful outcomes.
Expected number of retries given eventual success:
(Derivation note: This is derived from the mean of a geometric distribution, conditional on success occurring within the first r+1 trials.)
Hence, the expected cost given eventual success:
On final failure, we have made all r+1 attempts (the first attempt plus r retries), then pay the fallback penalty.
The cost is therefore:
| p | r | c | f | F | t | v |
|---|---|---|---|---|---|---|
| 0.6 | 2 | $1 | $1 | $10 | 1 h | $50 h⁻¹ |
$$P_{\text{success}} = 1 - 0.4^{3} = 0.936$$ -
$$E[R \mid \text{succ}] = \frac{0.4}{0.6} \cdot \frac{1 - 0.4^{3} - 3 \cdot 0.6 \cdot 0.4^{2}}{0.936} \approx 0.462$$ retries - $$C_{\text{success}} = 1 + 1 \cdot 0.462 \approx
$1.46$ $ - $$C_{\text{fail}} = 1 + 2 \cdot 1 + 10 + 1 \cdot 50 =
$63$ $ - $$\text{EV}_{\text{task}} = 0.936 \cdot 1.46 + (1 - 0.936) \cdot 63 \approx
$5.40$ $
If you want to give the AI only as many retries as fit inside the time a human would spend, set your maximum retries r to:
where:
-
$$T_{\text{human}}$$ = human hours for the task -
$$T_{\text{AI}}$$ = AI hours per attempt (wall-clock + your inspection time)
Then plug
- Pick 30–50 representative tasks.
- For each task / model pair record:
- pass/fail on first try → p
- API $ + minutes you spent → c
- retry cost f (usually ≈ c)
- did you stop early? → real r used
- fallback you would actually pay → F
- extra hours lost if all failed → t
- Average EV_task across tasks → EV_model.
- Choose the model with the lowest EV_model.
Leaderboards optimise for accuracy under infinite patience; your wallet optimises for accuracy under finite money and time. The EV formula converts both dimensions into a single currency ($) and lets you decide whether a 4% absolute accuracy boost is worth a 3× cost increase.
Compute: “probability it works” × “what you pay when it works” + “probability it bombs” × “what you pay when it bombs”
Pick the model whose average $ across your own tasks is lowest.