Skip to content

Instantly share code, notes, and snippets.

@alkimiadev
Created September 11, 2025 01:27
Show Gist options
  • Select an option

  • Save alkimiadev/7fb21b108db460609eaff880508cf777 to your computer and use it in GitHub Desktop.

Select an option

Save alkimiadev/7fb21b108db460609eaff880508cf777 to your computer and use it in GitHub Desktop.
Choosing AI Models: A Cost-Benefit Analysis Framework

Choosing AI Models: A Cost-Benefit Analysis Framework

Standard benchmarks and leaderboards measure an AI model's raw capability, but they don't tell the whole story. They ignore crucial, real-world factors that every developer and business must face: your budget, your time, and the actual cost when a model fails and a human has to intervene.

This document presents a simple framework to move beyond abstract accuracy and calculate the total expected monetary cost of using an AI model for a specific task. By factoring in API fees, the probability of success, the cost of retries, and the financial impact of complete failure, you can make a rational, data-driven decision. It helps you answer the real question: which model provides the most value for your specific workflow, not just the one with the highest benchmark score.

Variable definitions (per task)

Symbol Meaning Typical unit
p first-attempt success probability 0–1
c cost of the first attempt (API $ + your prompting/inspection time monetised at v) $
r max retries you are willing to pay for (≥ 0) integer
f cost of one retry (API $ + your time $; often f ≈ c) $
F fallback cost if every AI attempt fails (human dev, rewrite, etc.) $
t extra hours you lose on final failure (context-switching, re-spec, hand-off …) h
v value of your time $/h

Underlying Assumptions

  • The probability of success p is independent and constant for each attempt.
  • The cost of each retry f is constant.

Core probabilities

Total probability that the model succeeds within the allowed r+1 attempts: $$P_{\text{success}} = 1 - (1 - p)^{r+1}$$

Expected cost when the model eventually succeeds

We always pay c once. Retries are paid only on earlier failure, and we condition on the set of successful outcomes.

Expected number of retries given eventual success: $$E[R \mid \text{succ}] = \frac{1 - p}{p} \cdot \frac{1 - (1 - p)^{r+1} - (r+1),p,(1 - p)^r}{1 - (1 - p)^{r+1}}$$

(Derivation note: This is derived from the mean of a geometric distribution, conditional on success occurring within the first r+1 trials.)

Hence, the expected cost given eventual success: $$C_{\text{success}} = c + f \cdot E[R \mid \text{succ}]$$

Expected cost when the model finally fails

On final failure, we have made all r+1 attempts (the first attempt plus r retries), then pay the fallback penalty.

The cost is therefore: $$C_{\text{fail}} = c + r,f + F + t,v$$

Total expected monetary cost of the task

$$\text{EV}_{\text{task}} = P_{\text{success}} \cdot C_{\text{success}} + (1 - P_{\text{success}}) \cdot C_{\text{fail}}$$

Sanity-check: single-shot (no retries, r = 0)

$$\text{EV}_{\text{task}} = p \cdot c + (1 - p)(c + F + t,v) = c + (1 - p)(F + t,v)$$

Worked micro-example

p r c f F t v
0.6 2 $1 $1 $10 1 h $50 h⁻¹
  • $$P_{\text{success}} = 1 - 0.4^{3} = 0.936$$
  • $$E[R \mid \text{succ}] = \frac{0.4}{0.6} \cdot \frac{1 - 0.4^{3} - 3 \cdot 0.6 \cdot 0.4^{2}}{0.936} \approx 0.462$$ retries
  • $$C_{\text{success}} = 1 + 1 \cdot 0.462 \approx $1.46$$
  • $$C_{\text{fail}} = 1 + 2 \cdot 1 + 10 + 1 \cdot 50 = $63$$
  • $$\text{EV}_{\text{task}} = 0.936 \cdot 1.46 + (1 - 0.936) \cdot 63 \approx $5.40$$

Time-Constrained Retry Budget (optional extension)

If you want to give the AI only as many retries as fit inside the time a human would spend, set your maximum retries r to: $$r_{\text{max}} = \left\lfloor \frac{T_{\text{human}}}{T_{\text{AI}}} \right\rfloor - 1$$

where:

  • $$T_{\text{human}}$$ = human hours for the task
  • $$T_{\text{AI}}$$ = AI hours per attempt (wall-clock + your inspection time)

Then plug $$r = r_{\text{max}}$$ into the formulas above to obtain a time-fair EV comparison.

How to collect the numbers in practice

  1. Pick 30–50 representative tasks.
  2. For each task / model pair record:
    • pass/fail on first try → p
    • API $ + minutes you spent → c
    • retry cost f (usually ≈ c)
    • did you stop early? → real r used
    • fallback you would actually pay → F
    • extra hours lost if all failed → t
  3. Average EV_task across tasks → EV_model.
  4. Choose the model with the lowest EV_model.

Why this beats leaderboards

Leaderboards optimise for accuracy under infinite patience; your wallet optimises for accuracy under finite money and time. The EV formula converts both dimensions into a single currency ($) and lets you decide whether a 4% absolute accuracy boost is worth a 3× cost increase.

TL;DR

Compute: “probability it works” × “what you pay when it works” + “probability it bombs” × “what you pay when it bombs”

Pick the model whose average $ across your own tasks is lowest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment