Taken from inspect_evals @ ef181cdc95dc2d92b0f9dff8b6c0756b42128b78, with our modifications:
- don't pass in temperatures or grader models (we have our own defaults/policies)
- use tool calling for more structured output in the scorer (theirs was extremely brittle)
- only calculate the % correct over all questions (it's the only metric we can compare with other benchmarks)