Skip to content

Instantly share code, notes, and snippets.

@nibzard
Last active February 25, 2026 17:23
Show Gist options
  • Select an option

  • Save nibzard/3abff45109e4c84187e7b0e335698e55 to your computer and use it in GitHub Desktop.

Select an option

Save nibzard/3abff45109e4c84187e7b0e335698e55 to your computer and use it in GitHub Desktop.
Benchmark analysis: run 20260225_151602_384

Benchmark Narrative Report

Run ID: 20260225_151602_384

Source files

  • results/summary/20260225_151602_384/metrics.jsonl
  • results/summary/20260225_151602_384/metrics.csv

Executive summary

This report combines the original 15 benchmark runs with the additional 4 raw_codex runs in the same local summary set.

  • Total runs: 19
  • Window: 2026-02-25 15:18:41.446Z to 2026-02-25 17:08:55.905Z UTC
  • Scenarios covered: 4

Methodology

  1. Parse every line in metrics.jsonl as JSON and use the same run set as the CSV counterpart.
  2. Define meaningful reliability as meaningful_success=true.
  3. For speed: compute duration median (p50) and p95 from each scenario's run set.
  4. For effort: report mean command_executions and mean action_count per scenario.
  5. Keep judge metadata (judge_verdict, judge_status) from the same rows; do not reinterpret outside existing judgments.

Run table (all runs)

Scenario Rep Run ID Type Status Meaningful Quality Duration (ms) Duration (s) Result Matched Commands Actions Timed out Judge verdict
steel_ai_cli 1 20260225_151602_412_steel_ai_cli_booking_com_r1_i1_a1 cli success yes 0.8000 159,003 159.00 5 5 34 5 no
steel_ai_cli 2 20260225_151841_446_steel_ai_cli_booking_com_r2_i1_a1 cli success yes 0.9333 492,442 492.44 6 6 81 3 no
steel_ai_cli 3 20260225_152653_894_steel_ai_cli_booking_com_r3_i1_a1 cli success yes 0.8333 437,313 437.31 6 6 102 3 no
steel_ai_cli 4 20260225_153411_211_steel_ai_cli_booking_com_r4_i1_a1 cli success yes 0.7600 42,728 42.73 5 5 21 8 no
steel_ai_cli 5 20260225_153453_944_steel_ai_cli_booking_com_r5_i1_a1 cli success yes 0.8000 59,684 59.68 3 3 22 3 no
steel_browsing_skill 1 20260225_153553_632_steel_browsing_skill_booking_com_r1_i1_a1 skill success yes 0.8667 34,174 34.17 3 3 21 6 no
steel_browsing_skill 2 20260225_153627_810_steel_browsing_skill_booking_com_r2_i1_a1 skill success yes 0.7600 200,573 200.57 5 5 56 3 no
steel_browsing_skill 3 20260225_153948_386_steel_browsing_skill_booking_com_r3_i1_a1 skill success yes 0.8667 907,403 907.40 3 3 114 12 no
steel_browsing_skill 4 20260225_155455_808_steel_browsing_skill_booking_com_r4_i1_a1 skill success yes 0.8800 287,288 287.29 5 5 50 8 no
steel_browsing_skill 5 20260225_155943_100_steel_browsing_skill_booking_com_r5_i1_a1 skill success yes 0.9500 312,279 312.28 4 4 46 17 no
steel_cli_huss_agent_cli 1 20260225_160455_382_steel_cli_huss_agent_cli_booking_com_r1_i1_a1 cli success yes 0.9333 246,217 246.22 3 3 81 12 no
steel_cli_huss_agent_cli 2 20260225_160901_614_steel_cli_huss_agent_cli_booking_com_r2_i1_a1 cli success yes 0.9333 207,686 207.69 3 3 71 21 no
steel_cli_huss_agent_cli 3 20260225_161229_306_steel_cli_huss_agent_cli_booking_com_r3_i1_a1 cli success yes 0.9000 105,381 105.38 2 2 41 3 no
steel_cli_huss_agent_cli 4 20260225_161414_692_steel_cli_huss_agent_cli_booking_com_r4_i1_a1 cli success yes 0.9000 99,474 99.47 2 2 40 3 no
steel_cli_huss_agent_cli 5 20260225_161554_169_steel_cli_huss_agent_cli_booking_com_r5_i1_a1 cli success yes 0.9000 72,296 72.30 2 2 53 3 no
raw_codex 1 20260225_165227_203_raw_codex_booking_com_r1_i1_a1 raw timeout no 0.0000 0 0.00 0 0 64 5 yes unclear
raw_codex 2 20260225_165531_266_raw_codex_booking_com_r2_i1_a1 raw success yes 0.9333 95,294 95.29 3 3 54 4 no unclear
raw_codex 3 20260225_165709_436_raw_codex_booking_com_r3_i1_a1 raw success yes 0.8800 41,270 41.27 5 5 48 4 no true
raw_codex 4 20260225_165753_474_raw_codex_booking_com_r4_i1_a1 raw success yes 0.7200 34,898 34.90 5 5 51 5 no unclear

Scenario aggregation

Scenario Type Runs Success rate Mean quality Quality range p50 duration (ms) p95 duration (ms) Mean duration (ms) Timeout rate Mean commands Mean actions Mean results
raw_codex raw 4 75.0% 0.6333 0.0000-0.9333 34,898 95,294 42865.5 25.0% 54.25 4.50 3.25
steel_ai_cli cli 5 100.0% 0.8253 0.7600-0.9333 159,003 492,442 238234.0 0.0% 52.00 4.40 5.00
steel_browsing_skill skill 5 100.0% 0.8647 0.7600-0.9500 287,288 907,403 348343.4 0.0% 57.40 9.20 4.00
steel_cli_huss_agent_cli cli 5 100.0% 0.9133 0.9000-0.9333 105,381 246,217 146210.8 0.0% 57.20 8.40 2.40

Comparative findings

  1. Reliability is currently high across dedicated scenarios (100% meaningful success in this sample); raw_codex is mixed due to one timeout at 25%.
  2. Speed-wise, p50 ranking is raw_codex < steel_ai_cli < steel_cli_huss_agent_cli < steel_browsing_skill; p95 is lower for raw_codex and steel_cli_huss_agent_cli than for steel_browsing_skill.
  3. Quality ranking by mean: steel_cli_huss_agent_cli (0.9133) is best, then steel_browsing_skill (0.8647), then steel_ai_cli (0.8253), with raw_codex below (0.6333 due timeout inclusion).
  4. Effort: steel_browsing_skill has the highest action footprint and command volume; steel_ai_cli is lighter but slower than expected on rep 2.

General narrative

Across these 19 runs, the benchmark is behaving like a practical quality-vs-stability trade study rather than a strict pass/fail exercise. The cleanest signal is that reliability and quality are strongest in steel_cli_huss_agent_cli, while output variance is concentrated in the more variable steel_ai_cli and especially steel_browsing_skill. In this dataset, raw_codex is useful as a stress signal for behavioral consistency rather than a production-ready lane.

My reading by scenario:

  • steel_cli_huss_agent_cli: best overall balance here. It is the highest-quality bucket (mean 0.9133) while remaining stable and fully meaningful on all five reps. It also has the lowest median latency among the dedicated scenarios (p50 105,381 ms), which makes it the best default pick when both quality and speed matter.
  • steel_browsing_skill: strong quality ceiling (0.9500 max rep) but highest command/action footprint and the largest latency tail (907,403 ms at p95). Use this when result richness is the top goal and long-tail runtime is acceptable.
  • steel_ai_cli: decent quality with the broadest quality/operation variance in this run set. It is lightweight in effort terms and can be efficient, but rep-to-rep behavior is less predictable than the two alternatives above.
  • raw_codex: a mixed experiment channel in this report. It shows both fast successful runs and one meaningful timeout, so the mean quality is dragged down by an unresolved execution path. It is useful for identifying where workflow constraints and interpretation rules are still affecting consistency.

The common pattern is straightforward: if you need dependable, ranked results with low ambiguity, pick steel_cli_huss_agent_cli; if you need potentially richer browsing depth and can tolerate run-time variance, use steel_browsing_skill; if you need a constrained comparison against unconstrained behavior, keep raw_codex in scope but treat it as a separate reliability tier until repeats are stabilized.

Recommendations

  1. Keep this narrative as the canonical comparison table set until all scenarios have equal repetition depth.
  2. For raw_codex, re-run a full 5-rep block and confirm judge metadata quality to reduce ambiguity from cross-scenario overlap.
  3. Publish scenario-level stability bands (p50/p95) after each refresh so table issues can be caught immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment