Run ID: 20260225_151602_384
Source files
results/summary/20260225_151602_384/metrics.jsonlresults/summary/20260225_151602_384/metrics.csv
This report combines the original 15 benchmark runs with the additional 4 raw_codex runs in the same local summary set.
- Total runs: 19
- Window: 2026-02-25 15:18:41.446Z to 2026-02-25 17:08:55.905Z UTC
- Scenarios covered: 4
- Parse every line in
metrics.jsonlas JSON and use the same run set as the CSV counterpart. - Define meaningful reliability as
meaningful_success=true. - For speed: compute duration median (p50) and p95 from each scenario's run set.
- For effort: report mean
command_executionsand meanaction_countper scenario. - Keep judge metadata (
judge_verdict,judge_status) from the same rows; do not reinterpret outside existing judgments.
| Scenario | Rep | Run ID | Type | Status | Meaningful | Quality | Duration (ms) | Duration (s) | Result | Matched | Commands | Actions | Timed out | Judge verdict |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| steel_ai_cli | 1 | 20260225_151602_412_steel_ai_cli_booking_com_r1_i1_a1 |
cli | success | yes | 0.8000 | 159,003 | 159.00 | 5 | 5 | 34 | 5 | no | |
| steel_ai_cli | 2 | 20260225_151841_446_steel_ai_cli_booking_com_r2_i1_a1 |
cli | success | yes | 0.9333 | 492,442 | 492.44 | 6 | 6 | 81 | 3 | no | |
| steel_ai_cli | 3 | 20260225_152653_894_steel_ai_cli_booking_com_r3_i1_a1 |
cli | success | yes | 0.8333 | 437,313 | 437.31 | 6 | 6 | 102 | 3 | no | |
| steel_ai_cli | 4 | 20260225_153411_211_steel_ai_cli_booking_com_r4_i1_a1 |
cli | success | yes | 0.7600 | 42,728 | 42.73 | 5 | 5 | 21 | 8 | no | |
| steel_ai_cli | 5 | 20260225_153453_944_steel_ai_cli_booking_com_r5_i1_a1 |
cli | success | yes | 0.8000 | 59,684 | 59.68 | 3 | 3 | 22 | 3 | no | |
| steel_browsing_skill | 1 | 20260225_153553_632_steel_browsing_skill_booking_com_r1_i1_a1 |
skill | success | yes | 0.8667 | 34,174 | 34.17 | 3 | 3 | 21 | 6 | no | |
| steel_browsing_skill | 2 | 20260225_153627_810_steel_browsing_skill_booking_com_r2_i1_a1 |
skill | success | yes | 0.7600 | 200,573 | 200.57 | 5 | 5 | 56 | 3 | no | |
| steel_browsing_skill | 3 | 20260225_153948_386_steel_browsing_skill_booking_com_r3_i1_a1 |
skill | success | yes | 0.8667 | 907,403 | 907.40 | 3 | 3 | 114 | 12 | no | |
| steel_browsing_skill | 4 | 20260225_155455_808_steel_browsing_skill_booking_com_r4_i1_a1 |
skill | success | yes | 0.8800 | 287,288 | 287.29 | 5 | 5 | 50 | 8 | no | |
| steel_browsing_skill | 5 | 20260225_155943_100_steel_browsing_skill_booking_com_r5_i1_a1 |
skill | success | yes | 0.9500 | 312,279 | 312.28 | 4 | 4 | 46 | 17 | no | |
| steel_cli_huss_agent_cli | 1 | 20260225_160455_382_steel_cli_huss_agent_cli_booking_com_r1_i1_a1 |
cli | success | yes | 0.9333 | 246,217 | 246.22 | 3 | 3 | 81 | 12 | no | |
| steel_cli_huss_agent_cli | 2 | 20260225_160901_614_steel_cli_huss_agent_cli_booking_com_r2_i1_a1 |
cli | success | yes | 0.9333 | 207,686 | 207.69 | 3 | 3 | 71 | 21 | no | |
| steel_cli_huss_agent_cli | 3 | 20260225_161229_306_steel_cli_huss_agent_cli_booking_com_r3_i1_a1 |
cli | success | yes | 0.9000 | 105,381 | 105.38 | 2 | 2 | 41 | 3 | no | |
| steel_cli_huss_agent_cli | 4 | 20260225_161414_692_steel_cli_huss_agent_cli_booking_com_r4_i1_a1 |
cli | success | yes | 0.9000 | 99,474 | 99.47 | 2 | 2 | 40 | 3 | no | |
| steel_cli_huss_agent_cli | 5 | 20260225_161554_169_steel_cli_huss_agent_cli_booking_com_r5_i1_a1 |
cli | success | yes | 0.9000 | 72,296 | 72.30 | 2 | 2 | 53 | 3 | no | |
| raw_codex | 1 | 20260225_165227_203_raw_codex_booking_com_r1_i1_a1 |
raw | timeout | no | 0.0000 | 0 | 0.00 | 0 | 0 | 64 | 5 | yes | unclear |
| raw_codex | 2 | 20260225_165531_266_raw_codex_booking_com_r2_i1_a1 |
raw | success | yes | 0.9333 | 95,294 | 95.29 | 3 | 3 | 54 | 4 | no | unclear |
| raw_codex | 3 | 20260225_165709_436_raw_codex_booking_com_r3_i1_a1 |
raw | success | yes | 0.8800 | 41,270 | 41.27 | 5 | 5 | 48 | 4 | no | true |
| raw_codex | 4 | 20260225_165753_474_raw_codex_booking_com_r4_i1_a1 |
raw | success | yes | 0.7200 | 34,898 | 34.90 | 5 | 5 | 51 | 5 | no | unclear |
| Scenario | Type | Runs | Success rate | Mean quality | Quality range | p50 duration (ms) | p95 duration (ms) | Mean duration (ms) | Timeout rate | Mean commands | Mean actions | Mean results |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| raw_codex | raw | 4 | 75.0% | 0.6333 | 0.0000-0.9333 | 34,898 | 95,294 | 42865.5 | 25.0% | 54.25 | 4.50 | 3.25 |
| steel_ai_cli | cli | 5 | 100.0% | 0.8253 | 0.7600-0.9333 | 159,003 | 492,442 | 238234.0 | 0.0% | 52.00 | 4.40 | 5.00 |
| steel_browsing_skill | skill | 5 | 100.0% | 0.8647 | 0.7600-0.9500 | 287,288 | 907,403 | 348343.4 | 0.0% | 57.40 | 9.20 | 4.00 |
| steel_cli_huss_agent_cli | cli | 5 | 100.0% | 0.9133 | 0.9000-0.9333 | 105,381 | 246,217 | 146210.8 | 0.0% | 57.20 | 8.40 | 2.40 |
- Reliability is currently high across dedicated scenarios (100% meaningful success in this sample);
raw_codexis mixed due to one timeout at 25%. - Speed-wise, p50 ranking is
raw_codex<steel_ai_cli<steel_cli_huss_agent_cli<steel_browsing_skill; p95 is lower forraw_codexandsteel_cli_huss_agent_clithan forsteel_browsing_skill. - Quality ranking by mean:
steel_cli_huss_agent_cli(0.9133) is best, thensteel_browsing_skill(0.8647), thensteel_ai_cli(0.8253), withraw_codexbelow (0.6333 due timeout inclusion). - Effort:
steel_browsing_skillhas the highest action footprint and command volume;steel_ai_cliis lighter but slower than expected on rep 2.
Across these 19 runs, the benchmark is behaving like a practical quality-vs-stability trade study rather than a strict pass/fail exercise. The cleanest signal is that reliability and quality are strongest in steel_cli_huss_agent_cli, while output variance is concentrated in the more variable steel_ai_cli and especially steel_browsing_skill. In this dataset, raw_codex is useful as a stress signal for behavioral consistency rather than a production-ready lane.
My reading by scenario:
steel_cli_huss_agent_cli: best overall balance here. It is the highest-quality bucket (mean 0.9133) while remaining stable and fully meaningful on all five reps. It also has the lowest median latency among the dedicated scenarios (p50 105,381 ms), which makes it the best default pick when both quality and speed matter.steel_browsing_skill: strong quality ceiling (0.9500 max rep) but highest command/action footprint and the largest latency tail (907,403 ms at p95). Use this when result richness is the top goal and long-tail runtime is acceptable.steel_ai_cli: decent quality with the broadest quality/operation variance in this run set. It is lightweight in effort terms and can be efficient, but rep-to-rep behavior is less predictable than the two alternatives above.raw_codex: a mixed experiment channel in this report. It shows both fast successful runs and one meaningful timeout, so the mean quality is dragged down by an unresolved execution path. It is useful for identifying where workflow constraints and interpretation rules are still affecting consistency.
The common pattern is straightforward: if you need dependable, ranked results with low ambiguity, pick steel_cli_huss_agent_cli; if you need potentially richer browsing depth and can tolerate run-time variance, use steel_browsing_skill; if you need a constrained comparison against unconstrained behavior, keep raw_codex in scope but treat it as a separate reliability tier until repeats are stabilized.
- Keep this narrative as the canonical comparison table set until all scenarios have equal repetition depth.
- For
raw_codex, re-run a full 5-rep block and confirm judge metadata quality to reduce ambiguity from cross-scenario overlap. - Publish scenario-level stability bands (p50/p95) after each refresh so table issues can be caught immediately.