nibzard/benchmark_report_20260225_151602_384_fixed.md

## benchmark_report_20260225_151602_384_fixed.md

      
    Raw
  

              benchmark_report_20260225_151602_384_fixed.md
            
          
    Benchmark Narrative Report

Run ID: 20260225_151602_384
Source files

results/summary/20260225_151602_384/metrics.jsonl
results/summary/20260225_151602_384/metrics.csv

Executive summary

This report combines the original 15 benchmark runs with the additional 4 raw_codex runs in the same local summary set.

Total runs: 19
Window: 2026-02-25 15:18:41.446Z to 2026-02-25 17:08:55.905Z UTC
Scenarios covered: 4

Methodology


Parse every line in metrics.jsonl as JSON and use the same run set as the CSV counterpart.
Define meaningful reliability as meaningful_success=true.
For speed: compute duration median (p50) and p95 from each scenario's run set.
For effort: report mean command_executions and mean action_count per scenario.
Keep judge metadata (judge_verdict, judge_status) from the same rows; do not reinterpret outside existing judgments.

Run table (all runs)


Scenario
Rep
Run ID
Type
Status
Meaningful
Quality
Duration (ms)
Duration (s)
Result
Matched
Commands
Actions
Timed out
Judge verdict


steel_ai_cli
1
20260225_151602_412_steel_ai_cli_booking_com_r1_i1_a1
cli
success
yes
0.8000
159,003
159.00
5
5
34
5
no


steel_ai_cli
2
20260225_151841_446_steel_ai_cli_booking_com_r2_i1_a1
cli
success
yes
0.9333
492,442
492.44
6
6
81
3
no


steel_ai_cli
3
20260225_152653_894_steel_ai_cli_booking_com_r3_i1_a1
cli
success
yes
0.8333
437,313
437.31
6
6
102
3
no


steel_ai_cli
4
20260225_153411_211_steel_ai_cli_booking_com_r4_i1_a1
cli
success
yes
0.7600
42,728
42.73
5
5
21
8
no


steel_ai_cli
5
20260225_153453_944_steel_ai_cli_booking_com_r5_i1_a1
cli
success
yes
0.8000
59,684
59.68
3
3
22
3
no


steel_browsing_skill
1
20260225_153553_632_steel_browsing_skill_booking_com_r1_i1_a1
skill
success
yes
0.8667
34,174
34.17
3
3
21
6
no


steel_browsing_skill
2
20260225_153627_810_steel_browsing_skill_booking_com_r2_i1_a1
skill
success
yes
0.7600
200,573
200.57
5
5
56
3
no


steel_browsing_skill
3
20260225_153948_386_steel_browsing_skill_booking_com_r3_i1_a1
skill
success
yes
0.8667
907,403
907.40
3
3
114
12
no


steel_browsing_skill
4
20260225_155455_808_steel_browsing_skill_booking_com_r4_i1_a1
skill
success
yes
0.8800
287,288
287.29
5
5
50
8
no


steel_browsing_skill
5
20260225_155943_100_steel_browsing_skill_booking_com_r5_i1_a1
skill
success
yes
0.9500
312,279
312.28
4
4
46
17
no


steel_cli_huss_agent_cli
1
20260225_160455_382_steel_cli_huss_agent_cli_booking_com_r1_i1_a1
cli
success
yes
0.9333
246,217
246.22
3
3
81
12
no


steel_cli_huss_agent_cli
2
20260225_160901_614_steel_cli_huss_agent_cli_booking_com_r2_i1_a1
cli
success
yes
0.9333
207,686
207.69
3
3
71
21
no


steel_cli_huss_agent_cli
3
20260225_161229_306_steel_cli_huss_agent_cli_booking_com_r3_i1_a1
cli
success
yes
0.9000
105,381
105.38
2
2
41
3
no


steel_cli_huss_agent_cli
4
20260225_161414_692_steel_cli_huss_agent_cli_booking_com_r4_i1_a1
cli
success
yes
0.9000
99,474
99.47
2
2
40
3
no


steel_cli_huss_agent_cli
5
20260225_161554_169_steel_cli_huss_agent_cli_booking_com_r5_i1_a1
cli
success
yes
0.9000
72,296
72.30
2
2
53
3
no


raw_codex
1
20260225_165227_203_raw_codex_booking_com_r1_i1_a1
raw
timeout
no
0.0000
0
0.00
0
0
64
5
yes
unclear


raw_codex
2
20260225_165531_266_raw_codex_booking_com_r2_i1_a1
raw
success
yes
0.9333
95,294
95.29
3
3
54
4
no
unclear


raw_codex
3
20260225_165709_436_raw_codex_booking_com_r3_i1_a1
raw
success
yes
0.8800
41,270
41.27
5
5
48
4
no
true


raw_codex
4
20260225_165753_474_raw_codex_booking_com_r4_i1_a1
raw
success
yes
0.7200
34,898
34.90
5
5
51
5
no
unclear


Scenario aggregation


      Scenario
      Type
      Runs
      Success rate
      Mean quality
      Quality range
      p50 duration (ms)
      p95 duration (ms)
      Mean duration (ms)
      Timeout rate
      Mean commands
      Mean actions
      Mean results
    
  
  raw_codex
  raw
  4
  75.0%
  0.6333
  0.0000-0.9333
  34,898
  95,294
  42865.5
  25.0%
  54.25
  4.50
  3.25


  steel_ai_cli
  cli
  5
  100.0%
  0.8253
  0.7600-0.9333
  159,003
  492,442
  238234.0
  0.0%
  52.00
  4.40
  5.00


  steel_browsing_skill
  skill
  5
  100.0%
  0.8647
  0.7600-0.9500
  287,288
  907,403
  348343.4
  0.0%
  57.40
  9.20
  4.00


  steel_cli_huss_agent_cli
  cli
  5
  100.0%
  0.9133
  0.9000-0.9333
  105,381
  246,217
  146210.8
  0.0%
  57.20
  8.40
  2.40

  
Comparative findings


  Reliability is currently high across dedicated scenarios (100% meaningful success in this sample); raw_codex is mixed due to one timeout at 25%.
  Speed-wise, p50 ranking is raw_codex < steel_ai_cli < steel_cli_huss_agent_cli < steel_browsing_skill; p95 is lower for raw_codex and steel_cli_huss_agent_cli than for steel_browsing_skill.
  Quality ranking by mean: steel_cli_huss_agent_cli (0.9133) is best, then steel_browsing_skill (0.8647), then steel_ai_cli (0.8253), with raw_codex below (0.6333 due timeout inclusion).
  Effort: steel_browsing_skill has the highest action footprint and command volume; steel_ai_cli is lighter but slower than expected on rep 2.

General narrative

Across these 19 runs, the benchmark is behaving like a practical quality-vs-stability trade study rather than a strict pass/fail exercise. The cleanest signal is that reliability and quality are strongest in steel_cli_huss_agent_cli, while output variance is concentrated in the more variable steel_ai_cli and especially steel_browsing_skill. In this dataset, raw_codex is useful as a stress signal for behavioral consistency rather than a production-ready lane.
My reading by scenario:

steel_cli_huss_agent_cli: best overall balance here. It is the highest-quality bucket (mean 0.9133) while remaining stable and fully meaningful on all five reps. It also has the lowest median latency among the dedicated scenarios (p50 105,381 ms), which makes it the best default pick when both quality and speed matter.
steel_browsing_skill: strong quality ceiling (0.9500 max rep) but highest command/action footprint and the largest latency tail (907,403 ms at p95). Use this when result richness is the top goal and long-tail runtime is acceptable.
steel_ai_cli: decent quality with the broadest quality/operation variance in this run set. It is lightweight in effort terms and can be efficient, but rep-to-rep behavior is less predictable than the two alternatives above.
raw_codex: a mixed experiment channel in this report. It shows both fast successful runs and one meaningful timeout, so the mean quality is dragged down by an unresolved execution path. It is useful for identifying where workflow constraints and interpretation rules are still affecting consistency.

The common pattern is straightforward: if you need dependable, ranked results with low ambiguity, pick steel_cli_huss_agent_cli; if you need potentially richer browsing depth and can tolerate run-time variance, use steel_browsing_skill; if you need a constrained comparison against unconstrained behavior, keep raw_codex in scope but treat it as a separate reliability tier until repeats are stabilized.
Recommendations


Keep this narrative as the canonical comparison table set until all scenarios have equal repetition depth.
For raw_codex, re-run a full 5-rep block and confirm judge metadata quality to reduce ambiguity from cross-scenario overlap.
Publish scenario-level stability bands (p50/p95) after each refresh so table issues can be caught immediately.
Scenario	Rep	Run ID	Type	Status	Meaningful	Quality	Duration (ms)	Duration (s)	Result	Matched	Commands	Actions	Timed out	Judge verdict
steel_ai_cli	1	`20260225_151602_412_steel_ai_cli_booking_com_r1_i1_a1`	cli	success	yes	0.8000	159,003	159.00	5	5	34	5	no
steel_ai_cli	2	`20260225_151841_446_steel_ai_cli_booking_com_r2_i1_a1`	cli	success	yes	0.9333	492,442	492.44	6	6	81	3	no
steel_ai_cli	3	`20260225_152653_894_steel_ai_cli_booking_com_r3_i1_a1`	cli	success	yes	0.8333	437,313	437.31	6	6	102	3	no
steel_ai_cli	4	`20260225_153411_211_steel_ai_cli_booking_com_r4_i1_a1`	cli	success	yes	0.7600	42,728	42.73	5	5	21	8	no
steel_ai_cli	5	`20260225_153453_944_steel_ai_cli_booking_com_r5_i1_a1`	cli	success	yes	0.8000	59,684	59.68	3	3	22	3	no
steel_browsing_skill	1	`20260225_153553_632_steel_browsing_skill_booking_com_r1_i1_a1`	skill	success	yes	0.8667	34,174	34.17	3	3	21	6	no
steel_browsing_skill	2	`20260225_153627_810_steel_browsing_skill_booking_com_r2_i1_a1`	skill	success	yes	0.7600	200,573	200.57	5	5	56	3	no
steel_browsing_skill	3	`20260225_153948_386_steel_browsing_skill_booking_com_r3_i1_a1`	skill	success	yes	0.8667	907,403	907.40	3	3	114	12	no
steel_browsing_skill	4	`20260225_155455_808_steel_browsing_skill_booking_com_r4_i1_a1`	skill	success	yes	0.8800	287,288	287.29	5	5	50	8	no
steel_browsing_skill	5	`20260225_155943_100_steel_browsing_skill_booking_com_r5_i1_a1`	skill	success	yes	0.9500	312,279	312.28	4	4	46	17	no
steel_cli_huss_agent_cli	1	`20260225_160455_382_steel_cli_huss_agent_cli_booking_com_r1_i1_a1`	cli	success	yes	0.9333	246,217	246.22	3	3	81	12	no
steel_cli_huss_agent_cli	2	`20260225_160901_614_steel_cli_huss_agent_cli_booking_com_r2_i1_a1`	cli	success	yes	0.9333	207,686	207.69	3	3	71	21	no
steel_cli_huss_agent_cli	3	`20260225_161229_306_steel_cli_huss_agent_cli_booking_com_r3_i1_a1`	cli	success	yes	0.9000	105,381	105.38	2	2	41	3	no
steel_cli_huss_agent_cli	4	`20260225_161414_692_steel_cli_huss_agent_cli_booking_com_r4_i1_a1`	cli	success	yes	0.9000	99,474	99.47	2	2	40	3	no
steel_cli_huss_agent_cli	5	`20260225_161554_169_steel_cli_huss_agent_cli_booking_com_r5_i1_a1`	cli	success	yes	0.9000	72,296	72.30	2	2	53	3	no
raw_codex	1	`20260225_165227_203_raw_codex_booking_com_r1_i1_a1`	raw	timeout	no	0.0000	0	0.00	0	0	64	5	yes	unclear
raw_codex	2	`20260225_165531_266_raw_codex_booking_com_r2_i1_a1`	raw	success	yes	0.9333	95,294	95.29	3	3	54	4	no	unclear
raw_codex	3	`20260225_165709_436_raw_codex_booking_com_r3_i1_a1`	raw	success	yes	0.8800	41,270	41.27	5	5	48	4	no	true
raw_codex	4	`20260225_165753_474_raw_codex_booking_com_r4_i1_a1`	raw	success	yes	0.7200	34,898	34.90	5	5	51	5	no	unclear
Scenario	Type	Runs	Success rate	Mean quality	Quality range	p50 duration (ms)	p95 duration (ms)	Mean duration (ms)	Timeout rate	Mean commands	Mean actions	Mean results
raw_codex	raw	4	75.0%	0.6333	0.0000-0.9333	34,898	95,294	42865.5	25.0%	54.25	4.50	3.25
steel_ai_cli	cli	5	100.0%	0.8253	0.7600-0.9333	159,003	492,442	238234.0	0.0%	52.00	4.40	5.00
steel_browsing_skill	skill	5	100.0%	0.8647	0.7600-0.9500	287,288	907,403	348343.4	0.0%	57.40	9.20	4.00
steel_cli_huss_agent_cli	cli	5	100.0%	0.9133	0.9000-0.9333	105,381	246,217	146210.8	0.0%	57.20	8.40	2.40