nibzard/steel-chatgpt-browser-session-comparison.md

## steel-chatgpt-browser-session-comparison.md

      
    Raw
  

              steel-chatgpt-browser-session-comparison.md
            
          
    Steel Browser + ChatGPT Session Comparison (All Runs)

Date: 2026-03-02
Scope

Compared seven Codex runs that execute the same task on chatgpt.com:
what is the best browser infrastructure for my ai agent.
Runs included


Plain skill use
Overlay (old)
Overlay optimized (first run)
Overlay optimized (second run, regressed behavior)
Overlay reverted to old
Overlay optimized (stabilized)
Overlay optimized (latest rerun)

Consolidated metrics


#
Session ID
Scenario
Task duration (s)
Function calls
Exec/Write
Non-zero exits
Total tokens
Steel sessions created
Observed session runtime on task (s)


1
019cae77-f993-70e2-b494-cf795e3b9bb9
Plain skill use
204.703
48
39/9
10
1,122,315
2
198.129


2
019cae75-6b1b-7970-b0d5-7a216115c529
Overlay (old)
73.442
14
12/2
2
194,777
1
58.937


3
019cae9d-fa6f-7a70-9dab-1cca5daf05a4
Overlay optimized (run 1)
108.465
6
2/4
0
113,935
1
101.987


4
019caea2-1e73-7602-904f-61d13ec29290
Overlay optimized (run 2, regressed)
160.610
33
33/0
7
745,076
1
147.561


5
019caeaf-ec88-7dd2-9cd4-22859d0e6bc6
Overlay reverted to old
71.416
29
29/0
5
562,433
1
55.757


6
019caebd-fb28-7390-a39e-b6a8c280713c
Overlay optimized (stabilized)
70.922
4
2/2
0
86,297
1
65.066


7
019caec0-d053-7ac1-a04a-309a902a7426
Overlay optimized (latest rerun)
85.618
4
3/1
1
84,001
1
74.421


Findings


Overlay discipline is the biggest performance lever.


Plain flow was the worst in latency, tool-call count, failures, and token use.
A strict one-session scripted flow collapses tool overhead and reduces failure surface.


Deterministic sequencing outperforms exploratory recovery.


Best reliability and speed came from fixed sequence: start -> open -> wait -> interact -> wait -> extract -> stop.
Regressions correlate with extra exploratory commands and selector/path recovery loops.


Token cost and latency are related but not identical.


Latest rerun (019caec0...) had the lowest tokens but was slower due to one avoidable command failure and recovery.
Best overall run (019caebd...) is faster and cleaner despite slightly higher tokens.


Session lifecycle hygiene improved materially.


Best runs consistently use exactly one named session and explicit cleanup.
Worst run created two sessions and had long operation window with repeated retries.

Best run and why

Recommended benchmark run: 019caebd-fb28-7390-a39e-b6a8c280713c.
Reason:

Fastest total completion: 70.922s
Clean tool shape: 4 total calls
Zero command failures
Low token usage: 86,297
Single-session lifecycle with deterministic finish

Actionable conclusion

Use the optimized overlay contract as the default ChatGPT UI workflow for this skill.
Keep these invariants:

Exactly one named session per run.
Fixed happy-path command order.
One bounded fallback only.
trap-based cleanup plus explicit stop.
No selector fishing, no stale snapshot refs, no ad-hoc path probing during run.

Appendix A: Latest skill overlay

Source:

/Users/nikola/dev/steel/tmp1/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md

# ChatGPT UI Overlay

## Goal
Run a deterministic ChatGPT UI task: open `chatgpt.com`, send one prompt, extract the reply, and always release the session.

## Scope
Use this overlay only for direct browser UI automation of `chatgpt.com` with `steel browser`.

## Mode
Use cloud mode unless the user explicitly requests local/self-hosted mode.

## Execution contract
- Use exactly one named session per run.
- Keep the flow fixed: `start -> open -> interact -> wait -> extract -> stop`.
- Do not run `--help` calls or selector fishing during normal execution.
- Always clean up with `trap`, even on command failure.

## Happy path (required)
```bash
set -euo pipefail

SESSION="chatgpt-ui-$(date +%s)"
QUESTION="${QUESTION:-what is the best browser infrastructure for my ai agent}"

cleanup() {
  steel browser stop --session "$SESSION" >/dev/null 2>&1 || true
}
trap cleanup EXIT

steel browser start --session "$SESSION"
steel browser open 'https://chatgpt.com/?oai-dm=1' --session "$SESSION"
steel browser wait 5000 --session "$SESSION"
steel browser snapshot -i --session "$SESSION"

steel browser click '#prompt-textarea' --session "$SESSION"
steel browser keyboard type "$QUESTION" --session "$SESSION"
steel browser press Enter --session "$SESSION"

steel browser wait 20000 --session "$SESSION"
SNAPSHOT="$(steel browser snapshot --session "$SESSION")"

if ! printf '%s' "$SNAPSHOT" | rg -q 'ChatGPT said:'; then
  steel browser wait 15000 --session "$SESSION"
  SNAPSHOT="$(steel browser snapshot --session "$SESSION")"
fi

printf '%s\n' "$SNAPSHOT"
steel browser get text main --session "$SESSION"
steel browser stop --session "$SESSION"
Fallback (single retry only)

Use only if clicking or typing into #prompt-textarea fails.
steel browser fill '#prompt-textarea' "$QUESTION" --session "$SESSION"
steel browser press Enter --session "$SESSION"
steel browser wait 20000 --session "$SESSION"
SNAPSHOT="$(steel browser snapshot --session "$SESSION")"
printf '%s\n' "$SNAPSHOT"
steel browser get text main --session "$SESSION"
Failure policy


If no completion signal after fallback, stop session and return the last snapshot with a concise failure note.
Do not start a second session in the same run.
Do not switch to unrelated selectors or random text clicks.

Known error map


no matches found ... ?model=...:
Quote URLs with query parameters.
Validation error ... subaction ... type:
Use fill or keyboard type, not find ... type.
strict mode violation ... getByText(...) resolved to 2 elements:
Use explicit selector #prompt-textarea.
Expected string, received null:
Avoid role queries without explicit names and stable scope.

Do not


Do not use ephemeral refs from old snapshots (@eNN) across new snapshots.
Do not click ambiguous text labels like Continue or What can I help with?.
Do not use unquoted URLs containing ? or &.
Do not leave a session running.

Overlay path parity

Keep this same overlay text in both paths:

Workspace path: /Users/nikola/dev/steel/tmp1/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md
Global skill path: /Users/nikola/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md

Notes


Treat connect_url as display-safe metadata only.
Do not log API keys, auth tokens, or cookies.
This overlay is for UI reproducibility, not API chat completion.
#	Session ID	Scenario	Task duration (s)	Function calls	Exec/Write	Non-zero exits	Total tokens	Steel sessions created	Observed session runtime on task (s)
1	`019cae77-f993-70e2-b494-cf795e3b9bb9`	Plain skill use	204.703	48	39/9	10	1,122,315	2	198.129
2	`019cae75-6b1b-7970-b0d5-7a216115c529`	Overlay (old)	73.442	14	12/2	2	194,777	1	58.937
3	`019cae9d-fa6f-7a70-9dab-1cca5daf05a4`	Overlay optimized (run 1)	108.465	6	2/4	0	113,935	1	101.987
4	`019caea2-1e73-7602-904f-61d13ec29290`	Overlay optimized (run 2, regressed)	160.610	33	33/0	7	745,076	1	147.561
5	`019caeaf-ec88-7dd2-9cd4-22859d0e6bc6`	Overlay reverted to old	71.416	29	29/0	5	562,433	1	55.757
6	`019caebd-fb28-7390-a39e-b6a8c280713c`	Overlay optimized (stabilized)	70.922	4	2/2	0	86,297	1	65.066
7	`019caec0-d053-7ac1-a04a-309a902a7426`	Overlay optimized (latest rerun)	85.618	4	3/1	1	84,001	1	74.421
No results found