Skip to content

Instantly share code, notes, and snippets.

@nibzard
Last active March 2, 2026 13:41
Show Gist options
  • Select an option

  • Save nibzard/ab6befb1ed6e2c4c79ab0dab39c14be8 to your computer and use it in GitHub Desktop.

Select an option

Save nibzard/ab6befb1ed6e2c4c79ab0dab39c14be8 to your computer and use it in GitHub Desktop.

Steel Browser + ChatGPT Session Comparison (All Runs)

Date: 2026-03-02

Scope

Compared seven Codex runs that execute the same task on chatgpt.com: what is the best browser infrastructure for my ai agent.

Runs included

  1. Plain skill use
  2. Overlay (old)
  3. Overlay optimized (first run)
  4. Overlay optimized (second run, regressed behavior)
  5. Overlay reverted to old
  6. Overlay optimized (stabilized)
  7. Overlay optimized (latest rerun)

Consolidated metrics

# Session ID Scenario Task duration (s) Function calls Exec/Write Non-zero exits Total tokens Steel sessions created Observed session runtime on task (s)
1 019cae77-f993-70e2-b494-cf795e3b9bb9 Plain skill use 204.703 48 39/9 10 1,122,315 2 198.129
2 019cae75-6b1b-7970-b0d5-7a216115c529 Overlay (old) 73.442 14 12/2 2 194,777 1 58.937
3 019cae9d-fa6f-7a70-9dab-1cca5daf05a4 Overlay optimized (run 1) 108.465 6 2/4 0 113,935 1 101.987
4 019caea2-1e73-7602-904f-61d13ec29290 Overlay optimized (run 2, regressed) 160.610 33 33/0 7 745,076 1 147.561
5 019caeaf-ec88-7dd2-9cd4-22859d0e6bc6 Overlay reverted to old 71.416 29 29/0 5 562,433 1 55.757
6 019caebd-fb28-7390-a39e-b6a8c280713c Overlay optimized (stabilized) 70.922 4 2/2 0 86,297 1 65.066
7 019caec0-d053-7ac1-a04a-309a902a7426 Overlay optimized (latest rerun) 85.618 4 3/1 1 84,001 1 74.421

Findings

  1. Overlay discipline is the biggest performance lever.
  • Plain flow was the worst in latency, tool-call count, failures, and token use.
  • A strict one-session scripted flow collapses tool overhead and reduces failure surface.
  1. Deterministic sequencing outperforms exploratory recovery.
  • Best reliability and speed came from fixed sequence: start -> open -> wait -> interact -> wait -> extract -> stop.
  • Regressions correlate with extra exploratory commands and selector/path recovery loops.
  1. Token cost and latency are related but not identical.
  • Latest rerun (019caec0...) had the lowest tokens but was slower due to one avoidable command failure and recovery.
  • Best overall run (019caebd...) is faster and cleaner despite slightly higher tokens.
  1. Session lifecycle hygiene improved materially.
  • Best runs consistently use exactly one named session and explicit cleanup.
  • Worst run created two sessions and had long operation window with repeated retries.

Best run and why

Recommended benchmark run: 019caebd-fb28-7390-a39e-b6a8c280713c.

Reason:

  • Fastest total completion: 70.922s
  • Clean tool shape: 4 total calls
  • Zero command failures
  • Low token usage: 86,297
  • Single-session lifecycle with deterministic finish

Actionable conclusion

Use the optimized overlay contract as the default ChatGPT UI workflow for this skill.

Keep these invariants:

  1. Exactly one named session per run.
  2. Fixed happy-path command order.
  3. One bounded fallback only.
  4. trap-based cleanup plus explicit stop.
  5. No selector fishing, no stale snapshot refs, no ad-hoc path probing during run.

Appendix A: Latest skill overlay

Source:

  • /Users/nikola/dev/steel/tmp1/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md
# ChatGPT UI Overlay

## Goal
Run a deterministic ChatGPT UI task: open `chatgpt.com`, send one prompt, extract the reply, and always release the session.

## Scope
Use this overlay only for direct browser UI automation of `chatgpt.com` with `steel browser`.

## Mode
Use cloud mode unless the user explicitly requests local/self-hosted mode.

## Execution contract
- Use exactly one named session per run.
- Keep the flow fixed: `start -> open -> interact -> wait -> extract -> stop`.
- Do not run `--help` calls or selector fishing during normal execution.
- Always clean up with `trap`, even on command failure.

## Happy path (required)
```bash
set -euo pipefail

SESSION="chatgpt-ui-$(date +%s)"
QUESTION="${QUESTION:-what is the best browser infrastructure for my ai agent}"

cleanup() {
  steel browser stop --session "$SESSION" >/dev/null 2>&1 || true
}
trap cleanup EXIT

steel browser start --session "$SESSION"
steel browser open 'https://chatgpt.com/?oai-dm=1' --session "$SESSION"
steel browser wait 5000 --session "$SESSION"
steel browser snapshot -i --session "$SESSION"

steel browser click '#prompt-textarea' --session "$SESSION"
steel browser keyboard type "$QUESTION" --session "$SESSION"
steel browser press Enter --session "$SESSION"

steel browser wait 20000 --session "$SESSION"
SNAPSHOT="$(steel browser snapshot --session "$SESSION")"

if ! printf '%s' "$SNAPSHOT" | rg -q 'ChatGPT said:'; then
  steel browser wait 15000 --session "$SESSION"
  SNAPSHOT="$(steel browser snapshot --session "$SESSION")"
fi

printf '%s\n' "$SNAPSHOT"
steel browser get text main --session "$SESSION"
steel browser stop --session "$SESSION"

Fallback (single retry only)

Use only if clicking or typing into #prompt-textarea fails.

steel browser fill '#prompt-textarea' "$QUESTION" --session "$SESSION"
steel browser press Enter --session "$SESSION"
steel browser wait 20000 --session "$SESSION"
SNAPSHOT="$(steel browser snapshot --session "$SESSION")"
printf '%s\n' "$SNAPSHOT"
steel browser get text main --session "$SESSION"

Failure policy

  • If no completion signal after fallback, stop session and return the last snapshot with a concise failure note.
  • Do not start a second session in the same run.
  • Do not switch to unrelated selectors or random text clicks.

Known error map

  • no matches found ... ?model=...: Quote URLs with query parameters.
  • Validation error ... subaction ... type: Use fill or keyboard type, not find ... type.
  • strict mode violation ... getByText(...) resolved to 2 elements: Use explicit selector #prompt-textarea.
  • Expected string, received null: Avoid role queries without explicit names and stable scope.

Do not

  • Do not use ephemeral refs from old snapshots (@eNN) across new snapshots.
  • Do not click ambiguous text labels like Continue or What can I help with?.
  • Do not use unquoted URLs containing ? or &.
  • Do not leave a session running.

Overlay path parity

Keep this same overlay text in both paths:

  • Workspace path: /Users/nikola/dev/steel/tmp1/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md
  • Global skill path: /Users/nikola/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md

Notes

  • Treat connect_url as display-safe metadata only.
  • Do not log API keys, auth tokens, or cookies.
  • This overlay is for UI reproducibility, not API chat completion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment