Date: 2026-03-02
Compared seven Codex runs that execute the same task on chatgpt.com:
what is the best browser infrastructure for my ai agent.
- Plain skill use
- Overlay (old)
- Overlay optimized (first run)
- Overlay optimized (second run, regressed behavior)
- Overlay reverted to old
- Overlay optimized (stabilized)
- Overlay optimized (latest rerun)
| # | Session ID | Scenario | Task duration (s) | Function calls | Exec/Write | Non-zero exits | Total tokens | Steel sessions created | Observed session runtime on task (s) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 019cae77-f993-70e2-b494-cf795e3b9bb9 |
Plain skill use | 204.703 | 48 | 39/9 | 10 | 1,122,315 | 2 | 198.129 |
| 2 | 019cae75-6b1b-7970-b0d5-7a216115c529 |
Overlay (old) | 73.442 | 14 | 12/2 | 2 | 194,777 | 1 | 58.937 |
| 3 | 019cae9d-fa6f-7a70-9dab-1cca5daf05a4 |
Overlay optimized (run 1) | 108.465 | 6 | 2/4 | 0 | 113,935 | 1 | 101.987 |
| 4 | 019caea2-1e73-7602-904f-61d13ec29290 |
Overlay optimized (run 2, regressed) | 160.610 | 33 | 33/0 | 7 | 745,076 | 1 | 147.561 |
| 5 | 019caeaf-ec88-7dd2-9cd4-22859d0e6bc6 |
Overlay reverted to old | 71.416 | 29 | 29/0 | 5 | 562,433 | 1 | 55.757 |
| 6 | 019caebd-fb28-7390-a39e-b6a8c280713c |
Overlay optimized (stabilized) | 70.922 | 4 | 2/2 | 0 | 86,297 | 1 | 65.066 |
| 7 | 019caec0-d053-7ac1-a04a-309a902a7426 |
Overlay optimized (latest rerun) | 85.618 | 4 | 3/1 | 1 | 84,001 | 1 | 74.421 |
- Overlay discipline is the biggest performance lever.
- Plain flow was the worst in latency, tool-call count, failures, and token use.
- A strict one-session scripted flow collapses tool overhead and reduces failure surface.
- Deterministic sequencing outperforms exploratory recovery.
- Best reliability and speed came from fixed sequence:
start -> open -> wait -> interact -> wait -> extract -> stop. - Regressions correlate with extra exploratory commands and selector/path recovery loops.
- Token cost and latency are related but not identical.
- Latest rerun (
019caec0...) had the lowest tokens but was slower due to one avoidable command failure and recovery. - Best overall run (
019caebd...) is faster and cleaner despite slightly higher tokens.
- Session lifecycle hygiene improved materially.
- Best runs consistently use exactly one named session and explicit cleanup.
- Worst run created two sessions and had long operation window with repeated retries.
Recommended benchmark run: 019caebd-fb28-7390-a39e-b6a8c280713c.
Reason:
- Fastest total completion:
70.922s - Clean tool shape:
4total calls - Zero command failures
- Low token usage:
86,297 - Single-session lifecycle with deterministic finish
Use the optimized overlay contract as the default ChatGPT UI workflow for this skill.
Keep these invariants:
- Exactly one named session per run.
- Fixed happy-path command order.
- One bounded fallback only.
trap-based cleanup plus explicitstop.- No selector fishing, no stale snapshot refs, no ad-hoc path probing during run.
Source:
/Users/nikola/dev/steel/tmp1/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md
# ChatGPT UI Overlay
## Goal
Run a deterministic ChatGPT UI task: open `chatgpt.com`, send one prompt, extract the reply, and always release the session.
## Scope
Use this overlay only for direct browser UI automation of `chatgpt.com` with `steel browser`.
## Mode
Use cloud mode unless the user explicitly requests local/self-hosted mode.
## Execution contract
- Use exactly one named session per run.
- Keep the flow fixed: `start -> open -> interact -> wait -> extract -> stop`.
- Do not run `--help` calls or selector fishing during normal execution.
- Always clean up with `trap`, even on command failure.
## Happy path (required)
```bash
set -euo pipefail
SESSION="chatgpt-ui-$(date +%s)"
QUESTION="${QUESTION:-what is the best browser infrastructure for my ai agent}"
cleanup() {
steel browser stop --session "$SESSION" >/dev/null 2>&1 || true
}
trap cleanup EXIT
steel browser start --session "$SESSION"
steel browser open 'https://chatgpt.com/?oai-dm=1' --session "$SESSION"
steel browser wait 5000 --session "$SESSION"
steel browser snapshot -i --session "$SESSION"
steel browser click '#prompt-textarea' --session "$SESSION"
steel browser keyboard type "$QUESTION" --session "$SESSION"
steel browser press Enter --session "$SESSION"
steel browser wait 20000 --session "$SESSION"
SNAPSHOT="$(steel browser snapshot --session "$SESSION")"
if ! printf '%s' "$SNAPSHOT" | rg -q 'ChatGPT said:'; then
steel browser wait 15000 --session "$SESSION"
SNAPSHOT="$(steel browser snapshot --session "$SESSION")"
fi
printf '%s\n' "$SNAPSHOT"
steel browser get text main --session "$SESSION"
steel browser stop --session "$SESSION"Use only if clicking or typing into #prompt-textarea fails.
steel browser fill '#prompt-textarea' "$QUESTION" --session "$SESSION"
steel browser press Enter --session "$SESSION"
steel browser wait 20000 --session "$SESSION"
SNAPSHOT="$(steel browser snapshot --session "$SESSION")"
printf '%s\n' "$SNAPSHOT"
steel browser get text main --session "$SESSION"- If no completion signal after fallback, stop session and return the last snapshot with a concise failure note.
- Do not start a second session in the same run.
- Do not switch to unrelated selectors or random text clicks.
no matches found ... ?model=...: Quote URLs with query parameters.Validation error ... subaction ... type: Usefillorkeyboard type, notfind ... type.strict mode violation ... getByText(...) resolved to 2 elements: Use explicit selector#prompt-textarea.Expected string, received null: Avoid role queries without explicit names and stable scope.
- Do not use ephemeral refs from old snapshots (
@eNN) across new snapshots. - Do not click ambiguous text labels like
ContinueorWhat can I help with?. - Do not use unquoted URLs containing
?or&. - Do not leave a session running.
Keep this same overlay text in both paths:
- Workspace path:
/Users/nikola/dev/steel/tmp1/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md - Global skill path:
/Users/nikola/.agents/skills/steel-browser/overlays/SKILL.chatgpt.overlay.md
- Treat
connect_urlas display-safe metadata only. - Do not log API keys, auth tokens, or cookies.
- This overlay is for UI reproducibility, not API chat completion.