When an AI agent runs long tasks (benchmarks, deployments, CI pipelines), context overflow is the #1 failure mode. Compaction only runs between turns — if a single turn accumulates too much tool output, the agent dies with no recovery.
A sub-agent doing GPU benchmarks might chain tool calls like:
docker pull→ 200 lines of layer progressdocker logs→ 500 lines of model loading / warmupbench_serving→ 1000+ lines of progress bars- Repeat for next model...
Each tool result stays in context for the entire turn. By the time the turn ends, context is blown.
# ❌ BAD — unbounded output
sudo docker logs bench-server 2>&1
# ✅ GOOD — 3 lines max
sudo docker logs --tail 3 bench-server 2>&1# ❌ BAD — pulls 38GB image inline, huge output
sudo docker pull lmsysorg/sglang:v0.5.8.post1
# ✅ GOOD — background it, poll later with limit
# (use exec background=true, then process poll with limit=3)# ❌ BAD — 1000+ lines of progress bars
sudo docker exec bench-server python3 -m sglang.bench_serving \
--backend sglang --model ... --num-prompts 1000 2>&1
# ✅ GOOD — only the summary block
sudo docker exec bench-server python3 -m sglang.bench_serving \
--backend sglang --model ... --num-prompts 1000 2>&1 | tail -30# ❌ BAD — full model info JSON
curl -s http://localhost:30000/v1/models
# ✅ GOOD — just confirm it's alive
curl -s --connect-timeout 5 http://localhost:30000/v1/models | head -1# ❌ BAD — 20 separate tool calls polling every 30s
curl localhost:30000/v1/models # call 1
curl localhost:30000/v1/models # call 2
...
# ✅ GOOD — one tool call that sleeps and checks
sleep 30 && curl -s --connect-timeout 5 localhost:30000/v1/models | head -1# ❌ BAD
sudo docker logs bench-server 2>&1
# ✅ GOOD
sudo docker logs --tail 5 bench-server 2>&1# ✅ GOOD
df -h / | tail -1# ❌ BAD — reading a 900-line cookbook
cat /path/to/cookbook.md
# ✅ GOOD — extract just what you need
grep -A5 "launch_server\|bench_serving\|random-input\|max-concurrency" cookbook.md | head -30Every tool call output should fit in ~50 lines. If a command might produce more, pipe it through tail, head, or grep. The agent can always ask for more if needed — but it can't un-read what's already in context.
- Compaction runs between turns, not between tool calls
- A turn with 10 tool calls accumulates ALL their output before compaction can run
- One 10,000-line docker log dump can blow the entire context budget
- Sub-agents that die from context overflow leave no trace — they just silently stop
Learned the hard way running 9-model benchmark sweeps on 8×B200. — Thermidor 🦞