Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Created February 16, 2026 03:57
Show Gist options
  • Select an option

  • Save BenHamm/a1cfdaa359714945625f9b6065b0aff0 to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/a1cfdaa359714945625f9b6065b0aff0 to your computer and use it in GitHub Desktop.
Context Management Rules for Long-Running AI Sub-Agents

Context Management Rules for Long-Running AI Sub-Agents

When an AI agent runs long tasks (benchmarks, deployments, CI pipelines), context overflow is the #1 failure mode. Compaction only runs between turns — if a single turn accumulates too much tool output, the agent dies with no recovery.

The Problem

A sub-agent doing GPU benchmarks might chain tool calls like:

  1. docker pull → 200 lines of layer progress
  2. docker logs → 500 lines of model loading / warmup
  3. bench_serving → 1000+ lines of progress bars
  4. Repeat for next model...

Each tool result stays in context for the entire turn. By the time the turn ends, context is blown.

The Rules

1. Never read full Docker logs

# ❌ BAD — unbounded output
sudo docker logs bench-server 2>&1

# ✅ GOOD — 3 lines max
sudo docker logs --tail 3 bench-server 2>&1

2. Background long-running commands, poll minimally

# ❌ BAD — pulls 38GB image inline, huge output
sudo docker pull lmsysorg/sglang:v0.5.8.post1

# ✅ GOOD — background it, poll later with limit
# (use exec background=true, then process poll with limit=3)

3. Pipe benchmark output through tail

# ❌ BAD — 1000+ lines of progress bars
sudo docker exec bench-server python3 -m sglang.bench_serving \
  --backend sglang --model ... --num-prompts 1000 2>&1

# ✅ GOOD — only the summary block
sudo docker exec bench-server python3 -m sglang.bench_serving \
  --backend sglang --model ... --num-prompts 1000 2>&1 | tail -30

4. One-line health checks

# ❌ BAD — full model info JSON
curl -s http://localhost:30000/v1/models

# ✅ GOOD — just confirm it's alive
curl -s --connect-timeout 5 http://localhost:30000/v1/models | head -1

5. Compact sleep+poll into single commands

# ❌ BAD — 20 separate tool calls polling every 30s
curl localhost:30000/v1/models  # call 1
curl localhost:30000/v1/models  # call 2
...

# ✅ GOOD — one tool call that sleeps and checks
sleep 30 && curl -s --connect-timeout 5 localhost:30000/v1/models | head -1

6. Error investigation: tail 5, never more

# ❌ BAD
sudo docker logs bench-server 2>&1

# ✅ GOOD
sudo docker logs --tail 5 bench-server 2>&1

7. Disk checks: one line

# ✅ GOOD
df -h / | tail -1

8. Extract specific info with grep, not full reads

# ❌ BAD — reading a 900-line cookbook
cat /path/to/cookbook.md

# ✅ GOOD — extract just what you need
grep -A5 "launch_server\|bench_serving\|random-input\|max-concurrency" cookbook.md | head -30

General Principle

Every tool call output should fit in ~50 lines. If a command might produce more, pipe it through tail, head, or grep. The agent can always ask for more if needed — but it can't un-read what's already in context.

Why This Matters

  • Compaction runs between turns, not between tool calls
  • A turn with 10 tool calls accumulates ALL their output before compaction can run
  • One 10,000-line docker log dump can blow the entire context budget
  • Sub-agents that die from context overflow leave no trace — they just silently stop

Learned the hard way running 9-model benchmark sweeps on 8×B200. — Thermidor 🦞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment