BenHamm/context-management-rules.md

## context-management-rules.md

      
    Raw
  

              context-management-rules.md
            
          
    Context Management Rules for Long-Running AI Sub-Agents

When an AI agent runs long tasks (benchmarks, deployments, CI pipelines), context overflow is the #1 failure mode. Compaction only runs between turns — if a single turn accumulates too much tool output, the agent dies with no recovery.
The Problem

A sub-agent doing GPU benchmarks might chain tool calls like:

docker pull → 200 lines of layer progress
docker logs → 500 lines of model loading / warmup
bench_serving → 1000+ lines of progress bars
Repeat for next model...

Each tool result stays in context for the entire turn. By the time the turn ends, context is blown.
The Rules

1. Never read full Docker logs

# ❌ BAD — unbounded output
sudo docker logs bench-server 2>&1

# ✅ GOOD — 3 lines max
sudo docker logs --tail 3 bench-server 2>&1
2. Background long-running commands, poll minimally

# ❌ BAD — pulls 38GB image inline, huge output
sudo docker pull lmsysorg/sglang:v0.5.8.post1

# ✅ GOOD — background it, poll later with limit
# (use exec background=true, then process poll with limit=3)
3. Pipe benchmark output through tail

# ❌ BAD — 1000+ lines of progress bars
sudo docker exec bench-server python3 -m sglang.bench_serving \
  --backend sglang --model ... --num-prompts 1000 2>&1

# ✅ GOOD — only the summary block
sudo docker exec bench-server python3 -m sglang.bench_serving \
  --backend sglang --model ... --num-prompts 1000 2>&1 | tail -30
4. One-line health checks

# ❌ BAD — full model info JSON
curl -s http://localhost:30000/v1/models

# ✅ GOOD — just confirm it's alive
curl -s --connect-timeout 5 http://localhost:30000/v1/models | head -1
5. Compact sleep+poll into single commands

# ❌ BAD — 20 separate tool calls polling every 30s
curl localhost:30000/v1/models  # call 1
curl localhost:30000/v1/models  # call 2
...

# ✅ GOOD — one tool call that sleeps and checks
sleep 30 && curl -s --connect-timeout 5 localhost:30000/v1/models | head -1
6. Error investigation: tail 5, never more

# ❌ BAD
sudo docker logs bench-server 2>&1

# ✅ GOOD
sudo docker logs --tail 5 bench-server 2>&1
7. Disk checks: one line

# ✅ GOOD
df -h / | tail -1
8. Extract specific info with grep, not full reads

# ❌ BAD — reading a 900-line cookbook
cat /path/to/cookbook.md

# ✅ GOOD — extract just what you need
grep -A5 "launch_server\|bench_serving\|random-input\|max-concurrency" cookbook.md | head -30
General Principle

Every tool call output should fit in ~50 lines. If a command might produce more, pipe it through tail, head, or grep. The agent can always ask for more if needed — but it can't un-read what's already in context.
Why This Matters


Compaction runs between turns, not between tool calls
A turn with 10 tool calls accumulates ALL their output before compaction can run
One 10,000-line docker log dump can blow the entire context budget
Sub-agents that die from context overflow leave no trace — they just silently stop


Learned the hard way running 9-model benchmark sweeps on 8×B200. — Thermidor 🦞
No results found