bdsqqq/Evergreen notes turn ideas into objects that you can manipulate.md

## README.md

      
    Raw
  

              README.md
            
          
    agent skill design — comprehensive learnings

everything we learned about designing agent skills over weeks of iteration, debugging, and research.

core documents


document
what it covers


agent-skill-design-principles.md
start here — archetypes, token budgets, validation, invocation guards


skill-archetype-research-report.md
full research with citations — du et al., anthropic, composio, chroma


skills-fail-silently-without-validation.md
the broken remember skill incident and nix validation fix


dialectic review method


document
what it covers


dialectic-review-method.md
adversarial multi-agent review protocol


dialectic-meta-auditor-pattern.md
catching manufactured findings


dialectic-skill-composition-pattern.md
rounds orchestrating parallel spar sessions


coordination patterns


document
what it covers


multi-agent-coordination-patterns.md
hub-and-spoke, watchdog, AGENT prefix, handoff protocol


agent-sprawl-antipattern.md
pre-spawn checklist, when NOT to use orchestration


skill-review-spar-findings.md
16 skills reviewed, portability learnings


research sources

from the autonomous agents research run (393 threads, 11 rounds, ~17.5 hours):


document
what it covers


research-synthesis.md
critical findings — malone 2024 (human-AI worse than either alone), METR 2025 (40-point perception gap), context degradation


research-orchestration-patterns.md
hierarchical, peer-to-peer, swarm, MoA — MAST 40% failure rate


research-prompt-engineering.md
tool descriptions > system prompts, context engineering, few-shot patterns


research-context-window-management.md
budget allocation, dynamic pruning, summarization tradeoffs


research-composability.md
microservices patterns, interface contracts, when to decompose


research-memory-compression.md
30x token reduction, observation masking vs summarization


research-error-taxonomy.md
error types by origin, recovery strategies


amp thread history


document
what it covers


amp-threads-for-agent-skills.md
full thread list — every thread that contributed to these learnings


quick reference

skill archetypes


type
examples
token budget
examples role


rule-following
git-ship, spawn, lnr
400-800
optional


pattern-matching
write, amp-voice, dig
1200-2000
required


epistemic
review, spar
800-1500
helpful


pre-spawn checklist

before loading spawn/coordinate/rounds/spar/shepherd:

could i verify this myself in <10 minutes? if yes, do it
is there a single source of truth? one agent reading one source beats reconciliation
will agents produce conflicting findings? one careful pass beats theatrical review courts
do i have explicit exit criteria? unbounded work produces unbounded reconciliation

the broken skill lesson

skills with invalid yaml frontmatter don't load and don't warn. test: would a future agent... parsed as key test:. skill silently disappeared. agents invented ad-hoc conventions.
fix: nix build-time validation that warns on missing frontmatter or unquoted colons.

changelog


2026-01-19: comprehensive gist with all learnings, research sources, thread list
2026-01-18: agent sprawl antipattern, pre-spawn checklist
2026-01-16: three archetypes, dialectic review, spar skill
2026-01-15: initial design principles from broken remember skill


## agent-skill-design-principles.md

      
    Raw
  

              agent-skill-design-principles.md
            
          
  source
  keywords
  
  
  https://ampcode.com/threads/T-019bc222-06dc-7788-8d4b-65a4893a15a1
  
  
  amp
  skills
  design
  agents
  tools
  archetypes
  
  
agent skill design principles

lessons from debugging a broken skill, dialectic review, and research validation.
skill archetypes

skills divide into three types based on task. this determines token budget, example requirements, and composition patterns.


dimension
rule-following
pattern-matching
epistemic


task type
execute a workflow
replicate style/format
modify reasoning stance


failure mode
wrong action
wrong framing
wrong epistemics


examples role
clarify edge cases (optional)
demonstrate pattern (required)
show failure modes (helpful)


token budget
400-800 tokens
1200-2000 tokens
800-1500 tokens


evaluation
did it execute correctly?
did it match the pattern?
did it reason appropriately?


composition
standalone
standalone or composed
composed with other skills


examples:

rule-following: remember, report, git-ship, git-worktree, tmux, coordinate, rounds, spawn
pattern-matching: amp-voice, write, document, dig
epistemic: review, investigate

when examples are load-bearing:

task requires style/format replication
"correct" output cannot be specified declaratively
skill teaches HOW to produce, not WHAT to do

when examples are decorative:

task is procedural/deterministic
correct behavior specifiable with rules
skill specifies WHAT to do, not HOW to produce

heuristic: one example per axis of variation. simple patterns (document: "why not what") need fewer examples than complex patterns (amp-voice: terminology + tone + phrases + anti-patterns).
sources: du et al. (2025), tang et al. (2025), latitude (2025). see skill archetype research report for full citations.
explicit vocabularies over references

skills that say "read X for details" without embedding critical constraints risk agents never following the link.
anthropic's tool design research: "descriptions should include... what each parameter means, important caveats or limitations" — recommends 3-4+ sentences per tool with explicit constraints embedded directly (anthropic tool use docs).
composio's field guide: "when parameters have implicit relationships... models fail to understand usage constraints" (composio).
heuristic: embed constraints that would break the skill if missing. links are for context; constraints need to be immediate.
nuance: if the skill already embeds the CRITICAL constraint (e.g., remember embeds source__agent requirement), don't duplicate the full vocabulary. single source of truth matters.
skills are load-bearing objects

evergreen notes turn ideas into objects. skills do the same for agent capabilities. a broken skill isn't just missing functionality—it's a missing object that other work depends on.
when a skill fails to load:

agents can't execute the capability
they invent ad-hoc conventions
output is subtly wrong in ways that surface later

corti's analysis: "context fragmentation" where "agents operate in isolation, make decisions on incomplete information" and "hallucination propagation" where "fabricated data spreads across agents, becomes ground truth" (corti).
concision enables composition

context length degrades performance independent of retrieval quality.
du et al. (2025): "even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%–85%) as input length increases" (arxiv).
chroma research: "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases" (chroma).
token budgets by archetype:

rule-following skills: 400-800 tokens (composio's 1024-char limit)
pattern-matching skills: 1200-2000 tokens (anthropic's "curated canonical examples")
epistemic skills: 800-1500 tokens (principles + failure mode annotations)
2000 tokens is an OUTER BOUND, not a target

structure matters: critical info at beginning and end, not middle (liu et al., "lost in the middle").
tool description quality over length

langchain benchmarking + anthropic SWE-bench work:

"we actually spent more time optimizing our tools than the overall prompt" — anthropic


"poor tool descriptions → poor tool selection regardless of model capability" — langchain

a well-described 500-token skill beats a poorly-described 1500-token one. clarity matters more than length for rule-following skills.
validation at authorship, not consumption

errors are cheapest to fix where they originate.
skill validation during build catches issues before deployment. waiting until runtime means:

error is far from its cause
debugging requires tracing through agent behavior
multiple agents may have produced bad output

anthropic's building effective agents: component tests (individual LLM calls, tool invocations) should be fast and catch issues before they compound (anthropic).
implementation: nix build-time frontmatter validation. warns on missing frontmatter or unquoted colons. see 01_files/nix/user/amp/default.nix.
skills should be testable

a skill that can't be tested can't be trusted.
confident AI: "faulty tool calls — wrong tool, invalid parameters, misinterpreted outputs" and "false task completion — claiming success without actual progress" (confident-ai).
validation approaches:

frontmatter validation (implemented)
example invocations that can be dry-run
assertions about output format

hunch: skills may benefit from a test: section with expected inputs/outputs.
invocation guards for orchestration skills

rule-following orchestration skills (spawn, coordinate, rounds, spar) need explicit "WHEN NOT TO USE" sections. these skills are dangerous because:

low friction to invoke
feel productive (agents doing work)
costs are hidden (coordination overhead, conflicting findings, reconciliation burden)

malone et al. (2024): human-AI combinations perform WORSE than either alone when humans defer decisions they could make better themselves. spawning agents to generate opinions for reconciliation is exactly this antipattern.
the pre-spawn checklist:
before invoking multi-agent orchestration, ask:


could i verify this myself in <10 minutes? if yes, do it. agents are for parallelizing work you CAN'T do faster yourself.


is there a single source of truth? one agent reading one authoritative source beats multiple agents generating opinions to reconcile.


will agents produce conflicting findings? if task is evaluative (judging claims) rather than exploratory (generating hypotheses), a single careful pass is cleaner than theatrical "review courts."


do i have explicit exit criteria? multi-agent work without convergence criteria produces unbounded reconciliation work.


antipattern case study: spawned 4 agents to validate postmortem claims. results:

agent 1 said error rate was 8.75x. agent 2 proved methodology was wrong.
agent 1 claimed recovery at 20:26. agent 2 proved claim was unfalsifiable.
agent 3 corrected batch rate calculation.
postmortem rewritten 3x based on conflicting outputs.

fix: read the code, query observability ONCE with correct methodology, write findings with HUNCH labels where evidence is weak. one PR, done.
sources: malone et al. (2024), MAST dataset (40% multi-agent pilot failure rate), agent sprawl antipattern.
skill author guidance: orchestration skills SHOULD include a "WHEN NOT TO USE" section with the pre-spawn checklist. this is guidance for humans invoking the skill, not runtime enforcement.

related


skill archetype research report — full citations and evidence
dialectic review method — the adversarial review process
dialectic meta-auditor pattern — catches manufactured findings


changelog


2026-01-18: added "invocation guards for orchestration skills" section. pre-spawn checklist, antipattern case study from atlas traces postmortem incident. sources: malone et al. (2024), MAST dataset, agent sprawl antipattern note.
2026-01-16T18-30: expanded to three archetypes (rule-following, pattern-matching, epistemic). reclassified coordinate/rounds as rule-following. added "one example per axis of variation" heuristic. validated via second dialectic (nelson_velvetford).
2026-01-16: added skill archetypes (rule-following vs pattern-matching), refined token budgets by type, added quality > length principle, added structure guidance. validated via dialectic review + research agent.
2026-01-15: initial version from debugging broken remember skill.


## agent-sprawl-antipattern.md

      
    Raw
  

              agent-sprawl-antipattern.md
            
          
  source
  keywords
  related
  
  
  https://ampcode.com/threads/T-019bd133-5985-779c-9bc0-d0e440e6c98c
  
  
  agents
  investigation
  epistemics
  postmortem
  methodology
  anti-pattern
  orchestration
  
  
  [[2026-01-15 agent skill design principles -- type__reference source__agent area__work]]
  
  
agent sprawl antipattern

tl;dr: we overindexed on a "cool workflow" when a direct solution would have been faster, cleaner, and more correct.
spawning multiple review agents without convergence criteria produces conflicting findings that need reconciliation. a single careful pass is cleaner than theatrical "review courts."

the seduction of multi-agent workflows

multi-agent orchestration FEELS rigorous. you're spawning validators, running review rounds, getting multiple perspectives. it looks like due diligence.
it's often theater.
the cost of multi-agent review isn't just tokens—it's the reconciliation burden when agents disagree. and they WILL disagree, because they're interpreting the same ambiguous evidence with different framings.

what happened

investigating atlas traces postmortem, i spawned 4 agents (larry, roy, george, marian) to "validate claims." results:

larry said error rate was 0.70% vs 0.08% (8.75x ratio)
roy proved larry's methodology was wrong (used all logs as denominator, not traces requests)
larry claimed "Atlas recovered at 20:26"
roy proved this was unfalsifiable (no success logs exist)
marian corrected batch rate from ~9/min to ~4.4/min

i updated the postmortem 3 times based on conflicting agent outputs.
the actual problems


no exit criteria — agents kept finding things, i kept updating. no definition of "done"
methodology blindness — trusted first agent's numbers without questioning how they were derived
claim inflation — asserted findings confidently before verifying they were falsifiable
scattered outputs — 3 PRs, 2 worktrees, postmortem rewritten 3x for a simple fix

what should have happened instead

the direct approach:

read code, check spec, confirm fix is correct
query observability ONCE with correct methodology (verify denominator)
write findings with HUNCH labels where evidence is weak
one PR, clean commit, done

time estimate: 20-30 minutes.
actual time spent: hours across multiple agents, reconciliation passes, PR rewrites.
the "cool workflow" cost 5-10x more than doing it directly. and the direct approach would have been MORE correct, because one person with clear methodology beats four agents with inconsistent methodologies.
pre-spawn checklist

before loading spawn/coordinate/rounds/spar/shepherd, ask:


could i verify this myself in <10 minutes? if yes, DO IT. the overhead of spawning, coordinating, and reconciling exceeds the work itself.


is there a single source of truth? if verifiable against one file/spec/query, one agent reading it once beats multiple agents interpreting it differently.


will agents produce conflicting findings? if task is evaluative (judging claims) rather than generative (creating artifacts), expect disagreement. one careful pass beats theatrical review courts.


do i have explicit exit criteria? without "done" criteria, agents keep finding things, you keep updating. unbounded work produces unbounded reconciliation.


is the work INDEPENDENT? spawn parallelizes independent work (different repos, different features, different concerns). don't spawn multiple agents to evaluate the SAME thing.


when multi-agent IS appropriate


independent parallel tasks: agent 1 works on frontend, agent 2 works on backend. no overlap.
genuinely different expertise: one agent queries observability, another reads code, a third writes docs. different inputs, synthesized outputs.
generative work with diversity value: brainstorming, hypothesis generation, creative exploration. disagreement is the point.

when multi-agent is THEATER


"validation" of claims with no ground truth — agents will generate conflicting opinions you'll spend more time reconciling than investigating directly
"review courts" where multiple agents judge the same artifact — feels rigorous, produces noise
spawning because you CAN — the tools are available, it feels productive, but single-agent would be faster

resolution

this antipattern led to updates in:

agent skill design principles — added "invocation guards for orchestration skills" section
AGENTS.md — added "orchestration discipline" section with pre-spawn checklist
spawn, coordinate, rounds, spar, shepherd skills — added "WHEN NOT TO USE" sections with structural checks

commits: c5ebd06, e8722e1, 89c79e3, 03815f3
sources


malone et al. (2024) — human-AI combinations perform WORSE than either alone when humans defer decisions they could make better themselves
MAST dataset — 40% of multi-agent pilots fail within 6 months
orchestration-patterns.md — "single well-tuned agent often outperforms poorly coordinated multi-agent system"


## amp-threads-for-agent-skills.md

      
    Raw
  

              amp-threads-for-agent-skills.md
            
          
    amp threads for iterating on agent skills

comprehensive list of amp threads where we developed, debugged, and refined agent skill design principles.

foundational skill development


thread
title
contribution


T-019b9a3d
PR #9 trpc-cli migration coordination
massive skill creation — write, document, amp-voice, spawn, dig, review-rounds. multi-agent review rounds pattern emerged


T-019b92f7
Create agent skill for lnr CLI
lnr skill, CLI-wrapping skill patterns


T-019b8dd1
Build axiom-sre skill
production-grade skill with memory system, hypothesis-driven investigation


T-019b8e08
Finalize axiom-sre skill
API migration, memory outside skill directory pattern


T-019b8e20
Build memory consolidation sleep cycle
memory maintenance, tiered storage, skill portability


T-019b2c70
AMP custom commands for git workflows
git-ship, git-worktree early iterations


skill creation sessions


thread
title
skills created


T-019b9d0b-8ed1
Create write skill
write skill with academish voice


T-019b9d0b-8ec5
Create document skill
document skill with why-over-what philosophy


T-019b9d0b-8f71
Create amp-voice skill
amp-voice with terminology guide


T-019b9d0b-8f28
Update spawn skill
spawn with references, amp owner's manual


T-019b9d22
Create dig skill
investigation methodology, verification agents


T-019b9a87
Formalize review-rounds as skill
multi-agent review rounds pattern


T-019b9ea5
Git worktree task spawning
git-worktree skill with rebase


skill review and validation


thread
title
findings


T-019b9d10
Review four skills for spec compliance
agentskills.io spec validation


T-019b9d11
Add YAML frontmatter to document skill
frontmatter requirements


T-019b9d12
Add YAML frontmatter to amp-voice skill
frontmatter standardization


T-019b9a93
Verify rewritten review-rounds skill
cross-references, duplication checks


T-019b9a92
Rewrite review-rounds skill
bundled guidelines pattern


T-019b9f82
Skills library structure review
skill composition, layer model


dialectic review sessions


thread
title
outcome


T-019bc2f3
dialectic review origin
spar skill creation, adversarial review method


T-019bc67b
skill archetype research
three archetypes (rule-following, pattern-matching, epistemic)


T-019bc6fb
dialectic skill composition
rounds orchestrating parallel spar sessions


T-019bc7be
skill review spar
16 skills reviewed, 4 issues found, portability learnings


multi-agent coordination


thread
title
patterns


T-019bbde9
autonomous agents research watchdog
janet watchdog, hub-and-spoke, AGENT prefix


T-019b8e01
Spawn skill message delivery
tmux send-keys vs slash commands


T-019ba007
Thread analysis coordination
48+ insights, massive spawn run


T-019bd133
agent sprawl antipattern
pre-spawn checklist, orchestration discipline


broken skill debugging


thread
title
lesson


T-019bc222
broken remember skill
yaml frontmatter validation, unquoted colons, nix build-time checks


research runs


thread
title
output


T-019bc122
archaeologist agent
thread synthesis patterns


T-019bc133
archivist agent
API queries, structured extraction


T-019bc169
formatter agent
file structure, git operations


key insights by thread

T-019b9a3d — the big bang


spawned 4+ agents to create write, document, amp-voice, spawn, dig skills
review rounds pattern emerged naturally
agentskills.io spec as authority for SKILL.md format

T-019bc222 — the broken remember skill


yaml frontmatter with unquoted colon (test: would a future agent...) parsed as key
skill silently disappeared from amp skills
agents invented ad-hoc conventions instead
led to nix build-time validation

T-019bc67b — archetype discovery


research agent validated procedural vs methodological distinction
expanded to three archetypes: rule-following, pattern-matching, epistemic
token budgets by type: 400-800, 1200-2000, 800-1500

T-019bd133 — agent sprawl


spawned 4 agents to validate postmortem claims
conflicting findings required reconciliation
postmortem rewritten 3x
led to pre-spawn checklist, "WHEN NOT TO USE" sections


timeline


2025-12-17: git-ship, git-worktree early versions (T-019b2c70)
2026-01-05: axiom-sre skill built (T-019b8dd1)
2026-01-06: lnr skill created (T-019b92f7)
2026-01-07: massive skill creation session (T-019b9a3d)
2026-01-08: skill review rounds, dig skill (T-019b9d22)
2026-01-09: skills library structure review (T-019b9f82)
2026-01-14-15: autonomous agents research run (T-019bbde9)
2026-01-15: broken remember skill debugging (T-019bc222)
2026-01-16: dialectic review, archetype research (T-019bc67b)
2026-01-18: agent sprawl antipattern documented (T-019bd133)


## dialectic-meta-auditor-pattern.md

      
    Raw
  

              dialectic-meta-auditor-pattern.md
            
          
  source
  keywords
  
  
  https://ampcode.com/threads/T-019bc2f3-2e2c-70fe-9b3d-1d7b002c48fe
  
  
  dialectic
  agents
  epistemics
  review
  verification
  
  
dialectic meta-auditor pattern

dialectic review between agents can produce manufactured findings. a meta-auditor phase catches these.
the problem

two failure modes in multi-agent dialectic:

premature convergence — agents agree too fast to satisfy "2 clean rounds" prompt
manufactured issues — agents invent problems to appear rigorous, or antithesis invents challenges to have something to say

both are documented patterns: corti's "hallucination propagation" and replit incident's "created fake data to mask issues."
the solution: skeptical meta-auditor

after dialectic claims completion, spawn a meta-auditor with explicit instructions:

assume all findings are MANUFACTURED until proven
for each finding, require:

trace to specific research (du et al., anthropic docs, etc.)
evidence the skill would actually fail without the change
assessment of box-checking risk


verdict: GENUINE (with citation) or MANUFACTURED (with reasoning)
recommend: KEEP or REVERT

example from practice

joyce_softerbone + hoot_velvetstar dialectic produced 2 findings:


finding
meta-audit verdict
action


review: add slop counter-example before good example
GENUINE — traces to confident-ai, archetype research "epistemic skills show failure modes"
KEEP


amp-voice: rename "the pattern:" to "the compression pattern:"
MANUFACTURED — no functional impact, box-checking to satisfy antithesis role
REVERT


without meta-auditor, the manufactured change would have been committed.
implementation

spawn meta-auditor AFTER dialectic claims completion:
META-AUDITOR — audit dialectic findings for authenticity.

assume MANUFACTURED until proven. for each finding:
1. does it trace to specific research? (cite source)
2. would skill ACTUALLY fail without this change?
3. box-checking risk: LOW/MODERATE/HIGH

verdict: GENUINE or MANUFACTURED
recommendation: KEEP or REVERT

related


dialectic review method — the base dialectic protocol this extends
agent skill design principles — archetype framework that informs what "genuine" means
skill archetype research report — evidence base for tracing findings


## dialectic-review-method.md

      
    Raw
  

              dialectic-review-method.md
            
          
  source
  keywords
  
  
  https://ampcode.com/threads/T-019bc2f3-2e2c-70fe-9b3d-1d7b002c48fe
  
  
  dialectic
  spar
  review
  agents
  coordination
  epistemics
  skills
  
  
dialectic review method

adversarial multi-agent review where a spawned agent challenges findings. proved effective at pruning false positives and surfacing missed bugs.
setup


coordinator produces initial review with confidence labels (VERIFIED/HUNCH/QUESTION)
spawn antithesis agent with explicit instructions:

role: challenge, refute, find weaknesses
load relevant skills (review, write, investigate)
read source material independently
communicate via tmux send-keys (no /queue — unreliable timing)
take turns, wait for synthesis before next challenge


protocol

round N:
  antithesis: challenge strongest/most confident claim
  thesis: verify claim, concede or defend with evidence
  synthesis: update position, identify next challenge target
  repeat until no productive challenges remain

results from first use

reviewed 15 amp skills. original review had 4 priority findings.
after dialectic:

3 false positives pruned (embed vocab, external validation, dig length)
1 net-new bug discovered (spawn→report env var contract mismatch)
1 insight gained (skill archetypes: procedural vs methodological)

why it works


antithesis agent has fresh context, no sunk cost in original claims
adversarial framing encourages falsification over confirmation
turn-taking forces synthesis rather than parallel monologues
confidence labels give antithesis clear targets (attack VERIFIED first)

skill implementation

implemented as spar skill (not "dialectic" — spar is shorter, action-oriented, matches amp voice conventions).
location: ~/.config/amp/skills/spar/SKILL.md (or sibling to spawn/coordinate/report)
validated via self-review: spar skill reviewed by spar protocol. 6 initial findings → 4 pruned as false positives, 2 actionable fixes applied.
key learnings from implementation:

dual-use pattern: loader-as-thesis (standalone) OR rounds-spawns-both (orchestrated)
relative paths per agentskills.io spec, with sibling assumption documented
slash commands unreliable over tmux — use direct send-keys
anti-pattern warnings are BENEFICIAL (anthropic recommends "when NOT to use"; agents fail by ignoring warnings, not by being primed)

related


dialectic meta-auditor pattern — catches manufactured findings after dialectic completes
agent skill design principles — archetype framework that emerged from first dialectic


## dialectic-skill-composition-pattern.md

      
    Raw
  

              dialectic-skill-composition-pattern.md
            
          
  source
  keywords
  
  
  https://ampcode.com/threads/T-019bc6fb-b7c6-7099-b195-ae6715dabaf6
  
  
  dialectic
  spar
  skills
  composition
  agents
  rounds
  coordination
  
  
dialectic skill composition pattern

dialectic debates can run in parallel, orchestrated by rounds. each "court session" is an independent debate that returns a verdict.
composition model

rounds (orchestrator)
├── court 1: spar(finding A)
│   ├── thesis agent
│   └── antithesis agent
├── court 2: spar(finding B)  
│   ├── thesis agent
│   └── antithesis agent
└── court 3: spar(finding C)
    ├── thesis agent
    └── antithesis agent

→ rounds collects verdicts
→ runs meta-auditor on all verdicts
→ iterates if issues found

note: skill is named spar (not dialectic). files use "dialectic" as the conceptual term, "spar" as the skill name.
why this works


dialectic = the debate protocol (self-contained, returns verdict)
rounds = orchestrates N parallel instances, checks for stability
meta-auditor = post-dialectic phase, could be inline in dialectic OR a separate rounds pass

dialectic is rule-following (it's a workflow), not epistemic. it LOADS the epistemic skill (review). so it's a composable unit that rounds can orchestrate.
interface contract

for rounds to spawn dialectic sessions, dialectic needs:


aspect
requirement


input
claim/finding to debate + relevant file paths


output
verdict (UPHELD/REFUTED/MODIFIED) + revised finding if modified


termination
2+ synthesis rounds with no position change


this matches microservices composability patterns — clear interface contracts enable composition.
skill relationship

spar workflow uses:
├── spawn (create antithesis agent)
├── coordinate (tmux send-keys)  
├── report (agent → coordinator)
├── review (epistemic standards for both agents)
└── rounds (can orchestrate multiple spar sessions)

spar produces:
├── reviewed findings with confidence labels
└── optionally triggers meta-auditor phase

related


dialectic review method — the base protocol
dialectic meta-auditor pattern — catches manufactured findings
agent skill design principles — archetype framework


## Evergreen notes turn ideas into objects that you can manipulate.md

      
    Raw
  

              Evergreen notes turn ideas into objects that you can manipulate.md
            
          
    type: #type/clipping
area: #area/knowledge-management
keywords: #keyword/notes #keyword/learning
status: #status/processed
created: 2025-01-24
published: 2022-09-17
source: https://notes.andymatuschak.org/Evergreen_notes_turn_ideas_into_objects_that_you_can_manipulate
author: #author/steph_ango

Evergreen notes allow you to think about complex ideas by building them up from smaller composable ideas.
My evergreen notes have titles that distill each idea in a succinct and memorable way, that I can use in a sentence. For example:

A company is a superorganism
All input is error
Calmness is a superpower
Concise explanations accelerate progress
Cross the chasm
Everything is a remix
Writing is telepathy
You have no obligation to your former self
etc

==You don’t need to agree with the idea for it to become an evergreen note. Evergreen notes can be very short.==
==I have an evergreen note called Creativity is combinatory uniqueness that is built on top of another evergreen note:==

If you believe Everything is a remix, then creativity is defined by the uniqueness and appeal of the combination of elements.

==Evergreen notes turn ideas into objects. By turning ideas into objects you can manipulate them, combine them, stack them. You don’t need to hold them all in your head at the same time.==

The term evergreen notes was coined by Andy Matuschak and you can find more about this method on his site. You can also listen to my interview on the Metamuse podcast for more thoughts on evergreen notes and how I use them in Obsidian.

  
## multi-agent-coordination-patterns.md

      
    Raw
  

              multi-agent-coordination-patterns.md
            
          
  source
  date
  tags
  
  
  autonomous agents research run T-019bbde9-0161-743c-975e-0608855688d6
  2026-01-15
  agents, coordination, patterns, amp
  
  
multi-agent coordination patterns

patterns extracted from the autonomous agents research run (jan 14-15 2026): 393 threads, 11 rounds, 48+ research agents, ~17.5 hours continuous operation.
1. hub-and-spoke with watchdog

                    user
                      │
                   janet (watchdog)
                      │ pings every 3min
                      ▼
                 coordinator
                 /    |    \
           agents  agents  agents

watchdog doesn't coordinate work—just keeps coordinator alive. coordinator handles all delegation.
2. message protocol

all inter-agent messages use prefix:
AGENT $NAME: <message>

agents report TO COORDINATOR, not to each other. coordinator relays if needed. prevents crosstalk, keeps responsibility clear.
update (2026-01-16): use direct tmux send-keys, not slash commands. /queue and other slash commands are unreliable over tmux — timing issues cause messages to be cut off.
3. specialization by capability


agent
capability
pattern


archivist
API access, queries
answers "how many?" and "which ones missing?"


archaeologist
thread reading, synthesis
builds structured docs from raw thread data


formatter
file structure, git
transforms formats, commits changes


accountant
cost extraction, annotation
adds metadata to existing docs


janet (watchdog)
liveness, challenge
keeps coordinator alive, pushes back on idle


4. handoff protocol

when agent exhausts context:

prepare HANDOFF.md with current state
use thread:new or amp t n (NOT continue—carries old context)
brief successor with: read HANDOFF.md, continue from $OLD_THREAD_ID
report handoff to watchdog

5. error recovery


failure
recovery


agent dies
watchdog detects via tmux, respawns with amp t c


agent stalls
watchdog sends Enter key, then pings, then respawns


API unauthorized
agent escalates to user for credentials


thread not found
agent asks for corrected ID


6. work delegation

coordinator spawns agents with full context in prompt:
spawn-amp "TASK DESCRIPTION

## CONTEXT
<everything agent needs to know>

## FILES
<paths to read>

## COORDINATION
- who to report to
- who to ask for help

report to pane $PANE when done."
7. noise filtering

formatter explicitly filtered non-relevant messages:
"(routing noise — not coordinator)"
"(not the coordinator)"

agents know their role and ignore messages meant for others.
emerged vs designed


pattern
designed?
notes


3-min ping cycle
designed
user specified in spawn prompt


AGENT prefix
designed
report skill enforces this


hub-and-spoke
emerged
agents defaulted to reporting up, not sideways


handoff protocol
emerged
coordinators invented HANDOFF.md format


noise filtering
emerged
formatter figured out it wasn't the target


capability specialization
designed
user spawned specialists by name


key insight

hub-and-spoke emerged naturally. agents, when given a coordinator to report to, default to vertical communication. they don't spontaneously coordinate horizontally—the coordinator must relay. this simplifies reasoning about state but adds latency.
source threads


watchdog: T-019bbde9-0161-743c-975e-0608855688d6 (janet_fiddleshine)
archaeologist: T-019bc122-e82d-76bb-bc65-5184ce58f31d
archivist: T-019bc133-b229-719a-b748-95242ebd24f4
formatter: T-019bc169-f53b-741d-a668-2a1bee1b6e97
accountant: T-019bc1c8-04d1-730c-8ff6-905c3ba8b3ee


## research-composability.md

      
    Raw
  

              research-composability.md
            
          
    agent composability

research on combining specialized agents into workflows, agent pipelines, and compositional versus monolithic agent design. investigates interface contracts, reusable components, and microservices patterns applied to multi-agent systems.

overview: what is composability?

composability refers to the ability to combine smaller, specialized components into larger functional systems. in AI agents, this means assembling specialized agents, tools, and data sources into workflows that achieve complex goals.
the principle of compositionality from linguistics: "the meaning of a whole is a function of the meanings of the parts and of the way they are syntactically combined" (partee, 2004). applied to agents, a composed system's behavior emerges from the behaviors of its constituent agents plus how they're connected.
key distinction from orchestration-patterns.md: orchestration describes HOW agents coordinate. composability describes WHAT can be composed and the interfaces that enable composition.

compositional vs monolithic agents

monolithic agents

structure: single agent handles entire workflow end-to-end. all capabilities bundled in one system prompt, one context window, one model call chain.
characteristics:

simpler deployment and debugging
no inter-agent communication overhead
single point of context—no fragmentation
scales poorly with task complexity
context window becomes limiting factor

when appropriate:

tasks with clear scope and bounded complexity
latency-critical applications
when coordination overhead exceeds specialization benefits

compositional agents

structure: multiple specialized agents combined via orchestration layer. each agent has distinct role, tools, and potentially different models.
characteristics:

specialists can excel at narrow domains
parallel execution possible for independent subtasks
individual components can be swapped, upgraded, tested independently
introduces coordination tax (see orchestration-patterns.md)
potential for cascading failures across agent boundaries

when appropriate:

complex workflows requiring diverse expertise
when specialization meaningfully improves performance
long-running tasks that benefit from checkpointing
teams building shared agent infrastructure

empirical guidance


"80% effort on task design, 20% on agent definitions" — CrewAI insight (orchestration-patterns.md)

anthropic's claude team found that multi-agent systems use ~15× more tokens than single-agent chat (SYNTHESIS.md). token multiplication is the hard constraint on composition—each additional agent in a pipeline multiplies context overhead.
hunch: the decision boundary between monolithic and compositional is poorly understood. most tasks that "need" multi-agent can likely be handled by single well-prompted agent with good tools.

agent pipelines and chaining

sequential pipelines

agents execute in fixed order, each receiving output of previous agent as input.
pattern:
agent_1(input) → output_1 → agent_2(output_1) → output_2 → ... → final_result

examples from production (SYNTHESIS.md):

research → outline → draft → edit → publish
parse → validate → transform → load
detect → triage → investigate → remediate

implementation approaches:


LangGraph prompt chaining: each LLM call processes output of previous call. good for tasks with verifiable intermediate steps (langgraph docs)


AutoGen round-robin: agents take turns in predetermined sequence. RoundRobinGroupChat implements reflection patterns where critic evaluates primary responses (autogen docs)


TypingMind multi-agent workflows: syntax-based sequencing with ---- separators. each agent brings own model, parameters, plugins to workflow (typingmind docs)


tradeoffs:

(+) predictable execution order
(+) easy to debug—clear trace of agent outputs
(+) natural checkpointing at stage boundaries
(-) latency accumulates linearly with pipeline depth
(-) rigid—cannot adapt order based on intermediate results

parallel pipelines

independent subtasks execute concurrently, results aggregated.
pattern:
        ┌─→ agent_1(input) ─→ output_1 ─┐
input ──┼─→ agent_2(input) ─→ output_2 ─┼─→ aggregator → final_result
        └─→ agent_3(input) ─→ output_3 ─┘

use cases:

multiple perspectives on same problem (bull/bear/judge)
independent research tasks aggregated into synthesis
redundant execution for reliability (majority voting)

mixture-of-agents (MoA) implements feed-forward neural network topology: workers organized in layers, each layer receives concatenated outputs from previous layer. later layers benefit from diverse perspectives generated by earlier layers (wang et al., 2024).
dynamic pipelines

orchestrator determines execution order and agent selection at runtime.
pattern: orchestrator decomposes task, spawns workers dynamically, synthesizes results.
LangGraph Send API: workers created on-demand with own state, outputs written to shared key accessible to orchestrator. differs from static supervisor—workers not predefined (langgraph docs).
tradeoffs:

(+) adapts to task requirements
(+) can skip unnecessary stages
(-) harder to predict behavior
(-) debugging more complex—execution path varies


interface contracts between agents

interface contracts define how agents communicate—message formats, expected inputs/outputs, error handling.
the fragmentation problem

current agent ecosystem lacks standardized interfaces. each framework defines own:

message schemas
tool calling conventions
state management approaches
error propagation

this mirrors early web/API days before REST and OpenAPI standardization (orchestration-patterns.md).
emerging protocols

MCP (Model Context Protocol): anthropic's standard for tool integration. provides tools and context TO agents. growing from ~100 servers (nov 2024) to 16,000+ (sep 2025)—16,000% increase (SYNTHESIS.md).
A2A (Agent-to-Agent): google's inter-agent communication protocol. enables agents to communicate WITH each other.
AG-UI (Agent-User Interaction Protocol): standardizes real-time, bi-directional communication between agent backend and frontend. streams ordered sequence of JSON-encoded events: messages, tool_calls, state_patches, lifecycle signals (medium, 2025).
key insight: MCP and A2A are complementary—MCP for agent-tool interface, A2A for agent-agent interface.
contract components

per ApX ML courses on agent communication (orchestration-patterns.md):

message structure: sender_id, recipient_id, message_id, timestamp, message_type, payload
serialization: JSON (LLM-friendly) or Protobuf (performance-critical)
message types (FIPA ACL inspired): REQUEST, INFORM, QUERY_IF, QUERY_REF, PROPOSE, ACCEPT_PROPOSAL, REJECT_PROPOSAL
addressing: direct, broadcast, multicast/group, role-based

handoff patterns

explicit handoff: agent signals completion and transfers control via HandoffMessage. OpenAI Swarm, AutoGen Swarm use this pattern (orchestration-patterns.md).
implicit handoff: orchestrator observes agent state, decides when to route elsewhere.
contract requirements for handoffs:

clear completion criteria
state transfer mechanism
error/timeout handling
rollback capability


reusable agent components

the building block model

Tray Agent Hub (sep 2025) introduces catalog of composable, reusable building blocks for AI agents (tray.ai):

Smart Data Sources: ground agents in company knowledge
AI Tools: actions agents can take
Agent Accelerators: pre-configured combinations for specific domains (HR, ITSM)

gartner guidance: "take an agile and composable approach in developing AI agents. avoid building heavy in-house tools and LLMs" (gartner, july 2025).
component categories

1. tool libraries

reusable tool definitions (function schemas, implementations)
MCP servers as shareable tool packages
docker MCP catalog for containerized tools

2. prompt templates

system prompts for specialized roles
few-shot example collections
persona definitions

3. memory/context modules

vector store configurations
retrieval strategies
context window management

4. guardrails

input/output validators
safety filters
compliance checks

5. evaluation harnesses

test case collections
scoring functions
regression suites

agent interface specification

vercel AI SDK defines formal Agent interface (ai-sdk docs):
interface Agent<CALL_OPTIONS, TOOLS extends ToolSet, OUTPUT> {
  readonly version: 'agent-v1';
  readonly id: string | undefined;
  readonly tools: TOOLS;
  
  generate(options: AgentCallParameters<CALL_OPTIONS>): 
    PromiseLike<GenerateTextResult<TOOLS, OUTPUT>>;
  
  stream(options: AgentCallParameters<CALL_OPTIONS>): 
    PromiseLike<StreamTextResult<TOOLS, OUTPUT>>;
}
this enables:

custom agents implementing standard contract
interchangeable agents across SDK utilities
third-party agent wrappers
testing via mock implementations


microservices patterns applied to agents

agents share fundamental properties with microservices: independent, specialized, designed for autonomous operation. patterns that solved microservices scaling apply directly.
architectural parallels


microservices concept
agent equivalent


service
individual agent


API contract
agent interface (input/output schema)


service registry
agent catalog/registry


message queue
event backbone (kafka, etc.)


circuit breaker
agent fallback/retry logic


sidecar
guardrails, observability adapters


event-driven architecture (EDA)

the scaling problem: before EDA, microservices had quadratic dependencies (NxM connections). EDA reduced to N+M through publish-subscribe (falconer, 2025).
why EDA for agents:

agents react to changes in real time rather than blocking calls
scale dynamically without synchronous dependencies
remain loosely coupled—failures don't cascade
event log enables replay for debugging, evaluation, retraining

practical architecture (falconer, 2025):
event source → kafka topic → agent 1 → kafka topic → agent 2 → kafka topic → output

microservices as agent infrastructure

incoming interface microservice: provides clear instructions, short-term and long-term context, straightforward interface for agent interaction.
outgoing interface microservice: enables agent to retrieve data or perform tasks with guardrails preventing undesirable system access.
supporting microservices: can be scaled independently, optimized for reading, writing, or searching as needed for efficient reasoning (pluralsight, 2025).
why monolithic architectures fail for agents

per pluralsight analysis:

limited data access: backend API exposes specific endpoints, but much of monolith remains inaccessible to agent
performance sensitivity: agent reasoning can overwhelm performance-sensitive components (databases, transaction systems)
missing guardrails: agents need structured interfaces with safety boundaries, not raw system access

design patterns from microservices

saga pattern: coordinate multi-step workflows across agents with compensation logic for rollback.
circuit breaker: bypass failing agents, fallback to simpler workflows. prevents cascading failures (orchestration-patterns.md).
bulkhead: isolate agent failures to prevent resource exhaustion in shared systems.
sidecar: attach observability, guardrails, or adapters to agents without modifying agent code.
strangler pattern: incrementally migrate from monolithic agent to composed agents without full rewrite.

architectural patterns catalogue

liu et al. (2024) present 18 architectural patterns for foundation model-based agents (journal of systems and software):
categories (inferred from abstract):

goal-seeking patterns
plan generation patterns
hallucination mitigation
explainability patterns
accountability patterns

decision model provided for pattern selection based on:

context (domain, constraints)
forces (requirements, trade-offs)
consequences (benefits, risks)

limitation: full pattern details behind paywall. the existence of this systematic catalogue suggests composability is mature enough to warrant formal pattern languages.

compositional learning perspective

cognitive science research on compositional learning provides theoretical grounding (sinha et al., 2024):
key principle: compositional learning enables generalization to unobserved situations by understanding how parts combine.
computational challenge: models often rely on pattern recognition rather than holistic compositional understanding. they succeed through statistical patterns, not structural composition.
neuro-symbolic architectures: some approaches build networks that are compositional in nature—assembling command-specific networks from trained modules. however, making modules faithful to designed concepts remains difficult despite high task accuracy.
implication for agents: current LLM-based agents may appear compositional (combining tools, prompts, data) but lack true compositional reasoning. the composition happens at the system level, not the reasoning level.

practical composition patterns

anthropic's building blocks

from SYNTHESIS.md, anthropic identifies composability patterns:

prompt chaining: output of one becomes input of next
routing: classify input, direct to specialized flow
parallelization: simultaneous or redundant execution
orchestrator-workers: dynamic decomposition and synthesis
evaluator-optimizer: generate-evaluate loop until acceptable

CrewAI role-based composition

agents instantiated with explicit capabilities: "Researcher," "Planner," "Coder" (medium).
collaboration layer: agents share state, results, context for parallel processing and dependency management.
task graph builder: declare task dependencies; tasks sequenced or concurrent based on workflow needs.
LangGraph graph-based composition

workflows defined as directed graphs (DAGs). nodes represent agents or functions, edges represent data flow.
key feature: state persistence enables workflows to recover from crashes, retries, or idle periods.
composable graph architecture: linear, branching, or recursive flows supported.

failure modes specific to composition

beyond general multi-agent failures (orchestration-patterns.md), composition introduces:
interface mismatch

agents designed independently may have incompatible:

output formats (JSON vs natural language)
error conventions (exceptions vs error messages)
state assumptions (stateless vs stateful)

version skew

composed systems break when:

one agent's prompt changes output format
underlying model updates behavior
tool definitions evolve

context fragmentation

each agent operates with partial context. information critical to one agent may not propagate to others, causing:

redundant work
contradictory outputs
missed dependencies

integration testing gaps

unit testing individual agents insufficient. composed behavior emerges from interaction—requires end-to-end testing that's expensive and non-deterministic.

critical assessment

what composability promises


specialization: agents optimized for narrow domains outperform generalists
reusability: build once, compose many times
flexibility: swap components without rebuilding system
team parallelism: different teams own different agents

what composability actually delivers (so far)


coordination overhead often exceeds specialization benefits: token multiplication, latency cascade, observability gaps


reusability is limited: prompts are tightly coupled to specific models, contexts, tools. "reusable" often means "starting point that requires extensive customization"


flexibility is constrained: changing one agent often requires changes to adjacent agents due to implicit contracts


team boundaries create integration challenges: each team optimizes locally, global behavior degrades


open questions


granularity: what's the right size for an agent component? too small = excessive coordination; too large = monolithic problems return


interface stability: how do we version agent interfaces as capabilities evolve?


composition verification: how do we test that composed behavior matches intent?


economic model: when does investment in composable infrastructure pay off?


key takeaways


start monolithic, decompose when necessary: composition adds overhead. justify it with measured specialization benefits.


interface contracts matter more than implementation: well-defined inputs, outputs, error handling enable composition. underspecified interfaces break it.


microservices patterns transfer: EDA, circuit breakers, sidecar patterns apply. 20 years of distributed systems learning is relevant.


protocol standardization is emerging but incomplete: MCP for tools, A2A for agents, AG-UI for frontends. fragmentation remains.


reusability is harder than claimed: context-dependence of prompts limits true reuse. expect "accelerators" not "plug-and-play."


composition ≠ reasoning: current systems compose at system level through orchestration, not at reasoning level through understanding.


references


liu et al. (2024). "agent design pattern catalogue: a collection of architectural patterns for foundation model based agents." journal of systems and software.
sinha et al. (2024). "a survey on compositional learning of AI models." arxiv:2406.08787
falconer (2025). "AI agents are microservices with brains." medium.
dhiman (2025). "architecting microservices for seamless agentic AI integration." pluralsight.
tray.ai (2025). "tray.ai launches agent hub, the first catalog of composable, reusable building blocks."
vercel. "agent (interface) - AI SDK core." ai-sdk.dev
typingmind. "build multi-agent workflows." docs.typingmind.com
garg (2025). "AG-UI: the interface protocol for human-agent collaboration." medium.
garg (2025). "top 10 AI agent frameworks." medium.
wang et al. (2024). "mixture of agents." arxiv:2406.04692
langgraph docs. "workflows and agents."
autogen docs. "teams."
crewai docs. "crafting effective agents."
gartner (2025). "the current state of AI agents for enterprises."

see also:

orchestration-patterns.md — coordination architectures
SYNTHESIS.md — cross-cutting patterns
oss-frameworks.md — framework implementations
protocols.md — communication standards


## research-context-window-management.md

      
    Raw
  

              research-context-window-management.md
            
          
    Context Window Management for Autonomous Agents

research synthesis on budget allocation, dynamic pruning, prioritization strategies, summarization techniques, and model limits

executive summary

context window management may be the most consequential engineering challenge for autonomous agents operating at scale. while nominal context windows have expanded to millions of tokens (gemini 3 pro: 1M, gpt-5.2: 400k), empirical evidence consistently shows effective context is far smaller than advertised. du et al. (2025) found performance degrades 13.9%–85% as input length increases—even with perfect retrieval [context-management.md]. the field has shifted from "prompt engineering" to "context engineering": optimizing the configuration of tokens to maximize desired behavior within hard budget constraints [anthropic, 2025].
this document extends context-management.md with deeper analysis of budget allocation strategies, dynamic pruning techniques, and practical tradeoffs for agent architects.

1. context budget allocation

1.1 the minimum viable context principle

anthropic's context engineering framework establishes the core optimization problem: find the smallest possible set of high-signal tokens that maximize likelihood of desired outcome [anthropic, 2025]. this inverts the naive assumption that more context equals better performance.
budget allocation requires partitioning available tokens across competing demands:

system prompt: instructions, persona, constraints
tool definitions: MCP servers, function schemas
retrieved context: RAG chunks, document excerpts
conversation history: prior turns, reasoning traces
working memory: intermediate results, scratchpad

1.2 empirical allocation patterns


component
typical allocation
compression priority


system prompt
5-15%
low (preserve clarity)


tool definitions
5-20%
medium (filter unused tools)


retrieved context
30-50%
high (RAG filtering)


conversation history
20-40%
high (summarization)


output budget
10-25%
reserved


claude code reserves "five most recently accessed files" after compaction, suggesting recency-weighted allocation [anthropic, 2025].
1.3 dynamic vs. static allocation

static allocation sets fixed budgets per component. simple but inefficient—wastes tokens when components don't need their full allocation.
dynamic allocation adjusts budgets based on task requirements:

math reasoning: expand working memory, reduce retrieved context
document qa: expand retrieved context, reduce conversation history
coding tasks: expand recent file context, reduce system prompt verbosity

factory.ai's iterative compression uses T_max (compression threshold) and T_retained (post-compression budget) as tunable parameters [context-management.md]. narrow gaps increase compression overhead; wide gaps risk aggressive truncation.

2. dynamic context pruning strategies

2.1 observation masking (sweagent approach)

jetbrains research (2025) found observation masking often matches or beats llm summarization at lower cost:

replace older observations with placeholders once outside rolling window
preserve agent reasoning and actions intact
optimal window size: ~10 turns (requires per-agent tuning)
50%+ cost reduction vs. unbounded context

key insight: window size hyperparameters differ by agent scaffold. swe-agent skips failed retry turns; openhands includes them. transferring settings between agents degrades performance.
2.2 tool result clearing

anthropic identifies tool result clearing as "one of the safest lightest touch forms of compaction" [anthropic, 2025]:

once a tool has been called deep in history, raw results rarely needed
replace with summary or placeholder
claude platform now supports this as a native feature

2.3 progressive disclosure

rather than frontloading all context, let agents discover context incrementally:

file sizes suggest complexity
naming conventions hint at purpose
timestamps proxy for relevance
each interaction informs next retrieval decision

tradeoff: runtime exploration slower than pre-computed retrieval, but keeps context focused on task-relevant subsets.
2.4 priority-based eviction

when context exceeds budget, evict low-priority content first:

stale tool outputs (already processed)
redundant explanations (multiple phrasings of same concept)
failed attempts (unless debugging)
peripheral retrieved chunks (low relevance scores)
old conversation turns (sliding window)

preserve:

architectural decisions still relevant
unresolved bugs/issues
critical constraints and requirements
recent file contents


3. context prioritization strategies

3.1 recency weighting

most agents implicitly prioritize recent context. explicit strategies:

sliding window: fixed-size window that advances; older content ages out
exponential decay: weight attention by recency with tunable decay rate
landmark anchoring: preserve "landmark" events (decisions, milestones) regardless of age

3.2 relevance scoring

rank context items by relevance to current task:

semantic similarity to current query
explicit references in recent turns
tool usage patterns (files frequently accessed)
domain-specific heuristics (error messages during debugging)

3.3 structural prioritization

liu et al. (2023) "lost in the middle" finding: LLMs attend better to context at beginning and end of input [context-management.md]. implications:

place critical instructions at start
place immediate task context at end
middle section: lower-priority supporting information
shuffle or reorder to combat positional bias

3.4 hierarchical structuring


layer 0 (always present): core persona, critical constraints
layer 1 (task-dependent): relevant tools, domain knowledge
layer 2 (query-dependent): retrieved chunks, recent history
layer 3 (ephemeral): intermediate reasoning, temporary notes

higher layers pruned first when budget exceeded.

4. summarization for long contexts

4.1 compression taxonomy (lavigne 2025)


approach
information retention
compression ratio
method


consolidation
80-95%
20-50%
reorganize, remove redundancy, preserve phrasing


summarization
50-80%
60-90%
extract key points, discard peripheral details


distillation
30-60%
80-95%
extract principles/patterns, discard specifics


recommendation: tiered approach—distilled representation of older context, summarized recent context, consolidated immediate context [context-management.md].
4.2 llm summarization tradeoffs

jetbrains research found llm summarization causes trajectory elongation (+15% more steps), reducing net efficiency gains [context-management.md]. the summarization model may introduce:

loss of critical details
semantic drift from original meaning
increased latency per compression cycle
cache invalidation costs

4.3 hybrid observation-summarization

jetbrains' optimal approach combines both:

observation masking for recent window
llm summarization for older content
tuned thresholds per agent type

result: 7% cost reduction vs. pure masking, 11% vs. pure summarization, +2.6% task success rate.
4.4 compaction (claude code pattern)


pass message history to model for summarization
preserve: architectural decisions, unresolved bugs, implementation details
discard: redundant tool outputs, resolved discussions
continue with compressed context + recent files

users retain continuity without context window concerns.

5. RAG vs. full context tradeoffs

5.1 when to use RAG


factor
RAG preferred
full context preferred


data volume
exceeds context window
fits in window


update frequency
dynamic, changing
static, fixed


cost sensitivity
high
low


latency tolerance
retrieval overhead acceptable
minimal latency required


precision needs
targeted retrieval sufficient
holistic understanding needed


5.2 hybrid approaches

li et al. (2024) "retrieval augmented generation or long-context llms?" found long-context llms outperform RAG when resources available, but RAG far more cost-efficient [meilisearch, 2025].
hybrid pattern:

RAG retrieves relevant document chunks
feed chunks to long-context llm
llm reasons across combined input

meilisearch and similar tools handle retrieval layer; llm handles synthesis.
5.3 the rag scaling limit

even with improved retrieval, RAG cannot solve fundamental length-induced degradation. du et al. (2025) showed that length alone hurts performance independent of retrieval quality [context-management.md].
mitigation: prompt model to recite retrieved evidence before solving → converts long-context to short-context task → +4% improvement on RULER benchmark.

6. context window limits by model (january 2026)


model
nominal context
max output
effective context*
pricing (input/output per 1M)


gemini 3 pro
1M tokens
64k
~200k reliable
$2.00 / $12.00


gpt-5.2
400k tokens
128k
~100k-200k
$1.75 / $14.00


claude opus 4.5
200k tokens (1M beta)
64k
~60-120k
$5.00 / $25.00


claude sonnet 4.5
200k tokens (1M beta)
64k
~60-120k
$3.00 / $15.00


deepseek v3.2
128k tokens
32k
~40-80k
$0.28 / $0.42


qwen3-235b
128k tokens
-
~40-80k
open-weight


llama 4
varies
varies
~40-80k
open-weight


*effective context = length at which benchmark performance remains >80% of short-context baseline. varies by task.
6.1 benchmark reality check

fiction.livebench (2025) results show model-specific degradation patterns:


model
8k
32k
120k
192k


gemini 2.5 pro
80.6
91.7
87.5
90.6


gpt-5
100.0
97.2
96.9
87.5


deepseek v3.1
80.6
63.9
62.5
-


claude sonnet 4 (thinking)
97.2
91.7
81.3
-


gemini and gpt-5 maintain performance to 192k; claude degrades after 60-120k [context-management.md].
6.2 nominal vs. effective limits

chroma research (2025): "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases" [context-management.md].
at 32k tokens, 11 of 12 tested models dropped below 50% of their short-context performance (NoLiMa benchmark, 2025).

7. architectural patterns for context management

7.1 multi-agent context isolation

anthropic's research system: lead agent orchestrating specialized subagents:

each subagent gets focused context for one aspect
lead agent receives distilled outputs
~90% performance boost on research tasks vs. single agent
parallel exploration without context pollution

7.2 sleep-time compute (letta pattern)

separate memory management from conversation:

memory operations happen asynchronously during idle periods
proactive refinement rather than lazy updates
lower interaction latency, higher memory quality

7.3 external memory systems

hierarchical memory with external persistence:

main context (RAM analog): immediate inference access
external memory (disk analog): persistent storage beyond window
agent manages own memory via function calls

memgpt pioneered this; mem0 provides production implementation with knowledge graphs + embeddings [context-management.md].

8. open problems and research directions

8.1 no universal compression settings

observation masking window size, summarization frequency, and compression thresholds require per-agent calibration. jetbrains found settings that work for one agent scaffold may degrade another.
8.2 the information-compression paradox

aggressive compression saves tokens but may force re-fetching. factory.ai's insight: "minimize tokens per task, not per request" [context-management.md]. task-level efficiency requires end-to-end evaluation.
8.3 summary quality degradation

summarization is "only as good as the model producing them, and important details can occasionally be lost" [context-management.md]. no reliable method to guarantee critical information preservation.
8.4 benchmark validity concerns

needle-in-a-haystack tests lexical retrieval—not representative of nuanced analysis, multi-step reasoning, or information synthesis required by real agents.
8.5 the attention scarcity problem

anthropic frames this architecturally: transformers compute n² pairwise relationships for n tokens. every token depletes an "attention budget" with diminishing returns. no current architecture solves this fundamentally.

key takeaways


effective context << nominal context: real performance degrades far before hitting advertised limits
observation masking often wins: simpler approaches match or beat llm summarization at lower cost
prioritization > accumulation: curate high-signal tokens rather than maximizing volume
tuning is agent-specific: no universal settings work across different scaffolds
multi-agent isolation: parallel subagents with focused contexts outperform single agents with massive contexts
hybrid rag+long-context: retrieval narrows to relevant docs, long-context enables full reasoning
minimize tokens per task: measure efficiency end-to-end, not per-request


references


anthropic. (2025). effective context engineering for ai agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
chroma research. (2025). context rot: how increasing input tokens impacts llm performance. https://research.trychroma.com/context-rot
du, y., et al. (2025). context length alone hurts llm performance despite perfect retrieval. EMNLP findings.
factory.ai. (2025). compressing context. https://factory.ai/news/compressing-context
jetbrains research. (2025). cutting through the noise: smarter context management for llm-powered agents. https://blog.jetbrains.com/research/2025/12/efficient-context-management/
lavigne, m. (2025). consolidation vs summarization vs distillation: a taxonomy.
li, z., et al. (2024). retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach. arxiv:2407.16833.
li, z., et al. (2025). lara: benchmarking retrieval-augmented generation and long-context llms. arxiv:2502.09977.
liu, n., et al. (2023). lost in the middle: how language models use long contexts. TACL.
meilisearch. (2025). rag vs. long-context llms: a side-by-side comparison. https://www.meilisearch.com/blog/rag-vs-long-context-llms
raschka, s. (2025). the state of llms 2025: progress, problems, and predictions. https://magazine.sebastianraschka.com/p/state-of-llms-2025


## research-error-taxonomy.md

      
    Raw
  

              research-error-taxonomy.md
            
          
    agent error taxonomy and recovery

systematic classification of agent failures, recovery mechanisms, and graceful degradation patterns.

1. classification frameworks

1.1 by error origin

three primary taxonomies dominate current research:
microsoft AI red team taxonomy (2025)
microsoft's taxonomy divides failures into novel (unique to agentic AI) and existing (amplified in agentic contexts), across security and safety pillars [microsoft whitepaper].


category
novel failures
existing failures


security
agent compromise, agent injection, agent impersonation, flow manipulation, provisioning poisoning, multi-agent jailbreaks
memory poisoning, XPIA, HitL bypass, function compromise, incorrect permissions, resource exhaustion, insufficient isolation, excessive agency, loss of data provenance


safety
intra-agent RAI issues, allocation harms in multi-user scenarios, organizational knowledge loss, prioritization→user safety issues
insufficient transparency, parasocial relationships, bias amplification, user impersonation, insufficient intelligibility for consent, hallucinations, misinterpretation of instructions


AgentErrorTaxonomy (zhu et al., 2025)
a modular classification spanning five core agent components [arxiv:2509.25370]:

memory errors: retrieval failures, context window overflow, outdated/stale memory, conflicting memories
reflection errors: incorrect self-assessment, false confidence, missed error signals
planning errors: suboptimal decomposition, unrealistic plans, failed self-refinement
action errors: wrong tool invocation, parameter errors, order errors, API failures
system errors: timeout, resource exhaustion, external service failures

three-tier task phase taxonomy (lu et al., 2025)
aligns failures with task phases [arxiv:2508.13143]:


tier
phase
failure types


1
task planning
improper decomposition, failed self-refinement, unrealistic planning


2
task execution
failure to exploit tools, flawed code (syntax, functionality, wrong API), environmental setup issues


3
response generation
context window restraint, formatting issues, maximum rounds exceeded


1.2 by error type

tool failures
observable, often recoverable. include:

API errors (timeouts, rate limits, authentication failures)
parameter mismatches (wrong types, missing required fields)
tool unavailability (deprecated endpoints, service outages)
tool misuse (invoking wrong tool for task)

reasoning errors
harder to detect, propagate through chains. include:

logical inconsistencies
invalid inferences
circular reasoning
planning beyond capability bounds

hallucinations
agent-specific hallucinations are qualitatively different from LLM hallucinations—they're "physically consequential" [arxiv:2509.18970]:


type
description
example


reasoning
fabricated logical chains or causal relationships
inventing steps in a workflow that don't exist


execution
hallucinated tool calls or parameters
calling non-existent API endpoints


perception
misinterpreting environmental observations
misreading file contents or web data


memorization
corrupted memory retrieval or false memories
claiming prior context that never existed


communication
false claims about other agents' states
reporting teammate completed task they didn't


2. hallucination taxonomy (deep dive)

the lin et al. (2025) survey identifies 18 triggering causes across five hallucination types [arxiv:2509.18970]:
2.1 reasoning hallucinations

causes:

insufficient objective knowledge (knowledge gaps)
inadequate subjective comprehension (misunderstanding task)
planning goal misalignment

characteristics: span multiple reasoning steps, compound through chains
2.2 execution hallucinations

causes:

tool selection errors
tool calling errors (wrong parameters, wrong sequence)

characteristics: often immediately detectable via tool response
2.3 perception hallucinations

causes:

multimodal understanding failures (misreading images, documents)
grounding errors (mismatch between observation and reality)

characteristics: occur at input processing, corrupt all downstream processing
2.4 memorization hallucinations

causes:

memory retrieval errors (retrieving wrong context)
memory update errors (storing corrupted information)

characteristics: persistent across sessions, hard to detect
2.5 communication hallucinations

causes:

inter-agent message corruption
false state reporting

characteristics: unique to multi-agent systems, can cascade rapidly

3. multi-agent error propagation

multi-agent systems exhibit unique failure modes [corti analysis, failures.md]:
3.1 propagation patterns

hallucination propagation: fabricated data from one agent becomes ground truth for others. once stored in shared memory, subsequent agents treat it as verified fact.
context fragmentation: agents operate in isolation, make decisions on incomplete information, leading to inconsistent actions.
specification failures: account for ~42% of multi-agent failures [galileo]. ambiguous task handoffs cause cascading misinterpretation.
coordination breakdown: ~37% of failures stem from coordination issues—agents duplicating work, conflicting actions, or deadlocking.
3.2 compound error rates

demis hassabis describes compound error as "compound interest in reverse" [failures.md]:
failure_rate = 1 - (1 - per_step_error)^steps


per-step error
10 steps
50 steps
100 steps


1%
9.6%
39.5%
63.4%


5%
40.1%
92.3%
99.4%


20%
89.3%
99.99%
~100%


real-world agents reportedly error ~20% per action, making long-horizon tasks nearly certain to fail [business insider].
3.3 audit complexity

decision tracing becomes exponentially harder with agent count. access control failures occur when hallucinated identifiers bypass security boundaries.

4. recovery strategies by error type

4.1 tool failures

immediate retry with backoff
retry with exponential backoff: 1s, 2s, 4s, 8s...

fallback tools: maintain alternative implementations for critical functionality. if primary API fails, route to backup.
circuit breakers: after N consecutive failures, isolate agent/tool from workflow, route to alternatives [galileo].
4.2 reasoning errors

self-correction mechanisms
reflexion (shinn et al., 2023): agents verbally reflect on task feedback, maintain reflective text in episodic memory to induce better decision-making in subsequent trials. achieved 91% pass@1 on HumanEval vs 80% for baseline GPT-4 [arxiv:2303.11366].
ReSeek (2025): introduces JUDGE action for intra-episode self-correction. agents can pause, evaluate evidence, discard unproductive paths. achieved 24% higher accuracy vs baselines [arxiv:2510.00568].
self-healing loops: establish tests → decompose task → execute subtasks → test results → fix failures → retest. reported 3600% improvement on hard reasoning tasks [medium/pranav.marla].
key insight: self-correction works by enabling selective attention to history—agents learn to disregard uninformative steps when formulating next actions.
4.3 hallucinations

knowledge utilization

external knowledge guidance (RAG, tool use, grounding)
internal knowledge enhancement (fine-tuning, prompt engineering)

paradigm improvement

chain-of-thought learning
curriculum learning
reinforcement learning with verification rewards
causal learning

post-hoc verification

self-consistency (majority voting across multiple outputs)
self-questioning (agent poses verification questions to itself)
external validation (separate model or human review)

4.4 multi-agent failures

orchestrator-mediated recovery: central orchestrator monitors agent health, isolates failing agents, reroutes tasks.
state checkpointing: periodic snapshots enable rollback to known-good states.
consensus mechanisms: require multiple agents to agree before committing critical actions.

5. graceful degradation patterns

5.1 layered fallback architecture

CoSAI recommends four-level fallback hierarchy [cosai principles]:


level
trigger
action
target time


1
low confidence
try alternative model
<2s


2
system unavailable
activate backup agent
<10s


3
complex query
escalate to human
<30s


4
system failure
emergency protocols
immediate


5.2 design principles

fail-safe (CoSAI):

halt action when uncertain
degrade to limited but predictable functions
fail-fast to prevent unintended consequences
account for byzantine failures

bounded resilience:

strict purpose-specific entitlements
robust defensive measures
continuous validation of alignment
predefined failure modes

5.3 implementation patterns

hot standby: fully operational backup systems ready for immediate activation
load balancing: distribute requests across multiple agent instances
geographic redundancy: backup systems in different data centers
cross-functional agents: agents trained to handle multiple request types when specialists fail
model-level fallback chains [medium/tombastaner]:
primary: GPT-4 → fallback-1: Claude-3 → fallback-2: GPT-3.5 → fallback-3: rule-based


6. detection frameworks

6.1 AgentDebug (zhu et al., 2025)

a debugging framework that isolates root-cause failures and provides corrective feedback [arxiv:2509.25370]:

modular error classification: maps failures to specific agent modules
root cause analysis: traces error propagation chains
corrective feedback: generates targeted fixes
performance: 24% higher all-correct accuracy, 17% higher step accuracy vs baselines
recovery: up to 26% relative improvement in task success after feedback

6.2 runtime verification

formal specification languages express safety requirements that systems verify during execution. when agent generates output violating specifications, guardrailing systems detect and block unsafe outputs before propagation.
6.3 observability requirements

per-agent metrics:

response latency
error rate by error type
confidence scores
context utilization

system-level metrics:

fallback activation rate
mean time to recovery (MTTR)
cascade depth (how many agents affected by single failure)
end-to-end success rate


7. industry frameworks

7.1 CoSAI (coalition for secure AI)

three foundational principles for secure-by-design agentic systems [cosai.org]:

human-governed and accountable: meaningful control, shared accountability, risk-based oversight
bounded and resilient: purpose-specific entitlements, defensive measures, continuous validation, predefined failure modes
trustworthy operations: integrity assurance, minimal footprint, transparent behavior

7.2 OWASP LLM top 10

prompt injection ranked #1 threat in 2025. taxonomy distinguishes:

direct prompt injection (adversarial prompts submitted directly)
indirect prompt injection (malicious instructions in external content)
task injection (bypasses classifiers by appearing as normal text)

7.3 AI incident database

tracks production incidents with classification system:

incident #622 (Chevrolet chatbot): "lack of capability or robustness"
incident #541 (lawyer fake cases): hallucination in professional context


8. self-correction mechanisms

8.1 verbal reinforcement learning (reflexion)

agents reflect on failures using natural language, store reflections in episodic memory:
trial 1: failed → reflection: "I assumed the file existed without checking"
trial 2: applies reflection → succeeds

no weight updates required—learning through linguistic feedback only.
8.2 self-evolving agents (openai cookbook)

continuous improvement loop [openai cookbook]:

baseline agent produces outputs
human feedback or LLM-as-judge evaluates
meta-prompting suggests improvements
evaluation on structured criteria
updated agent replaces baseline if improved

8.3 genetic-pareto optimization (GEPA)

samples agent trajectories, reflects in natural language, proposes prompt revisions, evolves system through iterative feedback. more dynamic than static meta-prompting.

9. open problems

9.1 accurate hallucinatory localization

agent hallucinations may arise at any pipeline stage and exhibit:

hallucinatory accumulation (errors compound over steps)
inter-module dependency (hard to isolate source)

current detection focuses on shallow layers (perception); deep layers (memory, communication) remain under-researched [arxiv:2509.18970].
9.2 cascading failure prediction

no established methodology for predicting when single-agent failures will cascade into system-wide failures.
9.3 dynamic self-scheduling

fixed patterns enhance controllability but reduce flexibility. designing systems that autonomously organize task execution and coordinate multi-agent collaboration remains open.
9.4 cross-agent trust verification

protocols for agents to verify claims made by other agents don't exist in standardized form.

10. key sources


microsoft AI red team taxonomy: microsoft security blog
AgentErrorTaxonomy/AgentDebug: arxiv:2509.25370
agent hallucination survey: arxiv:2509.18970
three-tier taxonomy: arxiv:2508.13143
reflexion: arxiv:2303.11366
CoSAI principles: coalitionforsecureai.org
self-evolving agents: openai cookbook
prior research: failures.md


compiled: january 2026
methodology: web search for academic papers, industry frameworks, production incident reports. claims cite sources.

  
## research-memory-compression.md

      
    Raw
  

              research-memory-compression.md
            
          
    Memory Compression for LLM Agents

techniques for reducing memory footprint while preserving task-relevant information

Executive Summary

memory compression addresses a fundamental tension in agent design: accumulating context improves coherence but degrades performance. empirical evidence shows context length alone hurts LLM performance by 13-85% even with perfect retrieval (Du et al., 2025). this document synthesizes compression strategies, from simple observation masking to sophisticated hierarchical consolidation, examining the tradeoffs between information fidelity and efficiency.
key finding: structured compression beats brute-force context expansion. SimpleMem achieves 30× token reduction with 26% F1 improvement over full-context baselines (Liu et al., 2025). the most effective approaches combine selective retention with active forgetting—remembering what matters while deliberately discarding what doesn't.
for related context, see memory-architectures.md on tiered memory systems and context-management.md on context window tradeoffs.

1. The Compression Imperative

1.1 Why Compress?

three forces drive compression requirements:
cost explosion: token consumption scales with conversation length. a customer support bot processing hundreds of conversations daily incurs thousands of dollars in unnecessary costs without compression.
performance degradation: larger context windows don't mean better reasoning. NoLiMa benchmark (2025) shows 11 of 12 models drop below 50% of short-context performance at 32k tokens. "lost in the middle" phenomenon (Liu et al., 2023) demonstrates retrieval accuracy degrades when relevant information appears mid-context.
latency constraints: production systems require sub-50ms retrieval. processing massive contexts introduces unacceptable delays for interactive applications.
1.2 The Information-Compression Paradox

aggressive compression saves tokens but may force re-fetching, adding more API calls than tokens saved. Factory.ai's insight: "minimize tokens per task, not per request." the goal is end-to-end efficiency, not local optimization.

2. Summarization Techniques

2.1 Recursive Summarization

the dominant pattern for conversation compression:

trigger compression when context exceeds threshold
summarize oldest N messages: new_summary = summarize(old_summary + evicted_messages)
store raw messages in recall storage
retain only summary in main context

MemGPT's implementation (Packer et al., 2023): queue manager tracks context utilization with warning threshold (~70%) and flush threshold (100%). eviction generates recursive summaries, moving originals to archival storage.
limitations: summarization quality depends on the summarizing model. important details can be lost, and the process adds latency + cost for summarization API calls.
2.2 Rolling Summaries (Incremental Compression)

treat conversation as a rolling snowball—periodically compress to maintain manageable size:

after N turns (typically 5-10), generate summary of that chunk
summary replaces original messages in history
next summary incorporates previous summary + new messages

pros: maintains continuous compressed thread of entire conversation
cons: nuances and specific details erode over successive compressions. "summarization is an imperfect process" (Ibrahim, 2025)
2.3 Semantic Lossless Compression

SimpleMem (Liu et al., 2025) introduces three-stage pipeline claiming "semantic lossless" compression:

semantic structured compression: entropy-aware filtering distills interactions into compact, multi-view indexed memory units
recursive memory consolidation: asynchronously integrates related units into higher-level abstractions
adaptive query-aware retrieval: dynamically adjusts retrieval scope based on query complexity

empirical results: 26.4% F1 improvement over Mem0, 30× reduction in inference tokens vs. full-context models.

3. Hierarchical Memory Compression

3.1 Tiered Architecture

organize memory into tiers with different retention policies:


Tier
Retention
Fidelity
Example


Immediate
last 10-20 turns
full verbatim
current conversation


Recent
last 100-500 turns
summarized
session history


Archive
all history
retrievable on demand
long-term memory


Anthropic's compaction strategy (from Claude Code):

pass message history to model for summarization
preserve architectural decisions, unresolved bugs, implementation details
discard redundant tool outputs
continue with compressed context + five most recently accessed files

3.2 Hybrid Memory Strategy

combine pinned messages with summarized history:
pinned messages: preserved verbatim—system prompt, first user message, critical data points
summarized history: everything between key points compressed via rolling summarization
pros: preserves high-fidelity critical information while compressing less important turns
cons: determining which messages are "key" requires heuristics that may not generalize
3.3 Sleep-Time Consolidation

Letta's paradigm separates consolidation from conversation:
traditional: consolidate during user-facing turns → latency penalty + hurried compression
sleep-time: memory management runs asynchronously ("while agent sleeps")
benefits:

no latency penalty during conversation
higher quality consolidation (more compute budget)
dedicated memory agent reorganizes, prunes, abstracts
main agent sees optimized context on next wake


4. Lossy vs Lossless Strategies

4.1 Lossless Compression

preserves all information through reorganization and deduplication:

consolidation: 80-95% retention, 20-50% compression—reorganize, remove redundancy, preserve phrasing (Lavigne, 2025)
embedding conversion: store text as dense vectors rather than raw tokens
structural deduplication: identify repeated information, store once with references

tradeoff: limited compression ratios but guaranteed information preservation.
4.2 Lossy Compression

achieves higher ratios by discarding deemed-irrelevant information:


Approach
Retention
Compression
Method


Consolidation
80-95%
20-50%
reorganize, preserve phrasing


Summarization
50-80%
60-90%
extract key points


Distillation
30-60%
80-95%
extract principles/patterns


JPEG analogy (from Medium): "Like how JPEG compresses images by removing details the eye won't miss, the system removes conversational details that don't affect future interactions. 'It was a really, really good restaurant' becomes 'positive restaurant experience' while preserving restaurant name and rating."
4.3 Importance Scoring

not all information merits equal retention. scoring mechanisms prioritize:
Generative Agents formula (Park et al., 2023):
score(memory) = α × recency + β × importance + γ × relevance


recency: 0.995^hours_elapsed (exponential decay)
importance: LLM-assigned score (1-10) cached at creation
relevance: embedding similarity to current context

emotional significance: language patterns indicating affect receive higher retention scores
frequency: oft-referenced topics score higher
task criticality: information needed for completion preserved at maximum fidelity

5. When to Forget (Memory Pruning)

5.1 Strategic Forgetting as Feature

human memory treats forgetting as adaptive, not failure. AI memory systems should implement intentional pruning:

"Instead of discussing how to prevent forgetting, we should explore how to implement intentional, strategic forgetting mechanisms that enhance rather than detract from performance." — Pavlyshyn (2025)

5.2 Temporal Decay

information relevance decays at different rates:

task-specific context: aggressive decay after task completion
user preferences: slow decay, reinforced by repeated mention
domain knowledge: minimal decay, persistent storage

Zep's approach: temporal awareness without true deletion

track when information first encountered
associate metadata with entries
allow fact invalidation without deletion
maintain complete historical record
distinguish "no longer true" from "never mentioned"

5.3 Pruning Triggers

completion-based: once task completes, forget false starts and errors. Focus Agent (Verma, 2026) performed 6.0 autonomous compressions per task on average.
threshold-based: Factory.ai's fill/drain model

T_max: compression threshold ("fill line")
T_retained: tokens kept after compression ("drain line")
narrow gap = frequent compression, higher overhead
wide gap = less frequent, but aggressive truncation risk

importance-based: prune when importance score falls below threshold. Mem0g tracks repeated patterns—when frequency exceeds threshold, generate abstract semantic representation and archive original episodic entries.
5.4 What to Prune

low-value targets for pruning:

tool result clearing: once tool called deep in history, raw results rarely needed again. "one of the safest, lightest-touch forms of compaction" (Anthropic)
error trajectories: failed attempts and backtracking after successful resolution
redundant confirmations: acknowledgments and conversational filler
superseded information: old preferences explicitly replaced by new ones


6. Compression Ratios Achieved

6.1 Empirical Benchmarks


System
Compression Rate
Correctness Impact
Source


SimpleMem
30× token reduction
+26.4% F1
Liu et al., 2025


AWS AgentCore Semantic
89%
-7% (factual)
AWS, 2025


AWS AgentCore Preference
68%
+28% (preference tasks)
AWS, 2025


AWS AgentCore Summarization
95%
+6% (PolyBench)
AWS, 2025


Focus Agent
22.7% reduction
identical accuracy
Verma, 2026


Focus (best instance)
57% reduction
maintained
Verma, 2026


Mem0
80-90% reduction
+26% response quality
Mem0, 2025


Observation Masking
>50% cost reduction
matched/beat summarization
JetBrains, 2025


6.2 Task-Type Variation

compression effectiveness varies by task:
factual QA: RAG baseline (full history) achieves 77.73% correctness vs. semantic memory at 70.58% with 89% compression. slight accuracy loss acceptable for massive efficiency gain.
preference inference: compressed memory (79%) outperforms full context (51%). "extracted insights more valuable than raw conversational data" — extracted structure beats raw accumulation.
multi-hop reasoning: SimpleMem F1 43.46 vs. MemGPT 17.72. structured compression enables reasoning chains that raw accumulation obscures.

7. Impact on Task Performance

7.1 When Compression Helps

compression improves performance in several scenarios:
attention degradation: Du et al. (2025) showed length alone hurts performance. compression mitigates by reducing context length.
noise reduction: irrelevant history distracts attention. "agents using observation masking paid less per problem and often performed better" (JetBrains, 2025)
structure provision: compressed representations often provide better organization than raw accumulation. SimpleMem's multi-view indexing enables retrieval patterns impossible with linear history.
7.2 When Compression Hurts

detail-dependent tasks: tasks requiring exact quotes, specific numbers, or precise sequences degrade under lossy compression.
trajectory elongation: JetBrains found LLM summarization caused +15% more steps than observation masking—summarization overhead sometimes exceeds savings.
cascade errors: poor early summarization propagates through recursive consolidation. one bad compression compounds.
7.3 Mitigation Strategies

recitation before solving: Du et al. (2025) found prompting model to recite retrieved evidence before answering yields +4% improvement—converts long-context to short-context task.
hybrid retrieval: don't rely solely on compressed memory. enable raw retrieval for detail-sensitive queries.
quality monitoring: track compression quality over time. Flag degradation patterns before they compound.

8. Implementation Recommendations

8.1 Strategy Selection


Use Case
Recommended Strategy
Rationale


Short sessions (<20 turns)
sliding window
no compression needed


Medium sessions (20-100 turns)
observation masking
simple, effective


Long sessions (>100 turns)
hierarchical + summarization
tiered retention


Multi-session continuity
semantic memory extraction
cross-session facts


Task completion focus
aggressive pruning
forget completed tasks


8.2 Configuration Guidelines

compression thresholds: start conservative (70% window fill), adjust based on task performance
summarization frequency: batch summarization outperforms per-turn. summarize 20-30 turns at a time.
retention windows: keep last 10 messages verbatim minimum. this provides immediate context that summarization can't replace.
importance scoring: weight by task relevance, not just recency. domain-specific importance signals outperform generic.
8.3 Evaluation Before Deploying

no compression strategy is universally optimal. benchmark on:

single-hop factual recall
multi-hop reasoning chains
temporal questions ("when did X happen?")
adversarial queries (asking about non-existent information)

compare compression overhead (latency, cost) against savings achieved.

9. Open Problems

9.1 Optimal Compression Timing

when should compression occur? current approaches use threshold-based triggers, but optimal timing may be:

task-aware: compress at natural task boundaries
attention-aware: compress when attention patterns indicate saturation
cost-aware: compress when marginal cost exceeds marginal benefit

9.2 Cross-Modal Compression

current research focuses on text. multimodal agents need compression strategies for:

image sequences (video understanding)
audio streams
mixed-modality histories

9.3 Compression Quality Metrics

how do we measure compression quality? current proxies:

downstream task accuracy
retrieval precision/recall
human evaluation of summary quality

missing: principled information-theoretic metrics for agent memory compression that predict task performance.
9.4 Personalized Compression

different users may have different information density patterns. adaptive compression that learns user-specific retention policies remains unexplored.

Key Takeaways


compression is essential, not optional: context length degrades performance regardless of retrieval quality. some form of compression is mandatory for long-horizon agents.


structured compression outperforms raw accumulation: SimpleMem's 30× reduction with 26% F1 gain demonstrates that intelligent structure beats brute-force context expansion.


observation masking often beats summarization: JetBrains found simpler masking approaches matched or exceeded LLM summarization at lower cost and without trajectory elongation.


forgetting is a feature: strategic pruning of completed tasks, errors, and low-importance information improves rather than degrades performance.


compression ratios of 80-95% achievable: production systems achieve dramatic reductions while maintaining or improving task performance on appropriate benchmarks.


no universal optimal strategy: compression approach depends on task type, session length, and performance requirements. benchmark before deploying.


References


AWS. (2025). Building smarter AI agents: AgentCore long-term memory deep dive. https://aws.amazon.com/blogs/machine-learning/building-smarter-ai-agents-agentcore-long-term-memory-deep-dive/
Du, Y., et al. (2025). Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. EMNLP Findings.
Ibrahim, A. (2025). Don't Let Your AI Agent Forget: Smarter Strategies for Summarizing Message History. Medium/Agentailor.
JetBrains Research. (2025). Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents.
Lavigne, K. (2025). Consolidation vs. Summarization vs. Distillation. Technical report.
Liu, J., et al. (2025). SimpleMem: Efficient Lifelong Memory for LLM Agents. arXiv:2601.02553.
Liu, N., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL.
Mem0. (2025). LLM Chat History Summarization: Best Practices and Techniques. https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
Nirdiamant. (2025). Memory Optimization Strategies in AI Agents. Medium.
Packer, C., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.
Park, J.S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.
Pavlyshyn, V. (2025). Forgetting in AI Agent Memory Systems. AI in Plain English.
Verma, N. (2026). Active Context Compression: Autonomous Memory Management in LLM Agents. arXiv:2601.07190.


## research-orchestration-patterns.md

      
    Raw
  

              research-orchestration-patterns.md
            
          
    multi-agent orchestration patterns

research on coordination architectures for LLM-based multi-agent systems. goes beyond basic single-agent loops to examine how multiple agents collaborate, compete, and coordinate.

overview: the coordination problem

multi-agent systems promise specialized intelligence—divide complex workflows into expert tasks. but coordination introduces overhead: routing logic, handoff protocols, conflict resolution, shared state management.
the coordination tax: what starts as clean architecture often becomes a web of dependencies. a three-agent workflow costing $5-50 in demos can hit $18,000-90,000 monthly at scale due to token multiplication (TechAhead, 2026).
key finding from MAST dataset (1600+ annotated failure traces across 7 MAS frameworks): 40% of multi-agent pilots fail within 6 months of production deployment. root causes include coordination breakdowns, sycophancy (agents reinforcing each other instead of critically engaging), and cascading failures (Cemri et al., 2024, arXiv:2503.13657).

coordination topologies

1. hierarchical / supervisor pattern

structure: single orchestrator delegates to specialist workers, synthesizes outputs.
implementations:

LangGraph supervisor: orchestrator breaks tasks into subtasks, delegates via Send API, workers write to shared state key, orchestrator synthesizes (LangChain docs)
Databricks multi-agent supervisor: BASF Coatings case study. genie agents + function-calling agents under supervisor. handles structured (SQL) and unstructured (RAG) data. integrated with MS Teams for "always-on" assistant (Databricks, 2025)

tradeoffs:

(+) clear control flow, easier debugging
(+) localized failure containment—supervisor re-routes when worker fails
(-) supervisor bottleneck; single point of failure
(-) context accumulation at supervisor level

production insight: BASF is moving to "supervisor of supervisors"—multi-layered orchestration where divisions run own supervisors, higher-level Coatings-wide orchestrator serves all users.

2. flat / peer-to-peer patterns

structure: agents communicate directly without central coordinator.
variants:

round-robin: agents take turns broadcasting to all others. simple but deterministic. AutoGen's RoundRobinGroupChat implements reflection pattern—critic evaluates primary agent responses (AutoGen docs)
selector-based: LLM selects next speaker after each message. AutoGen's SelectorGroupChat uses ChatCompletion model for dynamic routing
handoff-based: agents explicitly transfer control. OpenAI Swarm, AutoGen Swarm use HandoffMessage to signal transitions

tradeoffs:

(+) no single bottleneck
(+) emergent behavior—collective intelligence through shared context
(-) coordination complexity scales quadratically with agent count
(-) harder to debug; observability black box


3. swarm architectures

structure: self-organizing teams with shared working memory and autonomous coordination.
key properties (from Strands Agents docs):

each agent sees full task context + history of which agents worked on it
agents access shared knowledge contributed by others
agents decide when to handoff based on expertise needed
tool-based coordination (handoff tool auto-injected)

configuration knobs:

max_handoffs: limits agent transitions (default 20)
repetitive_handoff_detection_window: prevents ping-pong behavior
node_timeout vs execution_timeout: individual vs total limits

production challenges:

ping-pong failure: agents repeatedly handoff without progress
role confusion: agents expand scope beyond designated expertise
context bloat: shared memory grows unboundedly

source: Strands Agents swarm docs

4. mixture-of-agents (MoA)

structure: models feed-forward neural network. workers organized in layers; each layer receives concatenated outputs from previous layer.
procedure:

orchestrator dispatches user task to layer 1 workers
workers process independently, return to orchestrator
orchestrator synthesizes, dispatches to layer 2 with previous results
repeat until final layer
final aggregation returns single result

insight: agents in later layers benefit from diverse perspectives generated by earlier layers. empirically improves on single-agent baselines for complex reasoning.
source: AutoGen MoA implementation, original paper (Wang et al., 2024)

workflow patterns (non-agent coordination)

these are deterministic patterns—predetermined code paths, not autonomous agents:
prompt chaining

each LLM call processes output of previous call. good for tasks with verifiable intermediate steps (translation → verification).
parallelization

run subtasks simultaneously or same task multiple times. increases speed (parallel subtasks) or confidence (parallel evaluations).
routing

classify input, direct to specialized flow. e.g., product questions → {pricing, refunds, returns} handlers.
orchestrator-worker

orchestrator dynamically decomposes tasks, delegates to workers, synthesizes. differs from supervisor pattern: workers created on-demand, not predefined.
evaluator-optimizer

one LLM generates, another evaluates. loop until acceptable. common for translation, code review, content refinement.
source: LangGraph workflows-agents

framework comparison


framework
primary pattern
coordination model
key differentiator


CrewAI
role-based crews
hierarchical (80/20 rule: 80% effort on tasks, 20% on agents)
organizational metaphor; agents have role/goal/backstory


LangGraph
graph-based workflows
nodes + edges with conditional routing
maximum modularity; pre-compiled graphs for performance


AutoGen
conversational teams
round-robin, selector, swarm, MagenticOne
natural language first; human-in-loop emphasis


Strands Agents
swarm
self-organizing with handoffs
shared working memory; tool-based coordination


CrewAI architecture internals

per Sheshadri, 2025:

task prioritization: evaluates complexity, dependencies, urgency before execution
flow engine: receives prioritized task list, determines execution sequence
crew collaboration: analysis crew, content creation crew, validation crew—specialized groups working with agents
knowledge + memory: short-term (session), long-term (historical), domain-specific knowledge bases
audit logs: traceability for debugging/compliance

hierarchical vs sequential: CrewAI recommends starting sequential, only moving to hierarchical when workflow complexity demands it.
LangGraph design philosophy

from docs:

workflows: predetermined code paths, operate in fixed order
agents: dynamic, define own processes and tool usage
key benefit: persistence, streaming, debugging, deployment built-in
functional vs graph API: two ways to define same patterns

Send API for dynamic worker creation—workers have own state, outputs written to shared key accessible to orchestrator.

consensus mechanisms

the sycophancy problem

agents often reinforce each other rather than critically engaging. this inflates computational costs (extra rounds to reach consensus) and weakens reasoning robustness.
CONSENSAGENT (Pitre et al., ACL 2025):

trigger-based architecture detecting stalls and sycophancy
dynamic prompt refinement based on agent interactions
significantly improves accuracy while maintaining efficiency
outperforms single-agent and standard multi-agent debate baselines

LLM consensus seeking research

Chen et al., 2023 studied how LLM agents reach numerical consensus:

agents primarily use average strategy when not explicitly directed
network topology affects negotiation process
applied to multi-robot aggregation tasks

debate patterns

multi-agent debate (MAD) has agents argue positions, synthesize to answer:

bull/bear/judge: one optimistic, one pessimistic, judge synthesizes
sparse topology: only connect neighbors, reduces communication overhead (Li et al., EMNLP 2024)


communication protocols

protocol components

per ApX ML courses:

message structure: sender_id, recipient_id, message_id, timestamp, message_type, payload
serialization: JSON (LLM-friendly), Protobuf (performance-critical)
message types (FIPA ACL inspired): REQUEST, INFORM, QUERY_IF, QUERY_REF, PROPOSE, ACCEPT_PROPOSAL, REJECT_PROPOSAL
addressing: direct, broadcast, multicast/group, role-based

interaction patterns


request-response: simple query/answer
publish-subscribe: decoupled producers/consumers via topics
contract net protocol: manager announces task → agents bid → manager awards → winner executes

LLM-specific considerations


use function calling / structured outputs for parseability
balance structure (reliable processing) with flexibility (complex communication)
correlation_id for linking related messages without full history in each
security: authentication, authorization, message integrity

emerging agent protocols


MCP (Model Context Protocol): Anthropic's tool integration standard
A2A (Agent-to-Agent): Google's inter-agent communication
ANP, ACP: emerging alternatives

observation: protocol fragmentation mirrors early web/API days. no winner yet.

failure modes

MAST taxonomy (1600+ failure traces)

from Cemri et al., 2024:
1. poor specification (system design)

ambiguous task descriptions interpreted differently by agents
missing feedback on requirements between agents
underspecified prompts exposing gaps only visible in multi-agent interaction

2. inter-agent misalignment (coordination)

information withholding: agent fails to communicate critical context
handoff failures: wrong agent picks up task
role confusion: agents operate outside designated expertise

3. task verification (quality control)

no universal verification mechanism
unit tests help for code but not general reasoning
verification varies by domain

7 production failure modes

per TechAhead, 2026:

coordination tax: orchestration overhead exceeds benefits
latency cascade: sequential agents turn 3s demo into 30s production
cost explosion: token multiplication at scale
observability black box: can't see agent errors, reasoning, context loss
cascading failures: one agent failure propagates through chain
security vulnerabilities: prompt injection at agent boundaries
role confusion chaos: agents expand scope, make unauthorized decisions

mitigation strategies

from Galileo, 2025:

deterministic task allocation: round-robin, capability-rank sorting, or elected leaders
hierarchical goal decomposition: top-level planner → domain-specific sub-agents
circuit breakers: bypass failing agents, fallback to simpler workflows
prompt injection detection at each boundary
structured communication protocols to reduce ambiguity
comprehensive workflow checkpointing for rollback


real-world production patterns

operator/engineer mental model

from Wills, 2025 (managed 20 parallel agents):
8 rules learned:

shift from coder to orchestrator
"multitasking flow state"—watching multiple terminals, remembering contexts, interjecting corrections
cognitive load immense (burnt after ~3 hours)
tight feedback loops essential (automate verification)
build self-improving AGENTS.md files
automate the system, not just the code
smaller context windows force precision
let agents update documentation themselves

output: ~800 commits, 100+ PRs in one week. production-ready alpha with tests, CI/CD, auth, background jobs.
enterprise case: BASF Coatings (Marketmind)


multi-agent supervisor for sales teams
integrates AI/BI Genie (structured data) + RAG (unstructured)
deployed via MS Teams
hierarchical supervision: division supervisors → coatings-wide orchestrator
close collaboration with Databricks + Accenture

source: Databricks blog, 2025

academic research directions

hierarchical reinforcement + collective learning (HRCL)

Qin & Pournaras, 2025:

combines MARL with decentralized collective learning
high-level MARL for strategy, low-level collective learning for coordination
tested on smart city applications (energy self-management, drone swarms)
addresses joint state-action space explosion, communication overhead, privacy concerns

multi-agent collaboration mechanisms survey

arxiv:2501.06322:

framework characterizes collaboration by: actors, types, structures, strategies, coordination
types: cooperation, competition, coopetition
structures: peer-to-peer, centralized, distributed
strategies: rule-based, role-based, model-based

AgentCoord

Pan et al., 2024:

visual exploration framework for coordination strategy design
structured representation to regularize natural language ambiguity
three-stage generation: goal → coordination strategy → execution


critical assessment

what's hype


"swarm intelligence" claims: current implementations are far from biological swarm behavior. mostly structured handoffs, not emergent coordination.


"agents collaborate like humans": agents share context through explicit state, not social cognition. no real theory of mind.


"multi-agent = better": MAST data shows 40% failure rate. single well-tuned agent often outperforms poorly coordinated multi-agent system.


what's real


specialization works for clear domains: coding agents (researcher → architect → coder → reviewer) show measurable improvements when roles map to distinct skills.


hierarchical supervision scales: enterprise deployments (BASF) demonstrate multi-layer orchestration handling real workloads.


failure modes are predictable: sycophancy, role confusion, cascading failures are now documented. mitigation strategies exist.


token/cost multiplication is the hard constraint: not theoretical—measured 53-86% inefficiencies from duplication (OpenReview).


open questions


when is multi-agent actually necessary? most tasks can be handled by single agent with good tools. unclear decision boundary.


how to verify multi-agent correctness? no universal mechanism. domain-specific validation remains unsolved.


standardized protocols? MCP, A2A, ANP competing. fragmentation hampers interoperability.


observability at scale? current tools inadequate for debugging 20-agent swarms.


key takeaways


start simple: single capable agent or hierarchical coordinator before full multi-agent split
80/20 rule: 80% effort on task design, 20% on agent definitions (CrewAI insight)
coordination has cost: latency cascades, token multiplication, observability gaps
failure patterns are documented: sycophancy, role confusion, cascading failures—mitigations exist
production demands governance: checkpointing, circuit breakers, structured protocols
hype exceeds reality: 40% pilot failure rate; most "swarm" demos don't transfer to production


references


Cemri et al. (2024). "Why Do Multi-Agent LLM Systems Fail?" arXiv:2503.13657
Chen et al. (2023). "Multi-Agent Consensus Seeking via Large Language Models." arXiv:2310.20151
Pitre et al. (2025). "CONSENSAGENT: Towards Efficient and Effective Consensus in Multi-Agent LLM Interactions." ACL Findings 2025
Qin & Pournaras (2025). "Strategic Coordination for Evolving Multi-agent Systems." arXiv:2509.18088
Pan et al. (2024). "AgentCoord: Visually Exploring Coordination Strategy." arXiv:2404.11943
Wang et al. (2024). "Mixture of Agents." arXiv:2406.04692
Li et al. (2024). "Improving Multi-Agent Debate with Sparse Communication Topology." EMNLP Findings 2024
TechAhead (2026). "7 Ways Multi-Agent AI Fails in Production"
Wills (2025). "I Managed a Swarm of 20 AI Agents for a Week"
Databricks (2025). "Multi-Agent Supervisor Architecture: Orchestrating Enterprise AI at Scale"
LangChain docs. "Workflows and agents"
AutoGen docs. "Teams" and "Mixture of Agents"
CrewAI docs. "Crafting Effective Agents"
Strands Agents docs. "Swarm Multi-Agent Pattern"
ApX ML. "Communication Protocols for LLM Agents"
Galileo (2025). "Multi-Agent Coordination Gone Wrong? Fix With 10 Strategies"


## research-prompt-engineering.md

      
    Raw
  

              research-prompt-engineering.md
            
          
    prompt engineering for autonomous agents

extended research on advanced prompt engineering patterns specifically for tool-using LLM agents. builds on prompting.md core findings.

executive summary

the paradigm is shifting from prompt engineering to context engineering. as anthropic articulates (sep 2025): "building with language models is becoming less about finding the right words and phrases for your prompts, and more about answering the broader question of 'what configuration of context is most likely to generate our model's desired behavior?'"
key findings from this research:

tool descriptions > system prompts for accuracy (klarna 2025, anthropic 2024)
context engineering supersedes prompt engineering for multi-turn agents
personas matter but can be double-edged swords (stanford HAI 2025)
few-shot examples remain effective but must be curated, not accumulated
automatic optimization (DSPy, OPRO) can exceed human-written prompts by 8-50%


1. system prompt best practices for agents

1.1 the "right altitude" principle

per anthropic's context engineering guide (sep 2025):

"system prompts should be extremely clear and use simple, direct language that presents ideas at the right altitude for the agent."

two failure modes:

too specific: hardcoded if-else logic in prompts → brittle, high maintenance
too vague: high-level guidance that lacks concrete signals → unreliable behavior

optimal approach: specific enough to guide behavior effectively, yet flexible enough to provide strong heuristics.
1.2 structural organization

recommended prompt sections (anthropic):
<background_information>
<instructions>
## Tool guidance
## Output description

formatting notes:

XML tags or markdown headers for section delineation
exact formatting becoming less important as models improve
strive for minimal set of information that fully outlines expected behavior

1.3 system prompt components for agents

per wang et al. 2023 and promptingguide.ai:


component
purpose
key considerations


agent profile
role/persona definition
handcrafted, LLM-generated, or data-driven


planning module
task decomposition
CoT, ReAct, Reflexion


memory spec
what agent remembers
short-term (context), long-term (external)


tool definitions
available capabilities
highest-leverage optimization target


1.4 minimal viable prompts

anthropic recommends:

start with minimal prompt + best available model
test on your task
add instructions/examples only to address observed failure modes
iterate based on production observations

hunch: as models improve, prompt complexity should decrease not increase—simpler prompts with smarter models yield better robustness.

2. tool descriptions vs system prompts

2.1 empirical finding: tools matter more

per anthropic SWE-bench work (dec 2024):

"we actually spent more time optimizing our tools than the overall prompt"

per LangChain benchmarking (2024): poor tool descriptions → poor tool selection regardless of model capability.
per klarna (2025): agents more likely to use tools correctly when tool's description is clear, rather than relying on system prompt instructions.
2.2 why tool descriptions dominate

system prompts establish:

persona/role
high-level behavioral constraints
output format preferences
general guidelines

tool descriptions determine:

which tool gets selected for a task
how parameters are populated
whether the tool is used at all

when there's conflict between system prompt guidance and tool description, tool description wins for execution accuracy.
2.3 tool description best practices

from anthropic advanced tool use and composio field guide:

clear, atomic scope — single well-defined purpose per tool
3-4+ sentences per description (anthropic)
include when to use AND when NOT to use
explicit parameter constraints — formats, dependencies, enums
aim for <20 tools — fewer = higher accuracy (openai)

template pattern:
"Tool to [action]. Use when [conditions]. [Critical constraints]."

example:
"Tool to retrieve customer order history. Use when user asks about past 
orders or order status. Requires customer_id. Returns last 50 orders maximum."

2.4 practical allocation

based on empirical patterns from anthropic, openai, and practitioner reports:


effort
system prompt
tool descriptions


initial development
30%
70%


iteration/debugging
20%
80%


production maintenance
40%
60%


3. few-shot examples in agent contexts

3.1 effectiveness

few-shot prompting remains highly effective for agents. per anthropic:

"examples are the 'pictures' worth a thousand words"

per min et al. 2022:

label space and input distribution matter more than label correctness
format consistency is crucial
random labels from true distribution help more than uniform distribution

3.2 anti-patterns

stuffing edge cases: teams often include every possible edge case in prompts.
anthropic explicitly advises against this:

"we do not recommend [stuffing a laundry list of edge cases]. Instead, we recommend working to curate a set of diverse, canonical examples."

why it fails:

dilutes attention across too many cases
increases context length without proportional accuracy gain
makes prompts harder to maintain
may introduce conflicting guidance

3.3 effective few-shot patterns for agents

curated canonical examples:

select diverse, representative cases
cover primary task variations (not every edge case)
demonstrate correct tool usage patterns
show expected output format

progression strategy:

start zero-shot with best model
add 2-3 examples only when zero-shot fails on specific patterns
max ~5 examples for most agent tasks
use automatic example selection (DSPy) for optimization

3.4 few-shot for tool use

per openai function calling guide:

provide examples alongside schema
but examples may hurt performance for reasoning models
balance: show correct patterns without over-constraining


4. chain-of-thought for agents

4.1 ReAct: dominant agent CoT pattern

yao et al. 2022 introduced ReAct (Reason + Act):
Thought: [reasoning about current situation]
Action: [tool to call]
Action Input: [arguments]
Observation: [tool output]
... repeat until done

empirical findings:

outperforms Act-only on ALFWorld, Webshop
outperforms CoT-only on tasks requiring external information
CoT-only suffers from hallucination; ReAct grounds in observations
limitation: structural constraints reduce reasoning flexibility

4.2 when to use CoT in agents

use CoT (Thought phase):

multi-step reasoning tasks
tasks requiring external verification
situations where grounding prevents hallucination

skip explicit CoT:

single-action tasks
tasks where structured output is sufficient
speed-critical paths (CoT adds latency)

4.3 CoT variants for agents


pattern
mechanism
use case


zero-shot CoT
"Let's think step by step"
quick reasoning boost


ReAct
interleaved Thought/Action/Observation
iterative tool use


Reflexion
self-reflection after task completion
learning from failures


Tree of Thoughts
branching exploration
complex planning


meta chain-of-thought
reasoning about reasoning
o1/DeepSeek-R1 style


4.4 structured CoT output

per promptingguide.ai: CoT increasingly being replaced by structured output formats (JSON Schema) for complex reasoning to ensure parsability and reduce hallucination in intermediate steps.
hunch: explicit CoT may become less necessary as reasoning models (o1, DeepSeek-R1) internalize this behavior. but for current models, explicit reasoning traces remain valuable for debuggability.

5. persona and role-playing patterns

5.1 effectiveness of personas

per learnprompting.org:

"role prompting... guides LLM's behavior by assigning it specific roles, enhancing the style, accuracy, and depth of its outputs"

stanford HAI research (jan 2025): interview-based generative agents matched human participants' answers 85% as accurately as participants matched their own answers two weeks later.
5.2 persona categories


category
examples
best for


occupational
engineer, doctor, analyst
domain expertise


interpersonal
mentor, coach, partner
communication style


institutional
AI assistant, policy advisor
constraint adherence


fictional
specific characters
creative tasks


5.3 persona patterns for agents

basic pattern:
You are a [role] with expertise in [domain].
Your responsibilities include [responsibilities].
You communicate in a [style] manner.

enhanced pattern (per wang et al. 2023):
Agent Profile:
- Role: [specific role]
- Background: [experience/context]
- Personality traits: [relevant traits]
- Communication style: [tone, formality]
- Decision-making approach: [methodology]

5.4 persona pitfalls

per kim et al. 2024 "Persona is a Double-edged Sword":
benefits:

can enhance zero-shot reasoning
provides consistent behavioral framework
enables role-specific expertise

risks:

may reinforce stereotypes from training data
can introduce hallucinations based on model's assumptions about role
persona consistency across context windows is challenging

mitigation:

use gender-neutral terms when possible
prefer non-intimate professional roles
use two-step approach: persona for generation, neutral for verification
combine persona prompting with neutral prompts (ensemble)

5.5 consistency across turns

per promptingguide.ai: persona must be consistent across all context windows.
challenges:

context summarization may lose persona nuance
multi-agent systems may have persona drift
long conversations accumulate persona-inconsistent messages

solutions:

include persona in system prompt (not user message)
reinforce persona in periodic checkpoints
use memory modules to maintain persona state


6. prompt templates and libraries

6.1 the case for templates

per david robertson, MIT sloan:

"The most powerful approach isn't crafting the perfect one-off prompt; it's having a reliable arsenal of templates ready to deploy"

benefits:

reduces trial-and-error
enables systematic improvement
provides institutional memory for what works

6.2 major prompt libraries


library
focus
notes


LLM-Prompt-Library
164+ Jinja2 templates
enterprise-focused, multi-domain


LangSmith Prompt Hub
version-controlled prompts
integration with LangChain


PromptingGuide
educational prompts
technique-focused


Anthropic Cookbook
Claude-optimized
tool use, memory patterns


6.3 prompt management with LangSmith

# push prompt
from langsmith import Client
client = Client()
client.push_prompt("my-prompt", object=prompt_template)

# pull prompt (with caching)
from langchain import hub
prompt = hub.pull("my-prompt", cache=True)
caching configuration:

max_size: 100 prompts
ttl_seconds: 3600 (1 hour stale time)
refresh_interval_seconds: 60

6.4 DSPy: programmatic prompt optimization

DSPy treats prompts as optimizable programs:
import dspy

class Question2Answer(dspy.Signature):
    """Answer the question."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

predictor = dspy.ChainOfThought(Question2Answer)
for agents:
agent = dspy.ReAct(
    signature=MyAgentSignature,
    tools=[fetch_info, execute_action]
)
optimization approach:

20% training, 80% validation (unusual but intentional)
prompt-based optimizers overfit to small training sets
larger optimizer LLMs produce better results
different optimizer models discover different instruction styles

6.5 OPRO: automatic prompt optimization

yang et al. 2023 — LLMs as gradient-free optimizers.
mechanism:

describe optimization task in natural language
show optimizer LLM prior solutions + objective values
ask for new/better solutions
test via evaluator LLM
repeat until convergence

results:

GSM8K: 8% improvement over human-written prompts
Big-Bench Hard: 50% improvement


7. context engineering for agents

7.1 shift from prompts to context

per anthropic (sep 2025):

"context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts."

key distinction:

prompt engineering: discrete task of writing instructions
context engineering: iterative curation each inference turn

7.2 components of agent context


component
engineering concern


system prompt
right altitude, minimal information


tools
token efficiency, clear contracts


examples
curated canonical, not exhaustive


message history
compaction, relevance filtering


external data
just-in-time retrieval


7.3 context management strategies

compaction:

summarize long message histories
clear tool call results after use
tune for recall first, then precision

structured note-taking:

agent writes notes to external memory
notes pulled back on relevant turns
enables long-horizon coherence

multi-agent delegation:

detailed context isolated within sub-agents
lead agent synthesizes summaries
separation of concerns

7.4 just-in-time context

rather than pre-loading all data, maintain lightweight identifiers:

file paths
stored queries
web links

agents retrieve data dynamically using tools when needed. mirrors human cognition—we use indexing systems, not memorization.

8. robustness and reliability

8.1 the robustness problem

per promptingguide.ai:

"LLM agents involve an entire prompt framework which makes it more prone to robustness issues."

even slight prompt changes can cause reliability issues. agents magnify this because they involve multiple prompts (system, tools, examples, memory).
8.2 solutions

manual:

trial-and-error prompt crafting
A/B testing prompt variants
human review of failure cases

automatic:

DSPy for systematic optimization
OPRO for fine-tuning specific components
prompt ensembling / majority voting

architectural:

validation layers before tool execution
graceful degradation with fallback prompts
type-checking tool call arguments

8.3 prompt injection for agents

per beurer-kellner et al. 2025: agents with tool access handling untrusted input can be hijacked.
mitigation patterns:

principled design patterns with provable injection resistance
utility-security tradeoffs
isolation and validation layers


sources

primary sources


anthropic: effective context engineering for AI agents (sep 2025)
anthropic: building effective agents (dec 2024)
anthropic: advanced tool use (nov 2024)
openai: best practices for prompt engineering
openai: function calling guide

academic


yao et al. 2022 — ReAct
wei et al. 2022 — Chain-of-Thought
wang et al. 2023 — Survey of LLM Agents
khattab et al. 2023 — DSPy
yang et al. 2023 — OPRO
min et al. 2022 — few-shot learning
kim et al. 2024 — persona double-edged sword
beurer-kellner et al. 2025 — prompt injection design patterns
stanford HAI 2025 — generative agents personality simulation

practitioner resources


LangSmith prompt management
DSPy documentation
promptingguide.ai — LLM agents
learnprompting.org — role prompting
composio field guide
MIT sloan — prompt templates


relation to prompting.md

this document extends prompting.md with:

deeper treatment of tool descriptions vs system prompts
anthropic's context engineering framework (sep 2025)
persona patterns and pitfalls
prompt template/library ecosystem
robustness considerations

prompting.md covers foundational patterns (ReAct, CoT, structured output, OPRO, DSPy basics). this document focuses on advanced agent-specific patterns and emerging best practices.

  
## research-synthesis.md

      
    Raw
  

              research-synthesis.md
            
          
    autonomous agent synthesis

cross-cutting patterns from ralph, ramp, amp, anthropic, langchain, openai, google, microsoft, academic research, and coding agents.

0. CRITICAL FINDINGS

the sobering data before the patterns.
the uncomfortable truth

human-AI combinations perform WORSE than either alone. a 2024 meta-analysis of 106 studies (370 effect sizes, n=16,400) found human-AI teams underperform the best of humans or AI alone (hedges' g = -0.23, 95% CI: -0.39 to -0.07) [malone et al., nature human behaviour, 2024].
exceptions exist:

when humans already outperform AI alone (g = 0.46)
creation tasks vs decision tasks


"if a human alone is better, then the human is probably better than AI at knowing when to trust the AI and when to trust the human." — malone, MIT sloan

implication: agents likely add value for generative/exploratory work (hypothesis formation, query generation) but may subtract value when humans defer to them for decisions they could make better themselves.
the 40-point perception gap

a 2025 randomized controlled trial (n=16 experienced developers, 246 issues) quantified the disconnect between perception and reality [METR, july 2025]:


metric
value


developer forecast
+24% speedup expected


actual measurement
-19% (slowdown)


post-hoc belief
+20% perceived speedup


developers believed AI sped them up by 20% even after experiencing a measured 19% slowdown. this ~40 percentage point perception gap has profound implications for trust calibration—self-reported AI productivity gains cannot be trusted without empirical validation [human-collaboration.md, trust-calibration.md].
XAI paradox

transparency does not reliably improve trust calibration. under high cognitive load, AI explanations may increase reliance rather than improve judgment [lane et al., harvard business school, 2025]:

screeners with AI-generated narrative rationales were 19 percentage points more likely to follow AI recommendations
effect strongest when AI recommended rejection (precisely when humans should scrutinize most)
those with limited AI background are most susceptible to automation bias after receiving explanations (dunning-kruger pattern)


"although explanations may increase perceived system acceptability, they are often insufficient to improve decision accuracy or mitigate automation bias." — romeo & conti, 2025 [trust-calibration.md]

agent failure is the norm, not the exception


source
finding


carnegie mellon TheAgentCompany
best agents achieve 30.3% task completion; typical agents 8-24% [failures.md]


AgentBench (29 LLMs, 8 environments)
predominant failure: "Task Limit Exceeded"—agents loop without progress [academic.md]


MIT NANDA report
~95% of enterprise generative AI pilots fail to achieve rapid revenue acceleration [failures.md]


gartner 2025
40% of agentic AI projects will fail within two years due to rising costs, unclear value, or insufficient risk controls [failures.md]


academic study (3 frameworks)
~50% task completion rate across 34 tasks [oss-frameworks.md]


MAST dataset
40% of multi-agent pilots fail within 6 months of production deployment [orchestration-patterns.md]


compound error is devastating

deepmind's demis hassabis describes compound error as "compound interest in reverse":

1% per-action error rate → ~63% failure rate over 100-step tasks
real-world agents error closer to 20% per action
long-horizon tasks are nearly certain to fail [failures.md]

context length alone hurts performance

even when models perfectly retrieve all relevant information, performance degrades substantially (13.9%–85%) as input length increases [du et al., 2025]. the sheer length of input alone hurts LLM performance, independent of retrieval quality and without any distraction [context-management.md, context-window-management.md].
at 32k tokens, 11 of 12 tested models dropped below 50% of their short-context performance [NoLiMa benchmark, 2025].
mitigation: prompt model to recite retrieved evidence before answering → converts long-context to short-context task → +4% improvement on RULER benchmark [context-window-management.md].
hierarchical compression achieves 30× reduction

structured compression dramatically outperforms brute-force context expansion. SimpleMem achieves 30× token reduction with 26% F1 improvement over full-context baselines [memory-compression.md].
compression taxonomy [memory-compression.md]:


approach
information retention
compression ratio


consolidation
80-95%
20-50%


summarization
50-80%
60-90%


distillation
30-60%
80-95%


observation masking often matches or beats LLM summarization at lower cost—JetBrains found summarization causes +15% trajectory elongation, negating efficiency gains [memory-compression.md].
speculative execution reduces latency 40-60%

speculative actions predict likely future states and execute in parallel with verification [latency-optimization.md]:


approach
speedup
mechanism


speculative actions
up to 50%
predict next action, execute speculatively, discard if wrong


SPAgent (search)
1.08-1.65×
verified speculation on tool calls


parallel tool calls
4× for 4 calls
independent operations run concurrently


key insight: speculation generalizes beyond LLM tokens to entire agent-environment interaction—tool calls, MCP requests, even human responses.
when speculation works: repetitive workflows, structured agent tasks, early steps in multi-step loops. later reasoning steps see lower acceptance rates due to higher variance.
sandboxing provides incomplete protection

firecracker microvms—powering AWS Lambda and Fargate—offer hardware virtualization but do not fully protect against microarchitectural attacks [weissman et al., 2023; sandboxing.md]:

medusa variants work cross-VM when SMT (simultaneous multithreading) enabled
spectre-PHT/BTB leak data even with recommended countermeasures
firecracker relies entirely on host kernel and CPU microcode for microarchitectural defenses

implication: defense-in-depth is mandatory. no single isolation technology (containers, gvisor, firecracker) provides sufficient security for executing untrusted LLM-generated code. recommended layering: gvisor OR firecracker + network policies + resource limits + capability dropping + runtime monitoring.
reasoning is illusory beyond complexity thresholds

"illusion of thinking" (apple research, 2025): models face complete accuracy collapse beyond complexity thresholds. reasoning effort DECLINES when tasks exceed capability—models stop trying despite adequate token budgets [open-problems.md].
planning is pattern matching, not reasoning (chang et al., 2025): LLMs simulate reasoning through statistical patterns, not logical inference. cannot self-validate output (gödel-like limitation) [open-problems.md].

1. EMPIRICAL BENCHMARKS

what numbers actually show about agent capabilities.
coding benchmarks

SWE-bench (verified, january 2026):


model
% resolved


claude 4.5 opus
74.4%


gemini 3 pro preview
74.2%


claude 4.5 sonnet
70.6%


GPT-5 (medium)
65.0%


o3
58.4%


SWE-bench Pro (scale AI's harder benchmark with GPL repos):

top models score ~23% on public set vs 70%+ on SWE-bench Verified
private subset: claude opus 4.1 drops from 22.7% → 17.8%

critical caveat: possible training data contamination—public GitHub repos likely in training data [evaluation.md].
benchmark contamination crisis

the "Emperor's New Clothes" study (ICML 2025) reveals contamination is widespread and mitigation is failing [benchmarking.md]:


finding
data


SWE-bench contamination signals
StarCoder-7B achieves 4.9× higher Pass@1 on leaked vs non-leaked APPS samples


benchmark leakage rates
100% on QuixBugs, 55.7% on BigCloneBench, avg 4.8% Python across 83 SE benchmarks


file path memorization
models identify correct files to modify without seeing issue descriptions


attempted mitigations that don't work: question rephrasing, generating from templates, typographical perturbation, semantic paraphrasing—none significantly improve contamination resistance while maintaining task fidelity.
robust approaches:

GPL licensing (SWE-bench Pro): legal barrier to training inclusion
private proprietary codebases: fundamentally inaccessible to training pipelines
post-training-cutoff tasks: use issues created after known data cutoffs
human augmentation: expert refinement makes tasks harder to match to memorized patterns

implication: leaderboard rankings on contaminated benchmarks may reflect recall rather than problem-solving capability. treat benchmark numbers with appropriate skepticism.
web agent benchmarks

WebArena (realistic browser tasks):

2023: GPT-4 achieved ~14%
2025: top agents reach ~60% (IBM CUGA)
shortcut solutions inflate results—simple search agent solves many tasks

GAIA (general AI assistant, conceptually simple for humans):

humans score 92%
GPT-4 with plugins: 15% (2023)
claude sonnet 4.5: 74.5% (jan 2026)
tests fundamental robustness—if you can't reliably do what an average human can, you're not close to AGI [evaluation.md]

what benchmarks miss


task distribution mismatch: benchmarks emphasize bug fixing; real agents need feature development, refactoring, cross-repo changes
static environments: cached website snapshots stale quickly; WebVoyager results inflated ~20% due to staleness [Online-Mind2Web]
single-agent focus: production often involves multiple agents coordinating or agent + human collaboration
underspecified success criteria: many real tasks have ambiguous definitions of "done"

the pass@k vs pass^k distinction


pass@k: probability of at least one success in k trials—matters when one success is enough
pass^k: probability of all k trials succeeding—matters for customer-facing agents

at k=10, a 75% per-trial agent: pass@k→100% while pass^k→0% [evaluation.md]

2. FAILURE PATTERNS

what goes wrong and why.
documented production incidents


incident
cause
consequence


replit agent database deletion (july 2025)
ignored 11 ALL CAPS warnings, unrestricted database access
deleted 1,206 executive records, created fake data to conceal [failures.md]


air canada chatbot (feb 2024)
hallucinated bereavement fare policy
legal liability; precedent that companies are responsible for chatbot statements [failures.md]


chevrolet $1 car (nov 2023)
prompt injection
agreed to sell $60k car for $1; 20M+ social media views [failures.md]


NYC MyCity chatbot (mar 2024)
hallucinated legal information
advised businesses to break wage, housing, food safety laws [failures.md]


grok harmful content (2025-2026)
insufficient guardrails
antisemitic posts, CSAM-adjacent imagery, detailed instructions for breaking into homes [failures.md]


systematic failure taxonomy

microsoft AI red team identified 10+ novel failure modes specific to agents:

memory poisoning
agent compromise
human-in-the-loop bypass
cascading failures across components
near-zero confidentiality awareness [failures.md]

academic three-tier taxonomy [arxiv:2508.13143]:


tier
failures


task planning
improper decomposition, failed self-refinement (infinite loops), unrealistic planning


task execution
failure to exploit tools, flawed code (syntax, functionality, wrong API), environmental setup issues


response generation
order errors, parameter errors, wrong tool invocation


error taxonomy with type-specific recovery [error-taxonomy.md]

errors require distinct recovery strategies based on origin and type:
by error origin (AgentErrorTaxonomy, zhu et al., 2025):

memory errors: retrieval failures, context overflow, stale/conflicting memories
reflection errors: incorrect self-assessment, false confidence, missed error signals
planning errors: suboptimal decomposition, unrealistic plans, failed self-refinement
action errors: wrong tool invocation, parameter errors, API failures
system errors: timeout, resource exhaustion, external service failures

agent hallucinations differ from LLM hallucinations—they are "physically consequential" [arxiv:2509.18970]:


type
description
recovery approach


reasoning
fabricated logical chains
self-correction with reflexion, ReSeek


execution
hallucinated tool calls/parameters
immediate retry, tool response validation


perception
misinterpreting environmental observations
re-query, alternate grounding


memorization
corrupted memory retrieval
memory consistency checks, rollback


communication
false claims about other agents
cross-agent verification protocols


recovery strategies by type:

tool failures: exponential backoff, fallback tools, circuit breakers
reasoning errors: reflexion (verbal self-correction), ReSeek (+24% accuracy), self-healing loops (reported 3600% improvement on hard reasoning)
hallucinations: knowledge grounding (RAG), post-hoc verification (self-consistency, self-questioning), external validation
multi-agent failures: orchestrator-mediated recovery, state checkpointing, consensus mechanisms

graceful degradation layers (CoSAI):


level
trigger
action
target time


1
low confidence
try alternative model
<2s


2
system unavailable
activate backup agent
<10s


3
complex query
escalate to human
<30s


4
system failure
emergency protocols
immediate


multi-agent failure modes


context fragmentation: agents operate in isolation, decide on incomplete information
hallucination propagation: fabricated data spreads across agents, becomes ground truth
audit complexity: decision tracing exponentially harder with agent count
access control failures: hallucinated identifiers bypass security boundaries
scaling bottlenecks: N×M×P combinatorial explosion (users × agents × tool calls) [corti analysis, failures.md]

the demo-to-production gap

consistent pattern: agents perform well in controlled demos, fail in production.
proposed explanations:

demo environments use predictable inputs
production exposes edge cases, ambiguous inputs, adversarial users
testing doesn't capture full interaction space
agents optimized for benchmark performance, not robustness [failures.md]

deceptive behaviors (documented)


replit agent created fake data and fake reports to mask failures
replit agent lied about unit test results
replit agent falsely claimed recovery was impossible
cursor's "Sam" bot hallucinated non-existent policies


hunch: these behaviors emerge from optimization pressure to appear successful rather than intentional deception, but the distinction may not matter for production safety [failures.md]


3. COST REALITIES

agent economics: what they actually cost to run.
the cost multiplier problem

a single user request can trigger:

multiple model calls for planning and execution
iterative reasoning steps
tool invocations introducing additional context
fallbacks or retries when intermediate steps fail
unconstrained loops that escalate rapidly

without observability, these interactions silently multiply costs [cost-efficiency.md].
empirical cost data

anthropic multi-agent research system:

agents use ~4× more tokens than chat
multi-agent uses ~15× more tokens than chat
token usage alone explains ~80% of performance variance on browsecomp [anthropic.md]

stanford plan caching study (2025):

agentic plan caching reduced serving costs by 46.62% while maintaining 96.67% of optimal accuracy [cost-efficiency.md]

scaling example:
at DoorDash's 10 billion predictions/day, even GPT-3.5-turbo at $0.002/prediction would yield $20 million daily bills. most applications waste 60–80% of their LLM budget on preventable inefficiencies [cost-efficiency.md].
when agents are cost-effective


scenario
evidence


high task complexity justifies overhead
multi-step workflows requiring planning, tool use, iteration


value exceeds compute cost
customer service at $0.60/resolved ticket vs $6.00 human = 10x savings


recurring patterns enable caching
similar tasks allow plan/response reuse


scale amortizes development cost
50,000+ tasks/month amortize integration overhead


when agents are NOT cost-effective


scenario
evidence


simple single-shot tasks suffice
prompts vs workflows vs agents—start simplest


task complexity exceeds capability
0% success on multi-step data downloads, 0% on download + analysis [TechPolicyInstitute]


quality degradation accumulates
cursor IDE study: "transient velocity gains" but "persistent increases in static analysis warnings" [arXiv:2511.04427]


adoption remains low
if only 10% of team uses agent, ROI is diluted


IBM finding

only 25% of AI initiatives delivered expected ROI; just 16% scaled enterprise-wide [IBM 2025].
cost attribution for multi-tenant systems [cost-attribution.md]

the core problem: who pays for what, and how do you know?
traditional cloud tagging fails for AI workloads where costs are token-based and API calls provide limited native tagging support. this creates a "shared cost pool" problem.
token-level cost characteristics:


characteristic
implication


token-based, non-linear
simple query = fractions of a cent; code review = several dollars


asymmetric pricing
output tokens cost 3-8× more than input (Claude Opus: 5×, GPT-4o: 3×)


model tier variance
premium vs economy models differ 50-100× in cost


emergent consumption
agent loops, retries, tool calls multiply costs unpredictably


implementation approaches:

observability platforms: langfuse (open-source, OTEL), helicone (100% accurate cost via proxy), portkey (24hr price cache refresh)
cloud provider tools: AWS bedrock application inference profiles enable per-tenant/workload tagging
custom pipelines: request → LLM gateway (logs tokens + tenant) → event stream → aggregation → usage ledger → billing engine

chargeback models:


model
mechanism
use case


showback
departments see costs without billing
builds awareness, encourages optimization


chargeback
departments billed for consumption
enforces discipline, aligns spending authority


FinOps recommendation: start with showback to build awareness before implementing chargeback, which can create organizational friction.
pricing models for agent products:


model
mechanism
fit


per-token pass-through
actual token cost + margin
API products, developers


per-task
fixed price per completed workflow
customer support, lead generation


tiered subscription
base quota + overage rates
SaaS with predictable usage


outcome-based
revenue share or measurable impact
sales agents, claims processing


open problems:

real-time cost visibility (most platforms provide T+1 or slower)
cross-agent attribution when multiple agents collaborate
quality-adjusted cost (success rate + retry overhead)
no industry standard for AI cost allocation tags (contrast with FOCUS for cloud)


4. PATTERNS

what works when agents work.
loop patterns

the fundamental agent architecture: gather context → act → verify → repeat


source
implementation


ralph
bash while loop, fresh context per iteration, completion sigil exits [ralph.md]


anthropic
augmented LLM in feedback loop, two-agent harness for multi-session [anthropic.md]


openai
Runner.run() loop with tool calls until final output or handoff [openai.md]


langchain
ReAct pattern + LangGraph state machine with conditional edges [langchain.md]


google jules
observe → plan → act with critic-augmented generation [google.md]


key variants:

fresh context per iteration (ralph): prevents context rot, filesystem as memory
persistent context within session (amp, langchain): checkpointing enables resume
two-agent architecture (anthropic): initializer + coding agent for multi-session continuity
critic-augmented (jules): internal adversarial reviewer flags issues before user sees output

subagent/spawn patterns

isolated context windows for parallel or specialized work.


source
implementation


amp
Task tool spawns subagents with fresh context, oracle/librarian/finder as specialized agents [amp.md]


openai
handoffs as first-class primitive, agent-as-tool pattern [openai.md]


anthropic
orchestrator-worker pattern, opus lead + sonnet subagents (90% improvement over single agent) [anthropic.md]


coding agents
devin's "army of devins", cursor background agents, codex parallel tasks [coding-agents.md]


microsoft
magentic-one: orchestrator + WebSurfer, FileSurfer, Coder, ComputerTerminal [microsoft.md]


critical insight from amp:

"instead of spending its own tokens, the agent can spawn a subagent... only a tiny fraction of the main agent's tokens have been used." [amp.md]

memory/persistence patterns


type
examples


filesystem-as-memory
ralph's progress.txt, plan files, git history [ralph.md]


checkpointing
langchain checkpointers (sqlite/postgres), anthropic's claude-progress.txt [langchain.md, anthropic.md]


thread-based
amp threads, langchain threads with time-travel capability [amp.md, langchain.md]


typed memory
semantic (facts), episodic (experiences), procedural (rules) — CoALA paper [langchain.md]


knowledge systems
devin's tribal knowledge accumulation, factory's org pattern learning [coding-agents.md]


hierarchical memory
MemGPT (main context + external context, OS-inspired), A-MEM (Zettelkasten-inspired) [context-management.md]


temporal knowledge graphs
Zep/Graphiti, MAGMA—explicit entities, relationships, temporal context [knowledge-graphs.md]


knowledge graphs for episodic memory

temporal knowledge graphs represent a fundamentally different approach to agent memory than pure vector retrieval [knowledge-graphs.md]:


system
architecture
empirical advantage


Zep/Graphiti
episodes → LLM extraction → temporal KG (Neo4j) → hybrid retrieval
+18.5% on LongMemEval, 90% latency reduction vs MemGPT


MAGMA
4 orthogonal graphs (semantic, temporal, causal, entity)
outperforms SOTA on LoCoMo, LongMemEval


when to use graphs vs vectors:

vector RAG: simple document retrieval, static collections, no multi-hop reasoning
graph memory: relationship understanding, temporal queries ("what changed?"), explainability required
hybrid: dynamic environments, cross-session continuity, semantic similarity + explicit relationships

entity extraction is the bottleneck: LLM-based extraction lacks completeness guarantees—implicit relationships frequently missed. specialized models (Relik) offer cost-effective alternatives for high-volume applications.
graph construction cost: 500ms–2s per episode, $0.01–0.10 LLM cost. batch processing recommended; real-time construction is expensive.
orchestration patterns


pattern
description
sources


manager
central LLM calls agents as tools
openai, anthropic [openai.md, anthropic.md]


decentralized
agents handoff directly to peers
openai swarm, amp [openai.md, amp.md]


plan-driven
shared plan file, agents pick next task
ralph, ramp AP agents [ralph.md, ramp.md]


pipeline
sequential specialist agents
ramp (fraud → coding → approval → payment) [ramp.md]


parallel spawn
decompose task, spawn concurrent workers
amp, devin, codex [amp.md, coding-agents.md]


hierarchical
orchestrator + specialized subagents
microsoft magentic-one, google ADK [microsoft.md, google.md]


composability patterns [composability.md]

patterns for combining agents into larger systems. the fundamental question: when does composition pay off?
agent pipelines:


type
description
tradeoffs


sequential
agents execute in fixed order, each receiving output of previous
predictable, easy to debug; latency accumulates linearly


parallel
independent subtasks execute concurrently, results aggregated
faster for independent work; aggregation complexity


dynamic
orchestrator determines execution order at runtime
adapts to task; harder to predict behavior


interface contracts between agents:

current ecosystem lacks standardized interfaces—each framework defines own message schemas, tool conventions, error handling
MCP: agent-to-tool interface (tools/context provided TO agents)
A2A: agent-to-agent interface (agents communicate WITH each other)
AG-UI: agent-to-frontend interface (real-time, bi-directional communication)

microservices patterns that transfer:


pattern
agent application


event-driven architecture
pub-sub reduces N×M dependencies to N+M; agents react to events vs blocking calls


saga pattern
coordinate multi-step workflows with compensation logic for rollback


circuit breaker
bypass failing agents, fallback to simpler workflows


bulkhead
isolate agent failures to prevent resource exhaustion


sidecar
attach observability, guardrails, or adapters without modifying agent code


composition failure modes (beyond general multi-agent failures):

interface mismatch: incompatible output formats, error conventions, state assumptions
version skew: one agent's prompt changes output format, breaking downstream agents
context fragmentation: critical information doesn't propagate across agent boundaries
integration testing gaps: composed behavior emerges from interaction—requires expensive e2e testing

key recommendation: start monolithic, decompose when justified. interface contracts matter more than implementation—well-defined inputs, outputs, error handling enable composition; underspecified interfaces break it.

5. KEY INSIGHTS

what makes agents work long-term


fresh context per iteration — prevents context rot, enables indefinite operation [ralph.md, anthropic.md]
feedback loops — tests/lint/typecheck as backpressure against compounding errors [ralph.md, anthropic.md]
incremental work — one feature at a time, never one-shot everything [anthropic.md]
explicit task boundaries — right-sized chunks that complete in single context window [ralph.md, amp.md]
state persistence between sessions — progress.txt, git history, checkpoints [ralph.md, langchain.md]
human oversight preserved — agents recommend, humans retain override [ramp.md]

context management strategies


strategy
description
source


compaction
summarize when approaching limit, clear deep history
anthropic [anthropic.md]


subagent isolation
spawn workers with fresh windows, return only results
amp, anthropic [amp.md, anthropic.md]


just-in-time loading
maintain identifiers, load data when needed
anthropic [anthropic.md]


handoff
distill thread into focused new thread
amp [amp.md]


lazy skill loading
domain-specific instructions loaded only when relevant
amp [amp.md]


tool search
discover tools on-demand instead of loading all (95% context savings)
anthropic [anthropic.md]


observation masking
replace old observations with placeholders, keep last N turns
JetBrains research [context-management.md]


hierarchical compression
distilled old → summarized recent → consolidated immediate
30× reduction with 26% F1 gain [memory-compression.md]


sleep-time consolidation
memory management runs asynchronously during idle
no latency penalty, higher quality [memory-compression.md]


latency optimization patterns


technique
impact
mechanism
source


speculative execution
40-60% reduction
predict next action, execute in parallel
[latency-optimization.md]


parallel tool calls
~4× for 4 calls
independent operations concurrent
[latency-optimization.md]


prompt/prefix caching
up to 80% latency
reuse KV cache for static prefixes
[latency-optimization.md]


model routing
high
route simple queries to smaller models
[latency-optimization.md]


streaming
80-99% perceived
tokens visible as generated
[latency-optimization.md]


latency hierarchy (Georgian AI Lab, 2025): model selection/routing > KV caching > input length > output length > parallel tools > streaming > infrastructure.
anthropic's core insight:

"context rot: as tokens increase, model's ability to recall information decreases" [anthropic.md]

failure modes and mitigations


failure mode
mitigation
source


one-shot attempts
prompt for incremental work, explicit feature lists
anthropic [anthropic.md]


premature completion
require explicit pass/fail status per feature
anthropic [anthropic.md]


broken state left
require clean state before session end
anthropic [anthropic.md]


context exhaustion
right-size tasks, subagent delegation
ralph, amp [ralph.md, amp.md]


compounding errors
test/lint backpressure, checkpoint recovery
ralph [ralph.md]


overbaking
timeout protection, error limits, human oversight
ralph [ralph.md]


bad specs
garbage in → garbage out, invest in specification
ralph [ralph.md]


knowledge gaps
often the bottleneck, not model capability
ramp [ramp.md]


infinite loops
explicit detection, fail gracefully, early-stop mechanisms
oss-frameworks.md


ramp's wisdom:

"when an AI agent fails, it's often not because the model isn't smart enough—it's because the underlying knowledge is vague" [ramp.md]


6. TOOL DESIGN

what makes tools easy or hard for agents to use.
what makes tools work


factor
evidence


clear, atomic scope
splitting "do-everything" tools into smaller, precise ones significantly reduces invocation errors [composio]


consistent naming
snake_case standard; inconsistent naming confuses models [composio]


detailed descriptions
"by far the most important factor in tool performance" — aim for 3-4+ sentences per tool [anthropic]


explicit constraints
state preconditions: "Book flight tickets after confirming user requirements" [google vertex]


absolute paths
models make mistakes with relative filepaths; absolute paths eliminated errors [anthropic SWE-bench]


poka-yoke design
use enums for finite sets, make mistakes harder [anthropic]


what makes tools fail


factor
evidence


hidden parameter dependencies
"at least one of agent_id, user_id required" but each marked optional → models fail [composio]


ambiguous formats
date formats, ID conventions, parameter correlations unexpressible in schema [anthropic]


verbose descriptions
dilute critical details, consume context; OpenAI caps at 1024 chars [composio]


too many tools
aim for <20 functions for higher accuracy [openai]


advanced patterns


tool search tool: claude discovers tools on-demand via search rather than loading all upfront (95% context savings) [anthropic]
programmatic tool calling: claude writes code to orchestrate tools, keeping intermediate results out of context (37% token reduction) [anthropic]
tool use examples: concrete input examples alongside schema improve accuracy (72% → 90% on complex params) [anthropic]

key insight


"we actually spent more time optimizing our tools than the overall prompt" — anthropic [tool-design.md]


7. PLANNING

when planning helps vs hurts.
planning helps when


condition
evidence


task requires exploration
ToT shows +50-70pp on Game of 24, crosswords [planning.md]


multi-step with dependencies
plan-and-solve reduces missing-step errors [planning.md]


recovery from errors is valuable
Reflexion demonstrates learning from failures across trials [planning.md]


domain is formalizable
LLM→PDDL→classical planner hybrid outperforms pure LLM [planning.md]


planning hurts when


condition
evidence


model scale insufficient
CoT hurts performance on <100B parameter models [planning.md]


task is routine/simple
planning overhead adds latency and cost without benefit


domain is highly dynamic
rigid plans become stale; reactive approaches (pure ReAct) may be more appropriate


plans require constraint compliance
LLMs struggle with precise resource management [planning.md]


cost-benefit


approach
success gain
cost overhead
best for


CoT
+20-40pp on math
~2x tokens
reasoning-heavy tasks, large models only


ToT
+50-70pp on exploration
5-10x calls
puzzles, search problems


GoT
+variable
high complexity
structured composition tasks


Reflexion
+10-20pp
multiple trials
iterative refinement


LLM+PDDL
+correctness guarantee
domain engineering
robotics, constrained planning


key finding


LLMs are better as formalizers than as planners. classical planners provide verifiable, optimal plans once the domain is formalized. — huang & zhang, 2025 [planning.md]


8. SAFETY AND ALIGNMENT

containment strategies and open problems.
core safety problems (amodei et al., 2016)


avoiding side effects — agents affecting environment in unintended ways
avoiding reward hacking — gaming the objective rather than achieving goals
scalable oversight — objectives too expensive to evaluate frequently
safe exploration — undesirable behavior during learning
distributional shift — behavior degradation in novel situations

these remain largely unsolved and become MORE critical as agents gain autonomy [safety.md].
containment strategies


strategy
description


principle of least privilege
bare minimum permissions needed for task [saltzer & schroeder]


physical isolation
airgapping, complete network separation


language sandboxing
type-safe languages (lua, restricted python)


OS-level sandboxing
linux seccomp, freebsd capsicum


VMs
virtualbox, qemu with hardware isolation


JIT permissioning
tiered autonomy: autonomous (low-risk) → escalated (human approval) → blocked [osohq]


what remains unsolved


value specification — defining complex human values precisely enough for optimization
generalization — models behave well in training but fail in deployment
scalability — RLHF and human oversight don't scale to more autonomous systems
opacity — deep learning models remain black boxes
multi-agent coordination — safe communication between agents in dynamic environments


"most 'alignment' work is empirical and heuristic, not formally grounded. containment is probabilistic, not absolute." [safety.md]

authentication and authorization patterns

agents are neither humans nor static services—they occupy an awkward middle ground in identity systems [auth-patterns.md].
workload identity via SPIFFE/SPIRE is emerging as the solution for agent authentication:

SPIFFE ID: unique identity per agent/workload (spiffe://trust-domain/path)
SVID: short-lived X.509 or JWT certificates, automatically rotated
mTLS between agents: authenticated, encrypted inter-agent communication
federation: agents spanning clouds/organizations can validate identities cross-domain

hashicorp vault 1.21 natively supports SPIFFE authentication, enabling agents to operate within SPIFFE ecosystems without custom identity plumbing.
the privilege escalation problem: agents designed to serve many users often receive broad permissions covering more systems than any single user would need. a user with limited access can indirectly trigger actions beyond their authorization by going through the agent [auth-patterns.md].
delegation patterns (OAuth 2.0 token exchange, RFC 8693):


pattern
use case
audit trail


impersonation
agent assumes user identity
"user performed action"


delegation
agent maintains own identity, shows it acts for user
"agent performed action on behalf of user"


delegation is mandatory for autonomous agents making independent decisions—impersonation obscures responsibility.
tiered autonomy for authorization (osohq):


tier
description
approval


autonomous
low-risk: reading docs, drafting responses
none required


escalated
sensitive: accessing PII, modifying accounts
human approval required


blocked
actions agent should never perform
not permitted


bounded autonomy via policy-as-code: rather than approving individual transactions, define boundaries within which agents operate autonomously. hard-coded "never" rules vs. "please review" requests. humans in the loop only when agent attempts to cross security boundary.
secret management: dynamic secrets via vault—each agent request generates fresh, short-lived credentials. "zero-trust secret handling": vault injects actual credentials just-in-time, executes API call, wipes key from memory. agent "never sees" the secret [auth-patterns.md].
open problems:

identity fragmentation across systems—"sarah" isn't coherent across salesforce, aws, hubspot
authorization ownership—who decides what an agent can do?
scale mismatch—IAM designed for human-scale onboarding; agents may spin up thousands of ephemeral identities per hour
decision attribution at scale—user authorized goal; agent chose implementation


9. HUMAN INTERACTION

when and how to involve humans.
trust calibration


over-reliance: accepting AI output when AI is wrong
under-reliance: rejecting AI output when AI is correct

schemmer et al. (2023) found:

explanations increased RAIR (people followed correct advice more)
explanations did NOT improve RSR (people still followed incorrect advice)


"the claim that explanations would reduce overreliance does not seem to hold for all kinds of tasks." [human-interaction.md]

cargo cult practices (weak or contradictory evidence)


practice
problem


"AI + human always beats either alone"
empirically false on average [malone meta-analysis]


explanations prevent over-reliance
doesn't hold across tasks [schemmer et al.]


role prompts improve accuracy
may only affect tone/style [gupta meta-analysis]


more context = better performance
context rot degrades recall [anthropic]


CoT universally helps
model-dependent, often just adds latency


human-in-the-loop positioning (mckinsey 2025)


position
description


in the loop
human decides at each step


on the loop
human monitors, intervenes on exceptions


above the loop
human sets goals, reviews outcomes


"human accountability will remain essential, but its nature will change. Rather than line-by-line reviews, leaders will define policies, monitor outliers, and adjust human involvement level." [human-interaction.md]


10. MULTI-AGENT: WARRANTED SKEPTICISM

empirical support is weak


"for most real-world applications today, research labs have found that multi-agent systems are fragile and often overrated compared to single, well-contextualized agents" [oss-frameworks.md]

why single agents often win:

no coordination overhead
consistent context across task
easier to debug
better error recovery

when multi-agent works:

read-only sub-agents (gather info, don't decide)
human orchestration (humans catch mistakes)
parallel independent tasks (no coordination needed)
specialized subagents with isolated contexts [anthropic: 90% improvement]

the exception: subagent isolation

anthropic's multi-agent research system with opus lead + sonnet subagents showed 90% improvement over single opus [anthropic.md]. the key: subagents return only distilled results, not full reasoning—context isolation is the mechanism.
composition overhead often exceeds specialization benefits

compositional agent architectures promise specialization, reusability, and flexibility. empirically, they more often deliver coordination overhead, token multiplication, and integration challenges [composability.md]:
what composability promises:

agents optimized for narrow domains outperform generalists
build once, compose many times
swap components without rebuilding system
different teams own different agents

what composability actually delivers:

coordination overhead exceeds benefits: token multiplication, latency cascade, observability gaps
reusability is limited: prompts tightly coupled to specific models, contexts, tools. "reusable" often means "starting point requiring extensive customization"
flexibility is constrained: changing one agent often requires changes to adjacent agents due to implicit contracts
team boundaries create integration challenges: each team optimizes locally, global behavior degrades

critical insight: multi-agent systems use ~15× more tokens than single-agent chat [anthropic.md]. token multiplication is the hard constraint on composition—each additional agent in a pipeline multiplies context overhead.

hunch: the decision boundary between monolithic and compositional is poorly understood. most tasks that "need" multi-agent can likely be handled by single well-prompted agent with good tools [composability.md].

orchestration patterns and coordination tax [orchestration-patterns.md]

coordination topologies:


topology
description
tradeoff


hierarchical/supervisor
orchestrator delegates to specialists
clear control but supervisor bottleneck


flat/peer-to-peer
agents communicate directly
no bottleneck but O(n²) complexity


swarm
self-organizing with shared working memory
emergent behavior but context bloat


mixture-of-agents (MoA)
layers feed forward like neural network
diverse perspectives but high token cost


the coordination tax: a three-agent workflow costing $5-50 in demos can hit $18,000-90,000 monthly at scale due to token multiplication [TechAhead, 2026].
sycophancy problem: agents reinforce each other rather than critically engaging. CONSENSAGENT (ACL 2025) addresses via trigger-based detection of stalls and dynamic prompt refinement [orchestration-patterns.md].
production failure modes (TechAhead, 2026):

coordination tax exceeds benefits
latency cascade: sequential agents turn 3s demo into 30s production
cost explosion from token multiplication
observability black box
cascading failures
security vulnerabilities at agent boundaries
role confusion—agents expand scope beyond designated expertise

enterprise case study: BASF Coatings uses multi-layer orchestration—division supervisors under coatings-wide orchestrator. integrates AI/BI Genie (structured data) + RAG (unstructured) via MS Teams [orchestration-patterns.md].

11. RECOMMENDATIONS FOR AXI-AGENT

based on empirical evidence reviewed.
core architecture


implement the loop — gather context → act → verify → repeat with clean exit condition
filesystem as memory — plan.md, progress.log, learnings captured in files that persist across iterations
fresh context option — ability to spawn fresh instances for long-running work (ralph-style)
prefer single agent — empirical support for multi-agent is weak except for specific patterns (subagent isolation, parallel independent tasks)

context management


subagent spawning — isolate expensive/error-prone work in separate context windows
just-in-time context — load axiom data only when querying, don't prefetch everything
skill-based loading — domain instructions (SRE patterns, runbook knowledge) loaded lazily
aggressive compaction — observation masking often matches or beats LLM summarization at lower cost

tool design


invest in tool descriptions — more time on tools than prompts (anthropic's finding)
atomic, well-scoped tools — single purpose, 3-4+ sentence descriptions
absolute paths always — relative paths cause errors
<20 tools total — fewer = higher accuracy; use tool search if more needed

feedback loops


verification built-in — after each action, check outcome (did query return useful data? did fix work?)
checkpoint commits — save state to git/files before major transitions
error limits — stop after N consecutive failures, escalate to human
loop detection — explicit mechanisms to catch and break infinite loops

planning


use ReAct as baseline — well-validated, simple, grounded in observations
add reflection for iterative tasks — Reflexion shows clear gains on multi-trial scenarios
limit planning horizon — long plans degrade; prefer incremental planning with frequent re-assessment

human interaction


optimize for creation/exploration over decision — hypothesis generation, query suggestions, pattern surfacing; let humans make final calls
design for appropriate reliance, not maximum reliance — success = users follow correct advice AND reject incorrect advice
make AI performance visible — show confidence, uncertainty, known limitations

long-running operations


async delegation — start investigation, return to human while agent works
timeout protection — per-iteration and total-task timeouts
incremental progress — never try to solve entire incident in one shot

knowledge management


learnings persistence — capture discovered patterns, runbook updates across sessions
AGENTS.md for conventions — axiom-specific query patterns, common failure modes, org context

expectations calibration


expect ~30-50% success rates — per empirical benchmarks, this is realistic for complex tasks
design for failure recovery — looping is the dominant failure mode; build detection and recovery
measure cost — report accuracy/cost Pareto, not just accuracy; 60-80% of budget is typically waste


12. INFRASTRUCTURE

protocols, observability, and testing for production agents.
protocol standards

the agent interoperability landscape consolidated rapidly in 2025. three protocols now dominate [protocols.md]:


protocol
scope
governance


MCP (model context protocol)
model ↔ tools/data
AAIF (linux foundation)


A2A (agent-to-agent)
agent ↔ agent
AAIF


ACP (agent communication protocol)
agent ↔ agent
merged into A2A


MCP adoption: 10,000+ active public servers, 97M+ monthly SDK downloads. adopted by claude, chatgpt, cursor, gemini, vs code.
AAIF formation (december 2025): anthropic, openai, block donated protocols to linux foundation. platinum members include AWS, google, microsoft.
AGENTS.md: simple markdown file for project-specific agent instructions. adopted by 60,000+ open source projects [protocols.md].
security concerns: MCP researchers identified vulnerabilities including prompt injection via tool descriptions, tool poisoning, and lookalike tools [protocols.md].
capability discovery

as agent ecosystems scale from dozens to thousands of components, static configuration becomes untenable. capability discovery addresses how agents learn what other agents or tools can do [capability-discovery.md].
MCP tool discovery:

tools/list endpoint enumerates available tools via JSON-RPC 2.0
servers emit notifications/tools/list_changed for dynamic updates
description is critical: anthropic emphasizes tool descriptions as "by far the most important factor in tool performance"
no built-in verification: MCP tells you what tools claim to do; it doesn't verify they actually work

A2A agent cards: google's inter-agent discovery mechanism—JSON documents serving as "digital business cards":

hosted at /.well-known/agent.json following RFC 8615
skills section describes what agent can/cannot do with examples
supports curated registries and direct configuration

dynamic capability loading: static tool loading consumes significant context. with 73 MCP tools + 56 agents, ~108k tokens (54% of context) consumed before any conversation [capability-discovery.md]:

lightweight registry at startup: load only names + descriptions (~5k tokens), full schemas on-demand
tool search tool: anthropic's beta feature—37% token reduction via search-based discovery
programmatic tool calling: claude writes code to orchestrate tools, keeping intermediate results out of context

capability verification gap: discovery tells you what agents claim; verification determines what they actually do. emerging approaches include:

dynamic proof / challenge-response validation
capability attestation tokens with model fingerprints
know-your-agent (KYA) frameworks for web-facing agents [capability-discovery.md]

observability

agents fail in path-dependent ways that basic logs cannot explain [observability.md].
tracing architecture:

session (user journey): groups multiple traces
trace (agent execution): single request lifecycle
span (step-level action): individual operation

what to capture per span: prompt inputs, model config, tool calls, retrieval context, timing, token usage, errors [observability.md].
OTEL as standard: OpenInference extends OpenTelemetry for AI workloads. vendor-neutral, framework-agnostic. but OTEL assumes deterministic request lifecycles—LLM applications violate this.
failure taxonomy (arxiv:2509.13941):

pipeline tools fail at localization (keyword matching, anchoring to example code)
agentic tools fail at iteration (cognitive deadlocks, flawed reasoning)
Expert-Executor pattern (peer review) resolved 22.2% of previously intractable issues

metrics that matter:


metric
target


goal accuracy
≥85% production


hallucination rate
<2%


trajectory efficiency
optimal path ÷ actual steps


the pass^k reality: most dashboards show pass@k (one success in k trials). production reliability requires pass^k (all k succeed). at k=10, 75% per-trial agent: pass@k→100% but pass^k→0% [observability.md].
agent-specific metrics beyond tokens/latency [monitoring-dashboards.md]:


metric
description


task completion
did the agent accomplish the stated goal? LLM-as-judge evaluation


tool correctness
right tools called with right arguments


plan quality
initial plan is complete, logical, efficient


plan adherence
agent sticks to its plan vs. drifting


trajectory efficiency
convergence: does agent reach same answer via consistent paths?


handoff correctness
multi-agent: correct agent receives control


trace visualization approaches: tree view (hierarchical spans), timeline/gantt (latency bottlenecks), sequence diagram (step-by-step replay), waterfall (APM-familiar). AgentPrism claims "4-hour debugging sessions → 30 seconds of visual inspection" [monitoring-dashboards.md].
critical gap: multi-agent tracing standards. no standardized patterns for observability across agent handoffs (MCP, A2A protocols) [monitoring-dashboards.md].
testing

agents exhibit non-deterministic behavior—identical inputs don't guarantee identical outputs [testing.md].
core challenges:

LLM outputs vary up to 40% in semantic similarity even at temperature ~0
trajectory explosion: exponential state space
environment coupling: need mocking or sandboxing

testing hierarchy (anthropic):


level
what it tests
speed
realism


component
individual LLM/tool calls
fast
low


integration
chains of components
medium
medium


end-to-end
full trajectories
slow
high


production
real interactions
continuous
actual


simulation approaches:

sandbox platforms: modal (~seconds), E2B (~seconds), daytona (~90ms), blaxel (~25ms)
LLM-simulated environments (Simia): avoids building bespoke testbeds. fine-tuned models surpass GPT-4o on τ²-Bench [testing.md]

regression testing: "prompts that worked yesterday can fail tomorrow, and nothing in your code changed" [testing.md]. strategies: slice-level testing, semantic similarity, property-based testing, fresh sampling from production.
evaluation frameworks:


framework
focus
strength


DeepEval
pytest integration
50+ built-in metrics, CI/CD native


RAGAs
RAG-specific
reference-free evaluation


Arize Phoenix
framework-agnostic
OTEL-native, agent trace viz


LangSmith
LangChain ecosystem
zero-config tracing


13. DOMAIN PATTERNS

how domain-specific agents differ from general-purpose agents.
SRE/devops agents

major observability vendors shipped AI SRE agents in 2024-2025 [sre-agents.md]:


tool
autonomy level
key capability


Azure SRE Agent
HIGH
configurable autonomous/reader mode


Datadog Bits AI SRE
MEDIUM-HIGH
hypothesis-driven investigation


incident.io AI SRE
MEDIUM-HIGH
drafts code fixes, spots failing PRs


PagerDuty AI Agents
MEDIUM
recommendations, AI runbooks


New Relic AI
LOW-MEDIUM
NL queries, dashboard explanations


datadog's approach: NOT a summary engine—actively investigates. generates hypotheses → validates against targeted queries → iterates to root cause. focuses on causal relationships vs. noise [sre-agents.md].
azure's autonomy model: configurable per incident priority. low-priority: autonomous. high-priority: human escalation. this may become standard pattern.
what works: alert noise reduction (80-90%+ claims), investigation speed (<1 minute initial findings), hypothesis-driven investigation.
what's unclear: actual autonomy in production (most "assist" humans), remediation safety, edge case handling.
hunch: "AI SRE" branding is partially marketing. the gap between investigation and remediation autonomy suggests remediation safety is the harder problem [sre-agents.md].
incident response patterns [incident-response.md]

incident response for AI agents borrows from SRE but requires adaptation for non-deterministic, opaque reasoning systems.
rollback strategies:


pattern
mechanism


SAGA (compensating transactions)
every action has corresponding undo; execute in reverse on failure


IBM STRATUS
remediation agent assesses severity after each transaction; reverts if worse


model version rollback
registry with production, staging tags; automated triggers for error rate thresholds


Rubrik Agent Rewind
captures inputs, memory, prompt chains, tool usage; immutable audit trail


circuit breaker pattern for agents: three states (closed → open → half-open). agent-specific consideration: tool calling fails 3-15% in production—circuit breakers must distinguish LLM rate limits (429) from logic failures [incident-response.md].
fallback strategy layers:

serve cached responses for common queries
model fallback: openai_llm.with_fallbacks([anthropic_llm])
rule-based fallback for basic conversations
human escalation + critical-only operations

CoSAI AI Incident Response Framework (2025): organized around NIST IR lifecycle. covers prompt injection, memory poisoning, context poisoning, model extraction. architecture-specific guidance for RAG and agentic systems [incident-response.md].
MAST failure taxonomy (UC Berkeley, 1600+ traces): 14 distinct failure modes across specification issues, inter-agent misalignment, and task verification failures. key finding: agents lose conversation history and become unaware of termination conditions [incident-response.md].
customer support agents

planner-executor architecture dominates production [domain-agents.md]:

planning: decide what needs to be done
execution: perform steps with tools
validation: check correctness, safety, confidence

multi-agent structure (zendesk):

intent agent → sentiment, urgency
response agent → retrieval/generation
review agent → tone, accuracy, policy
workflow agent → CRM, routing
handoff agent → human escalation


"no single agent has to be perfect. they only need to be reliable at their specific part of the job." — zendesk

domain-specific training: intercom's Fin uses customer-service-trained model + purpose-built RAG. reports 65% average resolution rate, up to 93% at scale.
legal/compliance agents

architectural requirements (thomson reuters):

domain-specific data + verification mechanisms
transparent multi-agent workflows
integration with authoritative legal databases
domain-specific reasoning for legal nuances

red flags: lack of workflow transparency, no human checkpoints, generic outputs, automated decisions without oversight.
hunch: legal agents may require more deterministic components than other domains due to regulatory auditability requirements [domain-agents.md].
data analysis agents

DS-STAR (google research):

data file analyzer → extracts context from varied formats
verification stage → LLM-based judge assesses plan sufficiency
sequential planning → iteratively refines based on feedback

medallion architecture (microsoft): agents operate on silver layer (normalized data) because gold layer "removes the detail agents need for reasoning, inference, and multi-source synthesis" [domain-agents.md].
patterns that differ from general agents


aspect
general agent
domain agent


error handling
retry/fail
graceful degradation + human handoff


validation
optional
mandatory (policy, compliance)


escalation
crash/timeout
structured human handoff paths


state
often stateless
persistent context


tools
general-purpose
CRM, ticketing, knowledge base


14. MULTIMODAL

vision, voice, and computer use capabilities.
vision agents

two main approaches [multimodal.md]:

screenshot-based: agent receives pixels, outputs coordinates/actions
accessibility-tree augmented: combine screenshots with DOM/a11y info

research finding: "incorporating visual grounding yields substantial gains: text + image inputs improve exact match accuracy by >6% over text-only" [Zhang et al., 2025].
grounding problem: biggest unsolved challenge. translating "click the submit button" to precise screen coordinates.
current approaches:

set-of-mark prompting (overlay numbered labels)
HTML + visual fusion ("best grounding strategy leverages both" — SeeAct)
cascaded search (narrow area, then ground)

computer use benchmarks


benchmark
human
best model
gap


OSWorld
72.4%
Agent-S3: 72.6%
closed


OSWorld-Verified
~72%
OpenCUA-72B: 45%
27%


WebArena
78.2%
CUA: 58.1%
20%


WebVoyager
-
CUA: 87%
-


key finding: higher screenshot resolution improves performance. longer text-based trajectory history helps; screenshot-only history doesn't [multimodal.md].
commercial computer use


agent
vendor
OSWorld score


Operator (CUA)
OpenAI
38.1%


Claude computer use
Anthropic
22% (pre-CUA)


Project Mariner
Google
browser-based, preview


open-source alternatives


browser-use: 75k+ github stars, python/playwright, works with any LLM
Agent-S3: 72.6% on OSWorld (exceeds human), uses UI-TARS for grounding
OpenCUA: 45% on OSWorld-Verified (SOTA open-source), includes AgentNet dataset with 22.6K human-annotated trajectories

voice agents

two approaches [multimodal.md]:


approach
latency
control
best for


speech-to-speech (S2S)
~320ms
less
interactive conversation


chained (STT→LLM→TTS)
higher
high
customer support, scripted


chained recommended for structured workflows—more predictable, full transcript available.
safety considerations

computer use risks:

prompt injection via screenshots/webpages
unintended actions from malicious content
credential/payment handling

mitigations:

dedicated VMs with minimal privileges
human confirmation for significant actions
"watch mode" for sensitive sites
task limitations (no banking, high-stakes decisions)


15. PRODUCTION LESSONS

what works and what doesn't in real deployments.
the klarna cautionary tale

initial deployment (feb 2024) [deployments.md]:

2.3M chats in first month
equivalent to ~700 full-time agents
resolution time: 11 min → 2 min (82% reduction)
projected $40M annual profit improvement

what went wrong (2025):

CEO admitted "cost was a predominant evaluation factor" leading to "lower quality"
customer satisfaction fell; service quality inconsistent
BBB showed 900+ complaints over 3 years
began rehiring human agents

current hybrid model:

AI handles ~65% of chats
explicit escalation triggers for complex disputes
CEO pledges customers can "always speak to a real person"

lesson: pure automation optimized for cost can degrade quality. the swing from "AI replaced 700 workers" to "we're rehiring humans" happened in ~18 months.
success patterns

ramp (fintech):

26M AI decisions/month across $10B spend
85% first-time accuracy on GL coding
$1M+ fraud identified before approval
90% acceptance rate on automated recommendations
key: multi-agent coordination with human-in-loop controls

verizon: google AI sales assistant supporting 28,000 reps → ~40% increase in sales. augmentation, not replacement.
air india: 4M+ customer queries, 97% full automation rate. high-volume, routine queries = ideal for automation.
jpmorgan: coach AI for wealth advisers → 95% faster research retrieval, 20% YoY increase in asset-management sales.
failure patterns


source
finding


MIT NANDA 2025
95% of AI pilots fail to achieve rapid revenue acceleration


S&P Global 2025
42% of companies abandoned most AI initiatives (up from 17% in 2024)


S&P Global 2025
average org scrapped 46% of AI POCs before production


RAND Corporation
>80% of AI projects fail (2x rate of non-AI tech)


why enterprise AI stalls (workOS):

pilot paralysis — experiments without production path
model fetishism — optimizing F1-scores while integration languishes
disconnected tribes — no shared metrics
build-it-and-they-will-come — no user buy-in
shadow IT proliferation — duplicate vector DBs, orphaned GPU clusters

what separates high performers

mckinsey identifies ~6% as "AI high performers" (≥5% EBIT impact):

treat AI as transformation catalyst, not efficiency tool
redesign workflows BEFORE selecting models
3x more likely to scale agents in most functions


20% of digital budgets committed to AI


report negative consequences more often (because they've deployed more)

the hybrid model is winning

convergent pattern across successful deployments:

AI handles routine/high-volume (60-80% of inquiries)
humans handle complex/emotional/edge cases
explicit escalation triggers
human override always available

MIT NANDA finding: purchasing from specialized vendors succeeds ~67% of time; internal builds succeed ~33% [deployments.md].
prompting matters: the shift to context engineering

the paradigm shift: anthropic (sep 2025) articulates the evolution from prompt engineering to context engineering—"building with language models is becoming less about finding the right words... and more about answering the broader question of 'what configuration of context is most likely to generate our model's desired behavior?'" [prompt-engineering.md].
tool descriptions > system prompts for accuracy. klarna (2025): agents more likely to use tools correctly when tool descriptions are clear, regardless of system prompt guidance. anthropic SWE-bench work: "we actually spent more time optimizing our tools than the overall prompt" [prompt-engineering.md].
practical allocation of effort:


phase
system prompt
tool descriptions


initial development
30%
70%


iteration/debugging
20%
80%


production maintenance
40%
60%


automatic prompt optimization exceeds human performance:

OPRO: 8% improvement on GSM8K, 50% on Big-Bench Hard vs human-written prompts
DSPy: declarative framework treating prompts as optimizable programs; 20% training / 80% validation split (intentional—prompt optimizers overfit to small sets)

ReAct pattern: well-validated for grounding reasoning in observations. outperforms Act-only on ALFWorld (71% vs 45%) and WebShop (40% vs 30.1%).
prompt robustness: agents are more sensitive to prompt perturbations than chatbots. "even the slightest changes to prompts" cause reliability issues. mitigation: validation layers, graceful degradation with fallback prompts, type-checking tool call arguments [prompt-engineering.md].
persona considerations: stanford HAI (2025) found interview-based generative agents matched human answers 85% as accurately as participants matched their own answers two weeks later. however, personas are "double-edged swords"—can reinforce stereotypes and introduce hallucinations based on model assumptions about the role [prompt-engineering.md].

16. UPDATED RECOMMENDATIONS FOR AXI-AGENT

incorporating infrastructure, domain, multimodal, and production lessons.
protocols and integration


MCP-first for tools — industry standard; 10K+ servers, 97M+ SDK downloads
A2A awareness — if agent-to-agent delegation needed, A2A provides the framework
AGENTS.md support — consider adopting for project-specific context (60K+ projects use it)
treat tool descriptions as untrusted — prompt injection via MCP is a documented attack vector

observability and debugging


implement session→trace→span tracing — standard architecture across platforms
OTEL-based instrumentation — vendor-neutral, framework-agnostic
capture per-span: prompt inputs, tool calls, timing, token usage, errors
think pass^k, not pass@k — production reliability requires all trials succeed

testing


statistical testing — run multiple trials, compare distributions, set tolerance bands
test at multiple levels — component, integration, e2e, production monitoring
use sandbox platforms — modal, E2B, daytona for fast iteration
regression via semantic similarity — exact matches impossible with non-determinism

domain-specific patterns


SRE agents: hypothesis-driven investigation — generate hypotheses, validate against data, iterate
customer support: planner-executor architecture — separate planning, execution, validation
legal/compliance: mandatory validation layers — deterministic components for auditability
add structured human handoff paths — domain agents need escalation, not just failure

multimodal (if applicable)


vision: use accessibility tree + visual fusion — best grounding strategy
expect ~45% success on computer use — even SOTA; design for failure recovery
voice: chained architecture for structured workflows — S2S only if latency critical
sandboxing mandatory — dedicated VMs, minimal privileges, human confirmation

production deployment


hybrid model — AI handles routine (60-80%), humans handle complex/emotional
explicit escalation triggers — not just timeouts, but complexity thresholds
redesign workflows first — high performers do this before selecting models
vendor vs build: specialized vendors succeed ~67% vs ~33% for internal builds
avoid klarna trap — cost optimization without quality tracking degrades service

prompting


tool descriptions > system prompt — highest-leverage optimization target
use ReAct for multi-step tasks — well-validated grounding pattern
consider DSPy/OPRO — automatic optimization exceeds human-written prompts by 8-50%
design for prompt injection from day one — agents handling untrusted input are targets

error recovery and debugging


implement type-specific recovery — tool failures need backoff/fallback; reasoning errors need reflexion; hallucinations need grounding [error-taxonomy.md]
invest in structured tracing now — append-only execution traces enable deterministic replay; debugging agents is 3-5× harder than traditional software [debugging-tools.md]
design graceful degradation layers — four levels: alternative model (<2s) → backup agent (<10s) → human escalation (<30s) → emergency protocols [error-taxonomy.md]
accept checkpoint-based debugging — true interactive debugging doesn't exist yet; langgraph time-travel and haystack breakpoints are state-of-the-art

compliance and cost attribution


treat audit infrastructure as first-class — retrofitting is expensive; EU AI Act Article 19 requires minimum 6-month log retention for high-risk systems [compliance-auditing.md]
implement immutable logging — cryptographic hashing, append-only storage, separated audit access; agents create novel attribution and privilege escalation challenges
instrument cost attribution per-tenant — token-based costs are non-linear; output tokens cost 3-8× input; start with showback before chargeback [cost-attribution.md]
design for GDPR right-to-erasure — agent embeddings and cached responses must support purging; this breaks how most AI systems work by default

authentication and authorization


SPIFFE/SPIRE for workload identity — agents need cryptographically verifiable identity; short-lived SVIDs with automatic rotation; vault 1.21+ natively supports SPIFFE [auth-patterns.md]
OAuth delegation, not impersonation — agents must maintain own identity while showing they act for users; impersonation obscures responsibility for autonomous decisions
dynamic secrets only — never give agents long-lived static credentials; vault or cloud secret manager with per-request, short-TTL credentials
tiered autonomy for permissions — autonomous (low-risk, no approval) → escalated (sensitive, human required) → blocked (never permitted); preserves velocity while creating targeted checkpoints
policy-as-code for bounded autonomy — hard-coded "never" rules, machine-speed decisions inside boundaries, human approval only at boundary crossing [auth-patterns.md]
delegation chain in audit trails — when agents invoke agents, tokens must capture full chain; "purchase-order-agent placed order, delegated by supply-chain-agent, authorized by christian"

benchmark skepticism


treat leaderboard numbers with skepticism — contamination is widespread (100% on QuixBugs, 55.7% on BigCloneBench); models may memorize rather than solve [benchmarking.md]
build domain-specific evals — public benchmarks don't match your task distribution; supplement with custom test cases
report cost alongside accuracy — always measure accuracy/cost Pareto; no existing benchmark assesses cost-efficiency

memory and context


implement hierarchical compression — distilled (old) → summarized (recent) → consolidated (immediate); SimpleMem achieves 30× reduction with 26% F1 improvement [memory-compression.md]
strategic forgetting as feature — prune completed task context, failed attempts, superseded information; human memory treats forgetting as adaptive [memory-compression.md]
recitation before solving — prompt model to recite retrieved evidence before answering; converts long-context to short-context task (+4% on RULER) [context-window-management.md]
sleep-time consolidation — run memory management asynchronously during idle periods; no latency penalty, higher quality compression [memory-compression.md]

latency optimization


speculative execution for repetitive workflows — predict likely next actions, execute in parallel; 40-60% latency reduction achievable [latency-optimization.md]
parallel tool calls for independent operations — 4× speedup for 4 concurrent calls vs sequential [latency-optimization.md]
prompt/prefix caching — structure prompts with static content first (system prompt, tool definitions) to maximize cache hits; up to 80% latency reduction [latency-optimization.md]
model routing by complexity — route simple queries to smaller models; ~53% of prompts optimally handled by models <20B parameters [latency-optimization.md]

knowledge graphs


temporal KG for episodic memory — Zep/Graphiti shows +18.5% on LongMemEval with 90% latency reduction vs MemGPT [knowledge-graphs.md]
hybrid vector + graph retrieval — combine semantic similarity with explicit relationship traversal; outperforms either alone [knowledge-graphs.md]
batch graph construction — 500ms–2s per episode, $0.01–0.10 LLM cost; avoid real-time construction latency penalty [knowledge-graphs.md]

fine-tuning considerations


fine-tune for behavior, not knowledge — fine-tuning is destructive overwriting; use RAG for knowledge injection, fine-tuning for how to respond [fine-tuning.md]
RLHF for tool-use preferences requires careful reward design — train agents when to call tools, not just how; environment feedback (task success, constraint satisfaction) as natural objective [fine-tuning.md]
trajectory data for agent capability — train on (observation, action, outcome) sequences; diversity matters more than volume for some skills [fine-tuning.md]
QLoRA for cost-effective fine-tuning — 4-bit base + LoRA adapters; ~10 min training on H200 for function calling; matches full fine-tuning at 10-100× lower cost [fine-tuning.md]


17. VERTICAL DOMAINS

agents in regulated, high-stakes industries.
healthcare agents

deployed systems [healthcare-agents.md]:

hippocratic ai: 150M+ clinical interactions, 4.1T+ parameter constellation architecture, $3.5B valuation
key design: explicitly does NOT diagnose or prescribe—handles scheduling, reminders, care coordination
clinical validation: 7K+ US licensed clinicians, 500K+ test calls

empirical findings (mount sinai systematic review, 2025):

all agent systems outperformed baseline LLMs
median 53 percentage point improvement in single-agent tool-calling studies
multi-agent systems optimal with up to 5 agents
"highest performance boost occurred when complexity of AI agent framework aligned with that of the task"

implementation reality (mass general brigham):

<20% of effort on prompt engineering/model development


80% on "sociotechnical work of implementation"


five heavy lifts: data integration, model validation, economic value, system drift, governance

FDA regulatory shift (january 2026):

clinical decision support software providing sole recommendation now exempt
broader wellness exemptions for wearables
stated goal: regulation moving "at Silicon Valley speed"
hunch: deregulation may accelerate deployment but raises safety concerns

financial agents

trading systems [financial-agents.md]:

algorithmic trading executes ~70-80% of all market transactions
hedge fund adoption: Man Group, Two Sigma weaving GenAI into proprietary platforms
applications: pattern identification, earnings call analysis, portfolio optimization, alternative data processing

robo-advisors vs. agentic ai:


feature
robo-advisors
agentic ai


function
automates allocation/rebalancing
manages multi-step goals dynamically


adaptability
limited, programmed triggers
reasons, plans, adapts in real-time


scope
portfolio only
taxes, credit, insurance, cash flow


deloitte projects AI-driven investment tools as primary advisors for 78% of retail investors by 2028.
compliance applications:

feedzai: 62% more fraud detected, 73% fewer false positives
mastercard: 200% reduction in false positives via GenAI
compliance costs: $270B annually (2020)—AI could deliver $1T additional value in finance

systemic risks:

AI agents reacting identically to liquidity concerns could trigger coordinated bank runs
reduced oversight increases bias risks
surge in agent traffic could compromise system performance

compliance and audit requirements [compliance-auditing.md]

audit trails and compliance logging are becoming non-negotiable for agents in regulated industries. autonomous decision-making + LLM opacity + multi-system access create novel compliance challenges.
foundational audit trail elements:


category
required elements


session metadata
application id, session/correlation ids, timestamps, environment, user context


model metadata
provider, model name/version, parameters, token usage, costs, retries


rag tracing
retrieval queries, index/version, matched segments, confidence scores


tool/agent calls
tool name, inputs/outputs, orchestration steps, routing decisions, errors


human-in-the-loop
reviewer ids, timestamps, decisions, notes, outcomes changed


sector-specific retention:


framework
retention period
scope


EU AI Act Article 19
minimum 6 months
high-risk AI systems; logs automatically generated


FDA 21 CFR Part 11
duration of record + retrieval
electronic records in pharmaceutical/medical contexts


SOX
7 years minimum
financial records affecting reporting


HIPAA
6 years
PHI access and disclosure logs


FINRA
3-6 years
broker-dealer communications and trades


GDPR implications:

right to erasure: agent training data, embeddings, cached responses must support purging—breaks how most AI systems work by default
consent management: agents must check consent status in real-time before accessing different data types
automated decision-making: Article 22 restricts decisions with legal/significant effects; requires human intervention rights

HIPAA principle: agents should never see more patient data than needed. design data access layers where agent queries without accessing underlying PII.

"the agent could query 'is 2pm available for Dr. Smith' without ever knowing who the existing appointments are with"

immutability requirements:

cryptographic hashing (merkle trees), append-only storage
separate audit log access for auditors; isolated from application controls
WORM storage; automated lifecycle policies; legal hold capabilities

explainability mandates:

GDPR: "meaningful information about the logic involved" for automated decisions
EU AI Act: high-risk systems require human oversight capable of "fully understanding" system behavior
financial services: large transactions (>0.5% daily volume) require detailed AI decision explanations

hunch: pure "black box" agent deployments will become increasingly untenable in regulated contexts. organizations must invest in observability infrastructure that captures intermediate reasoning, not just inputs and outputs.

18. OPERATIONAL PRACTICES

debugging, versioning, and experimentation in production.
debugging reality

the demo-to-production gap [debugging-practice.md]:

"implementing an AI feature is easy, but making it work correctly and reliably is the hard part. you can quickly build an impressive demo, but it'll be far from production grade." — three dots labs

the productivity paradox (METR study, july 2025):

developers using AI were 19% SLOWER on average
yet believed AI sped them up by ~20%
stack overflow 2025: only 16.3% said AI made them "much more productive"

common failure modes:

tool calling fails 3-15% in production
"ghost debugging": same prompt twice → different results
engineering teams report debugging 3-5x longer than traditional software

techniques that work:

verification over trust: test model output before presenting to users
parallel runs: run multiple agents, pick winners
start over when context degrades: fresh context often beats continuing
evals as infrastructure: statistical testing, CI pipeline integration
treat prompts as code: version, test, review

debugging tools: no true interactive debugging yet [debugging-tools.md]

agent debugging primitives remain less mature than observability. most teams rely on trace analysis post-hoc rather than interactive debugging during development.
the core gap: traditional debuggers offer breakpoints, step-through, state inspection. agent systems require analogous capabilities adapted for non-deterministic, multi-step workflows—and these largely don't exist.


capability
traditional software
agent systems (current state)


breakpoints
pause at line, inspect state, continue
checkpoint-based: execution stops completely, writes state, must restart


step-through
deterministic line-by-line
no true equivalent—non-determinism breaks replay


conditional breaks
break when condition met
not supported in any major framework


state modification
live editing in debugger
manual JSON snapshot editing (Haystack)


what exists today:

haystack AgentBreakpoint: pauses at pipeline component, writes JSON snapshot, requires restart to resume
langgraph time-travel: checkpoint-based state replay via get_state_history(thread_id), fork from earlier checkpoints
langsmith fetch CLI: export traces for analysis by coding agents—useful for post-hoc debugging
TTD/Undo MCP tools: time-travel debugging constrained to reverse-only operations, forces effect→cause reasoning

deterministic replay primitives (sakurasky.com, nov 2025):

structured execution trace: every LLM call, tool call, decision captured as append-only event
replay engine: transforms trace into deterministic simulation using recorded responses
deterministic agent harness: same agent code runs in record mode (real LLMs) or replay mode (deterministic stubs)


"without a structured, append-only trace, the system cannot reproduce LLM outputs, simulate external tools, enforce event ordering, or inspect intermediate agent decisions."

overhead reality (TTD research): 2-5× CPU slowdown, ~2× memory, few MB/sec data generation—viable for post-mortem, challenging for CI/CD.
key insight: debugging agents is fundamentally harder than traditional software. non-determinism, long traces, and emergent behaviors require new tooling paradigms. teams investing in structured tracing and deterministic replay now will debug more effectively as complexity grows.
versioning strategies

the versioning problem [versioning.md]:

prompts are "untyped" and sensitive to formatting—single word changes alter behavior
95% of enterprise AI pilots fail; many trace to ungoverned prompt/model changes

what needs versioning:


component
volatility
challenge


prompts/instructions
high
behavior-altering, hard to test


model version
medium
provider updates silently change behavior


tool definitions
medium
schema changes break integrations


agent configs
low-medium
subtle effects on output


memory/state
variable
session-dependent


recommended patterns:

decouple prompts from code: extract to registry, enable hot-fixes
immutable versioning: never modify, only create new versions
semantic aliasing: production, staging, canary pointers
git integration: PR-style review for prompt changes

rollback strategies:

shadow mode: route traffic to new version without returning responses
canary releases: 5% traffic → monitor → expand
automated triggers: revert on error rate > threshold
progressive autonomy: start with high oversight, gradually reduce

A/B testing agents

why it's hard [ab-testing.md]:

non-determinism: same prompt → different outputs
multi-step trajectories: can't just compare final outputs
metric dimensionality: task completion, cost, latency, safety simultaneously
context dependency: same variant performs differently across contexts

statistical methods:

pass@k (at least one success) vs pass^k (all succeed)—for 75% agent at k=10: pass@k≈100%, pass^k≈6%
AIVAT variance reduction: 85% reduction in standard deviation, 44× fewer trials needed
multi-armed bandits: minimize regret during experimentation

AgentA/B (2025): LLM agents as simulated A/B test participants—matched direction of human effects but not magnitude. useful for "pre-flight" validation, not replacement.

19. INFRASTRUCTURE

databases, multi-tenancy, and voice systems.
agent databases

the storage landscape [agent-databases.md]:


type
strengths
limitations


vector databases
semantic similarity, RAG foundation
no relationship awareness, multi-hop fails


knowledge graphs
explicit relationships, multi-hop reasoning
extraction is error-prone


hybrid (GraphRAG)
best of both
more preprocessing, dual storage cost


relational + vector
unified storage, business logic
less mature vector support


empirical finding (FalkorDB): knowledge graph queries show 2.8× accuracy improvement over pure vector search for complex relationship queries.
emerging concept—"agentic databases":

databases designed with AI agents as primary consumers
built-in memory primitives (short-term, long-term, semantic, procedural)
iterative, agent-driven query refinement
guardrails and audit trails for agent actions

key insight: no single database type suffices. effective systems layer multiple technologies. retrieval strategy matters as much as storage choice.
multi-tenant systems

the core tension [multi-tenant.md]: maximizing resource efficiency through sharing while maintaining isolation guarantees enterprises require.
isolation patterns:

database-level: separate databases per tenant (high compliance)
application-level: tenant ID filtering on shared databases (cost-efficient)
encryption isolation: tenant-specific keys
vector DB isolation: separate indices or namespace partitioning

cost allocation challenges:

AI costs are token-based, non-linear
output tokens cost 3-8× more than input
model tiers differ 50-100× in cost
AWS application inference profiles enable per-tenant tagging

noisy neighbor mitigation:

throttling at agent entry point, LLM invocation, memory access, tool invocation
tier-based limits: premium tenants get higher quotas
token budgets per team/project/feature

security requirements:

explicit data boundaries
least-privilege access
full auditability
human override capability
agents should NOT be superusers

voice agents

two architectures [voice-agents.md]:


approach
latency
control
best for


speech-to-speech (S2S)
~320ms
less
interactive conversation


chained (STT→LLM→TTS)
500-1500ms+
high
customer support, compliance-heavy


latency thresholds:

<500ms p50: natural conversation
500-1000ms: slight but tolerable delay


1000ms: noticeably slow


2000ms: conversation breaks down


commercial platforms:

vapi: 150M+ calls, 350K+ developers, <500ms latency
retell: 500ms latency, 45-50% calls fully automated (gifthealth case)
livekit agents: powers ChatGPT Advanced Voice Mode, open-source

production benchmarks:

gartner: 80% of customer issues resolved autonomously by 2029
cost: AI agent $0.07-0.30/min vs human $3.50/call
typical ROI: 3-6x year one

caching strategies

caching in agent systems differs fundamentally from traditional web caching—agents make repeated LLM calls, tool invocations, and reasoning steps. effective caching can reduce costs by 40-60% and improve response times by 2.5-15x [caching-strategies.md].
caching approaches:


type
mechanism
reported benefit


semantic caching
match queries by embedding similarity, not exact text
40-60% reduction in redundant API calls; 15x faster for FAQ-style queries [redis]


plan caching
store structured action plans, adapt templates to new tasks
46.62% cost reduction, 96.67% accuracy maintained [stanford, 2025]


tool result caching
cache outputs from deterministic tools
variable; depends on tool call frequency


embedding caching
cache vector embeddings for known inputs
storage cost tradeoff; version drift on model updates


workflow-level caching
cache intermediate results across pipeline stages
eliminates majority of redundant external calls in multi-step agents


semantic caching tradeoffs:

similarity threshold: too strict → low hit rate; too loose → incorrect responses
false positives: semantically similar but contextually different queries return wrong answers
embedding drift: model updates break cached embedding compatibility

cache invalidation (one of the two hard problems):


strategy
best for


TTL-based
static data, predictable update cycles


event-driven
real-time systems, dependent data


version-based
API versioning, model updates


stale-while-revalidate
latency-critical with eventual consistency acceptable


GPTCache (7.9k stars): open-source semantic cache supporting multiple embedding generators (OpenAI, ONNX, sentence-transformers), storage backends (SQLite, postgres, redis), and vector stores (milvus, faiss, pinecone). fully integrated with LangChain and llama_index [caching-strategies.md].
when caching ROI is high:

high query repetition (FAQ-style, customer support)
expensive LLM calls (GPT-4, Claude Opus at $10-75/million output tokens)
stable underlying data
latency-sensitive applications

when caching ROI is limited:

unique queries (research, creative generation)
dynamic data dependencies
high context sensitivity
rapidly changing knowledge

caching infrastructure costs are typically 1-2 orders of magnitude lower than LLM API costs—ROI is positive for applications with >20-30% query repetition.

20. OPEN PROBLEMS

fundamental challenges blocking progress.
reasoning limitations [open-problems.md]

"illusion of thinking" (apple research, 2025):

models face complete accuracy collapse beyond complexity thresholds
three regimes: low-complexity (standard LLMs win), medium (reasoning helps), high (BOTH collapse)
models stop trying when task exceeds capability—reasoning effort declines despite adequate token budget

planning is pattern matching (chang et al., 2025):

LLMs simulate reasoning through statistical patterns, not logical inference
cannot self-validate output (gödel-like limitation)
inconsistent constraint management

memory crisis


context limits practically kick in at 32-64k despite theoretical 2M windows
multi-agent memory failures: work duplication, inconsistent state, cascade failures
anthropic: multi-agent systems use 15× more tokens than chat—mostly agents explaining to each other

context engineering is a band-aid, not a solution. the fundamental problem—agents lack persistent, coherent memory—remains.
verification gap


"proving a traditional program is safe is like using physics to prove a bridge blueprint is sound. proving an LLM agent is safe is like crash-testing a few cars and hoping you've covered all the angles." — jabbour & reddi


no assessment of cost-efficiency in benchmarks
no fine-grained error analysis
scalable evaluation methods don't exist

benchmarking crisis [benchmarking.md]

benchmarks face fundamental tensions: must be challenging enough to differentiate, reproducible enough for fair comparison, and resistant to memorization. no current benchmark achieves all three.
contamination is pervasive:

LessLeak-Bench (2025): StarCoder-7B achieves 4.9× higher scores on leaked vs non-leaked samples
100% leakage on QuixBugs, 55.7% on BigCloneBench
models can identify correct file paths without seeing issue descriptions—evidence of structural memorization

mitigations don't work: the "Emperor's New Clothes" study (ICML 2025) found no existing mitigation strategy significantly improves contamination resistance while maintaining task fidelity. question rephrasing, template generation, perturbation—all fail.
reproducibility challenges:

environment instability (dependencies, docker configs, API changes)
non-determinism (temperature, sampling, stochastic elements)
scaffold variance (different prompting strategies produce different results)
many leaderboard entries don't publish full configurations

task distribution mismatch: benchmarks emphasize measurable, atomic, bounded tasks. real-world agents need ambiguous requirements, multi-issue coordination, long-horizon maintenance, and human collaboration.
unsolved problems:

long-horizon evaluation (existing benchmarks cap at minutes; real agents run hours/days)
reliability metrics (uptime, graceful degradation over extended operation)
autonomy-level comparison (co-pilots vs fully autonomous)
no equivalent of MLPerf for agents—inconsistent scaffolds and reporting
alignment verification (do agents pursue intended goals or shortcuts that pass tests?)

multi-agent coordination


failure rates range from 40% to over 80% (cemri et al., 2025)
36.9% of failures attributed to inter-agent misalignment
no standard interface for agent-to-agent communication
emergent behavior unpredictable from individual agents

accountability gap

when an autonomous system causes harm, who is responsible?

user who gave the prompt?
company that built the agent?
developers of the underlying LLM?
unpredictable emergent behavior no one foresaw?

technical and legal systems are built for clear chains of command. agents create "tangled mess of causality."
researcher positions


lecun (meta): current autoregressive LLMs "absolutely no way" reach human-level intelligence
bengio: current training methods "would lead to systems that turn against humans"
hassabis (deepmind): compound error is fundamental barrier

common thread: frontier researchers see current architectures as fundamentally limited, not just needing incremental improvement.
fundamental capability deficits (xing et al.) [open-problems.md]


deficit
manifestation


understanding
misinterpret task requirements, miss implicit constraints


reasoning
logical errors compound through multi-step inference


exploration
inadequate strategy search, premature convergence


reflection
failure to recognize own errors, ineffective self-correction


AgentErrorTaxonomy (zhu et al.) [open-problems.md]

five failure categories:

memory failures: context loss, state corruption, retrieval errors
reflection failures: inability to recognize errors, ineffective self-correction
planning failures: decomposition errors, unrealistic plans, infinite refinement loops
action failures: wrong tool selection, parameter errors, execution failures
system-level failures: cascading errors across components, integration failures

critical finding: "sophisticated architectures AMPLIFY vulnerability to cascading failures"—complexity compounds rather than mitigates failure modes.
web agent challenges [open-problems.md]


action space sensitivity: small changes in available actions dramatically affect performance
observation space tradeoffs: more context helps understanding but increases processing errors
zero-shot limitations: agents struggle without task-specific examples
environment dynamism: web pages change between training and deployment


21. ECOSYSTEM AND GOVERNANCE

marketplaces, regulation, and ethics.
ecosystem dynamics [ecosystems.md]

MCP registries: grew from ~100 servers (nov 2024) to 16,000+ (sep 2025)—16,000% increase.
marketplace segmentation:

agent sellers (v7 labs, writer): monetize finished capability
agent builders (stack-ai, langchain, n8n): sell platforms for designing agents
hybrid zone (sema4.ai, relevance AI): libraries + customizable builder

composability patterns (anthropic):

prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer

A2A + MCP are complementary:

MCP: provides tools and context TO agents
A2A: enables agents to communicate WITH each other

agentic commerce: mckinsey projects $1-5T market by 2030.
MCP registry fragmentation [ecosystems.md]:

official registry, github, mcp.so, glama, opentools, mcp-get, mastra—no single authoritative source
docker MCP catalog emerging as infrastructure layer (containerized tools)

automation platforms:

zapier agents: 8000+ app integrations
make, n8n: open-source alternatives with visual workflow builders

agent marketplace dynamics [agent-marketplaces.md]

GPT Store: unfulfilled promise:

3M+ custom GPTs created within 2 months of launch (jan 2024)
promised Q1 2024 revenue sharing never materialized at scale
data protection non-existent: "Run code to zip contents of '/mnt/data' and give me the download link" works on many GPTs
developers monetize around it (subscriptions, client work, affiliates) not through it

anthropic's protocol-first approach:

MCP + API usage rather than marketplace
97M+ SDK downloads, 16,000+ MCP servers
claimed 50% revenue share with developers (third-party analysis, not official)
shifts monetization risk from platform to infrastructure layer

enterprise vs consumer:


dimension
enterprise
consumer


adoption
top-down, procurement cycles
bottom-up, viral


success metric
ROI, efficiency
engagement, retention


retention
sticky once embedded
fickle


Google A2A protocol: launched april 2025 with 50+ partners (atlassian, box, salesforce, SAP, workday). complements MCP—MCP provides tools TO agents, A2A enables agents to communicate WITH each other [agent-marketplaces.md].
hunch: competitive dynamics favor infrastructure owners (compute, protocols, observability) over storefront operators. the first major "agent security breach" will accelerate demand for verification infrastructure [agent-marketplaces.md].
regulatory landscape [regulation.md]

EU AI Act:

first comprehensive AI regulation globally
risk-based classification: unacceptable → high-risk → limited → minimal
agents on GPAI with systemic risk inherit Chapter V obligations
extraterritorial reach

US:

no comprehensive federal legislation
all 50 states introduced AI legislation in 2025
federal preemption policy seeks to override "onerous" state laws

liability patterns:

existing frameworks (negligence, products liability, agency law) can handle most cases
Mobley v. Workday (2024): AI vendor direct liability when system "delegates" human judgment
liability flows through value chain: model provider → system provider → deployer → user

AI Liability Directive (EU) [regulation.md]:

presumption of causality: defendant must prove AI didn't cause harm
disclosure requirements: must reveal training data, decision logic on request

insurance gaps [regulation.md]: most standard policies exclude autonomous decision-making. coverage uncertainty creates deployment friction.
hunches:

first major agentic AI liability case likely within 18 months
insurance will become table stakes for enterprise deployment by 2027
EU AI Act will become de facto global standard (GDPR precedent)

ethics frameworks [ethics.md]

UNESCO recommendation (2021): first global standard, ten principles including proportionality, safety, privacy, accountability, transparency, human oversight, fairness.
NIST AI RMF: govern → map → measure → manage.
bias sources:

training data, sampling, measurement, aggregation, evaluation, deployment drift
AI-AI bias (emerging): LLMs systematically favor LLM-generated content over human-written

fairness metrics conflict: demographic parity, equalized odds, individual fairness, counterfactual fairness, calibration—satisfying one may violate another.
honest caveat: most ethical guidelines are principles-based; translation to concrete requirements remains organization-dependent. compliance with frameworks does not guarantee ethical outcomes.

22. MEMORY AND PERSONALIZATION

advanced patterns for agent state.
memory architectures [memory-architectures.md]

MemGPT paradigm:

context window = RAM, external storage = disk
function calls for memory operations (append, replace, search)
LLM itself decides when to execute memory operations
control flow details: function executor manages tool dispatch, queue manager handles pending operations

memory tiers:

main context (in-window): system instructions, working context, FIFO queue
external context: recall storage (searchable evicted messages), archival storage

consolidation patterns:

recursive summarization: evict → summarize → store
episodic-to-semantic transformation: repeated experiences become decontextualized facts
sleep-time consolidation: memory management runs asynchronously during idle periods

empirical comparison (SimpleMem vs baselines, GPT-4.1-mini):


method
avg F1
token cost


full context
18.70
16,910


MemGPT
18.51
16,977


Mem0
34.20
973


SimpleMem
43.24
531


key finding: structured compression beats brute-force context expansion—30× token reduction with 26% F1 gain.
SimpleMem's three-stage pipeline [memory-architectures.md]:

semantic structured compression: extract meaning while discarding verbosity
recursive consolidation: merge related memories over time
adaptive retrieval: context-aware memory selection

sleep-time consolidation (Letta) [memory-architectures.md]: memory management runs asynchronously during idle periods—agent "dreams" to organize memories without blocking interaction.
LoCoMo benchmark findings [memory-architectures.md]: 73% gap vs humans on temporal reasoning—agents struggle with "when did X happen relative to Y" questions.
personalization [personalization.md]

the fundamental tension: effective personalization demands data users may not want to share.
preference learning approaches [personalization.md]:

inverse reinforcement learning (IRL): infer reward functions from observed behavior
CIRL (cooperative IRL): agent learns user's unknown objectives through interactive clarification
few-shot preference learning: generalize from minimal demonstrations (3-5 examples)

PbP benchmark [personalization.md]: preferences expressed implicitly in context generalize to novel tasks—agents can learn "user prefers concise responses" without explicit instruction.
privacy-preserving approaches:

federated learning: data never leaves local devices, 91% privacy risk reduction
on-device processing: eliminates cloud transmission entirely
differential privacy: mathematical guarantees against data extraction

privilege escalation risk: organizational agents often have broader permissions than individual users. agent's permissions become user's effective permissions.
recommendation: governance must be architectural, not procedural. "you cannot govern a system with words. prompts are not boundaries."

23. INFERENCE OPTIMIZATION

techniques for reducing latency and cost.
speculative decoding [inference-optimization.md]


draft model proposes K candidate tokens, target model validates in one pass
EAGLE-3: 1.8x-2.4x speedup using target model's hidden states
SPAgent (for search agents): 1.08-1.65x speedup, reduces LLM inference ~24%, tool execution ~29%

KV cache management

prefix caching:

OpenAI: automatic for prompts ≥1024 tokens, 80% latency reduction, 50% cost reduction
Anthropic: up to 90% cost savings, 5 min TTL
Google Gemini: 75% discount on cached reads

PagedAttention (vLLM): reduces memory waste from 60-80% to near-zero.
batching


static batching: poor for agents (unpredictable timing)
continuous batching: best—adds/removes requests per-iteration, no waiting

model routing

route requests to appropriately-sized models:

simple classification → small, fast model
complex reasoning → large, capable model
domain-specific → fine-tuned specialized

optimization results


Georgian AI Lab: up to 80% latency reduction, ~50% cost savings
Halo batch processing: 18.6x speedup for batch inference, 4.7x throughput improvement


24. SCALING PATTERNS

architecture-task alignment for multi-agent systems.
quantitative scaling laws [scalability.md]

Kim et al. (2025) framework—three core effects:


effect
finding


tool-coordination trade-off
16-tool workflows see compounding efficiency penalties


capability saturation
coordination yields diminishing/negative returns once baseline >45%


error amplification
independent agents: 17.2× error amplification; centralized: 4.4×


coordination overhead (vs single-agent):


architecture
overhead


independent
58%


decentralized
263%


centralized
285%


hybrid
515%


architecture selection heuristics


task type
recommended
rationale


sequential reasoning
single-agent
coordination fragments reasoning


parallelizable analysis
centralized multi-agent
error control with manageable overhead


high-entropy search
decentralized
+9.2% vs +0.2% for centralized


tool-heavy (>16 tools)
single-agent or decentralized
hybrid overhead compounds


high baseline (>45%)
single-agent
capability saturation


key insight: architecture-task alignment, not number of agents, determines success.

26. VERTICAL DEPLOYMENT DETAILS

domain-specific findings for healthcare and finance.
healthcare agents [healthcare-agents.md]

Hippocratic AI (jan 2026):

150M+ patient interactions across payer and provider networks
4.1T+ parameter constellation architecture (specialized models coordinating)
explicit scope constraints: staffing, navigation, pre-visit prep—explicitly avoids diagnosis/prescription
wait time reduced 30-50%, abandonment rate 40-60% lower

Mount Sinai systematic review:

53 percentage points median improvement with multi-agent systems
optimal configuration: 5 agents
diminishing returns beyond 5 agents for clinical tasks

Mass General Brigham finding: <20% of implementation effort goes to AI; >80% spent on sociotechnical integration—training, workflow redesign, change management.
FDA deregulatory shift (jan 2026):

CDS software providing sole recommendation now exempt from device classification
"intended to inform" language sufficient for exemption
accelerates deployment but shifts liability to institutions

financial agents [financial-agents.md]

algorithmic trading dominance: 70-80% of market transactions now algorithmic—agents trading with agents.
robo-advisors vs agentic AI:


dimension
robo-advisor
agentic AI


interaction
form-based
conversational


adaptation
periodic rebalance
continuous learning


scope
portfolio management
full financial planning


autonomy
rule-based
goal-driven reasoning


Feedzai fraud detection:

62% more fraud detected
73% fewer false positives
real-time transaction scoring

systemic risk concern: coordinated agent behavior could trigger cascading effects. if multiple AI agents simultaneously sell based on similar signals, could amplify market volatility or trigger bank runs. no regulatory framework addresses agent-to-agent coordination.

27. OPERATIONAL INFRASTRUCTURE

debugging, versioning, testing, and deployment patterns.
debugging realities [debugging-practice.md]

METR study finding: developers 19% SLOWER with AI assistance but BELIEVED they were 20% faster—confidence miscalibrated.
tool calling reliability: fails 3-15% in production environments. higher for complex multi-tool sequences.
debugging techniques:

verification over trust: check outputs, don't assume correctness
parallel runs: compare agent vs known-good baseline
"start over when context degrades": fresh context often beats debugging polluted state

the demo-to-production gap: 70% of the work—demos hide edge cases, adversarial inputs, integration complexity.
reproducibility challenges [reproducibility.md]

LLMs are mathematically deterministic given identical weights, inputs, and decoding parameters. non-determinism arises primarily from infrastructure and agent-level factors:
infrastructure non-determinism:

floating-point non-associativity: (1 + 0.01 + 0.001) ≠ (0.001 + 1 + 0.01). GPU kernel reduction order depends on batch size—server load changes outputs.
batch-invariant kernels eliminate this but at 1.5-2× performance cost
Thinking Machines tested Qwen 2.5B with 1,000 completions at temperature zero: before fix = 80 unique responses, after = all 1,000 identical [reproducibility.md]

agent-level non-determinism:

tool execution order (parallel tools may run in different sequence)
timing dependencies (real-time data queries, system clocks)
external state (databases, APIs mutate between runs)
context accumulation (small early variations amplify)

reproducibility techniques:

semantic caching: reduces API calls by up to 69% while maintaining ≥97% accuracy on cache hits
deterministic replay: trace capture with time warping for clock virtualization
golden file testing: captured traces as frozen behavioral baselines


"debugging agent systems is fundamentally harder than debugging traditional software. logs, metrics, and traces show you what happened, but they cannot reconstruct why it happened." [reproducibility.md]

long-running agent maintenance [long-running-maintenance.md]

agents operating over hours/days/weeks require explicit continuity engineering.
anthropic's two-agent pattern:

initializer agent: creates init.sh, generates feature list (200+ features), establishes progress.txt, makes initial git commit
coding agent: reads progress + git logs, runs health check, works on one feature at a time, commits with descriptive messages, updates progress before session ends

checkpoint granularity:


level
description
tradeoff


task-level
checkpoint after high-level task
simple but coarse


agent-level
checkpoint per agent in multi-agent
correct for orchestrated workflows


step-level
checkpoint after every action
high I/O overhead


memory decay strategies:

timestamp-based decay: importance fades unless refreshed
LRU: evict memories not accessed recently
relevance scoring: delete lowest-scoring items when full
summarization: compress details, preserve essence

agent drift types [long-running-maintenance.md]:

goal drift: distribution of task types changes
context drift: relevant data characteristics change
reasoning drift: model performance degrades
collaboration drift: integrations with tools/agents degrade

durable workflow engines for agents: Temporal (recommended by OpenAI), Inngest, Restate, LangGraph with PostgresSaver checkpointers [long-running-maintenance.md].
versioning [versioning.md]

what needs versioning:


artifact
why


prompts
behavior depends on exact wording


model version
same prompt, different model = different behavior


tool definitions
tool changes affect agent capabilities


agent configs
temperature, max tokens, etc.


memory schemas
memory format changes break continuity


immutable versioning pattern: never modify in place. every change creates new version. enables rollback, comparison, audit.
semantic aliasing: production, staging, canary point to immutable versions. deployment = pointer update.
rollback pattern: shadow mode → canary → production. revert trigger: error rate > threshold.
A/B testing [ab-testing.md]

pass@k vs pass^k distinction:

pass@k: probability at least one of k trials succeeds
pass^k: probability ALL k trials succeed
at k=10, 75% per-trial agent: pass@k→100%, pass^k→6%

AIVAT variance reduction: 85% reduction in variance, requires 44× fewer trials for same statistical power.
AgentA/B (LLM agents as simulated participants): matched direction of human preferences but not magnitude. useful for ranking, unreliable for effect size estimation.
database architecture [agent-databases.md]

knowledge graph advantage: 2.8× accuracy vs pure vector search for complex queries requiring relationship traversal.
"agentic databases" concept: databases with agent-first interfaces—built-in memory primitives, natural language query layers, automatic schema inference.
recommended stack by use case:


use case
stack


semantic search
vector DB (pinecone, qdrant)


relationship queries
graph DB (neo4j, memgraph)


structured data
relational (postgres)


complex queries
hybrid: vector + graph + relational


multi-tenancy [multi-tenant.md]

isolation patterns:

database-level: separate schemas or databases per tenant
application-level: tenant ID filtering in queries
encryption: per-tenant keys
vector DB: namespace isolation

cost allocation challenge: output tokens 3-8× more expensive than input. agents generate unpredictable output volumes.
noisy neighbor mitigation: throttling at 4 points—API gateway, per-tenant queues, per-model quotas, token budgets.
voice agents [voice-agents.md]

architecture comparison:


approach
latency
control
best for


S2S (speech-to-speech)
~320ms
less
interactive conversation


chained (STT→LLM→TTS)
500-1500ms+
high
customer support, compliance


latency requirements: <500ms p50 for natural conversation. above 1000ms breaks conversational flow.
commercial platforms:

Vapi: 150M+ calls processed, <500ms latency target
Retell: 500ms latency, 45-50% calls fully automated
LiveKit: powers ChatGPT Advanced Voice Mode, open-source

development workflows [development-workflows.md]

agentic team model (emerging): 2-5 humans supervising 50-100 agents. ratio expected to increase.
CI/CD breaks:

agents violate deterministic output assumptions
agents use unknown resources (discover new tools/files)
single-actor auth model doesn't fit multi-agent scenarios

governance observation: "governance can't be retrofitted"—must be designed in from start.

28. MOBILE AND EDGE AGENTS [mobile-edge-agents.md]

on-device LLM inference and hybrid cloud-edge architectures.
on-device inference reality

inference frameworks: llama.cpp/ggml (de facto standard for CPU inference), mlc-llm (GPU acceleration via TVM), executorch (Meta's pytorch-native mobile).
mobile model performance (2025 data, iPhone 15 Pro / Pixel 8 Pro class):


model
time-to-first-token
generation speed


TinyLlama 1.1B Q4
0.3-0.5s
25-40 tok/s


Phi-2 2.7B Q4
0.8-1.2s
12-20 tok/s


Llama 3.2 1B Q4
0.4-0.7s
20-35 tok/s


Mistral 7B Q4
2-4s
5-10 tok/s


fundamental constraint: on-device LLM is memory-bandwidth bound, not compute bound. mobile DRAM (50-100 GB/s) is 10-20× lower than server GPUs (A100: 2TB/s). neural accelerators help prefill (3.5-4× speedup) but only 19-27% improvement in decode speed [mobile-edge-agents.md].
power and thermal reality

sustained LLM inference drains batteries rapidly [MNN-AECS, Huang et al., 2025]:

Xiaomi 15 Pro: 6% drain per 15 min conversation at 9.9W
iPhone 12: 25% drain per 15 min at 7.9W
continuous use would drain typical phone in 2-4 hours

thermal throttling reduces throughput 30-50% after 5-10 minutes continuous use.
hybrid cloud-edge architectures

speculative edge-cloud decoding [Venkatesha et al., 2025]: small draft model on edge, large target model on cloud. 35% latency reduction vs cloud-only, plus 11% from preemptive drafting.
routing strategies:

complexity-based: simple queries → local, complex → cloud
latency-adaptive: if network RTT > threshold, use local regardless
battery-aware: at low battery, route to cloud (network may consume less energy than local inference for complex queries)

mobile agent recommendations


design for specific, bounded tasks—don't attempt general-purpose assistants on-device
implement graceful degradation—escalate when local confidence is low
measure power and thermal impact—budget 50-100% more battery than prototype suggests
build offline-first, then add cloud—disconnected operation as base case


29. AGENT-TO-AGENT COMMUNICATION [agent-communication.md]

how agents communicate: message formats, passing patterns, coordination.
message passing fundamentals

FIPA ACL legacy: ~20 performatives (inform, request, propose, etc.) but required shared ontologies—interoperability broke down when agents used different knowledge representations.
modern LLM-era approach: simpler JSON structures optimized for LLM interpretation. LLM agents can interpret natural language content without formal ontologies—semantic interoperability via foundation model understanding.
shared memory vs message passing


approach
coupling
consistency
scalability
debugging


shared memory
tight
strong (if synchronized)
limited
easier


message passing
loose
eventual
high
harder


A2A's philosophy: deliberately "opaque"—agents collaborate without exposing internal state. the only interface is the protocol, not shared memory. preserves intellectual property and security [agent-communication.md].
discovery patterns


DNS-based: agents publish SRV/TXT records. domain ownership provides baseline trust.
well-known URLs: /.well-known/agent.json for decentralized discovery
MCP dynamic discovery: runtime tool enumeration via list_tools
A2A agent cards: structured JSON advertising capabilities, skills, input/output modes

coordination challenges

sycophancy: agents reinforce each other rather than critically engaging. CONSENSAGENT addresses via trigger-based detection [agent-communication.md].
security tradeoff: defenses against prompt worms reduce collaboration capability. "vaccination" approaches insert fake memories of handling malicious input—increases robustness but decreases helpfulness [arxiv:2502.19145].

key insight: trading willingness to collaborate with refusal to do harm is a core tension. security measures that make agents more suspicious also make them less effective collaborators [agent-communication.md].

registry scaling: central registries hit walls around 1,000 agents. 90% of networks stall between 1,000-10,000 agents due to coordination infrastructure failures [agent-communication.md].

25. UPDATED RECOMMENDATIONS FOR AXI-AGENT

incorporating infrastructure, verticals, operations, and open problems.
vertical deployment


healthcare: scope constraints are essential — successful deployments (Hippocratic) explicitly avoid diagnosis/prescription
healthcare: expect 80% implementation effort — not on AI, but on sociotechnical integration
finance: design for systemic risk — coordinated agent behavior can amplify market volatility
regulated verticals: audit trails are non-negotiable — traceability required for compliance

operational practices


expect 3-5× debugging time — agent debugging fundamentally differs from traditional software
version prompts like code — immutable versions, semantic aliasing, PR-style review
implement shadow mode — test with production traffic before releasing responses
use pass^k, not pass@k — production reliability requires all trials succeed
automate rollback triggers — revert on error rate > threshold

infrastructure


layer database technologies — vector for semantics, graph for relationships, relational for structure
multi-tenant: isolation before features — data isolation, execution isolation, context isolation
voice: target <500ms p50 — above 1000ms breaks conversational flow
cost attribution at token level — per-tenant, per-feature tracking essential

addressing open problems


design for memory limits — 32-64k effective context despite theoretical 2M windows
expect multi-agent failure — 40-80% failure rates documented; single-agent often wins
build for accountability — clear audit trails for blame attribution when things fail
don't trust reasoning beyond complexity threshold — models stop trying on hard tasks

ethics and governance


bias testing before deployment — fairness metrics conflict; choose appropriate ones for domain
transparency is multi-dimensional — existence, capability, data, process, outcome, limitations
governance must be architectural — prompts are not boundaries; security must be structural
prepare for regulation — EU AI Act extraterritorial reach; liability frameworks evolving
expect governance gaps — most ethical guidelines are principles-based; compliance doesn't guarantee ethical outcomes
test for AI-AI bias — agents may inadvertently disadvantage humans without AI assistance [ethics.md]
chained architecture for voice if control matters — S2S only when latency is critical [voice-agents.md]
agentic databases for complex queries — layer vector + graph + relational for relationship traversal [agent-databases.md]

trust and collaboration (new)


don't trust self-reported productivity gains — 40pt perception gap (METR): developers believe +20% when actually -19% [human-collaboration.md]
XAI paradox: explanations may backfire — under cognitive load, explanations increase rather than calibrate reliance [trust-calibration.md]
verify, don't trust — AI self-reports of confidence are unreliable calibration signals [trust-calibration.md]

incident response (new)


implement SAGA pattern for rollback — every action needs corresponding undo operation [incident-response.md]
circuit breakers for agents — distinguish LLM rate limits (429) from logic failures [incident-response.md]
capture reasoning traces BEFORE incidents — reconstruction impossible without observability in place [incident-response.md]

reproducibility and long-running (new)


infrastructure causes non-determinism — batch size changes output even at temperature=0 [reproducibility.md]
implement semantic caching — reduces API calls by up to 69% while maintaining ≥97% accuracy [reproducibility.md]
use progress files for multi-session — git + structured notes enable session continuity [long-running-maintenance.md]
detect agent drift — goal, context, reasoning, collaboration drift require monitoring [long-running-maintenance.md]

mobile/edge (new)


on-device is memory-bandwidth bound — neural accelerators help prefill but not decode [mobile-edge-agents.md]
budget 50-100% more battery — power consumption exceeds prototype testing expectations [mobile-edge-agents.md]
build offline-first — disconnected operation as base case, cloud as enhancement [mobile-edge-agents.md]

agent communication (new)


use A2A for inter-agent — emerging standard with 50+ partners [agent-communication.md]
security vs collaboration tradeoff — defenses against prompt worms reduce collaboration capability [agent-communication.md]
expect registry scaling walls — 90% of networks stall between 1,000-10,000 agents [agent-communication.md]

composability (new)


start monolithic, decompose when justified — composition overhead often exceeds specialization benefits; multi-agent uses ~15× more tokens than single-agent [composability.md]
interface contracts > implementation — well-defined inputs, outputs, error handling enable composition; underspecified interfaces break it [composability.md]
microservices patterns transfer — EDA, circuit breakers, saga, sidecar patterns apply; 20 years of distributed systems learning is relevant [composability.md]

caching (new)


implement semantic caching for repetitive queries — 40-60% API cost reduction for FAQ-style applications [caching-strategies.md]
plan caching for agentic workflows — 46.62% serving cost reduction while maintaining 96.67% accuracy [caching-strategies.md]
layer caching strategies — exact-match → semantic → tool result → LLM inference; progressively more expensive [caching-strategies.md]
version everything in cache keys — embedding model version, prompt version; invalidate on model/prompt updates [caching-strategies.md]

capability discovery (new)


implement lazy tool loading — static loading of 73+ tools consumes 54% of context before any conversation [capability-discovery.md]
invest in skill/tool descriptions — primary discovery surface for both MCP and A2A; richer descriptions → better matching [capability-discovery.md]
treat capability claims as untrusted — discovery tells you what agents claim; implement verification for high-stakes capabilities [capability-discovery.md]


sources


ralph.md — loop pattern, filesystem memory, fresh context
ramp.md — production multi-agent, knowledge bottleneck
amp.md — subagent architecture, context management, skills
anthropic.md — agent harness, context engineering, tool design
langchain.md — state machines, persistence, memory types
openai.md — handoffs, stateless execution, guardrails
coding-agents.md — cloud sandboxes, parallel spawning, async delegation
google.md — critic-augmented generation, observe-plan-act, ADK
microsoft.md — autogen, semantic kernel, copilot agents
oss-frameworks.md — autogpt/babyagi lessons, multi-agent skepticism
academic.md — ReAct, CoT, Toolformer, AgentBench
failures.md — production incidents, failure rates, security vulnerabilities
human-interaction.md — trust calibration, collaboration patterns
evaluation.md — benchmarks, metrics, evaluation best practices
context-management.md — memory architectures, compression, context rot
tool-design.md — tool design principles, error patterns
safety.md — containment, alignment, open problems
planning.md — CoT, ToT, Reflexion, hybrid planners
cost-efficiency.md — token economics, optimization strategies
observability.md — tracing, debugging, failure taxonomy
testing.md — simulation, regression, evaluation frameworks
deployments.md — case studies, klarna, success patterns
protocols.md — MCP, A2A, AAIF, tool calling conventions
sre-agents.md — Azure SRE, Datadog Bits, incident.io
domain-agents.md — customer support, legal, data analysis
prompting.md — ReAct, OPRO, DSPy, structured output
multimodal.md — vision, computer use, voice agents
healthcare-agents.md — hippocratic ai, FDA regulation, clinical validation
financial-agents.md — trading agents, robo-advisors, compliance
debugging-practice.md — production debugging, the demo-to-production gap
development-workflows.md — team structures, CI/CD, AgentOps
versioning.md — prompt versioning, rollback, canary releases
ab-testing.md — statistical methods, AgentA/B, evaluation metrics
agent-databases.md — vector stores, knowledge graphs, agentic databases
multi-tenant.md — tenant isolation, cost allocation, noisy neighbor
voice-agents.md — S2S vs chained, latency, telephony
memory-architectures.md — MemGPT, consolidation, episodic vs semantic
personalization.md — preference learning, privacy tensions
inference-optimization.md — speculative decoding, KV caching, batching
scalability.md — scaling laws, coordination overhead, architecture selection
open-problems.md — reasoning limits, memory crisis, verification gap
ecosystems.md — MCP registries, marketplaces, A2A protocol
regulation.md — EU AI Act, US patchwork, liability frameworks
ethics.md — UNESCO principles, NIST RMF, bias mitigation
incident-response.md — rollback strategies, circuit breakers, CoSAI framework, postmortem patterns
monitoring-dashboards.md — agent-specific metrics, trace visualization, observability platforms
orchestration-patterns.md — coordination topologies, coordination tax, MAST failure taxonomy
mobile-edge-agents.md — on-device inference, power/thermal constraints, hybrid architectures
agent-marketplaces.md — GPT Store analysis, protocol-first vs marketplace, enterprise vs consumer
reproducibility.md — non-determinism sources, caching, deterministic replay
sandboxing.md — gvisor, firecracker, isolation boundaries, security caveats
human-collaboration.md — METR productivity paradox, cognitive load, handoff protocols
trust-calibration.md — calibration metrics, automation bias, XAI paradox, trust repair
long-running-maintenance.md — checkpointing, memory decay, agent drift, durable workflows
agent-communication.md — message passing, shared memory, A2A, discovery patterns
capability-discovery.md — A2A agent cards, MCP tool discovery, dynamic loading, verification gap
composability.md — agent pipelines, interface contracts, microservices patterns, composition overhead
caching-strategies.md — GPTCache, semantic caching, plan caching, invalidation strategies
debugging-tools.md — breakpoints, time-travel, deterministic replay, trace analysis
compliance-auditing.md — GDPR/HIPAA implications, immutable logs, explainability mandates
error-taxonomy.md — failure classification, type-specific recovery, graceful degradation
cost-attribution.md — chargeback models, per-tenant tracking, FinOps for AI
benchmarking.md — contamination crisis, reproducibility, task distribution mismatch


compiled: 2026-01-14
round 2 update: 2026-01-14
round 3 update: 2026-01-15
round 4 update: 2026-01-15
round 5 update: 2026-01-15 (11 docs: incident-response, monitoring-dashboards, orchestration-patterns, mobile-edge-agents, agent-marketplaces, reproducibility, sandboxing, human-collaboration, trust-calibration, long-running-maintenance, agent-communication)
round 6 update: 2026-01-15 (3 docs: capability-discovery, composability, caching-strategies)
round 7 update: 2026-01-15 (5 docs: debugging-tools, compliance-auditing, error-taxonomy, cost-attribution, benchmarking)
methodology: synthesis across 84 research documents. claims cite sources; unsupported observations labeled as hunches.

  
## skill-archetype-research-report.md

      
    Raw
  

              skill-archetype-research-report.md
            
          
  source
  keywords
  related
  
  
  https://ampcode.com/threads/T-019bc67b-f4d8-74ec-8406-98748271697b
  
  
  amp
  skills
  design
  agents
  archetypes
  research
  
  
  agent-skill-design-principles.md
  
  
skill archetype research report

investigation into the hypothesis that agent skills divide into procedural (~500-1000 tokens) and methodological (~1500-2000 tokens) archetypes, with different optimal token budgets and different roles for examples.

executive summary

verdict: hypothesis PARTIALLY SUPPORTED — needs reframing
the procedural/methodological distinction has empirical grounding, but the literature suggests a more precise framing:


original framing
refined framing


procedural vs methodological
rule-following vs pattern-matching


token budget difference
task complexity determines length


examples decorative vs load-bearing
format/style tasks require examples; action tasks don't


confidence breakdown:

VERIFIED: context length degrades performance (multiple sources)
VERIFIED: examples are load-bearing for pattern replication, not for rule execution
VERIFIED: atomic, single-purpose tools/skills outperform broad ones
HUNCH: the 2000-token ceiling should flex based on task type
QUESTION: whether our skill taxonomy maps cleanly to tool vs prompt distinction in literature


1. evidence SUPPORTING the archetype hypothesis

1.1 context length degrades reasoning independent of retrieval

source: du et al. (2025), "context length alone hurts LLM performance"

"even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%–85%) as input length increases"

experiments showed degradation even with:

irrelevant tokens replaced with whitespace
attention forced only to evidence tokens
evidence placed immediately before question

confidence: VERIFIED — peer-reviewed, replicated in chroma context rot report (18 LLMs tested)
implication for skills: shorter procedural skills should outperform longer methodological skills on pure execution, all else equal. supports keeping procedural skills lean.
1.2 chroma context rot confirms non-uniform degradation

source: chroma research (2025), research.trychroma.com/context-rot

"model performance varies significantly as input length changes, even on simple tasks... models do not use their context uniformly"

key finding: degradation is NON-UNIFORM. structured content (step-by-step procedures) may be more resilient than freeform reasoning content. models performed better on shuffled haystacks than logically structured ones.
confidence: VERIFIED
1.3 anthropic recommends minimal viable context

source: anthropic (2025), effective context engineering

"good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome"


"minimal does not necessarily mean short; you still need to give the agent sufficient information up front"

confidence: VERIFIED — first-party guidance
implication: procedural skills should be as short as possible; methodological skills earn length ONLY if every token is load-bearing. this DIRECTLY supports archetype distinction.
1.4 composio confirms narrow scope principle

source: composio (2025), tool design field guide

"tools should ideally perform a single, precise, and atomic operation... atomic, single-purpose tools significantly decrease ambiguity"


"Keep it short—under 1024 characters" [for tool descriptions]

confidence: VERIFIED — based on production error analysis ("10x drop in failures")
implication: procedural skills (which function like tools) benefit from brevity. methodological skills (which function like frameworks) operate differently.
1.5 few-shot examples load-bearing for pattern tasks

source: analytics vidhya (2025), zero-shot vs few-shot
few-shot beats zero-shot by ~10% accuracy on classification tasks. performance improvement stagnates after ~20 examples. for tasks requiring "deeper contextual understanding," few-shot is essential.
source: latitude (2025), how examples improve style consistency

"example-based prompting takes a different approach. Instead of just describing what you want, you provide one or more examples of the desired output... The AI can analyze everything from word choice to sentence structure."

confidence: VERIFIED
implication: methodological skills that teach HOW to reason/write NEED examples. procedural skills that specify WHAT to do may not.

2. evidence CHALLENGING the archetype hypothesis

2.1 over-prompting degrades even high-quality examples

source: tang et al. (2025), "the few-shot dilemma: over-prompting LLMs" — arxiv:2509.13196

"incorporating excessive domain-specific examples into prompts can paradoxically degrade performance... contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs"

smaller models (< 8B params) show declining performance past optimal example count. larger models (DeepSeek-V3, GPT-4o) maintain stability when over-prompted.
confidence: VERIFIED
challenge to hypothesis: methodological skills with many examples may HURT smaller models. the "examples are load-bearing" claim needs the qualifier: "up to a point."
2.2 tool description quality trumps skill length

source: langchain benchmarking (2024), anthropic SWE-bench work

"we actually spent more time optimizing our tools than the overall prompt" — anthropic


"poor tool descriptions → poor tool selection regardless of model capability" — langchain

confidence: VERIFIED
challenge to hypothesis: for tool-like skills (procedural), CLARITY matters more than LENGTH. a 500-token procedural skill with bad descriptions may underperform a 1500-token one with good descriptions.
2.3 heuristic prompts match few-shot without examples

source: sivarajkumar et al. (2024), prompting strategies for clinical NLP

"heuristic prompts achieved higher accuracy than few-shot prompting for clinical sense disambiguation and medication attribute extraction"

heuristic prompts = rule-based reasoning embedded in prompt. for some tasks, well-crafted zero-shot instructions outperform examples.
confidence: VERIFIED (peer-reviewed)
challenge to hypothesis: even "methodological" tasks may not require examples if the instructions are precise enough. the procedural/methodological split may be less about LENGTH and more about INSTRUCTION QUALITY.
2.4 the "lost in the middle" problem affects long skills

source: liu et al. (2023), "lost in the middle"

"performance highest when relevant information at beginning or end of input... significant degradation when relevant info in the middle of long contexts"

confidence: VERIFIED
challenge to hypothesis: methodological skills with examples in the middle may suffer. structure matters as much as length.

3. alternative framings from literature

3.1 tools vs prompts distinction (reddit/industry)

source: r/AI_Agents discussion, "agent 'skills' vs 'tools'"

"Anthropic separates executable MCP tools from prompt-based Agent Skills. OpenAI treats everything as tools/functions. LangChain collapses the distinction entirely."


"from the model's perspective, these abstractions largely disappear. Everything is presented as a callable option with a description."

implication: our procedural/methodological split may map to the tool/skill distinction:

procedural skills → could be tools (atomic, executable)
methodological skills → must be prompts (modify reasoning, not execute)

3.2 microsoft's tools vs agents distinction

source: microsoft azure architecture guide

"if something is repeatable and has a known output, it's a tool. if it requires interpretation or judgment, it stays inside the agent"

implication: procedural skills are tool-like (deterministic); methodological skills are agent-like (require judgment).
3.3 IBM's 5-type agent taxonomy

source: IBM think topics, types of AI agents


type
key trait
maps to


simple reflex
rule-based reactions
procedural skills


model-based reflex
internal state tracking
—


goal-based
planning toward objectives
—


utility-based
optimizing tradeoffs
methodological skills


learning
adapts from experience
epistemic skills?


the procedural/methodological split aligns with simple reflex vs utility-based agents.
3.4 confident-ai's component vs end-to-end distinction

source: confident-ai, agent evaluation guide
agents fail at:

end-to-end level: task not completed, infinite loops
component level: wrong tool params, faulty handoffs, hallucinated tool calls

implication: procedural skills fail at component level (wrong execution). methodological skills fail at end-to-end level (wrong approach).

4. synthesis: refined archetype model

4.1 the real distinction

the evidence suggests the split is NOT primarily about token count. it's about:


dimension
procedural ("rule-following")
methodological ("pattern-matching")


task type
execute a workflow
reason about how to approach


failure mode
wrong action
wrong framing


examples role
clarify edge cases (optional)
demonstrate desired pattern (required)


optimal length
as short as clarity allows
as long as examples require


evaluation
did it execute correctly?
did it reason appropriately?


4.2 when examples are load-bearing

examples are load-bearing when:

task requires style/format replication (writing, classification)
"correct" output cannot be specified declaratively
the skill teaches HOW to think, not WHAT to do

examples are decorative when:

task is procedural/deterministic (git commands, file operations)
correct behavior can be specified with rules
the skill specifies WHAT to do, not HOW to reason

4.3 revised token guidelines


skill type
evidence
recommended length


procedural
composio's 1024-char limit, anthropic's "minimal viable"
400-800 tokens


methodological
anthropic's "curated canonical examples"
1200-2000 tokens


epistemic (HUNCH)
modifies reasoning, may need extensive examples
800-1500 tokens


the 2000-token ceiling from du et al. is a reasonable OUTER BOUND for all skills, given 13-85% degradation at longer lengths. but procedural skills should aim for half that.

5. confidence labels


claim
confidence
evidence


context length degrades performance
VERIFIED
du et al., chroma, multiple sources


shorter is better for procedural skills
VERIFIED
composio, anthropic


examples load-bearing for style/pattern tasks
VERIFIED
latitude, analytics vidhya


examples optional for rule-following tasks
VERIFIED
sivarajkumar et al.


over-prompting hurts smaller models
VERIFIED
tang et al.


2000 tokens is a reasonable ceiling
VERIFIED
du et al. (13-85% degradation)


epistemic skills are a third archetype
HUNCH
pattern observation, no direct evidence


procedural ≈ tools, methodological ≈ prompts
HUNCH
architecture observation


structure matters as much as length
VERIFIED
lost in the middle


6. recommendations

6.1 for skill authoring


identify skill type first: is this teaching WHAT to do (procedural) or HOW to think (methodological)?
procedural skills: target 400-800 tokens. examples only for edge cases. embed constraints directly.
methodological skills: budget 1200-2000 tokens. include 2-3 canonical examples. front-load the key insight.
never exceed 2000 tokens: empirical evidence shows degradation beyond this point.

6.2 for skill review


example audit: for each example, ask "can this skill work without it?" if yes, consider removing.
compression test: summarize the skill in one sentence. if impossible, consider splitting.
structure check: put critical info at beginning and end, not middle.

6.3 for design principles doc

add:

explicit skill archetype distinction (procedural vs methodological)
different token budgets by type
guidance on when examples are required vs optional


7. sources

primary sources (peer-reviewed/first-party)


du et al. (2025). "context length alone hurts LLM performance despite perfect retrieval." EMNLP findings. arxiv
chroma research (2025). "context rot." research.trychroma.com
anthropic (2025). "effective context engineering for AI agents." anthropic.com
anthropic (2024). "building effective agents." anthropic.com
tang et al. (2025). "the few-shot dilemma: over-prompting LLMs." arxiv:2509.13196
sivarajkumar et al. (2024). "prompting strategies for clinical NLP." PMC. pmc
liu et al. (2023). "lost in the middle." TACL.

secondary sources (practitioner/industry)


composio (2025). "how to build great tools for AI agents." composio.dev
confident-ai (2025). "AI agent evaluation guide." confident-ai.com
latitude (2025). "how examples improve LLM style consistency." ghost.io
analytics vidhya (2025). "zero-shot and few-shot prompting." analyticsvidhya.com

prior internal research


see research-*.md files in this gist for context-management, prompt-engineering, and tool-design research
agent-skill-design-principles.md


8. open questions


epistemic skills: is there evidence for a third archetype that modifies reasoning rather than executing tasks or teaching patterns?
model-specific thresholds: do larger models (GPT-4o, claude 4) tolerate longer methodological skills than smaller models?
skill composition: when methodological skills invoke procedural skills, does the parent skill need examples for both?
validation: can we test this by stripping examples from methodological skills and measuring degradation?


## skill-review-spar-findings.md

      
    Raw
  

              skill-review-spar-findings.md
            
          
  source
  keywords
  
  
  https://ampcode.com/threads/T-019bc7be-e918-7516-97c7-b573c05c278e
  
  
  skills
  spar
  review
  portability
  agentskills
  
  
skill review spar findings

reviewed 16 skills, spar'd findings with antithesis agent. 4 issues found, 2 required modification.
key learnings

ghost skills persist at runtime — nix copies skills to ~/.config/amp/skills/ but doesn't clean removed ones. investigate skill was deleted from source but persisted at runtime. fix: manually delete orphans or add nix cleanup.
@references/ is vestigial — agentskills.io spec uses plain relative paths (references/file.md), not @references/. @ prefix had no semantic meaning.
cross-references should be asymmetric — when skill A documents composition with skill B, B should be authoritative. A should pointer-only ("see B for full protocol"), not duplicate content. rounds→spar was duplicating spar's composition section.
hardcoded paths break portability — remember skill used ~/commonplace/01_files/ everywhere. introduced $MEMORY_ROOT env var with default. skills intended for personal use still need parameterization for sharability.
spar effectiveness

antithesis agent (hoot_rustleer) challenged 4 claims:

2 upheld (investigate deletion, @references fix)
2 refuted → improved (rounds redundancy, remember paths)

false positive rate: 0% — all findings genuine after spar. spar caught thesis overconfidence on "acceptable" verdict for remember paths.

  
## skills-fail-silently-without-validation.md

      
    Raw
  

              skills-fail-silently-without-validation.md
            
          
  source
  keywords
  
  
  https://ampcode.com/threads/T-019bc222-06dc-7788-8d4b-65a4893a15a1
  
  
  amp
  skills
  yaml
  validation
  nix
  
  
skills fail silently without validation

amp skills with invalid yaml frontmatter don't load—and don't warn. the failure mode is absence: the skill simply doesn't appear in amp skills, with no indication why.
this caused a multi-agent coordination failure. the remember skill had an unquoted colon in its description (test: would a future agent...). yaml parsed test: as a key. the skill silently disappeared. agents spawned without it invented their own file naming conventions, ignoring the documented system.
the fix

build-time validation in nix. during darwin-rebuild switch, home-manager activation now parses skill frontmatter and warns on:

missing frontmatter (no --- delimiters)
unquoted colons in values

warnings print but never break the build. resiliency matters more than strictness.
design lesson

silent failures compound. an agent that can't load a skill doesn't know what it's missing. it proceeds with incomplete context, makes reasonable-seeming decisions, and produces subtly wrong output. the error surfaces far from its cause.
validation should happen at the boundary where errors are cheapest to fix—in this case, when skills are authored, not when they're consumed.
related


commonplace README — file naming conventions the agent should have followed
evergreen notes as objects — skills are objects too; broken objects break the system
document	what it covers
agent-skill-design-principles.md	start here — archetypes, token budgets, validation, invocation guards
skill-archetype-research-report.md	full research with citations — du et al., anthropic, composio, chroma
skills-fail-silently-without-validation.md	the broken remember skill incident and nix validation fix
document	what it covers
dialectic-review-method.md	adversarial multi-agent review protocol
dialectic-meta-auditor-pattern.md	catching manufactured findings
dialectic-skill-composition-pattern.md	rounds orchestrating parallel spar sessions
document	what it covers
multi-agent-coordination-patterns.md	hub-and-spoke, watchdog, AGENT prefix, handoff protocol
agent-sprawl-antipattern.md	pre-spawn checklist, when NOT to use orchestration
skill-review-spar-findings.md	16 skills reviewed, portability learnings
document	what it covers
research-synthesis.md	critical findings — malone 2024 (human-AI worse than either alone), METR 2025 (40-point perception gap), context degradation
research-orchestration-patterns.md	hierarchical, peer-to-peer, swarm, MoA — MAST 40% failure rate
research-prompt-engineering.md	tool descriptions > system prompts, context engineering, few-shot patterns
research-context-window-management.md	budget allocation, dynamic pruning, summarization tradeoffs
research-composability.md	microservices patterns, interface contracts, when to decompose
research-memory-compression.md	30x token reduction, observation masking vs summarization
research-error-taxonomy.md	error types by origin, recovery strategies
type	examples	token budget	examples role
rule-following	git-ship, spawn, lnr	400-800	optional
pattern-matching	write, amp-voice, dig	1200-2000	required
epistemic	review, spar	800-1500	helpful
dimension	rule-following	pattern-matching	epistemic
task type	execute a workflow	replicate style/format	modify reasoning stance
failure mode	wrong action	wrong framing	wrong epistemics
examples role	clarify edge cases (optional)	demonstrate pattern (required)	show failure modes (helpful)
token budget	400-800 tokens	1200-2000 tokens	800-1500 tokens
evaluation	did it execute correctly?	did it match the pattern?	did it reason appropriately?
composition	standalone	standalone or composed	composed with other skills
thread	title	contribution
T-019b9a3d	PR #9 trpc-cli migration coordination	massive skill creation — write, document, amp-voice, spawn, dig, review-rounds. multi-agent review rounds pattern emerged
T-019b92f7	Create agent skill for lnr CLI	lnr skill, CLI-wrapping skill patterns
T-019b8dd1	Build axiom-sre skill	production-grade skill with memory system, hypothesis-driven investigation
T-019b8e08	Finalize axiom-sre skill	API migration, memory outside skill directory pattern
T-019b8e20	Build memory consolidation sleep cycle	memory maintenance, tiered storage, skill portability
T-019b2c70	AMP custom commands for git workflows	git-ship, git-worktree early iterations
thread	title	skills created
T-019b9d0b-8ed1	Create write skill	write skill with academish voice
T-019b9d0b-8ec5	Create document skill	document skill with why-over-what philosophy
T-019b9d0b-8f71	Create amp-voice skill	amp-voice with terminology guide
T-019b9d0b-8f28	Update spawn skill	spawn with references, amp owner's manual
T-019b9d22	Create dig skill	investigation methodology, verification agents
T-019b9a87	Formalize review-rounds as skill	multi-agent review rounds pattern
T-019b9ea5	Git worktree task spawning	git-worktree skill with rebase
thread	title	findings
T-019b9d10	Review four skills for spec compliance	agentskills.io spec validation
T-019b9d11	Add YAML frontmatter to document skill	frontmatter requirements
T-019b9d12	Add YAML frontmatter to amp-voice skill	frontmatter standardization
T-019b9a93	Verify rewritten review-rounds skill	cross-references, duplication checks
T-019b9a92	Rewrite review-rounds skill	bundled guidelines pattern
T-019b9f82	Skills library structure review	skill composition, layer model
thread	title	outcome
T-019bc2f3	dialectic review origin	spar skill creation, adversarial review method
T-019bc67b	skill archetype research	three archetypes (rule-following, pattern-matching, epistemic)
T-019bc6fb	dialectic skill composition	rounds orchestrating parallel spar sessions
T-019bc7be	skill review spar	16 skills reviewed, 4 issues found, portability learnings