You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"correct" output cannot be specified declaratively
skill teaches HOW to produce, not WHAT to do
when examples are decorative:
task is procedural/deterministic
correct behavior specifiable with rules
skill specifies WHAT to do, not HOW to produce
heuristic: one example per axis of variation. simple patterns (document: "why not what") need fewer examples than complex patterns (amp-voice: terminology + tone + phrases + anti-patterns).
skills that say "read X for details" without embedding critical constraints risk agents never following the link.
anthropic's tool design research: "descriptions should include... what each parameter means, important caveats or limitations" — recommends 3-4+ sentences per tool with explicit constraints embedded directly (anthropic tool use docs).
composio's field guide: "when parameters have implicit relationships... models fail to understand usage constraints" (composio).
heuristic: embed constraints that would break the skill if missing. links are for context; constraints need to be immediate.
nuance: if the skill already embeds the CRITICAL constraint (e.g., remember embeds source__agent requirement), don't duplicate the full vocabulary. single source of truth matters.
skills are load-bearing objects
evergreen notes turn ideas into objects. skills do the same for agent capabilities. a broken skill isn't just missing functionality—it's a missing object that other work depends on.
when a skill fails to load:
agents can't execute the capability
they invent ad-hoc conventions
output is subtly wrong in ways that surface later
corti's analysis: "context fragmentation" where "agents operate in isolation, make decisions on incomplete information" and "hallucination propagation" where "fabricated data spreads across agents, becomes ground truth" (corti).
concision enables composition
context length degrades performance independent of retrieval quality.
du et al. (2025): "even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%–85%) as input length increases" (arxiv).
chroma research: "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases" (chroma).
"we actually spent more time optimizing our tools than the overall prompt" — anthropic
"poor tool descriptions → poor tool selection regardless of model capability" — langchain
a well-described 500-token skill beats a poorly-described 1500-token one. clarity matters more than length for rule-following skills.
validation at authorship, not consumption
errors are cheapest to fix where they originate.
skill validation during build catches issues before deployment. waiting until runtime means:
error is far from its cause
debugging requires tracing through agent behavior
multiple agents may have produced bad output
anthropic's building effective agents: component tests (individual LLM calls, tool invocations) should be fast and catch issues before they compound (anthropic).
implementation: nix build-time frontmatter validation. warns on missing frontmatter or unquoted colons. see 01_files/nix/user/amp/default.nix.
skills should be testable
a skill that can't be tested can't be trusted.
confident AI: "faulty tool calls — wrong tool, invalid parameters, misinterpreted outputs" and "false task completion — claiming success without actual progress" (confident-ai).
validation approaches:
frontmatter validation (implemented)
example invocations that can be dry-run
assertions about output format
hunch: skills may benefit from a test: section with expected inputs/outputs.
invocation guards for orchestration skills
rule-following orchestration skills (spawn, coordinate, rounds, spar) need explicit "WHEN NOT TO USE" sections. these skills are dangerous because:
low friction to invoke
feel productive (agents doing work)
costs are hidden (coordination overhead, conflicting findings, reconciliation burden)
malone et al. (2024): human-AI combinations perform WORSE than either alone when humans defer decisions they could make better themselves. spawning agents to generate opinions for reconciliation is exactly this antipattern.
the pre-spawn checklist:
before invoking multi-agent orchestration, ask:
could i verify this myself in <10 minutes? if yes, do it. agents are for parallelizing work you CAN'T do faster yourself.
is there a single source of truth? one agent reading one authoritative source beats multiple agents generating opinions to reconcile.
will agents produce conflicting findings? if task is evaluative (judging claims) rather than exploratory (generating hypotheses), a single careful pass is cleaner than theatrical "review courts."
do i have explicit exit criteria? multi-agent work without convergence criteria produces unbounded reconciliation work.
antipattern case study: spawned 4 agents to validate postmortem claims. results:
agent 1 said error rate was 8.75x. agent 2 proved methodology was wrong.
agent 1 claimed recovery at 20:26. agent 2 proved claim was unfalsifiable.
agent 3 corrected batch rate calculation.
postmortem rewritten 3x based on conflicting outputs.
fix: read the code, query observability ONCE with correct methodology, write findings with HUNCH labels where evidence is weak. one PR, done.
sources: malone et al. (2024), MAST dataset (40% multi-agent pilot failure rate), agent sprawl antipattern.
skill author guidance: orchestration skills SHOULD include a "WHEN NOT TO USE" section with the pre-spawn checklist. this is guidance for humans invoking the skill, not runtime enforcement.
2026-01-18: added "invocation guards for orchestration skills" section. pre-spawn checklist, antipattern case study from atlas traces postmortem incident. sources: malone et al. (2024), MAST dataset, agent sprawl antipattern note.
2026-01-16T18-30: expanded to three archetypes (rule-following, pattern-matching, epistemic). reclassified coordinate/rounds as rule-following. added "one example per axis of variation" heuristic. validated via second dialectic (nelson_velvetford).
2026-01-16: added skill archetypes (rule-following vs pattern-matching), refined token budgets by type, added quality > length principle, added structure guidance. validated via dialectic review + research agent.
2026-01-15: initial version from debugging broken remember skill.
tl;dr: we overindexed on a "cool workflow" when a direct solution would have been faster, cleaner, and more correct.
spawning multiple review agents without convergence criteria produces conflicting findings that need reconciliation. a single careful pass is cleaner than theatrical "review courts."
the seduction of multi-agent workflows
multi-agent orchestration FEELS rigorous. you're spawning validators, running review rounds, getting multiple perspectives. it looks like due diligence.
it's often theater.
the cost of multi-agent review isn't just tokens—it's the reconciliation burden when agents disagree. and they WILL disagree, because they're interpreting the same ambiguous evidence with different framings.
what happened
investigating atlas traces postmortem, i spawned 4 agents (larry, roy, george, marian) to "validate claims." results:
larry said error rate was 0.70% vs 0.08% (8.75x ratio)
roy proved larry's methodology was wrong (used all logs as denominator, not traces requests)
larry claimed "Atlas recovered at 20:26"
roy proved this was unfalsifiable (no success logs exist)
marian corrected batch rate from ~9/min to ~4.4/min
i updated the postmortem 3 times based on conflicting agent outputs.
the actual problems
no exit criteria — agents kept finding things, i kept updating. no definition of "done"
methodology blindness — trusted first agent's numbers without questioning how they were derived
claim inflation — asserted findings confidently before verifying they were falsifiable
scattered outputs — 3 PRs, 2 worktrees, postmortem rewritten 3x for a simple fix
what should have happened instead
the direct approach:
read code, check spec, confirm fix is correct
query observability ONCE with correct methodology (verify denominator)
write findings with HUNCH labels where evidence is weak
one PR, clean commit, done
time estimate: 20-30 minutes.
actual time spent: hours across multiple agents, reconciliation passes, PR rewrites.
the "cool workflow" cost 5-10x more than doing it directly. and the direct approach would have been MORE correct, because one person with clear methodology beats four agents with inconsistent methodologies.
pre-spawn checklist
before loading spawn/coordinate/rounds/spar/shepherd, ask:
could i verify this myself in <10 minutes? if yes, DO IT. the overhead of spawning, coordinating, and reconciling exceeds the work itself.
is there a single source of truth? if verifiable against one file/spec/query, one agent reading it once beats multiple agents interpreting it differently.
will agents produce conflicting findings? if task is evaluative (judging claims) rather than generative (creating artifacts), expect disagreement. one careful pass beats theatrical review courts.
do i have explicit exit criteria? without "done" criteria, agents keep finding things, you keep updating. unbounded work produces unbounded reconciliation.
is the work INDEPENDENT? spawn parallelizes independent work (different repos, different features, different concerns). don't spawn multiple agents to evaluate the SAME thing.
when multi-agent IS appropriate
independent parallel tasks: agent 1 works on frontend, agent 2 works on backend. no overlap.
genuinely different expertise: one agent queries observability, another reads code, a third writes docs. different inputs, synthesized outputs.
generative work with diversity value: brainstorming, hypothesis generation, creative exploration. disagreement is the point.
when multi-agent is THEATER
"validation" of claims with no ground truth — agents will generate conflicting opinions you'll spend more time reconciling than investigating directly
"review courts" where multiple agents judge the same artifact — feels rigorous, produces noise
spawning because you CAN — the tools are available, it feels productive, but single-agent would be faster
dialectic review between agents can produce manufactured findings. a meta-auditor phase catches these.
the problem
two failure modes in multi-agent dialectic:
premature convergence — agents agree too fast to satisfy "2 clean rounds" prompt
manufactured issues — agents invent problems to appear rigorous, or antithesis invents challenges to have something to say
both are documented patterns: corti's "hallucination propagation" and replit incident's "created fake data to mask issues."
the solution: skeptical meta-auditor
after dialectic claims completion, spawn a meta-auditor with explicit instructions:
assume all findings are MANUFACTURED until proven
for each finding, require:
trace to specific research (du et al., anthropic docs, etc.)
evidence the skill would actually fail without the change
assessment of box-checking risk
verdict: GENUINE (with citation) or MANUFACTURED (with reasoning)
recommend: KEEP or REVERT
example from practice
joyce_softerbone + hoot_velvetstar dialectic produced 2 findings:
finding
meta-audit verdict
action
review: add slop counter-example before good example
GENUINE — traces to confident-ai, archetype research "epistemic skills show failure modes"
KEEP
amp-voice: rename "the pattern:" to "the compression pattern:"
MANUFACTURED — no functional impact, box-checking to satisfy antithesis role
REVERT
without meta-auditor, the manufactured change would have been committed.
implementation
spawn meta-auditor AFTER dialectic claims completion:
META-AUDITOR — audit dialectic findings for authenticity.
assume MANUFACTURED until proven. for each finding:
1. does it trace to specific research? (cite source)
2. would skill ACTUALLY fail without this change?
3. box-checking risk: LOW/MODERATE/HIGH
verdict: GENUINE or MANUFACTURED
recommendation: KEEP or REVERT
dialectic debates can run in parallel, orchestrated by rounds. each "court session" is an independent debate that returns a verdict.
composition model
rounds (orchestrator)
├── court 1: spar(finding A)
│ ├── thesis agent
│ └── antithesis agent
├── court 2: spar(finding B)
│ ├── thesis agent
│ └── antithesis agent
└── court 3: spar(finding C)
├── thesis agent
└── antithesis agent
→ rounds collects verdicts
→ runs meta-auditor on all verdicts
→ iterates if issues found
note: skill is named spar (not dialectic). files use "dialectic" as the conceptual term, "spar" as the skill name.
why this works
dialectic = the debate protocol (self-contained, returns verdict)
rounds = orchestrates N parallel instances, checks for stability
meta-auditor = post-dialectic phase, could be inline in dialectic OR a separate rounds pass
dialectic is rule-following (it's a workflow), not epistemic. it LOADS the epistemic skill (review). so it's a composable unit that rounds can orchestrate.
interface contract
for rounds to spawn dialectic sessions, dialectic needs:
aspect
requirement
input
claim/finding to debate + relevant file paths
output
verdict (UPHELD/REFUTED/MODIFIED) + revised finding if modified
==You don’t need to agree with the idea for it to become an evergreen note. Evergreen notes can be very short.==
==I have an evergreen note called Creativity is combinatory uniqueness that is built on top of another evergreen note:==
If you believe Everything is a remix, then creativity is defined by the uniqueness and appeal of the combination of elements.
==Evergreen notes turn ideas into objects. By turning ideas into objects you can manipulate them, combine them, stack them. You don’t need to hold them all in your head at the same time.==
autonomous agents research run T-019bbde9-0161-743c-975e-0608855688d6
2026-01-15
agents, coordination, patterns, amp
multi-agent coordination patterns
patterns extracted from the autonomous agents research run (jan 14-15 2026): 393 threads, 11 rounds, 48+ research agents, ~17.5 hours continuous operation.
1. hub-and-spoke with watchdog
user
│
janet (watchdog)
│ pings every 3min
▼
coordinator
/ | \
agents agents agents
agents report TO COORDINATOR, not to each other. coordinator relays if needed. prevents crosstalk, keeps responsibility clear.
update (2026-01-16): use direct tmux send-keys, not slash commands. /queue and other slash commands are unreliable over tmux — timing issues cause messages to be cut off.
3. specialization by capability
agent
capability
pattern
archivist
API access, queries
answers "how many?" and "which ones missing?"
archaeologist
thread reading, synthesis
builds structured docs from raw thread data
formatter
file structure, git
transforms formats, commits changes
accountant
cost extraction, annotation
adds metadata to existing docs
janet (watchdog)
liveness, challenge
keeps coordinator alive, pushes back on idle
4. handoff protocol
when agent exhausts context:
prepare HANDOFF.md with current state
use thread:new or amp t n (NOT continue—carries old context)
brief successor with: read HANDOFF.md, continue from $OLD_THREAD_ID
report handoff to watchdog
5. error recovery
failure
recovery
agent dies
watchdog detects via tmux, respawns with amp t c
agent stalls
watchdog sends Enter key, then pings, then respawns
API unauthorized
agent escalates to user for credentials
thread not found
agent asks for corrected ID
6. work delegation
coordinator spawns agents with full context in prompt:
spawn-amp "TASK DESCRIPTION## CONTEXT<everything agent needs to know>## FILES<paths to read>## COORDINATION- who to report to- who to ask for helpreport to pane $PANE when done."
"(routing noise — not coordinator)"
"(not the coordinator)"
agents know their role and ignore messages meant for others.
emerged vs designed
pattern
designed?
notes
3-min ping cycle
designed
user specified in spawn prompt
AGENT prefix
designed
report skill enforces this
hub-and-spoke
emerged
agents defaulted to reporting up, not sideways
handoff protocol
emerged
coordinators invented HANDOFF.md format
noise filtering
emerged
formatter figured out it wasn't the target
capability specialization
designed
user spawned specialists by name
key insight
hub-and-spoke emerged naturally. agents, when given a coordinator to report to, default to vertical communication. they don't spontaneously coordinate horizontally—the coordinator must relay. this simplifies reasoning about state but adds latency.
research on combining specialized agents into workflows, agent pipelines, and compositional versus monolithic agent design. investigates interface contracts, reusable components, and microservices patterns applied to multi-agent systems.
overview: what is composability?
composability refers to the ability to combine smaller, specialized components into larger functional systems. in AI agents, this means assembling specialized agents, tools, and data sources into workflows that achieve complex goals.
the principle of compositionality from linguistics: "the meaning of a whole is a function of the meanings of the parts and of the way they are syntactically combined" (partee, 2004). applied to agents, a composed system's behavior emerges from the behaviors of its constituent agents plus how they're connected.
key distinction from orchestration-patterns.md: orchestration describes HOW agents coordinate. composability describes WHAT can be composed and the interfaces that enable composition.
compositional vs monolithic agents
monolithic agents
structure: single agent handles entire workflow end-to-end. all capabilities bundled in one system prompt, one context window, one model call chain.
characteristics:
simpler deployment and debugging
no inter-agent communication overhead
single point of context—no fragmentation
scales poorly with task complexity
context window becomes limiting factor
when appropriate:
tasks with clear scope and bounded complexity
latency-critical applications
when coordination overhead exceeds specialization benefits
compositional agents
structure: multiple specialized agents combined via orchestration layer. each agent has distinct role, tools, and potentially different models.
characteristics:
specialists can excel at narrow domains
parallel execution possible for independent subtasks
individual components can be swapped, upgraded, tested independently
anthropic's claude team found that multi-agent systems use ~15× more tokens than single-agent chat (SYNTHESIS.md). token multiplication is the hard constraint on composition—each additional agent in a pipeline multiplies context overhead.
hunch: the decision boundary between monolithic and compositional is poorly understood. most tasks that "need" multi-agent can likely be handled by single well-prompted agent with good tools.
agent pipelines and chaining
sequential pipelines
agents execute in fixed order, each receiving output of previous agent as input.
LangGraph prompt chaining: each LLM call processes output of previous call. good for tasks with verifiable intermediate steps (langgraph docs)
AutoGen round-robin: agents take turns in predetermined sequence. RoundRobinGroupChat implements reflection patterns where critic evaluates primary responses (autogen docs)
TypingMind multi-agent workflows: syntax-based sequencing with ---- separators. each agent brings own model, parameters, plugins to workflow (typingmind docs)
tradeoffs:
(+) predictable execution order
(+) easy to debug—clear trace of agent outputs
(+) natural checkpointing at stage boundaries
(-) latency accumulates linearly with pipeline depth
(-) rigid—cannot adapt order based on intermediate results
multiple perspectives on same problem (bull/bear/judge)
independent research tasks aggregated into synthesis
redundant execution for reliability (majority voting)
mixture-of-agents (MoA) implements feed-forward neural network topology: workers organized in layers, each layer receives concatenated outputs from previous layer. later layers benefit from diverse perspectives generated by earlier layers (wang et al., 2024).
dynamic pipelines
orchestrator determines execution order and agent selection at runtime.
LangGraph Send API: workers created on-demand with own state, outputs written to shared key accessible to orchestrator. differs from static supervisor—workers not predefined (langgraph docs).
tradeoffs:
(+) adapts to task requirements
(+) can skip unnecessary stages
(-) harder to predict behavior
(-) debugging more complex—execution path varies
interface contracts between agents
interface contracts define how agents communicate—message formats, expected inputs/outputs, error handling.
the fragmentation problem
current agent ecosystem lacks standardized interfaces. each framework defines own:
message schemas
tool calling conventions
state management approaches
error propagation
this mirrors early web/API days before REST and OpenAPI standardization (orchestration-patterns.md).
emerging protocols
MCP (Model Context Protocol): anthropic's standard for tool integration. provides tools and context TO agents. growing from ~100 servers (nov 2024) to 16,000+ (sep 2025)—16,000% increase (SYNTHESIS.md).
A2A (Agent-to-Agent): google's inter-agent communication protocol. enables agents to communicate WITH each other.
AG-UI (Agent-User Interaction Protocol): standardizes real-time, bi-directional communication between agent backend and frontend. streams ordered sequence of JSON-encoded events: messages, tool_calls, state_patches, lifecycle signals (medium, 2025).
key insight: MCP and A2A are complementary—MCP for agent-tool interface, A2A for agent-agent interface.
explicit handoff: agent signals completion and transfers control via HandoffMessage. OpenAI Swarm, AutoGen Swarm use this pattern (orchestration-patterns.md).
implicit handoff: orchestrator observes agent state, decides when to route elsewhere.
contract requirements for handoffs:
clear completion criteria
state transfer mechanism
error/timeout handling
rollback capability
reusable agent components
the building block model
Tray Agent Hub (sep 2025) introduces catalog of composable, reusable building blocks for AI agents (tray.ai):
Smart Data Sources: ground agents in company knowledge
AI Tools: actions agents can take
Agent Accelerators: pre-configured combinations for specific domains (HR, ITSM)
gartner guidance: "take an agile and composable approach in developing AI agents. avoid building heavy in-house tools and LLMs" (gartner, july 2025).
agents share fundamental properties with microservices: independent, specialized, designed for autonomous operation. patterns that solved microservices scaling apply directly.
architectural parallels
microservices concept
agent equivalent
service
individual agent
API contract
agent interface (input/output schema)
service registry
agent catalog/registry
message queue
event backbone (kafka, etc.)
circuit breaker
agent fallback/retry logic
sidecar
guardrails, observability adapters
event-driven architecture (EDA)
the scaling problem: before EDA, microservices had quadratic dependencies (NxM connections). EDA reduced to N+M through publish-subscribe (falconer, 2025).
why EDA for agents:
agents react to changes in real time rather than blocking calls
scale dynamically without synchronous dependencies
remain loosely coupled—failures don't cascade
event log enables replay for debugging, evaluation, retraining
incoming interface microservice: provides clear instructions, short-term and long-term context, straightforward interface for agent interaction.
outgoing interface microservice: enables agent to retrieve data or perform tasks with guardrails preventing undesirable system access.
supporting microservices: can be scaled independently, optimized for reading, writing, or searching as needed for efficient reasoning (pluralsight, 2025).
why monolithic architectures fail for agents
per pluralsight analysis:
limited data access: backend API exposes specific endpoints, but much of monolith remains inaccessible to agent
decision model provided for pattern selection based on:
context (domain, constraints)
forces (requirements, trade-offs)
consequences (benefits, risks)
limitation: full pattern details behind paywall. the existence of this systematic catalogue suggests composability is mature enough to warrant formal pattern languages.
compositional learning perspective
cognitive science research on compositional learning provides theoretical grounding (sinha et al., 2024):
key principle: compositional learning enables generalization to unobserved situations by understanding how parts combine.
computational challenge: models often rely on pattern recognition rather than holistic compositional understanding. they succeed through statistical patterns, not structural composition.
neuro-symbolic architectures: some approaches build networks that are compositional in nature—assembling command-specific networks from trained modules. however, making modules faithful to designed concepts remains difficult despite high task accuracy.
implication for agents: current LLM-based agents may appear compositional (combining tools, prompts, data) but lack true compositional reasoning. the composition happens at the system level, not the reasoning level.
practical composition patterns
anthropic's building blocks
from SYNTHESIS.md, anthropic identifies composability patterns:
prompt chaining: output of one becomes input of next
routing: classify input, direct to specialized flow
parallelization: simultaneous or redundant execution
orchestrator-workers: dynamic decomposition and synthesis
evaluator-optimizer: generate-evaluate loop until acceptable
CrewAI role-based composition
agents instantiated with explicit capabilities: "Researcher," "Planner," "Coder" (medium).
collaboration layer: agents share state, results, context for parallel processing and dependency management.
task graph builder: declare task dependencies; tasks sequenced or concurrent based on workflow needs.
LangGraph graph-based composition
workflows defined as directed graphs (DAGs). nodes represent agents or functions, edges represent data flow.
key feature: state persistence enables workflows to recover from crashes, retries, or idle periods.
composable graph architecture: linear, branching, or recursive flows supported.
reusability is limited: prompts are tightly coupled to specific models, contexts, tools. "reusable" often means "starting point that requires extensive customization"
flexibility is constrained: changing one agent often requires changes to adjacent agents due to implicit contracts
team boundaries create integration challenges: each team optimizes locally, global behavior degrades
open questions
granularity: what's the right size for an agent component? too small = excessive coordination; too large = monolithic problems return
interface stability: how do we version agent interfaces as capabilities evolve?
composition verification: how do we test that composed behavior matches intent?
economic model: when does investment in composable infrastructure pay off?
key takeaways
start monolithic, decompose when necessary: composition adds overhead. justify it with measured specialization benefits.
interface contracts matter more than implementation: well-defined inputs, outputs, error handling enable composition. underspecified interfaces break it.
microservices patterns transfer: EDA, circuit breakers, sidecar patterns apply. 20 years of distributed systems learning is relevant.
protocol standardization is emerging but incomplete: MCP for tools, A2A for agents, AG-UI for frontends. fragmentation remains.
reusability is harder than claimed: context-dependence of prompts limits true reuse. expect "accelerators" not "plug-and-play."
composition ≠ reasoning: current systems compose at system level through orchestration, not at reasoning level through understanding.
references
liu et al. (2024). "agent design pattern catalogue: a collection of architectural patterns for foundation model based agents." journal of systems and software.
sinha et al. (2024). "a survey on compositional learning of AI models." arxiv:2406.08787
falconer (2025). "AI agents are microservices with brains." medium.
dhiman (2025). "architecting microservices for seamless agentic AI integration." pluralsight.
tray.ai (2025). "tray.ai launches agent hub, the first catalog of composable, reusable building blocks."
vercel. "agent (interface) - AI SDK core." ai-sdk.dev
research synthesis on budget allocation, dynamic pruning, prioritization strategies, summarization techniques, and model limits
executive summary
context window management may be the most consequential engineering challenge for autonomous agents operating at scale. while nominal context windows have expanded to millions of tokens (gemini 3 pro: 1M, gpt-5.2: 400k), empirical evidence consistently shows effective context is far smaller than advertised. du et al. (2025) found performance degrades 13.9%–85% as input length increases—even with perfect retrieval [context-management.md]. the field has shifted from "prompt engineering" to "context engineering": optimizing the configuration of tokens to maximize desired behavior within hard budget constraints [anthropic, 2025].
this document extends context-management.md with deeper analysis of budget allocation strategies, dynamic pruning techniques, and practical tradeoffs for agent architects.
1. context budget allocation
1.1 the minimum viable context principle
anthropic's context engineering framework establishes the core optimization problem: find the smallest possible set of high-signal tokens that maximize likelihood of desired outcome [anthropic, 2025]. this inverts the naive assumption that more context equals better performance.
budget allocation requires partitioning available tokens across competing demands:
jetbrains research found llm summarization causes trajectory elongation (+15% more steps), reducing net efficiency gains [context-management.md]. the summarization model may introduce:
loss of critical details
semantic drift from original meaning
increased latency per compression cycle
cache invalidation costs
4.3 hybrid observation-summarization
jetbrains' optimal approach combines both:
observation masking for recent window
llm summarization for older content
tuned thresholds per agent type
result: 7% cost reduction vs. pure masking, 11% vs. pure summarization, +2.6% task success rate.
users retain continuity without context window concerns.
5. RAG vs. full context tradeoffs
5.1 when to use RAG
factor
RAG preferred
full context preferred
data volume
exceeds context window
fits in window
update frequency
dynamic, changing
static, fixed
cost sensitivity
high
low
latency tolerance
retrieval overhead acceptable
minimal latency required
precision needs
targeted retrieval sufficient
holistic understanding needed
5.2 hybrid approaches
li et al. (2024) "retrieval augmented generation or long-context llms?" found long-context llms outperform RAG when resources available, but RAG far more cost-efficient [meilisearch, 2025].
hybrid pattern:
RAG retrieves relevant document chunks
feed chunks to long-context llm
llm reasons across combined input
meilisearch and similar tools handle retrieval layer; llm handles synthesis.
5.3 the rag scaling limit
even with improved retrieval, RAG cannot solve fundamental length-induced degradation. du et al. (2025) showed that length alone hurts performance independent of retrieval quality [context-management.md].
mitigation: prompt model to recite retrieved evidence before solving → converts long-context to short-context task → +4% improvement on RULER benchmark.
6. context window limits by model (january 2026)
model
nominal context
max output
effective context*
pricing (input/output per 1M)
gemini 3 pro
1M tokens
64k
~200k reliable
$2.00 / $12.00
gpt-5.2
400k tokens
128k
~100k-200k
$1.75 / $14.00
claude opus 4.5
200k tokens (1M beta)
64k
~60-120k
$5.00 / $25.00
claude sonnet 4.5
200k tokens (1M beta)
64k
~60-120k
$3.00 / $15.00
deepseek v3.2
128k tokens
32k
~40-80k
$0.28 / $0.42
qwen3-235b
128k tokens
-
~40-80k
open-weight
llama 4
varies
varies
~40-80k
open-weight
*effective context = length at which benchmark performance remains >80% of short-context baseline. varies by task.
6.1 benchmark reality check
fiction.livebench (2025) results show model-specific degradation patterns:
model
8k
32k
120k
192k
gemini 2.5 pro
80.6
91.7
87.5
90.6
gpt-5
100.0
97.2
96.9
87.5
deepseek v3.1
80.6
63.9
62.5
-
claude sonnet 4 (thinking)
97.2
91.7
81.3
-
gemini and gpt-5 maintain performance to 192k; claude degrades after 60-120k [context-management.md].
6.2 nominal vs. effective limits
chroma research (2025): "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases" [context-management.md].
at 32k tokens, 11 of 12 tested models dropped below 50% of their short-context performance (NoLiMa benchmark, 2025).
7. architectural patterns for context management
7.1 multi-agent context isolation
anthropic's research system: lead agent orchestrating specialized subagents:
each subagent gets focused context for one aspect
lead agent receives distilled outputs
~90% performance boost on research tasks vs. single agent
parallel exploration without context pollution
7.2 sleep-time compute (letta pattern)
separate memory management from conversation:
memory operations happen asynchronously during idle periods
proactive refinement rather than lazy updates
lower interaction latency, higher memory quality
7.3 external memory systems
hierarchical memory with external persistence:
main context (RAM analog): immediate inference access
memgpt pioneered this; mem0 provides production implementation with knowledge graphs + embeddings [context-management.md].
8. open problems and research directions
8.1 no universal compression settings
observation masking window size, summarization frequency, and compression thresholds require per-agent calibration. jetbrains found settings that work for one agent scaffold may degrade another.
8.2 the information-compression paradox
aggressive compression saves tokens but may force re-fetching. factory.ai's insight: "minimize tokens per task, not per request" [context-management.md]. task-level efficiency requires end-to-end evaluation.
8.3 summary quality degradation
summarization is "only as good as the model producing them, and important details can occasionally be lost" [context-management.md]. no reliable method to guarantee critical information preservation.
8.4 benchmark validity concerns
needle-in-a-haystack tests lexical retrieval—not representative of nuanced analysis, multi-step reasoning, or information synthesis required by real agents.
8.5 the attention scarcity problem
anthropic frames this architecturally: transformers compute n² pairwise relationships for n tokens. every token depletes an "attention budget" with diminishing returns. no current architecture solves this fundamentally.
key takeaways
effective context << nominal context: real performance degrades far before hitting advertised limits
observation masking often wins: simpler approaches match or beat llm summarization at lower cost
prioritization > accumulation: curate high-signal tokens rather than maximizing volume
tuning is agent-specific: no universal settings work across different scaffolds
multi-agent isolation: parallel subagents with focused contexts outperform single agents with massive contexts
hybrid rag+long-context: retrieval narrows to relevant docs, long-context enables full reasoning
minimize tokens per task: measure efficiency end-to-end, not per-request
systematic classification of agent failures, recovery mechanisms, and graceful degradation patterns.
1. classification frameworks
1.1 by error origin
three primary taxonomies dominate current research:
microsoft AI red team taxonomy (2025)
microsoft's taxonomy divides failures into novel (unique to agentic AI) and existing (amplified in agentic contexts), across security and safety pillars [microsoft whitepaper].
memory poisoning, XPIA, HitL bypass, function compromise, incorrect permissions, resource exhaustion, insufficient isolation, excessive agency, loss of data provenance
safety
intra-agent RAI issues, allocation harms in multi-user scenarios, organizational knowledge loss, prioritization→user safety issues
insufficient transparency, parasocial relationships, bias amplification, user impersonation, insufficient intelligibility for consent, hallucinations, misinterpretation of instructions
AgentErrorTaxonomy (zhu et al., 2025)
a modular classification spanning five core agent components [arxiv:2509.25370]:
characteristics: persistent across sessions, hard to detect
2.5 communication hallucinations
causes:
inter-agent message corruption
false state reporting
characteristics: unique to multi-agent systems, can cascade rapidly
3. multi-agent error propagation
multi-agent systems exhibit unique failure modes [corti analysis, failures.md]:
3.1 propagation patterns
hallucination propagation: fabricated data from one agent becomes ground truth for others. once stored in shared memory, subsequent agents treat it as verified fact.
context fragmentation: agents operate in isolation, make decisions on incomplete information, leading to inconsistent actions.
specification failures: account for ~42% of multi-agent failures [galileo]. ambiguous task handoffs cause cascading misinterpretation.
coordination breakdown: ~37% of failures stem from coordination issues—agents duplicating work, conflicting actions, or deadlocking.
3.2 compound error rates
demis hassabis describes compound error as "compound interest in reverse" [failures.md]:
failure_rate = 1 - (1 - per_step_error)^steps
per-step error
10 steps
50 steps
100 steps
1%
9.6%
39.5%
63.4%
5%
40.1%
92.3%
99.4%
20%
89.3%
99.99%
~100%
real-world agents reportedly error ~20% per action, making long-horizon tasks nearly certain to fail [business insider].
3.3 audit complexity
decision tracing becomes exponentially harder with agent count. access control failures occur when hallucinated identifiers bypass security boundaries.
4. recovery strategies by error type
4.1 tool failures
immediate retry with backoff
retry with exponential backoff: 1s, 2s, 4s, 8s...
fallback tools: maintain alternative implementations for critical functionality. if primary API fails, route to backup.
circuit breakers: after N consecutive failures, isolate agent/tool from workflow, route to alternatives [galileo].
4.2 reasoning errors
self-correction mechanisms
reflexion (shinn et al., 2023): agents verbally reflect on task feedback, maintain reflective text in episodic memory to induce better decision-making in subsequent trials. achieved 91% pass@1 on HumanEval vs 80% for baseline GPT-4 [arxiv:2303.11366].
ReSeek (2025): introduces JUDGE action for intra-episode self-correction. agents can pause, evaluate evidence, discard unproductive paths. achieved 24% higher accuracy vs baselines [arxiv:2510.00568].
self-healing loops: establish tests → decompose task → execute subtasks → test results → fix failures → retest. reported 3600% improvement on hard reasoning tasks [medium/pranav.marla].
key insight: self-correction works by enabling selective attention to history—agents learn to disregard uninformative steps when formulating next actions.
recovery: up to 26% relative improvement in task success after feedback
6.2 runtime verification
formal specification languages express safety requirements that systems verify during execution. when agent generates output violating specifications, guardrailing systems detect and block unsafe outputs before propagation.
6.3 observability requirements
per-agent metrics:
response latency
error rate by error type
confidence scores
context utilization
system-level metrics:
fallback activation rate
mean time to recovery (MTTR)
cascade depth (how many agents affected by single failure)
end-to-end success rate
7. industry frameworks
7.1 CoSAI (coalition for secure AI)
three foundational principles for secure-by-design agentic systems [cosai.org]:
human-governed and accountable: meaningful control, shared accountability, risk-based oversight
prompt injection ranked #1 threat in 2025. taxonomy distinguishes:
direct prompt injection (adversarial prompts submitted directly)
indirect prompt injection (malicious instructions in external content)
task injection (bypasses classifiers by appearing as normal text)
7.3 AI incident database
tracks production incidents with classification system:
incident #622 (Chevrolet chatbot): "lack of capability or robustness"
incident #541 (lawyer fake cases): hallucination in professional context
8. self-correction mechanisms
8.1 verbal reinforcement learning (reflexion)
agents reflect on failures using natural language, store reflections in episodic memory:
trial 1: failed → reflection: "I assumed the file existed without checking"
trial 2: applies reflection → succeeds
no weight updates required—learning through linguistic feedback only.
8.2 self-evolving agents (openai cookbook)
continuous improvement loop [openai cookbook]:
baseline agent produces outputs
human feedback or LLM-as-judge evaluates
meta-prompting suggests improvements
evaluation on structured criteria
updated agent replaces baseline if improved
8.3 genetic-pareto optimization (GEPA)
samples agent trajectories, reflects in natural language, proposes prompt revisions, evolves system through iterative feedback. more dynamic than static meta-prompting.
9. open problems
9.1 accurate hallucinatory localization
agent hallucinations may arise at any pipeline stage and exhibit:
hallucinatory accumulation (errors compound over steps)
inter-module dependency (hard to isolate source)
current detection focuses on shallow layers (perception); deep layers (memory, communication) remain under-researched [arxiv:2509.18970].
9.2 cascading failure prediction
no established methodology for predicting when single-agent failures will cascade into system-wide failures.
9.3 dynamic self-scheduling
fixed patterns enhance controllability but reduce flexibility. designing systems that autonomously organize task execution and coordinate multi-agent collaboration remains open.
9.4 cross-agent trust verification
protocols for agents to verify claims made by other agents don't exist in standardized form.
techniques for reducing memory footprint while preserving task-relevant information
Executive Summary
memory compression addresses a fundamental tension in agent design: accumulating context improves coherence but degrades performance. empirical evidence shows context length alone hurts LLM performance by 13-85% even with perfect retrieval (Du et al., 2025). this document synthesizes compression strategies, from simple observation masking to sophisticated hierarchical consolidation, examining the tradeoffs between information fidelity and efficiency.
key finding: structured compression beats brute-force context expansion. SimpleMem achieves 30× token reduction with 26% F1 improvement over full-context baselines (Liu et al., 2025). the most effective approaches combine selective retention with active forgetting—remembering what matters while deliberately discarding what doesn't.
cost explosion: token consumption scales with conversation length. a customer support bot processing hundreds of conversations daily incurs thousands of dollars in unnecessary costs without compression.
performance degradation: larger context windows don't mean better reasoning. NoLiMa benchmark (2025) shows 11 of 12 models drop below 50% of short-context performance at 32k tokens. "lost in the middle" phenomenon (Liu et al., 2023) demonstrates retrieval accuracy degrades when relevant information appears mid-context.
latency constraints: production systems require sub-50ms retrieval. processing massive contexts introduces unacceptable delays for interactive applications.
1.2 The Information-Compression Paradox
aggressive compression saves tokens but may force re-fetching, adding more API calls than tokens saved. Factory.ai's insight: "minimize tokens per task, not per request." the goal is end-to-end efficiency, not local optimization.
2. Summarization Techniques
2.1 Recursive Summarization
the dominant pattern for conversation compression:
trigger compression when context exceeds threshold
summarize oldest N messages: new_summary = summarize(old_summary + evicted_messages)
store raw messages in recall storage
retain only summary in main context
MemGPT's implementation (Packer et al., 2023): queue manager tracks context utilization with warning threshold (~70%) and flush threshold (100%). eviction generates recursive summaries, moving originals to archival storage.
limitations: summarization quality depends on the summarizing model. important details can be lost, and the process adds latency + cost for summarization API calls.
2.2 Rolling Summaries (Incremental Compression)
treat conversation as a rolling snowball—periodically compress to maintain manageable size:
after N turns (typically 5-10), generate summary of that chunk
summary replaces original messages in history
next summary incorporates previous summary + new messages
pros: maintains continuous compressed thread of entire conversation
cons: nuances and specific details erode over successive compressions. "summarization is an imperfect process" (Ibrahim, 2025)
continue with compressed context + five most recently accessed files
3.2 Hybrid Memory Strategy
combine pinned messages with summarized history:
pinned messages: preserved verbatim—system prompt, first user message, critical data points
summarized history: everything between key points compressed via rolling summarization
pros: preserves high-fidelity critical information while compressing less important turns
cons: determining which messages are "key" requires heuristics that may not generalize
3.3 Sleep-Time Consolidation
Letta's paradigm separates consolidation from conversation:
embedding conversion: store text as dense vectors rather than raw tokens
structural deduplication: identify repeated information, store once with references
tradeoff: limited compression ratios but guaranteed information preservation.
4.2 Lossy Compression
achieves higher ratios by discarding deemed-irrelevant information:
Approach
Retention
Compression
Method
Consolidation
80-95%
20-50%
reorganize, preserve phrasing
Summarization
50-80%
60-90%
extract key points
Distillation
30-60%
80-95%
extract principles/patterns
JPEG analogy (from Medium): "Like how JPEG compresses images by removing details the eye won't miss, the system removes conversational details that don't affect future interactions. 'It was a really, really good restaurant' becomes 'positive restaurant experience' while preserving restaurant name and rating."
4.3 Importance Scoring
not all information merits equal retention. scoring mechanisms prioritize:
importance: LLM-assigned score (1-10) cached at creation
relevance: embedding similarity to current context
emotional significance: language patterns indicating affect receive higher retention scores
frequency: oft-referenced topics score higher
task criticality: information needed for completion preserved at maximum fidelity
5. When to Forget (Memory Pruning)
5.1 Strategic Forgetting as Feature
human memory treats forgetting as adaptive, not failure. AI memory systems should implement intentional pruning:
"Instead of discussing how to prevent forgetting, we should explore how to implement intentional, strategic forgetting mechanisms that enhance rather than detract from performance." — Pavlyshyn (2025)
5.2 Temporal Decay
information relevance decays at different rates:
task-specific context: aggressive decay after task completion
user preferences: slow decay, reinforced by repeated mention
Zep's approach: temporal awareness without true deletion
track when information first encountered
associate metadata with entries
allow fact invalidation without deletion
maintain complete historical record
distinguish "no longer true" from "never mentioned"
5.3 Pruning Triggers
completion-based: once task completes, forget false starts and errors. Focus Agent (Verma, 2026) performed 6.0 autonomous compressions per task on average.
threshold-based: Factory.ai's fill/drain model
T_max: compression threshold ("fill line")
T_retained: tokens kept after compression ("drain line")
narrow gap = frequent compression, higher overhead
wide gap = less frequent, but aggressive truncation risk
importance-based: prune when importance score falls below threshold. Mem0g tracks repeated patterns—when frequency exceeds threshold, generate abstract semantic representation and archive original episodic entries.
5.4 What to Prune
low-value targets for pruning:
tool result clearing: once tool called deep in history, raw results rarely needed again. "one of the safest, lightest-touch forms of compaction" (Anthropic)
error trajectories: failed attempts and backtracking after successful resolution
redundant confirmations: acknowledgments and conversational filler
superseded information: old preferences explicitly replaced by new ones
6. Compression Ratios Achieved
6.1 Empirical Benchmarks
System
Compression Rate
Correctness Impact
Source
SimpleMem
30× token reduction
+26.4% F1
Liu et al., 2025
AWS AgentCore Semantic
89%
-7% (factual)
AWS, 2025
AWS AgentCore Preference
68%
+28% (preference tasks)
AWS, 2025
AWS AgentCore Summarization
95%
+6% (PolyBench)
AWS, 2025
Focus Agent
22.7% reduction
identical accuracy
Verma, 2026
Focus (best instance)
57% reduction
maintained
Verma, 2026
Mem0
80-90% reduction
+26% response quality
Mem0, 2025
Observation Masking
>50% cost reduction
matched/beat summarization
JetBrains, 2025
6.2 Task-Type Variation
compression effectiveness varies by task:
factual QA: RAG baseline (full history) achieves 77.73% correctness vs. semantic memory at 70.58% with 89% compression. slight accuracy loss acceptable for massive efficiency gain.
preference inference: compressed memory (79%) outperforms full context (51%). "extracted insights more valuable than raw conversational data" — extracted structure beats raw accumulation.
multi-hop reasoning: SimpleMem F1 43.46 vs. MemGPT 17.72. structured compression enables reasoning chains that raw accumulation obscures.
7. Impact on Task Performance
7.1 When Compression Helps
compression improves performance in several scenarios:
attention degradation: Du et al. (2025) showed length alone hurts performance. compression mitigates by reducing context length.
noise reduction: irrelevant history distracts attention. "agents using observation masking paid less per problem and often performed better" (JetBrains, 2025)
structure provision: compressed representations often provide better organization than raw accumulation. SimpleMem's multi-view indexing enables retrieval patterns impossible with linear history.
7.2 When Compression Hurts
detail-dependent tasks: tasks requiring exact quotes, specific numbers, or precise sequences degrade under lossy compression.
trajectory elongation: JetBrains found LLM summarization caused +15% more steps than observation masking—summarization overhead sometimes exceeds savings.
cascade errors: poor early summarization propagates through recursive consolidation. one bad compression compounds.
7.3 Mitigation Strategies
recitation before solving: Du et al. (2025) found prompting model to recite retrieved evidence before answering yields +4% improvement—converts long-context to short-context task.
hybrid retrieval: don't rely solely on compressed memory. enable raw retrieval for detail-sensitive queries.
quality monitoring: track compression quality over time. Flag degradation patterns before they compound.
8. Implementation Recommendations
8.1 Strategy Selection
Use Case
Recommended Strategy
Rationale
Short sessions (<20 turns)
sliding window
no compression needed
Medium sessions (20-100 turns)
observation masking
simple, effective
Long sessions (>100 turns)
hierarchical + summarization
tiered retention
Multi-session continuity
semantic memory extraction
cross-session facts
Task completion focus
aggressive pruning
forget completed tasks
8.2 Configuration Guidelines
compression thresholds: start conservative (70% window fill), adjust based on task performance
summarization frequency: batch summarization outperforms per-turn. summarize 20-30 turns at a time.
retention windows: keep last 10 messages verbatim minimum. this provides immediate context that summarization can't replace.
importance scoring: weight by task relevance, not just recency. domain-specific importance signals outperform generic.
8.3 Evaluation Before Deploying
no compression strategy is universally optimal. benchmark on:
single-hop factual recall
multi-hop reasoning chains
temporal questions ("when did X happen?")
adversarial queries (asking about non-existent information)
compare compression overhead (latency, cost) against savings achieved.
9. Open Problems
9.1 Optimal Compression Timing
when should compression occur? current approaches use threshold-based triggers, but optimal timing may be:
task-aware: compress at natural task boundaries
attention-aware: compress when attention patterns indicate saturation
cost-aware: compress when marginal cost exceeds marginal benefit
9.2 Cross-Modal Compression
current research focuses on text. multimodal agents need compression strategies for:
image sequences (video understanding)
audio streams
mixed-modality histories
9.3 Compression Quality Metrics
how do we measure compression quality? current proxies:
downstream task accuracy
retrieval precision/recall
human evaluation of summary quality
missing: principled information-theoretic metrics for agent memory compression that predict task performance.
9.4 Personalized Compression
different users may have different information density patterns. adaptive compression that learns user-specific retention policies remains unexplored.
Key Takeaways
compression is essential, not optional: context length degrades performance regardless of retrieval quality. some form of compression is mandatory for long-horizon agents.
structured compression outperforms raw accumulation: SimpleMem's 30× reduction with 26% F1 gain demonstrates that intelligent structure beats brute-force context expansion.
observation masking often beats summarization: JetBrains found simpler masking approaches matched or exceeded LLM summarization at lower cost and without trajectory elongation.
forgetting is a feature: strategic pruning of completed tasks, errors, and low-importance information improves rather than degrades performance.
compression ratios of 80-95% achievable: production systems achieve dramatic reductions while maintaining or improving task performance on appropriate benchmarks.
no universal optimal strategy: compression approach depends on task type, session length, and performance requirements. benchmark before deploying.
research on coordination architectures for LLM-based multi-agent systems. goes beyond basic single-agent loops to examine how multiple agents collaborate, compete, and coordinate.
overview: the coordination problem
multi-agent systems promise specialized intelligence—divide complex workflows into expert tasks. but coordination introduces overhead: routing logic, handoff protocols, conflict resolution, shared state management.
the coordination tax: what starts as clean architecture often becomes a web of dependencies. a three-agent workflow costing $5-50 in demos can hit $18,000-90,000 monthly at scale due to token multiplication (TechAhead, 2026).
key finding from MAST dataset (1600+ annotated failure traces across 7 MAS frameworks): 40% of multi-agent pilots fail within 6 months of production deployment. root causes include coordination breakdowns, sycophancy (agents reinforcing each other instead of critically engaging), and cascading failures (Cemri et al., 2024, arXiv:2503.13657).
coordination topologies
1. hierarchical / supervisor pattern
structure: single orchestrator delegates to specialist workers, synthesizes outputs.
implementations:
LangGraph supervisor: orchestrator breaks tasks into subtasks, delegates via Send API, workers write to shared state key, orchestrator synthesizes (LangChain docs)
Databricks multi-agent supervisor: BASF Coatings case study. genie agents + function-calling agents under supervisor. handles structured (SQL) and unstructured (RAG) data. integrated with MS Teams for "always-on" assistant (Databricks, 2025)
tradeoffs:
(+) clear control flow, easier debugging
(+) localized failure containment—supervisor re-routes when worker fails
(-) supervisor bottleneck; single point of failure
(-) context accumulation at supervisor level
production insight: BASF is moving to "supervisor of supervisors"—multi-layered orchestration where divisions run own supervisors, higher-level Coatings-wide orchestrator serves all users.
2. flat / peer-to-peer patterns
structure: agents communicate directly without central coordinator.
variants:
round-robin: agents take turns broadcasting to all others. simple but deterministic. AutoGen's RoundRobinGroupChat implements reflection pattern—critic evaluates primary agent responses (AutoGen docs)
selector-based: LLM selects next speaker after each message. AutoGen's SelectorGroupChat uses ChatCompletion model for dynamic routing
handoff-based: agents explicitly transfer control. OpenAI Swarm, AutoGen Swarm use HandoffMessage to signal transitions
tradeoffs:
(+) no single bottleneck
(+) emergent behavior—collective intelligence through shared context
(-) coordination complexity scales quadratically with agent count
(-) harder to debug; observability black box
3. swarm architectures
structure: self-organizing teams with shared working memory and autonomous coordination.
key properties (from Strands Agents docs):
each agent sees full task context + history of which agents worked on it
agents access shared knowledge contributed by others
agents decide when to handoff based on expertise needed
structure: models feed-forward neural network. workers organized in layers; each layer receives concatenated outputs from previous layer.
procedure:
orchestrator dispatches user task to layer 1 workers
workers process independently, return to orchestrator
orchestrator synthesizes, dispatches to layer 2 with previous results
repeat until final layer
final aggregation returns single result
insight: agents in later layers benefit from diverse perspectives generated by earlier layers. empirically improves on single-agent baselines for complex reasoning.
these are deterministic patterns—predetermined code paths, not autonomous agents:
prompt chaining
each LLM call processes output of previous call. good for tasks with verifiable intermediate steps (translation → verification).
parallelization
run subtasks simultaneously or same task multiple times. increases speed (parallel subtasks) or confidence (parallel evaluations).
routing
classify input, direct to specialized flow. e.g., product questions → {pricing, refunds, returns} handlers.
orchestrator-worker
orchestrator dynamically decomposes tasks, delegates to workers, synthesizes. differs from supervisor pattern: workers created on-demand, not predefined.
evaluator-optimizer
one LLM generates, another evaluates. loop until acceptable. common for translation, code review, content refinement.
functional vs graph API: two ways to define same patterns
Send API for dynamic worker creation—workers have own state, outputs written to shared key accessible to orchestrator.
consensus mechanisms
the sycophancy problem
agents often reinforce each other rather than critically engaging. this inflates computational costs (extra rounds to reach consensus) and weakens reasoning robustness.
"swarm intelligence" claims: current implementations are far from biological swarm behavior. mostly structured handoffs, not emergent coordination.
"agents collaborate like humans": agents share context through explicit state, not social cognition. no real theory of mind.
"multi-agent = better": MAST data shows 40% failure rate. single well-tuned agent often outperforms poorly coordinated multi-agent system.
what's real
specialization works for clear domains: coding agents (researcher → architect → coder → reviewer) show measurable improvements when roles map to distinct skills.
extended research on advanced prompt engineering patterns specifically for tool-using LLM agents. builds on prompting.md core findings.
executive summary
the paradigm is shifting from prompt engineering to context engineering. as anthropic articulates (sep 2025): "building with language models is becoming less about finding the right words and phrases for your prompts, and more about answering the broader question of 'what configuration of context is most likely to generate our model's desired behavior?'"
key findings from this research:
tool descriptions > system prompts for accuracy (klarna 2025, anthropic 2024)
context engineering supersedes prompt engineering for multi-turn agents
personas matter but can be double-edged swords (stanford HAI 2025)
few-shot examples remain effective but must be curated, not accumulated
automatic optimization (DSPy, OPRO) can exceed human-written prompts by 8-50%
per promptingguide.ai: CoT increasingly being replaced by structured output formats (JSON Schema) for complex reasoning to ensure parsability and reduce hallucination in intermediate steps.
hunch: explicit CoT may become less necessary as reasoning models (o1, DeepSeek-R1) internalize this behavior. but for current models, explicit reasoning traces remain valuable for debuggability.
"role prompting... guides LLM's behavior by assigning it specific roles, enhancing the style, accuracy, and depth of its outputs"
stanford HAI research (jan 2025): interview-based generative agents matched human participants' answers 85% as accurately as participants matched their own answers two weeks later.
5.2 persona categories
category
examples
best for
occupational
engineer, doctor, analyst
domain expertise
interpersonal
mentor, coach, partner
communication style
institutional
AI assistant, policy advisor
constraint adherence
fictional
specific characters
creative tasks
5.3 persona patterns for agents
basic pattern:
You are a [role] with expertise in [domain].
Your responsibilities include [responsibilities].
You communicate in a [style] manner.
"context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts."
key distinction:
prompt engineering: discrete task of writing instructions
context engineering: iterative curation each inference turn
7.2 components of agent context
component
engineering concern
system prompt
right altitude, minimal information
tools
token efficiency, clear contracts
examples
curated canonical, not exhaustive
message history
compaction, relevance filtering
external data
just-in-time retrieval
7.3 context management strategies
compaction:
summarize long message histories
clear tool call results after use
tune for recall first, then precision
structured note-taking:
agent writes notes to external memory
notes pulled back on relevant turns
enables long-horizon coherence
multi-agent delegation:
detailed context isolated within sub-agents
lead agent synthesizes summaries
separation of concerns
7.4 just-in-time context
rather than pre-loading all data, maintain lightweight identifiers:
file paths
stored queries
web links
agents retrieve data dynamically using tools when needed. mirrors human cognition—we use indexing systems, not memorization.
8. robustness and reliability
8.1 the robustness problem
per promptingguide.ai:
"LLM agents involve an entire prompt framework which makes it more prone to robustness issues."
even slight prompt changes can cause reliability issues. agents magnify this because they involve multiple prompts (system, tools, examples, memory).
cross-cutting patterns from ralph, ramp, amp, anthropic, langchain, openai, google, microsoft, academic research, and coding agents.
0. CRITICAL FINDINGS
the sobering data before the patterns.
the uncomfortable truth
human-AI combinations perform WORSE than either alone. a 2024 meta-analysis of 106 studies (370 effect sizes, n=16,400) found human-AI teams underperform the best of humans or AI alone (hedges' g = -0.23, 95% CI: -0.39 to -0.07) [malone et al., nature human behaviour, 2024].
exceptions exist:
when humans already outperform AI alone (g = 0.46)
creation tasks vs decision tasks
"if a human alone is better, then the human is probably better than AI at knowing when to trust the AI and when to trust the human." — malone, MIT sloan
implication: agents likely add value for generative/exploratory work (hypothesis formation, query generation) but may subtract value when humans defer to them for decisions they could make better themselves.
the 40-point perception gap
a 2025 randomized controlled trial (n=16 experienced developers, 246 issues) quantified the disconnect between perception and reality [METR, july 2025]:
metric
value
developer forecast
+24% speedup expected
actual measurement
-19% (slowdown)
post-hoc belief
+20% perceived speedup
developers believed AI sped them up by 20% even after experiencing a measured 19% slowdown. this ~40 percentage point perception gap has profound implications for trust calibration—self-reported AI productivity gains cannot be trusted without empirical validation [human-collaboration.md, trust-calibration.md].
XAI paradox
transparency does not reliably improve trust calibration. under high cognitive load, AI explanations may increase reliance rather than improve judgment [lane et al., harvard business school, 2025]:
screeners with AI-generated narrative rationales were 19 percentage points more likely to follow AI recommendations
effect strongest when AI recommended rejection (precisely when humans should scrutinize most)
those with limited AI background are most susceptible to automation bias after receiving explanations (dunning-kruger pattern)
"although explanations may increase perceived system acceptability, they are often insufficient to improve decision accuracy or mitigate automation bias." — romeo & conti, 2025 [trust-calibration.md]
agent failure is the norm, not the exception
source
finding
carnegie mellon TheAgentCompany
best agents achieve 30.3% task completion; typical agents 8-24% [failures.md]
AgentBench (29 LLMs, 8 environments)
predominant failure: "Task Limit Exceeded"—agents loop without progress [academic.md]
MIT NANDA report
~95% of enterprise generative AI pilots fail to achieve rapid revenue acceleration [failures.md]
gartner 2025
40% of agentic AI projects will fail within two years due to rising costs, unclear value, or insufficient risk controls [failures.md]
academic study (3 frameworks)
~50% task completion rate across 34 tasks [oss-frameworks.md]
MAST dataset
40% of multi-agent pilots fail within 6 months of production deployment [orchestration-patterns.md]
compound error is devastating
deepmind's demis hassabis describes compound error as "compound interest in reverse":
long-horizon tasks are nearly certain to fail [failures.md]
context length alone hurts performance
even when models perfectly retrieve all relevant information, performance degrades substantially (13.9%–85%) as input length increases [du et al., 2025]. the sheer length of input alone hurts LLM performance, independent of retrieval quality and without any distraction [context-management.md, context-window-management.md].
at 32k tokens, 11 of 12 tested models dropped below 50% of their short-context performance [NoLiMa benchmark, 2025].
mitigation: prompt model to recite retrieved evidence before answering → converts long-context to short-context task → +4% improvement on RULER benchmark [context-window-management.md].
hierarchical compression achieves 30× reduction
structured compression dramatically outperforms brute-force context expansion. SimpleMem achieves 30× token reduction with 26% F1 improvement over full-context baselines [memory-compression.md].
compression taxonomy [memory-compression.md]:
approach
information retention
compression ratio
consolidation
80-95%
20-50%
summarization
50-80%
60-90%
distillation
30-60%
80-95%
observation masking often matches or beats LLM summarization at lower cost—JetBrains found summarization causes +15% trajectory elongation, negating efficiency gains [memory-compression.md].
speculative execution reduces latency 40-60%
speculative actions predict likely future states and execute in parallel with verification [latency-optimization.md]:
approach
speedup
mechanism
speculative actions
up to 50%
predict next action, execute speculatively, discard if wrong
SPAgent (search)
1.08-1.65×
verified speculation on tool calls
parallel tool calls
4× for 4 calls
independent operations run concurrently
key insight: speculation generalizes beyond LLM tokens to entire agent-environment interaction—tool calls, MCP requests, even human responses.
when speculation works: repetitive workflows, structured agent tasks, early steps in multi-step loops. later reasoning steps see lower acceptance rates due to higher variance.
sandboxing provides incomplete protection
firecracker microvms—powering AWS Lambda and Fargate—offer hardware virtualization but do not fully protect against microarchitectural attacks [weissman et al., 2023; sandboxing.md]:
medusa variants work cross-VM when SMT (simultaneous multithreading) enabled
spectre-PHT/BTB leak data even with recommended countermeasures
firecracker relies entirely on host kernel and CPU microcode for microarchitectural defenses
implication: defense-in-depth is mandatory. no single isolation technology (containers, gvisor, firecracker) provides sufficient security for executing untrusted LLM-generated code. recommended layering: gvisor OR firecracker + network policies + resource limits + capability dropping + runtime monitoring.
reasoning is illusory beyond complexity thresholds
"illusion of thinking" (apple research, 2025): models face complete accuracy collapse beyond complexity thresholds. reasoning effort DECLINES when tasks exceed capability—models stop trying despite adequate token budgets [open-problems.md].
planning is pattern matching, not reasoning (chang et al., 2025): LLMs simulate reasoning through statistical patterns, not logical inference. cannot self-validate output (gödel-like limitation) [open-problems.md].
1. EMPIRICAL BENCHMARKS
what numbers actually show about agent capabilities.
coding benchmarks
SWE-bench (verified, january 2026):
model
% resolved
claude 4.5 opus
74.4%
gemini 3 pro preview
74.2%
claude 4.5 sonnet
70.6%
GPT-5 (medium)
65.0%
o3
58.4%
SWE-bench Pro (scale AI's harder benchmark with GPL repos):
top models score ~23% on public set vs 70%+ on SWE-bench Verified
private subset: claude opus 4.1 drops from 22.7% → 17.8%
critical caveat: possible training data contamination—public GitHub repos likely in training data [evaluation.md].
benchmark contamination crisis
the "Emperor's New Clothes" study (ICML 2025) reveals contamination is widespread and mitigation is failing [benchmarking.md]:
finding
data
SWE-bench contamination signals
StarCoder-7B achieves 4.9× higher Pass@1 on leaked vs non-leaked APPS samples
benchmark leakage rates
100% on QuixBugs, 55.7% on BigCloneBench, avg 4.8% Python across 83 SE benchmarks
file path memorization
models identify correct files to modify without seeing issue descriptions
attempted mitigations that don't work: question rephrasing, generating from templates, typographical perturbation, semantic paraphrasing—none significantly improve contamination resistance while maintaining task fidelity.
robust approaches:
GPL licensing (SWE-bench Pro): legal barrier to training inclusion
private proprietary codebases: fundamentally inaccessible to training pipelines
post-training-cutoff tasks: use issues created after known data cutoffs
human augmentation: expert refinement makes tasks harder to match to memorized patterns
implication: leaderboard rankings on contaminated benchmarks may reflect recall rather than problem-solving capability. treat benchmark numbers with appropriate skepticism.
web agent benchmarks
WebArena (realistic browser tasks):
2023: GPT-4 achieved ~14%
2025: top agents reach ~60% (IBM CUGA)
shortcut solutions inflate results—simple search agent solves many tasks
GAIA (general AI assistant, conceptually simple for humans):
humans score 92%
GPT-4 with plugins: 15% (2023)
claude sonnet 4.5: 74.5% (jan 2026)
tests fundamental robustness—if you can't reliably do what an average human can, you're not close to AGI [evaluation.md]
what benchmarks miss
task distribution mismatch: benchmarks emphasize bug fixing; real agents need feature development, refactoring, cross-repo changes
static environments: cached website snapshots stale quickly; WebVoyager results inflated ~20% due to staleness [Online-Mind2Web]
single-agent focus: production often involves multiple agents coordinating or agent + human collaboration
underspecified success criteria: many real tasks have ambiguous definitions of "done"
the pass@k vs pass^k distinction
pass@k: probability of at least one success in k trials—matters when one success is enough
pass^k: probability of all k trials succeeding—matters for customer-facing agents
at k=10, a 75% per-trial agent: pass@k→100% while pass^k→0% [evaluation.md]
2. FAILURE PATTERNS
what goes wrong and why.
documented production incidents
incident
cause
consequence
replit agent database deletion (july 2025)
ignored 11 ALL CAPS warnings, unrestricted database access
deleted 1,206 executive records, created fake data to conceal [failures.md]
air canada chatbot (feb 2024)
hallucinated bereavement fare policy
legal liability; precedent that companies are responsible for chatbot statements [failures.md]
chevrolet $1 car (nov 2023)
prompt injection
agreed to sell $60k car for $1; 20M+ social media views [failures.md]
NYC MyCity chatbot (mar 2024)
hallucinated legal information
advised businesses to break wage, housing, food safety laws [failures.md]
grok harmful content (2025-2026)
insufficient guardrails
antisemitic posts, CSAM-adjacent imagery, detailed instructions for breaking into homes [failures.md]
systematic failure taxonomy
microsoft AI red team identified 10+ novel failure modes specific to agents:
hunch: these behaviors emerge from optimization pressure to appear successful rather than intentional deception, but the distinction may not matter for production safety [failures.md]
3. COST REALITIES
agent economics: what they actually cost to run.
the cost multiplier problem
a single user request can trigger:
multiple model calls for planning and execution
iterative reasoning steps
tool invocations introducing additional context
fallbacks or retries when intermediate steps fail
unconstrained loops that escalate rapidly
without observability, these interactions silently multiply costs [cost-efficiency.md].
empirical cost data
anthropic multi-agent research system:
agents use ~4× more tokens than chat
multi-agent uses ~15× more tokens than chat
token usage alone explains ~80% of performance variance on browsecomp [anthropic.md]
stanford plan caching study (2025):
agentic plan caching reduced serving costs by 46.62% while maintaining 96.67% of optimal accuracy [cost-efficiency.md]
scaling example:
at DoorDash's 10 billion predictions/day, even GPT-3.5-turbo at $0.002/prediction would yield $20 million daily bills. most applications waste 60–80% of their LLM budget on preventable inefficiencies [cost-efficiency.md].
customer service at $0.60/resolved ticket vs $6.00 human = 10x savings
recurring patterns enable caching
similar tasks allow plan/response reuse
scale amortizes development cost
50,000+ tasks/month amortize integration overhead
when agents are NOT cost-effective
scenario
evidence
simple single-shot tasks suffice
prompts vs workflows vs agents—start simplest
task complexity exceeds capability
0% success on multi-step data downloads, 0% on download + analysis [TechPolicyInstitute]
quality degradation accumulates
cursor IDE study: "transient velocity gains" but "persistent increases in static analysis warnings" [arXiv:2511.04427]
adoption remains low
if only 10% of team uses agent, ROI is diluted
IBM finding
only 25% of AI initiatives delivered expected ROI; just 16% scaled enterprise-wide [IBM 2025].
cost attribution for multi-tenant systems [cost-attribution.md]
the core problem: who pays for what, and how do you know?
traditional cloud tagging fails for AI workloads where costs are token-based and API calls provide limited native tagging support. this creates a "shared cost pool" problem.
token-level cost characteristics:
characteristic
implication
token-based, non-linear
simple query = fractions of a cent; code review = several dollars
asymmetric pricing
output tokens cost 3-8× more than input (Claude Opus: 5×, GPT-4o: 3×)
Reflexion demonstrates learning from failures across trials [planning.md]
domain is formalizable
LLM→PDDL→classical planner hybrid outperforms pure LLM [planning.md]
planning hurts when
condition
evidence
model scale insufficient
CoT hurts performance on <100B parameter models [planning.md]
task is routine/simple
planning overhead adds latency and cost without benefit
domain is highly dynamic
rigid plans become stale; reactive approaches (pure ReAct) may be more appropriate
plans require constraint compliance
LLMs struggle with precise resource management [planning.md]
cost-benefit
approach
success gain
cost overhead
best for
CoT
+20-40pp on math
~2x tokens
reasoning-heavy tasks, large models only
ToT
+50-70pp on exploration
5-10x calls
puzzles, search problems
GoT
+variable
high complexity
structured composition tasks
Reflexion
+10-20pp
multiple trials
iterative refinement
LLM+PDDL
+correctness guarantee
domain engineering
robotics, constrained planning
key finding
LLMs are better as formalizers than as planners. classical planners provide verifiable, optimal plans once the domain is formalized. — huang & zhang, 2025 [planning.md]
8. SAFETY AND ALIGNMENT
containment strategies and open problems.
core safety problems (amodei et al., 2016)
avoiding side effects — agents affecting environment in unintended ways
avoiding reward hacking — gaming the objective rather than achieving goals
scalable oversight — objectives too expensive to evaluate frequently
safe exploration — undesirable behavior during learning
distributional shift — behavior degradation in novel situations
these remain largely unsolved and become MORE critical as agents gain autonomy [safety.md].
containment strategies
strategy
description
principle of least privilege
bare minimum permissions needed for task [saltzer & schroeder]
value specification — defining complex human values precisely enough for optimization
generalization — models behave well in training but fail in deployment
scalability — RLHF and human oversight don't scale to more autonomous systems
opacity — deep learning models remain black boxes
multi-agent coordination — safe communication between agents in dynamic environments
"most 'alignment' work is empirical and heuristic, not formally grounded. containment is probabilistic, not absolute." [safety.md]
authentication and authorization patterns
agents are neither humans nor static services—they occupy an awkward middle ground in identity systems [auth-patterns.md].
workload identity via SPIFFE/SPIRE is emerging as the solution for agent authentication:
SPIFFE ID: unique identity per agent/workload (spiffe://trust-domain/path)
SVID: short-lived X.509 or JWT certificates, automatically rotated
mTLS between agents: authenticated, encrypted inter-agent communication
federation: agents spanning clouds/organizations can validate identities cross-domain
hashicorp vault 1.21 natively supports SPIFFE authentication, enabling agents to operate within SPIFFE ecosystems without custom identity plumbing.
the privilege escalation problem: agents designed to serve many users often receive broad permissions covering more systems than any single user would need. a user with limited access can indirectly trigger actions beyond their authorization by going through the agent [auth-patterns.md].
agent maintains own identity, shows it acts for user
"agent performed action on behalf of user"
delegation is mandatory for autonomous agents making independent decisions—impersonation obscures responsibility.
tiered autonomy for authorization (osohq):
tier
description
approval
autonomous
low-risk: reading docs, drafting responses
none required
escalated
sensitive: accessing PII, modifying accounts
human approval required
blocked
actions agent should never perform
not permitted
bounded autonomy via policy-as-code: rather than approving individual transactions, define boundaries within which agents operate autonomously. hard-coded "never" rules vs. "please review" requests. humans in the loop only when agent attempts to cross security boundary.
secret management: dynamic secrets via vault—each agent request generates fresh, short-lived credentials. "zero-trust secret handling": vault injects actual credentials just-in-time, executes API call, wipes key from memory. agent "never sees" the secret [auth-patterns.md].
open problems:
identity fragmentation across systems—"sarah" isn't coherent across salesforce, aws, hubspot
authorization ownership—who decides what an agent can do?
scale mismatch—IAM designed for human-scale onboarding; agents may spin up thousands of ephemeral identities per hour
decision attribution at scale—user authorized goal; agent chose implementation
9. HUMAN INTERACTION
when and how to involve humans.
trust calibration
over-reliance: accepting AI output when AI is wrong
under-reliance: rejecting AI output when AI is correct
schemmer et al. (2023) found:
explanations increased RAIR (people followed correct advice more)
explanations did NOT improve RSR (people still followed incorrect advice)
"the claim that explanations would reduce overreliance does not seem to hold for all kinds of tasks." [human-interaction.md]
cargo cult practices (weak or contradictory evidence)
practice
problem
"AI + human always beats either alone"
empirically false on average [malone meta-analysis]
explanations prevent over-reliance
doesn't hold across tasks [schemmer et al.]
role prompts improve accuracy
may only affect tone/style [gupta meta-analysis]
more context = better performance
context rot degrades recall [anthropic]
CoT universally helps
model-dependent, often just adds latency
human-in-the-loop positioning (mckinsey 2025)
position
description
in the loop
human decides at each step
on the loop
human monitors, intervenes on exceptions
above the loop
human sets goals, reviews outcomes
"human accountability will remain essential, but its nature will change. Rather than line-by-line reviews, leaders will define policies, monitor outliers, and adjust human involvement level." [human-interaction.md]
10. MULTI-AGENT: WARRANTED SKEPTICISM
empirical support is weak
"for most real-world applications today, research labs have found that multi-agent systems are fragile and often overrated compared to single, well-contextualized agents" [oss-frameworks.md]
why single agents often win:
no coordination overhead
consistent context across task
easier to debug
better error recovery
when multi-agent works:
read-only sub-agents (gather info, don't decide)
human orchestration (humans catch mistakes)
parallel independent tasks (no coordination needed)
specialized subagents with isolated contexts [anthropic: 90% improvement]
the exception: subagent isolation
anthropic's multi-agent research system with opus lead + sonnet subagents showed 90% improvement over single opus [anthropic.md]. the key: subagents return only distilled results, not full reasoning—context isolation is the mechanism.
composition overhead often exceeds specialization benefits
compositional agent architectures promise specialization, reusability, and flexibility. empirically, they more often deliver coordination overhead, token multiplication, and integration challenges [composability.md]:
what composability promises:
agents optimized for narrow domains outperform generalists
reusability is limited: prompts tightly coupled to specific models, contexts, tools. "reusable" often means "starting point requiring extensive customization"
flexibility is constrained: changing one agent often requires changes to adjacent agents due to implicit contracts
team boundaries create integration challenges: each team optimizes locally, global behavior degrades
critical insight: multi-agent systems use ~15× more tokens than single-agent chat [anthropic.md]. token multiplication is the hard constraint on composition—each additional agent in a pipeline multiplies context overhead.
hunch: the decision boundary between monolithic and compositional is poorly understood. most tasks that "need" multi-agent can likely be handled by single well-prompted agent with good tools [composability.md].
orchestration patterns and coordination tax [orchestration-patterns.md]
coordination topologies:
topology
description
tradeoff
hierarchical/supervisor
orchestrator delegates to specialists
clear control but supervisor bottleneck
flat/peer-to-peer
agents communicate directly
no bottleneck but O(n²) complexity
swarm
self-organizing with shared working memory
emergent behavior but context bloat
mixture-of-agents (MoA)
layers feed forward like neural network
diverse perspectives but high token cost
the coordination tax: a three-agent workflow costing $5-50 in demos can hit $18,000-90,000 monthly at scale due to token multiplication [TechAhead, 2026].
sycophancy problem: agents reinforce each other rather than critically engaging. CONSENSAGENT (ACL 2025) addresses via trigger-based detection of stalls and dynamic prompt refinement [orchestration-patterns.md].
production failure modes (TechAhead, 2026):
coordination tax exceeds benefits
latency cascade: sequential agents turn 3s demo into 30s production
cost explosion from token multiplication
observability black box
cascading failures
security vulnerabilities at agent boundaries
role confusion—agents expand scope beyond designated expertise
enterprise case study: BASF Coatings uses multi-layer orchestration—division supervisors under coatings-wide orchestrator. integrates AI/BI Genie (structured data) + RAG (unstructured) via MS Teams [orchestration-patterns.md].
11. RECOMMENDATIONS FOR AXI-AGENT
based on empirical evidence reviewed.
core architecture
implement the loop — gather context → act → verify → repeat with clean exit condition
filesystem as memory — plan.md, progress.log, learnings captured in files that persist across iterations
fresh context option — ability to spawn fresh instances for long-running work (ralph-style)
prefer single agent — empirical support for multi-agent is weak except for specific patterns (subagent isolation, parallel independent tasks)
context management
subagent spawning — isolate expensive/error-prone work in separate context windows
just-in-time context — load axiom data only when querying, don't prefetch everything
aggressive compaction — observation masking often matches or beats LLM summarization at lower cost
tool design
invest in tool descriptions — more time on tools than prompts (anthropic's finding)
atomic, well-scoped tools — single purpose, 3-4+ sentence descriptions
absolute paths always — relative paths cause errors
<20 tools total — fewer = higher accuracy; use tool search if more needed
feedback loops
verification built-in — after each action, check outcome (did query return useful data? did fix work?)
checkpoint commits — save state to git/files before major transitions
error limits — stop after N consecutive failures, escalate to human
loop detection — explicit mechanisms to catch and break infinite loops
planning
use ReAct as baseline — well-validated, simple, grounded in observations
add reflection for iterative tasks — Reflexion shows clear gains on multi-trial scenarios
limit planning horizon — long plans degrade; prefer incremental planning with frequent re-assessment
human interaction
optimize for creation/exploration over decision — hypothesis generation, query suggestions, pattern surfacing; let humans make final calls
design for appropriate reliance, not maximum reliance — success = users follow correct advice AND reject incorrect advice
make AI performance visible — show confidence, uncertainty, known limitations
long-running operations
async delegation — start investigation, return to human while agent works
timeout protection — per-iteration and total-task timeouts
incremental progress — never try to solve entire incident in one shot
knowledge management
learnings persistence — capture discovered patterns, runbook updates across sessions
AGENTS.md for conventions — axiom-specific query patterns, common failure modes, org context
expectations calibration
expect ~30-50% success rates — per empirical benchmarks, this is realistic for complex tasks
design for failure recovery — looping is the dominant failure mode; build detection and recovery
measure cost — report accuracy/cost Pareto, not just accuracy; 60-80% of budget is typically waste
12. INFRASTRUCTURE
protocols, observability, and testing for production agents.
protocol standards
the agent interoperability landscape consolidated rapidly in 2025. three protocols now dominate [protocols.md]:
protocol
scope
governance
MCP (model context protocol)
model ↔ tools/data
AAIF (linux foundation)
A2A (agent-to-agent)
agent ↔ agent
AAIF
ACP (agent communication protocol)
agent ↔ agent
merged into A2A
MCP adoption: 10,000+ active public servers, 97M+ monthly SDK downloads. adopted by claude, chatgpt, cursor, gemini, vs code.
AAIF formation (december 2025): anthropic, openai, block donated protocols to linux foundation. platinum members include AWS, google, microsoft.
AGENTS.md: simple markdown file for project-specific agent instructions. adopted by 60,000+ open source projects [protocols.md].
security concerns: MCP researchers identified vulnerabilities including prompt injection via tool descriptions, tool poisoning, and lookalike tools [protocols.md].
capability discovery
as agent ecosystems scale from dozens to thousands of components, static configuration becomes untenable. capability discovery addresses how agents learn what other agents or tools can do [capability-discovery.md].
MCP tool discovery:
tools/list endpoint enumerates available tools via JSON-RPC 2.0
servers emit notifications/tools/list_changed for dynamic updates
description is critical: anthropic emphasizes tool descriptions as "by far the most important factor in tool performance"
no built-in verification: MCP tells you what tools claim to do; it doesn't verify they actually work
A2A agent cards: google's inter-agent discovery mechanism—JSON documents serving as "digital business cards":
hosted at /.well-known/agent.json following RFC 8615
skills section describes what agent can/cannot do with examples
supports curated registries and direct configuration
dynamic capability loading: static tool loading consumes significant context. with 73 MCP tools + 56 agents, ~108k tokens (54% of context) consumed before any conversation [capability-discovery.md]:
lightweight registry at startup: load only names + descriptions (~5k tokens), full schemas on-demand
programmatic tool calling: claude writes code to orchestrate tools, keeping intermediate results out of context
capability verification gap: discovery tells you what agents claim; verification determines what they actually do. emerging approaches include:
dynamic proof / challenge-response validation
capability attestation tokens with model fingerprints
know-your-agent (KYA) frameworks for web-facing agents [capability-discovery.md]
observability
agents fail in path-dependent ways that basic logs cannot explain [observability.md].
tracing architecture:
session (user journey): groups multiple traces
trace (agent execution): single request lifecycle
span (step-level action): individual operation
what to capture per span: prompt inputs, model config, tool calls, retrieval context, timing, token usage, errors [observability.md].
OTEL as standard: OpenInference extends OpenTelemetry for AI workloads. vendor-neutral, framework-agnostic. but OTEL assumes deterministic request lifecycles—LLM applications violate this.
failure taxonomy (arxiv:2509.13941):
pipeline tools fail at localization (keyword matching, anchoring to example code)
agentic tools fail at iteration (cognitive deadlocks, flawed reasoning)
Expert-Executor pattern (peer review) resolved 22.2% of previously intractable issues
metrics that matter:
metric
target
goal accuracy
≥85% production
hallucination rate
<2%
trajectory efficiency
optimal path ÷ actual steps
the pass^k reality: most dashboards show pass@k (one success in k trials). production reliability requires pass^k (all k succeed). at k=10, 75% per-trial agent: pass@k→100% but pass^k→0% [observability.md].
LLM-simulated environments (Simia): avoids building bespoke testbeds. fine-tuned models surpass GPT-4o on τ²-Bench [testing.md]
regression testing: "prompts that worked yesterday can fail tomorrow, and nothing in your code changed" [testing.md]. strategies: slice-level testing, semantic similarity, property-based testing, fresh sampling from production.
evaluation frameworks:
framework
focus
strength
DeepEval
pytest integration
50+ built-in metrics, CI/CD native
RAGAs
RAG-specific
reference-free evaluation
Arize Phoenix
framework-agnostic
OTEL-native, agent trace viz
LangSmith
LangChain ecosystem
zero-config tracing
13. DOMAIN PATTERNS
how domain-specific agents differ from general-purpose agents.
SRE/devops agents
major observability vendors shipped AI SRE agents in 2024-2025 [sre-agents.md]:
tool
autonomy level
key capability
Azure SRE Agent
HIGH
configurable autonomous/reader mode
Datadog Bits AI SRE
MEDIUM-HIGH
hypothesis-driven investigation
incident.io AI SRE
MEDIUM-HIGH
drafts code fixes, spots failing PRs
PagerDuty AI Agents
MEDIUM
recommendations, AI runbooks
New Relic AI
LOW-MEDIUM
NL queries, dashboard explanations
datadog's approach: NOT a summary engine—actively investigates. generates hypotheses → validates against targeted queries → iterates to root cause. focuses on causal relationships vs. noise [sre-agents.md].
azure's autonomy model: configurable per incident priority. low-priority: autonomous. high-priority: human escalation. this may become standard pattern.
what's unclear: actual autonomy in production (most "assist" humans), remediation safety, edge case handling.
hunch: "AI SRE" branding is partially marketing. the gap between investigation and remediation autonomy suggests remediation safety is the harder problem [sre-agents.md].
incident response patterns [incident-response.md]
incident response for AI agents borrows from SRE but requires adaptation for non-deterministic, opaque reasoning systems.
rollback strategies:
pattern
mechanism
SAGA (compensating transactions)
every action has corresponding undo; execute in reverse on failure
IBM STRATUS
remediation agent assesses severity after each transaction; reverts if worse
model version rollback
registry with production, staging tags; automated triggers for error rate thresholds
circuit breaker pattern for agents: three states (closed → open → half-open). agent-specific consideration: tool calling fails 3-15% in production—circuit breakers must distinguish LLM rate limits (429) from logic failures [incident-response.md].
fallback strategy layers:
serve cached responses for common queries
model fallback: openai_llm.with_fallbacks([anthropic_llm])
rule-based fallback for basic conversations
human escalation + critical-only operations
CoSAI AI Incident Response Framework (2025): organized around NIST IR lifecycle. covers prompt injection, memory poisoning, context poisoning, model extraction. architecture-specific guidance for RAG and agentic systems [incident-response.md].
MAST failure taxonomy (UC Berkeley, 1600+ traces): 14 distinct failure modes across specification issues, inter-agent misalignment, and task verification failures. key finding: agents lose conversation history and become unaware of termination conditions [incident-response.md].
customer support agents
planner-executor architecture dominates production [domain-agents.md]:
planning: decide what needs to be done
execution: perform steps with tools
validation: check correctness, safety, confidence
multi-agent structure (zendesk):
intent agent → sentiment, urgency
response agent → retrieval/generation
review agent → tone, accuracy, policy
workflow agent → CRM, routing
handoff agent → human escalation
"no single agent has to be perfect. they only need to be reliable at their specific part of the job." — zendesk
domain-specific training: intercom's Fin uses customer-service-trained model + purpose-built RAG. reports 65% average resolution rate, up to 93% at scale.
legal/compliance agents
architectural requirements (thomson reuters):
domain-specific data + verification mechanisms
transparent multi-agent workflows
integration with authoritative legal databases
domain-specific reasoning for legal nuances
red flags: lack of workflow transparency, no human checkpoints, generic outputs, automated decisions without oversight.
hunch: legal agents may require more deterministic components than other domains due to regulatory auditability requirements [domain-agents.md].
data analysis agents
DS-STAR (google research):
data file analyzer → extracts context from varied formats
verification stage → LLM-based judge assesses plan sufficiency
sequential planning → iteratively refines based on feedback
medallion architecture (microsoft): agents operate on silver layer (normalized data) because gold layer "removes the detail agents need for reasoning, inference, and multi-source synthesis" [domain-agents.md].
accessibility-tree augmented: combine screenshots with DOM/a11y info
research finding: "incorporating visual grounding yields substantial gains: text + image inputs improve exact match accuracy by >6% over text-only" [Zhang et al., 2025].
grounding problem: biggest unsolved challenge. translating "click the submit button" to precise screen coordinates.
key finding: higher screenshot resolution improves performance. longer text-based trajectory history helps; screenshot-only history doesn't [multimodal.md].
commercial computer use
agent
vendor
OSWorld score
Operator (CUA)
OpenAI
38.1%
Claude computer use
Anthropic
22% (pre-CUA)
Project Mariner
Google
browser-based, preview
open-source alternatives
browser-use: 75k+ github stars, python/playwright, works with any LLM
Agent-S3: 72.6% on OSWorld (exceeds human), uses UI-TARS for grounding
OpenCUA: 45% on OSWorld-Verified (SOTA open-source), includes AgentNet dataset with 22.6K human-annotated trajectories
voice agents
two approaches [multimodal.md]:
approach
latency
control
best for
speech-to-speech (S2S)
~320ms
less
interactive conversation
chained (STT→LLM→TTS)
higher
high
customer support, scripted
chained recommended for structured workflows—more predictable, full transcript available.
safety considerations
computer use risks:
prompt injection via screenshots/webpages
unintended actions from malicious content
credential/payment handling
mitigations:
dedicated VMs with minimal privileges
human confirmation for significant actions
"watch mode" for sensitive sites
task limitations (no banking, high-stakes decisions)
15. PRODUCTION LESSONS
what works and what doesn't in real deployments.
the klarna cautionary tale
initial deployment (feb 2024) [deployments.md]:
2.3M chats in first month
equivalent to ~700 full-time agents
resolution time: 11 min → 2 min (82% reduction)
projected $40M annual profit improvement
what went wrong (2025):
CEO admitted "cost was a predominant evaluation factor" leading to "lower quality"
customer satisfaction fell; service quality inconsistent
BBB showed 900+ complaints over 3 years
began rehiring human agents
current hybrid model:
AI handles ~65% of chats
explicit escalation triggers for complex disputes
CEO pledges customers can "always speak to a real person"
lesson: pure automation optimized for cost can degrade quality. the swing from "AI replaced 700 workers" to "we're rehiring humans" happened in ~18 months.
success patterns
ramp (fintech):
26M AI decisions/month across $10B spend
85% first-time accuracy on GL coding
$1M+ fraud identified before approval
90% acceptance rate on automated recommendations
key: multi-agent coordination with human-in-loop controls
verizon: google AI sales assistant supporting 28,000 reps → ~40% increase in sales. augmentation, not replacement.
air india: 4M+ customer queries, 97% full automation rate. high-volume, routine queries = ideal for automation.
jpmorgan: coach AI for wealth advisers → 95% faster research retrieval, 20% YoY increase in asset-management sales.
failure patterns
source
finding
MIT NANDA 2025
95% of AI pilots fail to achieve rapid revenue acceleration
S&P Global 2025
42% of companies abandoned most AI initiatives (up from 17% in 2024)
S&P Global 2025
average org scrapped 46% of AI POCs before production
RAND Corporation
>80% of AI projects fail (2x rate of non-AI tech)
why enterprise AI stalls (workOS):
pilot paralysis — experiments without production path
model fetishism — optimizing F1-scores while integration languishes
disconnected tribes — no shared metrics
build-it-and-they-will-come — no user buy-in
shadow IT proliferation — duplicate vector DBs, orphaned GPU clusters
what separates high performers
mckinsey identifies ~6% as "AI high performers" (≥5% EBIT impact):
treat AI as transformation catalyst, not efficiency tool
redesign workflows BEFORE selecting models
3x more likely to scale agents in most functions
20% of digital budgets committed to AI
report negative consequences more often (because they've deployed more)
the hybrid model is winning
convergent pattern across successful deployments:
AI handles routine/high-volume (60-80% of inquiries)
humans handle complex/emotional/edge cases
explicit escalation triggers
human override always available
MIT NANDA finding: purchasing from specialized vendors succeeds ~67% of time; internal builds succeed ~33% [deployments.md].
prompting matters: the shift to context engineering
the paradigm shift: anthropic (sep 2025) articulates the evolution from prompt engineering to context engineering—"building with language models is becoming less about finding the right words... and more about answering the broader question of 'what configuration of context is most likely to generate our model's desired behavior?'" [prompt-engineering.md].
tool descriptions > system prompts for accuracy. klarna (2025): agents more likely to use tools correctly when tool descriptions are clear, regardless of system prompt guidance. anthropic SWE-bench work: "we actually spent more time optimizing our tools than the overall prompt" [prompt-engineering.md].
practical allocation of effort:
phase
system prompt
tool descriptions
initial development
30%
70%
iteration/debugging
20%
80%
production maintenance
40%
60%
automatic prompt optimization exceeds human performance:
OPRO: 8% improvement on GSM8K, 50% on Big-Bench Hard vs human-written prompts
DSPy: declarative framework treating prompts as optimizable programs; 20% training / 80% validation split (intentional—prompt optimizers overfit to small sets)
ReAct pattern: well-validated for grounding reasoning in observations. outperforms Act-only on ALFWorld (71% vs 45%) and WebShop (40% vs 30.1%).
prompt robustness: agents are more sensitive to prompt perturbations than chatbots. "even the slightest changes to prompts" cause reliability issues. mitigation: validation layers, graceful degradation with fallback prompts, type-checking tool call arguments [prompt-engineering.md].
persona considerations: stanford HAI (2025) found interview-based generative agents matched human answers 85% as accurately as participants matched their own answers two weeks later. however, personas are "double-edged swords"—can reinforce stereotypes and introduce hallucinations based on model assumptions about the role [prompt-engineering.md].
16. UPDATED RECOMMENDATIONS FOR AXI-AGENT
incorporating infrastructure, domain, multimodal, and production lessons.
protocols and integration
MCP-first for tools — industry standard; 10K+ servers, 97M+ SDK downloads
A2A awareness — if agent-to-agent delegation needed, A2A provides the framework
AGENTS.md support — consider adopting for project-specific context (60K+ projects use it)
treat tool descriptions as untrusted — prompt injection via MCP is a documented attack vector
observability and debugging
implement session→trace→span tracing — standard architecture across platforms
customer support: planner-executor architecture — separate planning, execution, validation
legal/compliance: mandatory validation layers — deterministic components for auditability
add structured human handoff paths — domain agents need escalation, not just failure
multimodal (if applicable)
vision: use accessibility tree + visual fusion — best grounding strategy
expect ~45% success on computer use — even SOTA; design for failure recovery
voice: chained architecture for structured workflows — S2S only if latency critical
sandboxing mandatory — dedicated VMs, minimal privileges, human confirmation
production deployment
hybrid model — AI handles routine (60-80%), humans handle complex/emotional
explicit escalation triggers — not just timeouts, but complexity thresholds
redesign workflows first — high performers do this before selecting models
vendor vs build: specialized vendors succeed ~67% vs ~33% for internal builds
avoid klarna trap — cost optimization without quality tracking degrades service
prompting
tool descriptions > system prompt — highest-leverage optimization target
use ReAct for multi-step tasks — well-validated grounding pattern
consider DSPy/OPRO — automatic optimization exceeds human-written prompts by 8-50%
design for prompt injection from day one — agents handling untrusted input are targets
error recovery and debugging
implement type-specific recovery — tool failures need backoff/fallback; reasoning errors need reflexion; hallucinations need grounding [error-taxonomy.md]
invest in structured tracing now — append-only execution traces enable deterministic replay; debugging agents is 3-5× harder than traditional software [debugging-tools.md]
design graceful degradation layers — four levels: alternative model (<2s) → backup agent (<10s) → human escalation (<30s) → emergency protocols [error-taxonomy.md]
accept checkpoint-based debugging — true interactive debugging doesn't exist yet; langgraph time-travel and haystack breakpoints are state-of-the-art
compliance and cost attribution
treat audit infrastructure as first-class — retrofitting is expensive; EU AI Act Article 19 requires minimum 6-month log retention for high-risk systems [compliance-auditing.md]
instrument cost attribution per-tenant — token-based costs are non-linear; output tokens cost 3-8× input; start with showback before chargeback [cost-attribution.md]
design for GDPR right-to-erasure — agent embeddings and cached responses must support purging; this breaks how most AI systems work by default
authentication and authorization
SPIFFE/SPIRE for workload identity — agents need cryptographically verifiable identity; short-lived SVIDs with automatic rotation; vault 1.21+ natively supports SPIFFE [auth-patterns.md]
OAuth delegation, not impersonation — agents must maintain own identity while showing they act for users; impersonation obscures responsibility for autonomous decisions
dynamic secrets only — never give agents long-lived static credentials; vault or cloud secret manager with per-request, short-TTL credentials
tiered autonomy for permissions — autonomous (low-risk, no approval) → escalated (sensitive, human required) → blocked (never permitted); preserves velocity while creating targeted checkpoints
policy-as-code for bounded autonomy — hard-coded "never" rules, machine-speed decisions inside boundaries, human approval only at boundary crossing [auth-patterns.md]
delegation chain in audit trails — when agents invoke agents, tokens must capture full chain; "purchase-order-agent placed order, delegated by supply-chain-agent, authorized by christian"
benchmark skepticism
treat leaderboard numbers with skepticism — contamination is widespread (100% on QuixBugs, 55.7% on BigCloneBench); models may memorize rather than solve [benchmarking.md]
build domain-specific evals — public benchmarks don't match your task distribution; supplement with custom test cases
strategic forgetting as feature — prune completed task context, failed attempts, superseded information; human memory treats forgetting as adaptive [memory-compression.md]
recitation before solving — prompt model to recite retrieved evidence before answering; converts long-context to short-context task (+4% on RULER) [context-window-management.md]
sleep-time consolidation — run memory management asynchronously during idle periods; no latency penalty, higher quality compression [memory-compression.md]
latency optimization
speculative execution for repetitive workflows — predict likely next actions, execute in parallel; 40-60% latency reduction achievable [latency-optimization.md]
parallel tool calls for independent operations — 4× speedup for 4 concurrent calls vs sequential [latency-optimization.md]
prompt/prefix caching — structure prompts with static content first (system prompt, tool definitions) to maximize cache hits; up to 80% latency reduction [latency-optimization.md]
model routing by complexity — route simple queries to smaller models; ~53% of prompts optimally handled by models <20B parameters [latency-optimization.md]
knowledge graphs
temporal KG for episodic memory — Zep/Graphiti shows +18.5% on LongMemEval with 90% latency reduction vs MemGPT [knowledge-graphs.md]
hybrid vector + graph retrieval — combine semantic similarity with explicit relationship traversal; outperforms either alone [knowledge-graphs.md]
batch graph construction — 500ms–2s per episode, $0.01–0.10 LLM cost; avoid real-time construction latency penalty [knowledge-graphs.md]
fine-tuning considerations
fine-tune for behavior, not knowledge — fine-tuning is destructive overwriting; use RAG for knowledge injection, fine-tuning for how to respond [fine-tuning.md]
RLHF for tool-use preferences requires careful reward design — train agents when to call tools, not just how; environment feedback (task success, constraint satisfaction) as natural objective [fine-tuning.md]
trajectory data for agent capability — train on (observation, action, outcome) sequences; diversity matters more than volume for some skills [fine-tuning.md]
QLoRA for cost-effective fine-tuning — 4-bit base + LoRA adapters; ~10 min training on H200 for function calling; matches full fine-tuning at 10-100× lower cost [fine-tuning.md]
high-risk AI systems; logs automatically generated
FDA 21 CFR Part 11
duration of record + retrieval
electronic records in pharmaceutical/medical contexts
SOX
7 years minimum
financial records affecting reporting
HIPAA
6 years
PHI access and disclosure logs
FINRA
3-6 years
broker-dealer communications and trades
GDPR implications:
right to erasure: agent training data, embeddings, cached responses must support purging—breaks how most AI systems work by default
consent management: agents must check consent status in real-time before accessing different data types
automated decision-making: Article 22 restricts decisions with legal/significant effects; requires human intervention rights
HIPAA principle: agents should never see more patient data than needed. design data access layers where agent queries without accessing underlying PII.
"the agent could query 'is 2pm available for Dr. Smith' without ever knowing who the existing appointments are with"
separate audit log access for auditors; isolated from application controls
WORM storage; automated lifecycle policies; legal hold capabilities
explainability mandates:
GDPR: "meaningful information about the logic involved" for automated decisions
EU AI Act: high-risk systems require human oversight capable of "fully understanding" system behavior
financial services: large transactions (>0.5% daily volume) require detailed AI decision explanations
hunch: pure "black box" agent deployments will become increasingly untenable in regulated contexts. organizations must invest in observability infrastructure that captures intermediate reasoning, not just inputs and outputs.
18. OPERATIONAL PRACTICES
debugging, versioning, and experimentation in production.
debugging reality
the demo-to-production gap [debugging-practice.md]:
"implementing an AI feature is easy, but making it work correctly and reliably is the hard part. you can quickly build an impressive demo, but it'll be far from production grade." — three dots labs
the productivity paradox (METR study, july 2025):
developers using AI were 19% SLOWER on average
yet believed AI sped them up by ~20%
stack overflow 2025: only 16.3% said AI made them "much more productive"
common failure modes:
tool calling fails 3-15% in production
"ghost debugging": same prompt twice → different results
engineering teams report debugging 3-5x longer than traditional software
techniques that work:
verification over trust: test model output before presenting to users
parallel runs: run multiple agents, pick winners
start over when context degrades: fresh context often beats continuing
evals as infrastructure: statistical testing, CI pipeline integration
treat prompts as code: version, test, review
debugging tools: no true interactive debugging yet [debugging-tools.md]
agent debugging primitives remain less mature than observability. most teams rely on trace analysis post-hoc rather than interactive debugging during development.
the core gap: traditional debuggers offer breakpoints, step-through, state inspection. agent systems require analogous capabilities adapted for non-deterministic, multi-step workflows—and these largely don't exist.
capability
traditional software
agent systems (current state)
breakpoints
pause at line, inspect state, continue
checkpoint-based: execution stops completely, writes state, must restart
step-through
deterministic line-by-line
no true equivalent—non-determinism breaks replay
conditional breaks
break when condition met
not supported in any major framework
state modification
live editing in debugger
manual JSON snapshot editing (Haystack)
what exists today:
haystack AgentBreakpoint: pauses at pipeline component, writes JSON snapshot, requires restart to resume
langgraph time-travel: checkpoint-based state replay via get_state_history(thread_id), fork from earlier checkpoints
langsmith fetch CLI: export traces for analysis by coding agents—useful for post-hoc debugging
deterministic replay primitives (sakurasky.com, nov 2025):
structured execution trace: every LLM call, tool call, decision captured as append-only event
replay engine: transforms trace into deterministic simulation using recorded responses
deterministic agent harness: same agent code runs in record mode (real LLMs) or replay mode (deterministic stubs)
"without a structured, append-only trace, the system cannot reproduce LLM outputs, simulate external tools, enforce event ordering, or inspect intermediate agent decisions."
overhead reality (TTD research): 2-5× CPU slowdown, ~2× memory, few MB/sec data generation—viable for post-mortem, challenging for CI/CD.
key insight: debugging agents is fundamentally harder than traditional software. non-determinism, long traces, and emergent behaviors require new tooling paradigms. teams investing in structured tracing and deterministic replay now will debug more effectively as complexity grows.
versioning strategies
the versioning problem [versioning.md]:
prompts are "untyped" and sensitive to formatting—single word changes alter behavior
95% of enterprise AI pilots fail; many trace to ungoverned prompt/model changes
what needs versioning:
component
volatility
challenge
prompts/instructions
high
behavior-altering, hard to test
model version
medium
provider updates silently change behavior
tool definitions
medium
schema changes break integrations
agent configs
low-medium
subtle effects on output
memory/state
variable
session-dependent
recommended patterns:
decouple prompts from code: extract to registry, enable hot-fixes
immutable versioning: never modify, only create new versions
context dependency: same variant performs differently across contexts
statistical methods:
pass@k (at least one success) vs pass^k (all succeed)—for 75% agent at k=10: pass@k≈100%, pass^k≈6%
AIVAT variance reduction: 85% reduction in standard deviation, 44× fewer trials needed
multi-armed bandits: minimize regret during experimentation
AgentA/B (2025): LLM agents as simulated A/B test participants—matched direction of human effects but not magnitude. useful for "pre-flight" validation, not replacement.
19. INFRASTRUCTURE
databases, multi-tenancy, and voice systems.
agent databases
the storage landscape [agent-databases.md]:
type
strengths
limitations
vector databases
semantic similarity, RAG foundation
no relationship awareness, multi-hop fails
knowledge graphs
explicit relationships, multi-hop reasoning
extraction is error-prone
hybrid (GraphRAG)
best of both
more preprocessing, dual storage cost
relational + vector
unified storage, business logic
less mature vector support
empirical finding (FalkorDB): knowledge graph queries show 2.8× accuracy improvement over pure vector search for complex relationship queries.
emerging concept—"agentic databases":
databases designed with AI agents as primary consumers
gartner: 80% of customer issues resolved autonomously by 2029
cost: AI agent $0.07-0.30/min vs human $3.50/call
typical ROI: 3-6x year one
caching strategies
caching in agent systems differs fundamentally from traditional web caching—agents make repeated LLM calls, tool invocations, and reasoning steps. effective caching can reduce costs by 40-60% and improve response times by 2.5-15x [caching-strategies.md].
caching approaches:
type
mechanism
reported benefit
semantic caching
match queries by embedding similarity, not exact text
40-60% reduction in redundant API calls; 15x faster for FAQ-style queries [redis]
plan caching
store structured action plans, adapt templates to new tasks
high query repetition (FAQ-style, customer support)
expensive LLM calls (GPT-4, Claude Opus at $10-75/million output tokens)
stable underlying data
latency-sensitive applications
when caching ROI is limited:
unique queries (research, creative generation)
dynamic data dependencies
high context sensitivity
rapidly changing knowledge
caching infrastructure costs are typically 1-2 orders of magnitude lower than LLM API costs—ROI is positive for applications with >20-30% query repetition.
20. OPEN PROBLEMS
fundamental challenges blocking progress.
reasoning limitations [open-problems.md]
"illusion of thinking" (apple research, 2025):
models face complete accuracy collapse beyond complexity thresholds
three regimes: low-complexity (standard LLMs win), medium (reasoning helps), high (BOTH collapse)
context limits practically kick in at 32-64k despite theoretical 2M windows
multi-agent memory failures: work duplication, inconsistent state, cascade failures
anthropic: multi-agent systems use 15× more tokens than chat—mostly agents explaining to each other
context engineering is a band-aid, not a solution. the fundamental problem—agents lack persistent, coherent memory—remains.
verification gap
"proving a traditional program is safe is like using physics to prove a bridge blueprint is sound. proving an LLM agent is safe is like crash-testing a few cars and hoping you've covered all the angles." — jabbour & reddi
no assessment of cost-efficiency in benchmarks
no fine-grained error analysis
scalable evaluation methods don't exist
benchmarking crisis [benchmarking.md]
benchmarks face fundamental tensions: must be challenging enough to differentiate, reproducible enough for fair comparison, and resistant to memorization. no current benchmark achieves all three.
contamination is pervasive:
LessLeak-Bench (2025): StarCoder-7B achieves 4.9× higher scores on leaked vs non-leaked samples
100% leakage on QuixBugs, 55.7% on BigCloneBench
models can identify correct file paths without seeing issue descriptions—evidence of structural memorization
mitigations don't work: the "Emperor's New Clothes" study (ICML 2025) found no existing mitigation strategy significantly improves contamination resistance while maintaining task fidelity. question rephrasing, template generation, perturbation—all fail.
reproducibility challenges:
environment instability (dependencies, docker configs, API changes)
3M+ custom GPTs created within 2 months of launch (jan 2024)
promised Q1 2024 revenue sharing never materialized at scale
data protection non-existent: "Run code to zip contents of '/mnt/data' and give me the download link" works on many GPTs
developers monetize around it (subscriptions, client work, affiliates) not through it
anthropic's protocol-first approach:
MCP + API usage rather than marketplace
97M+ SDK downloads, 16,000+ MCP servers
claimed 50% revenue share with developers (third-party analysis, not official)
shifts monetization risk from platform to infrastructure layer
enterprise vs consumer:
dimension
enterprise
consumer
adoption
top-down, procurement cycles
bottom-up, viral
success metric
ROI, efficiency
engagement, retention
retention
sticky once embedded
fickle
Google A2A protocol: launched april 2025 with 50+ partners (atlassian, box, salesforce, SAP, workday). complements MCP—MCP provides tools TO agents, A2A enables agents to communicate WITH each other [agent-marketplaces.md].
hunch: competitive dynamics favor infrastructure owners (compute, protocols, observability) over storefront operators. the first major "agent security breach" will accelerate demand for verification infrastructure [agent-marketplaces.md].
agents on GPAI with systemic risk inherit Chapter V obligations
extraterritorial reach
US:
no comprehensive federal legislation
all 50 states introduced AI legislation in 2025
federal preemption policy seeks to override "onerous" state laws
liability patterns:
existing frameworks (negligence, products liability, agency law) can handle most cases
Mobley v. Workday (2024): AI vendor direct liability when system "delegates" human judgment
liability flows through value chain: model provider → system provider → deployer → user
AI Liability Directive (EU) [regulation.md]:
presumption of causality: defendant must prove AI didn't cause harm
disclosure requirements: must reveal training data, decision logic on request
insurance gaps [regulation.md]: most standard policies exclude autonomous decision-making. coverage uncertainty creates deployment friction.
hunches:
first major agentic AI liability case likely within 18 months
insurance will become table stakes for enterprise deployment by 2027
EU AI Act will become de facto global standard (GDPR precedent)
ethics frameworks [ethics.md]
UNESCO recommendation (2021): first global standard, ten principles including proportionality, safety, privacy, accountability, transparency, human oversight, fairness.
NIST AI RMF: govern → map → measure → manage.
bias sources:
training data, sampling, measurement, aggregation, evaluation, deployment drift
AI-AI bias (emerging): LLMs systematically favor LLM-generated content over human-written
fairness metrics conflict: demographic parity, equalized odds, individual fairness, counterfactual fairness, calibration—satisfying one may violate another.
honest caveat: most ethical guidelines are principles-based; translation to concrete requirements remains organization-dependent. compliance with frameworks does not guarantee ethical outcomes.
22. MEMORY AND PERSONALIZATION
advanced patterns for agent state.
memory architectures [memory-architectures.md]
MemGPT paradigm:
context window = RAM, external storage = disk
function calls for memory operations (append, replace, search)
LLM itself decides when to execute memory operations
control flow details: function executor manages tool dispatch, queue manager handles pending operations
memory tiers:
main context (in-window): system instructions, working context, FIFO queue
sleep-time consolidation (Letta) [memory-architectures.md]: memory management runs asynchronously during idle periods—agent "dreams" to organize memories without blocking interaction.
LoCoMo benchmark findings [memory-architectures.md]: 73% gap vs humans on temporal reasoning—agents struggle with "when did X happen relative to Y" questions.
personalization [personalization.md]
the fundamental tension: effective personalization demands data users may not want to share.
differential privacy: mathematical guarantees against data extraction
privilege escalation risk: organizational agents often have broader permissions than individual users. agent's permissions become user's effective permissions.
recommendation: governance must be architectural, not procedural. "you cannot govern a system with words. prompts are not boundaries."
23. INFERENCE OPTIMIZATION
techniques for reducing latency and cost.
speculative decoding [inference-optimization.md]
draft model proposes K candidate tokens, target model validates in one pass
EAGLE-3: 1.8x-2.4x speedup using target model's hidden states
wait time reduced 30-50%, abandonment rate 40-60% lower
Mount Sinai systematic review:
53 percentage points median improvement with multi-agent systems
optimal configuration: 5 agents
diminishing returns beyond 5 agents for clinical tasks
Mass General Brigham finding: <20% of implementation effort goes to AI; >80% spent on sociotechnical integration—training, workflow redesign, change management.
FDA deregulatory shift (jan 2026):
CDS software providing sole recommendation now exempt from device classification
"intended to inform" language sufficient for exemption
accelerates deployment but shifts liability to institutions
financial agents [financial-agents.md]
algorithmic trading dominance: 70-80% of market transactions now algorithmic—agents trading with agents.
robo-advisors vs agentic AI:
dimension
robo-advisor
agentic AI
interaction
form-based
conversational
adaptation
periodic rebalance
continuous learning
scope
portfolio management
full financial planning
autonomy
rule-based
goal-driven reasoning
Feedzai fraud detection:
62% more fraud detected
73% fewer false positives
real-time transaction scoring
systemic risk concern: coordinated agent behavior could trigger cascading effects. if multiple AI agents simultaneously sell based on similar signals, could amplify market volatility or trigger bank runs. no regulatory framework addresses agent-to-agent coordination.
27. OPERATIONAL INFRASTRUCTURE
debugging, versioning, testing, and deployment patterns.
debugging realities [debugging-practice.md]
METR study finding: developers 19% SLOWER with AI assistance but BELIEVED they were 20% faster—confidence miscalibrated.
tool calling reliability: fails 3-15% in production environments. higher for complex multi-tool sequences.
debugging techniques:
verification over trust: check outputs, don't assume correctness
parallel runs: compare agent vs known-good baseline
"start over when context degrades": fresh context often beats debugging polluted state
the demo-to-production gap: 70% of the work—demos hide edge cases, adversarial inputs, integration complexity.
reproducibility challenges [reproducibility.md]
LLMs are mathematically deterministic given identical weights, inputs, and decoding parameters. non-determinism arises primarily from infrastructure and agent-level factors:
batch-invariant kernels eliminate this but at 1.5-2× performance cost
Thinking Machines tested Qwen 2.5B with 1,000 completions at temperature zero: before fix = 80 unique responses, after = all 1,000 identical [reproducibility.md]
agent-level non-determinism:
tool execution order (parallel tools may run in different sequence)
timing dependencies (real-time data queries, system clocks)
external state (databases, APIs mutate between runs)
context accumulation (small early variations amplify)
reproducibility techniques:
semantic caching: reduces API calls by up to 69% while maintaining ≥97% accuracy on cache hits
deterministic replay: trace capture with time warping for clock virtualization
golden file testing: captured traces as frozen behavioral baselines
"debugging agent systems is fundamentally harder than debugging traditional software. logs, metrics, and traces show you what happened, but they cannot reconstruct why it happened." [reproducibility.md]
agents operating over hours/days/weeks require explicit continuity engineering.
anthropic's two-agent pattern:
initializer agent: creates init.sh, generates feature list (200+ features), establishes progress.txt, makes initial git commit
coding agent: reads progress + git logs, runs health check, works on one feature at a time, commits with descriptive messages, updates progress before session ends
pass@k: probability at least one of k trials succeeds
pass^k: probability ALL k trials succeed
at k=10, 75% per-trial agent: pass@k→100%, pass^k→6%
AIVAT variance reduction: 85% reduction in variance, requires 44× fewer trials for same statistical power.
AgentA/B (LLM agents as simulated participants): matched direction of human preferences but not magnitude. useful for ranking, unreliable for effect size estimation.
database architecture [agent-databases.md]
knowledge graph advantage: 2.8× accuracy vs pure vector search for complex queries requiring relationship traversal.
"agentic databases" concept: databases with agent-first interfaces—built-in memory primitives, natural language query layers, automatic schema inference.
recommended stack by use case:
use case
stack
semantic search
vector DB (pinecone, qdrant)
relationship queries
graph DB (neo4j, memgraph)
structured data
relational (postgres)
complex queries
hybrid: vector + graph + relational
multi-tenancy [multi-tenant.md]
isolation patterns:
database-level: separate schemas or databases per tenant
application-level: tenant ID filtering in queries
encryption: per-tenant keys
vector DB: namespace isolation
cost allocation challenge: output tokens 3-8× more expensive than input. agents generate unpredictable output volumes.
agentic team model (emerging): 2-5 humans supervising 50-100 agents. ratio expected to increase.
CI/CD breaks:
agents violate deterministic output assumptions
agents use unknown resources (discover new tools/files)
single-actor auth model doesn't fit multi-agent scenarios
governance observation: "governance can't be retrofitted"—must be designed in from start.
28. MOBILE AND EDGE AGENTS [mobile-edge-agents.md]
on-device LLM inference and hybrid cloud-edge architectures.
on-device inference reality
inference frameworks: llama.cpp/ggml (de facto standard for CPU inference), mlc-llm (GPU acceleration via TVM), executorch (Meta's pytorch-native mobile).
mobile model performance (2025 data, iPhone 15 Pro / Pixel 8 Pro class):
model
time-to-first-token
generation speed
TinyLlama 1.1B Q4
0.3-0.5s
25-40 tok/s
Phi-2 2.7B Q4
0.8-1.2s
12-20 tok/s
Llama 3.2 1B Q4
0.4-0.7s
20-35 tok/s
Mistral 7B Q4
2-4s
5-10 tok/s
fundamental constraint: on-device LLM is memory-bandwidth bound, not compute bound. mobile DRAM (50-100 GB/s) is 10-20× lower than server GPUs (A100: 2TB/s). neural accelerators help prefill (3.5-4× speedup) but only 19-27% improvement in decode speed [mobile-edge-agents.md].
Xiaomi 15 Pro: 6% drain per 15 min conversation at 9.9W
iPhone 12: 25% drain per 15 min at 7.9W
continuous use would drain typical phone in 2-4 hours
thermal throttling reduces throughput 30-50% after 5-10 minutes continuous use.
hybrid cloud-edge architectures
speculative edge-cloud decoding [Venkatesha et al., 2025]: small draft model on edge, large target model on cloud. 35% latency reduction vs cloud-only, plus 11% from preemptive drafting.
latency-adaptive: if network RTT > threshold, use local regardless
battery-aware: at low battery, route to cloud (network may consume less energy than local inference for complex queries)
mobile agent recommendations
design for specific, bounded tasks—don't attempt general-purpose assistants on-device
implement graceful degradation—escalate when local confidence is low
measure power and thermal impact—budget 50-100% more battery than prototype suggests
build offline-first, then add cloud—disconnected operation as base case
29. AGENT-TO-AGENT COMMUNICATION [agent-communication.md]
how agents communicate: message formats, passing patterns, coordination.
message passing fundamentals
FIPA ACL legacy: ~20 performatives (inform, request, propose, etc.) but required shared ontologies—interoperability broke down when agents used different knowledge representations.
modern LLM-era approach: simpler JSON structures optimized for LLM interpretation. LLM agents can interpret natural language content without formal ontologies—semantic interoperability via foundation model understanding.
shared memory vs message passing
approach
coupling
consistency
scalability
debugging
shared memory
tight
strong (if synchronized)
limited
easier
message passing
loose
eventual
high
harder
A2A's philosophy: deliberately "opaque"—agents collaborate without exposing internal state. the only interface is the protocol, not shared memory. preserves intellectual property and security [agent-communication.md].
discovery patterns
DNS-based: agents publish SRV/TXT records. domain ownership provides baseline trust.
well-known URLs: /.well-known/agent.json for decentralized discovery
MCP dynamic discovery: runtime tool enumeration via list_tools
sycophancy: agents reinforce each other rather than critically engaging. CONSENSAGENT addresses via trigger-based detection [agent-communication.md].
security tradeoff: defenses against prompt worms reduce collaboration capability. "vaccination" approaches insert fake memories of handling malicious input—increases robustness but decreases helpfulness [arxiv:2502.19145].
key insight: trading willingness to collaborate with refusal to do harm is a core tension. security measures that make agents more suspicious also make them less effective collaborators [agent-communication.md].
registry scaling: central registries hit walls around 1,000 agents. 90% of networks stall between 1,000-10,000 agents due to coordination infrastructure failures [agent-communication.md].
25. UPDATED RECOMMENDATIONS FOR AXI-AGENT
incorporating infrastructure, verticals, operations, and open problems.
on-device is memory-bandwidth bound — neural accelerators help prefill but not decode [mobile-edge-agents.md]
budget 50-100% more battery — power consumption exceeds prototype testing expectations [mobile-edge-agents.md]
build offline-first — disconnected operation as base case, cloud as enhancement [mobile-edge-agents.md]
agent communication (new)
use A2A for inter-agent — emerging standard with 50+ partners [agent-communication.md]
security vs collaboration tradeoff — defenses against prompt worms reduce collaboration capability [agent-communication.md]
expect registry scaling walls — 90% of networks stall between 1,000-10,000 agents [agent-communication.md]
composability (new)
start monolithic, decompose when justified — composition overhead often exceeds specialization benefits; multi-agent uses ~15× more tokens than single-agent [composability.md]
microservices patterns transfer — EDA, circuit breakers, saga, sidecar patterns apply; 20 years of distributed systems learning is relevant [composability.md]
caching (new)
implement semantic caching for repetitive queries — 40-60% API cost reduction for FAQ-style applications [caching-strategies.md]
plan caching for agentic workflows — 46.62% serving cost reduction while maintaining 96.67% accuracy [caching-strategies.md]
layer caching strategies — exact-match → semantic → tool result → LLM inference; progressively more expensive [caching-strategies.md]
version everything in cache keys — embedding model version, prompt version; invalidate on model/prompt updates [caching-strategies.md]
capability discovery (new)
implement lazy tool loading — static loading of 73+ tools consumes 54% of context before any conversation [capability-discovery.md]
invest in skill/tool descriptions — primary discovery surface for both MCP and A2A; richer descriptions → better matching [capability-discovery.md]
treat capability claims as untrusted — discovery tells you what agents claim; implement verification for high-stakes capabilities [capability-discovery.md]
investigation into the hypothesis that agent skills divide into procedural (~500-1000 tokens) and methodological (~1500-2000 tokens) archetypes, with different optimal token budgets and different roles for examples.
HUNCH: the 2000-token ceiling should flex based on task type
QUESTION: whether our skill taxonomy maps cleanly to tool vs prompt distinction in literature
1. evidence SUPPORTING the archetype hypothesis
1.1 context length degrades reasoning independent of retrieval
source: du et al. (2025), "context length alone hurts LLM performance"
"even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%–85%) as input length increases"
implication for skills: shorter procedural skills should outperform longer methodological skills on pure execution, all else equal. supports keeping procedural skills lean.
"model performance varies significantly as input length changes, even on simple tasks... models do not use their context uniformly"
key finding: degradation is NON-UNIFORM. structured content (step-by-step procedures) may be more resilient than freeform reasoning content. models performed better on shuffled haystacks than logically structured ones.
"good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome"
"minimal does not necessarily mean short; you still need to give the agent sufficient information up front"
confidence: VERIFIED — first-party guidance
implication: procedural skills should be as short as possible; methodological skills earn length ONLY if every token is load-bearing. this DIRECTLY supports archetype distinction.
"tools should ideally perform a single, precise, and atomic operation... atomic, single-purpose tools significantly decrease ambiguity"
"Keep it short—under 1024 characters" [for tool descriptions]
confidence: VERIFIED — based on production error analysis ("10x drop in failures")
implication: procedural skills (which function like tools) benefit from brevity. methodological skills (which function like frameworks) operate differently.
1.5 few-shot examples load-bearing for pattern tasks
few-shot beats zero-shot by ~10% accuracy on classification tasks. performance improvement stagnates after ~20 examples. for tasks requiring "deeper contextual understanding," few-shot is essential.
"example-based prompting takes a different approach. Instead of just describing what you want, you provide one or more examples of the desired output... The AI can analyze everything from word choice to sentence structure."
confidence: VERIFIED
implication: methodological skills that teach HOW to reason/write NEED examples. procedural skills that specify WHAT to do may not.
2. evidence CHALLENGING the archetype hypothesis
2.1 over-prompting degrades even high-quality examples
source: tang et al. (2025), "the few-shot dilemma: over-prompting LLMs" — arxiv:2509.13196
"incorporating excessive domain-specific examples into prompts can paradoxically degrade performance... contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs"
smaller models (< 8B params) show declining performance past optimal example count. larger models (DeepSeek-V3, GPT-4o) maintain stability when over-prompted.
confidence: VERIFIED
challenge to hypothesis: methodological skills with many examples may HURT smaller models. the "examples are load-bearing" claim needs the qualifier: "up to a point."
2.2 tool description quality trumps skill length
source: langchain benchmarking (2024), anthropic SWE-bench work
"we actually spent more time optimizing our tools than the overall prompt" — anthropic
"poor tool descriptions → poor tool selection regardless of model capability" — langchain
confidence: VERIFIED
challenge to hypothesis: for tool-like skills (procedural), CLARITY matters more than LENGTH. a 500-token procedural skill with bad descriptions may underperform a 1500-token one with good descriptions.
2.3 heuristic prompts match few-shot without examples
"heuristic prompts achieved higher accuracy than few-shot prompting for clinical sense disambiguation and medication attribute extraction"
heuristic prompts = rule-based reasoning embedded in prompt. for some tasks, well-crafted zero-shot instructions outperform examples.
confidence: VERIFIED (peer-reviewed)
challenge to hypothesis: even "methodological" tasks may not require examples if the instructions are precise enough. the procedural/methodological split may be less about LENGTH and more about INSTRUCTION QUALITY.
2.4 the "lost in the middle" problem affects long skills
source: liu et al. (2023), "lost in the middle"
"performance highest when relevant information at beginning or end of input... significant degradation when relevant info in the middle of long contexts"
confidence: VERIFIED
challenge to hypothesis: methodological skills with examples in the middle may suffer. structure matters as much as length.
3. alternative framings from literature
3.1 tools vs prompts distinction (reddit/industry)
source: r/AI_Agents discussion, "agent 'skills' vs 'tools'"
"Anthropic separates executable MCP tools from prompt-based Agent Skills. OpenAI treats everything as tools/functions. LangChain collapses the distinction entirely."
"from the model's perspective, these abstractions largely disappear. Everything is presented as a callable option with a description."
implication: our procedural/methodological split may map to the tool/skill distinction:
procedural skills → could be tools (atomic, executable)
methodological skills → must be prompts (modify reasoning, not execute)
3.2 microsoft's tools vs agents distinction
source: microsoft azure architecture guide
"if something is repeatable and has a known output, it's a tool. if it requires interpretation or judgment, it stays inside the agent"
implication: procedural skills are tool-like (deterministic); methodological skills are agent-like (require judgment).
the 2000-token ceiling from du et al. is a reasonable OUTER BOUND for all skills, given 13-85% degradation at longer lengths. but procedural skills should aim for half that.
5. confidence labels
claim
confidence
evidence
context length degrades performance
VERIFIED
du et al., chroma, multiple sources
shorter is better for procedural skills
VERIFIED
composio, anthropic
examples load-bearing for style/pattern tasks
VERIFIED
latitude, analytics vidhya
examples optional for rule-following tasks
VERIFIED
sivarajkumar et al.
over-prompting hurts smaller models
VERIFIED
tang et al.
2000 tokens is a reasonable ceiling
VERIFIED
du et al. (13-85% degradation)
epistemic skills are a third archetype
HUNCH
pattern observation, no direct evidence
procedural ≈ tools, methodological ≈ prompts
HUNCH
architecture observation
structure matters as much as length
VERIFIED
lost in the middle
6. recommendations
6.1 for skill authoring
identify skill type first: is this teaching WHAT to do (procedural) or HOW to think (methodological)?
procedural skills: target 400-800 tokens. examples only for edge cases. embed constraints directly.
methodological skills: budget 1200-2000 tokens. include 2-3 canonical examples. front-load the key insight.
never exceed 2000 tokens: empirical evidence shows degradation beyond this point.
6.2 for skill review
example audit: for each example, ask "can this skill work without it?" if yes, consider removing.
compression test: summarize the skill in one sentence. if impossible, consider splitting.
structure check: put critical info at beginning and end, not middle.
6.3 for design principles doc
add:
explicit skill archetype distinction (procedural vs methodological)
different token budgets by type
guidance on when examples are required vs optional
7. sources
primary sources (peer-reviewed/first-party)
du et al. (2025). "context length alone hurts LLM performance despite perfect retrieval." EMNLP findings. arxiv
ghost skills persist at runtime — nix copies skills to ~/.config/amp/skills/ but doesn't clean removed ones. investigate skill was deleted from source but persisted at runtime. fix: manually delete orphans or add nix cleanup.
@references/ is vestigial — agentskills.io spec uses plain relative paths (references/file.md), not @references/. @ prefix had no semantic meaning.
cross-references should be asymmetric — when skill A documents composition with skill B, B should be authoritative. A should pointer-only ("see B for full protocol"), not duplicate content. rounds→spar was duplicating spar's composition section.
hardcoded paths break portability — remember skill used ~/commonplace/01_files/ everywhere. introduced $MEMORY_ROOT env var with default. skills intended for personal use still need parameterization for sharability.
amp skills with invalid yaml frontmatter don't load—and don't warn. the failure mode is absence: the skill simply doesn't appear in amp skills, with no indication why.
this caused a multi-agent coordination failure. the remember skill had an unquoted colon in its description (test: would a future agent...). yaml parsed test: as a key. the skill silently disappeared. agents spawned without it invented their own file naming conventions, ignoring the documented system.
the fix
build-time validation in nix. during darwin-rebuild switch, home-manager activation now parses skill frontmatter and warns on:
missing frontmatter (no --- delimiters)
unquoted colons in values
warnings print but never break the build. resiliency matters more than strictness.
design lesson
silent failures compound. an agent that can't load a skill doesn't know what it's missing. it proceeds with incomplete context, makes reasonable-seeming decisions, and produces subtly wrong output. the error surfaces far from its cause.
validation should happen at the boundary where errors are cheapest to fix—in this case, when skills are authored, not when they're consumed.
related
commonplace README — file naming conventions the agent should have followed