This is an early-stage architecture blueprint for building a modern agentic harness from the ground up in 2026. It is not a product spec and not a vendor pitch. It is a practical plan for how to structure the runtime, state model, context system, tool layer, subagent orchestration, approvals, protocols, and observability so the harness stays useful as models, tools, and deployment surfaces evolve.
This blueprint synthesizes the uploaded source material with current official documentation from Anthropic, LangChain/LangGraph, Moonshot Kimi CLI, MCP, ACP, and A2A. Where the sources conflict, the plan below favors patterns that appear repeatedly across multiple production systems or formal protocols. [U1][U2][U3][U4][U5][E1][E7][E15][E18][E24][E26]
If you only remember six things, remember these:
-
The harness matters more than the loop. The model-tool loop is now commodity. Differentiation comes from context engineering, durable state, policy enforcement, externalized memory, and protocol design. [U1][U2][E1][E5][E13]
-
Design around cache stability first. Prompt caching is not a small optimization. It changes your entire architecture: stable prompt prefix, append-only history, fixed tool catalog per session, and state transitions modeled as messages or mode flags rather than prompt rewrites. [U2][U5][E7][E27]
-
Treat the filesystem and artifact store as working memory. Large tool outputs, notes, plans, recovered state, and handoffs should live outside the model context and be referenced by handles. This is the only reliable way to scale beyond short tasks without drowning the model in its own history. [U1][U2][U4][U5][E1][E3][E14]
-
Keep the built-in action space small and stable. Start with a compact set of high-leverage primitives: file ops, search/read, code execution or shell, planning/tasks, subagent delegation, and structured user elicitation. Add new tools only when they improve control, guardrails, concurrency, observability, or UX. [U1][U2][U5][E17]
-
Use subagents for context isolation, not because "multi-agent" sounds advanced. Start with a single agent. Add subagents only when you need parallel exploration, specialized prompts/tooling, separate context windows, or explicit ownership boundaries. [U1][U2][U4][E6][E14][E15]
-
Put guardrails in the runtime, not the prompt. The model should never be the only enforcement layer. Destructive tools, secret access, network egress, and external writes need deterministic policy checks, approval gates, sandbox restrictions, and audit trails. [U2][U4][U5][E12][E17][E25]
| Area | Recommended default | Why |
|---|---|---|
| Core runtime | Durable state machine / graph runtime with checkpointing | Long-running work needs pause/resume, replay, fault tolerance, and human interruption support. [E1][E5] |
| Session history | Append-only event log plus typed state snapshot | Best for cache stability, replay, auditability, and deterministic recovery. [U3][U4][E7] |
| Working memory | Artifact-first filesystem plus metadata store | Offloads context, preserves recoverability, supports handoffs and resumption. [U1][U4][E1][E3] |
| Built-in tools | Narrow, namespaced primitives | Small stable action space improves selection quality and keeps cache-friendly prefixes. [U1][U2][U5][E17] |
| Large data handling | Programmatic tool calling or sandboxed code execution | Keeps intermediate data out of the model context and reduces round trips. [U2][U3][U4][U5][E10][E11] |
| Planning | Task graph or structured todo primitive | Acts as attention control and coordination state, not as a workflow engine by itself. [U2][U5][E1][E16] |
| Multi-agent | Orchestrator-worker subagents by default | Strongest payoff for context isolation and parallel work with manageable complexity. [U1][U2][E6][E15] |
| Protocols | MCP for tools, ACP for IDE/client surfaces, A2A for remote agent-to-agent delegation | Clean separation of concerns and future-proof interoperability. [U3][U4][E18][E24][E25][E26] |
| Human collaboration | Structured question tool plus approval policies | Faster, more deterministic than plain-text back-and-forth. [U2][U4][U5][E12][E19] |
| Security | Sandbox + policy engine + audit log | Practical trust boundary for real tool use. [U2][U4][E3][E12][E25] |
A modern harness should treat the LLM as the control plane for reasoning and planning, while the rest of the system handles state, execution, storage, approvals, transport, and observability. This is a neuro-symbolic split in practice, not in theory. The more you can move determinism, memory, and policy into the harness, the more reliable the overall system becomes. [U3][E1][E5][E13]
The harness should therefore be built around five stable layers:
- Execution runtime - the event loop, session manager, checkpointing, and recovery.
- Context system - prompt layout, artifact references, compaction, and cache discipline.
- Capability surface - built-in tools, external tools, skills, and subagents.
- Governance layer - approvals, hooks, allow/deny policy, sandboxing, provenance.
- Surface/protocol adapters - CLI, IDE, web UI, ACP, MCP, and optionally A2A.
flowchart LR
U[User or Calling System] --> S[Surface Layer: CLI / IDE / Web / API]
S --> A[Client Adapter / ACP / REST]
A --> R[Agent Runtime]
R --> ST[Session State + Checkpoints]
R --> PL[Plan / Task Graph]
R --> EV[Typed Event Bus]
R --> PO[Policy Engine + Approvals + Hooks]
R --> TR[Tool Router]
TR --> BT[Built-in Tools]
TR --> MCP[MCP Connector]
TR --> SX[Sandbox / Code Execution / PTC]
TR --> FS[Artifact Store / Virtual Filesystem]
R --> SG[Subagent Manager]
SG --> R
R --> MEM[Long-term Memory / AGENTS.md / Conventions]
R --> A2A[A2A Adapter for Remote Agents]
EV --> OBS[Tracing / Metrics / Replay / Eval Harness]
- The runtime owns the loop, state, and control.
- The tool router is the gateway to all side effects.
- The artifact store / virtual filesystem is the main external memory substrate.
- The subagent manager is a specialization mechanism and a context pressure valve.
- The policy engine is the governance boundary.
- The surface layer is replaceable; the engine should not depend on a specific UI. Kimi's Wire-style decoupling and ACP both reinforce this design. [U1][E18][E23][E24]
Your runtime should manage:
- session creation and resume
- append-only message/event history
- deterministic state snapshots
- step execution and retry policy
- compaction and fresh-window restarts
- human pause/resume
- subagent spawning and result capture
- cancellation and timeout handling
- audit and replay
This is exactly why LangGraph emphasizes durable execution and persistence through checkpoints, and why Kimi CLI persists not just the conversation but also approvals, dynamic subagents, and added workspace directories across resume. [U1][U4][E1][E5][E20]
A session is the resumable unit for a conversation or job.
{
"session_id": "sess_...",
"thread_id": "thread_...",
"created_at": "2026-03-01T12:00:00Z",
"mode": "execute",
"model_profile": "orchestrator-default",
"tool_catalog_version": "v1",
"approval_mode": "ask",
"context_state": {
"compacted": false,
"recent_summary_ref": "artifact://..."
}
}A task is the coordination object.
{
"task_id": "task_...",
"title": "Draft migration plan",
"status": "in_progress",
"owner": "main-agent",
"dependencies": ["task_1", "task_2"],
"blockers": [],
"artifact_refs": ["artifact://plan.md"],
"updated_at": "2026-03-01T12:10:00Z"
}Artifacts are durable outputs and large intermediate objects.
{
"artifact_id": "artifact_...",
"uri": "artifact://reports/search-results-01.json",
"mime_type": "application/json",
"summary": "Search results for Azure pricing pages",
"sha256": "....",
"source": {
"tool": "web.search",
"session_id": "sess_...",
"step_id": "step_..."
}
}A step is a single model decision or tool execution event.
{
"step_id": "step_...",
"kind": "tool_call",
"tool_name": "fs.read",
"status": "completed",
"started_at": "2026-03-01T12:05:00Z",
"ended_at": "2026-03-01T12:05:01Z",
"artifact_refs": [],
"error": null
}- Checkpoint before and after external side effects. LangGraph's checkpointers and Kimi's per-step session persistence both support this general rule. [E5][E20]
- Make replay deterministic. Wrap side effects in tasks or tool execution envelopes so replay does not accidentally re-run destructive actions. LangGraph explicitly recommends deterministic/idempotent design for durable execution. [E5]
- Store both typed state and event history. State gives you fast resume; event history gives you replay, analytics, and debugging.
- Use cancellation as a first-class primitive. ACP and A2A both model cancellation/interrupt flows, so your internal runtime should too. [E24][E26]
Treat context as a scarce, actively managed resource. Context engineering is not "prompt polish"; it is the main systems problem in long-running agents. [U2][U3][U4][U5][E14]
The most cache-friendly ordering is:
- static system prompt and always-present tool stubs
- project memory / AGENTS.md / conventions
- session-level state summary
- recent messages and tool results
- latest user turn
Anthropic's prompt caching docs are explicit: the cache hierarchy is tools -> system -> messages, and changes to tools invalidate the whole cache below that point. [U2][U5][E7]
- Do not add or remove tools mid-session unless you are willing to lose cache locality. Prefer deferred loading or masks. [U2][U5][E7][E9]
- Do not switch models mid-session for trivial reasons. If you need a different model, spawn a subagent or fresh worker session. [U2][U5]
- Do not rewrite the system prompt for dynamic state changes. Send reminders or state updates as messages. [U2][U5][E7]
- Keep serialization deterministic. Even small ordering changes can break cache reuse. [U2][U5]
The best synthesis across the sources is a four-level policy:
Tool results should already be concise, typed, and artifact-backed. This reduces the need for later cleanup. Anthropic's tool-writing guidance strongly favors returning meaningful but token-efficient context. [E17]
If a tool returns a large object, write the full result to an artifact and return only:
- a short summary
- the first few relevant lines or rows
- a stable artifact handle
- metadata (size, type, location, provenance)
This is how Deep Agents handles large results conceptually, and it should be your baseline pattern. [U1][U2][E1]
When context approaches roughly 80-90% of the safe usable window, rewrite old bulky tool inputs or old results into references. Kimi exposes reserved_context_size and triggers auto-compaction before hard failure; Deep Agents uses proactive summarization middleware. [U1][U4][E18][E20]
When the window is still too full, summarize older history into:
- current goal
- state achieved so far
- open tasks
- key decisions and assumptions
- artifact refs
- next recommended step
Anthropic now recommends server-side compaction for long-running workflows where available. [E8][E27]
When state has been externalized well, a brand new context window can be better than compaction. Anthropic explicitly recommends considering fresh restarts when the model can rediscover state from files, tests, progress notes, and git history. [E30][E16]
Use compaction when:
- the interaction is conversational and continuity matters
- the task depends on subtle negotiation or conversational nuance
- the model must preserve recent reasoning state inline
Use fresh-window restart when:
- the task has strong externalized state in files/artifacts
- work has clear milestone checkpoints
- the model can cheaply reconstruct state from
AGENTS.md,progress.txt,tests.json, git history, or task objects [E16][E30]
Long tasks drift. Use explicit progress recitation:
- rewrite plan/task state regularly
- keep "what matters now" near the end of context
- maintain separate structured task state and freeform progress notes
Anthropic's long-running harness guidance found that structured feature/test files plus progress logs improve continuity across context windows; Manus-inspired notes in your uploaded docs make the same point through the todo.md recitation pattern. [U2][U5][E16][E30]
Do not hide failures from the model. Keep failed actions, stack traces, and rejection reasons in the trace unless they are sensitive and must be redacted. This is one of the highest leverage patterns in the uploaded materials because it turns the agent's own mistakes into in-session learning signal. [U2][U5]
Repetitive action-observation traces can accidentally become few-shot examples that the model blindly mimics. If the system processes many similar items, vary serialization templates slightly and break repetitive rhythms when safe. This is an underappreciated pattern from the user materials that is worth testing in evals. [U2][U5]
Make the filesystem or virtual artifact store the default working-memory substrate for the harness. This should not be an optional afterthought. [U1][U2][U4][E1][E3]
It solves four different problems at once:
- context overflow - large content moves out of the prompt
- recoverability - old information stays addressable
- handoffs - subagents and resumed sessions can share state through files/artifacts
- human inspectability - operators can inspect what the agent actually saw or produced
Deep Agents exposes a pluggable filesystem surface with backends for in-state memory, local disk, durable store, and sandboxes. That is exactly the kind of abstraction boundary you want. [E1][E3]
Use three memory layers:
- recent messages
- active plan
- recent tool outcomes
- approval state
- subagent registry
This should live in your checkpointed runtime state.
- large tool outputs
- cached fetch results
- scratch data
- notes
- progress logs
- structured test/status files
This should live in the filesystem or object store.
- persistent project conventions
- reusable operator notes
- AGENTS.md or equivalent
- recurring policies
- stable user or org preferences
Deep Agents supports AGENTS.md-style memory files and store-backed memory; Kimi persists session-specific state; Anthropic's docs increasingly assume the filesystem is a first-class rediscovery layer. [E2][E3][E20][E30]
At minimum, standardize these:
AGENTS.md- durable project conventions and working agreementsprogress.txtorprogress.md- human-readable progress logtasks.json- structured task graphartifacts/- large intermediate outputsplans/- plan drafts and checkpointsreports/- final or semi-final deliverablesrun/- ephemeral execution outputs if you want explicit scratch space
Do not stuff entire codebases or knowledge bases into the prompt. Give the agent strong search/read capabilities and let it build context just in time. This is a repeating lesson in your uploaded materials and aligns with Anthropic's recent context-engineering guidance. [U2][U5][E14][E30]
Start with a deliberately small, namespaced core set:
fs.listfs.readfs.writefs.editfs.search(glob/grep or both)exec.runorcode.runtask.set/task.updateagent.delegateuser.askweb.searchweb.fetch
You do not need 50 first-party tools to start. Deep Agents and Kimi both converge on a compact, high-leverage set, and Anthropic's tooling guidance reinforces a high bar for tool creation. [U1][U2][U5][E1][E18][E19][E17]
Promote an action out of shell/code execution only if at least one of these is true:
-
UX needs special rendering
Example: structured questions should become a UI panel, not free text. [U2][U5][E12][E19] -
Guardrails need a deterministic checkpoint
Example: file edits may require stale-read checks or path validation. [U2][E17] -
Concurrency or transaction semantics matter
Example: read-only actions can parallelize; writes should serialize. [U2] -
Observability needs a clean event boundary
Example: you want exact metrics for search calls or DB writes. [U2][E17] -
Approval policy differs by action class
Example:exec.runmay require review whilefs.readdoes not. [U2][E12]
If none of those apply, consider leaving the action in the shell/code sandbox.
Keep the 3-5 most common tools loaded and stable. Anthropic's tool search docs recommend keeping the most frequently used tools non-deferred. [E9]
All rarely used or verbose tool schemas should be discoverable but not loaded by default. Anthropic's tool search and Claude Code MCP docs show the right pattern: stable stubs plus deferred loading. [E9][E18]
Use namespaces and families:
fs.*web.*db.*exec.*task.*agent.*user.*
This helps humans, logs, policy rules, and model-side action masking.
Every tool result should come back in a consistent envelope:
{
"status": "ok",
"summary": "2 matching files found",
"structured": { "matches": 2 },
"artifact_refs": ["artifact://search/matches.json"],
"preview": ["src/app.py:12", "tests/test_app.py:48"]
}This is the single best way to keep tool outputs readable, cache-friendly, and observable.
Anthropic's tool-writing guidance is worth adopting directly:
- choose the right tools to implement
- use namespacing
- return meaningful context
- optimize for token efficiency
- prompt-engineer tool descriptions/specs
- run real evaluations for tool quality, not just unit tests [E17]
Treat code execution as a default capability, not a niche feature for "coding agents". In modern agent systems, code is the best intermediate representation for filtering, joining, transforming, validating, and compressing data before it ever reaches the model context. [U2][U3][U4][U5][E10][E11]
Use PTC or equivalent self-managed sandbox orchestration when the agent needs to:
- loop over many entities
- transform large result sets
- batch many API/tool calls
- validate tabular or numeric output
- filter web search/fetch results
- early-terminate on condition checks
- build structured artifacts from noisy data
Anthropic's programmatic tool calling docs are explicit that the main advantage is keeping intermediate results inside the execution container instead of sending them back through the model at every step. [E10]
sequenceDiagram
participant Runtime
participant Model
participant Sandbox
participant ToolHandler
Runtime->>Model: prompt + callable tool specs
Model-->>Runtime: code execution request
Runtime->>Sandbox: run generated code
Sandbox->>ToolHandler: typed tool invocation
ToolHandler-->>Sandbox: tool result
Sandbox->>ToolHandler: next invocation if needed
ToolHandler-->>Sandbox: result
Sandbox-->>Runtime: final stdout/result + artifact refs
Runtime->>Model: compact final result
Anthropic's docs outline three general implementation modes: client-side execution, self-managed sandboxed execution, and managed execution. The managed option is easy, but self-managed execution gives you stronger control over network policy, data retention, package policy, and compliance. [E10]
Recommendation for your own harness:
Build an abstraction that can support both:
- Managed provider PTC when you want convenience or benchmark leverage
- Self-managed sandbox when you need enterprise control
Your sandbox should support:
- strict filesystem root
- CPU / memory / wall-clock limits
- optional no-network mode
- explicit egress allowlists
- package install policy
- per-run provenance
- container/session reuse policy
- artifact mounting
Anthropic's managed PTC uses session-scoped containers with idle expiration; do not hard-code their TTL, but do copy the idea of session-scoped containers with explicit reuse handles. [E10]
If you use provider-managed code execution, verify retention and privacy rules. Anthropic explicitly notes that code execution and PTC are not covered by ZDR arrangements. [E10][E11]
The plan object is not there because the harness cannot sequence steps in code. It is there because it helps the model stay oriented, and later it helps multiple agents coordinate without relying on implicit conversational memory. [U1][U2][U5][E1]
Start with a lightweight task.set / task.update or write_todos-style primitive that:
- rewrites the current plan in full
- tracks statuses
- links to artifact refs
- captures blockers and next step
Once work spans multiple sessions or subagents, upgrade to a richer task object with:
- dependencies
- blockers
- owner agent
- parent/child relationships
- artifact refs
- timestamps
- versioning
- optional human assignee
This follows the "Todos to Tasks" evolution in Claude-related materials and the general direction in your uploaded docs. [U2][U5]
- structured:
tasks.json,tests.json, milestone statuses - unstructured:
progress.txt, operator notes, rationale logs
Anthropic's long-running harness article reports that JSON works better than Markdown for structured test and feature state because the model is less likely to rewrite it casually. [E16][E30]
Use a simple state machine:
queuedreadyin_progressblockedawaiting_userdonecanceledfailed
This makes approvals, subagent delegation, and resume behavior much easier to reason about.
LangChain's 2026 guidance is exactly right here: many tasks are best handled by a single agent with good tools, and you should start there. Multi-agent systems add complexity, latency, and token cost. Anthropic's own research notes that multi-agent systems can use dramatically more tokens than chat or single-agent flows, so they need to earn their keep. [E6][E15]
Use subagents when you need one or more of the following:
- context isolation from exploratory work
- specialization by prompt or toolset
- different model/cost profile
- parallel execution on independent branches
- separate ownership or maintenance boundaries
These benefits are consistent across Deep Agents, Kimi CLI, Anthropic research, and Claude Code. [U1][U2][U4][E6][E14][E15][E28]
Your first multi-agent pattern should be orchestrator-worker, not peer-to-peer.
- main agent holds user contract and task-level state
- worker agents get narrow briefs and isolated contexts
- workers return only final outputs plus artifact refs
- workers do not share conversational history directly
Anthropic's research system, Deep Agents, and the best parts of Kimi CLI all point here. [U1][U2][U3][E14][E15]
-
General-purpose worker
Same model/tools as parent, used mainly for context isolation. Deep Agents and Claude Code both have this concept. [E14][E28] -
Explore / research worker
Read/search-heavy, often read-only, optimized for discovery. Claude Code's built-in Explore/Plan subagents show the value of read-only researcher roles. [E28] -
Specialist worker
Narrow domain or tool scope, such assql-analyst,release-engineer,security-auditor. -
Fast/cheap worker
Lower-cost model for focused subtasks when the orchestrator uses a premium model. [E28]
- no recursive subagent spawning by default
- allow parallel fan-out for independent tasks
- pass a concise structured brief, not raw chat history
- return summary + structured result + artifact refs
- enforce explicit tool restrictions per subagent
- cap concurrent workers and total token budget
Use the LangChain four-pattern framing because it is genuinely useful:
| Pattern | Use when | Tradeoff |
|---|---|---|
| Single agent | Most tasks at the start | Simplest, easiest to debug |
| Skills | One agent needs many latent capabilities | Loaded context accumulates over time [E6] |
| Subagents | Need context isolation, specialization, or parallel work | Extra orchestration call(s) [E6][E15] |
| Handoffs | Need sequential stage-based conversations | More stateful and harder to reason about [E6] |
| Router | Need stateless fan-out and synthesis across domains | Repeated routing overhead for conversations [E6] |
Anthropic reports that multi-agent systems can use around 15x the tokens of chat interactions in their research setting. That does not mean "avoid multi-agent"; it means use it where the value of the task justifies the extra capacity. [E15]
Skills are not just "prompt fragments". They are a disciplined way to add latent expertise without bloating the always-loaded system prompt or tool catalog.
Deep Agents and Kimi both support a pattern where the agent learns only the skill name, path, and description up front, then loads the full SKILL.md only when relevant. Anthropic's skills and tool-search tooling point in the same general direction: load detail on demand, not by default. [U1][U2][U4][E2][E9][E21]
Skills solve three problems:
- token control - the full instructions are not always in context
- distributed ownership - different teams can own different skills
- capability growth without tool sprawl - many workflows can be added as instructions rather than first-class tools
Use a directory-based format:
skills/
release-engineering/
SKILL.md
templates/
scripts/
references/
Include:
- YAML frontmatter with
name,description, optional allowed tools - task framing
- decision rules
- required artifacts/templates
- examples
- links to scripts or reference files
Kimi's skill discovery hierarchy and Deep Agents' progressive loading are both worth copying. [E2][E21]
Put something in a skill when it is mainly:
- domain knowledge
- workflow guidance
- operational policy
- templates and examples
- "how to use these other tools correctly"
Put something in a tool when it is mainly:
- a capability requiring execution
- a stateful operation
- a side effect
- a special approval/control boundary
In 2026, serious agents are still collaborative systems. Anthropic's own product framing and both open-source harnesses assume supervision, questions, and approval gates are normal parts of the runtime. [U4][E12][E13]
Implement user.ask / AskUserQuestion as a synchronous pause point that returns structured options. Do not rely on the model to emit parseable markdown questions in free text. Both Kimi and Anthropic's Agent SDK now formalize this pattern. [U2][U5][E12][E19]
Use four layers of decisioning:
- hooks - custom code before execution
- static rules - allow/deny/ask policies by tool, path, domain, secret class
- permission mode - coarse session mode like
ask,accept_edits,bypass_in_sandbox - runtime callback / operator prompt - human final decision when needed
This mirrors Anthropic's documented permission evaluation pipeline and is a robust general model. [E12]
A useful default:
readtier:fs.read,fs.list,web.searchsoft writetier:fs.writein working dir, artifact creationhard writetier: shell commands, network POSTs, DB mutations, secrets accessdestructivetier: delete, force-push, schema migrations, production actions
Only the first tier should be auto-allowed by default outside sandboxes.
Persist "allow for this session" decisions. Kimi restores these on resume, and the pattern is very operator-friendly. [E19][E20]
Use deterministic hooks for:
- secret redaction
- allowlist/denylist checks
- provenance logging
- path normalization
- budget limits
- post-tool content scrubbing
- automatic artifact persistence
This is more reliable than repeatedly asking the model to remember rules.
Adopt protocol boundaries that map cleanly to the three different relationships in an agent system:
- agent to tool/resource -> MCP
- client/editor/UI to local or remote agent runtime -> ACP
- agent to remote agent -> A2A
Do not force one protocol to do all three jobs. [U3][U4][E18][E24][E25][E26]
MCP is the open standard for connecting agents to external tools and data sources. Both Anthropic and Kimi explicitly position it this way, and the official MCP spec emphasizes user consent, tool safety, and trust boundaries. [E18][E22][E25]
- stdio and HTTP transports
- per-server auth handling
- tool discovery and deferred loading for large catalogs
- consistent tool namespacing
- explicit trust and approval model
The MCP spec warns that tool descriptions and annotations should be considered untrusted unless they come from trusted servers. That means MCP server output and metadata should go through the same policy scrutiny as tool results. [E25]
ACP standardizes communication between coding agents and client applications such as IDEs. It uses JSON-RPC 2.0 and models initialization, session setup, prompt turns, updates, and cancellation. Kimi CLI supports ACP; Deep Agents ships ACP integration; the protocol is now solid enough to treat as the default IDE/editor adapter target. [E23][E24]
If your runtime already has:
- typed events
- session IDs
- prompt/turn boundaries
- permission request events
- file operation events
then ACP is mostly an adapter problem, not a redesign problem.
Your internal event bus should look ACP-like even if you do not expose ACP immediately.
A2A is the right layer for remote agent-to-agent delegation and task exchange. The official docs now describe it as the common language for agent interoperability, with task objects, streaming, push notifications, and Agent Cards for discovery. [E26]
- Agent Cards for remote capability discovery
- task-based remote execution
- artifacts as outputs
- SSE streaming for long-running tasks
- push notifications for disconnected clients or jobs
A2A explicitly separates messages from artifacts, and says results should generally be returned as task artifacts rather than chat messages. That is highly compatible with the artifact-first design recommended in this blueprint. [E26]
flowchart TD
CORE[Core Runtime]
CORE --> MCPA[MCP Adapter]
CORE --> ACPA[ACP Adapter]
CORE --> A2AA[A2A Adapter]
MCPA --> TOOLS[External Tools and Data Sources]
ACPA --> SURFACES[IDE / CLI / Web Clients]
A2AA --> REMOTE[Remote Agents]
MCP is already growing beyond plain tool calls into UI-capable extensions. Even if you do not adopt these immediately, design your content/event model so tool outputs can eventually include richer UI payloads without breaking the core runtime. [E25]
Assume the model is competent but not trustworthy enough to be your only control plane. Boundaries must be enforced in tools, policy engine, sandbox, network layer, and storage layer. Deep Agents states this bluntly: enforce boundaries at the tool/sandbox level, not by expecting the LLM to self-police. [U2][E3]
- absolute-root enforcement
- path normalization
- symlink and traversal defense
- restricted write scopes
- artifact-only zones vs source zones
Deep Agents and Kimi both hardened file path handling over time, which is exactly what you should expect to do too. [E3][E23]
- isolated container or VM
- network egress off by default
- package install restrictions
- resource quotas
- process timeout
- syscall restrictions if you control the sandbox
- never expose
.envor raw secret stores without policy checks - inject secrets into tools only when needed
- log secret access as events
- redact secrets from tool results before they hit the model context
- treat fetched web pages, MCP output, and external docs as untrusted
- strip or neutralize obvious "ignore prior instructions" prompt injections
- separate raw fetched content from normalized summaries
- require human approval for sensitive actions even if the model was instructed by external content
Every external action should leave:
- who/what requested it
- inputs
- approval path
- outputs/artifact refs
- timestamps
- hashes where relevant
Define at least three trust zones:
- Model context zone - reasoning buffer, low trust
- Execution zone - sandboxed code/tools, medium trust with policy control
- Operator and system-of-record zone - approvals, secrets, production integrations, high trust
This framing makes it easier to reason about what data can flow where.
You need visibility into:
- model calls
- tool calls
- approval waits
- compaction events
- subagent fan-out
- artifact creation
- resume/replay behavior
- failure and rollback paths
A Kimi-style typed event bus and LangSmith/LangGraph-style traces are both strong inspirations here. [U1][E1][E23]
- session resume success rate
- step failure rate
- retry rate per step
- cancellation rate
- mean steps per completed task
- prompt cache hit rate
- uncached token share
- compaction frequency
- average artifact bytes per task
- percentage of tool results evicted to artifacts
- tool selection accuracy
- average tool count considered per step
- deferred-tool load rate
- tool latency
- percentage of tool results requiring follow-up reads
- approvals per task
- time waiting for approval
AskUserQuestionfrequency- operator override rate
- subagent spawn rate
- parallel branch count
- orchestration overhead
- percent of tasks where subagents improved latency or quality
- token cost of orchestration vs direct execution
Run three layers of evaluation:
-
Unit tests for tools and policies
Deterministic correctness. -
Task-level replay evals
Fixed prompts and expected artifacts or state transitions. -
Long-horizon harness evals
Resume after interruption, compaction correctness, approval branching, artifact recovery, and subagent handoffs.
Anthropic's tooling guidance strongly recommends evaluating tools with agents rather than assuming normal unit tests are enough. [E17]
agent-harness/
runtime/
loop/
sessions/
checkpoints/
events/
subagents/
context/
compaction/
caching/
artifact_refs/
serializers/
tools/
builtin/
wrappers/
policies/
schemas/
protocols/
mcp/
acp/
a2a/
sandboxes/
local/
remote/
execution_bridge/
memory/
conventions/
task_store/
artifact_store/
skills/
builtin/
project/
ui/
cli/
web/
ide/
evals/
task_suites/
replay/
regression/
docs/
ARCHITECTURE.md
AGENTS.md
- isolates stable interfaces from fast-changing prompts and skills
- keeps protocol adapters separate from core runtime
- makes the artifact/memory system visible as a first-class subsystem
- makes evaluation a product surface, not an afterthought
These are starting defaults, not eternal constants.
| Knob | Recommended start | Rationale |
|---|---|---|
max_steps_per_turn |
50-100 | Kimi defaults to 100 and that is a reasonable ceiling for serious tasks. [E18] |
max_retries_per_step |
2-3 | Enough for transient failures without hiding real problems. [E18] |
reserved_context_tokens |
20-25% of window, or about 50k on 200-262k models | Leaves space for output + compaction prompts before hard failure. [U4][E18] |
| Large-result eviction threshold | 8k-16k tokens equivalent | High enough to keep small outputs inline, low enough to prevent bloat. |
| Subagent nesting | Off by default | Prevents delegation loops and hidden costs. [U1][E28] |
| Deferred-tool loading | Enable when tool schema mass is large or tool count exceeds comfortable selection range | Protects context budget and selection quality. [E9][E18] |
| Approval mode | Ask for writes, shell, network egress, deletes | Sensible default trust boundary. [E12][E19] |
| Artifact persistence | Always on for search/fetch/code outputs | Enables recovery, replay, and long-horizon work. |
Lock these decisions before heavy coding:
- runtime model (graph/checkpoint runtime vs hand-rolled)
- artifact store format and URI scheme
- sandbox strategy
- policy engine architecture
- protocol boundaries
- session and task schemas
Exit criteria
- written state schema
- event taxonomy
- tool naming convention
- security boundary map
Build:
- append-only event log
- checkpointed session runtime
- file/artifact store
- minimal built-in tool set
- approval engine
- typed traces
- basic compaction / artifact eviction
- CLI or API surface
Do not build yet
- remote A2A
- dynamic subagents
- huge MCP catalogs
- exotic skills
Exit criteria
- 30-100 step tasks complete reliably
- resume works
- approval pause/resume works
- large outputs are artifactized
Add:
- stable prompt layout and cache instrumentation
- deferred tool loading / tool search
- skills
- AGENTS.md memory
- better compaction / restart heuristics
- eval suite for long-horizon tasks
Exit criteria
- cache hit rate is measurable and stable
- long-context tasks avoid runaway token growth
- new domain knowledge can be added through skills without tool sprawl
Add:
- general-purpose subagent
- explore/research subagent
- explicit handoff contract
- ACP adapter
- MCP connector hardening
- parallel branch limits and metrics
Exit criteria
- subagents materially reduce token bloat or latency on chosen workloads
- IDE integration works through ACP or equivalent
- MCP tool injection risks are contained by policy
Add:
- remote sandboxes
- stricter egress policies
- A2A adapter for remote specialists
- richer audit and compliance
- advanced rollback and replay
- operator workflows for approval queues
Exit criteria
- clear trust zone boundaries
- disaster recovery / replay story
- remote delegation works without leaking internal state
- middleware composition
- pluggable filesystem backends
- built-in general-purpose subagent
- skill loading with progressive disclosure
- harness framing: planning + filesystem + subagents + context management [U1][E1][E2][E3]
- checkpoint-based durable execution
- persistence threads
- human-in-the-loop pause/resume
- idempotent task boundaries for replay [E5]
- session state persistence beyond chat history
- explicit loop-control knobs like
reserved_context_size - structured AskUserQuestion UX
- event/protocol decoupling via Wire-style messages
- config-driven agent definitions and inheritance [U1][U4][E18][E19][E20][E21][E23]
- prompt caching discipline
- tool search with deferred loading
- programmatic tool calling
- approval callbacks and layered permission logic
- initializer-agent pattern for long-running work
- fresh-window restarts when externalized state is strong [U2][U5][E7][E8][E9][E10][E12][E16][E30]
- context engineering as the primary systems problem
- mask, do not remove, when constraining actions
- keep errors in context
- use files to manipulate attention and preserve recoverability [U2][U3][U5]
- MCP for tool interoperability
- ACP for editor/client interoperability
- A2A for remote agent interoperability and task exchange [U3][U4][E24][E25][E26]
-
Treating the agent as "just a prompt".
This fails as soon as tasks span many steps, tools, or sessions. [U2][U3][E1] -
Dynamically rewriting the toolset mid-session.
Breaks cache locality and confuses state. [U2][U5][E7] -
Letting raw tool outputs flood the context.
Artifactize early. [U1][U4][E10] -
Using multi-agent because it sounds advanced.
Use it only when context isolation or parallelism is clearly valuable. [E6][E15] -
Relying on the model to self-police risky actions.
Put enforcement in policy and sandbox layers. [U4][E12][E25] -
Returning verbose, human-style tool blobs.
Return structured, compact outputs plus artifact refs. [E17] -
Hiding errors from the model.
Preserve error evidence unless redaction is necessary. [U2][U5] -
Overloading the system prompt with all possible instructions.
Use skills, tool search, and read/search tools instead. [U2][U4][E9][E21] -
Skipping evaluation of the harness itself.
Tool tests are not enough; evaluate resumption, compaction, approvals, and delegation. [E17] -
Conflating protocols.
MCP, ACP, and A2A solve different problems. Use each where it fits. [E24][E25][E26]
If I were starting this harness tomorrow, I would build the following first version:
- a durable orchestrator runtime with checkpointing
- append-only JSONL event log
- Postgres or SQLite for session/task metadata
- object store or filesystem artifact store with content-addressed blobs
- a stable tool catalog with namespaces
- file read/write/edit/search
- web search/fetch
- sandboxed code execution
- task graph primitive
- structured user question tool
- subagent delegation tool
- static prompt prefix
- AGENTS.md and project memory loading
- immediate artifactization of large outputs
- reserved output budget
- compaction + fresh restart policy
- pre-tool hooks
- policy rules
- approval callbacks
- sandbox allowlists
- provenance log
- one CLI or API first
- internal typed event bus
- ACP adapter second
- web UI later
- skills
- deferred tool loading / tool search
- general-purpose subagent
- specialist subagents
- MCP expansion
- A2A remote workers only after local harness is stable
That combination gives you a system that is already "modern agentic" by 2026 standards without overcommitting to every new trend.
The best 2026 harness is not the one with the most tools, the most agents, or the biggest context window. It is the one that:
- preserves a stable cached prefix
- externalizes state to artifacts and files
- exposes a small, ergonomic action space
- uses code execution to compress data before the model sees it
- delegates work into isolated contexts only when needed
- enforces safety in deterministic runtime layers
- can resume, replay, inspect, and audit every important step
That is the throughline across the materials you shared and the current official docs. Build that foundation first. Everything else - skills, richer protocols, remote swarms, UI polish, domain specialization - compounds on top of it.
- [U1]
compass_artifact_wf-b9580dc8-f513-4de3-bf7a-7e6dbb6d5df8_text_markdown.md- user-provided synthesis on architectural patterns for modern agentic systems. - [U2]
modern-agent-architecture-guide.md- user-provided guide synthesizing Claude Code, Manus, Deep Agents, and Kimi CLI patterns. - [U3]
Building a 2026 Agentic System.docx- user-provided architectural blueprint with neuro-symbolic framing, protocol sections, and implementation guidance. - [U4]
Starter Plan for a Modern Agentic System in 2026.pdf- user-provided starter blueprint focused on runtime primitives, context engineering, and milestones. - [U5]
agent_ideas.docx- user-provided notes including Thariq, Lance Martin, Manus, and Deep Agents excerpts.
-
[E1] LangChain Docs - "Deep Agents overview"
https://docs.langchain.com/oss/python/deepagents/overview -
[E2] LangChain Docs - "Customize Deep Agents"
https://docs.langchain.com/oss/python/deepagents/customization -
[E3] LangChain Docs - "Backends"
https://docs.langchain.com/oss/python/deepagents/backends -
[E4] LangChain Docs - "Deep Agents CLI"
https://docs.langchain.com/oss/python/deepagents/cli/overview -
[E5] LangChain / LangGraph Docs - "Durable execution", "Persistence", and "LangGraph overview"
https://docs.langchain.com/oss/python/langgraph/durable-execution
https://docs.langchain.com/oss/python/langgraph/persistence
https://docs.langchain.com/oss/python/langgraph/overview -
[E6] LangChain Blog - "Choosing the Right Multi-Agent Architecture"
https://blog.langchain.com/choosing-the-right-multi-agent-architecture/ -
[E7] Anthropic Docs - "Prompt caching"
https://platform.claude.com/docs/en/build-with-claude/prompt-caching -
[E8] Anthropic Docs - "Compaction"
https://platform.claude.com/docs/en/build-with-claude/compaction -
[E9] Anthropic Docs - "Tool search tool"
https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool -
[E10] Anthropic Docs - "Programmatic tool calling"
https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling -
[E11] Anthropic Docs - "Web search tool", "Web fetch tool", and "Code execution tool"
https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool
https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-fetch-tool
https://platform.claude.com/docs/en/agents-and-tools/tool-use/code-execution-tool -
[E12] Anthropic Docs - "Handle approvals and user input" and "Configure permissions"
https://platform.claude.com/docs/en/agent-sdk/user-input
https://platform.claude.com/docs/en/agent-sdk/permissions -
[E13] Anthropic Docs - "Agent SDK overview"
https://platform.claude.com/docs/en/agent-sdk/overview -
[E14] Anthropic Engineering - "Effective context engineering for AI agents"
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents -
[E15] Anthropic Engineering - "How we built our multi-agent research system"
https://www.anthropic.com/engineering/multi-agent-research-system -
[E16] Anthropic Engineering - "Effective harnesses for long-running agents"
https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents -
[E17] Anthropic Engineering - "Writing effective tools for agents - with agents"
https://www.anthropic.com/engineering/writing-tools-for-agents -
[E18] Moonshot Kimi CLI Docs - "Config Files"
https://moonshotai.github.io/kimi-cli/en/configuration/config-files.html -
[E19] Moonshot Kimi CLI Docs - "Interaction and Input"
https://moonshotai.github.io/kimi-cli/en/guides/interaction.html -
[E20] Moonshot Kimi CLI Docs - "Sessions and Context"
https://moonshotai.github.io/kimi-cli/en/guides/sessions.html -
[E21] Moonshot Kimi CLI Docs - "Agent Skills"
https://moonshotai.github.io/kimi-cli/en/customization/skills.html -
[E22] Moonshot Kimi CLI Docs - "Model Context Protocol"
https://moonshotai.github.io/kimi-cli/en/customization/mcp.html -
[E23] Moonshot Kimi CLI Docs - "Wire mode" and changelog notes
https://moonshotai.github.io/kimi-cli/en/customization/wire-mode.html
https://moonshotai.github.io/kimi-cli/en/release-notes/changelog.html -
[E24] Agent Client Protocol - Introduction and Protocol Overview
https://agentclientprotocol.com/
https://agentclientprotocol.com/protocol/overview -
[E25] Model Context Protocol - official specification
https://modelcontextprotocol.io/specification/2025-11-25 -
[E26] Agent2Agent (A2A) Protocol - official docs and specification
https://a2a-protocol.org/latest/
https://a2a-protocol.org/latest/topics/agent-discovery/
https://a2a-protocol.org/latest/topics/streaming-and-async/
https://a2a-protocol.org/latest/specification/ -
[E27] Anthropic release notes - February 2026 platform changes
https://platform.claude.com/docs/en/release-notes/overview -
[E28] Claude Code Docs - "Create custom subagents"
https://code.claude.com/docs/en/sub-agents -
[E29] Anthropic Research - "Building Effective AI Agents"
https://www.anthropic.com/research/building-effective-agents -
[E30] Anthropic Docs - "Claude prompting best practices"
https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices