amazingvince/modern-agentic-harness-blueprint-2026.md

## modern-agentic-harness-blueprint-2026.md

      
    Raw
  

              modern-agentic-harness-blueprint-2026.md
            
          
    Blueprint for a Modern Agentic Harness in 2026

What this document is

This is an early-stage architecture blueprint for building a modern agentic harness from the ground up in 2026. It is not a product spec and not a vendor pitch. It is a practical plan for how to structure the runtime, state model, context system, tool layer, subagent orchestration, approvals, protocols, and observability so the harness stays useful as models, tools, and deployment surfaces evolve.
This blueprint synthesizes the uploaded source material with current official documentation from Anthropic, LangChain/LangGraph, Moonshot Kimi CLI, MCP, ACP, and A2A. Where the sources conflict, the plan below favors patterns that appear repeatedly across multiple production systems or formal protocols. [U1][U2][U3][U4][U5][E1][E7][E15][E18][E24][E26]

Executive summary

If you only remember six things, remember these:


The harness matters more than the loop. The model-tool loop is now commodity. Differentiation comes from context engineering, durable state, policy enforcement, externalized memory, and protocol design. [U1][U2][E1][E5][E13]


Design around cache stability first. Prompt caching is not a small optimization. It changes your entire architecture: stable prompt prefix, append-only history, fixed tool catalog per session, and state transitions modeled as messages or mode flags rather than prompt rewrites. [U2][U5][E7][E27]


Treat the filesystem and artifact store as working memory. Large tool outputs, notes, plans, recovered state, and handoffs should live outside the model context and be referenced by handles. This is the only reliable way to scale beyond short tasks without drowning the model in its own history. [U1][U2][U4][U5][E1][E3][E14]


Keep the built-in action space small and stable. Start with a compact set of high-leverage primitives: file ops, search/read, code execution or shell, planning/tasks, subagent delegation, and structured user elicitation. Add new tools only when they improve control, guardrails, concurrency, observability, or UX. [U1][U2][U5][E17]


Use subagents for context isolation, not because "multi-agent" sounds advanced. Start with a single agent. Add subagents only when you need parallel exploration, specialized prompts/tooling, separate context windows, or explicit ownership boundaries. [U1][U2][U4][E6][E14][E15]


Put guardrails in the runtime, not the prompt. The model should never be the only enforcement layer. Destructive tools, secret access, network egress, and external writes need deterministic policy checks, approval gates, sandbox restrictions, and audit trails. [U2][U4][U5][E12][E17][E25]


Decision summary


Area
Recommended default
Why


Core runtime
Durable state machine / graph runtime with checkpointing
Long-running work needs pause/resume, replay, fault tolerance, and human interruption support. [E1][E5]


Session history
Append-only event log plus typed state snapshot
Best for cache stability, replay, auditability, and deterministic recovery. [U3][U4][E7]


Working memory
Artifact-first filesystem plus metadata store
Offloads context, preserves recoverability, supports handoffs and resumption. [U1][U4][E1][E3]


Built-in tools
Narrow, namespaced primitives
Small stable action space improves selection quality and keeps cache-friendly prefixes. [U1][U2][U5][E17]


Large data handling
Programmatic tool calling or sandboxed code execution
Keeps intermediate data out of the model context and reduces round trips. [U2][U3][U4][U5][E10][E11]


Planning
Task graph or structured todo primitive
Acts as attention control and coordination state, not as a workflow engine by itself. [U2][U5][E1][E16]


Multi-agent
Orchestrator-worker subagents by default
Strongest payoff for context isolation and parallel work with manageable complexity. [U1][U2][E6][E15]


Protocols
MCP for tools, ACP for IDE/client surfaces, A2A for remote agent-to-agent delegation
Clean separation of concerns and future-proof interoperability. [U3][U4][E18][E24][E25][E26]


Human collaboration
Structured question tool plus approval policies
Faster, more deterministic than plain-text back-and-forth. [U2][U4][U5][E12][E19]


Security
Sandbox + policy engine + audit log
Practical trust boundary for real tool use. [U2][U4][E3][E12][E25]


The architectural thesis

A modern harness should treat the LLM as the control plane for reasoning and planning, while the rest of the system handles state, execution, storage, approvals, transport, and observability. This is a neuro-symbolic split in practice, not in theory. The more you can move determinism, memory, and policy into the harness, the more reliable the overall system becomes. [U3][E1][E5][E13]
The harness should therefore be built around five stable layers:

Execution runtime - the event loop, session manager, checkpointing, and recovery.
Context system - prompt layout, artifact references, compaction, and cache discipline.
Capability surface - built-in tools, external tools, skills, and subagents.
Governance layer - approvals, hooks, allow/deny policy, sandboxing, provenance.
Surface/protocol adapters - CLI, IDE, web UI, ACP, MCP, and optionally A2A.


Reference architecture


      flowchart LR
    U[User or Calling System] --> S[Surface Layer: CLI / IDE / Web / API]
    S --> A[Client Adapter / ACP / REST]

    A --> R[Agent Runtime]
    R --> ST[Session State + Checkpoints]
    R --> PL[Plan / Task Graph]
    R --> EV[Typed Event Bus]
    R --> PO[Policy Engine + Approvals + Hooks]

    R --> TR[Tool Router]
    TR --> BT[Built-in Tools]
    TR --> MCP[MCP Connector]
    TR --> SX[Sandbox / Code Execution / PTC]
    TR --> FS[Artifact Store / Virtual Filesystem]

    R --> SG[Subagent Manager]
    SG --> R

    R --> MEM[Long-term Memory / AGENTS.md / Conventions]
    R --> A2A[A2A Adapter for Remote Agents]
    EV --> OBS[Tracing / Metrics / Replay / Eval Harness]

    
      Loading

  
Core idea of the diagram


The runtime owns the loop, state, and control.
The tool router is the gateway to all side effects.
The artifact store / virtual filesystem is the main external memory substrate.
The subagent manager is a specialization mechanism and a context pressure valve.
The policy engine is the governance boundary.
The surface layer is replaceable; the engine should not depend on a specific UI. Kimi's Wire-style decoupling and ACP both reinforce this design. [U1][E18][E23][E24]


1. Runtime and state model

What to build

Your runtime should manage:

session creation and resume
append-only message/event history
deterministic state snapshots
step execution and retry policy
compaction and fresh-window restarts
human pause/resume
subagent spawning and result capture
cancellation and timeout handling
audit and replay

This is exactly why LangGraph emphasizes durable execution and persistence through checkpoints, and why Kimi CLI persists not just the conversation but also approvals, dynamic subagents, and added workspace directories across resume. [U1][U4][E1][E5][E20]
Recommended state entities

Session

A session is the resumable unit for a conversation or job.
{
  "session_id": "sess_...",
  "thread_id": "thread_...",
  "created_at": "2026-03-01T12:00:00Z",
  "mode": "execute",
  "model_profile": "orchestrator-default",
  "tool_catalog_version": "v1",
  "approval_mode": "ask",
  "context_state": {
    "compacted": false,
    "recent_summary_ref": "artifact://..."
  }
}
Task

A task is the coordination object.
{
  "task_id": "task_...",
  "title": "Draft migration plan",
  "status": "in_progress",
  "owner": "main-agent",
  "dependencies": ["task_1", "task_2"],
  "blockers": [],
  "artifact_refs": ["artifact://plan.md"],
  "updated_at": "2026-03-01T12:10:00Z"
}
Artifact

Artifacts are durable outputs and large intermediate objects.
{
  "artifact_id": "artifact_...",
  "uri": "artifact://reports/search-results-01.json",
  "mime_type": "application/json",
  "summary": "Search results for Azure pricing pages",
  "sha256": "....",
  "source": {
    "tool": "web.search",
    "session_id": "sess_...",
    "step_id": "step_..."
  }
}
Step

A step is a single model decision or tool execution event.
{
  "step_id": "step_...",
  "kind": "tool_call",
  "tool_name": "fs.read",
  "status": "completed",
  "started_at": "2026-03-01T12:05:00Z",
  "ended_at": "2026-03-01T12:05:01Z",
  "artifact_refs": [],
  "error": null
}
Runtime rules


Checkpoint before and after external side effects. LangGraph's checkpointers and Kimi's per-step session persistence both support this general rule. [E5][E20]
Make replay deterministic. Wrap side effects in tasks or tool execution envelopes so replay does not accidentally re-run destructive actions. LangGraph explicitly recommends deterministic/idempotent design for durable execution. [E5]
Store both typed state and event history. State gives you fast resume; event history gives you replay, analytics, and debugging.
Use cancellation as a first-class primitive. ACP and A2A both model cancellation/interrupt flows, so your internal runtime should too. [E24][E26]


2. Context engineering and cache-first design

Design principle

Treat context as a scarce, actively managed resource. Context engineering is not "prompt polish"; it is the main systems problem in long-running agents. [U2][U3][U4][U5][E14]
Stable prompt layout

The most cache-friendly ordering is:

static system prompt and always-present tool stubs
project memory / AGENTS.md / conventions
session-level state summary
recent messages and tool results
latest user turn

Anthropic's prompt caching docs are explicit: the cache hierarchy is tools -> system -> messages, and changes to tools invalidate the whole cache below that point. [U2][U5][E7]
Rules that follow from caching


Do not add or remove tools mid-session unless you are willing to lose cache locality. Prefer deferred loading or masks. [U2][U5][E7][E9]
Do not switch models mid-session for trivial reasons. If you need a different model, spawn a subagent or fresh worker session. [U2][U5]
Do not rewrite the system prompt for dynamic state changes. Send reminders or state updates as messages. [U2][U5][E7]
Keep serialization deterministic. Even small ordering changes can break cache reuse. [U2][U5]

Multi-tier context management plan

The best synthesis across the sources is a four-level policy:
Tier 0 - Structured outputs by default

Tool results should already be concise, typed, and artifact-backed. This reduces the need for later cleanup. Anthropic's tool-writing guidance strongly favors returning meaningful but token-efficient context. [E17]
Tier 1 - Immediate large-result eviction

If a tool returns a large object, write the full result to an artifact and return only:

a short summary
the first few relevant lines or rows
a stable artifact handle
metadata (size, type, location, provenance)

This is how Deep Agents handles large results conceptually, and it should be your baseline pattern. [U1][U2][E1]
Tier 2 - Deferred input/result eviction

When context approaches roughly 80-90% of the safe usable window, rewrite old bulky tool inputs or old results into references. Kimi exposes reserved_context_size and triggers auto-compaction before hard failure; Deep Agents uses proactive summarization middleware. [U1][U4][E18][E20]
Tier 3 - Compaction / summarization

When the window is still too full, summarize older history into:

current goal
state achieved so far
open tasks
key decisions and assumptions
artifact refs
next recommended step

Anthropic now recommends server-side compaction for long-running workflows where available. [E8][E27]
Tier 4 - Fresh-window restart

When state has been externalized well, a brand new context window can be better than compaction. Anthropic explicitly recommends considering fresh restarts when the model can rediscover state from files, tests, progress notes, and git history. [E30][E16]
When to compact vs when to restart fresh

Use compaction when:

the interaction is conversational and continuity matters
the task depends on subtle negotiation or conversational nuance
the model must preserve recent reasoning state inline

Use fresh-window restart when:

the task has strong externalized state in files/artifacts
work has clear milestone checkpoints
the model can cheaply reconstruct state from AGENTS.md, progress.txt, tests.json, git history, or task objects [E16][E30]

Attention anchoring

Long tasks drift. Use explicit progress recitation:

rewrite plan/task state regularly
keep "what matters now" near the end of context
maintain separate structured task state and freeform progress notes

Anthropic's long-running harness guidance found that structured feature/test files plus progress logs improve continuity across context windows; Manus-inspired notes in your uploaded docs make the same point through the todo.md recitation pattern. [U2][U5][E16][E30]
Error preservation

Do not hide failures from the model. Keep failed actions, stack traces, and rejection reasons in the trace unless they are sensitive and must be redacted. This is one of the highest leverage patterns in the uploaded materials because it turns the agent's own mistakes into in-session learning signal. [U2][U5]
Context diversity

Repetitive action-observation traces can accidentally become few-shot examples that the model blindly mimics. If the system processes many similar items, vary serialization templates slightly and break repetitive rhythms when safe. This is an underappreciated pattern from the user materials that is worth testing in evals. [U2][U5]

3. Filesystem, artifacts, and memory

Recommendation

Make the filesystem or virtual artifact store the default working-memory substrate for the harness. This should not be an optional afterthought. [U1][U2][U4][E1][E3]
Why this matters

It solves four different problems at once:

context overflow - large content moves out of the prompt
recoverability - old information stays addressable
handoffs - subagents and resumed sessions can share state through files/artifacts
human inspectability - operators can inspect what the agent actually saw or produced

Deep Agents exposes a pluggable filesystem surface with backends for in-state memory, local disk, durable store, and sandboxes. That is exactly the kind of abstraction boundary you want. [E1][E3]
Memory layers

Use three memory layers:
A. Short-term session memory


recent messages
active plan
recent tool outcomes
approval state
subagent registry

This should live in your checkpointed runtime state.
B. Working memory / artifact memory


large tool outputs
cached fetch results
scratch data
notes
progress logs
structured test/status files

This should live in the filesystem or object store.
C. Long-term memory


persistent project conventions
reusable operator notes
AGENTS.md or equivalent
recurring policies
stable user or org preferences

Deep Agents supports AGENTS.md-style memory files and store-backed memory; Kimi persists session-specific state; Anthropic's docs increasingly assume the filesystem is a first-class rediscovery layer. [E2][E3][E20][E30]
Recommended file conventions

At minimum, standardize these:

AGENTS.md - durable project conventions and working agreements
progress.txt or progress.md - human-readable progress log
tasks.json - structured task graph
artifacts/ - large intermediate outputs
plans/ - plan drafts and checkpoints
reports/ - final or semi-final deliverables
run/ - ephemeral execution outputs if you want explicit scratch space

Prefer "search and read" over massive preload

Do not stuff entire codebases or knowledge bases into the prompt. Give the agent strong search/read capabilities and let it build context just in time. This is a repeating lesson in your uploaded materials and aligns with Anthropic's recent context-engineering guidance. [U2][U5][E14][E30]

4. Tool and action-space design

Default built-in primitive set

Start with a deliberately small, namespaced core set:

fs.list
fs.read
fs.write
fs.edit
fs.search (glob/grep or both)
exec.run or code.run
task.set / task.update
agent.delegate
user.ask
web.search
web.fetch

You do not need 50 first-party tools to start. Deep Agents and Kimi both converge on a compact, high-leverage set, and Anthropic's tooling guidance reinforces a high bar for tool creation. [U1][U2][U5][E1][E18][E19][E17]
Criteria for promoting an action to a dedicated tool

Promote an action out of shell/code execution only if at least one of these is true:


UX needs special rendering

Example: structured questions should become a UI panel, not free text. [U2][U5][E12][E19]


Guardrails need a deterministic checkpoint

Example: file edits may require stale-read checks or path validation. [U2][E17]


Concurrency or transaction semantics matter

Example: read-only actions can parallelize; writes should serialize. [U2]


Observability needs a clean event boundary

Example: you want exact metrics for search calls or DB writes. [U2][E17]


Approval policy differs by action class

Example: exec.run may require review while fs.read does not. [U2][E12]


If none of those apply, consider leaving the action in the shell/code sandbox.
Tool catalog strategy

Always-present tools

Keep the 3-5 most common tools loaded and stable. Anthropic's tool search docs recommend keeping the most frequently used tools non-deferred. [E9]
Deferred tools

All rarely used or verbose tool schemas should be discoverable but not loaded by default. Anthropic's tool search and Claude Code MCP docs show the right pattern: stable stubs plus deferred loading. [E9][E18]
Tool naming

Use namespaces and families:

fs.*
web.*
db.*
exec.*
task.*
agent.*
user.*

This helps humans, logs, policy rules, and model-side action masking.
Tool output contract

Every tool result should come back in a consistent envelope:
{
  "status": "ok",
  "summary": "2 matching files found",
  "structured": { "matches": 2 },
  "artifact_refs": ["artifact://search/matches.json"],
  "preview": ["src/app.py:12", "tests/test_app.py:48"]
}
This is the single best way to keep tool outputs readable, cache-friendly, and observable.
Tool writing principles

Anthropic's tool-writing guidance is worth adopting directly:

choose the right tools to implement
use namespacing
return meaningful context
optimize for token efficiency
prompt-engineer tool descriptions/specs
run real evaluations for tool quality, not just unit tests [E17]


5. Code execution and programmatic tool calling

Recommendation

Treat code execution as a default capability, not a niche feature for "coding agents". In modern agent systems, code is the best intermediate representation for filtering, joining, transforming, validating, and compressing data before it ever reaches the model context. [U2][U3][U4][U5][E10][E11]
When to use programmatic tool calling (PTC)

Use PTC or equivalent self-managed sandbox orchestration when the agent needs to:

loop over many entities
transform large result sets
batch many API/tool calls
validate tabular or numeric output
filter web search/fetch results
early-terminate on condition checks
build structured artifacts from noisy data

Anthropic's programmatic tool calling docs are explicit that the main advantage is keeping intermediate results inside the execution container instead of sending them back through the model at every step. [E10]
Suggested architecture


      sequenceDiagram
    participant Runtime
    participant Model
    participant Sandbox
    participant ToolHandler

    Runtime->>Model: prompt + callable tool specs
    Model-->>Runtime: code execution request
    Runtime->>Sandbox: run generated code
    Sandbox->>ToolHandler: typed tool invocation
    ToolHandler-->>Sandbox: tool result
    Sandbox->>ToolHandler: next invocation if needed
    ToolHandler-->>Sandbox: result
    Sandbox-->>Runtime: final stdout/result + artifact refs
    Runtime->>Model: compact final result

    
      Loading

  
Managed vs self-managed PTC

Anthropic's docs outline three general implementation modes: client-side execution, self-managed sandboxed execution, and managed execution. The managed option is easy, but self-managed execution gives you stronger control over network policy, data retention, package policy, and compliance. [E10]
Recommendation for your own harness:

Build an abstraction that can support both:

Managed provider PTC when you want convenience or benchmark leverage
Self-managed sandbox when you need enterprise control

Sandbox requirements

Your sandbox should support:

strict filesystem root
CPU / memory / wall-clock limits
optional no-network mode
explicit egress allowlists
package install policy
per-run provenance
container/session reuse policy
artifact mounting

Anthropic's managed PTC uses session-scoped containers with idle expiration; do not hard-code their TTL, but do copy the idea of session-scoped containers with explicit reuse handles. [E10]
Governance note

If you use provider-managed code execution, verify retention and privacy rules. Anthropic explicitly notes that code execution and PTC are not covered by ZDR arrangements. [E10][E11]

6. Planning and the evolution from todos to tasks

Planning is for cognition first, coordination second

The plan object is not there because the harness cannot sequence steps in code. It is there because it helps the model stay oriented, and later it helps multiple agents coordinate without relying on implicit conversational memory. [U1][U2][U5][E1]
Starter recommendation

Start with a lightweight task.set / task.update or write_todos-style primitive that:

rewrites the current plan in full
tracks statuses
links to artifact refs
captures blockers and next step

Then evolve into a task graph

Once work spans multiple sessions or subagents, upgrade to a richer task object with:

dependencies
blockers
owner agent
parent/child relationships
artifact refs
timestamps
versioning
optional human assignee

This follows the "Todos to Tasks" evolution in Claude-related materials and the general direction in your uploaded docs. [U2][U5]
Recommendation: split structured and unstructured state


structured: tasks.json, tests.json, milestone statuses
unstructured: progress.txt, operator notes, rationale logs

Anthropic's long-running harness article reports that JSON works better than Markdown for structured test and feature state because the model is less likely to rewrite it casually. [E16][E30]
Suggested task states

Use a simple state machine:

queued
ready
in_progress
blocked
awaiting_user
done
canceled
failed

This makes approvals, subagent delegation, and resume behavior much easier to reason about.

7. Subagents and multi-agent patterns

First rule: do not start with many agents

LangChain's 2026 guidance is exactly right here: many tasks are best handled by a single agent with good tools, and you should start there. Multi-agent systems add complexity, latency, and token cost. Anthropic's own research notes that multi-agent systems can use dramatically more tokens than chat or single-agent flows, so they need to earn their keep. [E6][E15]
When subagents are worth it

Use subagents when you need one or more of the following:

context isolation from exploratory work
specialization by prompt or toolset
different model/cost profile
parallel execution on independent branches
separate ownership or maintenance boundaries

These benefits are consistent across Deep Agents, Kimi CLI, Anthropic research, and Claude Code. [U1][U2][U4][E6][E14][E15][E28]
Default pattern: orchestrator-worker

Your first multi-agent pattern should be orchestrator-worker, not peer-to-peer.

main agent holds user contract and task-level state
worker agents get narrow briefs and isolated contexts
workers return only final outputs plus artifact refs
workers do not share conversational history directly

Anthropic's research system, Deep Agents, and the best parts of Kimi CLI all point here. [U1][U2][U3][E14][E15]
Default subagent roles to support


General-purpose worker

Same model/tools as parent, used mainly for context isolation. Deep Agents and Claude Code both have this concept. [E14][E28]


Explore / research worker

Read/search-heavy, often read-only, optimized for discovery. Claude Code's built-in Explore/Plan subagents show the value of read-only researcher roles. [E28]


Specialist worker

Narrow domain or tool scope, such as sql-analyst, release-engineer, security-auditor.


Fast/cheap worker

Lower-cost model for focused subtasks when the orchestrator uses a premium model. [E28]


Recommended subagent rules


no recursive subagent spawning by default
allow parallel fan-out for independent tasks
pass a concise structured brief, not raw chat history
return summary + structured result + artifact refs
enforce explicit tool restrictions per subagent
cap concurrent workers and total token budget

Skills vs subagents vs handoffs vs routers

Use the LangChain four-pattern framing because it is genuinely useful:


Pattern
Use when
Tradeoff


Single agent
Most tasks at the start
Simplest, easiest to debug


Skills
One agent needs many latent capabilities
Loaded context accumulates over time [E6]


Subagents
Need context isolation, specialization, or parallel work
Extra orchestration call(s) [E6][E15]


Handoffs
Need sequential stage-based conversations
More stateful and harder to reason about [E6]


Router
Need stateless fan-out and synthesis across domains
Repeated routing overhead for conversations [E6]


Economic guidance

Anthropic reports that multi-agent systems can use around 15x the tokens of chat interactions in their research setting. That does not mean "avoid multi-agent"; it means use it where the value of the task justifies the extra capacity. [E15]

8. Skills and progressive disclosure

What skills are for

Skills are not just "prompt fragments". They are a disciplined way to add latent expertise without bloating the always-loaded system prompt or tool catalog.
Deep Agents and Kimi both support a pattern where the agent learns only the skill name, path, and description up front, then loads the full SKILL.md only when relevant. Anthropic's skills and tool-search tooling point in the same general direction: load detail on demand, not by default. [U1][U2][U4][E2][E9][E21]
Why skills work

Skills solve three problems:

token control - the full instructions are not always in context
distributed ownership - different teams can own different skills
capability growth without tool sprawl - many workflows can be added as instructions rather than first-class tools

Recommended skill format

Use a directory-based format:
skills/
  release-engineering/
    SKILL.md
    templates/
    scripts/
    references/

Include:

YAML frontmatter with name, description, optional allowed tools
task framing
decision rules
required artifacts/templates
examples
links to scripts or reference files

Kimi's skill discovery hierarchy and Deep Agents' progressive loading are both worth copying. [E2][E21]
What to put in a skill vs a tool

Put something in a skill when it is mainly:

domain knowledge
workflow guidance
operational policy
templates and examples
"how to use these other tools correctly"

Put something in a tool when it is mainly:

a capability requiring execution
a stateful operation
a side effect
a special approval/control boundary


9. Human collaboration and governance

Human-in-the-loop is not a failure mode

In 2026, serious agents are still collaborative systems. Anthropic's own product framing and both open-source harnesses assume supervision, questions, and approval gates are normal parts of the runtime. [U4][E12][E13]
Structured questions should be a first-class tool

Implement user.ask / AskUserQuestion as a synchronous pause point that returns structured options. Do not rely on the model to emit parseable markdown questions in free text. Both Kimi and Anthropic's Agent SDK now formalize this pattern. [U2][U5][E12][E19]
Approval engine design

Use four layers of decisioning:

hooks - custom code before execution
static rules - allow/deny/ask policies by tool, path, domain, secret class
permission mode - coarse session mode like ask, accept_edits, bypass_in_sandbox
runtime callback / operator prompt - human final decision when needed

This mirrors Anthropic's documented permission evaluation pipeline and is a robust general model. [E12]
Risk-tier your tools

A useful default:

read tier: fs.read, fs.list, web.search
soft write tier: fs.write in working dir, artifact creation
hard write tier: shell commands, network POSTs, DB mutations, secrets access
destructive tier: delete, force-push, schema migrations, production actions

Only the first tier should be auto-allowed by default outside sandboxes.
Session-scoped approval memory

Persist "allow for this session" decisions. Kimi restores these on resume, and the pattern is very operator-friendly. [E19][E20]
Hooks are where invariants live

Use deterministic hooks for:

secret redaction
allowlist/denylist checks
provenance logging
path normalization
budget limits
post-tool content scrubbing
automatic artifact persistence

This is more reliable than repeatedly asking the model to remember rules.

10. Protocol layer: MCP, ACP, A2A

Design principle

Adopt protocol boundaries that map cleanly to the three different relationships in an agent system:

agent to tool/resource -> MCP
client/editor/UI to local or remote agent runtime -> ACP
agent to remote agent -> A2A

Do not force one protocol to do all three jobs. [U3][U4][E18][E24][E25][E26]
MCP - tool and resource integration

MCP is the open standard for connecting agents to external tools and data sources. Both Anthropic and Kimi explicitly position it this way, and the official MCP spec emphasizes user consent, tool safety, and trust boundaries. [E18][E22][E25]
What to adopt


stdio and HTTP transports
per-server auth handling
tool discovery and deferred loading for large catalogs
consistent tool namespacing
explicit trust and approval model

Security note

The MCP spec warns that tool descriptions and annotations should be considered untrusted unless they come from trusted servers. That means MCP server output and metadata should go through the same policy scrutiny as tool results. [E25]
ACP - client/editor integration

ACP standardizes communication between coding agents and client applications such as IDEs. It uses JSON-RPC 2.0 and models initialization, session setup, prompt turns, updates, and cancellation. Kimi CLI supports ACP; Deep Agents ships ACP integration; the protocol is now solid enough to treat as the default IDE/editor adapter target. [E23][E24]
Why this matters for your harness

If your runtime already has:

typed events
session IDs
prompt/turn boundaries
permission request events
file operation events

then ACP is mostly an adapter problem, not a redesign problem.
Internal design implication

Your internal event bus should look ACP-like even if you do not expose ACP immediately.
A2A - remote agent interoperability

A2A is the right layer for remote agent-to-agent delegation and task exchange. The official docs now describe it as the common language for agent interoperability, with task objects, streaming, push notifications, and Agent Cards for discovery. [E26]
What to adopt


Agent Cards for remote capability discovery
task-based remote execution
artifacts as outputs
SSE streaming for long-running tasks
push notifications for disconnected clients or jobs

Important protocol insight

A2A explicitly separates messages from artifacts, and says results should generally be returned as task artifacts rather than chat messages. That is highly compatible with the artifact-first design recommended in this blueprint. [E26]
Recommended layering model


      flowchart TD
    CORE[Core Runtime]
    CORE --> MCPA[MCP Adapter]
    CORE --> ACPA[ACP Adapter]
    CORE --> A2AA[A2A Adapter]

    MCPA --> TOOLS[External Tools and Data Sources]
    ACPA --> SURFACES[IDE / CLI / Web Clients]
    A2AA --> REMOTE[Remote Agents]

    
      Loading

  
Optional future-facing note

MCP is already growing beyond plain tool calls into UI-capable extensions. Even if you do not adopt these immediately, design your content/event model so tool outputs can eventually include richer UI payloads without breaking the core runtime. [E25]

11. Security architecture

Principle

Assume the model is competent but not trustworthy enough to be your only control plane. Boundaries must be enforced in tools, policy engine, sandbox, network layer, and storage layer. Deep Agents states this bluntly: enforce boundaries at the tool/sandbox level, not by expecting the LLM to self-police. [U2][E3]
Required controls

Filesystem controls


absolute-root enforcement
path normalization
symlink and traversal defense
restricted write scopes
artifact-only zones vs source zones

Deep Agents and Kimi both hardened file path handling over time, which is exactly what you should expect to do too. [E3][E23]
Sandbox controls


isolated container or VM
network egress off by default
package install restrictions
resource quotas
process timeout
syscall restrictions if you control the sandbox

Secret controls


never expose .env or raw secret stores without policy checks
inject secrets into tools only when needed
log secret access as events
redact secrets from tool results before they hit the model context

Tool output injection defense


treat fetched web pages, MCP output, and external docs as untrusted
strip or neutralize obvious "ignore prior instructions" prompt injections
separate raw fetched content from normalized summaries
require human approval for sensitive actions even if the model was instructed by external content

Audit and provenance

Every external action should leave:

who/what requested it
inputs
approval path
outputs/artifact refs
timestamps
hashes where relevant

Trust zones

Define at least three trust zones:

Model context zone - reasoning buffer, low trust
Execution zone - sandboxed code/tools, medium trust with policy control
Operator and system-of-record zone - approvals, secrets, production integrations, high trust

This framing makes it easier to reason about what data can flow where.

12. Observability, evaluation, and product metrics

Instrument the harness, not just the model

You need visibility into:

model calls
tool calls
approval waits
compaction events
subagent fan-out
artifact creation
resume/replay behavior
failure and rollback paths

A Kimi-style typed event bus and LangSmith/LangGraph-style traces are both strong inspirations here. [U1][E1][E23]
Metrics to watch from day one

Core runtime health


session resume success rate
step failure rate
retry rate per step
cancellation rate
mean steps per completed task

Context economics


prompt cache hit rate
uncached token share
compaction frequency
average artifact bytes per task
percentage of tool results evicted to artifacts

Tooling quality


tool selection accuracy
average tool count considered per step
deferred-tool load rate
tool latency
percentage of tool results requiring follow-up reads

Human collaboration


approvals per task
time waiting for approval
AskUserQuestion frequency
operator override rate

Multi-agent efficiency


subagent spawn rate
parallel branch count
orchestration overhead
percent of tasks where subagents improved latency or quality
token cost of orchestration vs direct execution

Evaluation strategy

Run three layers of evaluation:


Unit tests for tools and policies

Deterministic correctness.


Task-level replay evals

Fixed prompts and expected artifacts or state transitions.


Long-horizon harness evals

Resume after interruption, compaction correctness, approval branching, artifact recovery, and subagent handoffs.


Anthropic's tooling guidance strongly recommends evaluating tools with agents rather than assuming normal unit tests are enough. [E17]

13. Recommended repo structure

agent-harness/
  runtime/
    loop/
    sessions/
    checkpoints/
    events/
    subagents/
  context/
    compaction/
    caching/
    artifact_refs/
    serializers/
  tools/
    builtin/
    wrappers/
    policies/
    schemas/
  protocols/
    mcp/
    acp/
    a2a/
  sandboxes/
    local/
    remote/
    execution_bridge/
  memory/
    conventions/
    task_store/
    artifact_store/
  skills/
    builtin/
    project/
  ui/
    cli/
    web/
    ide/
  evals/
    task_suites/
    replay/
    regression/
  docs/
    ARCHITECTURE.md
    AGENTS.md

Why this layout works


isolates stable interfaces from fast-changing prompts and skills
keeps protocol adapters separate from core runtime
makes the artifact/memory system visible as a first-class subsystem
makes evaluation a product surface, not an afterthought


14. Suggested starter defaults

These are starting defaults, not eternal constants.


Knob
Recommended start
Rationale


max_steps_per_turn
50-100
Kimi defaults to 100 and that is a reasonable ceiling for serious tasks. [E18]


max_retries_per_step
2-3
Enough for transient failures without hiding real problems. [E18]


reserved_context_tokens
20-25% of window, or about 50k on 200-262k models
Leaves space for output + compaction prompts before hard failure. [U4][E18]


Large-result eviction threshold
8k-16k tokens equivalent
High enough to keep small outputs inline, low enough to prevent bloat.


Subagent nesting
Off by default
Prevents delegation loops and hidden costs. [U1][E28]


Deferred-tool loading
Enable when tool schema mass is large or tool count exceeds comfortable selection range
Protects context budget and selection quality. [E9][E18]


Approval mode
Ask for writes, shell, network egress, deletes
Sensible default trust boundary. [E12][E19]


Artifact persistence
Always on for search/fetch/code outputs
Enables recovery, replay, and long-horizon work.


15. Implementation roadmap

Phase 0 - Architecture decisions (1-2 weeks)

Lock these decisions before heavy coding:

runtime model (graph/checkpoint runtime vs hand-rolled)
artifact store format and URI scheme
sandbox strategy
policy engine architecture
protocol boundaries
session and task schemas

Exit criteria

written state schema
event taxonomy
tool naming convention
security boundary map

Phase 1 - Single-agent durable harness (2-4 weeks)

Build:

append-only event log
checkpointed session runtime
file/artifact store
minimal built-in tool set
approval engine
typed traces
basic compaction / artifact eviction
CLI or API surface

Do not build yet

remote A2A
dynamic subagents
huge MCP catalogs
exotic skills

Exit criteria

30-100 step tasks complete reliably
resume works
approval pause/resume works
large outputs are artifactized

Phase 2 - Context and capability scaling (2-4 weeks)

Add:

stable prompt layout and cache instrumentation
deferred tool loading / tool search
skills
AGENTS.md memory
better compaction / restart heuristics
eval suite for long-horizon tasks

Exit criteria

cache hit rate is measurable and stable
long-context tasks avoid runaway token growth
new domain knowledge can be added through skills without tool sprawl

Phase 3 - Subagents and protocol adapters (3-6 weeks)

Add:

general-purpose subagent
explore/research subagent
explicit handoff contract
ACP adapter
MCP connector hardening
parallel branch limits and metrics

Exit criteria

subagents materially reduce token bloat or latency on chosen workloads
IDE integration works through ACP or equivalent
MCP tool injection risks are contained by policy

Phase 4 - Enterprise hardening and remote delegation (4-8 weeks)

Add:

remote sandboxes
stricter egress policies
A2A adapter for remote specialists
richer audit and compliance
advanced rollback and replay
operator workflows for approval queues

Exit criteria

clear trust zone boundaries
disaster recovery / replay story
remote delegation works without leaking internal state


16. What to steal directly from each source

From Deep Agents


middleware composition
pluggable filesystem backends
built-in general-purpose subagent
skill loading with progressive disclosure
harness framing: planning + filesystem + subagents + context management [U1][E1][E2][E3]

From LangGraph


checkpoint-based durable execution
persistence threads
human-in-the-loop pause/resume
idempotent task boundaries for replay [E5]

From Kimi CLI


session state persistence beyond chat history
explicit loop-control knobs like reserved_context_size
structured AskUserQuestion UX
event/protocol decoupling via Wire-style messages
config-driven agent definitions and inheritance [U1][U4][E18][E19][E20][E21][E23]

From Anthropic / Claude patterns


prompt caching discipline
tool search with deferred loading
programmatic tool calling
approval callbacks and layered permission logic
initializer-agent pattern for long-running work
fresh-window restarts when externalized state is strong [U2][U5][E7][E8][E9][E10][E12][E16][E30]

From the Manus-style ideas captured in your materials


context engineering as the primary systems problem
mask, do not remove, when constraining actions
keep errors in context
use files to manipulate attention and preserve recoverability [U2][U3][U5]

From open protocols


MCP for tool interoperability
ACP for editor/client interoperability
A2A for remote agent interoperability and task exchange [U3][U4][E24][E25][E26]


17. Anti-patterns to avoid


Treating the agent as "just a prompt".

This fails as soon as tasks span many steps, tools, or sessions. [U2][U3][E1]


Dynamically rewriting the toolset mid-session.

Breaks cache locality and confuses state. [U2][U5][E7]


Letting raw tool outputs flood the context.

Artifactize early. [U1][U4][E10]


Using multi-agent because it sounds advanced.

Use it only when context isolation or parallelism is clearly valuable. [E6][E15]


Relying on the model to self-police risky actions.

Put enforcement in policy and sandbox layers. [U4][E12][E25]


Returning verbose, human-style tool blobs.

Return structured, compact outputs plus artifact refs. [E17]


Hiding errors from the model.

Preserve error evidence unless redaction is necessary. [U2][U5]


Overloading the system prompt with all possible instructions.

Use skills, tool search, and read/search tools instead. [U2][U4][E9][E21]


Skipping evaluation of the harness itself.

Tool tests are not enough; evaluate resumption, compaction, approvals, and delegation. [E17]


Conflating protocols.

MCP, ACP, and A2A solve different problems. Use each where it fits. [E24][E25][E26]


18. The concrete blueprint I would build

If I were starting this harness tomorrow, I would build the following first version:
Core


a durable orchestrator runtime with checkpointing
append-only JSONL event log
Postgres or SQLite for session/task metadata
object store or filesystem artifact store with content-addressed blobs
a stable tool catalog with namespaces

Built-in primitives


file read/write/edit/search
web search/fetch
sandboxed code execution
task graph primitive
structured user question tool
subagent delegation tool

Context system


static prompt prefix
AGENTS.md and project memory loading
immediate artifactization of large outputs
reserved output budget
compaction + fresh restart policy

Governance


pre-tool hooks
policy rules
approval callbacks
sandbox allowlists
provenance log

Surfaces


one CLI or API first
internal typed event bus
ACP adapter second
web UI later

Scale-up path


skills
deferred tool loading / tool search
general-purpose subagent
specialist subagents
MCP expansion
A2A remote workers only after local harness is stable

That combination gives you a system that is already "modern agentic" by 2026 standards without overcommitting to every new trend.

19. Final recommendation

The best 2026 harness is not the one with the most tools, the most agents, or the biggest context window. It is the one that:

preserves a stable cached prefix
externalizes state to artifacts and files
exposes a small, ergonomic action space
uses code execution to compress data before the model sees it
delegates work into isolated contexts only when needed
enforces safety in deterministic runtime layers
can resume, replay, inspect, and audit every important step

That is the throughline across the materials you shared and the current official docs. Build that foundation first. Everything else - skills, richer protocols, remote swarms, UI polish, domain specialization - compounds on top of it.

References

User-provided source documents


[U1] compass_artifact_wf-b9580dc8-f513-4de3-bf7a-7e6dbb6d5df8_text_markdown.md - user-provided synthesis on architectural patterns for modern agentic systems.
[U2] modern-agent-architecture-guide.md - user-provided guide synthesizing Claude Code, Manus, Deep Agents, and Kimi CLI patterns.
[U3] Building a 2026 Agentic System.docx - user-provided architectural blueprint with neuro-symbolic framing, protocol sections, and implementation guidance.
[U4] Starter Plan for a Modern Agentic System in 2026.pdf - user-provided starter blueprint focused on runtime primitives, context engineering, and milestones.
[U5] agent_ideas.docx - user-provided notes including Thariq, Lance Martin, Manus, and Deep Agents excerpts.

External references and official docs


[E1] LangChain Docs - "Deep Agents overview"

https://docs.langchain.com/oss/python/deepagents/overview


[E2] LangChain Docs - "Customize Deep Agents"

https://docs.langchain.com/oss/python/deepagents/customization


[E3] LangChain Docs - "Backends"

https://docs.langchain.com/oss/python/deepagents/backends


[E4] LangChain Docs - "Deep Agents CLI"

https://docs.langchain.com/oss/python/deepagents/cli/overview


[E5] LangChain / LangGraph Docs - "Durable execution", "Persistence", and "LangGraph overview"

https://docs.langchain.com/oss/python/langgraph/durable-execution

https://docs.langchain.com/oss/python/langgraph/persistence

https://docs.langchain.com/oss/python/langgraph/overview


[E6] LangChain Blog - "Choosing the Right Multi-Agent Architecture"

https://blog.langchain.com/choosing-the-right-multi-agent-architecture/


[E7] Anthropic Docs - "Prompt caching"

https://platform.claude.com/docs/en/build-with-claude/prompt-caching


[E8] Anthropic Docs - "Compaction"

https://platform.claude.com/docs/en/build-with-claude/compaction


[E9] Anthropic Docs - "Tool search tool"

https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool


[E10] Anthropic Docs - "Programmatic tool calling"

https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling


[E11] Anthropic Docs - "Web search tool", "Web fetch tool", and "Code execution tool"

https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool

https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-fetch-tool

https://platform.claude.com/docs/en/agents-and-tools/tool-use/code-execution-tool


[E12] Anthropic Docs - "Handle approvals and user input" and "Configure permissions"

https://platform.claude.com/docs/en/agent-sdk/user-input

https://platform.claude.com/docs/en/agent-sdk/permissions


[E13] Anthropic Docs - "Agent SDK overview"

https://platform.claude.com/docs/en/agent-sdk/overview


[E14] Anthropic Engineering - "Effective context engineering for AI agents"

https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents


[E15] Anthropic Engineering - "How we built our multi-agent research system"

https://www.anthropic.com/engineering/multi-agent-research-system


[E16] Anthropic Engineering - "Effective harnesses for long-running agents"

https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents


[E17] Anthropic Engineering - "Writing effective tools for agents - with agents"

https://www.anthropic.com/engineering/writing-tools-for-agents


[E18] Moonshot Kimi CLI Docs - "Config Files"

https://moonshotai.github.io/kimi-cli/en/configuration/config-files.html


[E19] Moonshot Kimi CLI Docs - "Interaction and Input"

https://moonshotai.github.io/kimi-cli/en/guides/interaction.html


[E20] Moonshot Kimi CLI Docs - "Sessions and Context"

https://moonshotai.github.io/kimi-cli/en/guides/sessions.html


[E21] Moonshot Kimi CLI Docs - "Agent Skills"

https://moonshotai.github.io/kimi-cli/en/customization/skills.html


[E22] Moonshot Kimi CLI Docs - "Model Context Protocol"

https://moonshotai.github.io/kimi-cli/en/customization/mcp.html


[E23] Moonshot Kimi CLI Docs - "Wire mode" and changelog notes

https://moonshotai.github.io/kimi-cli/en/customization/wire-mode.html

https://moonshotai.github.io/kimi-cli/en/release-notes/changelog.html


[E24] Agent Client Protocol - Introduction and Protocol Overview

https://agentclientprotocol.com/

https://agentclientprotocol.com/protocol/overview


[E25] Model Context Protocol - official specification

https://modelcontextprotocol.io/specification/2025-11-25


[E26] Agent2Agent (A2A) Protocol - official docs and specification

https://a2a-protocol.org/latest/

https://a2a-protocol.org/latest/topics/agent-discovery/

https://a2a-protocol.org/latest/topics/streaming-and-async/

https://a2a-protocol.org/latest/specification/


[E27] Anthropic release notes - February 2026 platform changes

https://platform.claude.com/docs/en/release-notes/overview


[E28] Claude Code Docs - "Create custom subagents"

https://code.claude.com/docs/en/sub-agents


[E29] Anthropic Research - "Building Effective AI Agents"

https://www.anthropic.com/research/building-effective-agents


[E30] Anthropic Docs - "Claude prompting best practices"

https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices
Area	Recommended default	Why
Core runtime	Durable state machine / graph runtime with checkpointing	Long-running work needs pause/resume, replay, fault tolerance, and human interruption support. [E1][E5]
Session history	Append-only event log plus typed state snapshot	Best for cache stability, replay, auditability, and deterministic recovery. [U3][U4][E7]
Working memory	Artifact-first filesystem plus metadata store	Offloads context, preserves recoverability, supports handoffs and resumption. [U1][U4][E1][E3]
Built-in tools	Narrow, namespaced primitives	Small stable action space improves selection quality and keeps cache-friendly prefixes. [U1][U2][U5][E17]
Large data handling	Programmatic tool calling or sandboxed code execution	Keeps intermediate data out of the model context and reduces round trips. [U2][U3][U4][U5][E10][E11]
Planning	Task graph or structured todo primitive	Acts as attention control and coordination state, not as a workflow engine by itself. [U2][U5][E1][E16]
Multi-agent	Orchestrator-worker subagents by default	Strongest payoff for context isolation and parallel work with manageable complexity. [U1][U2][E6][E15]
Protocols	MCP for tools, ACP for IDE/client surfaces, A2A for remote agent-to-agent delegation	Clean separation of concerns and future-proof interoperability. [U3][U4][E18][E24][E25][E26]
Human collaboration	Structured question tool plus approval policies	Faster, more deterministic than plain-text back-and-forth. [U2][U4][U5][E12][E19]
Security	Sandbox + policy engine + audit log	Practical trust boundary for real tool use. [U2][U4][E3][E12][E25]
Pattern	Use when	Tradeoff
Single agent	Most tasks at the start	Simplest, easiest to debug
Skills	One agent needs many latent capabilities	Loaded context accumulates over time [E6]
Subagents	Need context isolation, specialization, or parallel work	Extra orchestration call(s) [E6][E15]
Handoffs	Need sequential stage-based conversations	More stateful and harder to reason about [E6]
Router	Need stateless fan-out and synthesis across domains	Repeated routing overhead for conversations [E6]
Knob	Recommended start	Rationale
`max_steps_per_turn`	50-100	Kimi defaults to 100 and that is a reasonable ceiling for serious tasks. [E18]
`max_retries_per_step`	2-3	Enough for transient failures without hiding real problems. [E18]
`reserved_context_tokens`	20-25% of window, or about 50k on 200-262k models	Leaves space for output + compaction prompts before hard failure. [U4][E18]
Large-result eviction threshold	8k-16k tokens equivalent	High enough to keep small outputs inline, low enough to prevent bloat.
Subagent nesting	Off by default	Prevents delegation loops and hidden costs. [U1][E28]
Deferred-tool loading	Enable when tool schema mass is large or tool count exceeds comfortable selection range	Protects context budget and selection quality. [E9][E18]
Approval mode	Ask for writes, shell, network egress, deletes	Sensible default trust boundary. [E12][E19]
Artifact persistence	Always on for search/fetch/code outputs	Enables recovery, replay, and long-horizon work.