Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save amazingvince/52158d00fb8b3ba1b8476bc62bb562e3 to your computer and use it in GitHub Desktop.

Select an option

Save amazingvince/52158d00fb8b3ba1b8476bc62bb562e3 to your computer and use it in GitHub Desktop.
Modern Agent Harness Blueprint 2026

Blueprint for a Modern Agentic Harness in 2026

What this document is

This is an early-stage architecture blueprint for building a modern agentic harness from the ground up in 2026. It is not a product spec and not a vendor pitch. It is a practical plan for how to structure the runtime, state model, context system, tool layer, subagent orchestration, approvals, protocols, and observability so the harness stays useful as models, tools, and deployment surfaces evolve.

This blueprint synthesizes the uploaded source material with current official documentation from Anthropic, LangChain/LangGraph, Moonshot Kimi CLI, MCP, ACP, and A2A. Where the sources conflict, the plan below favors patterns that appear repeatedly across multiple production systems or formal protocols. [U1][U2][U3][U4][U5][E1][E7][E15][E18][E24][E26]


Executive summary

If you only remember six things, remember these:

  1. The harness matters more than the loop. The model-tool loop is now commodity. Differentiation comes from context engineering, durable state, policy enforcement, externalized memory, and protocol design. [U1][U2][E1][E5][E13]

  2. Design around cache stability first. Prompt caching is not a small optimization. It changes your entire architecture: stable prompt prefix, append-only history, fixed tool catalog per session, and state transitions modeled as messages or mode flags rather than prompt rewrites. [U2][U5][E7][E27]

  3. Treat the filesystem and artifact store as working memory. Large tool outputs, notes, plans, recovered state, and handoffs should live outside the model context and be referenced by handles. This is the only reliable way to scale beyond short tasks without drowning the model in its own history. [U1][U2][U4][U5][E1][E3][E14]

  4. Keep the built-in action space small and stable. Start with a compact set of high-leverage primitives: file ops, search/read, code execution or shell, planning/tasks, subagent delegation, and structured user elicitation. Add new tools only when they improve control, guardrails, concurrency, observability, or UX. [U1][U2][U5][E17]

  5. Use subagents for context isolation, not because "multi-agent" sounds advanced. Start with a single agent. Add subagents only when you need parallel exploration, specialized prompts/tooling, separate context windows, or explicit ownership boundaries. [U1][U2][U4][E6][E14][E15]

  6. Put guardrails in the runtime, not the prompt. The model should never be the only enforcement layer. Destructive tools, secret access, network egress, and external writes need deterministic policy checks, approval gates, sandbox restrictions, and audit trails. [U2][U4][U5][E12][E17][E25]


Decision summary

Area Recommended default Why
Core runtime Durable state machine / graph runtime with checkpointing Long-running work needs pause/resume, replay, fault tolerance, and human interruption support. [E1][E5]
Session history Append-only event log plus typed state snapshot Best for cache stability, replay, auditability, and deterministic recovery. [U3][U4][E7]
Working memory Artifact-first filesystem plus metadata store Offloads context, preserves recoverability, supports handoffs and resumption. [U1][U4][E1][E3]
Built-in tools Narrow, namespaced primitives Small stable action space improves selection quality and keeps cache-friendly prefixes. [U1][U2][U5][E17]
Large data handling Programmatic tool calling or sandboxed code execution Keeps intermediate data out of the model context and reduces round trips. [U2][U3][U4][U5][E10][E11]
Planning Task graph or structured todo primitive Acts as attention control and coordination state, not as a workflow engine by itself. [U2][U5][E1][E16]
Multi-agent Orchestrator-worker subagents by default Strongest payoff for context isolation and parallel work with manageable complexity. [U1][U2][E6][E15]
Protocols MCP for tools, ACP for IDE/client surfaces, A2A for remote agent-to-agent delegation Clean separation of concerns and future-proof interoperability. [U3][U4][E18][E24][E25][E26]
Human collaboration Structured question tool plus approval policies Faster, more deterministic than plain-text back-and-forth. [U2][U4][U5][E12][E19]
Security Sandbox + policy engine + audit log Practical trust boundary for real tool use. [U2][U4][E3][E12][E25]

The architectural thesis

A modern harness should treat the LLM as the control plane for reasoning and planning, while the rest of the system handles state, execution, storage, approvals, transport, and observability. This is a neuro-symbolic split in practice, not in theory. The more you can move determinism, memory, and policy into the harness, the more reliable the overall system becomes. [U3][E1][E5][E13]

The harness should therefore be built around five stable layers:

  1. Execution runtime - the event loop, session manager, checkpointing, and recovery.
  2. Context system - prompt layout, artifact references, compaction, and cache discipline.
  3. Capability surface - built-in tools, external tools, skills, and subagents.
  4. Governance layer - approvals, hooks, allow/deny policy, sandboxing, provenance.
  5. Surface/protocol adapters - CLI, IDE, web UI, ACP, MCP, and optionally A2A.

Reference architecture

flowchart LR
    U[User or Calling System] --> S[Surface Layer: CLI / IDE / Web / API]
    S --> A[Client Adapter / ACP / REST]

    A --> R[Agent Runtime]
    R --> ST[Session State + Checkpoints]
    R --> PL[Plan / Task Graph]
    R --> EV[Typed Event Bus]
    R --> PO[Policy Engine + Approvals + Hooks]

    R --> TR[Tool Router]
    TR --> BT[Built-in Tools]
    TR --> MCP[MCP Connector]
    TR --> SX[Sandbox / Code Execution / PTC]
    TR --> FS[Artifact Store / Virtual Filesystem]

    R --> SG[Subagent Manager]
    SG --> R

    R --> MEM[Long-term Memory / AGENTS.md / Conventions]
    R --> A2A[A2A Adapter for Remote Agents]
    EV --> OBS[Tracing / Metrics / Replay / Eval Harness]
Loading

Core idea of the diagram

  • The runtime owns the loop, state, and control.
  • The tool router is the gateway to all side effects.
  • The artifact store / virtual filesystem is the main external memory substrate.
  • The subagent manager is a specialization mechanism and a context pressure valve.
  • The policy engine is the governance boundary.
  • The surface layer is replaceable; the engine should not depend on a specific UI. Kimi's Wire-style decoupling and ACP both reinforce this design. [U1][E18][E23][E24]

1. Runtime and state model

What to build

Your runtime should manage:

  • session creation and resume
  • append-only message/event history
  • deterministic state snapshots
  • step execution and retry policy
  • compaction and fresh-window restarts
  • human pause/resume
  • subagent spawning and result capture
  • cancellation and timeout handling
  • audit and replay

This is exactly why LangGraph emphasizes durable execution and persistence through checkpoints, and why Kimi CLI persists not just the conversation but also approvals, dynamic subagents, and added workspace directories across resume. [U1][U4][E1][E5][E20]

Recommended state entities

Session

A session is the resumable unit for a conversation or job.

{
  "session_id": "sess_...",
  "thread_id": "thread_...",
  "created_at": "2026-03-01T12:00:00Z",
  "mode": "execute",
  "model_profile": "orchestrator-default",
  "tool_catalog_version": "v1",
  "approval_mode": "ask",
  "context_state": {
    "compacted": false,
    "recent_summary_ref": "artifact://..."
  }
}

Task

A task is the coordination object.

{
  "task_id": "task_...",
  "title": "Draft migration plan",
  "status": "in_progress",
  "owner": "main-agent",
  "dependencies": ["task_1", "task_2"],
  "blockers": [],
  "artifact_refs": ["artifact://plan.md"],
  "updated_at": "2026-03-01T12:10:00Z"
}

Artifact

Artifacts are durable outputs and large intermediate objects.

{
  "artifact_id": "artifact_...",
  "uri": "artifact://reports/search-results-01.json",
  "mime_type": "application/json",
  "summary": "Search results for Azure pricing pages",
  "sha256": "....",
  "source": {
    "tool": "web.search",
    "session_id": "sess_...",
    "step_id": "step_..."
  }
}

Step

A step is a single model decision or tool execution event.

{
  "step_id": "step_...",
  "kind": "tool_call",
  "tool_name": "fs.read",
  "status": "completed",
  "started_at": "2026-03-01T12:05:00Z",
  "ended_at": "2026-03-01T12:05:01Z",
  "artifact_refs": [],
  "error": null
}

Runtime rules

  1. Checkpoint before and after external side effects. LangGraph's checkpointers and Kimi's per-step session persistence both support this general rule. [E5][E20]
  2. Make replay deterministic. Wrap side effects in tasks or tool execution envelopes so replay does not accidentally re-run destructive actions. LangGraph explicitly recommends deterministic/idempotent design for durable execution. [E5]
  3. Store both typed state and event history. State gives you fast resume; event history gives you replay, analytics, and debugging.
  4. Use cancellation as a first-class primitive. ACP and A2A both model cancellation/interrupt flows, so your internal runtime should too. [E24][E26]

2. Context engineering and cache-first design

Design principle

Treat context as a scarce, actively managed resource. Context engineering is not "prompt polish"; it is the main systems problem in long-running agents. [U2][U3][U4][U5][E14]

Stable prompt layout

The most cache-friendly ordering is:

  1. static system prompt and always-present tool stubs
  2. project memory / AGENTS.md / conventions
  3. session-level state summary
  4. recent messages and tool results
  5. latest user turn

Anthropic's prompt caching docs are explicit: the cache hierarchy is tools -> system -> messages, and changes to tools invalidate the whole cache below that point. [U2][U5][E7]

Rules that follow from caching

  • Do not add or remove tools mid-session unless you are willing to lose cache locality. Prefer deferred loading or masks. [U2][U5][E7][E9]
  • Do not switch models mid-session for trivial reasons. If you need a different model, spawn a subagent or fresh worker session. [U2][U5]
  • Do not rewrite the system prompt for dynamic state changes. Send reminders or state updates as messages. [U2][U5][E7]
  • Keep serialization deterministic. Even small ordering changes can break cache reuse. [U2][U5]

Multi-tier context management plan

The best synthesis across the sources is a four-level policy:

Tier 0 - Structured outputs by default

Tool results should already be concise, typed, and artifact-backed. This reduces the need for later cleanup. Anthropic's tool-writing guidance strongly favors returning meaningful but token-efficient context. [E17]

Tier 1 - Immediate large-result eviction

If a tool returns a large object, write the full result to an artifact and return only:

  • a short summary
  • the first few relevant lines or rows
  • a stable artifact handle
  • metadata (size, type, location, provenance)

This is how Deep Agents handles large results conceptually, and it should be your baseline pattern. [U1][U2][E1]

Tier 2 - Deferred input/result eviction

When context approaches roughly 80-90% of the safe usable window, rewrite old bulky tool inputs or old results into references. Kimi exposes reserved_context_size and triggers auto-compaction before hard failure; Deep Agents uses proactive summarization middleware. [U1][U4][E18][E20]

Tier 3 - Compaction / summarization

When the window is still too full, summarize older history into:

  • current goal
  • state achieved so far
  • open tasks
  • key decisions and assumptions
  • artifact refs
  • next recommended step

Anthropic now recommends server-side compaction for long-running workflows where available. [E8][E27]

Tier 4 - Fresh-window restart

When state has been externalized well, a brand new context window can be better than compaction. Anthropic explicitly recommends considering fresh restarts when the model can rediscover state from files, tests, progress notes, and git history. [E30][E16]

When to compact vs when to restart fresh

Use compaction when:

  • the interaction is conversational and continuity matters
  • the task depends on subtle negotiation or conversational nuance
  • the model must preserve recent reasoning state inline

Use fresh-window restart when:

  • the task has strong externalized state in files/artifacts
  • work has clear milestone checkpoints
  • the model can cheaply reconstruct state from AGENTS.md, progress.txt, tests.json, git history, or task objects [E16][E30]

Attention anchoring

Long tasks drift. Use explicit progress recitation:

  • rewrite plan/task state regularly
  • keep "what matters now" near the end of context
  • maintain separate structured task state and freeform progress notes

Anthropic's long-running harness guidance found that structured feature/test files plus progress logs improve continuity across context windows; Manus-inspired notes in your uploaded docs make the same point through the todo.md recitation pattern. [U2][U5][E16][E30]

Error preservation

Do not hide failures from the model. Keep failed actions, stack traces, and rejection reasons in the trace unless they are sensitive and must be redacted. This is one of the highest leverage patterns in the uploaded materials because it turns the agent's own mistakes into in-session learning signal. [U2][U5]

Context diversity

Repetitive action-observation traces can accidentally become few-shot examples that the model blindly mimics. If the system processes many similar items, vary serialization templates slightly and break repetitive rhythms when safe. This is an underappreciated pattern from the user materials that is worth testing in evals. [U2][U5]


3. Filesystem, artifacts, and memory

Recommendation

Make the filesystem or virtual artifact store the default working-memory substrate for the harness. This should not be an optional afterthought. [U1][U2][U4][E1][E3]

Why this matters

It solves four different problems at once:

  1. context overflow - large content moves out of the prompt
  2. recoverability - old information stays addressable
  3. handoffs - subagents and resumed sessions can share state through files/artifacts
  4. human inspectability - operators can inspect what the agent actually saw or produced

Deep Agents exposes a pluggable filesystem surface with backends for in-state memory, local disk, durable store, and sandboxes. That is exactly the kind of abstraction boundary you want. [E1][E3]

Memory layers

Use three memory layers:

A. Short-term session memory

  • recent messages
  • active plan
  • recent tool outcomes
  • approval state
  • subagent registry

This should live in your checkpointed runtime state.

B. Working memory / artifact memory

  • large tool outputs
  • cached fetch results
  • scratch data
  • notes
  • progress logs
  • structured test/status files

This should live in the filesystem or object store.

C. Long-term memory

  • persistent project conventions
  • reusable operator notes
  • AGENTS.md or equivalent
  • recurring policies
  • stable user or org preferences

Deep Agents supports AGENTS.md-style memory files and store-backed memory; Kimi persists session-specific state; Anthropic's docs increasingly assume the filesystem is a first-class rediscovery layer. [E2][E3][E20][E30]

Recommended file conventions

At minimum, standardize these:

  • AGENTS.md - durable project conventions and working agreements
  • progress.txt or progress.md - human-readable progress log
  • tasks.json - structured task graph
  • artifacts/ - large intermediate outputs
  • plans/ - plan drafts and checkpoints
  • reports/ - final or semi-final deliverables
  • run/ - ephemeral execution outputs if you want explicit scratch space

Prefer "search and read" over massive preload

Do not stuff entire codebases or knowledge bases into the prompt. Give the agent strong search/read capabilities and let it build context just in time. This is a repeating lesson in your uploaded materials and aligns with Anthropic's recent context-engineering guidance. [U2][U5][E14][E30]


4. Tool and action-space design

Default built-in primitive set

Start with a deliberately small, namespaced core set:

  • fs.list
  • fs.read
  • fs.write
  • fs.edit
  • fs.search (glob/grep or both)
  • exec.run or code.run
  • task.set / task.update
  • agent.delegate
  • user.ask
  • web.search
  • web.fetch

You do not need 50 first-party tools to start. Deep Agents and Kimi both converge on a compact, high-leverage set, and Anthropic's tooling guidance reinforces a high bar for tool creation. [U1][U2][U5][E1][E18][E19][E17]

Criteria for promoting an action to a dedicated tool

Promote an action out of shell/code execution only if at least one of these is true:

  1. UX needs special rendering
    Example: structured questions should become a UI panel, not free text. [U2][U5][E12][E19]

  2. Guardrails need a deterministic checkpoint
    Example: file edits may require stale-read checks or path validation. [U2][E17]

  3. Concurrency or transaction semantics matter
    Example: read-only actions can parallelize; writes should serialize. [U2]

  4. Observability needs a clean event boundary
    Example: you want exact metrics for search calls or DB writes. [U2][E17]

  5. Approval policy differs by action class
    Example: exec.run may require review while fs.read does not. [U2][E12]

If none of those apply, consider leaving the action in the shell/code sandbox.

Tool catalog strategy

Always-present tools

Keep the 3-5 most common tools loaded and stable. Anthropic's tool search docs recommend keeping the most frequently used tools non-deferred. [E9]

Deferred tools

All rarely used or verbose tool schemas should be discoverable but not loaded by default. Anthropic's tool search and Claude Code MCP docs show the right pattern: stable stubs plus deferred loading. [E9][E18]

Tool naming

Use namespaces and families:

  • fs.*
  • web.*
  • db.*
  • exec.*
  • task.*
  • agent.*
  • user.*

This helps humans, logs, policy rules, and model-side action masking.

Tool output contract

Every tool result should come back in a consistent envelope:

{
  "status": "ok",
  "summary": "2 matching files found",
  "structured": { "matches": 2 },
  "artifact_refs": ["artifact://search/matches.json"],
  "preview": ["src/app.py:12", "tests/test_app.py:48"]
}

This is the single best way to keep tool outputs readable, cache-friendly, and observable.

Tool writing principles

Anthropic's tool-writing guidance is worth adopting directly:

  • choose the right tools to implement
  • use namespacing
  • return meaningful context
  • optimize for token efficiency
  • prompt-engineer tool descriptions/specs
  • run real evaluations for tool quality, not just unit tests [E17]

5. Code execution and programmatic tool calling

Recommendation

Treat code execution as a default capability, not a niche feature for "coding agents". In modern agent systems, code is the best intermediate representation for filtering, joining, transforming, validating, and compressing data before it ever reaches the model context. [U2][U3][U4][U5][E10][E11]

When to use programmatic tool calling (PTC)

Use PTC or equivalent self-managed sandbox orchestration when the agent needs to:

  • loop over many entities
  • transform large result sets
  • batch many API/tool calls
  • validate tabular or numeric output
  • filter web search/fetch results
  • early-terminate on condition checks
  • build structured artifacts from noisy data

Anthropic's programmatic tool calling docs are explicit that the main advantage is keeping intermediate results inside the execution container instead of sending them back through the model at every step. [E10]

Suggested architecture

sequenceDiagram
    participant Runtime
    participant Model
    participant Sandbox
    participant ToolHandler

    Runtime->>Model: prompt + callable tool specs
    Model-->>Runtime: code execution request
    Runtime->>Sandbox: run generated code
    Sandbox->>ToolHandler: typed tool invocation
    ToolHandler-->>Sandbox: tool result
    Sandbox->>ToolHandler: next invocation if needed
    ToolHandler-->>Sandbox: result
    Sandbox-->>Runtime: final stdout/result + artifact refs
    Runtime->>Model: compact final result
Loading

Managed vs self-managed PTC

Anthropic's docs outline three general implementation modes: client-side execution, self-managed sandboxed execution, and managed execution. The managed option is easy, but self-managed execution gives you stronger control over network policy, data retention, package policy, and compliance. [E10]

Recommendation for your own harness:
Build an abstraction that can support both:

  • Managed provider PTC when you want convenience or benchmark leverage
  • Self-managed sandbox when you need enterprise control

Sandbox requirements

Your sandbox should support:

  • strict filesystem root
  • CPU / memory / wall-clock limits
  • optional no-network mode
  • explicit egress allowlists
  • package install policy
  • per-run provenance
  • container/session reuse policy
  • artifact mounting

Anthropic's managed PTC uses session-scoped containers with idle expiration; do not hard-code their TTL, but do copy the idea of session-scoped containers with explicit reuse handles. [E10]

Governance note

If you use provider-managed code execution, verify retention and privacy rules. Anthropic explicitly notes that code execution and PTC are not covered by ZDR arrangements. [E10][E11]


6. Planning and the evolution from todos to tasks

Planning is for cognition first, coordination second

The plan object is not there because the harness cannot sequence steps in code. It is there because it helps the model stay oriented, and later it helps multiple agents coordinate without relying on implicit conversational memory. [U1][U2][U5][E1]

Starter recommendation

Start with a lightweight task.set / task.update or write_todos-style primitive that:

  • rewrites the current plan in full
  • tracks statuses
  • links to artifact refs
  • captures blockers and next step

Then evolve into a task graph

Once work spans multiple sessions or subagents, upgrade to a richer task object with:

  • dependencies
  • blockers
  • owner agent
  • parent/child relationships
  • artifact refs
  • timestamps
  • versioning
  • optional human assignee

This follows the "Todos to Tasks" evolution in Claude-related materials and the general direction in your uploaded docs. [U2][U5]

Recommendation: split structured and unstructured state

  • structured: tasks.json, tests.json, milestone statuses
  • unstructured: progress.txt, operator notes, rationale logs

Anthropic's long-running harness article reports that JSON works better than Markdown for structured test and feature state because the model is less likely to rewrite it casually. [E16][E30]

Suggested task states

Use a simple state machine:

  • queued
  • ready
  • in_progress
  • blocked
  • awaiting_user
  • done
  • canceled
  • failed

This makes approvals, subagent delegation, and resume behavior much easier to reason about.


7. Subagents and multi-agent patterns

First rule: do not start with many agents

LangChain's 2026 guidance is exactly right here: many tasks are best handled by a single agent with good tools, and you should start there. Multi-agent systems add complexity, latency, and token cost. Anthropic's own research notes that multi-agent systems can use dramatically more tokens than chat or single-agent flows, so they need to earn their keep. [E6][E15]

When subagents are worth it

Use subagents when you need one or more of the following:

  • context isolation from exploratory work
  • specialization by prompt or toolset
  • different model/cost profile
  • parallel execution on independent branches
  • separate ownership or maintenance boundaries

These benefits are consistent across Deep Agents, Kimi CLI, Anthropic research, and Claude Code. [U1][U2][U4][E6][E14][E15][E28]

Default pattern: orchestrator-worker

Your first multi-agent pattern should be orchestrator-worker, not peer-to-peer.

  • main agent holds user contract and task-level state
  • worker agents get narrow briefs and isolated contexts
  • workers return only final outputs plus artifact refs
  • workers do not share conversational history directly

Anthropic's research system, Deep Agents, and the best parts of Kimi CLI all point here. [U1][U2][U3][E14][E15]

Default subagent roles to support

  1. General-purpose worker
    Same model/tools as parent, used mainly for context isolation. Deep Agents and Claude Code both have this concept. [E14][E28]

  2. Explore / research worker
    Read/search-heavy, often read-only, optimized for discovery. Claude Code's built-in Explore/Plan subagents show the value of read-only researcher roles. [E28]

  3. Specialist worker
    Narrow domain or tool scope, such as sql-analyst, release-engineer, security-auditor.

  4. Fast/cheap worker
    Lower-cost model for focused subtasks when the orchestrator uses a premium model. [E28]

Recommended subagent rules

  • no recursive subagent spawning by default
  • allow parallel fan-out for independent tasks
  • pass a concise structured brief, not raw chat history
  • return summary + structured result + artifact refs
  • enforce explicit tool restrictions per subagent
  • cap concurrent workers and total token budget

Skills vs subagents vs handoffs vs routers

Use the LangChain four-pattern framing because it is genuinely useful:

Pattern Use when Tradeoff
Single agent Most tasks at the start Simplest, easiest to debug
Skills One agent needs many latent capabilities Loaded context accumulates over time [E6]
Subagents Need context isolation, specialization, or parallel work Extra orchestration call(s) [E6][E15]
Handoffs Need sequential stage-based conversations More stateful and harder to reason about [E6]
Router Need stateless fan-out and synthesis across domains Repeated routing overhead for conversations [E6]

Economic guidance

Anthropic reports that multi-agent systems can use around 15x the tokens of chat interactions in their research setting. That does not mean "avoid multi-agent"; it means use it where the value of the task justifies the extra capacity. [E15]


8. Skills and progressive disclosure

What skills are for

Skills are not just "prompt fragments". They are a disciplined way to add latent expertise without bloating the always-loaded system prompt or tool catalog.

Deep Agents and Kimi both support a pattern where the agent learns only the skill name, path, and description up front, then loads the full SKILL.md only when relevant. Anthropic's skills and tool-search tooling point in the same general direction: load detail on demand, not by default. [U1][U2][U4][E2][E9][E21]

Why skills work

Skills solve three problems:

  1. token control - the full instructions are not always in context
  2. distributed ownership - different teams can own different skills
  3. capability growth without tool sprawl - many workflows can be added as instructions rather than first-class tools

Recommended skill format

Use a directory-based format:

skills/
  release-engineering/
    SKILL.md
    templates/
    scripts/
    references/

Include:

  • YAML frontmatter with name, description, optional allowed tools
  • task framing
  • decision rules
  • required artifacts/templates
  • examples
  • links to scripts or reference files

Kimi's skill discovery hierarchy and Deep Agents' progressive loading are both worth copying. [E2][E21]

What to put in a skill vs a tool

Put something in a skill when it is mainly:

  • domain knowledge
  • workflow guidance
  • operational policy
  • templates and examples
  • "how to use these other tools correctly"

Put something in a tool when it is mainly:

  • a capability requiring execution
  • a stateful operation
  • a side effect
  • a special approval/control boundary

9. Human collaboration and governance

Human-in-the-loop is not a failure mode

In 2026, serious agents are still collaborative systems. Anthropic's own product framing and both open-source harnesses assume supervision, questions, and approval gates are normal parts of the runtime. [U4][E12][E13]

Structured questions should be a first-class tool

Implement user.ask / AskUserQuestion as a synchronous pause point that returns structured options. Do not rely on the model to emit parseable markdown questions in free text. Both Kimi and Anthropic's Agent SDK now formalize this pattern. [U2][U5][E12][E19]

Approval engine design

Use four layers of decisioning:

  1. hooks - custom code before execution
  2. static rules - allow/deny/ask policies by tool, path, domain, secret class
  3. permission mode - coarse session mode like ask, accept_edits, bypass_in_sandbox
  4. runtime callback / operator prompt - human final decision when needed

This mirrors Anthropic's documented permission evaluation pipeline and is a robust general model. [E12]

Risk-tier your tools

A useful default:

  • read tier: fs.read, fs.list, web.search
  • soft write tier: fs.write in working dir, artifact creation
  • hard write tier: shell commands, network POSTs, DB mutations, secrets access
  • destructive tier: delete, force-push, schema migrations, production actions

Only the first tier should be auto-allowed by default outside sandboxes.

Session-scoped approval memory

Persist "allow for this session" decisions. Kimi restores these on resume, and the pattern is very operator-friendly. [E19][E20]

Hooks are where invariants live

Use deterministic hooks for:

  • secret redaction
  • allowlist/denylist checks
  • provenance logging
  • path normalization
  • budget limits
  • post-tool content scrubbing
  • automatic artifact persistence

This is more reliable than repeatedly asking the model to remember rules.


10. Protocol layer: MCP, ACP, A2A

Design principle

Adopt protocol boundaries that map cleanly to the three different relationships in an agent system:

  • agent to tool/resource -> MCP
  • client/editor/UI to local or remote agent runtime -> ACP
  • agent to remote agent -> A2A

Do not force one protocol to do all three jobs. [U3][U4][E18][E24][E25][E26]

MCP - tool and resource integration

MCP is the open standard for connecting agents to external tools and data sources. Both Anthropic and Kimi explicitly position it this way, and the official MCP spec emphasizes user consent, tool safety, and trust boundaries. [E18][E22][E25]

What to adopt

  • stdio and HTTP transports
  • per-server auth handling
  • tool discovery and deferred loading for large catalogs
  • consistent tool namespacing
  • explicit trust and approval model

Security note

The MCP spec warns that tool descriptions and annotations should be considered untrusted unless they come from trusted servers. That means MCP server output and metadata should go through the same policy scrutiny as tool results. [E25]

ACP - client/editor integration

ACP standardizes communication between coding agents and client applications such as IDEs. It uses JSON-RPC 2.0 and models initialization, session setup, prompt turns, updates, and cancellation. Kimi CLI supports ACP; Deep Agents ships ACP integration; the protocol is now solid enough to treat as the default IDE/editor adapter target. [E23][E24]

Why this matters for your harness

If your runtime already has:

  • typed events
  • session IDs
  • prompt/turn boundaries
  • permission request events
  • file operation events

then ACP is mostly an adapter problem, not a redesign problem.

Internal design implication

Your internal event bus should look ACP-like even if you do not expose ACP immediately.

A2A - remote agent interoperability

A2A is the right layer for remote agent-to-agent delegation and task exchange. The official docs now describe it as the common language for agent interoperability, with task objects, streaming, push notifications, and Agent Cards for discovery. [E26]

What to adopt

  • Agent Cards for remote capability discovery
  • task-based remote execution
  • artifacts as outputs
  • SSE streaming for long-running tasks
  • push notifications for disconnected clients or jobs

Important protocol insight

A2A explicitly separates messages from artifacts, and says results should generally be returned as task artifacts rather than chat messages. That is highly compatible with the artifact-first design recommended in this blueprint. [E26]

Recommended layering model

flowchart TD
    CORE[Core Runtime]
    CORE --> MCPA[MCP Adapter]
    CORE --> ACPA[ACP Adapter]
    CORE --> A2AA[A2A Adapter]

    MCPA --> TOOLS[External Tools and Data Sources]
    ACPA --> SURFACES[IDE / CLI / Web Clients]
    A2AA --> REMOTE[Remote Agents]
Loading

Optional future-facing note

MCP is already growing beyond plain tool calls into UI-capable extensions. Even if you do not adopt these immediately, design your content/event model so tool outputs can eventually include richer UI payloads without breaking the core runtime. [E25]


11. Security architecture

Principle

Assume the model is competent but not trustworthy enough to be your only control plane. Boundaries must be enforced in tools, policy engine, sandbox, network layer, and storage layer. Deep Agents states this bluntly: enforce boundaries at the tool/sandbox level, not by expecting the LLM to self-police. [U2][E3]

Required controls

Filesystem controls

  • absolute-root enforcement
  • path normalization
  • symlink and traversal defense
  • restricted write scopes
  • artifact-only zones vs source zones

Deep Agents and Kimi both hardened file path handling over time, which is exactly what you should expect to do too. [E3][E23]

Sandbox controls

  • isolated container or VM
  • network egress off by default
  • package install restrictions
  • resource quotas
  • process timeout
  • syscall restrictions if you control the sandbox

Secret controls

  • never expose .env or raw secret stores without policy checks
  • inject secrets into tools only when needed
  • log secret access as events
  • redact secrets from tool results before they hit the model context

Tool output injection defense

  • treat fetched web pages, MCP output, and external docs as untrusted
  • strip or neutralize obvious "ignore prior instructions" prompt injections
  • separate raw fetched content from normalized summaries
  • require human approval for sensitive actions even if the model was instructed by external content

Audit and provenance

Every external action should leave:

  • who/what requested it
  • inputs
  • approval path
  • outputs/artifact refs
  • timestamps
  • hashes where relevant

Trust zones

Define at least three trust zones:

  1. Model context zone - reasoning buffer, low trust
  2. Execution zone - sandboxed code/tools, medium trust with policy control
  3. Operator and system-of-record zone - approvals, secrets, production integrations, high trust

This framing makes it easier to reason about what data can flow where.


12. Observability, evaluation, and product metrics

Instrument the harness, not just the model

You need visibility into:

  • model calls
  • tool calls
  • approval waits
  • compaction events
  • subagent fan-out
  • artifact creation
  • resume/replay behavior
  • failure and rollback paths

A Kimi-style typed event bus and LangSmith/LangGraph-style traces are both strong inspirations here. [U1][E1][E23]

Metrics to watch from day one

Core runtime health

  • session resume success rate
  • step failure rate
  • retry rate per step
  • cancellation rate
  • mean steps per completed task

Context economics

  • prompt cache hit rate
  • uncached token share
  • compaction frequency
  • average artifact bytes per task
  • percentage of tool results evicted to artifacts

Tooling quality

  • tool selection accuracy
  • average tool count considered per step
  • deferred-tool load rate
  • tool latency
  • percentage of tool results requiring follow-up reads

Human collaboration

  • approvals per task
  • time waiting for approval
  • AskUserQuestion frequency
  • operator override rate

Multi-agent efficiency

  • subagent spawn rate
  • parallel branch count
  • orchestration overhead
  • percent of tasks where subagents improved latency or quality
  • token cost of orchestration vs direct execution

Evaluation strategy

Run three layers of evaluation:

  1. Unit tests for tools and policies
    Deterministic correctness.

  2. Task-level replay evals
    Fixed prompts and expected artifacts or state transitions.

  3. Long-horizon harness evals
    Resume after interruption, compaction correctness, approval branching, artifact recovery, and subagent handoffs.

Anthropic's tooling guidance strongly recommends evaluating tools with agents rather than assuming normal unit tests are enough. [E17]


13. Recommended repo structure

agent-harness/
  runtime/
    loop/
    sessions/
    checkpoints/
    events/
    subagents/
  context/
    compaction/
    caching/
    artifact_refs/
    serializers/
  tools/
    builtin/
    wrappers/
    policies/
    schemas/
  protocols/
    mcp/
    acp/
    a2a/
  sandboxes/
    local/
    remote/
    execution_bridge/
  memory/
    conventions/
    task_store/
    artifact_store/
  skills/
    builtin/
    project/
  ui/
    cli/
    web/
    ide/
  evals/
    task_suites/
    replay/
    regression/
  docs/
    ARCHITECTURE.md
    AGENTS.md

Why this layout works

  • isolates stable interfaces from fast-changing prompts and skills
  • keeps protocol adapters separate from core runtime
  • makes the artifact/memory system visible as a first-class subsystem
  • makes evaluation a product surface, not an afterthought

14. Suggested starter defaults

These are starting defaults, not eternal constants.

Knob Recommended start Rationale
max_steps_per_turn 50-100 Kimi defaults to 100 and that is a reasonable ceiling for serious tasks. [E18]
max_retries_per_step 2-3 Enough for transient failures without hiding real problems. [E18]
reserved_context_tokens 20-25% of window, or about 50k on 200-262k models Leaves space for output + compaction prompts before hard failure. [U4][E18]
Large-result eviction threshold 8k-16k tokens equivalent High enough to keep small outputs inline, low enough to prevent bloat.
Subagent nesting Off by default Prevents delegation loops and hidden costs. [U1][E28]
Deferred-tool loading Enable when tool schema mass is large or tool count exceeds comfortable selection range Protects context budget and selection quality. [E9][E18]
Approval mode Ask for writes, shell, network egress, deletes Sensible default trust boundary. [E12][E19]
Artifact persistence Always on for search/fetch/code outputs Enables recovery, replay, and long-horizon work.

15. Implementation roadmap

Phase 0 - Architecture decisions (1-2 weeks)

Lock these decisions before heavy coding:

  • runtime model (graph/checkpoint runtime vs hand-rolled)
  • artifact store format and URI scheme
  • sandbox strategy
  • policy engine architecture
  • protocol boundaries
  • session and task schemas

Exit criteria

  • written state schema
  • event taxonomy
  • tool naming convention
  • security boundary map

Phase 1 - Single-agent durable harness (2-4 weeks)

Build:

  • append-only event log
  • checkpointed session runtime
  • file/artifact store
  • minimal built-in tool set
  • approval engine
  • typed traces
  • basic compaction / artifact eviction
  • CLI or API surface

Do not build yet

  • remote A2A
  • dynamic subagents
  • huge MCP catalogs
  • exotic skills

Exit criteria

  • 30-100 step tasks complete reliably
  • resume works
  • approval pause/resume works
  • large outputs are artifactized

Phase 2 - Context and capability scaling (2-4 weeks)

Add:

  • stable prompt layout and cache instrumentation
  • deferred tool loading / tool search
  • skills
  • AGENTS.md memory
  • better compaction / restart heuristics
  • eval suite for long-horizon tasks

Exit criteria

  • cache hit rate is measurable and stable
  • long-context tasks avoid runaway token growth
  • new domain knowledge can be added through skills without tool sprawl

Phase 3 - Subagents and protocol adapters (3-6 weeks)

Add:

  • general-purpose subagent
  • explore/research subagent
  • explicit handoff contract
  • ACP adapter
  • MCP connector hardening
  • parallel branch limits and metrics

Exit criteria

  • subagents materially reduce token bloat or latency on chosen workloads
  • IDE integration works through ACP or equivalent
  • MCP tool injection risks are contained by policy

Phase 4 - Enterprise hardening and remote delegation (4-8 weeks)

Add:

  • remote sandboxes
  • stricter egress policies
  • A2A adapter for remote specialists
  • richer audit and compliance
  • advanced rollback and replay
  • operator workflows for approval queues

Exit criteria

  • clear trust zone boundaries
  • disaster recovery / replay story
  • remote delegation works without leaking internal state

16. What to steal directly from each source

From Deep Agents

  • middleware composition
  • pluggable filesystem backends
  • built-in general-purpose subagent
  • skill loading with progressive disclosure
  • harness framing: planning + filesystem + subagents + context management [U1][E1][E2][E3]

From LangGraph

  • checkpoint-based durable execution
  • persistence threads
  • human-in-the-loop pause/resume
  • idempotent task boundaries for replay [E5]

From Kimi CLI

  • session state persistence beyond chat history
  • explicit loop-control knobs like reserved_context_size
  • structured AskUserQuestion UX
  • event/protocol decoupling via Wire-style messages
  • config-driven agent definitions and inheritance [U1][U4][E18][E19][E20][E21][E23]

From Anthropic / Claude patterns

  • prompt caching discipline
  • tool search with deferred loading
  • programmatic tool calling
  • approval callbacks and layered permission logic
  • initializer-agent pattern for long-running work
  • fresh-window restarts when externalized state is strong [U2][U5][E7][E8][E9][E10][E12][E16][E30]

From the Manus-style ideas captured in your materials

  • context engineering as the primary systems problem
  • mask, do not remove, when constraining actions
  • keep errors in context
  • use files to manipulate attention and preserve recoverability [U2][U3][U5]

From open protocols

  • MCP for tool interoperability
  • ACP for editor/client interoperability
  • A2A for remote agent interoperability and task exchange [U3][U4][E24][E25][E26]

17. Anti-patterns to avoid

  1. Treating the agent as "just a prompt".
    This fails as soon as tasks span many steps, tools, or sessions. [U2][U3][E1]

  2. Dynamically rewriting the toolset mid-session.
    Breaks cache locality and confuses state. [U2][U5][E7]

  3. Letting raw tool outputs flood the context.
    Artifactize early. [U1][U4][E10]

  4. Using multi-agent because it sounds advanced.
    Use it only when context isolation or parallelism is clearly valuable. [E6][E15]

  5. Relying on the model to self-police risky actions.
    Put enforcement in policy and sandbox layers. [U4][E12][E25]

  6. Returning verbose, human-style tool blobs.
    Return structured, compact outputs plus artifact refs. [E17]

  7. Hiding errors from the model.
    Preserve error evidence unless redaction is necessary. [U2][U5]

  8. Overloading the system prompt with all possible instructions.
    Use skills, tool search, and read/search tools instead. [U2][U4][E9][E21]

  9. Skipping evaluation of the harness itself.
    Tool tests are not enough; evaluate resumption, compaction, approvals, and delegation. [E17]

  10. Conflating protocols.
    MCP, ACP, and A2A solve different problems. Use each where it fits. [E24][E25][E26]


18. The concrete blueprint I would build

If I were starting this harness tomorrow, I would build the following first version:

Core

  • a durable orchestrator runtime with checkpointing
  • append-only JSONL event log
  • Postgres or SQLite for session/task metadata
  • object store or filesystem artifact store with content-addressed blobs
  • a stable tool catalog with namespaces

Built-in primitives

  • file read/write/edit/search
  • web search/fetch
  • sandboxed code execution
  • task graph primitive
  • structured user question tool
  • subagent delegation tool

Context system

  • static prompt prefix
  • AGENTS.md and project memory loading
  • immediate artifactization of large outputs
  • reserved output budget
  • compaction + fresh restart policy

Governance

  • pre-tool hooks
  • policy rules
  • approval callbacks
  • sandbox allowlists
  • provenance log

Surfaces

  • one CLI or API first
  • internal typed event bus
  • ACP adapter second
  • web UI later

Scale-up path

  • skills
  • deferred tool loading / tool search
  • general-purpose subagent
  • specialist subagents
  • MCP expansion
  • A2A remote workers only after local harness is stable

That combination gives you a system that is already "modern agentic" by 2026 standards without overcommitting to every new trend.


19. Final recommendation

The best 2026 harness is not the one with the most tools, the most agents, or the biggest context window. It is the one that:

  • preserves a stable cached prefix
  • externalizes state to artifacts and files
  • exposes a small, ergonomic action space
  • uses code execution to compress data before the model sees it
  • delegates work into isolated contexts only when needed
  • enforces safety in deterministic runtime layers
  • can resume, replay, inspect, and audit every important step

That is the throughline across the materials you shared and the current official docs. Build that foundation first. Everything else - skills, richer protocols, remote swarms, UI polish, domain specialization - compounds on top of it.


References

User-provided source documents

  • [U1] compass_artifact_wf-b9580dc8-f513-4de3-bf7a-7e6dbb6d5df8_text_markdown.md - user-provided synthesis on architectural patterns for modern agentic systems.
  • [U2] modern-agent-architecture-guide.md - user-provided guide synthesizing Claude Code, Manus, Deep Agents, and Kimi CLI patterns.
  • [U3] Building a 2026 Agentic System.docx - user-provided architectural blueprint with neuro-symbolic framing, protocol sections, and implementation guidance.
  • [U4] Starter Plan for a Modern Agentic System in 2026.pdf - user-provided starter blueprint focused on runtime primitives, context engineering, and milestones.
  • [U5] agent_ideas.docx - user-provided notes including Thariq, Lance Martin, Manus, and Deep Agents excerpts.

External references and official docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment