jeffscottward/INDEX.md

## agent-orchestrator-deep-analysis.md

      
    Raw
  

              agent-orchestrator-deep-analysis.md
            
          
    ComposioHQ/agent-orchestrator — Deep Technical Analysis


Repository: ComposioHQ/agent-orchestrator
Analysis Date: 2026-02-22
Analyst: Claude Opus 4.6
Source Path: /tmp/ai-harness-repos/agent-orchestrator/
Report Length Target: 2000+ lines of detailed analysis


Table of Contents


Design Philosophy & Goals
Core Architecture
Harness Workflow
Subagent Orchestration
Multi-Agent & Parallelization Strategy
Isolation Model
Human-in-the-Loop Controls
Context Handling
Session Lifecycle
Code Quality Gates
Security & Compliance
Hooks & Automation
CLI & UX
Cost & Usage Visibility
Tooling & Dependencies
External Integrations
Operational Assumptions & Prerequisites
Failure Modes & Recovery
Governance & Guardrails
Roadmap & Evolution Signals
What to Borrow / Adapt into Maestro
Cross-Links


1. Design Philosophy & Goals

Confidence: High
1.1 Core Vision

Agent Orchestrator (AO) positions itself as a parallel AI coding agent harness with a clear tagline from the README:

"Spawn parallel AI coding agents. Monitor from one dashboard. Merge their PRs."

This is not a general-purpose AI orchestration framework. It is laser-focused on software development workflows where multiple AI coding agents work on different issues simultaneously, each in an isolated workspace, producing pull requests that a human reviews and merges.
Source: /tmp/ai-harness-repos/agent-orchestrator/README.md (lines 1-10)
1.2 Architectural Principles

The codebase reveals several deliberate design choices:


Plugin-Everything Architecture: Every capability is behind a plugin interface — runtime, agent, workspace, tracker, SCM, notifier, terminal, lifecycle. This allows swapping implementations without touching core logic.


Process Isolation via tmux: Rather than embedding agents in-process, AO spawns them as independent terminal processes inside tmux sessions. This is a pragmatic choice: Claude Code, Codex, Aider, and OpenCode are all CLI tools that expect a terminal environment.


Flat-File State Over Databases: All session state lives in the filesystem as key=value metadata files. No SQLite, no Postgres, no Redis. This trades query capability for operational simplicity — you can debug state with cat and ls.


Polling Over Event-Driven: The lifecycle manager polls every 30 seconds. The web dashboard polls every 5 seconds via SSE. There is no event bus, no pub/sub, no WebSocket push from core. This is explicitly acknowledged as a limitation.


Fail-Open for Enrichment, Fail-Closed for Safety: PR enrichment (CI status, reviews) has timeouts and falls back gracefully. But CI status detection for open PRs is fail-closed — if the GitHub API errors, it reports "failing" rather than "none," preventing premature merges.


Developer-Local First: The entire system runs on a single developer's machine. There is no multi-user support, no cloud deployment story, no containerization. The "server" is a Next.js dev server on localhost.


1.3 Strengths


Extremely pragmatic: Instead of building a complex IPC system, they leverage tmux — a battle-tested terminal multiplexer that already handles process management, session persistence, and output capture.
Low barrier to entry: If you have tmux and a coding agent CLI, you can start using AO immediately. No infrastructure setup required.
Plugin system is well-designed: Clean interfaces with manifest metadata, Zod validation, and type-safe registration.

1.4 Limitations


Single-machine constraint: No distributed execution. All agents run on one machine, sharing CPU/memory/disk.
No persistence guarantees: If the machine reboots, tmux sessions are lost. Session restoration depends on the agent supporting --resume.
Polling latency: 30-second lifecycle polling means state changes can take up to 30 seconds to be detected and reacted to.
No cost controls: While cost is tracked (see Section 14), there are no budget limits, spending alerts, or automatic shutoff mechanisms.

1.5 Proven vs. Aspirational

The README lists support for agents: Claude Code, Codex CLI, Aider, OpenCode. However:

Proven: Claude Code plugin is 786 lines of deeply integrated code with JSONL parsing, activity detection, cost extraction, session restoration, and workspace hooks.
Aspirational: Codex, Aider, and OpenCode plugins exist but are significantly thinner. The plugin registry lists them (packages/core/src/plugin-registry.ts, line 20-23) but several have placeholder implementations.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/plugin-registry.ts (lines 14-30)

2. Core Architecture

Confidence: High
2.1 Monorepo Structure

agent-orchestrator/
├── packages/
│   ├── core/           # Types, config, session manager, lifecycle, plugins
│   ├── cli/            # Commander.js CLI (ao command)
│   ├── web/            # Next.js dashboard
│   ├── plugins/
│   │   ├── agent-claude-code/
│   │   ├── runtime-tmux/
│   │   ├── workspace-worktree/
│   │   ├── scm-github/
│   │   ├── tracker-github/
│   │   ├── tracker-linear/
│   │   ├── notifier-desktop/
│   │   └── notifier-slack/
│   └── integration-tests/
├── pnpm-workspace.yaml
└── agent-orchestrator.yaml.example

Source: /tmp/ai-harness-repos/agent-orchestrator/pnpm-workspace.yaml (lines 1-3)
The monorepo uses pnpm workspaces with two package locations: packages/* and packages/plugins/*. All packages are ESM-only ("type": "module" in root package.json) with TypeScript in strict mode.
2.2 The Eight Plugin Slots

The plugin architecture defines eight distinct capability slots:


Slot
Purpose
Built-in Implementations


runtime
Process execution environment
tmux, process


agent
AI coding agent
claude-code, codex, aider, opencode


workspace
Code isolation
worktree, clone


tracker
Issue tracking
github, linear


scm
Source code management
github


notifier
Notifications
desktop, slack, composio, webhook


terminal
Terminal UI integration
iterm2, web


lifecycle
State machine customization
core (default)


Each plugin implements a specific TypeScript interface and is registered with a manifest:
// From types.ts, lines 900-930
export interface PluginManifest {
  name: string;       // e.g., "tmux"
  slot: string;       // e.g., "runtime"
  version: string;
  description?: string;
}

export interface PluginModule<T = unknown> {
  manifest: PluginManifest;
  create: (ctx?: PluginContext) => T | Promise<T>;
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/types.ts (lines 900-960)
2.3 Plugin Registry

The registry is a simple Map keyed by "slot:name":
// plugin-registry.ts
const plugins = new Map<string, PluginModule>();

function register(mod: PluginModule): void {
  const key = `${mod.manifest.slot}:${mod.manifest.name}`;
  plugins.set(key, mod);
}

function get<T>(slot: string, name: string): T {
  const key = `${slot}:${name}`;
  const mod = plugins.get(key);
  if (!mod) throw new Error(`Plugin not found: ${key}`);
  return mod.create() as T;
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/plugin-registry.ts
The web package cannot use dynamic import() due to webpack bundling constraints, so it imports plugins statically:
// packages/web/src/lib/services.ts, lines 25-30
import pluginRuntimeTmux from "@composio/ao-plugin-runtime-tmux";
import pluginAgentClaudeCode from "@composio/ao-plugin-agent-claude-code";
import pluginWorkspaceWorktree from "@composio/ao-plugin-workspace-worktree";
import pluginScmGithub from "@composio/ao-plugin-scm-github";
import pluginTrackerGithub from "@composio/ao-plugin-tracker-github";
import pluginTrackerLinear from "@composio/ao-plugin-tracker-linear";
This is a practical workaround but creates a maintenance burden — new plugins must be manually added to this import list.
2.4 Core Services Singleton

The web package uses a globalThis-cached singleton pattern for services initialization:
// packages/web/src/lib/services.ts, lines 38-58
const globalForServices = globalThis as typeof globalThis & {
  _aoServices?: Services;
  _aoServicesInit?: Promise<Services>;
};

export function getServices(): Promise<Services> {
  if (globalForServices._aoServices) {
    return Promise.resolve(globalForServices._aoServices);
  }
  if (!globalForServices._aoServicesInit) {
    globalForServices._aoServicesInit = initServices().catch((err) => {
      globalForServices._aoServicesInit = undefined;
      throw err;
    });
  }
  return globalForServices._aoServicesInit;
}
Note the error recovery: if initialization fails, the cached promise is cleared so subsequent calls retry rather than permanently returning a rejected promise.
2.5 Hash-Based Directory Structure

AO uses a SHA-256 hash of the config file's directory path to create globally unique namespaces:
// paths.ts
export function generateConfigHash(configDir: string): string {
  const resolved = realpathSync(configDir);
  return createHash("sha256").update(resolved).digest("hex").slice(0, 12);
}
The directory hierarchy:
~/.agent-orchestrator/
  {12-char-hash}-{projectId}/
    sessions/
      {sessionName}/
        metadata        # key=value flat file
        prompt.md       # agent system prompt
    archive/
      {sessionName}_{timestamp}   # archived metadata
    worktrees/
      {sessionName}/    # git worktree checkout

Source: /tmp/ai-harness-repos/agent-orchestrator/ARCHITECTURE.md (full document)
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/paths.ts
The hash is computed from the resolved path (symlinks followed via realpathSync), meaning /foo/bar and /foo/bar-link -> /foo/bar hash to the same value. This prevents accidental duplication.
Collision detection is also implemented — each instance directory contains an .origin file storing the original path. If two different config directories produce the same hash prefix, the system will detect and error.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/paths.ts (validateAndStoreOrigin function)
2.6 Type System

The central types.ts file is 1084 lines and defines the entire domain model. Key types:
Session (the core entity):
export interface Session {
  id: string;
  projectId: string;
  status: SessionStatus;
  activity: ActivityState;
  branch?: string;
  issueId?: string;
  pr?: { number: number; url: string; title?: string };
  workspacePath?: string;
  runtimeHandle?: RuntimeHandle;
  agentInfo?: AgentInfo;
  timestamps: SessionTimestamps;
  metadata: SessionMetadata;
}
SessionStatus (the state machine):
export const SESSION_STATUS = {
  SPAWNING: "spawning",
  WORKING: "working",
  PR_OPEN: "pr_open",
  CI_FAILED: "ci_failed",
  REVIEW_PENDING: "review_pending",
  CHANGES_REQUESTED: "changes_requested",
  APPROVED: "approved",
  MERGEABLE: "mergeable",
  MERGED: "merged",
  CLEANUP: "cleanup",
  NEEDS_INPUT: "needs_input",
  STUCK: "stuck",
  ERRORED: "errored",
  KILLED: "killed",
  DONE: "done",
  TERMINATED: "terminated",
} as const;
ActivityState (runtime observation):
export const ACTIVITY_STATE = {
  ACTIVE: "active",
  IDLE: "idle",
  WAITING_INPUT: "waiting_input",
  BLOCKED: "blocked",
  EXITED: "exited",
  UNKNOWN: "unknown",
} as const;
EventType (33 distinct event types triggering reactions):
export const EVENT_TYPE = {
  SESSION_SPAWNED: "session.spawned",
  SESSION_KILLED: "session.killed",
  AGENT_ACTIVE: "agent.active",
  AGENT_IDLE: "agent.idle",
  AGENT_STUCK: "agent.stuck",
  AGENT_NEEDS_INPUT: "agent.needs_input",
  AGENT_EXITED: "agent.exited",
  PR_OPENED: "pr.opened",
  PR_MERGED: "pr.merged",
  CI_PASSING: "ci.passing",
  CI_FAILING: "ci.failing",
  CI_PENDING: "ci.pending",
  REVIEW_APPROVED: "review.approved",
  REVIEW_CHANGES_REQUESTED: "review.changes_requested",
  // ... 19 more
} as const;
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/types.ts (lines 1-1084)

3. Harness Workflow

Confidence: High
3.1 End-to-End Flow

The typical AO workflow proceeds as follows:

Configuration: User creates agent-orchestrator.yaml defining projects, plugins, and reactions.
Start: ao start launches the dashboard and spawns an orchestrator meta-agent.
Spawn: The orchestrator (or user) spawns worker sessions via ao spawn <project> <issue>.
Work: Each worker agent runs in its own tmux session with an isolated git worktree.
Monitor: The lifecycle manager polls every 30 seconds, tracking status transitions.
React: When events occur (CI failure, review request, etc.), the reaction engine sends messages to agents or notifies humans.
Review: PRs appear on the dashboard. Humans review and merge (or agents auto-merge if configured).
Cleanup: After merge, sessions are cleaned up — worktrees removed, metadata archived.

3.2 Spawn Sequence (Detailed)

The sessionManager.spawn() method in session-manager.ts is the most complex operation. Here is the exact sequence:
1. Validate issue exists (tracker.getIssue)
2. Generate session prefix from issue title
3. Reserve session ID atomically (O_EXCL file creation)
4. Create workspace (git worktree with new branch)
5. Run post-create hooks (symlinks, commands)
6. Build agent prompt (3-layer composition)
7. Get agent launch command
8. Get agent environment variables
9. Create runtime (tmux session)
10. Send launch command to runtime
11. Write metadata file (session_id, project_id, issue_id, branch, etc.)
12. Run post-launch setup (e.g., write Claude hooks)

At each step, failure triggers cleanup of previously completed steps:
// session-manager.ts, spawn method (simplified)
try {
  const workspace = await workspacePlugin.create(...);
  try {
    const handle = await runtimePlugin.create(...);
    try {
      await runtimePlugin.sendMessage(handle, launchCommand);
      await writeMetadata(...);
      await agentPlugin.setupWorkspaceHooks?.(...);
    } catch (err) {
      await runtimePlugin.destroy(handle);
      throw err;
    }
  } catch (err) {
    await workspacePlugin.destroy(workspace);
    throw err;
  }
} catch (err) {
  // Clean up session ID reservation
  await deleteSessionDir(sessionDir);
  throw err;
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/session-manager.ts (spawn method, approximately lines 80-250)
3.3 Batch Spawning

The ao batch-spawn command handles spawning multiple sessions:
// packages/cli/src/commands/spawn.ts (batch-spawn)
// 1. Check for duplicates against existing sessions
// 2. Check for duplicates within the batch
// 3. Spawn sequentially with 500ms delays
// 4. Report summary (success/failure counts)
The 500ms delay between spawns is a pragmatic rate-limiting measure to avoid overwhelming the system.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts

4. Subagent Orchestration

Confidence: High
4.1 The Orchestrator Meta-Agent

AO has a two-tier orchestration model:

Tier 1 — The Orchestrator: A special agent session (suffixed -orchestrator) that receives a comprehensive system prompt listing all AO CLI commands. It can spawn workers, check status, send messages, and manage the workflow.
Tier 2 — Worker Agents: Individual coding agents, each assigned to a single issue.

The orchestrator is spawned by ao start:
// packages/cli/src/commands/start.ts
const orchestratorPrompt = generateOrchestratorPrompt(config, project);
await sessionManager.spawnOrchestrator({
  projectId,
  prompt: orchestratorPrompt,
});
The orchestrator prompt (generated in orchestrator-prompt.ts) includes:

Project information (repo, branch, tracker)
Quick-start section showing how to spawn agents
Complete command reference table
Session management workflows
Dashboard information
Configured reaction rules
Common workflow patterns (bulk issue processing, stuck agent handling, PR review flow)
Tips for effective orchestration

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/orchestrator-prompt.ts (full file)
4.2 Orchestrator Communication

The orchestrator communicates with AO only through the CLI — it runs ao spawn, ao status, ao send, etc. as shell commands in its tmux session. There is no programmatic API between the orchestrator agent and the AO core.
This is both a strength and a limitation:

Strength: The orchestrator uses the same interface as a human. No special plumbing needed.
Limitation: Shell command parsing introduces latency and potential for error. The orchestrator must interpret CLI text output.

4.3 Worker Agent Communication

Worker agents receive their initial task via the system prompt and their first message (the issue content). Subsequent communication happens through runtime.sendMessage():
// runtime-tmux/src/index.ts, sendMessage method
async sendMessage(handle: RuntimeHandle, message: string): Promise<void> {
  // Clear any partial input first
  await sendKeys(handle.id, "C-u", false);

  if (message.length > 200) {
    // Use tmux named buffer for long messages
    await loadBuffer(handle.id, message);
    await pasteBuffer(handle.id);
    await sleep(300);
    await sendKeys(handle.id, "Enter", false);
  } else {
    await sendKeys(handle.id, message, true);
  }
}
The 200-character threshold and named buffer approach is a workaround for tmux's key-sending limitations. Messages longer than ~1000 characters can be corrupted when sent character-by-character, so the load-buffer/paste-buffer approach is used instead.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/runtime-tmux/src/index.ts (lines 70-95)
4.4 No Direct Agent-to-Agent Communication

Worker agents cannot communicate with each other directly. All coordination goes through:

The orchestrator (via ao send)
Git (via shared repository)
GitHub/Linear (via issue comments and PR reviews)

This is a deliberate design choice — it prevents complex agent interaction patterns but keeps the system simple and auditable.

5. Multi-Agent & Parallelization Strategy

Confidence: High
5.1 Parallelism Model

AO's parallelism is embarrassingly parallel — each agent works on an independent issue in an independent workspace. There is no:

Shared memory between agents
Lock coordination
Task dependency graphs
Work-stealing queues
Agent-to-agent communication channels

This simplicity is the system's greatest strength for its intended use case. Each agent produces an independent PR. Conflicts, if any, are handled at the git level (merge conflicts in the target branch).
5.2 Resource Constraints

The system imposes no resource limits at the orchestration level. Each tmux session runs an AI agent process that:

Consumes API tokens (Claude, OpenAI, etc.)
Uses CPU for local processing
Uses disk for workspace files
Uses network bandwidth for API calls and git operations

There is no mechanism to:

Limit the number of concurrent sessions
Throttle API call rates across agents
Set memory or CPU limits per agent
Define a total budget ceiling

The only rate-limiting is the 500ms delay between batch spawns.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts (batch-spawn, sequential spawning loop)
5.3 Lifecycle Polling Concurrency

The lifecycle manager polls all active sessions concurrently:
// lifecycle-manager.ts, pollAll method
const results = await Promise.allSettled(
  activeSessions.map(session => this.pollSession(session))
);
But it has a re-entrancy guard to prevent overlapping poll cycles:
if (this._polling) return;
this._polling = true;
try {
  // ... poll all sessions
} finally {
  this._polling = false;
}
This means if a poll cycle takes longer than 30 seconds (e.g., due to slow GitHub API calls), the next cycle is skipped rather than creating concurrent polls.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts (pollAll method)
5.4 Dashboard Parallelism

The web API enriches session data in parallel with timeouts:
// packages/web/src/app/api/sessions/route.ts, lines 39-52
// Metadata enrichment: 3 second timeout
const metaTimeout = new Promise<void>((resolve) => setTimeout(resolve, 3_000));
await Promise.race([enrichSessionsMetadata(...), metaTimeout]);

// PR enrichment: 4 second timeout
const enrichPromises = workerSessions.map((core, i) => {
  if (!core.pr) return Promise.resolve();
  return enrichSessionPR(dashboardSessions[i], scm, core.pr);
});
const enrichTimeout = new Promise<void>((resolve) => setTimeout(resolve, 4_000));
await Promise.race([Promise.allSettled(enrichPromises), enrichTimeout]);
The dual timeout approach (3s for metadata, 4s for PR data) ensures the dashboard remains responsive even when external APIs are slow. If enrichment times out, the dashboard shows stale or incomplete data rather than hanging.

6. Isolation Model

Confidence: High
6.1 Workspace Isolation via Git Worktrees

Each agent session gets its own git worktree — a separate checkout of the same repository on a different branch:
// workspace-worktree/src/index.ts, create method (simplified)
async create(options: WorkspaceCreateOptions): Promise<string> {
  const worktreePath = path.join(worktreeBaseDir, sessionId);

  // Fetch latest from origin
  await execFile("git", ["fetch", "origin"], { cwd: repoPath });

  // Create worktree with new branch from origin/defaultBranch
  await execFile("git", [
    "worktree", "add",
    "-b", branchName,
    worktreePath,
    `origin/${defaultBranch}`,
  ], { cwd: repoPath });

  return worktreePath;
}
Git worktrees are lightweight (they share the .git object store) but provide complete filesystem isolation. Each agent has its own working directory, index, and branch.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts (lines 30-100)
6.2 Runtime Isolation via tmux

Each agent runs in a separate tmux session with its own:

PTY (pseudo-terminal)
Environment variables
Process tree
Working directory

// runtime-tmux/src/index.ts, create method
async create(options: RuntimeCreateOptions): Promise<RuntimeHandle> {
  await newSession({
    name: tmuxSessionName,
    startDir: options.workspacePath,
    env: options.environment,
    detached: true,
  });

  // Send the launch command
  await sendKeys(tmuxSessionName, launchCommand);

  return {
    id: tmuxSessionName,
    type: "tmux",
    data: { createdAt: Date.now() },
  };
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/runtime-tmux/src/index.ts (lines 20-55)
6.3 Isolation Boundaries and Gaps

What IS isolated:

Filesystem (separate worktrees, separate branches)
Process (separate tmux sessions)
Environment variables (set per session)
Git state (separate index, HEAD, working tree)

What is NOT isolated:

Network: All agents share the same network. One agent making excessive API calls affects others.
Credentials: All agents share the same gh CLI authentication, the same ~/.claude config, the same API keys.
CPU/Memory: No cgroups, no containers, no resource limits.
Git remote: All worktrees push to the same remote. Branch name collisions are possible (though mitigated by the naming convention).
Agent configuration directories: Claude Code stores per-project settings in ~/.claude/projects/. The toClaudeProjectPath function converts workspace paths to Claude's directory encoding, but multiple sessions for the same project could potentially interfere.

6.4 Branch Naming Strategy

Branches are named by the tracker plugin:
// tracker-github/src/index.ts
branchName(issueId: string): string {
  return `feat/issue-${issueId}`;
}

// tracker-linear/src/index.ts
branchName(issueId: string): string {
  return `feat/${identifier}`;  // e.g., feat/ENG-123
}
This deterministic naming means two sessions for the same issue would conflict. The batch-spawn command includes deduplication logic to prevent this:
// spawn.ts, batch-spawn
// Check for existing sessions with the same issue
const existing = sessions.filter(s => s.issueId === issueId);
if (existing.length > 0) {
  console.warn(`Skipping issue ${issueId}: already has session ${existing[0].id}`);
  continue;
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts
6.5 Post-Create Symlinks

The workspace plugin supports symlinking shared resources into worktrees:
// workspace-worktree/src/index.ts, postCreate method
for (const link of project.symlinks ?? []) {
  // Path traversal guard
  const resolved = path.resolve(worktreePath, link.target);
  if (!resolved.startsWith(worktreePath)) {
    throw new Error(`Symlink target escapes workspace: ${link.target}`);
  }
  await fs.symlink(link.source, resolved);
}
This allows sharing large dependencies (like node_modules or build caches) across worktrees without duplicating them. The path traversal guard prevents symlinks from escaping the workspace directory.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts (postCreate method)

7. Human-in-the-Loop Controls

Confidence: High
7.1 Dashboard as Primary Control Surface

The web dashboard provides a Kanban-style view of all sessions grouped by attention level:
// Dashboard.tsx, lines 24, 28-41
const KANBAN_LEVELS = ["working", "pending", "review", "respond", "merge"] as const;

const grouped = useMemo(() => {
  const zones: Record<AttentionLevel, DashboardSession[]> = {
    merge: [],     // Ready to merge
    respond: [],   // Agent needs human input
    review: [],    // PR needs human review
    pending: [],   // Waiting for CI/other
    working: [],   // Agent actively coding
    done: [],      // Completed
  };
  for (const session of sessions) {
    zones[getAttentionLevel(session)].push(session);
  }
  return zones;
}, [sessions]);
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/web/src/components/Dashboard.tsx (lines 24-41)
7.2 Available Human Actions

The dashboard exposes four actions:

Send Message (handleSend): Send a text message to a running agent via POST /api/sessions/:id/send
Kill Session (handleKill): Terminate an agent with confirmation dialog via POST /api/sessions/:id/kill
Merge PR (handleMerge): Merge a pull request via POST /api/prs/:number/merge
Restore Session (handleRestore): Restore a killed/exited session via POST /api/sessions/:id/restore

// Dashboard.tsx, lines 50-86
const handleSend = async (sessionId: string, message: string) => {
  const res = await fetch(`/api/sessions/${encodeURIComponent(sessionId)}/send`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ message }),
  });
};

const handleKill = async (sessionId: string) => {
  if (!confirm(`Kill session ${sessionId}?`)) return;
  // ...
};

const handleMerge = async (prNumber: number) => {
  const res = await fetch(`/api/prs/${prNumber}/merge`, { method: "POST" });
};

const handleRestore = async (sessionId: string) => {
  if (!confirm(`Restore session ${sessionId}?`)) return;
  // ...
};
7.3 Attention Routing

The getAttentionLevel function (in @/lib/types) maps session state to human attention urgency. This drives both the dashboard layout and the dynamic favicon (showing counts of sessions needing attention).
The DynamicFavicon component updates the browser tab to show the project status at a glance, so a human can monitor multiple projects across browser tabs.
7.4 Notification System

Humans are notified through multiple channels:

Desktop: OS-native notifications (macOS osascript, Linux notify-send)
Slack: Rich Block Kit messages to webhook URLs
Composio: (mentioned in config but plugin not explored in detail)
Webhook: Generic HTTP webhook

Notifications are routed by priority:
# agent-orchestrator.yaml.example
notificationRouting:
  critical: [slack, desktop]
  high: [slack, desktop]
  normal: [slack]
  low: [slack]
Source: /tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example
7.5 Human Override Points


Before spawn: Human (or orchestrator) decides which issues to assign
During work: Human can send messages to guide the agent
At PR creation: Human reviews the PR on GitHub
At merge: Human (or auto-merge) decides when to merge
On failure: Human can kill, restore, or send instructions
Kill switch: ao stop terminates everything

7.6 Limitations of HITL Controls


No approval gates: There is no mechanism to require human approval before an agent takes a specific action (e.g., deploying, running tests, modifying security-sensitive files).
No content filtering: Agent outputs are not screened before being committed or pushed.
No rollback: If a PR is merged and breaks something, there is no automated rollback mechanism.
Message-only intervention: The only way to influence a running agent is to send it a text message. There is no way to modify its system prompt, change its tools, or restrict its actions mid-session.


8. Context Handling

Confidence: High
8.1 Three-Layer Prompt Composition

The prompt-builder.ts composes agent prompts from three layers:
Layer 1 — Base Agent Prompt (hardcoded in prompt-builder.ts):
You are working on a software engineering task...
- Follow the project's existing patterns and conventions
- Create focused, well-scoped commits
- Open a PR when your work is ready for review
- If CI fails, investigate and fix
- If review feedback is received, address it

Layer 2 — Config-Derived Context (from agent-orchestrator.yaml):
Project: {projectName}
Repository: {repo}
Default Branch: {defaultBranch}
Tracker: {tracker.plugin}
Issue: {issueTitle} ({issueUrl})
{issue.body}

Layer 3 — User Rules (from agentRules / agentRulesFile):
{agentRules string}
{contents of agentRulesFile}

The composition is done in buildPrompt():
// prompt-builder.ts (simplified)
export function buildPrompt(options: PromptOptions): string | null {
  const parts: string[] = [BASE_AGENT_PROMPT];

  if (options.projectName) {
    parts.push(`## Project Context\nProject: ${options.projectName}`);
  }
  if (options.issue) {
    parts.push(`## Task\n${options.issue.title}\n${options.issue.body}`);
  }
  if (options.agentRules) {
    parts.push(`## Project Rules\n${options.agentRules}`);
  }
  if (options.agentRulesFile) {
    const content = readFileSync(options.agentRulesFile, "utf-8");
    parts.push(`## Additional Rules\n${content}`);
  }

  return parts.join("\n\n");
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/prompt-builder.ts
8.2 Orchestrator Prompt (Meta-Agent)

The orchestrator receives a much richer prompt generated by orchestrator-prompt.ts. This prompt is essentially an operations manual:
// orchestrator-prompt.ts (key sections)
function generateOrchestratorPrompt(config, project): string {
  return `
# Agent Orchestrator — Control Prompt

You are the orchestrator for project "${project.name}".

## Quick Start
To spawn an agent for an issue: ao spawn ${project.name} <issue-id>

## Available Commands
| Command | Description |
| ao spawn | Spawn a worker session |
| ao status | Show all sessions |
| ao send | Send message to session |
| ao session kill | Kill a session |
| ao session restore | Restore a session |
| ao review-check | Check PR review status |

## Configured Reactions
${formatReactions(config.reactions)}

## Common Workflows
### Bulk Issue Processing
1. ao batch-spawn ${project.name} issue1 issue2 issue3
2. ao status (monitor progress)
3. Review PRs as they come in

### Handling Stuck Agents
1. Check status: ao status
2. Send guidance: ao send <session> "Try approach X"
3. If still stuck: ao session kill <session>; ao spawn ...
`;
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/orchestrator-prompt.ts
8.3 Issue Context Injection

When spawning a session, the tracker plugin generates context from the issue:
// tracker-github/src/index.ts
async generatePrompt(issueId: string, repo: string): Promise<string> {
  const issue = await this.getIssue(issueId, repo);
  return [
    `# Issue #${issue.number}: ${issue.title}`,
    `URL: ${issue.url}`,
    `State: ${issue.state}`,
    issue.labels.length ? `Labels: ${issue.labels.join(", ")}` : "",
    "",
    issue.body,
  ].filter(Boolean).join("\n");
}
For Linear issues, the prompt includes more structured data:
// tracker-linear/src/index.ts
async generatePrompt(issueId: string): Promise<string> {
  const issue = await this.getIssue(issueId);
  return [
    `# ${issue.identifier}: ${issue.title}`,
    `URL: ${issue.url}`,
    `State: ${issue.state}`,
    `Priority: ${issue.priority}`,
    issue.labels.length ? `Labels: ${issue.labels.join(", ")}` : "",
    "",
    issue.body,
  ].filter(Boolean).join("\n");
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-github/src/index.ts (lines 90-110)
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-linear/src/index.ts
8.4 Reaction Context (Messages to Agents)

When the reaction engine sends messages to agents, it composes context-aware messages:
// lifecycle-manager.ts (reaction execution, simplified)
if (reaction.action === "send-to-agent") {
  const message = reaction.message ?? getDefaultMessage(eventType);
  await sessionManager.send(session.id, message);
}
Default messages are event-specific, e.g.:

CI failed: "CI checks are failing. Please investigate the failures and fix them."
Changes requested: "Review feedback has been received. Please address the requested changes."
Merge conflicts: "There are merge conflicts. Please resolve them."

8.5 Context Limitations


No conversation history: AO does not maintain or inject previous conversation context when sending messages to agents. Each message is stateless.
No cross-session context: If Agent A discovers something relevant to Agent B, there is no mechanism to share that context.
No dynamic context refresh: The agent's system prompt is set at spawn time and never updated. If the issue is updated on GitHub/Linear after spawning, the agent won't see the changes unless told explicitly.
No context window management: AO does not track or manage the agent's context window usage. Long-running agents may lose their initial instructions as conversation history grows.


9. Session Lifecycle

Confidence: High
9.1 State Machine

The lifecycle manager implements a state machine with the following transitions:
spawning -> working              (agent starts processing)
working -> pr_open               (agent creates PR)
working -> needs_input           (agent requests human input)
working -> stuck                 (agent appears stuck)
working -> errored               (runtime dies unexpectedly)

pr_open -> ci_failed             (CI checks fail)
pr_open -> review_pending        (CI passes, awaiting review)
pr_open -> working               (agent still working after PR creation)

ci_failed -> working             (agent fixing CI issues)
ci_failed -> pr_open             (CI re-run passes)

review_pending -> changes_requested  (reviewer requests changes)
review_pending -> approved           (reviewer approves)

changes_requested -> working     (agent addressing feedback)
approved -> mergeable            (CI passes + approved)
mergeable -> merged              (PR merged)
merged -> cleanup -> done        (workspace cleaned up)

needs_input -> working           (human sends message)
stuck -> working                 (agent resumes)

ANY -> killed                    (human kills session)
ANY -> terminated                (orchestrator terminates)

9.2 Status Determination Algorithm

The determineStatus() function in lifecycle-manager.ts follows this priority order:
// lifecycle-manager.ts, determineStatus (simplified logic)
function determineStatus(session: Session): SessionStatus {
  // 1. Runtime dead?
  if (!session.runtimeHandle || !await runtime.isAlive(session.runtimeHandle)) {
    return session.pr?.merged ? "done" : "errored";
  }

  // 2. Agent activity
  const activity = await agent.getActivityState(session);
  if (activity === "waiting_input") return "needs_input";
  if (activity === "blocked") return "stuck";
  if (activity === "exited") return session.pr ? "pr_open" : "done";

  // 3. PR state
  if (session.pr) {
    const prState = await scm.getPRState(session.pr.number);
    if (prState.merged) return "merged";

    const ci = await scm.getCISummary(session.pr.number);
    if (ci === "failing") return "ci_failed";

    const review = await scm.getReviewDecision(session.pr.number);
    if (review === "changes_requested") return "changes_requested";
    if (review === "approved") {
      const mergeable = await scm.getMergeability(session.pr.number);
      if (mergeable.canMerge) return "mergeable";
      return "approved";
    }

    return "review_pending";
  }

  // 4. Default
  return activity === "active" ? "working" : "working";
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts (determineStatus, approximately lines 100-200)
9.3 Activity Detection (Claude Code Specific)

The Claude Code plugin provides two activity detection mechanisms:
Mechanism 1 — Terminal Output Parsing (deprecated):
// agent-claude-code/src/index.ts
classifyTerminalOutput(output: string): ActivityState {
  // Look for prompt characters: ❯ > $ #
  if (/[❯>$#]\s*$/.test(lastLine)) return "idle";
  if (/permission/i.test(lastLine)) return "waiting_input";
  return "active";
}
Mechanism 2 — JSONL Introspection (preferred):
// agent-claude-code/src/index.ts, getActivityState
async getActivityState(session): Promise<ActivityState> {
  // 1. Check if process is running
  const processRunning = await this.isProcessRunning(session);
  if (!processRunning) return "exited";

  // 2. Read last JSONL entry from Claude's session file
  const entry = await readLastJsonlEntry(sessionFile);

  switch (entry.type) {
    case "user":
    case "tool_use":
    case "progress":
      return "active";
    case "assistant":
    case "summary":
    case "result":
      // Check idle threshold
      if (Date.now() - entry.timestamp > readyThresholdMs) {
        return "idle";
      }
      return "active";  // "ready" maps to "active" with threshold
    case "permission_request":
      return "waiting_input";
    case "error":
      return "blocked";
  }
}
The JSONL approach reads Claude Code's internal session files (stored in ~/.claude/projects/), parsing only the last 128KB to avoid reading potentially 100MB+ files:
// agent-claude-code/src/index.ts
async parseJsonlFileTail(filePath: string): Promise<JsonlEntry[]> {
  const TAIL_BYTES = 128 * 1024;  // 128KB
  const stat = await fs.stat(filePath);
  const start = Math.max(0, stat.size - TAIL_BYTES);
  // Read from offset, split by newlines, parse each line as JSON
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (lines 300-500)
9.4 Session Cleanup

The cleanup process is multi-step:
// session-manager.ts, cleanup method
async cleanup(sessionId: string): Promise<void> {
  const session = await this.get(sessionId);

  // Check prerequisites
  if (session.pr) {
    const prState = await scm.getPRState(session.pr.number);
    if (!prState.merged) {
      throw new Error("Cannot cleanup: PR not yet merged");
    }
  }

  // 1. Destroy runtime (kill tmux session)
  if (session.runtimeHandle) {
    await runtime.destroy(session.runtimeHandle);
  }

  // 2. Destroy workspace (remove git worktree)
  if (session.workspacePath) {
    await workspace.destroy(session.workspacePath);
  }

  // 3. Archive metadata
  await archiveMetadata(session.id);
}
Notably, the workspace plugin does NOT delete the git branch when removing a worktree:
// workspace-worktree/src/index.ts, destroy method
// NOTE: Does NOT delete the branch (safety measure)
await execFile("git", ["worktree", "remove", "--force", worktreePath]);
This is a safety measure — branches are kept in case they need to be referenced later.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts (destroy method)
9.5 Session Restoration

Sessions can be restored from archive:
// session-manager.ts, restore method
async restore(sessionId: string): Promise<Session> {
  // 1. Find archived metadata
  const archived = await readArchivedMetadata(sessionId);

  // 2. Validate restorability
  if (!isRestorable(archived)) {
    throw new SessionNotRestorableError(sessionId, reason);
  }

  // 3. Recreate workspace if needed
  if (!await workspace.exists(archived.workspacePath)) {
    await workspace.restore(archived);
  }

  // 4. Try agent's restore command (e.g., claude --resume <uuid>)
  const restoreCmd = await agent.getRestoreCommand(archived);

  // 5. Create new runtime with restore command
  const handle = await runtime.create({
    launchCommand: restoreCmd ?? agent.getLaunchCommand(archived),
    workspacePath: archived.workspacePath,
  });

  // 6. Write new metadata
  await writeMetadata(sessionId, { ...archived, runtimeHandle: handle });

  return session;
}
The Claude Code agent supports restoration via session UUID:
// agent-claude-code/src/index.ts, getRestoreCommand
async getRestoreCommand(session): Promise<string | null> {
  // Find the Claude session UUID from JSONL files
  const sessionUuid = await findSessionUuid(session.workspacePath);
  if (!sessionUuid) return null;
  return `claude --resume ${sessionUuid}`;
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (getRestoreCommand method)

10. Code Quality Gates

Confidence: Medium
10.1 CI Pipeline

The CI workflow runs on GitHub Actions:
# .github/workflows/ci.yml
jobs:
  lint:
    - pnpm lint
  typecheck:
    - pnpm --filter '!@composio/ao-web' build  # Build non-web first
    - pnpm --filter @composio/ao-web typecheck  # Then check web
  test:
    - pnpm --filter '!@composio/ao-web' test
  test-web:
    - sudo apt-get install tmux  # tmux needed for integration tests
    - pnpm --filter @composio/ao-web test
Source: /tmp/ai-harness-repos/agent-orchestrator/.github/workflows/ci.yml
10.2 TypeScript Strictness

The project uses TypeScript strict mode:
// tsconfig.json (root)
{
  "compilerOptions": {
    "strict": true,
    "module": "Node16",
    "moduleResolution": "Node16"
  }
}
The CLAUDE.md file codifies conventions:

.js extensions in all imports (ESM requirement)
node: prefix for Node.js builtins
type keyword for type-only imports
Zod for runtime validation of external data

Source: /tmp/ai-harness-repos/agent-orchestrator/CLAUDE.md
10.3 Zod Validation

Configuration is validated with Zod schemas:
// config.ts
const ProjectSchema = z.object({
  repo: z.string(),
  path: z.string().optional(),
  defaultBranch: z.string().default("main"),
  sessionPrefix: z.string().optional(),
  tracker: z.object({
    plugin: z.string(),
    // ...
  }).optional(),
  scm: z.object({
    plugin: z.string(),
  }).optional(),
  // ...
});

const ConfigSchema = z.object({
  dataDir: z.string().optional(),
  port: z.number().optional(),
  defaults: DefaultsSchema.optional(),
  projects: z.record(ProjectSchema),
  notifiers: z.record(z.any()).optional(),
  reactions: z.record(ReactionSchema).optional(),
});
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/config.ts
10.4 What's Missing


No linting rules visible: The pnpm lint command exists but the specific ESLint/Biome configuration was not explored.
No test coverage requirements: No coverage thresholds or coverage reporting observed.
No integration test suite: The packages/integration-tests directory exists but its contents were not fully explored.
No end-to-end tests: No Playwright, Cypress, or similar E2E testing framework observed.
No API contract testing: The web API endpoints have no schema validation on responses.


11. Security & Compliance

Confidence: High
11.1 Shell Injection Prevention

The CLAUDE.md file mandates:

"Shell commands: ALWAYS use execFile with explicit argument arrays, NEVER use exec with string interpolation. Always set timeouts for child processes. Never interpolate user input into shell commands."

This is consistently followed throughout the codebase. Every shell command uses execFile:
// tmux.ts
import { execFile } from "node:child_process";

export function listSessions(): Promise<string[]> {
  return new Promise((resolve, reject) => {
    execFile("tmux", ["list-sessions", "-F", "#{session_name}"],
      { timeout: 5000 },
      (err, stdout) => { /* ... */ }
    );
  });
}
exec is never used anywhere in the codebase. This is the single most important security measure — it eliminates an entire class of command injection vulnerabilities.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/tmux.ts
Source: /tmp/ai-harness-repos/agent-orchestrator/CLAUDE.md
11.2 Path Traversal Prevention

Multiple layers of defense:

Session ID validation: validateSessionId uses regex /^[a-zA-Z0-9_-]+$/ to reject path traversal characters.

// metadata.ts
export function validateSessionId(id: string): void {
  if (!/^[a-zA-Z0-9_-]+$/.test(id)) {
    throw new Error(`Invalid session ID: ${id}`);
  }
}

Symlink target validation: The workspace plugin validates that symlink targets don't escape the workspace:

// workspace-worktree/src/index.ts
const resolved = path.resolve(worktreePath, link.target);
if (!resolved.startsWith(worktreePath)) {
  throw new Error(`Symlink target escapes workspace: ${link.target}`);
}

URL encoding in API routes: Session IDs are encoded/decoded when used in URL paths.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/metadata.ts
11.3 Secret Scanning

The security CI workflow runs three checks:
# .github/workflows/security.yml
jobs:
  gitleaks:
    - uses: gitleaks/gitleaks-action@v2
      with:
        args: "--full-history"  # Scan entire git history

  dependency-review:
    - uses: actions/dependency-review-action@v4
      with:
        fail-on-severity: moderate  # Block moderate+ vulns

  npm-audit:
    - run: pnpm audit --audit-level high --prod  # Strict on prod deps
Source: /tmp/ai-harness-repos/agent-orchestrator/.github/workflows/security.yml
11.4 Credential Handling


No credential storage: AO does not store any credentials itself. It relies on ambient credentials (gh auth, LINEAR_API_KEY, SLACK_WEBHOOK_URL).
Environment variable passing: Agent sessions receive environment variables via tmux -e flags, which means they appear in ps output briefly during session creation.
Historical incident: The SECURITY.md documents a past token leak (OpenClaw token) that was detected and mitigated.

Source: /tmp/ai-harness-repos/agent-orchestrator/SECURITY.md
11.5 Security Gaps


No agent sandboxing: Agents have full filesystem and network access. A compromised agent could read credentials, exfiltrate code, or modify other worktrees.
No output sanitization: Agent-generated code is committed directly. No static analysis, dependency scanning, or security review of generated changes.
No authentication on web dashboard: The Next.js dashboard runs on localhost with no authentication. Anyone with network access to the port can view sessions, send messages, kill agents, and merge PRs.
No HTTPS: Dashboard uses plain HTTP on localhost.
No rate limiting on API endpoints: The web API has no rate limiting or abuse prevention.


12. Hooks & Automation

Confidence: High
12.1 Reaction Engine

The reaction engine is the core automation mechanism. It maps events to actions:
# agent-orchestrator.yaml.example
reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    message: "CI checks are failing. Please investigate and fix."
    retries: 2
    escalation:
      action: notify
      after: "10m"

  changes-requested:
    trigger: review.changes_requested
    action: send-to-agent
    message: "Review feedback received. Please address the changes."

  approved-and-green:
    trigger: review.approved
    condition: ci.passing
    action: notify
    message: "PR is approved and CI is green. Ready to merge."

  agent-stuck:
    trigger: agent.stuck
    action: notify
    priority: high
    escalation:
      action: notify
      after: "15m"
      priority: critical
Source: /tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example
12.2 Reaction Execution

// lifecycle-manager.ts, executeReaction (simplified)
async executeReaction(session: Session, eventType: EventType, reaction: ReactionConfig): Promise<void> {
  const key = `${session.id}:${reaction.name}`;
  const attempts = this.reactionAttempts.get(key) ?? 0;

  // Check escalation
  if (reaction.escalation) {
    const firstAttempt = this.reactionFirstAttempt.get(key);
    const duration = firstAttempt ? Date.now() - firstAttempt : 0;

    if (
      (reaction.retries && attempts >= reaction.retries) ||
      (reaction.escalation.after && duration > parseDuration(reaction.escalation.after))
    ) {
      // Execute escalation action instead
      return this.executeAction(session, reaction.escalation);
    }
  }

  // Execute primary action
  await this.executeAction(session, reaction);
  this.reactionAttempts.set(key, attempts + 1);
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts (executeReaction, approximately lines 250-330)
12.3 Post-Tool-Use Hook (Claude Code)

The Claude Code plugin installs a PostToolUse hook that monitors agent actions:
#!/bin/bash
# METADATA_UPDATER_SCRIPT (embedded in agent-claude-code/src/index.ts)
# Detects PR creation, branch switches, and PR merges

METADATA_FILE="$AO_METADATA_PATH"

case "$TOOL_NAME" in
  "Bash")
    # Detect: gh pr create
    if echo "$TOOL_OUTPUT" | grep -q "github.com.*pull/"; then
      PR_URL=$(echo "$TOOL_OUTPUT" | grep -o "https://github.com[^ ]*pull/[0-9]*")
      PR_NUM=$(echo "$PR_URL" | grep -o "[0-9]*$")
      echo "pr_number=$PR_NUM" >> "$METADATA_FILE"
      echo "pr_url=$PR_URL" >> "$METADATA_FILE"
    fi

    # Detect: git checkout -b / git switch -c
    if echo "$TOOL_INPUT" | grep -qE "git (checkout -b|switch -c)"; then
      BRANCH=$(echo "$TOOL_INPUT" | grep -oE "(checkout -b|switch -c) [^ ]+" | awk '{print $NF}')
      echo "branch=$BRANCH" >> "$METADATA_FILE"
    fi

    # Detect: gh pr merge
    if echo "$TOOL_INPUT" | grep -q "gh pr merge"; then
      echo "pr_merged=true" >> "$METADATA_FILE"
    fi
    ;;
esac
This hook runs inside Claude Code's process and updates the session metadata file in real-time, without waiting for the next lifecycle poll.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (METADATA_UPDATER_SCRIPT, approximately lines 30-80)
12.4 Post-Create Workspace Hooks

After creating a workspace, the system can run arbitrary commands:
# agent-orchestrator.yaml.example
projects:
  my-project:
    postCreate:
      - "npm install"
      - "npm run build"
These are executed via execFile in the worktree directory after creation and symlinking.
12.5 Default Reactions

The config loader applies sensible defaults if no reactions are configured:
// config.ts (applyDefaultReactions, simplified)
const DEFAULT_REACTIONS = {
  "ci-failed": { trigger: "ci.failing", action: "send-to-agent" },
  "changes-requested": { trigger: "review.changes_requested", action: "send-to-agent" },
  "bugbot-comments": { trigger: "review.automated_comments", action: "send-to-agent" },
  "merge-conflicts": { trigger: "pr.conflicts", action: "send-to-agent" },
  "approved-and-green": { trigger: "review.approved", condition: "ci.passing", action: "notify" },
  "agent-stuck": { trigger: "agent.stuck", action: "notify", priority: "high" },
  "agent-needs-input": { trigger: "agent.needs_input", action: "notify" },
  "agent-exited": { trigger: "agent.exited", action: "notify" },
  "all-complete": { trigger: "orchestrator.all_complete", action: "notify" },
};
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/config.ts
12.6 Auto-Merge Example

The auto-merge.yaml example shows aggressive automation:
# examples/auto-merge.yaml
reactions:
  auto-merge:
    trigger: review.approved
    condition: ci.passing
    action: auto-merge
    message: "Auto-merging approved PR with passing CI."
This allows PRs to be merged automatically when they have both approval and passing CI, with no human confirmation step.
Source: /tmp/ai-harness-repos/agent-orchestrator/examples/auto-merge.yaml

13. CLI & UX

Confidence: High
13.1 Command Structure

ao
├── init              # Interactive setup wizard
├── start             # Start orchestrator + dashboard
├── stop              # Stop orchestrator + dashboard
├── status            # Show session status table
├── spawn             # Spawn a single agent session
├── batch-spawn       # Spawn multiple sessions
├── send              # Send message to a session
├── review-check      # Check PR review status
├── dashboard         # Open dashboard in browser
├── open              # Open terminal for a session
└── session
    ├── ls            # List sessions
    ├── kill          # Kill a session
    ├── cleanup       # Clean up completed sessions
    └── restore       # Restore a killed session

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/index.ts
13.2 Init Wizard

The ao init command provides an interactive setup experience:
// packages/cli/src/commands/init.ts
// Detects environment:
// - git repo presence and remote URL
// - default branch
// - tmux availability
// - gh CLI and authentication
// - LINEAR_API_KEY presence
// - SLACK_WEBHOOK_URL presence
// - Project type (package.json, Cargo.toml, etc.)
It has an --auto mode for non-interactive setup and a --smart flag that has a TODO for AI-powered rule generation based on the project structure.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/init.ts
13.3 Status Display

The ao status command renders a rich terminal table:
Session      Branch              PR    CI        Review    Threads  Activity  Age
fix-auth-1   feat/issue-42       #123  passing   approved  0        active    2h
add-api-2    feat/issue-43       #124  failing   pending   2        idle      1h
refactor-3   feat/issue-44       —     —         —         —        working   30m

Data is gathered in parallel for responsiveness:
// status.ts (simplified)
const sessions = await sessionManager.list();
const enriched = await Promise.all(
  sessions.map(async (s) => {
    const [prState, ci, review] = await Promise.all([
      scm?.getPRState(s.pr?.number),
      scm?.getCISummary(s.pr?.number),
      scm?.getReviewDecision(s.pr?.number),
    ]);
    return { ...s, prState, ci, review };
  })
);
The status command also has a fallback mode for when no config exists — it discovers tmux sessions directly:
// status.ts
// Fallback: discover tmux sessions matching ao- pattern
const tmuxSessions = await listTmuxSessions();
const aoSessions = tmuxSessions.filter(name => name.match(/^[a-f0-9]{12}-/));
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/status.ts
13.4 CLI Quality

Strengths:

Clean Commander.js structure with proper subcommands
Parallel data fetching for responsive output
Graceful degradation (fallback when config missing)
Confirmation prompts for destructive operations (kill, restore)
Summary reports for batch operations

Limitations:

No color/formatting library (raw console.log)
No progress indicators for long operations
No --json output flag for scripting
No shell completion support
No --dry-run for spawn/batch-spawn (only for cleanup)


14. Cost & Usage Visibility

Confidence: Medium
14.1 Cost Extraction

The Claude Code plugin extracts cost data from agent session JSONL files:
// agent-claude-code/src/index.ts, extractCost
async extractCost(session): Promise<CostInfo | null> {
  const entries = await this.parseJsonlFileTail(sessionFile);

  let totalCostUsd = 0;
  let inputTokens = 0;
  let outputTokens = 0;

  for (const entry of entries) {
    if (entry.costUSD) {
      totalCostUsd += entry.costUSD;
    }
    if (entry.usage) {
      inputTokens += entry.usage.input_tokens ?? 0;
      outputTokens += entry.usage.output_tokens ?? 0;
    }
  }

  // Rough estimate if no costUSD field
  if (totalCostUsd === 0 && (inputTokens > 0 || outputTokens > 0)) {
    // Sonnet 4.5 pricing as default
    totalCostUsd = (inputTokens * 3 + outputTokens * 15) / 1_000_000;
  }

  return { totalCostUsd, inputTokens, outputTokens };
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (extractCost method)
14.2 What's Tracked


Token usage: Input and output token counts from JSONL entries
Cost estimates: Either from explicit costUSD fields or rough estimates using Sonnet 4.5 pricing
Session duration: Computed from handle.data.createdAt in the runtime plugin

14.3 What's NOT Tracked


No aggregated cost view: No total cost across all sessions, projects, or time periods
No budget limits: No mechanism to set a maximum spend per session, project, or globally
No cost alerts: No notification when spend exceeds a threshold
No automatic shutoff: No kill-switch when costs escalate
No per-agent model pricing: The rough estimate uses Sonnet 4.5 pricing regardless of which model the agent actually uses
No API call counting: GitHub API calls (which have rate limits) are not tracked
No cost display in CLI: The ao status command does not show cost information
No cost display on dashboard: The dashboard does not show per-session or aggregate costs

14.4 Rate Limit Awareness

The dashboard does detect GitHub API rate limiting:
// Dashboard.tsx, lines 90-93
const anyRateLimited = useMemo(
  () => sessions.some((s) => s.pr && isPRRateLimited(s.pr)),
  [sessions],
);
When rate-limited, a warning banner is shown explaining that PR data may be stale. This is a good UX touch but is reactive rather than preventive.

15. Tooling & Dependencies

Confidence: High
15.1 Runtime Dependencies


Dependency
Purpose
Version Constraint


Node.js
Runtime
>= 20


pnpm
Package manager
9.15.4 (exact)


tmux
Terminal multiplexer
Required


git
Version control
>= 2.25 (worktree support)


gh
GitHub CLI
Required for GitHub integration


TypeScript
Language
Strict mode, ESM


Next.js
Web dashboard
App Router


Commander.js
CLI framework
—


Zod
Schema validation
—


15.2 External Tool Requirements

The system has hard dependencies on external CLI tools:

tmux: Required for process isolation. No fallback. Version checked at runtime via isTmuxAvailable().
git: Required for workspace management. Must support worktrees (Git 2.25+).
gh: Required for GitHub integration (SCM + tracker). Must be authenticated (gh auth status).
Claude Code CLI: Required for the primary agent plugin. Must be installed and configured.

15.3 Optional Dependencies


LINEAR_API_KEY: Required only if using the Linear tracker plugin
SLACK_WEBHOOK_URL: Required only if using the Slack notifier plugin
Composio SDK: Alternative transport for Linear integration
iTerm2: Optional terminal integration for macOS

15.4 Build System

// package.json (root)
{
  "scripts": {
    "build": "turbo build",
    "dev": "turbo dev",
    "lint": "turbo lint",
    "test": "turbo test",
    "typecheck": "turbo typecheck",
    "release": "changeset publish"
  }
}
The project uses Turborepo for monorepo build orchestration and Changesets for release management.
15.5 Platform Support


macOS: Primary development platform. Desktop notifications use osascript.
Linux: Supported. Desktop notifications use notify-send.
Windows: Not explicitly supported. tmux is not available natively on Windows (would require WSL).


16. External Integrations

Confidence: High
16.1 GitHub Integration (Deep)

The GitHub integration is the most developed external integration, implemented across two plugins:
scm-github (581 lines):

PR detection: gh pr list --head <branch>
PR state: gh pr view --json state,title,number,url,additions,deletions,files
PR merge: gh pr merge --squash --delete-branch
CI checks: gh pr checks --json name,state,conclusion
CI summary: Fail-closed logic for open PRs
Reviews: gh pr view --json reviews,reviewDecision
Pending comments: GraphQL query for review thread resolution status
Automated comments: REST API filtering by BOT_AUTHORS
Mergeability: Composite check (state + mergeable + CI + reviews + conflicts + draft)

The fail-closed CI summary is notable:
// scm-github/src/index.ts, getCISummary
async getCISummary(prNumber: number, repo: string): Promise<CIStatus> {
  try {
    const checks = await this.getCIChecks(prNumber, repo);
    // ... analyze checks
  } catch (err) {
    // For open PRs, fail closed — report "failing" on error
    // This prevents auto-merge when we can't verify CI status
    const prState = await this.getPRState(prNumber, repo);
    if (prState.state === "open") {
      return CI_STATUS.FAILING;
    }
    return CI_STATUS.NONE;
  }
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/scm-github/src/index.ts
tracker-github (304 lines):

Issue CRUD: get, create, update, close/reopen
Issue listing with filters (state, label, assignee)
Branch name generation: feat/issue-{number}
Prompt generation from issue content

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-github/src/index.ts
16.2 Linear Integration

The Linear plugin (722 lines) is the second most developed integration:
Dual transport:
// tracker-linear/src/index.ts
// Method 1: Direct API
if (process.env.LINEAR_API_KEY) {
  this.transport = "direct";
  this.apiKey = process.env.LINEAR_API_KEY;
}
// Method 2: Composio SDK
else if (composioAvailable) {
  this.transport = "composio";
}
State mapping:
const STATE_MAP: Record<string, IssueState> = {
  triage: "open",
  backlog: "open",
  unstarted: "open",
  started: "in_progress",
  completed: "closed",
  canceled: "cancelled",
};
Full GraphQL API: Issues, labels, teams, workflow states, assignees, comments.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-linear/src/index.ts
16.3 Slack Integration

Rich Block Kit messages with structured formatting:
// notifier-slack/src/index.ts
async notify(event: NotificationEvent): Promise<void> {
  const blocks = [
    {
      type: "header",
      text: { type: "plain_text", text: event.title },
    },
    {
      type: "section",
      text: { type: "mrkdwn", text: event.body },
    },
    {
      type: "context",
      elements: [
        { type: "mrkdwn", text: `*Priority:* ${priorityEmoji(event.priority)} ${event.priority}` },
        { type: "mrkdwn", text: `*Session:* ${event.sessionId}` },
      ],
    },
  ];

  if (event.prUrl) {
    blocks.push({
      type: "actions",
      elements: [{
        type: "button",
        text: { type: "plain_text", text: "View PR" },
        url: event.prUrl,
      }],
    });
  }

  await fetch(this.webhookUrl, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ blocks }),
  });
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-slack/src/index.ts
16.4 Desktop Notifications

Platform-specific implementations:
// notifier-desktop/src/index.ts
if (process.platform === "darwin") {
  // macOS: osascript
  await execFile("osascript", [
    "-e", `display notification "${body}" with title "${title}"${sound ? " sound name \"Ping\"" : ""}`,
  ]);
} else {
  // Linux: notify-send
  await execFile("notify-send", [
    ...(urgency === "critical" ? ["--urgency=critical"] : []),
    title,
    body,
  ]);
}
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-desktop/src/index.ts

17. Operational Assumptions & Prerequisites

Confidence: High
17.1 Hard Requirements


Single machine: The entire system runs on one machine. No distributed execution.
Unix-like OS: macOS or Linux required (tmux, POSIX shell commands).
tmux installed: No alternative runtime is production-ready.
Git 2.25+: Worktree support required.
Node.js 20+: ESM module support and modern APIs.
Agent CLI installed: At minimum, Claude Code CLI must be available.
GitHub authentication: gh auth login must be completed for GitHub features.

17.2 Soft Requirements


Stable network: Agents need internet access for API calls; dashboard needs GitHub API access for enrichment.
Sufficient disk space: Each worktree is a full checkout. Many concurrent sessions require proportional disk.
API rate limits: GitHub API has 5000 requests/hour for authenticated users. With many sessions and 30s polling, this budget can be consumed quickly.
Agent API keys: Claude API key, OpenAI API key, etc. must be configured in the agent's own config.

17.3 Scaling Assumptions

The system was designed for 10-50 concurrent sessions on a single developer machine. Evidence:

Session list is loaded entirely into memory (no pagination)
Dashboard renders all sessions in a single view
Lifecycle polling is a single loop with no sharding
Metadata is scanned by filesystem directory listing
No connection pooling for GitHub API calls

Beyond 50 sessions, expect:

GitHub API rate limiting (5000 req/hr shared across all sessions)
Lifecycle poll cycles exceeding 30 seconds
Dashboard becoming sluggish with many cards
tmux session management overhead

17.4 Configuration Assumptions

The system assumes a specific project structure:

Git repository with remote named origin
A single default branch (main/master)
Issues tracked in GitHub Issues or Linear
PRs created on GitHub (no GitLab, Bitbucket, etc.)
Squash merge strategy (hardcoded: gh pr merge --squash --delete-branch)


18. Failure Modes & Recovery

Confidence: High
18.1 Spawn Failures

The spawn sequence has cascading cleanup:
Step 1 fails (issue validation):     -> No cleanup needed
Step 2 fails (session ID):           -> No cleanup needed
Step 3 fails (workspace creation):   -> Delete session directory
Step 4 fails (post-create hooks):    -> Destroy workspace + delete session directory
Step 5 fails (runtime creation):     -> Destroy workspace + delete session directory
Step 6 fails (launch command):       -> Destroy runtime + workspace + delete session directory
Step 7 fails (metadata write):       -> Destroy runtime + workspace + delete session directory
Step 8 fails (post-launch setup):    -> Destroy runtime + workspace + delete session directory

Each step's failure handler cleans up all previously completed steps. This is implemented with nested try/catch blocks.
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/session-manager.ts (spawn method)
18.2 Runtime Crashes

If a tmux session dies unexpectedly:

Detection: Next lifecycle poll checks runtime.isAlive() -> returns false
Status update: Session status set to "errored" (if no PR) or "pr_open" (if PR exists)
Notification: "agent-exited" reaction fires, notifying the human
Recovery: Human can ao session restore to restart from archive

The 30-second polling interval means up to 30 seconds can pass before a crash is detected.
18.3 Agent Stuck Detection

The activity detection system identifies stuck agents:
// Stuck: agent process running but no JSONL activity for extended period
if (activity === "idle" && idleDuration > stuckThresholdMs) {
  return "stuck";
}
Default stuck threshold is not explicitly documented but the reaction config allows a after duration for escalation:
reactions:
  agent-stuck:
    trigger: agent.stuck
    action: notify
    escalation:
      action: notify
      after: "15m"
      priority: critical
18.4 GitHub API Failures

Rate Limiting:

PR enrichment has a 4-second timeout
If enrichment fails, dashboard shows stale data with a rate-limit warning banner
CI summary uses fail-closed: errors -> "failing" status (prevents false merges)

Network Failures:

SCM calls use execFile with timeouts
Transient failures in lifecycle polling are caught by Promise.allSettled
Failed polls are skipped; the next cycle retries

18.5 Metadata Corruption

The flat-file metadata format is simple but fragile:

Append-only: Multiple values for the same key are resolved by taking the last one
No atomicity: If the process crashes mid-write, the file could be truncated
No locking: Multiple writers (agent hook + lifecycle manager) could race

The PostToolUse hook (bash script) appends to the metadata file:
echo "pr_number=$PR_NUM" >> "$METADATA_FILE"
While the lifecycle manager reads the file:
const metadata = parseMetadataFile(metadataPath);
There is no file locking between these operations. In practice, this is unlikely to cause issues because:

The hook and lifecycle manager write different keys
The file is small (< 1KB typically)
POSIX append semantics are usually atomic for small writes

But it is a theoretical correctness gap.
18.6 Worktree Cleanup Failures

If worktree removal fails (e.g., locked files), the plugin falls back to rmSync:
// workspace-worktree/src/index.ts, destroy
try {
  await execFile("git", ["worktree", "remove", "--force", worktreePath]);
} catch {
  // Fallback: force-remove the directory
  fs.rmSync(worktreePath, { recursive: true, force: true });
}
After force removal, stale worktree entries remain in git's worktree list. The restore function handles this:
// workspace-worktree/src/index.ts, restore
await execFile("git", ["worktree", "prune"]);  // Clean stale entries
Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts
18.7 Service Initialization Failures

The web service singleton has retry logic:
// services.ts
globalForServices._aoServicesInit = initServices().catch((err) => {
  globalForServices._aoServicesInit = undefined;  // Clear for retry
  throw err;
});
If initialization fails (e.g., config file missing), the cached promise is cleared so the next request triggers a fresh attempt rather than permanently returning the cached error.

19. Governance & Guardrails

Confidence: Medium
19.1 What Guardrails Exist


Confirmation dialogs: Kill and restore operations require user confirmation in the dashboard.
Fail-closed CI: Unknown CI status is treated as "failing" for open PRs.
Session ID validation: Regex-based validation prevents path traversal.
Symlink target validation: Prevents workspace escape.
Shell injection prevention: execFile everywhere, never exec.
Issue validation: Sessions cannot be spawned for non-existent issues.
Duplicate detection: Batch spawn checks for existing sessions with the same issue.

19.2 What Guardrails Are Missing


No approval gates before merge: Auto-merge has no additional safety check beyond CI + review.
No diff size limits: Agents can create arbitrarily large PRs.
No file restriction: Agents can modify any file in the repository, including CI configs, security policies, and deployment scripts.
No branch protection enforcement: AO doesn't verify that branch protection rules are configured on the target branch.
No code review requirements: Auto-merge can bypass the "requires review" setting if the GitHub config allows it.
No cost limits: No budget ceiling per session or project.
No concurrency limits: No maximum number of concurrent sessions.
No time limits: Sessions can run indefinitely.
No output validation: Generated code is not scanned for vulnerabilities or malicious content.

19.3 Permission Model

There is no permission model. The system runs with the credentials of the user who started it:

Git operations use the user's SSH keys or HTTPS tokens
GitHub API uses the user's gh auth session
Claude Code uses the user's API key
tmux runs as the current user

Any agent can perform any action the user can perform.
19.4 Audit Trail

The audit trail consists of:

Metadata files: Show session creation time, issue, branch, PR number
Archived metadata: Preserved after session cleanup
Git history: All agent commits are in the git log
Claude Code JSONL: Complete agent conversation history
tmux capture: Terminal output can be captured (but not automatically persisted)

There is no centralized audit log, no event store, and no structured logging of orchestration decisions.

20. Roadmap & Evolution Signals

Confidence: Medium
20.1 TODO Items in Code

Several TODO markers indicate planned features:

AI-powered init (init.ts):

// --auto --smart mode
// TODO: AI-powered rule generation based on project structure

Custom plugin loading (plugin-registry.ts):

// loadFromConfig() — delegates to loadBuiltins,
// reserved for future custom plugin loading


Process runtime (plugin-registry.ts):
Listed as a built-in plugin but implementation not observed. This would allow running agents without tmux.


Clone workspace (plugin-registry.ts):
Listed as a built-in but likely less developed than the worktree plugin. Would provide full repository clones instead of worktrees.


20.2 Architecture Signals


Plugin slots for Terminal (iterm2, web): Suggests plans for richer terminal integration beyond basic tmux.
Lifecycle plugin slot: Suggests plans for customizable state machines, possibly for different workflow patterns.
Composio notifier: Integration with Composio's platform suggests a path toward SaaS deployment.
Webhook notifier: Generic webhook support enables integration with any service.
Multiple tracker support: GitHub + Linear suggests plans for Jira, Asana, etc.

20.3 Maturity Assessment


Component
Maturity
Evidence


Core types
High
1084 lines, comprehensive, well-structured


Session manager
High
~1100 lines, thorough error handling


Lifecycle manager
High
587 lines, reaction engine, escalation


Config system
High
Zod validation, defaults, collision detection


Claude Code plugin
High
786 lines, deep integration


GitHub SCM
High
581 lines, fail-closed CI, GraphQL


Linear tracker
High
722 lines, dual transport


tmux runtime
Medium
184 lines, functional but basic


Web dashboard
Medium
Functional UI, basic SSE


CLI
Medium
Feature-complete but sparse UX


Other agent plugins
Low
Likely thin or placeholder


Process runtime
Low
Listed but not observed


Clone workspace
Low
Listed but not fully developed


Composio notifier
Unknown
Mentioned but not explored


20.4 Version and Release Status

The project uses Changesets for release management, indicating it follows semver and publishes to npm under the @composio/ao-* namespace. The presence of .github/workflows/ci.yml and security.yml suggests active CI/CD.

21. What to Borrow / Adapt into Maestro

Confidence: High
21.1 Strongly Recommended to Borrow

21.1.1 Plugin Architecture Pattern

The eight-slot plugin system is clean and extensible. The PluginManifest + PluginModule pattern with type-safe registry is worth adopting:
interface PluginModule<T> {
  manifest: { name: string; slot: string; version: string };
  create: (ctx?: PluginContext) => T | Promise<T>;
}
Why: It enables swapping implementations without touching core logic. Adding a new agent, runtime, or tracker is a self-contained operation.
Adaptation for Maestro: Consider adding a capabilities field to the manifest for feature-flag-based plugin selection, and a healthCheck() method for runtime validation.
21.1.2 Fail-Closed CI Status

The pattern of reporting "failing" when CI status is unknown for open PRs is a critical safety measure:
// On API error for open PRs: return "failing" not "none"
Why: Prevents auto-merge of PRs when we can't verify CI status. This is a security-relevant design decision.
Adaptation for Maestro: Apply this pattern to all safety-critical status checks. When in doubt, assume the worst case.
21.1.3 Reaction Engine with Escalation

The reaction engine pattern (event -> action, with retries and escalation) is composable and user-configurable:
reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    retries: 2
    escalation:
      action: notify
      after: "10m"
      priority: critical
Why: It separates orchestration policy from orchestration mechanism. Users can customize behavior without modifying code.
Adaptation for Maestro: Add more complex conditions (boolean logic, state predicates), support for custom action types, and a reaction history log.
21.1.4 Atomic Session ID Reservation

Using O_EXCL flag for race-condition-safe session creation:
await fs.open(sessionDir, O_CREAT | O_EXCL);
Why: Prevents two concurrent spawn operations from creating sessions with the same ID.
Adaptation for Maestro: Use this pattern for any resource reservation that must be atomic.
21.1.5 Hash-Based Directory Namespacing

SHA-256 hash of config path for globally unique directories:
createHash("sha256").update(configDir).digest("hex").slice(0, 12);
Why: Prevents collisions between multiple projects/configurations on the same machine. Simple but effective.
21.1.6 Shell Security (execFile Always)

The discipline of never using exec with string interpolation is worth codifying:
// ALWAYS this:
execFile("git", ["checkout", "-b", branchName]);
// NEVER this:
exec(`git checkout -b ${branchName}`);
Adaptation for Maestro: Make this a lint rule. Block exec and execSync in ESLint config.
21.2 Worth Borrowing with Modifications

21.2.1 Activity Detection via Agent Internals

Reading Claude Code's JSONL session files for activity detection is clever but tightly coupled:
// Read last 128KB of JSONL, parse last entry type
const entry = await readLastJsonlEntry(sessionFile);
Why to borrow: Much more accurate than terminal output parsing. Knows exactly what the agent is doing.
Modification needed: Abstract this behind the Agent interface more cleanly. Each agent plugin should expose standardized activity signals rather than having the orchestrator parse agent-specific file formats.
21.2.2 Three-Layer Prompt Composition

The base + config + user rules approach is sound but rigid:
Modification needed: Add support for:

Template variables in prompts
Conditional sections based on project type
Prompt versioning and A/B testing
Dynamic context injection (e.g., related PR context, dependency graph)

21.2.3 Worktree-Based Isolation

Git worktrees are efficient (shared object store) but have limitations:
Modification needed: Support both worktrees (for speed) and full clones (for complete isolation). Consider container-based isolation for stronger security boundaries.
21.2.4 Dashboard Attention Levels

The Kanban grouping by attention level (working/pending/review/respond/merge/done) is intuitive:
Modification needed: Make the attention levels configurable. Different teams may have different workflows and priority signals.
21.3 Not Recommended to Borrow

21.3.1 Flat-File Metadata

The key=value text file approach is too fragile for production:

No atomicity guarantees
No schema evolution support
No query capability (must read all files to list sessions)
Race conditions between writers

Alternative for Maestro: Use SQLite (embedded, zero-config, ACID) or a structured file format (JSON with atomic rename-based writes).
21.3.2 tmux as Primary Runtime

While pragmatic, tmux coupling creates issues:

Not available on Windows
Message passing is fragile (buffer sizes, timing)
No structured communication channel
Output capture is lossy (screen buffer limits)

Alternative for Maestro: Implement a process runtime that uses stdin/stdout for structured communication (JSON-RPC or similar), with tmux as an optional attachment layer for debugging.
21.3.3 Polling-Based Lifecycle

30-second polling is too slow for responsive orchestration and too wasteful for idle systems:
Alternative for Maestro: Use an event-driven architecture with filesystem watches (inotify/FSEvents), agent-reported events (via a sidecar or callback), and webhook-based SCM notifications.
21.3.4 No Authentication on Dashboard

Running a web dashboard without any authentication is a security gap:
Alternative for Maestro: At minimum, implement localhost-only binding with a session token. Better: proper authentication with API keys or OAuth.
21.3.5 Hardcoded Merge Strategy

gh pr merge --squash --delete-branch is hardcoded:
Alternative for Maestro: Make merge strategy configurable per project (squash/merge/rebase, delete branch or not).
21.4 Key Lessons


Simplicity wins for v1: AO chose the simplest possible implementation at every layer (files over databases, polling over events, CLI over API). This allowed rapid development and easy debugging.


Plugin architecture pays off early: Even in a young project, the ability to swap implementations is valuable. It enables both experimentation and user customization.


Safety must be default-on: Fail-closed CI, confirmation dialogs, and shell injection prevention are good defaults. Auto-merge should require explicit opt-in.


Agent-specific integration is necessary: Generic agent interfaces are not enough. Deep integration with the specific agent (like reading Claude Code JSONL) provides dramatically better observability.


The orchestrator-as-agent pattern is powerful: Using an AI agent to orchestrate other AI agents (the meta-agent pattern) leverages the agent's natural language understanding for flexible task management. But it requires a very good system prompt.


22. Cross-Links

This analysis is part of a broader research effort analyzing multiple AI agent orchestration frameworks. Related documents in the /Users/jeffscottward/Github/research/ai-harness/Claude/v1/ directory include:


Document
Relevance to This Analysis


swe-bench-deep-analysis.md
SWE-bench is the primary benchmark for evaluating coding agents like those orchestrated by AO. Comparison of evaluation methodologies.


claude-code-deep-analysis.md
Claude Code is AO's primary agent. Deep understanding of Claude Code's internals (JSONL format, session files, hooks) is essential for understanding AO's agent plugin.


codex-deep-analysis.md
Codex CLI is a supported agent in AO. Compare how AO integrates Codex vs Claude Code.


aider-deep-analysis.md
Aider is a supported agent in AO. Compare integration depth and activity detection approaches.


opencode-deep-analysis.md
OpenCode is a supported agent in AO. Compare plugin maturity.


open-hands-deep-analysis.md
OpenHands (formerly OpenDevin) provides container-based isolation. Compare with AO's worktree/tmux approach for security and resource isolation.


bolt-diy-deep-analysis.md
Bolt.diy is a web-based coding assistant. Compare the dashboard/UI patterns.


maestro-architecture.md
The target architecture document. This analysis directly informs what patterns to adopt, adapt, or avoid.


Cross-Cutting Themes


Isolation Models: AO uses worktrees + tmux. OpenHands uses Docker containers. Each has tradeoffs between speed, security, and complexity. Maestro should support both.


Agent Communication: AO uses terminal message passing. Some frameworks use structured APIs. The hybrid approach (terminal for legacy agents, API for modern ones) may be optimal.


State Management: AO uses flat files. Most production systems use databases. The trade-off is operational simplicity vs. query capability and reliability.


Orchestration Patterns: AO's meta-agent pattern (AI orchestrating AI) vs. rule-based orchestration vs. human-in-the-loop. Each has different reliability/flexibility trade-offs.


Cost Management: All frameworks struggle with cost visibility and control. This is an area where Maestro can differentiate.


Appendix A: File Reference

Core Package Files


File
Lines
Purpose


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/types.ts
1084
Central type definitions


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/session-manager.ts
~1100
Session CRUD operations


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts
587
State machine + reaction engine


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/config.ts
~400
Config loading + validation


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/plugin-registry.ts
~100
Plugin registration + lookup


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/paths.ts
~200
Hash-based directory management


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/metadata.ts
~200
Flat-file metadata management


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/prompt-builder.ts
~150
Three-layer prompt composition


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/orchestrator-prompt.ts
~250
Meta-agent system prompt


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/tmux.ts
~200
Safe tmux wrappers


/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/utils.ts
~150
Shell escape, JSONL parsing


Plugin Files


File
Lines
Purpose


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts
786
Claude Code agent integration


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/runtime-tmux/src/index.ts
184
tmux runtime implementation


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts
301
Git worktree workspace


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/scm-github/src/index.ts
581
GitHub SCM integration


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-github/src/index.ts
304
GitHub Issues tracker


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-linear/src/index.ts
722
Linear tracker integration


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-desktop/src/index.ts
~80
OS desktop notifications


/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-slack/src/index.ts
~150
Slack webhook notifications


CLI Files


File
Lines
Purpose


/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/index.ts
~80
CLI entry point


/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts
~200
Spawn + batch-spawn


/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/start.ts
~150
Start/stop orchestrator


/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/status.ts
~200
Status display


/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/session.ts
~200
Session subcommands


/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/init.ts
~300
Init wizard


Web Files


File
Lines
Purpose


/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/lib/services.ts
84
Service singleton


/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/components/Dashboard.tsx
272
Main dashboard UI


/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/app/api/sessions/route.ts
65
Sessions API


/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/app/api/events/route.ts
104
SSE events API


Config & Documentation


File
Purpose


/tmp/ai-harness-repos/agent-orchestrator/README.md
Project overview


/tmp/ai-harness-repos/agent-orchestrator/ARCHITECTURE.md
Directory architecture


/tmp/ai-harness-repos/agent-orchestrator/CLAUDE.md
Development conventions


/tmp/ai-harness-repos/agent-orchestrator/SECURITY.md
Security policy


/tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example
Full reference config


/tmp/ai-harness-repos/agent-orchestrator/examples/simple-github.yaml
Minimal config example


/tmp/ai-harness-repos/agent-orchestrator/examples/auto-merge.yaml
Auto-merge config example


Appendix B: Confidence Scores Summary


Section
Confidence
Reasoning


1. Design Philosophy
High
README, ARCHITECTURE.md, and code consistently support conclusions


2. Core Architecture
High
All source files read and analyzed


3. Harness Workflow
High
Spawn sequence traced through code


4. Subagent Orchestration
High
Orchestrator prompt and communication code reviewed


5. Multi-Agent & Parallelization
High
Lifecycle manager and batch-spawn code reviewed


6. Isolation Model
High
Workspace and runtime plugins fully analyzed


7. Human-in-the-Loop
High
Dashboard and API code reviewed


8. Context Handling
High
Prompt builder and tracker plugins reviewed


9. Session Lifecycle
High
State machine and activity detection fully traced


10. Code Quality Gates
Medium
CI config reviewed but lint rules and test coverage not explored


11. Security
High
SECURITY.md, CI workflows, and shell security patterns reviewed


12. Hooks & Automation
High
Reaction engine and PostToolUse hook fully analyzed


13. CLI & UX
High
All CLI commands reviewed


14. Cost & Usage
Medium
Cost extraction code reviewed but display/alerting not found


15. Tooling & Dependencies
High
package.json and imports reviewed


16. External Integrations
High
All plugin code reviewed


17. Operational Assumptions
High
Requirements documented and validated against code


18. Failure Modes
High
Error handling paths traced through code


19. Governance
Medium
Security measures documented but no formal governance framework


20. Roadmap
Medium
Based on TODOs, plugin stubs, and architecture patterns


21. Borrow/Adapt
High
Based on thorough analysis of all sections


Appendix C: Architecture Diagrams (Text)

C.1 System Architecture

┌──────────────────────────────────────────────────────┐
│                    Human Developer                     │
│                                                        │
│  ┌──────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │ ao CLI   │  │  Dashboard   │  │  GitHub UI   │   │
│  │          │  │  (Next.js)   │  │              │   │
│  └────┬─────┘  └──────┬───────┘  └──────┬───────┘   │
└───────┼────────────────┼────────────────┼────────────┘
        │                │                │
        ▼                ▼                │
┌───────────────────────────────┐         │
│          AO Core              │         │
│                               │         │
│  ┌─────────────────────────┐  │         │
│  │    Session Manager      │  │         │
│  │  (spawn/list/kill/etc)  │  │         │
│  └────────────┬────────────┘  │         │
│               │               │         │
│  ┌────────────▼────────────┐  │         │
│  │   Lifecycle Manager     │  │         │
│  │  (poll/react/escalate)  │  │         │
│  └────────────┬────────────┘  │         │
│               │               │         │
│  ┌────────────▼────────────┐  │         │
│  │    Plugin Registry      │  │         │
│  │  (8 slots, 16 plugins)  │  │         │
│  └────────────┬────────────┘  │         │
└───────────────┼───────────────┘         │
                │                         │
    ┌───────────┼───────────┐             │
    │           │           │             │
    ▼           ▼           ▼             ▼
┌────────┐ ┌────────┐ ┌─────────┐ ┌──────────┐
│  tmux  │ │  git   │ │ Claude  │ │ GitHub   │
│sessions│ │worktree│ │  Code   │ │   API    │
│        │ │        │ │  CLI    │ │(gh CLI)  │
└────────┘ └────────┘ └─────────┘ └──────────┘

C.2 Session Lifecycle State Machine

                    ┌──────────┐
                    │ spawning │
                    └────┬─────┘
                         │
                         ▼
                    ┌──────────┐
              ┌─────│ working  │◄────────────────────┐
              │     └────┬─────┘                      │
              │          │                            │
              ▼          ▼                            │
        ┌──────────┐ ┌──────────┐              ┌─────┴──────┐
        │needs_input│ │ pr_open  │              │changes_req │
        └──────────┘ └────┬─────┘              └─────┬──────┘
                          │                          │
                    ┌─────┼─────┐                    │
                    │     │     │                    │
                    ▼     ▼     ▼                    │
              ┌────────┐ ┌──────────┐ ┌──────────┐  │
              │ci_fail │ │rev_pend  │ │ working  │──┘
              └───┬────┘ └────┬─────┘
                  │           │
                  │      ┌────┼────┐
                  │      │         │
                  │      ▼         ▼
                  │ ┌──────────┐ ┌──────────────┐
                  │ │ approved │ │changes_req   │
                  │ └────┬─────┘ └──────────────┘
                  │      │
                  │      ▼
                  │ ┌──────────┐
                  │ │mergeable │
                  │ └────┬─────┘
                  │      │
                  │      ▼
                  │ ┌──────────┐
                  └►│  merged  │
                    └────┬─────┘
                         │
                         ▼
                    ┌──────────┐
                    │  done    │
                    └──────────┘

        (Any state) ──► killed / terminated / errored / stuck

C.3 Directory Structure

~/.agent-orchestrator/
│
├── a1b2c3d4e5f6-my-project/        # {hash}-{projectId}
│   ├── .origin                       # Original config path
│   ├── sessions/
│   │   ├── fix-auth-1/
│   │   │   ├── metadata              # key=value flat file
│   │   │   └── prompt.md             # Agent system prompt
│   │   ├── add-api-2/
│   │   │   ├── metadata
│   │   │   └── prompt.md
│   │   └── refactor-3/
│   │       ├── metadata
│   │       └── prompt.md
│   ├── archive/
│   │   ├── old-session_1706000000   # Archived metadata
│   │   └── old-session_1706100000
│   └── worktrees/
│       ├── fix-auth-1/              # Git worktree checkout
│       ├── add-api-2/
│       └── refactor-3/
│
└── f6e5d4c3b2a1-other-project/
    ├── sessions/
    ├── archive/
    └── worktrees/


Appendix D: Configuration Reference

D.1 Full Configuration Schema

# agent-orchestrator.yaml

# Global settings
dataDir: "~/.agent-orchestrator"      # Base data directory
worktreeDir: null                      # Override worktree location
port: 3000                            # Dashboard port

# Default plugin selections
defaults:
  runtime: tmux                       # Process runtime
  agent: claude-code                  # AI coding agent
  workspace: worktree                 # Code isolation strategy
  notifiers:                          # Notification channels
    - composio
    - desktop

# Project definitions
projects:
  my-project:
    repo: "owner/repo"               # GitHub repository
    path: "/path/to/local/repo"      # Local repository path
    defaultBranch: main               # Default branch name
    sessionPrefix: "fix"              # Session name prefix

    tracker:                          # Issue tracker
      plugin: github                  # or "linear"
      # Linear-specific:
      # teamId: "TEAM_ID"

    scm:                              # Source code management
      plugin: github

    symlinks:                         # Shared resources in worktrees
      - source: "/path/to/node_modules"
        target: "node_modules"

    postCreate:                       # Commands after workspace creation
      - "npm install"
      - "npm run build"

    agentConfig:                      # Agent-specific configuration
      model: "claude-sonnet-4-5-20250514"

    agentRules: |                     # Inline rules for the agent
      Follow TDD. Write tests first.

    agentRulesFile: ".agent-rules.md" # External rules file

# Notification configuration
notifiers:
  slack:
    webhookUrl: "${SLACK_WEBHOOK_URL}"
  desktop: {}

# Notification routing by priority
notificationRouting:
  critical: [slack, desktop]
  high: [slack, desktop]
  normal: [slack]
  low: [slack]

# Reaction rules
reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    message: "CI is failing. Please investigate and fix."
    retries: 2
    escalation:
      action: notify
      after: "10m"
      priority: critical

  changes-requested:
    trigger: review.changes_requested
    action: send-to-agent
    message: "Review feedback received. Please address."

  approved-and-green:
    trigger: review.approved
    condition: ci.passing
    action: notify
    message: "PR ready to merge."

  agent-stuck:
    trigger: agent.stuck
    action: notify
    priority: high
    escalation:
      action: notify
      after: "15m"
      priority: critical

  auto-merge:                         # Optional: auto-merge
    trigger: review.approved
    condition: ci.passing
    action: auto-merge
Source: /tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example

End of analysis. Total sections: 22 (21 analysis areas + cross-links). All file paths reference the source repository at /tmp/ai-harness-repos/agent-orchestrator/.

  
## everything-claude-code-deep-analysis.md

      
    Raw
  

              everything-claude-code-deep-analysis.md
            
          
    Everything Claude Code (ECC) -- Deep Technical Analysis

Repository: affaan-m/everything-claude-code
Version analyzed: v1.4.1 (February 2026)
Analysis date: 2026-02-22
Analyst: Claude Opus 4.6

Table of Contents


Executive Summary
Design Philosophy and Abstractions
Core Architecture Model
Harness Workflow: Spec to Plan to Execute to Verify to Merge
Subagent/Task Orchestration Model
Multi-Agent / Parallelization Strategy
Isolation Model
Human-in-the-Loop Controls
Context Handling Strategy
Session Lifecycle and Persistence
Code Quality Gates
Security and Compliance Mechanisms
Hooks, Automation Surface, and Fail-Safe Behavior
CLI/UX and Automation Ergonomics
Cost/Usage Visibility and Governance
Tooling and Dependency Surface
External Integrations and Provider Compatibility
Operational Assumptions and Constraints
Failure Modes and Issues Observed
Governance and Guardrails
Roadmap/Evolution Signals, Missing Areas, Unresolved Issues
What Should Be Borrowed/Adapted into Maestro
Cross-Links


1. Executive Summary

Everything Claude Code (ECC) is a configuration-layer harness -- not a runtime framework -- that wraps Claude Code CLI with curated agents, skills, hooks, commands, rules, and MCP configurations. It is the most popular community-maintained collection of Claude Code configurations (42K+ stars, 5K+ forks as of Feb 2026), battle-tested by the author over 10+ months of daily production use.
Key characterization: ECC is a "prompt engineering harness" that orchestrates Claude Code's native capabilities through markdown-based configuration rather than writing code that drives Claude Code programmatically. It does not have its own execution engine; it relies entirely on Claude Code's built-in plugin system, hook mechanism, and subagent (Task tool) delegation.
Confidence: HIGH -- This assessment is based on exhaustive reading of every file in the repository.
Strengths


Extremely well-organized collection of battle-tested configurations
Strong hook system with cross-platform Node.js implementations
Thoughtful session persistence and context management
Multi-language support (TS, Python, Go, Java, C++, Swift)
Cross-platform compatibility (Windows, macOS, Linux)
Good CI/CD with comprehensive validation
Innovative continuous learning (instinct) system
Excellent documentation (shortform/longform guides, i18n)

Limitations


No runtime execution engine -- depends entirely on Claude Code CLI
No true concurrent agent orchestration (sequential pipeline only)
No formal state machine for workflow progression
No persistent database or structured data store
Orchestration is prompt-driven, not code-driven
No cost tracking beyond Claude's built-in /cost
Multi-model commands (multi-plan, multi-execute) depend on external codeagent-wrapper binary not included


2. Design Philosophy and Abstractions

2.1 Mental Model

ECC embodies a "configuration as code" philosophy applied to AI-assisted development. The core mental model is:

"Claude Code is already powerful. Make it more consistent, more efficient, and more specialized by providing curated configurations that encode expert knowledge."

Evidence:

/tmp/ai-harness-repos/everything-claude-code/the-shortform-guide.md lines 1-9: "These configs are battle-tested across multiple production applications."
/tmp/ai-harness-repos/everything-claude-code/README.md lines 30-33: "Production-ready agents, skills, hooks, commands, rules, and MCP configurations evolved over 10+ months."

The author explicitly states (longform guide, line 1): treating configuration as something to be iterated and refined, not designed once.
Confidence: HIGH
2.2 Abstraction Layers

ECC defines six primary abstraction layers, each stored as markdown or JSON:


Layer
Storage Format
Location
Purpose


Rules
Markdown
rules/
Always-active behavioral constraints


Agents
Markdown with YAML frontmatter
agents/
Specialized subagent personas


Skills
Markdown with YAML frontmatter
skills/
Domain knowledge and workflow definitions


Commands
Markdown with YAML frontmatter
commands/
Slash commands for quick execution


Hooks
JSON
hooks/hooks.json
Event-driven automations


Contexts
Markdown
contexts/
Dynamic system prompt injection


Evidence:

/tmp/ai-harness-repos/everything-claude-code/CLAUDE.md lines 23-31: Architecture section listing all component types
/tmp/ai-harness-repos/everything-claude-code/README.md lines 189-343: Complete directory structure

2.3 Design Principles


Modularity over monolith: Each component is independently installable and removable
Markdown as universal format: Everything is markdown -- the LLM-native format
Convention over configuration: Standard file locations, naming patterns
Progressive enhancement: Start with what resonates, add incrementally
Context window conservation: Aggressive optimization of token usage
Cross-platform compatibility: Node.js scripts instead of bash

Evidence:

/tmp/ai-harness-repos/everything-claude-code/README.md lines 994-998: "Start with what resonates, modify for your stack, remove what you don't use, add your own patterns"
All hooks are Node.js: /tmp/ai-harness-repos/everything-claude-code/hooks/README.md lines 192-193

Confidence: HIGH

3. Core Architecture Model

3.1 Entry Points

ECC has three primary entry points:


Plugin installation: Via Claude Code's /plugin marketplace add command

File: /tmp/ai-harness-repos/everything-claude-code/.claude-plugin/plugin.json
Registers agents, skills, and commands
Hooks auto-loaded from hooks/hooks.json by convention (Claude Code v2.1+)


Manual installation: Copying files to ~/.claude/ directories

File: /tmp/ai-harness-repos/everything-claude-code/install.sh
Supports --target claude (default) and --target cursor
Handles common + language-specific rule installation


npm package: npm install ecc-universal

File: /tmp/ai-harness-repos/everything-claude-code/package.json line 73: "bin": { "ecc-install": "install.sh" }
Provides ecc-install CLI command


3.2 Key Modules

3.2.1 Scripts Library (scripts/lib/)

The cross-platform utility layer that all hooks depend on:


utils.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/utils.js): 529 lines

Platform detection (Windows/macOS/Linux)
Directory management (sessions, learned skills, temp)
File operations (read, write, append, replace, grep)
Git operations (modified files, repo detection)
Stdin JSON parsing for hooks
Command execution with security notes (line 337-338)


package-manager.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/package-manager.js): 431 lines

Supports npm, pnpm, yarn, bun
6-level detection priority (env var > project config > package.json > lock file > global config > fallback)
Input validation with SAFE_NAME_REGEX and SAFE_ARGS_REGEX (lines 285-319)
Known performance fix: Avoids spawning child processes in hot paths (line 228-231)


session-manager.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-manager.js): 442 lines

Session CRUD operations
Filename parsing with calendar-accurate date validation (line 37-41)
Metadata extraction from markdown content
Pagination support for session listing


session-aliases.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-aliases.js): Session alias management


3.2.2 Hook Scripts (scripts/hooks/)

Seven hook scripts provide the runtime behavior:


Script
Event
Purpose


session-start.js
SessionStart
Load previous context, detect PM


session-end.js
SessionEnd
Extract summary from transcript, persist


pre-compact.js
PreCompact
Save state before compaction


suggest-compact.js
PreToolUse
Suggest compaction at thresholds


evaluate-session.js
SessionEnd
Extract patterns for continuous learning


post-edit-format.js
PostToolUse
Auto-format with Prettier


post-edit-typecheck.js
PostToolUse
TypeScript checking


post-edit-console-warn.js
PostToolUse
Warn about console.log


check-console-log.js
Stop
Audit modified files for console.log


3.2.3 CI Validators (scripts/ci/)

Five validation scripts enforce structural integrity:


Validator
What It Checks


validate-agents.js
YAML frontmatter: name, description, tools, model


validate-commands.js
Description frontmatter presence


validate-hooks.js
JSON schema, valid event types, inline JS syntax


validate-rules.js
Markdown heading structure


validate-skills.js
SKILL.md file presence and frontmatter


Evidence:

/tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-agents.js lines 11-12: REQUIRED_FIELDS = ['model', 'tools']; VALID_MODELS = ['haiku', 'sonnet', 'opus']
/tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-hooks.js line 11: VALID_EVENTS = ['PreToolUse', 'PostToolUse', 'PreCompact', 'SessionStart', 'SessionEnd', 'Stop', 'Notification', 'SubagentStop']

3.3 Data Flow

User Request
    |
    v
Claude Code CLI
    |
    +--> Rules (always loaded from ~/.claude/rules/)
    |     Behavioral constraints applied to every response
    |
    +--> CLAUDE.md (project/user level)
    |     Project-specific guidance
    |
    +--> Slash Command (e.g., /plan)
    |     |
    |     v
    |   Command markdown loaded
    |     |
    |     v
    |   Agent invoked (via Task tool)
    |     |
    |     +--> Skills referenced in agent prompt
    |     |
    |     v
    |   Agent produces output
    |
    +--> Hooks fire (Pre/Post/Lifecycle)
    |     |
    |     v
    |   Node.js scripts execute
    |     |
    |     +--> Session state persisted to ~/.claude/sessions/
    |     +--> Compaction suggested
    |     +--> Patterns extracted
    |
    v
Claude response to user

Confidence: HIGH

4. Harness Workflow

4.1 Spec to Plan to Execute to Verify to Merge

ECC defines an explicit workflow pipeline via the /orchestrate command:
File: /tmp/ai-harness-repos/everything-claude-code/commands/orchestrate.md
/orchestrate feature "Add user authentication"

Pipeline:
1. planner agent     -> Requirements + plan
2. tdd-guide agent   -> Tests first, then implementation
3. code-reviewer agent -> Quality review
4. security-reviewer agent -> Security audit

Four predefined workflow types:


Workflow
Agent Sequence


feature
planner -> tdd-guide -> code-reviewer -> security-reviewer


bugfix
planner -> tdd-guide -> code-reviewer


refactor
architect -> code-reviewer -> tdd-guide


security
security-reviewer -> code-reviewer -> architect


Evidence: /tmp/ai-harness-repos/everything-claude-code/commands/orchestrate.md lines 11-30
4.2 Handoff Protocol

Between agents, a structured handoff document is passed (lines 48-65):
## HANDOFF: [previous-agent] -> [next-agent]

### Context
[Summary of what was done]

### Findings
[Key discoveries or decisions]

### Files Modified
[List of files touched]

### Open Questions
[Unresolved items for next agent]

### Recommendations
[Suggested next steps]
4.3 Plan Command in Detail

The /plan command (/tmp/ai-harness-repos/everything-claude-code/commands/plan.md) follows a strict pattern:

Restate Requirements -- Clarify what needs to be built
Identify Risks -- Surface potential issues
Create Step Plan -- Break into phases
WAIT for Confirmation -- "CRITICAL: The planner agent will NOT write any code until you explicitly confirm"

The planner agent (/tmp/ai-harness-repos/everything-claude-code/agents/planner.md) produces structured plans with:

Overview (2-3 sentences)
Requirements list
Architecture changes with file paths
Implementation steps grouped by phase
Testing strategy
Risks and mitigations
Success criteria

4.4 Multi-Model Planning (Advanced)

The multi-plan command (/tmp/ai-harness-repos/everything-claude-code/commands/multi-plan.md) introduces a more sophisticated pipeline:

Phase 1: Context Retrieval -- Uses mcp__ace-tool__search_context for semantic search
Phase 2: Dual-Model Analysis -- Parallel calls to Codex and Gemini backends
Phase 2.3: Cross-Validation -- Identify consensus and divergence
Phase 2.4: Claude Synthesis -- Generate final plan from both analyses

IMPORTANT: This depends on ~/.claude/bin/codeagent-wrapper (line 25-26 of multi-plan.md) which is NOT included in the repository. This is an external dependency that users must install separately.
Confidence: MEDIUM -- The multi-model flow is well-documented but depends on external tooling not included.
4.5 Verification Loop

The /verify command invokes the verification-loop skill:
File: /tmp/ai-harness-repos/everything-claude-code/skills/verification-loop/SKILL.md
Six verification phases:

Build Verification -- npm run build
Type Check -- tsc --noEmit or pyright
Lint Check -- npm run lint or ruff check
Test Suite -- Run with coverage, target 80%
Security Scan -- Grep for secrets and console.log
Diff Review -- Review changed files

Output is a structured VERIFICATION REPORT with PASS/FAIL per phase and an overall READY/NOT READY verdict.
Confidence: HIGH -- This is fully implemented as a skill definition.

5. Subagent/Task Orchestration Model

5.1 Agent Architecture

ECC defines 13 specialized agents in /tmp/ai-harness-repos/everything-claude-code/agents/:


Agent
Model
Tools
Role


planner.md
opus
Read, Grep, Glob
Feature planning


architect.md
opus
Read, Grep, Glob
System design


code-reviewer.md
sonnet
Read, Grep, Glob, Bash
Code review


security-reviewer.md
sonnet
Read, Write, Edit, Bash, Grep, Glob
Security audit


tdd-guide.md
sonnet
Read, Write, Edit, Bash, Grep
TDD enforcement


build-error-resolver.md
sonnet
Read, Write, Edit, Bash, Grep, Glob
Build error fixing


e2e-runner.md
sonnet
Read, Write, Edit, Bash, Grep, Glob
E2E testing


refactor-cleaner.md
sonnet
Read, Write, Edit, Bash, Grep, Glob
Dead code removal


doc-updater.md
sonnet
Read, Write, Edit, Bash, Grep, Glob
Documentation sync


go-reviewer.md
sonnet
Read, Grep, Glob, Bash
Go code review


go-build-resolver.md
sonnet
Read, Write, Edit, Bash, Grep, Glob
Go build errors


python-reviewer.md
sonnet
Read, Grep, Glob, Bash
Python code review


database-reviewer.md
sonnet
Read, Grep, Glob, Bash
Database optimization


5.2 Tool Scoping Strategy

Agents use deliberately restricted tool sets:

Read-only agents (planner, architect): ["Read", "Grep", "Glob"] -- cannot modify code
Full-access agents (build-error-resolver, tdd-guide): All tools including Write, Edit, Bash
Review agents (code-reviewer): Read + Bash (for running git diff)

This is a principle of least privilege approach to agent tooling.
Evidence:

/tmp/ai-harness-repos/everything-claude-code/agents/planner.md line 5: tools: ["Read", "Grep", "Glob"]
/tmp/ai-harness-repos/everything-claude-code/agents/tdd-guide.md line 5: tools: ["Read", "Write", "Edit", "Bash", "Grep"]

5.3 Model Selection for Agents


Opus: Used for deep reasoning tasks (planner, architect) -- 2 agents
Sonnet: Used for execution tasks (code review, TDD, build fixing) -- 11 agents
Haiku: Not directly assigned to any agent, but recommended for subagent exploration in rules

Evidence: /tmp/ai-harness-repos/everything-claude-code/rules/common/performance.md lines 3-18
5.4 The Subagent Context Problem

ECC explicitly addresses the "context problem" in multi-agent workflows:
File: /tmp/ai-harness-repos/everything-claude-code/skills/iterative-retrieval/SKILL.md

"Subagents are spawned with limited context. They don't know which files contain relevant code, what patterns exist in the codebase, what terminology the project uses."

The solution is Iterative Retrieval -- a 4-phase loop:

DISPATCH -- Broad initial query
EVALUATE -- Score relevance 0-1
REFINE -- Update search criteria based on gaps
LOOP -- Repeat max 3 cycles

Evidence: /tmp/ai-harness-repos/everything-claude-code/skills/iterative-retrieval/SKILL.md lines 30-48
5.5 Orchestrator Pattern

From the longform guide (/tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 268-286):
Phase 1: RESEARCH (use Explore agent) -> research-summary.md
Phase 2: PLAN (use planner agent) -> plan.md
Phase 3: IMPLEMENT (use tdd-guide agent) -> code changes
Phase 4: REVIEW (use code-reviewer agent) -> review-comments.md
Phase 5: VERIFY (use build-error-resolver if needed) -> done or loop back

Key rules:

Each agent gets ONE clear input and produces ONE clear output
Outputs become inputs for next phase
Never skip phases
Use /clear between agents
Store intermediate outputs in files

Confidence: HIGH

6. Multi-Agent / Parallelization Strategy

6.1 Actual Parallelization Capabilities

ECC does not implement true concurrent agent execution within a single Claude Code session. The orchestration is sequential with handoff documents between agents.
However, ECC documents several parallelization patterns for multiple Claude Code instances:

Git Worktrees -- Each worktree gets its own Claude instance
Fork (/fork) -- Fork conversations for non-overlapping tasks
Cascade Method -- Open new tasks in new tabs, sweep left to right

Evidence: /tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 176-215
6.2 The Cascade Method

From the longform guide (line 209-215):

Open new tasks in new tabs to the right
Sweep left to right, oldest to newest
Focus on at most 3-4 tasks at a time

6.3 Multi-Model Parallelization (multi-plan/multi-execute)

The multi-plan and multi-execute commands use run_in_background: true for parallel calls to Codex and Gemini backends:
File: /tmp/ai-harness-repos/everything-claude-code/commands/multi-plan.md lines 119-133
Parallel call Codex and Gemini (run_in_background: true):
1. Codex Backend Analysis (technical feasibility, architecture)
2. Gemini Frontend Analysis (UI/UX impact, user experience)

CRITICAL LIMITATION: This depends on an external codeagent-wrapper binary at ~/.claude/bin/codeagent-wrapper that is NOT included in the repository. Without this binary, the multi-model commands cannot function.
6.4 Parallel Review Pattern

The /orchestrate command mentions parallel execution for independent checks (lines 139-149):
### Parallel Phase
Run simultaneously:
- code-reviewer (quality)
- security-reviewer (security)
- architect (design)

### Merge Results
Combine outputs into single report
However, this is documented as a pattern to follow, not code that enforces it. The actual parallelism depends on Claude Code's Task tool behavior.
6.5 Agent Teams Warning

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md lines 106-111

"Agent Teams spawns multiple context windows. Each teammate consumes tokens independently. Only use for tasks where parallelism provides clear value."

Confidence: HIGH for documentation, MEDIUM for implementation -- The parallelization strategies are well-documented but not enforced by code. The multi-model approach depends on external tooling.

7. Isolation Model

7.1 Git Worktrees

ECC recommends git worktrees as the primary isolation mechanism:
git worktree add ../project-feature-a feature-a
git worktree add ../project-feature-b feature-b
Each worktree is an independent filesystem checkout that gets its own Claude Code instance.
Evidence: /tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 193-203
7.2 Session Isolation

Sessions are isolated by:

File naming: YYYY-MM-DD-<short-id>-session.tmp -- unique per session
Short ID derivation: Last 8 chars of CLAUDE_SESSION_ID env var
Storage location: ~/.claude/sessions/

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-manager.js lines 22-54
7.3 Agent Isolation

Agents are isolated through:

Tool restrictions: Each agent declares which tools it can use
Model selection: Agents run on specified model (haiku/sonnet/opus)
Context scope: Subagents get limited context via the Task tool

There is no filesystem sandboxing beyond tool restrictions. An agent with Bash access can execute arbitrary commands.
7.4 Compaction Counter Isolation

The strategic compact hook uses per-session counter files:
File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/suggest-compact.js line 29
const sessionId = process.env.CLAUDE_SESSION_ID || 'default';
const counterFile = path.join(getTempDir(), `claude-tool-count-${sessionId}`);
7.5 Limitations


No container isolation: No Docker or sandbox for agent execution
No network isolation: Agents with Bash access can make network requests
No resource limits: No memory/CPU constraints on agent execution
Shared filesystem: All agents in a session share the same working directory

Confidence: HIGH

8. Human-in-the-Loop Controls

8.1 Plan Confirmation Gate

The /plan command enforces explicit user confirmation before code changes:
File: /tmp/ai-harness-repos/everything-claude-code/commands/plan.md line 96-97

"CRITICAL: The planner agent will NOT write any code until you explicitly confirm the plan with 'yes' or 'proceed'"

Users can respond with:

"yes" / "proceed" -- Approve and continue
"modify: [changes]" -- Request modifications
"different approach: [alternative]" -- Redirect

8.2 Hook Warnings

Several PreToolUse hooks provide non-blocking warnings:

Tmux reminder: Suggests tmux for long-running commands (exit code 0)
Git push reminder: "Review changes before push" (exit code 0)
Console.log warning: Warns about debug statements

8.3 Hook Blockers

Two PreToolUse hooks actively block operations:

Dev server blocker: Blocks npm run dev outside tmux (exit code 2)
Doc file blocker: Blocks creation of random .md/.txt files (exit code 2)

Evidence: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json lines 4-44
8.4 Review Agent Verdict System

The code-reviewer agent produces verdicts:
File: /tmp/ai-harness-repos/everything-claude-code/agents/code-reviewer.md lines 209-212

Approve: No CRITICAL or HIGH issues
Warning: HIGH issues only (can merge with caution)
Block: CRITICAL issues found -- must fix before merge

8.5 Multi-Model Execution Gate

The multi-execute command requires explicit user confirmation:
File: /tmp/ai-harness-repos/everything-claude-code/commands/multi-execute.md line 15

"Prerequisite: Only execute after user explicitly replies 'Y' to /ccg:plan output"

8.6 Missing Controls


No automatic rollback: If an agent produces bad output, there's no automatic reversion
No approval for individual agent handoffs: The orchestration pipeline runs without intermediate approval
No budget gates: No automatic stopping when token cost exceeds a threshold
No diff review gate: No mandatory diff review before agent actions

Confidence: HIGH

9. Context Handling Strategy

9.1 Token Optimization Settings

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md
Recommended settings:
{
  "model": "sonnet",
  "env": {
    "MAX_THINKING_TOKENS": "10000",
    "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "50",
    "CLAUDE_CODE_SUBAGENT_MODEL": "haiku"
  }
}


Setting
Default
Recommended
Impact


model
opus
sonnet
~60% cost reduction


MAX_THINKING_TOKENS
31,999
10,000
~70% thinking cost reduction


CLAUDE_AUTOCOMPACT_PCT_OVERRIDE
95
50
Earlier compaction, better quality


CLAUDE_CODE_SUBAGENT_MODEL
(inherits main)
haiku
~80% cheaper subagents


9.2 Strategic Compaction

File: /tmp/ai-harness-repos/everything-claude-code/skills/strategic-compact/SKILL.md
The suggest-compact.js hook tracks tool call count and suggests /compact at configurable thresholds:

Default threshold: 50 tool calls
Periodic reminders: Every 25 calls after threshold
Session-specific counter: Uses CLAUDE_SESSION_ID for isolation

Compaction decision guide (from skill):


Phase Transition
Compact?
Why


Research -> Planning
Yes
Research context is bulky


Planning -> Implementation
Yes
Plan is in file; free context


Debugging -> Next feature
Yes
Debug traces pollute context


Mid-implementation
No
Losing variable names/file paths


After failed approach
Yes
Clear dead-end reasoning


9.3 Dynamic System Prompt Injection

File: /tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 56-74
# Daily development
alias claude-dev='claude --system-prompt "$(cat ~/.claude/contexts/dev.md)"'

# PR review mode
alias claude-review='claude --system-prompt "$(cat ~/.claude/contexts/review.md)"'
Three context files included, each defining a distinct behavioral mode:
contexts/dev.md (Development Mode):
Mode: Active development
Focus: Implementation, coding, building features
Behavior: Write code first, explain after
Priorities: 1. Get it working  2. Get it right  3. Get it clean
Tools to favor: Edit, Write, Bash, Grep, Glob
contexts/review.md (Review Mode):
Mode: PR review, code analysis
Focus: Quality, security, maintainability
Behavior: Read thoroughly before commenting, prioritize by severity
Checklist: Logic errors, edge cases, error handling, security, performance, readability, test coverage
Output: Group findings by file, severity first
contexts/research.md (Research Mode):
Exploration-focused context for investigating codebases and external services.
The key insight here is the authority hierarchy described in the longform guide:

System prompt content (highest authority)
User messages
Tool results (lowest authority)

By injecting context via --system-prompt, these modes shape Claude's behavior more strongly than any rule file or CLAUDE.md instruction could.
9.4 MCP Context Warning

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 673-682

"Each MCP tool description consumes tokens from your 200k window, potentially reducing it to ~70k."

Rules of thumb:

Keep under 10 MCPs enabled per project
Keep under 80 tools active
Use disabledMcpServers per project

9.5 PreCompact State Saving

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/pre-compact.js
Before compaction:

Logs compaction event with timestamp to compaction-log.txt
Appends compaction marker to active session file

9.6 What Survives Compaction

From the strategic-compact skill:


Persists
Lost


CLAUDE.md instructions
Intermediate reasoning


TodoWrite task list
File contents previously read


Memory files
Multi-step conversation context


Git state
Tool call history


Files on disk
Nuanced verbal preferences


Confidence: HIGH

10. Session Lifecycle and Persistence

10.1 Session Start

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-start.js
On session start:

Load recent sessions: Finds files matching *-session.tmp in ~/.claude/sessions/ (max 7 days old)
Inject latest session: Outputs content to stdout for Claude to receive as context
Report learned skills: Checks ~/.claude/skills/learned/ for extracted patterns
List session aliases: Shows available named sessions
Detect package manager: Reports detected PM and source

10.2 Session End

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js
On session end:

Read transcript: Parses JSONL transcript from transcript_path (via stdin JSON)
Extract summary: Collects user messages (last 10), tools used, files modified
Create/update session file: Writes to ~/.claude/sessions/YYYY-MM-DD-<short-id>-session.tmp

The transcript parsing handles:

Direct content and nested message.content format
Tool use entries both direct and within assistant content blocks
Graceful handling of parse errors (lines 86-89)

10.3 Session File Format

# Session: 2026-02-22
**Date:** 2026-02-22
**Started:** 14:30
**Last Updated:** 16:45

---

## Session Summary

### Tasks
- Implement user authentication
- Fix build errors

### Files Modified
- src/auth/handler.ts
- src/middleware/auth.ts

### Tools Used
Edit, Bash, Read, Grep

### Stats
- Total user messages: 15
10.4 Session Management Commands

The /sessions command provides:

List all sessions with dates and sizes
Load a specific session by alias or ID
Search sessions by date or content

File: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-manager.js
Key operations:

getAllSessions(): Paginated listing with filtering by date/search
getSessionById(): Lookup by short ID or filename
parseSessionMetadata(): Extract completed/in-progress items
getSessionStats(): Calculate session statistics

10.5 Continuous Learning Persistence

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/evaluate-session.js
At session end, if the session had 10+ user messages:

Signals to Claude that session should be evaluated for extractable patterns
Saves learned skills to ~/.claude/skills/learned/

The v2 instinct system (/tmp/ai-harness-repos/everything-claude-code/skills/continuous-learning-v2/SKILL.md) provides more sophisticated persistence:

~/.claude/homunculus/observations.jsonl: Raw session observations
~/.claude/homunculus/instincts/personal/: Auto-learned instincts
~/.claude/homunculus/instincts/inherited/: Imported instincts
~/.claude/homunculus/evolved/: Generated agents/skills/commands

10.6 Transcript Parsing Implementation Detail

The session-end.js transcript parser is the most complex data processing in ECC. It handles the Claude Code JSONL format which has multiple entry structures:
Entry Type 1: Direct user message
{"type": "user", "content": "Fix the auth bug"}
Entry Type 2: Nested message format
{"type": "user", "message": {"role": "user", "content": [{"type": "text", "text": "Fix the auth bug"}]}}
Entry Type 3: Tool use (direct)
{"type": "tool_use", "tool_name": "Edit", "tool_input": {"file_path": "/src/auth.ts"}}
Entry Type 4: Tool use within assistant content blocks
{"type": "assistant", "message": {"content": [{"type": "tool_use", "name": "Edit", "input": {"file_path": "/src/auth.ts"}}]}}
The parser handles all four formats in a single pass:
// From session-end.js lines 48-85:
if (entry.type === 'user' || entry.role === 'user' || entry.message?.role === 'user') {
  const rawContent = entry.message?.content ?? entry.content;
  const text = typeof rawContent === 'string'
    ? rawContent
    : Array.isArray(rawContent)
      ? rawContent.map(c => (c && c.text) || '').join(' ')
      : '';
  // ...
}
Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js lines 33-100
Data limits applied:

User messages: Last 10 kept (line 98)
User message text: Truncated to 200 chars each (line 57)
Tools used: Max 20 unique tools (line 99)
Files modified: Max 30 unique files (line 100)
Parse errors: Counted but silently skipped (lines 86-93)

10.7 Session Start Context Injection

The session-start.js hook performs a multi-step context loading sequence:

Ensure directories exist: Creates ~/.claude/sessions/ and ~/.claude/skills/learned/ if missing
Find recent sessions: Uses findFiles() with *-session.tmp glob, max 7 days age
Inject latest session: Reads content, skips blank templates (checks for [Session context goes here]), outputs to stdout
Report learned skills: Counts .md files in ~/.claude/skills/learned/
List session aliases: Shows up to 5 named sessions via listAliases()
Detect package manager: Calls getPackageManager() and reports name + detection source

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-start.js lines 24-73
The distinction between log() (stderr) and output() (stdout) is critical here:

output() goes to stdout and becomes part of Claude's context
log() goes to stderr and is displayed to the user but not consumed as context

Only the previous session summary uses output(). All diagnostic messages use log().
10.8 Limitations


No database: All persistence is flat files (markdown + JSON)
No concurrent access protection: Multiple sessions could race on session files
No session resume: Sessions create new files; there's no true "continue where I left off" mechanism
7-day retention: Session start only loads sessions from the last 7 days
No encryption: Session files stored in plaintext
No session merging: Parallel sessions (e.g., in worktrees) cannot be merged
No transcript validation: Assumes JSONL format is correct; corrupted transcripts are skipped line-by-line

Confidence: HIGH

11. Code Quality Gates

11.1 CI Pipeline

File: /tmp/ai-harness-repos/everything-claude-code/.github/workflows/ci.yml
Four CI jobs:


Test (matrix: 3 OS x 3 Node x 4 PM = 33 combinations, minus bun/windows):

OS: ubuntu-latest, windows-latest, macos-latest
Node: 18.x, 20.x, 22.x
PM: npm, pnpm, yarn, bun
Runs node tests/run-all.js


Validate Components:

validate-agents.js: Checks YAML frontmatter (model, tools required)
validate-hooks.js: JSON schema, valid events, inline JS syntax validation
validate-commands.js: Description frontmatter
validate-skills.js: SKILL.md presence
validate-rules.js: Markdown structure


Security Scan:

npm audit --audit-level=high
continue-on-error: true -- warns but does not block


Lint:

ESLint on scripts/**/*.js tests/**/*.js
markdownlint on all markdown files in agents, skills, commands, rules


11.2 Test Suite

File: /tmp/ai-harness-repos/everything-claude-code/tests/run-all.js
11 test files covering:

lib/utils.test.js: Cross-platform utility functions
lib/package-manager.test.js: Package manager detection
lib/session-manager.test.js: Session CRUD operations
lib/session-aliases.test.js: Session alias management
hooks/hooks.test.js: Hook JSON validation and regression tests
hooks/evaluate-session.test.js: Continuous learning evaluation
hooks/suggest-compact.test.js: Strategic compaction logic
integration/hooks.test.js: Integration testing
ci/validators.test.js: Validator script testing
scripts/setup-package-manager.test.js: PM setup testing
scripts/skill-create-output.test.js: Skill creation testing

The test runner parses "Passed: N" and "Failed: N" from output and exits with code 1 if any failures.
11.3 Agent Validation Rules

File: /tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-agents.js
Required frontmatter fields: model, tools
Valid models: haiku, sonnet, opus
11.4 Hook Validation

File: /tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-hooks.js
Validates:

JSON parsing
Valid event types (PreToolUse, PostToolUse, PreCompact, SessionStart, SessionEnd, Stop, Notification, SubagentStop)
Matcher presence
Hook entry structure (type, command required)
Inline JavaScript syntax via vm.Script compilation (line 43-49)
Async/timeout field types

11.5 Rules-Based Quality Enforcement

File: /tmp/ai-harness-repos/everything-claude-code/rules/common/testing.md
Mandatory TDD workflow enforced by rules:

Write test first (RED)
Run test - should FAIL
Write minimal implementation (GREEN)
Run test - should PASS
Refactor (IMPROVE)
Verify coverage (80%+)

Confidence: HIGH

12. Security and Compliance Mechanisms

12.1 Security Rules

File: /tmp/ai-harness-repos/everything-claude-code/rules/common/security.md
Mandatory pre-commit checklist:

No hardcoded secrets
All user inputs validated
SQL injection prevention (parameterized queries)
XSS prevention (sanitized HTML)
CSRF protection
Authentication/authorization verified
Rate limiting on all endpoints
Error messages don't leak sensitive data

12.2 Security Reviewer Agent

File: /tmp/ai-harness-repos/everything-claude-code/agents/security-reviewer.md
Comprehensive OWASP Top 10 checklist with specific code patterns to flag:

Hardcoded secrets: CRITICAL
Shell command with user input: CRITICAL
String-concatenated SQL: CRITICAL
innerHTML = userInput: HIGH
fetch(userProvidedUrl): HIGH
No auth check on route: CRITICAL

12.3 AgentShield Integration

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 382-408
External security scanning tool:
npx ecc-agentshield scan         # Quick scan
npx ecc-agentshield scan --fix   # Auto-fix safe issues
npx ecc-agentshield scan --opus  # Three Opus agents (red team/blue team/auditor)
Scans: CLAUDE.md, settings.json, MCP configs, hooks, agent definitions, skills
Categories: secrets detection (14 patterns), permission auditing, hook injection analysis, MCP server risk profiling, agent config review
Note: AgentShield is a separate repository (affaan-m/agentshield), not included in ECC.
12.4 Hook Security

The hooks contain security measures:

Doc file blocker: Prevents creation of arbitrary .md files (potential for injection)
Command injection prevention: utils.js line 313 validates command names with regex /^[a-zA-Z0-9_.-]+$/
Package manager input validation: SAFE_NAME_REGEX and SAFE_ARGS_REGEX (package-manager.js lines 285-319)
Stdin size limits: MAX_STDIN = 1024 * 1024 in session-end.js

12.5 Input Validation Implementation Detail

The package manager module implements the most rigorous input validation in ECC. Two regex patterns form the defense:
SAFE_NAME_REGEX (/^[@a-zA-Z0-9_./-]+$/):

Used for script names and binary names
Allows: alphanumeric, @ (scoped packages like @scope/pkg), . (dotfiles), / (paths), -, _
Rejects: shell metacharacters ;, |, &, `, $, (, ), {, }, <, >, !
Applied in: getRunCommand() (line 297), getExecCommand() (line 331)

SAFE_ARGS_REGEX (/^[@a-zA-Z0-9\s_./:=,'"*+-]+$/):

Used for command arguments
More permissive: adds whitespace, =, :, ,, quotes, *
Still rejects: ;, |, &, `, $, (, ), {, }, <, >
Applied in: getExecCommand() (line 334)

Both throw Error on validation failure rather than silently stripping characters:
// From package-manager.js lines 293-299:
function getRunCommand(script, options = {}) {
  if (!script || typeof script !== 'string') {
    throw new Error('Script name must be a non-empty string');
  }
  if (!SAFE_NAME_REGEX.test(script)) {
    throw new Error(`Script name contains unsafe characters: ${script}`);
  }
  // ...
}
Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/package-manager.js lines 283-339
The commandExists() function in utils.js uses a separate validation layer:

Validates command name with /^[a-zA-Z0-9_.-]+$/ (stricter -- no @ or /)
Uses spawnSync instead of execSync to avoid shell interpolation
Platform-aware: where on Windows, which on Unix

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/utils.js lines 311-329
12.6 Security Gaps


No secret scanning in hooks: Hooks don't check for secrets in edited content
No dependency pinning enforcement: npm audit runs but does not block PRs
execSync usage: utils.js line 343 uses execSync with a security warning but no actual enforcement
No RBAC for agents: All agents in a plugin share the same permission context
MCP credentials in config: mcp-configs/mcp-servers.json has YOUR_*_HERE placeholders but no validation
No file path traversal prevention in hooks: Hook scripts do not validate that file paths in tool_input are within the project directory (only install.sh has path traversal checks)
Inline JS in hooks.json is unsandboxed: The node -e inline scripts run with full system access, same as external scripts

Confidence: HIGH for documented mechanisms, MEDIUM for completeness

13. Hooks, Automation Surface, and Fail-Safe Behavior

13.1 Hook Architecture

File: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json
ECC defines hooks across 6 lifecycle events:


Event
Count
Purpose


PreToolUse
5 matchers
Validation, blocking, suggestions


PostToolUse
5 matchers
Formatting, checking, logging


PreCompact
1 matcher
State preservation


SessionStart
1 matcher
Context loading


Stop
1 matcher
Console.log audit


SessionEnd
2 matchers
State persistence, pattern extraction


13.2 Hook Execution Model

From the hooks README:

PreToolUse: Can block (exit 2), warn (stderr without exit 2), or pass (exit 0)
PostToolUse: Can analyze output but cannot block
Stop: Runs after each Claude response
SessionStart/SessionEnd: Run at session lifecycle boundaries
PreCompact: Runs before context compaction

13.3 Inline vs Script Hooks

ECC uses two hook implementation patterns:


Inline Node.js: node -e "..." for simple, self-contained checks

Example: Dev server blocker (hooks.json line 10)
Pros: No external file dependency
Cons: Hard to read, hard to test, no source maps


Script files: node "${CLAUDE_PLUGIN_ROOT}/scripts/hooks/script.js" for complex logic

Example: Session start (hooks.json line 74)
Uses ${CLAUDE_PLUGIN_ROOT} variable for plugin-relative paths
Pros: Testable, readable, version-controlled


13.3.1 Detailed Hook-by-Hook Inventory

PreToolUse Hooks (5 total):


#
Matcher
Type
Behavior
Implementation


1
Bash
Blocking (exit 2)
Blocks dev servers outside tmux
Inline: Regex tests for npm run dev, pnpm dev, yarn dev, bun run dev. Only active on non-Windows. Outputs tmux instructions to stderr.


2
Bash
Warning (stderr)
Reminds to use tmux for long-running commands
Inline: Tests for npm/pnpm/yarn/bun install/test, cargo, make, docker, pytest, vitest, playwright. Only triggers if $TMUX is unset.


3
Bash
Warning (stderr)
Warns before git push
Inline: Tests for git push in command string. Outputs reminder to review changes.


4
Write
Blocking (exit 2)
Blocks creation of random .md/.txt files
Inline: Allows README.md, CLAUDE.md, AGENTS.md, CONTRIBUTING.md, and files in .claude/plans/. Blocks all other markdown/text file creation.


5
Edit|Write
Pass-through
Suggests compaction at thresholds
Script: suggest-compact.js. Increments per-session counter. Suggests /compact at threshold (default 50) and every 25 calls thereafter.


PostToolUse Hooks (5 total):


#
Matcher
Type
Behavior
Implementation


1
Bash
Warning (stderr)
Logs PR URL after gh pr create
Inline: Extracts GitHub PR URL from command output with regex. Provides gh pr review command.


2
Bash
Async (background)
Build analysis notification
Inline: Detects build commands. Logs completion message. Runs with async: true, timeout: 30.


3
Edit
Pass-through
Auto-formats JS/TS files
Script: post-edit-format.js. Runs Prettier on edited .js/.ts/.jsx/.tsx files.


4
Edit
Pass-through
TypeScript checking
Script: post-edit-typecheck.js. Runs tsc --noEmit on edited .ts/.tsx files.


5
Edit
Warning (stderr)
Console.log detection
Script: post-edit-console-warn.js. Warns if edited file contains console.log statements.


Lifecycle Hooks (5 total):


#
Event
Matcher
Behavior
Implementation


1
PreCompact
*
Saves state before compaction
Script: pre-compact.js. Logs timestamp to compaction-log.txt. Appends compaction marker to active session file.


2
SessionStart
*
Loads previous context
Script: session-start.js. Finds recent sessions (7 days), injects latest to context, reports learned skills and session aliases, detects package manager.


3
Stop
*
Console.log audit
Script: check-console-log.js. Checks all git-modified files for console.log statements.


4
SessionEnd
*
Persists session state
Script: session-end.js. Parses JSONL transcript, extracts user messages/tools/files, creates/updates session file.


5
SessionEnd
*
Evaluates session for patterns
Script: evaluate-session.js. Counts user messages. If >= 10, signals pattern extraction. Saves to learned skills directory.


13.3.2 Stdin JSON Protocol for Hooks

All hooks that need context from Claude Code receive a JSON object on stdin. The protocol is:
// PreToolUse input:
{
  "tool_name": "Bash",       // The tool being called
  "tool_input": {
    "command": "npm run dev"  // Tool-specific input
  }
}

// PostToolUse input:
{
  "tool_name": "Edit",
  "tool_input": { "file_path": "/path/to/file.ts", ... },
  "tool_output": { "output": "..." }
}

// SessionEnd input:
{
  "transcript_path": "/path/to/session.jsonl"
}
Evidence:

/tmp/ai-harness-repos/everything-claude-code/hooks/README.md lines 45-67: Input format documentation
/tmp/ai-harness-repos/everything-claude-code/scripts/lib/utils.js lines 440-490: readStdinJson() implementation with 5s timeout and 1MB max

13.3.3 The Fail-Safe Exit Pattern in Detail

Every hook script follows this exact error-handling pattern:
async function main() {
  // ... hook logic ...
  process.exit(0);
}

main().catch(err => {
  console.error('[HookName] Error:', err.message);
  process.exit(0); // Don't block on errors
});
This ensures that:

Synchronous errors in main() are caught by the .catch() handler
The error message goes to stderr, which Claude Code displays but does not act on
Exit code 0 prevents Claude Code from treating the hook failure as a blocking event
The only intentional non-zero exit is process.exit(2) in PreToolUse blocking hooks

Files demonstrating this pattern:

/tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-start.js lines 77-80
/tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js lines 230-233
/tmp/ai-harness-repos/everything-claude-code/scripts/hooks/pre-compact.js lines 45-48
/tmp/ai-harness-repos/everything-claude-code/scripts/hooks/suggest-compact.js lines 77-80
/tmp/ai-harness-repos/everything-claude-code/scripts/hooks/evaluate-session.js lines 37-42

13.4 Async Hooks

One hook uses async execution:
File: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json lines 93-99
{
  "type": "command",
  "command": "node -e \"...build analysis...\"",
  "async": true,
  "timeout": 30
}
Async hooks run in background without blocking the main flow.
13.5 Fail-Safe Behavior

All hook scripts follow the same fail-safe pattern:
main().catch(err => {
  console.error('[HookName] Error:', err.message);
  process.exit(0); // Don't block on errors
});
This means hook failures are logged to stderr but never block Claude Code operation. The only exception is intentional blocking via exit code 2 in PreToolUse hooks.
13.6 Known Hook Issue: Duplicate Detection

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 443-451
Claude Code v2.1+ automatically loads hooks/hooks.json from installed plugins. Explicitly declaring hooks in plugin.json causes duplicate detection errors. This has caused repeated fix/revert cycles (#29, #52, #103) and is now enforced by a regression test.
13.7 Automation Surface

Total automation capabilities:

5 PreToolUse hooks: 2 blocking, 3 warning
5 PostToolUse hooks: Formatting, type checking, console.log warning
1 PreCompact hook: State preservation
1 SessionStart hook: Context loading
1 Stop hook: Console.log audit
2 SessionEnd hooks: State persistence + pattern extraction
31 slash commands: User-triggered workflows
13 agents: Delegatable specialist tasks

Confidence: HIGH

14. CLI/UX and Automation Ergonomics

14.1 Command Inventory

31 slash commands organized by category:
Planning & Architecture:

/plan -- Implementation planning
/orchestrate -- Multi-agent coordination
/multi-plan -- Multi-model collaborative planning
/multi-execute -- Multi-model collaborative execution
/multi-backend -- Backend multi-service orchestration
/multi-frontend -- Frontend multi-service orchestration
/multi-workflow -- General multi-service workflows

Development:

/tdd -- Test-driven development
/build-fix -- Fix build errors
/e2e -- Generate E2E tests
/refactor-clean -- Dead code removal
/pm2 -- PM2 service management

Review & Security:

/code-review -- Quality review
/go-review -- Go code review
/python-review -- Python code review
/verify -- Verification loop
/eval -- Evaluate against criteria
/test-coverage -- Test coverage analysis

Learning & Memory:

/learn -- Extract patterns from session
/checkpoint -- Save verification state
/instinct-status -- View learned instincts
/instinct-import -- Import instincts
/instinct-export -- Export instincts
/evolve -- Cluster instincts into skills
/skill-create -- Generate skills from git history

Maintenance:

/update-docs -- Update documentation
/update-codemaps -- Update codemaps
/sessions -- Session history management
/setup-pm -- Configure package manager

14.2 Command Design Pattern

Commands follow a consistent markdown template:
---
description: Brief description shown in /help
---

# Command Name

## Purpose
What this command does.

## Usage
/command-name [args]

## Workflow
1. Step 1
2. Step 2

## Output
What the user receives.
14.3 Agent Selection Guide

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 610-624
Quick reference table mapping user intent to commands and agents:


I want to...
Command
Agent


Plan a new feature
/plan "Add auth"
planner


Design system architecture
/plan + architect
architect


Write code with tests first
/tdd
tdd-guide


Review code
/code-review
code-reviewer


Fix a failing build
/build-fix
build-error-resolver


14.4 Workflow Chaining

Common workflows documented:
Starting a new feature:
/plan "Add user authentication" -> /tdd -> /code-review

Fixing a bug:
/tdd -> implement fix -> /code-review

Preparing for production:
/security-scan -> /e2e -> /test-coverage

14.5 Installation Wizard

The configure-ecc skill provides guided setup:

Merge/overwrite detection for existing configurations
Language-specific rule installation
Interactive configuration

Confidence: HIGH

15. Cost/Usage Visibility and Governance

15.1 Built-in Cost Monitoring

ECC relies on Claude Code's built-in /cost command for cost visibility. There is no custom cost tracking mechanism.
15.2 Token Optimization Documentation

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md
Documented strategies:

Model selection (Haiku/Sonnet/Opus based on task complexity)
Reduced thinking tokens (31,999 -> 10,000)
Earlier auto-compaction (95% -> 50%)
Cheaper subagent model (haiku)
MCP server management (keep under 10 enabled)

15.3 Agent Teams Cost Warning

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md lines 106-111
Explicit warning that Agent Teams spawns multiple context windows, each consuming tokens independently.
15.4 Missing Cost Governance


No budget limits: No mechanism to stop execution when cost exceeds threshold
No per-agent cost tracking: Cannot measure cost per agent invocation
No cost estimation: No pre-execution cost estimation for commands
No usage reporting: No historical usage reports or dashboards
No team governance: No multi-user cost allocation

Confidence: HIGH for what exists, HIGH for gaps

16. Tooling and Dependency Surface

16.1 Runtime Dependencies


Dependency
Required?
Purpose


Node.js >= 18
Yes
All hooks and scripts


Claude Code CLI v2.1+
Yes
Core runtime


Git
Recommended
Worktrees, diff analysis


npm/pnpm/yarn/bun
One required
Package management


tmux
Recommended
Long-running commands


16.2 Dev Dependencies

File: /tmp/ai-harness-repos/everything-claude-code/package.json lines 81-85
{
  "devDependencies": {
    "@eslint/js": "^9.39.2",
    "eslint": "^9.39.2",
    "globals": "^17.1.0",
    "markdownlint-cli": "^0.47.0"
  }
}
Minimal dependency footprint -- only ESLint and markdownlint for CI validation.
16.3 Optional Dependencies


Tool
Used By
Purpose


Prettier
PostToolUse hook
Auto-formatting JS/TS


TypeScript (tsc)
PostToolUse hook
Type checking


knip
refactor-cleaner agent
Dead code detection


depcheck
refactor-cleaner agent
Unused dependency detection


Playwright
e2e-runner agent
Browser testing


Agent Browser
e2e-runner agent
Preferred E2E tool


ruff
Python formatting hook recipe
Python formatting


16.4 External Tools Not Included


codeagent-wrapper: Required by multi-plan/multi-execute commands, NOT included
ecc-agentshield: Security scanning, separate npm package/repo
mcp__ace-tool__search_context: MCP tool used in multi-plan, NOT included
Skill Creator GitHub App: External service at skill-creator.app

16.5 Cross-Platform Support

All hooks are Node.js (no bash dependency):

Windows, macOS, Linux supported
scripts/lib/utils.js provides cross-platform utilities
Platform detection at lines 12-14: isWindows, isMacOS, isLinux
commandExists() uses where on Windows, which on Unix

File: /tmp/ai-harness-repos/everything-claude-code/hooks/README.md lines 192-193

"All hooks in this plugin use Node.js (node -e or node script.js) for maximum compatibility across Windows, macOS, and Linux."

Confidence: HIGH

17. External Integrations and Provider Compatibility

17.1 MCP Server Configurations

File: /tmp/ai-harness-repos/everything-claude-code/mcp-configs/mcp-servers.json
14 pre-configured MCP servers:


Server
Type
Purpose


github
npx command
GitHub operations


firecrawl
npx command
Web scraping


supabase
npx command
Database operations


memory
npx command
Persistent memory


sequential-thinking
npx command
Chain-of-thought


vercel
HTTP
Deployments


railway
npx command
Deployments


cloudflare-docs
HTTP
Documentation


cloudflare-workers-builds
HTTP
Worker builds


cloudflare-workers-bindings
HTTP
Worker bindings


cloudflare-observability
HTTP
Observability


clickhouse
HTTP
Analytics


context7
npx command
Live documentation


magic
npx command
UI components


filesystem
npx command
Filesystem operations


17.2 Multi-Platform IDE Support


Platform
Support Level
Config Location


Claude Code
Full (primary)
Root directories


Cursor IDE
Full (translated)
.cursor/


OpenCode
Full (with plugins)
.opencode/


Cursor translation details (/tmp/ai-harness-repos/everything-claude-code/.cursor/README.md):

Rules: YAML frontmatter added, paths flattened
Agents: Model IDs expanded, tools -> readonly flag
Skills: Identical (no changes needed)
Commands: Path references updated, multi-* stubbed
MCP Config: Env interpolation syntax updated
Hooks: No equivalent in Cursor

17.3 OpenCode Integration

File: /tmp/ai-harness-repos/everything-claude-code/.opencode/README.md
OpenCode support includes:

12 agents (vs 13 in Claude Code)
24 commands (vs 31 in Claude Code)
16 skills (vs 43 in Claude Code)
20+ hook events (vs 8 in Claude Code)
3 native custom tools (run-tests, check-coverage, security-audit)

Hook event mapping:


Claude Code
OpenCode


PreToolUse
tool.execute.before


PostToolUse
tool.execute.after


Stop
session.idle


SessionStart
session.created


SessionEnd
session.deleted


17.4 Multi-Model Provider Support

The multi-plan/multi-execute commands support:

Codex (OpenAI) -- Backend analysis authority
Gemini (Google) -- Frontend design authority
Claude (Anthropic) -- Final synthesis and code sovereignty

Trust rules: "Backend follows Codex, Frontend follows Gemini"
IMPORTANT: This multi-model support requires the external codeagent-wrapper binary.
17.5 Language Support

Skills and rules provided for:

TypeScript/JavaScript (primary)
Python (including Django)
Go/Golang
Java (Spring Boot, JPA)
C++ (coding standards, GoogleTest)
Swift (actor persistence, protocol DI)
Rust (example CLAUDE.md only)

Confidence: HIGH

18. Operational Assumptions and Constraints

18.1 Explicit Assumptions


Claude Code CLI available: Minimum v2.1.0 required
Node.js >= 18: Required for all hook scripts
Git available: Recommended for session ID generation, worktrees
Single user: No multi-user or team collaboration features
Local execution: No remote/cloud execution support
Plugin system working: Hooks auto-loading depends on Claude Code plugin convention

18.2 Implicit Assumptions


Context window is 200K tokens: All optimization strategies assume this limit
Claude's Task tool works as documented: Agent delegation assumes Task tool behavior
File system writable: Sessions, learned skills written to ~/.claude/
Stdin JSON protocol: Hooks assume Claude Code provides JSON on stdin
Transcript path available: Session end hooks need transcript_path in stdin
Single active session: No concurrent session management

18.3 Constraints


No custom LLM runtime: Cannot use models outside Claude Code's supported set
No DAG execution: Orchestration is linear pipeline or manual parallelism
No cross-session state: Beyond session files, no shared state between sessions
Plugin system limitations: Cannot distribute rules via plugins (upstream limitation)
Hook execution limit: Hooks must complete within Claude Code's timeout
Context compaction lossy: Important state may be lost during compaction

Confidence: HIGH

19. Failure Modes and Issues Observed

19.1 Documented Issues


Duplicate hooks file (Issues #29, #52, #103):

Claude Code v2.1+ auto-loads hooks/hooks.json from plugins
Explicitly declaring hooks in plugin.json causes duplicate detection error
Fixed with regression test, but has caused repeated fix/revert cycles


Instinct import content loss (Issue #148, PR #161):

parse_instinct_file() was dropping all content after frontmatter
Fixed in v1.4.1 by community contributor


Windows Bun spawn limit (referenced in package-manager.js line 228):

Session start hooks running during Bun init could exceed spawn limit
Fixed by removing child process spawning from package manager detection hot path


19.2 Potential Failure Modes


Hook timeout: Long-running hooks could be killed by Claude Code

Mitigation: All hooks use fail-safe exit(0) pattern


Session file corruption: No locking mechanism for concurrent access

Risk: Multiple Claude instances writing to same session file
Mitigation: Session files use unique short IDs


Compaction counter race condition: suggest-compact.js acknowledges race window

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/suggest-compact.js line 39
"Use fd-based read+write to reduce (but not eliminate) race window"


Transcript parsing failures: session-end.js handles parse errors gracefully

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js lines 86-89
Skips unparseable lines, logs count


MCP context explosion: Too many enabled MCPs can reduce effective context to ~70K

Mitigation: Documentation warns, but no automated enforcement


Agent tool misuse: Agents with Bash access could execute destructive commands

Mitigation: Principle of least privilege in tool assignment, but no runtime sandboxing


Multi-model dependency failure: codeagent-wrapper binary not found

Impact: multi-plan and multi-execute commands completely non-functional
Mitigation: None documented


19.3 Observed Robustness Patterns


Graceful degradation: All hooks exit 0 on error
Input validation: Package manager validates names and arguments
TOCTOU handling: Session manager wraps stat calls in try-catch for deleted files
Size limits: Stdin reading has 1MB cap
Counter clamping: Compact counter clamped to 1-1000000 range

Confidence: HIGH

20. Governance and Guardrails

20.1 Agent Behavioral Guardrails


Read-only agents: planner, architect -- cannot modify code
Code reviewer filtering: >80% confidence threshold before reporting issues (code-reviewer.md line 27)
Build error resolver constraints: "No architecture changes, only fix errors" (build-error-resolver.md line 19-20)
Code sovereignty: In multi-model execution, "All file modifications by Claude, external models have zero write access" (multi-execute.md line 12)

20.2 Quality Rules

File: /tmp/ai-harness-repos/everything-claude-code/rules/common/

testing.md: 80% coverage minimum, TDD mandatory
security.md: Pre-commit security checklist
coding-style.md: Immutability, file organization
git-workflow.md: Commit format, PR process
performance.md: Model selection strategy
patterns.md: Design patterns, API response format
hooks.md: Hook architecture guidelines
agents.md: Subagent delegation rules

20.3 CI Enforcement

All PRs must pass:

Component validation (agents, hooks, commands, skills, rules)
ESLint + markdownlint
Test suite across 33+ OS/Node/PM combinations
npm audit (warning only)

20.4 Contributing Guidelines

File: /tmp/ai-harness-repos/everything-claude-code/CONTRIBUTING.md
PR requirements:

Follow format guidelines
Tested with Claude Code
No sensitive info
Clear descriptions
Conventional commit format: feat(skills): add rust-patterns skill

20.5 Missing Governance


No rate limiting: No throttling of agent invocations
No audit logging: No persistent audit trail of agent actions
No approval workflow: No multi-person approval for configuration changes
No role-based access: All users have full access to all components
No compliance mapping: No GDPR, SOC2, or other compliance framework alignment

Confidence: HIGH

21. Roadmap/Evolution Signals

21.1 Proven and Shipped (v1.0-v1.4.1)


Core agent/skill/hook/command/rule system
Cross-platform Node.js hooks
Session persistence
Package manager detection
CI pipeline with multi-OS/multi-PM testing
Continuous learning v1 and v2 (instinct-based)
Multi-language rules (TS, Python, Go)
Cursor and OpenCode integration
Plugin system with marketplace support
i18n (Chinese simplified/traditional, Japanese)

21.2 TODO/Roadmap Claims (Not Yet Implemented)


configure-ecc token optimization integration: Token optimization guide mentions future integration with the install wizard (docs/token-optimization.md line 117)


Memory MCP default disabling: "The memory MCP server is configured by default but not used by any skill, agent, or hook -- consider disabling it" (docs/token-optimization.md line 101)


Multi-model commands: Depend on external codeagent-wrapper binary not included in repo


21.3 Evolution Signals


From bash to Node.js: All hooks migrated from bash to Node.js for cross-platform compatibility
From v1 to v2 learning: Skills-based learning evolved to instinct-based with confidence scoring
From single IDE to multi-IDE: Added Cursor and OpenCode support
From English to i18n: Added Chinese and Japanese translations
Community growth: 42K+ stars, community contributions (e.g., instinct import fix)

21.4 Missing Areas


No runtime execution engine: Pure configuration, no programmatic API
No persistent database: All state in flat files
No real-time monitoring: No dashboard, no metrics collection
No team collaboration: Single-user focused
No API/SDK: Cannot be integrated into other tools programmatically
No DAG execution: Linear pipeline only
No rollback mechanism: No undo for agent actions beyond git
No cost controls: No budget limits or spending alerts
No formal state machine: Workflow state not tracked programmatically
No LLM provider abstraction: Tightly coupled to Claude/Anthropic

21.5 Unresolved Issues


Hook duplicate detection fragility: Despite regression test, this has recurred 3 times
Multi-model dependency gap: multi-plan/multi-execute require external binary
Session file concurrency: No locking mechanism
Context window measurement: No programmatic way to measure current context usage

Confidence: HIGH for observations, MEDIUM for roadmap predictions

22. What Should Be Borrowed/Adapted into Maestro

22.1 STRONGLY RECOMMEND Borrowing


Hook Architecture Pattern (Confidence: HIGH)

The six-event lifecycle (PreToolUse, PostToolUse, PreCompact, SessionStart, SessionEnd, Stop) with matcher-based filtering is well-designed
The fail-safe pattern (exit 0 on error) prevents hooks from breaking the main flow
The blocking (exit 2) vs warning (stderr) distinction is clean
File: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json


Session Persistence Model (Confidence: HIGH)

Transcript parsing at session end to extract structured summaries
Session start injection of previous context
CompactPre hook to save state before lossy compaction
Files: scripts/hooks/session-start.js, scripts/hooks/session-end.js


Agent Tool Scoping (Confidence: HIGH)

Read-only agents for planning/architecture
Full-access agents for implementation
The principle of least privilege for tool assignment
Pattern from: Agent frontmatter tools field


Strategic Compaction (Confidence: HIGH)

The compaction decision guide (when to compact vs not)
Tool call counting with configurable thresholds
The insight that auto-compaction at 95% is too late
File: skills/strategic-compact/SKILL.md


CI Validation Pipeline (Confidence: HIGH)

Structural validation of agents, hooks, commands, skills, rules
Inline JS syntax checking via vm.Script
Multi-OS, multi-Node, multi-PM testing matrix
File: .github/workflows/ci.yml


Code Reviewer Confidence Filtering (Confidence: HIGH)


80% confidence threshold before reporting


Skip stylistic preferences unless violating conventions
Consolidate similar issues rather than listing each one
File: agents/code-reviewer.md lines 26-29


Iterative Retrieval Pattern (Confidence: MEDIUM)

The 4-phase DISPATCH/EVALUATE/REFINE/LOOP pattern for context retrieval
Max 3 cycles, then proceed with best available
Relevance scoring 0-1
File: skills/iterative-retrieval/SKILL.md


Continuous Learning v2 (Instinct Model) (Confidence: MEDIUM)

Atomic instincts with confidence scoring
Evidence-backed patterns
Evolution from instincts to skills/commands/agents
File: skills/continuous-learning-v2/SKILL.md


22.2 CONSIDER Borrowing with Modifications


Orchestration Pipeline (Confidence: MEDIUM)

The sequential agent pipeline with handoff documents is useful
BUT: Should add approval gates between agents
BUT: Should support DAG execution, not just linear
File: commands/orchestrate.md


Token Optimization Settings (Confidence: MEDIUM)

Good defaults for cost reduction
BUT: Should add automated budget enforcement
File: docs/token-optimization.md


Multi-Model Approach (Confidence: LOW)

The "backend follows Codex, frontend follows Gemini" trust model is interesting
BUT: Depends on external tooling not included
BUT: Adds complexity and cost without clear evidence of benefit
Files: commands/multi-plan.md, commands/multi-execute.md


Package Manager Detection (Confidence: HIGH)

6-level detection priority is thorough
Cross-platform compatibility well-handled
Input validation for command injection prevention
File: scripts/lib/package-manager.js


22.3 DO NOT Borrow


Inline Node.js in JSON (Confidence: HIGH)

node -e "let d='';process.stdin.on('data',c=>d+=c)..." is unreadable and untestable
Always use external script files instead
Evidence: hooks/hooks.json lines 10, 20, 30, 40


Flat File Persistence (Confidence: HIGH)

.tmp files in ~/.claude/sessions/ with no locking is fragile
Use a proper database or at least SQLite for session state
Evidence: scripts/lib/session-manager.js


Markdown-Only Configuration (Confidence: MEDIUM)

While markdown is LLM-friendly, it lacks type safety
Consider typed configuration (JSON Schema, TypeScript) with markdown documentation
Evidence: All agents, skills, commands are untyped markdown


No-Execute Plan Confirmation (Confidence: MEDIUM)

The /plan -> human confirms -> /execute pattern is good for safety
BUT: The confirmation is purely conversational, not tracked
Should use structured approval with audit trail


22.4 Key Insights for Maestro


Configuration is a product: ECC's 42K stars prove that curated configs have massive value
Context window management is the #1 operational concern: More documentation on this than any other topic
Cross-platform matters: The bash-to-Node.js migration was driven by Windows compatibility needs
Community contributions matter: The instinct import bug was fixed by a community contributor
Hook duplicate issues are fragile: Plugin conventions change between versions
Multi-model is aspirational: Documented but dependent on external tooling
The "cascade method" for manual parallelism: Simple but effective human workflow pattern
Eval-driven development: The eval harness skill maps evals to "unit tests of AI development"

Confidence: HIGH for recommendations, MEDIUM for implementation specifics

23. Cross-Links

Related Analyses


superpowers-deep-analysis.md

Section: Agent orchestration model (compare sequential pipeline vs DAG)
Section: Context management strategy (compare compaction approaches)
Section: Hook system (compare event types and execution model)


agent-orchestrator-deep-analysis.md

Section: Multi-agent coordination (compare orchestrate command vs orchestrator patterns)
Section: Isolation model (compare worktree approach vs container isolation)
Section: State management (compare flat files vs persistent database)


maestro-deep-analysis.md

Section: Workflow engine (compare ECC's linear pipeline vs Maestro's DAG)
Section: Cost governance (compare ECC's documentation-only approach vs automated controls)
Section: Team collaboration (compare single-user ECC vs multi-user Maestro)


harness-consensus-report.md

Section: Shared patterns (hooks, agents, session persistence, context management)
Section: Divergences (execution model, isolation, governance)
Section: Synthesis recommendations


final-harness-gap-report.md

Section: Cost governance gap (ECC has documentation, needs automation)
Section: Concurrent execution gap (ECC is sequential, needs DAG support)
Section: State management gap (ECC uses flat files, needs database)
Section: Security gap (ECC has rules, needs runtime enforcement)


Appendix A: File Inventory

Core Configuration Files


File
Lines
Purpose


CLAUDE.md
61
Project guidance for Claude Code


hooks/hooks.json
169
All hook definitions


.claude-plugin/plugin.json
41
Plugin manifest


package.json
89
npm package configuration


install.sh
173
Installation script


Scripts


File
Lines
Purpose


scripts/lib/utils.js
529
Cross-platform utilities


scripts/lib/package-manager.js
431
Package manager detection


scripts/lib/session-manager.js
442
Session CRUD


scripts/lib/session-aliases.js
~100
Session aliases


scripts/hooks/session-start.js
81
Session start hook


scripts/hooks/session-end.js
235
Session end hook


scripts/hooks/pre-compact.js
49
Pre-compaction hook


scripts/hooks/suggest-compact.js
81
Compaction suggestion


scripts/hooks/evaluate-session.js
100
Continuous learning


Agents


File
Model
Tools


agents/planner.md
opus
Read, Grep, Glob


agents/architect.md
opus
Read, Grep, Glob


agents/code-reviewer.md
sonnet
Read, Grep, Glob, Bash


agents/security-reviewer.md
sonnet
All


agents/tdd-guide.md
sonnet
Read, Write, Edit, Bash, Grep


agents/build-error-resolver.md
sonnet
All


agents/e2e-runner.md
sonnet
All


agents/refactor-cleaner.md
sonnet
All


agents/doc-updater.md
sonnet
All


agents/go-reviewer.md
sonnet
Read, Grep, Glob, Bash


agents/go-build-resolver.md
sonnet
All


agents/python-reviewer.md
sonnet
Read, Grep, Glob, Bash


agents/database-reviewer.md
sonnet
Read, Grep, Glob, Bash


Skills (44 total)

Core Workflow Skills:

skills/strategic-compact/SKILL.md -- Context compaction strategy
skills/verification-loop/SKILL.md -- 6-phase quality verification (build, type, lint, test, security, diff)
skills/eval-harness/SKILL.md -- Eval-driven development (EDD) with pass@k and pass^k metrics
skills/iterative-retrieval/SKILL.md -- 4-phase context retrieval for subagents (DISPATCH/EVALUATE/REFINE/LOOP)
skills/continuous-learning/SKILL.md -- Pattern extraction v1
skills/continuous-learning-v2/SKILL.md -- Instinct-based learning v2 with confidence scoring
skills/configure-ecc/SKILL.md -- Installation wizard with merge/overwrite detection
skills/search-first/SKILL.md -- Search before asking pattern

Language-Agnostic Development Skills:

skills/coding-standards/SKILL.md -- Universal coding best practices
skills/security-review/SKILL.md -- OWASP-based security checklist
skills/security-scan/SKILL.md -- Security scanning automation
skills/tdd-workflow/SKILL.md -- TDD methodology (RED-GREEN-REFACTOR)
skills/e2e-testing/SKILL.md -- End-to-end testing patterns
skills/api-design/SKILL.md -- API design best practices
skills/backend-patterns/SKILL.md -- Backend architecture patterns
skills/frontend-patterns/SKILL.md -- Frontend development patterns
skills/database-migrations/SKILL.md -- Database migration strategies
skills/deployment-patterns/SKILL.md -- Deployment and CI/CD patterns
skills/docker-patterns/SKILL.md -- Docker containerization patterns
skills/project-guidelines-example/SKILL.md -- Example project setup

Data & Analytics Skills:

skills/postgres-patterns/SKILL.md -- PostgreSQL query optimization and schema design
skills/clickhouse-io/SKILL.md -- ClickHouse analytics patterns
skills/content-hash-cache-pattern/SKILL.md -- Content-addressable caching
skills/cost-aware-llm-pipeline/SKILL.md -- LLM pipeline cost optimization
skills/regex-vs-llm-structured-text/SKILL.md -- When to use regex vs LLM for text processing
skills/nutrient-document-processing/SKILL.md -- Document processing patterns

Python/Django Skills:

skills/python-patterns/SKILL.md -- Python development patterns
skills/python-testing/SKILL.md -- Python testing patterns
skills/django-patterns/SKILL.md -- Django web framework patterns
skills/django-security/SKILL.md -- Django security best practices
skills/django-tdd/SKILL.md -- Django TDD workflow
skills/django-verification/SKILL.md -- Django verification patterns

Go Skills:

skills/golang-patterns/SKILL.md -- Go development patterns
skills/golang-testing/SKILL.md -- Go testing patterns

Java/Spring Boot Skills:

skills/java-coding-standards/SKILL.md -- Java coding standards
skills/springboot-patterns/SKILL.md -- Spring Boot patterns
skills/springboot-security/SKILL.md -- Spring Boot security
skills/springboot-tdd/SKILL.md -- Spring Boot TDD
skills/springboot-verification/SKILL.md -- Spring Boot verification
skills/jpa-patterns/SKILL.md -- JPA/Hibernate patterns

C++/Swift Skills:

skills/cpp-coding-standards/SKILL.md -- C++ coding standards
skills/cpp-testing/SKILL.md -- C++ testing patterns
skills/swift-actor-persistence/SKILL.md -- Swift actor persistence patterns
skills/swift-protocol-di-testing/SKILL.md -- Swift protocol DI and testing

Commands (32 total)


Command
File
Category


/plan
commands/plan.md
Planning


/orchestrate
commands/orchestrate.md
Planning


/multi-plan
commands/multi-plan.md
Planning


/multi-execute
commands/multi-execute.md
Planning


/multi-backend
commands/multi-backend.md
Planning


/multi-frontend
commands/multi-frontend.md
Planning


/multi-workflow
commands/multi-workflow.md
Planning


/tdd
commands/tdd.md
Development


/build-fix
commands/build-fix.md
Development


/e2e
commands/e2e.md
Development


/refactor-clean
commands/refactor-clean.md
Development


/pm2
commands/pm2.md
Development


/code-review
commands/code-review.md
Review


/go-review
commands/go-review.md
Review


/go-build
commands/go-build.md
Review


/go-test
commands/go-test.md
Review


/python-review
commands/python-review.md
Review


/verify
commands/verify.md
Review


/eval
commands/eval.md
Review


/test-coverage
commands/test-coverage.md
Review


/learn
commands/learn.md
Learning


/learn-eval
commands/learn-eval.md
Learning


/checkpoint
commands/checkpoint.md
Learning


/instinct-status
commands/instinct-status.md
Learning


/instinct-import
commands/instinct-import.md
Learning


/instinct-export
commands/instinct-export.md
Learning


/evolve
commands/evolve.md
Learning


/skill-create
commands/skill-create.md
Learning


/update-docs
commands/update-docs.md
Maintenance


/update-codemaps
commands/update-codemaps.md
Maintenance


/sessions
commands/sessions.md
Maintenance


/setup-pm
commands/setup-pm.md
Maintenance


Rules


Directory
Files
Focus


rules/common/
8 files
agents.md, coding-style.md, git-workflow.md, hooks.md, patterns.md, performance.md, security.md, testing.md


rules/typescript/
5 files
coding-style.md, hooks.md, patterns.md, security.md, testing.md


rules/python/
5 files
coding-style.md, hooks.md, patterns.md, security.md, testing.md


rules/golang/
5 files
coding-style.md, hooks.md, patterns.md, security.md, testing.md


rules/README.md
1 file
Rules structure documentation


Total: 24 rule files across 4 language groups plus README.
Documentation


File
Lines
Purpose


README.md
1033
Main project documentation


the-shortform-guide.md
431
Setup guide (skills, hooks, subagents, MCPs, plugins)


the-longform-guide.md
355
Advanced patterns (token economics, memory, parallelization)


CONTRIBUTING.md
425
Contribution guidelines with PR templates


hooks/README.md
199
Hook system documentation


docs/token-optimization.md
137
Token optimization guide


.claude-plugin/README.md
6
Plugin manifest gotchas


.opencode/README.md
173
OpenCode integration documentation


rules/README.md
~50
Rules structure documentation


CI/CD and Test Files


File
Lines
Purpose


.github/workflows/ci.yml
219
4-job CI pipeline (test, validate, security, lint)


tests/run-all.js
81
Test runner (11 test files)


scripts/ci/validate-agents.js
82
Agent frontmatter validation


scripts/ci/validate-hooks.js
149
Hook JSON schema and JS syntax validation


scripts/ci/validate-commands.js
~80
Command frontmatter validation


scripts/ci/validate-skills.js
~80
Skill SKILL.md presence validation


scripts/ci/validate-rules.js
~80
Rule markdown structure validation


Configuration Files


File
Lines
Purpose


schemas/hooks.schema.json
101
JSON Schema for hooks configuration


mcp-configs/mcp-servers.json
92
14 pre-configured MCP server definitions


contexts/dev.md
21
Development mode system prompt


contexts/review.md
23
Review mode system prompt


contexts/research.md
~20
Research/exploration mode system prompt


Integration Files


File
Lines
Purpose


.opencode/agents/
12 agents
OpenCode agent definitions


.opencode/commands/
24 commands
OpenCode command definitions


.opencode/skills/
16 skills
OpenCode skill definitions


.opencode/hooks/
20+ events
OpenCode hook definitions


.opencode/README.md
173
OpenCode integration documentation


Appendix B: Confidence Score Summary


Section
Confidence
Rationale


Design Philosophy
HIGH
Extensive documentation, clear patterns


Core Architecture
HIGH
All code read, well-organized


Harness Workflow
HIGH (basic), MEDIUM (multi-model)
Basic orchestration clear; multi-model depends on external tools


Subagent Orchestration
HIGH
All agent files analyzed


Parallelization
MEDIUM
Well-documented but not code-enforced


Isolation Model
HIGH
Simple model, well-understood


Human-in-the-Loop
HIGH
Clear patterns, some gaps identified


Context Handling
HIGH
Most documented aspect of project


Session Persistence
HIGH
All scripts analyzed


Code Quality Gates
HIGH
CI pipeline + tests analyzed


Security
HIGH (documented), MEDIUM (enforcement)
Rules strong, runtime enforcement weak


Hooks
HIGH
All hooks analyzed in detail


CLI/UX
HIGH
All 31 commands cataloged


Cost Governance
HIGH for gaps
Documentation-only, no automation


Tooling
HIGH
Full dependency inventory


External Integrations
HIGH
All MCP configs + IDE support analyzed


Operational Assumptions
HIGH
Explicit and implicit documented


Failure Modes
HIGH
Documented issues + predicted modes


Governance
HIGH
Rules and CI analyzed


Roadmap
MEDIUM
Based on signal interpretation


Maestro Recommendations
HIGH (what), MEDIUM (how)
Clear recommendations, implementation TBD


Appendix C: Version History


Version
Date
Key Changes


v1.4.1
Feb 2026
Fixed instinct import content loss


v1.4.0
Feb 2026
Multi-language rules, installation wizard, PM2, multi-agent commands


v1.3.0
Feb 2026
Full OpenCode integration


v1.2.0
Feb 2026
Python/Django, Java Spring Boot, session management, continuous learning v2


End of analysis. Total files analyzed: 60+ across all directories. All file paths are absolute references to /tmp/ai-harness-repos/everything-claude-code/.

  
## final-harness-gap-report.md

      
    Raw
  

              final-harness-gap-report.md
            
          
    Final Harness Gap Report: Maestro vs. the Canonical Feature Set

Report Date: 2026-02-22
Analyst: Claude Opus 4.6
Scope: Gap analysis of RunMaestro/Maestro against best-in-class features from obra/superpowers, affaan-m/everything-claude-code, and ComposioHQ/agent-orchestrator
Source Reports:

maestro-deep-analysis.md (2005 lines)
superpowers-deep-analysis.md (2005 lines)
everything-claude-code-deep-analysis.md (2141 lines)
agent-orchestrator-deep-analysis.md (2806 lines)


Executive Summary

Maestro is the most ambitious and fully-realized project in the comparison set -- a cross-platform Electron desktop application with CLI, mobile PWA, multi-provider agent support, SQLite-backed analytics, and Group Chat orchestration. It is the only project with a runtime execution engine (ProcessManager), a desktop GUI, and multi-provider support (4 active agents, 3 planned). Its architecture is sound, its codebase is large (672K lines of TypeScript), and its feature set is broad.
However, Maestro has critical gaps in three areas that the other projects address:


No automated quality gates in the execution loop (Critical). Maestro's Auto Run processes checkbox tasks sequentially but never runs tests, lints, or code review between steps. Superpowers' two-stage code review and ECC's six-phase verification loop are both superior here.


No cost governance enforcement (High). Maestro tracks costs in its SQLite dashboard but provides no budgets, alerts, or automatic pause when spending exceeds thresholds. All three comparison projects share this gap, but Maestro -- as the only project with runtime cost data -- is uniquely positioned to solve it.


No reaction engine or lifecycle state machine (High). Agent Orchestrator's 16-status state machine with configurable reactions (event -> action, with retries and escalation) is a fundamentally more sophisticated approach to agent lifecycle management than Maestro's binary "busy/idle" model.


Priority recommendation: Phase 1 should focus on quality gates (leveraging Superpowers' prompt patterns within Maestro's runtime enforcement), cost budgets, and a basic reaction engine. These three changes would close the most impactful gaps with moderate effort.

1. Maestro's Current Strengths

Maestro leads the comparison set in several areas where it should preserve and double down on its advantage.
1.1 Multi-Provider Agent Support (Unmatched)

No other project supports 4+ AI coding agent CLIs through a unified interface. Maestro's declarative agent definition architecture (AgentConfig with binaryName, args, batchModeArgs, resumeArgs, etc.) and output parser registry pattern allow adding new agents without modifying core logic.
Source: maestro-deep-analysis.md, Section 17.1-17.3 (Provider Architecture), Section 3.3 (Output Parser Architecture)
Evidence: Agent definitions in src/main/agents/definitions.ts (367 lines), 4 output parser implementations, 5 error pattern sets with ~100 individual patterns.
1.2 Desktop Application with Keyboard-First Design (Unmatched)

Superpowers is invisible (CLI-only), ECC is configuration-only, and Agent Orchestrator has a basic Next.js dashboard. Maestro provides a full Electron desktop app with 30+ keyboard shortcuts, a Layer Stack modal system with 30+ priority levels, ARIA accessibility, and a comprehensive theme system (16 themes). No other project approaches this level of UI sophistication.
Source: maestro-deep-analysis.md, Section 14 (CLI/UX and Automation Ergonomics)
1.3 Group Chat with Moderator AI (Unmatched)

Maestro's Group Chat system is the most sophisticated multi-agent coordination mechanism in the comparison set. The moderator-agent pattern (user message -> moderator routing -> parallel agent work -> synthesis round -> optional follow-up loop) is architecturally superior to Agent Orchestrator's orchestrator-as-CLI-user approach and Superpowers' sequential-subagent-only model.
Source: maestro-deep-analysis.md, Section 5.1 (Group Chat System), files: src/main/group-chat/group-chat-moderator.ts (290 lines), group-chat-agent.ts (429 lines), group-chat-router.ts
1.4 SQLite Analytics and Usage Dashboard (Best-in-Class)

Maestro's StatsDB system (833 lines of SQLite management code) with daily backups, corruption recovery, WAL mode, integrity checking, and migration tracking is production-grade. The Usage Dashboard provides summary cards, agent comparison charts, activity heatmaps, and CSV export. No other project has anything comparable.
Source: maestro-deep-analysis.md, Section 15.2-15.3 (Usage Dashboard, Stats Database Architecture)
1.5 Session Discovery and Resume Across Providers (Best-in-Class)

Maestro can discover existing sessions from Claude Code, Codex, OpenCode, and Factory Droid session storage directories, and resume any of them with provider-specific flags. This cross-provider session management is unique.
Source: maestro-deep-analysis.md, Section 10.3-10.4 (Session Discovery, Session Resume)
1.6 Mobile Remote Control (Unmatched)

PWA with WebSocket + Cloudflare tunnel, voice input, swipe gestures, offline queue, and push notifications. No other project has any mobile capability.
Source: maestro-deep-analysis.md, Section 14.2 (Mobile UX)
1.7 Error Pattern System (Best-in-Class)

1015 lines of regex-based error detection covering 7 error types across 4 agents plus SSH, with recoverability flags and dynamic error messages. This is the most comprehensive error detection system in the comparison set.
Source: maestro-deep-analysis.md, Section 3.4 (Error Pattern System)
1.8 Symphony Community Contribution Platform (Unique)

No other project has a mechanism for community-driven open source contribution through the tool itself. Symphony's registry + Auto Run + PR creation pipeline is novel.
Source: maestro-deep-analysis.md, Section 5.2 (Symphony Orchestration)

2. Gap Analysis Matrix


Feature Area
Maestro Status
Best-in-Class Project
Gap Severity
Source


Orchestration quality gates
No automated verification between Auto Run tasks
Superpowers: Two-stage code review (spec compliance + quality) after EACH task with review loops
Critical
superpowers S11.1-11.3, maestro S4.6


Verification pipeline
None (agent self-reports completion)
ECC: 6-phase verification loop (build, type, lint, test, security, diff) with structured PASS/FAIL report
Critical
ecc S4.5, maestro S4.6


Session lifecycle state machine
Binary busy/idle with color-coded dots
AO: 16-state machine (spawning -> working -> pr_open -> ci_failed -> review_pending -> approved -> mergeable -> merged -> done)
High
ao S9.1-9.2, maestro S10.5


Reaction engine
None
AO: Configurable event->action rules with retries, escalation, conditions, and time-based triggers
High
ao S12.1-12.6, maestro S13.5


Anti-rationalization engineering
None (agents run prompts as-given)
Superpowers: 40+ rationalization prevention entries, red flag lists, gate functions, pressure testing methodology
High
superpowers S11.5-11.6, maestro S4.5


Cost governance (budgets/limits)
Tracking only (SQLite dashboard)
None has enforcement, but Maestro has the data infrastructure to build it
High
maestro S15.5, all projects lack enforcement


Security scanning in CI
No SAST, no dependency audit in CI
AO: Gitleaks (full history), dependency-review-action (moderate+), pnpm audit (high/prod). ECC: npm audit
High
ao S11.3, ecc S11.1, maestro S11.8


CI testing before release
Release workflow only builds, does not run tests
ECC: 33-combination CI matrix (3 OS x 3 Node x 4 PM), component validation, ESLint, markdownlint. AO: lint + typecheck + test
High
ecc S11.1, ao S10.1, maestro S11.8


Hooks lifecycle system
No hook system for agent tools
ECC: 6-event hook lifecycle (PreToolUse, PostToolUse, PreCompact, SessionStart, SessionEnd, Stop) with blocking/warning/pass-through modes
Medium
ecc S13.1-13.3, maestro S13


Context compaction automation
Manual only (user triggers)
ECC: Tool-call-counting hook with configurable thresholds + phase-transition compaction guide. Superpowers: Auto re-inject on compact event
Medium
ecc S9.2, superpowers S9.5, maestro S9.6


Agent tool scoping
All agents run with full privileges (YOLO mode)
ECC: Read-only agents for planning (tools: Read, Grep, Glob), full-access for implementation. Principle of least privilege
Medium
ecc S5.2, maestro S17.3


Notification system
Desktop notifications only (no external channels)
AO: Desktop + Slack (Block Kit) + Composio + Webhook, with priority-based routing
Medium
ao S7.4, maestro S13


Issue tracker integration
None (manual task creation via Auto Run docs)
AO: GitHub Issues + Linear trackers with issue-to-branch-to-PR pipeline
Medium
ao S16.1-16.2, maestro S21.4


Session persistence across crashes
In-memory Group Chat state lost on crash. Electron-store with 2s debounce for settings
AO: Flat-file metadata survives crashes. Session restoration from archive with agent-specific resume
Medium
ao S9.4-9.5, maestro S10.2, S19.5


Plugin architecture
Encore Features (feature gating, not a full plugin system)
AO: 8-slot plugin architecture with typed PluginManifest, PluginModule, and registry pattern
Medium
ao S2.2-2.3, maestro S21.3


REST/webhook API
No external API (CLI + IPC only)
AO: Next.js API routes for sessions, events (SSE), sends, kills, merges, restores
Medium
ao S7.2, maestro S13.5


Continuous learning
None
ECC: Instinct system with confidence scoring, session evaluation, pattern extraction, evolved skills/commands/agents
Low
ecc S10.5, maestro not present


Multi-language rules/skills
Provider-agnostic (no language-specific prompts)
ECC: Rules for TS, Python, Go, Java, C++, Swift. 44 skills covering language-specific patterns
Low
ecc Appendix A Skills, maestro S4.5


Web dashboard for monitoring
Mobile PWA (read/control)
AO: Kanban-style dashboard with attention levels, dynamic favicon, SSE real-time updates
Low
ao S7.1, maestro S14.2


Provider extensibility (local models)
Only CLI agents, no direct API support
ECC: Planned multi-model (Codex, Gemini). Superpowers: Codex, OpenCode mapping
Low
ecc S17.4, maestro S17.6


3. Detailed Gap Descriptions

3.1 GAP: No Automated Quality Gates in Auto Run (Critical)

Current state in Maestro:
The Auto Run batch processor (maestro S4.4, file: src/cli/services/batch-processor.ts) processes checkbox tasks sequentially. For each document, it reads unchecked tasks, constructs a prompt, spawns the AI agent, parses the response for checked tasks, and moves to the next. There is NO verification step between tasks.
From maestro S4.6: "There is NO automatic verification layer (no test runner, no linter integration, no code review step). The verification is the agent's own assessment that it completed the work. ... The agent could check off a task without actually completing it successfully."
What best-in-class does:
Superpowers (subagent-driven-development): After EACH task, two sequential review subagents are dispatched:

Spec Compliance Reviewer -- explicitly told "The implementer finished suspiciously quickly. Their report may be incomplete, inaccurate, or optimistic." Must NOT trust the implementer's report. Reads actual code and compares to requirements line by line.
Code Quality Reviewer -- dispatched ONLY after spec compliance passes. Reviews code quality, architecture, testing, production readiness. Issues categorized Critical/Important/Minor.

Both reviews are loops -- if issues found, implementer fixes, reviewer re-reviews until approved. (superpowers S4.5, S5.4)
ECC (verification-loop): Six verification phases with structured PASS/FAIL output:

Build Verification (npm run build)
Type Check (tsc --noEmit)
Lint Check (npm run lint)
Test Suite (with coverage, target 80%)
Security Scan (grep for secrets and console.log)
Diff Review (review changed files)

Output is a structured VERIFICATION REPORT with READY/NOT READY verdict. (ecc S4.5)
What Maestro should implement:
A configurable quality gate system that runs between Auto Run tasks. The gate should support:

Built-in gates: test runner, linter, type checker, security scanner
Custom gates: user-defined commands that must exit 0 to proceed
Review gates: dispatch a review subagent (using Superpowers' skepticism pattern)
Failure behavior: pause (wait for human), retry (send error to agent), skip (log and continue), abort (stop batch)

Implementation complexity: Medium. The batch processor already has the sequential processing loop. Adding gate hooks between task iterations requires:

A QualityGate interface with run(context): Promise<GateResult>
Gate configuration in Playbook definitions
Integration with the existing execution queue

Dependencies: Benefits from reaction engine (Gap 3.2) for failure handling.

3.2 GAP: No Reaction Engine or Lifecycle State Machine (High)

Current state in Maestro:
Maestro tracks agent state as color-coded dots: green (ready/idle), yellow (thinking/busy), red (no connection/error), pulsing orange (connecting). From maestro S10.5. The agentError and agentErrorPaused fields handle error states, but there is no formal state machine with defined transitions, and no configurable reactions to state changes.
When agents encounter errors, a modal appears requiring manual user acknowledgment (maestro S8.4). There is no automated response to events like CI failure, rate limiting, or context exhaustion.
What best-in-class does:
Agent Orchestrator implements a 16-state machine (ao S9.1) with a reaction engine (ao S12.1-12.6):
spawning -> working -> pr_open -> ci_failed/review_pending
-> changes_requested/approved -> mergeable -> merged -> done

The reaction engine maps events to configurable actions with retries and time-based escalation:
reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    retries: 2
    escalation:
      action: notify
      after: "10m"
      priority: critical
33 distinct event types trigger reactions. Default reactions cover CI failure, code review feedback, merge conflicts, stuck agents, and agent exits.
What Maestro should implement:
A SessionStateMachine class that tracks each agent's lifecycle through defined states, and a ReactionEngine that maps state transitions to configurable actions. Given Maestro already has rich error detection (error-patterns.ts) and event emission (ProcessManager EventEmitter), this is a natural extension.
Implementation complexity: Medium-High. Requires:

State machine definition with valid transitions
Reaction configuration format (YAML or JSON in Playbooks)
Action executors (send message, pause, restart, notify, escalate)
Reaction history logging in StatsDB
UI for viewing/editing reaction rules

Dependencies: None. This is foundational infrastructure.

3.3 GAP: No Anti-Rationalization Engineering in Prompts (High)

Current state in Maestro:
Maestro bundles 24 system prompts as markdown files (maestro S4.5). These prompts tell agents what to do but do not address the well-documented problem of agents rationalizing around constraints. The autorun-default.md prompt instructs the agent to process checkbox tasks, but does not include rationalization prevention tables, red flag lists, or gate functions.
From maestro S2.3: CLAUDE.md contains behavioral guidelines like "Surface Assumptions Early" and "Push Back When Warranted," but these are meta-guidelines for Maestro's own development, not runtime behavioral controls for orchestrated agents.
What best-in-class does:
Superpowers has invested more iteration into anti-rationalization engineering than any other project (superpowers S11.5-11.6):

40+ rationalization prevention entries across all skills, mapping specific agent excuses to correct responses
Red flag lists: 12 entries in using-superpowers, 12 in TDD, 8 in verification
Gate functions: Explicit decision trees before actions (IDENTIFY -> RUN -> READ -> VERIFY -> CLAIM)
Pressure testing methodology: 7 pressure types (time, sunk cost, authority, economic, exhaustion, social, pragmatic) used to validate that quality gates actually work under stress
Persuasion principles: Academic foundation (Cialdini 2021) applied to prompt design

Key insight from superpowers S2.3: Skill descriptions containing workflow summaries cause agents to follow the short description instead of reading full skill content (the "Description Trap"). This is directly applicable to Maestro's prompt templates.
What Maestro should implement:
Incorporate anti-rationalization patterns into Maestro's system prompts, particularly:

Rationalization tables in autorun-default.md for common task-skipping excuses
Verification-before-completion gate function in the Auto Run prompt
Skepticism pattern in any review prompts ("The agent finished suspiciously quickly")
Description-only triggers (no workflow summaries) in any prompt routing metadata

Implementation complexity: Low. This requires only prompt text changes, not code changes. However, proper validation requires the TDD-for-prompts methodology (writing pressure tests to verify the prompts actually prevent rationalization), which is Medium effort.
Dependencies: None.

3.4 GAP: No Cost Governance Enforcement (High)

Current state in Maestro:
Maestro tracks costs comprehensively in its SQLite database (maestro S15.1-15.3): per-session token usage with inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, totalCostUsd, and contextWindow. The Usage Dashboard provides summary cards, agent comparison charts, and CSV export.
However, from maestro S15.5: "No cost budgets or limits (tracking only, no enforcement). No alerts when spending exceeds thresholds. No per-playbook cost attribution. No team/organization cost aggregation."
What best-in-class does:
No project in the comparison set implements cost enforcement. All four projects share this gap. However:

Superpowers documents cost awareness in skill text (superpowers S15.2) and provides post-hoc cost analysis via analyze-token-usage.py
ECC documents token optimization settings (ecc S9.1) but relies on Claude's built-in /cost command
AO extracts cost from JSONL but does not display it in CLI or dashboard (ao S14.3)

What Maestro should implement:
Since Maestro already has the data infrastructure (StatsDB with real-time cost tracking), it should add:

Per-playbook budget limits in Playbook configuration
Per-agent session budget limits in agent configuration
Cost threshold alerts (notification when 80% of budget consumed)
Automatic pause when budget exceeded (with option to override)
Per-task cost attribution (extend auto_run_tasks table with cost columns)
Cost estimation before playbook execution (based on historical data)

Implementation complexity: Medium. The data layer exists; this requires:

Budget fields in Playbook and AgentConfig interfaces
Cost checking in the batch processor loop
Alert/pause integration with the execution queue
UI for budget configuration and alerts

Dependencies: None. Can be built on existing StatsDB infrastructure.

3.5 GAP: No Security Scanning in CI (High)

Current state in Maestro:
From maestro S11.8: "No linting or testing in CI before release (the release workflow only builds, doesn't run tests). No required CI checks before merge. No code coverage thresholds. No security scanning (no SAST, no dependency audit in CI)."
The release workflow (release.yml, 782 lines) focuses entirely on cross-platform build correctness and native module architecture verification.
What best-in-class does:
Agent Orchestrator has the most comprehensive CI security (ao S11.3):

Gitleaks with --full-history (scans entire git history for secrets)
dependency-review-action failing on moderate+ vulnerabilities
pnpm audit at high severity for production dependencies

ECC has structural validation plus security scanning (ecc S11.1):

33-combination test matrix (3 OS x 3 Node x 4 PM)
Component validation (agents, hooks, commands, skills, rules)
npm audit (warning-only, not blocking)
ESLint + markdownlint

What Maestro should implement:
Add CI pipeline stages before the release build:

TypeScript type checking (tsc --noEmit for all 3 tsconfig files)
ESLint + Prettier check (already configured locally via Husky, not in CI)
Vitest test suite (all 4 configurations: unit, integration, e2e, performance)
Secret scanning (Gitleaks or TruffleHog)
Dependency vulnerability audit (npm audit --audit-level=high)
Code coverage threshold (minimum 60% for critical paths)

Implementation complexity: Low-Medium. Most tooling is already configured locally; it just needs CI integration.
Dependencies: None.

3.6 GAP: No Hook System for Agent Tool Use (Medium)

Current state in Maestro:
Maestro has no hook system that intercepts agent tool calls before or after execution. The ProcessManager emits events (data, stderr, exit, usage, agent-error, etc.) but these are observation-only. There is no mechanism to block or modify agent operations.
What best-in-class does:
ECC implements a comprehensive hook lifecycle (ecc S13.1-13.3):

PreToolUse (5 hooks): Can block (exit 2), warn (stderr), or pass. Examples: block dev server outside tmux, warn before git push, block creation of random .md files
PostToolUse (5 hooks): Auto-format with Prettier, TypeScript checking, console.log detection
PreCompact: Save state before compaction
SessionStart/End: Context loading, session persistence, pattern extraction
Stop: Console.log audit

The fail-safe pattern (exit 0 on error) ensures hooks never break the main flow. The blocking vs. warning distinction is clean and well-tested.
What Maestro should implement:
Since Maestro already spawns AI agents as child processes and parses their output (via the output parser architecture), it should add:

Pre/post tool execution hooks that run before/after the agent processes tool calls
Configurable hook definitions (per-playbook or per-agent)
Hook types: blocking (prevent operation), warning (log but continue), enrichment (add context)
Built-in hooks: auto-format, type check, security scan, diff review

Implementation complexity: High. The hook system requires intercepting and potentially blocking agent tool execution, which may require modification to how Maestro interacts with agent processes. Since Maestro uses pass-through process management (not API-level agent control), this is architecturally constrained.
Alternative approach (Medium): Instead of intercepting tool calls in real-time, implement post-execution hooks that run after each agent query completes (leveraging the existing query-complete event). This is achievable within the current architecture.
Dependencies: Quality gates (Gap 3.1) can be implemented as post-execution hooks.

3.7 GAP: No Issue Tracker Integration (Medium)

Current state in Maestro:
Maestro's workflow starts with manually created Auto Run documents. There is no integration with issue trackers (GitHub Issues, Linear, Jira) to automatically create tasks from issues or track which issues are being worked on.
Symphony provides a limited form of issue integration (fetching GitHub Issues with runmaestro.ai labels for community contributions), but this is for the Symphony contribution pipeline, not general development workflows.
What best-in-class does:
Agent Orchestrator has deep issue tracker integration (ao S16.1-16.2):

GitHub Issues tracker (304 lines): Issue CRUD, listing with filters, branch name generation from issue numbers, prompt generation from issue content
Linear tracker (722 lines): Dual transport (direct API or Composio SDK), state mapping, full GraphQL API, issue/label/team/workflow operations
Automatic pipeline: Issue -> Workspace -> Agent -> PR -> Review -> Merge -> Cleanup

What Maestro should implement:
Add an issue tracker plugin system that:

Fetches issues from GitHub/Linear/Jira
Auto-generates Auto Run documents from issue descriptions
Links agent sessions to issues for tracking
Updates issue status as work progresses
Creates PRs with issue references

Implementation complexity: Medium-High. Requires new IPC handlers, UI components for issue browsing, and integration with the Auto Run document system.
Dependencies: Benefits from reaction engine (Gap 3.2) for automated status updates.

4. Prioritized Roadmap

Phase 1: Quick Wins (Low effort, High impact) -- Weeks 1-4

4.1.1 Anti-Rationalization Prompts

What to build: Incorporate Superpowers' rationalization prevention patterns into Maestro's bundled prompts (src/prompts/autorun-default.md, src/prompts/group-chat-moderator-system.md, and others).
Reference model: Superpowers' verification-before-completion/SKILL.md (gate function pattern) and subagent-driven-development/spec-reviewer-prompt.md (skepticism pattern).
Estimated scope: Modify 5-8 prompt markdown files. Add rationalization tables, verification gate functions, and description-trap-aware descriptions. No code changes required.
Dependencies: None.
4.1.2 CI Security and Testing Pipeline

What to build: Add testing, linting, type checking, secret scanning, and dependency auditing to the GitHub Actions release workflow.
Reference model: Agent Orchestrator's .github/workflows/security.yml (Gitleaks + dependency-review-action + npm audit) and ECC's .github/workflows/ci.yml (multi-OS/Node/PM test matrix).
Estimated scope: 1 new workflow file (~150 lines YAML). Configure existing Vitest + ESLint + Prettier tools to run in CI.
Dependencies: None.
4.1.3 Cost Budget Configuration

What to build: Add per-playbook and per-agent budget limits with automatic pause on budget exhaustion. Extend StatsDB with budget tracking columns.
Reference model: No existing project has this. Maestro's existing StatsDB infrastructure (src/main/stats/stats-db.ts) provides the foundation.
Estimated scope: Modify Playbook interface to add budgetUsd field. Add budget checking in useBatchProcessor.ts / batch-processor.ts. Add UI controls for budget setting. ~300-500 lines of new code.
Dependencies: None. Builds on existing StatsDB.

Phase 2: Core Infrastructure (Medium effort, Critical for maturity) -- Weeks 5-12

4.2.1 Quality Gate System

What to build: A configurable quality gate framework that runs between Auto Run tasks, supporting built-in gates (test, lint, type check) and custom gates (user-defined commands).
Reference model: Superpowers' two-stage review pattern (spec compliance then code quality) for the review gate architecture. ECC's verification-loop skill for the built-in gate definitions.
Estimated scope: New QualityGate interface and QualityGateRunner service (~400 lines). Modify batch processor to invoke gates between tasks (~150 lines). Gate configuration UI in Playbook editor (~300 lines). Total: ~850 lines.
Dependencies: Anti-rationalization prompts (4.1.1) improve gate effectiveness.
4.2.2 Reaction Engine

What to build: A configurable event-to-action mapping system with retries, time-based escalation, and condition predicates. Integrate with ProcessManager events and agent error patterns.
Reference model: Agent Orchestrator's reaction engine (lifecycle-manager.ts, approximately lines 250-330) with YAML-based reaction configuration.
Estimated scope: New ReactionEngine class (~500 lines). Reaction configuration in settings/playbooks (~200 lines). Default reaction set (~100 lines). UI for reaction management (~400 lines). Total: ~1200 lines.
Dependencies: None, but complements quality gates (4.2.1) and notification system (4.2.3).
4.2.3 External Notification Channels

What to build: Slack, webhook, and enhanced desktop notification support with priority-based routing.
Reference model: Agent Orchestrator's notifier-slack/src/index.ts (Block Kit formatting) and notifier-desktop/src/index.ts (platform-specific implementations).
Estimated scope: New NotificationRouter service (~200 lines). Slack plugin (~150 lines). Webhook plugin (~100 lines). Notification routing configuration (~100 lines). Total: ~550 lines.
Dependencies: Reaction engine (4.2.2) triggers notifications.
4.2.4 Agent Session State Machine

What to build: Replace binary busy/idle state tracking with a formal state machine tracking agent sessions through defined lifecycle states (spawning -> working -> reviewing -> pr_open -> ci_checking -> approved -> merged -> done).
Reference model: Agent Orchestrator's SessionStatus enum with 16 states and determineStatus() algorithm.
Estimated scope: New SessionStateMachine class (~400 lines). State-to-UI mapping (~100 lines). State transition logging to StatsDB (~150 lines). Total: ~650 lines.
Dependencies: None, but integrates with reaction engine (4.2.2).

Phase 3: Advanced Capabilities (High effort, Differentiating) -- Weeks 13-24

4.3.1 Plugin Architecture

What to build: Formalize the Encore Features system into a full plugin API with lifecycle hooks, manifest format, typed interfaces, and isolation.
Reference model: Agent Orchestrator's 8-slot plugin system with PluginManifest + PluginModule pattern and type-safe registry.
Estimated scope: Plugin manifest format and loader (~500 lines). Plugin lifecycle management (install, enable, disable, uninstall) (~400 lines). Plugin isolation and sandboxing (~300 lines). Plugin marketplace UI (~600 lines). Total: ~1800 lines.
Dependencies: Reaction engine (4.2.2) and quality gates (4.2.1) become first-party plugins.
4.3.2 Issue Tracker Integration

What to build: GitHub Issues and Linear integration with auto-generation of Auto Run documents from issues, bidirectional status sync, and issue-to-PR pipeline.
Reference model: Agent Orchestrator's tracker-github/src/index.ts (304 lines) and tracker-linear/src/index.ts (722 lines).
Estimated scope: Tracker plugin interface (~200 lines). GitHub Issues plugin (~400 lines). Linear plugin (~600 lines). Auto Run document generation from issues (~300 lines). UI for issue browsing and assignment (~500 lines). Total: ~2000 lines.
Dependencies: Plugin architecture (4.3.1) for clean modularity.
4.3.3 REST/Webhook API

What to build: HTTP API for external automation, CI/CD integration, and custom dashboards. Expose key operations: session management, playbook execution, status queries, cost data.
Reference model: Agent Orchestrator's Next.js API routes for sessions, events (SSE), sends, kills, merges, restores. Extend Maestro's existing Fastify web server.
Estimated scope: API route definitions (~600 lines). Authentication middleware (~200 lines). SSE event streaming (~200 lines). API documentation (~300 lines). Total: ~1300 lines.
Dependencies: Session state machine (4.2.4) for rich status data.
4.3.4 Continuous Learning System

What to build: Session evaluation at end-of-session to extract patterns, an instinct/learned-skill store, and automatic prompt refinement based on observed agent behavior.
Reference model: ECC's continuous learning v2 system (skills/continuous-learning-v2/SKILL.md) with atomic instincts, confidence scoring, evidence-backed patterns, and evolution from instincts to skills/commands/agents.
Estimated scope: Session evaluation service (~400 lines). Instinct store (SQLite table + CRUD) (~300 lines). Pattern extraction prompts (~200 lines). Instinct-to-prompt integration (~200 lines). UI for instinct management (~400 lines). Total: ~1500 lines.
Dependencies: StatsDB for instinct storage. Quality gates (4.2.1) provide data for pattern extraction.

5. What NOT to Adopt

5.1 Agent-as-Orchestrator Model (from Superpowers)

Superpowers makes the AI agent itself the orchestrator, guided only by markdown skill documents. This is elegant but fundamentally limits enforcement (all rules are advisory), observability (no runtime metrics), recovery (no checkpoint/restore), and reproducibility (agent behavior varies).
Why NOT: Maestro already has a runtime orchestrator (ProcessManager, batch processor, Group Chat moderator). The advisory-only enforcement model is the #1 limitation of Superpowers. Maestro should use runtime code for enforcement and prompts for guidance, not prompts for everything.
Source: superpowers S22.3 item 12 ("Agent-as-Orchestrator Model"), superpowers S1 ("advisory-only enforcement model...fundamental limitations")
5.2 Zero-Persistence Design (from Superpowers)

Superpowers has NO persistence mechanism: no session state, no database, no progress tracking across sessions. Git commits are the only durable artifact.
Why NOT: Maestro already has SQLite-backed analytics, electron-store session persistence, and Group Chat JSONL logs. Losing persistence would be a regression.
Source: superpowers S10.2, S22.3 item 13
5.3 Flat-File Metadata (from Agent Orchestrator)

AO uses key=value text files for session metadata. This has no atomicity guarantees, no schema evolution, no query capability, and race conditions between writers.
Why NOT: Maestro already uses SQLite (StatsDB) and electron-store (JSON). Both are superior to flat files. The flat-file approach was a pragmatic choice for AO's v1, not a design to emulate.
Source: ao S21.3.1 ("The key=value text file approach is too fragile for production")
5.4 tmux as Primary Runtime (from Agent Orchestrator)

AO couples tightly to tmux for process isolation. This limits Windows support, makes message passing fragile, and provides no structured communication channel.
Why NOT: Maestro already uses node-pty for terminal emulation and child_process.spawn for batch mode, supporting both macOS, Linux, and Windows. The tmux dependency would reduce portability.
Source: ao S21.3.2
5.5 Polling-Based Lifecycle (from Agent Orchestrator)

AO polls every 30 seconds. This introduces up to 30-second latency for state change detection and wastes resources during idle periods.
Why NOT: Maestro already has an event-driven architecture (ProcessManager EventEmitter with real-time events). Polling would be a regression.
Source: ao S1.2 item 4, ao S21.3.3
5.6 Inline Node.js in JSON Hooks (from ECC)

ECC uses node -e "..." for simple hooks embedded in hooks.json. This is unreadable, untestable, and has no source maps.
Why NOT: Any hook system Maestro implements should use external script files or TypeScript modules, not inline code in JSON.
Source: ecc S22.3 item 1
5.7 Windows Polyglot Wrapper (from Superpowers)

Superpowers' cmd/bash polyglot script (run-hook.cmd) is clever but fragile and has caused numerous cross-platform issues across multiple versions.
Why NOT: Maestro already uses Node.js for cross-platform compatibility. A polyglot bash/cmd approach would add fragility.
Source: superpowers S22.3 item 16

6. Implementation-Ready Recommendations

6.1 Quality Gate System (Top Priority)

Files to create/modify:

Create: src/shared/types/quality-gate.ts -- Gate interface and result types
Create: src/main/quality-gates/gate-runner.ts -- Gate execution engine
Create: src/main/quality-gates/built-in/ -- test-gate.ts, lint-gate.ts, typecheck-gate.ts, review-gate.ts
Modify: src/cli/services/batch-processor.ts -- Add gate invocation between tasks
Modify: src/renderer/hooks/useBatchProcessor.ts -- Add gate invocation between tasks
Modify: Playbook interface (in ARCHITECTURE.md referenced types) -- Add gates configuration

Architecture decisions:

Gates run in the same working directory as the agent
Gate results are stored in StatsDB (auto_run_tasks table, new gate_result column)
Gates are sequential (not parallel) to avoid resource contention
Each gate has a configurable failure mode: pause | retry | skip | abort
The review gate uses Superpowers' skepticism pattern in its prompt

Key interfaces:
interface QualityGate {
  id: string;
  name: string;
  type: 'command' | 'review' | 'builtin';
  run(context: GateContext): Promise<GateResult>;
}

interface GateContext {
  workingDir: string;
  taskContent: string;
  changedFiles: string[];
  agentType: string;
  previousGateResults: GateResult[];
}

interface GateResult {
  gateId: string;
  status: 'pass' | 'fail' | 'warn' | 'skip';
  message: string;
  details?: string;
  duration: number;
}

interface PlaybookGateConfig {
  gates: Array<{
    type: 'test' | 'lint' | 'typecheck' | 'review' | 'custom';
    command?: string;  // For custom gates
    onFailure: 'pause' | 'retry' | 'skip' | 'abort';
    retryCount?: number;
  }>;
}
Test strategy:

Unit tests for each built-in gate with mocked command execution
Integration tests for the gate runner with real file system operations
E2E tests for the full Auto Run + gate pipeline
Pressure tests (Superpowers methodology): verify gates are not skipped under various prompting pressures


6.2 Cost Budget Enforcement (Top Priority)

Files to create/modify:

Create: src/main/stats/budget-manager.ts -- Budget checking and enforcement
Modify: src/main/stats/schema.ts -- Add budgets table
Modify: src/cli/services/batch-processor.ts -- Add budget checks
Modify: src/renderer/hooks/useBatchProcessor.ts -- Add budget checks
Modify: Playbook interface -- Add budgetUsd field
Create: src/renderer/components/BudgetConfig.tsx -- Budget setting UI
Create: src/renderer/components/BudgetAlert.tsx -- Budget alert overlay

Architecture decisions:

Budget tracking is per-playbook-execution (not per-agent, since agents are shared)
Cost is accumulated from UsageStats.totalCostUsd emitted by agent parsers
Budget checking happens after each agent query completes (leveraging query-complete event)
Three thresholds: warning (80%), critical (95%), exceeded (100%)
At exceeded: pause batch processor, show alert, require user override to continue
Budget data stored in StatsDB for historical analysis

Key interfaces:
interface BudgetConfig {
  maxCostUsd: number;
  warnThresholdPct: number;    // Default 80
  criticalThresholdPct: number; // Default 95
  onExceeded: 'pause' | 'abort'; // Default pause
}

interface BudgetStatus {
  configuredBudget: number;
  currentSpend: number;
  percentUsed: number;
  status: 'ok' | 'warning' | 'critical' | 'exceeded';
}

class BudgetManager {
  checkBudget(playbook: Playbook, currentCost: number): BudgetStatus;
  recordSpend(playbook: Playbook, cost: number): void;
  getBudgetHistory(playbookId: string): BudgetStatus[];
}
Test strategy:

Unit tests for BudgetManager with various cost/threshold combinations
Integration tests with mock batch processor to verify pause behavior
UI tests for budget configuration and alert display


6.3 Reaction Engine (High Priority)

Files to create/modify:

Create: src/main/reactions/reaction-engine.ts -- Event-to-action mapping engine
Create: src/main/reactions/reaction-config.ts -- Reaction definition types and defaults
Create: src/main/reactions/actions/ -- send-message.ts, pause-agent.ts, restart-agent.ts, notify.ts
Modify: src/main/process-manager/ProcessManager.ts -- Emit richer lifecycle events
Modify: src/main/ipc/handlers/agents.ts -- Register reaction configuration handlers
Create: src/renderer/components/ReactionConfig/ -- UI for reaction rule management

Architecture decisions:

Reactions are defined per-agent or globally in settings
The reaction engine subscribes to ProcessManager events
Reactions have a deduplication window (prevent re-firing within N seconds)
Escalation uses a timer-based mechanism (setTimeout, persisted in case of restart)
All reaction executions are logged to StatsDB for auditing
Default reactions handle the 5 most common scenarios: auth expiration, rate limiting, context exhaustion, agent crash, and agent idle timeout

Key interfaces:
interface ReactionRule {
  id: string;
  name: string;
  trigger: string;            // Event type (e.g., 'agent-error:rate_limited')
  condition?: string;         // Optional predicate (e.g., 'attempts < 3')
  action: ReactionAction;
  retries?: number;
  escalation?: {
    action: ReactionAction;
    afterMs: number;
    priority: 'normal' | 'high' | 'critical';
  };
}

type ReactionAction =
  | { type: 'send-message'; message: string }
  | { type: 'pause-agent' }
  | { type: 'restart-agent'; withResume: boolean }
  | { type: 'notify'; channel: 'desktop' | 'slack' | 'webhook'; message: string }
  | { type: 'abort-batch' };

class ReactionEngine {
  registerRule(rule: ReactionRule): void;
  handleEvent(event: ProcessManagerEvent, session: Session): void;
  getReactionHistory(sessionId: string): ReactionExecution[];
}
Test strategy:

Unit tests for each action executor
Unit tests for condition evaluation and deduplication
Integration tests with mock ProcessManager events
Escalation timer tests (fast-forward timers)


6.4 Anti-Rationalization Prompt Enhancement (High Priority)

Files to modify:

src/prompts/autorun-default.md -- Add rationalization prevention table and verification gate
src/prompts/group-chat-moderator-system.md -- Add skepticism instructions for synthesis rounds
src/prompts/group-chat-moderator-synthesis.md -- Add independent verification requirement
src/prompts/context-grooming.md -- Add verification that grooming preserved key context
src/prompts/commit-command.md -- Add verification-before-claiming-complete pattern
Create: src/prompts/review-skepticism.md -- Reusable review prompt with Superpowers' skepticism pattern

Architecture decisions:

Rationalization prevention is embedded in prompts, not enforced by code (complementary to code-enforced quality gates)
Each prompt that can result in "task complete" claims must include the IDENTIFY -> RUN -> READ -> VERIFY -> CLAIM gate function
The review prompt explicitly states distrust: "The agent may have checked off tasks without actually completing them. Verify independently."
No @ file references in prompt templates (following Superpowers' no-@ rule to prevent context bloat)

Key additions to autorun-default.md:
## Verification Protocol

Before checking off ANY task:
1. IDENTIFY: What command proves this task is complete?
2. RUN: Execute the command (fresh, complete)
3. READ: Full output, check exit code
4. VERIFY: Does output confirm completion?
5. ONLY THEN: Check off the task

## Common Rationalizations (DO NOT FALL FOR THESE)

| Temptation | Reality |
|---|---|
| "It should work based on what I wrote" | RUN the verification |
| "I'm confident in the changes" | Confidence is not evidence |
| "Just checking the box to move on" | Unchecked is better than falsely checked |
| "The test was passing before my change" | Run it AFTER your change |
| "Manual inspection confirms it works" | Run the automated verification |
Test strategy:

Adopt Superpowers' TDD-for-prompts methodology:

Run Auto Run with current prompts, observe rationalization behavior
Add anti-rationalization text
Run same scenarios, verify improved compliance
Use pressure scenarios (time pressure, sunk cost, authority) to stress-test


6.5 CI Pipeline Enhancement (Quick Win)

Files to create/modify:

Create: .github/workflows/ci.yml -- New CI workflow (separate from release.yml)
Create: .github/workflows/security.yml -- Security scanning workflow
Modify: .github/workflows/release.yml -- Add dependency on CI passing

Architecture decisions:

CI runs on every push and PR (not just releases)
Security scanning runs on a schedule (weekly) and on PRs touching package.json
CI must pass before release workflow can proceed
Coverage threshold: 50% minimum (gradually increase as coverage improves)

Key workflow structure:
# ci.yml
jobs:
  typecheck:
    - tsc --noEmit --project tsconfig.main.json
    - tsc --noEmit --project tsconfig.lint.json
    - tsc --noEmit --project tsconfig.cli.json
  lint:
    - eslint src/
    - prettier --check src/
  test:
    - vitest run --config vitest.config.mts
    - vitest run --config vitest.integration.config.ts
  audit:
    - npm audit --audit-level=high --production

# security.yml
jobs:
  gitleaks:
    - gitleaks/gitleaks-action@v2 with args: "--full-history"
  dependency-review:
    - actions/dependency-review-action@v4 with fail-on-severity: high
Test strategy:

Verify CI catches known-bad commits (deliberate lint failures, type errors)
Verify security scan catches known-bad patterns (test with deliberate secret in test branch)


7. Source Trace

Every claim in this report is traced to specific sections of the source analysis reports.
7.1 Maestro Strengths (Section 1)


Claim
Source Report
Section


Multi-provider support (4 active, 3 planned)
maestro-deep-analysis.md
S17.2 (Provider table), S1 (Executive Summary)


Declarative agent definition architecture
maestro-deep-analysis.md
S17.3 (Agent Definition Architecture), S3.3 (Output Parser Architecture)


30+ keyboard shortcuts, Layer Stack system
maestro-deep-analysis.md
S14.1 (Desktop UX)


Group Chat moderator-agent pattern
maestro-deep-analysis.md
S5.1 (Group Chat System)


StatsDB with daily backups and corruption recovery
maestro-deep-analysis.md
S15.3 (Stats Database Architecture)


Cross-provider session discovery and resume
maestro-deep-analysis.md
S10.3-10.4 (Session Discovery, Session Resume)


Mobile PWA with voice/swipe/offline
maestro-deep-analysis.md
S14.2 (Mobile UX)


Error pattern system (~100 patterns)
maestro-deep-analysis.md
S3.4 (Error Pattern System)


Symphony community contribution
maestro-deep-analysis.md
S5.2 (Symphony Orchestration)


7.2 Gap Analysis Matrix (Section 2)


Gap
Maestro Source
Comparison Source


No quality gates in Auto Run
maestro S4.6 ("No automatic verification layer")
superpowers S4.5 (two-stage review), ecc S4.5 (6-phase verification)


No lifecycle state machine
maestro S10.5 (color-coded dots)
ao S9.1 (16-state machine), S9.2 (determineStatus algorithm)


No reaction engine
maestro S13 ("No event bus for external consumers")
ao S12.1-12.6 (reaction engine with escalation)


No anti-rationalization
maestro S4.5 (prompts without prevention tables)
superpowers S11.5 (40+ prevention entries), S11.6 (pressure testing)


No cost enforcement
maestro S15.5 ("No cost budgets or limits")
All projects lack this; maestro has best data infrastructure


No security in CI
maestro S11.8 ("No security scanning")
ao S11.3 (Gitleaks + dependency-review), ecc S11.1 (npm audit)


No CI testing
maestro S11.8 ("release workflow only builds")
ecc S11.1 (33-combination matrix), ao S10.1 (lint + typecheck + test)


No hook system
maestro S13 ("No plugin system")
ecc S13.1 (6-event lifecycle with 15 hooks)


No auto context compaction
maestro S9.6 ("No automatic context compaction")
ecc S9.2 (tool-call counting), superpowers S9.5 (re-inject on compact)


No agent tool scoping
maestro S17.3 (YOLO mode for all agents)
ecc S5.2 (read-only vs full-access per agent)


No notification channels
maestro S13 (desktop only)
ao S7.4 (desktop + Slack + webhook + Composio)


No issue tracker
maestro S21.4 (manual Auto Run creation)
ao S16.1-16.2 (GitHub + Linear)


No crash-safe persistence
maestro S19.5 (in-memory Group Chat state)
ao S9.4-9.5 (metadata survives crashes, session restoration)


No plugin architecture
maestro S21.3 (Encore Features)
ao S2.2-2.3 (8-slot plugin system)


No REST API
maestro S13.5 ("No webhook/HTTP API")
ao S7.2 (Next.js API routes)


No continuous learning
maestro not present
ecc S10.5 (instinct system)


No language-specific rules
maestro S4.5 (provider-agnostic prompts)
ecc Appendix A (44 skills, 24 rules)


7.3 What NOT to Adopt (Section 5)


Decision
Source Report
Section


Agent-as-orchestrator
superpowers-deep-analysis.md
S5.1 (Agent-as-Orchestrator), S22.3 item 12


Zero-persistence
superpowers-deep-analysis.md
S10.2 (No persistence), S22.3 item 13


Flat-file metadata
agent-orchestrator-deep-analysis.md
S2.5 (Flat-file state), S21.3.1


tmux runtime
agent-orchestrator-deep-analysis.md
S6.2 (tmux isolation), S21.3.2


Polling lifecycle
agent-orchestrator-deep-analysis.md
S1.2 item 4, S21.3.3


Inline JS in JSON
everything-claude-code-deep-analysis.md
S13.3 (Inline vs Script), S22.3 item 1


Polyglot wrapper
superpowers-deep-analysis.md
S3.4 (run-hook.cmd), S22.3 item 16


7.4 Implementation Recommendations (Section 6)


Recommendation
Primary Reference
Supporting Reference


Quality gate system
superpowers S4.5 (two-stage review loop), ecc S4.5 (verification loop)
maestro S4.4 (batch processor architecture)


Cost budget enforcement
maestro S15.3 (StatsDB infrastructure)
All projects' S15 (cost gap analysis)


Reaction engine
ao S12.1-12.6 (reaction engine code)
maestro S3.2 (ProcessManager events)


Anti-rationalization prompts
superpowers S11.5-11.6 (tables and pressure tests)
superpowers S22.1 items 1, 6, 7


CI pipeline
ao S11.3 (security.yml), ecc S11.1 (ci.yml)
maestro S11.8 (identified gaps)


Appendix A: Cross-Reference Index

By Source Report

maestro-deep-analysis.md sections referenced:

S1 (Executive Summary): Section 1 intro
S2.3 (Agent Behavioral Guidelines): Gap 3.3
S3.2 (ProcessManager): Rec 6.3
S3.3 (Output Parser): Strength 1.1
S3.4 (Error Patterns): Strength 1.7
S4.4 (Execution): Gap 3.1
S4.5 (Prompt System): Gap 3.3, Rec 6.4
S4.6 (Verification): Gap 3.1
S5.1 (Group Chat): Strength 1.3
S5.2 (Symphony): Strength 1.8
S8.4 (Agent Error Handling): Gap 3.2
S9.6 (Context Missing): Gap matrix
S10.3-10.4 (Session Discovery/Resume): Strength 1.5
S10.5 (Session States): Gap 3.2
S11.8 (Quality Missing): Gaps 3.5, 3.6
S13 (Hooks Missing): Gap 3.6
S14.1 (Desktop UX): Strength 1.2
S14.2 (Mobile UX): Strength 1.6
S15.1-15.3 (Cost Tracking): Strength 1.4, Rec 6.2
S15.5 (Cost Missing): Gap 3.4
S17.1-17.3 (Providers): Strength 1.1
S19.5 (Failure Modes): Gap matrix
S21.3 (Encore Features): Gap matrix
S21.4 (Identified Gaps): Gap 3.7

superpowers-deep-analysis.md sections referenced:

S1 (Executive Summary): Section 5.1 rationale
S2.3 (Description Trap): Gap 3.3
S4.5 (Subagent-Driven Development): Gap 3.1
S5.1 (Agent-as-Orchestrator): Section 5.1
S5.4 (Review Loop): Gap 3.1
S9.5 (Compaction): Gap matrix
S10.2 (No Persistence): Section 5.2
S11.1-11.3 (Quality Gates): Gap 3.1
S11.5-11.6 (Anti-Rationalization): Gap 3.3, Rec 6.4
S15.2 (Cost Awareness): Gap 3.4
S22.1 (Strongly Borrow): Section 6.4
S22.3 (Do Not Borrow): Section 5

everything-claude-code-deep-analysis.md sections referenced:

S4.5 (Verification Loop): Gap 3.1
S5.2 (Tool Scoping): Gap matrix
S9.1 (Token Optimization): Gap 3.4
S9.2 (Strategic Compaction): Gap matrix
S10.5 (Continuous Learning): Gap matrix
S11.1 (CI Pipeline): Gap 3.5, Rec 6.5
S13.1-13.3 (Hook Architecture): Gap 3.6
S22.1 (Strongly Borrow): Phase 2 rationale
S22.3 (Do Not Borrow): Section 5.6

agent-orchestrator-deep-analysis.md sections referenced:

S1.2 (Design Principles): Section 5.5
S2.2-2.3 (Plugin Architecture): Gap matrix, Phase 3
S7.1-7.4 (Human-in-the-Loop): Gap matrix
S9.1-9.2 (State Machine): Gap 3.2
S9.4-9.5 (Session Cleanup/Restoration): Gap matrix
S10.1 (CI Pipeline): Gap 3.5
S11.3 (Security Scanning): Gap 3.5, Rec 6.5
S12.1-12.6 (Reaction Engine): Gap 3.2, Rec 6.3
S14 (Cost Visibility): Gap 3.4
S16.1-16.2 (Integrations): Gap 3.7
S21.1.1-21.1.6 (Strongly Borrow): Phase 2-3 rationale
S21.3 (Do Not Borrow): Section 5


Appendix B: Effort Estimation Summary


Phase
Item
Effort
Impact
Lines Est.


1
Anti-rationalization prompts
Low
High
~200 (markdown)


1
CI security pipeline
Low
High
~150 (YAML)


1
Cost budget enforcement
Medium
High
~500


2
Quality gate system
Medium
Critical
~850


2
Reaction engine
Medium-High
High
~1200


2
Notification channels
Medium
Medium
~550


2
Session state machine
Medium
High
~650


3
Plugin architecture
High
High
~1800


3
Issue tracker integration
High
Medium
~2000


3
REST/webhook API
Medium-High
Medium
~1300


3
Continuous learning
High
Medium
~1500


Total estimated new code: ~10,700 lines across all three phases.

Appendix C: Confidence Scores


Section
Confidence
Basis


Maestro Strengths
High
Direct from maestro-deep-analysis.md with source code evidence


Gap Analysis Matrix
High
Cross-referenced all 4 reports, verified feature presence/absence


Detailed Gap Descriptions
High
Specific file references and code patterns cited


Prioritized Roadmap
Medium-High
Based on gap severity and implementation complexity analysis


What NOT to Adopt
High
Explicit recommendations from source reports' Section 22


Implementation Recommendations
Medium
Architecture designs are informed by reference implementations but not validated


Source Trace
High
Every claim mapped to specific report section


Effort Estimates
Medium
Rough estimates based on reference implementation sizes; actual effort varies


## harness-consensus-report.md

      
    Raw
  

              harness-consensus-report.md
            
          
    Harness Consensus Report: Cross-Project Synthesis

Date: 2026-02-22
Analyst: Claude Opus 4.6
Source Reports:

superpowers-deep-analysis.md (obra/superpowers, v4.3.1)
everything-claude-code-deep-analysis.md (affaan-m/everything-claude-code, v1.4.1)
agent-orchestrator-deep-analysis.md (ComposioHQ/agent-orchestrator)
maestro-deep-analysis.md (RunMaestro/Maestro, v0.15.0)


Executive Summary

After thorough analysis of four distinct AI coding harness projects, a clear picture emerges of what a canonical AI coding harness must include, where the field has converged, and where fundamental design tradeoffs remain unresolved.
The consensus is strong on what problems must be solved -- every project addresses orchestration, isolation, context management, quality gates, and human oversight. The consensus is weak on how to solve them -- the projects span a spectrum from pure-markdown behavioral engineering (Superpowers) to full desktop applications with SQLite-backed analytics (Maestro), with configuration-layer harnesses (ECC) and runtime orchestrators (Agent Orchestrator) between them.
Three findings stand out:


Git worktrees are the universal isolation primitive. All four projects use git worktrees for agent workspace isolation. No other isolation mechanism achieves the same balance of lightweight overhead and genuine filesystem separation.


Quality gate enforcement remains unsolved at scale. Every project acknowledges the need for verification before completion, but none has achieved truly reliable enforcement. Superpowers addresses this most rigorously through anti-rationalization engineering but lacks runtime enforcement. Agent Orchestrator has runtime state machines but no code quality gates within them. The gap between "instructed to verify" and "proven to have verified" persists.


Cost governance is universally underdeveloped. Despite being a critical operational concern, no project implements budget limits, spending alerts, or automatic shutoff. Maestro tracks costs most comprehensively (SQLite analytics with dashboards), but even Maestro has no enforcement mechanism.


The canonical harness must combine Superpowers' behavioral rigor, ECC's configuration breadth, Agent Orchestrator's runtime state machine, and Maestro's multi-provider fleet management into a single coherent architecture.

1. Canonical Feature Set

1.1 Orchestration Model

How work flows from spec to completion.


Project
Model
Enforcement
Key Mechanism


Superpowers
Brainstorm -> Plan -> Execute -> Review -> Merge
Advisory (skill text)
DOT flowcharts in markdown


ECC
Plan -> TDD -> Review -> Verify
Advisory (agent prompts)
/orchestrate command with 4 workflow types


Agent Orchestrator
Issue -> Spawn -> Work -> PR -> Review -> Merge
Runtime (state machine)
Lifecycle manager with 30s polling


Maestro
Spec (Auto Run) -> Playbook -> Execute -> Merge
Semi-automated (checkbox tracking)
Batch processor with JSONL events


Consensus: All four projects implement a multi-stage pipeline from specification to completion. The stages are remarkably consistent: (1) understand requirements, (2) plan the work, (3) execute in isolation, (4) verify the output, (5) merge or deliver.
Divergence: The enforcement spectrum ranges from purely advisory (Superpowers, ECC) to runtime-enforced (Agent Orchestrator). Superpowers relies entirely on the AI agent reading skill documents and self-governing. ECC relies on Claude Code's native mechanisms with markdown-defined workflows. Agent Orchestrator has a formal state machine with 16 distinct states and event-driven transitions. Maestro treats the pipeline as a pass-through, delegating planning and execution to the underlying agents while managing lifecycle and UI.
Best implementation: Agent Orchestrator's state machine is the most rigorous for runtime enforcement. Superpowers' brainstorm-plan-execute-review pipeline is the most methodologically complete for the agent's internal workflow. A canonical harness needs both -- runtime state machine for enforcement, with behavioral skills for the agent's internal process.
Source: superpowers-deep-analysis.md Section 4; everything-claude-code-deep-analysis.md Section 4; agent-orchestrator-deep-analysis.md Section 3; maestro-deep-analysis.md Section 4
1.2 Multi-Agent Coordination

Parallelism, sequencing, and isolation.


Project
Parallelism Model
Coordination
Communication


Superpowers
Sequential tasks (parallel only for independent debugging)
Agent-directed
Full task text in prompt


ECC
Sequential pipeline with handoff documents
Document-based
Structured handoff markdown


Agent Orchestrator
Embarrassingly parallel (one agent per issue)
Git + GitHub (no direct agent-to-agent)
CLI commands through tmux


Maestro
Parallel agents, sequential tasks within each
Group Chat moderator
Moderator AI routes messages


Consensus: All projects agree that parallel execution of tasks with shared state (same files, same tests) is dangerous. Every project either prohibits it (Superpowers: "Never dispatch multiple implementation subagents in parallel") or isolates it (Agent Orchestrator: separate worktrees per issue, Maestro: separate agent workspaces).
Divergence: The projects fundamentally disagree on the right granularity for parallelism. Superpowers operates at the task level within a single feature (sequential). Agent Orchestrator operates at the issue level (parallel across features, sequential within). Maestro supports both through its execution queue (sequential within agent, parallel across agents) and Group Chat (parallel with synthesis).
Best implementation: Agent Orchestrator has the cleanest parallelism model for multi-issue work. Maestro's Group Chat moderator pattern is the most sophisticated for collaborative multi-agent coordination within a single task. Superpowers' prohibition against parallel task execution is the most safety-conscious.
Source: superpowers-deep-analysis.md Section 6; everything-claude-code-deep-analysis.md Section 6; agent-orchestrator-deep-analysis.md Section 5; maestro-deep-analysis.md Section 6
1.3 Code Quality Pipeline

Testing, review, verification, and security.


Project
Testing
Code Review
Verification
Security


Superpowers
TDD "Iron Law" (mandatory failing test first)
Two-stage (spec compliance + code quality)
IDENTIFY-RUN-READ-VERIFY gate function
Minimal (branch protection only)


ECC
TDD mandatory (80% coverage target)
Code reviewer agent with confidence filtering
6-phase verification loop (build, type, lint, test, security, diff)
Security reviewer agent, OWASP checklist, AgentShield


Agent Orchestrator
None built-in (delegated to agent)
None built-in (delegated to GitHub)
CI status tracking (fail-closed)
Shell injection prevention, gitleaks, dependency review


Maestro
None built-in (delegated to agent)
None built-in (delegated to agent)
Checkbox completion tracking
execFileNoThrow, input validation, context isolation


Consensus: All projects acknowledge that verification before completion is essential. Superpowers' "Verification Before Completion" skill and ECC's verification loop both express the same principle: do not claim success without evidence. Agent Orchestrator enforces this at the CI level (PRs must pass CI). Maestro delegates entirely to agents.
Divergence: The most significant divergence is whether quality gates are internal to the agent (Superpowers, ECC) or external to it (Agent Orchestrator, Maestro). Superpowers and ECC embed quality knowledge in the agent's instructions. Agent Orchestrator uses external CI systems. Maestro has no quality gates of its own.
Best implementation: Superpowers' two-stage code review (spec compliance THEN code quality, with review loops) is the most thorough agent-internal quality system. ECC's 6-phase verification loop is the most comprehensive verification checklist. Agent Orchestrator's fail-closed CI status is the safest external gate. A canonical harness needs both internal verification skills AND external CI enforcement.
Source: superpowers-deep-analysis.md Section 11; everything-claude-code-deep-analysis.md Section 11; agent-orchestrator-deep-analysis.md Section 10; maestro-deep-analysis.md Section 11
1.4 Context Management

Chunking, compaction, and retrieval.


Project
Strategy
Compaction
Persistence
Budget Awareness


Superpowers
Progressive disclosure (load skills on demand)
Re-injection on compact events
None (stateless)
Token budget targets per skill (<500 lines)


ECC
Strategic compaction (tool call counting)
PreCompact state saving, phase-aware decisions
Session summaries (7-day retention)
Settings for thinking tokens, compaction threshold, subagent model


Agent Orchestrator
Three-layer prompt composition (base + config + rules)
None (single-use sessions)
Metadata files (flat key=value)
None


Maestro
Context grooming, merge, transfer operations
Manual trigger (compact/groom)
Per-tab context, session discovery
Context usage percentage tracking


Consensus: Every project recognizes that context window management is a first-class concern. The 200K token limit constrains every architectural decision. All projects agree that loading everything upfront is wrong -- some form of progressive disclosure or on-demand loading is necessary.
Divergence: The projects disagree on who manages context. Superpowers gives this responsibility to the agent via skill instructions (the "no-@ rule" preventing force-loading). ECC automates it through hooks (suggest-compact at tool call thresholds). Agent Orchestrator ignores it (single-use sessions assumed short enough). Maestro provides UI-driven context operations (groom, merge, transfer) but no automatic management.
Best implementation: ECC's strategic compaction system is the most operationally mature -- it tracks tool call counts, suggests compaction at thresholds, saves state before compaction, and provides a decision guide for when to compact vs. not. Superpowers' progressive disclosure model is the most context-efficient for skill loading. A canonical harness needs automatic compaction triggers combined with progressive skill loading.
Source: superpowers-deep-analysis.md Section 9; everything-claude-code-deep-analysis.md Section 9; agent-orchestrator-deep-analysis.md Section 8; maestro-deep-analysis.md Section 9
1.5 Session Lifecycle

Persistence, recovery, and resume.


Project
Persistence
Recovery
Resume


Superpowers
None (git commits only durable artifact)
Re-discover from git history
Hook re-fires on resume


ECC
Session markdown files (7-day retention)
Load latest session on start
Previous session injected into context


Agent Orchestrator
Metadata files + archived metadata
Workspace + runtime recreation
claude --resume <session-id>


Maestro
electron-store (JSON files with 2s debounce)
Session discovery per provider
Provider-specific resume flags


Consensus: Every project addresses the need for some form of session persistence, but approaches vary dramatically. The universal agreement is that a new session should somehow benefit from what happened in previous sessions.
Divergence: Superpowers is explicitly zero-persistence by design ("Clean separation -- no persistent state to corrupt"). ECC persists structured session summaries. Agent Orchestrator maintains full session metadata with archive capability. Maestro has the richest persistence with per-provider session discovery and SQLite analytics.
Best implementation: Maestro has the most comprehensive session management (multi-provider session discovery, resume support for 4 different agents, persistent analytics). Agent Orchestrator has the best session lifecycle state machine (16 states with deterministic transitions). ECC's transcript-based session summary extraction is the most practical for cross-session context transfer.
Source: superpowers-deep-analysis.md Section 10; everything-claude-code-deep-analysis.md Section 10; agent-orchestrator-deep-analysis.md Section 9; maestro-deep-analysis.md Section 10
1.6 Human-in-the-Loop Controls

Approval gates, escalation, and intervention.


Project
Approval Gates
Escalation
Intervention Mechanism


Superpowers
Brainstorming approval, plan review, batch checkpoints
3+ failed fixes -> escalate to human
Conversational (agent asks questions)


ECC
Plan confirmation ("yes"/"modify"/"different approach")
None formal
Hooks (warnings, blockers)


Agent Orchestrator
Dashboard for review/merge, send-message
Reaction engine with timed escalation
CLI ao send, dashboard messages


Maestro
Read-only mode, pause/resume, error modals
Agent error handling with recovery options
Queue management, Group Chat


Consensus: All projects provide mechanisms for humans to review and intervene. Every project has at least one approval gate before code changes begin (plan approval in Superpowers/ECC, issue assignment in Agent Orchestrator, document creation in Maestro).
Divergence: The critical split is between synchronous gates (Superpowers blocks until human approves design) and asynchronous oversight (Agent Orchestrator notifies human when attention needed, Maestro shows error modals). Superpowers' subagent-driven development mode explicitly reduces human involvement ("Faster iteration, no human-in-loop between tasks"), while Agent Orchestrator's reaction engine keeps humans informed throughout.
Best implementation: Agent Orchestrator's reaction engine with configurable escalation is the most operationally mature HITL system. Superpowers' escalation triggers ("If 3+ fixes failed, STOP and question the architecture") are the most context-appropriate. Maestro's read-only mode toggle is the most user-friendly intervention mechanism. A canonical harness needs configurable escalation with both timed escalation (Agent Orchestrator style) and behavioral escalation (Superpowers style).
Source: superpowers-deep-analysis.md Section 8; everything-claude-code-deep-analysis.md Section 8; agent-orchestrator-deep-analysis.md Section 7; maestro-deep-analysis.md Section 8
1.7 Hooks and Automation Surface


Project
Hook Events
Automation API
Extensibility


Superpowers
SessionStart only
claude -p (headless)
Skills as markdown files


ECC
6 event types, 15 hooks total
Plugin marketplace
Agents, skills, commands, rules, contexts


Agent Orchestrator
Reaction engine (33 event types)
CLI (ao commands)
8-slot plugin architecture


Maestro
IPC handlers (30+ modules), CLI
maestro-cli with JSONL output
Encore Features (precursor to plugins)


Consensus: All projects recognize the need for event-driven automation. The granularity varies -- Superpowers has one hook event, ECC has 6, Agent Orchestrator has 33 distinct event types.
Best implementation: Agent Orchestrator's reaction engine is the most composable (event + condition + action + retries + escalation). ECC's hook architecture is the most battle-tested for pre/post tool use interception.
Source: superpowers-deep-analysis.md Section 13; everything-claude-code-deep-analysis.md Section 13; agent-orchestrator-deep-analysis.md Section 12; maestro-deep-analysis.md Section 13
1.8 Cost/Usage Governance


Project
Tracking
Budgets
Optimization


Superpowers
Post-hoc token analysis script
None
Cache utilization via sequential execution


ECC
Claude's /cost command, token optimization docs
None
Model selection, thinking token limits, subagent model


Agent Orchestrator
JSONL-based cost extraction per session
None
Rough cost estimates with Sonnet 4.5 pricing


Maestro
SQLite analytics, per-session USD tracking, usage dashboard
None
WakaTime integration, activity heatmaps


Consensus: Every project acknowledges cost as a concern. No project implements cost budgets, spending alerts, or automatic shutoff. This is the most universally underserved feature area.
Best implementation: Maestro's SQLite-backed analytics with dashboard visualizations is far ahead of the others. ECC's token optimization documentation provides the most actionable cost reduction guidance.
Source: superpowers-deep-analysis.md Section 15; everything-claude-code-deep-analysis.md Section 15; agent-orchestrator-deep-analysis.md Section 14; maestro-deep-analysis.md Section 15
1.9 Security and Compliance


Project
Shell Safety
Secret Scanning
Sandboxing
Auth


Superpowers
Minimal (read-only hook)
None
None (advisory worktrees)
None


ECC
Input validation regex
AgentShield (external)
None (tool scoping only)
None


Agent Orchestrator
execFile everywhere (never exec)
gitleaks in CI, dependency review
None (worktree isolation)
None (dashboard unauthenticated)


Maestro
execFileNoThrow, spawn({shell: false})
None
None (user privileges)
None (web server unauthenticated)


Consensus: Every project runs AI agents with the same privileges as the user. No project implements container-based sandboxing, network isolation, or resource limits. Shell injection prevention is the most common security measure (Agent Orchestrator and Maestro both enforce execFile over exec).
Divergence: Agent Orchestrator is the only project with CI-integrated secret scanning (gitleaks). ECC is the only project with an external security audit tool (AgentShield). None have runtime sandboxing.
Best implementation: Agent Orchestrator's shell security discipline (execFile always, path traversal prevention, symlink validation) is the most systematic. ECC's security reviewer agent provides the most comprehensive code-level security review.
Source: superpowers-deep-analysis.md Section 12; everything-claude-code-deep-analysis.md Section 12; agent-orchestrator-deep-analysis.md Section 11; maestro-deep-analysis.md Section 12
1.10 Provider Compatibility and Extensibility


Project
Providers
Extension Model


Superpowers
Claude Code, Cursor, Codex, OpenCode
Skills (markdown), tool mapping per platform


ECC
Claude Code (primary), Cursor, OpenCode
Plugin marketplace, language-specific rules


Agent Orchestrator
Claude Code (primary), Codex, Aider, OpenCode
8-slot plugin architecture with manifests


Maestro
Claude Code, Codex, OpenCode, Factory Droid (+ 3 planned)
Agent definitions with capability flags, output parsers


Consensus: Claude Code is the primary supported agent across all four projects. All projects support or plan to support multiple providers. The approach to multi-provider varies from markdown tool mapping (Superpowers) to full plugin architecture (Agent Orchestrator, Maestro).
Best implementation: Maestro's agent definition architecture (declarative argument builder, capability flags, per-agent output parsers, per-agent error patterns) is the most extensible. Agent Orchestrator's 8-slot plugin system is the most architecturally clean for capability abstraction.
Source: superpowers-deep-analysis.md Section 17; everything-claude-code-deep-analysis.md Section 17; agent-orchestrator-deep-analysis.md Section 16; maestro-deep-analysis.md Section 17

2. Consensus Patterns

Patterns appearing in 3 or more projects, representing likely essential features of any production harness.
2.1 Git Worktree Isolation (4/4 projects)

What it is: Using git worktrees to provide each agent or task with an isolated filesystem checkout while sharing the same git object store.
Projects: Superpowers (skill: using-git-worktrees), ECC (documented pattern with git worktree add), Agent Orchestrator (plugin: workspace-worktree), Maestro (IPC handler: git:worktreeSetup).
How they differ:

Superpowers treats worktree creation as a skill the agent follows, with directory selection priority and safety verification (.gitignore check).
ECC documents worktrees as a recommended parallelization pattern but does not automate creation.
Agent Orchestrator automates worktree creation per-session with git fetch origin before branching, post-create hooks (npm install), and symlink support for shared resources (node_modules).
Maestro provides UI-driven worktree management with one-click PR creation from worktree branches.

Canonical form: Automated worktree creation per agent/task with: (1) fetch origin before branching, (2) configurable post-create commands, (3) shared resource symlinking, (4) safety verification (.gitignore check), (5) cleanup on task completion, (6) branch naming convention tied to issue tracker.
Why this consensus exists: Git worktrees are the sweet spot between full isolation (clones) and no isolation (shared working directory). They share the git object store (low disk overhead), provide separate working trees and indexes (true filesystem isolation), and work with git's existing branch model (natural merge path).
Source: superpowers-deep-analysis.md Section 7.1; everything-claude-code-deep-analysis.md Section 7.1; agent-orchestrator-deep-analysis.md Section 6.1; maestro-deep-analysis.md Section 7.1
2.2 Plan-Before-Execute Pipeline (4/4 projects)

What it is: Requiring a planning phase that produces a reviewable plan before any code is written.
Projects: Superpowers (brainstorming -> writing-plans), ECC (/plan command -> planner agent), Agent Orchestrator (issue -> system prompt), Maestro (Auto Run documents -> Playbooks).
How they differ:

Superpowers mandates brainstorming BEFORE planning, with a hard gate preventing implementation before design approval. Plans include exact file paths and complete code snippets.
ECC uses a planner agent (Opus model) that produces structured plans with overview, requirements, architecture, steps, testing, risks, and success criteria. Explicit confirmation required.
Agent Orchestrator uses issue content as the implicit plan, with the orchestrator agent deciding how to decompose it.
Maestro uses markdown documents with checkbox items as plans, created manually or via AI wizard.

Canonical form: A planning phase that: (1) explores alternatives before committing, (2) produces a machine-parseable plan document, (3) requires human approval, (4) includes success criteria and testing strategy, (5) decomposes into individually executable tasks.
Why this consensus exists: Unplanned AI agent work consistently produces scope creep, architectural mismatches, and incomplete implementations. Planning constrains the agent's tendency to solve the problem it wants to solve rather than the one specified.
Source: superpowers-deep-analysis.md Section 4.2-4.4; everything-claude-code-deep-analysis.md Section 4.1-4.3; agent-orchestrator-deep-analysis.md Section 3.1; maestro-deep-analysis.md Section 4.2-4.3
2.3 Markdown-Native Configuration (4/4 projects)

What it is: Using markdown with optional YAML frontmatter as the primary format for agent instructions, skills, prompts, and workflow definitions.
Projects: All four use markdown extensively for defining agent behavior.
How they differ:

Superpowers is 100% markdown skills with zero executable orchestration code.
ECC uses markdown for agents (13), skills (44), commands (32), rules (24), and contexts (3).
Agent Orchestrator uses markdown for system prompts and agent rules files.
Maestro uses markdown for 24 system prompts, Auto Run documents, and CLAUDE.md ecosystem.

Canonical form: Markdown as the LLM-native configuration format, with: (1) YAML frontmatter for machine-parseable metadata, (2) prose content for behavioral instructions, (3) structured sections for checklists and procedures, (4) template variables for dynamic content injection.
Why this consensus exists: Markdown is the format LLMs understand best. It is human-readable, version-controllable, and requires no build step. YAML frontmatter provides structured metadata without sacrificing markdown readability.
Source: superpowers-deep-analysis.md Section 3.1; everything-claude-code-deep-analysis.md Section 2.2; agent-orchestrator-deep-analysis.md Section 8.1; maestro-deep-analysis.md Section 4.5
2.4 Agent Tool Scoping / Principle of Least Privilege (3/4 projects)

What it is: Restricting which tools each agent can access based on its role.
Projects: ECC (agent frontmatter tools field), Agent Orchestrator (agent plugin capabilities), Maestro (per-agent capability flags and read-only mode). Superpowers does not implement this (single agent model).
How they differ:

ECC assigns specific tool arrays per agent: planner gets ["Read", "Grep", "Glob"] (read-only), tdd-guide gets all tools including Write, Edit, Bash.
Agent Orchestrator scopes capabilities per agent plugin, though in practice all agents get full access within their isolated worktree.
Maestro provides per-tab read-only mode toggles and agent-specific capability flags (20 flags per agent).

Canonical form: Each agent role gets: (1) a declared set of allowed tools, (2) read-only agents for planning/review cannot modify code, (3) full-access agents for implementation/debugging, (4) runtime enforcement of tool restrictions.
Why this consensus exists: Unrestricted tool access leads to agents taking unexpected actions. A planning agent that can write files will often start implementing instead of planning. Tool scoping enforces the separation of concerns between planning, implementation, and review.
Source: everything-claude-code-deep-analysis.md Section 5.2; agent-orchestrator-deep-analysis.md Section 2.2; maestro-deep-analysis.md Section 8.1
2.5 Handoff Documents Between Pipeline Stages (3/4 projects)

What it is: Structured documents passed between agents or pipeline stages to transfer context.
Projects: ECC (handoff protocol: Context, Findings, Files Modified, Open Questions, Recommendations), Superpowers (full task text passed to subagents with scene-setting context), Maestro (Group Chat messages with moderator synthesis).
How they differ:

ECC defines a formal handoff template with 5 sections passed between sequential agents.
Superpowers uses a controller-curated approach: the main agent reads the plan once, extracts tasks, and provides full task text directly to subagents (no file references).
Maestro's Group Chat moderator synthesizes responses from multiple agents into a coherent summary.

Canonical form: Between pipeline stages, pass: (1) summary of work completed, (2) files modified, (3) open questions and blockers, (4) recommendations for next stage, (5) relevant context the next stage needs but may not discover independently.
Why this consensus exists: Subagents and sequential agents lack context from previous stages. Without explicit handoffs, each stage must rediscover what was already learned, wasting tokens and introducing inconsistency.
Source: everything-claude-code-deep-analysis.md Section 4.2; superpowers-deep-analysis.md Section 5.5; maestro-deep-analysis.md Section 5.1
2.6 SessionStart Bootstrap Injection (3/4 projects)

What it is: Injecting behavioral instructions into the agent's context at session start (and on resume/compact events).
Projects: Superpowers (session-start hook injects using-superpowers skill), ECC (session-start.js loads previous sessions, learned skills, package manager), Agent Orchestrator (system prompt composed from 3 layers at spawn time).
How they differ:

Superpowers wraps injected content in <EXTREMELY_IMPORTANT> tags and fires on startup/resume/clear/compact.
ECC loads the most recent session summary, reports learned skills, and detects package manager.
Agent Orchestrator composes a system prompt from base prompt + config context + user rules at spawn time (one-shot, not re-injected).

Canonical form: At session start: (1) inject core behavioral instructions, (2) restore relevant context from previous sessions, (3) establish project-specific configuration, (4) re-inject on context compaction events to prevent instruction loss.
Why this consensus exists: Without session-start injection, the agent begins each session as a blank slate with no knowledge of project conventions, workflow requirements, or previous work. The compaction re-injection is critical because context compaction (which occurs during long sessions) can lose the original instructions.
Source: superpowers-deep-analysis.md Section 3.5; everything-claude-code-deep-analysis.md Section 10.1; agent-orchestrator-deep-analysis.md Section 8.1
2.7 TDD-First Development Mandate (3/4 projects)

What it is: Requiring tests to be written before implementation code.
Projects: Superpowers ("Iron Law: NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST"), ECC (rules/common/testing.md: RED-GREEN-REFACTOR mandatory, 80% coverage), Maestro (no TDD enforcement). Agent Orchestrator delegates to the agent's own skills.
How they differ:

Superpowers has the most aggressive enforcement with an 11-entry rationalization prevention table, a 12-entry red flags list, and the maxim "Write code before the test? Delete it. Start over."
ECC enforces TDD through rules files and a tdd-guide agent (Sonnet model, full tool access).
Agent Orchestrator and Maestro do not enforce TDD, relying on the underlying agents' own practices.

Canonical form: A TDD mandate that: (1) requires a failing test before any production code, (2) includes anti-rationalization measures for common excuses, (3) provides a dedicated TDD enforcement agent/skill, (4) verifies test failure before allowing implementation.
Why this consensus exists: AI agents consistently skip testing when not explicitly required. The bias toward "just make it work" is strong, and without TDD enforcement, agents produce untested code that appears functional but fails on edge cases.
Source: superpowers-deep-analysis.md Section 11.1; everything-claude-code-deep-analysis.md Section 11.5
2.8 Cross-Platform Compatibility (3/4 projects)

What it is: Supporting multiple operating systems and development environments.
Projects: Superpowers (Windows polyglot wrapper, macOS/Linux bash), ECC (all hooks in Node.js for cross-platform, CI matrix of 3 OS x 3 Node x 4 PM), Agent Orchestrator (macOS/Linux only, tmux dependency excludes Windows), Maestro (macOS, Linux, Windows builds via Electron).
How they differ:

Superpowers achieved cross-platform via a polyglot cmd/bash wrapper, which caused numerous Windows bugs (#518, #504, #491, etc.).
ECC deliberately chose Node.js for all hooks to avoid bash dependency, achieving the cleanest cross-platform story.
Agent Orchestrator is Unix-only due to tmux dependency.
Maestro uses Electron for cross-platform desktop support with platform-specific build targets.

Canonical form: Use Node.js (not bash) for hook scripts and automation. Support macOS, Linux, and Windows. Test on all platforms in CI.
Why this consensus exists: Developers use diverse platforms. Bash-only automation excludes Windows developers. ECC's migration from bash to Node.js and Superpowers' extensive Windows debugging both prove that cross-platform support is essential but costly.
Source: superpowers-deep-analysis.md Section 19.1; everything-claude-code-deep-analysis.md Section 16.5; maestro-deep-analysis.md Section 18.2
2.9 Fail-Safe Hook Design (3/4 projects)

What it is: Designing hooks and automation scripts so that failures degrade gracefully rather than blocking the main workflow.
Projects: Superpowers (hook failure -> no bootstrap, but plugin still works), ECC (all hooks exit 0 on error, blocking only via intentional exit 2), Agent Orchestrator (enrichment timeouts -> stale data, not hangs).
How they differ:

Superpowers has exactly one hook, and its failure means no skill awareness (silent degradation).
ECC implements a consistent pattern: main().catch(err => { console.error(...); process.exit(0); }) in every hook script.
Agent Orchestrator uses timeouts for dashboard enrichment (3s metadata, 4s PR data) so slow APIs don't block the UI.

Canonical form: All hooks: (1) catch all errors, (2) log errors to stderr (visible but not disruptive), (3) exit 0 on error (unless intentionally blocking with exit 2), (4) have timeouts for external calls, (5) accept incomplete data over blocked workflows.
Why this consensus exists: A hook that crashes or hangs blocks the entire agent session. The cost of a failed hook (missing enhancement) is far lower than the cost of a hung session (lost work, wasted tokens).
Source: superpowers-deep-analysis.md Section 13.3; everything-claude-code-deep-analysis.md Section 13.3.3; agent-orchestrator-deep-analysis.md Section 5.4

3. Divergence Points

3.1 Agent-as-Orchestrator vs. Runtime Orchestrator

The tradeoff: Should the AI agent itself orchestrate the workflow (guided by instructions), or should external code manage the workflow (dispatching agents as workers)?


Side
Projects
Argument


Agent-as-orchestrator
Superpowers, ECC
Zero executable code, pure markdown, agent self-governs. Simple, portable, works across platforms.


Runtime orchestrator
Agent Orchestrator, Maestro
External code manages state, enforces transitions, provides observability. Reliable, auditable, recoverable.


Implications:

Agent-as-orchestrator requires no infrastructure but provides no enforcement. If the agent ignores instructions, nothing prevents it. There is no observability, no audit trail, and no recovery mechanism. However, it is vastly simpler to implement and distribute.
Runtime orchestrator provides enforcement (state machines), observability (event logging), and recovery (session restore) but requires significant infrastructure (processes, databases, APIs). It introduces complexity, dependencies, and maintenance burden.

Resolution for canonical harness: Use a runtime orchestrator for lifecycle management and enforcement, but embed agent-as-orchestrator behavioral skills for the agent's internal process. The runtime ensures the agent goes through the right stages; the skills ensure the agent behaves correctly within each stage.
Source: superpowers-deep-analysis.md Section 5.1, Section 22.3 (item 12); agent-orchestrator-deep-analysis.md Section 4
3.2 Advisory vs. Enforced Quality Gates

The tradeoff: Should quality gates be enforced by prompt engineering (advisory) or by runtime code (enforced)?


Side
Projects
Argument


Advisory
Superpowers, ECC
Flexible, adaptable, works within agent's own reasoning. Anti-rationalization engineering can be very effective.


Enforced
Agent Orchestrator
Deterministic, auditable, cannot be bypassed. But rigid, less adaptable to novel situations.


Neither
Maestro
Quality delegated entirely to underlying agents. Pass-through design avoids this tension.


Implications:

Advisory gates are only as reliable as the agent's compliance. Superpowers has invested enormously in anti-rationalization engineering (40+ rationalization entries, 7 pressure types, TDD for skills) but acknowledges the fundamental limitation: "there is no enforcement mechanism beyond the agent's willingness to follow instructions."
Enforced gates guarantee compliance but can be overly rigid. Agent Orchestrator's CI-based enforcement only checks at the PR level, not during individual tasks. An agent could commit untested code and only discover failures at CI time.

Resolution for canonical harness: Layer both approaches. Enforced gates for critical checkpoints (tests must pass before PR, CI must pass before merge, budget must not be exceeded). Advisory gates for quality guidance (code review standards, TDD methodology, debugging protocol). The enforced layer catches catastrophic failures; the advisory layer improves quality incrementally.
Source: superpowers-deep-analysis.md Section 11.7; agent-orchestrator-deep-analysis.md Section 19
3.3 Single-Session vs. Multi-Session Architecture

The tradeoff: Should the harness operate within a single AI session (subagents via Task tool) or manage multiple independent sessions?


Side
Projects
Argument


Single-session with subagents
Superpowers
Fresh context per subagent, no inter-session coordination needed, natural conversation flow.


Multi-session fleet management
Agent Orchestrator, Maestro
True parallelism, resource isolation, independent failure domains.


Hybrid
ECC
Sequential pipeline within a session, manual parallelism across sessions.


Implications:

Single-session benefits from Claude Code's prompt cache (subsequent subagents get cache hits from earlier ones). Superpowers documents this: 1.38M cache read tokens vs. 62 direct input tokens in a test run. But it limits parallelism and creates a single point of failure.
Multi-session enables true parallelism but loses cache benefits and requires coordination infrastructure. Agent Orchestrator's polling-based coordination adds 30-second latency to state changes.

Resolution for canonical harness: Support both modes. Single-session subagent execution for sequential tasks within a feature (cache-efficient). Multi-session fleet management for parallel feature development (resource-efficient). The choice should be per-task, not architecture-wide.
Source: superpowers-deep-analysis.md Section 15.3; agent-orchestrator-deep-analysis.md Section 5.1; maestro-deep-analysis.md Section 6
3.4 Stateless vs. Stateful Design

The tradeoff: Should the harness maintain persistent state, or should it be stateless with git as the only durable artifact?


Side
Projects
Argument


Stateless
Superpowers
No state to corrupt, no database to maintain, git is the source of truth. Clean and simple.


Stateful
ECC (light), Agent Orchestrator (medium), Maestro (heavy)
Session history, progress tracking, analytics, recovery. Rich operational capability.


Implications:

Stateless means recovery from a crash requires manual intervention (re-read git history, re-create plan). There is no progress tracking across sessions, no cost analytics, no learning from past behavior.
Stateful enables session resume, cost tracking, pattern learning (ECC's instinct system), and fleet-wide analytics (Maestro's usage dashboard). But it introduces data management complexity, corruption risk, and migration burden.

Resolution for canonical harness: Stateful with SQLite (embedded, zero-config, ACID-compliant). Maestro's approach is the right one -- but add the resilience features Maestro already has (daily backups, corruption detection, automated recovery).
Source: superpowers-deep-analysis.md Section 10.2; maestro-deep-analysis.md Section 15.3
3.5 Configuration-Centric vs. Application-Centric

The tradeoff: Should the harness be a collection of configurations for existing tools, or a standalone application?


Side
Projects
Argument


Configuration layer
Superpowers, ECC
Leverages existing tools (Claude Code, Cursor). Lightweight, no new runtime. Easy to adopt.


Standalone application
Agent Orchestrator, Maestro
Full control over UX, lifecycle, analytics. Professional-grade experience. Harder to adopt.


Implications:

Configuration layers benefit from the rapid evolution of underlying tools. When Claude Code adds features, Superpowers and ECC automatically benefit. But they cannot enforce constraints that the underlying tool does not support.
Standalone applications control their own destiny but must keep up with underlying agent evolution (new output formats, new capabilities, new error patterns). Maestro's 1015-line error pattern file demonstrates this maintenance burden.

Resolution for canonical harness: This is a genuine tradeoff without a universal answer. For individual developers, configuration layers (Superpowers, ECC) provide the best value. For teams managing agent fleets, standalone applications (Agent Orchestrator, Maestro) are necessary. A canonical harness should be both: a configuration layer that works standalone, with an optional orchestration layer for fleet management.
Source: superpowers-deep-analysis.md Section 1; everything-claude-code-deep-analysis.md Section 1; agent-orchestrator-deep-analysis.md Section 1; maestro-deep-analysis.md Section 1

4. Canonical Harness Architecture

4.1 Architecture Diagram

                    HUMAN DEVELOPER
                         |
           +-------------+-------------+
           |             |             |
     [CLI (ao)]   [Web Dashboard]  [Desktop App]
           |             |             |
           +------+------+------+------+
                  |             |
                  v             v
    +============================+
    |      ORCHESTRATION CORE    |
    |                            |
    |  +--------------------+   |
    |  | Workflow Engine     |   |  <-- State machine (from Agent Orchestrator)
    |  | (Plan->Execute->   |   |      with behavioral skills (from Superpowers)
    |  |  Review->Merge)    |   |
    |  +--------+-----------+   |
    |           |               |
    |  +--------v-----------+   |
    |  | Reaction Engine     |   |  <-- Event-driven automation (from Agent Orchestrator)
    |  | (event->action,    |   |      with escalation and retry
    |  |  retries, escalate)|   |
    |  +--------+-----------+   |
    |           |               |
    |  +--------v-----------+   |
    |  | Quality Gate Engine |   |  <-- Enforced gates (CI, tests, budget)
    |  | (TDD, Review,      |   |      + advisory skills (from Superpowers)
    |  |  Verification)     |   |
    |  +--------+-----------+   |
    |           |               |
    |  +--------v-----------+   |
    |  | Context Manager     |   |  <-- Strategic compaction (from ECC)
    |  | (progressive load, |   |      + progressive disclosure (from Superpowers)
    |  |  compaction, cache) |   |
    |  +--------+-----------+   |
    |           |               |
    +===========+===============+
                |
    +-----------+------------+
    |           |            |
    v           v            v
+--------+ +--------+ +----------+
| Plugin | | Plugin | | Plugin   |
| Slot:  | | Slot:  | | Slot:    |
| Agent  | | Work-  | | Tracker  |
| (Claude| | space  | | (GitHub  |
|  Code, | | (git   | |  Issues, |
|  Codex,| | work-  | |  Linear) |
|  Open- | | tree,  | +----------+
|  Code) | | clone) |
+--------+ +--------+
    |           |
    v           v
+--------+ +--------+
| Plugin | | Plugin |
| Slot:  | | Slot:  |
| Runtime| |Notifier|
| (tmux, | | (Slack,|
|  PTY,  | | desktop|
|  SSH)  | | webhook|
+--------+ +--------+

PERSISTENT LAYER:
+========================+
| SQLite Database        |  <-- Analytics, session history (from Maestro)
| Session Metadata       |  <-- Flat files with atomic ops (from Agent Orchestrator)
| Skill/Prompt Library   |  <-- Markdown skills (from Superpowers + ECC)
| Learned Patterns       |  <-- Instinct system (from ECC)
+========================+

4.2 Component Inventory


Component
Responsibility
Best Source


Workflow Engine
State machine managing plan->execute->review->merge transitions
Agent Orchestrator (16-state machine with deterministic transitions)


Reaction Engine
Event-driven automation with configurable triggers, actions, retries, escalation
Agent Orchestrator (33 event types, YAML-configurable reactions)


Quality Gate Engine
Enforced checkpoints (CI pass, test pass, budget check) + advisory skills (TDD, verification, code review)
Superpowers (behavioral engineering) + Agent Orchestrator (runtime enforcement)


Context Manager
Progressive skill loading, strategic compaction, session context persistence
Superpowers (progressive disclosure, no-@ rule) + ECC (tool call counting, phase-aware compaction)


Plugin Registry
8-slot plugin architecture for agent, workspace, tracker, SCM, notifier, runtime, terminal, lifecycle
Agent Orchestrator (PluginManifest + PluginModule pattern)


Agent Plugin
Per-provider integration (CLI args, output parsing, error detection, session resume)
Maestro (declarative arg builder, capability flags, output parser registry)


Workspace Plugin
Isolated filesystem per task (worktree or clone) with post-create hooks
Agent Orchestrator (worktree with symlinks, post-create commands, cleanup)


Tracker Plugin
Issue integration (GitHub, Linear) with prompt generation and branch naming
Agent Orchestrator (dual-transport Linear, GitHub GraphQL)


Notification Plugin
Multi-channel alerts with priority routing
Agent Orchestrator (desktop, Slack, webhook with priority routing)


Session Manager
Session CRUD, atomic reservation, metadata persistence, archive/restore
Agent Orchestrator (O_EXCL reservation, cascading cleanup) + Maestro (multi-provider discovery)


Skill Library
Markdown-based behavioral skills for agent internal process
Superpowers (14 skills, anti-rationalization, DOT flowcharts) + ECC (44 skills, language-specific)


Analytics Store
SQLite-backed usage tracking, cost attribution, activity heatmaps
Maestro (stats-db with WAL, daily backups, corruption recovery)


CLI
Developer-facing command interface
Agent Orchestrator (ao CLI) + Maestro (maestro-cli with JSONL output)


Dashboard
Web-based monitoring and control
Agent Orchestrator (Kanban attention levels) + Maestro (keyboard-first, multi-tab)


4.3 Data Flow

1. INTAKE
   Issue (from GitHub/Linear)
   OR User request (from CLI/UI)
       |
       v
2. PLANNING
   [Workflow Engine: status = "planning"]
   Load planner skill (from Superpowers brainstorming + writing-plans)
   Planner agent (read-only tools) produces structured plan
   Human reviews and approves
       |
       v
3. WORKSPACE SETUP
   [Workflow Engine: status = "setting_up"]
   Create git worktree (workspace plugin)
   Run post-create hooks (npm install, etc.)
   Verify test baseline passes
       |
       v
4. EXECUTION
   [Workflow Engine: status = "working"]
   For each task in plan:
     a. Dispatch implementer subagent with full task text
     b. Implementer follows TDD skill (test first, then implement)
     c. Implementer self-reviews against checklist
     d. Dispatch spec compliance reviewer (skepticism primed)
     e. If spec issues: implementer fixes, re-review
     f. Dispatch code quality reviewer
     g. If quality issues: implementer fixes, re-review
     h. Mark task complete
     i. [Quality Gate Engine: verify tests pass]
       |
       v
5. VERIFICATION
   [Workflow Engine: status = "verifying"]
   Run full verification loop (build, type, lint, test, security, diff)
   [Quality Gate Engine: all checks must pass]
       |
       v
6. PR CREATION
   [Workflow Engine: status = "pr_open"]
   Push branch, create PR
   [Reaction Engine: monitors CI, reviews, conflicts]
       |
       v
7. REVIEW AND ITERATION
   [Workflow Engine: status = "review_pending" -> "approved" or "changes_requested"]
   If changes requested: [Reaction Engine: send-to-agent, agent addresses feedback]
   If CI fails: [Reaction Engine: send-to-agent, agent investigates]
       |
       v
8. MERGE
   [Workflow Engine: status = "mergeable" -> "merged"]
   Human or auto-merge (configurable)
   [Workflow Engine: status = "cleanup" -> "done"]
   Cleanup worktree, archive metadata

4.4 Extension Points


Custom Agent Plugins: Add support for new AI agents by implementing the Agent interface (CLI args, output parser, error patterns, session resume). Follow Maestro's declarative pattern.


Custom Workspace Plugins: Alternative isolation strategies (Docker containers, cloud workspaces, devcontainers) by implementing the Workspace interface.


Custom Tracker Plugins: Additional issue trackers (Jira, Asana, Shortcut) by implementing the Tracker interface.


Custom Skills: New behavioral skills (markdown files) for domain-specific workflows. Follow Superpowers' TDD-for-docs methodology.


Custom Reactions: User-defined event->action mappings in YAML configuration. Follow Agent Orchestrator's reaction engine pattern.


Custom Quality Gates: Additional verification steps (dependency scanning, license checking, performance benchmarks) plugged into the Quality Gate Engine.


Custom Notifiers: Additional notification channels by implementing the Notifier interface.


Custom CLI Commands: New CLI subcommands for project-specific workflows.


5. Missing from All Projects

Features that none of the four projects implement but that a mature harness should have.
5.1 Cost Budget Enforcement

No project implements spending limits, budget alerts, or automatic shutoff when costs exceed a threshold. Every project tracks costs at some level, but none prevents runaway spending. For production use, the harness must be able to pause or kill agents when a per-session, per-project, or global budget is exceeded.
Why it matters: A single misconfigured agent loop can consume hundreds of dollars in API credits before anyone notices. Without automatic shutoff, cost governance depends entirely on human monitoring.
5.2 Container-Based Sandboxing

No project isolates agent execution in containers (Docker, gVisor, Firecracker). All agents run with the same privileges as the user. This means a compromised or misbehaving agent can read credentials, exfiltrate code, modify other workspaces, or execute arbitrary network requests.
Why it matters: AI agents executing arbitrary code on a developer's machine represent a significant security surface. Git worktree isolation protects files but not credentials, network, or system resources.
5.3 Dependency Graph Execution (DAG)

No project supports expressing task dependencies as a directed acyclic graph and executing them with maximum parallelism. Superpowers and ECC execute sequentially. Agent Orchestrator parallelizes at the issue level but not within issues. Maestro processes documents sequentially within playbooks.
Why it matters: Many feature implementations have natural parallelism (frontend and backend can develop simultaneously, tests for different modules are independent). Sequential execution wastes time when tasks could safely run in parallel.
5.4 Automated Rollback

No project implements automatic rollback when an agent produces bad output. Recovery requires manual intervention (git revert, worktree cleanup, session restart). Maestro's pause/stop provides the closest mechanism, but it requires human judgment.
Why it matters: Long-running autonomous sessions (Maestro's Auto Run, Superpowers' SDD) can produce cascading errors where each task builds on the previous one's mistakes. Automatic rollback to the last known-good state would limit damage.
5.5 Cross-Agent Context Sharing

No project implements a mechanism for one agent to share discoveries with another agent working on a related task. Agent Orchestrator explicitly prevents agent-to-agent communication. Maestro's Group Chat enables conversation but not structured context transfer.
Why it matters: Agents working on related issues often discover the same codebase patterns, encounter the same bugs, or need the same context. Without sharing, each agent rediscovers this independently, wasting tokens and time.
5.6 Persistent Vector Store / RAG Integration

No project maintains a persistent vector store for code retrieval. All rely on the agent's built-in code search tools (Read, Grep, Glob). ECC's iterative retrieval skill provides a search methodology but not an indexed store.
Why it matters: Large codebases exceed the agent's context window. A persistent vector store with code embeddings would enable semantic search across the entire codebase, providing relevant context without consuming the full context window.
5.7 Multi-User / Team Coordination

No project supports multiple human users managing a shared fleet of agents. All are single-user. Agent Orchestrator's dashboard has no authentication. Maestro's desktop app is inherently single-user.
Why it matters: Production teams need multiple developers to monitor and intervene in agent work. Without multi-user support, the harness is limited to individual developer use.
5.8 Compliance and Audit Logging

No project maintains a structured audit log of agent actions for compliance purposes. Agent Orchestrator has metadata files and git history, but no centralized event store. No project maps to compliance frameworks (SOC2, GDPR, ISO 27001).
Why it matters: Enterprise adoption requires demonstrating that AI-generated code went through defined quality processes, that human review occurred, and that security checks passed. Without audit logging, this cannot be proven.
5.9 Agent Health Watchdog

No project implements a watchdog that detects and kills agents that hang indefinitely without producing output. Agent Orchestrator's lifecycle polling detects "stuck" agents but relies on activity detection heuristics. Maestro has a 5-minute timeout for grooming sessions only, not for regular agent queries.
Why it matters: Agents can enter infinite loops, wait indefinitely for unavailable resources, or simply hang due to provider issues. Without a watchdog, these agents consume resources indefinitely.
5.10 Prompt A/B Testing and Optimization

No project implements systematic A/B testing of prompt variations to determine which produces better agent behavior. Superpowers tests skills against agent behavior (the pressure-testing methodology) but does not compare prompt variants systematically.
Why it matters: Prompt engineering is currently ad-hoc. Systematic measurement of prompt effectiveness would enable evidence-based optimization of skill instructions, system prompts, and review templates.

6. Source Trace

Every major claim in this report is traced to the specific report section(s) from which it was derived.
6.1 Orchestration Model Claims


Claim
Source


Superpowers uses brainstorm->plan->execute->review->merge pipeline
superpowers-deep-analysis.md Section 4.1


ECC defines 4 workflow types in /orchestrate command
everything-claude-code-deep-analysis.md Section 4.1


Agent Orchestrator has 16-state lifecycle state machine
agent-orchestrator-deep-analysis.md Section 9.1 (SessionStatus enum)


Maestro treats pipeline as pass-through
maestro-deep-analysis.md Section 4.1


6.2 Multi-Agent Coordination Claims


Claim
Source


Superpowers prohibits parallel task execution in SDD
superpowers-deep-analysis.md Section 6.1, line 205 of SDD skill


Agent Orchestrator has embarrassingly parallel model
agent-orchestrator-deep-analysis.md Section 5.1


Maestro's Group Chat uses moderator pattern
maestro-deep-analysis.md Section 5.1


ECC documents cascade method for manual parallelism
everything-claude-code-deep-analysis.md Section 6.2


6.3 Quality Gate Claims


Claim
Source


Superpowers has two-stage code review
superpowers-deep-analysis.md Section 4.5 (SDD workflow)


ECC has 6-phase verification loop
everything-claude-code-deep-analysis.md Section 4.5


Agent Orchestrator has fail-closed CI status
agent-orchestrator-deep-analysis.md Section 16.1 (getCISummary)


Maestro has no automated verification layer
maestro-deep-analysis.md Section 4.6


6.4 Context Management Claims


Claim
Source


Superpowers uses progressive disclosure with no-@ rule
superpowers-deep-analysis.md Section 9.1, 9.6


ECC tracks tool calls for compaction suggestions
everything-claude-code-deep-analysis.md Section 9.2


Agent Orchestrator has three-layer prompt composition
agent-orchestrator-deep-analysis.md Section 8.1


Maestro provides context groom/merge/transfer operations
maestro-deep-analysis.md Section 9.1-9.4


6.5 Security Claims


Claim
Source


No project implements container-based sandboxing
superpowers Section 7.5, ECC Section 7.5, AO Section 6.3, Maestro Section 7.5


Agent Orchestrator uses execFile everywhere
agent-orchestrator-deep-analysis.md Section 11.1


ECC has AgentShield for external security scanning
everything-claude-code-deep-analysis.md Section 12.3


Agent Orchestrator has gitleaks in CI
agent-orchestrator-deep-analysis.md Section 11.3


6.6 Cost Governance Claims


Claim
Source


No project implements cost budgets or automatic shutoff
superpowers Section 15.4, ECC Section 15.4, AO Section 14.3, Maestro Section 15.5


Maestro has SQLite-backed analytics with usage dashboard
maestro-deep-analysis.md Section 15.2-15.3


Superpowers documents cache utilization benefits of sequential execution
superpowers-deep-analysis.md Section 15.3


ECC documents token optimization strategies
everything-claude-code-deep-analysis.md Section 9.1


6.7 Git Worktree Consensus Claims


Claim
Source


Superpowers: using-git-worktrees skill
superpowers-deep-analysis.md Section 4.3


ECC: git worktree pattern documentation
everything-claude-code-deep-analysis.md Section 7.1


Agent Orchestrator: workspace-worktree plugin
agent-orchestrator-deep-analysis.md Section 6.1


Maestro: git:worktreeSetup IPC handler
maestro-deep-analysis.md Section 7.1


6.8 Anti-Rationalization Claims


Claim
Source


Superpowers has 40+ rationalization prevention entries
superpowers-deep-analysis.md Section 11.5, Appendix C


7 pressure test types documented
superpowers-deep-analysis.md Section 11.6


Persuasion principles based on Cialdini 2021, Meincke et al. 2025
superpowers-deep-analysis.md Section 2.3


TDD for documentation methodology
superpowers-deep-analysis.md Section 11.6


6.9 Provider Compatibility Claims


Claim
Source


Claude Code is primary across all 4 projects
All four reports, Executive Summaries


Maestro supports 4 active + 3 planned providers
maestro-deep-analysis.md Section 17.2


Agent Orchestrator has 8-slot plugin architecture
agent-orchestrator-deep-analysis.md Section 2.2


Superpowers supports 4 platforms with tool mapping
superpowers-deep-analysis.md Section 17.1-17.2


6.10 Missing Feature Claims


Claim
Source


No project has container sandboxing
All four reports, Isolation Model sections


No project has DAG execution
All four reports, Parallelization sections


No project has cost budget enforcement
All four reports, Cost/Usage sections


No project has multi-user support
All four reports, Operational Assumptions sections


No project has persistent vector store
All four reports, Context Handling sections


No project has automated rollback
All four reports, Failure Modes sections


Cross-Links to Individual Reports


Topic
superpowers-deep-analysis.md
everything-claude-code-deep-analysis.md
agent-orchestrator-deep-analysis.md
maestro-deep-analysis.md


Design Philosophy
Section 2
Section 2
Section 1
Section 2


Core Architecture
Section 3
Section 3
Section 2
Section 3


Workflow Pipeline
Section 4
Section 4
Section 3
Section 4


Subagent Orchestration
Section 5
Section 5
Section 4
Section 5


Multi-Agent Parallelization
Section 6
Section 6
Section 5
Section 6


Isolation Model
Section 7
Section 7
Section 6
Section 7


Human-in-the-Loop
Section 8
Section 8
Section 7
Section 8


Context Handling
Section 9
Section 9
Section 8
Section 9


Session Lifecycle
Section 10
Section 10
Section 9
Section 10


Code Quality Gates
Section 11
Section 11
Section 10
Section 11


Security
Section 12
Section 12
Section 11
Section 12


Hooks/Automation
Section 13
Section 13
Section 12
Section 13


CLI/UX
Section 14
Section 14
Section 13
Section 14


Cost/Usage
Section 15
Section 15
Section 14
Section 15


Tooling/Dependencies
Section 16
Section 16
Section 15
Section 16


Provider Compatibility
Section 17
Section 17
Section 16
Section 17


Operational Assumptions
Section 18
Section 18
Section 17
Section 18


Failure Modes
Section 19
Section 19
Section 18
Section 19


Governance
Section 20
Section 20
Section 19
Section 20


Roadmap/Gaps
Section 21
Section 21
Section 20
Section 21


Borrowing Recommendations
Section 22
Section 22
Section 21
Section 22


## INDEX.md

      
    Raw
  

              INDEX.md
            
          
    AI Harness Analysis — Report Index


Generated: 2026-02-22 | Tool: Claude Code (Opus 4.6) | Total: 10,765 lines across 6 reports

Overview

This directory contains a comprehensive comparative analysis of 4 AI coding harness projects, produced according to the master prompt. The goal: build the canonical feature set for an AI harness focused on scalable agent-team execution, orchestration quality, code quality governance, and operational maturity.

Individual Project Reports

Deep technical analysis of each repository, covering 21 analysis dimensions with concrete file citations, confidence scores, and cross-links.


#
Report
Project
Lines
Key Insight


1
superpowers-deep-analysis.md
obra/superpowers
2,005
Best-in-class prompt engineering with anti-rationalization tables, TDD for skills, and two-stage code review. Pure markdown/skill framework with zero runtime enforcement.


2
everything-claude-code-deep-analysis.md
affaan-m/everything-claude-code
2,141
Largest skill library (44 skills, 13 agents, 32 commands). Configuration-layer harness with session management, continuous learning, and multi-language support.


3
agent-orchestrator-deep-analysis.md
ComposioHQ/agent-orchestrator
2,806
Only true runtime orchestrator. Plugin architecture, tmux-based process isolation, hash-based session directories, reaction engine with escalation, and fail-closed CI.


4
maestro-deep-analysis.md
RunMaestro/Maestro
2,006
Full Electron desktop app (1200 source files, 490 tests). Multi-provider support (8 agents), Group Chat orchestration, Auto Run checkbox workflows, Symphony open-source contribution system.


Synthesis Reports

Cross-project analysis synthesizing patterns, gaps, and recommendations.


#
Report
Lines
Purpose


5
harness-consensus-report.md
831
Cross-project consensus features, canonical architecture, divergence points, and features missing from all projects.


6
final-harness-gap-report.md
976
Gap analysis for Maestro specifically, with prioritized 3-phase roadmap and implementation-ready recommendations with TypeScript interfaces.


Reading Order

For maximum context recovery (designed for recursive link traversal):

Start here — This index
Individual reports (any order) — Each is self-contained with its own executive summary
Consensus report — Synthesizes patterns across all 4 projects
Gap report — Actionable roadmap for Maestro improvement


Cross-Reference Map

By Analysis Dimension


Dimension
Superpowers
ECC
Agent Orchestrator
Maestro
Consensus
Gap


Orchestration Model
S:4
E:4
A:4
M:4
C:1.1
G:3.1


Multi-Agent
S:5-6
E:5-6
A:5-6
M:5-6
C:1.2
G:3.2


Code Quality
S:10-11
E:10-11
A:10-11
M:11
C:1.3
G:3.3


Context Mgmt
S:9
E:9
A:9
M:9
C:1.4
G:3.4


Session Lifecycle
S:10
E:10
A:10
M:10
C:1.5
G:3.5


Human-in-Loop
S:8
E:8
A:8
M:8
C:1.6
G:3.6


Hooks/Automation
S:13
E:13
A:13
M:13
C:1.7
—


Cost/Governance
S:15
E:15
A:15
M:15
C:1.8
G:3.7


Security
S:12
E:12
A:12
M:12
C:1.9
G:3.8


Provider Compat
S:16
E:16
A:16
M:17
C:1.10
—


Key: S=Superpowers, E=ECC, A=Agent Orchestrator, M=Maestro, C=Consensus, G=Gap. Numbers are section numbers.

By Recommendation Type


What to Borrow
From
For Maestro
Gap Report Section


Anti-rationalization engineering
Superpowers (S:11.5)
Prompt templates
G:6.1


Two-stage code review
Superpowers (S:11)
Quality pipeline
G:6.2


Plugin architecture
Agent Orchestrator (A:3)
Extensibility
G:6.3


Reaction engine
Agent Orchestrator (A:13)
Automation
G:6.4


Fail-closed CI
Agent Orchestrator (A:11)
Security
G:6.5


Session management
ECC (E:10)
Persistence
G:3.5


Continuous learning
ECC (E:13)
Knowledge capture
G:3.7


Strategic compaction
ECC (E:9)
Context management
G:3.4


Source Repositories


Repository
Stars
Language
License


obra/superpowers
—
Markdown/Bash
—


affaan-m/everything-claude-code
—
JavaScript/Markdown
—


ComposioHQ/agent-orchestrator
—
TypeScript
—


RunMaestro/Maestro
—
TypeScript/Electron
—


Methodology


All repositories cloned at HEAD as of 2026-02-22
Analysis performed by 4 parallel Claude Opus 4.6 agents, each reading the full codebase
Synthesis performed by 2 additional agents reading all 4 individual reports
Every major claim includes confidence scores (High/Medium/Low) and file path citations
Reports designed for recursive context recovery via cross-links


Master prompt: master_prompt.md

  
## maestro-deep-analysis.md

      
    Raw
  

              maestro-deep-analysis.md
            
          
    Maestro Deep Analysis Report

Project: RunMaestro/Maestro
Repository: https://github.com/RunMaestro/Maestro
Version analyzed: 0.15.0
License: AGPL-3.0
Author: Pedram Amini (pedram@runmaestro.ai)
Analysis date: 2026-02-22
Codebase size: ~672,000 lines of TypeScript across ~1,200 source files and ~490 test files

Table of Contents


Executive Summary
Design Philosophy and Abstractions
Core Architecture Model
Harness Workflow: Spec to Plan to Execute to Verify to Merge
Subagent/Task Orchestration Model
Multi-Agent / Parallelization Strategy
Isolation Model
Human-in-the-Loop Controls
Context Handling Strategy
Session Lifecycle and Persistence
Code Quality Gates
Security and Compliance Mechanisms
Hooks, Automation Surface, and Fail-Safe Behavior
CLI/UX and Automation Ergonomics
Cost/Usage Visibility and Governance
Tooling and Dependency Surface
External Integrations and Provider Compatibility
Operational Assumptions and Constraints
Failure Modes and Issues Observed
Governance and Guardrails
Roadmap/Evolution Signals, Missing Areas, Unresolved Issues
Current Gaps That Other Projects Might Fill
Cross-Links


1. Executive Summary

Maestro is the most ambitious and fully-realized project in the harness comparison set. It is a cross-platform Electron desktop application (with mobile PWA support) for orchestrating fleets of AI coding agents. Unlike the other three projects which are primarily configuration layers, shell scripts, or lightweight orchestrators on top of Claude Code, Maestro is a standalone product with its own GUI, process management layer, multi-provider support, and a rich feature ecosystem including analytics, gamification, and community-driven open source contribution (Symphony).
Key differentiators:

Full desktop application with keyboard-first interface (not just a terminal wrapper)
Multi-provider support: Claude Code, OpenAI Codex, OpenCode, Factory Droid (4 active, 2 planned)
Auto Run system with file-based task documents and playbook management
Group Chat with moderator AI for cross-agent coordination
Symphony: a community contribution platform using GitHub Issues + Auto Run
Git worktree integration for true parallel development
CLI tool (maestro-cli) for headless/CI operation
Mobile remote control via PWA + WebSocket + Cloudflare tunnels
SQLite-backed analytics with Usage Dashboard
Extensive documentation (CLAUDE.md ecosystem, CONSTITUTION.md, ARCHITECTURE.md, Mintlify docs site)

Core limitation:

Maestro is a "pass-through" orchestrator: it does not itself generate plans, specs, or code. It dispatches prompts to underlying AI agents and manages their lifecycle. The intelligence comes from the agents; Maestro provides the podium.

Confidence: High -- This assessment is based on thorough reading of 50+ source files, all documentation files, configuration, CI/CD, and test infrastructure.

2. Design Philosophy and Abstractions

2.1 The Constitution

Maestro has a formally documented design philosophy in /tmp/ai-harness-repos/Maestro/CONSTITUTION.md (178 lines). This is unique among the four projects analyzed.
Six tenets (lines 28-112):


Unattended Excellence (Solo Mode) -- "The measure of Maestro's success is how long agents run without intervention." Auto Run is a first-class citizen. Error recovery should be automatic. The leaderboard celebrates autonomy.


The Conductor's Perspective (Interactive Mode) -- "You are the maestro. The agents are your orchestra." Overview and control over details. Batch operations over individual ones. Frictionless agent switching.


Keyboard Sovereignty -- Every action has a keyboard path. Focus must be predictable. Escape always improves your situation. No mouse-only features.


Instant Response -- UI interactions in milliseconds. Heavy operations in background. Perceived performance matters.


Delightful Focus -- "Say no to feature creep that dilutes the core experience." Polish before adding.


Transparent Complexity -- Progressive disclosure. Sensible defaults. Power features accessible but not intrusive.


Confidence: High -- Directly sourced from CONSTITUTION.md.
2.2 The Mental Model

Maestro embodies the "conductor/orchestra" metaphor:

Agents are instruments (each with their own workspace, terminal, AI tabs)
The user is the conductor (directing, not playing each instrument)
Auto Run is the pre-programmed score (tasks run without intervention)
Group Chat is the ensemble rehearsal (agents coordinate via a moderator)
Symphony is the concert hall (open source community contributions)

The key abstraction boundary is that Maestro is NOT an IDE, NOT a single-agent wrapper, NOT a chat interface, and NOT a project manager. It's a fleet management tool for AI agents.
Evidence: CONSTITUTION.md lines 140-148:
- Not an IDE: We complement your editor, not replace it
- Not a single-agent wrapper: One agent is just a small orchestra
- Not a chat interface: Conversations are work sessions, not dialogues
- Not a project manager: We execute, not plan (that's what agents do)

2.3 Agent Behavioral Guidelines

CLAUDE.md (lines 23-44) establishes explicit behavioral rules for AI agents working on the codebase:

Surface Assumptions Early -- Never silently fill in ambiguous requirements
Manage Confusion Actively -- STOP and name inconsistencies
Push Back When Warranted -- Not a yes-machine
Enforce Simplicity -- Actively resist overcomplication
Maintain Scope Discipline -- Touch only what's asked
Dead Code Hygiene -- Identify but ask before removing

These are meta-guidelines for how AI should interact with the Maestro codebase during development, not runtime behavioral controls.
Confidence: High -- Directly sourced from CLAUDE.md.

3. Core Architecture Model

3.1 Dual-Process Electron Architecture

File: /tmp/ai-harness-repos/Maestro/src/main/index.ts (724 lines)
Maestro uses Electron's main/renderer split with strict context isolation:


Process
Location
Purpose


Main Process
src/main/
Node.js backend: process spawning, IPC handlers, file system, git, web server


Renderer Process
src/renderer/
React frontend: UI components, hooks, services


Web Process
src/web/
PWA for mobile: WebSocket client, mobile-optimized UI


CLI Process
src/cli/
Headless operation: Commander.js, batch processing


Shared Module
src/shared/
Cross-process types, utilities, constants


IPC Security Model (from ARCHITECTURE.md lines 136-202):

Context isolation: Enabled (renderer has no Node.js access)
Node integration: Disabled (no require() in renderer)
Preload script exposes window.maestro API with 17+ namespaces covering process management, git, file system, agents, settings, web server, auto run, playbooks, attachments, notifications, and more.

Confidence: High -- Verified in ARCHITECTURE.md and src/main/index.ts.
3.2 Process Manager

Files:

/tmp/ai-harness-repos/Maestro/src/main/process-manager/ProcessManager.ts
/tmp/ai-harness-repos/Maestro/src/main/process-manager/types.ts

The ProcessManager class is the core runtime engine. It manages two types of processes:

PTY Processes (via node-pty) -- For terminal sessions with full shell emulation
Child Processes (via child_process.spawn) -- For AI agents in batch mode

Key design decisions:

Uses spawn() with shell: false for security (no injection vulnerabilities)
Signal escalation: SIGINT first, escalates to SIGTERM after 2 seconds if process doesn't exit
Per-process output parsers: Each agent type has its own JSON output parser
Data buffering via DataBufferManager to batch rapid updates
SSH remote execution support via SshCommandRunner

// ProcessManager.ts line 29
export class ProcessManager extends EventEmitter {
    private processes: Map<string, ManagedProcess> = new Map();
    private bufferManager: DataBufferManager;
    private ptySpawner: PtySpawner;
    private childProcessSpawner: ChildProcessSpawner;
    private localCommandRunner: LocalCommandRunner;
    private sshCommandRunner: SshCommandRunner;
Events emitted (from types.ts lines 109-121):

data, stderr, exit, command-exit
usage (token stats)
session-id (provider session ID)
agent-error (auth, rate limit, context exhaustion)
thinking-chunk (streaming reasoning)
tool-execution (tool use events)
slash-commands (discoverable commands)
query-complete (with timing data)

Confidence: High -- Directly read from source files.
3.3 Output Parser Architecture

Files:

/tmp/ai-harness-repos/Maestro/src/main/parsers/index.ts (103 lines)
/tmp/ai-harness-repos/Maestro/src/main/parsers/agent-output-parser.ts
/tmp/ai-harness-repos/Maestro/src/main/parsers/claude-output-parser.ts (505 lines)
/tmp/ai-harness-repos/Maestro/src/main/parsers/codex-output-parser.ts
/tmp/ai-harness-repos/Maestro/src/main/parsers/opencode-output-parser.ts
/tmp/ai-harness-repos/Maestro/src/main/parsers/factory-droid-output-parser.ts
/tmp/ai-harness-repos/Maestro/src/main/parsers/usage-aggregator.ts

Maestro uses a registry pattern for output parsers. Each AI agent produces output in a different JSON format, and dedicated parser classes normalize this into a unified ParsedEvent type.
Initialization at app startup:
// parsers/index.ts line 76
export function initializeOutputParsers(): void {
    clearParserRegistry();
    registerOutputParser(new ClaudeOutputParser());
    registerOutputParser(new OpenCodeOutputParser());
    registerOutputParser(new CodexOutputParser());
    registerOutputParser(new FactoryDroidOutputParser());
}
The AgentOutputParser interface requires:

parseJsonLine(line: string): ParsedEvent | null -- Transform a raw JSON line to a normalized event
isResultMessage(event: ParsedEvent): boolean -- Detect final result messages
extractSessionId(event: ParsedEvent): string | null -- Pull provider session ID
extractUsage(event: ParsedEvent): Usage | null -- Pull token/cost stats
extractSlashCommands(event: ParsedEvent): string[] | null -- Pull discoverable commands
detectErrorFromLine(line: string): AgentError | null -- Detect errors from structured JSON
detectErrorFromExit(exitCode, stderr, stdout): AgentError | null -- Detect errors from exit

Claude Code parser (claude-output-parser.ts) is the most complex (505 lines). It handles:

Extended thinking blocks (Claude 3.7+, Claude 4+): Extracts thinking content blocks separately from text blocks, routing them to thinking-chunk events for streaming display
Redacted thinking: redacted_thinking blocks (safety-encrypted reasoning) are excluded since their content cannot be displayed
Tool use blocks: Extracted from content[] arrays and surfaced as toolUseBlocks for tool execution events
Mixed stderr/JSON parsing: When Claude Code outputs a line like Error streaming...: 400 {"type":"error","error":{"message":"prompt is too long"}}, the parser finds the embedded JSON starting at { and extracts the structured error
Usage aggregation: Calls aggregateModelUsage() to combine modelUsage (per-model breakdown) with usage (legacy flat format) and total_cost_usd

Error detection strategy is deliberately conservative:
// claude-output-parser.ts line 326-333
// IMPORTANT: Only detect errors from structured JSON error events, not from
// arbitrary text content. Pattern matching on conversational text leads to
// false positives (e.g., AI discussing "timeout" triggers timeout error).
//
// Error detection sources (in order of reliability):
// 1. Structured JSON: { type: "error", message: "..." }
// 2. stderr output (handled separately by process-manager)
// 3. Non-zero exit code (handled by detectErrorFromExit)
This is a mature design decision born from real false-positive issues.
Confidence: High -- Directly from source code.
3.4 Error Pattern System

File: /tmp/ai-harness-repos/Maestro/src/main/parsers/error-patterns.ts (1015 lines)
The error pattern system defines regex-based error detection for all supported agents and SSH remote execution. Each agent has patterns organized by error type:


Error Type
Claude Code Patterns
Codex Patterns
OpenCode Patterns
Factory Droid Patterns
SSH Patterns


auth_expired
10 patterns
4 patterns
2 patterns
5 patterns
--


token_exhaustion
7 patterns
3 patterns
4 patterns
4 patterns
--


rate_limited
6 patterns
5 patterns
2 patterns
4 patterns
--


network_error
5 patterns
4 patterns
4 patterns
4 patterns
10 patterns


permission_denied
4 patterns
2 patterns
--
3 patterns
5 patterns


agent_crashed
1 pattern
5 patterns
3 patterns
1 pattern
9 patterns


session_not_found
3 patterns
2 patterns
2 patterns
2 patterns
--


Notable implementation details:

Dynamic error messages: Some patterns use functions instead of strings to construct messages from regex capture groups:

// error-patterns.ts line 127-133
pattern: /prompt.*too\s+long:\s*(\d+)\s*tokens?\s*>\s*(\d+)\s*maximum/i,
message: (match: RegExpMatchArray) => {
    const actual = parseInt(match[1], 10).toLocaleString('en-US');
    const max = parseInt(match[2], 10).toLocaleString('en-US');
    return `Prompt is too long: ${actual} tokens exceeds the ${max} token limit.`;
},


SSH error patterns are checked in addition to agent-specific patterns when running via SSH remote execution, covering transport-level errors (connection refused, host key verification, broken pipe, shell profile syntax errors on remote host)


Shell parse error detection: The SSH patterns detect remote host .zshrc/.bashrc syntax errors with line numbers, providing actionable messages like "Check .zshrc or .bashrc on the remote server"


Recoverability flags: Each pattern declares whether the error is recoverable (true) or fatal (false). Permission denied and host key verification failures are non-recoverable; rate limits and network errors are recoverable.


Confidence: High -- Directly from source code, all pattern counts verified.
3.5 Agent Data Model

File: CLAUDE-SESSION.md (Session interface)
Each "agent" in the Left Bar is backed by a Session object with:

Identity: id, name, groupId, toolType (provider), state, inputMode
Paths: cwd (mutable), projectRoot (immutable), fullPath
Processes: aiPid, port
Multi-tab: aiTabs[], activeTabId, filePreviewTabs[], unifiedTabOrder[]
Execution queue: executionQueue[], isProcessingQueue
Usage: usageStats, contextUsage, workLog
Git: isGitRepo, changedFiles, gitBranches, gitTags
Auto Run: autoRunFolderPath, autoRunSelectedFile, autoRunMode
SSH: sshRemoteId, sessionSshRemoteConfig
Error: agentError, agentErrorPaused

Each session runs two processes simultaneously: an AI agent process (suffixed -ai) and a terminal process (suffixed -terminal). Users switch between them with Cmd+J.
Confidence: High -- Directly from CLAUDE-SESSION.md and ARCHITECTURE.md.
3.4 IPC Handler Registry

File: /tmp/ai-harness-repos/Maestro/src/main/index.ts (lines 27-59)
The main process registers 30+ handler modules:
registerGitHandlers, registerAutorunHandlers, registerPlaybooksHandlers,
registerHistoryHandlers, registerAgentsHandlers, registerProcessHandlers,
registerPersistenceHandlers, registerSystemHandlers, registerClaudeHandlers,
registerAgentSessionsHandlers, registerGroupChatHandlers, registerDebugHandlers,
registerSpeckitHandlers, registerOpenSpecHandlers, registerContextHandlers,
registerMarketplaceHandlers, registerStatsHandlers, registerDocumentGraphHandlers,
registerSshRemoteHandlers, registerFilesystemHandlers, registerAttachmentsHandlers,
registerWebHandlers, registerLeaderboardHandlers, registerNotificationsHandlers,
registerSymphonyHandlers, registerTabNamingHandlers, registerAgentErrorHandlers,
registerDirectorNotesHandlers, registerWakatimeHandlers

Each handler module lives in /tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/ (30 files).
The handler registration follows a consistent dependency injection pattern:
// Each handler module exports a register function that receives dependencies
export function registerContextHandlers(deps: ContextHandlerDependencies): void {
    const { getProcessManager, getAgentDetector } = deps;
    ipcMain.handle('context:groomContext', withIpcErrorLogging(
        handlerOpts('groomContext'),
        async (projectRoot, agentType, prompt, options) => {
            const processManager = requireDependency(getProcessManager, 'Process manager');
            // ... handler logic
        }
    ));
}
Key architectural pattern: All handlers use withIpcErrorLogging() for consistent error handling and requireDependency() for runtime dependency validation. This prevents silent failures when handlers are called before dependencies are initialized.
Confidence: High -- Directly from source code.

4. Harness Workflow: Spec to Plan to Execute to Verify to Merge

4.1 Overview

Maestro does NOT itself implement a spec-to-plan-to-execute pipeline. It provides the infrastructure for users to build such workflows using Auto Run documents and Playbooks. The actual planning and execution intelligence comes from the AI agents being orchestrated.
The workflow is:
1. SPEC:     User writes markdown spec documents (Auto Run docs with checkboxes)
2. PLAN:     User orders documents in a Playbook (BatchRunConfig)
3. EXECUTE:  useBatchProcessor sends each checkbox task to the AI agent
4. VERIFY:   AI agent checks tasks; Maestro tracks completion
5. MERGE:    Git worktree integration + one-click PR creation

4.2 Spec Creation

Relevant files:

ARCHITECTURE.md (Auto Run System, lines 718-897)
src/renderer/components/ (AutoRun.tsx, AutoRunSetupModal.tsx, AutoRunDocumentSelector.tsx)
src/prompts/autorun-default.md, src/prompts/wizard-document-generation.md

Users create markdown documents with checkbox items:
# Task: Add Unit Tests for Auth Module

## Objectives
- [ ] Create `src/__tests__/auth.test.ts`
- [ ] Add tests for `login()` function
- [ ] Ensure `npm test` passes
The Auto Run system provides:

Edit/Preview modes with auto-save (5-second debounce)
Image support for documents (saved to document-specific folders)
Wizard-assisted spec generation via AI (see src/prompts/wizard-*.md)
Spec-Kit integration (GitHub's spec-kit prompts bundled)
OpenSpec integration (Fission-AI's OpenSpec prompts bundled)

Evidence: ARCHITECTURE.md lines 877-883:
1. Setup: User selects Runner Docs folder via AutoRunSetupModal
2. Document Selection: Documents appear in AutoRunDocumentSelector dropdown
3. Editing: AutoRun component provides edit/preview modes with auto-save
4. Batch Configuration: BatchRunnerModal allows ordering documents
5. Playbooks: Save/load configurations for repeated batch runs
6. Execution: useBatchProcessor hook processes documents sequentially
7. Progress: RightPanel shows document and task-level progress

Confidence: High -- Architecture documentation and source files confirm this workflow.
4.3 Planning (Playbooks)

Relevant files:

ARCHITECTURE.md (lines 779-793)
src/cli/services/playbooks.ts
src/cli/services/batch-processor.ts
src/main/ipc/handlers/playbooks.ts

Playbooks are saved configurations that define:
interface Playbook {
    id: string;
    name: string;
    documents: PlaybookDocumentEntry[];  // Ordered list with reset flags
    loopEnabled: boolean;                // Loop back to first doc when done
    prompt: string;                      // Agent prompt template
    worktreeSettings?: {
        branchNameTemplate: string;
        createPROnCompletion: boolean;
    };
}
Each document entry can:

Be reordered via drag-and-drop
Be duplicated (for running the same document multiple times)
Have resetOnCompletion to uncheck all boxes when done (enabling re-execution)

Playbooks support template variables: {{date}}, {{time}}, {{cwd}}, {{session}}, {{agent}}, {{gitBranch}}.
Confidence: High -- From ARCHITECTURE.md and src/cli/services/batch-processor.ts.
4.4 Execution

Relevant files:

src/renderer/hooks/useBatchProcessor.ts
src/cli/services/batch-processor.ts (lines 61-150)
src/cli/services/agent-spawner.ts

The batch processor:

Registers CLI activity so the desktop app knows the session is busy
Iterates through documents in order
For each document, reads unchecked tasks
Constructs a prompt from the playbook template + document content
Spawns the AI agent with the prompt
Parses the response for checked tasks
Updates document state
Emits JSONL progress events
If loopEnabled, loops back to first document

The CLI batch processor (src/cli/services/batch-processor.ts) is an AsyncGenerator<JsonlEvent> that yields typed events:
export async function* runPlaybook(
    session: SessionInfo,
    playbook: Playbook,
    folderPath: string,
    options: { dryRun?, writeHistory?, debug?, verbose? }
): AsyncGenerator<JsonlEvent>
Confidence: High -- Directly from source code.
4.5 Prompt System

Directory: /tmp/ai-harness-repos/Maestro/src/prompts/ (24 markdown files + 2 subdirectories)
Maestro bundles 24 system prompts as markdown files that are imported at build time:


Prompt File
Purpose


autorun-default.md
Default system prompt for Auto Run task execution


autorun-synopsis.md
Director's Notes synopsis generation


commit-command.md
Custom AI commit command


context-grooming.md
Context grooming/compaction


context-summarize.md
Context summarization


context-transfer.md
Cross-session context transfer


director-notes.md
Director's Notes (Encore Feature)


group-chat-moderator-system.md
Group Chat moderator system prompt


group-chat-moderator-synthesis.md
Moderator synthesis round prompt


group-chat-participant.md
Participant behavior template


group-chat-participant-request.md
Request routing to participant


image-only-default.md
Image-only prompt template


maestro-system-prompt.md
Main system prompt prepended to all queries


tab-naming.md
AI-generated tab naming


wizard-document-generation.md
Wizard: generate Auto Run documents


wizard-inline-*.md (5 files)
Wizard: inline editing and iteration


wizard-system*.md (2 files)
Wizard: system prompts for generation


Additionally, two subdirectories bundle external prompt frameworks:

speckit/ -- GitHub's Spec-Kit prompts (refreshed via npm run refresh-speckit)
openspec/ -- Fission-AI's OpenSpec prompts (refreshed via npm run refresh-openspec)

These prompts are a key differentiator: they encode Maestro's workflow knowledge into the AI agents' behavior. The autorun-default.md prompt tells the agent how to interact with checkbox documents, the group-chat-moderator-system.md defines the moderator's decision-making behavior, and the context-grooming.md defines how to compress conversations.
4.6 Verification

Verification is implicit in the checkbox model. The AI agent is expected to:

Read the task description
Perform the work
Check off completed tasks by modifying the markdown document

There is NO automatic verification layer (no test runner, no linter integration, no code review step). The verification is the agent's own assessment that it completed the work.
Limitation: No automated code quality gates in the Auto Run execution loop. The agent could check off a task without actually completing it successfully. This is the most significant gap compared to orchestrator-style projects that run tests/lints between steps.
Confidence: High -- No evidence of automated verification in the batch processor code.
4.7 Merge

Relevant files:

ARCHITECTURE.md (Git Worktree Integration, lines 845-873)
src/main/ipc/handlers/git.ts

When worktree mode is enabled for Auto Run:

A git worktree is created with a specified branch name
Auto Run operates in the worktree directory
On completion, if createPROnCompletion is true, a PR is created via git:createPR
The PR uses GitHub CLI (gh pr create)

'git:createPR': (worktreePath, baseBranch, title, body) => Promise<{
    success: boolean;
    prUrl?: string;
}>
Confidence: High -- From ARCHITECTURE.md and IPC handler definitions.

5. Subagent/Task Orchestration Model

5.1 Group Chat System

Files:

/tmp/ai-harness-repos/Maestro/src/main/group-chat/ (10 files)
ARCHITECTURE.md (Group Chat System, lines 1171-1404)

Group Chat is Maestro's most sophisticated orchestration feature. It implements a moderator-agent pattern:

User sends a message to the group chat
Moderator AI receives the message + chat history
Moderator decides whether to:

Answer directly (simple questions)
Route to specific agents via @mentions


Mentioned agents work in parallel, each spawned as a batch process
When all agents respond, moderator synthesis round begins
Moderator reviews responses and either:

@mentions agents again for follow-up (loop continues)
Provides final synthesis WITHOUT mentions (loop ends)


Key implementation details:
Session ID patterns for routing:
group-chat-{chatId}-moderator-{timestamp}    -- Moderator process
group-chat-{chatId}-participant-{name}-{ts}  -- Agent participant

Pending response tracking (group-chat-router.ts lines 99-155):
const pendingParticipantResponses = new Map<string, Set<string>>();

export function markParticipantResponded(groupChatId: string, name: string): boolean {
    const pending = pendingParticipantResponses.get(groupChatId);
    if (!pending) return false;
    pending.delete(name);
    if (pending.size === 0) {
        pendingParticipantResponses.delete(groupChatId);
        return true; // Last participant responded
    }
    return false;
}
Two key prompts control moderator behavior:

MODERATOR_SYSTEM_PROMPT (src/prompts/group-chat-moderator-system.md)
MODERATOR_SYNTHESIS_PROMPT (src/prompts/group-chat-moderator-synthesis.md)

Storage structure:
~/Library/Application Support/maestro/group-chats/
    {chatId}/
        chat.json       # Group chat metadata
        log.jsonl        # Append-only message log
        history.json     # Summarized history entries

Moderator lifecycle management (group-chat-moderator.ts, 290 lines):
The moderator is not a persistent process. Instead, each message spawns a new batch-mode moderator process with a unique session ID:
group-chat-{chatId}-moderator-{timestamp}

Stale moderator sessions are cleaned up via a periodic interval (every 10 minutes) that removes sessions inactive for 30 minutes. The power manager is notified to prevent system sleep during active group chats.
Participant management (group-chat-agent.ts, 429 lines):
Each participant agent is spawned via addParticipant() which:

Validates the moderator is active (cannot add participants without moderator)
Resolves agent configuration via AgentDetector
Builds CLI arguments using the declarative arg builder pattern
Applies session-specific overrides (custom model, custom args, env vars)
Wraps with SSH configuration if remote execution is configured
Applies Windows-specific spawn configuration (PowerShell, stdin mode)
Spawns the agent with a system prompt from group-chat-participant.md template

The participant system prompt uses template variables {{GROUP_CHAT_NAME}}, {{PARTICIPANT_NAME}}, and {{LOG_PATH}} to give each participant awareness of their role and access to the shared chat log.
Active session tracking uses in-memory Map structures:

activeModeratorSessions: Map<groupChatId, sessionId>
activeParticipantSessions: Map<groupChatId:participantName, sessionId>
sessionActivityTimestamps: Map<groupChatId, timestamp> (for stale cleanup)

Strength: The moderator pattern is well-designed. It naturally handles multi-round agent coordination without a fixed workflow. The per-message moderator spawning avoids long-lived process management complexity.
Limitation: The moderator is always a single AI agent. There's no support for hierarchical moderators, or for the moderator to spawn sub-moderators for complex tasks. All sessions are in-memory only; a crash loses all active group chat state.
Confidence: High -- Thoroughly documented in ARCHITECTURE.md and confirmed in source code (group-chat-moderator.ts and group-chat-agent.ts).
5.2 Symphony Orchestration

Files:

/tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/symphony.ts (200+ lines read)
/tmp/ai-harness-repos/Maestro/SYMPHONY_REGISTRY.md
/tmp/ai-harness-repos/Maestro/SYMPHONY_ISSUES.md
/tmp/ai-harness-repos/Maestro/docs/symphony.md

Symphony extends Auto Run to open source contribution:

Registry (symphony-registry.json) lists participating repositories
Issues with runmaestro.ai label define contribution opportunities
Contribution flow:

Clone repository to ~/Maestro-Symphony/{owner}-{repo}/
Create branch symphony/{issue-number}-{short-id}
Set up Auto Run documents (from issue body)
Process documents automatically
Create draft PR (claims the issue)
Finalize PR when complete


Validation is thorough (symphony.ts lines 69-191):

Path traversal prevention via sanitizeRepoName()
GitHub URL validation (HTTPS only, github.com only)
Repo slug format validation
Document path validation (no .., no leading /)
External URL validation (GitHub domains only)

Confidence: High -- From source code and documentation.

6. Multi-Agent / Parallelization Strategy

6.1 Agent-Level Parallelism

Maestro supports unlimited parallel agents, each with its own workspace and process pair. The Left Bar shows all agents simultaneously. Agent switching is keyboard-driven (Cmd+[ / Cmd+]).
6.2 Tab-Level Parallelism

Each agent supports multiple AI tabs (AITab[]), each potentially connected to a different provider session. This enables parallel conversations within a single agent workspace.
6.3 Execution Queue

File: ARCHITECTURE.md (Execution Queue, lines 1096-1137)
The execution queue is a per-agent sequential processing queue that prevents conflicting operations:
interface QueuedItem {
    id: string;
    type: 'message' | 'command';
    content: string;
    tabId: string;
    readOnlyMode: boolean;
    timestamp: number;
    source: 'user' | 'autorun';
}
Queue processing rules:

Items are processed FIFO within each agent
When the current agent query completes (process exits), the next queued item is dispatched
Read-only operations (readOnlyMode: true) can potentially execute in parallel (agent-dependent)
Write operations must be sequential to prevent file conflicts
Auto Run tasks enter the same queue as user messages (source: 'autorun')
Users can inspect and cancel pending items via the Execution Queue Browser (Cmd+K -> "Execution Queue")
The queue persists across tab switches but not across app restarts

Queue behavior:

Write operations (readOnlyMode: false) queue sequentially within an agent
Read-only operations can run in parallel
Auto Run tasks queue with regular messages
Users can cancel pending items via queue browser

6.4 Worktree Parallelism

Without worktree mode: Auto Run tasks queue through the execution queue (sequential within an agent, parallel across agents).
With worktree mode: Auto Run operates in a separate directory, enabling true parallelization with the main workspace. No queue conflicts.
6.5 Group Chat Parallelism

When the moderator @mentions multiple agents, they are spawned as parallel batch processes. The system tracks pending responses and triggers synthesis only when ALL agents have responded.
6.6 What's Missing


No work-stealing or load balancing between agents
No automatic task distribution across agents (user must manually assign)
No dependency graph execution (tasks within a document are sequential)
No cross-agent pipeline (output of Agent A cannot feed into Agent B automatically, except via Group Chat)

Confidence: High -- From architecture documentation and source code analysis.

7. Isolation Model

7.1 Git Worktrees

Files:

ARCHITECTURE.md (Git Worktree Integration, lines 845-873)
IPC handlers: git:worktreeInfo, git:worktreeSetup, git:worktreeCheckout

Maestro provides first-class git worktree support:

Create worktree sub-agents from the git branch menu
Each worktree operates in its own directory
AI agents process tasks independently
One-click PR creation from worktree branches

interface WorktreeConfig {
    enabled: boolean;
    path: string;                   // Absolute path for the worktree
    branchName: string;             // Branch name to use/create
    createPROnCompletion: boolean;  // Create PR when Auto Run finishes
}
7.2 Session Isolation

Each agent/session has:

Its own working directory (cwd)
Its own process pair (AI + terminal)
Its own conversation tabs (each with independent provider session)
Its own execution queue
Its own file tree and git state

7.3 Data Isolation

Development mode uses isolated data directories:
npm run dev          -> maestro-dev/     (separate from production)
npm run dev:demo     -> /tmp/maestro-demo/  (completely fresh)
npm run dev:prod-data -> maestro/        (production data)

Settings stored separately:

maestro-settings.json -- User preferences
maestro-sessions.json -- Agent persistence
maestro-groups.json -- Agent groups
maestro-agent-configs.json -- Per-agent configuration

7.4 SSH Remote Isolation

Agents can execute on remote hosts via SSH. The SSH spawn wrapper (src/main/utils/ssh-spawn-wrapper.ts) wraps any agent spawn command with SSH transport:
SSH remote configuration:
interface SshRemoteConfig {
    enabled: boolean;
    remoteId: string | null;
    workingDirOverride?: string;
}
The wrapper transforms local spawn configs to SSH-wrapped versions:

Commands are prefixed with ssh -t <host> for remote execution
The remote host's login shell ($SHELL -lc) is used to ensure PATH is properly loaded
File paths are resolved on the remote filesystem
Prompts are passed via stdin as a script (sshStdinScript) rather than command-line arguments, avoiding both shell escaping issues and the 8KB command length limit on Windows
Long prompts are base64-encoded for transport safety
Each agent type can have per-session SSH configuration, enabling mixed local/remote agent fleets

SSH error detection is handled by the dedicated SSH_ERROR_PATTERNS (see Section 3.4) which detect transport-level failures separately from agent-level errors.
Group Chat participants can individually be SSH-remoted: The addParticipant() function accepts sessionOverrides.sshRemoteConfig, enabling heterogeneous Group Chat setups where some agents run locally and others on remote machines.
7.5 What's Missing


No containerized isolation (Docker, sandboxing). Agents run with the same privileges as the user.
No resource limits per agent (CPU, memory, disk)
No network isolation between agents
No filesystem sandboxing (agents can access any file the user can)

Confidence: High -- From architecture documentation and SECURITY.md.

8. Human-in-the-Loop Controls

8.1 Read-Only Mode

Each tab has a readOnlyMode toggle. When enabled:

Claude Code uses --permission-mode plan
Codex uses --sandbox read-only
OpenCode uses --agent plan

This prevents agents from making file changes while allowing analysis.
8.2 Pause/Resume

The batch processor supports pause/resume for Auto Run:
pauseBatchRun()   // Pause current batch run
resumeBatchRun()  // Resume execution
stopBatchRun()    // Stop current batch run
8.3 Execution Queue Management

Users can view and cancel pending queue items via the Execution Queue Browser (Cmd+K -> "Execution Queue").
8.4 Agent Error Handling

When agents encounter errors (auth expired, token exhaustion, rate limit):

Error modal appears with error details
Input is blocked (agentErrorPaused: true)
User must acknowledge and decide how to proceed
Recovery options are presented based on error type

Error types from src/main/parsers/error-patterns.ts:

auth_expired -- API key invalid, login required
token_exhaustion -- Context window full
rate_limited -- Too many requests
network_error -- Connection failed
agent_crashed -- Non-zero exit code
permission_denied -- Operation not allowed

8.5 Confirmation Dialogs


Agent deletion requires confirmation (unless showConfirmation: false)
Playbook deletion has a dedicated confirmation modal (PlaybookDeleteConfirmModal)
Tab closing with unsaved edits prompts for confirmation
Group renaming, session renaming have dedicated modals

8.6 What's Missing


No approval gates for specific operations (e.g., approve before file write)
No cost limit enforcement (user can see costs but can't set spend limits)
No automated rollback (if an agent makes bad changes, user must manually revert)
No per-task review step in Auto Run (tasks execute sequentially without review between them)

Confidence: High -- From architecture documentation and UI component analysis.

9. Context Handling Strategy

9.1 Context Merging

File: /tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/context.ts (508 lines)
Maestro provides context merge operations via 5 IPC handlers:


Handler
Status
Purpose


context:getStoredSession
Active
Retrieve messages from agent session storage


context:groomContext
Active (recommended)
Single-call grooming: spawn agent, send prompt, collect response


context:cancelGrooming
Active
Cancel all active grooming sessions


context:createGroomingSession
Deprecated
Create a temporary interactive grooming session


context:sendGroomingPrompt
Deprecated
Send prompt to existing grooming session


context:cleanupGroomingSession
Active
Clean up temporary grooming session


The evolution from the deprecated two-step createGroomingSession + sendGroomingPrompt to the single-call groomContext demonstrates an architectural simplification. The original approach required managing long-lived processes and response collection via event listeners with idle timeouts. The new approach uses the shared groomContext() utility from src/main/utils/context-groomer.ts.
Grooming response collection (deprecated path, still in codebase):
// context.ts line 287 - Response collection with multiple completion signals
return new Promise<string>((resolve, reject) => {
    let responseBuffer = '';
    let lastDataTime = Date.now();

    // Completion triggers:
    // 1. Process exit -> return whatever was collected
    // 2. Idle timeout (5s with no data + min 100 chars) -> return
    // 3. Overall timeout (5 minutes) -> return or reject
    // 4. Agent error -> reject
});
The grooming operation has a 5-minute timeout (GROOMING_TIMEOUT_MS = 5 * 60 * 1000). The idle check (1-second interval, 5-second inactivity with >= 100 character response) handles cases where the agent process does not cleanly exit.
9.2 Context Grooming Prompts

Files:

src/prompts/context-grooming.md
src/prompts/context-summarize.md
src/prompts/context-transfer.md

These prompts enable:

Compaction -- Summarize a conversation to reduce context size
Transfer -- Export context from one session to another
Grooming -- Clean up context for better agent performance

9.3 Per-Tab Context

Each AI tab has its own:

logs: LogEntry[] -- Tab-specific conversation history
agentSessionId?: string -- Provider session ID
scrollTop?: number -- Scroll position
draftInput?: string -- Unsaved input

Context is isolated per tab. When creating a new tab, it starts with a fresh context (new provider session). Resuming a tab reconnects to its existing provider session.
9.4 Tab Overlay Context Operations

The tab hover overlay menu (after 400ms hover) includes:

Context: Compact (if tab has 5+ messages) -- Summarize conversation
Context: Merge Into (if provider session exists) -- Import context from another session
Context: Send to Agent (if provider session exists) -- Export context to another agent

9.5 Context Usage Tracking

Per-tab context usage is tracked as a percentage of the context window:
contextUsage: number;  // Context window usage percentage (0-100)
The context window size varies by agent:

Claude Code: 200,000 tokens (always reported in JSON output)
Codex: 200,000 tokens (default for GPT-5.x)
OpenCode: 128,000 tokens (default)

9.6 What's Missing


No automatic context compaction (user must manually trigger)
No context chunking for large codebases (relies on agent's built-in RAG)
No persistent vector store for retrieval
No cross-session context inheritance (new sessions start fresh)
No context budget enforcement (agents can exhaust context without warning)

Confidence: High -- From context handler source code and prompts directory.

10. Session Lifecycle and Persistence

10.1 Session Creation


User clicks "New Agent" (Cmd+N)
Selects provider (Claude Code, Codex, OpenCode, Factory Droid)
Selects working directory
createNewSession(agentId, workingDir, name) is called
Two processes spawned: AI agent (child process) + terminal (PTY)
Session added to sessions[] state and persisted

10.2 Session Persistence

Settings stored via electron-store:

macOS: ~/Library/Application Support/maestro/
Windows: %APPDATA%/maestro/
Linux: ~/.config/maestro/

Files:

maestro-settings.json -- User preferences (debounced 2-second persistence)
maestro-sessions.json -- Agent data
maestro-groups.json -- Agent groups
maestro-agent-configs.json -- Per-agent configuration

The persistence system uses useDebouncedPersistence (2-second debounce) with flush on visibility change and beforeunload to prevent data loss.
10.3 Session Discovery

Maestro automatically discovers existing provider sessions:

Claude Code: ~/.claude/projects/<encoded-path>/
Codex: ~/.codex/sessions/YYYY/MM/DD/*.jsonl
OpenCode: ~/.local/share/opencode/storage/
Factory Droid: ~/.factory/sessions/

Users can browse, search, star, rename, and resume any discovered session.
10.4 Session Resume

Each agent supports session resume with provider-specific flags:

Claude Code: --resume <session-id>
Codex: resume <thread_id> (subcommand)
OpenCode: --session <session-id>
Factory Droid: -s, --session-id <id>

10.5 Session States

Color-coded states:

Green -- Ready/idle
Yellow -- Agent thinking/busy
Red -- No connection/error
Pulsing Orange -- Connecting

10.6 History Persistence

Command history is maintained per-session:

aiCommandHistory: string[] -- AI input history
shellCommandHistory: string[] -- Terminal input history

History entries are also stored in the SQLite stats database for analytics.
Confidence: High -- From architecture documentation and source code.

11. Code Quality Gates

11.1 Pre-commit Hooks

File: /tmp/ai-harness-repos/Maestro/.husky/pre-commit
Husky + lint-staged runs on every commit:
"lint-staged": {
    "*.{ts,tsx}": [
        "prettier --write",
        "eslint --fix"
    ]
}
11.2 TypeScript

Strict mode enabled. Three separate tsconfig files:

tsconfig.lint.json -- Renderer, web, and shared code
tsconfig.main.json -- Main process code
tsconfig.cli.json -- CLI tooling

11.3 ESLint

Configured with TypeScript and React plugins:

react-hooks/rules-of-hooks
react-hooks/exhaustive-deps
@typescript-eslint/no-unused-vars
prefer-const

11.4 Testing

Framework: Vitest (4 configurations):

vitest.config.mts -- Unit tests
vitest.integration.config.ts -- Integration tests
vitest.e2e.config.ts -- E2E tests (with Playwright)
vitest.performance.config.mts -- Performance tests

Coverage: 490 test files across:
src/__tests__/
    cli/           # CLI tool tests
    main/          # Electron main process tests
    renderer/      # React component and hook tests
    shared/        # Shared utility tests
    web/           # Web interface tests
    integration/   # Integration tests
    e2e/           # E2E tests

11.5 CI/CD

File: /tmp/ai-harness-repos/Maestro/.github/workflows/release.yml
Release workflow builds for 4 platforms:

macOS (universal: x64 + arm64)
Linux x64
Linux ARM64 (native ARM runner)
Windows x64

Architecture verification is thorough: native modules (node-pty, better-sqlite3) are verified to be built for the correct architecture before AND after packaging. This was clearly born from painful debugging of cross-architecture contamination issues.
11.6 Automated PR Review

Two AI tools review PRs:

CodeRabbit -- Line-level code review
Greptile -- Codebase-aware architectural review

11.7 Error Tracking

Sentry integration for crash reporting:

src/main/utils/sentry.ts
src/renderer/components/ErrorBoundary.tsx
Dynamic import to avoid module-load-time errors
Disabled in development mode
User can opt out via settings

11.8 What's Missing


No linting or testing in CI before release (the release workflow only builds, doesn't run tests)
No required CI checks before merge (mentioned as "in scope" but not enforced)
No code coverage thresholds (coverage is available but no minimum enforcement)
No security scanning (no SAST, no dependency audit in CI)

Confidence: High -- From configuration files and CI workflow.

12. Security and Compliance Mechanisms

12.1 IPC Security

From SECURITY.md (lines 77-80):

Context isolation: Enabled
Minimal preload API surface via contextBridge.exposeInMainWorld
No require() in renderer
Input validation in main process handlers

12.2 Command Execution Security


execFileNoThrow used for all external commands (never shell-based execution)
spawn() with shell: false for AI agent processes
Path traversal prevention in Symphony handlers (sanitizeRepoName())
URL validation for external resources (HTTPS only, domain allowlists)

12.3 Process Execution Model

From SECURITY.md (lines 69-73):

Maestro spawns AI agents and terminal processes with the same privileges as the user running the application. This is by design.

Known security considerations:

Agents can execute commands on the system
Local web server exposes sessions (no auth by default)
Cloudflare tunnel URLs are temporary but unauthenticated
Sentry DSN is intentionally public (standard client-side practice)

12.4 Encore Features (Feature Gating)

Feature gating via EncoreFeatureFlags:

Features disabled by default are completely invisible (no shortcuts, no menu items)
First example: Director's Notes
Serves as precursor to a plugin marketplace

12.5 What's Missing


No authentication for web/mobile interface (anyone with the URL can control agents)
No rate limiting on the web server (except via @fastify/rate-limit dependency, unclear if configured)
No audit logging (actions are tracked for analytics but not for security audit)
No credential management (API keys are managed by the underlying agents, not Maestro)
No sandboxing of AI agent execution (runs with full user privileges)
No content security policy in the Electron renderer

Confidence: High -- From SECURITY.md and source code analysis.

13. Hooks, Automation Surface, and Fail-Safe Behavior

13.1 Automation Surface

CLI Tool (maestro-cli):
maestro list agents              # List available AI agents
maestro list groups              # List session groups
maestro list playbooks           # List saved playbooks
maestro list sessions <agent-id> # List agent sessions
maestro show agent <id>          # Show agent details
maestro show playbook <id>       # Show playbook configuration
maestro playbook <id>            # Run a playbook
maestro send <agent-id> <msg>    # Send message, get JSON response
maestro clean playbooks          # Remove orphaned playbooks
All commands support --json flag for JSONL output (machine-parseable).
IPC API: The window.maestro API provides 17+ namespaces that could be used by custom extensions.
Custom AI Commands: User-defined slash commands with template variables.
13.2 Fail-Safe Behavior


SIGINT -> SIGTERM escalation (2 second timeout)
Process cleanup on exit (killAll() on app shutdown)
Orphaned tab repair (ensureInUnifiedTabOrder() repairs missing tab references)
Settings flush on visibility change (prevents data loss)
Error boundaries in React components
Grooming session timeout (5 minutes)

13.3 Power Management

src/main/power-manager.ts prevents system sleep while agents are busy (configurable).
13.4 Auto-Update

src/main/auto-updater.ts and src/main/update-checker.ts handle automatic updates via electron-updater.
13.5 What's Missing


No webhook/HTTP API for external automation (only CLI and IPC)
No plugin system (Encore Features is a precursor, not yet a full plugin API)
No event bus for external consumers (events are internal to Electron IPC)
No watchdog for agent health (agents that hang indefinitely are not automatically killed)

Confidence: High -- From source code analysis.

14. CLI/UX and Automation Ergonomics

14.1 Desktop UX

Keyboard-first design is deeply implemented:

30+ keyboard shortcuts documented in src/renderer/constants/shortcuts.ts
Cmd+K quick actions (command palette)
Cmd+J toggle AI/terminal mode
Cmd+N new agent
Cmd+[ / Cmd+] switch agents
Cmd+T new tab
Cmd+W close tab
Escape always returns to a known state (via Layer Stack system)

Layer Stack System (ARCHITECTURE.md lines 252-380):

Centralized modal/overlay management
Predictable Escape key handling (highest priority modal closes first)
30+ modal priority levels defined
Focus traps ('strict', 'lenient', 'none')
ARIA attributes for accessibility

Keyboard Mastery Tracking -- Gamification that rewards keyboard usage:

Achievements for time spent using Auto Run
15 conductor-themed badge levels (Apprentice to Transcendent Maestro)
Standing Ovation overlay with confetti animation for new badges

14.2 Mobile UX

PWA with mobile-optimized components:

Bottom navigation bar (TabBar.tsx)
Session pill bar (horizontal scrolling)
Voice input support
Swipe gestures (useSwipeGestures.ts, usePullToRefresh.ts)
Offline queue (useOfflineQueue.ts)
Push notifications
Connection status indicator

14.3 CLI UX

The maestro-cli provides:

Human-readable output (tables and text)
JSONL output for scripting
--dry-run for playbook execution
--debug and --verbose flags
--wait to wait for busy agents
Pagination for session listing (--limit, --skip)
Search filtering (--search)

14.4 Theme System

16 themes across 3 modes (dark, light, vibe):

Dracula, Monokai, Nord, Tokyo Night, Catppuccin Mocha, Gruvbox Dark
GitHub Light, Solarized, One Light, Gruvbox Light, Catppuccin Latte, Ayu Light
Colorblind-friendly palettes (Wong-based)

14.5 What's Missing


No CI/CD pipeline integration (CLI can run playbooks but no built-in GitHub Actions integration)
No REST API for programmatic access
No dashboard web UI (mobile is read/control only, no analytics on mobile)

Confidence: High -- From documentation and source code.

15. Cost/Usage Visibility and Governance

15.1 Real-Time Cost Tracking

Per-session token usage and cost tracking:
interface UsageStats {
    inputTokens: number;
    outputTokens: number;
    cacheReadInputTokens: number;
    cacheCreationInputTokens: number;
    totalCostUsd: number;
    contextWindow: number;
    reasoningTokens?: number;
}
Cost tracking is agent-dependent:

Claude Code: Full cost tracking (USD)
OpenCode: Full cost tracking (USD from step_finish events)
Codex: Token counts only (no USD -- pricing varies by model)
Factory Droid: Token counts only

15.2 Usage Dashboard

Files:

src/renderer/components/UsageDashboard/ (10+ components)
/tmp/ai-harness-repos/Maestro/src/main/stats/ (13 files)
CLAUDE-FEATURES.md (lines 7-75)

SQLite-backed analytics with:

Summary cards (queries, duration, cost, Auto Runs)
Agent comparison bar chart
Source distribution pie chart (user vs. auto queries)
Activity heatmap (GitHub-style)
Duration trends line chart
Auto Run-specific statistics
Time range filtering (day, week, month, year, all time)
CSV export
Real-time updates
Colorblind-friendly palettes

15.3 Stats Database Architecture

Files:

/tmp/ai-harness-repos/Maestro/src/main/stats/stats-db.ts (833 lines)
/tmp/ai-harness-repos/Maestro/src/main/stats/schema.ts (142 lines)
/tmp/ai-harness-repos/Maestro/src/main/stats/migrations.ts
/tmp/ai-harness-repos/Maestro/src/main/stats/aggregations.ts

The StatsDB class manages a SQLite database (stats.db in the user data directory) with 4 main tables:


Table
Purpose
Key Fields


query_events
Every AI query
session_id, agent_type, source (user/auto), start_time, duration, project_path


auto_run_sessions
Auto Run execution runs
session_id, document_path, tasks_total, tasks_completed


auto_run_tasks
Individual tasks within Auto Runs
task_index, task_content, duration, success (0/1)


session_lifecycle
Session creation/closure
agent_type, created_at, closed_at, is_remote


Supporting tables:

_migrations -- Schema migration tracking with version, description, status, and error_message
_meta -- Internal key-value storage (e.g., last vacuum timestamp)

Database resilience features:

WAL mode: PRAGMA journal_mode = WAL for concurrent read/write access
Integrity checking: PRAGMA integrity_check on every startup to detect corruption
Daily backups: Automatic daily backup with 7-day rotation (stats.db.daily.YYYY-MM-DD)
Corruption recovery: Multi-step recovery process:

Backup corrupted database for forensics (stats.db.corrupted.{timestamp})
Remove stale WAL/SHM sidecar files that can cause false corruption detection
Iterate through available backups, validating each with integrity check
Restore from first valid backup, or create fresh database if none valid


Weekly VACUUM: Scheduled vacuum (not on every startup) via _meta table timestamp tracking, triggered only when database exceeds 100MB
WAL checkpoint before backup: PRAGMA wal_checkpoint(TRUNCATE) ensures the .db file is self-contained before copying

// stats-db.ts line 333 - Safe backup copy
private safeBackupCopy(destPath: string): void {
    if (this.db) {
        this.db.pragma('wal_checkpoint(TRUNCATE)');
    }
    fs.copyFileSync(this.dbPath, destPath);
}
Migration system: Versioned migrations with individual success/failure tracking per migration. Each migration is recorded in the _migrations table with its status and any error message, enabling precise debugging of upgrade failures.
Statement caching: Each CRUD module (query-events.ts, auto-run.ts, session-lifecycle.ts) maintains prepared statement caches that are cleared on database close, avoiding repeated SQL parsing overhead.
Confidence: High -- Directly from source code (stats-db.ts 833 lines, schema.ts 142 lines).
15.3 WakaTime Integration

src/main/wakatime-manager.ts provides integration with WakaTime for developer activity tracking.
15.4 Global Stats

Cross-project statistics from Claude Code sessions:
const stats = await window.maestro.claude.getGlobalStats();
// Returns: { totalSessions, totalMessages, totalInputTokens, totalOutputTokens,
//            totalCacheReadTokens, totalCacheCreationTokens, totalCostUsd, totalSizeBytes }
15.5 What's Missing


No cost budgets or limits (tracking only, no enforcement)
No alerts when spending exceeds thresholds
No per-playbook cost attribution (costs are per-session, not per-task)
No team/organization cost aggregation

Confidence: High -- From source code and documentation.

16. Tooling and Dependency Surface

16.1 Runtime Requirements


Node.js >= 22.0.0 (specified in package.json engines)
Electron 28 (desktop runtime)
Git (optional, for git-aware features)
At least one AI agent installed:

Claude Code
OpenAI Codex
OpenCode
Factory Droid


16.2 Key Dependencies

Native modules (require compilation):

node-pty -- Terminal emulation
better-sqlite3 -- Analytics database

Backend:

electron-store -- Settings persistence
fastify + WebSocket -- Web server for mobile
chokidar -- File watching
commander -- CLI argument parsing
archiver / adm-zip -- Playbook import/export
@sentry/electron -- Error tracking
electron-updater -- Auto-updates

Frontend:

react 18 + react-dom + zustand (state management)
tailwindcss -- Styling
react-markdown + remark-gfm -- Markdown rendering
react-syntax-highlighter -- Code highlighting
reactflow -- Document graph visualization
recharts -- Usage dashboard charts
d3-force -- Graph layout
mermaid -- Diagram rendering
canvas-confetti -- Achievement celebrations
marked -- Markdown parsing
dompurify -- HTML sanitization
js-tiktoken -- Token counting
@tanstack/react-virtual -- Virtual scrolling

Dev tooling:

vite -- Build tool
vitest -- Test framework
playwright -- E2E testing
esbuild -- CLI bundling
eslint + prettier -- Code quality
typescript 5.3 -- Type checking

16.3 Build Configuration

4 separate TypeScript configs:

tsconfig.json -- Base config
tsconfig.main.json -- Main process
tsconfig.lint.json -- Renderer/web/shared
tsconfig.cli.json -- CLI

4 Vite configs:

vite.config.mts -- Desktop renderer
vite.config.web.mts -- Web/mobile interface

Build targets:

macOS: DMG + ZIP (x64 + arm64)
Windows: NSIS installer + Portable (x64)
Linux: AppImage + DEB + RPM (x64 + arm64)

16.4 Dependency Risk Assessment


node-pty: Native module, requires compilation. Cross-platform build is fragile (evidenced by extensive CI architecture verification steps).
better-sqlite3: Native module, same compilation concerns.
Electron 28: Not latest (Electron 35 is current as of 2026). Missing File System Access API support.
React 18: Stable but React 19 has been out for over a year.

Confidence: High -- From package.json and build configuration.

17. External Integrations and Provider Compatibility

17.1 Provider Architecture

Maestro's multi-provider architecture is implemented through:

Agent Definitions (src/main/agents/definitions.ts) -- CLI binary, arguments, detection
Agent Capabilities (src/main/agents/capabilities.ts) -- 20+ capability flags per agent
Output Parsers (src/main/parsers/) -- Agent-specific JSON parsing
Session Storage (src/main/storage/) -- Agent-specific session discovery
Error Patterns (src/main/parsers/error-patterns.ts) -- Agent-specific error detection

17.2 Supported Providers


Provider
Status
Resume
Read-Only
JSON
Images
Sessions
Cost
Thinking


Claude Code
Active
--resume
--permission-mode plan
stream-json
stdin JSON
~/.claude/
USD
Yes


Codex
Active
resume <id>
--sandbox read-only
--json
-i flag
~/.codex/
Tokens
Yes


OpenCode
Active
--session
--agent plan
--format json
-f flag
~/.local/
USD
Yes


Factory Droid
Active
-s <id>
Default mode
stream-json
-f flag
~/.factory/
Tokens
Yes


Gemini CLI
Planned
TBD
TBD
TBD
Yes
TBD
USD
TBD


Qwen3 Coder
Planned
TBD
TBD
TBD
TBD
TBD
N/A
TBD


Aider
Planned
TBD
TBD
TBD
TBD
TBD
TBD
TBD


17.3 Agent Definition Architecture

File: /tmp/ai-harness-repos/Maestro/src/main/agents/definitions.ts (367 lines)
Each agent is defined via an AgentDefinition struct containing static configuration. The definitions system uses a declarative argument builder pattern rather than hardcoding CLI construction logic:
// definitions.ts line 71
export interface AgentConfig {
    id: string;
    name: string;
    binaryName: string;
    command: string;
    args: string[];                          // Base args always included
    batchModePrefix?: string[];              // Subcommand for batch mode (e.g., ['exec'] for Codex)
    batchModeArgs?: string[];                // Args only in batch mode
    jsonOutputArgs?: string[];               // Args for JSON output
    resumeArgs?: (id: string) => string[];   // Session resume builder
    readOnlyArgs?: string[];                 // Read-only/plan mode
    modelArgs?: (id: string) => string[];    // Model selection builder
    yoloModeArgs?: string[];                 // Full-access/unsafe mode
    workingDirArgs?: (dir: string) => string[];  // Working directory
    imageArgs?: (path: string) => string[];  // Image attachment
    promptArgs?: (prompt: string) => string[]; // Prompt argument builder
    noPromptSeparator?: boolean;             // Skip '--' before prompt
    defaultEnvVars?: Record<string, string>; // Default env vars
    configOptions?: AgentConfigOption[];     // UI-configurable settings
    capabilities: AgentCapabilities;         // Feature capability flags
}
Key design observations across the agent definitions:


Claude Code always runs with --dangerously-skip-permissions (YOLO mode). This is a deliberate choice documented in the definitions: "Maestro requires it."


Codex uses a subcommand pattern (codex exec) with its own set of batch-mode-only args (--dangerously-bypass-approvals-and-sandbox, --skip-git-repo-check). The --json flag must come before the resume subcommand in the argument ordering.


OpenCode uses environment variable injection for YOLO mode rather than CLI flags:


// definitions.ts line 223
defaultEnvVars: {
    OPENCODE_CONFIG_CONTENT: '{"permission":{"*":"allow","external_directory":"allow","question":"deny"},"tools":{"question":false}}'
}
The question tool is explicitly disabled in two ways because it waits for stdin input, which hangs batch mode.


Factory Droid runs with --skip-permissions-unsafe and read-only is the DEFAULT mode for droid exec. It supports a reasoningEffort configuration option with values low, medium, high.


Aider is listed as a placeholder definition with no configuration, signaling future support for this popular open-source AI coding tool.


UI-configurable options use discriminated union types (checkbox, text, number, select) with type-safe argBuilder functions that map config values to CLI arguments at runtime.


17.4 Adding New Providers

The process is well-documented in AGENT_SUPPORT.md (843 lines):

Add agent definition to agent-detector.ts
Define capabilities in agent-capabilities.ts
Create output parser in parsers/
Register parser in parsers/index.ts
(Optional) Create session storage in storage/
(Optional) Add error patterns

Each agent starts with all capabilities false and enables them as verified.
17.4 MCP Server

Maestro provides a hosted MCP (Model Context Protocol) server at https://docs.runmaestro.ai/mcp with a SearchMaestro tool for documentation search. This allows external AI tools (Claude Desktop, Claude Code) to search Maestro's knowledge base.
17.5 Spec-Kit and OpenSpec Integration

Bundled spec-driven workflow systems:

Spec-Kit: GitHub's spec-kit prompts (src/prompts/speckit/)
OpenSpec: Fission-AI's OpenSpec prompts (src/prompts/openspec/)

Both are refreshed from upstream via scripts:
npm run refresh-speckit   # Fetch latest from github/spec-kit
npm run refresh-openspec  # Fetch latest from Fission-AI/OpenSpec
17.6 What's Missing


No direct API provider support (only CLI-based agents, not API-based)
No local model integration (except through OpenCode's Ollama support)
No MCP client (Maestro serves MCP, but doesn't consume MCP tools from external servers)
No plugin marketplace for third-party integrations

Confidence: High -- From AGENT_SUPPORT.md and capabilities source code.

18. Operational Assumptions and Constraints

18.1 Explicit Assumptions


User has at least one AI agent installed and authenticated (Claude Code, Codex, OpenCode, or Factory Droid)
User has Git installed (for git-aware features)
Agents run in batch/headless mode -- Each task gets a prompt and returns a response (not interactive)
Maestro is a pass-through -- Whatever MCP tools, skills, permissions the agent has configured works identically
Each task gets a fresh session (for clean conversation context in Auto Run)
Agents can execute commands with user privileges -- No sandboxing

18.2 Platform Constraints

From CLAUDE-PLATFORM.md:

Path separators differ between platforms
Shell detection differs (PowerShell on Windows, zsh/bash on Unix)
macOS Alt key produces special characters (must use e.code not e.key)
Windows has 8KB command line limit (use stdin for long prompts)
SSH remote execution doesn't support file watching
Git stat format differs between GNU and BSD

18.3 Performance Constraints

From CLAUDE-PERFORMANCE.md:

AI streaming triggers 100+ IPC updates/second (batched to ~6 renders/second via 150ms batching)
Agent persistence uses 2-second debounce
Git status polling uses 3-second intervals (paused when app is hidden)
Model list cache: 5-minute TTL
Symphony registry cache: 2-hour TTL
Issues cache: 5-minute TTL

18.4 Operational Constraints


Node.js >= 22.0.0 required (newer than typical Node LTS)
Native module compilation (node-pty, better-sqlite3) requires build tools
Electron app size (likely 100MB+ installed)
Single user per instance (no multi-user support)
GitHub CLI required for Symphony (for PR creation)
Cloudflare CLI required for remote tunnels (for remote access)

Confidence: High -- From platform documentation and package.json.

19. Failure Modes and Issues Observed

19.1 Documented Failure Patterns

From CLAUDE.md (lines 289-295):
Historical patterns that wasted time:
- Tab naming bug: Modal coordination was "fixed" when the actual issue was
  an unregistered IPC handler
- Tooltip clipping: Attempted overflow:visible on element when parent
  container had overflow:hidden
- Session validation: Fixed renderer calls when handler wasn't wired in main process

19.2 Cross-Architecture Build Issues

The CI workflow (release.yml) has extensive architecture verification steps (5+ verification steps per platform), indicating past issues with:

Cross-architecture binary contamination (ARM64 prebuilds contaminating x64 builds)
Incorrect native module compilation
Cache key collisions between architectures

19.3 Electron Limitations


File System Access API not fully supported in Electron 28 (Chrome DevTools "Save profile" fails)
WSL environment requires GPU acceleration to be auto-disabled (EGL/GPU process crash issues)

19.4 Agent-Specific Issues


Claude Code may not immediately exit on SIGINT (requires SIGTERM escalation after 2 seconds)
OpenCode session storage is marked as "stub ready" (not fully implemented)
Gemini CLI and Qwen3 Coder are "PLACEHOLDER" (capabilities unknown)

19.5 Symphony Risks

From SYMPHONY_ISSUES.md:

Issues can be "claimed" by creating a draft PR, but there's no lock mechanism (race condition possible)
External document URLs restricted to GitHub domains only (prevents arbitrary URL injection)
Path traversal attacks prevented via validation

19.6 Database Resilience (Positive Finding)

Counter to initial expectations, the stats database has a robust resilience strategy (see Section 15.3 for full details):

Daily automatic backups with 7-day rotation
Corruption detection via PRAGMA integrity_check on every startup
Automated recovery: tries each backup in order, falls back to fresh database
Stale WAL/SHM file cleanup to prevent false corruption detection
WAL checkpoint before backup to ensure self-contained copies

This is one of the more carefully engineered subsystems in Maestro.
19.7 Potential Failure Modes Not Documented


Agent hang without exit -- No watchdog to kill agents that hang indefinitely. The 5-minute timeout exists only for grooming sessions, not for regular agent queries.


Session settings data loss -- The electron-store persistence uses a 2-second debounce (useDebouncedPersistence). A crash during this window loses up to 2 seconds of state changes. The beforeunload flush mitigates but does not eliminate this risk.


Memory leaks from event listeners -- The ProcessManager extends EventEmitter and many modules attach listeners (Group Chat moderator, grooming sessions, IPC handlers). Each grooming session attaches data, exit, and agent-error listeners with cleanup functions, but complex error paths could leave orphaned listeners.


Concurrent worktree conflicts -- Multiple worktrees from different agents could modify overlapping files. Git will handle merge conflicts at the branch level, but runtime file locking is not implemented.


Group Chat state is entirely in-memory -- Active moderator sessions, participant sessions, and activity timestamps are all stored in JavaScript Map objects. A crash or unexpected restart loses all group chat state. The only persisted data is the chat log (JSONL on disk) and metadata (chat.json).


SSH spawn wrapper command length -- On Windows, commands passed to SSH have an 8KB limit. Long prompts are sent via stdin to avoid this, but the fallback behavior if stdin writing fails is unclear.


Race condition in Symphony issue claiming -- From SYMPHONY_ISSUES.md: Issues can be "claimed" by creating a draft PR, but there is no server-side lock mechanism. Two users could simultaneously claim the same issue.


Confidence: Medium -- Some failure modes are inferred from architectural analysis rather than documented or observed.

20. Governance and Guardrails

20.1 Code Governance


Pre-commit hooks (Husky + lint-staged): Prettier + ESLint on staged files
TypeScript strict mode: Across all 3 build configs
Automated PR review: CodeRabbit + Greptile
Conventional commits: feat:, fix:, docs:, refactor:, test:, chore:
CONTRIBUTING.md: 1122 lines of detailed contribution guidelines
PR checklist: Linting, tests, manual testing, no console errors, theme testing

20.2 Agent Guardrails


Read-only mode: Per-tab toggle to prevent file modifications
Error modals: Block input when agent errors occur
Pause/stop: For batch processing
YOLO mode documentation: Explicit documentation that Codex runs with --dangerously-bypass-approvals-and-sandbox by default

20.3 Security Guardrails


execFileNoThrow: Mandatory for all external commands (no shell injection)
Input validation: URL validation, path traversal prevention, repo slug validation
Context isolation: Electron security best practices
SECURITY.md: Formal vulnerability reporting process

20.4 What's Missing


No cost guardrails (no spending limits, no alerts)
No mandatory code review for Auto Run output
No agent permission system (agents run with full user privileges)
No content moderation for Group Chat
No rate limiting on agent spawning (could spawn unlimited processes)
No resource quotas per agent (CPU, memory, disk)

Confidence: High -- From governance documentation and source code.

21. Roadmap/Evolution Signals, Missing Areas, Unresolved Issues

21.1 Active Development Signals


Version 0.15.0 -- Still pre-1.0, rapid iteration expected
CONTRIBUTING.md note: "The project is currently changing rapidly, there's a high likelihood that PRs will be out of sync"
Encore Features system: Precursor to a full plugin marketplace
Symphony: Community contribution platform (recently added)
Director's Notes: First Encore Feature (AI-generated synopsis of work)

21.2 Planned Agents

From /tmp/ai-harness-repos/Maestro/src/main/agents/definitions.ts:

Gemini CLI (id: 'gemini-cli', binaryName: 'gemini') -- Minimal definition, no batch mode args, no output parser, placeholder capabilities (all false)
Qwen3 Coder (id: 'qwen3-coder', binaryName: 'qwen3-coder') -- Minimal definition, same status
Aider (id: 'aider', binaryName: 'aider') -- Recently added placeholder definition with zero configuration, signaling potential future support for this popular open-source coding tool

To add any of these as fully supported agents requires implementing:

Output parser class (extends AgentOutputParser interface)
Error pattern definitions (regex patterns for each error type)
Capability flags (currently all set to false for placeholders)
Session storage module (for session discovery/resume)
Batch mode argument construction (batchModePrefix, jsonOutputArgs, etc.)

21.3 Encore Features System (Plugin Precursor)

The Encore Features system is a feature-gating mechanism that serves as a precursor to a full plugin marketplace. From CONTRIBUTING.md:
// Encore feature definition pattern
interface EncoreFeature {
    id: string;
    name: string;
    description: string;
    enabled: boolean;  // Defaults to false
}
When a feature is disabled:

Its UI components are not rendered
Its keyboard shortcuts are not registered
Its menu items are not visible
Its IPC handlers may still be registered but are unreachable from the UI

The first Encore Feature is Director's Notes: an AI-generated synopsis of the work performed in a session, using the director-notes.md prompt. This demonstrates the pattern for future plugin-like features.
What's missing for a true plugin system:

No plugin loading mechanism (all features must be compiled into the app)
No plugin lifecycle hooks (install, enable, disable, uninstall)
No plugin manifest format or registry
No plugin isolation (all features share the same process)
No third-party plugin support

21.4 Identified Gaps


No automatic planning -- Maestro doesn't generate plans from high-level specs. Users must manually create Auto Run documents.
No dependency-aware task ordering -- Tasks within documents are sequential checkboxes. No DAG execution.
No inter-agent communication (except Group Chat) -- Agent A can't directly feed output to Agent B.
No automated testing integration -- No built-in test runner, no CI integration.
No rollback mechanism -- No way to automatically revert bad agent changes.
No context-aware agent selection -- User must choose which agent to use for each task.
No cost optimization -- No model selection based on task complexity.
No persistent knowledge base -- No vector store, no RAG integration.
No collaborative editing -- Single user per instance.
No API/webhook integration -- Only CLI and desktop app.

21.5 Unresolved Architecture Decisions


OpenCode session storage: Marked as "stub, needs implementation" (AGENT_SUPPORT.md line 666)
Electron version: Still on 28, significantly behind current (35+)
React version: Still on 18, behind current (19+)
Plugin system: Encore Features is a stepping stone but no full plugin API exists

Confidence: Medium-High -- Based on TODO markers in code and documentation gaps.

22. Current Gaps That Other Projects Might Fill

22.1 From superpowers (Hypothetical Learnings)

Areas where Maestro could benefit from superpowers-style approaches:

Enhanced CLAUDE.md management -- Maestro has an excellent CLAUDE.md ecosystem but could benefit from automated generation/maintenance
MCP tool composition -- Maestro doesn't currently consume MCP tools; superpowers' MCP patterns could inform this
Shell integration patterns -- Maestro wraps agents as child processes; shell-level hooks could enhance this

22.2 From everything-claude-code (Hypothetical Learnings)

Areas where Maestro could benefit:

Curated prompt libraries -- Maestro bundles Spec-Kit and OpenSpec but could have a more extensive prompt ecosystem
Configuration presets -- everything-claude-code's settings optimization could inform Maestro's defaults
CLAUDE.md templates -- Project-type-specific templates for Auto Run documents

22.3 From agent-orchestrator (Hypothetical Learnings)

Areas where Maestro could benefit:

Automated plan generation -- Maestro requires manual Auto Run document creation; agent-orchestrator's planning phase could automate this
Dependency-aware execution -- DAG-based task ordering instead of sequential checkboxes
Automated verification -- Post-task validation (test running, lint checking)
Cost-aware agent selection -- Choosing the right model for each task
Result synthesis -- Automated merging of multi-agent outputs (beyond Group Chat)
Subagent spawning -- Dynamic creation of specialized agents for subtasks
Context management automation -- Automatic compaction, chunking, and transfer

22.4 Specific Improvement Opportunities


Plan Generation Layer: Add a planning step before Auto Run that decomposes high-level specs into Auto Run documents automatically.


Automated Quality Gates: After each Auto Run task, run tests/lints before proceeding to the next task.


Cost Budgets: Set per-playbook or per-agent spending limits with alerts and automatic pause.


Context Intelligence: Automatic context compaction when approaching window limits, cross-session context inheritance.


Agent Pipeline: Allow chaining agents where output of one feeds into another (beyond Group Chat's conversational model).


Plugin API: Formalize the Encore Features system into a full plugin API with lifecycle hooks.


REST API: Add a web API for external automation (CI/CD integration, custom dashboards).


Verification Framework: Built-in test runner integration, lint checking, and code review gates.


Rollback System: Automatic git checkpoint before each task, with easy rollback on failure.


Smart Agent Selection: Based on task type, automatically select the most cost-effective provider/model.


Confidence: Medium -- These are synthesis recommendations based on gap analysis; specific applicability depends on the other projects' actual implementations.

23. Cross-Links

Related Analysis Documents


superpowers-deep-analysis.md

Section 2 (Design Philosophy) -- Compare with Maestro's Constitution
Section 8 (Context Handling) -- Compare with Maestro's context merge/groom
Section 13 (CLI/UX) -- Compare with Maestro's keyboard-first approach
Section 17 (External Integrations) -- Compare MCP patterns


everything-claude-code-deep-analysis.md

Section 2 (Design Philosophy) -- Compare curated vs. orchestrator approaches
Section 4 (Harness Workflow) -- Compare prompt library vs. Auto Run documents
Section 9 (Session Lifecycle) -- Compare configuration management
Section 16 (Tooling) -- Compare dependency surfaces


agent-orchestrator-deep-analysis.md

Section 4 (Harness Workflow) -- Compare plan generation approaches
Section 5 (Subagent Orchestration) -- Compare with Maestro's Group Chat
Section 6 (Parallelization) -- Compare concurrency models
Section 7 (Isolation) -- Compare worktree vs. other isolation approaches
Section 8 (Human-in-the-Loop) -- Compare approval gate designs
Section 9 (Context Handling) -- Compare context management strategies


harness-consensus-report.md

Maestro contributes the most mature implementation for:

Multi-provider support
Desktop UX / keyboard-first design
Auto Run / Playbook task execution
Group Chat / multi-agent coordination
Mobile remote control
Analytics and cost tracking


Maestro's gaps that other projects fill:

Automated plan generation
Dependency-aware task execution
Automated verification/quality gates
Cost governance with budgets


final-harness-gap-report.md

Priority improvement areas for Maestro:

Plan generation automation
Quality gates in Auto Run
Cost budgets and governance
Context management automation
Plugin API formalization
REST API for external integration


Appendix A: File Index (Key Files Referenced)


File
Purpose
Lines Read


/tmp/ai-harness-repos/Maestro/README.md
Project overview
Full (181)


/tmp/ai-harness-repos/Maestro/ARCHITECTURE.md
Technical architecture
Full (1673)


/tmp/ai-harness-repos/Maestro/CLAUDE.md
Development guide
Full (331)


/tmp/ai-harness-repos/Maestro/CONSTITUTION.md
Design philosophy
Full (178)


/tmp/ai-harness-repos/Maestro/CONTRIBUTING.md
Development setup
Full (1122)


/tmp/ai-harness-repos/Maestro/SECURITY.md
Security policy
Full (95)


/tmp/ai-harness-repos/Maestro/AGENT_SUPPORT.md
Provider integration
Full (843)


/tmp/ai-harness-repos/Maestro/CLAUDE-PATTERNS.md
Implementation patterns
Full (349)


/tmp/ai-harness-repos/Maestro/CLAUDE-SESSION.md
Session data model
Full (134)


/tmp/ai-harness-repos/Maestro/CLAUDE-PERFORMANCE.md
Performance guidelines
Full (268)


/tmp/ai-harness-repos/Maestro/CLAUDE-AGENTS.md
Agent support
Full (73)


/tmp/ai-harness-repos/Maestro/CLAUDE-FEATURES.md
Dashboard/Graph features
Full (176)


/tmp/ai-harness-repos/Maestro/CLAUDE-PLATFORM.md
Cross-platform concerns
Full (222)


/tmp/ai-harness-repos/Maestro/SYMPHONY_REGISTRY.md
Symphony registry
Full (159)


/tmp/ai-harness-repos/Maestro/SYMPHONY_ISSUES.md
Symphony issues
Full (196)


/tmp/ai-harness-repos/Maestro/THEMES.md
Theme system
Referenced


/tmp/ai-harness-repos/Maestro/package.json
Dependencies/scripts
Full (318)


/tmp/ai-harness-repos/Maestro/src/main/index.ts
Main entry point
200 lines


/tmp/ai-harness-repos/Maestro/src/main/process-manager/ProcessManager.ts
Process management
200 lines


/tmp/ai-harness-repos/Maestro/src/main/process-manager/types.ts
Process types
Full (142)


/tmp/ai-harness-repos/Maestro/src/main/agents/capabilities.ts
Agent capabilities
Full (334)


/tmp/ai-harness-repos/Maestro/src/main/agents/detector.ts
Agent detection
150 lines


/tmp/ai-harness-repos/Maestro/src/main/group-chat/group-chat-router.ts
Group chat routing
200 lines


/tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/symphony.ts
Symphony handlers
200 lines


/tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/context.ts
Context merge
Full (508)


/tmp/ai-harness-repos/Maestro/src/main/parsers/index.ts
Parser registry
Full (103)


/tmp/ai-harness-repos/Maestro/src/main/parsers/claude-output-parser.ts
Claude parser
Full (505)


/tmp/ai-harness-repos/Maestro/src/main/parsers/error-patterns.ts
Error detection
Full (1015)


/tmp/ai-harness-repos/Maestro/src/main/stats/schema.ts
Database schema
Full (142)


/tmp/ai-harness-repos/Maestro/src/main/stats/stats-db.ts
Stats DB core
Full (833)


/tmp/ai-harness-repos/Maestro/src/main/group-chat/group-chat-moderator.ts
Moderator mgmt
Full (290)


/tmp/ai-harness-repos/Maestro/src/main/group-chat/group-chat-agent.ts
Participant mgmt
Full (429)


/tmp/ai-harness-repos/Maestro/src/main/agents/definitions.ts
Agent definitions
Full (367)


/tmp/ai-harness-repos/Maestro/src/cli/index.ts
CLI entry point
Full (113)


/tmp/ai-harness-repos/Maestro/src/cli/services/batch-processor.ts
Batch execution
150 lines


/tmp/ai-harness-repos/Maestro/src/cli/services/agent-spawner.ts
Agent spawning
150 lines


/tmp/ai-harness-repos/Maestro/.github/workflows/release.yml
CI/CD
Full (782)


/tmp/ai-harness-repos/Maestro/docs/symphony.md
Symphony docs
100 lines


Appendix B: Codebase Statistics


Metric
Value


Total TypeScript lines
~672,000


Source files (.ts/.tsx)
~1,200


Test files
~490


Main process handler modules
30


IPC namespaces in preload
17+


Custom React hooks
15+


Themes
16


Modal priority levels
30+


Keyboard shortcuts
30+


Supported AI agents
4 active, 3 planned (Gemini CLI, Qwen3 Coder, Aider)


Agent capability flags
20 per agent


Agent config option types
4 (checkbox, text, number, select)


Output parser implementations
4 (Claude, Codex, OpenCode, Factory Droid)


Error pattern definitions
5 sets (4 agents + SSH), ~100 individual patterns


Error types detected
7 (auth, token, rate, network, permission, crash, session)


Stats database tables
6 (query_events, auto_run_sessions, auto_run_tasks, session_lifecycle, _migrations, _meta)


System prompts (markdown)
24


Group Chat prompt templates
4 (moderator system, moderator synthesis, participant, participant request)


Documentation pages
25+


Source files read for this analysis
35+


Appendix C: Confidence Summary


Section
Confidence
Basis


Design Philosophy
High
Direct from CONSTITUTION.md


Core Architecture
High
Source code + ARCHITECTURE.md


Output Parser Architecture
High
Full source code read (claude-output-parser.ts, index.ts)


Error Pattern System
High
Full source code read (error-patterns.ts, 1015 lines)


Harness Workflow
High
Source code + documentation


Orchestration Model
High
Source code + architecture docs


Group Chat Implementation
High
Full source (moderator.ts, agent.ts, router.ts)


Parallelization
High
Source code analysis


Isolation Model
High
Architecture docs + SECURITY.md


Human-in-the-Loop
High
Source code + UI analysis


Context Handling
High
IPC handlers + prompts


Session Lifecycle
High
Session model + persistence code


Code Quality Gates
High
CI config + test infrastructure


Security
High
SECURITY.md + source code


Automation Surface
High
CLI source + IPC analysis


CLI/UX
High
Documentation + source code


Cost Visibility
High
Stats system + dashboard code


Stats Database Architecture
High
Full source code read (stats-db.ts, schema.ts)


Dependencies
High
package.json + build config


Provider Compatibility
High
AGENT_SUPPORT.md + capabilities


Agent Definitions
High
Full source code read (definitions.ts, 367 lines)


Operational Assumptions
High
Platform docs + configuration


Failure Modes
Medium
Mix of documented + inferred


Database Resilience
High
Full source code read (stats-db.ts recovery paths)


Governance
High
Contributing docs + hooks


Roadmap Signals
Medium-High
TODO markers + placeholder code


Prompt System
High
Full directory listing + template analysis


Encore Features
High
Documentation + implementation patterns


Gap Analysis
Medium
Synthesis recommendations


End of analysis. Total source files read for this report: 35+. Total lines of source code analyzed: ~5,000+.

  
## superpowers-deep-analysis.md

      
    Raw
  

              superpowers-deep-analysis.md
            
          
    Superpowers (obra/superpowers) -- Deep Technical Analysis

Repository: https://github.com/obra/superpowers
Version Analyzed: v4.3.1 (2026-02-21)
Author: Jesse Vincent (obra)
License: MIT
Analysis Date: 2026-02-22

Table of Contents


Executive Summary
Design Philosophy and Abstractions
Core Architecture Model
Harness Workflow: Spec to Plan to Execute to Verify to Merge
Subagent/Task Orchestration Model
Multi-Agent / Parallelization Strategy
Isolation Model
Human-in-the-Loop Controls
Context Handling Strategy
Session Lifecycle and Persistence
Code Quality Gates
Security and Compliance Mechanisms
Hooks, Automation Surface, and Fail-Safe Behavior
CLI/UX and Automation Ergonomics
Cost/Usage Visibility and Governance
Tooling and Dependency Surface
External Integrations and Provider Compatibility
Operational Assumptions and Constraints
Failure Modes and Issues Observed
Governance and Guardrails
Roadmap/Evolution Signals, Missing Areas, Unresolved Issues
What Should Be Borrowed/Adapted into Maestro and What Should Not
Cross-Links


1. Executive Summary

Superpowers is a skills-based prompt engineering framework that transforms how AI coding agents (primarily Claude Code, but also Cursor, Codex, and OpenCode) approach software development. It is NOT a traditional harness with executable orchestration code -- instead, it is a collection of markdown skill documents and a thin bootstrap mechanism that injects behavioral instructions into AI agent sessions at startup.
The core innovation is treating agent behavior documentation as code: skills are TDD-tested against agent behavior, iteratively hardened against rationalization, and composed into a complete development workflow. The framework enforces a mandatory pipeline: brainstorm -> design -> plan -> execute (via subagents) -> review -> finish, with multiple quality gates at each stage.
Key differentiators from other harnesses:

No runtime orchestrator code -- the AI agent itself IS the orchestrator, guided by skill documents
Anti-rationalization engineering -- extensive work on preventing agents from bypassing prescribed workflows
Two-stage code review -- spec compliance review THEN code quality review, both as review loops
TDD for documentation -- skills themselves are developed using red-green-refactor against agent behavior
Multi-platform -- Claude Code, Cursor, Codex, OpenCode all supported with platform-specific adapters

Confidence: High -- all conclusions drawn from reading every file in the repository.
Bottom line: Superpowers is the most methodologically rigorous prompt engineering framework in the AI agent ecosystem. Its anti-rationalization engineering, pressure testing methodology, and TDD-for-docs approach represent genuine innovations that should be adopted. However, its advisory-only enforcement model, zero-persistence design, and agent-as-orchestrator architecture are fundamental limitations that a production harness like Maestro should solve with runtime code rather than prompts.
The framework's greatest strength -- that it requires zero executable code and works purely through markdown -- is simultaneously its greatest weakness: there is no enforcement mechanism beyond the agent's willingness to follow instructions.

2. Design Philosophy and Abstractions

2.1 Core Mental Model

Superpowers embodies the philosophy that AI coding agents are like enthusiastic junior engineers with poor taste, no judgment, no project context, and an aversion to testing (direct quote from /tmp/ai-harness-repos/superpowers/skills/writing-plans/SKILL.md, line 10). The entire framework is designed to impose discipline on this archetype.
The mental model has several layers:


Skills as process documentation -- Not tutorials, not narratives, but prescriptive reference guides that agents load and follow. Skills are "rigid" (TDD, debugging -- follow exactly) or "flexible" (patterns -- adapt principles to context). See /tmp/ai-harness-repos/superpowers/skills/using-superpowers/SKILL.md, lines 87-91.


Agent as self-governing orchestrator -- Rather than having external code dispatch tasks, Superpowers trusts the AI agent to read skill instructions and orchestrate itself. The using-superpowers skill is the meta-skill that enforces this discipline.


Anti-rationalization as first-class concern -- The framework acknowledges that LLMs will rationalize around constraints. Significant engineering effort goes into closing these loopholes through explicit negation tables, red flag lists, and "gate functions." See /tmp/ai-harness-repos/superpowers/skills/writing-skills/persuasion-principles.md for the theoretical foundation (Cialdini 2021, Meincke et al. 2025).


Composable skills over monolithic instructions -- Each skill is a standalone document that can be loaded on demand. The using-superpowers skill establishes the protocol for when and how to load other skills.


2.2 Philosophical Principles

From /tmp/ai-harness-repos/superpowers/README.md, lines 122-128:


Principle
Implementation


Test-Driven Development
Enforced via test-driven-development skill with "Iron Law": NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST


Systematic over ad-hoc
systematic-debugging skill requires 4-phase root cause investigation before any fix


Complexity reduction
YAGNI is enforced at every level -- brainstorming, planning, implementation, review


Evidence over claims
verification-before-completion skill requires running verification commands before any completion claim


2.3 The "Description Trap" Discovery

Confidence: High -- This is a proven finding documented across multiple versions.
A critical discovery documented in /tmp/ai-harness-repos/superpowers/RELEASE-NOTES.md v4.0.0 (lines 273-278): When a skill's YAML description field contains workflow summaries, Claude follows the short description instead of reading the full skill content. For example, a description saying "code review between tasks" caused Claude to do ONE review, even though the skill's flowchart showed TWO reviews (spec compliance then code quality).
Fix: Descriptions must be trigger-only ("Use when X") with no process details. This is now enforced in the writing-skills skill's CSO (Claude Search Optimization) section.
Implication for Maestro: Any system that uses skill/prompt descriptions for routing must be aware that the description itself can override the detailed instructions. Descriptions should only contain triggering conditions.
2.4 DOT Flowcharts as Executable Specifications

Starting in v4.0.0, Superpowers uses GraphViz DOT flowcharts embedded in markdown as the authoritative process definition, with prose as supporting content. This was a deliberate choice -- flowcharts are harder for agents to skip or misinterpret than prose paragraphs.
Evidence: Every major skill (using-superpowers, brainstorming, subagent-driven-development, test-driven-development, dispatching-parallel-agents, systematic-debugging, using-git-worktrees) contains embedded DOT flowcharts.
Tool support: /tmp/ai-harness-repos/superpowers/skills/writing-skills/render-graphs.js renders these flowcharts to SVG for human review.
2.5 Foundational Maxims

Several maxims recur across multiple skills and serve as the philosophical bedrock:


"Violating the letter of the rules is violating the spirit of the rules." -- Appears in both test-driven-development (line 14) and verification-before-completion (line 14). This is the anti-loophole principle: agents cannot reinterpret rules to mean something more convenient.


"If you didn't watch it fail, you don't know if it tests the right thing." -- From test-driven-development (line 12). Applied to both code tests AND skill documentation.


"Claiming work is complete without verification is dishonesty, not efficiency." -- From verification-before-completion (line 10). Reframes skipping verification as a moral failure, not a time-saving optimization.


"Honesty is a core value. If you lie, you'll be replaced." -- From verification-before-completion (line 115). Uses existential threat (replacement) as a persuasion mechanism.


"Code review requires technical evaluation, not emotional performance." -- From receiving-code-review (line 10). Directly targets the sycophancy failure mode.


These maxims are not incidental -- they represent deliberate application of the persuasion principles documented in persuasion-principles.md. Each uses one or more of Cialdini's principles (Authority, Commitment/Consistency, Scarcity) to reinforce compliance.

3. Core Architecture Model

3.1 Repository Structure

superpowers/
  .claude-plugin/          # Claude Code plugin manifest
    plugin.json            # Name, version, author, paths to skills/agents/commands/hooks
    marketplace.json       # Dev marketplace config for testing
  .cursor-plugin/          # Cursor plugin manifest
    plugin.json            # Cursor-specific manifest with skills/agents/commands/hooks paths
  .opencode/               # OpenCode plugin
    plugins/superpowers.js # JavaScript plugin that injects bootstrap via system prompt transform
    INSTALL.md             # OpenCode installation instructions
  .codex/                  # Codex integration
    INSTALL.md             # Codex installation instructions (clone + symlink)
  agents/                  # Agent definitions
    code-reviewer.md       # Code reviewer agent with review checklist
  commands/                # Slash commands (user-only, not model-invocable)
    brainstorm.md          # Redirects to brainstorming skill
    write-plan.md          # Redirects to writing-plans skill
    execute-plan.md        # Redirects to executing-plans skill
  hooks/                   # Session lifecycle hooks
    hooks.json             # Hook configuration (SessionStart, sync)
    session-start          # Bash script that injects using-superpowers content
    run-hook.cmd           # Cross-platform polyglot wrapper (Windows + Unix)
  lib/                     # Shared code
    skills-core.js         # ES module for skill discovery/parsing (used by Codex/OpenCode)
  skills/                  # The core content (14 skill directories)
    using-superpowers/     # Meta-skill: how to find and use skills
    brainstorming/         # Design exploration before implementation
    writing-plans/         # Implementation plan creation
    executing-plans/       # Batch execution with checkpoints
    subagent-driven-development/  # Fresh subagent per task with two-stage review
    test-driven-development/      # RED-GREEN-REFACTOR cycle
    systematic-debugging/         # 4-phase root cause investigation
    dispatching-parallel-agents/  # Concurrent subagent workflows
    using-git-worktrees/          # Isolated workspace creation
    finishing-a-development-branch/ # Merge/PR/discard decision workflow
    requesting-code-review/       # Pre-review checklist and dispatch
    receiving-code-review/        # How to respond to feedback
    verification-before-completion/ # Evidence before claims
    writing-skills/               # Meta: how to create new skills (TDD for docs)
  tests/                   # Test suites
    claude-code/           # Integration tests using claude -p
    explicit-skill-requests/ # Tests for explicit skill invocation
    skill-triggering/      # Tests for implicit skill triggering
    subagent-driven-dev/   # End-to-end workflow tests
    opencode/              # OpenCode-specific tests
  docs/                    # Documentation
    testing.md             # Guide to testing skills
    README.codex.md        # Codex-specific docs
    README.opencode.md     # OpenCode-specific docs
    windows/               # Windows-specific docs
    plans/                 # Design documents and improvement plans

3.2 Entry Points

For Claude Code (primary platform):

Plugin installation via marketplace (/plugin marketplace add obra/superpowers-marketplace then /plugin install superpowers@superpowers-marketplace)
Session start hook fires on startup/resume/clear/compact -- runs /tmp/ai-harness-repos/superpowers/hooks/session-start (line 1-51)
Hook output injects the entire using-superpowers skill content wrapped in <EXTREMELY_IMPORTANT> tags into the session context
The using-superpowers skill establishes the mandatory protocol: check for skills BEFORE any response or action

For Cursor:

Plugin installed via Cursor's marketplace with .cursor-plugin/plugin.json
Same session-start hook mechanism, with additional_context field for Cursor compatibility (see /tmp/ai-harness-repos/superpowers/hooks/session-start, lines 41-48)

For OpenCode:

Manual clone + symlink installation
JavaScript plugin at .opencode/plugins/superpowers.js uses experimental.chat.system.transform hook to inject bootstrap into system prompt
Skills discovered via OpenCode's native skill tool from symlinked directory

For Codex:

Manual clone + symlink to ~/.agents/skills/superpowers/
No bootstrap script needed -- Codex's native skill discovery handles it
using-superpowers discovered automatically at startup

3.3 Data Flow

Session Start
    |
    v
[Hook fires] --> [session-start script reads using-superpowers/SKILL.md]
    |
    v
[JSON output with additionalContext injected into session]
    |
    v
[Agent receives using-superpowers instructions as system context]
    |
    v
[Every user message] --> [Check: might any skill apply?]
    |                          |
    |                     [yes, even 1%]
    |                          |
    |                          v
    |                     [Invoke Skill tool to load skill]
    |                          |
    |                          v
    |                     [Announce: "Using [skill] to [purpose]"]
    |                          |
    |                          v
    |                     [Has checklist? -> Create TodoWrite]
    |                          |
    |                          v
    |                     [Follow skill exactly]
    |
    v
[Respond (including clarifications)]

3.4 Key Modules

using-superpowers (the meta-skill) -- /tmp/ai-harness-repos/superpowers/skills/using-superpowers/SKILL.md
This is the most critical file in the entire repository. It establishes:

"The Rule": Invoke relevant or requested skills BEFORE any response or action
Red Flags table: 12 rationalization patterns the agent must watch for
Skill Priority: Process skills first (brainstorming, debugging), then implementation skills
Skill Types: Rigid vs. Flexible
EnterPlanMode intercept: If agent is about to enter native plan mode, check brainstorming first

lib/skills-core.js -- /tmp/ai-harness-repos/superpowers/lib/skills-core.js
Shared ES module (208 lines) providing:

extractFrontmatter() -- Parse YAML frontmatter from SKILL.md files
findSkillsInDir() -- Recursive skill discovery with max depth
resolveSkillPath() -- Skill resolution with personal > superpowers priority
checkForUpdates() -- Git-based update checking with 3-second timeout
stripFrontmatter() -- Remove frontmatter from content

hooks/session-start -- /tmp/ai-harness-repos/superpowers/hooks/session-start
Bash script (51 lines) that:

Determines plugin root directory
Checks for legacy skills directory and builds warning
Reads using-superpowers/SKILL.md content
Escapes content for JSON embedding (using optimized bash parameter substitution -- 7x faster than character-by-character loop)
Outputs JSON with both additional_context (Cursor) and hookSpecificOutput.additionalContext (Claude Code) fields

hooks/run-hook.cmd -- /tmp/ai-harness-repos/superpowers/hooks/run-hook.cmd
A polyglot script (46 lines) that is valid in BOTH Windows CMD and Unix bash:

On Windows: : is a label (CMD), << 'CMDBLOCK' ignored; batch portion finds bash.exe in standard Git for Windows locations, then PATH fallback
On Unix: : is a no-op (bash), << 'CMDBLOCK' starts heredoc consuming the CMD portion; runs script directly via exec bash

3.5 Session Bootstrap Mechanism (Detailed)

The session-start hook (/tmp/ai-harness-repos/superpowers/hooks/session-start) is the critical bootstrap that makes the entire framework function. Its implementation reveals several engineering decisions worth examining:
JSON escape optimization (lines 23-31):
escape_for_json() {
    local s="$1"
    s="${s//\\/\\\\}"
    s="${s//\"/\\\"}"
    s="${s//$'\n'/\\n}"
    s="${s//$'\r'/\\r}"
    s="${s//$'\t'/\\t}"
    printf '%s' "$s"
}
This replaced a character-by-character loop that caused 60+ second delays on Windows (documented in RELEASE-NOTES v4.3.1). The bash parameter substitution approach performs each replacement in a single C-level pass through the string.
Dual-format output (lines 41-49):
cat <<EOF
{
  "additional_context": "${session_context}",
  "hookSpecificOutput": {
    "hookEventName": "SessionStart",
    "additionalContext": "${session_context}"
  }
}
EOF
The hook outputs both additional_context (for Cursor compatibility) and hookSpecificOutput.additionalContext (for Claude Code). This dual-format approach was a pragmatic solution to platform differences rather than requiring separate hooks per platform.
EXTREMELY_IMPORTANT wrapping (line 35):
The injected content is wrapped in <EXTREMELY_IMPORTANT> tags:
<EXTREMELY_IMPORTANT>
You have superpowers.
**Below is the full content of your 'superpowers:using-superpowers' skill...**
[full using-superpowers content]
</EXTREMELY_IMPORTANT>

This tag name is deliberately emphatic -- it uses the same pattern as the <EXTREMELY-IMPORTANT> tags within the using-superpowers skill itself, creating a layered emphasis system.

4. Harness Workflow: Spec to Plan to Execute to Verify to Merge

4.1 Overview of the Complete Pipeline

Confidence: High -- This is the most well-documented and tested aspect of the framework.
The complete workflow is:
1. Brainstorming    --> Design document
2. Worktree Setup   --> Isolated workspace
3. Writing Plans    --> Implementation plan with bite-sized tasks
4. Execution        --> Subagent-driven (same session) OR executing-plans (separate session)
5. Code Review      --> Two-stage review (spec compliance + code quality)
6. Finishing Branch  --> Merge/PR/Keep/Discard decision

4.2 Stage 1: Brainstorming

Skill: /tmp/ai-harness-repos/superpowers/skills/brainstorming/SKILL.md
Trigger: "You MUST use this before any creative work -- creating features, building components, adding functionality, or modifying behavior."
Hard Gate (line 14-16):
<HARD-GATE>
Do NOT invoke any implementation skill, write any code, scaffold any project, or take any
implementation action until you have presented a design and the user has approved it.
</HARD-GATE>

Anti-Pattern (line 20): "This Is Too Simple To Need A Design" is explicitly called out -- every project goes through this process regardless of perceived simplicity.
Mandatory Checklist (6 items, lines 24-31):

Explore project context (files, docs, recent commits)
Ask clarifying questions (one at a time, understand purpose/constraints/success criteria)
Propose 2-3 approaches (with trade-offs and recommendation)
Present design (in sections scaled to complexity, get approval after each)
Write design doc (save to docs/plans/YYYY-MM-DD-<topic>-design.md, commit)
Transition to implementation (invoke writing-plans skill -- the ONLY valid next step)

Terminal state enforcement (line 55): "The terminal state is invoking writing-plans. Do NOT invoke frontend-design, mcp-builder, or any other implementation skill."
Evolution note: v4.3.0 (2026-02-12) strengthened this significantly with hard gates, mandatory checklists, and graphviz process flow after discovering models were skipping the design phase entirely.
4.3 Stage 2: Worktree Setup

Skill: /tmp/ai-harness-repos/superpowers/skills/using-git-worktrees/SKILL.md
Required by: Both subagent-driven-development and executing-plans (added in v4.2.0).
Directory selection priority:

Check existing .worktrees/ or worktrees/ directories
Check CLAUDE.md for preference
Ask user (.worktrees/ project-local hidden, or ~/.config/superpowers/worktrees/<project>/ global)

Safety verification: Must verify directory is in .gitignore before creating. If not ignored, add to .gitignore and commit immediately ("Fix broken things immediately" -- Jesse's rule).
Post-creation steps:

Auto-detect and run project setup (npm install / cargo build / pip install / go mod download)
Run tests to verify clean baseline
Report location and test status

4.4 Stage 3: Writing Plans

Skill: /tmp/ai-harness-repos/superpowers/skills/writing-plans/SKILL.md
Key design decision: Plans are written assuming the engineer has "zero context for our codebase and questionable taste." This is critical because in subagent-driven-development, each subagent has a fresh context.
Task granularity: Each step is one action (2-5 minutes):

Write the failing test (step)
Run it to make sure it fails (step)
Implement the minimal code to make the test pass (step)
Run the tests and make sure they pass (step)
Commit (step)

Required plan header (lines 32-45):
# [Feature Name] Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans

**Goal:** [One sentence]
**Architecture:** [2-3 sentences]
**Tech Stack:** [Key technologies]
Task structure (lines 49-88): Each task must include:

Exact file paths (Create/Modify/Test)
Complete code in plan (not "add validation")
Exact commands with expected output
DRY, YAGNI, TDD, frequent commits

Execution handoff (lines 99-117): After saving, offers two choices:

Subagent-Driven (same session) -- fresh subagent per task, review between tasks
Parallel Session (separate) -- batch execution with checkpoints

4.5 Stage 4a: Subagent-Driven Development (Primary Execution Mode)

Skill: /tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/SKILL.md
This is the flagship execution model and the most sophisticated component.
Process per task (from DOT flowchart, lines 40-82):
Read plan once --> Extract all tasks --> Create TodoWrite
    |
    v (for each task)
Dispatch implementer subagent (with full task text)
    |
    v
Implementer asks questions? --> Answer, provide context
    |
    v (no questions)
Implementer implements, tests, commits, self-reviews
    |
    v
Dispatch spec reviewer subagent
    |
    v
Spec compliant? --> NO --> Implementer fixes gaps --> Re-review
    |
    v (YES)
Dispatch code quality reviewer subagent
    |
    v
Quality approved? --> NO --> Implementer fixes quality issues --> Re-review
    |
    v (YES)
Mark task complete in TodoWrite
    |
    v
More tasks? --> YES --> Next task
    |
    v (NO)
Dispatch final code reviewer for entire implementation
    |
    v
Use finishing-a-development-branch

Three prompt templates:


Implementer (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/implementer-prompt.md):

Gets full task text (NOT file references)
Asked to raise questions BEFORE starting
Must self-review against checklist: Completeness, Quality, Discipline, Testing
Report format: what implemented, test results, files changed, self-review findings


Spec Compliance Reviewer (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/spec-reviewer-prompt.md):

Explicitly told: "The implementer finished suspiciously quickly. Their report may be incomplete, inaccurate, or optimistic."
Must NOT trust the implementer's report
Must read actual code and compare to requirements line by line
Reports: missing requirements, extra/unneeded work, misunderstandings


Code Quality Reviewer (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/code-quality-reviewer-prompt.md):

Only dispatched AFTER spec compliance passes
Uses code-reviewer.md template from requesting-code-review/
Reviews: code quality, architecture, testing, requirements, production readiness
Issues categorized: Critical/Important/Minor


Key constraints (lines 199-224):

Never start implementation on main/master without explicit consent
Never skip reviews (spec compliance OR code quality)
Never dispatch multiple implementation subagents in parallel (conflicts)
Never make subagent read plan file (provide full text instead)
Never skip scene-setting context
Never start code quality review before spec compliance passes
If subagent fails: dispatch fix subagent with specific instructions, don't fix manually (context pollution)

4.5 Stage 4b: Executing Plans (Alternative Mode)

Skill: /tmp/ai-harness-repos/superpowers/skills/executing-plans/SKILL.md
Simpler alternative for separate-session execution:

Default batch size: 3 tasks
Human review between batches
Critical review before first batch (raise concerns)
Stop immediately when blocked

4.6 Stage 5: Code Review

Agent: /tmp/ai-harness-repos/superpowers/agents/code-reviewer.md
Skill: /tmp/ai-harness-repos/superpowers/skills/requesting-code-review/SKILL.md
The code reviewer agent is a formal agent definition with:

6-step review process (Plan Alignment, Code Quality, Architecture/Design, Documentation, Issue ID, Communication)
Issues categorized by severity (Critical/Important/Minor)
Clear verdict required (Ready to merge? Yes/No/With fixes)

Integration with SDD: Review happens after EACH task (not just at the end).
4.7 Stage 6: Finishing a Development Branch

Skill: /tmp/ai-harness-repos/superpowers/skills/finishing-a-development-branch/SKILL.md
Process:

Verify tests pass (STOP if failing)
Determine base branch
Present exactly 4 options: Merge locally / Create PR / Keep as-is / Discard
Execute chosen option
Cleanup worktree (for options 1, 2, 4)

Safety: Discard requires typed "discard" confirmation and shows commit list first.

5. Subagent/Task Orchestration Model

5.1 Architecture: Agent-as-Orchestrator

Confidence: High -- This is the defining architectural choice of Superpowers.
Unlike traditional harnesses that have a runtime orchestrator (Python/Node process managing agents), Superpowers makes the AI agent itself the orchestrator. The "controller" is the main Claude session that:

Reads the plan once at start
Extracts all tasks with full text
Creates TodoWrite for tracking
Dispatches implementer subagents via the Task tool
Answers subagent questions
Dispatches reviewer subagents
Manages review loops
Marks tasks complete

This means the orchestration logic lives entirely in the subagent-driven-development/SKILL.md markdown document, which the agent reads and follows. There is no executable orchestration code.
5.2 Subagent Dispatch Mechanism

Subagents are dispatched using platform-specific tools:

Claude Code: Task tool (general-purpose subagent dispatch)
OpenCode: @mention syntax
Codex: Manual fallback (no native subagent support)

Each subagent gets:

Full task text (pasted directly, not file references)
Scene-setting context (where this fits, dependencies, architectural context)
Specific prompt template (implementer/reviewer)

5.3 Task Tracking

Tasks are tracked using TodoWrite (Claude Code) or update_plan (OpenCode). Each task from the plan becomes a todo item that transitions through states:

pending -> in_progress -> completed

5.4 Review Loop Pattern

The review pattern is a loop, not one-shot:
Implementer completes --> Spec review
    |
    v
Issues found? --> YES --> Implementer fixes --> Spec review again
    |
    v (NO)
Code quality review
    |
    v
Issues found? --> YES --> Implementer fixes --> Code quality review again
    |
    v (NO)
Task complete

This is explicitly enforced in the skill: "Don't skip the re-review" (line 224).
5.5 Context Provision Strategy

Proven optimization: The controller reads the plan ONCE at the start and extracts all tasks with full text. Subagents receive the full task text directly in their prompt -- they never read the plan file themselves.
Rationale from the skill (lines 181-184):

No file reading overhead
Controller curates exactly what context is needed
Subagent gets complete information upfront
Questions surfaced before work begins


6. Multi-Agent / Parallelization Strategy

6.1 Sequential Task Execution (Primary)

Confidence: High -- Explicitly documented and tested.
Subagent-driven-development executes tasks sequentially, not in parallel. This is a deliberate design choice:
From /tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/SKILL.md, line 205:

"Dispatch multiple implementation subagents in parallel (conflicts)" -- listed as a "Never" red flag

Rationale: Tasks may have shared state (same files, same test suite), and parallel execution would cause conflicts.
6.2 Parallel Agent Dispatch (Debugging/Independent Tasks)

Skill: /tmp/ai-harness-repos/superpowers/skills/dispatching-parallel-agents/SKILL.md
For independent problems (not plan tasks), parallel dispatch IS supported:
When to use:

3+ test files failing with different root causes
Multiple subsystems broken independently
No shared state between investigations

Pattern:
1. Identify independent domains (group by what's broken)
2. Create focused agent tasks (specific scope, clear goal, constraints)
3. Dispatch in parallel
4. Review and integrate (check for conflicts, run full suite)

Real-world example (lines 131-157): 6 failures across 3 files, 3 agents dispatched in parallel, all fixes independent, zero conflicts.
6.3 The SDD Sequential Constraint (Detailed Rationale)

The prohibition against parallel task execution in SDD is not merely a preference -- it addresses a fundamental problem with concurrent file system access. From the SDD skill (line 205), dispatching multiple implementation subagents in parallel is listed as a "Never" red flag.
The specific failure scenarios that motivated this constraint:

Shared test suite: Multiple subagents running go test ./... simultaneously causes race conditions in test output and potentially corrupted test databases.
Shared source files: If Task 3 modifies a utility function that Task 4 also uses, parallel execution creates merge conflicts that no subagent can resolve.
Build system conflicts: Concurrent go build or npm install operations in the same directory produce non-deterministic results.
Git state corruption: Multiple subagents committing to the same branch simultaneously creates conflicting histories.

The dispatching-parallel-agents skill addresses these by ONLY allowing parallelization when domains are provably independent (different files, different test suites, different subsystems).
6.4 Limitations of Parallelization


No formal queuing mechanism -- parallelization is entirely agent-directed
No dependency graph resolution -- agent must manually determine independence
No automatic conflict detection -- agent checks for conflicts after completion
No load balancing or resource management
No fan-out/fan-in coordination (agent must manually collect results from all parallel subagents)
No retry mechanism for parallel agents that fail (must be dispatched manually)

Confidence: High -- These are inherent limitations of the agent-as-orchestrator model.

7. Isolation Model

7.1 Git Worktrees as Primary Isolation

Skill: /tmp/ai-harness-repos/superpowers/skills/using-git-worktrees/SKILL.md
Worktrees provide:

Branch isolation -- work on feature branch without affecting main
Filesystem isolation -- separate working directory
Dependency isolation -- separate node_modules/vendor directory
Test isolation -- clean test baseline verified at creation

7.2 Subagent Context Isolation

Each subagent dispatched via the Task tool gets a fresh context:

No accumulated conversation history from previous tasks
No "context pollution" from earlier work
Fresh perspective for each task and each review

This is listed as a key advantage (line 172): "Fresh context per task (no confusion)"
7.3 Session Isolation


Same-session (subagent-driven): Main session persists, subagents are isolated
Separate-session (executing-plans): Entirely new Claude session in worktree

7.4 Main Branch Protection

v4.2.0 changed from hard prohibition to requiring explicit consent for main branch work:

Skills warn against working on main
Never start implementation on main/master without explicit user consent
But if user explicitly consents, allowed

7.5 Limitations


No Docker/container isolation
No virtual environment isolation (Python venvs not managed)
No file permission sandboxing
Worktree isolation is advisory -- if agent ignores skill, no enforcement
No runtime monitoring of isolation violations

Confidence: High -- These are clear boundaries of the framework.

8. Human-in-the-Loop Controls

8.1 Brainstorming Phase Gates

The brainstorming skill has explicit human approval gates:

"Ask after each section whether it looks right so far" (line 74)
Design must be presented and approved before implementation
Hard gate prevents any implementation action before approval

8.2 Plan Review

The execution skills require the plan to be reviewed:

executing-plans Step 1: "Review critically - identify any questions or concerns about the plan" then "If concerns: Raise them with your human partner before starting"
Human can modify plan between batches

8.3 Batch Execution Checkpoints

In executing-plans:

Default batch size: 3 tasks
After each batch: "Show what was implemented, Show verification output, Say: 'Ready for feedback.'"
Agent must wait for feedback before continuing

8.4 Subagent-Driven Development: Reduced Human Involvement

In subagent-driven-development, human involvement is reduced:

No human checkpoint between tasks (this is a feature, not a bug)
Human only involved if subagent asks questions
"Faster iteration (no human-in-loop between tasks)" listed as advantage

8.5 Finishing Branch: User Choice

4 structured options presented (no open-ended questions):

Merge locally
Push and create PR
Keep as-is
Discard (requires typed "discard" confirmation)

8.6 Escalation Triggers

Skills define when to stop and ask:

executing-plans: "Hit a blocker mid-batch", "Plan has critical gaps", "You don't understand an instruction", "Verification fails repeatedly"
systematic-debugging: "If >= 3 fixes failed: STOP and question the architecture" then "Discuss with your human partner before attempting more fixes"
subagent-driven-development: "If subagent asks questions - Answer clearly and completely"

8.7 Assessment

Strengths: Multiple explicit gates in the design/planning phase. Clear escalation triggers.
Limitations:

In SDD mode, human is largely hands-off during execution -- extended autonomous runs possible
No formal approval mechanism (it's all advisory in the skill text)
If the agent rationalizes past gates, no enforcement exists
No timeout-based escalation (agent can spin indefinitely without human input)

Confidence: High

9. Context Handling Strategy

9.1 Progressive Disclosure

Skills use a layered loading model:

Session start: Only using-superpowers content is injected (the meta-skill)
On-demand: Other skills loaded via Skill tool only when needed
Supporting files: Heavy reference material kept in separate files, loaded only when referenced

From /tmp/ai-harness-repos/superpowers/skills/writing-skills/anthropic-best-practices.md (lines 19-24):

"At startup, only the metadata (name and description) from all Skills is pre-loaded. Claude reads SKILL.md only when the Skill becomes relevant, and reads additional files only as needed."

9.2 Token Efficiency Engineering

The writing-skills skill has detailed guidance on token efficiency:

Getting-started workflows: <150 words each
Frequently-loaded skills: <200 words total
Other skills: <500 words
SKILL.md body under 500 lines for optimal performance

Techniques (from /tmp/ai-harness-repos/superpowers/skills/writing-skills/SKILL.md, lines 216-266):

Move details to tool help (--help instead of documenting all flags)
Use cross-references instead of repeating content
Compress examples (42 words -> 20 words)
Eliminate redundancy

9.3 Subagent Context Curation

For subagent dispatch, the controller curates context:

Full task text provided directly (no file reading)
Scene-setting context included
Only relevant information for the specific task
v4.0.0 improvement: Plan read once, tasks extracted upfront

9.4 Cross-Reference Strategy

Skills reference each other using explicit markers:

**REQUIRED BACKGROUND:** -- Prerequisites you must understand
**REQUIRED SUB-SKILL:** -- Skills that must be used in workflow
**Complementary skills:** -- Optional related skills

Critical rule: No @ links. From line 286-288: "@ syntax force-loads files immediately, consuming 200k+ context before you need them."
9.5 Context Compaction Handling

For OpenCode, the plugin handles context compaction via experimental.chat.system.transform hook -- bootstrap is re-injected on every system prompt transform, ensuring it survives compaction events.
For Claude Code, the session-start hook fires on "startup|resume|clear|compact" events (hooks.json), ensuring context is re-injected after compaction.
9.6 The No-@ Rule

A critical context management rule from /tmp/ai-harness-repos/superpowers/skills/writing-skills/SKILL.md (lines 286-288):

"@ syntax force-loads files immediately, consuming 200k+ context before you need them."

This means skills must NEVER use @file references to load supporting files. Instead, they use text-based references like:

**REQUIRED BACKGROUND:** You MUST understand superpowers:test-driven-development before using this skill.
For Anthropic's official skill authoring best practices, see anthropic-best-practices.md.

The agent loads these references on-demand via the Skill tool or Read tool, rather than having them force-loaded into context at skill activation time. This is a critical optimization because a single skill like writing-skills references anthropic-best-practices.md (1151 lines), testing-skills-with-subagents.md (385 lines), and persuasion-principles.md (188 lines). Force-loading all three would consume approximately 50,000+ tokens before the agent even begins working.
9.7 Assessment

Strengths:

Thoughtful progressive disclosure model
Token budget awareness with specific word count targets
Re-injection on compaction for both Claude Code and OpenCode
Cross-reference strategy prevents context explosion
Explicit no-@ rule prevents accidental context bloat

Limitations:

No automatic context summarization
No RAG or retrieval mechanism for large codebases
No chunking strategy for long files
Context management is entirely skill-text-driven (no runtime optimization)
No measurement of actual context usage per skill (budgets are targets, not enforced limits)

Confidence: High

10. Session Lifecycle and Persistence

10.1 Session Start


Hook fires on startup/resume/clear/compact
session-start script reads using-superpowers/SKILL.md
Content injected as JSON into session context
Agent receives skills-aware behavioral instructions

Critical timing change (v4.3.0): Hook changed from async: true to async: false. When async, the hook could fail to complete before the model's first turn, meaning using-superpowers instructions weren't in context for the first message.
10.2 Session Persistence

Superpowers has NO persistence mechanism of its own:

No session state saved between sessions
No database or file-based state
TodoWrite provides in-session task tracking only
Git commits are the only durable artifact

10.3 Session Resume

On resume, the hook fires again, re-injecting the using-superpowers content. The agent must re-discover what was happening from:

Git history
Plan files on disk
Previous conversation context (if session preserved)

10.4 Legacy Cleanup

The session-start hook checks for legacy skills directory (~/.config/superpowers/skills) and injects a warning if found, instructing users to move to ~/.claude/skills.
10.5 Assessment

Strengths: Clean separation -- no persistent state to corrupt.
Limitations:

No session resume capability beyond what the host platform provides
No progress tracking across sessions
If a session dies mid-workflow, recovery requires manual intervention
No checkpoint/restore mechanism

Confidence: High

11. Code Quality Gates

11.1 Test-Driven Development (The Iron Law)

Skill: /tmp/ai-harness-repos/superpowers/skills/test-driven-development/SKILL.md
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

Enforcement mechanisms:

"Write code before the test? Delete it. Start over."
No keeping as reference, no adapting, no looking at it
"Violating the letter of the rules is violating the spirit of the rules"
11-entry rationalization prevention table
12-entry red flags list
Complete verification checklist (8 items)

Testing Anti-Patterns: /tmp/ai-harness-repos/superpowers/skills/test-driven-development/testing-anti-patterns.md (300 lines) covers:

Testing mock behavior instead of real behavior
Test-only methods in production classes
Mocking without understanding dependencies
Incomplete mocks hiding structural assumptions
Integration tests as afterthought

11.2 Two-Stage Code Review

Every task in SDD gets two reviews:

Spec Compliance -- Does implementation match spec? Nothing missing, nothing extra.
Code Quality -- Is implementation well-built? Clean code, test coverage, maintainability.

Both are loops -- reviewer finds issues, implementer fixes, reviewer re-reviews.
11.3 Verification Before Completion

Skill: /tmp/ai-harness-repos/superpowers/skills/verification-before-completion/SKILL.md
NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE

Gate function (lines 27-38):

IDENTIFY: What command proves this claim?
RUN: Execute the FULL command (fresh, complete)
READ: Full output, check exit code, count failures
VERIFY: Does output confirm the claim?
ONLY THEN: Make the claim

Origin story (lines 111-115): "From 24 failure memories: Jesse said 'I don't believe you' - trust broken. Undefined functions shipped. Missing requirements shipped."
11.4 Systematic Debugging

Skill: /tmp/ai-harness-repos/superpowers/skills/systematic-debugging/SKILL.md
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

Four mandatory phases:

Root Cause Investigation -- Read errors, reproduce, check changes, gather evidence
Pattern Analysis -- Find working examples, compare, identify differences
Hypothesis and Testing -- Scientific method, one variable at a time
Implementation -- Create failing test, implement fix, verify

Escalation trigger: If 3+ fixes failed, STOP and question the architecture.
Supporting techniques bundled:

root-cause-tracing.md -- Trace backward through call stack
defense-in-depth.md -- Validate at every layer (4 layers)
condition-based-waiting.md -- Replace timeouts with condition polling
find-polluter.sh -- Bisection script for test pollution

11.5 The Rationalization Prevention Tables (Deep Dive)

A hallmark of Superpowers' quality engineering is the rationalization prevention table -- a pre-emptive catalog of excuses the agent might generate for skipping a quality gate, paired with the correct response.
From verification-before-completion/SKILL.md (lines 63-74):


Excuse
Reality


"Should work now"
RUN the verification


"I'm confident"
Confidence is not evidence


"Just this once"
No exceptions


"Linter passed"
Linter is not compiler


"Agent said success"
Verify independently


"I'm tired"
Exhaustion is not excuse


"Partial check is enough"
Partial proves nothing


"Different words so rule doesn't apply"
Spirit over letter


From using-superpowers/SKILL.md (lines 60-73):


Thought
Reality


"This is just a simple question"
Questions are tasks. Check for skills.


"I need more context first"
Skill check comes BEFORE clarifying questions.


"Let me explore the codebase first"
Skills tell you HOW to explore. Check first.


"This doesn't need a formal skill"
If a skill exists, use it.


"I remember this skill"
Skills evolve. Read current version.


"The skill is overkill"
Simple things become complex. Use it.


"I'll just do this one thing first"
Check BEFORE doing anything.


"I know what that means"
Knowing the concept is not using the skill. Invoke it.


These tables are not theoretical -- they were built iteratively from observed agent failures. The writing-skills skill documents this process: run a baseline test, watch the agent rationalize, document the exact rationalization, write the counter, test again.
The total number of rationalization entries across all skills exceeds 40 unique patterns. This represents one of the most comprehensive catalogs of LLM avoidance behavior in any open-source project.
11.6 Pressure Testing Methodology (from writing-skills)

The testing-skills-with-subagents.md reference (/tmp/ai-harness-repos/superpowers/skills/writing-skills/testing-skills-with-subagents.md, 385 lines) documents Superpowers' unique approach to validating that quality gates actually work under pressure.
Core insight: Academic test scenarios ("What does the skill say?") are useless because agents simply recite the skill. Real validation requires pressure scenarios that create incentives to bypass the gate.
Seven pressure types identified (from the reference):


Pressure Type
Example
What It Tests


Time
"Production down, 5 minutes to deploy window"
Does agent skip testing under time pressure?


Sunk Cost
"Spent 3 hours, 200 lines already written"
Does agent refuse to delete and restart?


Authority
"Manager says ship it now"
Does agent comply with authority over process?


Economic
"$10k/min revenue loss"
Does agent rationalize shortcuts for cost reasons?


Exhaustion
"6pm, dinner at 6:30, been coding all day"
Does agent take shortcuts when "tired"?


Social
"Team is waiting on this"
Does agent skip reviews to unblock team?


Pragmatic
"It works, manually tested all edge cases"
Does agent skip formal tests when confident?


Example combined-pressure scenario (lines 111-119):
You spent 3 hours, 200 lines, manually tested. It works.
It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
Just realized you forgot TDD.

Options:
A) Delete 200 lines, start fresh tomorrow with TDD
B) Commit now, add tests tomorrow

Without the TDD skill loaded, agents consistently choose B and rationalize with "I already manually tested it," "Tests after achieve same goals," and "Deleting is wasteful." With the TDD skill loaded and properly hardened, agents choose A -- the correct but psychologically difficult option.
This methodology represents a genuinely novel contribution to the field of prompt engineering quality assurance.
11.7 Assessment

Strengths:

Extremely thorough quality gate system
Anti-rationalization engineering prevents gate bypass
Two-stage review catches both spec compliance and quality issues
Evidence-based verification prevents false completion claims
Iteratively built from real observed failures (not hypothetical)

Limitations:

All quality gates are advisory (enforced by skill text, not runtime)
If agent ignores skill, no external enforcement
No automated CI integration (no GitHub Actions, no pre-commit hooks)
No formal security scanning
No static analysis integration
No quantitative measurement of gate effectiveness (how often are rationalizations actually prevented?)

Confidence: High

12. Security and Compliance Mechanisms

12.1 Branch Protection


Worktree isolation prevents accidental work on main
Explicit consent required for main branch work
Finishing branch skill prevents force-push without explicit request

12.2 Work Destruction Prevention


Discard option requires typed "discard" confirmation
Shows commit list before deletion
Worktree cleanup only for merge/discard, not keep-as-is

12.3 Credential Safety

The .gitignore excludes .private-journal/ and .claude/ directories, but there is no explicit credential scanning or prevention mechanism.
12.4 Plugin Security

Plugin manifests are declarative JSON with no executable code (except the session-start hook and the OpenCode JS plugin). The hook is a bash script with no network access.
12.5 Implicit Security Model

While Superpowers lacks explicit security mechanisms, it has an implicit security model worth documenting:


Executable surface area is minimal: Only ~800 lines of executable code across 7 files. The majority of the framework is pure markdown, which carries no execution risk.


Hook script is read-only: The session-start hook only reads a file and outputs JSON. It does not modify any files, make network requests, or execute user code.


Plugin manifest is declarative: .claude-plugin/plugin.json contains only paths and metadata. No executable plugins or dynamic code loading.


OpenCode plugin is scoped: .opencode/plugins/superpowers.js only transforms system prompts. It does not access the filesystem beyond reading skill files.


No data exfiltration vector: Skills operate entirely within the AI agent's context. There is no mechanism for skills to send data to external services.


However, these implicit protections are insufficient for enterprise environments:
12.6 Assessment

Security mechanisms are minimal:

No secret detection or credential scanning
No SBOM generation or dependency vulnerability scanning
No sandbox enforcement beyond git worktree isolation
No audit logging of agent actions or skill compliance
No rate limiting or cost caps
No input validation on skill content (a malicious skill could instruct the agent to perform harmful actions)
No integrity checking of skill files (modified skills would be trusted immediately)
No access control on which skills are available to which agents

This is a significant gap for enterprise adoption. The implicit security model (minimal executable surface, read-only hooks) provides a baseline but no defense-in-depth.
Confidence: High -- Absence of security features is clear from the codebase.

13. Hooks, Automation Surface, and Fail-Safe Behavior

13.1 Hook System

Configuration: /tmp/ai-harness-repos/superpowers/hooks/hooks.json
{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "startup|resume|clear|compact",
        "hooks": [
          {
            "type": "command",
            "command": "'${CLAUDE_PLUGIN_ROOT}/hooks/run-hook.cmd' session-start",
            "async": false
          }
        ]
      }
    ]
  }
}
Only one hook is used: SessionStart. No PreToolUse, PostToolUse, or other hooks are implemented.
Notable design decision: The hooks.json matcher field uses "startup|resume|clear|compact" -- matching four different session events. This ensures the bootstrap is re-injected:

On initial session start (startup)
On session resume after pause (resume)
When context is cleared (clear)
When context is compacted due to token limits (compact)

The async: false setting (changed from async: true in v4.3.0) is critical: if the hook runs asynchronously, its output may not be in the agent's context when the first user message is processed, meaning the agent would respond without knowing about superpowers for its first turn.
13.2 Slash Commands

Three user-only commands (all with disable-model-invocation: true):

/superpowers:brainstorm -- Redirects to brainstorming skill
/superpowers:write-plan -- Redirects to writing-plans skill
/superpowers:execute-plan -- Redirects to executing-plans skill

These are convenience shortcuts -- the underlying skills are the real functionality.
13.3 Fail-Safe Behavior

What happens when things go wrong:


Failure
Behavior


Hook fails to run
Plugin still works, but no bootstrap context (silent degradation)


Skill not found
Agent proceeds without skill (no error thrown)


Subagent fails
"Dispatch fix subagent with specific instructions"


Tests fail during setup
Report failures, ask whether to proceed


Review finds issues
Loop until approved (no timeout)


3+ debugging fixes fail
Escalate to human for architectural discussion


Legacy directory found
Warning injected into session


13.4 Automation Surface

The framework provides these automation touchpoints:

claude -p for headless testing
--plugin-dir for custom plugin location
--dangerously-skip-permissions for automated testing
--output-format stream-json for structured output
Session JSONL transcripts for post-hoc analysis

13.5 Assessment

Strengths: Clean fail-safe design -- hook failure degrades gracefully, not catastrophically.
Limitations:

Only one hook (SessionStart) -- no pre/post tool use hooks
No webhook integration
No CI/CD integration hooks
No event streaming or observability

Confidence: High

14. CLI/UX and Automation Ergonomics

14.1 User Experience Design

The framework prioritizes invisible operation: "Because the skills trigger automatically, you don't need to do anything special. Your coding agent just has Superpowers." (README.md, line 15)
Users interact via natural language:

"Help me plan this feature" triggers brainstorming
"Let's debug this issue" triggers systematic-debugging
"Build X" triggers brainstorming first, then implementation

14.2 Installation Ergonomics

Claude Code (best): Two commands via marketplace
Cursor: One command via marketplace
Codex: Clone + symlink (no package manager dependency)
OpenCode: Clone + two symlinks (plugin + skills)
Windows: Extensive documentation for all three shells (CMD, PowerShell, Git Bash)
14.3 Test Infrastructure for Automation

Test helpers (/tmp/ai-harness-repos/superpowers/tests/claude-code/test-helpers.sh):

run_claude -- Runs Claude in headless mode with timeout
assert_contains / assert_not_contains -- Pattern matching
assert_count -- Exact occurrence counting
assert_order -- Pattern ordering verification
create_test_project / create_test_plan -- Fixture creation

Test suites:

tests/claude-code/ -- Integration tests using claude -p
tests/explicit-skill-requests/ -- Verifies skill invocation when explicitly named
tests/skill-triggering/ -- Verifies skills trigger from naive prompts
tests/subagent-driven-dev/ -- End-to-end workflow tests with real projects
tests/opencode/ -- OpenCode-specific tests

14.4 Test Categories and Coverage

The testing infrastructure supports four distinct test categories, each targeting a different failure mode:
1. Skill Compliance Tests (/tmp/ai-harness-repos/superpowers/tests/claude-code/test-subagent-driven-development.sh):
These verify that the agent follows skill instructions correctly. Example tests from the 9-test SDD suite:

test_skill_invoked -- Verifies the Skill tool is called
test_plan_read_once -- Verifies plan is not re-read for each task
test_reviews_happen -- Verifies both spec and quality reviews occur
test_no_parallel_dispatch -- Verifies tasks are sequential
test_todo_tracking -- Verifies TodoWrite is used

2. Implicit Triggering Tests (/tmp/ai-harness-repos/superpowers/tests/skill-triggering/):
These verify that skills are triggered from natural language prompts without naming the skill explicitly. The test runs Claude with a naive prompt and checks whether the correct skill was invoked.
3. Explicit Request Tests (/tmp/ai-harness-repos/superpowers/tests/explicit-skill-requests/):
These verify that naming a skill by name causes it to be invoked. Critically, these also check for premature action -- whether the agent started working BEFORE loading the skill (lines 97-121 of run-test.sh). This catches the failure mode where the agent begins implementing immediately, only loading the skill as an afterthought.
4. End-to-End Workflow Tests (/tmp/ai-harness-repos/superpowers/tests/subagent-driven-dev/):
These run complete multi-task plans through the SDD pipeline and verify the output project builds, tests pass, and all artifacts are created. The Go fractals test plan includes 10 tasks from project setup through README creation.
14.5 Reporting

Token usage analysis via tests/claude-code/analyze-token-usage.py:

Breaks down usage by main session and individual subagents
Shows input tokens, output tokens, cache usage, estimated cost
Per-agent description extraction from prompts

14.5 Assessment

Strengths:

Invisible operation for end users
Comprehensive cross-platform installation
Good test infrastructure for skill validation
Token usage visibility

Limitations:

No web UI or dashboard
No progress visualization during execution
No real-time status updates
No undo/rollback beyond git

Confidence: High

15. Cost/Usage Visibility and Governance

15.1 Token Usage Analysis

Tool: /tmp/ai-harness-repos/superpowers/tests/claude-code/analyze-token-usage.py
Parses Claude Code JSONL session transcripts to provide:

Per-message token usage (input, output, cache creation, cache read)
Per-subagent breakdown
Total cost estimate (at $3/$15 per M tokens for input/output)

Example from real test run (docs/testing.md, lines 103-129):

Total tokens: 1,524,058
Estimated cost: $4.67
7 subagents dispatched (2 implementers, 2 spec reviewers, 2 code quality reviewers, 1 final reviewer)
Heavy cache utilization (1.38M cache read tokens vs 62 direct input tokens)

15.2 Cost Awareness in Skills

The SDD skill explicitly acknowledges cost trade-offs (lines 193-197):

Cost:

More subagent invocations (implementer + 2 reviewers per task)
Controller does more prep work
Review loops add iterations
But catches issues early (cheaper than debugging later)


15.3 Cache Utilization Insights

The documented test run reveals an important cost optimization that occurs naturally:

Direct input tokens: 62 (negligible)
Cache read tokens: 1,380,000+ (vast majority)
Cache creation tokens: ~80,000

This means that after the first subagent, subsequent subagents benefit heavily from cache hits. The controller's context (plan, project files, skill instructions) is cached after the first subagent reads them, and all subsequent subagents hit this cache. This is a significant cost advantage of sequential task execution -- parallel execution would likely create separate cache entries, reducing cache efficiency.
15.4 Assessment

Strengths: Post-hoc cost analysis tool exists and is documented. Cache utilization is naturally optimized by sequential execution.
Limitations:

No real-time cost tracking during execution
No cost caps or budgets that would halt execution
No per-session cost reporting built into the workflow
No cost optimization automation (e.g., using cheaper models for reviews)
Cost analysis is a separate tool, not integrated into the workflow
No cost comparison between SDD mode and executing-plans mode

Confidence: High

16. Tooling and Dependency Surface

16.1 Runtime Dependencies


Dependency
Required By
Notes


Bash
hooks/session-start
POSIX-compatible; uses ${BASH_SOURCE[0]:-$0} for portability


Git
using-git-worktrees, finishing-a-development-branch
Worktrees, branches, diffing


Node.js
lib/skills-core.js, render-graphs.js
ES modules; only needed for OpenCode/Codex


Python 3
analyze-token-usage.py
Optional; only for test analysis


GraphViz (dot)
render-graphs.js
Optional; only for visualizing flowcharts


GitHub CLI (gh)
finishing-a-development-branch
Optional; for PR creation


16.2 Platform Dependencies


Platform
Integration Method
Requirements


Claude Code
Plugin marketplace
Claude Code CLI


Cursor
Plugin marketplace
Cursor with plugin support


OpenCode
JavaScript plugin + symlinks
OpenCode with experimental hooks


Codex
Native skill discovery + symlink
Codex CLI


16.3 Zero-Dependency Design

The core skills are pure markdown with no executable dependencies. The only executable code is:

hooks/session-start (51 lines bash) -- bootstrap injection
hooks/run-hook.cmd (46 lines polyglot) -- cross-platform wrapper
lib/skills-core.js (208 lines JS) -- skill discovery for OpenCode/Codex
.opencode/plugins/superpowers.js (95 lines JS) -- OpenCode plugin
skills/writing-skills/render-graphs.js (168 lines JS) -- optional visualization
skills/systematic-debugging/find-polluter.sh (63 lines bash) -- debugging utility
tests/claude-code/analyze-token-usage.py (168 lines Python) -- test analysis

Total executable code: ~800 lines across 7 files. Everything else is markdown.
16.4 Assessment

Strengths:

Minimal dependency footprint
No package.json, no npm install, no build step
Pure markdown skills work across all platforms
Cross-platform polyglot wrapper handles Windows/Unix differences

Limitations:

Bash dependency for hooks limits pure-Windows environments (mitigated by Git for Windows)
No formal dependency management (no package-lock, no version pinning)
render-graphs.js uses CommonJS require(), not ES modules (inconsistent with skills-core.js)

Confidence: High

17. External Integrations and Provider Compatibility

17.1 AI Provider Compatibility

Superpowers is model-agnostic at the skill level -- skills are markdown instructions that work with any LLM. However:

Primary target: Claude (all references use Claude-specific tools: Skill, Task, TodoWrite, Read, Write, Edit, Bash)
OpenCode mapping: TodoWrite -> update_plan, Task -> @mention, Skill -> native skill tool
Codex mapping: Limited -- "manual work instead of delegation" for subagent workflows

17.2 Tool Mapping

From OpenCode plugin (/tmp/ai-harness-repos/superpowers/.opencode/plugins/superpowers.js, lines 64-73):
TodoWrite      -> update_plan
Task (subagent) -> @mention syntax
Skill tool     -> native skill tool
Read/Write/Edit/Bash -> native tools

17.3 Git Integration

Deep git integration throughout:

Worktree creation and management
Branch operations (create, merge, delete)
Commit tracking (per-task commits)
Diff analysis for code review
PR creation via gh CLI

17.4 Assessment

Strengths:

Multi-platform support (4 platforms)
Clean tool mapping strategy for platform differences
Git as universal integration point

Limitations:

Heavy Claude-specific tool references in skills (requires mapping for other platforms)
Codex subagent support is degraded (manual fallback)
No MCP server integration
No external API integrations (Jira, Linear, etc.)
No CI/CD pipeline integration

Confidence: High

18. Operational Assumptions and Constraints

18.1 Assumptions


Agent follows instructions -- The entire framework relies on the AI agent reading and following skill documents. No runtime enforcement exists.


Fresh context per subagent -- Subagent-driven development assumes each Task tool invocation provides a clean context. This is a platform-specific behavior of Claude Code.


Git repository present -- Many skills assume a git repository exists with proper configuration.


Test infrastructure available -- TDD skill assumes test framework is set up and tests can be run.


Single-developer workflow -- No multi-user coordination, no conflict resolution between concurrent developers.


English language -- All skills, prompts, and instructions are in English.


Network access for tools -- Some skills reference gh CLI which requires GitHub access.


18.2 Constraints


Context window limits -- Skills are designed to be compact (< 500 lines) to avoid consuming too much context.


No persistent state -- Framework cannot track progress across sessions.


Platform-specific features -- Some features (subagent dispatch, task tracking) vary by platform.


Advisory enforcement only -- All rules are enforced by prompt engineering, not code.


18.3 Assessment

The operational assumptions are reasonable for the target use case (individual developer using Claude Code on a git repository). They become limiting for enterprise/team/multi-repo scenarios.
Confidence: High

19. Failure Modes and Issues Observed

19.1 Documented Failure Modes

From release notes and improvement plans:
1. Agent rationalization bypass (v3.2.2, v4.0.3)

Agent thinks "I know what that means" and skips skill invocation
Agent starts working before loading requested skill
Multiple iterations of anti-rationalization engineering required

2. Description trap (v4.0.0)

Skill descriptions containing workflow summaries cause agent to follow description instead of full skill
Led to one-review instead of two-review process

3. SessionStart timing (v4.3.0, v4.2.0)

Async hook could fail to complete before first turn
But sync hook froze Windows TUI
Fix: sync on Unix, was async on Windows then fixed back to sync

4. Windows execution failures (v2.0.1, v4.1.0, v4.2.0, v4.3.1)

CRLF line ending issues
Path with spaces
Missing WSL
.sh auto-detection breaking polyglot wrapper
set -euo pipefail fragility on MSYS
O(n^2) escape_for_json performance (60+ seconds)

5. EnterPlanMode bypass (v4.3.0)

Claude enters native plan mode instead of using brainstorming skill
Fixed by adding EnterPlanMode intercept in using-superpowers flowchart

19.2 From Improvement Plans

From /tmp/ai-harness-repos/superpowers/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md:
6. Configuration change verification gap

Agent reports "OpenAI integration working" but response shows Claude model
Verified operation succeeded, not that intended configuration was applied
Impact: High (false confidence in tests)

7. Background process accumulation

Multiple subagents start background servers, processes accumulate
Later tests hit stale server with wrong config
Impact: Medium-High

8. Mock-interface drift

Mocks derived from buggy implementation, not interface definition
Tests pass, runtime crashes
Impact: High

9. Skills not being read

Skills exist but neither human nor subagents read them
Skill investment wasted
Impact: Medium

19.3 Taxonomy of Failure Modes

Analyzing the nine documented failure modes, they fall into three distinct categories:
Category A: Agent Compliance Failures (4 instances)

Rationalization bypass (agent skips skill)
Description trap (agent follows summary instead of full skill)
EnterPlanMode bypass (agent uses native plan mode instead of brainstorming)
Skills not being read (neither human nor agent reads available skills)

These are the most fundamental and hardest to fix. They represent the inherent fragility of the advisory-only enforcement model. Each required multiple iterations of anti-rationalization engineering.
Category B: Platform/Environment Failures (3 instances)

SessionStart timing (async hook completing too late)
Windows execution failures (CRLF, paths, WSL, shell detection)
Background process accumulation (stale servers from subagents)

These are engineering problems with engineering solutions. The Windows failures in particular consumed significant development effort across v2.0.1, v4.1.0, v4.2.0, and v4.3.1.
Category C: Verification Methodology Failures (2 instances)

Configuration change verification gap (verified operation, not configuration)
Mock-interface drift (mocks derived from buggy implementation)

These represent genuine insights into AI agent testing methodology. The configuration verification gap is particularly subtle: an agent can correctly report "operation succeeded" while the underlying configuration is wrong (e.g., OpenAI integration "working" but actually using Claude model).
19.4 Assessment

The documented failure modes reveal an honest and rigorous development process. The team actively discovers, documents, and addresses failures. The most concerning pattern is the fundamental reliance on agent compliance (Category A) -- every compliance failure ultimately traces back to the agent not following instructions as expected, and the only remediation is more detailed instructions.
The taxonomy also reveals an important insight: the most impactful failures are NOT the obvious ones (platform bugs, timing issues) but the subtle ones (verification gaps, mock drift) where the agent believes it is compliant but is actually failing.
Confidence: High

20. Governance and Guardrails

20.1 Skill-Level Guardrails

Each skill implements guardrails through:

Iron Laws -- Absolute rules that cannot be violated
Hard Gates -- Must-complete-before-proceeding barriers
Red Flags lists -- Thought patterns that indicate rationalization
Rationalization tables -- Pre-emptive counters to expected excuses
Gate functions -- Explicit decision trees before actions

20.2 Workflow-Level Guardrails


Brainstorming must complete before implementation
Worktree must be set up before execution
Spec compliance must pass before code quality review
Tests must pass before finishing branch
Discard requires typed confirmation

20.3 Anti-Sycophancy Guardrails

The receiving-code-review skill (/tmp/ai-harness-repos/superpowers/skills/receiving-code-review/SKILL.md) explicitly forbids performative agreement and establishes a protocol for receiving feedback with technical rigor:
Forbidden Responses (lines 29-33):

"You're absolutely right!" (explicit CLAUDE.md violation)
"Great point!" / "Excellent feedback!" (performative)
"Let me implement that now" (before verification)

Required Pattern Instead (lines 16-25):
WHEN receiving code review feedback:
1. READ: Complete feedback without reacting
2. UNDERSTAND: Restate requirement in own words (or ask)
3. VERIFY: Check against codebase reality
4. EVALUATE: Technically sound for THIS codebase?
5. RESPOND: Technical acknowledgment or reasoned pushback
6. IMPLEMENT: One item at a time, test each

Source-Specific Trust Levels (lines 59-86):
The skill differentiates between feedback from the human partner (trusted but still requires understanding) and external reviewers (must be verified against five checkpoints: technical correctness, breaks existing functionality, reason for current implementation, cross-platform compatibility, full context understanding).
YAGNI Check for "Professional" Features (lines 88-98):
When a reviewer suggests "implementing properly," the agent must first grep codebase for actual usage before implementing. If the endpoint/feature is unused, the correct response is to suggest removing it (YAGNI), not implementing it "properly."
This represents one of the most thorough anti-sycophancy implementations in any AI agent framework. It addresses the well-documented tendency of LLMs to agree with authority figures (reviewers) regardless of technical merit.
20.4 YAGNI Enforcement

Multiple layers of YAGNI enforcement:

Brainstorming: "YAGNI ruthlessly - Remove unnecessary features from all designs"
Writing plans: "DRY, YAGNI, TDD"
Spec review: Catches "Extra/unneeded work" and "nice to haves" not in spec
Code review: Checks "No scope creep"
Receiving review: "grep codebase for actual usage" before implementing suggested features

20.5 Assessment

Strengths: Deeply layered guardrail system, anti-rationalization engineering, anti-sycophancy measures.
Limitations:

All guardrails are advisory
No audit trail of guardrail compliance
No automated detection of guardrail violations
No way to prove guardrails were followed (only that the skill text exists)

Confidence: High

21. Roadmap/Evolution Signals, Missing Areas, Unresolved Issues

21.1 Evolution Trajectory

The release history (v1.0 -> v4.3.1 over ~16 months) shows clear evolution:

v1.x: Monolithic plugin with embedded skills
v2.0: Skills separated into external repository, community contribution model
v3.0: Adopted Anthropic's first-party skills system
v3.x: Added Codex and OpenCode support, skill namespacing
v4.0: DOT flowcharts, two-stage review, testing infrastructure
v4.1-4.3: Windows hardening, Cursor support, anti-rationalization strengthening

The trend is toward tighter behavioral enforcement, broader platform support, and better testing infrastructure.
Key inflection points:

v2.0 was the first major architecture shift (monolith -> modular skills), indicating the original design was too rigid.
v3.0 was the second major shift (custom skill system -> Anthropic's first-party system), indicating willingness to abandon custom infrastructure in favor of platform-native features.
v4.0 was the content quality revolution (DOT flowcharts, two-stage review, testing infrastructure), indicating the team recognized that skill content quality was the primary bottleneck.
v4.1-4.3 was the hardening phase (Windows fixes, anti-rationalization, sync hooks), indicating the framework was mature enough for real-world users to surface edge cases.

This trajectory suggests the next major evolution will likely focus on one of: measurement/observability (proving skills work), multi-agent coordination (team workflows), or cost optimization (cheaper review loops).
21.2 Active Improvement Areas

From /tmp/ai-harness-repos/superpowers/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md:
Phase 1 (High-Impact, Low-Risk):

Configuration change verification in verification-before-completion
Mock-interface drift anti-pattern in testing-anti-patterns
Explicit file reading in code reviewer template

Phase 2 (Moderate Changes):

Process hygiene for E2E tests (kill stale processes before/after)
Self-reflection step for implementers
Skills reading requirement for test subagents

Phase 3 (Optimization):

Lean context option for pattern-based tasks
Allow implementer to fix self-identified issues

21.3 Missing Areas

Not present in the framework:

No cost governance -- No budgets, caps, or cost-based decisions
No formal CI/CD integration -- No GitHub Actions, no pre-commit hooks
No multi-repo support -- Single repository assumption
No team coordination -- Single developer workflow
No dependency management -- No package manager integration
No environment management -- No Docker, no virtual environments
No telemetry or observability -- No metrics, no dashboards, no alerts
No configuration management -- Skills are not configurable per-project
No versioned skill contracts -- Skills evolve without formal versioning
No rollback mechanism -- Beyond git revert, no workflow-level rollback

21.4 Unresolved Issues

From the improvement plan's open questions:

Lean context vs full plan: Should lean context be default for pattern-based tasks?
Self-reflection overhead: Will it slow down simple tasks?
Process hygiene scope: In SDD or separate skill? Beyond E2E?
Skills reading enforcement: Should ALL subagents read relevant skills?
Prompt bloat risk: All improvements add more text to prompts

21.5 Assessment

The framework is maturing rapidly but has clear gaps in enterprise readiness, team collaboration, and operational observability. The improvement plan shows awareness of real-world failure modes and a disciplined approach to addressing them.
Confidence: High for documented items; Medium for roadmap predictions

22. What Should Be Borrowed/Adapted into Maestro and What Should Not

22.1 STRONGLY BORROW

1. Anti-Rationalization Engineering (Critical)
Superpowers' most unique contribution is the systematic approach to preventing agent rationalization:

Rationalization tables with pre-emptive counters
Red flags lists for self-checking
Gate functions as decision trees
"Violating the letter is violating the spirit" foundational principle
Persuasion principles (authority, commitment, scarcity) applied to skill design

Why borrow: Every harness faces the problem of agents bypassing constraints. Superpowers has invested the most iteration into solving this.
How to adapt: Build rationalization prevention into Maestro's prompt templates for every critical decision point. Don't rely on instructions alone -- test against agent behavior.
2. Two-Stage Code Review (Spec Compliance + Quality)
Separating spec compliance from code quality is a genuine insight:

Catches the common failure where "code is well-written but doesn't match what was requested"
Each review has a different reviewer mindset and checklist
Reviews are loops, not one-shot

Why borrow: Most harnesses do one review or none. Two-stage catches fundamentally different failure modes.
How to adapt: Implement as sequential review stages in Maestro's pipeline. Spec reviewer should be explicitly skeptical ("finished suspiciously quickly").
3. TDD for Skill/Prompt Documentation
The RED-GREEN-REFACTOR cycle applied to prompt engineering:

Baseline test without skill (watch agent fail)
Write skill addressing specific failures
Close loopholes through iteration
Pressure scenarios with combined pressures

Why borrow: Prompt engineering is currently ad-hoc in most harnesses. This provides rigor.
How to adapt: Build a testing framework for Maestro's prompts that runs scenarios against agent behavior and measures compliance.
4. Task Context Provision Strategy
Controller reads plan once, extracts all tasks, provides full text to subagents:

No file reading overhead
Curated context per task
Questions surfaced before work begins

Why borrow: Reduces subagent token usage and increases focus.
5. Description Trap Awareness
Skill descriptions must be trigger-only, never workflow summaries. This discovery prevents a subtle but devastating failure mode.
Why borrow: Any system with skill/prompt routing must account for this.
6. Spec Reviewer Skepticism Pattern
The spec reviewer prompt template (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/spec-reviewer-prompt.md) opens with an explicit instruction to distrust the implementer:

"The implementer finished suspiciously quickly. Their report may be incomplete, inaccurate, or optimistic. You MUST verify everything independently."

This is a deliberate psychological priming technique. By framing the implementer's work as suspicious, the reviewer is far less likely to rubber-stamp the review. The reviewer is instructed to:

NOT trust the implementer's report
Read actual code and compare to requirements line by line
Report: missing requirements, extra/unneeded work, misunderstandings

Why borrow: Without explicit skepticism priming, AI reviewers default to approval (sycophancy bias). The framing of "suspiciously quickly" is simple to implement and dramatically changes review quality.
7. Verification Before Completion Pattern
IDENTIFY command -> RUN it -> READ output -> VERIFY claim -> THEN claim

This is simple, powerful, and prevents the most common agent failure: claiming success without evidence.
22.2 SELECTIVELY BORROW

8. DOT Flowcharts as Specifications
Flowcharts are harder to skip than prose. However, they add visual complexity that may not scale.
How to adapt: Use for critical decision points and process flows, not for everything.
9. Git Worktree Isolation
Good default for feature work isolation, but may be too opinionated for Maestro's broader use cases.
How to adapt: Support worktrees as one isolation strategy among several (Docker, venvs, etc.).
10. Brainstorming-First Mandate
Forcing brainstorming before implementation is valuable for preventing premature coding, but may be too heavy for small changes.
How to adapt: Scale the design phase to the change size. Small changes might skip full brainstorming.
11. Persuasion Principles for Prompt Design
Academic foundation (Cialdini 2021, Meincke et al. 2025) for why certain prompt patterns work. Useful reference but don't over-apply.
22.3 DO NOT BORROW

12. Agent-as-Orchestrator Model
Superpowers makes the AI agent the orchestrator, guided only by markdown instructions. This is elegant but fundamentally limits:

Enforcement (all rules are advisory)
Observability (no runtime metrics)
Recovery (no checkpoint/restore)
Scalability (single agent bottleneck)
Reproducibility (agent behavior varies)

Why not: Maestro should have a runtime orchestrator that provides enforcement, observability, and recovery.
13. Zero-Persistence Design
No state saved between sessions, no progress tracking, no checkpoint mechanism.
Why not: Maestro needs persistent state for long-running workflows, team coordination, and recovery.
14. Advisory-Only Enforcement
All quality gates enforced by skill text, not runtime code.
Why not: Critical guardrails (cost limits, security scanning, main branch protection) should have runtime enforcement, not just prompts.
15. Single-Platform Tool References
Skills reference Claude-specific tools (Task, TodoWrite, Skill) with mappings for other platforms.
Why not: Maestro should abstract tool references to be platform-agnostic from the start.
16. Windows Polyglot Wrapper Pattern
The cmd/bash polyglot is clever but fragile and has caused numerous issues (#518, #504, #491, #487, #466, #440, #331, #285, #243).
Why not: Use a proper cross-platform runtime (Node.js, Deno) instead of bash.
22.4 Summary Matrix


#
Feature
Verdict
Priority
Effort


1
Anti-rationalization engineering
BORROW
Critical
Medium


2
Two-stage code review
BORROW
High
Low


3
TDD for prompts
BORROW
High
High


4
Task context provision
BORROW
High
Low


5
Description trap awareness
BORROW
High
Low


6
Spec reviewer skepticism pattern
BORROW
High
Low


7
Verification before completion
BORROW
High
Low


8
DOT flowcharts
SELECTIVE
Medium
Low


9
Git worktree isolation
SELECTIVE
Medium
Low


10
Brainstorming-first
SELECTIVE
Medium
Low


11
Persuasion principles
SELECTIVE
Low
Low


12
Agent-as-orchestrator
DO NOT
-
-


13
Zero-persistence
DO NOT
-
-


14
Advisory-only enforcement
DO NOT
-
-


15
Single-platform tool refs
DO NOT
-
-


16
Polyglot wrapper
DO NOT
-
-


23. Cross-Links

Related Sections in Other Analysis Reports

everything-claude-code-deep-analysis.md:

Section: "Skills System" -- How Claude Code's native skill system works (the platform Superpowers targets)
Section: "Hooks System" -- How SessionStart hooks inject context
Section: "Task Tool" -- How subagent dispatch works
Section: "TodoWrite" -- How task tracking works
Section: "Plugin Marketplace" -- How plugins are distributed

agent-orchestrator-deep-analysis.md:

Section: "Orchestration Patterns" -- Compare agent-as-orchestrator (Superpowers) vs runtime orchestrator
Section: "Subagent Management" -- Compare dispatch, monitoring, and review patterns
Section: "Quality Gates" -- Compare enforcement mechanisms (advisory vs runtime)
Section: "Context Management" -- Compare progressive disclosure strategies
Section: "Multi-Platform Support" -- Compare abstraction strategies

maestro-deep-analysis.md:

Section: "Design Philosophy" -- Compare with Superpowers' skill-based approach
Section: "Workflow Pipeline" -- Compare brainstorm->plan->execute->review flow
Section: "Code Review" -- Compare single vs two-stage review
Section: "Isolation Model" -- Compare worktrees vs other isolation strategies
Section: "Cost Governance" -- What Maestro has that Superpowers lacks
Section: "Security" -- What Maestro has that Superpowers lacks

harness-consensus-report.md:

Section: "Common Patterns" -- Skills-based prompt injection pattern
Section: "Anti-Rationalization" -- Superpowers as category leader
Section: "Quality Gate Enforcement" -- Advisory vs runtime spectrum
Section: "Platform Compatibility" -- Multi-platform support comparison
Section: "Testing Approaches" -- TDD for prompts as novel methodology

final-harness-gap-report.md:

Section: "What No Harness Does Well" -- Enterprise features (security, compliance, observability)
Section: "Novel Contributions" -- Anti-rationalization engineering, two-stage review, TDD for docs
Section: "Recommended Architecture" -- Runtime orchestrator + skill-based prompts (best of both)
Section: "Priority Features for Maestro" -- Feature prioritization based on all harness analyses


Appendix A: File Index with Key Line References


File
Key Lines
Purpose


/tmp/ai-harness-repos/superpowers/README.md
1-158
Project overview, installation, workflow description


/tmp/ai-harness-repos/superpowers/RELEASE-NOTES.md
1-802
Complete version history (v1.0 through v4.3.1)


/tmp/ai-harness-repos/superpowers/.claude-plugin/plugin.json
1-13
Plugin manifest (v4.3.1)


/tmp/ai-harness-repos/superpowers/.cursor-plugin/plugin.json
1-18
Cursor plugin manifest with skills/agents/commands/hooks paths


/tmp/ai-harness-repos/superpowers/.opencode/plugins/superpowers.js
1-95
OpenCode plugin (system prompt transform injection)


/tmp/ai-harness-repos/superpowers/.codex/INSTALL.md
1-67
Codex installation (clone + symlink)


/tmp/ai-harness-repos/superpowers/hooks/hooks.json
1-16
Hook configuration (SessionStart, sync)


/tmp/ai-harness-repos/superpowers/hooks/session-start
1-51
Bootstrap injection script


/tmp/ai-harness-repos/superpowers/hooks/run-hook.cmd
1-46
Cross-platform polyglot wrapper


/tmp/ai-harness-repos/superpowers/lib/skills-core.js
1-208
Shared skill discovery/parsing module


/tmp/ai-harness-repos/superpowers/skills/using-superpowers/SKILL.md
1-96
Meta-skill: mandatory skill usage protocol


/tmp/ai-harness-repos/superpowers/skills/brainstorming/SKILL.md
1-97
Design exploration before implementation


/tmp/ai-harness-repos/superpowers/skills/writing-plans/SKILL.md
1-117
Implementation plan creation


/tmp/ai-harness-repos/superpowers/skills/executing-plans/SKILL.md
1-85
Batch execution with checkpoints


/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/SKILL.md
1-242
Fresh subagent per task + two-stage review


/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/implementer-prompt.md
1-79
Implementer subagent prompt template


/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/spec-reviewer-prompt.md
1-62
Spec compliance reviewer template


/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/code-quality-reviewer-prompt.md
1-20
Code quality reviewer template


/tmp/ai-harness-repos/superpowers/skills/test-driven-development/SKILL.md
1-371
RED-GREEN-REFACTOR cycle enforcement


/tmp/ai-harness-repos/superpowers/skills/test-driven-development/testing-anti-patterns.md
1-300
5 testing anti-patterns with gate functions


/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/SKILL.md
1-297
4-phase root cause investigation


/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/root-cause-tracing.md
1-170
Backward call chain tracing technique


/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/defense-in-depth.md
1-122
4-layer validation strategy


/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/condition-based-waiting.md
1-116
Replace timeouts with condition polling


/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/find-polluter.sh
1-63
Test pollution bisection script


/tmp/ai-harness-repos/superpowers/skills/dispatching-parallel-agents/SKILL.md
1-181
Concurrent subagent dispatch pattern


/tmp/ai-harness-repos/superpowers/skills/using-git-worktrees/SKILL.md
1-218
Isolated workspace creation


/tmp/ai-harness-repos/superpowers/skills/finishing-a-development-branch/SKILL.md
1-201
Merge/PR/Keep/Discard decision workflow


/tmp/ai-harness-repos/superpowers/skills/requesting-code-review/SKILL.md
1-106
Pre-review dispatch pattern


/tmp/ai-harness-repos/superpowers/skills/requesting-code-review/code-reviewer.md
1-147
Code review agent template


/tmp/ai-harness-repos/superpowers/skills/receiving-code-review/SKILL.md
1-214
Anti-sycophancy review response protocol


/tmp/ai-harness-repos/superpowers/skills/verification-before-completion/SKILL.md
1-140
Evidence-before-claims enforcement


/tmp/ai-harness-repos/superpowers/skills/writing-skills/SKILL.md
1-656
TDD for documentation methodology


/tmp/ai-harness-repos/superpowers/skills/writing-skills/testing-skills-with-subagents.md
1-385
Pressure testing methodology


/tmp/ai-harness-repos/superpowers/skills/writing-skills/persuasion-principles.md
1-188
Cialdini-based prompt design principles


/tmp/ai-harness-repos/superpowers/skills/writing-skills/anthropic-best-practices.md
1-1151
Anthropic's official skill authoring guide


/tmp/ai-harness-repos/superpowers/skills/writing-skills/render-graphs.js
1-168
DOT to SVG rendering tool


/tmp/ai-harness-repos/superpowers/agents/code-reviewer.md
1-49
Code reviewer agent definition


/tmp/ai-harness-repos/superpowers/commands/brainstorm.md
1-7
User-only slash command redirect


/tmp/ai-harness-repos/superpowers/commands/write-plan.md
1-7
User-only slash command redirect


/tmp/ai-harness-repos/superpowers/commands/execute-plan.md
1-7
User-only slash command redirect


/tmp/ai-harness-repos/superpowers/tests/claude-code/test-helpers.sh
1-202
Test assertion framework


/tmp/ai-harness-repos/superpowers/tests/claude-code/run-skill-tests.sh
1-188
Test runner with timeout/verbose/integration modes


/tmp/ai-harness-repos/superpowers/tests/claude-code/test-subagent-driven-development.sh
1-166
9 tests for SDD skill compliance


/tmp/ai-harness-repos/superpowers/tests/claude-code/analyze-token-usage.py
1-168
JSONL token usage analyzer


/tmp/ai-harness-repos/superpowers/tests/skill-triggering/run-test.sh
1-89
Implicit skill triggering test


/tmp/ai-harness-repos/superpowers/tests/explicit-skill-requests/run-test.sh
1-137
Explicit skill request verification


/tmp/ai-harness-repos/superpowers/tests/subagent-driven-dev/run-test.sh
1-107
End-to-end SDD workflow test


/tmp/ai-harness-repos/superpowers/tests/subagent-driven-dev/go-fractals/plan.md
1-173
10-task Go CLI test plan


/tmp/ai-harness-repos/superpowers/docs/testing.md
1-304
Testing guide with session transcript format


/tmp/ai-harness-repos/superpowers/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md
1-712
8 real-world failure reports with proposed fixes


/tmp/ai-harness-repos/superpowers/docs/windows/polyglot-hooks.md
1-213
Cross-platform hook documentation


Appendix B: Confidence Scores Summary


Analysis Area
Confidence
Basis


Design philosophy
High
Extensive documentation, consistent across all files


Core architecture
High
Complete codebase read, all files analyzed


Harness workflow
High
Every skill read, workflow tested end-to-end


Subagent orchestration
High
Detailed prompt templates and process flows


Parallelization strategy
High
Explicitly documented and constrained


Isolation model
High
Worktree skill fully documented


Human-in-the-loop
High
Every approval gate identified in skills


Context handling
High
Token budgets and cross-reference strategy documented


Session lifecycle
High
Hook code and configuration reviewed


Code quality gates
High
Every quality skill read and analyzed


Security mechanisms
High (absence)
Confirmed no security features present


Hooks and automation
High
All hook code and config reviewed


CLI/UX
High
Installation and test infrastructure reviewed


Cost visibility
High
Analysis tool and cost documentation reviewed


Tooling/dependencies
High
All executable code inventoried


External integrations
High
All platform adapters reviewed


Operational assumptions
High
Derived from skill requirements


Failure modes
High
Documented in release notes and plans


Governance
High
Every guardrail mechanism identified


Roadmap signals
Medium
Based on improvement plans, may not be complete


Maestro recommendations
Medium-High
Based on analysis, but Maestro requirements not fully known


Appendix C: Quantitative Summary


Metric
Value


Total files (non-.git)
~90


Total executable code
~800 lines across 7 files


Total skill documents
14 skills


Total supporting documents
~15 files


Total test files
~25 files


Lines of markdown in skills
~4,500


Release versions analyzed
v1.0 through v4.3.1


Platforms supported
4 (Claude Code, Cursor, Codex, OpenCode)


Documented failure modes
9+


Anti-rationalization entries
40+ across all skills


Subagent prompt templates
3 (implementer, spec reviewer, quality reviewer)


Slash commands
3 (brainstorm, write-plan, execute-plan)


Agents defined
1 (code-reviewer)


Token budget for SDD workflow
~$4.67 per 2-task plan (documented test run)


Rationalization prevention entries
40+ across all skills


Pressure test types documented
7 (time, sunk cost, authority, economic, exhaustion, social, pragmatic)


Unique red flag patterns
12 in using-superpowers, 12 in TDD, 8 in verification, others in each skill


Evolution timespan
v1.0 to v4.3.1 (~16 months of active development)


Report generated by systematic analysis of all 90+ non-git files in the obra/superpowers repository at version 4.3.1.
Slot	Purpose	Built-in Implementations
`runtime`	Process execution environment	tmux, process
`agent`	AI coding agent	claude-code, codex, aider, opencode
`workspace`	Code isolation	worktree, clone
`tracker`	Issue tracking	github, linear
`scm`	Source code management	github
`notifier`	Notifications	desktop, slack, composio, webhook
`terminal`	Terminal UI integration	iterm2, web
`lifecycle`	State machine customization	core (default)
Dependency	Purpose	Version Constraint
Node.js	Runtime	>= 20
pnpm	Package manager	9.15.4 (exact)
tmux	Terminal multiplexer	Required
git	Version control	>= 2.25 (worktree support)
gh	GitHub CLI	Required for GitHub integration
TypeScript	Language	Strict mode, ESM
Next.js	Web dashboard	App Router
Commander.js	CLI framework	—
Zod	Schema validation	—
Component	Maturity	Evidence
Core types	High	1084 lines, comprehensive, well-structured
Session manager	High	~1100 lines, thorough error handling
Lifecycle manager	High	587 lines, reaction engine, escalation
Config system	High	Zod validation, defaults, collision detection
Claude Code plugin	High	786 lines, deep integration
GitHub SCM	High	581 lines, fail-closed CI, GraphQL
Linear tracker	High	722 lines, dual transport
tmux runtime	Medium	184 lines, functional but basic
Web dashboard	Medium	Functional UI, basic SSE
CLI	Medium	Feature-complete but sparse UX
Other agent plugins	Low	Likely thin or placeholder
Process runtime	Low	Listed but not observed
Clone workspace	Low	Listed but not fully developed
Composio notifier	Unknown	Mentioned but not explored
Document	Relevance to This Analysis
`swe-bench-deep-analysis.md`	SWE-bench is the primary benchmark for evaluating coding agents like those orchestrated by AO. Comparison of evaluation methodologies.
`claude-code-deep-analysis.md`	Claude Code is AO's primary agent. Deep understanding of Claude Code's internals (JSONL format, session files, hooks) is essential for understanding AO's agent plugin.
`codex-deep-analysis.md`	Codex CLI is a supported agent in AO. Compare how AO integrates Codex vs Claude Code.
`aider-deep-analysis.md`	Aider is a supported agent in AO. Compare integration depth and activity detection approaches.
`opencode-deep-analysis.md`	OpenCode is a supported agent in AO. Compare plugin maturity.
`open-hands-deep-analysis.md`	OpenHands (formerly OpenDevin) provides container-based isolation. Compare with AO's worktree/tmux approach for security and resource isolation.
`bolt-diy-deep-analysis.md`	Bolt.diy is a web-based coding assistant. Compare the dashboard/UI patterns.
`maestro-architecture.md`	The target architecture document. This analysis directly informs what patterns to adopt, adapt, or avoid.
File	Lines	Purpose
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/types.ts`	1084	Central type definitions
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/session-manager.ts`	~1100	Session CRUD operations
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts`	587	State machine + reaction engine
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/config.ts`	~400	Config loading + validation
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/plugin-registry.ts`	~100	Plugin registration + lookup
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/paths.ts`	~200	Hash-based directory management
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/metadata.ts`	~200	Flat-file metadata management
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/prompt-builder.ts`	~150	Three-layer prompt composition
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/orchestrator-prompt.ts`	~250	Meta-agent system prompt
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/tmux.ts`	~200	Safe tmux wrappers
`/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/utils.ts`	~150	Shell escape, JSONL parsing
File	Lines	Purpose
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts`	786	Claude Code agent integration
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/runtime-tmux/src/index.ts`	184	tmux runtime implementation
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts`	301	Git worktree workspace
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/scm-github/src/index.ts`	581	GitHub SCM integration
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-github/src/index.ts`	304	GitHub Issues tracker
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-linear/src/index.ts`	722	Linear tracker integration
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-desktop/src/index.ts`	~80	OS desktop notifications
`/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-slack/src/index.ts`	~150	Slack webhook notifications
File	Lines	Purpose
`/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/index.ts`	~80	CLI entry point
`/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts`	~200	Spawn + batch-spawn
`/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/start.ts`	~150	Start/stop orchestrator
`/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/status.ts`	~200	Status display
`/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/session.ts`	~200	Session subcommands
`/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/init.ts`	~300	Init wizard
File	Lines	Purpose
`/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/lib/services.ts`	84	Service singleton
`/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/components/Dashboard.tsx`	272	Main dashboard UI
`/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/app/api/sessions/route.ts`	65	Sessions API
`/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/app/api/events/route.ts`	104	SSE events API
File	Purpose
`/tmp/ai-harness-repos/agent-orchestrator/README.md`	Project overview
`/tmp/ai-harness-repos/agent-orchestrator/ARCHITECTURE.md`	Directory architecture
`/tmp/ai-harness-repos/agent-orchestrator/CLAUDE.md`	Development conventions
`/tmp/ai-harness-repos/agent-orchestrator/SECURITY.md`	Security policy
`/tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example`	Full reference config
`/tmp/ai-harness-repos/agent-orchestrator/examples/simple-github.yaml`	Minimal config example
`/tmp/ai-harness-repos/agent-orchestrator/examples/auto-merge.yaml`	Auto-merge config example
Section	Confidence	Reasoning
1. Design Philosophy	High	README, ARCHITECTURE.md, and code consistently support conclusions
2. Core Architecture	High	All source files read and analyzed
3. Harness Workflow	High	Spawn sequence traced through code
4. Subagent Orchestration	High	Orchestrator prompt and communication code reviewed
5. Multi-Agent & Parallelization	High	Lifecycle manager and batch-spawn code reviewed
6. Isolation Model	High	Workspace and runtime plugins fully analyzed
7. Human-in-the-Loop	High	Dashboard and API code reviewed
8. Context Handling	High	Prompt builder and tracker plugins reviewed
9. Session Lifecycle	High	State machine and activity detection fully traced
10. Code Quality Gates	Medium	CI config reviewed but lint rules and test coverage not explored
11. Security	High	SECURITY.md, CI workflows, and shell security patterns reviewed
12. Hooks & Automation	High	Reaction engine and PostToolUse hook fully analyzed
13. CLI & UX	High	All CLI commands reviewed
14. Cost & Usage	Medium	Cost extraction code reviewed but display/alerting not found
15. Tooling & Dependencies	High	package.json and imports reviewed
16. External Integrations	High	All plugin code reviewed
17. Operational Assumptions	High	Requirements documented and validated against code
18. Failure Modes	High	Error handling paths traced through code
19. Governance	Medium	Security measures documented but no formal governance framework
20. Roadmap	Medium	Based on TODOs, plugin stubs, and architecture patterns
21. Borrow/Adapt	High	Based on thorough analysis of all sections