Skip to content

Instantly share code, notes, and snippets.

@jeffscottward
Created February 22, 2026 21:36
Show Gist options
  • Select an option

  • Save jeffscottward/de77a769d9e25a8ccdc92b65291b1c34 to your computer and use it in GitHub Desktop.

Select an option

Save jeffscottward/de77a769d9e25a8ccdc92b65291b1c34 to your computer and use it in GitHub Desktop.
AI Harness Comparative Analysis: Maestro vs Superpowers vs ECC vs Agent Orchestrator (10,800+ lines)

ComposioHQ/agent-orchestrator — Deep Technical Analysis

Repository: ComposioHQ/agent-orchestrator Analysis Date: 2026-02-22 Analyst: Claude Opus 4.6 Source Path: /tmp/ai-harness-repos/agent-orchestrator/ Report Length Target: 2000+ lines of detailed analysis


Table of Contents

  1. Design Philosophy & Goals
  2. Core Architecture
  3. Harness Workflow
  4. Subagent Orchestration
  5. Multi-Agent & Parallelization Strategy
  6. Isolation Model
  7. Human-in-the-Loop Controls
  8. Context Handling
  9. Session Lifecycle
  10. Code Quality Gates
  11. Security & Compliance
  12. Hooks & Automation
  13. CLI & UX
  14. Cost & Usage Visibility
  15. Tooling & Dependencies
  16. External Integrations
  17. Operational Assumptions & Prerequisites
  18. Failure Modes & Recovery
  19. Governance & Guardrails
  20. Roadmap & Evolution Signals
  21. What to Borrow / Adapt into Maestro
  22. Cross-Links

1. Design Philosophy & Goals

Confidence: High

1.1 Core Vision

Agent Orchestrator (AO) positions itself as a parallel AI coding agent harness with a clear tagline from the README:

"Spawn parallel AI coding agents. Monitor from one dashboard. Merge their PRs."

This is not a general-purpose AI orchestration framework. It is laser-focused on software development workflows where multiple AI coding agents work on different issues simultaneously, each in an isolated workspace, producing pull requests that a human reviews and merges.

Source: /tmp/ai-harness-repos/agent-orchestrator/README.md (lines 1-10)

1.2 Architectural Principles

The codebase reveals several deliberate design choices:

  1. Plugin-Everything Architecture: Every capability is behind a plugin interface — runtime, agent, workspace, tracker, SCM, notifier, terminal, lifecycle. This allows swapping implementations without touching core logic.

  2. Process Isolation via tmux: Rather than embedding agents in-process, AO spawns them as independent terminal processes inside tmux sessions. This is a pragmatic choice: Claude Code, Codex, Aider, and OpenCode are all CLI tools that expect a terminal environment.

  3. Flat-File State Over Databases: All session state lives in the filesystem as key=value metadata files. No SQLite, no Postgres, no Redis. This trades query capability for operational simplicity — you can debug state with cat and ls.

  4. Polling Over Event-Driven: The lifecycle manager polls every 30 seconds. The web dashboard polls every 5 seconds via SSE. There is no event bus, no pub/sub, no WebSocket push from core. This is explicitly acknowledged as a limitation.

  5. Fail-Open for Enrichment, Fail-Closed for Safety: PR enrichment (CI status, reviews) has timeouts and falls back gracefully. But CI status detection for open PRs is fail-closed — if the GitHub API errors, it reports "failing" rather than "none," preventing premature merges.

  6. Developer-Local First: The entire system runs on a single developer's machine. There is no multi-user support, no cloud deployment story, no containerization. The "server" is a Next.js dev server on localhost.

1.3 Strengths

  • Extremely pragmatic: Instead of building a complex IPC system, they leverage tmux — a battle-tested terminal multiplexer that already handles process management, session persistence, and output capture.
  • Low barrier to entry: If you have tmux and a coding agent CLI, you can start using AO immediately. No infrastructure setup required.
  • Plugin system is well-designed: Clean interfaces with manifest metadata, Zod validation, and type-safe registration.

1.4 Limitations

  • Single-machine constraint: No distributed execution. All agents run on one machine, sharing CPU/memory/disk.
  • No persistence guarantees: If the machine reboots, tmux sessions are lost. Session restoration depends on the agent supporting --resume.
  • Polling latency: 30-second lifecycle polling means state changes can take up to 30 seconds to be detected and reacted to.
  • No cost controls: While cost is tracked (see Section 14), there are no budget limits, spending alerts, or automatic shutoff mechanisms.

1.5 Proven vs. Aspirational

The README lists support for agents: Claude Code, Codex CLI, Aider, OpenCode. However:

  • Proven: Claude Code plugin is 786 lines of deeply integrated code with JSONL parsing, activity detection, cost extraction, session restoration, and workspace hooks.
  • Aspirational: Codex, Aider, and OpenCode plugins exist but are significantly thinner. The plugin registry lists them (packages/core/src/plugin-registry.ts, line 20-23) but several have placeholder implementations.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/plugin-registry.ts (lines 14-30)


2. Core Architecture

Confidence: High

2.1 Monorepo Structure

agent-orchestrator/
├── packages/
│   ├── core/           # Types, config, session manager, lifecycle, plugins
│   ├── cli/            # Commander.js CLI (ao command)
│   ├── web/            # Next.js dashboard
│   ├── plugins/
│   │   ├── agent-claude-code/
│   │   ├── runtime-tmux/
│   │   ├── workspace-worktree/
│   │   ├── scm-github/
│   │   ├── tracker-github/
│   │   ├── tracker-linear/
│   │   ├── notifier-desktop/
│   │   └── notifier-slack/
│   └── integration-tests/
├── pnpm-workspace.yaml
└── agent-orchestrator.yaml.example

Source: /tmp/ai-harness-repos/agent-orchestrator/pnpm-workspace.yaml (lines 1-3)

The monorepo uses pnpm workspaces with two package locations: packages/* and packages/plugins/*. All packages are ESM-only ("type": "module" in root package.json) with TypeScript in strict mode.

2.2 The Eight Plugin Slots

The plugin architecture defines eight distinct capability slots:

Slot Purpose Built-in Implementations
runtime Process execution environment tmux, process
agent AI coding agent claude-code, codex, aider, opencode
workspace Code isolation worktree, clone
tracker Issue tracking github, linear
scm Source code management github
notifier Notifications desktop, slack, composio, webhook
terminal Terminal UI integration iterm2, web
lifecycle State machine customization core (default)

Each plugin implements a specific TypeScript interface and is registered with a manifest:

// From types.ts, lines 900-930
export interface PluginManifest {
  name: string;       // e.g., "tmux"
  slot: string;       // e.g., "runtime"
  version: string;
  description?: string;
}

export interface PluginModule<T = unknown> {
  manifest: PluginManifest;
  create: (ctx?: PluginContext) => T | Promise<T>;
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/types.ts (lines 900-960)

2.3 Plugin Registry

The registry is a simple Map keyed by "slot:name":

// plugin-registry.ts
const plugins = new Map<string, PluginModule>();

function register(mod: PluginModule): void {
  const key = `${mod.manifest.slot}:${mod.manifest.name}`;
  plugins.set(key, mod);
}

function get<T>(slot: string, name: string): T {
  const key = `${slot}:${name}`;
  const mod = plugins.get(key);
  if (!mod) throw new Error(`Plugin not found: ${key}`);
  return mod.create() as T;
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/plugin-registry.ts

The web package cannot use dynamic import() due to webpack bundling constraints, so it imports plugins statically:

// packages/web/src/lib/services.ts, lines 25-30
import pluginRuntimeTmux from "@composio/ao-plugin-runtime-tmux";
import pluginAgentClaudeCode from "@composio/ao-plugin-agent-claude-code";
import pluginWorkspaceWorktree from "@composio/ao-plugin-workspace-worktree";
import pluginScmGithub from "@composio/ao-plugin-scm-github";
import pluginTrackerGithub from "@composio/ao-plugin-tracker-github";
import pluginTrackerLinear from "@composio/ao-plugin-tracker-linear";

This is a practical workaround but creates a maintenance burden — new plugins must be manually added to this import list.

2.4 Core Services Singleton

The web package uses a globalThis-cached singleton pattern for services initialization:

// packages/web/src/lib/services.ts, lines 38-58
const globalForServices = globalThis as typeof globalThis & {
  _aoServices?: Services;
  _aoServicesInit?: Promise<Services>;
};

export function getServices(): Promise<Services> {
  if (globalForServices._aoServices) {
    return Promise.resolve(globalForServices._aoServices);
  }
  if (!globalForServices._aoServicesInit) {
    globalForServices._aoServicesInit = initServices().catch((err) => {
      globalForServices._aoServicesInit = undefined;
      throw err;
    });
  }
  return globalForServices._aoServicesInit;
}

Note the error recovery: if initialization fails, the cached promise is cleared so subsequent calls retry rather than permanently returning a rejected promise.

2.5 Hash-Based Directory Structure

AO uses a SHA-256 hash of the config file's directory path to create globally unique namespaces:

// paths.ts
export function generateConfigHash(configDir: string): string {
  const resolved = realpathSync(configDir);
  return createHash("sha256").update(resolved).digest("hex").slice(0, 12);
}

The directory hierarchy:

~/.agent-orchestrator/
  {12-char-hash}-{projectId}/
    sessions/
      {sessionName}/
        metadata        # key=value flat file
        prompt.md       # agent system prompt
    archive/
      {sessionName}_{timestamp}   # archived metadata
    worktrees/
      {sessionName}/    # git worktree checkout

Source: /tmp/ai-harness-repos/agent-orchestrator/ARCHITECTURE.md (full document) Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/paths.ts

The hash is computed from the resolved path (symlinks followed via realpathSync), meaning /foo/bar and /foo/bar-link -> /foo/bar hash to the same value. This prevents accidental duplication.

Collision detection is also implemented — each instance directory contains an .origin file storing the original path. If two different config directories produce the same hash prefix, the system will detect and error.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/paths.ts (validateAndStoreOrigin function)

2.6 Type System

The central types.ts file is 1084 lines and defines the entire domain model. Key types:

Session (the core entity):

export interface Session {
  id: string;
  projectId: string;
  status: SessionStatus;
  activity: ActivityState;
  branch?: string;
  issueId?: string;
  pr?: { number: number; url: string; title?: string };
  workspacePath?: string;
  runtimeHandle?: RuntimeHandle;
  agentInfo?: AgentInfo;
  timestamps: SessionTimestamps;
  metadata: SessionMetadata;
}

SessionStatus (the state machine):

export const SESSION_STATUS = {
  SPAWNING: "spawning",
  WORKING: "working",
  PR_OPEN: "pr_open",
  CI_FAILED: "ci_failed",
  REVIEW_PENDING: "review_pending",
  CHANGES_REQUESTED: "changes_requested",
  APPROVED: "approved",
  MERGEABLE: "mergeable",
  MERGED: "merged",
  CLEANUP: "cleanup",
  NEEDS_INPUT: "needs_input",
  STUCK: "stuck",
  ERRORED: "errored",
  KILLED: "killed",
  DONE: "done",
  TERMINATED: "terminated",
} as const;

ActivityState (runtime observation):

export const ACTIVITY_STATE = {
  ACTIVE: "active",
  IDLE: "idle",
  WAITING_INPUT: "waiting_input",
  BLOCKED: "blocked",
  EXITED: "exited",
  UNKNOWN: "unknown",
} as const;

EventType (33 distinct event types triggering reactions):

export const EVENT_TYPE = {
  SESSION_SPAWNED: "session.spawned",
  SESSION_KILLED: "session.killed",
  AGENT_ACTIVE: "agent.active",
  AGENT_IDLE: "agent.idle",
  AGENT_STUCK: "agent.stuck",
  AGENT_NEEDS_INPUT: "agent.needs_input",
  AGENT_EXITED: "agent.exited",
  PR_OPENED: "pr.opened",
  PR_MERGED: "pr.merged",
  CI_PASSING: "ci.passing",
  CI_FAILING: "ci.failing",
  CI_PENDING: "ci.pending",
  REVIEW_APPROVED: "review.approved",
  REVIEW_CHANGES_REQUESTED: "review.changes_requested",
  // ... 19 more
} as const;

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/types.ts (lines 1-1084)


3. Harness Workflow

Confidence: High

3.1 End-to-End Flow

The typical AO workflow proceeds as follows:

  1. Configuration: User creates agent-orchestrator.yaml defining projects, plugins, and reactions.
  2. Start: ao start launches the dashboard and spawns an orchestrator meta-agent.
  3. Spawn: The orchestrator (or user) spawns worker sessions via ao spawn <project> <issue>.
  4. Work: Each worker agent runs in its own tmux session with an isolated git worktree.
  5. Monitor: The lifecycle manager polls every 30 seconds, tracking status transitions.
  6. React: When events occur (CI failure, review request, etc.), the reaction engine sends messages to agents or notifies humans.
  7. Review: PRs appear on the dashboard. Humans review and merge (or agents auto-merge if configured).
  8. Cleanup: After merge, sessions are cleaned up — worktrees removed, metadata archived.

3.2 Spawn Sequence (Detailed)

The sessionManager.spawn() method in session-manager.ts is the most complex operation. Here is the exact sequence:

1. Validate issue exists (tracker.getIssue)
2. Generate session prefix from issue title
3. Reserve session ID atomically (O_EXCL file creation)
4. Create workspace (git worktree with new branch)
5. Run post-create hooks (symlinks, commands)
6. Build agent prompt (3-layer composition)
7. Get agent launch command
8. Get agent environment variables
9. Create runtime (tmux session)
10. Send launch command to runtime
11. Write metadata file (session_id, project_id, issue_id, branch, etc.)
12. Run post-launch setup (e.g., write Claude hooks)

At each step, failure triggers cleanup of previously completed steps:

// session-manager.ts, spawn method (simplified)
try {
  const workspace = await workspacePlugin.create(...);
  try {
    const handle = await runtimePlugin.create(...);
    try {
      await runtimePlugin.sendMessage(handle, launchCommand);
      await writeMetadata(...);
      await agentPlugin.setupWorkspaceHooks?.(...);
    } catch (err) {
      await runtimePlugin.destroy(handle);
      throw err;
    }
  } catch (err) {
    await workspacePlugin.destroy(workspace);
    throw err;
  }
} catch (err) {
  // Clean up session ID reservation
  await deleteSessionDir(sessionDir);
  throw err;
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/session-manager.ts (spawn method, approximately lines 80-250)

3.3 Batch Spawning

The ao batch-spawn command handles spawning multiple sessions:

// packages/cli/src/commands/spawn.ts (batch-spawn)
// 1. Check for duplicates against existing sessions
// 2. Check for duplicates within the batch
// 3. Spawn sequentially with 500ms delays
// 4. Report summary (success/failure counts)

The 500ms delay between spawns is a pragmatic rate-limiting measure to avoid overwhelming the system.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts


4. Subagent Orchestration

Confidence: High

4.1 The Orchestrator Meta-Agent

AO has a two-tier orchestration model:

  • Tier 1 — The Orchestrator: A special agent session (suffixed -orchestrator) that receives a comprehensive system prompt listing all AO CLI commands. It can spawn workers, check status, send messages, and manage the workflow.
  • Tier 2 — Worker Agents: Individual coding agents, each assigned to a single issue.

The orchestrator is spawned by ao start:

// packages/cli/src/commands/start.ts
const orchestratorPrompt = generateOrchestratorPrompt(config, project);
await sessionManager.spawnOrchestrator({
  projectId,
  prompt: orchestratorPrompt,
});

The orchestrator prompt (generated in orchestrator-prompt.ts) includes:

  1. Project information (repo, branch, tracker)
  2. Quick-start section showing how to spawn agents
  3. Complete command reference table
  4. Session management workflows
  5. Dashboard information
  6. Configured reaction rules
  7. Common workflow patterns (bulk issue processing, stuck agent handling, PR review flow)
  8. Tips for effective orchestration

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/orchestrator-prompt.ts (full file)

4.2 Orchestrator Communication

The orchestrator communicates with AO only through the CLI — it runs ao spawn, ao status, ao send, etc. as shell commands in its tmux session. There is no programmatic API between the orchestrator agent and the AO core.

This is both a strength and a limitation:

  • Strength: The orchestrator uses the same interface as a human. No special plumbing needed.
  • Limitation: Shell command parsing introduces latency and potential for error. The orchestrator must interpret CLI text output.

4.3 Worker Agent Communication

Worker agents receive their initial task via the system prompt and their first message (the issue content). Subsequent communication happens through runtime.sendMessage():

// runtime-tmux/src/index.ts, sendMessage method
async sendMessage(handle: RuntimeHandle, message: string): Promise<void> {
  // Clear any partial input first
  await sendKeys(handle.id, "C-u", false);

  if (message.length > 200) {
    // Use tmux named buffer for long messages
    await loadBuffer(handle.id, message);
    await pasteBuffer(handle.id);
    await sleep(300);
    await sendKeys(handle.id, "Enter", false);
  } else {
    await sendKeys(handle.id, message, true);
  }
}

The 200-character threshold and named buffer approach is a workaround for tmux's key-sending limitations. Messages longer than ~1000 characters can be corrupted when sent character-by-character, so the load-buffer/paste-buffer approach is used instead.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/runtime-tmux/src/index.ts (lines 70-95)

4.4 No Direct Agent-to-Agent Communication

Worker agents cannot communicate with each other directly. All coordination goes through:

  1. The orchestrator (via ao send)
  2. Git (via shared repository)
  3. GitHub/Linear (via issue comments and PR reviews)

This is a deliberate design choice — it prevents complex agent interaction patterns but keeps the system simple and auditable.


5. Multi-Agent & Parallelization Strategy

Confidence: High

5.1 Parallelism Model

AO's parallelism is embarrassingly parallel — each agent works on an independent issue in an independent workspace. There is no:

  • Shared memory between agents
  • Lock coordination
  • Task dependency graphs
  • Work-stealing queues
  • Agent-to-agent communication channels

This simplicity is the system's greatest strength for its intended use case. Each agent produces an independent PR. Conflicts, if any, are handled at the git level (merge conflicts in the target branch).

5.2 Resource Constraints

The system imposes no resource limits at the orchestration level. Each tmux session runs an AI agent process that:

  • Consumes API tokens (Claude, OpenAI, etc.)
  • Uses CPU for local processing
  • Uses disk for workspace files
  • Uses network bandwidth for API calls and git operations

There is no mechanism to:

  • Limit the number of concurrent sessions
  • Throttle API call rates across agents
  • Set memory or CPU limits per agent
  • Define a total budget ceiling

The only rate-limiting is the 500ms delay between batch spawns.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts (batch-spawn, sequential spawning loop)

5.3 Lifecycle Polling Concurrency

The lifecycle manager polls all active sessions concurrently:

// lifecycle-manager.ts, pollAll method
const results = await Promise.allSettled(
  activeSessions.map(session => this.pollSession(session))
);

But it has a re-entrancy guard to prevent overlapping poll cycles:

if (this._polling) return;
this._polling = true;
try {
  // ... poll all sessions
} finally {
  this._polling = false;
}

This means if a poll cycle takes longer than 30 seconds (e.g., due to slow GitHub API calls), the next cycle is skipped rather than creating concurrent polls.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts (pollAll method)

5.4 Dashboard Parallelism

The web API enriches session data in parallel with timeouts:

// packages/web/src/app/api/sessions/route.ts, lines 39-52
// Metadata enrichment: 3 second timeout
const metaTimeout = new Promise<void>((resolve) => setTimeout(resolve, 3_000));
await Promise.race([enrichSessionsMetadata(...), metaTimeout]);

// PR enrichment: 4 second timeout
const enrichPromises = workerSessions.map((core, i) => {
  if (!core.pr) return Promise.resolve();
  return enrichSessionPR(dashboardSessions[i], scm, core.pr);
});
const enrichTimeout = new Promise<void>((resolve) => setTimeout(resolve, 4_000));
await Promise.race([Promise.allSettled(enrichPromises), enrichTimeout]);

The dual timeout approach (3s for metadata, 4s for PR data) ensures the dashboard remains responsive even when external APIs are slow. If enrichment times out, the dashboard shows stale or incomplete data rather than hanging.


6. Isolation Model

Confidence: High

6.1 Workspace Isolation via Git Worktrees

Each agent session gets its own git worktree — a separate checkout of the same repository on a different branch:

// workspace-worktree/src/index.ts, create method (simplified)
async create(options: WorkspaceCreateOptions): Promise<string> {
  const worktreePath = path.join(worktreeBaseDir, sessionId);

  // Fetch latest from origin
  await execFile("git", ["fetch", "origin"], { cwd: repoPath });

  // Create worktree with new branch from origin/defaultBranch
  await execFile("git", [
    "worktree", "add",
    "-b", branchName,
    worktreePath,
    `origin/${defaultBranch}`,
  ], { cwd: repoPath });

  return worktreePath;
}

Git worktrees are lightweight (they share the .git object store) but provide complete filesystem isolation. Each agent has its own working directory, index, and branch.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts (lines 30-100)

6.2 Runtime Isolation via tmux

Each agent runs in a separate tmux session with its own:

  • PTY (pseudo-terminal)
  • Environment variables
  • Process tree
  • Working directory
// runtime-tmux/src/index.ts, create method
async create(options: RuntimeCreateOptions): Promise<RuntimeHandle> {
  await newSession({
    name: tmuxSessionName,
    startDir: options.workspacePath,
    env: options.environment,
    detached: true,
  });

  // Send the launch command
  await sendKeys(tmuxSessionName, launchCommand);

  return {
    id: tmuxSessionName,
    type: "tmux",
    data: { createdAt: Date.now() },
  };
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/runtime-tmux/src/index.ts (lines 20-55)

6.3 Isolation Boundaries and Gaps

What IS isolated:

  • Filesystem (separate worktrees, separate branches)
  • Process (separate tmux sessions)
  • Environment variables (set per session)
  • Git state (separate index, HEAD, working tree)

What is NOT isolated:

  • Network: All agents share the same network. One agent making excessive API calls affects others.
  • Credentials: All agents share the same gh CLI authentication, the same ~/.claude config, the same API keys.
  • CPU/Memory: No cgroups, no containers, no resource limits.
  • Git remote: All worktrees push to the same remote. Branch name collisions are possible (though mitigated by the naming convention).
  • Agent configuration directories: Claude Code stores per-project settings in ~/.claude/projects/. The toClaudeProjectPath function converts workspace paths to Claude's directory encoding, but multiple sessions for the same project could potentially interfere.

6.4 Branch Naming Strategy

Branches are named by the tracker plugin:

// tracker-github/src/index.ts
branchName(issueId: string): string {
  return `feat/issue-${issueId}`;
}

// tracker-linear/src/index.ts
branchName(issueId: string): string {
  return `feat/${identifier}`;  // e.g., feat/ENG-123
}

This deterministic naming means two sessions for the same issue would conflict. The batch-spawn command includes deduplication logic to prevent this:

// spawn.ts, batch-spawn
// Check for existing sessions with the same issue
const existing = sessions.filter(s => s.issueId === issueId);
if (existing.length > 0) {
  console.warn(`Skipping issue ${issueId}: already has session ${existing[0].id}`);
  continue;
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts

6.5 Post-Create Symlinks

The workspace plugin supports symlinking shared resources into worktrees:

// workspace-worktree/src/index.ts, postCreate method
for (const link of project.symlinks ?? []) {
  // Path traversal guard
  const resolved = path.resolve(worktreePath, link.target);
  if (!resolved.startsWith(worktreePath)) {
    throw new Error(`Symlink target escapes workspace: ${link.target}`);
  }
  await fs.symlink(link.source, resolved);
}

This allows sharing large dependencies (like node_modules or build caches) across worktrees without duplicating them. The path traversal guard prevents symlinks from escaping the workspace directory.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts (postCreate method)


7. Human-in-the-Loop Controls

Confidence: High

7.1 Dashboard as Primary Control Surface

The web dashboard provides a Kanban-style view of all sessions grouped by attention level:

// Dashboard.tsx, lines 24, 28-41
const KANBAN_LEVELS = ["working", "pending", "review", "respond", "merge"] as const;

const grouped = useMemo(() => {
  const zones: Record<AttentionLevel, DashboardSession[]> = {
    merge: [],     // Ready to merge
    respond: [],   // Agent needs human input
    review: [],    // PR needs human review
    pending: [],   // Waiting for CI/other
    working: [],   // Agent actively coding
    done: [],      // Completed
  };
  for (const session of sessions) {
    zones[getAttentionLevel(session)].push(session);
  }
  return zones;
}, [sessions]);

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/web/src/components/Dashboard.tsx (lines 24-41)

7.2 Available Human Actions

The dashboard exposes four actions:

  1. Send Message (handleSend): Send a text message to a running agent via POST /api/sessions/:id/send
  2. Kill Session (handleKill): Terminate an agent with confirmation dialog via POST /api/sessions/:id/kill
  3. Merge PR (handleMerge): Merge a pull request via POST /api/prs/:number/merge
  4. Restore Session (handleRestore): Restore a killed/exited session via POST /api/sessions/:id/restore
// Dashboard.tsx, lines 50-86
const handleSend = async (sessionId: string, message: string) => {
  const res = await fetch(`/api/sessions/${encodeURIComponent(sessionId)}/send`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ message }),
  });
};

const handleKill = async (sessionId: string) => {
  if (!confirm(`Kill session ${sessionId}?`)) return;
  // ...
};

const handleMerge = async (prNumber: number) => {
  const res = await fetch(`/api/prs/${prNumber}/merge`, { method: "POST" });
};

const handleRestore = async (sessionId: string) => {
  if (!confirm(`Restore session ${sessionId}?`)) return;
  // ...
};

7.3 Attention Routing

The getAttentionLevel function (in @/lib/types) maps session state to human attention urgency. This drives both the dashboard layout and the dynamic favicon (showing counts of sessions needing attention).

The DynamicFavicon component updates the browser tab to show the project status at a glance, so a human can monitor multiple projects across browser tabs.

7.4 Notification System

Humans are notified through multiple channels:

  • Desktop: OS-native notifications (macOS osascript, Linux notify-send)
  • Slack: Rich Block Kit messages to webhook URLs
  • Composio: (mentioned in config but plugin not explored in detail)
  • Webhook: Generic HTTP webhook

Notifications are routed by priority:

# agent-orchestrator.yaml.example
notificationRouting:
  critical: [slack, desktop]
  high: [slack, desktop]
  normal: [slack]
  low: [slack]

Source: /tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example

7.5 Human Override Points

  1. Before spawn: Human (or orchestrator) decides which issues to assign
  2. During work: Human can send messages to guide the agent
  3. At PR creation: Human reviews the PR on GitHub
  4. At merge: Human (or auto-merge) decides when to merge
  5. On failure: Human can kill, restore, or send instructions
  6. Kill switch: ao stop terminates everything

7.6 Limitations of HITL Controls

  • No approval gates: There is no mechanism to require human approval before an agent takes a specific action (e.g., deploying, running tests, modifying security-sensitive files).
  • No content filtering: Agent outputs are not screened before being committed or pushed.
  • No rollback: If a PR is merged and breaks something, there is no automated rollback mechanism.
  • Message-only intervention: The only way to influence a running agent is to send it a text message. There is no way to modify its system prompt, change its tools, or restrict its actions mid-session.

8. Context Handling

Confidence: High

8.1 Three-Layer Prompt Composition

The prompt-builder.ts composes agent prompts from three layers:

Layer 1 — Base Agent Prompt (hardcoded in prompt-builder.ts):

You are working on a software engineering task...
- Follow the project's existing patterns and conventions
- Create focused, well-scoped commits
- Open a PR when your work is ready for review
- If CI fails, investigate and fix
- If review feedback is received, address it

Layer 2 — Config-Derived Context (from agent-orchestrator.yaml):

Project: {projectName}
Repository: {repo}
Default Branch: {defaultBranch}
Tracker: {tracker.plugin}
Issue: {issueTitle} ({issueUrl})
{issue.body}

Layer 3 — User Rules (from agentRules / agentRulesFile):

{agentRules string}
{contents of agentRulesFile}

The composition is done in buildPrompt():

// prompt-builder.ts (simplified)
export function buildPrompt(options: PromptOptions): string | null {
  const parts: string[] = [BASE_AGENT_PROMPT];

  if (options.projectName) {
    parts.push(`## Project Context\nProject: ${options.projectName}`);
  }
  if (options.issue) {
    parts.push(`## Task\n${options.issue.title}\n${options.issue.body}`);
  }
  if (options.agentRules) {
    parts.push(`## Project Rules\n${options.agentRules}`);
  }
  if (options.agentRulesFile) {
    const content = readFileSync(options.agentRulesFile, "utf-8");
    parts.push(`## Additional Rules\n${content}`);
  }

  return parts.join("\n\n");
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/prompt-builder.ts

8.2 Orchestrator Prompt (Meta-Agent)

The orchestrator receives a much richer prompt generated by orchestrator-prompt.ts. This prompt is essentially an operations manual:

// orchestrator-prompt.ts (key sections)
function generateOrchestratorPrompt(config, project): string {
  return `
# Agent Orchestrator — Control Prompt

You are the orchestrator for project "${project.name}".

## Quick Start
To spawn an agent for an issue: ao spawn ${project.name} <issue-id>

## Available Commands
| Command | Description |
| ao spawn | Spawn a worker session |
| ao status | Show all sessions |
| ao send | Send message to session |
| ao session kill | Kill a session |
| ao session restore | Restore a session |
| ao review-check | Check PR review status |

## Configured Reactions
${formatReactions(config.reactions)}

## Common Workflows
### Bulk Issue Processing
1. ao batch-spawn ${project.name} issue1 issue2 issue3
2. ao status (monitor progress)
3. Review PRs as they come in

### Handling Stuck Agents
1. Check status: ao status
2. Send guidance: ao send <session> "Try approach X"
3. If still stuck: ao session kill <session>; ao spawn ...
`;
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/orchestrator-prompt.ts

8.3 Issue Context Injection

When spawning a session, the tracker plugin generates context from the issue:

// tracker-github/src/index.ts
async generatePrompt(issueId: string, repo: string): Promise<string> {
  const issue = await this.getIssue(issueId, repo);
  return [
    `# Issue #${issue.number}: ${issue.title}`,
    `URL: ${issue.url}`,
    `State: ${issue.state}`,
    issue.labels.length ? `Labels: ${issue.labels.join(", ")}` : "",
    "",
    issue.body,
  ].filter(Boolean).join("\n");
}

For Linear issues, the prompt includes more structured data:

// tracker-linear/src/index.ts
async generatePrompt(issueId: string): Promise<string> {
  const issue = await this.getIssue(issueId);
  return [
    `# ${issue.identifier}: ${issue.title}`,
    `URL: ${issue.url}`,
    `State: ${issue.state}`,
    `Priority: ${issue.priority}`,
    issue.labels.length ? `Labels: ${issue.labels.join(", ")}` : "",
    "",
    issue.body,
  ].filter(Boolean).join("\n");
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-github/src/index.ts (lines 90-110) Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-linear/src/index.ts

8.4 Reaction Context (Messages to Agents)

When the reaction engine sends messages to agents, it composes context-aware messages:

// lifecycle-manager.ts (reaction execution, simplified)
if (reaction.action === "send-to-agent") {
  const message = reaction.message ?? getDefaultMessage(eventType);
  await sessionManager.send(session.id, message);
}

Default messages are event-specific, e.g.:

  • CI failed: "CI checks are failing. Please investigate the failures and fix them."
  • Changes requested: "Review feedback has been received. Please address the requested changes."
  • Merge conflicts: "There are merge conflicts. Please resolve them."

8.5 Context Limitations

  • No conversation history: AO does not maintain or inject previous conversation context when sending messages to agents. Each message is stateless.
  • No cross-session context: If Agent A discovers something relevant to Agent B, there is no mechanism to share that context.
  • No dynamic context refresh: The agent's system prompt is set at spawn time and never updated. If the issue is updated on GitHub/Linear after spawning, the agent won't see the changes unless told explicitly.
  • No context window management: AO does not track or manage the agent's context window usage. Long-running agents may lose their initial instructions as conversation history grows.

9. Session Lifecycle

Confidence: High

9.1 State Machine

The lifecycle manager implements a state machine with the following transitions:

spawning -> working              (agent starts processing)
working -> pr_open               (agent creates PR)
working -> needs_input           (agent requests human input)
working -> stuck                 (agent appears stuck)
working -> errored               (runtime dies unexpectedly)

pr_open -> ci_failed             (CI checks fail)
pr_open -> review_pending        (CI passes, awaiting review)
pr_open -> working               (agent still working after PR creation)

ci_failed -> working             (agent fixing CI issues)
ci_failed -> pr_open             (CI re-run passes)

review_pending -> changes_requested  (reviewer requests changes)
review_pending -> approved           (reviewer approves)

changes_requested -> working     (agent addressing feedback)
approved -> mergeable            (CI passes + approved)
mergeable -> merged              (PR merged)
merged -> cleanup -> done        (workspace cleaned up)

needs_input -> working           (human sends message)
stuck -> working                 (agent resumes)

ANY -> killed                    (human kills session)
ANY -> terminated                (orchestrator terminates)

9.2 Status Determination Algorithm

The determineStatus() function in lifecycle-manager.ts follows this priority order:

// lifecycle-manager.ts, determineStatus (simplified logic)
function determineStatus(session: Session): SessionStatus {
  // 1. Runtime dead?
  if (!session.runtimeHandle || !await runtime.isAlive(session.runtimeHandle)) {
    return session.pr?.merged ? "done" : "errored";
  }

  // 2. Agent activity
  const activity = await agent.getActivityState(session);
  if (activity === "waiting_input") return "needs_input";
  if (activity === "blocked") return "stuck";
  if (activity === "exited") return session.pr ? "pr_open" : "done";

  // 3. PR state
  if (session.pr) {
    const prState = await scm.getPRState(session.pr.number);
    if (prState.merged) return "merged";

    const ci = await scm.getCISummary(session.pr.number);
    if (ci === "failing") return "ci_failed";

    const review = await scm.getReviewDecision(session.pr.number);
    if (review === "changes_requested") return "changes_requested";
    if (review === "approved") {
      const mergeable = await scm.getMergeability(session.pr.number);
      if (mergeable.canMerge) return "mergeable";
      return "approved";
    }

    return "review_pending";
  }

  // 4. Default
  return activity === "active" ? "working" : "working";
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts (determineStatus, approximately lines 100-200)

9.3 Activity Detection (Claude Code Specific)

The Claude Code plugin provides two activity detection mechanisms:

Mechanism 1 — Terminal Output Parsing (deprecated):

// agent-claude-code/src/index.ts
classifyTerminalOutput(output: string): ActivityState {
  // Look for prompt characters: ❯ > $ #
  if (/[>$#]\s*$/.test(lastLine)) return "idle";
  if (/permission/i.test(lastLine)) return "waiting_input";
  return "active";
}

Mechanism 2 — JSONL Introspection (preferred):

// agent-claude-code/src/index.ts, getActivityState
async getActivityState(session): Promise<ActivityState> {
  // 1. Check if process is running
  const processRunning = await this.isProcessRunning(session);
  if (!processRunning) return "exited";

  // 2. Read last JSONL entry from Claude's session file
  const entry = await readLastJsonlEntry(sessionFile);

  switch (entry.type) {
    case "user":
    case "tool_use":
    case "progress":
      return "active";
    case "assistant":
    case "summary":
    case "result":
      // Check idle threshold
      if (Date.now() - entry.timestamp > readyThresholdMs) {
        return "idle";
      }
      return "active";  // "ready" maps to "active" with threshold
    case "permission_request":
      return "waiting_input";
    case "error":
      return "blocked";
  }
}

The JSONL approach reads Claude Code's internal session files (stored in ~/.claude/projects/), parsing only the last 128KB to avoid reading potentially 100MB+ files:

// agent-claude-code/src/index.ts
async parseJsonlFileTail(filePath: string): Promise<JsonlEntry[]> {
  const TAIL_BYTES = 128 * 1024;  // 128KB
  const stat = await fs.stat(filePath);
  const start = Math.max(0, stat.size - TAIL_BYTES);
  // Read from offset, split by newlines, parse each line as JSON
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (lines 300-500)

9.4 Session Cleanup

The cleanup process is multi-step:

// session-manager.ts, cleanup method
async cleanup(sessionId: string): Promise<void> {
  const session = await this.get(sessionId);

  // Check prerequisites
  if (session.pr) {
    const prState = await scm.getPRState(session.pr.number);
    if (!prState.merged) {
      throw new Error("Cannot cleanup: PR not yet merged");
    }
  }

  // 1. Destroy runtime (kill tmux session)
  if (session.runtimeHandle) {
    await runtime.destroy(session.runtimeHandle);
  }

  // 2. Destroy workspace (remove git worktree)
  if (session.workspacePath) {
    await workspace.destroy(session.workspacePath);
  }

  // 3. Archive metadata
  await archiveMetadata(session.id);
}

Notably, the workspace plugin does NOT delete the git branch when removing a worktree:

// workspace-worktree/src/index.ts, destroy method
// NOTE: Does NOT delete the branch (safety measure)
await execFile("git", ["worktree", "remove", "--force", worktreePath]);

This is a safety measure — branches are kept in case they need to be referenced later.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts (destroy method)

9.5 Session Restoration

Sessions can be restored from archive:

// session-manager.ts, restore method
async restore(sessionId: string): Promise<Session> {
  // 1. Find archived metadata
  const archived = await readArchivedMetadata(sessionId);

  // 2. Validate restorability
  if (!isRestorable(archived)) {
    throw new SessionNotRestorableError(sessionId, reason);
  }

  // 3. Recreate workspace if needed
  if (!await workspace.exists(archived.workspacePath)) {
    await workspace.restore(archived);
  }

  // 4. Try agent's restore command (e.g., claude --resume <uuid>)
  const restoreCmd = await agent.getRestoreCommand(archived);

  // 5. Create new runtime with restore command
  const handle = await runtime.create({
    launchCommand: restoreCmd ?? agent.getLaunchCommand(archived),
    workspacePath: archived.workspacePath,
  });

  // 6. Write new metadata
  await writeMetadata(sessionId, { ...archived, runtimeHandle: handle });

  return session;
}

The Claude Code agent supports restoration via session UUID:

// agent-claude-code/src/index.ts, getRestoreCommand
async getRestoreCommand(session): Promise<string | null> {
  // Find the Claude session UUID from JSONL files
  const sessionUuid = await findSessionUuid(session.workspacePath);
  if (!sessionUuid) return null;
  return `claude --resume ${sessionUuid}`;
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (getRestoreCommand method)


10. Code Quality Gates

Confidence: Medium

10.1 CI Pipeline

The CI workflow runs on GitHub Actions:

# .github/workflows/ci.yml
jobs:
  lint:
    - pnpm lint
  typecheck:
    - pnpm --filter '!@composio/ao-web' build  # Build non-web first
    - pnpm --filter @composio/ao-web typecheck  # Then check web
  test:
    - pnpm --filter '!@composio/ao-web' test
  test-web:
    - sudo apt-get install tmux  # tmux needed for integration tests
    - pnpm --filter @composio/ao-web test

Source: /tmp/ai-harness-repos/agent-orchestrator/.github/workflows/ci.yml

10.2 TypeScript Strictness

The project uses TypeScript strict mode:

// tsconfig.json (root)
{
  "compilerOptions": {
    "strict": true,
    "module": "Node16",
    "moduleResolution": "Node16"
  }
}

The CLAUDE.md file codifies conventions:

  • .js extensions in all imports (ESM requirement)
  • node: prefix for Node.js builtins
  • type keyword for type-only imports
  • Zod for runtime validation of external data

Source: /tmp/ai-harness-repos/agent-orchestrator/CLAUDE.md

10.3 Zod Validation

Configuration is validated with Zod schemas:

// config.ts
const ProjectSchema = z.object({
  repo: z.string(),
  path: z.string().optional(),
  defaultBranch: z.string().default("main"),
  sessionPrefix: z.string().optional(),
  tracker: z.object({
    plugin: z.string(),
    // ...
  }).optional(),
  scm: z.object({
    plugin: z.string(),
  }).optional(),
  // ...
});

const ConfigSchema = z.object({
  dataDir: z.string().optional(),
  port: z.number().optional(),
  defaults: DefaultsSchema.optional(),
  projects: z.record(ProjectSchema),
  notifiers: z.record(z.any()).optional(),
  reactions: z.record(ReactionSchema).optional(),
});

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/config.ts

10.4 What's Missing

  • No linting rules visible: The pnpm lint command exists but the specific ESLint/Biome configuration was not explored.
  • No test coverage requirements: No coverage thresholds or coverage reporting observed.
  • No integration test suite: The packages/integration-tests directory exists but its contents were not fully explored.
  • No end-to-end tests: No Playwright, Cypress, or similar E2E testing framework observed.
  • No API contract testing: The web API endpoints have no schema validation on responses.

11. Security & Compliance

Confidence: High

11.1 Shell Injection Prevention

The CLAUDE.md file mandates:

"Shell commands: ALWAYS use execFile with explicit argument arrays, NEVER use exec with string interpolation. Always set timeouts for child processes. Never interpolate user input into shell commands."

This is consistently followed throughout the codebase. Every shell command uses execFile:

// tmux.ts
import { execFile } from "node:child_process";

export function listSessions(): Promise<string[]> {
  return new Promise((resolve, reject) => {
    execFile("tmux", ["list-sessions", "-F", "#{session_name}"],
      { timeout: 5000 },
      (err, stdout) => { /* ... */ }
    );
  });
}

exec is never used anywhere in the codebase. This is the single most important security measure — it eliminates an entire class of command injection vulnerabilities.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/tmux.ts Source: /tmp/ai-harness-repos/agent-orchestrator/CLAUDE.md

11.2 Path Traversal Prevention

Multiple layers of defense:

  1. Session ID validation: validateSessionId uses regex /^[a-zA-Z0-9_-]+$/ to reject path traversal characters.
// metadata.ts
export function validateSessionId(id: string): void {
  if (!/^[a-zA-Z0-9_-]+$/.test(id)) {
    throw new Error(`Invalid session ID: ${id}`);
  }
}
  1. Symlink target validation: The workspace plugin validates that symlink targets don't escape the workspace:
// workspace-worktree/src/index.ts
const resolved = path.resolve(worktreePath, link.target);
if (!resolved.startsWith(worktreePath)) {
  throw new Error(`Symlink target escapes workspace: ${link.target}`);
}
  1. URL encoding in API routes: Session IDs are encoded/decoded when used in URL paths.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/metadata.ts

11.3 Secret Scanning

The security CI workflow runs three checks:

# .github/workflows/security.yml
jobs:
  gitleaks:
    - uses: gitleaks/gitleaks-action@v2
      with:
        args: "--full-history"  # Scan entire git history

  dependency-review:
    - uses: actions/dependency-review-action@v4
      with:
        fail-on-severity: moderate  # Block moderate+ vulns

  npm-audit:
    - run: pnpm audit --audit-level high --prod  # Strict on prod deps

Source: /tmp/ai-harness-repos/agent-orchestrator/.github/workflows/security.yml

11.4 Credential Handling

  • No credential storage: AO does not store any credentials itself. It relies on ambient credentials (gh auth, LINEAR_API_KEY, SLACK_WEBHOOK_URL).
  • Environment variable passing: Agent sessions receive environment variables via tmux -e flags, which means they appear in ps output briefly during session creation.
  • Historical incident: The SECURITY.md documents a past token leak (OpenClaw token) that was detected and mitigated.

Source: /tmp/ai-harness-repos/agent-orchestrator/SECURITY.md

11.5 Security Gaps

  • No agent sandboxing: Agents have full filesystem and network access. A compromised agent could read credentials, exfiltrate code, or modify other worktrees.
  • No output sanitization: Agent-generated code is committed directly. No static analysis, dependency scanning, or security review of generated changes.
  • No authentication on web dashboard: The Next.js dashboard runs on localhost with no authentication. Anyone with network access to the port can view sessions, send messages, kill agents, and merge PRs.
  • No HTTPS: Dashboard uses plain HTTP on localhost.
  • No rate limiting on API endpoints: The web API has no rate limiting or abuse prevention.

12. Hooks & Automation

Confidence: High

12.1 Reaction Engine

The reaction engine is the core automation mechanism. It maps events to actions:

# agent-orchestrator.yaml.example
reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    message: "CI checks are failing. Please investigate and fix."
    retries: 2
    escalation:
      action: notify
      after: "10m"

  changes-requested:
    trigger: review.changes_requested
    action: send-to-agent
    message: "Review feedback received. Please address the changes."

  approved-and-green:
    trigger: review.approved
    condition: ci.passing
    action: notify
    message: "PR is approved and CI is green. Ready to merge."

  agent-stuck:
    trigger: agent.stuck
    action: notify
    priority: high
    escalation:
      action: notify
      after: "15m"
      priority: critical

Source: /tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example

12.2 Reaction Execution

// lifecycle-manager.ts, executeReaction (simplified)
async executeReaction(session: Session, eventType: EventType, reaction: ReactionConfig): Promise<void> {
  const key = `${session.id}:${reaction.name}`;
  const attempts = this.reactionAttempts.get(key) ?? 0;

  // Check escalation
  if (reaction.escalation) {
    const firstAttempt = this.reactionFirstAttempt.get(key);
    const duration = firstAttempt ? Date.now() - firstAttempt : 0;

    if (
      (reaction.retries && attempts >= reaction.retries) ||
      (reaction.escalation.after && duration > parseDuration(reaction.escalation.after))
    ) {
      // Execute escalation action instead
      return this.executeAction(session, reaction.escalation);
    }
  }

  // Execute primary action
  await this.executeAction(session, reaction);
  this.reactionAttempts.set(key, attempts + 1);
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts (executeReaction, approximately lines 250-330)

12.3 Post-Tool-Use Hook (Claude Code)

The Claude Code plugin installs a PostToolUse hook that monitors agent actions:

#!/bin/bash
# METADATA_UPDATER_SCRIPT (embedded in agent-claude-code/src/index.ts)
# Detects PR creation, branch switches, and PR merges

METADATA_FILE="$AO_METADATA_PATH"

case "$TOOL_NAME" in
  "Bash")
    # Detect: gh pr create
    if echo "$TOOL_OUTPUT" | grep -q "github.com.*pull/"; then
      PR_URL=$(echo "$TOOL_OUTPUT" | grep -o "https://github.com[^ ]*pull/[0-9]*")
      PR_NUM=$(echo "$PR_URL" | grep -o "[0-9]*$")
      echo "pr_number=$PR_NUM" >> "$METADATA_FILE"
      echo "pr_url=$PR_URL" >> "$METADATA_FILE"
    fi

    # Detect: git checkout -b / git switch -c
    if echo "$TOOL_INPUT" | grep -qE "git (checkout -b|switch -c)"; then
      BRANCH=$(echo "$TOOL_INPUT" | grep -oE "(checkout -b|switch -c) [^ ]+" | awk '{print $NF}')
      echo "branch=$BRANCH" >> "$METADATA_FILE"
    fi

    # Detect: gh pr merge
    if echo "$TOOL_INPUT" | grep -q "gh pr merge"; then
      echo "pr_merged=true" >> "$METADATA_FILE"
    fi
    ;;
esac

This hook runs inside Claude Code's process and updates the session metadata file in real-time, without waiting for the next lifecycle poll.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (METADATA_UPDATER_SCRIPT, approximately lines 30-80)

12.4 Post-Create Workspace Hooks

After creating a workspace, the system can run arbitrary commands:

# agent-orchestrator.yaml.example
projects:
  my-project:
    postCreate:
      - "npm install"
      - "npm run build"

These are executed via execFile in the worktree directory after creation and symlinking.

12.5 Default Reactions

The config loader applies sensible defaults if no reactions are configured:

// config.ts (applyDefaultReactions, simplified)
const DEFAULT_REACTIONS = {
  "ci-failed": { trigger: "ci.failing", action: "send-to-agent" },
  "changes-requested": { trigger: "review.changes_requested", action: "send-to-agent" },
  "bugbot-comments": { trigger: "review.automated_comments", action: "send-to-agent" },
  "merge-conflicts": { trigger: "pr.conflicts", action: "send-to-agent" },
  "approved-and-green": { trigger: "review.approved", condition: "ci.passing", action: "notify" },
  "agent-stuck": { trigger: "agent.stuck", action: "notify", priority: "high" },
  "agent-needs-input": { trigger: "agent.needs_input", action: "notify" },
  "agent-exited": { trigger: "agent.exited", action: "notify" },
  "all-complete": { trigger: "orchestrator.all_complete", action: "notify" },
};

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/config.ts

12.6 Auto-Merge Example

The auto-merge.yaml example shows aggressive automation:

# examples/auto-merge.yaml
reactions:
  auto-merge:
    trigger: review.approved
    condition: ci.passing
    action: auto-merge
    message: "Auto-merging approved PR with passing CI."

This allows PRs to be merged automatically when they have both approval and passing CI, with no human confirmation step.

Source: /tmp/ai-harness-repos/agent-orchestrator/examples/auto-merge.yaml


13. CLI & UX

Confidence: High

13.1 Command Structure

ao
├── init              # Interactive setup wizard
├── start             # Start orchestrator + dashboard
├── stop              # Stop orchestrator + dashboard
├── status            # Show session status table
├── spawn             # Spawn a single agent session
├── batch-spawn       # Spawn multiple sessions
├── send              # Send message to a session
├── review-check      # Check PR review status
├── dashboard         # Open dashboard in browser
├── open              # Open terminal for a session
└── session
    ├── ls            # List sessions
    ├── kill          # Kill a session
    ├── cleanup       # Clean up completed sessions
    └── restore       # Restore a killed session

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/index.ts

13.2 Init Wizard

The ao init command provides an interactive setup experience:

// packages/cli/src/commands/init.ts
// Detects environment:
// - git repo presence and remote URL
// - default branch
// - tmux availability
// - gh CLI and authentication
// - LINEAR_API_KEY presence
// - SLACK_WEBHOOK_URL presence
// - Project type (package.json, Cargo.toml, etc.)

It has an --auto mode for non-interactive setup and a --smart flag that has a TODO for AI-powered rule generation based on the project structure.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/init.ts

13.3 Status Display

The ao status command renders a rich terminal table:

Session      Branch              PR    CI        Review    Threads  Activity  Age
fix-auth-1   feat/issue-42       #123  passing   approved  0        active    2h
add-api-2    feat/issue-43       #124  failing   pending   2        idle      1h
refactor-3   feat/issue-44       —     —         —         —        working   30m

Data is gathered in parallel for responsiveness:

// status.ts (simplified)
const sessions = await sessionManager.list();
const enriched = await Promise.all(
  sessions.map(async (s) => {
    const [prState, ci, review] = await Promise.all([
      scm?.getPRState(s.pr?.number),
      scm?.getCISummary(s.pr?.number),
      scm?.getReviewDecision(s.pr?.number),
    ]);
    return { ...s, prState, ci, review };
  })
);

The status command also has a fallback mode for when no config exists — it discovers tmux sessions directly:

// status.ts
// Fallback: discover tmux sessions matching ao- pattern
const tmuxSessions = await listTmuxSessions();
const aoSessions = tmuxSessions.filter(name => name.match(/^[a-f0-9]{12}-/));

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/status.ts

13.4 CLI Quality

Strengths:

  • Clean Commander.js structure with proper subcommands
  • Parallel data fetching for responsive output
  • Graceful degradation (fallback when config missing)
  • Confirmation prompts for destructive operations (kill, restore)
  • Summary reports for batch operations

Limitations:

  • No color/formatting library (raw console.log)
  • No progress indicators for long operations
  • No --json output flag for scripting
  • No shell completion support
  • No --dry-run for spawn/batch-spawn (only for cleanup)

14. Cost & Usage Visibility

Confidence: Medium

14.1 Cost Extraction

The Claude Code plugin extracts cost data from agent session JSONL files:

// agent-claude-code/src/index.ts, extractCost
async extractCost(session): Promise<CostInfo | null> {
  const entries = await this.parseJsonlFileTail(sessionFile);

  let totalCostUsd = 0;
  let inputTokens = 0;
  let outputTokens = 0;

  for (const entry of entries) {
    if (entry.costUSD) {
      totalCostUsd += entry.costUSD;
    }
    if (entry.usage) {
      inputTokens += entry.usage.input_tokens ?? 0;
      outputTokens += entry.usage.output_tokens ?? 0;
    }
  }

  // Rough estimate if no costUSD field
  if (totalCostUsd === 0 && (inputTokens > 0 || outputTokens > 0)) {
    // Sonnet 4.5 pricing as default
    totalCostUsd = (inputTokens * 3 + outputTokens * 15) / 1_000_000;
  }

  return { totalCostUsd, inputTokens, outputTokens };
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts (extractCost method)

14.2 What's Tracked

  • Token usage: Input and output token counts from JSONL entries
  • Cost estimates: Either from explicit costUSD fields or rough estimates using Sonnet 4.5 pricing
  • Session duration: Computed from handle.data.createdAt in the runtime plugin

14.3 What's NOT Tracked

  • No aggregated cost view: No total cost across all sessions, projects, or time periods
  • No budget limits: No mechanism to set a maximum spend per session, project, or globally
  • No cost alerts: No notification when spend exceeds a threshold
  • No automatic shutoff: No kill-switch when costs escalate
  • No per-agent model pricing: The rough estimate uses Sonnet 4.5 pricing regardless of which model the agent actually uses
  • No API call counting: GitHub API calls (which have rate limits) are not tracked
  • No cost display in CLI: The ao status command does not show cost information
  • No cost display on dashboard: The dashboard does not show per-session or aggregate costs

14.4 Rate Limit Awareness

The dashboard does detect GitHub API rate limiting:

// Dashboard.tsx, lines 90-93
const anyRateLimited = useMemo(
  () => sessions.some((s) => s.pr && isPRRateLimited(s.pr)),
  [sessions],
);

When rate-limited, a warning banner is shown explaining that PR data may be stale. This is a good UX touch but is reactive rather than preventive.


15. Tooling & Dependencies

Confidence: High

15.1 Runtime Dependencies

Dependency Purpose Version Constraint
Node.js Runtime >= 20
pnpm Package manager 9.15.4 (exact)
tmux Terminal multiplexer Required
git Version control >= 2.25 (worktree support)
gh GitHub CLI Required for GitHub integration
TypeScript Language Strict mode, ESM
Next.js Web dashboard App Router
Commander.js CLI framework
Zod Schema validation

15.2 External Tool Requirements

The system has hard dependencies on external CLI tools:

  1. tmux: Required for process isolation. No fallback. Version checked at runtime via isTmuxAvailable().
  2. git: Required for workspace management. Must support worktrees (Git 2.25+).
  3. gh: Required for GitHub integration (SCM + tracker). Must be authenticated (gh auth status).
  4. Claude Code CLI: Required for the primary agent plugin. Must be installed and configured.

15.3 Optional Dependencies

  • LINEAR_API_KEY: Required only if using the Linear tracker plugin
  • SLACK_WEBHOOK_URL: Required only if using the Slack notifier plugin
  • Composio SDK: Alternative transport for Linear integration
  • iTerm2: Optional terminal integration for macOS

15.4 Build System

// package.json (root)
{
  "scripts": {
    "build": "turbo build",
    "dev": "turbo dev",
    "lint": "turbo lint",
    "test": "turbo test",
    "typecheck": "turbo typecheck",
    "release": "changeset publish"
  }
}

The project uses Turborepo for monorepo build orchestration and Changesets for release management.

15.5 Platform Support

  • macOS: Primary development platform. Desktop notifications use osascript.
  • Linux: Supported. Desktop notifications use notify-send.
  • Windows: Not explicitly supported. tmux is not available natively on Windows (would require WSL).

16. External Integrations

Confidence: High

16.1 GitHub Integration (Deep)

The GitHub integration is the most developed external integration, implemented across two plugins:

scm-github (581 lines):

  • PR detection: gh pr list --head <branch>
  • PR state: gh pr view --json state,title,number,url,additions,deletions,files
  • PR merge: gh pr merge --squash --delete-branch
  • CI checks: gh pr checks --json name,state,conclusion
  • CI summary: Fail-closed logic for open PRs
  • Reviews: gh pr view --json reviews,reviewDecision
  • Pending comments: GraphQL query for review thread resolution status
  • Automated comments: REST API filtering by BOT_AUTHORS
  • Mergeability: Composite check (state + mergeable + CI + reviews + conflicts + draft)

The fail-closed CI summary is notable:

// scm-github/src/index.ts, getCISummary
async getCISummary(prNumber: number, repo: string): Promise<CIStatus> {
  try {
    const checks = await this.getCIChecks(prNumber, repo);
    // ... analyze checks
  } catch (err) {
    // For open PRs, fail closed — report "failing" on error
    // This prevents auto-merge when we can't verify CI status
    const prState = await this.getPRState(prNumber, repo);
    if (prState.state === "open") {
      return CI_STATUS.FAILING;
    }
    return CI_STATUS.NONE;
  }
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/scm-github/src/index.ts

tracker-github (304 lines):

  • Issue CRUD: get, create, update, close/reopen
  • Issue listing with filters (state, label, assignee)
  • Branch name generation: feat/issue-{number}
  • Prompt generation from issue content

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-github/src/index.ts

16.2 Linear Integration

The Linear plugin (722 lines) is the second most developed integration:

Dual transport:

// tracker-linear/src/index.ts
// Method 1: Direct API
if (process.env.LINEAR_API_KEY) {
  this.transport = "direct";
  this.apiKey = process.env.LINEAR_API_KEY;
}
// Method 2: Composio SDK
else if (composioAvailable) {
  this.transport = "composio";
}

State mapping:

const STATE_MAP: Record<string, IssueState> = {
  triage: "open",
  backlog: "open",
  unstarted: "open",
  started: "in_progress",
  completed: "closed",
  canceled: "cancelled",
};

Full GraphQL API: Issues, labels, teams, workflow states, assignees, comments.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-linear/src/index.ts

16.3 Slack Integration

Rich Block Kit messages with structured formatting:

// notifier-slack/src/index.ts
async notify(event: NotificationEvent): Promise<void> {
  const blocks = [
    {
      type: "header",
      text: { type: "plain_text", text: event.title },
    },
    {
      type: "section",
      text: { type: "mrkdwn", text: event.body },
    },
    {
      type: "context",
      elements: [
        { type: "mrkdwn", text: `*Priority:* ${priorityEmoji(event.priority)} ${event.priority}` },
        { type: "mrkdwn", text: `*Session:* ${event.sessionId}` },
      ],
    },
  ];

  if (event.prUrl) {
    blocks.push({
      type: "actions",
      elements: [{
        type: "button",
        text: { type: "plain_text", text: "View PR" },
        url: event.prUrl,
      }],
    });
  }

  await fetch(this.webhookUrl, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ blocks }),
  });
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-slack/src/index.ts

16.4 Desktop Notifications

Platform-specific implementations:

// notifier-desktop/src/index.ts
if (process.platform === "darwin") {
  // macOS: osascript
  await execFile("osascript", [
    "-e", `display notification "${body}" with title "${title}"${sound ? " sound name \"Ping\"" : ""}`,
  ]);
} else {
  // Linux: notify-send
  await execFile("notify-send", [
    ...(urgency === "critical" ? ["--urgency=critical"] : []),
    title,
    body,
  ]);
}

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-desktop/src/index.ts


17. Operational Assumptions & Prerequisites

Confidence: High

17.1 Hard Requirements

  1. Single machine: The entire system runs on one machine. No distributed execution.
  2. Unix-like OS: macOS or Linux required (tmux, POSIX shell commands).
  3. tmux installed: No alternative runtime is production-ready.
  4. Git 2.25+: Worktree support required.
  5. Node.js 20+: ESM module support and modern APIs.
  6. Agent CLI installed: At minimum, Claude Code CLI must be available.
  7. GitHub authentication: gh auth login must be completed for GitHub features.

17.2 Soft Requirements

  1. Stable network: Agents need internet access for API calls; dashboard needs GitHub API access for enrichment.
  2. Sufficient disk space: Each worktree is a full checkout. Many concurrent sessions require proportional disk.
  3. API rate limits: GitHub API has 5000 requests/hour for authenticated users. With many sessions and 30s polling, this budget can be consumed quickly.
  4. Agent API keys: Claude API key, OpenAI API key, etc. must be configured in the agent's own config.

17.3 Scaling Assumptions

The system was designed for 10-50 concurrent sessions on a single developer machine. Evidence:

  • Session list is loaded entirely into memory (no pagination)
  • Dashboard renders all sessions in a single view
  • Lifecycle polling is a single loop with no sharding
  • Metadata is scanned by filesystem directory listing
  • No connection pooling for GitHub API calls

Beyond 50 sessions, expect:

  • GitHub API rate limiting (5000 req/hr shared across all sessions)
  • Lifecycle poll cycles exceeding 30 seconds
  • Dashboard becoming sluggish with many cards
  • tmux session management overhead

17.4 Configuration Assumptions

The system assumes a specific project structure:

  • Git repository with remote named origin
  • A single default branch (main/master)
  • Issues tracked in GitHub Issues or Linear
  • PRs created on GitHub (no GitLab, Bitbucket, etc.)
  • Squash merge strategy (hardcoded: gh pr merge --squash --delete-branch)

18. Failure Modes & Recovery

Confidence: High

18.1 Spawn Failures

The spawn sequence has cascading cleanup:

Step 1 fails (issue validation):     -> No cleanup needed
Step 2 fails (session ID):           -> No cleanup needed
Step 3 fails (workspace creation):   -> Delete session directory
Step 4 fails (post-create hooks):    -> Destroy workspace + delete session directory
Step 5 fails (runtime creation):     -> Destroy workspace + delete session directory
Step 6 fails (launch command):       -> Destroy runtime + workspace + delete session directory
Step 7 fails (metadata write):       -> Destroy runtime + workspace + delete session directory
Step 8 fails (post-launch setup):    -> Destroy runtime + workspace + delete session directory

Each step's failure handler cleans up all previously completed steps. This is implemented with nested try/catch blocks.

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/core/src/session-manager.ts (spawn method)

18.2 Runtime Crashes

If a tmux session dies unexpectedly:

  1. Detection: Next lifecycle poll checks runtime.isAlive() -> returns false
  2. Status update: Session status set to "errored" (if no PR) or "pr_open" (if PR exists)
  3. Notification: "agent-exited" reaction fires, notifying the human
  4. Recovery: Human can ao session restore to restart from archive

The 30-second polling interval means up to 30 seconds can pass before a crash is detected.

18.3 Agent Stuck Detection

The activity detection system identifies stuck agents:

// Stuck: agent process running but no JSONL activity for extended period
if (activity === "idle" && idleDuration > stuckThresholdMs) {
  return "stuck";
}

Default stuck threshold is not explicitly documented but the reaction config allows a after duration for escalation:

reactions:
  agent-stuck:
    trigger: agent.stuck
    action: notify
    escalation:
      action: notify
      after: "15m"
      priority: critical

18.4 GitHub API Failures

Rate Limiting:

  • PR enrichment has a 4-second timeout
  • If enrichment fails, dashboard shows stale data with a rate-limit warning banner
  • CI summary uses fail-closed: errors -> "failing" status (prevents false merges)

Network Failures:

  • SCM calls use execFile with timeouts
  • Transient failures in lifecycle polling are caught by Promise.allSettled
  • Failed polls are skipped; the next cycle retries

18.5 Metadata Corruption

The flat-file metadata format is simple but fragile:

  • Append-only: Multiple values for the same key are resolved by taking the last one
  • No atomicity: If the process crashes mid-write, the file could be truncated
  • No locking: Multiple writers (agent hook + lifecycle manager) could race

The PostToolUse hook (bash script) appends to the metadata file:

echo "pr_number=$PR_NUM" >> "$METADATA_FILE"

While the lifecycle manager reads the file:

const metadata = parseMetadataFile(metadataPath);

There is no file locking between these operations. In practice, this is unlikely to cause issues because:

  1. The hook and lifecycle manager write different keys
  2. The file is small (< 1KB typically)
  3. POSIX append semantics are usually atomic for small writes

But it is a theoretical correctness gap.

18.6 Worktree Cleanup Failures

If worktree removal fails (e.g., locked files), the plugin falls back to rmSync:

// workspace-worktree/src/index.ts, destroy
try {
  await execFile("git", ["worktree", "remove", "--force", worktreePath]);
} catch {
  // Fallback: force-remove the directory
  fs.rmSync(worktreePath, { recursive: true, force: true });
}

After force removal, stale worktree entries remain in git's worktree list. The restore function handles this:

// workspace-worktree/src/index.ts, restore
await execFile("git", ["worktree", "prune"]);  // Clean stale entries

Source: /tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts

18.7 Service Initialization Failures

The web service singleton has retry logic:

// services.ts
globalForServices._aoServicesInit = initServices().catch((err) => {
  globalForServices._aoServicesInit = undefined;  // Clear for retry
  throw err;
});

If initialization fails (e.g., config file missing), the cached promise is cleared so the next request triggers a fresh attempt rather than permanently returning the cached error.


19. Governance & Guardrails

Confidence: Medium

19.1 What Guardrails Exist

  1. Confirmation dialogs: Kill and restore operations require user confirmation in the dashboard.
  2. Fail-closed CI: Unknown CI status is treated as "failing" for open PRs.
  3. Session ID validation: Regex-based validation prevents path traversal.
  4. Symlink target validation: Prevents workspace escape.
  5. Shell injection prevention: execFile everywhere, never exec.
  6. Issue validation: Sessions cannot be spawned for non-existent issues.
  7. Duplicate detection: Batch spawn checks for existing sessions with the same issue.

19.2 What Guardrails Are Missing

  1. No approval gates before merge: Auto-merge has no additional safety check beyond CI + review.
  2. No diff size limits: Agents can create arbitrarily large PRs.
  3. No file restriction: Agents can modify any file in the repository, including CI configs, security policies, and deployment scripts.
  4. No branch protection enforcement: AO doesn't verify that branch protection rules are configured on the target branch.
  5. No code review requirements: Auto-merge can bypass the "requires review" setting if the GitHub config allows it.
  6. No cost limits: No budget ceiling per session or project.
  7. No concurrency limits: No maximum number of concurrent sessions.
  8. No time limits: Sessions can run indefinitely.
  9. No output validation: Generated code is not scanned for vulnerabilities or malicious content.

19.3 Permission Model

There is no permission model. The system runs with the credentials of the user who started it:

  • Git operations use the user's SSH keys or HTTPS tokens
  • GitHub API uses the user's gh auth session
  • Claude Code uses the user's API key
  • tmux runs as the current user

Any agent can perform any action the user can perform.

19.4 Audit Trail

The audit trail consists of:

  • Metadata files: Show session creation time, issue, branch, PR number
  • Archived metadata: Preserved after session cleanup
  • Git history: All agent commits are in the git log
  • Claude Code JSONL: Complete agent conversation history
  • tmux capture: Terminal output can be captured (but not automatically persisted)

There is no centralized audit log, no event store, and no structured logging of orchestration decisions.


20. Roadmap & Evolution Signals

Confidence: Medium

20.1 TODO Items in Code

Several TODO markers indicate planned features:

  1. AI-powered init (init.ts):
// --auto --smart mode
// TODO: AI-powered rule generation based on project structure
  1. Custom plugin loading (plugin-registry.ts):
// loadFromConfig() — delegates to loadBuiltins,
// reserved for future custom plugin loading
  1. Process runtime (plugin-registry.ts): Listed as a built-in plugin but implementation not observed. This would allow running agents without tmux.

  2. Clone workspace (plugin-registry.ts): Listed as a built-in but likely less developed than the worktree plugin. Would provide full repository clones instead of worktrees.

20.2 Architecture Signals

  1. Plugin slots for Terminal (iterm2, web): Suggests plans for richer terminal integration beyond basic tmux.
  2. Lifecycle plugin slot: Suggests plans for customizable state machines, possibly for different workflow patterns.
  3. Composio notifier: Integration with Composio's platform suggests a path toward SaaS deployment.
  4. Webhook notifier: Generic webhook support enables integration with any service.
  5. Multiple tracker support: GitHub + Linear suggests plans for Jira, Asana, etc.

20.3 Maturity Assessment

Component Maturity Evidence
Core types High 1084 lines, comprehensive, well-structured
Session manager High ~1100 lines, thorough error handling
Lifecycle manager High 587 lines, reaction engine, escalation
Config system High Zod validation, defaults, collision detection
Claude Code plugin High 786 lines, deep integration
GitHub SCM High 581 lines, fail-closed CI, GraphQL
Linear tracker High 722 lines, dual transport
tmux runtime Medium 184 lines, functional but basic
Web dashboard Medium Functional UI, basic SSE
CLI Medium Feature-complete but sparse UX
Other agent plugins Low Likely thin or placeholder
Process runtime Low Listed but not observed
Clone workspace Low Listed but not fully developed
Composio notifier Unknown Mentioned but not explored

20.4 Version and Release Status

The project uses Changesets for release management, indicating it follows semver and publishes to npm under the @composio/ao-* namespace. The presence of .github/workflows/ci.yml and security.yml suggests active CI/CD.


21. What to Borrow / Adapt into Maestro

Confidence: High

21.1 Strongly Recommended to Borrow

21.1.1 Plugin Architecture Pattern

The eight-slot plugin system is clean and extensible. The PluginManifest + PluginModule pattern with type-safe registry is worth adopting:

interface PluginModule<T> {
  manifest: { name: string; slot: string; version: string };
  create: (ctx?: PluginContext) => T | Promise<T>;
}

Why: It enables swapping implementations without touching core logic. Adding a new agent, runtime, or tracker is a self-contained operation.

Adaptation for Maestro: Consider adding a capabilities field to the manifest for feature-flag-based plugin selection, and a healthCheck() method for runtime validation.

21.1.2 Fail-Closed CI Status

The pattern of reporting "failing" when CI status is unknown for open PRs is a critical safety measure:

// On API error for open PRs: return "failing" not "none"

Why: Prevents auto-merge of PRs when we can't verify CI status. This is a security-relevant design decision.

Adaptation for Maestro: Apply this pattern to all safety-critical status checks. When in doubt, assume the worst case.

21.1.3 Reaction Engine with Escalation

The reaction engine pattern (event -> action, with retries and escalation) is composable and user-configurable:

reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    retries: 2
    escalation:
      action: notify
      after: "10m"
      priority: critical

Why: It separates orchestration policy from orchestration mechanism. Users can customize behavior without modifying code.

Adaptation for Maestro: Add more complex conditions (boolean logic, state predicates), support for custom action types, and a reaction history log.

21.1.4 Atomic Session ID Reservation

Using O_EXCL flag for race-condition-safe session creation:

await fs.open(sessionDir, O_CREAT | O_EXCL);

Why: Prevents two concurrent spawn operations from creating sessions with the same ID.

Adaptation for Maestro: Use this pattern for any resource reservation that must be atomic.

21.1.5 Hash-Based Directory Namespacing

SHA-256 hash of config path for globally unique directories:

createHash("sha256").update(configDir).digest("hex").slice(0, 12);

Why: Prevents collisions between multiple projects/configurations on the same machine. Simple but effective.

21.1.6 Shell Security (execFile Always)

The discipline of never using exec with string interpolation is worth codifying:

// ALWAYS this:
execFile("git", ["checkout", "-b", branchName]);
// NEVER this:
exec(`git checkout -b ${branchName}`);

Adaptation for Maestro: Make this a lint rule. Block exec and execSync in ESLint config.

21.2 Worth Borrowing with Modifications

21.2.1 Activity Detection via Agent Internals

Reading Claude Code's JSONL session files for activity detection is clever but tightly coupled:

// Read last 128KB of JSONL, parse last entry type
const entry = await readLastJsonlEntry(sessionFile);

Why to borrow: Much more accurate than terminal output parsing. Knows exactly what the agent is doing.

Modification needed: Abstract this behind the Agent interface more cleanly. Each agent plugin should expose standardized activity signals rather than having the orchestrator parse agent-specific file formats.

21.2.2 Three-Layer Prompt Composition

The base + config + user rules approach is sound but rigid:

Modification needed: Add support for:

  • Template variables in prompts
  • Conditional sections based on project type
  • Prompt versioning and A/B testing
  • Dynamic context injection (e.g., related PR context, dependency graph)

21.2.3 Worktree-Based Isolation

Git worktrees are efficient (shared object store) but have limitations:

Modification needed: Support both worktrees (for speed) and full clones (for complete isolation). Consider container-based isolation for stronger security boundaries.

21.2.4 Dashboard Attention Levels

The Kanban grouping by attention level (working/pending/review/respond/merge/done) is intuitive:

Modification needed: Make the attention levels configurable. Different teams may have different workflows and priority signals.

21.3 Not Recommended to Borrow

21.3.1 Flat-File Metadata

The key=value text file approach is too fragile for production:

  • No atomicity guarantees
  • No schema evolution support
  • No query capability (must read all files to list sessions)
  • Race conditions between writers

Alternative for Maestro: Use SQLite (embedded, zero-config, ACID) or a structured file format (JSON with atomic rename-based writes).

21.3.2 tmux as Primary Runtime

While pragmatic, tmux coupling creates issues:

  • Not available on Windows
  • Message passing is fragile (buffer sizes, timing)
  • No structured communication channel
  • Output capture is lossy (screen buffer limits)

Alternative for Maestro: Implement a process runtime that uses stdin/stdout for structured communication (JSON-RPC or similar), with tmux as an optional attachment layer for debugging.

21.3.3 Polling-Based Lifecycle

30-second polling is too slow for responsive orchestration and too wasteful for idle systems:

Alternative for Maestro: Use an event-driven architecture with filesystem watches (inotify/FSEvents), agent-reported events (via a sidecar or callback), and webhook-based SCM notifications.

21.3.4 No Authentication on Dashboard

Running a web dashboard without any authentication is a security gap:

Alternative for Maestro: At minimum, implement localhost-only binding with a session token. Better: proper authentication with API keys or OAuth.

21.3.5 Hardcoded Merge Strategy

gh pr merge --squash --delete-branch is hardcoded:

Alternative for Maestro: Make merge strategy configurable per project (squash/merge/rebase, delete branch or not).

21.4 Key Lessons

  1. Simplicity wins for v1: AO chose the simplest possible implementation at every layer (files over databases, polling over events, CLI over API). This allowed rapid development and easy debugging.

  2. Plugin architecture pays off early: Even in a young project, the ability to swap implementations is valuable. It enables both experimentation and user customization.

  3. Safety must be default-on: Fail-closed CI, confirmation dialogs, and shell injection prevention are good defaults. Auto-merge should require explicit opt-in.

  4. Agent-specific integration is necessary: Generic agent interfaces are not enough. Deep integration with the specific agent (like reading Claude Code JSONL) provides dramatically better observability.

  5. The orchestrator-as-agent pattern is powerful: Using an AI agent to orchestrate other AI agents (the meta-agent pattern) leverages the agent's natural language understanding for flexible task management. But it requires a very good system prompt.


22. Cross-Links

This analysis is part of a broader research effort analyzing multiple AI agent orchestration frameworks. Related documents in the /Users/jeffscottward/Github/research/ai-harness/Claude/v1/ directory include:

Document Relevance to This Analysis
swe-bench-deep-analysis.md SWE-bench is the primary benchmark for evaluating coding agents like those orchestrated by AO. Comparison of evaluation methodologies.
claude-code-deep-analysis.md Claude Code is AO's primary agent. Deep understanding of Claude Code's internals (JSONL format, session files, hooks) is essential for understanding AO's agent plugin.
codex-deep-analysis.md Codex CLI is a supported agent in AO. Compare how AO integrates Codex vs Claude Code.
aider-deep-analysis.md Aider is a supported agent in AO. Compare integration depth and activity detection approaches.
opencode-deep-analysis.md OpenCode is a supported agent in AO. Compare plugin maturity.
open-hands-deep-analysis.md OpenHands (formerly OpenDevin) provides container-based isolation. Compare with AO's worktree/tmux approach for security and resource isolation.
bolt-diy-deep-analysis.md Bolt.diy is a web-based coding assistant. Compare the dashboard/UI patterns.
maestro-architecture.md The target architecture document. This analysis directly informs what patterns to adopt, adapt, or avoid.

Cross-Cutting Themes

  1. Isolation Models: AO uses worktrees + tmux. OpenHands uses Docker containers. Each has tradeoffs between speed, security, and complexity. Maestro should support both.

  2. Agent Communication: AO uses terminal message passing. Some frameworks use structured APIs. The hybrid approach (terminal for legacy agents, API for modern ones) may be optimal.

  3. State Management: AO uses flat files. Most production systems use databases. The trade-off is operational simplicity vs. query capability and reliability.

  4. Orchestration Patterns: AO's meta-agent pattern (AI orchestrating AI) vs. rule-based orchestration vs. human-in-the-loop. Each has different reliability/flexibility trade-offs.

  5. Cost Management: All frameworks struggle with cost visibility and control. This is an area where Maestro can differentiate.


Appendix A: File Reference

Core Package Files

File Lines Purpose
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/types.ts 1084 Central type definitions
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/session-manager.ts ~1100 Session CRUD operations
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/lifecycle-manager.ts 587 State machine + reaction engine
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/config.ts ~400 Config loading + validation
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/plugin-registry.ts ~100 Plugin registration + lookup
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/paths.ts ~200 Hash-based directory management
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/metadata.ts ~200 Flat-file metadata management
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/prompt-builder.ts ~150 Three-layer prompt composition
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/orchestrator-prompt.ts ~250 Meta-agent system prompt
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/tmux.ts ~200 Safe tmux wrappers
/tmp/ai-harness-repos/agent-orchestrator/packages/core/src/utils.ts ~150 Shell escape, JSONL parsing

Plugin Files

File Lines Purpose
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/agent-claude-code/src/index.ts 786 Claude Code agent integration
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/runtime-tmux/src/index.ts 184 tmux runtime implementation
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/workspace-worktree/src/index.ts 301 Git worktree workspace
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/scm-github/src/index.ts 581 GitHub SCM integration
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-github/src/index.ts 304 GitHub Issues tracker
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/tracker-linear/src/index.ts 722 Linear tracker integration
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-desktop/src/index.ts ~80 OS desktop notifications
/tmp/ai-harness-repos/agent-orchestrator/packages/plugins/notifier-slack/src/index.ts ~150 Slack webhook notifications

CLI Files

File Lines Purpose
/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/index.ts ~80 CLI entry point
/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/spawn.ts ~200 Spawn + batch-spawn
/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/start.ts ~150 Start/stop orchestrator
/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/status.ts ~200 Status display
/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/session.ts ~200 Session subcommands
/tmp/ai-harness-repos/agent-orchestrator/packages/cli/src/commands/init.ts ~300 Init wizard

Web Files

File Lines Purpose
/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/lib/services.ts 84 Service singleton
/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/components/Dashboard.tsx 272 Main dashboard UI
/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/app/api/sessions/route.ts 65 Sessions API
/tmp/ai-harness-repos/agent-orchestrator/packages/web/src/app/api/events/route.ts 104 SSE events API

Config & Documentation

File Purpose
/tmp/ai-harness-repos/agent-orchestrator/README.md Project overview
/tmp/ai-harness-repos/agent-orchestrator/ARCHITECTURE.md Directory architecture
/tmp/ai-harness-repos/agent-orchestrator/CLAUDE.md Development conventions
/tmp/ai-harness-repos/agent-orchestrator/SECURITY.md Security policy
/tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example Full reference config
/tmp/ai-harness-repos/agent-orchestrator/examples/simple-github.yaml Minimal config example
/tmp/ai-harness-repos/agent-orchestrator/examples/auto-merge.yaml Auto-merge config example

Appendix B: Confidence Scores Summary

Section Confidence Reasoning
1. Design Philosophy High README, ARCHITECTURE.md, and code consistently support conclusions
2. Core Architecture High All source files read and analyzed
3. Harness Workflow High Spawn sequence traced through code
4. Subagent Orchestration High Orchestrator prompt and communication code reviewed
5. Multi-Agent & Parallelization High Lifecycle manager and batch-spawn code reviewed
6. Isolation Model High Workspace and runtime plugins fully analyzed
7. Human-in-the-Loop High Dashboard and API code reviewed
8. Context Handling High Prompt builder and tracker plugins reviewed
9. Session Lifecycle High State machine and activity detection fully traced
10. Code Quality Gates Medium CI config reviewed but lint rules and test coverage not explored
11. Security High SECURITY.md, CI workflows, and shell security patterns reviewed
12. Hooks & Automation High Reaction engine and PostToolUse hook fully analyzed
13. CLI & UX High All CLI commands reviewed
14. Cost & Usage Medium Cost extraction code reviewed but display/alerting not found
15. Tooling & Dependencies High package.json and imports reviewed
16. External Integrations High All plugin code reviewed
17. Operational Assumptions High Requirements documented and validated against code
18. Failure Modes High Error handling paths traced through code
19. Governance Medium Security measures documented but no formal governance framework
20. Roadmap Medium Based on TODOs, plugin stubs, and architecture patterns
21. Borrow/Adapt High Based on thorough analysis of all sections

Appendix C: Architecture Diagrams (Text)

C.1 System Architecture

┌──────────────────────────────────────────────────────┐
│                    Human Developer                     │
│                                                        │
│  ┌──────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │ ao CLI   │  │  Dashboard   │  │  GitHub UI   │   │
│  │          │  │  (Next.js)   │  │              │   │
│  └────┬─────┘  └──────┬───────┘  └──────┬───────┘   │
└───────┼────────────────┼────────────────┼────────────┘
        │                │                │
        ▼                ▼                │
┌───────────────────────────────┐         │
│          AO Core              │         │
│                               │         │
│  ┌─────────────────────────┐  │         │
│  │    Session Manager      │  │         │
│  │  (spawn/list/kill/etc)  │  │         │
│  └────────────┬────────────┘  │         │
│               │               │         │
│  ┌────────────▼────────────┐  │         │
│  │   Lifecycle Manager     │  │         │
│  │  (poll/react/escalate)  │  │         │
│  └────────────┬────────────┘  │         │
│               │               │         │
│  ┌────────────▼────────────┐  │         │
│  │    Plugin Registry      │  │         │
│  │  (8 slots, 16 plugins)  │  │         │
│  └────────────┬────────────┘  │         │
└───────────────┼───────────────┘         │
                │                         │
    ┌───────────┼───────────┐             │
    │           │           │             │
    ▼           ▼           ▼             ▼
┌────────┐ ┌────────┐ ┌─────────┐ ┌──────────┐
│  tmux  │ │  git   │ │ Claude  │ │ GitHub   │
│sessions│ │worktree│ │  Code   │ │   API    │
│        │ │        │ │  CLI    │ │(gh CLI)  │
└────────┘ └────────┘ └─────────┘ └──────────┘

C.2 Session Lifecycle State Machine

                    ┌──────────┐
                    │ spawning │
                    └────┬─────┘
                         │
                         ▼
                    ┌──────────┐
              ┌─────│ working  │◄────────────────────┐
              │     └────┬─────┘                      │
              │          │                            │
              ▼          ▼                            │
        ┌──────────┐ ┌──────────┐              ┌─────┴──────┐
        │needs_input│ │ pr_open  │              │changes_req │
        └──────────┘ └────┬─────┘              └─────┬──────┘
                          │                          │
                    ┌─────┼─────┐                    │
                    │     │     │                    │
                    ▼     ▼     ▼                    │
              ┌────────┐ ┌──────────┐ ┌──────────┐  │
              │ci_fail │ │rev_pend  │ │ working  │──┘
              └───┬────┘ └────┬─────┘
                  │           │
                  │      ┌────┼────┐
                  │      │         │
                  │      ▼         ▼
                  │ ┌──────────┐ ┌──────────────┐
                  │ │ approved │ │changes_req   │
                  │ └────┬─────┘ └──────────────┘
                  │      │
                  │      ▼
                  │ ┌──────────┐
                  │ │mergeable │
                  │ └────┬─────┘
                  │      │
                  │      ▼
                  │ ┌──────────┐
                  └►│  merged  │
                    └────┬─────┘
                         │
                         ▼
                    ┌──────────┐
                    │  done    │
                    └──────────┘

        (Any state) ──► killed / terminated / errored / stuck

C.3 Directory Structure

~/.agent-orchestrator/
│
├── a1b2c3d4e5f6-my-project/        # {hash}-{projectId}
│   ├── .origin                       # Original config path
│   ├── sessions/
│   │   ├── fix-auth-1/
│   │   │   ├── metadata              # key=value flat file
│   │   │   └── prompt.md             # Agent system prompt
│   │   ├── add-api-2/
│   │   │   ├── metadata
│   │   │   └── prompt.md
│   │   └── refactor-3/
│   │       ├── metadata
│   │       └── prompt.md
│   ├── archive/
│   │   ├── old-session_1706000000   # Archived metadata
│   │   └── old-session_1706100000
│   └── worktrees/
│       ├── fix-auth-1/              # Git worktree checkout
│       ├── add-api-2/
│       └── refactor-3/
│
└── f6e5d4c3b2a1-other-project/
    ├── sessions/
    ├── archive/
    └── worktrees/

Appendix D: Configuration Reference

D.1 Full Configuration Schema

# agent-orchestrator.yaml

# Global settings
dataDir: "~/.agent-orchestrator"      # Base data directory
worktreeDir: null                      # Override worktree location
port: 3000                            # Dashboard port

# Default plugin selections
defaults:
  runtime: tmux                       # Process runtime
  agent: claude-code                  # AI coding agent
  workspace: worktree                 # Code isolation strategy
  notifiers:                          # Notification channels
    - composio
    - desktop

# Project definitions
projects:
  my-project:
    repo: "owner/repo"               # GitHub repository
    path: "/path/to/local/repo"      # Local repository path
    defaultBranch: main               # Default branch name
    sessionPrefix: "fix"              # Session name prefix

    tracker:                          # Issue tracker
      plugin: github                  # or "linear"
      # Linear-specific:
      # teamId: "TEAM_ID"

    scm:                              # Source code management
      plugin: github

    symlinks:                         # Shared resources in worktrees
      - source: "/path/to/node_modules"
        target: "node_modules"

    postCreate:                       # Commands after workspace creation
      - "npm install"
      - "npm run build"

    agentConfig:                      # Agent-specific configuration
      model: "claude-sonnet-4-5-20250514"

    agentRules: |                     # Inline rules for the agent
      Follow TDD. Write tests first.

    agentRulesFile: ".agent-rules.md" # External rules file

# Notification configuration
notifiers:
  slack:
    webhookUrl: "${SLACK_WEBHOOK_URL}"
  desktop: {}

# Notification routing by priority
notificationRouting:
  critical: [slack, desktop]
  high: [slack, desktop]
  normal: [slack]
  low: [slack]

# Reaction rules
reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    message: "CI is failing. Please investigate and fix."
    retries: 2
    escalation:
      action: notify
      after: "10m"
      priority: critical

  changes-requested:
    trigger: review.changes_requested
    action: send-to-agent
    message: "Review feedback received. Please address."

  approved-and-green:
    trigger: review.approved
    condition: ci.passing
    action: notify
    message: "PR ready to merge."

  agent-stuck:
    trigger: agent.stuck
    action: notify
    priority: high
    escalation:
      action: notify
      after: "15m"
      priority: critical

  auto-merge:                         # Optional: auto-merge
    trigger: review.approved
    condition: ci.passing
    action: auto-merge

Source: /tmp/ai-harness-repos/agent-orchestrator/agent-orchestrator.yaml.example


End of analysis. Total sections: 22 (21 analysis areas + cross-links). All file paths reference the source repository at /tmp/ai-harness-repos/agent-orchestrator/.

Everything Claude Code (ECC) -- Deep Technical Analysis

Repository: affaan-m/everything-claude-code Version analyzed: v1.4.1 (February 2026) Analysis date: 2026-02-22 Analyst: Claude Opus 4.6


Table of Contents

  1. Executive Summary
  2. Design Philosophy and Abstractions
  3. Core Architecture Model
  4. Harness Workflow: Spec to Plan to Execute to Verify to Merge
  5. Subagent/Task Orchestration Model
  6. Multi-Agent / Parallelization Strategy
  7. Isolation Model
  8. Human-in-the-Loop Controls
  9. Context Handling Strategy
  10. Session Lifecycle and Persistence
  11. Code Quality Gates
  12. Security and Compliance Mechanisms
  13. Hooks, Automation Surface, and Fail-Safe Behavior
  14. CLI/UX and Automation Ergonomics
  15. Cost/Usage Visibility and Governance
  16. Tooling and Dependency Surface
  17. External Integrations and Provider Compatibility
  18. Operational Assumptions and Constraints
  19. Failure Modes and Issues Observed
  20. Governance and Guardrails
  21. Roadmap/Evolution Signals, Missing Areas, Unresolved Issues
  22. What Should Be Borrowed/Adapted into Maestro
  23. Cross-Links

1. Executive Summary

Everything Claude Code (ECC) is a configuration-layer harness -- not a runtime framework -- that wraps Claude Code CLI with curated agents, skills, hooks, commands, rules, and MCP configurations. It is the most popular community-maintained collection of Claude Code configurations (42K+ stars, 5K+ forks as of Feb 2026), battle-tested by the author over 10+ months of daily production use.

Key characterization: ECC is a "prompt engineering harness" that orchestrates Claude Code's native capabilities through markdown-based configuration rather than writing code that drives Claude Code programmatically. It does not have its own execution engine; it relies entirely on Claude Code's built-in plugin system, hook mechanism, and subagent (Task tool) delegation.

Confidence: HIGH -- This assessment is based on exhaustive reading of every file in the repository.

Strengths

  • Extremely well-organized collection of battle-tested configurations
  • Strong hook system with cross-platform Node.js implementations
  • Thoughtful session persistence and context management
  • Multi-language support (TS, Python, Go, Java, C++, Swift)
  • Cross-platform compatibility (Windows, macOS, Linux)
  • Good CI/CD with comprehensive validation
  • Innovative continuous learning (instinct) system
  • Excellent documentation (shortform/longform guides, i18n)

Limitations

  • No runtime execution engine -- depends entirely on Claude Code CLI
  • No true concurrent agent orchestration (sequential pipeline only)
  • No formal state machine for workflow progression
  • No persistent database or structured data store
  • Orchestration is prompt-driven, not code-driven
  • No cost tracking beyond Claude's built-in /cost
  • Multi-model commands (multi-plan, multi-execute) depend on external codeagent-wrapper binary not included

2. Design Philosophy and Abstractions

2.1 Mental Model

ECC embodies a "configuration as code" philosophy applied to AI-assisted development. The core mental model is:

"Claude Code is already powerful. Make it more consistent, more efficient, and more specialized by providing curated configurations that encode expert knowledge."

Evidence:

  • /tmp/ai-harness-repos/everything-claude-code/the-shortform-guide.md lines 1-9: "These configs are battle-tested across multiple production applications."
  • /tmp/ai-harness-repos/everything-claude-code/README.md lines 30-33: "Production-ready agents, skills, hooks, commands, rules, and MCP configurations evolved over 10+ months."

The author explicitly states (longform guide, line 1): treating configuration as something to be iterated and refined, not designed once.

Confidence: HIGH

2.2 Abstraction Layers

ECC defines six primary abstraction layers, each stored as markdown or JSON:

Layer Storage Format Location Purpose
Rules Markdown rules/ Always-active behavioral constraints
Agents Markdown with YAML frontmatter agents/ Specialized subagent personas
Skills Markdown with YAML frontmatter skills/ Domain knowledge and workflow definitions
Commands Markdown with YAML frontmatter commands/ Slash commands for quick execution
Hooks JSON hooks/hooks.json Event-driven automations
Contexts Markdown contexts/ Dynamic system prompt injection

Evidence:

  • /tmp/ai-harness-repos/everything-claude-code/CLAUDE.md lines 23-31: Architecture section listing all component types
  • /tmp/ai-harness-repos/everything-claude-code/README.md lines 189-343: Complete directory structure

2.3 Design Principles

  1. Modularity over monolith: Each component is independently installable and removable
  2. Markdown as universal format: Everything is markdown -- the LLM-native format
  3. Convention over configuration: Standard file locations, naming patterns
  4. Progressive enhancement: Start with what resonates, add incrementally
  5. Context window conservation: Aggressive optimization of token usage
  6. Cross-platform compatibility: Node.js scripts instead of bash

Evidence:

  • /tmp/ai-harness-repos/everything-claude-code/README.md lines 994-998: "Start with what resonates, modify for your stack, remove what you don't use, add your own patterns"
  • All hooks are Node.js: /tmp/ai-harness-repos/everything-claude-code/hooks/README.md lines 192-193

Confidence: HIGH


3. Core Architecture Model

3.1 Entry Points

ECC has three primary entry points:

  1. Plugin installation: Via Claude Code's /plugin marketplace add command

    • File: /tmp/ai-harness-repos/everything-claude-code/.claude-plugin/plugin.json
    • Registers agents, skills, and commands
    • Hooks auto-loaded from hooks/hooks.json by convention (Claude Code v2.1+)
  2. Manual installation: Copying files to ~/.claude/ directories

    • File: /tmp/ai-harness-repos/everything-claude-code/install.sh
    • Supports --target claude (default) and --target cursor
    • Handles common + language-specific rule installation
  3. npm package: npm install ecc-universal

    • File: /tmp/ai-harness-repos/everything-claude-code/package.json line 73: "bin": { "ecc-install": "install.sh" }
    • Provides ecc-install CLI command

3.2 Key Modules

3.2.1 Scripts Library (scripts/lib/)

The cross-platform utility layer that all hooks depend on:

  • utils.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/utils.js): 529 lines

    • Platform detection (Windows/macOS/Linux)
    • Directory management (sessions, learned skills, temp)
    • File operations (read, write, append, replace, grep)
    • Git operations (modified files, repo detection)
    • Stdin JSON parsing for hooks
    • Command execution with security notes (line 337-338)
  • package-manager.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/package-manager.js): 431 lines

    • Supports npm, pnpm, yarn, bun
    • 6-level detection priority (env var > project config > package.json > lock file > global config > fallback)
    • Input validation with SAFE_NAME_REGEX and SAFE_ARGS_REGEX (lines 285-319)
    • Known performance fix: Avoids spawning child processes in hot paths (line 228-231)
  • session-manager.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-manager.js): 442 lines

    • Session CRUD operations
    • Filename parsing with calendar-accurate date validation (line 37-41)
    • Metadata extraction from markdown content
    • Pagination support for session listing
  • session-aliases.js (/tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-aliases.js): Session alias management

3.2.2 Hook Scripts (scripts/hooks/)

Seven hook scripts provide the runtime behavior:

Script Event Purpose
session-start.js SessionStart Load previous context, detect PM
session-end.js SessionEnd Extract summary from transcript, persist
pre-compact.js PreCompact Save state before compaction
suggest-compact.js PreToolUse Suggest compaction at thresholds
evaluate-session.js SessionEnd Extract patterns for continuous learning
post-edit-format.js PostToolUse Auto-format with Prettier
post-edit-typecheck.js PostToolUse TypeScript checking
post-edit-console-warn.js PostToolUse Warn about console.log
check-console-log.js Stop Audit modified files for console.log

3.2.3 CI Validators (scripts/ci/)

Five validation scripts enforce structural integrity:

Validator What It Checks
validate-agents.js YAML frontmatter: name, description, tools, model
validate-commands.js Description frontmatter presence
validate-hooks.js JSON schema, valid event types, inline JS syntax
validate-rules.js Markdown heading structure
validate-skills.js SKILL.md file presence and frontmatter

Evidence:

  • /tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-agents.js lines 11-12: REQUIRED_FIELDS = ['model', 'tools']; VALID_MODELS = ['haiku', 'sonnet', 'opus']
  • /tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-hooks.js line 11: VALID_EVENTS = ['PreToolUse', 'PostToolUse', 'PreCompact', 'SessionStart', 'SessionEnd', 'Stop', 'Notification', 'SubagentStop']

3.3 Data Flow

User Request
    |
    v
Claude Code CLI
    |
    +--> Rules (always loaded from ~/.claude/rules/)
    |     Behavioral constraints applied to every response
    |
    +--> CLAUDE.md (project/user level)
    |     Project-specific guidance
    |
    +--> Slash Command (e.g., /plan)
    |     |
    |     v
    |   Command markdown loaded
    |     |
    |     v
    |   Agent invoked (via Task tool)
    |     |
    |     +--> Skills referenced in agent prompt
    |     |
    |     v
    |   Agent produces output
    |
    +--> Hooks fire (Pre/Post/Lifecycle)
    |     |
    |     v
    |   Node.js scripts execute
    |     |
    |     +--> Session state persisted to ~/.claude/sessions/
    |     +--> Compaction suggested
    |     +--> Patterns extracted
    |
    v
Claude response to user

Confidence: HIGH


4. Harness Workflow

4.1 Spec to Plan to Execute to Verify to Merge

ECC defines an explicit workflow pipeline via the /orchestrate command:

File: /tmp/ai-harness-repos/everything-claude-code/commands/orchestrate.md

/orchestrate feature "Add user authentication"

Pipeline:
1. planner agent     -> Requirements + plan
2. tdd-guide agent   -> Tests first, then implementation
3. code-reviewer agent -> Quality review
4. security-reviewer agent -> Security audit

Four predefined workflow types:

Workflow Agent Sequence
feature planner -> tdd-guide -> code-reviewer -> security-reviewer
bugfix planner -> tdd-guide -> code-reviewer
refactor architect -> code-reviewer -> tdd-guide
security security-reviewer -> code-reviewer -> architect

Evidence: /tmp/ai-harness-repos/everything-claude-code/commands/orchestrate.md lines 11-30

4.2 Handoff Protocol

Between agents, a structured handoff document is passed (lines 48-65):

## HANDOFF: [previous-agent] -> [next-agent]

### Context
[Summary of what was done]

### Findings
[Key discoveries or decisions]

### Files Modified
[List of files touched]

### Open Questions
[Unresolved items for next agent]

### Recommendations
[Suggested next steps]

4.3 Plan Command in Detail

The /plan command (/tmp/ai-harness-repos/everything-claude-code/commands/plan.md) follows a strict pattern:

  1. Restate Requirements -- Clarify what needs to be built
  2. Identify Risks -- Surface potential issues
  3. Create Step Plan -- Break into phases
  4. WAIT for Confirmation -- "CRITICAL: The planner agent will NOT write any code until you explicitly confirm"

The planner agent (/tmp/ai-harness-repos/everything-claude-code/agents/planner.md) produces structured plans with:

  • Overview (2-3 sentences)
  • Requirements list
  • Architecture changes with file paths
  • Implementation steps grouped by phase
  • Testing strategy
  • Risks and mitigations
  • Success criteria

4.4 Multi-Model Planning (Advanced)

The multi-plan command (/tmp/ai-harness-repos/everything-claude-code/commands/multi-plan.md) introduces a more sophisticated pipeline:

  1. Phase 1: Context Retrieval -- Uses mcp__ace-tool__search_context for semantic search
  2. Phase 2: Dual-Model Analysis -- Parallel calls to Codex and Gemini backends
  3. Phase 2.3: Cross-Validation -- Identify consensus and divergence
  4. Phase 2.4: Claude Synthesis -- Generate final plan from both analyses

IMPORTANT: This depends on ~/.claude/bin/codeagent-wrapper (line 25-26 of multi-plan.md) which is NOT included in the repository. This is an external dependency that users must install separately.

Confidence: MEDIUM -- The multi-model flow is well-documented but depends on external tooling not included.

4.5 Verification Loop

The /verify command invokes the verification-loop skill:

File: /tmp/ai-harness-repos/everything-claude-code/skills/verification-loop/SKILL.md

Six verification phases:

  1. Build Verification -- npm run build
  2. Type Check -- tsc --noEmit or pyright
  3. Lint Check -- npm run lint or ruff check
  4. Test Suite -- Run with coverage, target 80%
  5. Security Scan -- Grep for secrets and console.log
  6. Diff Review -- Review changed files

Output is a structured VERIFICATION REPORT with PASS/FAIL per phase and an overall READY/NOT READY verdict.

Confidence: HIGH -- This is fully implemented as a skill definition.


5. Subagent/Task Orchestration Model

5.1 Agent Architecture

ECC defines 13 specialized agents in /tmp/ai-harness-repos/everything-claude-code/agents/:

Agent Model Tools Role
planner.md opus Read, Grep, Glob Feature planning
architect.md opus Read, Grep, Glob System design
code-reviewer.md sonnet Read, Grep, Glob, Bash Code review
security-reviewer.md sonnet Read, Write, Edit, Bash, Grep, Glob Security audit
tdd-guide.md sonnet Read, Write, Edit, Bash, Grep TDD enforcement
build-error-resolver.md sonnet Read, Write, Edit, Bash, Grep, Glob Build error fixing
e2e-runner.md sonnet Read, Write, Edit, Bash, Grep, Glob E2E testing
refactor-cleaner.md sonnet Read, Write, Edit, Bash, Grep, Glob Dead code removal
doc-updater.md sonnet Read, Write, Edit, Bash, Grep, Glob Documentation sync
go-reviewer.md sonnet Read, Grep, Glob, Bash Go code review
go-build-resolver.md sonnet Read, Write, Edit, Bash, Grep, Glob Go build errors
python-reviewer.md sonnet Read, Grep, Glob, Bash Python code review
database-reviewer.md sonnet Read, Grep, Glob, Bash Database optimization

5.2 Tool Scoping Strategy

Agents use deliberately restricted tool sets:

  • Read-only agents (planner, architect): ["Read", "Grep", "Glob"] -- cannot modify code
  • Full-access agents (build-error-resolver, tdd-guide): All tools including Write, Edit, Bash
  • Review agents (code-reviewer): Read + Bash (for running git diff)

This is a principle of least privilege approach to agent tooling.

Evidence:

  • /tmp/ai-harness-repos/everything-claude-code/agents/planner.md line 5: tools: ["Read", "Grep", "Glob"]
  • /tmp/ai-harness-repos/everything-claude-code/agents/tdd-guide.md line 5: tools: ["Read", "Write", "Edit", "Bash", "Grep"]

5.3 Model Selection for Agents

  • Opus: Used for deep reasoning tasks (planner, architect) -- 2 agents
  • Sonnet: Used for execution tasks (code review, TDD, build fixing) -- 11 agents
  • Haiku: Not directly assigned to any agent, but recommended for subagent exploration in rules

Evidence: /tmp/ai-harness-repos/everything-claude-code/rules/common/performance.md lines 3-18

5.4 The Subagent Context Problem

ECC explicitly addresses the "context problem" in multi-agent workflows:

File: /tmp/ai-harness-repos/everything-claude-code/skills/iterative-retrieval/SKILL.md

"Subagents are spawned with limited context. They don't know which files contain relevant code, what patterns exist in the codebase, what terminology the project uses."

The solution is Iterative Retrieval -- a 4-phase loop:

  1. DISPATCH -- Broad initial query
  2. EVALUATE -- Score relevance 0-1
  3. REFINE -- Update search criteria based on gaps
  4. LOOP -- Repeat max 3 cycles

Evidence: /tmp/ai-harness-repos/everything-claude-code/skills/iterative-retrieval/SKILL.md lines 30-48

5.5 Orchestrator Pattern

From the longform guide (/tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 268-286):

Phase 1: RESEARCH (use Explore agent) -> research-summary.md
Phase 2: PLAN (use planner agent) -> plan.md
Phase 3: IMPLEMENT (use tdd-guide agent) -> code changes
Phase 4: REVIEW (use code-reviewer agent) -> review-comments.md
Phase 5: VERIFY (use build-error-resolver if needed) -> done or loop back

Key rules:

  1. Each agent gets ONE clear input and produces ONE clear output
  2. Outputs become inputs for next phase
  3. Never skip phases
  4. Use /clear between agents
  5. Store intermediate outputs in files

Confidence: HIGH


6. Multi-Agent / Parallelization Strategy

6.1 Actual Parallelization Capabilities

ECC does not implement true concurrent agent execution within a single Claude Code session. The orchestration is sequential with handoff documents between agents.

However, ECC documents several parallelization patterns for multiple Claude Code instances:

  1. Git Worktrees -- Each worktree gets its own Claude instance
  2. Fork (/fork) -- Fork conversations for non-overlapping tasks
  3. Cascade Method -- Open new tasks in new tabs, sweep left to right

Evidence: /tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 176-215

6.2 The Cascade Method

From the longform guide (line 209-215):

  • Open new tasks in new tabs to the right
  • Sweep left to right, oldest to newest
  • Focus on at most 3-4 tasks at a time

6.3 Multi-Model Parallelization (multi-plan/multi-execute)

The multi-plan and multi-execute commands use run_in_background: true for parallel calls to Codex and Gemini backends:

File: /tmp/ai-harness-repos/everything-claude-code/commands/multi-plan.md lines 119-133

Parallel call Codex and Gemini (run_in_background: true):
1. Codex Backend Analysis (technical feasibility, architecture)
2. Gemini Frontend Analysis (UI/UX impact, user experience)

CRITICAL LIMITATION: This depends on an external codeagent-wrapper binary at ~/.claude/bin/codeagent-wrapper that is NOT included in the repository. Without this binary, the multi-model commands cannot function.

6.4 Parallel Review Pattern

The /orchestrate command mentions parallel execution for independent checks (lines 139-149):

### Parallel Phase
Run simultaneously:
- code-reviewer (quality)
- security-reviewer (security)
- architect (design)

### Merge Results
Combine outputs into single report

However, this is documented as a pattern to follow, not code that enforces it. The actual parallelism depends on Claude Code's Task tool behavior.

6.5 Agent Teams Warning

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md lines 106-111

"Agent Teams spawns multiple context windows. Each teammate consumes tokens independently. Only use for tasks where parallelism provides clear value."

Confidence: HIGH for documentation, MEDIUM for implementation -- The parallelization strategies are well-documented but not enforced by code. The multi-model approach depends on external tooling.


7. Isolation Model

7.1 Git Worktrees

ECC recommends git worktrees as the primary isolation mechanism:

git worktree add ../project-feature-a feature-a
git worktree add ../project-feature-b feature-b

Each worktree is an independent filesystem checkout that gets its own Claude Code instance.

Evidence: /tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 193-203

7.2 Session Isolation

Sessions are isolated by:

  • File naming: YYYY-MM-DD-<short-id>-session.tmp -- unique per session
  • Short ID derivation: Last 8 chars of CLAUDE_SESSION_ID env var
  • Storage location: ~/.claude/sessions/

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-manager.js lines 22-54

7.3 Agent Isolation

Agents are isolated through:

  • Tool restrictions: Each agent declares which tools it can use
  • Model selection: Agents run on specified model (haiku/sonnet/opus)
  • Context scope: Subagents get limited context via the Task tool

There is no filesystem sandboxing beyond tool restrictions. An agent with Bash access can execute arbitrary commands.

7.4 Compaction Counter Isolation

The strategic compact hook uses per-session counter files:

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/suggest-compact.js line 29

const sessionId = process.env.CLAUDE_SESSION_ID || 'default';
const counterFile = path.join(getTempDir(), `claude-tool-count-${sessionId}`);

7.5 Limitations

  • No container isolation: No Docker or sandbox for agent execution
  • No network isolation: Agents with Bash access can make network requests
  • No resource limits: No memory/CPU constraints on agent execution
  • Shared filesystem: All agents in a session share the same working directory

Confidence: HIGH


8. Human-in-the-Loop Controls

8.1 Plan Confirmation Gate

The /plan command enforces explicit user confirmation before code changes:

File: /tmp/ai-harness-repos/everything-claude-code/commands/plan.md line 96-97

"CRITICAL: The planner agent will NOT write any code until you explicitly confirm the plan with 'yes' or 'proceed'"

Users can respond with:

  • "yes" / "proceed" -- Approve and continue
  • "modify: [changes]" -- Request modifications
  • "different approach: [alternative]" -- Redirect

8.2 Hook Warnings

Several PreToolUse hooks provide non-blocking warnings:

  • Tmux reminder: Suggests tmux for long-running commands (exit code 0)
  • Git push reminder: "Review changes before push" (exit code 0)
  • Console.log warning: Warns about debug statements

8.3 Hook Blockers

Two PreToolUse hooks actively block operations:

  • Dev server blocker: Blocks npm run dev outside tmux (exit code 2)
  • Doc file blocker: Blocks creation of random .md/.txt files (exit code 2)

Evidence: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json lines 4-44

8.4 Review Agent Verdict System

The code-reviewer agent produces verdicts:

File: /tmp/ai-harness-repos/everything-claude-code/agents/code-reviewer.md lines 209-212

  • Approve: No CRITICAL or HIGH issues
  • Warning: HIGH issues only (can merge with caution)
  • Block: CRITICAL issues found -- must fix before merge

8.5 Multi-Model Execution Gate

The multi-execute command requires explicit user confirmation:

File: /tmp/ai-harness-repos/everything-claude-code/commands/multi-execute.md line 15

"Prerequisite: Only execute after user explicitly replies 'Y' to /ccg:plan output"

8.6 Missing Controls

  • No automatic rollback: If an agent produces bad output, there's no automatic reversion
  • No approval for individual agent handoffs: The orchestration pipeline runs without intermediate approval
  • No budget gates: No automatic stopping when token cost exceeds a threshold
  • No diff review gate: No mandatory diff review before agent actions

Confidence: HIGH


9. Context Handling Strategy

9.1 Token Optimization Settings

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md

Recommended settings:

{
  "model": "sonnet",
  "env": {
    "MAX_THINKING_TOKENS": "10000",
    "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "50",
    "CLAUDE_CODE_SUBAGENT_MODEL": "haiku"
  }
}
Setting Default Recommended Impact
model opus sonnet ~60% cost reduction
MAX_THINKING_TOKENS 31,999 10,000 ~70% thinking cost reduction
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE 95 50 Earlier compaction, better quality
CLAUDE_CODE_SUBAGENT_MODEL (inherits main) haiku ~80% cheaper subagents

9.2 Strategic Compaction

File: /tmp/ai-harness-repos/everything-claude-code/skills/strategic-compact/SKILL.md

The suggest-compact.js hook tracks tool call count and suggests /compact at configurable thresholds:

  • Default threshold: 50 tool calls
  • Periodic reminders: Every 25 calls after threshold
  • Session-specific counter: Uses CLAUDE_SESSION_ID for isolation

Compaction decision guide (from skill):

Phase Transition Compact? Why
Research -> Planning Yes Research context is bulky
Planning -> Implementation Yes Plan is in file; free context
Debugging -> Next feature Yes Debug traces pollute context
Mid-implementation No Losing variable names/file paths
After failed approach Yes Clear dead-end reasoning

9.3 Dynamic System Prompt Injection

File: /tmp/ai-harness-repos/everything-claude-code/the-longform-guide.md lines 56-74

# Daily development
alias claude-dev='claude --system-prompt "$(cat ~/.claude/contexts/dev.md)"'

# PR review mode
alias claude-review='claude --system-prompt "$(cat ~/.claude/contexts/review.md)"'

Three context files included, each defining a distinct behavioral mode:

contexts/dev.md (Development Mode):

Mode: Active development
Focus: Implementation, coding, building features
Behavior: Write code first, explain after
Priorities: 1. Get it working  2. Get it right  3. Get it clean
Tools to favor: Edit, Write, Bash, Grep, Glob

contexts/review.md (Review Mode):

Mode: PR review, code analysis
Focus: Quality, security, maintainability
Behavior: Read thoroughly before commenting, prioritize by severity
Checklist: Logic errors, edge cases, error handling, security, performance, readability, test coverage
Output: Group findings by file, severity first

contexts/research.md (Research Mode): Exploration-focused context for investigating codebases and external services.

The key insight here is the authority hierarchy described in the longform guide:

  1. System prompt content (highest authority)
  2. User messages
  3. Tool results (lowest authority)

By injecting context via --system-prompt, these modes shape Claude's behavior more strongly than any rule file or CLAUDE.md instruction could.

9.4 MCP Context Warning

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 673-682

"Each MCP tool description consumes tokens from your 200k window, potentially reducing it to ~70k."

Rules of thumb:

  • Keep under 10 MCPs enabled per project
  • Keep under 80 tools active
  • Use disabledMcpServers per project

9.5 PreCompact State Saving

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/pre-compact.js

Before compaction:

  1. Logs compaction event with timestamp to compaction-log.txt
  2. Appends compaction marker to active session file

9.6 What Survives Compaction

From the strategic-compact skill:

Persists Lost
CLAUDE.md instructions Intermediate reasoning
TodoWrite task list File contents previously read
Memory files Multi-step conversation context
Git state Tool call history
Files on disk Nuanced verbal preferences

Confidence: HIGH


10. Session Lifecycle and Persistence

10.1 Session Start

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-start.js

On session start:

  1. Load recent sessions: Finds files matching *-session.tmp in ~/.claude/sessions/ (max 7 days old)
  2. Inject latest session: Outputs content to stdout for Claude to receive as context
  3. Report learned skills: Checks ~/.claude/skills/learned/ for extracted patterns
  4. List session aliases: Shows available named sessions
  5. Detect package manager: Reports detected PM and source

10.2 Session End

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js

On session end:

  1. Read transcript: Parses JSONL transcript from transcript_path (via stdin JSON)
  2. Extract summary: Collects user messages (last 10), tools used, files modified
  3. Create/update session file: Writes to ~/.claude/sessions/YYYY-MM-DD-<short-id>-session.tmp

The transcript parsing handles:

  • Direct content and nested message.content format
  • Tool use entries both direct and within assistant content blocks
  • Graceful handling of parse errors (lines 86-89)

10.3 Session File Format

# Session: 2026-02-22
**Date:** 2026-02-22
**Started:** 14:30
**Last Updated:** 16:45

---

## Session Summary

### Tasks
- Implement user authentication
- Fix build errors

### Files Modified
- src/auth/handler.ts
- src/middleware/auth.ts

### Tools Used
Edit, Bash, Read, Grep

### Stats
- Total user messages: 15

10.4 Session Management Commands

The /sessions command provides:

  • List all sessions with dates and sizes
  • Load a specific session by alias or ID
  • Search sessions by date or content

File: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/session-manager.js

Key operations:

  • getAllSessions(): Paginated listing with filtering by date/search
  • getSessionById(): Lookup by short ID or filename
  • parseSessionMetadata(): Extract completed/in-progress items
  • getSessionStats(): Calculate session statistics

10.5 Continuous Learning Persistence

File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/evaluate-session.js

At session end, if the session had 10+ user messages:

  1. Signals to Claude that session should be evaluated for extractable patterns
  2. Saves learned skills to ~/.claude/skills/learned/

The v2 instinct system (/tmp/ai-harness-repos/everything-claude-code/skills/continuous-learning-v2/SKILL.md) provides more sophisticated persistence:

  • ~/.claude/homunculus/observations.jsonl: Raw session observations
  • ~/.claude/homunculus/instincts/personal/: Auto-learned instincts
  • ~/.claude/homunculus/instincts/inherited/: Imported instincts
  • ~/.claude/homunculus/evolved/: Generated agents/skills/commands

10.6 Transcript Parsing Implementation Detail

The session-end.js transcript parser is the most complex data processing in ECC. It handles the Claude Code JSONL format which has multiple entry structures:

Entry Type 1: Direct user message

{"type": "user", "content": "Fix the auth bug"}

Entry Type 2: Nested message format

{"type": "user", "message": {"role": "user", "content": [{"type": "text", "text": "Fix the auth bug"}]}}

Entry Type 3: Tool use (direct)

{"type": "tool_use", "tool_name": "Edit", "tool_input": {"file_path": "/src/auth.ts"}}

Entry Type 4: Tool use within assistant content blocks

{"type": "assistant", "message": {"content": [{"type": "tool_use", "name": "Edit", "input": {"file_path": "/src/auth.ts"}}]}}

The parser handles all four formats in a single pass:

// From session-end.js lines 48-85:
if (entry.type === 'user' || entry.role === 'user' || entry.message?.role === 'user') {
  const rawContent = entry.message?.content ?? entry.content;
  const text = typeof rawContent === 'string'
    ? rawContent
    : Array.isArray(rawContent)
      ? rawContent.map(c => (c && c.text) || '').join(' ')
      : '';
  // ...
}

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js lines 33-100

Data limits applied:

  • User messages: Last 10 kept (line 98)
  • User message text: Truncated to 200 chars each (line 57)
  • Tools used: Max 20 unique tools (line 99)
  • Files modified: Max 30 unique files (line 100)
  • Parse errors: Counted but silently skipped (lines 86-93)

10.7 Session Start Context Injection

The session-start.js hook performs a multi-step context loading sequence:

  1. Ensure directories exist: Creates ~/.claude/sessions/ and ~/.claude/skills/learned/ if missing
  2. Find recent sessions: Uses findFiles() with *-session.tmp glob, max 7 days age
  3. Inject latest session: Reads content, skips blank templates (checks for [Session context goes here]), outputs to stdout
  4. Report learned skills: Counts .md files in ~/.claude/skills/learned/
  5. List session aliases: Shows up to 5 named sessions via listAliases()
  6. Detect package manager: Calls getPackageManager() and reports name + detection source

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-start.js lines 24-73

The distinction between log() (stderr) and output() (stdout) is critical here:

  • output() goes to stdout and becomes part of Claude's context
  • log() goes to stderr and is displayed to the user but not consumed as context

Only the previous session summary uses output(). All diagnostic messages use log().

10.8 Limitations

  • No database: All persistence is flat files (markdown + JSON)
  • No concurrent access protection: Multiple sessions could race on session files
  • No session resume: Sessions create new files; there's no true "continue where I left off" mechanism
  • 7-day retention: Session start only loads sessions from the last 7 days
  • No encryption: Session files stored in plaintext
  • No session merging: Parallel sessions (e.g., in worktrees) cannot be merged
  • No transcript validation: Assumes JSONL format is correct; corrupted transcripts are skipped line-by-line

Confidence: HIGH


11. Code Quality Gates

11.1 CI Pipeline

File: /tmp/ai-harness-repos/everything-claude-code/.github/workflows/ci.yml

Four CI jobs:

  1. Test (matrix: 3 OS x 3 Node x 4 PM = 33 combinations, minus bun/windows):

    • OS: ubuntu-latest, windows-latest, macos-latest
    • Node: 18.x, 20.x, 22.x
    • PM: npm, pnpm, yarn, bun
    • Runs node tests/run-all.js
  2. Validate Components:

    • validate-agents.js: Checks YAML frontmatter (model, tools required)
    • validate-hooks.js: JSON schema, valid events, inline JS syntax validation
    • validate-commands.js: Description frontmatter
    • validate-skills.js: SKILL.md presence
    • validate-rules.js: Markdown structure
  3. Security Scan:

    • npm audit --audit-level=high
    • continue-on-error: true -- warns but does not block
  4. Lint:

    • ESLint on scripts/**/*.js tests/**/*.js
    • markdownlint on all markdown files in agents, skills, commands, rules

11.2 Test Suite

File: /tmp/ai-harness-repos/everything-claude-code/tests/run-all.js

11 test files covering:

  • lib/utils.test.js: Cross-platform utility functions
  • lib/package-manager.test.js: Package manager detection
  • lib/session-manager.test.js: Session CRUD operations
  • lib/session-aliases.test.js: Session alias management
  • hooks/hooks.test.js: Hook JSON validation and regression tests
  • hooks/evaluate-session.test.js: Continuous learning evaluation
  • hooks/suggest-compact.test.js: Strategic compaction logic
  • integration/hooks.test.js: Integration testing
  • ci/validators.test.js: Validator script testing
  • scripts/setup-package-manager.test.js: PM setup testing
  • scripts/skill-create-output.test.js: Skill creation testing

The test runner parses "Passed: N" and "Failed: N" from output and exits with code 1 if any failures.

11.3 Agent Validation Rules

File: /tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-agents.js

Required frontmatter fields: model, tools Valid models: haiku, sonnet, opus

11.4 Hook Validation

File: /tmp/ai-harness-repos/everything-claude-code/scripts/ci/validate-hooks.js

Validates:

  • JSON parsing
  • Valid event types (PreToolUse, PostToolUse, PreCompact, SessionStart, SessionEnd, Stop, Notification, SubagentStop)
  • Matcher presence
  • Hook entry structure (type, command required)
  • Inline JavaScript syntax via vm.Script compilation (line 43-49)
  • Async/timeout field types

11.5 Rules-Based Quality Enforcement

File: /tmp/ai-harness-repos/everything-claude-code/rules/common/testing.md

Mandatory TDD workflow enforced by rules:

  1. Write test first (RED)
  2. Run test - should FAIL
  3. Write minimal implementation (GREEN)
  4. Run test - should PASS
  5. Refactor (IMPROVE)
  6. Verify coverage (80%+)

Confidence: HIGH


12. Security and Compliance Mechanisms

12.1 Security Rules

File: /tmp/ai-harness-repos/everything-claude-code/rules/common/security.md

Mandatory pre-commit checklist:

  • No hardcoded secrets
  • All user inputs validated
  • SQL injection prevention (parameterized queries)
  • XSS prevention (sanitized HTML)
  • CSRF protection
  • Authentication/authorization verified
  • Rate limiting on all endpoints
  • Error messages don't leak sensitive data

12.2 Security Reviewer Agent

File: /tmp/ai-harness-repos/everything-claude-code/agents/security-reviewer.md

Comprehensive OWASP Top 10 checklist with specific code patterns to flag:

  • Hardcoded secrets: CRITICAL
  • Shell command with user input: CRITICAL
  • String-concatenated SQL: CRITICAL
  • innerHTML = userInput: HIGH
  • fetch(userProvidedUrl): HIGH
  • No auth check on route: CRITICAL

12.3 AgentShield Integration

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 382-408

External security scanning tool:

npx ecc-agentshield scan         # Quick scan
npx ecc-agentshield scan --fix   # Auto-fix safe issues
npx ecc-agentshield scan --opus  # Three Opus agents (red team/blue team/auditor)

Scans: CLAUDE.md, settings.json, MCP configs, hooks, agent definitions, skills Categories: secrets detection (14 patterns), permission auditing, hook injection analysis, MCP server risk profiling, agent config review

Note: AgentShield is a separate repository (affaan-m/agentshield), not included in ECC.

12.4 Hook Security

The hooks contain security measures:

  • Doc file blocker: Prevents creation of arbitrary .md files (potential for injection)
  • Command injection prevention: utils.js line 313 validates command names with regex /^[a-zA-Z0-9_.-]+$/
  • Package manager input validation: SAFE_NAME_REGEX and SAFE_ARGS_REGEX (package-manager.js lines 285-319)
  • Stdin size limits: MAX_STDIN = 1024 * 1024 in session-end.js

12.5 Input Validation Implementation Detail

The package manager module implements the most rigorous input validation in ECC. Two regex patterns form the defense:

SAFE_NAME_REGEX (/^[@a-zA-Z0-9_./-]+$/):

  • Used for script names and binary names
  • Allows: alphanumeric, @ (scoped packages like @scope/pkg), . (dotfiles), / (paths), -, _
  • Rejects: shell metacharacters ;, |, &, `, $, (, ), {, }, <, >, !
  • Applied in: getRunCommand() (line 297), getExecCommand() (line 331)

SAFE_ARGS_REGEX (/^[@a-zA-Z0-9\s_./:=,'"*+-]+$/):

  • Used for command arguments
  • More permissive: adds whitespace, =, :, ,, quotes, *
  • Still rejects: ;, |, &, `, $, (, ), {, }, <, >
  • Applied in: getExecCommand() (line 334)

Both throw Error on validation failure rather than silently stripping characters:

// From package-manager.js lines 293-299:
function getRunCommand(script, options = {}) {
  if (!script || typeof script !== 'string') {
    throw new Error('Script name must be a non-empty string');
  }
  if (!SAFE_NAME_REGEX.test(script)) {
    throw new Error(`Script name contains unsafe characters: ${script}`);
  }
  // ...
}

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/package-manager.js lines 283-339

The commandExists() function in utils.js uses a separate validation layer:

  • Validates command name with /^[a-zA-Z0-9_.-]+$/ (stricter -- no @ or /)
  • Uses spawnSync instead of execSync to avoid shell interpolation
  • Platform-aware: where on Windows, which on Unix

Evidence: /tmp/ai-harness-repos/everything-claude-code/scripts/lib/utils.js lines 311-329

12.6 Security Gaps

  • No secret scanning in hooks: Hooks don't check for secrets in edited content
  • No dependency pinning enforcement: npm audit runs but does not block PRs
  • execSync usage: utils.js line 343 uses execSync with a security warning but no actual enforcement
  • No RBAC for agents: All agents in a plugin share the same permission context
  • MCP credentials in config: mcp-configs/mcp-servers.json has YOUR_*_HERE placeholders but no validation
  • No file path traversal prevention in hooks: Hook scripts do not validate that file paths in tool_input are within the project directory (only install.sh has path traversal checks)
  • Inline JS in hooks.json is unsandboxed: The node -e inline scripts run with full system access, same as external scripts

Confidence: HIGH for documented mechanisms, MEDIUM for completeness


13. Hooks, Automation Surface, and Fail-Safe Behavior

13.1 Hook Architecture

File: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json

ECC defines hooks across 6 lifecycle events:

Event Count Purpose
PreToolUse 5 matchers Validation, blocking, suggestions
PostToolUse 5 matchers Formatting, checking, logging
PreCompact 1 matcher State preservation
SessionStart 1 matcher Context loading
Stop 1 matcher Console.log audit
SessionEnd 2 matchers State persistence, pattern extraction

13.2 Hook Execution Model

From the hooks README:

  • PreToolUse: Can block (exit 2), warn (stderr without exit 2), or pass (exit 0)
  • PostToolUse: Can analyze output but cannot block
  • Stop: Runs after each Claude response
  • SessionStart/SessionEnd: Run at session lifecycle boundaries
  • PreCompact: Runs before context compaction

13.3 Inline vs Script Hooks

ECC uses two hook implementation patterns:

  1. Inline Node.js: node -e "..." for simple, self-contained checks

    • Example: Dev server blocker (hooks.json line 10)
    • Pros: No external file dependency
    • Cons: Hard to read, hard to test, no source maps
  2. Script files: node "${CLAUDE_PLUGIN_ROOT}/scripts/hooks/script.js" for complex logic

    • Example: Session start (hooks.json line 74)
    • Uses ${CLAUDE_PLUGIN_ROOT} variable for plugin-relative paths
    • Pros: Testable, readable, version-controlled

13.3.1 Detailed Hook-by-Hook Inventory

PreToolUse Hooks (5 total):

# Matcher Type Behavior Implementation
1 Bash Blocking (exit 2) Blocks dev servers outside tmux Inline: Regex tests for npm run dev, pnpm dev, yarn dev, bun run dev. Only active on non-Windows. Outputs tmux instructions to stderr.
2 Bash Warning (stderr) Reminds to use tmux for long-running commands Inline: Tests for npm/pnpm/yarn/bun install/test, cargo, make, docker, pytest, vitest, playwright. Only triggers if $TMUX is unset.
3 Bash Warning (stderr) Warns before git push Inline: Tests for git push in command string. Outputs reminder to review changes.
4 Write Blocking (exit 2) Blocks creation of random .md/.txt files Inline: Allows README.md, CLAUDE.md, AGENTS.md, CONTRIBUTING.md, and files in .claude/plans/. Blocks all other markdown/text file creation.
5 Edit|Write Pass-through Suggests compaction at thresholds Script: suggest-compact.js. Increments per-session counter. Suggests /compact at threshold (default 50) and every 25 calls thereafter.

PostToolUse Hooks (5 total):

# Matcher Type Behavior Implementation
1 Bash Warning (stderr) Logs PR URL after gh pr create Inline: Extracts GitHub PR URL from command output with regex. Provides gh pr review command.
2 Bash Async (background) Build analysis notification Inline: Detects build commands. Logs completion message. Runs with async: true, timeout: 30.
3 Edit Pass-through Auto-formats JS/TS files Script: post-edit-format.js. Runs Prettier on edited .js/.ts/.jsx/.tsx files.
4 Edit Pass-through TypeScript checking Script: post-edit-typecheck.js. Runs tsc --noEmit on edited .ts/.tsx files.
5 Edit Warning (stderr) Console.log detection Script: post-edit-console-warn.js. Warns if edited file contains console.log statements.

Lifecycle Hooks (5 total):

# Event Matcher Behavior Implementation
1 PreCompact * Saves state before compaction Script: pre-compact.js. Logs timestamp to compaction-log.txt. Appends compaction marker to active session file.
2 SessionStart * Loads previous context Script: session-start.js. Finds recent sessions (7 days), injects latest to context, reports learned skills and session aliases, detects package manager.
3 Stop * Console.log audit Script: check-console-log.js. Checks all git-modified files for console.log statements.
4 SessionEnd * Persists session state Script: session-end.js. Parses JSONL transcript, extracts user messages/tools/files, creates/updates session file.
5 SessionEnd * Evaluates session for patterns Script: evaluate-session.js. Counts user messages. If >= 10, signals pattern extraction. Saves to learned skills directory.

13.3.2 Stdin JSON Protocol for Hooks

All hooks that need context from Claude Code receive a JSON object on stdin. The protocol is:

// PreToolUse input:
{
  "tool_name": "Bash",       // The tool being called
  "tool_input": {
    "command": "npm run dev"  // Tool-specific input
  }
}

// PostToolUse input:
{
  "tool_name": "Edit",
  "tool_input": { "file_path": "/path/to/file.ts", ... },
  "tool_output": { "output": "..." }
}

// SessionEnd input:
{
  "transcript_path": "/path/to/session.jsonl"
}

Evidence:

  • /tmp/ai-harness-repos/everything-claude-code/hooks/README.md lines 45-67: Input format documentation
  • /tmp/ai-harness-repos/everything-claude-code/scripts/lib/utils.js lines 440-490: readStdinJson() implementation with 5s timeout and 1MB max

13.3.3 The Fail-Safe Exit Pattern in Detail

Every hook script follows this exact error-handling pattern:

async function main() {
  // ... hook logic ...
  process.exit(0);
}

main().catch(err => {
  console.error('[HookName] Error:', err.message);
  process.exit(0); // Don't block on errors
});

This ensures that:

  1. Synchronous errors in main() are caught by the .catch() handler
  2. The error message goes to stderr, which Claude Code displays but does not act on
  3. Exit code 0 prevents Claude Code from treating the hook failure as a blocking event
  4. The only intentional non-zero exit is process.exit(2) in PreToolUse blocking hooks

Files demonstrating this pattern:

  • /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-start.js lines 77-80
  • /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js lines 230-233
  • /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/pre-compact.js lines 45-48
  • /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/suggest-compact.js lines 77-80
  • /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/evaluate-session.js lines 37-42

13.4 Async Hooks

One hook uses async execution:

File: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json lines 93-99

{
  "type": "command",
  "command": "node -e \"...build analysis...\"",
  "async": true,
  "timeout": 30
}

Async hooks run in background without blocking the main flow.

13.5 Fail-Safe Behavior

All hook scripts follow the same fail-safe pattern:

main().catch(err => {
  console.error('[HookName] Error:', err.message);
  process.exit(0); // Don't block on errors
});

This means hook failures are logged to stderr but never block Claude Code operation. The only exception is intentional blocking via exit code 2 in PreToolUse hooks.

13.6 Known Hook Issue: Duplicate Detection

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 443-451

Claude Code v2.1+ automatically loads hooks/hooks.json from installed plugins. Explicitly declaring hooks in plugin.json causes duplicate detection errors. This has caused repeated fix/revert cycles (#29, #52, #103) and is now enforced by a regression test.

13.7 Automation Surface

Total automation capabilities:

  • 5 PreToolUse hooks: 2 blocking, 3 warning
  • 5 PostToolUse hooks: Formatting, type checking, console.log warning
  • 1 PreCompact hook: State preservation
  • 1 SessionStart hook: Context loading
  • 1 Stop hook: Console.log audit
  • 2 SessionEnd hooks: State persistence + pattern extraction
  • 31 slash commands: User-triggered workflows
  • 13 agents: Delegatable specialist tasks

Confidence: HIGH


14. CLI/UX and Automation Ergonomics

14.1 Command Inventory

31 slash commands organized by category:

Planning & Architecture:

  • /plan -- Implementation planning
  • /orchestrate -- Multi-agent coordination
  • /multi-plan -- Multi-model collaborative planning
  • /multi-execute -- Multi-model collaborative execution
  • /multi-backend -- Backend multi-service orchestration
  • /multi-frontend -- Frontend multi-service orchestration
  • /multi-workflow -- General multi-service workflows

Development:

  • /tdd -- Test-driven development
  • /build-fix -- Fix build errors
  • /e2e -- Generate E2E tests
  • /refactor-clean -- Dead code removal
  • /pm2 -- PM2 service management

Review & Security:

  • /code-review -- Quality review
  • /go-review -- Go code review
  • /python-review -- Python code review
  • /verify -- Verification loop
  • /eval -- Evaluate against criteria
  • /test-coverage -- Test coverage analysis

Learning & Memory:

  • /learn -- Extract patterns from session
  • /checkpoint -- Save verification state
  • /instinct-status -- View learned instincts
  • /instinct-import -- Import instincts
  • /instinct-export -- Export instincts
  • /evolve -- Cluster instincts into skills
  • /skill-create -- Generate skills from git history

Maintenance:

  • /update-docs -- Update documentation
  • /update-codemaps -- Update codemaps
  • /sessions -- Session history management
  • /setup-pm -- Configure package manager

14.2 Command Design Pattern

Commands follow a consistent markdown template:

---
description: Brief description shown in /help
---

# Command Name

## Purpose
What this command does.

## Usage
/command-name [args]

## Workflow
1. Step 1
2. Step 2

## Output
What the user receives.

14.3 Agent Selection Guide

File: /tmp/ai-harness-repos/everything-claude-code/README.md lines 610-624

Quick reference table mapping user intent to commands and agents:

I want to... Command Agent
Plan a new feature /plan "Add auth" planner
Design system architecture /plan + architect architect
Write code with tests first /tdd tdd-guide
Review code /code-review code-reviewer
Fix a failing build /build-fix build-error-resolver

14.4 Workflow Chaining

Common workflows documented:

Starting a new feature:
/plan "Add user authentication" -> /tdd -> /code-review

Fixing a bug:
/tdd -> implement fix -> /code-review

Preparing for production:
/security-scan -> /e2e -> /test-coverage

14.5 Installation Wizard

The configure-ecc skill provides guided setup:

  • Merge/overwrite detection for existing configurations
  • Language-specific rule installation
  • Interactive configuration

Confidence: HIGH


15. Cost/Usage Visibility and Governance

15.1 Built-in Cost Monitoring

ECC relies on Claude Code's built-in /cost command for cost visibility. There is no custom cost tracking mechanism.

15.2 Token Optimization Documentation

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md

Documented strategies:

  • Model selection (Haiku/Sonnet/Opus based on task complexity)
  • Reduced thinking tokens (31,999 -> 10,000)
  • Earlier auto-compaction (95% -> 50%)
  • Cheaper subagent model (haiku)
  • MCP server management (keep under 10 enabled)

15.3 Agent Teams Cost Warning

File: /tmp/ai-harness-repos/everything-claude-code/docs/token-optimization.md lines 106-111

Explicit warning that Agent Teams spawns multiple context windows, each consuming tokens independently.

15.4 Missing Cost Governance

  • No budget limits: No mechanism to stop execution when cost exceeds threshold
  • No per-agent cost tracking: Cannot measure cost per agent invocation
  • No cost estimation: No pre-execution cost estimation for commands
  • No usage reporting: No historical usage reports or dashboards
  • No team governance: No multi-user cost allocation

Confidence: HIGH for what exists, HIGH for gaps


16. Tooling and Dependency Surface

16.1 Runtime Dependencies

Dependency Required? Purpose
Node.js >= 18 Yes All hooks and scripts
Claude Code CLI v2.1+ Yes Core runtime
Git Recommended Worktrees, diff analysis
npm/pnpm/yarn/bun One required Package management
tmux Recommended Long-running commands

16.2 Dev Dependencies

File: /tmp/ai-harness-repos/everything-claude-code/package.json lines 81-85

{
  "devDependencies": {
    "@eslint/js": "^9.39.2",
    "eslint": "^9.39.2",
    "globals": "^17.1.0",
    "markdownlint-cli": "^0.47.0"
  }
}

Minimal dependency footprint -- only ESLint and markdownlint for CI validation.

16.3 Optional Dependencies

Tool Used By Purpose
Prettier PostToolUse hook Auto-formatting JS/TS
TypeScript (tsc) PostToolUse hook Type checking
knip refactor-cleaner agent Dead code detection
depcheck refactor-cleaner agent Unused dependency detection
Playwright e2e-runner agent Browser testing
Agent Browser e2e-runner agent Preferred E2E tool
ruff Python formatting hook recipe Python formatting

16.4 External Tools Not Included

  • codeagent-wrapper: Required by multi-plan/multi-execute commands, NOT included
  • ecc-agentshield: Security scanning, separate npm package/repo
  • mcp__ace-tool__search_context: MCP tool used in multi-plan, NOT included
  • Skill Creator GitHub App: External service at skill-creator.app

16.5 Cross-Platform Support

All hooks are Node.js (no bash dependency):

  • Windows, macOS, Linux supported
  • scripts/lib/utils.js provides cross-platform utilities
  • Platform detection at lines 12-14: isWindows, isMacOS, isLinux
  • commandExists() uses where on Windows, which on Unix

File: /tmp/ai-harness-repos/everything-claude-code/hooks/README.md lines 192-193

"All hooks in this plugin use Node.js (node -e or node script.js) for maximum compatibility across Windows, macOS, and Linux."

Confidence: HIGH


17. External Integrations and Provider Compatibility

17.1 MCP Server Configurations

File: /tmp/ai-harness-repos/everything-claude-code/mcp-configs/mcp-servers.json

14 pre-configured MCP servers:

Server Type Purpose
github npx command GitHub operations
firecrawl npx command Web scraping
supabase npx command Database operations
memory npx command Persistent memory
sequential-thinking npx command Chain-of-thought
vercel HTTP Deployments
railway npx command Deployments
cloudflare-docs HTTP Documentation
cloudflare-workers-builds HTTP Worker builds
cloudflare-workers-bindings HTTP Worker bindings
cloudflare-observability HTTP Observability
clickhouse HTTP Analytics
context7 npx command Live documentation
magic npx command UI components
filesystem npx command Filesystem operations

17.2 Multi-Platform IDE Support

Platform Support Level Config Location
Claude Code Full (primary) Root directories
Cursor IDE Full (translated) .cursor/
OpenCode Full (with plugins) .opencode/

Cursor translation details (/tmp/ai-harness-repos/everything-claude-code/.cursor/README.md):

  • Rules: YAML frontmatter added, paths flattened
  • Agents: Model IDs expanded, tools -> readonly flag
  • Skills: Identical (no changes needed)
  • Commands: Path references updated, multi-* stubbed
  • MCP Config: Env interpolation syntax updated
  • Hooks: No equivalent in Cursor

17.3 OpenCode Integration

File: /tmp/ai-harness-repos/everything-claude-code/.opencode/README.md

OpenCode support includes:

  • 12 agents (vs 13 in Claude Code)
  • 24 commands (vs 31 in Claude Code)
  • 16 skills (vs 43 in Claude Code)
  • 20+ hook events (vs 8 in Claude Code)
  • 3 native custom tools (run-tests, check-coverage, security-audit)

Hook event mapping:

Claude Code OpenCode
PreToolUse tool.execute.before
PostToolUse tool.execute.after
Stop session.idle
SessionStart session.created
SessionEnd session.deleted

17.4 Multi-Model Provider Support

The multi-plan/multi-execute commands support:

  • Codex (OpenAI) -- Backend analysis authority
  • Gemini (Google) -- Frontend design authority
  • Claude (Anthropic) -- Final synthesis and code sovereignty

Trust rules: "Backend follows Codex, Frontend follows Gemini"

IMPORTANT: This multi-model support requires the external codeagent-wrapper binary.

17.5 Language Support

Skills and rules provided for:

  • TypeScript/JavaScript (primary)
  • Python (including Django)
  • Go/Golang
  • Java (Spring Boot, JPA)
  • C++ (coding standards, GoogleTest)
  • Swift (actor persistence, protocol DI)
  • Rust (example CLAUDE.md only)

Confidence: HIGH


18. Operational Assumptions and Constraints

18.1 Explicit Assumptions

  1. Claude Code CLI available: Minimum v2.1.0 required
  2. Node.js >= 18: Required for all hook scripts
  3. Git available: Recommended for session ID generation, worktrees
  4. Single user: No multi-user or team collaboration features
  5. Local execution: No remote/cloud execution support
  6. Plugin system working: Hooks auto-loading depends on Claude Code plugin convention

18.2 Implicit Assumptions

  1. Context window is 200K tokens: All optimization strategies assume this limit
  2. Claude's Task tool works as documented: Agent delegation assumes Task tool behavior
  3. File system writable: Sessions, learned skills written to ~/.claude/
  4. Stdin JSON protocol: Hooks assume Claude Code provides JSON on stdin
  5. Transcript path available: Session end hooks need transcript_path in stdin
  6. Single active session: No concurrent session management

18.3 Constraints

  1. No custom LLM runtime: Cannot use models outside Claude Code's supported set
  2. No DAG execution: Orchestration is linear pipeline or manual parallelism
  3. No cross-session state: Beyond session files, no shared state between sessions
  4. Plugin system limitations: Cannot distribute rules via plugins (upstream limitation)
  5. Hook execution limit: Hooks must complete within Claude Code's timeout
  6. Context compaction lossy: Important state may be lost during compaction

Confidence: HIGH


19. Failure Modes and Issues Observed

19.1 Documented Issues

  1. Duplicate hooks file (Issues #29, #52, #103):

    • Claude Code v2.1+ auto-loads hooks/hooks.json from plugins
    • Explicitly declaring hooks in plugin.json causes duplicate detection error
    • Fixed with regression test, but has caused repeated fix/revert cycles
  2. Instinct import content loss (Issue #148, PR #161):

    • parse_instinct_file() was dropping all content after frontmatter
    • Fixed in v1.4.1 by community contributor
  3. Windows Bun spawn limit (referenced in package-manager.js line 228):

    • Session start hooks running during Bun init could exceed spawn limit
    • Fixed by removing child process spawning from package manager detection hot path

19.2 Potential Failure Modes

  1. Hook timeout: Long-running hooks could be killed by Claude Code

    • Mitigation: All hooks use fail-safe exit(0) pattern
  2. Session file corruption: No locking mechanism for concurrent access

    • Risk: Multiple Claude instances writing to same session file
    • Mitigation: Session files use unique short IDs
  3. Compaction counter race condition: suggest-compact.js acknowledges race window

    • File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/suggest-compact.js line 39
    • "Use fd-based read+write to reduce (but not eliminate) race window"
  4. Transcript parsing failures: session-end.js handles parse errors gracefully

    • File: /tmp/ai-harness-repos/everything-claude-code/scripts/hooks/session-end.js lines 86-89
    • Skips unparseable lines, logs count
  5. MCP context explosion: Too many enabled MCPs can reduce effective context to ~70K

    • Mitigation: Documentation warns, but no automated enforcement
  6. Agent tool misuse: Agents with Bash access could execute destructive commands

    • Mitigation: Principle of least privilege in tool assignment, but no runtime sandboxing
  7. Multi-model dependency failure: codeagent-wrapper binary not found

    • Impact: multi-plan and multi-execute commands completely non-functional
    • Mitigation: None documented

19.3 Observed Robustness Patterns

  • Graceful degradation: All hooks exit 0 on error
  • Input validation: Package manager validates names and arguments
  • TOCTOU handling: Session manager wraps stat calls in try-catch for deleted files
  • Size limits: Stdin reading has 1MB cap
  • Counter clamping: Compact counter clamped to 1-1000000 range

Confidence: HIGH


20. Governance and Guardrails

20.1 Agent Behavioral Guardrails

  1. Read-only agents: planner, architect -- cannot modify code
  2. Code reviewer filtering: >80% confidence threshold before reporting issues (code-reviewer.md line 27)
  3. Build error resolver constraints: "No architecture changes, only fix errors" (build-error-resolver.md line 19-20)
  4. Code sovereignty: In multi-model execution, "All file modifications by Claude, external models have zero write access" (multi-execute.md line 12)

20.2 Quality Rules

File: /tmp/ai-harness-repos/everything-claude-code/rules/common/

  • testing.md: 80% coverage minimum, TDD mandatory
  • security.md: Pre-commit security checklist
  • coding-style.md: Immutability, file organization
  • git-workflow.md: Commit format, PR process
  • performance.md: Model selection strategy
  • patterns.md: Design patterns, API response format
  • hooks.md: Hook architecture guidelines
  • agents.md: Subagent delegation rules

20.3 CI Enforcement

All PRs must pass:

  • Component validation (agents, hooks, commands, skills, rules)
  • ESLint + markdownlint
  • Test suite across 33+ OS/Node/PM combinations
  • npm audit (warning only)

20.4 Contributing Guidelines

File: /tmp/ai-harness-repos/everything-claude-code/CONTRIBUTING.md

PR requirements:

  • Follow format guidelines
  • Tested with Claude Code
  • No sensitive info
  • Clear descriptions
  • Conventional commit format: feat(skills): add rust-patterns skill

20.5 Missing Governance

  • No rate limiting: No throttling of agent invocations
  • No audit logging: No persistent audit trail of agent actions
  • No approval workflow: No multi-person approval for configuration changes
  • No role-based access: All users have full access to all components
  • No compliance mapping: No GDPR, SOC2, or other compliance framework alignment

Confidence: HIGH


21. Roadmap/Evolution Signals

21.1 Proven and Shipped (v1.0-v1.4.1)

  • Core agent/skill/hook/command/rule system
  • Cross-platform Node.js hooks
  • Session persistence
  • Package manager detection
  • CI pipeline with multi-OS/multi-PM testing
  • Continuous learning v1 and v2 (instinct-based)
  • Multi-language rules (TS, Python, Go)
  • Cursor and OpenCode integration
  • Plugin system with marketplace support
  • i18n (Chinese simplified/traditional, Japanese)

21.2 TODO/Roadmap Claims (Not Yet Implemented)

  1. configure-ecc token optimization integration: Token optimization guide mentions future integration with the install wizard (docs/token-optimization.md line 117)

  2. Memory MCP default disabling: "The memory MCP server is configured by default but not used by any skill, agent, or hook -- consider disabling it" (docs/token-optimization.md line 101)

  3. Multi-model commands: Depend on external codeagent-wrapper binary not included in repo

21.3 Evolution Signals

  1. From bash to Node.js: All hooks migrated from bash to Node.js for cross-platform compatibility
  2. From v1 to v2 learning: Skills-based learning evolved to instinct-based with confidence scoring
  3. From single IDE to multi-IDE: Added Cursor and OpenCode support
  4. From English to i18n: Added Chinese and Japanese translations
  5. Community growth: 42K+ stars, community contributions (e.g., instinct import fix)

21.4 Missing Areas

  1. No runtime execution engine: Pure configuration, no programmatic API
  2. No persistent database: All state in flat files
  3. No real-time monitoring: No dashboard, no metrics collection
  4. No team collaboration: Single-user focused
  5. No API/SDK: Cannot be integrated into other tools programmatically
  6. No DAG execution: Linear pipeline only
  7. No rollback mechanism: No undo for agent actions beyond git
  8. No cost controls: No budget limits or spending alerts
  9. No formal state machine: Workflow state not tracked programmatically
  10. No LLM provider abstraction: Tightly coupled to Claude/Anthropic

21.5 Unresolved Issues

  1. Hook duplicate detection fragility: Despite regression test, this has recurred 3 times
  2. Multi-model dependency gap: multi-plan/multi-execute require external binary
  3. Session file concurrency: No locking mechanism
  4. Context window measurement: No programmatic way to measure current context usage

Confidence: HIGH for observations, MEDIUM for roadmap predictions


22. What Should Be Borrowed/Adapted into Maestro

22.1 STRONGLY RECOMMEND Borrowing

  1. Hook Architecture Pattern (Confidence: HIGH)

    • The six-event lifecycle (PreToolUse, PostToolUse, PreCompact, SessionStart, SessionEnd, Stop) with matcher-based filtering is well-designed
    • The fail-safe pattern (exit 0 on error) prevents hooks from breaking the main flow
    • The blocking (exit 2) vs warning (stderr) distinction is clean
    • File: /tmp/ai-harness-repos/everything-claude-code/hooks/hooks.json
  2. Session Persistence Model (Confidence: HIGH)

    • Transcript parsing at session end to extract structured summaries
    • Session start injection of previous context
    • CompactPre hook to save state before lossy compaction
    • Files: scripts/hooks/session-start.js, scripts/hooks/session-end.js
  3. Agent Tool Scoping (Confidence: HIGH)

    • Read-only agents for planning/architecture
    • Full-access agents for implementation
    • The principle of least privilege for tool assignment
    • Pattern from: Agent frontmatter tools field
  4. Strategic Compaction (Confidence: HIGH)

    • The compaction decision guide (when to compact vs not)
    • Tool call counting with configurable thresholds
    • The insight that auto-compaction at 95% is too late
    • File: skills/strategic-compact/SKILL.md
  5. CI Validation Pipeline (Confidence: HIGH)

    • Structural validation of agents, hooks, commands, skills, rules
    • Inline JS syntax checking via vm.Script
    • Multi-OS, multi-Node, multi-PM testing matrix
    • File: .github/workflows/ci.yml
  6. Code Reviewer Confidence Filtering (Confidence: HIGH)

    • 80% confidence threshold before reporting

    • Skip stylistic preferences unless violating conventions
    • Consolidate similar issues rather than listing each one
    • File: agents/code-reviewer.md lines 26-29
  7. Iterative Retrieval Pattern (Confidence: MEDIUM)

    • The 4-phase DISPATCH/EVALUATE/REFINE/LOOP pattern for context retrieval
    • Max 3 cycles, then proceed with best available
    • Relevance scoring 0-1
    • File: skills/iterative-retrieval/SKILL.md
  8. Continuous Learning v2 (Instinct Model) (Confidence: MEDIUM)

    • Atomic instincts with confidence scoring
    • Evidence-backed patterns
    • Evolution from instincts to skills/commands/agents
    • File: skills/continuous-learning-v2/SKILL.md

22.2 CONSIDER Borrowing with Modifications

  1. Orchestration Pipeline (Confidence: MEDIUM)

    • The sequential agent pipeline with handoff documents is useful
    • BUT: Should add approval gates between agents
    • BUT: Should support DAG execution, not just linear
    • File: commands/orchestrate.md
  2. Token Optimization Settings (Confidence: MEDIUM)

    • Good defaults for cost reduction
    • BUT: Should add automated budget enforcement
    • File: docs/token-optimization.md
  3. Multi-Model Approach (Confidence: LOW)

    • The "backend follows Codex, frontend follows Gemini" trust model is interesting
    • BUT: Depends on external tooling not included
    • BUT: Adds complexity and cost without clear evidence of benefit
    • Files: commands/multi-plan.md, commands/multi-execute.md
  4. Package Manager Detection (Confidence: HIGH)

    • 6-level detection priority is thorough
    • Cross-platform compatibility well-handled
    • Input validation for command injection prevention
    • File: scripts/lib/package-manager.js

22.3 DO NOT Borrow

  1. Inline Node.js in JSON (Confidence: HIGH)

    • node -e "let d='';process.stdin.on('data',c=>d+=c)..." is unreadable and untestable
    • Always use external script files instead
    • Evidence: hooks/hooks.json lines 10, 20, 30, 40
  2. Flat File Persistence (Confidence: HIGH)

    • .tmp files in ~/.claude/sessions/ with no locking is fragile
    • Use a proper database or at least SQLite for session state
    • Evidence: scripts/lib/session-manager.js
  3. Markdown-Only Configuration (Confidence: MEDIUM)

    • While markdown is LLM-friendly, it lacks type safety
    • Consider typed configuration (JSON Schema, TypeScript) with markdown documentation
    • Evidence: All agents, skills, commands are untyped markdown
  4. No-Execute Plan Confirmation (Confidence: MEDIUM)

    • The /plan -> human confirms -> /execute pattern is good for safety
    • BUT: The confirmation is purely conversational, not tracked
    • Should use structured approval with audit trail

22.4 Key Insights for Maestro

  1. Configuration is a product: ECC's 42K stars prove that curated configs have massive value
  2. Context window management is the #1 operational concern: More documentation on this than any other topic
  3. Cross-platform matters: The bash-to-Node.js migration was driven by Windows compatibility needs
  4. Community contributions matter: The instinct import bug was fixed by a community contributor
  5. Hook duplicate issues are fragile: Plugin conventions change between versions
  6. Multi-model is aspirational: Documented but dependent on external tooling
  7. The "cascade method" for manual parallelism: Simple but effective human workflow pattern
  8. Eval-driven development: The eval harness skill maps evals to "unit tests of AI development"

Confidence: HIGH for recommendations, MEDIUM for implementation specifics


23. Cross-Links

Related Analyses

  • superpowers-deep-analysis.md

    • Section: Agent orchestration model (compare sequential pipeline vs DAG)
    • Section: Context management strategy (compare compaction approaches)
    • Section: Hook system (compare event types and execution model)
  • agent-orchestrator-deep-analysis.md

    • Section: Multi-agent coordination (compare orchestrate command vs orchestrator patterns)
    • Section: Isolation model (compare worktree approach vs container isolation)
    • Section: State management (compare flat files vs persistent database)
  • maestro-deep-analysis.md

    • Section: Workflow engine (compare ECC's linear pipeline vs Maestro's DAG)
    • Section: Cost governance (compare ECC's documentation-only approach vs automated controls)
    • Section: Team collaboration (compare single-user ECC vs multi-user Maestro)
  • harness-consensus-report.md

    • Section: Shared patterns (hooks, agents, session persistence, context management)
    • Section: Divergences (execution model, isolation, governance)
    • Section: Synthesis recommendations
  • final-harness-gap-report.md

    • Section: Cost governance gap (ECC has documentation, needs automation)
    • Section: Concurrent execution gap (ECC is sequential, needs DAG support)
    • Section: State management gap (ECC uses flat files, needs database)
    • Section: Security gap (ECC has rules, needs runtime enforcement)

Appendix A: File Inventory

Core Configuration Files

File Lines Purpose
CLAUDE.md 61 Project guidance for Claude Code
hooks/hooks.json 169 All hook definitions
.claude-plugin/plugin.json 41 Plugin manifest
package.json 89 npm package configuration
install.sh 173 Installation script

Scripts

File Lines Purpose
scripts/lib/utils.js 529 Cross-platform utilities
scripts/lib/package-manager.js 431 Package manager detection
scripts/lib/session-manager.js 442 Session CRUD
scripts/lib/session-aliases.js ~100 Session aliases
scripts/hooks/session-start.js 81 Session start hook
scripts/hooks/session-end.js 235 Session end hook
scripts/hooks/pre-compact.js 49 Pre-compaction hook
scripts/hooks/suggest-compact.js 81 Compaction suggestion
scripts/hooks/evaluate-session.js 100 Continuous learning

Agents

File Model Tools
agents/planner.md opus Read, Grep, Glob
agents/architect.md opus Read, Grep, Glob
agents/code-reviewer.md sonnet Read, Grep, Glob, Bash
agents/security-reviewer.md sonnet All
agents/tdd-guide.md sonnet Read, Write, Edit, Bash, Grep
agents/build-error-resolver.md sonnet All
agents/e2e-runner.md sonnet All
agents/refactor-cleaner.md sonnet All
agents/doc-updater.md sonnet All
agents/go-reviewer.md sonnet Read, Grep, Glob, Bash
agents/go-build-resolver.md sonnet All
agents/python-reviewer.md sonnet Read, Grep, Glob, Bash
agents/database-reviewer.md sonnet Read, Grep, Glob, Bash

Skills (44 total)

Core Workflow Skills:

  • skills/strategic-compact/SKILL.md -- Context compaction strategy
  • skills/verification-loop/SKILL.md -- 6-phase quality verification (build, type, lint, test, security, diff)
  • skills/eval-harness/SKILL.md -- Eval-driven development (EDD) with pass@k and pass^k metrics
  • skills/iterative-retrieval/SKILL.md -- 4-phase context retrieval for subagents (DISPATCH/EVALUATE/REFINE/LOOP)
  • skills/continuous-learning/SKILL.md -- Pattern extraction v1
  • skills/continuous-learning-v2/SKILL.md -- Instinct-based learning v2 with confidence scoring
  • skills/configure-ecc/SKILL.md -- Installation wizard with merge/overwrite detection
  • skills/search-first/SKILL.md -- Search before asking pattern

Language-Agnostic Development Skills:

  • skills/coding-standards/SKILL.md -- Universal coding best practices
  • skills/security-review/SKILL.md -- OWASP-based security checklist
  • skills/security-scan/SKILL.md -- Security scanning automation
  • skills/tdd-workflow/SKILL.md -- TDD methodology (RED-GREEN-REFACTOR)
  • skills/e2e-testing/SKILL.md -- End-to-end testing patterns
  • skills/api-design/SKILL.md -- API design best practices
  • skills/backend-patterns/SKILL.md -- Backend architecture patterns
  • skills/frontend-patterns/SKILL.md -- Frontend development patterns
  • skills/database-migrations/SKILL.md -- Database migration strategies
  • skills/deployment-patterns/SKILL.md -- Deployment and CI/CD patterns
  • skills/docker-patterns/SKILL.md -- Docker containerization patterns
  • skills/project-guidelines-example/SKILL.md -- Example project setup

Data & Analytics Skills:

  • skills/postgres-patterns/SKILL.md -- PostgreSQL query optimization and schema design
  • skills/clickhouse-io/SKILL.md -- ClickHouse analytics patterns
  • skills/content-hash-cache-pattern/SKILL.md -- Content-addressable caching
  • skills/cost-aware-llm-pipeline/SKILL.md -- LLM pipeline cost optimization
  • skills/regex-vs-llm-structured-text/SKILL.md -- When to use regex vs LLM for text processing
  • skills/nutrient-document-processing/SKILL.md -- Document processing patterns

Python/Django Skills:

  • skills/python-patterns/SKILL.md -- Python development patterns
  • skills/python-testing/SKILL.md -- Python testing patterns
  • skills/django-patterns/SKILL.md -- Django web framework patterns
  • skills/django-security/SKILL.md -- Django security best practices
  • skills/django-tdd/SKILL.md -- Django TDD workflow
  • skills/django-verification/SKILL.md -- Django verification patterns

Go Skills:

  • skills/golang-patterns/SKILL.md -- Go development patterns
  • skills/golang-testing/SKILL.md -- Go testing patterns

Java/Spring Boot Skills:

  • skills/java-coding-standards/SKILL.md -- Java coding standards
  • skills/springboot-patterns/SKILL.md -- Spring Boot patterns
  • skills/springboot-security/SKILL.md -- Spring Boot security
  • skills/springboot-tdd/SKILL.md -- Spring Boot TDD
  • skills/springboot-verification/SKILL.md -- Spring Boot verification
  • skills/jpa-patterns/SKILL.md -- JPA/Hibernate patterns

C++/Swift Skills:

  • skills/cpp-coding-standards/SKILL.md -- C++ coding standards
  • skills/cpp-testing/SKILL.md -- C++ testing patterns
  • skills/swift-actor-persistence/SKILL.md -- Swift actor persistence patterns
  • skills/swift-protocol-di-testing/SKILL.md -- Swift protocol DI and testing

Commands (32 total)

Command File Category
/plan commands/plan.md Planning
/orchestrate commands/orchestrate.md Planning
/multi-plan commands/multi-plan.md Planning
/multi-execute commands/multi-execute.md Planning
/multi-backend commands/multi-backend.md Planning
/multi-frontend commands/multi-frontend.md Planning
/multi-workflow commands/multi-workflow.md Planning
/tdd commands/tdd.md Development
/build-fix commands/build-fix.md Development
/e2e commands/e2e.md Development
/refactor-clean commands/refactor-clean.md Development
/pm2 commands/pm2.md Development
/code-review commands/code-review.md Review
/go-review commands/go-review.md Review
/go-build commands/go-build.md Review
/go-test commands/go-test.md Review
/python-review commands/python-review.md Review
/verify commands/verify.md Review
/eval commands/eval.md Review
/test-coverage commands/test-coverage.md Review
/learn commands/learn.md Learning
/learn-eval commands/learn-eval.md Learning
/checkpoint commands/checkpoint.md Learning
/instinct-status commands/instinct-status.md Learning
/instinct-import commands/instinct-import.md Learning
/instinct-export commands/instinct-export.md Learning
/evolve commands/evolve.md Learning
/skill-create commands/skill-create.md Learning
/update-docs commands/update-docs.md Maintenance
/update-codemaps commands/update-codemaps.md Maintenance
/sessions commands/sessions.md Maintenance
/setup-pm commands/setup-pm.md Maintenance

Rules

Directory Files Focus
rules/common/ 8 files agents.md, coding-style.md, git-workflow.md, hooks.md, patterns.md, performance.md, security.md, testing.md
rules/typescript/ 5 files coding-style.md, hooks.md, patterns.md, security.md, testing.md
rules/python/ 5 files coding-style.md, hooks.md, patterns.md, security.md, testing.md
rules/golang/ 5 files coding-style.md, hooks.md, patterns.md, security.md, testing.md
rules/README.md 1 file Rules structure documentation

Total: 24 rule files across 4 language groups plus README.

Documentation

File Lines Purpose
README.md 1033 Main project documentation
the-shortform-guide.md 431 Setup guide (skills, hooks, subagents, MCPs, plugins)
the-longform-guide.md 355 Advanced patterns (token economics, memory, parallelization)
CONTRIBUTING.md 425 Contribution guidelines with PR templates
hooks/README.md 199 Hook system documentation
docs/token-optimization.md 137 Token optimization guide
.claude-plugin/README.md 6 Plugin manifest gotchas
.opencode/README.md 173 OpenCode integration documentation
rules/README.md ~50 Rules structure documentation

CI/CD and Test Files

File Lines Purpose
.github/workflows/ci.yml 219 4-job CI pipeline (test, validate, security, lint)
tests/run-all.js 81 Test runner (11 test files)
scripts/ci/validate-agents.js 82 Agent frontmatter validation
scripts/ci/validate-hooks.js 149 Hook JSON schema and JS syntax validation
scripts/ci/validate-commands.js ~80 Command frontmatter validation
scripts/ci/validate-skills.js ~80 Skill SKILL.md presence validation
scripts/ci/validate-rules.js ~80 Rule markdown structure validation

Configuration Files

File Lines Purpose
schemas/hooks.schema.json 101 JSON Schema for hooks configuration
mcp-configs/mcp-servers.json 92 14 pre-configured MCP server definitions
contexts/dev.md 21 Development mode system prompt
contexts/review.md 23 Review mode system prompt
contexts/research.md ~20 Research/exploration mode system prompt

Integration Files

File Lines Purpose
.opencode/agents/ 12 agents OpenCode agent definitions
.opencode/commands/ 24 commands OpenCode command definitions
.opencode/skills/ 16 skills OpenCode skill definitions
.opencode/hooks/ 20+ events OpenCode hook definitions
.opencode/README.md 173 OpenCode integration documentation

Appendix B: Confidence Score Summary

Section Confidence Rationale
Design Philosophy HIGH Extensive documentation, clear patterns
Core Architecture HIGH All code read, well-organized
Harness Workflow HIGH (basic), MEDIUM (multi-model) Basic orchestration clear; multi-model depends on external tools
Subagent Orchestration HIGH All agent files analyzed
Parallelization MEDIUM Well-documented but not code-enforced
Isolation Model HIGH Simple model, well-understood
Human-in-the-Loop HIGH Clear patterns, some gaps identified
Context Handling HIGH Most documented aspect of project
Session Persistence HIGH All scripts analyzed
Code Quality Gates HIGH CI pipeline + tests analyzed
Security HIGH (documented), MEDIUM (enforcement) Rules strong, runtime enforcement weak
Hooks HIGH All hooks analyzed in detail
CLI/UX HIGH All 31 commands cataloged
Cost Governance HIGH for gaps Documentation-only, no automation
Tooling HIGH Full dependency inventory
External Integrations HIGH All MCP configs + IDE support analyzed
Operational Assumptions HIGH Explicit and implicit documented
Failure Modes HIGH Documented issues + predicted modes
Governance HIGH Rules and CI analyzed
Roadmap MEDIUM Based on signal interpretation
Maestro Recommendations HIGH (what), MEDIUM (how) Clear recommendations, implementation TBD

Appendix C: Version History

Version Date Key Changes
v1.4.1 Feb 2026 Fixed instinct import content loss
v1.4.0 Feb 2026 Multi-language rules, installation wizard, PM2, multi-agent commands
v1.3.0 Feb 2026 Full OpenCode integration
v1.2.0 Feb 2026 Python/Django, Java Spring Boot, session management, continuous learning v2

End of analysis. Total files analyzed: 60+ across all directories. All file paths are absolute references to /tmp/ai-harness-repos/everything-claude-code/.

Final Harness Gap Report: Maestro vs. the Canonical Feature Set

Report Date: 2026-02-22 Analyst: Claude Opus 4.6 Scope: Gap analysis of RunMaestro/Maestro against best-in-class features from obra/superpowers, affaan-m/everything-claude-code, and ComposioHQ/agent-orchestrator Source Reports:

  • maestro-deep-analysis.md (2005 lines)
  • superpowers-deep-analysis.md (2005 lines)
  • everything-claude-code-deep-analysis.md (2141 lines)
  • agent-orchestrator-deep-analysis.md (2806 lines)

Executive Summary

Maestro is the most ambitious and fully-realized project in the comparison set -- a cross-platform Electron desktop application with CLI, mobile PWA, multi-provider agent support, SQLite-backed analytics, and Group Chat orchestration. It is the only project with a runtime execution engine (ProcessManager), a desktop GUI, and multi-provider support (4 active agents, 3 planned). Its architecture is sound, its codebase is large (672K lines of TypeScript), and its feature set is broad.

However, Maestro has critical gaps in three areas that the other projects address:

  1. No automated quality gates in the execution loop (Critical). Maestro's Auto Run processes checkbox tasks sequentially but never runs tests, lints, or code review between steps. Superpowers' two-stage code review and ECC's six-phase verification loop are both superior here.

  2. No cost governance enforcement (High). Maestro tracks costs in its SQLite dashboard but provides no budgets, alerts, or automatic pause when spending exceeds thresholds. All three comparison projects share this gap, but Maestro -- as the only project with runtime cost data -- is uniquely positioned to solve it.

  3. No reaction engine or lifecycle state machine (High). Agent Orchestrator's 16-status state machine with configurable reactions (event -> action, with retries and escalation) is a fundamentally more sophisticated approach to agent lifecycle management than Maestro's binary "busy/idle" model.

Priority recommendation: Phase 1 should focus on quality gates (leveraging Superpowers' prompt patterns within Maestro's runtime enforcement), cost budgets, and a basic reaction engine. These three changes would close the most impactful gaps with moderate effort.


1. Maestro's Current Strengths

Maestro leads the comparison set in several areas where it should preserve and double down on its advantage.

1.1 Multi-Provider Agent Support (Unmatched)

No other project supports 4+ AI coding agent CLIs through a unified interface. Maestro's declarative agent definition architecture (AgentConfig with binaryName, args, batchModeArgs, resumeArgs, etc.) and output parser registry pattern allow adding new agents without modifying core logic.

Source: maestro-deep-analysis.md, Section 17.1-17.3 (Provider Architecture), Section 3.3 (Output Parser Architecture)

Evidence: Agent definitions in src/main/agents/definitions.ts (367 lines), 4 output parser implementations, 5 error pattern sets with ~100 individual patterns.

1.2 Desktop Application with Keyboard-First Design (Unmatched)

Superpowers is invisible (CLI-only), ECC is configuration-only, and Agent Orchestrator has a basic Next.js dashboard. Maestro provides a full Electron desktop app with 30+ keyboard shortcuts, a Layer Stack modal system with 30+ priority levels, ARIA accessibility, and a comprehensive theme system (16 themes). No other project approaches this level of UI sophistication.

Source: maestro-deep-analysis.md, Section 14 (CLI/UX and Automation Ergonomics)

1.3 Group Chat with Moderator AI (Unmatched)

Maestro's Group Chat system is the most sophisticated multi-agent coordination mechanism in the comparison set. The moderator-agent pattern (user message -> moderator routing -> parallel agent work -> synthesis round -> optional follow-up loop) is architecturally superior to Agent Orchestrator's orchestrator-as-CLI-user approach and Superpowers' sequential-subagent-only model.

Source: maestro-deep-analysis.md, Section 5.1 (Group Chat System), files: src/main/group-chat/group-chat-moderator.ts (290 lines), group-chat-agent.ts (429 lines), group-chat-router.ts

1.4 SQLite Analytics and Usage Dashboard (Best-in-Class)

Maestro's StatsDB system (833 lines of SQLite management code) with daily backups, corruption recovery, WAL mode, integrity checking, and migration tracking is production-grade. The Usage Dashboard provides summary cards, agent comparison charts, activity heatmaps, and CSV export. No other project has anything comparable.

Source: maestro-deep-analysis.md, Section 15.2-15.3 (Usage Dashboard, Stats Database Architecture)

1.5 Session Discovery and Resume Across Providers (Best-in-Class)

Maestro can discover existing sessions from Claude Code, Codex, OpenCode, and Factory Droid session storage directories, and resume any of them with provider-specific flags. This cross-provider session management is unique.

Source: maestro-deep-analysis.md, Section 10.3-10.4 (Session Discovery, Session Resume)

1.6 Mobile Remote Control (Unmatched)

PWA with WebSocket + Cloudflare tunnel, voice input, swipe gestures, offline queue, and push notifications. No other project has any mobile capability.

Source: maestro-deep-analysis.md, Section 14.2 (Mobile UX)

1.7 Error Pattern System (Best-in-Class)

1015 lines of regex-based error detection covering 7 error types across 4 agents plus SSH, with recoverability flags and dynamic error messages. This is the most comprehensive error detection system in the comparison set.

Source: maestro-deep-analysis.md, Section 3.4 (Error Pattern System)

1.8 Symphony Community Contribution Platform (Unique)

No other project has a mechanism for community-driven open source contribution through the tool itself. Symphony's registry + Auto Run + PR creation pipeline is novel.

Source: maestro-deep-analysis.md, Section 5.2 (Symphony Orchestration)


2. Gap Analysis Matrix

Feature Area Maestro Status Best-in-Class Project Gap Severity Source
Orchestration quality gates No automated verification between Auto Run tasks Superpowers: Two-stage code review (spec compliance + quality) after EACH task with review loops Critical superpowers S11.1-11.3, maestro S4.6
Verification pipeline None (agent self-reports completion) ECC: 6-phase verification loop (build, type, lint, test, security, diff) with structured PASS/FAIL report Critical ecc S4.5, maestro S4.6
Session lifecycle state machine Binary busy/idle with color-coded dots AO: 16-state machine (spawning -> working -> pr_open -> ci_failed -> review_pending -> approved -> mergeable -> merged -> done) High ao S9.1-9.2, maestro S10.5
Reaction engine None AO: Configurable event->action rules with retries, escalation, conditions, and time-based triggers High ao S12.1-12.6, maestro S13.5
Anti-rationalization engineering None (agents run prompts as-given) Superpowers: 40+ rationalization prevention entries, red flag lists, gate functions, pressure testing methodology High superpowers S11.5-11.6, maestro S4.5
Cost governance (budgets/limits) Tracking only (SQLite dashboard) None has enforcement, but Maestro has the data infrastructure to build it High maestro S15.5, all projects lack enforcement
Security scanning in CI No SAST, no dependency audit in CI AO: Gitleaks (full history), dependency-review-action (moderate+), pnpm audit (high/prod). ECC: npm audit High ao S11.3, ecc S11.1, maestro S11.8
CI testing before release Release workflow only builds, does not run tests ECC: 33-combination CI matrix (3 OS x 3 Node x 4 PM), component validation, ESLint, markdownlint. AO: lint + typecheck + test High ecc S11.1, ao S10.1, maestro S11.8
Hooks lifecycle system No hook system for agent tools ECC: 6-event hook lifecycle (PreToolUse, PostToolUse, PreCompact, SessionStart, SessionEnd, Stop) with blocking/warning/pass-through modes Medium ecc S13.1-13.3, maestro S13
Context compaction automation Manual only (user triggers) ECC: Tool-call-counting hook with configurable thresholds + phase-transition compaction guide. Superpowers: Auto re-inject on compact event Medium ecc S9.2, superpowers S9.5, maestro S9.6
Agent tool scoping All agents run with full privileges (YOLO mode) ECC: Read-only agents for planning (tools: Read, Grep, Glob), full-access for implementation. Principle of least privilege Medium ecc S5.2, maestro S17.3
Notification system Desktop notifications only (no external channels) AO: Desktop + Slack (Block Kit) + Composio + Webhook, with priority-based routing Medium ao S7.4, maestro S13
Issue tracker integration None (manual task creation via Auto Run docs) AO: GitHub Issues + Linear trackers with issue-to-branch-to-PR pipeline Medium ao S16.1-16.2, maestro S21.4
Session persistence across crashes In-memory Group Chat state lost on crash. Electron-store with 2s debounce for settings AO: Flat-file metadata survives crashes. Session restoration from archive with agent-specific resume Medium ao S9.4-9.5, maestro S10.2, S19.5
Plugin architecture Encore Features (feature gating, not a full plugin system) AO: 8-slot plugin architecture with typed PluginManifest, PluginModule, and registry pattern Medium ao S2.2-2.3, maestro S21.3
REST/webhook API No external API (CLI + IPC only) AO: Next.js API routes for sessions, events (SSE), sends, kills, merges, restores Medium ao S7.2, maestro S13.5
Continuous learning None ECC: Instinct system with confidence scoring, session evaluation, pattern extraction, evolved skills/commands/agents Low ecc S10.5, maestro not present
Multi-language rules/skills Provider-agnostic (no language-specific prompts) ECC: Rules for TS, Python, Go, Java, C++, Swift. 44 skills covering language-specific patterns Low ecc Appendix A Skills, maestro S4.5
Web dashboard for monitoring Mobile PWA (read/control) AO: Kanban-style dashboard with attention levels, dynamic favicon, SSE real-time updates Low ao S7.1, maestro S14.2
Provider extensibility (local models) Only CLI agents, no direct API support ECC: Planned multi-model (Codex, Gemini). Superpowers: Codex, OpenCode mapping Low ecc S17.4, maestro S17.6

3. Detailed Gap Descriptions

3.1 GAP: No Automated Quality Gates in Auto Run (Critical)

Current state in Maestro:

The Auto Run batch processor (maestro S4.4, file: src/cli/services/batch-processor.ts) processes checkbox tasks sequentially. For each document, it reads unchecked tasks, constructs a prompt, spawns the AI agent, parses the response for checked tasks, and moves to the next. There is NO verification step between tasks.

From maestro S4.6: "There is NO automatic verification layer (no test runner, no linter integration, no code review step). The verification is the agent's own assessment that it completed the work. ... The agent could check off a task without actually completing it successfully."

What best-in-class does:

Superpowers (subagent-driven-development): After EACH task, two sequential review subagents are dispatched:

  1. Spec Compliance Reviewer -- explicitly told "The implementer finished suspiciously quickly. Their report may be incomplete, inaccurate, or optimistic." Must NOT trust the implementer's report. Reads actual code and compares to requirements line by line.
  2. Code Quality Reviewer -- dispatched ONLY after spec compliance passes. Reviews code quality, architecture, testing, production readiness. Issues categorized Critical/Important/Minor.

Both reviews are loops -- if issues found, implementer fixes, reviewer re-reviews until approved. (superpowers S4.5, S5.4)

ECC (verification-loop): Six verification phases with structured PASS/FAIL output:

  1. Build Verification (npm run build)
  2. Type Check (tsc --noEmit)
  3. Lint Check (npm run lint)
  4. Test Suite (with coverage, target 80%)
  5. Security Scan (grep for secrets and console.log)
  6. Diff Review (review changed files)

Output is a structured VERIFICATION REPORT with READY/NOT READY verdict. (ecc S4.5)

What Maestro should implement:

A configurable quality gate system that runs between Auto Run tasks. The gate should support:

  • Built-in gates: test runner, linter, type checker, security scanner
  • Custom gates: user-defined commands that must exit 0 to proceed
  • Review gates: dispatch a review subagent (using Superpowers' skepticism pattern)
  • Failure behavior: pause (wait for human), retry (send error to agent), skip (log and continue), abort (stop batch)

Implementation complexity: Medium. The batch processor already has the sequential processing loop. Adding gate hooks between task iterations requires:

  • A QualityGate interface with run(context): Promise<GateResult>
  • Gate configuration in Playbook definitions
  • Integration with the existing execution queue

Dependencies: Benefits from reaction engine (Gap 3.2) for failure handling.


3.2 GAP: No Reaction Engine or Lifecycle State Machine (High)

Current state in Maestro:

Maestro tracks agent state as color-coded dots: green (ready/idle), yellow (thinking/busy), red (no connection/error), pulsing orange (connecting). From maestro S10.5. The agentError and agentErrorPaused fields handle error states, but there is no formal state machine with defined transitions, and no configurable reactions to state changes.

When agents encounter errors, a modal appears requiring manual user acknowledgment (maestro S8.4). There is no automated response to events like CI failure, rate limiting, or context exhaustion.

What best-in-class does:

Agent Orchestrator implements a 16-state machine (ao S9.1) with a reaction engine (ao S12.1-12.6):

spawning -> working -> pr_open -> ci_failed/review_pending
-> changes_requested/approved -> mergeable -> merged -> done

The reaction engine maps events to configurable actions with retries and time-based escalation:

reactions:
  ci-failed:
    trigger: ci.failing
    action: send-to-agent
    retries: 2
    escalation:
      action: notify
      after: "10m"
      priority: critical

33 distinct event types trigger reactions. Default reactions cover CI failure, code review feedback, merge conflicts, stuck agents, and agent exits.

What Maestro should implement:

A SessionStateMachine class that tracks each agent's lifecycle through defined states, and a ReactionEngine that maps state transitions to configurable actions. Given Maestro already has rich error detection (error-patterns.ts) and event emission (ProcessManager EventEmitter), this is a natural extension.

Implementation complexity: Medium-High. Requires:

  • State machine definition with valid transitions
  • Reaction configuration format (YAML or JSON in Playbooks)
  • Action executors (send message, pause, restart, notify, escalate)
  • Reaction history logging in StatsDB
  • UI for viewing/editing reaction rules

Dependencies: None. This is foundational infrastructure.


3.3 GAP: No Anti-Rationalization Engineering in Prompts (High)

Current state in Maestro:

Maestro bundles 24 system prompts as markdown files (maestro S4.5). These prompts tell agents what to do but do not address the well-documented problem of agents rationalizing around constraints. The autorun-default.md prompt instructs the agent to process checkbox tasks, but does not include rationalization prevention tables, red flag lists, or gate functions.

From maestro S2.3: CLAUDE.md contains behavioral guidelines like "Surface Assumptions Early" and "Push Back When Warranted," but these are meta-guidelines for Maestro's own development, not runtime behavioral controls for orchestrated agents.

What best-in-class does:

Superpowers has invested more iteration into anti-rationalization engineering than any other project (superpowers S11.5-11.6):

  • 40+ rationalization prevention entries across all skills, mapping specific agent excuses to correct responses
  • Red flag lists: 12 entries in using-superpowers, 12 in TDD, 8 in verification
  • Gate functions: Explicit decision trees before actions (IDENTIFY -> RUN -> READ -> VERIFY -> CLAIM)
  • Pressure testing methodology: 7 pressure types (time, sunk cost, authority, economic, exhaustion, social, pragmatic) used to validate that quality gates actually work under stress
  • Persuasion principles: Academic foundation (Cialdini 2021) applied to prompt design

Key insight from superpowers S2.3: Skill descriptions containing workflow summaries cause agents to follow the short description instead of reading full skill content (the "Description Trap"). This is directly applicable to Maestro's prompt templates.

What Maestro should implement:

Incorporate anti-rationalization patterns into Maestro's system prompts, particularly:

  • Rationalization tables in autorun-default.md for common task-skipping excuses
  • Verification-before-completion gate function in the Auto Run prompt
  • Skepticism pattern in any review prompts ("The agent finished suspiciously quickly")
  • Description-only triggers (no workflow summaries) in any prompt routing metadata

Implementation complexity: Low. This requires only prompt text changes, not code changes. However, proper validation requires the TDD-for-prompts methodology (writing pressure tests to verify the prompts actually prevent rationalization), which is Medium effort.

Dependencies: None.


3.4 GAP: No Cost Governance Enforcement (High)

Current state in Maestro:

Maestro tracks costs comprehensively in its SQLite database (maestro S15.1-15.3): per-session token usage with inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, totalCostUsd, and contextWindow. The Usage Dashboard provides summary cards, agent comparison charts, and CSV export.

However, from maestro S15.5: "No cost budgets or limits (tracking only, no enforcement). No alerts when spending exceeds thresholds. No per-playbook cost attribution. No team/organization cost aggregation."

What best-in-class does:

No project in the comparison set implements cost enforcement. All four projects share this gap. However:

  • Superpowers documents cost awareness in skill text (superpowers S15.2) and provides post-hoc cost analysis via analyze-token-usage.py
  • ECC documents token optimization settings (ecc S9.1) but relies on Claude's built-in /cost command
  • AO extracts cost from JSONL but does not display it in CLI or dashboard (ao S14.3)

What Maestro should implement:

Since Maestro already has the data infrastructure (StatsDB with real-time cost tracking), it should add:

  • Per-playbook budget limits in Playbook configuration
  • Per-agent session budget limits in agent configuration
  • Cost threshold alerts (notification when 80% of budget consumed)
  • Automatic pause when budget exceeded (with option to override)
  • Per-task cost attribution (extend auto_run_tasks table with cost columns)
  • Cost estimation before playbook execution (based on historical data)

Implementation complexity: Medium. The data layer exists; this requires:

  • Budget fields in Playbook and AgentConfig interfaces
  • Cost checking in the batch processor loop
  • Alert/pause integration with the execution queue
  • UI for budget configuration and alerts

Dependencies: None. Can be built on existing StatsDB infrastructure.


3.5 GAP: No Security Scanning in CI (High)

Current state in Maestro:

From maestro S11.8: "No linting or testing in CI before release (the release workflow only builds, doesn't run tests). No required CI checks before merge. No code coverage thresholds. No security scanning (no SAST, no dependency audit in CI)."

The release workflow (release.yml, 782 lines) focuses entirely on cross-platform build correctness and native module architecture verification.

What best-in-class does:

Agent Orchestrator has the most comprehensive CI security (ao S11.3):

  • Gitleaks with --full-history (scans entire git history for secrets)
  • dependency-review-action failing on moderate+ vulnerabilities
  • pnpm audit at high severity for production dependencies

ECC has structural validation plus security scanning (ecc S11.1):

  • 33-combination test matrix (3 OS x 3 Node x 4 PM)
  • Component validation (agents, hooks, commands, skills, rules)
  • npm audit (warning-only, not blocking)
  • ESLint + markdownlint

What Maestro should implement:

Add CI pipeline stages before the release build:

  1. TypeScript type checking (tsc --noEmit for all 3 tsconfig files)
  2. ESLint + Prettier check (already configured locally via Husky, not in CI)
  3. Vitest test suite (all 4 configurations: unit, integration, e2e, performance)
  4. Secret scanning (Gitleaks or TruffleHog)
  5. Dependency vulnerability audit (npm audit --audit-level=high)
  6. Code coverage threshold (minimum 60% for critical paths)

Implementation complexity: Low-Medium. Most tooling is already configured locally; it just needs CI integration.

Dependencies: None.


3.6 GAP: No Hook System for Agent Tool Use (Medium)

Current state in Maestro:

Maestro has no hook system that intercepts agent tool calls before or after execution. The ProcessManager emits events (data, stderr, exit, usage, agent-error, etc.) but these are observation-only. There is no mechanism to block or modify agent operations.

What best-in-class does:

ECC implements a comprehensive hook lifecycle (ecc S13.1-13.3):

  • PreToolUse (5 hooks): Can block (exit 2), warn (stderr), or pass. Examples: block dev server outside tmux, warn before git push, block creation of random .md files
  • PostToolUse (5 hooks): Auto-format with Prettier, TypeScript checking, console.log detection
  • PreCompact: Save state before compaction
  • SessionStart/End: Context loading, session persistence, pattern extraction
  • Stop: Console.log audit

The fail-safe pattern (exit 0 on error) ensures hooks never break the main flow. The blocking vs. warning distinction is clean and well-tested.

What Maestro should implement:

Since Maestro already spawns AI agents as child processes and parses their output (via the output parser architecture), it should add:

  • Pre/post tool execution hooks that run before/after the agent processes tool calls
  • Configurable hook definitions (per-playbook or per-agent)
  • Hook types: blocking (prevent operation), warning (log but continue), enrichment (add context)
  • Built-in hooks: auto-format, type check, security scan, diff review

Implementation complexity: High. The hook system requires intercepting and potentially blocking agent tool execution, which may require modification to how Maestro interacts with agent processes. Since Maestro uses pass-through process management (not API-level agent control), this is architecturally constrained.

Alternative approach (Medium): Instead of intercepting tool calls in real-time, implement post-execution hooks that run after each agent query completes (leveraging the existing query-complete event). This is achievable within the current architecture.

Dependencies: Quality gates (Gap 3.1) can be implemented as post-execution hooks.


3.7 GAP: No Issue Tracker Integration (Medium)

Current state in Maestro:

Maestro's workflow starts with manually created Auto Run documents. There is no integration with issue trackers (GitHub Issues, Linear, Jira) to automatically create tasks from issues or track which issues are being worked on.

Symphony provides a limited form of issue integration (fetching GitHub Issues with runmaestro.ai labels for community contributions), but this is for the Symphony contribution pipeline, not general development workflows.

What best-in-class does:

Agent Orchestrator has deep issue tracker integration (ao S16.1-16.2):

  • GitHub Issues tracker (304 lines): Issue CRUD, listing with filters, branch name generation from issue numbers, prompt generation from issue content
  • Linear tracker (722 lines): Dual transport (direct API or Composio SDK), state mapping, full GraphQL API, issue/label/team/workflow operations
  • Automatic pipeline: Issue -> Workspace -> Agent -> PR -> Review -> Merge -> Cleanup

What Maestro should implement:

Add an issue tracker plugin system that:

  • Fetches issues from GitHub/Linear/Jira
  • Auto-generates Auto Run documents from issue descriptions
  • Links agent sessions to issues for tracking
  • Updates issue status as work progresses
  • Creates PRs with issue references

Implementation complexity: Medium-High. Requires new IPC handlers, UI components for issue browsing, and integration with the Auto Run document system.

Dependencies: Benefits from reaction engine (Gap 3.2) for automated status updates.


4. Prioritized Roadmap

Phase 1: Quick Wins (Low effort, High impact) -- Weeks 1-4

4.1.1 Anti-Rationalization Prompts

What to build: Incorporate Superpowers' rationalization prevention patterns into Maestro's bundled prompts (src/prompts/autorun-default.md, src/prompts/group-chat-moderator-system.md, and others).

Reference model: Superpowers' verification-before-completion/SKILL.md (gate function pattern) and subagent-driven-development/spec-reviewer-prompt.md (skepticism pattern).

Estimated scope: Modify 5-8 prompt markdown files. Add rationalization tables, verification gate functions, and description-trap-aware descriptions. No code changes required.

Dependencies: None.

4.1.2 CI Security and Testing Pipeline

What to build: Add testing, linting, type checking, secret scanning, and dependency auditing to the GitHub Actions release workflow.

Reference model: Agent Orchestrator's .github/workflows/security.yml (Gitleaks + dependency-review-action + npm audit) and ECC's .github/workflows/ci.yml (multi-OS/Node/PM test matrix).

Estimated scope: 1 new workflow file (~150 lines YAML). Configure existing Vitest + ESLint + Prettier tools to run in CI.

Dependencies: None.

4.1.3 Cost Budget Configuration

What to build: Add per-playbook and per-agent budget limits with automatic pause on budget exhaustion. Extend StatsDB with budget tracking columns.

Reference model: No existing project has this. Maestro's existing StatsDB infrastructure (src/main/stats/stats-db.ts) provides the foundation.

Estimated scope: Modify Playbook interface to add budgetUsd field. Add budget checking in useBatchProcessor.ts / batch-processor.ts. Add UI controls for budget setting. ~300-500 lines of new code.

Dependencies: None. Builds on existing StatsDB.


Phase 2: Core Infrastructure (Medium effort, Critical for maturity) -- Weeks 5-12

4.2.1 Quality Gate System

What to build: A configurable quality gate framework that runs between Auto Run tasks, supporting built-in gates (test, lint, type check) and custom gates (user-defined commands).

Reference model: Superpowers' two-stage review pattern (spec compliance then code quality) for the review gate architecture. ECC's verification-loop skill for the built-in gate definitions.

Estimated scope: New QualityGate interface and QualityGateRunner service (~400 lines). Modify batch processor to invoke gates between tasks (~150 lines). Gate configuration UI in Playbook editor (~300 lines). Total: ~850 lines.

Dependencies: Anti-rationalization prompts (4.1.1) improve gate effectiveness.

4.2.2 Reaction Engine

What to build: A configurable event-to-action mapping system with retries, time-based escalation, and condition predicates. Integrate with ProcessManager events and agent error patterns.

Reference model: Agent Orchestrator's reaction engine (lifecycle-manager.ts, approximately lines 250-330) with YAML-based reaction configuration.

Estimated scope: New ReactionEngine class (~500 lines). Reaction configuration in settings/playbooks (~200 lines). Default reaction set (~100 lines). UI for reaction management (~400 lines). Total: ~1200 lines.

Dependencies: None, but complements quality gates (4.2.1) and notification system (4.2.3).

4.2.3 External Notification Channels

What to build: Slack, webhook, and enhanced desktop notification support with priority-based routing.

Reference model: Agent Orchestrator's notifier-slack/src/index.ts (Block Kit formatting) and notifier-desktop/src/index.ts (platform-specific implementations).

Estimated scope: New NotificationRouter service (~200 lines). Slack plugin (~150 lines). Webhook plugin (~100 lines). Notification routing configuration (~100 lines). Total: ~550 lines.

Dependencies: Reaction engine (4.2.2) triggers notifications.

4.2.4 Agent Session State Machine

What to build: Replace binary busy/idle state tracking with a formal state machine tracking agent sessions through defined lifecycle states (spawning -> working -> reviewing -> pr_open -> ci_checking -> approved -> merged -> done).

Reference model: Agent Orchestrator's SessionStatus enum with 16 states and determineStatus() algorithm.

Estimated scope: New SessionStateMachine class (~400 lines). State-to-UI mapping (~100 lines). State transition logging to StatsDB (~150 lines). Total: ~650 lines.

Dependencies: None, but integrates with reaction engine (4.2.2).


Phase 3: Advanced Capabilities (High effort, Differentiating) -- Weeks 13-24

4.3.1 Plugin Architecture

What to build: Formalize the Encore Features system into a full plugin API with lifecycle hooks, manifest format, typed interfaces, and isolation.

Reference model: Agent Orchestrator's 8-slot plugin system with PluginManifest + PluginModule pattern and type-safe registry.

Estimated scope: Plugin manifest format and loader (~500 lines). Plugin lifecycle management (install, enable, disable, uninstall) (~400 lines). Plugin isolation and sandboxing (~300 lines). Plugin marketplace UI (~600 lines). Total: ~1800 lines.

Dependencies: Reaction engine (4.2.2) and quality gates (4.2.1) become first-party plugins.

4.3.2 Issue Tracker Integration

What to build: GitHub Issues and Linear integration with auto-generation of Auto Run documents from issues, bidirectional status sync, and issue-to-PR pipeline.

Reference model: Agent Orchestrator's tracker-github/src/index.ts (304 lines) and tracker-linear/src/index.ts (722 lines).

Estimated scope: Tracker plugin interface (~200 lines). GitHub Issues plugin (~400 lines). Linear plugin (~600 lines). Auto Run document generation from issues (~300 lines). UI for issue browsing and assignment (~500 lines). Total: ~2000 lines.

Dependencies: Plugin architecture (4.3.1) for clean modularity.

4.3.3 REST/Webhook API

What to build: HTTP API for external automation, CI/CD integration, and custom dashboards. Expose key operations: session management, playbook execution, status queries, cost data.

Reference model: Agent Orchestrator's Next.js API routes for sessions, events (SSE), sends, kills, merges, restores. Extend Maestro's existing Fastify web server.

Estimated scope: API route definitions (~600 lines). Authentication middleware (~200 lines). SSE event streaming (~200 lines). API documentation (~300 lines). Total: ~1300 lines.

Dependencies: Session state machine (4.2.4) for rich status data.

4.3.4 Continuous Learning System

What to build: Session evaluation at end-of-session to extract patterns, an instinct/learned-skill store, and automatic prompt refinement based on observed agent behavior.

Reference model: ECC's continuous learning v2 system (skills/continuous-learning-v2/SKILL.md) with atomic instincts, confidence scoring, evidence-backed patterns, and evolution from instincts to skills/commands/agents.

Estimated scope: Session evaluation service (~400 lines). Instinct store (SQLite table + CRUD) (~300 lines). Pattern extraction prompts (~200 lines). Instinct-to-prompt integration (~200 lines). UI for instinct management (~400 lines). Total: ~1500 lines.

Dependencies: StatsDB for instinct storage. Quality gates (4.2.1) provide data for pattern extraction.


5. What NOT to Adopt

5.1 Agent-as-Orchestrator Model (from Superpowers)

Superpowers makes the AI agent itself the orchestrator, guided only by markdown skill documents. This is elegant but fundamentally limits enforcement (all rules are advisory), observability (no runtime metrics), recovery (no checkpoint/restore), and reproducibility (agent behavior varies).

Why NOT: Maestro already has a runtime orchestrator (ProcessManager, batch processor, Group Chat moderator). The advisory-only enforcement model is the #1 limitation of Superpowers. Maestro should use runtime code for enforcement and prompts for guidance, not prompts for everything.

Source: superpowers S22.3 item 12 ("Agent-as-Orchestrator Model"), superpowers S1 ("advisory-only enforcement model...fundamental limitations")

5.2 Zero-Persistence Design (from Superpowers)

Superpowers has NO persistence mechanism: no session state, no database, no progress tracking across sessions. Git commits are the only durable artifact.

Why NOT: Maestro already has SQLite-backed analytics, electron-store session persistence, and Group Chat JSONL logs. Losing persistence would be a regression.

Source: superpowers S10.2, S22.3 item 13

5.3 Flat-File Metadata (from Agent Orchestrator)

AO uses key=value text files for session metadata. This has no atomicity guarantees, no schema evolution, no query capability, and race conditions between writers.

Why NOT: Maestro already uses SQLite (StatsDB) and electron-store (JSON). Both are superior to flat files. The flat-file approach was a pragmatic choice for AO's v1, not a design to emulate.

Source: ao S21.3.1 ("The key=value text file approach is too fragile for production")

5.4 tmux as Primary Runtime (from Agent Orchestrator)

AO couples tightly to tmux for process isolation. This limits Windows support, makes message passing fragile, and provides no structured communication channel.

Why NOT: Maestro already uses node-pty for terminal emulation and child_process.spawn for batch mode, supporting both macOS, Linux, and Windows. The tmux dependency would reduce portability.

Source: ao S21.3.2

5.5 Polling-Based Lifecycle (from Agent Orchestrator)

AO polls every 30 seconds. This introduces up to 30-second latency for state change detection and wastes resources during idle periods.

Why NOT: Maestro already has an event-driven architecture (ProcessManager EventEmitter with real-time events). Polling would be a regression.

Source: ao S1.2 item 4, ao S21.3.3

5.6 Inline Node.js in JSON Hooks (from ECC)

ECC uses node -e "..." for simple hooks embedded in hooks.json. This is unreadable, untestable, and has no source maps.

Why NOT: Any hook system Maestro implements should use external script files or TypeScript modules, not inline code in JSON.

Source: ecc S22.3 item 1

5.7 Windows Polyglot Wrapper (from Superpowers)

Superpowers' cmd/bash polyglot script (run-hook.cmd) is clever but fragile and has caused numerous cross-platform issues across multiple versions.

Why NOT: Maestro already uses Node.js for cross-platform compatibility. A polyglot bash/cmd approach would add fragility.

Source: superpowers S22.3 item 16


6. Implementation-Ready Recommendations

6.1 Quality Gate System (Top Priority)

Files to create/modify:

  • Create: src/shared/types/quality-gate.ts -- Gate interface and result types
  • Create: src/main/quality-gates/gate-runner.ts -- Gate execution engine
  • Create: src/main/quality-gates/built-in/ -- test-gate.ts, lint-gate.ts, typecheck-gate.ts, review-gate.ts
  • Modify: src/cli/services/batch-processor.ts -- Add gate invocation between tasks
  • Modify: src/renderer/hooks/useBatchProcessor.ts -- Add gate invocation between tasks
  • Modify: Playbook interface (in ARCHITECTURE.md referenced types) -- Add gates configuration

Architecture decisions:

  • Gates run in the same working directory as the agent
  • Gate results are stored in StatsDB (auto_run_tasks table, new gate_result column)
  • Gates are sequential (not parallel) to avoid resource contention
  • Each gate has a configurable failure mode: pause | retry | skip | abort
  • The review gate uses Superpowers' skepticism pattern in its prompt

Key interfaces:

interface QualityGate {
  id: string;
  name: string;
  type: 'command' | 'review' | 'builtin';
  run(context: GateContext): Promise<GateResult>;
}

interface GateContext {
  workingDir: string;
  taskContent: string;
  changedFiles: string[];
  agentType: string;
  previousGateResults: GateResult[];
}

interface GateResult {
  gateId: string;
  status: 'pass' | 'fail' | 'warn' | 'skip';
  message: string;
  details?: string;
  duration: number;
}

interface PlaybookGateConfig {
  gates: Array<{
    type: 'test' | 'lint' | 'typecheck' | 'review' | 'custom';
    command?: string;  // For custom gates
    onFailure: 'pause' | 'retry' | 'skip' | 'abort';
    retryCount?: number;
  }>;
}

Test strategy:

  • Unit tests for each built-in gate with mocked command execution
  • Integration tests for the gate runner with real file system operations
  • E2E tests for the full Auto Run + gate pipeline
  • Pressure tests (Superpowers methodology): verify gates are not skipped under various prompting pressures

6.2 Cost Budget Enforcement (Top Priority)

Files to create/modify:

  • Create: src/main/stats/budget-manager.ts -- Budget checking and enforcement
  • Modify: src/main/stats/schema.ts -- Add budgets table
  • Modify: src/cli/services/batch-processor.ts -- Add budget checks
  • Modify: src/renderer/hooks/useBatchProcessor.ts -- Add budget checks
  • Modify: Playbook interface -- Add budgetUsd field
  • Create: src/renderer/components/BudgetConfig.tsx -- Budget setting UI
  • Create: src/renderer/components/BudgetAlert.tsx -- Budget alert overlay

Architecture decisions:

  • Budget tracking is per-playbook-execution (not per-agent, since agents are shared)
  • Cost is accumulated from UsageStats.totalCostUsd emitted by agent parsers
  • Budget checking happens after each agent query completes (leveraging query-complete event)
  • Three thresholds: warning (80%), critical (95%), exceeded (100%)
  • At exceeded: pause batch processor, show alert, require user override to continue
  • Budget data stored in StatsDB for historical analysis

Key interfaces:

interface BudgetConfig {
  maxCostUsd: number;
  warnThresholdPct: number;    // Default 80
  criticalThresholdPct: number; // Default 95
  onExceeded: 'pause' | 'abort'; // Default pause
}

interface BudgetStatus {
  configuredBudget: number;
  currentSpend: number;
  percentUsed: number;
  status: 'ok' | 'warning' | 'critical' | 'exceeded';
}

class BudgetManager {
  checkBudget(playbook: Playbook, currentCost: number): BudgetStatus;
  recordSpend(playbook: Playbook, cost: number): void;
  getBudgetHistory(playbookId: string): BudgetStatus[];
}

Test strategy:

  • Unit tests for BudgetManager with various cost/threshold combinations
  • Integration tests with mock batch processor to verify pause behavior
  • UI tests for budget configuration and alert display

6.3 Reaction Engine (High Priority)

Files to create/modify:

  • Create: src/main/reactions/reaction-engine.ts -- Event-to-action mapping engine
  • Create: src/main/reactions/reaction-config.ts -- Reaction definition types and defaults
  • Create: src/main/reactions/actions/ -- send-message.ts, pause-agent.ts, restart-agent.ts, notify.ts
  • Modify: src/main/process-manager/ProcessManager.ts -- Emit richer lifecycle events
  • Modify: src/main/ipc/handlers/agents.ts -- Register reaction configuration handlers
  • Create: src/renderer/components/ReactionConfig/ -- UI for reaction rule management

Architecture decisions:

  • Reactions are defined per-agent or globally in settings
  • The reaction engine subscribes to ProcessManager events
  • Reactions have a deduplication window (prevent re-firing within N seconds)
  • Escalation uses a timer-based mechanism (setTimeout, persisted in case of restart)
  • All reaction executions are logged to StatsDB for auditing
  • Default reactions handle the 5 most common scenarios: auth expiration, rate limiting, context exhaustion, agent crash, and agent idle timeout

Key interfaces:

interface ReactionRule {
  id: string;
  name: string;
  trigger: string;            // Event type (e.g., 'agent-error:rate_limited')
  condition?: string;         // Optional predicate (e.g., 'attempts < 3')
  action: ReactionAction;
  retries?: number;
  escalation?: {
    action: ReactionAction;
    afterMs: number;
    priority: 'normal' | 'high' | 'critical';
  };
}

type ReactionAction =
  | { type: 'send-message'; message: string }
  | { type: 'pause-agent' }
  | { type: 'restart-agent'; withResume: boolean }
  | { type: 'notify'; channel: 'desktop' | 'slack' | 'webhook'; message: string }
  | { type: 'abort-batch' };

class ReactionEngine {
  registerRule(rule: ReactionRule): void;
  handleEvent(event: ProcessManagerEvent, session: Session): void;
  getReactionHistory(sessionId: string): ReactionExecution[];
}

Test strategy:

  • Unit tests for each action executor
  • Unit tests for condition evaluation and deduplication
  • Integration tests with mock ProcessManager events
  • Escalation timer tests (fast-forward timers)

6.4 Anti-Rationalization Prompt Enhancement (High Priority)

Files to modify:

  • src/prompts/autorun-default.md -- Add rationalization prevention table and verification gate
  • src/prompts/group-chat-moderator-system.md -- Add skepticism instructions for synthesis rounds
  • src/prompts/group-chat-moderator-synthesis.md -- Add independent verification requirement
  • src/prompts/context-grooming.md -- Add verification that grooming preserved key context
  • src/prompts/commit-command.md -- Add verification-before-claiming-complete pattern
  • Create: src/prompts/review-skepticism.md -- Reusable review prompt with Superpowers' skepticism pattern

Architecture decisions:

  • Rationalization prevention is embedded in prompts, not enforced by code (complementary to code-enforced quality gates)
  • Each prompt that can result in "task complete" claims must include the IDENTIFY -> RUN -> READ -> VERIFY -> CLAIM gate function
  • The review prompt explicitly states distrust: "The agent may have checked off tasks without actually completing them. Verify independently."
  • No @ file references in prompt templates (following Superpowers' no-@ rule to prevent context bloat)

Key additions to autorun-default.md:

## Verification Protocol

Before checking off ANY task:
1. IDENTIFY: What command proves this task is complete?
2. RUN: Execute the command (fresh, complete)
3. READ: Full output, check exit code
4. VERIFY: Does output confirm completion?
5. ONLY THEN: Check off the task

## Common Rationalizations (DO NOT FALL FOR THESE)

| Temptation | Reality |
|---|---|
| "It should work based on what I wrote" | RUN the verification |
| "I'm confident in the changes" | Confidence is not evidence |
| "Just checking the box to move on" | Unchecked is better than falsely checked |
| "The test was passing before my change" | Run it AFTER your change |
| "Manual inspection confirms it works" | Run the automated verification |

Test strategy:

  • Adopt Superpowers' TDD-for-prompts methodology:
    1. Run Auto Run with current prompts, observe rationalization behavior
    2. Add anti-rationalization text
    3. Run same scenarios, verify improved compliance
    4. Use pressure scenarios (time pressure, sunk cost, authority) to stress-test

6.5 CI Pipeline Enhancement (Quick Win)

Files to create/modify:

  • Create: .github/workflows/ci.yml -- New CI workflow (separate from release.yml)
  • Create: .github/workflows/security.yml -- Security scanning workflow
  • Modify: .github/workflows/release.yml -- Add dependency on CI passing

Architecture decisions:

  • CI runs on every push and PR (not just releases)
  • Security scanning runs on a schedule (weekly) and on PRs touching package.json
  • CI must pass before release workflow can proceed
  • Coverage threshold: 50% minimum (gradually increase as coverage improves)

Key workflow structure:

# ci.yml
jobs:
  typecheck:
    - tsc --noEmit --project tsconfig.main.json
    - tsc --noEmit --project tsconfig.lint.json
    - tsc --noEmit --project tsconfig.cli.json
  lint:
    - eslint src/
    - prettier --check src/
  test:
    - vitest run --config vitest.config.mts
    - vitest run --config vitest.integration.config.ts
  audit:
    - npm audit --audit-level=high --production

# security.yml
jobs:
  gitleaks:
    - gitleaks/gitleaks-action@v2 with args: "--full-history"
  dependency-review:
    - actions/dependency-review-action@v4 with fail-on-severity: high

Test strategy:

  • Verify CI catches known-bad commits (deliberate lint failures, type errors)
  • Verify security scan catches known-bad patterns (test with deliberate secret in test branch)

7. Source Trace

Every claim in this report is traced to specific sections of the source analysis reports.

7.1 Maestro Strengths (Section 1)

Claim Source Report Section
Multi-provider support (4 active, 3 planned) maestro-deep-analysis.md S17.2 (Provider table), S1 (Executive Summary)
Declarative agent definition architecture maestro-deep-analysis.md S17.3 (Agent Definition Architecture), S3.3 (Output Parser Architecture)
30+ keyboard shortcuts, Layer Stack system maestro-deep-analysis.md S14.1 (Desktop UX)
Group Chat moderator-agent pattern maestro-deep-analysis.md S5.1 (Group Chat System)
StatsDB with daily backups and corruption recovery maestro-deep-analysis.md S15.3 (Stats Database Architecture)
Cross-provider session discovery and resume maestro-deep-analysis.md S10.3-10.4 (Session Discovery, Session Resume)
Mobile PWA with voice/swipe/offline maestro-deep-analysis.md S14.2 (Mobile UX)
Error pattern system (~100 patterns) maestro-deep-analysis.md S3.4 (Error Pattern System)
Symphony community contribution maestro-deep-analysis.md S5.2 (Symphony Orchestration)

7.2 Gap Analysis Matrix (Section 2)

Gap Maestro Source Comparison Source
No quality gates in Auto Run maestro S4.6 ("No automatic verification layer") superpowers S4.5 (two-stage review), ecc S4.5 (6-phase verification)
No lifecycle state machine maestro S10.5 (color-coded dots) ao S9.1 (16-state machine), S9.2 (determineStatus algorithm)
No reaction engine maestro S13 ("No event bus for external consumers") ao S12.1-12.6 (reaction engine with escalation)
No anti-rationalization maestro S4.5 (prompts without prevention tables) superpowers S11.5 (40+ prevention entries), S11.6 (pressure testing)
No cost enforcement maestro S15.5 ("No cost budgets or limits") All projects lack this; maestro has best data infrastructure
No security in CI maestro S11.8 ("No security scanning") ao S11.3 (Gitleaks + dependency-review), ecc S11.1 (npm audit)
No CI testing maestro S11.8 ("release workflow only builds") ecc S11.1 (33-combination matrix), ao S10.1 (lint + typecheck + test)
No hook system maestro S13 ("No plugin system") ecc S13.1 (6-event lifecycle with 15 hooks)
No auto context compaction maestro S9.6 ("No automatic context compaction") ecc S9.2 (tool-call counting), superpowers S9.5 (re-inject on compact)
No agent tool scoping maestro S17.3 (YOLO mode for all agents) ecc S5.2 (read-only vs full-access per agent)
No notification channels maestro S13 (desktop only) ao S7.4 (desktop + Slack + webhook + Composio)
No issue tracker maestro S21.4 (manual Auto Run creation) ao S16.1-16.2 (GitHub + Linear)
No crash-safe persistence maestro S19.5 (in-memory Group Chat state) ao S9.4-9.5 (metadata survives crashes, session restoration)
No plugin architecture maestro S21.3 (Encore Features) ao S2.2-2.3 (8-slot plugin system)
No REST API maestro S13.5 ("No webhook/HTTP API") ao S7.2 (Next.js API routes)
No continuous learning maestro not present ecc S10.5 (instinct system)
No language-specific rules maestro S4.5 (provider-agnostic prompts) ecc Appendix A (44 skills, 24 rules)

7.3 What NOT to Adopt (Section 5)

Decision Source Report Section
Agent-as-orchestrator superpowers-deep-analysis.md S5.1 (Agent-as-Orchestrator), S22.3 item 12
Zero-persistence superpowers-deep-analysis.md S10.2 (No persistence), S22.3 item 13
Flat-file metadata agent-orchestrator-deep-analysis.md S2.5 (Flat-file state), S21.3.1
tmux runtime agent-orchestrator-deep-analysis.md S6.2 (tmux isolation), S21.3.2
Polling lifecycle agent-orchestrator-deep-analysis.md S1.2 item 4, S21.3.3
Inline JS in JSON everything-claude-code-deep-analysis.md S13.3 (Inline vs Script), S22.3 item 1
Polyglot wrapper superpowers-deep-analysis.md S3.4 (run-hook.cmd), S22.3 item 16

7.4 Implementation Recommendations (Section 6)

Recommendation Primary Reference Supporting Reference
Quality gate system superpowers S4.5 (two-stage review loop), ecc S4.5 (verification loop) maestro S4.4 (batch processor architecture)
Cost budget enforcement maestro S15.3 (StatsDB infrastructure) All projects' S15 (cost gap analysis)
Reaction engine ao S12.1-12.6 (reaction engine code) maestro S3.2 (ProcessManager events)
Anti-rationalization prompts superpowers S11.5-11.6 (tables and pressure tests) superpowers S22.1 items 1, 6, 7
CI pipeline ao S11.3 (security.yml), ecc S11.1 (ci.yml) maestro S11.8 (identified gaps)

Appendix A: Cross-Reference Index

By Source Report

maestro-deep-analysis.md sections referenced:

  • S1 (Executive Summary): Section 1 intro
  • S2.3 (Agent Behavioral Guidelines): Gap 3.3
  • S3.2 (ProcessManager): Rec 6.3
  • S3.3 (Output Parser): Strength 1.1
  • S3.4 (Error Patterns): Strength 1.7
  • S4.4 (Execution): Gap 3.1
  • S4.5 (Prompt System): Gap 3.3, Rec 6.4
  • S4.6 (Verification): Gap 3.1
  • S5.1 (Group Chat): Strength 1.3
  • S5.2 (Symphony): Strength 1.8
  • S8.4 (Agent Error Handling): Gap 3.2
  • S9.6 (Context Missing): Gap matrix
  • S10.3-10.4 (Session Discovery/Resume): Strength 1.5
  • S10.5 (Session States): Gap 3.2
  • S11.8 (Quality Missing): Gaps 3.5, 3.6
  • S13 (Hooks Missing): Gap 3.6
  • S14.1 (Desktop UX): Strength 1.2
  • S14.2 (Mobile UX): Strength 1.6
  • S15.1-15.3 (Cost Tracking): Strength 1.4, Rec 6.2
  • S15.5 (Cost Missing): Gap 3.4
  • S17.1-17.3 (Providers): Strength 1.1
  • S19.5 (Failure Modes): Gap matrix
  • S21.3 (Encore Features): Gap matrix
  • S21.4 (Identified Gaps): Gap 3.7

superpowers-deep-analysis.md sections referenced:

  • S1 (Executive Summary): Section 5.1 rationale
  • S2.3 (Description Trap): Gap 3.3
  • S4.5 (Subagent-Driven Development): Gap 3.1
  • S5.1 (Agent-as-Orchestrator): Section 5.1
  • S5.4 (Review Loop): Gap 3.1
  • S9.5 (Compaction): Gap matrix
  • S10.2 (No Persistence): Section 5.2
  • S11.1-11.3 (Quality Gates): Gap 3.1
  • S11.5-11.6 (Anti-Rationalization): Gap 3.3, Rec 6.4
  • S15.2 (Cost Awareness): Gap 3.4
  • S22.1 (Strongly Borrow): Section 6.4
  • S22.3 (Do Not Borrow): Section 5

everything-claude-code-deep-analysis.md sections referenced:

  • S4.5 (Verification Loop): Gap 3.1
  • S5.2 (Tool Scoping): Gap matrix
  • S9.1 (Token Optimization): Gap 3.4
  • S9.2 (Strategic Compaction): Gap matrix
  • S10.5 (Continuous Learning): Gap matrix
  • S11.1 (CI Pipeline): Gap 3.5, Rec 6.5
  • S13.1-13.3 (Hook Architecture): Gap 3.6
  • S22.1 (Strongly Borrow): Phase 2 rationale
  • S22.3 (Do Not Borrow): Section 5.6

agent-orchestrator-deep-analysis.md sections referenced:

  • S1.2 (Design Principles): Section 5.5
  • S2.2-2.3 (Plugin Architecture): Gap matrix, Phase 3
  • S7.1-7.4 (Human-in-the-Loop): Gap matrix
  • S9.1-9.2 (State Machine): Gap 3.2
  • S9.4-9.5 (Session Cleanup/Restoration): Gap matrix
  • S10.1 (CI Pipeline): Gap 3.5
  • S11.3 (Security Scanning): Gap 3.5, Rec 6.5
  • S12.1-12.6 (Reaction Engine): Gap 3.2, Rec 6.3
  • S14 (Cost Visibility): Gap 3.4
  • S16.1-16.2 (Integrations): Gap 3.7
  • S21.1.1-21.1.6 (Strongly Borrow): Phase 2-3 rationale
  • S21.3 (Do Not Borrow): Section 5

Appendix B: Effort Estimation Summary

Phase Item Effort Impact Lines Est.
1 Anti-rationalization prompts Low High ~200 (markdown)
1 CI security pipeline Low High ~150 (YAML)
1 Cost budget enforcement Medium High ~500
2 Quality gate system Medium Critical ~850
2 Reaction engine Medium-High High ~1200
2 Notification channels Medium Medium ~550
2 Session state machine Medium High ~650
3 Plugin architecture High High ~1800
3 Issue tracker integration High Medium ~2000
3 REST/webhook API Medium-High Medium ~1300
3 Continuous learning High Medium ~1500

Total estimated new code: ~10,700 lines across all three phases.


Appendix C: Confidence Scores

Section Confidence Basis
Maestro Strengths High Direct from maestro-deep-analysis.md with source code evidence
Gap Analysis Matrix High Cross-referenced all 4 reports, verified feature presence/absence
Detailed Gap Descriptions High Specific file references and code patterns cited
Prioritized Roadmap Medium-High Based on gap severity and implementation complexity analysis
What NOT to Adopt High Explicit recommendations from source reports' Section 22
Implementation Recommendations Medium Architecture designs are informed by reference implementations but not validated
Source Trace High Every claim mapped to specific report section
Effort Estimates Medium Rough estimates based on reference implementation sizes; actual effort varies

Harness Consensus Report: Cross-Project Synthesis

Date: 2026-02-22 Analyst: Claude Opus 4.6 Source Reports:

  1. superpowers-deep-analysis.md (obra/superpowers, v4.3.1)
  2. everything-claude-code-deep-analysis.md (affaan-m/everything-claude-code, v1.4.1)
  3. agent-orchestrator-deep-analysis.md (ComposioHQ/agent-orchestrator)
  4. maestro-deep-analysis.md (RunMaestro/Maestro, v0.15.0)

Executive Summary

After thorough analysis of four distinct AI coding harness projects, a clear picture emerges of what a canonical AI coding harness must include, where the field has converged, and where fundamental design tradeoffs remain unresolved.

The consensus is strong on what problems must be solved -- every project addresses orchestration, isolation, context management, quality gates, and human oversight. The consensus is weak on how to solve them -- the projects span a spectrum from pure-markdown behavioral engineering (Superpowers) to full desktop applications with SQLite-backed analytics (Maestro), with configuration-layer harnesses (ECC) and runtime orchestrators (Agent Orchestrator) between them.

Three findings stand out:

  1. Git worktrees are the universal isolation primitive. All four projects use git worktrees for agent workspace isolation. No other isolation mechanism achieves the same balance of lightweight overhead and genuine filesystem separation.

  2. Quality gate enforcement remains unsolved at scale. Every project acknowledges the need for verification before completion, but none has achieved truly reliable enforcement. Superpowers addresses this most rigorously through anti-rationalization engineering but lacks runtime enforcement. Agent Orchestrator has runtime state machines but no code quality gates within them. The gap between "instructed to verify" and "proven to have verified" persists.

  3. Cost governance is universally underdeveloped. Despite being a critical operational concern, no project implements budget limits, spending alerts, or automatic shutoff. Maestro tracks costs most comprehensively (SQLite analytics with dashboards), but even Maestro has no enforcement mechanism.

The canonical harness must combine Superpowers' behavioral rigor, ECC's configuration breadth, Agent Orchestrator's runtime state machine, and Maestro's multi-provider fleet management into a single coherent architecture.


1. Canonical Feature Set

1.1 Orchestration Model

How work flows from spec to completion.

Project Model Enforcement Key Mechanism
Superpowers Brainstorm -> Plan -> Execute -> Review -> Merge Advisory (skill text) DOT flowcharts in markdown
ECC Plan -> TDD -> Review -> Verify Advisory (agent prompts) /orchestrate command with 4 workflow types
Agent Orchestrator Issue -> Spawn -> Work -> PR -> Review -> Merge Runtime (state machine) Lifecycle manager with 30s polling
Maestro Spec (Auto Run) -> Playbook -> Execute -> Merge Semi-automated (checkbox tracking) Batch processor with JSONL events

Consensus: All four projects implement a multi-stage pipeline from specification to completion. The stages are remarkably consistent: (1) understand requirements, (2) plan the work, (3) execute in isolation, (4) verify the output, (5) merge or deliver.

Divergence: The enforcement spectrum ranges from purely advisory (Superpowers, ECC) to runtime-enforced (Agent Orchestrator). Superpowers relies entirely on the AI agent reading skill documents and self-governing. ECC relies on Claude Code's native mechanisms with markdown-defined workflows. Agent Orchestrator has a formal state machine with 16 distinct states and event-driven transitions. Maestro treats the pipeline as a pass-through, delegating planning and execution to the underlying agents while managing lifecycle and UI.

Best implementation: Agent Orchestrator's state machine is the most rigorous for runtime enforcement. Superpowers' brainstorm-plan-execute-review pipeline is the most methodologically complete for the agent's internal workflow. A canonical harness needs both -- runtime state machine for enforcement, with behavioral skills for the agent's internal process.

Source: superpowers-deep-analysis.md Section 4; everything-claude-code-deep-analysis.md Section 4; agent-orchestrator-deep-analysis.md Section 3; maestro-deep-analysis.md Section 4

1.2 Multi-Agent Coordination

Parallelism, sequencing, and isolation.

Project Parallelism Model Coordination Communication
Superpowers Sequential tasks (parallel only for independent debugging) Agent-directed Full task text in prompt
ECC Sequential pipeline with handoff documents Document-based Structured handoff markdown
Agent Orchestrator Embarrassingly parallel (one agent per issue) Git + GitHub (no direct agent-to-agent) CLI commands through tmux
Maestro Parallel agents, sequential tasks within each Group Chat moderator Moderator AI routes messages

Consensus: All projects agree that parallel execution of tasks with shared state (same files, same tests) is dangerous. Every project either prohibits it (Superpowers: "Never dispatch multiple implementation subagents in parallel") or isolates it (Agent Orchestrator: separate worktrees per issue, Maestro: separate agent workspaces).

Divergence: The projects fundamentally disagree on the right granularity for parallelism. Superpowers operates at the task level within a single feature (sequential). Agent Orchestrator operates at the issue level (parallel across features, sequential within). Maestro supports both through its execution queue (sequential within agent, parallel across agents) and Group Chat (parallel with synthesis).

Best implementation: Agent Orchestrator has the cleanest parallelism model for multi-issue work. Maestro's Group Chat moderator pattern is the most sophisticated for collaborative multi-agent coordination within a single task. Superpowers' prohibition against parallel task execution is the most safety-conscious.

Source: superpowers-deep-analysis.md Section 6; everything-claude-code-deep-analysis.md Section 6; agent-orchestrator-deep-analysis.md Section 5; maestro-deep-analysis.md Section 6

1.3 Code Quality Pipeline

Testing, review, verification, and security.

Project Testing Code Review Verification Security
Superpowers TDD "Iron Law" (mandatory failing test first) Two-stage (spec compliance + code quality) IDENTIFY-RUN-READ-VERIFY gate function Minimal (branch protection only)
ECC TDD mandatory (80% coverage target) Code reviewer agent with confidence filtering 6-phase verification loop (build, type, lint, test, security, diff) Security reviewer agent, OWASP checklist, AgentShield
Agent Orchestrator None built-in (delegated to agent) None built-in (delegated to GitHub) CI status tracking (fail-closed) Shell injection prevention, gitleaks, dependency review
Maestro None built-in (delegated to agent) None built-in (delegated to agent) Checkbox completion tracking execFileNoThrow, input validation, context isolation

Consensus: All projects acknowledge that verification before completion is essential. Superpowers' "Verification Before Completion" skill and ECC's verification loop both express the same principle: do not claim success without evidence. Agent Orchestrator enforces this at the CI level (PRs must pass CI). Maestro delegates entirely to agents.

Divergence: The most significant divergence is whether quality gates are internal to the agent (Superpowers, ECC) or external to it (Agent Orchestrator, Maestro). Superpowers and ECC embed quality knowledge in the agent's instructions. Agent Orchestrator uses external CI systems. Maestro has no quality gates of its own.

Best implementation: Superpowers' two-stage code review (spec compliance THEN code quality, with review loops) is the most thorough agent-internal quality system. ECC's 6-phase verification loop is the most comprehensive verification checklist. Agent Orchestrator's fail-closed CI status is the safest external gate. A canonical harness needs both internal verification skills AND external CI enforcement.

Source: superpowers-deep-analysis.md Section 11; everything-claude-code-deep-analysis.md Section 11; agent-orchestrator-deep-analysis.md Section 10; maestro-deep-analysis.md Section 11

1.4 Context Management

Chunking, compaction, and retrieval.

Project Strategy Compaction Persistence Budget Awareness
Superpowers Progressive disclosure (load skills on demand) Re-injection on compact events None (stateless) Token budget targets per skill (<500 lines)
ECC Strategic compaction (tool call counting) PreCompact state saving, phase-aware decisions Session summaries (7-day retention) Settings for thinking tokens, compaction threshold, subagent model
Agent Orchestrator Three-layer prompt composition (base + config + rules) None (single-use sessions) Metadata files (flat key=value) None
Maestro Context grooming, merge, transfer operations Manual trigger (compact/groom) Per-tab context, session discovery Context usage percentage tracking

Consensus: Every project recognizes that context window management is a first-class concern. The 200K token limit constrains every architectural decision. All projects agree that loading everything upfront is wrong -- some form of progressive disclosure or on-demand loading is necessary.

Divergence: The projects disagree on who manages context. Superpowers gives this responsibility to the agent via skill instructions (the "no-@ rule" preventing force-loading). ECC automates it through hooks (suggest-compact at tool call thresholds). Agent Orchestrator ignores it (single-use sessions assumed short enough). Maestro provides UI-driven context operations (groom, merge, transfer) but no automatic management.

Best implementation: ECC's strategic compaction system is the most operationally mature -- it tracks tool call counts, suggests compaction at thresholds, saves state before compaction, and provides a decision guide for when to compact vs. not. Superpowers' progressive disclosure model is the most context-efficient for skill loading. A canonical harness needs automatic compaction triggers combined with progressive skill loading.

Source: superpowers-deep-analysis.md Section 9; everything-claude-code-deep-analysis.md Section 9; agent-orchestrator-deep-analysis.md Section 8; maestro-deep-analysis.md Section 9

1.5 Session Lifecycle

Persistence, recovery, and resume.

Project Persistence Recovery Resume
Superpowers None (git commits only durable artifact) Re-discover from git history Hook re-fires on resume
ECC Session markdown files (7-day retention) Load latest session on start Previous session injected into context
Agent Orchestrator Metadata files + archived metadata Workspace + runtime recreation claude --resume <session-id>
Maestro electron-store (JSON files with 2s debounce) Session discovery per provider Provider-specific resume flags

Consensus: Every project addresses the need for some form of session persistence, but approaches vary dramatically. The universal agreement is that a new session should somehow benefit from what happened in previous sessions.

Divergence: Superpowers is explicitly zero-persistence by design ("Clean separation -- no persistent state to corrupt"). ECC persists structured session summaries. Agent Orchestrator maintains full session metadata with archive capability. Maestro has the richest persistence with per-provider session discovery and SQLite analytics.

Best implementation: Maestro has the most comprehensive session management (multi-provider session discovery, resume support for 4 different agents, persistent analytics). Agent Orchestrator has the best session lifecycle state machine (16 states with deterministic transitions). ECC's transcript-based session summary extraction is the most practical for cross-session context transfer.

Source: superpowers-deep-analysis.md Section 10; everything-claude-code-deep-analysis.md Section 10; agent-orchestrator-deep-analysis.md Section 9; maestro-deep-analysis.md Section 10

1.6 Human-in-the-Loop Controls

Approval gates, escalation, and intervention.

Project Approval Gates Escalation Intervention Mechanism
Superpowers Brainstorming approval, plan review, batch checkpoints 3+ failed fixes -> escalate to human Conversational (agent asks questions)
ECC Plan confirmation ("yes"/"modify"/"different approach") None formal Hooks (warnings, blockers)
Agent Orchestrator Dashboard for review/merge, send-message Reaction engine with timed escalation CLI ao send, dashboard messages
Maestro Read-only mode, pause/resume, error modals Agent error handling with recovery options Queue management, Group Chat

Consensus: All projects provide mechanisms for humans to review and intervene. Every project has at least one approval gate before code changes begin (plan approval in Superpowers/ECC, issue assignment in Agent Orchestrator, document creation in Maestro).

Divergence: The critical split is between synchronous gates (Superpowers blocks until human approves design) and asynchronous oversight (Agent Orchestrator notifies human when attention needed, Maestro shows error modals). Superpowers' subagent-driven development mode explicitly reduces human involvement ("Faster iteration, no human-in-loop between tasks"), while Agent Orchestrator's reaction engine keeps humans informed throughout.

Best implementation: Agent Orchestrator's reaction engine with configurable escalation is the most operationally mature HITL system. Superpowers' escalation triggers ("If 3+ fixes failed, STOP and question the architecture") are the most context-appropriate. Maestro's read-only mode toggle is the most user-friendly intervention mechanism. A canonical harness needs configurable escalation with both timed escalation (Agent Orchestrator style) and behavioral escalation (Superpowers style).

Source: superpowers-deep-analysis.md Section 8; everything-claude-code-deep-analysis.md Section 8; agent-orchestrator-deep-analysis.md Section 7; maestro-deep-analysis.md Section 8

1.7 Hooks and Automation Surface

Project Hook Events Automation API Extensibility
Superpowers SessionStart only claude -p (headless) Skills as markdown files
ECC 6 event types, 15 hooks total Plugin marketplace Agents, skills, commands, rules, contexts
Agent Orchestrator Reaction engine (33 event types) CLI (ao commands) 8-slot plugin architecture
Maestro IPC handlers (30+ modules), CLI maestro-cli with JSONL output Encore Features (precursor to plugins)

Consensus: All projects recognize the need for event-driven automation. The granularity varies -- Superpowers has one hook event, ECC has 6, Agent Orchestrator has 33 distinct event types.

Best implementation: Agent Orchestrator's reaction engine is the most composable (event + condition + action + retries + escalation). ECC's hook architecture is the most battle-tested for pre/post tool use interception.

Source: superpowers-deep-analysis.md Section 13; everything-claude-code-deep-analysis.md Section 13; agent-orchestrator-deep-analysis.md Section 12; maestro-deep-analysis.md Section 13

1.8 Cost/Usage Governance

Project Tracking Budgets Optimization
Superpowers Post-hoc token analysis script None Cache utilization via sequential execution
ECC Claude's /cost command, token optimization docs None Model selection, thinking token limits, subagent model
Agent Orchestrator JSONL-based cost extraction per session None Rough cost estimates with Sonnet 4.5 pricing
Maestro SQLite analytics, per-session USD tracking, usage dashboard None WakaTime integration, activity heatmaps

Consensus: Every project acknowledges cost as a concern. No project implements cost budgets, spending alerts, or automatic shutoff. This is the most universally underserved feature area.

Best implementation: Maestro's SQLite-backed analytics with dashboard visualizations is far ahead of the others. ECC's token optimization documentation provides the most actionable cost reduction guidance.

Source: superpowers-deep-analysis.md Section 15; everything-claude-code-deep-analysis.md Section 15; agent-orchestrator-deep-analysis.md Section 14; maestro-deep-analysis.md Section 15

1.9 Security and Compliance

Project Shell Safety Secret Scanning Sandboxing Auth
Superpowers Minimal (read-only hook) None None (advisory worktrees) None
ECC Input validation regex AgentShield (external) None (tool scoping only) None
Agent Orchestrator execFile everywhere (never exec) gitleaks in CI, dependency review None (worktree isolation) None (dashboard unauthenticated)
Maestro execFileNoThrow, spawn({shell: false}) None None (user privileges) None (web server unauthenticated)

Consensus: Every project runs AI agents with the same privileges as the user. No project implements container-based sandboxing, network isolation, or resource limits. Shell injection prevention is the most common security measure (Agent Orchestrator and Maestro both enforce execFile over exec).

Divergence: Agent Orchestrator is the only project with CI-integrated secret scanning (gitleaks). ECC is the only project with an external security audit tool (AgentShield). None have runtime sandboxing.

Best implementation: Agent Orchestrator's shell security discipline (execFile always, path traversal prevention, symlink validation) is the most systematic. ECC's security reviewer agent provides the most comprehensive code-level security review.

Source: superpowers-deep-analysis.md Section 12; everything-claude-code-deep-analysis.md Section 12; agent-orchestrator-deep-analysis.md Section 11; maestro-deep-analysis.md Section 12

1.10 Provider Compatibility and Extensibility

Project Providers Extension Model
Superpowers Claude Code, Cursor, Codex, OpenCode Skills (markdown), tool mapping per platform
ECC Claude Code (primary), Cursor, OpenCode Plugin marketplace, language-specific rules
Agent Orchestrator Claude Code (primary), Codex, Aider, OpenCode 8-slot plugin architecture with manifests
Maestro Claude Code, Codex, OpenCode, Factory Droid (+ 3 planned) Agent definitions with capability flags, output parsers

Consensus: Claude Code is the primary supported agent across all four projects. All projects support or plan to support multiple providers. The approach to multi-provider varies from markdown tool mapping (Superpowers) to full plugin architecture (Agent Orchestrator, Maestro).

Best implementation: Maestro's agent definition architecture (declarative argument builder, capability flags, per-agent output parsers, per-agent error patterns) is the most extensible. Agent Orchestrator's 8-slot plugin system is the most architecturally clean for capability abstraction.

Source: superpowers-deep-analysis.md Section 17; everything-claude-code-deep-analysis.md Section 17; agent-orchestrator-deep-analysis.md Section 16; maestro-deep-analysis.md Section 17


2. Consensus Patterns

Patterns appearing in 3 or more projects, representing likely essential features of any production harness.

2.1 Git Worktree Isolation (4/4 projects)

What it is: Using git worktrees to provide each agent or task with an isolated filesystem checkout while sharing the same git object store.

Projects: Superpowers (skill: using-git-worktrees), ECC (documented pattern with git worktree add), Agent Orchestrator (plugin: workspace-worktree), Maestro (IPC handler: git:worktreeSetup).

How they differ:

  • Superpowers treats worktree creation as a skill the agent follows, with directory selection priority and safety verification (.gitignore check).
  • ECC documents worktrees as a recommended parallelization pattern but does not automate creation.
  • Agent Orchestrator automates worktree creation per-session with git fetch origin before branching, post-create hooks (npm install), and symlink support for shared resources (node_modules).
  • Maestro provides UI-driven worktree management with one-click PR creation from worktree branches.

Canonical form: Automated worktree creation per agent/task with: (1) fetch origin before branching, (2) configurable post-create commands, (3) shared resource symlinking, (4) safety verification (.gitignore check), (5) cleanup on task completion, (6) branch naming convention tied to issue tracker.

Why this consensus exists: Git worktrees are the sweet spot between full isolation (clones) and no isolation (shared working directory). They share the git object store (low disk overhead), provide separate working trees and indexes (true filesystem isolation), and work with git's existing branch model (natural merge path).

Source: superpowers-deep-analysis.md Section 7.1; everything-claude-code-deep-analysis.md Section 7.1; agent-orchestrator-deep-analysis.md Section 6.1; maestro-deep-analysis.md Section 7.1

2.2 Plan-Before-Execute Pipeline (4/4 projects)

What it is: Requiring a planning phase that produces a reviewable plan before any code is written.

Projects: Superpowers (brainstorming -> writing-plans), ECC (/plan command -> planner agent), Agent Orchestrator (issue -> system prompt), Maestro (Auto Run documents -> Playbooks).

How they differ:

  • Superpowers mandates brainstorming BEFORE planning, with a hard gate preventing implementation before design approval. Plans include exact file paths and complete code snippets.
  • ECC uses a planner agent (Opus model) that produces structured plans with overview, requirements, architecture, steps, testing, risks, and success criteria. Explicit confirmation required.
  • Agent Orchestrator uses issue content as the implicit plan, with the orchestrator agent deciding how to decompose it.
  • Maestro uses markdown documents with checkbox items as plans, created manually or via AI wizard.

Canonical form: A planning phase that: (1) explores alternatives before committing, (2) produces a machine-parseable plan document, (3) requires human approval, (4) includes success criteria and testing strategy, (5) decomposes into individually executable tasks.

Why this consensus exists: Unplanned AI agent work consistently produces scope creep, architectural mismatches, and incomplete implementations. Planning constrains the agent's tendency to solve the problem it wants to solve rather than the one specified.

Source: superpowers-deep-analysis.md Section 4.2-4.4; everything-claude-code-deep-analysis.md Section 4.1-4.3; agent-orchestrator-deep-analysis.md Section 3.1; maestro-deep-analysis.md Section 4.2-4.3

2.3 Markdown-Native Configuration (4/4 projects)

What it is: Using markdown with optional YAML frontmatter as the primary format for agent instructions, skills, prompts, and workflow definitions.

Projects: All four use markdown extensively for defining agent behavior.

How they differ:

  • Superpowers is 100% markdown skills with zero executable orchestration code.
  • ECC uses markdown for agents (13), skills (44), commands (32), rules (24), and contexts (3).
  • Agent Orchestrator uses markdown for system prompts and agent rules files.
  • Maestro uses markdown for 24 system prompts, Auto Run documents, and CLAUDE.md ecosystem.

Canonical form: Markdown as the LLM-native configuration format, with: (1) YAML frontmatter for machine-parseable metadata, (2) prose content for behavioral instructions, (3) structured sections for checklists and procedures, (4) template variables for dynamic content injection.

Why this consensus exists: Markdown is the format LLMs understand best. It is human-readable, version-controllable, and requires no build step. YAML frontmatter provides structured metadata without sacrificing markdown readability.

Source: superpowers-deep-analysis.md Section 3.1; everything-claude-code-deep-analysis.md Section 2.2; agent-orchestrator-deep-analysis.md Section 8.1; maestro-deep-analysis.md Section 4.5

2.4 Agent Tool Scoping / Principle of Least Privilege (3/4 projects)

What it is: Restricting which tools each agent can access based on its role.

Projects: ECC (agent frontmatter tools field), Agent Orchestrator (agent plugin capabilities), Maestro (per-agent capability flags and read-only mode). Superpowers does not implement this (single agent model).

How they differ:

  • ECC assigns specific tool arrays per agent: planner gets ["Read", "Grep", "Glob"] (read-only), tdd-guide gets all tools including Write, Edit, Bash.
  • Agent Orchestrator scopes capabilities per agent plugin, though in practice all agents get full access within their isolated worktree.
  • Maestro provides per-tab read-only mode toggles and agent-specific capability flags (20 flags per agent).

Canonical form: Each agent role gets: (1) a declared set of allowed tools, (2) read-only agents for planning/review cannot modify code, (3) full-access agents for implementation/debugging, (4) runtime enforcement of tool restrictions.

Why this consensus exists: Unrestricted tool access leads to agents taking unexpected actions. A planning agent that can write files will often start implementing instead of planning. Tool scoping enforces the separation of concerns between planning, implementation, and review.

Source: everything-claude-code-deep-analysis.md Section 5.2; agent-orchestrator-deep-analysis.md Section 2.2; maestro-deep-analysis.md Section 8.1

2.5 Handoff Documents Between Pipeline Stages (3/4 projects)

What it is: Structured documents passed between agents or pipeline stages to transfer context.

Projects: ECC (handoff protocol: Context, Findings, Files Modified, Open Questions, Recommendations), Superpowers (full task text passed to subagents with scene-setting context), Maestro (Group Chat messages with moderator synthesis).

How they differ:

  • ECC defines a formal handoff template with 5 sections passed between sequential agents.
  • Superpowers uses a controller-curated approach: the main agent reads the plan once, extracts tasks, and provides full task text directly to subagents (no file references).
  • Maestro's Group Chat moderator synthesizes responses from multiple agents into a coherent summary.

Canonical form: Between pipeline stages, pass: (1) summary of work completed, (2) files modified, (3) open questions and blockers, (4) recommendations for next stage, (5) relevant context the next stage needs but may not discover independently.

Why this consensus exists: Subagents and sequential agents lack context from previous stages. Without explicit handoffs, each stage must rediscover what was already learned, wasting tokens and introducing inconsistency.

Source: everything-claude-code-deep-analysis.md Section 4.2; superpowers-deep-analysis.md Section 5.5; maestro-deep-analysis.md Section 5.1

2.6 SessionStart Bootstrap Injection (3/4 projects)

What it is: Injecting behavioral instructions into the agent's context at session start (and on resume/compact events).

Projects: Superpowers (session-start hook injects using-superpowers skill), ECC (session-start.js loads previous sessions, learned skills, package manager), Agent Orchestrator (system prompt composed from 3 layers at spawn time).

How they differ:

  • Superpowers wraps injected content in <EXTREMELY_IMPORTANT> tags and fires on startup/resume/clear/compact.
  • ECC loads the most recent session summary, reports learned skills, and detects package manager.
  • Agent Orchestrator composes a system prompt from base prompt + config context + user rules at spawn time (one-shot, not re-injected).

Canonical form: At session start: (1) inject core behavioral instructions, (2) restore relevant context from previous sessions, (3) establish project-specific configuration, (4) re-inject on context compaction events to prevent instruction loss.

Why this consensus exists: Without session-start injection, the agent begins each session as a blank slate with no knowledge of project conventions, workflow requirements, or previous work. The compaction re-injection is critical because context compaction (which occurs during long sessions) can lose the original instructions.

Source: superpowers-deep-analysis.md Section 3.5; everything-claude-code-deep-analysis.md Section 10.1; agent-orchestrator-deep-analysis.md Section 8.1

2.7 TDD-First Development Mandate (3/4 projects)

What it is: Requiring tests to be written before implementation code.

Projects: Superpowers ("Iron Law: NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST"), ECC (rules/common/testing.md: RED-GREEN-REFACTOR mandatory, 80% coverage), Maestro (no TDD enforcement). Agent Orchestrator delegates to the agent's own skills.

How they differ:

  • Superpowers has the most aggressive enforcement with an 11-entry rationalization prevention table, a 12-entry red flags list, and the maxim "Write code before the test? Delete it. Start over."
  • ECC enforces TDD through rules files and a tdd-guide agent (Sonnet model, full tool access).
  • Agent Orchestrator and Maestro do not enforce TDD, relying on the underlying agents' own practices.

Canonical form: A TDD mandate that: (1) requires a failing test before any production code, (2) includes anti-rationalization measures for common excuses, (3) provides a dedicated TDD enforcement agent/skill, (4) verifies test failure before allowing implementation.

Why this consensus exists: AI agents consistently skip testing when not explicitly required. The bias toward "just make it work" is strong, and without TDD enforcement, agents produce untested code that appears functional but fails on edge cases.

Source: superpowers-deep-analysis.md Section 11.1; everything-claude-code-deep-analysis.md Section 11.5

2.8 Cross-Platform Compatibility (3/4 projects)

What it is: Supporting multiple operating systems and development environments.

Projects: Superpowers (Windows polyglot wrapper, macOS/Linux bash), ECC (all hooks in Node.js for cross-platform, CI matrix of 3 OS x 3 Node x 4 PM), Agent Orchestrator (macOS/Linux only, tmux dependency excludes Windows), Maestro (macOS, Linux, Windows builds via Electron).

How they differ:

  • Superpowers achieved cross-platform via a polyglot cmd/bash wrapper, which caused numerous Windows bugs (#518, #504, #491, etc.).
  • ECC deliberately chose Node.js for all hooks to avoid bash dependency, achieving the cleanest cross-platform story.
  • Agent Orchestrator is Unix-only due to tmux dependency.
  • Maestro uses Electron for cross-platform desktop support with platform-specific build targets.

Canonical form: Use Node.js (not bash) for hook scripts and automation. Support macOS, Linux, and Windows. Test on all platforms in CI.

Why this consensus exists: Developers use diverse platforms. Bash-only automation excludes Windows developers. ECC's migration from bash to Node.js and Superpowers' extensive Windows debugging both prove that cross-platform support is essential but costly.

Source: superpowers-deep-analysis.md Section 19.1; everything-claude-code-deep-analysis.md Section 16.5; maestro-deep-analysis.md Section 18.2

2.9 Fail-Safe Hook Design (3/4 projects)

What it is: Designing hooks and automation scripts so that failures degrade gracefully rather than blocking the main workflow.

Projects: Superpowers (hook failure -> no bootstrap, but plugin still works), ECC (all hooks exit 0 on error, blocking only via intentional exit 2), Agent Orchestrator (enrichment timeouts -> stale data, not hangs).

How they differ:

  • Superpowers has exactly one hook, and its failure means no skill awareness (silent degradation).
  • ECC implements a consistent pattern: main().catch(err => { console.error(...); process.exit(0); }) in every hook script.
  • Agent Orchestrator uses timeouts for dashboard enrichment (3s metadata, 4s PR data) so slow APIs don't block the UI.

Canonical form: All hooks: (1) catch all errors, (2) log errors to stderr (visible but not disruptive), (3) exit 0 on error (unless intentionally blocking with exit 2), (4) have timeouts for external calls, (5) accept incomplete data over blocked workflows.

Why this consensus exists: A hook that crashes or hangs blocks the entire agent session. The cost of a failed hook (missing enhancement) is far lower than the cost of a hung session (lost work, wasted tokens).

Source: superpowers-deep-analysis.md Section 13.3; everything-claude-code-deep-analysis.md Section 13.3.3; agent-orchestrator-deep-analysis.md Section 5.4


3. Divergence Points

3.1 Agent-as-Orchestrator vs. Runtime Orchestrator

The tradeoff: Should the AI agent itself orchestrate the workflow (guided by instructions), or should external code manage the workflow (dispatching agents as workers)?

Side Projects Argument
Agent-as-orchestrator Superpowers, ECC Zero executable code, pure markdown, agent self-governs. Simple, portable, works across platforms.
Runtime orchestrator Agent Orchestrator, Maestro External code manages state, enforces transitions, provides observability. Reliable, auditable, recoverable.

Implications:

  • Agent-as-orchestrator requires no infrastructure but provides no enforcement. If the agent ignores instructions, nothing prevents it. There is no observability, no audit trail, and no recovery mechanism. However, it is vastly simpler to implement and distribute.
  • Runtime orchestrator provides enforcement (state machines), observability (event logging), and recovery (session restore) but requires significant infrastructure (processes, databases, APIs). It introduces complexity, dependencies, and maintenance burden.

Resolution for canonical harness: Use a runtime orchestrator for lifecycle management and enforcement, but embed agent-as-orchestrator behavioral skills for the agent's internal process. The runtime ensures the agent goes through the right stages; the skills ensure the agent behaves correctly within each stage.

Source: superpowers-deep-analysis.md Section 5.1, Section 22.3 (item 12); agent-orchestrator-deep-analysis.md Section 4

3.2 Advisory vs. Enforced Quality Gates

The tradeoff: Should quality gates be enforced by prompt engineering (advisory) or by runtime code (enforced)?

Side Projects Argument
Advisory Superpowers, ECC Flexible, adaptable, works within agent's own reasoning. Anti-rationalization engineering can be very effective.
Enforced Agent Orchestrator Deterministic, auditable, cannot be bypassed. But rigid, less adaptable to novel situations.
Neither Maestro Quality delegated entirely to underlying agents. Pass-through design avoids this tension.

Implications:

  • Advisory gates are only as reliable as the agent's compliance. Superpowers has invested enormously in anti-rationalization engineering (40+ rationalization entries, 7 pressure types, TDD for skills) but acknowledges the fundamental limitation: "there is no enforcement mechanism beyond the agent's willingness to follow instructions."
  • Enforced gates guarantee compliance but can be overly rigid. Agent Orchestrator's CI-based enforcement only checks at the PR level, not during individual tasks. An agent could commit untested code and only discover failures at CI time.

Resolution for canonical harness: Layer both approaches. Enforced gates for critical checkpoints (tests must pass before PR, CI must pass before merge, budget must not be exceeded). Advisory gates for quality guidance (code review standards, TDD methodology, debugging protocol). The enforced layer catches catastrophic failures; the advisory layer improves quality incrementally.

Source: superpowers-deep-analysis.md Section 11.7; agent-orchestrator-deep-analysis.md Section 19

3.3 Single-Session vs. Multi-Session Architecture

The tradeoff: Should the harness operate within a single AI session (subagents via Task tool) or manage multiple independent sessions?

Side Projects Argument
Single-session with subagents Superpowers Fresh context per subagent, no inter-session coordination needed, natural conversation flow.
Multi-session fleet management Agent Orchestrator, Maestro True parallelism, resource isolation, independent failure domains.
Hybrid ECC Sequential pipeline within a session, manual parallelism across sessions.

Implications:

  • Single-session benefits from Claude Code's prompt cache (subsequent subagents get cache hits from earlier ones). Superpowers documents this: 1.38M cache read tokens vs. 62 direct input tokens in a test run. But it limits parallelism and creates a single point of failure.
  • Multi-session enables true parallelism but loses cache benefits and requires coordination infrastructure. Agent Orchestrator's polling-based coordination adds 30-second latency to state changes.

Resolution for canonical harness: Support both modes. Single-session subagent execution for sequential tasks within a feature (cache-efficient). Multi-session fleet management for parallel feature development (resource-efficient). The choice should be per-task, not architecture-wide.

Source: superpowers-deep-analysis.md Section 15.3; agent-orchestrator-deep-analysis.md Section 5.1; maestro-deep-analysis.md Section 6

3.4 Stateless vs. Stateful Design

The tradeoff: Should the harness maintain persistent state, or should it be stateless with git as the only durable artifact?

Side Projects Argument
Stateless Superpowers No state to corrupt, no database to maintain, git is the source of truth. Clean and simple.
Stateful ECC (light), Agent Orchestrator (medium), Maestro (heavy) Session history, progress tracking, analytics, recovery. Rich operational capability.

Implications:

  • Stateless means recovery from a crash requires manual intervention (re-read git history, re-create plan). There is no progress tracking across sessions, no cost analytics, no learning from past behavior.
  • Stateful enables session resume, cost tracking, pattern learning (ECC's instinct system), and fleet-wide analytics (Maestro's usage dashboard). But it introduces data management complexity, corruption risk, and migration burden.

Resolution for canonical harness: Stateful with SQLite (embedded, zero-config, ACID-compliant). Maestro's approach is the right one -- but add the resilience features Maestro already has (daily backups, corruption detection, automated recovery).

Source: superpowers-deep-analysis.md Section 10.2; maestro-deep-analysis.md Section 15.3

3.5 Configuration-Centric vs. Application-Centric

The tradeoff: Should the harness be a collection of configurations for existing tools, or a standalone application?

Side Projects Argument
Configuration layer Superpowers, ECC Leverages existing tools (Claude Code, Cursor). Lightweight, no new runtime. Easy to adopt.
Standalone application Agent Orchestrator, Maestro Full control over UX, lifecycle, analytics. Professional-grade experience. Harder to adopt.

Implications:

  • Configuration layers benefit from the rapid evolution of underlying tools. When Claude Code adds features, Superpowers and ECC automatically benefit. But they cannot enforce constraints that the underlying tool does not support.
  • Standalone applications control their own destiny but must keep up with underlying agent evolution (new output formats, new capabilities, new error patterns). Maestro's 1015-line error pattern file demonstrates this maintenance burden.

Resolution for canonical harness: This is a genuine tradeoff without a universal answer. For individual developers, configuration layers (Superpowers, ECC) provide the best value. For teams managing agent fleets, standalone applications (Agent Orchestrator, Maestro) are necessary. A canonical harness should be both: a configuration layer that works standalone, with an optional orchestration layer for fleet management.

Source: superpowers-deep-analysis.md Section 1; everything-claude-code-deep-analysis.md Section 1; agent-orchestrator-deep-analysis.md Section 1; maestro-deep-analysis.md Section 1


4. Canonical Harness Architecture

4.1 Architecture Diagram

                    HUMAN DEVELOPER
                         |
           +-------------+-------------+
           |             |             |
     [CLI (ao)]   [Web Dashboard]  [Desktop App]
           |             |             |
           +------+------+------+------+
                  |             |
                  v             v
    +============================+
    |      ORCHESTRATION CORE    |
    |                            |
    |  +--------------------+   |
    |  | Workflow Engine     |   |  <-- State machine (from Agent Orchestrator)
    |  | (Plan->Execute->   |   |      with behavioral skills (from Superpowers)
    |  |  Review->Merge)    |   |
    |  +--------+-----------+   |
    |           |               |
    |  +--------v-----------+   |
    |  | Reaction Engine     |   |  <-- Event-driven automation (from Agent Orchestrator)
    |  | (event->action,    |   |      with escalation and retry
    |  |  retries, escalate)|   |
    |  +--------+-----------+   |
    |           |               |
    |  +--------v-----------+   |
    |  | Quality Gate Engine |   |  <-- Enforced gates (CI, tests, budget)
    |  | (TDD, Review,      |   |      + advisory skills (from Superpowers)
    |  |  Verification)     |   |
    |  +--------+-----------+   |
    |           |               |
    |  +--------v-----------+   |
    |  | Context Manager     |   |  <-- Strategic compaction (from ECC)
    |  | (progressive load, |   |      + progressive disclosure (from Superpowers)
    |  |  compaction, cache) |   |
    |  +--------+-----------+   |
    |           |               |
    +===========+===============+
                |
    +-----------+------------+
    |           |            |
    v           v            v
+--------+ +--------+ +----------+
| Plugin | | Plugin | | Plugin   |
| Slot:  | | Slot:  | | Slot:    |
| Agent  | | Work-  | | Tracker  |
| (Claude| | space  | | (GitHub  |
|  Code, | | (git   | |  Issues, |
|  Codex,| | work-  | |  Linear) |
|  Open- | | tree,  | +----------+
|  Code) | | clone) |
+--------+ +--------+
    |           |
    v           v
+--------+ +--------+
| Plugin | | Plugin |
| Slot:  | | Slot:  |
| Runtime| |Notifier|
| (tmux, | | (Slack,|
|  PTY,  | | desktop|
|  SSH)  | | webhook|
+--------+ +--------+

PERSISTENT LAYER:
+========================+
| SQLite Database        |  <-- Analytics, session history (from Maestro)
| Session Metadata       |  <-- Flat files with atomic ops (from Agent Orchestrator)
| Skill/Prompt Library   |  <-- Markdown skills (from Superpowers + ECC)
| Learned Patterns       |  <-- Instinct system (from ECC)
+========================+

4.2 Component Inventory

Component Responsibility Best Source
Workflow Engine State machine managing plan->execute->review->merge transitions Agent Orchestrator (16-state machine with deterministic transitions)
Reaction Engine Event-driven automation with configurable triggers, actions, retries, escalation Agent Orchestrator (33 event types, YAML-configurable reactions)
Quality Gate Engine Enforced checkpoints (CI pass, test pass, budget check) + advisory skills (TDD, verification, code review) Superpowers (behavioral engineering) + Agent Orchestrator (runtime enforcement)
Context Manager Progressive skill loading, strategic compaction, session context persistence Superpowers (progressive disclosure, no-@ rule) + ECC (tool call counting, phase-aware compaction)
Plugin Registry 8-slot plugin architecture for agent, workspace, tracker, SCM, notifier, runtime, terminal, lifecycle Agent Orchestrator (PluginManifest + PluginModule pattern)
Agent Plugin Per-provider integration (CLI args, output parsing, error detection, session resume) Maestro (declarative arg builder, capability flags, output parser registry)
Workspace Plugin Isolated filesystem per task (worktree or clone) with post-create hooks Agent Orchestrator (worktree with symlinks, post-create commands, cleanup)
Tracker Plugin Issue integration (GitHub, Linear) with prompt generation and branch naming Agent Orchestrator (dual-transport Linear, GitHub GraphQL)
Notification Plugin Multi-channel alerts with priority routing Agent Orchestrator (desktop, Slack, webhook with priority routing)
Session Manager Session CRUD, atomic reservation, metadata persistence, archive/restore Agent Orchestrator (O_EXCL reservation, cascading cleanup) + Maestro (multi-provider discovery)
Skill Library Markdown-based behavioral skills for agent internal process Superpowers (14 skills, anti-rationalization, DOT flowcharts) + ECC (44 skills, language-specific)
Analytics Store SQLite-backed usage tracking, cost attribution, activity heatmaps Maestro (stats-db with WAL, daily backups, corruption recovery)
CLI Developer-facing command interface Agent Orchestrator (ao CLI) + Maestro (maestro-cli with JSONL output)
Dashboard Web-based monitoring and control Agent Orchestrator (Kanban attention levels) + Maestro (keyboard-first, multi-tab)

4.3 Data Flow

1. INTAKE
   Issue (from GitHub/Linear)
   OR User request (from CLI/UI)
       |
       v
2. PLANNING
   [Workflow Engine: status = "planning"]
   Load planner skill (from Superpowers brainstorming + writing-plans)
   Planner agent (read-only tools) produces structured plan
   Human reviews and approves
       |
       v
3. WORKSPACE SETUP
   [Workflow Engine: status = "setting_up"]
   Create git worktree (workspace plugin)
   Run post-create hooks (npm install, etc.)
   Verify test baseline passes
       |
       v
4. EXECUTION
   [Workflow Engine: status = "working"]
   For each task in plan:
     a. Dispatch implementer subagent with full task text
     b. Implementer follows TDD skill (test first, then implement)
     c. Implementer self-reviews against checklist
     d. Dispatch spec compliance reviewer (skepticism primed)
     e. If spec issues: implementer fixes, re-review
     f. Dispatch code quality reviewer
     g. If quality issues: implementer fixes, re-review
     h. Mark task complete
     i. [Quality Gate Engine: verify tests pass]
       |
       v
5. VERIFICATION
   [Workflow Engine: status = "verifying"]
   Run full verification loop (build, type, lint, test, security, diff)
   [Quality Gate Engine: all checks must pass]
       |
       v
6. PR CREATION
   [Workflow Engine: status = "pr_open"]
   Push branch, create PR
   [Reaction Engine: monitors CI, reviews, conflicts]
       |
       v
7. REVIEW AND ITERATION
   [Workflow Engine: status = "review_pending" -> "approved" or "changes_requested"]
   If changes requested: [Reaction Engine: send-to-agent, agent addresses feedback]
   If CI fails: [Reaction Engine: send-to-agent, agent investigates]
       |
       v
8. MERGE
   [Workflow Engine: status = "mergeable" -> "merged"]
   Human or auto-merge (configurable)
   [Workflow Engine: status = "cleanup" -> "done"]
   Cleanup worktree, archive metadata

4.4 Extension Points

  1. Custom Agent Plugins: Add support for new AI agents by implementing the Agent interface (CLI args, output parser, error patterns, session resume). Follow Maestro's declarative pattern.

  2. Custom Workspace Plugins: Alternative isolation strategies (Docker containers, cloud workspaces, devcontainers) by implementing the Workspace interface.

  3. Custom Tracker Plugins: Additional issue trackers (Jira, Asana, Shortcut) by implementing the Tracker interface.

  4. Custom Skills: New behavioral skills (markdown files) for domain-specific workflows. Follow Superpowers' TDD-for-docs methodology.

  5. Custom Reactions: User-defined event->action mappings in YAML configuration. Follow Agent Orchestrator's reaction engine pattern.

  6. Custom Quality Gates: Additional verification steps (dependency scanning, license checking, performance benchmarks) plugged into the Quality Gate Engine.

  7. Custom Notifiers: Additional notification channels by implementing the Notifier interface.

  8. Custom CLI Commands: New CLI subcommands for project-specific workflows.


5. Missing from All Projects

Features that none of the four projects implement but that a mature harness should have.

5.1 Cost Budget Enforcement

No project implements spending limits, budget alerts, or automatic shutoff when costs exceed a threshold. Every project tracks costs at some level, but none prevents runaway spending. For production use, the harness must be able to pause or kill agents when a per-session, per-project, or global budget is exceeded.

Why it matters: A single misconfigured agent loop can consume hundreds of dollars in API credits before anyone notices. Without automatic shutoff, cost governance depends entirely on human monitoring.

5.2 Container-Based Sandboxing

No project isolates agent execution in containers (Docker, gVisor, Firecracker). All agents run with the same privileges as the user. This means a compromised or misbehaving agent can read credentials, exfiltrate code, modify other workspaces, or execute arbitrary network requests.

Why it matters: AI agents executing arbitrary code on a developer's machine represent a significant security surface. Git worktree isolation protects files but not credentials, network, or system resources.

5.3 Dependency Graph Execution (DAG)

No project supports expressing task dependencies as a directed acyclic graph and executing them with maximum parallelism. Superpowers and ECC execute sequentially. Agent Orchestrator parallelizes at the issue level but not within issues. Maestro processes documents sequentially within playbooks.

Why it matters: Many feature implementations have natural parallelism (frontend and backend can develop simultaneously, tests for different modules are independent). Sequential execution wastes time when tasks could safely run in parallel.

5.4 Automated Rollback

No project implements automatic rollback when an agent produces bad output. Recovery requires manual intervention (git revert, worktree cleanup, session restart). Maestro's pause/stop provides the closest mechanism, but it requires human judgment.

Why it matters: Long-running autonomous sessions (Maestro's Auto Run, Superpowers' SDD) can produce cascading errors where each task builds on the previous one's mistakes. Automatic rollback to the last known-good state would limit damage.

5.5 Cross-Agent Context Sharing

No project implements a mechanism for one agent to share discoveries with another agent working on a related task. Agent Orchestrator explicitly prevents agent-to-agent communication. Maestro's Group Chat enables conversation but not structured context transfer.

Why it matters: Agents working on related issues often discover the same codebase patterns, encounter the same bugs, or need the same context. Without sharing, each agent rediscovers this independently, wasting tokens and time.

5.6 Persistent Vector Store / RAG Integration

No project maintains a persistent vector store for code retrieval. All rely on the agent's built-in code search tools (Read, Grep, Glob). ECC's iterative retrieval skill provides a search methodology but not an indexed store.

Why it matters: Large codebases exceed the agent's context window. A persistent vector store with code embeddings would enable semantic search across the entire codebase, providing relevant context without consuming the full context window.

5.7 Multi-User / Team Coordination

No project supports multiple human users managing a shared fleet of agents. All are single-user. Agent Orchestrator's dashboard has no authentication. Maestro's desktop app is inherently single-user.

Why it matters: Production teams need multiple developers to monitor and intervene in agent work. Without multi-user support, the harness is limited to individual developer use.

5.8 Compliance and Audit Logging

No project maintains a structured audit log of agent actions for compliance purposes. Agent Orchestrator has metadata files and git history, but no centralized event store. No project maps to compliance frameworks (SOC2, GDPR, ISO 27001).

Why it matters: Enterprise adoption requires demonstrating that AI-generated code went through defined quality processes, that human review occurred, and that security checks passed. Without audit logging, this cannot be proven.

5.9 Agent Health Watchdog

No project implements a watchdog that detects and kills agents that hang indefinitely without producing output. Agent Orchestrator's lifecycle polling detects "stuck" agents but relies on activity detection heuristics. Maestro has a 5-minute timeout for grooming sessions only, not for regular agent queries.

Why it matters: Agents can enter infinite loops, wait indefinitely for unavailable resources, or simply hang due to provider issues. Without a watchdog, these agents consume resources indefinitely.

5.10 Prompt A/B Testing and Optimization

No project implements systematic A/B testing of prompt variations to determine which produces better agent behavior. Superpowers tests skills against agent behavior (the pressure-testing methodology) but does not compare prompt variants systematically.

Why it matters: Prompt engineering is currently ad-hoc. Systematic measurement of prompt effectiveness would enable evidence-based optimization of skill instructions, system prompts, and review templates.


6. Source Trace

Every major claim in this report is traced to the specific report section(s) from which it was derived.

6.1 Orchestration Model Claims

Claim Source
Superpowers uses brainstorm->plan->execute->review->merge pipeline superpowers-deep-analysis.md Section 4.1
ECC defines 4 workflow types in /orchestrate command everything-claude-code-deep-analysis.md Section 4.1
Agent Orchestrator has 16-state lifecycle state machine agent-orchestrator-deep-analysis.md Section 9.1 (SessionStatus enum)
Maestro treats pipeline as pass-through maestro-deep-analysis.md Section 4.1

6.2 Multi-Agent Coordination Claims

Claim Source
Superpowers prohibits parallel task execution in SDD superpowers-deep-analysis.md Section 6.1, line 205 of SDD skill
Agent Orchestrator has embarrassingly parallel model agent-orchestrator-deep-analysis.md Section 5.1
Maestro's Group Chat uses moderator pattern maestro-deep-analysis.md Section 5.1
ECC documents cascade method for manual parallelism everything-claude-code-deep-analysis.md Section 6.2

6.3 Quality Gate Claims

Claim Source
Superpowers has two-stage code review superpowers-deep-analysis.md Section 4.5 (SDD workflow)
ECC has 6-phase verification loop everything-claude-code-deep-analysis.md Section 4.5
Agent Orchestrator has fail-closed CI status agent-orchestrator-deep-analysis.md Section 16.1 (getCISummary)
Maestro has no automated verification layer maestro-deep-analysis.md Section 4.6

6.4 Context Management Claims

Claim Source
Superpowers uses progressive disclosure with no-@ rule superpowers-deep-analysis.md Section 9.1, 9.6
ECC tracks tool calls for compaction suggestions everything-claude-code-deep-analysis.md Section 9.2
Agent Orchestrator has three-layer prompt composition agent-orchestrator-deep-analysis.md Section 8.1
Maestro provides context groom/merge/transfer operations maestro-deep-analysis.md Section 9.1-9.4

6.5 Security Claims

Claim Source
No project implements container-based sandboxing superpowers Section 7.5, ECC Section 7.5, AO Section 6.3, Maestro Section 7.5
Agent Orchestrator uses execFile everywhere agent-orchestrator-deep-analysis.md Section 11.1
ECC has AgentShield for external security scanning everything-claude-code-deep-analysis.md Section 12.3
Agent Orchestrator has gitleaks in CI agent-orchestrator-deep-analysis.md Section 11.3

6.6 Cost Governance Claims

Claim Source
No project implements cost budgets or automatic shutoff superpowers Section 15.4, ECC Section 15.4, AO Section 14.3, Maestro Section 15.5
Maestro has SQLite-backed analytics with usage dashboard maestro-deep-analysis.md Section 15.2-15.3
Superpowers documents cache utilization benefits of sequential execution superpowers-deep-analysis.md Section 15.3
ECC documents token optimization strategies everything-claude-code-deep-analysis.md Section 9.1

6.7 Git Worktree Consensus Claims

Claim Source
Superpowers: using-git-worktrees skill superpowers-deep-analysis.md Section 4.3
ECC: git worktree pattern documentation everything-claude-code-deep-analysis.md Section 7.1
Agent Orchestrator: workspace-worktree plugin agent-orchestrator-deep-analysis.md Section 6.1
Maestro: git:worktreeSetup IPC handler maestro-deep-analysis.md Section 7.1

6.8 Anti-Rationalization Claims

Claim Source
Superpowers has 40+ rationalization prevention entries superpowers-deep-analysis.md Section 11.5, Appendix C
7 pressure test types documented superpowers-deep-analysis.md Section 11.6
Persuasion principles based on Cialdini 2021, Meincke et al. 2025 superpowers-deep-analysis.md Section 2.3
TDD for documentation methodology superpowers-deep-analysis.md Section 11.6

6.9 Provider Compatibility Claims

Claim Source
Claude Code is primary across all 4 projects All four reports, Executive Summaries
Maestro supports 4 active + 3 planned providers maestro-deep-analysis.md Section 17.2
Agent Orchestrator has 8-slot plugin architecture agent-orchestrator-deep-analysis.md Section 2.2
Superpowers supports 4 platforms with tool mapping superpowers-deep-analysis.md Section 17.1-17.2

6.10 Missing Feature Claims

Claim Source
No project has container sandboxing All four reports, Isolation Model sections
No project has DAG execution All four reports, Parallelization sections
No project has cost budget enforcement All four reports, Cost/Usage sections
No project has multi-user support All four reports, Operational Assumptions sections
No project has persistent vector store All four reports, Context Handling sections
No project has automated rollback All four reports, Failure Modes sections

Cross-Links to Individual Reports

Topic superpowers-deep-analysis.md everything-claude-code-deep-analysis.md agent-orchestrator-deep-analysis.md maestro-deep-analysis.md
Design Philosophy Section 2 Section 2 Section 1 Section 2
Core Architecture Section 3 Section 3 Section 2 Section 3
Workflow Pipeline Section 4 Section 4 Section 3 Section 4
Subagent Orchestration Section 5 Section 5 Section 4 Section 5
Multi-Agent Parallelization Section 6 Section 6 Section 5 Section 6
Isolation Model Section 7 Section 7 Section 6 Section 7
Human-in-the-Loop Section 8 Section 8 Section 7 Section 8
Context Handling Section 9 Section 9 Section 8 Section 9
Session Lifecycle Section 10 Section 10 Section 9 Section 10
Code Quality Gates Section 11 Section 11 Section 10 Section 11
Security Section 12 Section 12 Section 11 Section 12
Hooks/Automation Section 13 Section 13 Section 12 Section 13
CLI/UX Section 14 Section 14 Section 13 Section 14
Cost/Usage Section 15 Section 15 Section 14 Section 15
Tooling/Dependencies Section 16 Section 16 Section 15 Section 16
Provider Compatibility Section 17 Section 17 Section 16 Section 17
Operational Assumptions Section 18 Section 18 Section 17 Section 18
Failure Modes Section 19 Section 19 Section 18 Section 19
Governance Section 20 Section 20 Section 19 Section 20
Roadmap/Gaps Section 21 Section 21 Section 20 Section 21
Borrowing Recommendations Section 22 Section 22 Section 21 Section 22

AI Harness Analysis — Report Index

Generated: 2026-02-22 | Tool: Claude Code (Opus 4.6) | Total: 10,765 lines across 6 reports

Overview

This directory contains a comprehensive comparative analysis of 4 AI coding harness projects, produced according to the master prompt. The goal: build the canonical feature set for an AI harness focused on scalable agent-team execution, orchestration quality, code quality governance, and operational maturity.


Individual Project Reports

Deep technical analysis of each repository, covering 21 analysis dimensions with concrete file citations, confidence scores, and cross-links.

# Report Project Lines Key Insight
1 superpowers-deep-analysis.md obra/superpowers 2,005 Best-in-class prompt engineering with anti-rationalization tables, TDD for skills, and two-stage code review. Pure markdown/skill framework with zero runtime enforcement.
2 everything-claude-code-deep-analysis.md affaan-m/everything-claude-code 2,141 Largest skill library (44 skills, 13 agents, 32 commands). Configuration-layer harness with session management, continuous learning, and multi-language support.
3 agent-orchestrator-deep-analysis.md ComposioHQ/agent-orchestrator 2,806 Only true runtime orchestrator. Plugin architecture, tmux-based process isolation, hash-based session directories, reaction engine with escalation, and fail-closed CI.
4 maestro-deep-analysis.md RunMaestro/Maestro 2,006 Full Electron desktop app (1200 source files, 490 tests). Multi-provider support (8 agents), Group Chat orchestration, Auto Run checkbox workflows, Symphony open-source contribution system.

Synthesis Reports

Cross-project analysis synthesizing patterns, gaps, and recommendations.

# Report Lines Purpose
5 harness-consensus-report.md 831 Cross-project consensus features, canonical architecture, divergence points, and features missing from all projects.
6 final-harness-gap-report.md 976 Gap analysis for Maestro specifically, with prioritized 3-phase roadmap and implementation-ready recommendations with TypeScript interfaces.

Reading Order

For maximum context recovery (designed for recursive link traversal):

  1. Start here — This index
  2. Individual reports (any order) — Each is self-contained with its own executive summary
  3. Consensus report — Synthesizes patterns across all 4 projects
  4. Gap report — Actionable roadmap for Maestro improvement

Cross-Reference Map

By Analysis Dimension

Dimension Superpowers ECC Agent Orchestrator Maestro Consensus Gap
Orchestration Model S:4 E:4 A:4 M:4 C:1.1 G:3.1
Multi-Agent S:5-6 E:5-6 A:5-6 M:5-6 C:1.2 G:3.2
Code Quality S:10-11 E:10-11 A:10-11 M:11 C:1.3 G:3.3
Context Mgmt S:9 E:9 A:9 M:9 C:1.4 G:3.4
Session Lifecycle S:10 E:10 A:10 M:10 C:1.5 G:3.5
Human-in-Loop S:8 E:8 A:8 M:8 C:1.6 G:3.6
Hooks/Automation S:13 E:13 A:13 M:13 C:1.7
Cost/Governance S:15 E:15 A:15 M:15 C:1.8 G:3.7
Security S:12 E:12 A:12 M:12 C:1.9 G:3.8
Provider Compat S:16 E:16 A:16 M:17 C:1.10

Key: S=Superpowers, E=ECC, A=Agent Orchestrator, M=Maestro, C=Consensus, G=Gap. Numbers are section numbers.

By Recommendation Type

What to Borrow From For Maestro Gap Report Section
Anti-rationalization engineering Superpowers (S:11.5) Prompt templates G:6.1
Two-stage code review Superpowers (S:11) Quality pipeline G:6.2
Plugin architecture Agent Orchestrator (A:3) Extensibility G:6.3
Reaction engine Agent Orchestrator (A:13) Automation G:6.4
Fail-closed CI Agent Orchestrator (A:11) Security G:6.5
Session management ECC (E:10) Persistence G:3.5
Continuous learning ECC (E:13) Knowledge capture G:3.7
Strategic compaction ECC (E:9) Context management G:3.4

Source Repositories

Repository Stars Language License
obra/superpowers Markdown/Bash
affaan-m/everything-claude-code JavaScript/Markdown
ComposioHQ/agent-orchestrator TypeScript
RunMaestro/Maestro TypeScript/Electron

Methodology

  • All repositories cloned at HEAD as of 2026-02-22
  • Analysis performed by 4 parallel Claude Opus 4.6 agents, each reading the full codebase
  • Synthesis performed by 2 additional agents reading all 4 individual reports
  • Every major claim includes confidence scores (High/Medium/Low) and file path citations
  • Reports designed for recursive context recovery via cross-links

Master prompt: master_prompt.md

Maestro Deep Analysis Report

Project: RunMaestro/Maestro Repository: https://github.com/RunMaestro/Maestro Version analyzed: 0.15.0 License: AGPL-3.0 Author: Pedram Amini (pedram@runmaestro.ai) Analysis date: 2026-02-22 Codebase size: ~672,000 lines of TypeScript across ~1,200 source files and ~490 test files


Table of Contents

  1. Executive Summary
  2. Design Philosophy and Abstractions
  3. Core Architecture Model
  4. Harness Workflow: Spec to Plan to Execute to Verify to Merge
  5. Subagent/Task Orchestration Model
  6. Multi-Agent / Parallelization Strategy
  7. Isolation Model
  8. Human-in-the-Loop Controls
  9. Context Handling Strategy
  10. Session Lifecycle and Persistence
  11. Code Quality Gates
  12. Security and Compliance Mechanisms
  13. Hooks, Automation Surface, and Fail-Safe Behavior
  14. CLI/UX and Automation Ergonomics
  15. Cost/Usage Visibility and Governance
  16. Tooling and Dependency Surface
  17. External Integrations and Provider Compatibility
  18. Operational Assumptions and Constraints
  19. Failure Modes and Issues Observed
  20. Governance and Guardrails
  21. Roadmap/Evolution Signals, Missing Areas, Unresolved Issues
  22. Current Gaps That Other Projects Might Fill
  23. Cross-Links

1. Executive Summary

Maestro is the most ambitious and fully-realized project in the harness comparison set. It is a cross-platform Electron desktop application (with mobile PWA support) for orchestrating fleets of AI coding agents. Unlike the other three projects which are primarily configuration layers, shell scripts, or lightweight orchestrators on top of Claude Code, Maestro is a standalone product with its own GUI, process management layer, multi-provider support, and a rich feature ecosystem including analytics, gamification, and community-driven open source contribution (Symphony).

Key differentiators:

  • Full desktop application with keyboard-first interface (not just a terminal wrapper)
  • Multi-provider support: Claude Code, OpenAI Codex, OpenCode, Factory Droid (4 active, 2 planned)
  • Auto Run system with file-based task documents and playbook management
  • Group Chat with moderator AI for cross-agent coordination
  • Symphony: a community contribution platform using GitHub Issues + Auto Run
  • Git worktree integration for true parallel development
  • CLI tool (maestro-cli) for headless/CI operation
  • Mobile remote control via PWA + WebSocket + Cloudflare tunnels
  • SQLite-backed analytics with Usage Dashboard
  • Extensive documentation (CLAUDE.md ecosystem, CONSTITUTION.md, ARCHITECTURE.md, Mintlify docs site)

Core limitation:

  • Maestro is a "pass-through" orchestrator: it does not itself generate plans, specs, or code. It dispatches prompts to underlying AI agents and manages their lifecycle. The intelligence comes from the agents; Maestro provides the podium.

Confidence: High -- This assessment is based on thorough reading of 50+ source files, all documentation files, configuration, CI/CD, and test infrastructure.


2. Design Philosophy and Abstractions

2.1 The Constitution

Maestro has a formally documented design philosophy in /tmp/ai-harness-repos/Maestro/CONSTITUTION.md (178 lines). This is unique among the four projects analyzed.

Six tenets (lines 28-112):

  1. Unattended Excellence (Solo Mode) -- "The measure of Maestro's success is how long agents run without intervention." Auto Run is a first-class citizen. Error recovery should be automatic. The leaderboard celebrates autonomy.

  2. The Conductor's Perspective (Interactive Mode) -- "You are the maestro. The agents are your orchestra." Overview and control over details. Batch operations over individual ones. Frictionless agent switching.

  3. Keyboard Sovereignty -- Every action has a keyboard path. Focus must be predictable. Escape always improves your situation. No mouse-only features.

  4. Instant Response -- UI interactions in milliseconds. Heavy operations in background. Perceived performance matters.

  5. Delightful Focus -- "Say no to feature creep that dilutes the core experience." Polish before adding.

  6. Transparent Complexity -- Progressive disclosure. Sensible defaults. Power features accessible but not intrusive.

Confidence: High -- Directly sourced from CONSTITUTION.md.

2.2 The Mental Model

Maestro embodies the "conductor/orchestra" metaphor:

  • Agents are instruments (each with their own workspace, terminal, AI tabs)
  • The user is the conductor (directing, not playing each instrument)
  • Auto Run is the pre-programmed score (tasks run without intervention)
  • Group Chat is the ensemble rehearsal (agents coordinate via a moderator)
  • Symphony is the concert hall (open source community contributions)

The key abstraction boundary is that Maestro is NOT an IDE, NOT a single-agent wrapper, NOT a chat interface, and NOT a project manager. It's a fleet management tool for AI agents.

Evidence: CONSTITUTION.md lines 140-148:

- Not an IDE: We complement your editor, not replace it
- Not a single-agent wrapper: One agent is just a small orchestra
- Not a chat interface: Conversations are work sessions, not dialogues
- Not a project manager: We execute, not plan (that's what agents do)

2.3 Agent Behavioral Guidelines

CLAUDE.md (lines 23-44) establishes explicit behavioral rules for AI agents working on the codebase:

  • Surface Assumptions Early -- Never silently fill in ambiguous requirements
  • Manage Confusion Actively -- STOP and name inconsistencies
  • Push Back When Warranted -- Not a yes-machine
  • Enforce Simplicity -- Actively resist overcomplication
  • Maintain Scope Discipline -- Touch only what's asked
  • Dead Code Hygiene -- Identify but ask before removing

These are meta-guidelines for how AI should interact with the Maestro codebase during development, not runtime behavioral controls.

Confidence: High -- Directly sourced from CLAUDE.md.


3. Core Architecture Model

3.1 Dual-Process Electron Architecture

File: /tmp/ai-harness-repos/Maestro/src/main/index.ts (724 lines)

Maestro uses Electron's main/renderer split with strict context isolation:

Process Location Purpose
Main Process src/main/ Node.js backend: process spawning, IPC handlers, file system, git, web server
Renderer Process src/renderer/ React frontend: UI components, hooks, services
Web Process src/web/ PWA for mobile: WebSocket client, mobile-optimized UI
CLI Process src/cli/ Headless operation: Commander.js, batch processing
Shared Module src/shared/ Cross-process types, utilities, constants

IPC Security Model (from ARCHITECTURE.md lines 136-202):

  • Context isolation: Enabled (renderer has no Node.js access)
  • Node integration: Disabled (no require() in renderer)
  • Preload script exposes window.maestro API with 17+ namespaces covering process management, git, file system, agents, settings, web server, auto run, playbooks, attachments, notifications, and more.

Confidence: High -- Verified in ARCHITECTURE.md and src/main/index.ts.

3.2 Process Manager

Files:

  • /tmp/ai-harness-repos/Maestro/src/main/process-manager/ProcessManager.ts
  • /tmp/ai-harness-repos/Maestro/src/main/process-manager/types.ts

The ProcessManager class is the core runtime engine. It manages two types of processes:

  1. PTY Processes (via node-pty) -- For terminal sessions with full shell emulation
  2. Child Processes (via child_process.spawn) -- For AI agents in batch mode

Key design decisions:

  • Uses spawn() with shell: false for security (no injection vulnerabilities)
  • Signal escalation: SIGINT first, escalates to SIGTERM after 2 seconds if process doesn't exit
  • Per-process output parsers: Each agent type has its own JSON output parser
  • Data buffering via DataBufferManager to batch rapid updates
  • SSH remote execution support via SshCommandRunner
// ProcessManager.ts line 29
export class ProcessManager extends EventEmitter {
    private processes: Map<string, ManagedProcess> = new Map();
    private bufferManager: DataBufferManager;
    private ptySpawner: PtySpawner;
    private childProcessSpawner: ChildProcessSpawner;
    private localCommandRunner: LocalCommandRunner;
    private sshCommandRunner: SshCommandRunner;

Events emitted (from types.ts lines 109-121):

  • data, stderr, exit, command-exit
  • usage (token stats)
  • session-id (provider session ID)
  • agent-error (auth, rate limit, context exhaustion)
  • thinking-chunk (streaming reasoning)
  • tool-execution (tool use events)
  • slash-commands (discoverable commands)
  • query-complete (with timing data)

Confidence: High -- Directly read from source files.

3.3 Output Parser Architecture

Files:

  • /tmp/ai-harness-repos/Maestro/src/main/parsers/index.ts (103 lines)
  • /tmp/ai-harness-repos/Maestro/src/main/parsers/agent-output-parser.ts
  • /tmp/ai-harness-repos/Maestro/src/main/parsers/claude-output-parser.ts (505 lines)
  • /tmp/ai-harness-repos/Maestro/src/main/parsers/codex-output-parser.ts
  • /tmp/ai-harness-repos/Maestro/src/main/parsers/opencode-output-parser.ts
  • /tmp/ai-harness-repos/Maestro/src/main/parsers/factory-droid-output-parser.ts
  • /tmp/ai-harness-repos/Maestro/src/main/parsers/usage-aggregator.ts

Maestro uses a registry pattern for output parsers. Each AI agent produces output in a different JSON format, and dedicated parser classes normalize this into a unified ParsedEvent type.

Initialization at app startup:

// parsers/index.ts line 76
export function initializeOutputParsers(): void {
    clearParserRegistry();
    registerOutputParser(new ClaudeOutputParser());
    registerOutputParser(new OpenCodeOutputParser());
    registerOutputParser(new CodexOutputParser());
    registerOutputParser(new FactoryDroidOutputParser());
}

The AgentOutputParser interface requires:

  • parseJsonLine(line: string): ParsedEvent | null -- Transform a raw JSON line to a normalized event
  • isResultMessage(event: ParsedEvent): boolean -- Detect final result messages
  • extractSessionId(event: ParsedEvent): string | null -- Pull provider session ID
  • extractUsage(event: ParsedEvent): Usage | null -- Pull token/cost stats
  • extractSlashCommands(event: ParsedEvent): string[] | null -- Pull discoverable commands
  • detectErrorFromLine(line: string): AgentError | null -- Detect errors from structured JSON
  • detectErrorFromExit(exitCode, stderr, stdout): AgentError | null -- Detect errors from exit

Claude Code parser (claude-output-parser.ts) is the most complex (505 lines). It handles:

  • Extended thinking blocks (Claude 3.7+, Claude 4+): Extracts thinking content blocks separately from text blocks, routing them to thinking-chunk events for streaming display
  • Redacted thinking: redacted_thinking blocks (safety-encrypted reasoning) are excluded since their content cannot be displayed
  • Tool use blocks: Extracted from content[] arrays and surfaced as toolUseBlocks for tool execution events
  • Mixed stderr/JSON parsing: When Claude Code outputs a line like Error streaming...: 400 {"type":"error","error":{"message":"prompt is too long"}}, the parser finds the embedded JSON starting at { and extracts the structured error
  • Usage aggregation: Calls aggregateModelUsage() to combine modelUsage (per-model breakdown) with usage (legacy flat format) and total_cost_usd

Error detection strategy is deliberately conservative:

// claude-output-parser.ts line 326-333
// IMPORTANT: Only detect errors from structured JSON error events, not from
// arbitrary text content. Pattern matching on conversational text leads to
// false positives (e.g., AI discussing "timeout" triggers timeout error).
//
// Error detection sources (in order of reliability):
// 1. Structured JSON: { type: "error", message: "..." }
// 2. stderr output (handled separately by process-manager)
// 3. Non-zero exit code (handled by detectErrorFromExit)

This is a mature design decision born from real false-positive issues.

Confidence: High -- Directly from source code.

3.4 Error Pattern System

File: /tmp/ai-harness-repos/Maestro/src/main/parsers/error-patterns.ts (1015 lines)

The error pattern system defines regex-based error detection for all supported agents and SSH remote execution. Each agent has patterns organized by error type:

Error Type Claude Code Patterns Codex Patterns OpenCode Patterns Factory Droid Patterns SSH Patterns
auth_expired 10 patterns 4 patterns 2 patterns 5 patterns --
token_exhaustion 7 patterns 3 patterns 4 patterns 4 patterns --
rate_limited 6 patterns 5 patterns 2 patterns 4 patterns --
network_error 5 patterns 4 patterns 4 patterns 4 patterns 10 patterns
permission_denied 4 patterns 2 patterns -- 3 patterns 5 patterns
agent_crashed 1 pattern 5 patterns 3 patterns 1 pattern 9 patterns
session_not_found 3 patterns 2 patterns 2 patterns 2 patterns --

Notable implementation details:

  1. Dynamic error messages: Some patterns use functions instead of strings to construct messages from regex capture groups:
// error-patterns.ts line 127-133
pattern: /prompt.*too\s+long:\s*(\d+)\s*tokens?\s*>\s*(\d+)\s*maximum/i,
message: (match: RegExpMatchArray) => {
    const actual = parseInt(match[1], 10).toLocaleString('en-US');
    const max = parseInt(match[2], 10).toLocaleString('en-US');
    return `Prompt is too long: ${actual} tokens exceeds the ${max} token limit.`;
},
  1. SSH error patterns are checked in addition to agent-specific patterns when running via SSH remote execution, covering transport-level errors (connection refused, host key verification, broken pipe, shell profile syntax errors on remote host)

  2. Shell parse error detection: The SSH patterns detect remote host .zshrc/.bashrc syntax errors with line numbers, providing actionable messages like "Check .zshrc or .bashrc on the remote server"

  3. Recoverability flags: Each pattern declares whether the error is recoverable (true) or fatal (false). Permission denied and host key verification failures are non-recoverable; rate limits and network errors are recoverable.

Confidence: High -- Directly from source code, all pattern counts verified.

3.5 Agent Data Model

File: CLAUDE-SESSION.md (Session interface)

Each "agent" in the Left Bar is backed by a Session object with:

  • Identity: id, name, groupId, toolType (provider), state, inputMode
  • Paths: cwd (mutable), projectRoot (immutable), fullPath
  • Processes: aiPid, port
  • Multi-tab: aiTabs[], activeTabId, filePreviewTabs[], unifiedTabOrder[]
  • Execution queue: executionQueue[], isProcessingQueue
  • Usage: usageStats, contextUsage, workLog
  • Git: isGitRepo, changedFiles, gitBranches, gitTags
  • Auto Run: autoRunFolderPath, autoRunSelectedFile, autoRunMode
  • SSH: sshRemoteId, sessionSshRemoteConfig
  • Error: agentError, agentErrorPaused

Each session runs two processes simultaneously: an AI agent process (suffixed -ai) and a terminal process (suffixed -terminal). Users switch between them with Cmd+J.

Confidence: High -- Directly from CLAUDE-SESSION.md and ARCHITECTURE.md.

3.4 IPC Handler Registry

File: /tmp/ai-harness-repos/Maestro/src/main/index.ts (lines 27-59)

The main process registers 30+ handler modules:

registerGitHandlers, registerAutorunHandlers, registerPlaybooksHandlers,
registerHistoryHandlers, registerAgentsHandlers, registerProcessHandlers,
registerPersistenceHandlers, registerSystemHandlers, registerClaudeHandlers,
registerAgentSessionsHandlers, registerGroupChatHandlers, registerDebugHandlers,
registerSpeckitHandlers, registerOpenSpecHandlers, registerContextHandlers,
registerMarketplaceHandlers, registerStatsHandlers, registerDocumentGraphHandlers,
registerSshRemoteHandlers, registerFilesystemHandlers, registerAttachmentsHandlers,
registerWebHandlers, registerLeaderboardHandlers, registerNotificationsHandlers,
registerSymphonyHandlers, registerTabNamingHandlers, registerAgentErrorHandlers,
registerDirectorNotesHandlers, registerWakatimeHandlers

Each handler module lives in /tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/ (30 files).

The handler registration follows a consistent dependency injection pattern:

// Each handler module exports a register function that receives dependencies
export function registerContextHandlers(deps: ContextHandlerDependencies): void {
    const { getProcessManager, getAgentDetector } = deps;
    ipcMain.handle('context:groomContext', withIpcErrorLogging(
        handlerOpts('groomContext'),
        async (projectRoot, agentType, prompt, options) => {
            const processManager = requireDependency(getProcessManager, 'Process manager');
            // ... handler logic
        }
    ));
}

Key architectural pattern: All handlers use withIpcErrorLogging() for consistent error handling and requireDependency() for runtime dependency validation. This prevents silent failures when handlers are called before dependencies are initialized.

Confidence: High -- Directly from source code.


4. Harness Workflow: Spec to Plan to Execute to Verify to Merge

4.1 Overview

Maestro does NOT itself implement a spec-to-plan-to-execute pipeline. It provides the infrastructure for users to build such workflows using Auto Run documents and Playbooks. The actual planning and execution intelligence comes from the AI agents being orchestrated.

The workflow is:

1. SPEC:     User writes markdown spec documents (Auto Run docs with checkboxes)
2. PLAN:     User orders documents in a Playbook (BatchRunConfig)
3. EXECUTE:  useBatchProcessor sends each checkbox task to the AI agent
4. VERIFY:   AI agent checks tasks; Maestro tracks completion
5. MERGE:    Git worktree integration + one-click PR creation

4.2 Spec Creation

Relevant files:

  • ARCHITECTURE.md (Auto Run System, lines 718-897)
  • src/renderer/components/ (AutoRun.tsx, AutoRunSetupModal.tsx, AutoRunDocumentSelector.tsx)
  • src/prompts/autorun-default.md, src/prompts/wizard-document-generation.md

Users create markdown documents with checkbox items:

# Task: Add Unit Tests for Auth Module

## Objectives
- [ ] Create `src/__tests__/auth.test.ts`
- [ ] Add tests for `login()` function
- [ ] Ensure `npm test` passes

The Auto Run system provides:

  • Edit/Preview modes with auto-save (5-second debounce)
  • Image support for documents (saved to document-specific folders)
  • Wizard-assisted spec generation via AI (see src/prompts/wizard-*.md)
  • Spec-Kit integration (GitHub's spec-kit prompts bundled)
  • OpenSpec integration (Fission-AI's OpenSpec prompts bundled)

Evidence: ARCHITECTURE.md lines 877-883:

1. Setup: User selects Runner Docs folder via AutoRunSetupModal
2. Document Selection: Documents appear in AutoRunDocumentSelector dropdown
3. Editing: AutoRun component provides edit/preview modes with auto-save
4. Batch Configuration: BatchRunnerModal allows ordering documents
5. Playbooks: Save/load configurations for repeated batch runs
6. Execution: useBatchProcessor hook processes documents sequentially
7. Progress: RightPanel shows document and task-level progress

Confidence: High -- Architecture documentation and source files confirm this workflow.

4.3 Planning (Playbooks)

Relevant files:

  • ARCHITECTURE.md (lines 779-793)
  • src/cli/services/playbooks.ts
  • src/cli/services/batch-processor.ts
  • src/main/ipc/handlers/playbooks.ts

Playbooks are saved configurations that define:

interface Playbook {
    id: string;
    name: string;
    documents: PlaybookDocumentEntry[];  // Ordered list with reset flags
    loopEnabled: boolean;                // Loop back to first doc when done
    prompt: string;                      // Agent prompt template
    worktreeSettings?: {
        branchNameTemplate: string;
        createPROnCompletion: boolean;
    };
}

Each document entry can:

  • Be reordered via drag-and-drop
  • Be duplicated (for running the same document multiple times)
  • Have resetOnCompletion to uncheck all boxes when done (enabling re-execution)

Playbooks support template variables: {{date}}, {{time}}, {{cwd}}, {{session}}, {{agent}}, {{gitBranch}}.

Confidence: High -- From ARCHITECTURE.md and src/cli/services/batch-processor.ts.

4.4 Execution

Relevant files:

  • src/renderer/hooks/useBatchProcessor.ts
  • src/cli/services/batch-processor.ts (lines 61-150)
  • src/cli/services/agent-spawner.ts

The batch processor:

  1. Registers CLI activity so the desktop app knows the session is busy
  2. Iterates through documents in order
  3. For each document, reads unchecked tasks
  4. Constructs a prompt from the playbook template + document content
  5. Spawns the AI agent with the prompt
  6. Parses the response for checked tasks
  7. Updates document state
  8. Emits JSONL progress events
  9. If loopEnabled, loops back to first document

The CLI batch processor (src/cli/services/batch-processor.ts) is an AsyncGenerator<JsonlEvent> that yields typed events:

export async function* runPlaybook(
    session: SessionInfo,
    playbook: Playbook,
    folderPath: string,
    options: { dryRun?, writeHistory?, debug?, verbose? }
): AsyncGenerator<JsonlEvent>

Confidence: High -- Directly from source code.

4.5 Prompt System

Directory: /tmp/ai-harness-repos/Maestro/src/prompts/ (24 markdown files + 2 subdirectories)

Maestro bundles 24 system prompts as markdown files that are imported at build time:

Prompt File Purpose
autorun-default.md Default system prompt for Auto Run task execution
autorun-synopsis.md Director's Notes synopsis generation
commit-command.md Custom AI commit command
context-grooming.md Context grooming/compaction
context-summarize.md Context summarization
context-transfer.md Cross-session context transfer
director-notes.md Director's Notes (Encore Feature)
group-chat-moderator-system.md Group Chat moderator system prompt
group-chat-moderator-synthesis.md Moderator synthesis round prompt
group-chat-participant.md Participant behavior template
group-chat-participant-request.md Request routing to participant
image-only-default.md Image-only prompt template
maestro-system-prompt.md Main system prompt prepended to all queries
tab-naming.md AI-generated tab naming
wizard-document-generation.md Wizard: generate Auto Run documents
wizard-inline-*.md (5 files) Wizard: inline editing and iteration
wizard-system*.md (2 files) Wizard: system prompts for generation

Additionally, two subdirectories bundle external prompt frameworks:

  • speckit/ -- GitHub's Spec-Kit prompts (refreshed via npm run refresh-speckit)
  • openspec/ -- Fission-AI's OpenSpec prompts (refreshed via npm run refresh-openspec)

These prompts are a key differentiator: they encode Maestro's workflow knowledge into the AI agents' behavior. The autorun-default.md prompt tells the agent how to interact with checkbox documents, the group-chat-moderator-system.md defines the moderator's decision-making behavior, and the context-grooming.md defines how to compress conversations.

4.6 Verification

Verification is implicit in the checkbox model. The AI agent is expected to:

  1. Read the task description
  2. Perform the work
  3. Check off completed tasks by modifying the markdown document

There is NO automatic verification layer (no test runner, no linter integration, no code review step). The verification is the agent's own assessment that it completed the work.

Limitation: No automated code quality gates in the Auto Run execution loop. The agent could check off a task without actually completing it successfully. This is the most significant gap compared to orchestrator-style projects that run tests/lints between steps.

Confidence: High -- No evidence of automated verification in the batch processor code.

4.7 Merge

Relevant files:

  • ARCHITECTURE.md (Git Worktree Integration, lines 845-873)
  • src/main/ipc/handlers/git.ts

When worktree mode is enabled for Auto Run:

  1. A git worktree is created with a specified branch name
  2. Auto Run operates in the worktree directory
  3. On completion, if createPROnCompletion is true, a PR is created via git:createPR
  4. The PR uses GitHub CLI (gh pr create)
'git:createPR': (worktreePath, baseBranch, title, body) => Promise<{
    success: boolean;
    prUrl?: string;
}>

Confidence: High -- From ARCHITECTURE.md and IPC handler definitions.


5. Subagent/Task Orchestration Model

5.1 Group Chat System

Files:

  • /tmp/ai-harness-repos/Maestro/src/main/group-chat/ (10 files)
  • ARCHITECTURE.md (Group Chat System, lines 1171-1404)

Group Chat is Maestro's most sophisticated orchestration feature. It implements a moderator-agent pattern:

  1. User sends a message to the group chat
  2. Moderator AI receives the message + chat history
  3. Moderator decides whether to:
    • Answer directly (simple questions)
    • Route to specific agents via @mentions
  4. Mentioned agents work in parallel, each spawned as a batch process
  5. When all agents respond, moderator synthesis round begins
  6. Moderator reviews responses and either:
    • @mentions agents again for follow-up (loop continues)
    • Provides final synthesis WITHOUT mentions (loop ends)

Key implementation details:

Session ID patterns for routing:

group-chat-{chatId}-moderator-{timestamp}    -- Moderator process
group-chat-{chatId}-participant-{name}-{ts}  -- Agent participant

Pending response tracking (group-chat-router.ts lines 99-155):

const pendingParticipantResponses = new Map<string, Set<string>>();

export function markParticipantResponded(groupChatId: string, name: string): boolean {
    const pending = pendingParticipantResponses.get(groupChatId);
    if (!pending) return false;
    pending.delete(name);
    if (pending.size === 0) {
        pendingParticipantResponses.delete(groupChatId);
        return true; // Last participant responded
    }
    return false;
}

Two key prompts control moderator behavior:

  • MODERATOR_SYSTEM_PROMPT (src/prompts/group-chat-moderator-system.md)
  • MODERATOR_SYNTHESIS_PROMPT (src/prompts/group-chat-moderator-synthesis.md)

Storage structure:

~/Library/Application Support/maestro/group-chats/
    {chatId}/
        chat.json       # Group chat metadata
        log.jsonl        # Append-only message log
        history.json     # Summarized history entries

Moderator lifecycle management (group-chat-moderator.ts, 290 lines):

The moderator is not a persistent process. Instead, each message spawns a new batch-mode moderator process with a unique session ID:

group-chat-{chatId}-moderator-{timestamp}

Stale moderator sessions are cleaned up via a periodic interval (every 10 minutes) that removes sessions inactive for 30 minutes. The power manager is notified to prevent system sleep during active group chats.

Participant management (group-chat-agent.ts, 429 lines):

Each participant agent is spawned via addParticipant() which:

  1. Validates the moderator is active (cannot add participants without moderator)
  2. Resolves agent configuration via AgentDetector
  3. Builds CLI arguments using the declarative arg builder pattern
  4. Applies session-specific overrides (custom model, custom args, env vars)
  5. Wraps with SSH configuration if remote execution is configured
  6. Applies Windows-specific spawn configuration (PowerShell, stdin mode)
  7. Spawns the agent with a system prompt from group-chat-participant.md template

The participant system prompt uses template variables {{GROUP_CHAT_NAME}}, {{PARTICIPANT_NAME}}, and {{LOG_PATH}} to give each participant awareness of their role and access to the shared chat log.

Active session tracking uses in-memory Map structures:

  • activeModeratorSessions: Map<groupChatId, sessionId>
  • activeParticipantSessions: Map<groupChatId:participantName, sessionId>
  • sessionActivityTimestamps: Map<groupChatId, timestamp> (for stale cleanup)

Strength: The moderator pattern is well-designed. It naturally handles multi-round agent coordination without a fixed workflow. The per-message moderator spawning avoids long-lived process management complexity.

Limitation: The moderator is always a single AI agent. There's no support for hierarchical moderators, or for the moderator to spawn sub-moderators for complex tasks. All sessions are in-memory only; a crash loses all active group chat state.

Confidence: High -- Thoroughly documented in ARCHITECTURE.md and confirmed in source code (group-chat-moderator.ts and group-chat-agent.ts).

5.2 Symphony Orchestration

Files:

  • /tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/symphony.ts (200+ lines read)
  • /tmp/ai-harness-repos/Maestro/SYMPHONY_REGISTRY.md
  • /tmp/ai-harness-repos/Maestro/SYMPHONY_ISSUES.md
  • /tmp/ai-harness-repos/Maestro/docs/symphony.md

Symphony extends Auto Run to open source contribution:

  1. Registry (symphony-registry.json) lists participating repositories
  2. Issues with runmaestro.ai label define contribution opportunities
  3. Contribution flow:
    • Clone repository to ~/Maestro-Symphony/{owner}-{repo}/
    • Create branch symphony/{issue-number}-{short-id}
    • Set up Auto Run documents (from issue body)
    • Process documents automatically
    • Create draft PR (claims the issue)
    • Finalize PR when complete

Validation is thorough (symphony.ts lines 69-191):

  • Path traversal prevention via sanitizeRepoName()
  • GitHub URL validation (HTTPS only, github.com only)
  • Repo slug format validation
  • Document path validation (no .., no leading /)
  • External URL validation (GitHub domains only)

Confidence: High -- From source code and documentation.


6. Multi-Agent / Parallelization Strategy

6.1 Agent-Level Parallelism

Maestro supports unlimited parallel agents, each with its own workspace and process pair. The Left Bar shows all agents simultaneously. Agent switching is keyboard-driven (Cmd+[ / Cmd+]).

6.2 Tab-Level Parallelism

Each agent supports multiple AI tabs (AITab[]), each potentially connected to a different provider session. This enables parallel conversations within a single agent workspace.

6.3 Execution Queue

File: ARCHITECTURE.md (Execution Queue, lines 1096-1137)

The execution queue is a per-agent sequential processing queue that prevents conflicting operations:

interface QueuedItem {
    id: string;
    type: 'message' | 'command';
    content: string;
    tabId: string;
    readOnlyMode: boolean;
    timestamp: number;
    source: 'user' | 'autorun';
}

Queue processing rules:

  • Items are processed FIFO within each agent
  • When the current agent query completes (process exits), the next queued item is dispatched
  • Read-only operations (readOnlyMode: true) can potentially execute in parallel (agent-dependent)
  • Write operations must be sequential to prevent file conflicts
  • Auto Run tasks enter the same queue as user messages (source: 'autorun')
  • Users can inspect and cancel pending items via the Execution Queue Browser (Cmd+K -> "Execution Queue")
  • The queue persists across tab switches but not across app restarts

Queue behavior:

  • Write operations (readOnlyMode: false) queue sequentially within an agent
  • Read-only operations can run in parallel
  • Auto Run tasks queue with regular messages
  • Users can cancel pending items via queue browser

6.4 Worktree Parallelism

Without worktree mode: Auto Run tasks queue through the execution queue (sequential within an agent, parallel across agents).

With worktree mode: Auto Run operates in a separate directory, enabling true parallelization with the main workspace. No queue conflicts.

6.5 Group Chat Parallelism

When the moderator @mentions multiple agents, they are spawned as parallel batch processes. The system tracks pending responses and triggers synthesis only when ALL agents have responded.

6.6 What's Missing

  • No work-stealing or load balancing between agents
  • No automatic task distribution across agents (user must manually assign)
  • No dependency graph execution (tasks within a document are sequential)
  • No cross-agent pipeline (output of Agent A cannot feed into Agent B automatically, except via Group Chat)

Confidence: High -- From architecture documentation and source code analysis.


7. Isolation Model

7.1 Git Worktrees

Files:

  • ARCHITECTURE.md (Git Worktree Integration, lines 845-873)
  • IPC handlers: git:worktreeInfo, git:worktreeSetup, git:worktreeCheckout

Maestro provides first-class git worktree support:

  • Create worktree sub-agents from the git branch menu
  • Each worktree operates in its own directory
  • AI agents process tasks independently
  • One-click PR creation from worktree branches
interface WorktreeConfig {
    enabled: boolean;
    path: string;                   // Absolute path for the worktree
    branchName: string;             // Branch name to use/create
    createPROnCompletion: boolean;  // Create PR when Auto Run finishes
}

7.2 Session Isolation

Each agent/session has:

  • Its own working directory (cwd)
  • Its own process pair (AI + terminal)
  • Its own conversation tabs (each with independent provider session)
  • Its own execution queue
  • Its own file tree and git state

7.3 Data Isolation

Development mode uses isolated data directories:

npm run dev          -> maestro-dev/     (separate from production)
npm run dev:demo     -> /tmp/maestro-demo/  (completely fresh)
npm run dev:prod-data -> maestro/        (production data)

Settings stored separately:

  • maestro-settings.json -- User preferences
  • maestro-sessions.json -- Agent persistence
  • maestro-groups.json -- Agent groups
  • maestro-agent-configs.json -- Per-agent configuration

7.4 SSH Remote Isolation

Agents can execute on remote hosts via SSH. The SSH spawn wrapper (src/main/utils/ssh-spawn-wrapper.ts) wraps any agent spawn command with SSH transport:

SSH remote configuration:

interface SshRemoteConfig {
    enabled: boolean;
    remoteId: string | null;
    workingDirOverride?: string;
}

The wrapper transforms local spawn configs to SSH-wrapped versions:

  • Commands are prefixed with ssh -t <host> for remote execution
  • The remote host's login shell ($SHELL -lc) is used to ensure PATH is properly loaded
  • File paths are resolved on the remote filesystem
  • Prompts are passed via stdin as a script (sshStdinScript) rather than command-line arguments, avoiding both shell escaping issues and the 8KB command length limit on Windows
  • Long prompts are base64-encoded for transport safety
  • Each agent type can have per-session SSH configuration, enabling mixed local/remote agent fleets

SSH error detection is handled by the dedicated SSH_ERROR_PATTERNS (see Section 3.4) which detect transport-level failures separately from agent-level errors.

Group Chat participants can individually be SSH-remoted: The addParticipant() function accepts sessionOverrides.sshRemoteConfig, enabling heterogeneous Group Chat setups where some agents run locally and others on remote machines.

7.5 What's Missing

  • No containerized isolation (Docker, sandboxing). Agents run with the same privileges as the user.
  • No resource limits per agent (CPU, memory, disk)
  • No network isolation between agents
  • No filesystem sandboxing (agents can access any file the user can)

Confidence: High -- From architecture documentation and SECURITY.md.


8. Human-in-the-Loop Controls

8.1 Read-Only Mode

Each tab has a readOnlyMode toggle. When enabled:

  • Claude Code uses --permission-mode plan
  • Codex uses --sandbox read-only
  • OpenCode uses --agent plan

This prevents agents from making file changes while allowing analysis.

8.2 Pause/Resume

The batch processor supports pause/resume for Auto Run:

pauseBatchRun()   // Pause current batch run
resumeBatchRun()  // Resume execution
stopBatchRun()    // Stop current batch run

8.3 Execution Queue Management

Users can view and cancel pending queue items via the Execution Queue Browser (Cmd+K -> "Execution Queue").

8.4 Agent Error Handling

When agents encounter errors (auth expired, token exhaustion, rate limit):

  1. Error modal appears with error details
  2. Input is blocked (agentErrorPaused: true)
  3. User must acknowledge and decide how to proceed
  4. Recovery options are presented based on error type

Error types from src/main/parsers/error-patterns.ts:

  • auth_expired -- API key invalid, login required
  • token_exhaustion -- Context window full
  • rate_limited -- Too many requests
  • network_error -- Connection failed
  • agent_crashed -- Non-zero exit code
  • permission_denied -- Operation not allowed

8.5 Confirmation Dialogs

  • Agent deletion requires confirmation (unless showConfirmation: false)
  • Playbook deletion has a dedicated confirmation modal (PlaybookDeleteConfirmModal)
  • Tab closing with unsaved edits prompts for confirmation
  • Group renaming, session renaming have dedicated modals

8.6 What's Missing

  • No approval gates for specific operations (e.g., approve before file write)
  • No cost limit enforcement (user can see costs but can't set spend limits)
  • No automated rollback (if an agent makes bad changes, user must manually revert)
  • No per-task review step in Auto Run (tasks execute sequentially without review between them)

Confidence: High -- From architecture documentation and UI component analysis.


9. Context Handling Strategy

9.1 Context Merging

File: /tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/context.ts (508 lines)

Maestro provides context merge operations via 5 IPC handlers:

Handler Status Purpose
context:getStoredSession Active Retrieve messages from agent session storage
context:groomContext Active (recommended) Single-call grooming: spawn agent, send prompt, collect response
context:cancelGrooming Active Cancel all active grooming sessions
context:createGroomingSession Deprecated Create a temporary interactive grooming session
context:sendGroomingPrompt Deprecated Send prompt to existing grooming session
context:cleanupGroomingSession Active Clean up temporary grooming session

The evolution from the deprecated two-step createGroomingSession + sendGroomingPrompt to the single-call groomContext demonstrates an architectural simplification. The original approach required managing long-lived processes and response collection via event listeners with idle timeouts. The new approach uses the shared groomContext() utility from src/main/utils/context-groomer.ts.

Grooming response collection (deprecated path, still in codebase):

// context.ts line 287 - Response collection with multiple completion signals
return new Promise<string>((resolve, reject) => {
    let responseBuffer = '';
    let lastDataTime = Date.now();

    // Completion triggers:
    // 1. Process exit -> return whatever was collected
    // 2. Idle timeout (5s with no data + min 100 chars) -> return
    // 3. Overall timeout (5 minutes) -> return or reject
    // 4. Agent error -> reject
});

The grooming operation has a 5-minute timeout (GROOMING_TIMEOUT_MS = 5 * 60 * 1000). The idle check (1-second interval, 5-second inactivity with >= 100 character response) handles cases where the agent process does not cleanly exit.

9.2 Context Grooming Prompts

Files:

  • src/prompts/context-grooming.md
  • src/prompts/context-summarize.md
  • src/prompts/context-transfer.md

These prompts enable:

  • Compaction -- Summarize a conversation to reduce context size
  • Transfer -- Export context from one session to another
  • Grooming -- Clean up context for better agent performance

9.3 Per-Tab Context

Each AI tab has its own:

  • logs: LogEntry[] -- Tab-specific conversation history
  • agentSessionId?: string -- Provider session ID
  • scrollTop?: number -- Scroll position
  • draftInput?: string -- Unsaved input

Context is isolated per tab. When creating a new tab, it starts with a fresh context (new provider session). Resuming a tab reconnects to its existing provider session.

9.4 Tab Overlay Context Operations

The tab hover overlay menu (after 400ms hover) includes:

  • Context: Compact (if tab has 5+ messages) -- Summarize conversation
  • Context: Merge Into (if provider session exists) -- Import context from another session
  • Context: Send to Agent (if provider session exists) -- Export context to another agent

9.5 Context Usage Tracking

Per-tab context usage is tracked as a percentage of the context window:

contextUsage: number;  // Context window usage percentage (0-100)

The context window size varies by agent:

  • Claude Code: 200,000 tokens (always reported in JSON output)
  • Codex: 200,000 tokens (default for GPT-5.x)
  • OpenCode: 128,000 tokens (default)

9.6 What's Missing

  • No automatic context compaction (user must manually trigger)
  • No context chunking for large codebases (relies on agent's built-in RAG)
  • No persistent vector store for retrieval
  • No cross-session context inheritance (new sessions start fresh)
  • No context budget enforcement (agents can exhaust context without warning)

Confidence: High -- From context handler source code and prompts directory.


10. Session Lifecycle and Persistence

10.1 Session Creation

  1. User clicks "New Agent" (Cmd+N)
  2. Selects provider (Claude Code, Codex, OpenCode, Factory Droid)
  3. Selects working directory
  4. createNewSession(agentId, workingDir, name) is called
  5. Two processes spawned: AI agent (child process) + terminal (PTY)
  6. Session added to sessions[] state and persisted

10.2 Session Persistence

Settings stored via electron-store:

  • macOS: ~/Library/Application Support/maestro/
  • Windows: %APPDATA%/maestro/
  • Linux: ~/.config/maestro/

Files:

  • maestro-settings.json -- User preferences (debounced 2-second persistence)
  • maestro-sessions.json -- Agent data
  • maestro-groups.json -- Agent groups
  • maestro-agent-configs.json -- Per-agent configuration

The persistence system uses useDebouncedPersistence (2-second debounce) with flush on visibility change and beforeunload to prevent data loss.

10.3 Session Discovery

Maestro automatically discovers existing provider sessions:

  • Claude Code: ~/.claude/projects/<encoded-path>/
  • Codex: ~/.codex/sessions/YYYY/MM/DD/*.jsonl
  • OpenCode: ~/.local/share/opencode/storage/
  • Factory Droid: ~/.factory/sessions/

Users can browse, search, star, rename, and resume any discovered session.

10.4 Session Resume

Each agent supports session resume with provider-specific flags:

  • Claude Code: --resume <session-id>
  • Codex: resume <thread_id> (subcommand)
  • OpenCode: --session <session-id>
  • Factory Droid: -s, --session-id <id>

10.5 Session States

Color-coded states:

  • Green -- Ready/idle
  • Yellow -- Agent thinking/busy
  • Red -- No connection/error
  • Pulsing Orange -- Connecting

10.6 History Persistence

Command history is maintained per-session:

  • aiCommandHistory: string[] -- AI input history
  • shellCommandHistory: string[] -- Terminal input history

History entries are also stored in the SQLite stats database for analytics.

Confidence: High -- From architecture documentation and source code.


11. Code Quality Gates

11.1 Pre-commit Hooks

File: /tmp/ai-harness-repos/Maestro/.husky/pre-commit

Husky + lint-staged runs on every commit:

"lint-staged": {
    "*.{ts,tsx}": [
        "prettier --write",
        "eslint --fix"
    ]
}

11.2 TypeScript

Strict mode enabled. Three separate tsconfig files:

  • tsconfig.lint.json -- Renderer, web, and shared code
  • tsconfig.main.json -- Main process code
  • tsconfig.cli.json -- CLI tooling

11.3 ESLint

Configured with TypeScript and React plugins:

  • react-hooks/rules-of-hooks
  • react-hooks/exhaustive-deps
  • @typescript-eslint/no-unused-vars
  • prefer-const

11.4 Testing

Framework: Vitest (4 configurations):

  • vitest.config.mts -- Unit tests
  • vitest.integration.config.ts -- Integration tests
  • vitest.e2e.config.ts -- E2E tests (with Playwright)
  • vitest.performance.config.mts -- Performance tests

Coverage: 490 test files across:

src/__tests__/
    cli/           # CLI tool tests
    main/          # Electron main process tests
    renderer/      # React component and hook tests
    shared/        # Shared utility tests
    web/           # Web interface tests
    integration/   # Integration tests
    e2e/           # E2E tests

11.5 CI/CD

File: /tmp/ai-harness-repos/Maestro/.github/workflows/release.yml

Release workflow builds for 4 platforms:

  • macOS (universal: x64 + arm64)
  • Linux x64
  • Linux ARM64 (native ARM runner)
  • Windows x64

Architecture verification is thorough: native modules (node-pty, better-sqlite3) are verified to be built for the correct architecture before AND after packaging. This was clearly born from painful debugging of cross-architecture contamination issues.

11.6 Automated PR Review

Two AI tools review PRs:

  • CodeRabbit -- Line-level code review
  • Greptile -- Codebase-aware architectural review

11.7 Error Tracking

Sentry integration for crash reporting:

  • src/main/utils/sentry.ts
  • src/renderer/components/ErrorBoundary.tsx
  • Dynamic import to avoid module-load-time errors
  • Disabled in development mode
  • User can opt out via settings

11.8 What's Missing

  • No linting or testing in CI before release (the release workflow only builds, doesn't run tests)
  • No required CI checks before merge (mentioned as "in scope" but not enforced)
  • No code coverage thresholds (coverage is available but no minimum enforcement)
  • No security scanning (no SAST, no dependency audit in CI)

Confidence: High -- From configuration files and CI workflow.


12. Security and Compliance Mechanisms

12.1 IPC Security

From SECURITY.md (lines 77-80):

  • Context isolation: Enabled
  • Minimal preload API surface via contextBridge.exposeInMainWorld
  • No require() in renderer
  • Input validation in main process handlers

12.2 Command Execution Security

  • execFileNoThrow used for all external commands (never shell-based execution)
  • spawn() with shell: false for AI agent processes
  • Path traversal prevention in Symphony handlers (sanitizeRepoName())
  • URL validation for external resources (HTTPS only, domain allowlists)

12.3 Process Execution Model

From SECURITY.md (lines 69-73):

Maestro spawns AI agents and terminal processes with the same privileges as the user running the application. This is by design.

Known security considerations:

  • Agents can execute commands on the system
  • Local web server exposes sessions (no auth by default)
  • Cloudflare tunnel URLs are temporary but unauthenticated
  • Sentry DSN is intentionally public (standard client-side practice)

12.4 Encore Features (Feature Gating)

Feature gating via EncoreFeatureFlags:

  • Features disabled by default are completely invisible (no shortcuts, no menu items)
  • First example: Director's Notes
  • Serves as precursor to a plugin marketplace

12.5 What's Missing

  • No authentication for web/mobile interface (anyone with the URL can control agents)
  • No rate limiting on the web server (except via @fastify/rate-limit dependency, unclear if configured)
  • No audit logging (actions are tracked for analytics but not for security audit)
  • No credential management (API keys are managed by the underlying agents, not Maestro)
  • No sandboxing of AI agent execution (runs with full user privileges)
  • No content security policy in the Electron renderer

Confidence: High -- From SECURITY.md and source code analysis.


13. Hooks, Automation Surface, and Fail-Safe Behavior

13.1 Automation Surface

CLI Tool (maestro-cli):

maestro list agents              # List available AI agents
maestro list groups              # List session groups
maestro list playbooks           # List saved playbooks
maestro list sessions <agent-id> # List agent sessions
maestro show agent <id>          # Show agent details
maestro show playbook <id>       # Show playbook configuration
maestro playbook <id>            # Run a playbook
maestro send <agent-id> <msg>    # Send message, get JSON response
maestro clean playbooks          # Remove orphaned playbooks

All commands support --json flag for JSONL output (machine-parseable).

IPC API: The window.maestro API provides 17+ namespaces that could be used by custom extensions.

Custom AI Commands: User-defined slash commands with template variables.

13.2 Fail-Safe Behavior

  • SIGINT -> SIGTERM escalation (2 second timeout)
  • Process cleanup on exit (killAll() on app shutdown)
  • Orphaned tab repair (ensureInUnifiedTabOrder() repairs missing tab references)
  • Settings flush on visibility change (prevents data loss)
  • Error boundaries in React components
  • Grooming session timeout (5 minutes)

13.3 Power Management

src/main/power-manager.ts prevents system sleep while agents are busy (configurable).

13.4 Auto-Update

src/main/auto-updater.ts and src/main/update-checker.ts handle automatic updates via electron-updater.

13.5 What's Missing

  • No webhook/HTTP API for external automation (only CLI and IPC)
  • No plugin system (Encore Features is a precursor, not yet a full plugin API)
  • No event bus for external consumers (events are internal to Electron IPC)
  • No watchdog for agent health (agents that hang indefinitely are not automatically killed)

Confidence: High -- From source code analysis.


14. CLI/UX and Automation Ergonomics

14.1 Desktop UX

Keyboard-first design is deeply implemented:

  • 30+ keyboard shortcuts documented in src/renderer/constants/shortcuts.ts
  • Cmd+K quick actions (command palette)
  • Cmd+J toggle AI/terminal mode
  • Cmd+N new agent
  • Cmd+[ / Cmd+] switch agents
  • Cmd+T new tab
  • Cmd+W close tab
  • Escape always returns to a known state (via Layer Stack system)

Layer Stack System (ARCHITECTURE.md lines 252-380):

  • Centralized modal/overlay management
  • Predictable Escape key handling (highest priority modal closes first)
  • 30+ modal priority levels defined
  • Focus traps ('strict', 'lenient', 'none')
  • ARIA attributes for accessibility

Keyboard Mastery Tracking -- Gamification that rewards keyboard usage:

  • Achievements for time spent using Auto Run
  • 15 conductor-themed badge levels (Apprentice to Transcendent Maestro)
  • Standing Ovation overlay with confetti animation for new badges

14.2 Mobile UX

PWA with mobile-optimized components:

  • Bottom navigation bar (TabBar.tsx)
  • Session pill bar (horizontal scrolling)
  • Voice input support
  • Swipe gestures (useSwipeGestures.ts, usePullToRefresh.ts)
  • Offline queue (useOfflineQueue.ts)
  • Push notifications
  • Connection status indicator

14.3 CLI UX

The maestro-cli provides:

  • Human-readable output (tables and text)
  • JSONL output for scripting
  • --dry-run for playbook execution
  • --debug and --verbose flags
  • --wait to wait for busy agents
  • Pagination for session listing (--limit, --skip)
  • Search filtering (--search)

14.4 Theme System

16 themes across 3 modes (dark, light, vibe):

  • Dracula, Monokai, Nord, Tokyo Night, Catppuccin Mocha, Gruvbox Dark
  • GitHub Light, Solarized, One Light, Gruvbox Light, Catppuccin Latte, Ayu Light
  • Colorblind-friendly palettes (Wong-based)

14.5 What's Missing

  • No CI/CD pipeline integration (CLI can run playbooks but no built-in GitHub Actions integration)
  • No REST API for programmatic access
  • No dashboard web UI (mobile is read/control only, no analytics on mobile)

Confidence: High -- From documentation and source code.


15. Cost/Usage Visibility and Governance

15.1 Real-Time Cost Tracking

Per-session token usage and cost tracking:

interface UsageStats {
    inputTokens: number;
    outputTokens: number;
    cacheReadInputTokens: number;
    cacheCreationInputTokens: number;
    totalCostUsd: number;
    contextWindow: number;
    reasoningTokens?: number;
}

Cost tracking is agent-dependent:

  • Claude Code: Full cost tracking (USD)
  • OpenCode: Full cost tracking (USD from step_finish events)
  • Codex: Token counts only (no USD -- pricing varies by model)
  • Factory Droid: Token counts only

15.2 Usage Dashboard

Files:

  • src/renderer/components/UsageDashboard/ (10+ components)
  • /tmp/ai-harness-repos/Maestro/src/main/stats/ (13 files)
  • CLAUDE-FEATURES.md (lines 7-75)

SQLite-backed analytics with:

  • Summary cards (queries, duration, cost, Auto Runs)
  • Agent comparison bar chart
  • Source distribution pie chart (user vs. auto queries)
  • Activity heatmap (GitHub-style)
  • Duration trends line chart
  • Auto Run-specific statistics
  • Time range filtering (day, week, month, year, all time)
  • CSV export
  • Real-time updates
  • Colorblind-friendly palettes

15.3 Stats Database Architecture

Files:

  • /tmp/ai-harness-repos/Maestro/src/main/stats/stats-db.ts (833 lines)
  • /tmp/ai-harness-repos/Maestro/src/main/stats/schema.ts (142 lines)
  • /tmp/ai-harness-repos/Maestro/src/main/stats/migrations.ts
  • /tmp/ai-harness-repos/Maestro/src/main/stats/aggregations.ts

The StatsDB class manages a SQLite database (stats.db in the user data directory) with 4 main tables:

Table Purpose Key Fields
query_events Every AI query session_id, agent_type, source (user/auto), start_time, duration, project_path
auto_run_sessions Auto Run execution runs session_id, document_path, tasks_total, tasks_completed
auto_run_tasks Individual tasks within Auto Runs task_index, task_content, duration, success (0/1)
session_lifecycle Session creation/closure agent_type, created_at, closed_at, is_remote

Supporting tables:

  • _migrations -- Schema migration tracking with version, description, status, and error_message
  • _meta -- Internal key-value storage (e.g., last vacuum timestamp)

Database resilience features:

  1. WAL mode: PRAGMA journal_mode = WAL for concurrent read/write access
  2. Integrity checking: PRAGMA integrity_check on every startup to detect corruption
  3. Daily backups: Automatic daily backup with 7-day rotation (stats.db.daily.YYYY-MM-DD)
  4. Corruption recovery: Multi-step recovery process:
    • Backup corrupted database for forensics (stats.db.corrupted.{timestamp})
    • Remove stale WAL/SHM sidecar files that can cause false corruption detection
    • Iterate through available backups, validating each with integrity check
    • Restore from first valid backup, or create fresh database if none valid
  5. Weekly VACUUM: Scheduled vacuum (not on every startup) via _meta table timestamp tracking, triggered only when database exceeds 100MB
  6. WAL checkpoint before backup: PRAGMA wal_checkpoint(TRUNCATE) ensures the .db file is self-contained before copying
// stats-db.ts line 333 - Safe backup copy
private safeBackupCopy(destPath: string): void {
    if (this.db) {
        this.db.pragma('wal_checkpoint(TRUNCATE)');
    }
    fs.copyFileSync(this.dbPath, destPath);
}

Migration system: Versioned migrations with individual success/failure tracking per migration. Each migration is recorded in the _migrations table with its status and any error message, enabling precise debugging of upgrade failures.

Statement caching: Each CRUD module (query-events.ts, auto-run.ts, session-lifecycle.ts) maintains prepared statement caches that are cleared on database close, avoiding repeated SQL parsing overhead.

Confidence: High -- Directly from source code (stats-db.ts 833 lines, schema.ts 142 lines).

15.3 WakaTime Integration

src/main/wakatime-manager.ts provides integration with WakaTime for developer activity tracking.

15.4 Global Stats

Cross-project statistics from Claude Code sessions:

const stats = await window.maestro.claude.getGlobalStats();
// Returns: { totalSessions, totalMessages, totalInputTokens, totalOutputTokens,
//            totalCacheReadTokens, totalCacheCreationTokens, totalCostUsd, totalSizeBytes }

15.5 What's Missing

  • No cost budgets or limits (tracking only, no enforcement)
  • No alerts when spending exceeds thresholds
  • No per-playbook cost attribution (costs are per-session, not per-task)
  • No team/organization cost aggregation

Confidence: High -- From source code and documentation.


16. Tooling and Dependency Surface

16.1 Runtime Requirements

  • Node.js >= 22.0.0 (specified in package.json engines)
  • Electron 28 (desktop runtime)
  • Git (optional, for git-aware features)
  • At least one AI agent installed:
    • Claude Code
    • OpenAI Codex
    • OpenCode
    • Factory Droid

16.2 Key Dependencies

Native modules (require compilation):

  • node-pty -- Terminal emulation
  • better-sqlite3 -- Analytics database

Backend:

  • electron-store -- Settings persistence
  • fastify + WebSocket -- Web server for mobile
  • chokidar -- File watching
  • commander -- CLI argument parsing
  • archiver / adm-zip -- Playbook import/export
  • @sentry/electron -- Error tracking
  • electron-updater -- Auto-updates

Frontend:

  • react 18 + react-dom + zustand (state management)
  • tailwindcss -- Styling
  • react-markdown + remark-gfm -- Markdown rendering
  • react-syntax-highlighter -- Code highlighting
  • reactflow -- Document graph visualization
  • recharts -- Usage dashboard charts
  • d3-force -- Graph layout
  • mermaid -- Diagram rendering
  • canvas-confetti -- Achievement celebrations
  • marked -- Markdown parsing
  • dompurify -- HTML sanitization
  • js-tiktoken -- Token counting
  • @tanstack/react-virtual -- Virtual scrolling

Dev tooling:

  • vite -- Build tool
  • vitest -- Test framework
  • playwright -- E2E testing
  • esbuild -- CLI bundling
  • eslint + prettier -- Code quality
  • typescript 5.3 -- Type checking

16.3 Build Configuration

4 separate TypeScript configs:

  • tsconfig.json -- Base config
  • tsconfig.main.json -- Main process
  • tsconfig.lint.json -- Renderer/web/shared
  • tsconfig.cli.json -- CLI

4 Vite configs:

  • vite.config.mts -- Desktop renderer
  • vite.config.web.mts -- Web/mobile interface

Build targets:

  • macOS: DMG + ZIP (x64 + arm64)
  • Windows: NSIS installer + Portable (x64)
  • Linux: AppImage + DEB + RPM (x64 + arm64)

16.4 Dependency Risk Assessment

  • node-pty: Native module, requires compilation. Cross-platform build is fragile (evidenced by extensive CI architecture verification steps).
  • better-sqlite3: Native module, same compilation concerns.
  • Electron 28: Not latest (Electron 35 is current as of 2026). Missing File System Access API support.
  • React 18: Stable but React 19 has been out for over a year.

Confidence: High -- From package.json and build configuration.


17. External Integrations and Provider Compatibility

17.1 Provider Architecture

Maestro's multi-provider architecture is implemented through:

  1. Agent Definitions (src/main/agents/definitions.ts) -- CLI binary, arguments, detection
  2. Agent Capabilities (src/main/agents/capabilities.ts) -- 20+ capability flags per agent
  3. Output Parsers (src/main/parsers/) -- Agent-specific JSON parsing
  4. Session Storage (src/main/storage/) -- Agent-specific session discovery
  5. Error Patterns (src/main/parsers/error-patterns.ts) -- Agent-specific error detection

17.2 Supported Providers

Provider Status Resume Read-Only JSON Images Sessions Cost Thinking
Claude Code Active --resume --permission-mode plan stream-json stdin JSON ~/.claude/ USD Yes
Codex Active resume <id> --sandbox read-only --json -i flag ~/.codex/ Tokens Yes
OpenCode Active --session --agent plan --format json -f flag ~/.local/ USD Yes
Factory Droid Active -s <id> Default mode stream-json -f flag ~/.factory/ Tokens Yes
Gemini CLI Planned TBD TBD TBD Yes TBD USD TBD
Qwen3 Coder Planned TBD TBD TBD TBD TBD N/A TBD
Aider Planned TBD TBD TBD TBD TBD TBD TBD

17.3 Agent Definition Architecture

File: /tmp/ai-harness-repos/Maestro/src/main/agents/definitions.ts (367 lines)

Each agent is defined via an AgentDefinition struct containing static configuration. The definitions system uses a declarative argument builder pattern rather than hardcoding CLI construction logic:

// definitions.ts line 71
export interface AgentConfig {
    id: string;
    name: string;
    binaryName: string;
    command: string;
    args: string[];                          // Base args always included
    batchModePrefix?: string[];              // Subcommand for batch mode (e.g., ['exec'] for Codex)
    batchModeArgs?: string[];                // Args only in batch mode
    jsonOutputArgs?: string[];               // Args for JSON output
    resumeArgs?: (id: string) => string[];   // Session resume builder
    readOnlyArgs?: string[];                 // Read-only/plan mode
    modelArgs?: (id: string) => string[];    // Model selection builder
    yoloModeArgs?: string[];                 // Full-access/unsafe mode
    workingDirArgs?: (dir: string) => string[];  // Working directory
    imageArgs?: (path: string) => string[];  // Image attachment
    promptArgs?: (prompt: string) => string[]; // Prompt argument builder
    noPromptSeparator?: boolean;             // Skip '--' before prompt
    defaultEnvVars?: Record<string, string>; // Default env vars
    configOptions?: AgentConfigOption[];     // UI-configurable settings
    capabilities: AgentCapabilities;         // Feature capability flags
}

Key design observations across the agent definitions:

  1. Claude Code always runs with --dangerously-skip-permissions (YOLO mode). This is a deliberate choice documented in the definitions: "Maestro requires it."

  2. Codex uses a subcommand pattern (codex exec) with its own set of batch-mode-only args (--dangerously-bypass-approvals-and-sandbox, --skip-git-repo-check). The --json flag must come before the resume subcommand in the argument ordering.

  3. OpenCode uses environment variable injection for YOLO mode rather than CLI flags:

// definitions.ts line 223
defaultEnvVars: {
    OPENCODE_CONFIG_CONTENT: '{"permission":{"*":"allow","external_directory":"allow","question":"deny"},"tools":{"question":false}}'
}

The question tool is explicitly disabled in two ways because it waits for stdin input, which hangs batch mode.

  1. Factory Droid runs with --skip-permissions-unsafe and read-only is the DEFAULT mode for droid exec. It supports a reasoningEffort configuration option with values low, medium, high.

  2. Aider is listed as a placeholder definition with no configuration, signaling future support for this popular open-source AI coding tool.

  3. UI-configurable options use discriminated union types (checkbox, text, number, select) with type-safe argBuilder functions that map config values to CLI arguments at runtime.

17.4 Adding New Providers

The process is well-documented in AGENT_SUPPORT.md (843 lines):

  1. Add agent definition to agent-detector.ts
  2. Define capabilities in agent-capabilities.ts
  3. Create output parser in parsers/
  4. Register parser in parsers/index.ts
  5. (Optional) Create session storage in storage/
  6. (Optional) Add error patterns

Each agent starts with all capabilities false and enables them as verified.

17.4 MCP Server

Maestro provides a hosted MCP (Model Context Protocol) server at https://docs.runmaestro.ai/mcp with a SearchMaestro tool for documentation search. This allows external AI tools (Claude Desktop, Claude Code) to search Maestro's knowledge base.

17.5 Spec-Kit and OpenSpec Integration

Bundled spec-driven workflow systems:

  • Spec-Kit: GitHub's spec-kit prompts (src/prompts/speckit/)
  • OpenSpec: Fission-AI's OpenSpec prompts (src/prompts/openspec/)

Both are refreshed from upstream via scripts:

npm run refresh-speckit   # Fetch latest from github/spec-kit
npm run refresh-openspec  # Fetch latest from Fission-AI/OpenSpec

17.6 What's Missing

  • No direct API provider support (only CLI-based agents, not API-based)
  • No local model integration (except through OpenCode's Ollama support)
  • No MCP client (Maestro serves MCP, but doesn't consume MCP tools from external servers)
  • No plugin marketplace for third-party integrations

Confidence: High -- From AGENT_SUPPORT.md and capabilities source code.


18. Operational Assumptions and Constraints

18.1 Explicit Assumptions

  1. User has at least one AI agent installed and authenticated (Claude Code, Codex, OpenCode, or Factory Droid)
  2. User has Git installed (for git-aware features)
  3. Agents run in batch/headless mode -- Each task gets a prompt and returns a response (not interactive)
  4. Maestro is a pass-through -- Whatever MCP tools, skills, permissions the agent has configured works identically
  5. Each task gets a fresh session (for clean conversation context in Auto Run)
  6. Agents can execute commands with user privileges -- No sandboxing

18.2 Platform Constraints

From CLAUDE-PLATFORM.md:

  • Path separators differ between platforms
  • Shell detection differs (PowerShell on Windows, zsh/bash on Unix)
  • macOS Alt key produces special characters (must use e.code not e.key)
  • Windows has 8KB command line limit (use stdin for long prompts)
  • SSH remote execution doesn't support file watching
  • Git stat format differs between GNU and BSD

18.3 Performance Constraints

From CLAUDE-PERFORMANCE.md:

  • AI streaming triggers 100+ IPC updates/second (batched to ~6 renders/second via 150ms batching)
  • Agent persistence uses 2-second debounce
  • Git status polling uses 3-second intervals (paused when app is hidden)
  • Model list cache: 5-minute TTL
  • Symphony registry cache: 2-hour TTL
  • Issues cache: 5-minute TTL

18.4 Operational Constraints

  • Node.js >= 22.0.0 required (newer than typical Node LTS)
  • Native module compilation (node-pty, better-sqlite3) requires build tools
  • Electron app size (likely 100MB+ installed)
  • Single user per instance (no multi-user support)
  • GitHub CLI required for Symphony (for PR creation)
  • Cloudflare CLI required for remote tunnels (for remote access)

Confidence: High -- From platform documentation and package.json.


19. Failure Modes and Issues Observed

19.1 Documented Failure Patterns

From CLAUDE.md (lines 289-295):

Historical patterns that wasted time:
- Tab naming bug: Modal coordination was "fixed" when the actual issue was
  an unregistered IPC handler
- Tooltip clipping: Attempted overflow:visible on element when parent
  container had overflow:hidden
- Session validation: Fixed renderer calls when handler wasn't wired in main process

19.2 Cross-Architecture Build Issues

The CI workflow (release.yml) has extensive architecture verification steps (5+ verification steps per platform), indicating past issues with:

  • Cross-architecture binary contamination (ARM64 prebuilds contaminating x64 builds)
  • Incorrect native module compilation
  • Cache key collisions between architectures

19.3 Electron Limitations

  • File System Access API not fully supported in Electron 28 (Chrome DevTools "Save profile" fails)
  • WSL environment requires GPU acceleration to be auto-disabled (EGL/GPU process crash issues)

19.4 Agent-Specific Issues

  • Claude Code may not immediately exit on SIGINT (requires SIGTERM escalation after 2 seconds)
  • OpenCode session storage is marked as "stub ready" (not fully implemented)
  • Gemini CLI and Qwen3 Coder are "PLACEHOLDER" (capabilities unknown)

19.5 Symphony Risks

From SYMPHONY_ISSUES.md:

  • Issues can be "claimed" by creating a draft PR, but there's no lock mechanism (race condition possible)
  • External document URLs restricted to GitHub domains only (prevents arbitrary URL injection)
  • Path traversal attacks prevented via validation

19.6 Database Resilience (Positive Finding)

Counter to initial expectations, the stats database has a robust resilience strategy (see Section 15.3 for full details):

  • Daily automatic backups with 7-day rotation
  • Corruption detection via PRAGMA integrity_check on every startup
  • Automated recovery: tries each backup in order, falls back to fresh database
  • Stale WAL/SHM file cleanup to prevent false corruption detection
  • WAL checkpoint before backup to ensure self-contained copies

This is one of the more carefully engineered subsystems in Maestro.

19.7 Potential Failure Modes Not Documented

  1. Agent hang without exit -- No watchdog to kill agents that hang indefinitely. The 5-minute timeout exists only for grooming sessions, not for regular agent queries.

  2. Session settings data loss -- The electron-store persistence uses a 2-second debounce (useDebouncedPersistence). A crash during this window loses up to 2 seconds of state changes. The beforeunload flush mitigates but does not eliminate this risk.

  3. Memory leaks from event listeners -- The ProcessManager extends EventEmitter and many modules attach listeners (Group Chat moderator, grooming sessions, IPC handlers). Each grooming session attaches data, exit, and agent-error listeners with cleanup functions, but complex error paths could leave orphaned listeners.

  4. Concurrent worktree conflicts -- Multiple worktrees from different agents could modify overlapping files. Git will handle merge conflicts at the branch level, but runtime file locking is not implemented.

  5. Group Chat state is entirely in-memory -- Active moderator sessions, participant sessions, and activity timestamps are all stored in JavaScript Map objects. A crash or unexpected restart loses all group chat state. The only persisted data is the chat log (JSONL on disk) and metadata (chat.json).

  6. SSH spawn wrapper command length -- On Windows, commands passed to SSH have an 8KB limit. Long prompts are sent via stdin to avoid this, but the fallback behavior if stdin writing fails is unclear.

  7. Race condition in Symphony issue claiming -- From SYMPHONY_ISSUES.md: Issues can be "claimed" by creating a draft PR, but there is no server-side lock mechanism. Two users could simultaneously claim the same issue.

Confidence: Medium -- Some failure modes are inferred from architectural analysis rather than documented or observed.


20. Governance and Guardrails

20.1 Code Governance

  • Pre-commit hooks (Husky + lint-staged): Prettier + ESLint on staged files
  • TypeScript strict mode: Across all 3 build configs
  • Automated PR review: CodeRabbit + Greptile
  • Conventional commits: feat:, fix:, docs:, refactor:, test:, chore:
  • CONTRIBUTING.md: 1122 lines of detailed contribution guidelines
  • PR checklist: Linting, tests, manual testing, no console errors, theme testing

20.2 Agent Guardrails

  • Read-only mode: Per-tab toggle to prevent file modifications
  • Error modals: Block input when agent errors occur
  • Pause/stop: For batch processing
  • YOLO mode documentation: Explicit documentation that Codex runs with --dangerously-bypass-approvals-and-sandbox by default

20.3 Security Guardrails

  • execFileNoThrow: Mandatory for all external commands (no shell injection)
  • Input validation: URL validation, path traversal prevention, repo slug validation
  • Context isolation: Electron security best practices
  • SECURITY.md: Formal vulnerability reporting process

20.4 What's Missing

  • No cost guardrails (no spending limits, no alerts)
  • No mandatory code review for Auto Run output
  • No agent permission system (agents run with full user privileges)
  • No content moderation for Group Chat
  • No rate limiting on agent spawning (could spawn unlimited processes)
  • No resource quotas per agent (CPU, memory, disk)

Confidence: High -- From governance documentation and source code.


21. Roadmap/Evolution Signals, Missing Areas, Unresolved Issues

21.1 Active Development Signals

  • Version 0.15.0 -- Still pre-1.0, rapid iteration expected
  • CONTRIBUTING.md note: "The project is currently changing rapidly, there's a high likelihood that PRs will be out of sync"
  • Encore Features system: Precursor to a full plugin marketplace
  • Symphony: Community contribution platform (recently added)
  • Director's Notes: First Encore Feature (AI-generated synopsis of work)

21.2 Planned Agents

From /tmp/ai-harness-repos/Maestro/src/main/agents/definitions.ts:

  • Gemini CLI (id: 'gemini-cli', binaryName: 'gemini') -- Minimal definition, no batch mode args, no output parser, placeholder capabilities (all false)
  • Qwen3 Coder (id: 'qwen3-coder', binaryName: 'qwen3-coder') -- Minimal definition, same status
  • Aider (id: 'aider', binaryName: 'aider') -- Recently added placeholder definition with zero configuration, signaling potential future support for this popular open-source coding tool

To add any of these as fully supported agents requires implementing:

  1. Output parser class (extends AgentOutputParser interface)
  2. Error pattern definitions (regex patterns for each error type)
  3. Capability flags (currently all set to false for placeholders)
  4. Session storage module (for session discovery/resume)
  5. Batch mode argument construction (batchModePrefix, jsonOutputArgs, etc.)

21.3 Encore Features System (Plugin Precursor)

The Encore Features system is a feature-gating mechanism that serves as a precursor to a full plugin marketplace. From CONTRIBUTING.md:

// Encore feature definition pattern
interface EncoreFeature {
    id: string;
    name: string;
    description: string;
    enabled: boolean;  // Defaults to false
}

When a feature is disabled:

  • Its UI components are not rendered
  • Its keyboard shortcuts are not registered
  • Its menu items are not visible
  • Its IPC handlers may still be registered but are unreachable from the UI

The first Encore Feature is Director's Notes: an AI-generated synopsis of the work performed in a session, using the director-notes.md prompt. This demonstrates the pattern for future plugin-like features.

What's missing for a true plugin system:

  • No plugin loading mechanism (all features must be compiled into the app)
  • No plugin lifecycle hooks (install, enable, disable, uninstall)
  • No plugin manifest format or registry
  • No plugin isolation (all features share the same process)
  • No third-party plugin support

21.4 Identified Gaps

  1. No automatic planning -- Maestro doesn't generate plans from high-level specs. Users must manually create Auto Run documents.
  2. No dependency-aware task ordering -- Tasks within documents are sequential checkboxes. No DAG execution.
  3. No inter-agent communication (except Group Chat) -- Agent A can't directly feed output to Agent B.
  4. No automated testing integration -- No built-in test runner, no CI integration.
  5. No rollback mechanism -- No way to automatically revert bad agent changes.
  6. No context-aware agent selection -- User must choose which agent to use for each task.
  7. No cost optimization -- No model selection based on task complexity.
  8. No persistent knowledge base -- No vector store, no RAG integration.
  9. No collaborative editing -- Single user per instance.
  10. No API/webhook integration -- Only CLI and desktop app.

21.5 Unresolved Architecture Decisions

  • OpenCode session storage: Marked as "stub, needs implementation" (AGENT_SUPPORT.md line 666)
  • Electron version: Still on 28, significantly behind current (35+)
  • React version: Still on 18, behind current (19+)
  • Plugin system: Encore Features is a stepping stone but no full plugin API exists

Confidence: Medium-High -- Based on TODO markers in code and documentation gaps.


22. Current Gaps That Other Projects Might Fill

22.1 From superpowers (Hypothetical Learnings)

Areas where Maestro could benefit from superpowers-style approaches:

  • Enhanced CLAUDE.md management -- Maestro has an excellent CLAUDE.md ecosystem but could benefit from automated generation/maintenance
  • MCP tool composition -- Maestro doesn't currently consume MCP tools; superpowers' MCP patterns could inform this
  • Shell integration patterns -- Maestro wraps agents as child processes; shell-level hooks could enhance this

22.2 From everything-claude-code (Hypothetical Learnings)

Areas where Maestro could benefit:

  • Curated prompt libraries -- Maestro bundles Spec-Kit and OpenSpec but could have a more extensive prompt ecosystem
  • Configuration presets -- everything-claude-code's settings optimization could inform Maestro's defaults
  • CLAUDE.md templates -- Project-type-specific templates for Auto Run documents

22.3 From agent-orchestrator (Hypothetical Learnings)

Areas where Maestro could benefit:

  • Automated plan generation -- Maestro requires manual Auto Run document creation; agent-orchestrator's planning phase could automate this
  • Dependency-aware execution -- DAG-based task ordering instead of sequential checkboxes
  • Automated verification -- Post-task validation (test running, lint checking)
  • Cost-aware agent selection -- Choosing the right model for each task
  • Result synthesis -- Automated merging of multi-agent outputs (beyond Group Chat)
  • Subagent spawning -- Dynamic creation of specialized agents for subtasks
  • Context management automation -- Automatic compaction, chunking, and transfer

22.4 Specific Improvement Opportunities

  1. Plan Generation Layer: Add a planning step before Auto Run that decomposes high-level specs into Auto Run documents automatically.

  2. Automated Quality Gates: After each Auto Run task, run tests/lints before proceeding to the next task.

  3. Cost Budgets: Set per-playbook or per-agent spending limits with alerts and automatic pause.

  4. Context Intelligence: Automatic context compaction when approaching window limits, cross-session context inheritance.

  5. Agent Pipeline: Allow chaining agents where output of one feeds into another (beyond Group Chat's conversational model).

  6. Plugin API: Formalize the Encore Features system into a full plugin API with lifecycle hooks.

  7. REST API: Add a web API for external automation (CI/CD integration, custom dashboards).

  8. Verification Framework: Built-in test runner integration, lint checking, and code review gates.

  9. Rollback System: Automatic git checkpoint before each task, with easy rollback on failure.

  10. Smart Agent Selection: Based on task type, automatically select the most cost-effective provider/model.

Confidence: Medium -- These are synthesis recommendations based on gap analysis; specific applicability depends on the other projects' actual implementations.


23. Cross-Links

Related Analysis Documents

  • superpowers-deep-analysis.md

    • Section 2 (Design Philosophy) -- Compare with Maestro's Constitution
    • Section 8 (Context Handling) -- Compare with Maestro's context merge/groom
    • Section 13 (CLI/UX) -- Compare with Maestro's keyboard-first approach
    • Section 17 (External Integrations) -- Compare MCP patterns
  • everything-claude-code-deep-analysis.md

    • Section 2 (Design Philosophy) -- Compare curated vs. orchestrator approaches
    • Section 4 (Harness Workflow) -- Compare prompt library vs. Auto Run documents
    • Section 9 (Session Lifecycle) -- Compare configuration management
    • Section 16 (Tooling) -- Compare dependency surfaces
  • agent-orchestrator-deep-analysis.md

    • Section 4 (Harness Workflow) -- Compare plan generation approaches
    • Section 5 (Subagent Orchestration) -- Compare with Maestro's Group Chat
    • Section 6 (Parallelization) -- Compare concurrency models
    • Section 7 (Isolation) -- Compare worktree vs. other isolation approaches
    • Section 8 (Human-in-the-Loop) -- Compare approval gate designs
    • Section 9 (Context Handling) -- Compare context management strategies
  • harness-consensus-report.md

    • Maestro contributes the most mature implementation for:
      • Multi-provider support
      • Desktop UX / keyboard-first design
      • Auto Run / Playbook task execution
      • Group Chat / multi-agent coordination
      • Mobile remote control
      • Analytics and cost tracking
    • Maestro's gaps that other projects fill:
      • Automated plan generation
      • Dependency-aware task execution
      • Automated verification/quality gates
      • Cost governance with budgets
  • final-harness-gap-report.md

    • Priority improvement areas for Maestro:
      1. Plan generation automation
      2. Quality gates in Auto Run
      3. Cost budgets and governance
      4. Context management automation
      5. Plugin API formalization
      6. REST API for external integration

Appendix A: File Index (Key Files Referenced)

File Purpose Lines Read
/tmp/ai-harness-repos/Maestro/README.md Project overview Full (181)
/tmp/ai-harness-repos/Maestro/ARCHITECTURE.md Technical architecture Full (1673)
/tmp/ai-harness-repos/Maestro/CLAUDE.md Development guide Full (331)
/tmp/ai-harness-repos/Maestro/CONSTITUTION.md Design philosophy Full (178)
/tmp/ai-harness-repos/Maestro/CONTRIBUTING.md Development setup Full (1122)
/tmp/ai-harness-repos/Maestro/SECURITY.md Security policy Full (95)
/tmp/ai-harness-repos/Maestro/AGENT_SUPPORT.md Provider integration Full (843)
/tmp/ai-harness-repos/Maestro/CLAUDE-PATTERNS.md Implementation patterns Full (349)
/tmp/ai-harness-repos/Maestro/CLAUDE-SESSION.md Session data model Full (134)
/tmp/ai-harness-repos/Maestro/CLAUDE-PERFORMANCE.md Performance guidelines Full (268)
/tmp/ai-harness-repos/Maestro/CLAUDE-AGENTS.md Agent support Full (73)
/tmp/ai-harness-repos/Maestro/CLAUDE-FEATURES.md Dashboard/Graph features Full (176)
/tmp/ai-harness-repos/Maestro/CLAUDE-PLATFORM.md Cross-platform concerns Full (222)
/tmp/ai-harness-repos/Maestro/SYMPHONY_REGISTRY.md Symphony registry Full (159)
/tmp/ai-harness-repos/Maestro/SYMPHONY_ISSUES.md Symphony issues Full (196)
/tmp/ai-harness-repos/Maestro/THEMES.md Theme system Referenced
/tmp/ai-harness-repos/Maestro/package.json Dependencies/scripts Full (318)
/tmp/ai-harness-repos/Maestro/src/main/index.ts Main entry point 200 lines
/tmp/ai-harness-repos/Maestro/src/main/process-manager/ProcessManager.ts Process management 200 lines
/tmp/ai-harness-repos/Maestro/src/main/process-manager/types.ts Process types Full (142)
/tmp/ai-harness-repos/Maestro/src/main/agents/capabilities.ts Agent capabilities Full (334)
/tmp/ai-harness-repos/Maestro/src/main/agents/detector.ts Agent detection 150 lines
/tmp/ai-harness-repos/Maestro/src/main/group-chat/group-chat-router.ts Group chat routing 200 lines
/tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/symphony.ts Symphony handlers 200 lines
/tmp/ai-harness-repos/Maestro/src/main/ipc/handlers/context.ts Context merge Full (508)
/tmp/ai-harness-repos/Maestro/src/main/parsers/index.ts Parser registry Full (103)
/tmp/ai-harness-repos/Maestro/src/main/parsers/claude-output-parser.ts Claude parser Full (505)
/tmp/ai-harness-repos/Maestro/src/main/parsers/error-patterns.ts Error detection Full (1015)
/tmp/ai-harness-repos/Maestro/src/main/stats/schema.ts Database schema Full (142)
/tmp/ai-harness-repos/Maestro/src/main/stats/stats-db.ts Stats DB core Full (833)
/tmp/ai-harness-repos/Maestro/src/main/group-chat/group-chat-moderator.ts Moderator mgmt Full (290)
/tmp/ai-harness-repos/Maestro/src/main/group-chat/group-chat-agent.ts Participant mgmt Full (429)
/tmp/ai-harness-repos/Maestro/src/main/agents/definitions.ts Agent definitions Full (367)
/tmp/ai-harness-repos/Maestro/src/cli/index.ts CLI entry point Full (113)
/tmp/ai-harness-repos/Maestro/src/cli/services/batch-processor.ts Batch execution 150 lines
/tmp/ai-harness-repos/Maestro/src/cli/services/agent-spawner.ts Agent spawning 150 lines
/tmp/ai-harness-repos/Maestro/.github/workflows/release.yml CI/CD Full (782)
/tmp/ai-harness-repos/Maestro/docs/symphony.md Symphony docs 100 lines

Appendix B: Codebase Statistics

Metric Value
Total TypeScript lines ~672,000
Source files (.ts/.tsx) ~1,200
Test files ~490
Main process handler modules 30
IPC namespaces in preload 17+
Custom React hooks 15+
Themes 16
Modal priority levels 30+
Keyboard shortcuts 30+
Supported AI agents 4 active, 3 planned (Gemini CLI, Qwen3 Coder, Aider)
Agent capability flags 20 per agent
Agent config option types 4 (checkbox, text, number, select)
Output parser implementations 4 (Claude, Codex, OpenCode, Factory Droid)
Error pattern definitions 5 sets (4 agents + SSH), ~100 individual patterns
Error types detected 7 (auth, token, rate, network, permission, crash, session)
Stats database tables 6 (query_events, auto_run_sessions, auto_run_tasks, session_lifecycle, _migrations, _meta)
System prompts (markdown) 24
Group Chat prompt templates 4 (moderator system, moderator synthesis, participant, participant request)
Documentation pages 25+
Source files read for this analysis 35+

Appendix C: Confidence Summary

Section Confidence Basis
Design Philosophy High Direct from CONSTITUTION.md
Core Architecture High Source code + ARCHITECTURE.md
Output Parser Architecture High Full source code read (claude-output-parser.ts, index.ts)
Error Pattern System High Full source code read (error-patterns.ts, 1015 lines)
Harness Workflow High Source code + documentation
Orchestration Model High Source code + architecture docs
Group Chat Implementation High Full source (moderator.ts, agent.ts, router.ts)
Parallelization High Source code analysis
Isolation Model High Architecture docs + SECURITY.md
Human-in-the-Loop High Source code + UI analysis
Context Handling High IPC handlers + prompts
Session Lifecycle High Session model + persistence code
Code Quality Gates High CI config + test infrastructure
Security High SECURITY.md + source code
Automation Surface High CLI source + IPC analysis
CLI/UX High Documentation + source code
Cost Visibility High Stats system + dashboard code
Stats Database Architecture High Full source code read (stats-db.ts, schema.ts)
Dependencies High package.json + build config
Provider Compatibility High AGENT_SUPPORT.md + capabilities
Agent Definitions High Full source code read (definitions.ts, 367 lines)
Operational Assumptions High Platform docs + configuration
Failure Modes Medium Mix of documented + inferred
Database Resilience High Full source code read (stats-db.ts recovery paths)
Governance High Contributing docs + hooks
Roadmap Signals Medium-High TODO markers + placeholder code
Prompt System High Full directory listing + template analysis
Encore Features High Documentation + implementation patterns
Gap Analysis Medium Synthesis recommendations

End of analysis. Total source files read for this report: 35+. Total lines of source code analyzed: ~5,000+.

Superpowers (obra/superpowers) -- Deep Technical Analysis

Repository: https://github.com/obra/superpowers Version Analyzed: v4.3.1 (2026-02-21) Author: Jesse Vincent (obra) License: MIT Analysis Date: 2026-02-22


Table of Contents

  1. Executive Summary
  2. Design Philosophy and Abstractions
  3. Core Architecture Model
  4. Harness Workflow: Spec to Plan to Execute to Verify to Merge
  5. Subagent/Task Orchestration Model
  6. Multi-Agent / Parallelization Strategy
  7. Isolation Model
  8. Human-in-the-Loop Controls
  9. Context Handling Strategy
  10. Session Lifecycle and Persistence
  11. Code Quality Gates
  12. Security and Compliance Mechanisms
  13. Hooks, Automation Surface, and Fail-Safe Behavior
  14. CLI/UX and Automation Ergonomics
  15. Cost/Usage Visibility and Governance
  16. Tooling and Dependency Surface
  17. External Integrations and Provider Compatibility
  18. Operational Assumptions and Constraints
  19. Failure Modes and Issues Observed
  20. Governance and Guardrails
  21. Roadmap/Evolution Signals, Missing Areas, Unresolved Issues
  22. What Should Be Borrowed/Adapted into Maestro and What Should Not
  23. Cross-Links

1. Executive Summary

Superpowers is a skills-based prompt engineering framework that transforms how AI coding agents (primarily Claude Code, but also Cursor, Codex, and OpenCode) approach software development. It is NOT a traditional harness with executable orchestration code -- instead, it is a collection of markdown skill documents and a thin bootstrap mechanism that injects behavioral instructions into AI agent sessions at startup.

The core innovation is treating agent behavior documentation as code: skills are TDD-tested against agent behavior, iteratively hardened against rationalization, and composed into a complete development workflow. The framework enforces a mandatory pipeline: brainstorm -> design -> plan -> execute (via subagents) -> review -> finish, with multiple quality gates at each stage.

Key differentiators from other harnesses:

  • No runtime orchestrator code -- the AI agent itself IS the orchestrator, guided by skill documents
  • Anti-rationalization engineering -- extensive work on preventing agents from bypassing prescribed workflows
  • Two-stage code review -- spec compliance review THEN code quality review, both as review loops
  • TDD for documentation -- skills themselves are developed using red-green-refactor against agent behavior
  • Multi-platform -- Claude Code, Cursor, Codex, OpenCode all supported with platform-specific adapters

Confidence: High -- all conclusions drawn from reading every file in the repository.

Bottom line: Superpowers is the most methodologically rigorous prompt engineering framework in the AI agent ecosystem. Its anti-rationalization engineering, pressure testing methodology, and TDD-for-docs approach represent genuine innovations that should be adopted. However, its advisory-only enforcement model, zero-persistence design, and agent-as-orchestrator architecture are fundamental limitations that a production harness like Maestro should solve with runtime code rather than prompts.

The framework's greatest strength -- that it requires zero executable code and works purely through markdown -- is simultaneously its greatest weakness: there is no enforcement mechanism beyond the agent's willingness to follow instructions.


2. Design Philosophy and Abstractions

2.1 Core Mental Model

Superpowers embodies the philosophy that AI coding agents are like enthusiastic junior engineers with poor taste, no judgment, no project context, and an aversion to testing (direct quote from /tmp/ai-harness-repos/superpowers/skills/writing-plans/SKILL.md, line 10). The entire framework is designed to impose discipline on this archetype.

The mental model has several layers:

  1. Skills as process documentation -- Not tutorials, not narratives, but prescriptive reference guides that agents load and follow. Skills are "rigid" (TDD, debugging -- follow exactly) or "flexible" (patterns -- adapt principles to context). See /tmp/ai-harness-repos/superpowers/skills/using-superpowers/SKILL.md, lines 87-91.

  2. Agent as self-governing orchestrator -- Rather than having external code dispatch tasks, Superpowers trusts the AI agent to read skill instructions and orchestrate itself. The using-superpowers skill is the meta-skill that enforces this discipline.

  3. Anti-rationalization as first-class concern -- The framework acknowledges that LLMs will rationalize around constraints. Significant engineering effort goes into closing these loopholes through explicit negation tables, red flag lists, and "gate functions." See /tmp/ai-harness-repos/superpowers/skills/writing-skills/persuasion-principles.md for the theoretical foundation (Cialdini 2021, Meincke et al. 2025).

  4. Composable skills over monolithic instructions -- Each skill is a standalone document that can be loaded on demand. The using-superpowers skill establishes the protocol for when and how to load other skills.

2.2 Philosophical Principles

From /tmp/ai-harness-repos/superpowers/README.md, lines 122-128:

Principle Implementation
Test-Driven Development Enforced via test-driven-development skill with "Iron Law": NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
Systematic over ad-hoc systematic-debugging skill requires 4-phase root cause investigation before any fix
Complexity reduction YAGNI is enforced at every level -- brainstorming, planning, implementation, review
Evidence over claims verification-before-completion skill requires running verification commands before any completion claim

2.3 The "Description Trap" Discovery

Confidence: High -- This is a proven finding documented across multiple versions.

A critical discovery documented in /tmp/ai-harness-repos/superpowers/RELEASE-NOTES.md v4.0.0 (lines 273-278): When a skill's YAML description field contains workflow summaries, Claude follows the short description instead of reading the full skill content. For example, a description saying "code review between tasks" caused Claude to do ONE review, even though the skill's flowchart showed TWO reviews (spec compliance then code quality).

Fix: Descriptions must be trigger-only ("Use when X") with no process details. This is now enforced in the writing-skills skill's CSO (Claude Search Optimization) section.

Implication for Maestro: Any system that uses skill/prompt descriptions for routing must be aware that the description itself can override the detailed instructions. Descriptions should only contain triggering conditions.

2.4 DOT Flowcharts as Executable Specifications

Starting in v4.0.0, Superpowers uses GraphViz DOT flowcharts embedded in markdown as the authoritative process definition, with prose as supporting content. This was a deliberate choice -- flowcharts are harder for agents to skip or misinterpret than prose paragraphs.

Evidence: Every major skill (using-superpowers, brainstorming, subagent-driven-development, test-driven-development, dispatching-parallel-agents, systematic-debugging, using-git-worktrees) contains embedded DOT flowcharts.

Tool support: /tmp/ai-harness-repos/superpowers/skills/writing-skills/render-graphs.js renders these flowcharts to SVG for human review.

2.5 Foundational Maxims

Several maxims recur across multiple skills and serve as the philosophical bedrock:

  1. "Violating the letter of the rules is violating the spirit of the rules." -- Appears in both test-driven-development (line 14) and verification-before-completion (line 14). This is the anti-loophole principle: agents cannot reinterpret rules to mean something more convenient.

  2. "If you didn't watch it fail, you don't know if it tests the right thing." -- From test-driven-development (line 12). Applied to both code tests AND skill documentation.

  3. "Claiming work is complete without verification is dishonesty, not efficiency." -- From verification-before-completion (line 10). Reframes skipping verification as a moral failure, not a time-saving optimization.

  4. "Honesty is a core value. If you lie, you'll be replaced." -- From verification-before-completion (line 115). Uses existential threat (replacement) as a persuasion mechanism.

  5. "Code review requires technical evaluation, not emotional performance." -- From receiving-code-review (line 10). Directly targets the sycophancy failure mode.

These maxims are not incidental -- they represent deliberate application of the persuasion principles documented in persuasion-principles.md. Each uses one or more of Cialdini's principles (Authority, Commitment/Consistency, Scarcity) to reinforce compliance.


3. Core Architecture Model

3.1 Repository Structure

superpowers/
  .claude-plugin/          # Claude Code plugin manifest
    plugin.json            # Name, version, author, paths to skills/agents/commands/hooks
    marketplace.json       # Dev marketplace config for testing
  .cursor-plugin/          # Cursor plugin manifest
    plugin.json            # Cursor-specific manifest with skills/agents/commands/hooks paths
  .opencode/               # OpenCode plugin
    plugins/superpowers.js # JavaScript plugin that injects bootstrap via system prompt transform
    INSTALL.md             # OpenCode installation instructions
  .codex/                  # Codex integration
    INSTALL.md             # Codex installation instructions (clone + symlink)
  agents/                  # Agent definitions
    code-reviewer.md       # Code reviewer agent with review checklist
  commands/                # Slash commands (user-only, not model-invocable)
    brainstorm.md          # Redirects to brainstorming skill
    write-plan.md          # Redirects to writing-plans skill
    execute-plan.md        # Redirects to executing-plans skill
  hooks/                   # Session lifecycle hooks
    hooks.json             # Hook configuration (SessionStart, sync)
    session-start          # Bash script that injects using-superpowers content
    run-hook.cmd           # Cross-platform polyglot wrapper (Windows + Unix)
  lib/                     # Shared code
    skills-core.js         # ES module for skill discovery/parsing (used by Codex/OpenCode)
  skills/                  # The core content (14 skill directories)
    using-superpowers/     # Meta-skill: how to find and use skills
    brainstorming/         # Design exploration before implementation
    writing-plans/         # Implementation plan creation
    executing-plans/       # Batch execution with checkpoints
    subagent-driven-development/  # Fresh subagent per task with two-stage review
    test-driven-development/      # RED-GREEN-REFACTOR cycle
    systematic-debugging/         # 4-phase root cause investigation
    dispatching-parallel-agents/  # Concurrent subagent workflows
    using-git-worktrees/          # Isolated workspace creation
    finishing-a-development-branch/ # Merge/PR/discard decision workflow
    requesting-code-review/       # Pre-review checklist and dispatch
    receiving-code-review/        # How to respond to feedback
    verification-before-completion/ # Evidence before claims
    writing-skills/               # Meta: how to create new skills (TDD for docs)
  tests/                   # Test suites
    claude-code/           # Integration tests using claude -p
    explicit-skill-requests/ # Tests for explicit skill invocation
    skill-triggering/      # Tests for implicit skill triggering
    subagent-driven-dev/   # End-to-end workflow tests
    opencode/              # OpenCode-specific tests
  docs/                    # Documentation
    testing.md             # Guide to testing skills
    README.codex.md        # Codex-specific docs
    README.opencode.md     # OpenCode-specific docs
    windows/               # Windows-specific docs
    plans/                 # Design documents and improvement plans

3.2 Entry Points

For Claude Code (primary platform):

  1. Plugin installation via marketplace (/plugin marketplace add obra/superpowers-marketplace then /plugin install superpowers@superpowers-marketplace)
  2. Session start hook fires on startup/resume/clear/compact -- runs /tmp/ai-harness-repos/superpowers/hooks/session-start (line 1-51)
  3. Hook output injects the entire using-superpowers skill content wrapped in <EXTREMELY_IMPORTANT> tags into the session context
  4. The using-superpowers skill establishes the mandatory protocol: check for skills BEFORE any response or action

For Cursor:

  1. Plugin installed via Cursor's marketplace with .cursor-plugin/plugin.json
  2. Same session-start hook mechanism, with additional_context field for Cursor compatibility (see /tmp/ai-harness-repos/superpowers/hooks/session-start, lines 41-48)

For OpenCode:

  1. Manual clone + symlink installation
  2. JavaScript plugin at .opencode/plugins/superpowers.js uses experimental.chat.system.transform hook to inject bootstrap into system prompt
  3. Skills discovered via OpenCode's native skill tool from symlinked directory

For Codex:

  1. Manual clone + symlink to ~/.agents/skills/superpowers/
  2. No bootstrap script needed -- Codex's native skill discovery handles it
  3. using-superpowers discovered automatically at startup

3.3 Data Flow

Session Start
    |
    v
[Hook fires] --> [session-start script reads using-superpowers/SKILL.md]
    |
    v
[JSON output with additionalContext injected into session]
    |
    v
[Agent receives using-superpowers instructions as system context]
    |
    v
[Every user message] --> [Check: might any skill apply?]
    |                          |
    |                     [yes, even 1%]
    |                          |
    |                          v
    |                     [Invoke Skill tool to load skill]
    |                          |
    |                          v
    |                     [Announce: "Using [skill] to [purpose]"]
    |                          |
    |                          v
    |                     [Has checklist? -> Create TodoWrite]
    |                          |
    |                          v
    |                     [Follow skill exactly]
    |
    v
[Respond (including clarifications)]

3.4 Key Modules

using-superpowers (the meta-skill) -- /tmp/ai-harness-repos/superpowers/skills/using-superpowers/SKILL.md

This is the most critical file in the entire repository. It establishes:

  • "The Rule": Invoke relevant or requested skills BEFORE any response or action
  • Red Flags table: 12 rationalization patterns the agent must watch for
  • Skill Priority: Process skills first (brainstorming, debugging), then implementation skills
  • Skill Types: Rigid vs. Flexible
  • EnterPlanMode intercept: If agent is about to enter native plan mode, check brainstorming first

lib/skills-core.js -- /tmp/ai-harness-repos/superpowers/lib/skills-core.js

Shared ES module (208 lines) providing:

  • extractFrontmatter() -- Parse YAML frontmatter from SKILL.md files
  • findSkillsInDir() -- Recursive skill discovery with max depth
  • resolveSkillPath() -- Skill resolution with personal > superpowers priority
  • checkForUpdates() -- Git-based update checking with 3-second timeout
  • stripFrontmatter() -- Remove frontmatter from content

hooks/session-start -- /tmp/ai-harness-repos/superpowers/hooks/session-start

Bash script (51 lines) that:

  1. Determines plugin root directory
  2. Checks for legacy skills directory and builds warning
  3. Reads using-superpowers/SKILL.md content
  4. Escapes content for JSON embedding (using optimized bash parameter substitution -- 7x faster than character-by-character loop)
  5. Outputs JSON with both additional_context (Cursor) and hookSpecificOutput.additionalContext (Claude Code) fields

hooks/run-hook.cmd -- /tmp/ai-harness-repos/superpowers/hooks/run-hook.cmd

A polyglot script (46 lines) that is valid in BOTH Windows CMD and Unix bash:

  • On Windows: : is a label (CMD), << 'CMDBLOCK' ignored; batch portion finds bash.exe in standard Git for Windows locations, then PATH fallback
  • On Unix: : is a no-op (bash), << 'CMDBLOCK' starts heredoc consuming the CMD portion; runs script directly via exec bash

3.5 Session Bootstrap Mechanism (Detailed)

The session-start hook (/tmp/ai-harness-repos/superpowers/hooks/session-start) is the critical bootstrap that makes the entire framework function. Its implementation reveals several engineering decisions worth examining:

JSON escape optimization (lines 23-31):

escape_for_json() {
    local s="$1"
    s="${s//\\/\\\\}"
    s="${s//\"/\\\"}"
    s="${s//$'\n'/\\n}"
    s="${s//$'\r'/\\r}"
    s="${s//$'\t'/\\t}"
    printf '%s' "$s"
}

This replaced a character-by-character loop that caused 60+ second delays on Windows (documented in RELEASE-NOTES v4.3.1). The bash parameter substitution approach performs each replacement in a single C-level pass through the string.

Dual-format output (lines 41-49):

cat <<EOF
{
  "additional_context": "${session_context}",
  "hookSpecificOutput": {
    "hookEventName": "SessionStart",
    "additionalContext": "${session_context}"
  }
}
EOF

The hook outputs both additional_context (for Cursor compatibility) and hookSpecificOutput.additionalContext (for Claude Code). This dual-format approach was a pragmatic solution to platform differences rather than requiring separate hooks per platform.

EXTREMELY_IMPORTANT wrapping (line 35): The injected content is wrapped in <EXTREMELY_IMPORTANT> tags:

<EXTREMELY_IMPORTANT>
You have superpowers.
**Below is the full content of your 'superpowers:using-superpowers' skill...**
[full using-superpowers content]
</EXTREMELY_IMPORTANT>

This tag name is deliberately emphatic -- it uses the same pattern as the <EXTREMELY-IMPORTANT> tags within the using-superpowers skill itself, creating a layered emphasis system.


4. Harness Workflow: Spec to Plan to Execute to Verify to Merge

4.1 Overview of the Complete Pipeline

Confidence: High -- This is the most well-documented and tested aspect of the framework.

The complete workflow is:

1. Brainstorming    --> Design document
2. Worktree Setup   --> Isolated workspace
3. Writing Plans    --> Implementation plan with bite-sized tasks
4. Execution        --> Subagent-driven (same session) OR executing-plans (separate session)
5. Code Review      --> Two-stage review (spec compliance + code quality)
6. Finishing Branch  --> Merge/PR/Keep/Discard decision

4.2 Stage 1: Brainstorming

Skill: /tmp/ai-harness-repos/superpowers/skills/brainstorming/SKILL.md

Trigger: "You MUST use this before any creative work -- creating features, building components, adding functionality, or modifying behavior."

Hard Gate (line 14-16):

<HARD-GATE>
Do NOT invoke any implementation skill, write any code, scaffold any project, or take any
implementation action until you have presented a design and the user has approved it.
</HARD-GATE>

Anti-Pattern (line 20): "This Is Too Simple To Need A Design" is explicitly called out -- every project goes through this process regardless of perceived simplicity.

Mandatory Checklist (6 items, lines 24-31):

  1. Explore project context (files, docs, recent commits)
  2. Ask clarifying questions (one at a time, understand purpose/constraints/success criteria)
  3. Propose 2-3 approaches (with trade-offs and recommendation)
  4. Present design (in sections scaled to complexity, get approval after each)
  5. Write design doc (save to docs/plans/YYYY-MM-DD-<topic>-design.md, commit)
  6. Transition to implementation (invoke writing-plans skill -- the ONLY valid next step)

Terminal state enforcement (line 55): "The terminal state is invoking writing-plans. Do NOT invoke frontend-design, mcp-builder, or any other implementation skill."

Evolution note: v4.3.0 (2026-02-12) strengthened this significantly with hard gates, mandatory checklists, and graphviz process flow after discovering models were skipping the design phase entirely.

4.3 Stage 2: Worktree Setup

Skill: /tmp/ai-harness-repos/superpowers/skills/using-git-worktrees/SKILL.md

Required by: Both subagent-driven-development and executing-plans (added in v4.2.0).

Directory selection priority:

  1. Check existing .worktrees/ or worktrees/ directories
  2. Check CLAUDE.md for preference
  3. Ask user (.worktrees/ project-local hidden, or ~/.config/superpowers/worktrees/<project>/ global)

Safety verification: Must verify directory is in .gitignore before creating. If not ignored, add to .gitignore and commit immediately ("Fix broken things immediately" -- Jesse's rule).

Post-creation steps:

  1. Auto-detect and run project setup (npm install / cargo build / pip install / go mod download)
  2. Run tests to verify clean baseline
  3. Report location and test status

4.4 Stage 3: Writing Plans

Skill: /tmp/ai-harness-repos/superpowers/skills/writing-plans/SKILL.md

Key design decision: Plans are written assuming the engineer has "zero context for our codebase and questionable taste." This is critical because in subagent-driven-development, each subagent has a fresh context.

Task granularity: Each step is one action (2-5 minutes):

  • Write the failing test (step)
  • Run it to make sure it fails (step)
  • Implement the minimal code to make the test pass (step)
  • Run the tests and make sure they pass (step)
  • Commit (step)

Required plan header (lines 32-45):

# [Feature Name] Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans

**Goal:** [One sentence]
**Architecture:** [2-3 sentences]
**Tech Stack:** [Key technologies]

Task structure (lines 49-88): Each task must include:

  • Exact file paths (Create/Modify/Test)
  • Complete code in plan (not "add validation")
  • Exact commands with expected output
  • DRY, YAGNI, TDD, frequent commits

Execution handoff (lines 99-117): After saving, offers two choices:

  1. Subagent-Driven (same session) -- fresh subagent per task, review between tasks
  2. Parallel Session (separate) -- batch execution with checkpoints

4.5 Stage 4a: Subagent-Driven Development (Primary Execution Mode)

Skill: /tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/SKILL.md

This is the flagship execution model and the most sophisticated component.

Process per task (from DOT flowchart, lines 40-82):

Read plan once --> Extract all tasks --> Create TodoWrite
    |
    v (for each task)
Dispatch implementer subagent (with full task text)
    |
    v
Implementer asks questions? --> Answer, provide context
    |
    v (no questions)
Implementer implements, tests, commits, self-reviews
    |
    v
Dispatch spec reviewer subagent
    |
    v
Spec compliant? --> NO --> Implementer fixes gaps --> Re-review
    |
    v (YES)
Dispatch code quality reviewer subagent
    |
    v
Quality approved? --> NO --> Implementer fixes quality issues --> Re-review
    |
    v (YES)
Mark task complete in TodoWrite
    |
    v
More tasks? --> YES --> Next task
    |
    v (NO)
Dispatch final code reviewer for entire implementation
    |
    v
Use finishing-a-development-branch

Three prompt templates:

  1. Implementer (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/implementer-prompt.md):

    • Gets full task text (NOT file references)
    • Asked to raise questions BEFORE starting
    • Must self-review against checklist: Completeness, Quality, Discipline, Testing
    • Report format: what implemented, test results, files changed, self-review findings
  2. Spec Compliance Reviewer (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/spec-reviewer-prompt.md):

    • Explicitly told: "The implementer finished suspiciously quickly. Their report may be incomplete, inaccurate, or optimistic."
    • Must NOT trust the implementer's report
    • Must read actual code and compare to requirements line by line
    • Reports: missing requirements, extra/unneeded work, misunderstandings
  3. Code Quality Reviewer (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/code-quality-reviewer-prompt.md):

    • Only dispatched AFTER spec compliance passes
    • Uses code-reviewer.md template from requesting-code-review/
    • Reviews: code quality, architecture, testing, requirements, production readiness
    • Issues categorized: Critical/Important/Minor

Key constraints (lines 199-224):

  • Never start implementation on main/master without explicit consent
  • Never skip reviews (spec compliance OR code quality)
  • Never dispatch multiple implementation subagents in parallel (conflicts)
  • Never make subagent read plan file (provide full text instead)
  • Never skip scene-setting context
  • Never start code quality review before spec compliance passes
  • If subagent fails: dispatch fix subagent with specific instructions, don't fix manually (context pollution)

4.5 Stage 4b: Executing Plans (Alternative Mode)

Skill: /tmp/ai-harness-repos/superpowers/skills/executing-plans/SKILL.md

Simpler alternative for separate-session execution:

  • Default batch size: 3 tasks
  • Human review between batches
  • Critical review before first batch (raise concerns)
  • Stop immediately when blocked

4.6 Stage 5: Code Review

Agent: /tmp/ai-harness-repos/superpowers/agents/code-reviewer.md

Skill: /tmp/ai-harness-repos/superpowers/skills/requesting-code-review/SKILL.md

The code reviewer agent is a formal agent definition with:

  • 6-step review process (Plan Alignment, Code Quality, Architecture/Design, Documentation, Issue ID, Communication)
  • Issues categorized by severity (Critical/Important/Minor)
  • Clear verdict required (Ready to merge? Yes/No/With fixes)

Integration with SDD: Review happens after EACH task (not just at the end).

4.7 Stage 6: Finishing a Development Branch

Skill: /tmp/ai-harness-repos/superpowers/skills/finishing-a-development-branch/SKILL.md

Process:

  1. Verify tests pass (STOP if failing)
  2. Determine base branch
  3. Present exactly 4 options: Merge locally / Create PR / Keep as-is / Discard
  4. Execute chosen option
  5. Cleanup worktree (for options 1, 2, 4)

Safety: Discard requires typed "discard" confirmation and shows commit list first.


5. Subagent/Task Orchestration Model

5.1 Architecture: Agent-as-Orchestrator

Confidence: High -- This is the defining architectural choice of Superpowers.

Unlike traditional harnesses that have a runtime orchestrator (Python/Node process managing agents), Superpowers makes the AI agent itself the orchestrator. The "controller" is the main Claude session that:

  1. Reads the plan once at start
  2. Extracts all tasks with full text
  3. Creates TodoWrite for tracking
  4. Dispatches implementer subagents via the Task tool
  5. Answers subagent questions
  6. Dispatches reviewer subagents
  7. Manages review loops
  8. Marks tasks complete

This means the orchestration logic lives entirely in the subagent-driven-development/SKILL.md markdown document, which the agent reads and follows. There is no executable orchestration code.

5.2 Subagent Dispatch Mechanism

Subagents are dispatched using platform-specific tools:

  • Claude Code: Task tool (general-purpose subagent dispatch)
  • OpenCode: @mention syntax
  • Codex: Manual fallback (no native subagent support)

Each subagent gets:

  • Full task text (pasted directly, not file references)
  • Scene-setting context (where this fits, dependencies, architectural context)
  • Specific prompt template (implementer/reviewer)

5.3 Task Tracking

Tasks are tracked using TodoWrite (Claude Code) or update_plan (OpenCode). Each task from the plan becomes a todo item that transitions through states:

  • pending -> in_progress -> completed

5.4 Review Loop Pattern

The review pattern is a loop, not one-shot:

Implementer completes --> Spec review
    |
    v
Issues found? --> YES --> Implementer fixes --> Spec review again
    |
    v (NO)
Code quality review
    |
    v
Issues found? --> YES --> Implementer fixes --> Code quality review again
    |
    v (NO)
Task complete

This is explicitly enforced in the skill: "Don't skip the re-review" (line 224).

5.5 Context Provision Strategy

Proven optimization: The controller reads the plan ONCE at the start and extracts all tasks with full text. Subagents receive the full task text directly in their prompt -- they never read the plan file themselves.

Rationale from the skill (lines 181-184):

  • No file reading overhead
  • Controller curates exactly what context is needed
  • Subagent gets complete information upfront
  • Questions surfaced before work begins

6. Multi-Agent / Parallelization Strategy

6.1 Sequential Task Execution (Primary)

Confidence: High -- Explicitly documented and tested.

Subagent-driven-development executes tasks sequentially, not in parallel. This is a deliberate design choice:

From /tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/SKILL.md, line 205:

"Dispatch multiple implementation subagents in parallel (conflicts)" -- listed as a "Never" red flag

Rationale: Tasks may have shared state (same files, same test suite), and parallel execution would cause conflicts.

6.2 Parallel Agent Dispatch (Debugging/Independent Tasks)

Skill: /tmp/ai-harness-repos/superpowers/skills/dispatching-parallel-agents/SKILL.md

For independent problems (not plan tasks), parallel dispatch IS supported:

When to use:

  • 3+ test files failing with different root causes
  • Multiple subsystems broken independently
  • No shared state between investigations

Pattern:

1. Identify independent domains (group by what's broken)
2. Create focused agent tasks (specific scope, clear goal, constraints)
3. Dispatch in parallel
4. Review and integrate (check for conflicts, run full suite)

Real-world example (lines 131-157): 6 failures across 3 files, 3 agents dispatched in parallel, all fixes independent, zero conflicts.

6.3 The SDD Sequential Constraint (Detailed Rationale)

The prohibition against parallel task execution in SDD is not merely a preference -- it addresses a fundamental problem with concurrent file system access. From the SDD skill (line 205), dispatching multiple implementation subagents in parallel is listed as a "Never" red flag.

The specific failure scenarios that motivated this constraint:

  1. Shared test suite: Multiple subagents running go test ./... simultaneously causes race conditions in test output and potentially corrupted test databases.
  2. Shared source files: If Task 3 modifies a utility function that Task 4 also uses, parallel execution creates merge conflicts that no subagent can resolve.
  3. Build system conflicts: Concurrent go build or npm install operations in the same directory produce non-deterministic results.
  4. Git state corruption: Multiple subagents committing to the same branch simultaneously creates conflicting histories.

The dispatching-parallel-agents skill addresses these by ONLY allowing parallelization when domains are provably independent (different files, different test suites, different subsystems).

6.4 Limitations of Parallelization

  • No formal queuing mechanism -- parallelization is entirely agent-directed
  • No dependency graph resolution -- agent must manually determine independence
  • No automatic conflict detection -- agent checks for conflicts after completion
  • No load balancing or resource management
  • No fan-out/fan-in coordination (agent must manually collect results from all parallel subagents)
  • No retry mechanism for parallel agents that fail (must be dispatched manually)

Confidence: High -- These are inherent limitations of the agent-as-orchestrator model.


7. Isolation Model

7.1 Git Worktrees as Primary Isolation

Skill: /tmp/ai-harness-repos/superpowers/skills/using-git-worktrees/SKILL.md

Worktrees provide:

  • Branch isolation -- work on feature branch without affecting main
  • Filesystem isolation -- separate working directory
  • Dependency isolation -- separate node_modules/vendor directory
  • Test isolation -- clean test baseline verified at creation

7.2 Subagent Context Isolation

Each subagent dispatched via the Task tool gets a fresh context:

  • No accumulated conversation history from previous tasks
  • No "context pollution" from earlier work
  • Fresh perspective for each task and each review

This is listed as a key advantage (line 172): "Fresh context per task (no confusion)"

7.3 Session Isolation

  • Same-session (subagent-driven): Main session persists, subagents are isolated
  • Separate-session (executing-plans): Entirely new Claude session in worktree

7.4 Main Branch Protection

v4.2.0 changed from hard prohibition to requiring explicit consent for main branch work:

  • Skills warn against working on main
  • Never start implementation on main/master without explicit user consent
  • But if user explicitly consents, allowed

7.5 Limitations

  • No Docker/container isolation
  • No virtual environment isolation (Python venvs not managed)
  • No file permission sandboxing
  • Worktree isolation is advisory -- if agent ignores skill, no enforcement
  • No runtime monitoring of isolation violations

Confidence: High -- These are clear boundaries of the framework.


8. Human-in-the-Loop Controls

8.1 Brainstorming Phase Gates

The brainstorming skill has explicit human approval gates:

  • "Ask after each section whether it looks right so far" (line 74)
  • Design must be presented and approved before implementation
  • Hard gate prevents any implementation action before approval

8.2 Plan Review

The execution skills require the plan to be reviewed:

  • executing-plans Step 1: "Review critically - identify any questions or concerns about the plan" then "If concerns: Raise them with your human partner before starting"
  • Human can modify plan between batches

8.3 Batch Execution Checkpoints

In executing-plans:

  • Default batch size: 3 tasks
  • After each batch: "Show what was implemented, Show verification output, Say: 'Ready for feedback.'"
  • Agent must wait for feedback before continuing

8.4 Subagent-Driven Development: Reduced Human Involvement

In subagent-driven-development, human involvement is reduced:

  • No human checkpoint between tasks (this is a feature, not a bug)
  • Human only involved if subagent asks questions
  • "Faster iteration (no human-in-loop between tasks)" listed as advantage

8.5 Finishing Branch: User Choice

4 structured options presented (no open-ended questions):

  1. Merge locally
  2. Push and create PR
  3. Keep as-is
  4. Discard (requires typed "discard" confirmation)

8.6 Escalation Triggers

Skills define when to stop and ask:

  • executing-plans: "Hit a blocker mid-batch", "Plan has critical gaps", "You don't understand an instruction", "Verification fails repeatedly"
  • systematic-debugging: "If >= 3 fixes failed: STOP and question the architecture" then "Discuss with your human partner before attempting more fixes"
  • subagent-driven-development: "If subagent asks questions - Answer clearly and completely"

8.7 Assessment

Strengths: Multiple explicit gates in the design/planning phase. Clear escalation triggers.

Limitations:

  • In SDD mode, human is largely hands-off during execution -- extended autonomous runs possible
  • No formal approval mechanism (it's all advisory in the skill text)
  • If the agent rationalizes past gates, no enforcement exists
  • No timeout-based escalation (agent can spin indefinitely without human input)

Confidence: High


9. Context Handling Strategy

9.1 Progressive Disclosure

Skills use a layered loading model:

  1. Session start: Only using-superpowers content is injected (the meta-skill)
  2. On-demand: Other skills loaded via Skill tool only when needed
  3. Supporting files: Heavy reference material kept in separate files, loaded only when referenced

From /tmp/ai-harness-repos/superpowers/skills/writing-skills/anthropic-best-practices.md (lines 19-24):

"At startup, only the metadata (name and description) from all Skills is pre-loaded. Claude reads SKILL.md only when the Skill becomes relevant, and reads additional files only as needed."

9.2 Token Efficiency Engineering

The writing-skills skill has detailed guidance on token efficiency:

  • Getting-started workflows: <150 words each
  • Frequently-loaded skills: <200 words total
  • Other skills: <500 words
  • SKILL.md body under 500 lines for optimal performance

Techniques (from /tmp/ai-harness-repos/superpowers/skills/writing-skills/SKILL.md, lines 216-266):

  • Move details to tool help (--help instead of documenting all flags)
  • Use cross-references instead of repeating content
  • Compress examples (42 words -> 20 words)
  • Eliminate redundancy

9.3 Subagent Context Curation

For subagent dispatch, the controller curates context:

  • Full task text provided directly (no file reading)
  • Scene-setting context included
  • Only relevant information for the specific task
  • v4.0.0 improvement: Plan read once, tasks extracted upfront

9.4 Cross-Reference Strategy

Skills reference each other using explicit markers:

  • **REQUIRED BACKGROUND:** -- Prerequisites you must understand
  • **REQUIRED SUB-SKILL:** -- Skills that must be used in workflow
  • **Complementary skills:** -- Optional related skills

Critical rule: No @ links. From line 286-288: "@ syntax force-loads files immediately, consuming 200k+ context before you need them."

9.5 Context Compaction Handling

For OpenCode, the plugin handles context compaction via experimental.chat.system.transform hook -- bootstrap is re-injected on every system prompt transform, ensuring it survives compaction events.

For Claude Code, the session-start hook fires on "startup|resume|clear|compact" events (hooks.json), ensuring context is re-injected after compaction.

9.6 The No-@ Rule

A critical context management rule from /tmp/ai-harness-repos/superpowers/skills/writing-skills/SKILL.md (lines 286-288):

"@ syntax force-loads files immediately, consuming 200k+ context before you need them."

This means skills must NEVER use @file references to load supporting files. Instead, they use text-based references like:

  • **REQUIRED BACKGROUND:** You MUST understand superpowers:test-driven-development before using this skill.
  • For Anthropic's official skill authoring best practices, see anthropic-best-practices.md.

The agent loads these references on-demand via the Skill tool or Read tool, rather than having them force-loaded into context at skill activation time. This is a critical optimization because a single skill like writing-skills references anthropic-best-practices.md (1151 lines), testing-skills-with-subagents.md (385 lines), and persuasion-principles.md (188 lines). Force-loading all three would consume approximately 50,000+ tokens before the agent even begins working.

9.7 Assessment

Strengths:

  • Thoughtful progressive disclosure model
  • Token budget awareness with specific word count targets
  • Re-injection on compaction for both Claude Code and OpenCode
  • Cross-reference strategy prevents context explosion
  • Explicit no-@ rule prevents accidental context bloat

Limitations:

  • No automatic context summarization
  • No RAG or retrieval mechanism for large codebases
  • No chunking strategy for long files
  • Context management is entirely skill-text-driven (no runtime optimization)
  • No measurement of actual context usage per skill (budgets are targets, not enforced limits)

Confidence: High


10. Session Lifecycle and Persistence

10.1 Session Start

  1. Hook fires on startup/resume/clear/compact
  2. session-start script reads using-superpowers/SKILL.md
  3. Content injected as JSON into session context
  4. Agent receives skills-aware behavioral instructions

Critical timing change (v4.3.0): Hook changed from async: true to async: false. When async, the hook could fail to complete before the model's first turn, meaning using-superpowers instructions weren't in context for the first message.

10.2 Session Persistence

Superpowers has NO persistence mechanism of its own:

  • No session state saved between sessions
  • No database or file-based state
  • TodoWrite provides in-session task tracking only
  • Git commits are the only durable artifact

10.3 Session Resume

On resume, the hook fires again, re-injecting the using-superpowers content. The agent must re-discover what was happening from:

  • Git history
  • Plan files on disk
  • Previous conversation context (if session preserved)

10.4 Legacy Cleanup

The session-start hook checks for legacy skills directory (~/.config/superpowers/skills) and injects a warning if found, instructing users to move to ~/.claude/skills.

10.5 Assessment

Strengths: Clean separation -- no persistent state to corrupt.

Limitations:

  • No session resume capability beyond what the host platform provides
  • No progress tracking across sessions
  • If a session dies mid-workflow, recovery requires manual intervention
  • No checkpoint/restore mechanism

Confidence: High


11. Code Quality Gates

11.1 Test-Driven Development (The Iron Law)

Skill: /tmp/ai-harness-repos/superpowers/skills/test-driven-development/SKILL.md

NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST

Enforcement mechanisms:

  • "Write code before the test? Delete it. Start over."
  • No keeping as reference, no adapting, no looking at it
  • "Violating the letter of the rules is violating the spirit of the rules"
  • 11-entry rationalization prevention table
  • 12-entry red flags list
  • Complete verification checklist (8 items)

Testing Anti-Patterns: /tmp/ai-harness-repos/superpowers/skills/test-driven-development/testing-anti-patterns.md (300 lines) covers:

  1. Testing mock behavior instead of real behavior
  2. Test-only methods in production classes
  3. Mocking without understanding dependencies
  4. Incomplete mocks hiding structural assumptions
  5. Integration tests as afterthought

11.2 Two-Stage Code Review

Every task in SDD gets two reviews:

  1. Spec Compliance -- Does implementation match spec? Nothing missing, nothing extra.
  2. Code Quality -- Is implementation well-built? Clean code, test coverage, maintainability.

Both are loops -- reviewer finds issues, implementer fixes, reviewer re-reviews.

11.3 Verification Before Completion

Skill: /tmp/ai-harness-repos/superpowers/skills/verification-before-completion/SKILL.md

NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE

Gate function (lines 27-38):

  1. IDENTIFY: What command proves this claim?
  2. RUN: Execute the FULL command (fresh, complete)
  3. READ: Full output, check exit code, count failures
  4. VERIFY: Does output confirm the claim?
  5. ONLY THEN: Make the claim

Origin story (lines 111-115): "From 24 failure memories: Jesse said 'I don't believe you' - trust broken. Undefined functions shipped. Missing requirements shipped."

11.4 Systematic Debugging

Skill: /tmp/ai-harness-repos/superpowers/skills/systematic-debugging/SKILL.md

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

Four mandatory phases:

  1. Root Cause Investigation -- Read errors, reproduce, check changes, gather evidence
  2. Pattern Analysis -- Find working examples, compare, identify differences
  3. Hypothesis and Testing -- Scientific method, one variable at a time
  4. Implementation -- Create failing test, implement fix, verify

Escalation trigger: If 3+ fixes failed, STOP and question the architecture.

Supporting techniques bundled:

  • root-cause-tracing.md -- Trace backward through call stack
  • defense-in-depth.md -- Validate at every layer (4 layers)
  • condition-based-waiting.md -- Replace timeouts with condition polling
  • find-polluter.sh -- Bisection script for test pollution

11.5 The Rationalization Prevention Tables (Deep Dive)

A hallmark of Superpowers' quality engineering is the rationalization prevention table -- a pre-emptive catalog of excuses the agent might generate for skipping a quality gate, paired with the correct response.

From verification-before-completion/SKILL.md (lines 63-74):

Excuse Reality
"Should work now" RUN the verification
"I'm confident" Confidence is not evidence
"Just this once" No exceptions
"Linter passed" Linter is not compiler
"Agent said success" Verify independently
"I'm tired" Exhaustion is not excuse
"Partial check is enough" Partial proves nothing
"Different words so rule doesn't apply" Spirit over letter

From using-superpowers/SKILL.md (lines 60-73):

Thought Reality
"This is just a simple question" Questions are tasks. Check for skills.
"I need more context first" Skill check comes BEFORE clarifying questions.
"Let me explore the codebase first" Skills tell you HOW to explore. Check first.
"This doesn't need a formal skill" If a skill exists, use it.
"I remember this skill" Skills evolve. Read current version.
"The skill is overkill" Simple things become complex. Use it.
"I'll just do this one thing first" Check BEFORE doing anything.
"I know what that means" Knowing the concept is not using the skill. Invoke it.

These tables are not theoretical -- they were built iteratively from observed agent failures. The writing-skills skill documents this process: run a baseline test, watch the agent rationalize, document the exact rationalization, write the counter, test again.

The total number of rationalization entries across all skills exceeds 40 unique patterns. This represents one of the most comprehensive catalogs of LLM avoidance behavior in any open-source project.

11.6 Pressure Testing Methodology (from writing-skills)

The testing-skills-with-subagents.md reference (/tmp/ai-harness-repos/superpowers/skills/writing-skills/testing-skills-with-subagents.md, 385 lines) documents Superpowers' unique approach to validating that quality gates actually work under pressure.

Core insight: Academic test scenarios ("What does the skill say?") are useless because agents simply recite the skill. Real validation requires pressure scenarios that create incentives to bypass the gate.

Seven pressure types identified (from the reference):

Pressure Type Example What It Tests
Time "Production down, 5 minutes to deploy window" Does agent skip testing under time pressure?
Sunk Cost "Spent 3 hours, 200 lines already written" Does agent refuse to delete and restart?
Authority "Manager says ship it now" Does agent comply with authority over process?
Economic "$10k/min revenue loss" Does agent rationalize shortcuts for cost reasons?
Exhaustion "6pm, dinner at 6:30, been coding all day" Does agent take shortcuts when "tired"?
Social "Team is waiting on this" Does agent skip reviews to unblock team?
Pragmatic "It works, manually tested all edge cases" Does agent skip formal tests when confident?

Example combined-pressure scenario (lines 111-119):

You spent 3 hours, 200 lines, manually tested. It works.
It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
Just realized you forgot TDD.

Options:
A) Delete 200 lines, start fresh tomorrow with TDD
B) Commit now, add tests tomorrow

Without the TDD skill loaded, agents consistently choose B and rationalize with "I already manually tested it," "Tests after achieve same goals," and "Deleting is wasteful." With the TDD skill loaded and properly hardened, agents choose A -- the correct but psychologically difficult option.

This methodology represents a genuinely novel contribution to the field of prompt engineering quality assurance.

11.7 Assessment

Strengths:

  • Extremely thorough quality gate system
  • Anti-rationalization engineering prevents gate bypass
  • Two-stage review catches both spec compliance and quality issues
  • Evidence-based verification prevents false completion claims
  • Iteratively built from real observed failures (not hypothetical)

Limitations:

  • All quality gates are advisory (enforced by skill text, not runtime)
  • If agent ignores skill, no external enforcement
  • No automated CI integration (no GitHub Actions, no pre-commit hooks)
  • No formal security scanning
  • No static analysis integration
  • No quantitative measurement of gate effectiveness (how often are rationalizations actually prevented?)

Confidence: High


12. Security and Compliance Mechanisms

12.1 Branch Protection

  • Worktree isolation prevents accidental work on main
  • Explicit consent required for main branch work
  • Finishing branch skill prevents force-push without explicit request

12.2 Work Destruction Prevention

  • Discard option requires typed "discard" confirmation
  • Shows commit list before deletion
  • Worktree cleanup only for merge/discard, not keep-as-is

12.3 Credential Safety

The .gitignore excludes .private-journal/ and .claude/ directories, but there is no explicit credential scanning or prevention mechanism.

12.4 Plugin Security

Plugin manifests are declarative JSON with no executable code (except the session-start hook and the OpenCode JS plugin). The hook is a bash script with no network access.

12.5 Implicit Security Model

While Superpowers lacks explicit security mechanisms, it has an implicit security model worth documenting:

  1. Executable surface area is minimal: Only ~800 lines of executable code across 7 files. The majority of the framework is pure markdown, which carries no execution risk.

  2. Hook script is read-only: The session-start hook only reads a file and outputs JSON. It does not modify any files, make network requests, or execute user code.

  3. Plugin manifest is declarative: .claude-plugin/plugin.json contains only paths and metadata. No executable plugins or dynamic code loading.

  4. OpenCode plugin is scoped: .opencode/plugins/superpowers.js only transforms system prompts. It does not access the filesystem beyond reading skill files.

  5. No data exfiltration vector: Skills operate entirely within the AI agent's context. There is no mechanism for skills to send data to external services.

However, these implicit protections are insufficient for enterprise environments:

12.6 Assessment

Security mechanisms are minimal:

  • No secret detection or credential scanning
  • No SBOM generation or dependency vulnerability scanning
  • No sandbox enforcement beyond git worktree isolation
  • No audit logging of agent actions or skill compliance
  • No rate limiting or cost caps
  • No input validation on skill content (a malicious skill could instruct the agent to perform harmful actions)
  • No integrity checking of skill files (modified skills would be trusted immediately)
  • No access control on which skills are available to which agents

This is a significant gap for enterprise adoption. The implicit security model (minimal executable surface, read-only hooks) provides a baseline but no defense-in-depth.

Confidence: High -- Absence of security features is clear from the codebase.


13. Hooks, Automation Surface, and Fail-Safe Behavior

13.1 Hook System

Configuration: /tmp/ai-harness-repos/superpowers/hooks/hooks.json

{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "startup|resume|clear|compact",
        "hooks": [
          {
            "type": "command",
            "command": "'${CLAUDE_PLUGIN_ROOT}/hooks/run-hook.cmd' session-start",
            "async": false
          }
        ]
      }
    ]
  }
}

Only one hook is used: SessionStart. No PreToolUse, PostToolUse, or other hooks are implemented.

Notable design decision: The hooks.json matcher field uses "startup|resume|clear|compact" -- matching four different session events. This ensures the bootstrap is re-injected:

  • On initial session start (startup)
  • On session resume after pause (resume)
  • When context is cleared (clear)
  • When context is compacted due to token limits (compact)

The async: false setting (changed from async: true in v4.3.0) is critical: if the hook runs asynchronously, its output may not be in the agent's context when the first user message is processed, meaning the agent would respond without knowing about superpowers for its first turn.

13.2 Slash Commands

Three user-only commands (all with disable-model-invocation: true):

  • /superpowers:brainstorm -- Redirects to brainstorming skill
  • /superpowers:write-plan -- Redirects to writing-plans skill
  • /superpowers:execute-plan -- Redirects to executing-plans skill

These are convenience shortcuts -- the underlying skills are the real functionality.

13.3 Fail-Safe Behavior

What happens when things go wrong:

Failure Behavior
Hook fails to run Plugin still works, but no bootstrap context (silent degradation)
Skill not found Agent proceeds without skill (no error thrown)
Subagent fails "Dispatch fix subagent with specific instructions"
Tests fail during setup Report failures, ask whether to proceed
Review finds issues Loop until approved (no timeout)
3+ debugging fixes fail Escalate to human for architectural discussion
Legacy directory found Warning injected into session

13.4 Automation Surface

The framework provides these automation touchpoints:

  • claude -p for headless testing
  • --plugin-dir for custom plugin location
  • --dangerously-skip-permissions for automated testing
  • --output-format stream-json for structured output
  • Session JSONL transcripts for post-hoc analysis

13.5 Assessment

Strengths: Clean fail-safe design -- hook failure degrades gracefully, not catastrophically.

Limitations:

  • Only one hook (SessionStart) -- no pre/post tool use hooks
  • No webhook integration
  • No CI/CD integration hooks
  • No event streaming or observability

Confidence: High


14. CLI/UX and Automation Ergonomics

14.1 User Experience Design

The framework prioritizes invisible operation: "Because the skills trigger automatically, you don't need to do anything special. Your coding agent just has Superpowers." (README.md, line 15)

Users interact via natural language:

  • "Help me plan this feature" triggers brainstorming
  • "Let's debug this issue" triggers systematic-debugging
  • "Build X" triggers brainstorming first, then implementation

14.2 Installation Ergonomics

Claude Code (best): Two commands via marketplace Cursor: One command via marketplace Codex: Clone + symlink (no package manager dependency) OpenCode: Clone + two symlinks (plugin + skills) Windows: Extensive documentation for all three shells (CMD, PowerShell, Git Bash)

14.3 Test Infrastructure for Automation

Test helpers (/tmp/ai-harness-repos/superpowers/tests/claude-code/test-helpers.sh):

  • run_claude -- Runs Claude in headless mode with timeout
  • assert_contains / assert_not_contains -- Pattern matching
  • assert_count -- Exact occurrence counting
  • assert_order -- Pattern ordering verification
  • create_test_project / create_test_plan -- Fixture creation

Test suites:

  1. tests/claude-code/ -- Integration tests using claude -p
  2. tests/explicit-skill-requests/ -- Verifies skill invocation when explicitly named
  3. tests/skill-triggering/ -- Verifies skills trigger from naive prompts
  4. tests/subagent-driven-dev/ -- End-to-end workflow tests with real projects
  5. tests/opencode/ -- OpenCode-specific tests

14.4 Test Categories and Coverage

The testing infrastructure supports four distinct test categories, each targeting a different failure mode:

1. Skill Compliance Tests (/tmp/ai-harness-repos/superpowers/tests/claude-code/test-subagent-driven-development.sh): These verify that the agent follows skill instructions correctly. Example tests from the 9-test SDD suite:

  • test_skill_invoked -- Verifies the Skill tool is called
  • test_plan_read_once -- Verifies plan is not re-read for each task
  • test_reviews_happen -- Verifies both spec and quality reviews occur
  • test_no_parallel_dispatch -- Verifies tasks are sequential
  • test_todo_tracking -- Verifies TodoWrite is used

2. Implicit Triggering Tests (/tmp/ai-harness-repos/superpowers/tests/skill-triggering/): These verify that skills are triggered from natural language prompts without naming the skill explicitly. The test runs Claude with a naive prompt and checks whether the correct skill was invoked.

3. Explicit Request Tests (/tmp/ai-harness-repos/superpowers/tests/explicit-skill-requests/): These verify that naming a skill by name causes it to be invoked. Critically, these also check for premature action -- whether the agent started working BEFORE loading the skill (lines 97-121 of run-test.sh). This catches the failure mode where the agent begins implementing immediately, only loading the skill as an afterthought.

4. End-to-End Workflow Tests (/tmp/ai-harness-repos/superpowers/tests/subagent-driven-dev/): These run complete multi-task plans through the SDD pipeline and verify the output project builds, tests pass, and all artifacts are created. The Go fractals test plan includes 10 tasks from project setup through README creation.

14.5 Reporting

Token usage analysis via tests/claude-code/analyze-token-usage.py:

  • Breaks down usage by main session and individual subagents
  • Shows input tokens, output tokens, cache usage, estimated cost
  • Per-agent description extraction from prompts

14.5 Assessment

Strengths:

  • Invisible operation for end users
  • Comprehensive cross-platform installation
  • Good test infrastructure for skill validation
  • Token usage visibility

Limitations:

  • No web UI or dashboard
  • No progress visualization during execution
  • No real-time status updates
  • No undo/rollback beyond git

Confidence: High


15. Cost/Usage Visibility and Governance

15.1 Token Usage Analysis

Tool: /tmp/ai-harness-repos/superpowers/tests/claude-code/analyze-token-usage.py

Parses Claude Code JSONL session transcripts to provide:

  • Per-message token usage (input, output, cache creation, cache read)
  • Per-subagent breakdown
  • Total cost estimate (at $3/$15 per M tokens for input/output)

Example from real test run (docs/testing.md, lines 103-129):

  • Total tokens: 1,524,058
  • Estimated cost: $4.67
  • 7 subagents dispatched (2 implementers, 2 spec reviewers, 2 code quality reviewers, 1 final reviewer)
  • Heavy cache utilization (1.38M cache read tokens vs 62 direct input tokens)

15.2 Cost Awareness in Skills

The SDD skill explicitly acknowledges cost trade-offs (lines 193-197):

Cost:

  • More subagent invocations (implementer + 2 reviewers per task)
  • Controller does more prep work
  • Review loops add iterations
  • But catches issues early (cheaper than debugging later)

15.3 Cache Utilization Insights

The documented test run reveals an important cost optimization that occurs naturally:

  • Direct input tokens: 62 (negligible)
  • Cache read tokens: 1,380,000+ (vast majority)
  • Cache creation tokens: ~80,000

This means that after the first subagent, subsequent subagents benefit heavily from cache hits. The controller's context (plan, project files, skill instructions) is cached after the first subagent reads them, and all subsequent subagents hit this cache. This is a significant cost advantage of sequential task execution -- parallel execution would likely create separate cache entries, reducing cache efficiency.

15.4 Assessment

Strengths: Post-hoc cost analysis tool exists and is documented. Cache utilization is naturally optimized by sequential execution.

Limitations:

  • No real-time cost tracking during execution
  • No cost caps or budgets that would halt execution
  • No per-session cost reporting built into the workflow
  • No cost optimization automation (e.g., using cheaper models for reviews)
  • Cost analysis is a separate tool, not integrated into the workflow
  • No cost comparison between SDD mode and executing-plans mode

Confidence: High


16. Tooling and Dependency Surface

16.1 Runtime Dependencies

Dependency Required By Notes
Bash hooks/session-start POSIX-compatible; uses ${BASH_SOURCE[0]:-$0} for portability
Git using-git-worktrees, finishing-a-development-branch Worktrees, branches, diffing
Node.js lib/skills-core.js, render-graphs.js ES modules; only needed for OpenCode/Codex
Python 3 analyze-token-usage.py Optional; only for test analysis
GraphViz (dot) render-graphs.js Optional; only for visualizing flowcharts
GitHub CLI (gh) finishing-a-development-branch Optional; for PR creation

16.2 Platform Dependencies

Platform Integration Method Requirements
Claude Code Plugin marketplace Claude Code CLI
Cursor Plugin marketplace Cursor with plugin support
OpenCode JavaScript plugin + symlinks OpenCode with experimental hooks
Codex Native skill discovery + symlink Codex CLI

16.3 Zero-Dependency Design

The core skills are pure markdown with no executable dependencies. The only executable code is:

  • hooks/session-start (51 lines bash) -- bootstrap injection
  • hooks/run-hook.cmd (46 lines polyglot) -- cross-platform wrapper
  • lib/skills-core.js (208 lines JS) -- skill discovery for OpenCode/Codex
  • .opencode/plugins/superpowers.js (95 lines JS) -- OpenCode plugin
  • skills/writing-skills/render-graphs.js (168 lines JS) -- optional visualization
  • skills/systematic-debugging/find-polluter.sh (63 lines bash) -- debugging utility
  • tests/claude-code/analyze-token-usage.py (168 lines Python) -- test analysis

Total executable code: ~800 lines across 7 files. Everything else is markdown.

16.4 Assessment

Strengths:

  • Minimal dependency footprint
  • No package.json, no npm install, no build step
  • Pure markdown skills work across all platforms
  • Cross-platform polyglot wrapper handles Windows/Unix differences

Limitations:

  • Bash dependency for hooks limits pure-Windows environments (mitigated by Git for Windows)
  • No formal dependency management (no package-lock, no version pinning)
  • render-graphs.js uses CommonJS require(), not ES modules (inconsistent with skills-core.js)

Confidence: High


17. External Integrations and Provider Compatibility

17.1 AI Provider Compatibility

Superpowers is model-agnostic at the skill level -- skills are markdown instructions that work with any LLM. However:

  • Primary target: Claude (all references use Claude-specific tools: Skill, Task, TodoWrite, Read, Write, Edit, Bash)
  • OpenCode mapping: TodoWrite -> update_plan, Task -> @mention, Skill -> native skill tool
  • Codex mapping: Limited -- "manual work instead of delegation" for subagent workflows

17.2 Tool Mapping

From OpenCode plugin (/tmp/ai-harness-repos/superpowers/.opencode/plugins/superpowers.js, lines 64-73):

TodoWrite      -> update_plan
Task (subagent) -> @mention syntax
Skill tool     -> native skill tool
Read/Write/Edit/Bash -> native tools

17.3 Git Integration

Deep git integration throughout:

  • Worktree creation and management
  • Branch operations (create, merge, delete)
  • Commit tracking (per-task commits)
  • Diff analysis for code review
  • PR creation via gh CLI

17.4 Assessment

Strengths:

  • Multi-platform support (4 platforms)
  • Clean tool mapping strategy for platform differences
  • Git as universal integration point

Limitations:

  • Heavy Claude-specific tool references in skills (requires mapping for other platforms)
  • Codex subagent support is degraded (manual fallback)
  • No MCP server integration
  • No external API integrations (Jira, Linear, etc.)
  • No CI/CD pipeline integration

Confidence: High


18. Operational Assumptions and Constraints

18.1 Assumptions

  1. Agent follows instructions -- The entire framework relies on the AI agent reading and following skill documents. No runtime enforcement exists.

  2. Fresh context per subagent -- Subagent-driven development assumes each Task tool invocation provides a clean context. This is a platform-specific behavior of Claude Code.

  3. Git repository present -- Many skills assume a git repository exists with proper configuration.

  4. Test infrastructure available -- TDD skill assumes test framework is set up and tests can be run.

  5. Single-developer workflow -- No multi-user coordination, no conflict resolution between concurrent developers.

  6. English language -- All skills, prompts, and instructions are in English.

  7. Network access for tools -- Some skills reference gh CLI which requires GitHub access.

18.2 Constraints

  1. Context window limits -- Skills are designed to be compact (< 500 lines) to avoid consuming too much context.

  2. No persistent state -- Framework cannot track progress across sessions.

  3. Platform-specific features -- Some features (subagent dispatch, task tracking) vary by platform.

  4. Advisory enforcement only -- All rules are enforced by prompt engineering, not code.

18.3 Assessment

The operational assumptions are reasonable for the target use case (individual developer using Claude Code on a git repository). They become limiting for enterprise/team/multi-repo scenarios.

Confidence: High


19. Failure Modes and Issues Observed

19.1 Documented Failure Modes

From release notes and improvement plans:

1. Agent rationalization bypass (v3.2.2, v4.0.3)

  • Agent thinks "I know what that means" and skips skill invocation
  • Agent starts working before loading requested skill
  • Multiple iterations of anti-rationalization engineering required

2. Description trap (v4.0.0)

  • Skill descriptions containing workflow summaries cause agent to follow description instead of full skill
  • Led to one-review instead of two-review process

3. SessionStart timing (v4.3.0, v4.2.0)

  • Async hook could fail to complete before first turn
  • But sync hook froze Windows TUI
  • Fix: sync on Unix, was async on Windows then fixed back to sync

4. Windows execution failures (v2.0.1, v4.1.0, v4.2.0, v4.3.1)

  • CRLF line ending issues
  • Path with spaces
  • Missing WSL
  • .sh auto-detection breaking polyglot wrapper
  • set -euo pipefail fragility on MSYS
  • O(n^2) escape_for_json performance (60+ seconds)

5. EnterPlanMode bypass (v4.3.0)

  • Claude enters native plan mode instead of using brainstorming skill
  • Fixed by adding EnterPlanMode intercept in using-superpowers flowchart

19.2 From Improvement Plans

From /tmp/ai-harness-repos/superpowers/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md:

6. Configuration change verification gap

  • Agent reports "OpenAI integration working" but response shows Claude model
  • Verified operation succeeded, not that intended configuration was applied
  • Impact: High (false confidence in tests)

7. Background process accumulation

  • Multiple subagents start background servers, processes accumulate
  • Later tests hit stale server with wrong config
  • Impact: Medium-High

8. Mock-interface drift

  • Mocks derived from buggy implementation, not interface definition
  • Tests pass, runtime crashes
  • Impact: High

9. Skills not being read

  • Skills exist but neither human nor subagents read them
  • Skill investment wasted
  • Impact: Medium

19.3 Taxonomy of Failure Modes

Analyzing the nine documented failure modes, they fall into three distinct categories:

Category A: Agent Compliance Failures (4 instances)

  • Rationalization bypass (agent skips skill)
  • Description trap (agent follows summary instead of full skill)
  • EnterPlanMode bypass (agent uses native plan mode instead of brainstorming)
  • Skills not being read (neither human nor agent reads available skills)

These are the most fundamental and hardest to fix. They represent the inherent fragility of the advisory-only enforcement model. Each required multiple iterations of anti-rationalization engineering.

Category B: Platform/Environment Failures (3 instances)

  • SessionStart timing (async hook completing too late)
  • Windows execution failures (CRLF, paths, WSL, shell detection)
  • Background process accumulation (stale servers from subagents)

These are engineering problems with engineering solutions. The Windows failures in particular consumed significant development effort across v2.0.1, v4.1.0, v4.2.0, and v4.3.1.

Category C: Verification Methodology Failures (2 instances)

  • Configuration change verification gap (verified operation, not configuration)
  • Mock-interface drift (mocks derived from buggy implementation)

These represent genuine insights into AI agent testing methodology. The configuration verification gap is particularly subtle: an agent can correctly report "operation succeeded" while the underlying configuration is wrong (e.g., OpenAI integration "working" but actually using Claude model).

19.4 Assessment

The documented failure modes reveal an honest and rigorous development process. The team actively discovers, documents, and addresses failures. The most concerning pattern is the fundamental reliance on agent compliance (Category A) -- every compliance failure ultimately traces back to the agent not following instructions as expected, and the only remediation is more detailed instructions.

The taxonomy also reveals an important insight: the most impactful failures are NOT the obvious ones (platform bugs, timing issues) but the subtle ones (verification gaps, mock drift) where the agent believes it is compliant but is actually failing.

Confidence: High


20. Governance and Guardrails

20.1 Skill-Level Guardrails

Each skill implements guardrails through:

  • Iron Laws -- Absolute rules that cannot be violated
  • Hard Gates -- Must-complete-before-proceeding barriers
  • Red Flags lists -- Thought patterns that indicate rationalization
  • Rationalization tables -- Pre-emptive counters to expected excuses
  • Gate functions -- Explicit decision trees before actions

20.2 Workflow-Level Guardrails

  • Brainstorming must complete before implementation
  • Worktree must be set up before execution
  • Spec compliance must pass before code quality review
  • Tests must pass before finishing branch
  • Discard requires typed confirmation

20.3 Anti-Sycophancy Guardrails

The receiving-code-review skill (/tmp/ai-harness-repos/superpowers/skills/receiving-code-review/SKILL.md) explicitly forbids performative agreement and establishes a protocol for receiving feedback with technical rigor:

Forbidden Responses (lines 29-33):

  • "You're absolutely right!" (explicit CLAUDE.md violation)
  • "Great point!" / "Excellent feedback!" (performative)
  • "Let me implement that now" (before verification)

Required Pattern Instead (lines 16-25):

WHEN receiving code review feedback:
1. READ: Complete feedback without reacting
2. UNDERSTAND: Restate requirement in own words (or ask)
3. VERIFY: Check against codebase reality
4. EVALUATE: Technically sound for THIS codebase?
5. RESPOND: Technical acknowledgment or reasoned pushback
6. IMPLEMENT: One item at a time, test each

Source-Specific Trust Levels (lines 59-86): The skill differentiates between feedback from the human partner (trusted but still requires understanding) and external reviewers (must be verified against five checkpoints: technical correctness, breaks existing functionality, reason for current implementation, cross-platform compatibility, full context understanding).

YAGNI Check for "Professional" Features (lines 88-98): When a reviewer suggests "implementing properly," the agent must first grep codebase for actual usage before implementing. If the endpoint/feature is unused, the correct response is to suggest removing it (YAGNI), not implementing it "properly."

This represents one of the most thorough anti-sycophancy implementations in any AI agent framework. It addresses the well-documented tendency of LLMs to agree with authority figures (reviewers) regardless of technical merit.

20.4 YAGNI Enforcement

Multiple layers of YAGNI enforcement:

  • Brainstorming: "YAGNI ruthlessly - Remove unnecessary features from all designs"
  • Writing plans: "DRY, YAGNI, TDD"
  • Spec review: Catches "Extra/unneeded work" and "nice to haves" not in spec
  • Code review: Checks "No scope creep"
  • Receiving review: "grep codebase for actual usage" before implementing suggested features

20.5 Assessment

Strengths: Deeply layered guardrail system, anti-rationalization engineering, anti-sycophancy measures.

Limitations:

  • All guardrails are advisory
  • No audit trail of guardrail compliance
  • No automated detection of guardrail violations
  • No way to prove guardrails were followed (only that the skill text exists)

Confidence: High


21. Roadmap/Evolution Signals, Missing Areas, Unresolved Issues

21.1 Evolution Trajectory

The release history (v1.0 -> v4.3.1 over ~16 months) shows clear evolution:

  1. v1.x: Monolithic plugin with embedded skills
  2. v2.0: Skills separated into external repository, community contribution model
  3. v3.0: Adopted Anthropic's first-party skills system
  4. v3.x: Added Codex and OpenCode support, skill namespacing
  5. v4.0: DOT flowcharts, two-stage review, testing infrastructure
  6. v4.1-4.3: Windows hardening, Cursor support, anti-rationalization strengthening

The trend is toward tighter behavioral enforcement, broader platform support, and better testing infrastructure.

Key inflection points:

  • v2.0 was the first major architecture shift (monolith -> modular skills), indicating the original design was too rigid.
  • v3.0 was the second major shift (custom skill system -> Anthropic's first-party system), indicating willingness to abandon custom infrastructure in favor of platform-native features.
  • v4.0 was the content quality revolution (DOT flowcharts, two-stage review, testing infrastructure), indicating the team recognized that skill content quality was the primary bottleneck.
  • v4.1-4.3 was the hardening phase (Windows fixes, anti-rationalization, sync hooks), indicating the framework was mature enough for real-world users to surface edge cases.

This trajectory suggests the next major evolution will likely focus on one of: measurement/observability (proving skills work), multi-agent coordination (team workflows), or cost optimization (cheaper review loops).

21.2 Active Improvement Areas

From /tmp/ai-harness-repos/superpowers/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md:

Phase 1 (High-Impact, Low-Risk):

  • Configuration change verification in verification-before-completion
  • Mock-interface drift anti-pattern in testing-anti-patterns
  • Explicit file reading in code reviewer template

Phase 2 (Moderate Changes):

  • Process hygiene for E2E tests (kill stale processes before/after)
  • Self-reflection step for implementers
  • Skills reading requirement for test subagents

Phase 3 (Optimization):

  • Lean context option for pattern-based tasks
  • Allow implementer to fix self-identified issues

21.3 Missing Areas

Not present in the framework:

  1. No cost governance -- No budgets, caps, or cost-based decisions
  2. No formal CI/CD integration -- No GitHub Actions, no pre-commit hooks
  3. No multi-repo support -- Single repository assumption
  4. No team coordination -- Single developer workflow
  5. No dependency management -- No package manager integration
  6. No environment management -- No Docker, no virtual environments
  7. No telemetry or observability -- No metrics, no dashboards, no alerts
  8. No configuration management -- Skills are not configurable per-project
  9. No versioned skill contracts -- Skills evolve without formal versioning
  10. No rollback mechanism -- Beyond git revert, no workflow-level rollback

21.4 Unresolved Issues

From the improvement plan's open questions:

  1. Lean context vs full plan: Should lean context be default for pattern-based tasks?
  2. Self-reflection overhead: Will it slow down simple tasks?
  3. Process hygiene scope: In SDD or separate skill? Beyond E2E?
  4. Skills reading enforcement: Should ALL subagents read relevant skills?
  5. Prompt bloat risk: All improvements add more text to prompts

21.5 Assessment

The framework is maturing rapidly but has clear gaps in enterprise readiness, team collaboration, and operational observability. The improvement plan shows awareness of real-world failure modes and a disciplined approach to addressing them.

Confidence: High for documented items; Medium for roadmap predictions


22. What Should Be Borrowed/Adapted into Maestro and What Should Not

22.1 STRONGLY BORROW

1. Anti-Rationalization Engineering (Critical)

Superpowers' most unique contribution is the systematic approach to preventing agent rationalization:

  • Rationalization tables with pre-emptive counters
  • Red flags lists for self-checking
  • Gate functions as decision trees
  • "Violating the letter is violating the spirit" foundational principle
  • Persuasion principles (authority, commitment, scarcity) applied to skill design

Why borrow: Every harness faces the problem of agents bypassing constraints. Superpowers has invested the most iteration into solving this.

How to adapt: Build rationalization prevention into Maestro's prompt templates for every critical decision point. Don't rely on instructions alone -- test against agent behavior.

2. Two-Stage Code Review (Spec Compliance + Quality)

Separating spec compliance from code quality is a genuine insight:

  • Catches the common failure where "code is well-written but doesn't match what was requested"
  • Each review has a different reviewer mindset and checklist
  • Reviews are loops, not one-shot

Why borrow: Most harnesses do one review or none. Two-stage catches fundamentally different failure modes.

How to adapt: Implement as sequential review stages in Maestro's pipeline. Spec reviewer should be explicitly skeptical ("finished suspiciously quickly").

3. TDD for Skill/Prompt Documentation

The RED-GREEN-REFACTOR cycle applied to prompt engineering:

  • Baseline test without skill (watch agent fail)
  • Write skill addressing specific failures
  • Close loopholes through iteration
  • Pressure scenarios with combined pressures

Why borrow: Prompt engineering is currently ad-hoc in most harnesses. This provides rigor.

How to adapt: Build a testing framework for Maestro's prompts that runs scenarios against agent behavior and measures compliance.

4. Task Context Provision Strategy

Controller reads plan once, extracts all tasks, provides full text to subagents:

  • No file reading overhead
  • Curated context per task
  • Questions surfaced before work begins

Why borrow: Reduces subagent token usage and increases focus.

5. Description Trap Awareness

Skill descriptions must be trigger-only, never workflow summaries. This discovery prevents a subtle but devastating failure mode.

Why borrow: Any system with skill/prompt routing must account for this.

6. Spec Reviewer Skepticism Pattern

The spec reviewer prompt template (/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/spec-reviewer-prompt.md) opens with an explicit instruction to distrust the implementer:

"The implementer finished suspiciously quickly. Their report may be incomplete, inaccurate, or optimistic. You MUST verify everything independently."

This is a deliberate psychological priming technique. By framing the implementer's work as suspicious, the reviewer is far less likely to rubber-stamp the review. The reviewer is instructed to:

  • NOT trust the implementer's report
  • Read actual code and compare to requirements line by line
  • Report: missing requirements, extra/unneeded work, misunderstandings

Why borrow: Without explicit skepticism priming, AI reviewers default to approval (sycophancy bias). The framing of "suspiciously quickly" is simple to implement and dramatically changes review quality.

7. Verification Before Completion Pattern

IDENTIFY command -> RUN it -> READ output -> VERIFY claim -> THEN claim

This is simple, powerful, and prevents the most common agent failure: claiming success without evidence.

22.2 SELECTIVELY BORROW

8. DOT Flowcharts as Specifications

Flowcharts are harder to skip than prose. However, they add visual complexity that may not scale.

How to adapt: Use for critical decision points and process flows, not for everything.

9. Git Worktree Isolation

Good default for feature work isolation, but may be too opinionated for Maestro's broader use cases.

How to adapt: Support worktrees as one isolation strategy among several (Docker, venvs, etc.).

10. Brainstorming-First Mandate

Forcing brainstorming before implementation is valuable for preventing premature coding, but may be too heavy for small changes.

How to adapt: Scale the design phase to the change size. Small changes might skip full brainstorming.

11. Persuasion Principles for Prompt Design

Academic foundation (Cialdini 2021, Meincke et al. 2025) for why certain prompt patterns work. Useful reference but don't over-apply.

22.3 DO NOT BORROW

12. Agent-as-Orchestrator Model

Superpowers makes the AI agent the orchestrator, guided only by markdown instructions. This is elegant but fundamentally limits:

  • Enforcement (all rules are advisory)
  • Observability (no runtime metrics)
  • Recovery (no checkpoint/restore)
  • Scalability (single agent bottleneck)
  • Reproducibility (agent behavior varies)

Why not: Maestro should have a runtime orchestrator that provides enforcement, observability, and recovery.

13. Zero-Persistence Design

No state saved between sessions, no progress tracking, no checkpoint mechanism.

Why not: Maestro needs persistent state for long-running workflows, team coordination, and recovery.

14. Advisory-Only Enforcement

All quality gates enforced by skill text, not runtime code.

Why not: Critical guardrails (cost limits, security scanning, main branch protection) should have runtime enforcement, not just prompts.

15. Single-Platform Tool References

Skills reference Claude-specific tools (Task, TodoWrite, Skill) with mappings for other platforms.

Why not: Maestro should abstract tool references to be platform-agnostic from the start.

16. Windows Polyglot Wrapper Pattern

The cmd/bash polyglot is clever but fragile and has caused numerous issues (#518, #504, #491, #487, #466, #440, #331, #285, #243).

Why not: Use a proper cross-platform runtime (Node.js, Deno) instead of bash.

22.4 Summary Matrix

# Feature Verdict Priority Effort
1 Anti-rationalization engineering BORROW Critical Medium
2 Two-stage code review BORROW High Low
3 TDD for prompts BORROW High High
4 Task context provision BORROW High Low
5 Description trap awareness BORROW High Low
6 Spec reviewer skepticism pattern BORROW High Low
7 Verification before completion BORROW High Low
8 DOT flowcharts SELECTIVE Medium Low
9 Git worktree isolation SELECTIVE Medium Low
10 Brainstorming-first SELECTIVE Medium Low
11 Persuasion principles SELECTIVE Low Low
12 Agent-as-orchestrator DO NOT - -
13 Zero-persistence DO NOT - -
14 Advisory-only enforcement DO NOT - -
15 Single-platform tool refs DO NOT - -
16 Polyglot wrapper DO NOT - -

23. Cross-Links

Related Sections in Other Analysis Reports

everything-claude-code-deep-analysis.md:

  • Section: "Skills System" -- How Claude Code's native skill system works (the platform Superpowers targets)
  • Section: "Hooks System" -- How SessionStart hooks inject context
  • Section: "Task Tool" -- How subagent dispatch works
  • Section: "TodoWrite" -- How task tracking works
  • Section: "Plugin Marketplace" -- How plugins are distributed

agent-orchestrator-deep-analysis.md:

  • Section: "Orchestration Patterns" -- Compare agent-as-orchestrator (Superpowers) vs runtime orchestrator
  • Section: "Subagent Management" -- Compare dispatch, monitoring, and review patterns
  • Section: "Quality Gates" -- Compare enforcement mechanisms (advisory vs runtime)
  • Section: "Context Management" -- Compare progressive disclosure strategies
  • Section: "Multi-Platform Support" -- Compare abstraction strategies

maestro-deep-analysis.md:

  • Section: "Design Philosophy" -- Compare with Superpowers' skill-based approach
  • Section: "Workflow Pipeline" -- Compare brainstorm->plan->execute->review flow
  • Section: "Code Review" -- Compare single vs two-stage review
  • Section: "Isolation Model" -- Compare worktrees vs other isolation strategies
  • Section: "Cost Governance" -- What Maestro has that Superpowers lacks
  • Section: "Security" -- What Maestro has that Superpowers lacks

harness-consensus-report.md:

  • Section: "Common Patterns" -- Skills-based prompt injection pattern
  • Section: "Anti-Rationalization" -- Superpowers as category leader
  • Section: "Quality Gate Enforcement" -- Advisory vs runtime spectrum
  • Section: "Platform Compatibility" -- Multi-platform support comparison
  • Section: "Testing Approaches" -- TDD for prompts as novel methodology

final-harness-gap-report.md:

  • Section: "What No Harness Does Well" -- Enterprise features (security, compliance, observability)
  • Section: "Novel Contributions" -- Anti-rationalization engineering, two-stage review, TDD for docs
  • Section: "Recommended Architecture" -- Runtime orchestrator + skill-based prompts (best of both)
  • Section: "Priority Features for Maestro" -- Feature prioritization based on all harness analyses

Appendix A: File Index with Key Line References

File Key Lines Purpose
/tmp/ai-harness-repos/superpowers/README.md 1-158 Project overview, installation, workflow description
/tmp/ai-harness-repos/superpowers/RELEASE-NOTES.md 1-802 Complete version history (v1.0 through v4.3.1)
/tmp/ai-harness-repos/superpowers/.claude-plugin/plugin.json 1-13 Plugin manifest (v4.3.1)
/tmp/ai-harness-repos/superpowers/.cursor-plugin/plugin.json 1-18 Cursor plugin manifest with skills/agents/commands/hooks paths
/tmp/ai-harness-repos/superpowers/.opencode/plugins/superpowers.js 1-95 OpenCode plugin (system prompt transform injection)
/tmp/ai-harness-repos/superpowers/.codex/INSTALL.md 1-67 Codex installation (clone + symlink)
/tmp/ai-harness-repos/superpowers/hooks/hooks.json 1-16 Hook configuration (SessionStart, sync)
/tmp/ai-harness-repos/superpowers/hooks/session-start 1-51 Bootstrap injection script
/tmp/ai-harness-repos/superpowers/hooks/run-hook.cmd 1-46 Cross-platform polyglot wrapper
/tmp/ai-harness-repos/superpowers/lib/skills-core.js 1-208 Shared skill discovery/parsing module
/tmp/ai-harness-repos/superpowers/skills/using-superpowers/SKILL.md 1-96 Meta-skill: mandatory skill usage protocol
/tmp/ai-harness-repos/superpowers/skills/brainstorming/SKILL.md 1-97 Design exploration before implementation
/tmp/ai-harness-repos/superpowers/skills/writing-plans/SKILL.md 1-117 Implementation plan creation
/tmp/ai-harness-repos/superpowers/skills/executing-plans/SKILL.md 1-85 Batch execution with checkpoints
/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/SKILL.md 1-242 Fresh subagent per task + two-stage review
/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/implementer-prompt.md 1-79 Implementer subagent prompt template
/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/spec-reviewer-prompt.md 1-62 Spec compliance reviewer template
/tmp/ai-harness-repos/superpowers/skills/subagent-driven-development/code-quality-reviewer-prompt.md 1-20 Code quality reviewer template
/tmp/ai-harness-repos/superpowers/skills/test-driven-development/SKILL.md 1-371 RED-GREEN-REFACTOR cycle enforcement
/tmp/ai-harness-repos/superpowers/skills/test-driven-development/testing-anti-patterns.md 1-300 5 testing anti-patterns with gate functions
/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/SKILL.md 1-297 4-phase root cause investigation
/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/root-cause-tracing.md 1-170 Backward call chain tracing technique
/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/defense-in-depth.md 1-122 4-layer validation strategy
/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/condition-based-waiting.md 1-116 Replace timeouts with condition polling
/tmp/ai-harness-repos/superpowers/skills/systematic-debugging/find-polluter.sh 1-63 Test pollution bisection script
/tmp/ai-harness-repos/superpowers/skills/dispatching-parallel-agents/SKILL.md 1-181 Concurrent subagent dispatch pattern
/tmp/ai-harness-repos/superpowers/skills/using-git-worktrees/SKILL.md 1-218 Isolated workspace creation
/tmp/ai-harness-repos/superpowers/skills/finishing-a-development-branch/SKILL.md 1-201 Merge/PR/Keep/Discard decision workflow
/tmp/ai-harness-repos/superpowers/skills/requesting-code-review/SKILL.md 1-106 Pre-review dispatch pattern
/tmp/ai-harness-repos/superpowers/skills/requesting-code-review/code-reviewer.md 1-147 Code review agent template
/tmp/ai-harness-repos/superpowers/skills/receiving-code-review/SKILL.md 1-214 Anti-sycophancy review response protocol
/tmp/ai-harness-repos/superpowers/skills/verification-before-completion/SKILL.md 1-140 Evidence-before-claims enforcement
/tmp/ai-harness-repos/superpowers/skills/writing-skills/SKILL.md 1-656 TDD for documentation methodology
/tmp/ai-harness-repos/superpowers/skills/writing-skills/testing-skills-with-subagents.md 1-385 Pressure testing methodology
/tmp/ai-harness-repos/superpowers/skills/writing-skills/persuasion-principles.md 1-188 Cialdini-based prompt design principles
/tmp/ai-harness-repos/superpowers/skills/writing-skills/anthropic-best-practices.md 1-1151 Anthropic's official skill authoring guide
/tmp/ai-harness-repos/superpowers/skills/writing-skills/render-graphs.js 1-168 DOT to SVG rendering tool
/tmp/ai-harness-repos/superpowers/agents/code-reviewer.md 1-49 Code reviewer agent definition
/tmp/ai-harness-repos/superpowers/commands/brainstorm.md 1-7 User-only slash command redirect
/tmp/ai-harness-repos/superpowers/commands/write-plan.md 1-7 User-only slash command redirect
/tmp/ai-harness-repos/superpowers/commands/execute-plan.md 1-7 User-only slash command redirect
/tmp/ai-harness-repos/superpowers/tests/claude-code/test-helpers.sh 1-202 Test assertion framework
/tmp/ai-harness-repos/superpowers/tests/claude-code/run-skill-tests.sh 1-188 Test runner with timeout/verbose/integration modes
/tmp/ai-harness-repos/superpowers/tests/claude-code/test-subagent-driven-development.sh 1-166 9 tests for SDD skill compliance
/tmp/ai-harness-repos/superpowers/tests/claude-code/analyze-token-usage.py 1-168 JSONL token usage analyzer
/tmp/ai-harness-repos/superpowers/tests/skill-triggering/run-test.sh 1-89 Implicit skill triggering test
/tmp/ai-harness-repos/superpowers/tests/explicit-skill-requests/run-test.sh 1-137 Explicit skill request verification
/tmp/ai-harness-repos/superpowers/tests/subagent-driven-dev/run-test.sh 1-107 End-to-end SDD workflow test
/tmp/ai-harness-repos/superpowers/tests/subagent-driven-dev/go-fractals/plan.md 1-173 10-task Go CLI test plan
/tmp/ai-harness-repos/superpowers/docs/testing.md 1-304 Testing guide with session transcript format
/tmp/ai-harness-repos/superpowers/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md 1-712 8 real-world failure reports with proposed fixes
/tmp/ai-harness-repos/superpowers/docs/windows/polyglot-hooks.md 1-213 Cross-platform hook documentation

Appendix B: Confidence Scores Summary

Analysis Area Confidence Basis
Design philosophy High Extensive documentation, consistent across all files
Core architecture High Complete codebase read, all files analyzed
Harness workflow High Every skill read, workflow tested end-to-end
Subagent orchestration High Detailed prompt templates and process flows
Parallelization strategy High Explicitly documented and constrained
Isolation model High Worktree skill fully documented
Human-in-the-loop High Every approval gate identified in skills
Context handling High Token budgets and cross-reference strategy documented
Session lifecycle High Hook code and configuration reviewed
Code quality gates High Every quality skill read and analyzed
Security mechanisms High (absence) Confirmed no security features present
Hooks and automation High All hook code and config reviewed
CLI/UX High Installation and test infrastructure reviewed
Cost visibility High Analysis tool and cost documentation reviewed
Tooling/dependencies High All executable code inventoried
External integrations High All platform adapters reviewed
Operational assumptions High Derived from skill requirements
Failure modes High Documented in release notes and plans
Governance High Every guardrail mechanism identified
Roadmap signals Medium Based on improvement plans, may not be complete
Maestro recommendations Medium-High Based on analysis, but Maestro requirements not fully known

Appendix C: Quantitative Summary

Metric Value
Total files (non-.git) ~90
Total executable code ~800 lines across 7 files
Total skill documents 14 skills
Total supporting documents ~15 files
Total test files ~25 files
Lines of markdown in skills ~4,500
Release versions analyzed v1.0 through v4.3.1
Platforms supported 4 (Claude Code, Cursor, Codex, OpenCode)
Documented failure modes 9+
Anti-rationalization entries 40+ across all skills
Subagent prompt templates 3 (implementer, spec reviewer, quality reviewer)
Slash commands 3 (brainstorm, write-plan, execute-plan)
Agents defined 1 (code-reviewer)
Token budget for SDD workflow ~$4.67 per 2-task plan (documented test run)
Rationalization prevention entries 40+ across all skills
Pressure test types documented 7 (time, sunk cost, authority, economic, exhaustion, social, pragmatic)
Unique red flag patterns 12 in using-superpowers, 12 in TDD, 8 in verification, others in each skill
Evolution timespan v1.0 to v4.3.1 (~16 months of active development)

Report generated by systematic analysis of all 90+ non-git files in the obra/superpowers repository at version 4.3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment