JacobFV/agi-binary-architecture.md

## agi-binary-architecture.md

      
    Raw
  

              agi-binary-architecture.md
            
          
    AGI SDK Binary Architecture - Claude Code Pattern
AGI SDK Binary Architecture

This document describes the AGI driver binary and SDK architecture. The driver is a self-contained agent that captures screenshots, reasons with Claude, and executes actions autonomously. SDKs are thin event wrappers.
Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                        agi-api-driver                                    │
│                    (locally: ~/Code/agi-api-driver)                      │
│                                                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  SELF-CONTAINED AGENT DRIVER BINARY                               │  │
│  │                                                                   │  │
│  │  ┌─────────────┐  ┌──────────────┐  ┌────────────────────────┐  │  │
│  │  │  Executor    │  │  Agent/LLM   │  │  Environment           │  │  │
│  │  │  • State     │  │  • Claude    │  │  • Screenshot capture  │  │  │
│  │  │    machine   │  │    API       │  │  • Action execution    │  │  │
│  │  │  • Event     │  │  • Tools    │  │  • Screen size detect  │  │  │
│  │  │    emission  │  │  • Prompts  │  │  • DPI/scale factor    │  │  │
│  │  └─────────────┘  └──────────────┘  └────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                            ↓ CI compiles & publishes                    │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  GitHub Releases: agi-driver-v1.0.0-{platform}                    │  │
│  │  darwin-arm64 | darwin-x64 | linux-x64 | windows-x64              │  │
│  └───────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
            ┌───────────────────────┼───────────────────────┐
            ↓                       ↓                       ↓
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│   agi-python     │  │    agi-node      │  │   agi-csharp     │
│                  │  │                  │  │                  │
│  THIN WRAPPER    │  │  THIN WRAPPER    │  │  THIN WRAPPER    │
│  • Spawn binary  │  │  • Spawn binary  │  │  • Spawn binary  │
│  • Event hooks   │  │  • Event hooks   │  │  • Event hooks   │
│  • Send commands │  │  • Send commands │  │  • Send commands │
│  (no platform    │  │  (no platform    │  │  (no platform    │
│   code needed)   │  │   code needed)   │  │   code needed)   │
└──────────────────┘  └──────────────────┘  └──────────────────┘

Two Operating Modes

Local Mode (Self-Contained) - mode: "local"

The driver runs autonomously on the local machine:
SDK sends:  {"command":"start","goal":"Open calculator","mode":"local"}
Driver:     1. Captures screenshot via Pillow/scrot
            2. Calls Claude API with screenshot + goal
            3. Emits thinking/action events (informational)
            4. Executes actions locally (JXA/PowerShell/xdotool)
            5. Waits 0.5s for screen to settle
            6. Captures next screenshot
            7. Repeat until finished/error
SDK:        Just listens to events. No platform code needed.

Legacy Mode (SDK-Driven) - mode: ""

The SDK manages screenshots and action execution (existing behavior):
SDK sends:  {"command":"start","goal":"...","screenshot":"base64...","screen_width":1920,"screen_height":1080}
Driver:     1. Receives screenshot from SDK
            2. Calls Claude API
            3. Emits action events
SDK:        4. Executes actions
            5. Captures screenshot
            6. Sends screenshot command
            7. Repeat

Architecture: Self-Contained Driver Binary

Key Changes from Previous Design


Aspect
Before
After


Screenshot capture
SDK responsibility
Driver captures locally


Action execution
SDK responsibility
Driver executes locally


Screen detection
SDK responsibility
Driver detects automatically


SDK executor code
Required (600+ lines per SDK)
Deprecated (kept for backward compat)


SDK role
I/O adapter + executor
Pure event wrapper


Platform code
Duplicated in 3 SDKs
Single implementation in driver


Environment Module

The driver includes a platform-aware environment module:
agi_driver/environment/
├── __init__.py     # Factory: create_environment("local")
├── base.py         # Abstract: BaseEnvironment
└── local.py        # LocalEnvironment - controls local machine

BaseEnvironment interface:
class BaseEnvironment(ABC):
    async def initialize(self) -> None
    async def capture_screenshot(self) -> tuple[str, int, int]  # (base64, width, height)
    async def execute_action(self, action: dict) -> bool
    async def get_screen_size(self) -> tuple[int, int]
    async def cleanup(self) -> None
LocalEnvironment handles:

Screenshot: PIL.ImageGrab.grab() (macOS/Windows), scrot (Linux)
Clicks: JXA/CGEvent (macOS), PowerShell/user32.dll (Windows), xdotool (Linux)
Typing: JXA with JSON escaping (macOS), Base64+SendKeys (Windows), xdotool (Linux)
Keys: AppleScript key codes (macOS), SendKeys format (Windows), xdotool (Linux)
Scroll/Drag: Platform-specific implementations
DPI/Scale: NSScreen (macOS), Registry (Windows), GDK_SCALE (Linux)

Event-Driven Protocol

Binary -> SDK (stdout events)

{"event":"ready","version":"0.1.0","protocol":"jsonl","step":0}
{"event":"state_change","state":"running","step":0}
{"event":"screenshot_captured","width":3024,"height":1964,"step":0}
{"event":"thinking","text":"I see the desktop with a dock at the bottom...","step":1}
{"event":"action","action":{"type":"click","x":150,"y":200},"step":1}
{"event":"screenshot_captured","width":3024,"height":1964,"step":1}
{"event":"confirm","action":{},"reason":"Delete this file?","step":2}
{"event":"ask_question","question":"What email should I use?","question_id":"q1","step":3}
{"event":"finished","reason":"completed","summary":"Opened calculator and computed 2+2=4","success":true,"step":10}
{"event":"error","message":"Model inference failed","code":"step_error","recoverable":true,"step":5}
New event: screenshot_captured - Emitted in local mode when the driver captures a screenshot. Lightweight notification (no image data) so SDKs know a step boundary occurred.
SDK -> Binary (stdin commands)

{"command":"start","session_id":"sess_abc","goal":"Open calculator","mode":"local"}
{"command":"start","session_id":"sess_def","goal":"Click login","screenshot":"base64...","screen_width":1920,"screen_height":1080}
{"command":"screenshot","data":"base64...","screen_width":1920,"screen_height":1080}
{"command":"pause"}
{"command":"resume"}
{"command":"stop","reason":"User cancelled"}
{"command":"confirm","approved":true,"message":""}
{"command":"answer","text":"user@example.com","question_id":"q1"}
StartCommand changes:

mode field added: "local" for autonomous, "" for legacy
screenshot, screen_width, screen_height are ignored in local mode

State Machine

                                    ┌─────────────────────┐
                                    │                     │
                    start           ▼                     │
    ┌───────┐ ─────────────> ┌─────────────┐             │
    │ IDLE  │                │   RUNNING   │<────────────┤
    └───────┘                └─────────────┘             │
                                    │                     │
            ┌───────────────────────┼───────────────────┐ │
            │                       │                   │ │
            ▼                       ▼                   ▼ │
    ┌──────────────┐    ┌───────────────────┐   ┌────────────────┐
    │    PAUSED    │    │ WAITING_CONFIRM   │   │ WAITING_ANSWER │
    │              │    │                   │   │                │
    │  resume()    │    │   confirm(bool)   │   │   answer(str)  │
    └──────┬───────┘    └─────────┬─────────┘   └───────┬────────┘
           │                      │                     │
           └──────────────────────┴─────────────────────┘

    ANY STATE ──── stop() ────> STOPPED
    ANY STATE ──── error ─────> ERROR
    RUNNING ────── finish ───> FINISHED

Repository Structure

agi-api-driver/

agi-api-driver/
├── src/agi_driver/
│   ├── __init__.py              # Package exports, version
│   ├── __main__.py              # CLI entry point
│   ├── executor.py              # Main execution loop (local + legacy modes)
│   ├── state_machine.py         # State enum and transitions
│   ├── agent/
│   │   ├── base.py              # BaseDriverAgent, AgentAction, StepResult
│   │   ├── desktop_agent.py     # DesktopAgent (Claude API integration)
│   │   ├── prompt.py            # System prompts
│   │   └── tools.py             # Desktop automation tool definitions
│   ├── environment/             # NEW: Self-contained environment
│   │   ├── __init__.py          # Factory: create_environment()
│   │   ├── base.py              # Abstract BaseEnvironment
│   │   └── local.py             # LocalEnvironment (screenshot + actions)
│   ├── llm/
│   │   └── anthropic.py         # Claude API client with retry
│   └── protocol/
│       ├── commands.py          # 7 command types (start now has mode)
│       ├── events.py            # 9 event types (+ screenshot_captured)
│       └── jsonl.py             # JSON Lines I/O
├── .github/workflows/
│   └── build-agi-driver.yml     # Cross-platform Nuitka build
└── pyproject.toml

SDK Integration

Python SDK - Local Mode (New, Simplified)

from agi import AgentDriver, DriverOptions

driver = AgentDriver(DriverOptions(mode="local"))

driver.on_thinking(lambda t: print(f"Thinking: {t}"))
driver.on_action(lambda a: print(f"Action: {a.type}"))  # Informational only

result = await driver.start(goal="Open calculator and compute 2+2")
print(f"Done: {result.summary}")
Node.js SDK - Local Mode

import { AgentDriver } from '@agi/sdk';

const driver = new AgentDriver({ mode: 'local' });

driver.on('thinking', (text) => console.log('Thinking:', text));
driver.on('action', (action) => console.log('Action:', action.type));

const result = await driver.start('Open calculator and compute 2+2');
console.log('Done:', result.summary);
C# SDK - Local Mode

using Agi.Driver;

var driver = new AgentDriver(new DriverOptions { Mode = "local" });

driver.OnThinking += async (text) => Console.WriteLine($"Thinking: {text}");
driver.OnAction += async (action) => Console.WriteLine($"Action: {action.Type}");

var result = await driver.StartAsync(goal: "Open calculator and compute 2+2");
Console.WriteLine($"Done: {result.Summary}");
Build & Distribution

Nuitka Compilation

cd src
python -m nuitka \
  --standalone --onefile \
  --output-filename=agi-driver \
  --include-package=agi_driver \
  --include-package=agi_driver.environment \
  --include-package=anthropic \
  --include-package=PIL \
  --include-package=pydantic \
  --lto=yes \
  --python-flag=no_site \
  -m agi_driver
Platform Matrix


OS
Target
Binary


macOS 14
darwin-arm64
agi-driver-darwin-arm64


macOS 13
darwin-x64
agi-driver-darwin-x64


Ubuntu 22.04
linux-x64
agi-driver-linux-x64


Windows latest
windows-x64
agi-driver-windows-x64.exe


Dependencies Bundled in Binary


anthropic - Claude API client
Pillow - Screenshot capture
pydantic - Data validation

Autonomous Loop (Local Mode) Detail

┌────────────────────────────────────────────────────────────────┐
│                    AUTONOMOUS LOOP                              │
│                                                                │
│  1. Initialize LocalEnvironment                                │
│     - Detect screen size (system_profiler / powershell / xdpy) │
│     - Cache DPI scale factor                                   │
│                                                                │
│  2. Capture initial screenshot (PIL.ImageGrab / scrot)         │
│     └── Emit screenshot_captured event                         │
│                                                                │
│  3. Call agent.step(screenshot, goal)                           │
│     ├── Prepare image (resize to 1366x768 canvas, JPEG 85)    │
│     ├── Build messages (goal + history + screenshot)           │
│     ├── Call Claude API with desktop tools                     │
│     └── Process response (thinking, tool uses)                 │
│                                                                │
│  4. Emit thinking event                                        │
│                                                                │
│  5. Check control flow:                                        │
│     ├── finish → Emit finished, exit loop                      │
│     ├── confirm → Emit confirm, wait for stdin response        │
│     ├── ask_question → Emit ask_question, wait for stdin       │
│     └── actions → Continue to step 6                           │
│                                                                │
│  6. Execute actions on environment                             │
│     ├── Emit action event (informational)                      │
│     └── Call environment.execute_action()                      │
│                                                                │
│  7. Wait 0.5s settle delay                                     │
│                                                                │
│  8. Capture next screenshot                                    │
│     └── Emit screenshot_captured event                         │
│                                                                │
│  9. Go to step 3                                               │
│                                                                │
│  Background: stdin reader queues pause/stop/confirm/answer     │
│  commands for processing between steps                         │
└────────────────────────────────────────────────────────────────┘

Removed: SDK-Side Executor Modules

The following SDK-side executor modules have been removed. The driver binary handles all screenshot capture and action execution in local mode:


SDK
Removed Module


Python
agi.executor (execute_action, execute_actions, get_scale_factor, get_screen_size)


Node.js
src/executor.ts (executeAction, executeActions, getScaleFactor, getScreenSize)


C#
Agi.Executor (ExecuteAction, ExecuteActions, GetScaleFactor, GetScreenSize)


All platform-specific code now lives exclusively in the driver binary's environment/local.py.
Aspect	Before	After
Screenshot capture	SDK responsibility	Driver captures locally
Action execution	SDK responsibility	Driver executes locally
Screen detection	SDK responsibility	Driver detects automatically
SDK executor code	Required (600+ lines per SDK)	Deprecated (kept for backward compat)
SDK role	I/O adapter + executor	Pure event wrapper
Platform code	Duplicated in 3 SDKs	Single implementation in driver
OS	Target	Binary
macOS 14	darwin-arm64	`agi-driver-darwin-arm64`
macOS 13	darwin-x64	`agi-driver-darwin-x64`
Ubuntu 22.04	linux-x64	`agi-driver-linux-x64`
Windows latest	windows-x64	`agi-driver-windows-x64.exe`
SDK	Removed Module
Python	`agi.executor` (`execute_action`, `execute_actions`, `get_scale_factor`, `get_screen_size`)
Node.js	`src/executor.ts` (`executeAction`, `executeActions`, `getScaleFactor`, `getScreenSize`)
C#	`Agi.Executor` (`ExecuteAction`, `ExecuteActions`, `GetScaleFactor`, `GetScreenSize`)