You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Date: 2026-03-02
Status: Design
Goal: Evaluate agent quality across real user flows -- catch regressions when code changes, measure improvements when new tools/skills are added.
Overview
A benchmark system that runs real user scenarios through the real agent loop with real LLM calls, then evaluates the resulting trajectory using two layers:
Hard assertions -- pass/fail checks on tool selection, response content, cost, latency
LLM-as-judge -- quality scoring for reasoning, tool use, and response helpfulness
This is an eval system, not a unit test suite. It answers: "Does the agent actually solve problems well?"
Scenario Format
Each trajectory is a YAML file in benchmarks/trajectories/:
name: schedule-meetingdescription: User asks to schedule a meeting with workspace contexttags: [tools, scheduling, memory]# Environment setupsetup:
skills: [calendar-skill]tools: [time, shell, http, memory_search, memory_read]provider: nearaimodel: default# Pre-populate workspace memory before the scenario runsworkspace:
documents:
- path: "context/team.md"content: | # Team - Alice (engineering lead, PST timezone) - Bob (product, EST timezone) - Carol (design, CET timezone)
- path: "preferences/scheduling.md"content: | # Scheduling Preferences - Default meeting length: 30 minutes - Prefer afternoons for syncs - Always include Zoom link# Or load from fixture directoryfixtures_dir: benchmarks/fixtures/scheduling/# Override identity files injected into system promptidentity:
USER.md: | Name: Zaki Timezone: PST Preferred calendar: Google Calendar# Conversation turns (multi-turn supported)turns:
- user: "Schedule a team sync for tomorrow at 2pm"assertions:
tools_called: [time, memory_search]tools_not_called: [shell]response_contains: ["Alice", "Bob", "Carol"]response_not_contains: ["error", "sorry"]max_tool_calls: 8max_cost_usd: 0.10max_latency_secs: 30judge:
criteria: | Did the agent search workspace memory for team and scheduling info? Did it account for timezone differences across the team? Is the response personalized using workspace context?min_score: 7
Format Details
setup.skills: Skills to activate for this scenario. Only these skills will be available.
setup.tools: Tools to register beyond default builtins. Controls tool availability.
setup.workspace.documents: Seed workspace memory with documents at specified paths. Torn down after scenario.
setup.workspace.fixtures_dir: Load all files from a directory into workspace memory.
setup.identity: Override AGENTS.md, USER.md, SOUL.md, IDENTITY.md for this scenario.
turns: Ordered list of user messages. Each turn can assert on the agent's behavior.
assertions: Hard pass/fail. The runner also uses max_tool_calls, max_cost_usd, and max_latency_secs as circuit breakers to kill runaway scenarios.
judge: Criteria string sent to a separate LLM call. Produces a 1-10 score.
Multi-Turn Example
turns:
- user: "Save a note: Project Alpha launches on March 15th"assertions:
tools_called: [memory_write]response_contains: ["saved", "note"]
- user: "When does Project Alpha launch?"assertions:
tools_called: [memory_search]response_contains: ["March 15"]judge:
criteria: | Did the agent retrieve the previously saved note? Is the answer accurate and concise?min_score: 8
A Rust binary (src/bin/benchmark.rs or workspace crate) that:
Discovers scenario YAML files from benchmarks/trajectories/
For each scenario (isolated):
Creates a fresh libSQL database (reusing TestHarnessBuilder patterns)
Seeds workspace documents and identity files per setup.workspace and setup.identity
Constructs a real Agent with a real LLM provider (configured via env vars)
Registers only the tools listed in setup.tools
Activates only the skills listed in setup.skills
Executes turns sequentially:
Sends IncomingMessage through agent.handle_message()
Captures full trajectory: tool calls (name, params, output, duration), final response, token cost, wall-clock latency
Enforces circuit breakers: kills scenario if max_tool_calls, max_cost_usd, or max_latency_secs is exceeded
Evaluates each turn:
Runs hard assertions (pass/fail)
If judge block present: sends trajectory summary + criteria to a separate LLM call, records 1-10 score
Reports: Writes per-scenario JSON results + suite summary, compares against baselines
Key Design Decisions
Real agent loop: Not mocked. Uses the actual Agent, dispatcher, worker, ToolRegistry, SafetyLayer. Tests the whole system.
Isolated per scenario: Each scenario gets its own libSQL database and workspace. No cross-contamination.
Parallel safe: Because scenarios are isolated, they can run concurrently (limited by LLM API rate limits).
Judge model: Use a cheaper/faster model for judging (e.g., Haiku, GPT-4o-mini) since it's scoring, not reasoning.
Cost and Speed Controls
Per-Scenario Guards
The assertions fields double as circuit breakers:
Guard
Function
max_tool_calls
Kill agent loop if tool call count exceeds limit
max_cost_usd
Kill if token cost exceeds budget
max_latency_secs
Timeout per turn
CLI Interface
# Full suite (nightly/pre-release)
cargo run --bin benchmark
# Tagged subset (PR CI)
cargo run --bin benchmark -- --tags basic,tools
# Single scenario (development)
cargo run --bin benchmark -- --scenario schedule-meeting
# Budget cap for entire run
cargo run --bin benchmark -- --max-total-cost 5.00
# Assertions only, skip judge (cheaper)
cargo run --bin benchmark -- --no-judge
# Parallel execution
cargo run --bin benchmark -- --parallel 4
# Update baselines after a good run
cargo run --bin benchmark -- --update-baseline
Cost Strategy
Context
What to run
Estimated cost
Development
Single scenario, --no-judge
Pennies
PR CI
--tags basic (5-10 scenarios), assertions only
$0.50-2.00
Nightly
Full suite with judge scoring
$5-20
Pre-release
Full suite, compare against frozen baselines, gate release