| name | description | license | metadata | compatibility | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
ai-agent-evaluations |
Framework for designing, implementing, and iterating on evaluations for AI agents. Use for automated testing of coding, conversational, research, and computer use agents with code-based, model-based, and human graders. |
MIT |
|
Framework-agnostic. Works with Harbor, Promptfoo, Braintrust, LangSmith, and Langfuse. Requires test framework and model access. |
Comprehensive guide for designing, implementing, and maintaining evaluations for AI agents across different architectures and use cases.
Evaluations help teams ship AI agents more confidently by:
- Making problems and behavioral changes visible before they affect users
- Providing metrics to track improvements and prevent regressions
- Enabling fast iteration and model updates with confidence
- Creating clear feedback loops between product and research teams
- Allowing teams to distinguish real regressions from noise
Without evals, debugging becomes reactive: wait for complaints, reproduce manually, fix, and hope nothing else regressed.
Task: A single test with defined inputs and success criteria.
Trial: One attempt at a task. Multiple trials are run because model outputs vary between runs.
Grader: Logic that scores some aspect of agent performance. A task can have multiple graders with multiple assertions.
Transcript: Complete record of a trial including outputs, tool calls, reasoning, intermediate results, and all interactions.
Outcome: Final state in the environment at end of trial.
Evaluation Harness: Infrastructure that runs evals end-to-end.
Agent Harness: System enabling a model to act as an agent.
Evaluation Suite: Collection of tasks designed to measure specific capabilities or behaviors.
Methods: String matching, regex, fuzzy matching, binary tests, static analysis, outcome verification, tool call verification, transcript analysis
Strengths: Fast, cheap, objective, reproducible, easy to debug
Weaknesses: Brittle to valid variations, lacking nuance, limited for subjective tasks
Methods: Rubric-based scoring, natural language assertions, pairwise comparison, reference-based evaluation, multi-judge consensus
Strengths: Flexible, scalable, captures nuance, handles open-ended tasks
Weaknesses: Non-deterministic, more expensive, requires calibration with human graders
Methods: SME review, crowdsourced judgment, spot-check sampling, A/B testing, inter-annotator agreement
Strengths: Gold standard quality, matches expert judgment, calibrates model-based graders
Weaknesses: Expensive, slow, requires human experts at scale
Capability Evals target tasks the agent struggles with, starting at low pass rate.
Regression Evals should have ~100% pass rate to protect against backsliding.
Graduate high-performing capability evals to regression suites for continuous monitoring.
Well-specified tasks, stable environments, thorough tests. Grade on deterministic tests, code quality, tool calls, and state verification.
Simulate user persona. Multidimensional success: task completion, interaction quality, reasoning correctness.
Combine grader types: groundedness checks, coverage of key facts, source quality, and coherence of synthesis. Frequently calibrate against expert judgment.
Run in real/sandboxed environment. Check URL/page state navigation and backend state verification (confirm action actually occurred, not just confirmation page).
pass@k: Probability agent gets at least one correct solution in k attempts. Useful when one success matters.
pass^k: Probability all k trials succeed. Useful when consistency is essential.
Step 0: Start Early - 20-50 tasks from real failures is great start.
Step 1: Start with Manual Testing - Convert manual checks and bug reports into test cases.
Step 2: Write Unambiguous Tasks - Two experts should reach same pass/fail verdict. Create reference solutions.
Step 3: Build Balanced Problem Sets - Test both when behavior should occur AND when it shouldn't.
Step 4: Build Robust Harness - Each trial starts clean. Avoid shared state causing correlated failures.
Step 5: Design Graders Thoughtfully - Grade output not path. Build partial credit. Calibrate model graders with humans.
Step 6: Check Transcripts - Read transcripts and grades. Verify failures are fair. Confirm eval measures what matters.
Step 7: Monitor Saturation - 100% pass rate provides no improvement signal. Ensure evals remain challenging.
Step 8: Keep Suites Healthy - Treat like unit tests. Dedicated teams own infrastructure. Domain experts contribute tasks.
Harbor: Containerized environments, infrastructure for running trials at scale.
Promptfoo: Lightweight, open-source, declarative YAML configuration.
Braintrust: Offline evaluation + production observability + pre-built scorers.
LangSmith: Tracing, evals, dataset management with LangChain integration.
Langfuse: Self-hosted open-source alternative.
- Automated Evals: Fast iteration, reproducible, no user impact
- Production Monitoring: Real user behavior, catches issues synthetic evals miss
- A/B Testing: Measures actual user outcomes
- User Feedback: Surfaces unexpected problems
- Manual Transcript Review: Builds intuition for failure modes
- Systematic Human Studies: Gold-standard judgments, calibrates model graders
Integration: Evals for pre-launch and CI/CD. Production monitoring post-launch. A/B testing for significant changes. User feedback and transcripts ongoing. Human studies for calibration.
- Start early - don't wait for perfect suite
- Source from reality - tasks from actual failures
- Define success unambiguously - two experts should agree
- Combine grader types - code for speed, model for nuance, human for calibration
- Read transcripts - critical for understanding
- Iterate on evals - like any product
- Treat as core component - not an afterthought
- Monitor saturation - ensure evals remain challenging
- Original: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- Building Effective Agents: https://www.anthropic.com/research/building-effective-agents