vishalsachdev/SKILL.md

## SKILL.md

      
    Raw
  

              SKILL.md
            
          
  name
  description
  license
  metadata
  compatibility
  
  
  ai-agent-evaluations
  Framework for designing, implementing, and iterating on evaluations for AI agents. Use for automated testing of coding, conversational, research, and computer use agents with code-based, model-based, and human graders.
  MIT
  
  
  author
  version
  source
  
  
  Anthropic
  1.0
  https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
  
  
  Framework-agnostic. Works with Harbor, Promptfoo, Braintrust, LangSmith, and Langfuse. Requires test framework and model access.
  
  
AI Agent Evaluations Framework

Comprehensive guide for designing, implementing, and maintaining evaluations for AI agents across different architectures and use cases.
Why Build Evaluations?

Evaluations help teams ship AI agents more confidently by:

Making problems and behavioral changes visible before they affect users
Providing metrics to track improvements and prevent regressions
Enabling fast iteration and model updates with confidence
Creating clear feedback loops between product and research teams
Allowing teams to distinguish real regressions from noise

Without evals, debugging becomes reactive: wait for complaints, reproduce manually, fix, and hope nothing else regressed.
Core Concepts

Task: A single test with defined inputs and success criteria.
Trial: One attempt at a task. Multiple trials are run because model outputs vary between runs.
Grader: Logic that scores some aspect of agent performance. A task can have multiple graders with multiple assertions.
Transcript: Complete record of a trial including outputs, tool calls, reasoning, intermediate results, and all interactions.
Outcome: Final state in the environment at end of trial.
Evaluation Harness: Infrastructure that runs evals end-to-end.
Agent Harness: System enabling a model to act as an agent.
Evaluation Suite: Collection of tasks designed to measure specific capabilities or behaviors.
Types of Graders

Code-Based Graders

Methods: String matching, regex, fuzzy matching, binary tests, static analysis, outcome verification, tool call verification, transcript analysis
Strengths: Fast, cheap, objective, reproducible, easy to debug
Weaknesses: Brittle to valid variations, lacking nuance, limited for subjective tasks
Model-Based Graders

Methods: Rubric-based scoring, natural language assertions, pairwise comparison, reference-based evaluation, multi-judge consensus
Strengths: Flexible, scalable, captures nuance, handles open-ended tasks
Weaknesses: Non-deterministic, more expensive, requires calibration with human graders
Human Graders

Methods: SME review, crowdsourced judgment, spot-check sampling, A/B testing, inter-annotator agreement
Strengths: Gold standard quality, matches expert judgment, calibrates model-based graders
Weaknesses: Expensive, slow, requires human experts at scale
Capability vs. Regression Evals

Capability Evals target tasks the agent struggles with, starting at low pass rate.
Regression Evals should have ~100% pass rate to protect against backsliding.
Graduate high-performing capability evals to regression suites for continuous monitoring.
Evaluating Agent Types

Coding Agents

Well-specified tasks, stable environments, thorough tests. Grade on deterministic tests, code quality, tool calls, and state verification.
Conversational Agents

Simulate user persona. Multidimensional success: task completion, interaction quality, reasoning correctness.
Research Agents

Combine grader types: groundedness checks, coverage of key facts, source quality, and coherence of synthesis. Frequently calibrate against expert judgment.
Computer Use Agents

Run in real/sandboxed environment. Check URL/page state navigation and backend state verification (confirm action actually occurred, not just confirmation page).
Non-Determinism Metrics

pass@k: Probability agent gets at least one correct solution in k attempts. Useful when one success matters.
pass^k: Probability all k trials succeed. Useful when consistency is essential.
Building Evals: 8-Step Roadmap

Step 0: Start Early - 20-50 tasks from real failures is great start.
Step 1: Start with Manual Testing - Convert manual checks and bug reports into test cases.
Step 2: Write Unambiguous Tasks - Two experts should reach same pass/fail verdict. Create reference solutions.
Step 3: Build Balanced Problem Sets - Test both when behavior should occur AND when it shouldn't.
Step 4: Build Robust Harness - Each trial starts clean. Avoid shared state causing correlated failures.
Step 5: Design Graders Thoughtfully - Grade output not path. Build partial credit. Calibrate model graders with humans.
Step 6: Check Transcripts - Read transcripts and grades. Verify failures are fair. Confirm eval measures what matters.
Step 7: Monitor Saturation - 100% pass rate provides no improvement signal. Ensure evals remain challenging.
Step 8: Keep Suites Healthy - Treat like unit tests. Dedicated teams own infrastructure. Domain experts contribute tasks.
Eval Frameworks

Harbor: Containerized environments, infrastructure for running trials at scale.
Promptfoo: Lightweight, open-source, declarative YAML configuration.
Braintrust: Offline evaluation + production observability + pre-built scorers.
LangSmith: Tracing, evals, dataset management with LangChain integration.
Langfuse: Self-hosted open-source alternative.
Combining Evaluation Methods


Automated Evals: Fast iteration, reproducible, no user impact
Production Monitoring: Real user behavior, catches issues synthetic evals miss
A/B Testing: Measures actual user outcomes
User Feedback: Surfaces unexpected problems
Manual Transcript Review: Builds intuition for failure modes
Systematic Human Studies: Gold-standard judgments, calibrates model graders

Integration: Evals for pre-launch and CI/CD. Production monitoring post-launch. A/B testing for significant changes. User feedback and transcripts ongoing. Human studies for calibration.
Key Principles


Start early - don't wait for perfect suite
Source from reality - tasks from actual failures
Define success unambiguously - two experts should agree
Combine grader types - code for speed, model for nuance, human for calibration
Read transcripts - critical for understanding
Iterate on evals - like any product
Treat as core component - not an afterthought
Monitor saturation - ensure evals remain challenging

More Information


Original: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Building Effective Agents: https://www.anthropic.com/research/building-effective-agents
No results found