Skip to content

Instantly share code, notes, and snippets.

@vishalsachdev
Created January 10, 2026 16:08
Show Gist options
  • Select an option

  • Save vishalsachdev/b6e5076ec3ced7e4f0228969f0727eba to your computer and use it in GitHub Desktop.

Select an option

Save vishalsachdev/b6e5076ec3ced7e4f0228969f0727eba to your computer and use it in GitHub Desktop.
Demystifying evals for AI agents - Evaluation Framework Guide
name description license metadata compatibility
ai-agent-evaluations
Framework for designing, implementing, and iterating on evaluations for AI agents. Use for automated testing of coding, conversational, research, and computer use agents with code-based, model-based, and human graders.
MIT
Framework-agnostic. Works with Harbor, Promptfoo, Braintrust, LangSmith, and Langfuse. Requires test framework and model access.

AI Agent Evaluations Framework

Comprehensive guide for designing, implementing, and maintaining evaluations for AI agents across different architectures and use cases.

Why Build Evaluations?

Evaluations help teams ship AI agents more confidently by:

  • Making problems and behavioral changes visible before they affect users
  • Providing metrics to track improvements and prevent regressions
  • Enabling fast iteration and model updates with confidence
  • Creating clear feedback loops between product and research teams
  • Allowing teams to distinguish real regressions from noise

Without evals, debugging becomes reactive: wait for complaints, reproduce manually, fix, and hope nothing else regressed.

Core Concepts

Task: A single test with defined inputs and success criteria.

Trial: One attempt at a task. Multiple trials are run because model outputs vary between runs.

Grader: Logic that scores some aspect of agent performance. A task can have multiple graders with multiple assertions.

Transcript: Complete record of a trial including outputs, tool calls, reasoning, intermediate results, and all interactions.

Outcome: Final state in the environment at end of trial.

Evaluation Harness: Infrastructure that runs evals end-to-end.

Agent Harness: System enabling a model to act as an agent.

Evaluation Suite: Collection of tasks designed to measure specific capabilities or behaviors.

Types of Graders

Code-Based Graders

Methods: String matching, regex, fuzzy matching, binary tests, static analysis, outcome verification, tool call verification, transcript analysis

Strengths: Fast, cheap, objective, reproducible, easy to debug

Weaknesses: Brittle to valid variations, lacking nuance, limited for subjective tasks

Model-Based Graders

Methods: Rubric-based scoring, natural language assertions, pairwise comparison, reference-based evaluation, multi-judge consensus

Strengths: Flexible, scalable, captures nuance, handles open-ended tasks

Weaknesses: Non-deterministic, more expensive, requires calibration with human graders

Human Graders

Methods: SME review, crowdsourced judgment, spot-check sampling, A/B testing, inter-annotator agreement

Strengths: Gold standard quality, matches expert judgment, calibrates model-based graders

Weaknesses: Expensive, slow, requires human experts at scale

Capability vs. Regression Evals

Capability Evals target tasks the agent struggles with, starting at low pass rate.

Regression Evals should have ~100% pass rate to protect against backsliding.

Graduate high-performing capability evals to regression suites for continuous monitoring.

Evaluating Agent Types

Coding Agents

Well-specified tasks, stable environments, thorough tests. Grade on deterministic tests, code quality, tool calls, and state verification.

Conversational Agents

Simulate user persona. Multidimensional success: task completion, interaction quality, reasoning correctness.

Research Agents

Combine grader types: groundedness checks, coverage of key facts, source quality, and coherence of synthesis. Frequently calibrate against expert judgment.

Computer Use Agents

Run in real/sandboxed environment. Check URL/page state navigation and backend state verification (confirm action actually occurred, not just confirmation page).

Non-Determinism Metrics

pass@k: Probability agent gets at least one correct solution in k attempts. Useful when one success matters.

pass^k: Probability all k trials succeed. Useful when consistency is essential.

Building Evals: 8-Step Roadmap

Step 0: Start Early - 20-50 tasks from real failures is great start.

Step 1: Start with Manual Testing - Convert manual checks and bug reports into test cases.

Step 2: Write Unambiguous Tasks - Two experts should reach same pass/fail verdict. Create reference solutions.

Step 3: Build Balanced Problem Sets - Test both when behavior should occur AND when it shouldn't.

Step 4: Build Robust Harness - Each trial starts clean. Avoid shared state causing correlated failures.

Step 5: Design Graders Thoughtfully - Grade output not path. Build partial credit. Calibrate model graders with humans.

Step 6: Check Transcripts - Read transcripts and grades. Verify failures are fair. Confirm eval measures what matters.

Step 7: Monitor Saturation - 100% pass rate provides no improvement signal. Ensure evals remain challenging.

Step 8: Keep Suites Healthy - Treat like unit tests. Dedicated teams own infrastructure. Domain experts contribute tasks.

Eval Frameworks

Harbor: Containerized environments, infrastructure for running trials at scale.

Promptfoo: Lightweight, open-source, declarative YAML configuration.

Braintrust: Offline evaluation + production observability + pre-built scorers.

LangSmith: Tracing, evals, dataset management with LangChain integration.

Langfuse: Self-hosted open-source alternative.

Combining Evaluation Methods

  • Automated Evals: Fast iteration, reproducible, no user impact
  • Production Monitoring: Real user behavior, catches issues synthetic evals miss
  • A/B Testing: Measures actual user outcomes
  • User Feedback: Surfaces unexpected problems
  • Manual Transcript Review: Builds intuition for failure modes
  • Systematic Human Studies: Gold-standard judgments, calibrates model graders

Integration: Evals for pre-launch and CI/CD. Production monitoring post-launch. A/B testing for significant changes. User feedback and transcripts ongoing. Human studies for calibration.

Key Principles

  1. Start early - don't wait for perfect suite
  2. Source from reality - tasks from actual failures
  3. Define success unambiguously - two experts should agree
  4. Combine grader types - code for speed, model for nuance, human for calibration
  5. Read transcripts - critical for understanding
  6. Iterate on evals - like any product
  7. Treat as core component - not an afterthought
  8. Monitor saturation - ensure evals remain challenging

More Information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment