carsonfarmer/ai-sdk-mcp-agent-evaluation.md

## ai-sdk-mcp-agent-evaluation.md

      
    Raw
  

              ai-sdk-mcp-agent-evaluation.md
            
          
    Agent Evaluation Guide for Vercel AI SDK with Langfuse

This guide demonstrates how to evaluate LLM agents built with the Vercel AI SDK using Langfuse's evaluation framework. We'll walk through a 3-phase evaluation approach, moving from manual inspection to automated testing at scale.
What is an LLM Agent?

Agents are systems operating in continuous loops where the LLM:

Receives input
Decides on an action (like calling external tools)
Receives feedback from the environment
Repeats until generating a final answer

This sequence is called a trace or trajectory.
Why Evaluate Agents?

Three persistent challenges emerge when building agents:

Understanding behavior: What do agents actually do on real traffic?
Specification: How do we properly specify correct behavior through prompts?
Generalization: Do agents work beyond handpicked examples?

Three Evaluation Strategies


Strategy
Description
Use Case


Final Response (Black-Box)
Compares agent output against expected facts
Correctness validation


Trajectory (Glass-Box)
Validates the sequence of tool calls
Process verification


Search Quality (White-Box)
Tests individual decision-making steps
Component testing


Step 1: Install Dependencies

npm install ai @ai-sdk/openai @ai-sdk/mcp langfuse langfuse-vercel zod dotenv
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
Step 2: Configure Environment Variables

Create a .env file with your credentials:
# Langfuse Configuration
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # EU region
# LANGFUSE_HOST=https://us.cloud.langfuse.com  # US region

# OpenAI Configuration
OPENAI_API_KEY=sk-proj-...
Step 3: Set Up Langfuse Tracing with OpenTelemetry

The AI SDK supports tracing via OpenTelemetry. With LangfuseExporter, you can collect traces in Langfuse.
// src/telemetry.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { LangfuseExporter } from 'langfuse-vercel';

export const sdk = new NodeSDK({
  traceExporter: new LangfuseExporter(),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
console.log('Langfuse OpenTelemetry tracing enabled');
Step 4: Build the Agent with MCP Integration

Here's how to create an agent that connects to the Langfuse Docs MCP server:
// src/agent.ts
import { createMCPClient } from '@ai-sdk/mcp';
import { generateText, ToolCall } from 'ai';
import { openai } from '@ai-sdk/openai';
import { Langfuse } from 'langfuse';
import 'dotenv/config';

const LANGFUSE_MCP_URL = 'https://langfuse.com/api/mcp';

// Initialize Langfuse client for dataset operations
const langfuse = new Langfuse();

interface AgentConfig {
  systemPrompt?: string;
  model?: string;
}

interface AgentResult {
  output: string;
  toolCallHistory: Array<{ toolName: string; args: Record<string, unknown> }>;
}

export async function runAgent(
  question: string,
  config: AgentConfig = {}
): Promise<AgentResult> {
  const {
    systemPrompt = 'You are an expert on Langfuse. Answer questions accurately using the available tools.',
    model = 'gpt-4o-mini',
  } = config;

  const toolCallHistory: Array<{ toolName: string; args: Record<string, unknown> }> = [];

  // Connect to Langfuse Docs MCP server
  const mcpClient = await createMCPClient({
    transport: {
      type: 'sse',
      url: LANGFUSE_MCP_URL,
    },
  });

  try {
    // Get tools from MCP server
    const mcpTools = await mcpClient.tools();

    // Run the agent loop
    const result = await generateText({
      model: openai(model),
      system: systemPrompt,
      prompt: question,
      tools: mcpTools,
      maxSteps: 10, // Allow up to 10 tool call iterations
      experimental_telemetry: {
        isEnabled: true,
        functionId: 'langfuse-docs-agent',
        metadata: {
          question,
          systemPrompt,
          model,
        },
      },
      onStepFinish: async ({ toolCalls }) => {
        // Track tool calls for trajectory evaluation
        if (toolCalls) {
          for (const toolCall of toolCalls) {
            toolCallHistory.push({
              toolName: toolCall.toolName,
              args: toolCall.args as Record<string, unknown>,
            });
          }
        }
      },
    });

    return {
      output: result.text,
      toolCallHistory,
    };
  } finally {
    await mcpClient.close();
  }
}
Step 5: Create a Benchmark Dataset

Define test cases with expected outputs for evaluation:
// src/dataset.ts
import { Langfuse } from 'langfuse';
import 'dotenv/config';

const langfuse = new Langfuse();

interface TestCase {
  input: { question: string };
  expectedOutput: {
    responseFacts: string[];
    trajectory: string[];
    searchTerm?: string;
  };
}

const testCases: TestCase[] = [
  {
    input: { question: 'What is Langfuse?' },
    expectedOutput: {
      responseFacts: [
        'Open Source LLM Engineering Platform',
        'Product modules: Tracing, Evaluation and Prompt Management',
      ],
      trajectory: ['getLangfuseOverview'],
    },
  },
  {
    input: { question: 'How to trace a TypeScript application with Langfuse?' },
    expectedOutput: {
      responseFacts: [
        'AI SDK integration via OpenTelemetry',
        'Use LangfuseExporter with experimental_telemetry',
      ],
      trajectory: ['getLangfuseOverview', 'searchLangfuseDocs'],
      searchTerm: 'TypeScript Tracing',
    },
  },
  {
    input: { question: 'How to connect to the Langfuse Docs MCP server?' },
    expectedOutput: {
      responseFacts: [
        'Connect via the MCP server endpoint: https://langfuse.com/api/mcp',
        'Transport protocol: streamableHttp or SSE',
      ],
      trajectory: ['getLangfuseOverview'],
    },
  },
  {
    input: { question: 'How long are traces retained in Langfuse?' },
    expectedOutput: {
      responseFacts: [
        'By default, traces are retained indefinitely',
        'You can set custom data retention policy in the project settings',
      ],
      trajectory: ['getLangfuseOverview', 'searchLangfuseDocs'],
      searchTerm: 'Data retention',
    },
  },
];

export async function createDataset() {
  const DATASET_NAME = 'ai-sdk-mcp-agent-evaluation';

  // Create or get the dataset
  const dataset = await langfuse.createDataset({
    name: DATASET_NAME,
    description: 'Evaluation dataset for AI SDK agent with Langfuse MCP tools',
  });

  // Add test cases
  for (const testCase of testCases) {
    await langfuse.createDatasetItem({
      datasetName: DATASET_NAME,
      input: testCase.input,
      expectedOutput: testCase.expectedOutput,
    });
  }

  console.log(`Dataset "${DATASET_NAME}" created with ${testCases.length} items`);
  return dataset;
}
Step 6: Set Up LLM-as-a-Judge Evaluators

In the Langfuse UI, create evaluators for automated scoring:
Factual Accuracy Evaluator

You are evaluating an AI agent's response for factual accuracy.

Expected facts the response should contain:
{{expected_output.responseFacts}}

Agent's response:
{{output}}

Score from 0-1 based on how many expected facts are present and accurate.
Return only a number between 0 and 1.

Trajectory Evaluator

You are evaluating an AI agent's tool usage trajectory.

Expected tool sequence:
{{expected_output.trajectory}}

Actual tool calls made:
{{metadata.toolCallHistory}}

Score from 0-1 based on:
- Did the agent use the expected tools?
- Was the order reasonable?
- Were unnecessary tools avoided?

Return only a number between 0 and 1.

Step 7: Run Experiments

Create a script to run experiments across different configurations:
// src/experiment.ts
import { Langfuse } from 'langfuse';
import { runAgent } from './agent';
import { sdk } from './telemetry';
import 'dotenv/config';

const langfuse = new Langfuse();

const DATASET_NAME = 'ai-sdk-mcp-agent-evaluation';

interface ExperimentConfig {
  name: string;
  description: string;
  systemPrompt: string;
  model: string;
}

const configurations: ExperimentConfig[] = [
  {
    name: 'baseline-gpt4o-mini',
    description: 'Baseline with GPT-4o-mini',
    systemPrompt:
      'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Cite sources when appropriate.',
    model: 'gpt-4o-mini',
  },
  {
    name: 'nudge-search-gpt4o-mini',
    description: 'Nudge to search with GPT-4o-mini',
    systemPrompt:
      'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Always cite sources when appropriate. When unsure, use getLangfuseOverview then search the docs. You can use these tools multiple times.',
    model: 'gpt-4o-mini',
  },
  {
    name: 'baseline-gpt4o',
    description: 'Baseline with GPT-4o',
    systemPrompt:
      'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Cite sources when appropriate.',
    model: 'gpt-4o',
  },
];

async function runExperiment(config: ExperimentConfig) {
  console.log(`\nRunning experiment: ${config.name}`);

  const dataset = await langfuse.getDataset(DATASET_NAME);
  const items = dataset.items;

  for (const item of items) {
    const trace = langfuse.trace({
      name: `experiment-${config.name}`,
      input: item.input,
      metadata: {
        experimentName: config.name,
        model: config.model,
      },
    });

    try {
      const { output, toolCallHistory } = await runAgent(
        (item.input as { question: string }).question,
        {
          systemPrompt: config.systemPrompt,
          model: config.model,
        }
      );

      trace.update({
        output,
        metadata: {
          toolCallHistory,
          experimentName: config.name,
          model: config.model,
        },
      });

      // Link trace to dataset item for evaluation
      await item.link(trace, config.name, {
        description: config.description,
      });

      console.log(`  Completed: ${(item.input as { question: string }).question}`);
    } catch (error) {
      console.error(`  Error: ${error}`);
      trace.update({
        output: `Error: ${error}`,
        level: 'ERROR',
      });
    }
  }

  console.log(`Experiment ${config.name} completed`);
}

async function main() {
  try {
    for (const config of configurations) {
      await runExperiment(config);
    }

    // Flush traces to Langfuse
    await langfuse.flushAsync();
    console.log('\nAll experiments completed! Check Langfuse for results.');
  } finally {
    await sdk.shutdown();
  }
}

main().catch(console.error);
Step 8: Manual Agent Loop (Alternative Pattern)

For more control over the agent loop, use this pattern:
// src/manual-agent.ts
import { createMCPClient } from '@ai-sdk/mcp';
import { streamText, ModelMessage } from 'ai';
import { openai } from '@ai-sdk/openai';
import { Langfuse } from 'langfuse';

const langfuse = new Langfuse();

export async function runManualAgentLoop(question: string) {
  const trace = langfuse.trace({
    name: 'manual-agent-loop',
    input: { question },
  });

  const messages: ModelMessage[] = [
    {
      role: 'system',
      content: 'You are an expert on Langfuse. Use the available tools to answer questions.',
    },
    {
      role: 'user',
      content: question,
    },
  ];

  const toolCallHistory: Array<{ toolName: string; args: unknown }> = [];

  const mcpClient = await createMCPClient({
    transport: {
      type: 'sse',
      url: 'https://langfuse.com/api/mcp',
    },
  });

  try {
    const tools = await mcpClient.tools();

    while (true) {
      const generation = trace.generation({
        name: 'llm-call',
        input: messages,
        model: 'gpt-4o-mini',
      });

      const result = streamText({
        model: openai('gpt-4o-mini'),
        messages,
        tools,
        experimental_telemetry: { isEnabled: true },
      });

      // Stream the response
      let responseText = '';
      for await (const chunk of result.fullStream) {
        if (chunk.type === 'text-delta') {
          responseText += chunk.text;
          process.stdout.write(chunk.text);
        }
        if (chunk.type === 'tool-call') {
          console.log(`\nCalling tool: ${chunk.toolName}`);
          toolCallHistory.push({
            toolName: chunk.toolName,
            args: chunk.args,
          });
        }
      }

      const responseMessages = (await result.response).messages;
      messages.push(...responseMessages);

      generation.end({
        output: responseText,
        metadata: { toolCallHistory },
      });

      const finishReason = await result.finishReason;

      if (finishReason !== 'tool-calls') {
        trace.update({
          output: responseText,
          metadata: { toolCallHistory },
        });
        return { output: responseText, toolCallHistory };
      }
    }
  } finally {
    await mcpClient.close();
    await langfuse.flushAsync();
  }
}
Running the Full Evaluation Pipeline

// src/index.ts
import { createDataset } from './dataset';
import './telemetry'; // Initialize telemetry

async function main() {
  // Step 1: Create dataset (run once)
  await createDataset();

  // Step 2: Run experiments
  // Import and run from experiment.ts

  console.log('Pipeline complete! View results in Langfuse dashboard.');
}

main().catch(console.error);
Viewing Results in Langfuse

After running experiments, navigate to your Langfuse project to:

View Traces: See individual agent runs with full tool call sequences
Compare Experiments: Use the Datasets tab to compare configurations side-by-side
Analyze Evaluations: Review LLM-as-a-judge scores across test cases
Debug Issues: Click into specific traces to see what went wrong

Key Differences from Pydantic AI


Aspect
Pydantic AI
AI SDK


Language
Python
TypeScript/JavaScript


Tracing
Agent.instrument_all()
OpenTelemetry + LangfuseExporter


MCP Integration
MCPServerStreamableHTTP
createMCPClient


Agent Loop
agent.run()
generateText with maxSteps


Streaming
Async generators
streamText with fullStream


Best Practices


Start with Final Response Evaluation: Validate outputs before debugging trajectories
Use Representative Test Cases: Include edge cases and common queries
Compare Configurations Systematically: Change one variable at a time
Monitor Costs: Track token usage across experiments in Langfuse
Iterate on Prompts: Use evaluation scores to guide prompt improvements

Next Steps


Add more test cases to your dataset
Create custom evaluators for domain-specific metrics
Set up scheduled evaluation runs for regression testing
Integrate with CI/CD for automated quality gates


Built with Vercel AI SDK and Langfuse
Strategy	Description	Use Case
Final Response (Black-Box)	Compares agent output against expected facts	Correctness validation
Trajectory (Glass-Box)	Validates the sequence of tool calls	Process verification
Search Quality (White-Box)	Tests individual decision-making steps	Component testing
Aspect	Pydantic AI	AI SDK
Language	Python	TypeScript/JavaScript
Tracing	`Agent.instrument_all()`	OpenTelemetry + LangfuseExporter
MCP Integration	`MCPServerStreamableHTTP`	`createMCPClient`
Agent Loop	`agent.run()`	`generateText` with `maxSteps`
Streaming	Async generators	`streamText` with `fullStream`