This guide demonstrates how to evaluate LLM agents built with the Vercel AI SDK using Langfuse's evaluation framework. We'll walk through a 3-phase evaluation approach, moving from manual inspection to automated testing at scale.
Agents are systems operating in continuous loops where the LLM:
- Receives input
- Decides on an action (like calling external tools)
- Receives feedback from the environment
- Repeats until generating a final answer
This sequence is called a trace or trajectory.
Three persistent challenges emerge when building agents:
- Understanding behavior: What do agents actually do on real traffic?
- Specification: How do we properly specify correct behavior through prompts?
- Generalization: Do agents work beyond handpicked examples?
| Strategy | Description | Use Case |
|---|---|---|
| Final Response (Black-Box) | Compares agent output against expected facts | Correctness validation |
| Trajectory (Glass-Box) | Validates the sequence of tool calls | Process verification |
| Search Quality (White-Box) | Tests individual decision-making steps | Component testing |
npm install ai @ai-sdk/openai @ai-sdk/mcp langfuse langfuse-vercel zod dotenv
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-nodeCreate a .env file with your credentials:
# Langfuse Configuration
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com # EU region
# LANGFUSE_HOST=https://us.cloud.langfuse.com # US region
# OpenAI Configuration
OPENAI_API_KEY=sk-proj-...The AI SDK supports tracing via OpenTelemetry. With LangfuseExporter, you can collect traces in Langfuse.
// src/telemetry.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { LangfuseExporter } from 'langfuse-vercel';
export const sdk = new NodeSDK({
traceExporter: new LangfuseExporter(),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
console.log('Langfuse OpenTelemetry tracing enabled');Here's how to create an agent that connects to the Langfuse Docs MCP server:
// src/agent.ts
import { createMCPClient } from '@ai-sdk/mcp';
import { generateText, ToolCall } from 'ai';
import { openai } from '@ai-sdk/openai';
import { Langfuse } from 'langfuse';
import 'dotenv/config';
const LANGFUSE_MCP_URL = 'https://langfuse.com/api/mcp';
// Initialize Langfuse client for dataset operations
const langfuse = new Langfuse();
interface AgentConfig {
systemPrompt?: string;
model?: string;
}
interface AgentResult {
output: string;
toolCallHistory: Array<{ toolName: string; args: Record<string, unknown> }>;
}
export async function runAgent(
question: string,
config: AgentConfig = {}
): Promise<AgentResult> {
const {
systemPrompt = 'You are an expert on Langfuse. Answer questions accurately using the available tools.',
model = 'gpt-4o-mini',
} = config;
const toolCallHistory: Array<{ toolName: string; args: Record<string, unknown> }> = [];
// Connect to Langfuse Docs MCP server
const mcpClient = await createMCPClient({
transport: {
type: 'sse',
url: LANGFUSE_MCP_URL,
},
});
try {
// Get tools from MCP server
const mcpTools = await mcpClient.tools();
// Run the agent loop
const result = await generateText({
model: openai(model),
system: systemPrompt,
prompt: question,
tools: mcpTools,
maxSteps: 10, // Allow up to 10 tool call iterations
experimental_telemetry: {
isEnabled: true,
functionId: 'langfuse-docs-agent',
metadata: {
question,
systemPrompt,
model,
},
},
onStepFinish: async ({ toolCalls }) => {
// Track tool calls for trajectory evaluation
if (toolCalls) {
for (const toolCall of toolCalls) {
toolCallHistory.push({
toolName: toolCall.toolName,
args: toolCall.args as Record<string, unknown>,
});
}
}
},
});
return {
output: result.text,
toolCallHistory,
};
} finally {
await mcpClient.close();
}
}Define test cases with expected outputs for evaluation:
// src/dataset.ts
import { Langfuse } from 'langfuse';
import 'dotenv/config';
const langfuse = new Langfuse();
interface TestCase {
input: { question: string };
expectedOutput: {
responseFacts: string[];
trajectory: string[];
searchTerm?: string;
};
}
const testCases: TestCase[] = [
{
input: { question: 'What is Langfuse?' },
expectedOutput: {
responseFacts: [
'Open Source LLM Engineering Platform',
'Product modules: Tracing, Evaluation and Prompt Management',
],
trajectory: ['getLangfuseOverview'],
},
},
{
input: { question: 'How to trace a TypeScript application with Langfuse?' },
expectedOutput: {
responseFacts: [
'AI SDK integration via OpenTelemetry',
'Use LangfuseExporter with experimental_telemetry',
],
trajectory: ['getLangfuseOverview', 'searchLangfuseDocs'],
searchTerm: 'TypeScript Tracing',
},
},
{
input: { question: 'How to connect to the Langfuse Docs MCP server?' },
expectedOutput: {
responseFacts: [
'Connect via the MCP server endpoint: https://langfuse.com/api/mcp',
'Transport protocol: streamableHttp or SSE',
],
trajectory: ['getLangfuseOverview'],
},
},
{
input: { question: 'How long are traces retained in Langfuse?' },
expectedOutput: {
responseFacts: [
'By default, traces are retained indefinitely',
'You can set custom data retention policy in the project settings',
],
trajectory: ['getLangfuseOverview', 'searchLangfuseDocs'],
searchTerm: 'Data retention',
},
},
];
export async function createDataset() {
const DATASET_NAME = 'ai-sdk-mcp-agent-evaluation';
// Create or get the dataset
const dataset = await langfuse.createDataset({
name: DATASET_NAME,
description: 'Evaluation dataset for AI SDK agent with Langfuse MCP tools',
});
// Add test cases
for (const testCase of testCases) {
await langfuse.createDatasetItem({
datasetName: DATASET_NAME,
input: testCase.input,
expectedOutput: testCase.expectedOutput,
});
}
console.log(`Dataset "${DATASET_NAME}" created with ${testCases.length} items`);
return dataset;
}In the Langfuse UI, create evaluators for automated scoring:
You are evaluating an AI agent's response for factual accuracy.
Expected facts the response should contain:
{{expected_output.responseFacts}}
Agent's response:
{{output}}
Score from 0-1 based on how many expected facts are present and accurate.
Return only a number between 0 and 1.
You are evaluating an AI agent's tool usage trajectory.
Expected tool sequence:
{{expected_output.trajectory}}
Actual tool calls made:
{{metadata.toolCallHistory}}
Score from 0-1 based on:
- Did the agent use the expected tools?
- Was the order reasonable?
- Were unnecessary tools avoided?
Return only a number between 0 and 1.
Create a script to run experiments across different configurations:
// src/experiment.ts
import { Langfuse } from 'langfuse';
import { runAgent } from './agent';
import { sdk } from './telemetry';
import 'dotenv/config';
const langfuse = new Langfuse();
const DATASET_NAME = 'ai-sdk-mcp-agent-evaluation';
interface ExperimentConfig {
name: string;
description: string;
systemPrompt: string;
model: string;
}
const configurations: ExperimentConfig[] = [
{
name: 'baseline-gpt4o-mini',
description: 'Baseline with GPT-4o-mini',
systemPrompt:
'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Cite sources when appropriate.',
model: 'gpt-4o-mini',
},
{
name: 'nudge-search-gpt4o-mini',
description: 'Nudge to search with GPT-4o-mini',
systemPrompt:
'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Always cite sources when appropriate. When unsure, use getLangfuseOverview then search the docs. You can use these tools multiple times.',
model: 'gpt-4o-mini',
},
{
name: 'baseline-gpt4o',
description: 'Baseline with GPT-4o',
systemPrompt:
'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Cite sources when appropriate.',
model: 'gpt-4o',
},
];
async function runExperiment(config: ExperimentConfig) {
console.log(`\nRunning experiment: ${config.name}`);
const dataset = await langfuse.getDataset(DATASET_NAME);
const items = dataset.items;
for (const item of items) {
const trace = langfuse.trace({
name: `experiment-${config.name}`,
input: item.input,
metadata: {
experimentName: config.name,
model: config.model,
},
});
try {
const { output, toolCallHistory } = await runAgent(
(item.input as { question: string }).question,
{
systemPrompt: config.systemPrompt,
model: config.model,
}
);
trace.update({
output,
metadata: {
toolCallHistory,
experimentName: config.name,
model: config.model,
},
});
// Link trace to dataset item for evaluation
await item.link(trace, config.name, {
description: config.description,
});
console.log(` Completed: ${(item.input as { question: string }).question}`);
} catch (error) {
console.error(` Error: ${error}`);
trace.update({
output: `Error: ${error}`,
level: 'ERROR',
});
}
}
console.log(`Experiment ${config.name} completed`);
}
async function main() {
try {
for (const config of configurations) {
await runExperiment(config);
}
// Flush traces to Langfuse
await langfuse.flushAsync();
console.log('\nAll experiments completed! Check Langfuse for results.');
} finally {
await sdk.shutdown();
}
}
main().catch(console.error);For more control over the agent loop, use this pattern:
// src/manual-agent.ts
import { createMCPClient } from '@ai-sdk/mcp';
import { streamText, ModelMessage } from 'ai';
import { openai } from '@ai-sdk/openai';
import { Langfuse } from 'langfuse';
const langfuse = new Langfuse();
export async function runManualAgentLoop(question: string) {
const trace = langfuse.trace({
name: 'manual-agent-loop',
input: { question },
});
const messages: ModelMessage[] = [
{
role: 'system',
content: 'You are an expert on Langfuse. Use the available tools to answer questions.',
},
{
role: 'user',
content: question,
},
];
const toolCallHistory: Array<{ toolName: string; args: unknown }> = [];
const mcpClient = await createMCPClient({
transport: {
type: 'sse',
url: 'https://langfuse.com/api/mcp',
},
});
try {
const tools = await mcpClient.tools();
while (true) {
const generation = trace.generation({
name: 'llm-call',
input: messages,
model: 'gpt-4o-mini',
});
const result = streamText({
model: openai('gpt-4o-mini'),
messages,
tools,
experimental_telemetry: { isEnabled: true },
});
// Stream the response
let responseText = '';
for await (const chunk of result.fullStream) {
if (chunk.type === 'text-delta') {
responseText += chunk.text;
process.stdout.write(chunk.text);
}
if (chunk.type === 'tool-call') {
console.log(`\nCalling tool: ${chunk.toolName}`);
toolCallHistory.push({
toolName: chunk.toolName,
args: chunk.args,
});
}
}
const responseMessages = (await result.response).messages;
messages.push(...responseMessages);
generation.end({
output: responseText,
metadata: { toolCallHistory },
});
const finishReason = await result.finishReason;
if (finishReason !== 'tool-calls') {
trace.update({
output: responseText,
metadata: { toolCallHistory },
});
return { output: responseText, toolCallHistory };
}
}
} finally {
await mcpClient.close();
await langfuse.flushAsync();
}
}// src/index.ts
import { createDataset } from './dataset';
import './telemetry'; // Initialize telemetry
async function main() {
// Step 1: Create dataset (run once)
await createDataset();
// Step 2: Run experiments
// Import and run from experiment.ts
console.log('Pipeline complete! View results in Langfuse dashboard.');
}
main().catch(console.error);After running experiments, navigate to your Langfuse project to:
- View Traces: See individual agent runs with full tool call sequences
- Compare Experiments: Use the Datasets tab to compare configurations side-by-side
- Analyze Evaluations: Review LLM-as-a-judge scores across test cases
- Debug Issues: Click into specific traces to see what went wrong
| Aspect | Pydantic AI | AI SDK |
|---|---|---|
| Language | Python | TypeScript/JavaScript |
| Tracing | Agent.instrument_all() |
OpenTelemetry + LangfuseExporter |
| MCP Integration | MCPServerStreamableHTTP |
createMCPClient |
| Agent Loop | agent.run() |
generateText with maxSteps |
| Streaming | Async generators | streamText with fullStream |
- Start with Final Response Evaluation: Validate outputs before debugging trajectories
- Use Representative Test Cases: Include edge cases and common queries
- Compare Configurations Systematically: Change one variable at a time
- Monitor Costs: Track token usage across experiments in Langfuse
- Iterate on Prompts: Use evaluation scores to guide prompt improvements
- Add more test cases to your dataset
- Create custom evaluators for domain-specific metrics
- Set up scheduled evaluation runs for regression testing
- Integrate with CI/CD for automated quality gates
Built with Vercel AI SDK and Langfuse