Skip to content

Instantly share code, notes, and snippets.

@carsonfarmer
Created January 12, 2026 16:59
Show Gist options
  • Select an option

  • Save carsonfarmer/c0e8b9264fb03c5ab4e1f9c012d85caf to your computer and use it in GitHub Desktop.

Select an option

Save carsonfarmer/c0e8b9264fb03c5ab4e1f9c012d85caf to your computer and use it in GitHub Desktop.
Agent Evaluation Guide for Vercel AI SDK with Langfuse

Agent Evaluation Guide for Vercel AI SDK with Langfuse

This guide demonstrates how to evaluate LLM agents built with the Vercel AI SDK using Langfuse's evaluation framework. We'll walk through a 3-phase evaluation approach, moving from manual inspection to automated testing at scale.

What is an LLM Agent?

Agents are systems operating in continuous loops where the LLM:

  1. Receives input
  2. Decides on an action (like calling external tools)
  3. Receives feedback from the environment
  4. Repeats until generating a final answer

This sequence is called a trace or trajectory.

Why Evaluate Agents?

Three persistent challenges emerge when building agents:

  • Understanding behavior: What do agents actually do on real traffic?
  • Specification: How do we properly specify correct behavior through prompts?
  • Generalization: Do agents work beyond handpicked examples?

Three Evaluation Strategies

Strategy Description Use Case
Final Response (Black-Box) Compares agent output against expected facts Correctness validation
Trajectory (Glass-Box) Validates the sequence of tool calls Process verification
Search Quality (White-Box) Tests individual decision-making steps Component testing

Step 1: Install Dependencies

npm install ai @ai-sdk/openai @ai-sdk/mcp langfuse langfuse-vercel zod dotenv
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

Step 2: Configure Environment Variables

Create a .env file with your credentials:

# Langfuse Configuration
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # EU region
# LANGFUSE_HOST=https://us.cloud.langfuse.com  # US region

# OpenAI Configuration
OPENAI_API_KEY=sk-proj-...

Step 3: Set Up Langfuse Tracing with OpenTelemetry

The AI SDK supports tracing via OpenTelemetry. With LangfuseExporter, you can collect traces in Langfuse.

// src/telemetry.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { LangfuseExporter } from 'langfuse-vercel';

export const sdk = new NodeSDK({
  traceExporter: new LangfuseExporter(),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
console.log('Langfuse OpenTelemetry tracing enabled');

Step 4: Build the Agent with MCP Integration

Here's how to create an agent that connects to the Langfuse Docs MCP server:

// src/agent.ts
import { createMCPClient } from '@ai-sdk/mcp';
import { generateText, ToolCall } from 'ai';
import { openai } from '@ai-sdk/openai';
import { Langfuse } from 'langfuse';
import 'dotenv/config';

const LANGFUSE_MCP_URL = 'https://langfuse.com/api/mcp';

// Initialize Langfuse client for dataset operations
const langfuse = new Langfuse();

interface AgentConfig {
  systemPrompt?: string;
  model?: string;
}

interface AgentResult {
  output: string;
  toolCallHistory: Array<{ toolName: string; args: Record<string, unknown> }>;
}

export async function runAgent(
  question: string,
  config: AgentConfig = {}
): Promise<AgentResult> {
  const {
    systemPrompt = 'You are an expert on Langfuse. Answer questions accurately using the available tools.',
    model = 'gpt-4o-mini',
  } = config;

  const toolCallHistory: Array<{ toolName: string; args: Record<string, unknown> }> = [];

  // Connect to Langfuse Docs MCP server
  const mcpClient = await createMCPClient({
    transport: {
      type: 'sse',
      url: LANGFUSE_MCP_URL,
    },
  });

  try {
    // Get tools from MCP server
    const mcpTools = await mcpClient.tools();

    // Run the agent loop
    const result = await generateText({
      model: openai(model),
      system: systemPrompt,
      prompt: question,
      tools: mcpTools,
      maxSteps: 10, // Allow up to 10 tool call iterations
      experimental_telemetry: {
        isEnabled: true,
        functionId: 'langfuse-docs-agent',
        metadata: {
          question,
          systemPrompt,
          model,
        },
      },
      onStepFinish: async ({ toolCalls }) => {
        // Track tool calls for trajectory evaluation
        if (toolCalls) {
          for (const toolCall of toolCalls) {
            toolCallHistory.push({
              toolName: toolCall.toolName,
              args: toolCall.args as Record<string, unknown>,
            });
          }
        }
      },
    });

    return {
      output: result.text,
      toolCallHistory,
    };
  } finally {
    await mcpClient.close();
  }
}

Step 5: Create a Benchmark Dataset

Define test cases with expected outputs for evaluation:

// src/dataset.ts
import { Langfuse } from 'langfuse';
import 'dotenv/config';

const langfuse = new Langfuse();

interface TestCase {
  input: { question: string };
  expectedOutput: {
    responseFacts: string[];
    trajectory: string[];
    searchTerm?: string;
  };
}

const testCases: TestCase[] = [
  {
    input: { question: 'What is Langfuse?' },
    expectedOutput: {
      responseFacts: [
        'Open Source LLM Engineering Platform',
        'Product modules: Tracing, Evaluation and Prompt Management',
      ],
      trajectory: ['getLangfuseOverview'],
    },
  },
  {
    input: { question: 'How to trace a TypeScript application with Langfuse?' },
    expectedOutput: {
      responseFacts: [
        'AI SDK integration via OpenTelemetry',
        'Use LangfuseExporter with experimental_telemetry',
      ],
      trajectory: ['getLangfuseOverview', 'searchLangfuseDocs'],
      searchTerm: 'TypeScript Tracing',
    },
  },
  {
    input: { question: 'How to connect to the Langfuse Docs MCP server?' },
    expectedOutput: {
      responseFacts: [
        'Connect via the MCP server endpoint: https://langfuse.com/api/mcp',
        'Transport protocol: streamableHttp or SSE',
      ],
      trajectory: ['getLangfuseOverview'],
    },
  },
  {
    input: { question: 'How long are traces retained in Langfuse?' },
    expectedOutput: {
      responseFacts: [
        'By default, traces are retained indefinitely',
        'You can set custom data retention policy in the project settings',
      ],
      trajectory: ['getLangfuseOverview', 'searchLangfuseDocs'],
      searchTerm: 'Data retention',
    },
  },
];

export async function createDataset() {
  const DATASET_NAME = 'ai-sdk-mcp-agent-evaluation';

  // Create or get the dataset
  const dataset = await langfuse.createDataset({
    name: DATASET_NAME,
    description: 'Evaluation dataset for AI SDK agent with Langfuse MCP tools',
  });

  // Add test cases
  for (const testCase of testCases) {
    await langfuse.createDatasetItem({
      datasetName: DATASET_NAME,
      input: testCase.input,
      expectedOutput: testCase.expectedOutput,
    });
  }

  console.log(`Dataset "${DATASET_NAME}" created with ${testCases.length} items`);
  return dataset;
}

Step 6: Set Up LLM-as-a-Judge Evaluators

In the Langfuse UI, create evaluators for automated scoring:

Factual Accuracy Evaluator

You are evaluating an AI agent's response for factual accuracy.

Expected facts the response should contain:
{{expected_output.responseFacts}}

Agent's response:
{{output}}

Score from 0-1 based on how many expected facts are present and accurate.
Return only a number between 0 and 1.

Trajectory Evaluator

You are evaluating an AI agent's tool usage trajectory.

Expected tool sequence:
{{expected_output.trajectory}}

Actual tool calls made:
{{metadata.toolCallHistory}}

Score from 0-1 based on:
- Did the agent use the expected tools?
- Was the order reasonable?
- Were unnecessary tools avoided?

Return only a number between 0 and 1.

Step 7: Run Experiments

Create a script to run experiments across different configurations:

// src/experiment.ts
import { Langfuse } from 'langfuse';
import { runAgent } from './agent';
import { sdk } from './telemetry';
import 'dotenv/config';

const langfuse = new Langfuse();

const DATASET_NAME = 'ai-sdk-mcp-agent-evaluation';

interface ExperimentConfig {
  name: string;
  description: string;
  systemPrompt: string;
  model: string;
}

const configurations: ExperimentConfig[] = [
  {
    name: 'baseline-gpt4o-mini',
    description: 'Baseline with GPT-4o-mini',
    systemPrompt:
      'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Cite sources when appropriate.',
    model: 'gpt-4o-mini',
  },
  {
    name: 'nudge-search-gpt4o-mini',
    description: 'Nudge to search with GPT-4o-mini',
    systemPrompt:
      'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Always cite sources when appropriate. When unsure, use getLangfuseOverview then search the docs. You can use these tools multiple times.',
    model: 'gpt-4o-mini',
  },
  {
    name: 'baseline-gpt4o',
    description: 'Baseline with GPT-4o',
    systemPrompt:
      'You are an expert on Langfuse. Answer user questions accurately and concisely using the available MCP tools. Cite sources when appropriate.',
    model: 'gpt-4o',
  },
];

async function runExperiment(config: ExperimentConfig) {
  console.log(`\nRunning experiment: ${config.name}`);

  const dataset = await langfuse.getDataset(DATASET_NAME);
  const items = dataset.items;

  for (const item of items) {
    const trace = langfuse.trace({
      name: `experiment-${config.name}`,
      input: item.input,
      metadata: {
        experimentName: config.name,
        model: config.model,
      },
    });

    try {
      const { output, toolCallHistory } = await runAgent(
        (item.input as { question: string }).question,
        {
          systemPrompt: config.systemPrompt,
          model: config.model,
        }
      );

      trace.update({
        output,
        metadata: {
          toolCallHistory,
          experimentName: config.name,
          model: config.model,
        },
      });

      // Link trace to dataset item for evaluation
      await item.link(trace, config.name, {
        description: config.description,
      });

      console.log(`  Completed: ${(item.input as { question: string }).question}`);
    } catch (error) {
      console.error(`  Error: ${error}`);
      trace.update({
        output: `Error: ${error}`,
        level: 'ERROR',
      });
    }
  }

  console.log(`Experiment ${config.name} completed`);
}

async function main() {
  try {
    for (const config of configurations) {
      await runExperiment(config);
    }

    // Flush traces to Langfuse
    await langfuse.flushAsync();
    console.log('\nAll experiments completed! Check Langfuse for results.');
  } finally {
    await sdk.shutdown();
  }
}

main().catch(console.error);

Step 8: Manual Agent Loop (Alternative Pattern)

For more control over the agent loop, use this pattern:

// src/manual-agent.ts
import { createMCPClient } from '@ai-sdk/mcp';
import { streamText, ModelMessage } from 'ai';
import { openai } from '@ai-sdk/openai';
import { Langfuse } from 'langfuse';

const langfuse = new Langfuse();

export async function runManualAgentLoop(question: string) {
  const trace = langfuse.trace({
    name: 'manual-agent-loop',
    input: { question },
  });

  const messages: ModelMessage[] = [
    {
      role: 'system',
      content: 'You are an expert on Langfuse. Use the available tools to answer questions.',
    },
    {
      role: 'user',
      content: question,
    },
  ];

  const toolCallHistory: Array<{ toolName: string; args: unknown }> = [];

  const mcpClient = await createMCPClient({
    transport: {
      type: 'sse',
      url: 'https://langfuse.com/api/mcp',
    },
  });

  try {
    const tools = await mcpClient.tools();

    while (true) {
      const generation = trace.generation({
        name: 'llm-call',
        input: messages,
        model: 'gpt-4o-mini',
      });

      const result = streamText({
        model: openai('gpt-4o-mini'),
        messages,
        tools,
        experimental_telemetry: { isEnabled: true },
      });

      // Stream the response
      let responseText = '';
      for await (const chunk of result.fullStream) {
        if (chunk.type === 'text-delta') {
          responseText += chunk.text;
          process.stdout.write(chunk.text);
        }
        if (chunk.type === 'tool-call') {
          console.log(`\nCalling tool: ${chunk.toolName}`);
          toolCallHistory.push({
            toolName: chunk.toolName,
            args: chunk.args,
          });
        }
      }

      const responseMessages = (await result.response).messages;
      messages.push(...responseMessages);

      generation.end({
        output: responseText,
        metadata: { toolCallHistory },
      });

      const finishReason = await result.finishReason;

      if (finishReason !== 'tool-calls') {
        trace.update({
          output: responseText,
          metadata: { toolCallHistory },
        });
        return { output: responseText, toolCallHistory };
      }
    }
  } finally {
    await mcpClient.close();
    await langfuse.flushAsync();
  }
}

Running the Full Evaluation Pipeline

// src/index.ts
import { createDataset } from './dataset';
import './telemetry'; // Initialize telemetry

async function main() {
  // Step 1: Create dataset (run once)
  await createDataset();

  // Step 2: Run experiments
  // Import and run from experiment.ts

  console.log('Pipeline complete! View results in Langfuse dashboard.');
}

main().catch(console.error);

Viewing Results in Langfuse

After running experiments, navigate to your Langfuse project to:

  1. View Traces: See individual agent runs with full tool call sequences
  2. Compare Experiments: Use the Datasets tab to compare configurations side-by-side
  3. Analyze Evaluations: Review LLM-as-a-judge scores across test cases
  4. Debug Issues: Click into specific traces to see what went wrong

Key Differences from Pydantic AI

Aspect Pydantic AI AI SDK
Language Python TypeScript/JavaScript
Tracing Agent.instrument_all() OpenTelemetry + LangfuseExporter
MCP Integration MCPServerStreamableHTTP createMCPClient
Agent Loop agent.run() generateText with maxSteps
Streaming Async generators streamText with fullStream

Best Practices

  1. Start with Final Response Evaluation: Validate outputs before debugging trajectories
  2. Use Representative Test Cases: Include edge cases and common queries
  3. Compare Configurations Systematically: Change one variable at a time
  4. Monitor Costs: Track token usage across experiments in Langfuse
  5. Iterate on Prompts: Use evaluation scores to guide prompt improvements

Next Steps

  • Add more test cases to your dataset
  • Create custom evaluators for domain-specific metrics
  • Set up scheduled evaluation runs for regression testing
  • Integrate with CI/CD for automated quality gates

Built with Vercel AI SDK and Langfuse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment