stefanstranger/README.md

## README.md

      
    Raw
  

              README.md
            
          
    AI Agent Evaluators Demo

Building Trust in AI Systems Through Systematic Evaluation
This repository demonstrates how to use evaluators to build trust in AI systems by systematically measuring quality, safety, and reliability of AI responses. When building Agentic AI solutions, we need to observe what agents did (actions) and why—this is where evaluation frameworks come in.
What This Demo Shows

Multi-Model Evaluation with Weighted Ranking

The multi_model_evaluation.py script demonstrates:

Testing Multiple LLM Models - Compare responses from different GitHub Models (GPT-4o, Phi-4, DeepSeek, Mistral, etc.)
Semantic Similarity Evaluation - Use Azure AI Foundry's SimilarityEvaluator with LLM-as-Judge
Weighted Ranking System - Rank models using:

60% Accuracy (similarity score)
20% Token Efficiency (response length)
20% Speed (response time)


Comprehensive Metrics Tracking - Monitor all evaluator attributes including tokens, finish reasons, and costs

Why This Matters

Different models answer the same question differently, even when semantically correct. The Similarity Evaluator uses semantic understanding (not just string matching) to assess if responses are meaningfully correct, helping you identify which models align best with your expected outputs.
Quick Start

Prerequisites


Python 3.9+ required
GitHub Token for GitHub Models access
Azure AI Foundry project with deployed model (for evaluator/judge)

Setup & Run


Install dependencies:
pip install agent-framework-core azure-ai-evaluation python-dotenv


Configure credentials:
Copy .env.example to .env and add:
# GitHub Models
GITHUB_TOKEN=your_github_token
GITHUB_ENDPOINT=https://models.inference.ai.azure.com

# Azure AI Foundry (for evaluator)
AZURE_OPENAI_ENDPOINT=your_azure_endpoint
AZURE_OPENAI_KEY=your_azure_key
AZURE_OPENAI_DEPLOYMENT=your_deployment_name
AZURE_OPENAI_VERSION=2024-08-01-preview


Run the evaluation:
python multi_model_evaluation.py


Sample Output

The script will:

Test each model with the same question
Display response time and content
Show detailed evaluation metrics (similarity score, tokens, finish reason)
Provide weighted ranking of all models
Identify the best performing model based on your criteria

What You'll Learn

Introduction to Observability

AI observability refers to the ability to monitor, understand, and troubleshoot AI systems throughout their lifecycle. It involves collecting and analyzing signals such as evaluation metrics, logs, traces, and model and agent outputs to gain visibility into performance, quality, safety, and operational health.
Understanding Evaluators

Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users.
Key Evaluator Categories:

Performance & Quality - Coherence, Fluency, Similarity, Relevance
Textual Similarity - F1 Score, BLEU, ROUGE, METEOR
RAG Evaluators - Groundedness, Retrieval effectiveness, Completeness
Risk & Safety - Content safety, Bias detection, Protected materials
Agent Evaluators - Task adherence, Tool usage, Intent resolution

Similarity Evaluator Attributes

The demo uses the Similarity Evaluator which provides these metrics:


Property
What it means


similarity
1–5 semantic alignment score


similarity_result
"pass"/"fail" vs threshold


similarity_threshold
Decision boundary (default 3)


similarity_prompt_tokens
Tokens in evaluator input


similarity_completion_tokens
Tokens in evaluator output


similarity_total_tokens
Total cost tracking


similarity_finish_reason
LLM stop code


similarity_model
Evaluator model ID


References


What are Evaluators?
Azure Foundry Agent evaluation

Blog Post: What are AI Agent Evaluators and Why They Matter - Deep dive with workflow diagram
References:

Designing Multi-Agent Systems by Victor Dibia - Recommended book
Microsoft Foundry - Observability Concepts
Azure AI Evaluation SDK
Evaluate Generative AI Apps
GitHub Models Marketplace

Key Takeaway

By implementing systematic evaluation, you move from feeling that your AI agent is working to knowing it is working based on quantifiable metrics. This is the foundation for building robust, production-ready Agentic AI systems.

  
## env.example
# ======== AGENT CONFIGURATION (GitHub Models) ========
# The agent uses GitHub Models to generate answers
# Get your token from: https://github.com/settings/tokens

GITHUB_TOKEN=your_github_token_here
GITHUB_ENDPOINT=https://models.github.ai/inference


# ======== EVALUATOR CONFIGURATION (Azure AI Foundry) ========
# The evaluator uses Azure OpenAI as a judge to validate responses
# Get these from: https://portal.azure.com

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_KEY=your_azure_api_key_here
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_VERSION=2024-12-01-preview

## multi_model_evaluation.py
"""
Multi-Model Evaluation with Similarity Evaluator

This example demonstrates how different LLM models answer the same question
and uses Azure AI Foundry's SimilarityEvaluator to measure consistency.

Demonstrates:
1. Testing multiple GitHub Models (GPT-4o, Phi-4, Claude, etc.)
2. Comparing responses across models
3. Building trust through systematic evaluation
"""

import asyncio
import os
import time
from dotenv import load_dotenv

# Agent Framework - Using GitHub Models
from agent_framework import ChatAgent
from agent_framework.openai import OpenAIChatClient

# Azure AI Evaluation - Similarity Evaluator
from azure.ai.evaluation import AzureOpenAIModelConfiguration, SimilarityEvaluator

load_dotenv(override=True)


async def test_model(model_id: str, query: str, instructions: str) -> str:
    """Test a specific model and return its response."""

    # Initialize GitHub Models Client for this specific model
    openai_chat_client = OpenAIChatClient(
        model_id=model_id,
        api_key=os.environ.get("GITHUB_TOKEN"),
        base_url=os.environ.get("GITHUB_ENDPOINT"),
    )

    # Create AI Agent
    agent = ChatAgent(
        chat_client=openai_chat_client,
        instructions=instructions,
        stream=False,
    )

    # Get response from agent
    response = await agent.run(query)
    return response.messages[-1].contents[0].text


async def main():
    """Compare multiple models using similarity evaluation."""

    print("Multi-Model Evaluation with Similarity Evaluator")
    print("=" * 70)
    print()

    # Define the test case
    query = "How many times does the letter 'e' appear in 'Mercedes-Benz'?"
    ground_truth = "4 times (M-e-rc-e-d-e-s-B-e-nz)"

    print(f"Query: {query}")
    print(f"Ground Truth: {ground_truth}")
    print()

    # Configure Azure OpenAI for the evaluator (LLM-as-Judge)
    model_config = AzureOpenAIModelConfiguration(
        azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
        api_key=os.environ.get("AZURE_OPENAI_KEY"),
        azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
        api_version=os.environ.get("AZURE_OPENAI_VERSION"),
    )

    # Create SimilarityEvaluator with threshold of 3
    similarity = SimilarityEvaluator(model_config=model_config, threshold=3)

    # Define models to test (available on GitHub Models)
    models_to_test = [
        "gpt-4o-mini",
        "gpt-4o",
        "microsoft/Phi-4",
        # "deepseek/DeepSeek-V3-0324",
        "openai/gpt-5",
        "mistral-ai/mistral-small-2503"
    ]

    instructions = "You are being evaluated on your ability to answer questions accurately and follow instructions precisely."

    results = []

    # Test each model
    for model_id in models_to_test:
        print(f"Testing model: {model_id}")
        print("-" * 70)

        try:
            # Get response from model and track time
            start_time = time.time()
            agent_answer = await test_model(model_id, query, instructions)
            response_time = time.time() - start_time

            print(f"Response: {agent_answer}")
            print(f"Response Time: {response_time:.2f}s")
            print()

            # Evaluate the response
            eval_result = similarity(
                query=query,
                response=agent_answer,
                ground_truth=ground_truth
            )

            # Extract similarity evaluator metrics
            score = eval_result.get('similarity', eval_result.get('gpt_similarity'))
            result = eval_result.get('similarity_result', 'N/A')
            threshold = eval_result.get('similarity_threshold', 3)

            # Token usage metrics
            prompt_tokens = eval_result.get('similarity_prompt_tokens', 0)
            completion_tokens = eval_result.get('similarity_completion_tokens', 0)
            total_tokens = eval_result.get('similarity_total_tokens', 0)

            # Evaluator model info
            evaluator_model = eval_result.get('similarity_model', 'N/A')
            finish_reason = eval_result.get('similarity_finish_reason', 'N/A')

            print("Evaluation Metrics:")
            print(f"  Similarity Score:      {score}/5")
            print(f"  Result:                {result.upper()}")
            print(f"  Threshold:             {threshold}")
            print(f"  Evaluator Model:       {evaluator_model}")
            print()
            print(f"Token Usage (Evaluator):")
            print(f"  Prompt Tokens:         {prompt_tokens}")
            print(f"  Completion Tokens:     {completion_tokens}")
            print(f"  Total Tokens:          {total_tokens}")
            print(f"  Finish Reason:         {finish_reason}")
            print()

            # Store results (tokens here are evaluator tokens for cost tracking)
            results.append({
                'model': model_id,
                'response': agent_answer,
                'score': score,
                'result': result,
                'threshold': threshold,
                'evaluator_tokens': total_tokens,
                'evaluator_model': evaluator_model,
                'response_time': response_time,
                'response_length': len(agent_answer)  # Proxy for model tokens
            })

        except Exception as e:
            print(f"Error testing {model_id}: {str(e)}")
            print()
            results.append({
                'model': model_id,
                'response': f"Error: {str(e)}",
                'score': 0,
                'result': 'ERROR',
                'response_time': 999999,
                'response_length': 0,
                'evaluator_tokens': 0
            })

        print()

    # Summary comparison
    print("=" * 70)
    print("SUMMARY: Model Comparison")
    print("=" * 70)
    print()

    # Calculate weighted scores for ranking
    # Filter out errors for normalization
    valid_results = [r for r in results if r['result'] != 'ERROR']

    if valid_results:
        # Normalize metrics (0-1 scale)
        max_score = max(r['score'] for r in valid_results) if valid_results else 1
        min_time = min(r['response_time'] for r in valid_results) if valid_results else 1
        max_time = max(r['response_time'] for r in valid_results) if valid_results else 1
        min_length = min(r['response_length'] for r in valid_results) if valid_results else 1
        max_length = max(r['response_length'] for r in valid_results) if valid_results else 1

        # Calculate weighted score for each result
        for r in results:
            if r['result'] != 'ERROR':
                # Normalize metrics (0-1 scale)
                norm_score = r['score'] / max_score if max_score > 0 else 0
                norm_speed = 1 - ((r['response_time'] - min_time) / (max_time - min_time)) if max_time > min_time else 1
                norm_efficiency = 1 - ((r['response_length'] - min_length) / (max_length - min_length)) if max_length > min_length else 1

                # Weighted ranking: 60% accuracy, 20% efficiency, 20% speed
                r['weighted_score'] = (0.6 * norm_score) + (0.2 * norm_efficiency) + (0.2 * norm_speed)
            else:
                r['weighted_score'] = 0

    # Sort by weighted score (highest first)
    results_sorted = sorted(results, key=lambda x: x.get('weighted_score', 0), reverse=True)

    for i, result in enumerate(results_sorted, 1):
        print(f"{i}. {result['model']}")
        print(f"   Similarity: {result['score']}/5 | Result: {result['result'].upper()}")
        print(f"   Response Time: {result.get('response_time', 0):.2f}s | Length: {result.get('response_length', 0)} chars")
        print(f"   Weighted Score: {result.get('weighted_score', 0):.3f} (60% accuracy, 20% efficiency, 20% speed)")
        print()

    # Analysis
    print("=" * 70)
    print("INSIGHTS")
    print("=" * 70)
    print()

    passed = [r for r in results if r['result'] == 'pass']
    failed = [r for r in results if r['result'] == 'fail']
    errors = [r for r in results if r['result'] == 'ERROR']

    print(f"Models Passed (>= threshold): {len(passed)}/{len(results)}")
    print(f"Models Failed (< threshold): {len(failed)}/{len(results)}")
    print(f"Models with Errors: {len(errors)}/{len(results)}")
    print()

    if results_sorted and results_sorted[0]['result'] != 'ERROR':
        best_model = results_sorted[0]
        print(f"Best Performing Model (Weighted Ranking): {best_model['model']}")
        print(f"  Similarity Score: {best_model['score']}/5")
        print(f"  Response Time: {best_model.get('response_time', 0):.2f}s")
        print(f"  Response Length: {best_model.get('response_length', 0)} chars")
        print(f"  Weighted Score: {best_model.get('weighted_score', 0):.3f}")
        print("   (Weights: 60% accuracy, 20% token efficiency, 20% speed)")
        print()

if __name__ == "__main__":
    asyncio.run(main())
Property	What it means
`similarity`	1–5 semantic alignment score
`similarity_result`	"pass"/"fail" vs threshold
`similarity_threshold`	Decision boundary (default 3)
`similarity_prompt_tokens`	Tokens in evaluator input
`similarity_completion_tokens`	Tokens in evaluator output
`similarity_total_tokens`	Total cost tracking
`similarity_finish_reason`	LLM stop code
`similarity_model`	Evaluator model ID
	# ======== AGENT CONFIGURATION (GitHub Models) ========
	# The agent uses GitHub Models to generate answers
	# Get your token from: https://github.com/settings/tokens

	GITHUB_TOKEN=your_github_token_here
	GITHUB_ENDPOINT=https://models.github.ai/inference


	# ======== EVALUATOR CONFIGURATION (Azure AI Foundry) ========
	# The evaluator uses Azure OpenAI as a judge to validate responses
	# Get these from: https://portal.azure.com

	AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
	AZURE_OPENAI_KEY=your_azure_api_key_here
	AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
	AZURE_OPENAI_VERSION=2024-12-01-preview
	"""
	Multi-Model Evaluation with Similarity Evaluator

	This example demonstrates how different LLM models answer the same question
	and uses Azure AI Foundry's SimilarityEvaluator to measure consistency.

	Demonstrates:
	1. Testing multiple GitHub Models (GPT-4o, Phi-4, Claude, etc.)
	2. Comparing responses across models
	3. Building trust through systematic evaluation
	"""

	import asyncio
	import os
	import time
	from dotenv import load_dotenv

	# Agent Framework - Using GitHub Models
	from agent_framework import ChatAgent
	from agent_framework.openai import OpenAIChatClient

	# Azure AI Evaluation - Similarity Evaluator
	from azure.ai.evaluation import AzureOpenAIModelConfiguration, SimilarityEvaluator

	load_dotenv(override=True)


	async def test_model(model_id: str, query: str, instructions: str) -> str:
	"""Test a specific model and return its response."""

	# Initialize GitHub Models Client for this specific model
	openai_chat_client = OpenAIChatClient(
	model_id=model_id,
	api_key=os.environ.get("GITHUB_TOKEN"),
	base_url=os.environ.get("GITHUB_ENDPOINT"),
	)

	# Create AI Agent
	agent = ChatAgent(
	chat_client=openai_chat_client,
	instructions=instructions,
	stream=False,
	)

	# Get response from agent
	response = await agent.run(query)
	return response.messages[-1].contents[0].text


	async def main():
	"""Compare multiple models using similarity evaluation."""

	print("Multi-Model Evaluation with Similarity Evaluator")
	print("=" * 70)
	print()

	# Define the test case
	query = "How many times does the letter 'e' appear in 'Mercedes-Benz'?"
	ground_truth = "4 times (M-e-rc-e-d-e-s-B-e-nz)"

	print(f"Query: {query}")
	print(f"Ground Truth: {ground_truth}")
	print()

	# Configure Azure OpenAI for the evaluator (LLM-as-Judge)
	model_config = AzureOpenAIModelConfiguration(
	azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
	api_key=os.environ.get("AZURE_OPENAI_KEY"),
	azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
	api_version=os.environ.get("AZURE_OPENAI_VERSION"),
	)

	# Create SimilarityEvaluator with threshold of 3
	similarity = SimilarityEvaluator(model_config=model_config, threshold=3)

	# Define models to test (available on GitHub Models)
	models_to_test = [
	"gpt-4o-mini",
	"gpt-4o",
	"microsoft/Phi-4",
	# "deepseek/DeepSeek-V3-0324",
	"openai/gpt-5",
	"mistral-ai/mistral-small-2503"
	]

	instructions = "You are being evaluated on your ability to answer questions accurately and follow instructions precisely."

	results = []

	# Test each model
	for model_id in models_to_test:
	print(f"Testing model: {model_id}")
	print("-" * 70)

	try:
	# Get response from model and track time
	start_time = time.time()
	agent_answer = await test_model(model_id, query, instructions)
	response_time = time.time() - start_time

	print(f"Response: {agent_answer}")
	print(f"Response Time: {response_time:.2f}s")
	print()

	# Evaluate the response
	eval_result = similarity(
	query=query,
	response=agent_answer,
	ground_truth=ground_truth
	)

	# Extract similarity evaluator metrics
	score = eval_result.get('similarity', eval_result.get('gpt_similarity'))
	result = eval_result.get('similarity_result', 'N/A')
	threshold = eval_result.get('similarity_threshold', 3)

	# Token usage metrics
	prompt_tokens = eval_result.get('similarity_prompt_tokens', 0)
	completion_tokens = eval_result.get('similarity_completion_tokens', 0)
	total_tokens = eval_result.get('similarity_total_tokens', 0)

	# Evaluator model info
	evaluator_model = eval_result.get('similarity_model', 'N/A')
	finish_reason = eval_result.get('similarity_finish_reason', 'N/A')

	print("Evaluation Metrics:")
	print(f" Similarity Score: {score}/5")
	print(f" Result: {result.upper()}")
	print(f" Threshold: {threshold}")
	print(f" Evaluator Model: {evaluator_model}")
	print()
	print(f"Token Usage (Evaluator):")
	print(f" Prompt Tokens: {prompt_tokens}")
	print(f" Completion Tokens: {completion_tokens}")
	print(f" Total Tokens: {total_tokens}")
	print(f" Finish Reason: {finish_reason}")
	print()

	# Store results (tokens here are evaluator tokens for cost tracking)
	results.append({
	'model': model_id,
	'response': agent_answer,
	'score': score,
	'result': result,
	'threshold': threshold,
	'evaluator_tokens': total_tokens,
	'evaluator_model': evaluator_model,
	'response_time': response_time,
	'response_length': len(agent_answer) # Proxy for model tokens
	})

	except Exception as e:
	print(f"Error testing {model_id}: {str(e)}")
	print()
	results.append({
	'model': model_id,
	'response': f"Error: {str(e)}",
	'score': 0,
	'result': 'ERROR',
	'response_time': 999999,
	'response_length': 0,
	'evaluator_tokens': 0
	})

	print()

	# Summary comparison
	print("=" * 70)
	print("SUMMARY: Model Comparison")
	print("=" * 70)
	print()

	# Calculate weighted scores for ranking
	# Filter out errors for normalization
	valid_results = [r for r in results if r['result'] != 'ERROR']

	if valid_results:
	# Normalize metrics (0-1 scale)
	max_score = max(r['score'] for r in valid_results) if valid_results else 1
	min_time = min(r['response_time'] for r in valid_results) if valid_results else 1
	max_time = max(r['response_time'] for r in valid_results) if valid_results else 1
	min_length = min(r['response_length'] for r in valid_results) if valid_results else 1
	max_length = max(r['response_length'] for r in valid_results) if valid_results else 1

	# Calculate weighted score for each result
	for r in results:
	if r['result'] != 'ERROR':
	# Normalize metrics (0-1 scale)
	norm_score = r['score'] / max_score if max_score > 0 else 0
	norm_speed = 1 - ((r['response_time'] - min_time) / (max_time - min_time)) if max_time > min_time else 1
	norm_efficiency = 1 - ((r['response_length'] - min_length) / (max_length - min_length)) if max_length > min_length else 1

	# Weighted ranking: 60% accuracy, 20% efficiency, 20% speed
	r['weighted_score'] = (0.6 * norm_score) + (0.2 * norm_efficiency) + (0.2 * norm_speed)
	else:
	r['weighted_score'] = 0

	# Sort by weighted score (highest first)
	results_sorted = sorted(results, key=lambda x: x.get('weighted_score', 0), reverse=True)

	for i, result in enumerate(results_sorted, 1):
	print(f"{i}. {result['model']}")
	print(f" Similarity: {result['score']}/5 \| Result: {result['result'].upper()}")
	print(f" Response Time: {result.get('response_time', 0):.2f}s \| Length: {result.get('response_length', 0)} chars")
	print(f" Weighted Score: {result.get('weighted_score', 0):.3f} (60% accuracy, 20% efficiency, 20% speed)")
	print()

	# Analysis
	print("=" * 70)
	print("INSIGHTS")
	print("=" * 70)
	print()

	passed = [r for r in results if r['result'] == 'pass']
	failed = [r for r in results if r['result'] == 'fail']
	errors = [r for r in results if r['result'] == 'ERROR']

	print(f"Models Passed (>= threshold): {len(passed)}/{len(results)}")
	print(f"Models Failed (< threshold): {len(failed)}/{len(results)}")
	print(f"Models with Errors: {len(errors)}/{len(results)}")
	print()

	if results_sorted and results_sorted[0]['result'] != 'ERROR':
	best_model = results_sorted[0]
	print(f"Best Performing Model (Weighted Ranking): {best_model['model']}")
	print(f" Similarity Score: {best_model['score']}/5")
	print(f" Response Time: {best_model.get('response_time', 0):.2f}s")
	print(f" Response Length: {best_model.get('response_length', 0)} chars")
	print(f" Weighted Score: {best_model.get('weighted_score', 0):.3f}")
	print(" (Weights: 60% accuracy, 20% token efficiency, 20% speed)")
	print()

	if __name__ == "__main__":
	asyncio.run(main())