Skip to content

Instantly share code, notes, and snippets.

@stefanstranger
Created January 8, 2026 08:32
Show Gist options
  • Select an option

  • Save stefanstranger/5d29964a82594ad710af930bd4c3d4c7 to your computer and use it in GitHub Desktop.

Select an option

Save stefanstranger/5d29964a82594ad710af930bd4c3d4c7 to your computer and use it in GitHub Desktop.
Multi-Model Evaluation with Similarity Evaluator

AI Agent Evaluators Demo

Building Trust in AI Systems Through Systematic Evaluation

This repository demonstrates how to use evaluators to build trust in AI systems by systematically measuring quality, safety, and reliability of AI responses. When building Agentic AI solutions, we need to observe what agents did (actions) and why—this is where evaluation frameworks come in.

What This Demo Shows

Multi-Model Evaluation with Weighted Ranking

The multi_model_evaluation.py script demonstrates:

  1. Testing Multiple LLM Models - Compare responses from different GitHub Models (GPT-4o, Phi-4, DeepSeek, Mistral, etc.)
  2. Semantic Similarity Evaluation - Use Azure AI Foundry's SimilarityEvaluator with LLM-as-Judge
  3. Weighted Ranking System - Rank models using:
    • 60% Accuracy (similarity score)
    • 20% Token Efficiency (response length)
    • 20% Speed (response time)
  4. Comprehensive Metrics Tracking - Monitor all evaluator attributes including tokens, finish reasons, and costs

Why This Matters

Different models answer the same question differently, even when semantically correct. The Similarity Evaluator uses semantic understanding (not just string matching) to assess if responses are meaningfully correct, helping you identify which models align best with your expected outputs.

Quick Start

Prerequisites

  • Python 3.9+ required
  • GitHub Token for GitHub Models access
  • Azure AI Foundry project with deployed model (for evaluator/judge)

Setup & Run

  1. Install dependencies:

    pip install agent-framework-core azure-ai-evaluation python-dotenv
  2. Configure credentials: Copy .env.example to .env and add:

    # GitHub Models
    GITHUB_TOKEN=your_github_token
    GITHUB_ENDPOINT=https://models.inference.ai.azure.com
    
    # Azure AI Foundry (for evaluator)
    AZURE_OPENAI_ENDPOINT=your_azure_endpoint
    AZURE_OPENAI_KEY=your_azure_key
    AZURE_OPENAI_DEPLOYMENT=your_deployment_name
    AZURE_OPENAI_VERSION=2024-08-01-preview
  3. Run the evaluation:

    python multi_model_evaluation.py

Sample Output

The script will:

  • Test each model with the same question
  • Display response time and content
  • Show detailed evaluation metrics (similarity score, tokens, finish reason)
  • Provide weighted ranking of all models
  • Identify the best performing model based on your criteria

What You'll Learn

Introduction to Observability

AI observability refers to the ability to monitor, understand, and troubleshoot AI systems throughout their lifecycle. It involves collecting and analyzing signals such as evaluation metrics, logs, traces, and model and agent outputs to gain visibility into performance, quality, safety, and operational health.

Understanding Evaluators

Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users.

Key Evaluator Categories:

  • Performance & Quality - Coherence, Fluency, Similarity, Relevance
  • Textual Similarity - F1 Score, BLEU, ROUGE, METEOR
  • RAG Evaluators - Groundedness, Retrieval effectiveness, Completeness
  • Risk & Safety - Content safety, Bias detection, Protected materials
  • Agent Evaluators - Task adherence, Tool usage, Intent resolution

Similarity Evaluator Attributes

The demo uses the Similarity Evaluator which provides these metrics:

Property What it means
similarity 1–5 semantic alignment score
similarity_result "pass"/"fail" vs threshold
similarity_threshold Decision boundary (default 3)
similarity_prompt_tokens Tokens in evaluator input
similarity_completion_tokens Tokens in evaluator output
similarity_total_tokens Total cost tracking
similarity_finish_reason LLM stop code
similarity_model Evaluator model ID

References

Blog Post: What are AI Agent Evaluators and Why They Matter - Deep dive with workflow diagram

References:

Key Takeaway

By implementing systematic evaluation, you move from feeling that your AI agent is working to knowing it is working based on quantifiable metrics. This is the foundation for building robust, production-ready Agentic AI systems.

# ======== AGENT CONFIGURATION (GitHub Models) ========
# The agent uses GitHub Models to generate answers
# Get your token from: https://github.com/settings/tokens
GITHUB_TOKEN=your_github_token_here
GITHUB_ENDPOINT=https://models.github.ai/inference
# ======== EVALUATOR CONFIGURATION (Azure AI Foundry) ========
# The evaluator uses Azure OpenAI as a judge to validate responses
# Get these from: https://portal.azure.com
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_KEY=your_azure_api_key_here
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_VERSION=2024-12-01-preview
"""
Multi-Model Evaluation with Similarity Evaluator
This example demonstrates how different LLM models answer the same question
and uses Azure AI Foundry's SimilarityEvaluator to measure consistency.
Demonstrates:
1. Testing multiple GitHub Models (GPT-4o, Phi-4, Claude, etc.)
2. Comparing responses across models
3. Building trust through systematic evaluation
"""
import asyncio
import os
import time
from dotenv import load_dotenv
# Agent Framework - Using GitHub Models
from agent_framework import ChatAgent
from agent_framework.openai import OpenAIChatClient
# Azure AI Evaluation - Similarity Evaluator
from azure.ai.evaluation import AzureOpenAIModelConfiguration, SimilarityEvaluator
load_dotenv(override=True)
async def test_model(model_id: str, query: str, instructions: str) -> str:
"""Test a specific model and return its response."""
# Initialize GitHub Models Client for this specific model
openai_chat_client = OpenAIChatClient(
model_id=model_id,
api_key=os.environ.get("GITHUB_TOKEN"),
base_url=os.environ.get("GITHUB_ENDPOINT"),
)
# Create AI Agent
agent = ChatAgent(
chat_client=openai_chat_client,
instructions=instructions,
stream=False,
)
# Get response from agent
response = await agent.run(query)
return response.messages[-1].contents[0].text
async def main():
"""Compare multiple models using similarity evaluation."""
print("Multi-Model Evaluation with Similarity Evaluator")
print("=" * 70)
print()
# Define the test case
query = "How many times does the letter 'e' appear in 'Mercedes-Benz'?"
ground_truth = "4 times (M-e-rc-e-d-e-s-B-e-nz)"
print(f"Query: {query}")
print(f"Ground Truth: {ground_truth}")
print()
# Configure Azure OpenAI for the evaluator (LLM-as-Judge)
model_config = AzureOpenAIModelConfiguration(
azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
api_key=os.environ.get("AZURE_OPENAI_KEY"),
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
api_version=os.environ.get("AZURE_OPENAI_VERSION"),
)
# Create SimilarityEvaluator with threshold of 3
similarity = SimilarityEvaluator(model_config=model_config, threshold=3)
# Define models to test (available on GitHub Models)
models_to_test = [
"gpt-4o-mini",
"gpt-4o",
"microsoft/Phi-4",
# "deepseek/DeepSeek-V3-0324",
"openai/gpt-5",
"mistral-ai/mistral-small-2503"
]
instructions = "You are being evaluated on your ability to answer questions accurately and follow instructions precisely."
results = []
# Test each model
for model_id in models_to_test:
print(f"Testing model: {model_id}")
print("-" * 70)
try:
# Get response from model and track time
start_time = time.time()
agent_answer = await test_model(model_id, query, instructions)
response_time = time.time() - start_time
print(f"Response: {agent_answer}")
print(f"Response Time: {response_time:.2f}s")
print()
# Evaluate the response
eval_result = similarity(
query=query,
response=agent_answer,
ground_truth=ground_truth
)
# Extract similarity evaluator metrics
score = eval_result.get('similarity', eval_result.get('gpt_similarity'))
result = eval_result.get('similarity_result', 'N/A')
threshold = eval_result.get('similarity_threshold', 3)
# Token usage metrics
prompt_tokens = eval_result.get('similarity_prompt_tokens', 0)
completion_tokens = eval_result.get('similarity_completion_tokens', 0)
total_tokens = eval_result.get('similarity_total_tokens', 0)
# Evaluator model info
evaluator_model = eval_result.get('similarity_model', 'N/A')
finish_reason = eval_result.get('similarity_finish_reason', 'N/A')
print("Evaluation Metrics:")
print(f" Similarity Score: {score}/5")
print(f" Result: {result.upper()}")
print(f" Threshold: {threshold}")
print(f" Evaluator Model: {evaluator_model}")
print()
print(f"Token Usage (Evaluator):")
print(f" Prompt Tokens: {prompt_tokens}")
print(f" Completion Tokens: {completion_tokens}")
print(f" Total Tokens: {total_tokens}")
print(f" Finish Reason: {finish_reason}")
print()
# Store results (tokens here are evaluator tokens for cost tracking)
results.append({
'model': model_id,
'response': agent_answer,
'score': score,
'result': result,
'threshold': threshold,
'evaluator_tokens': total_tokens,
'evaluator_model': evaluator_model,
'response_time': response_time,
'response_length': len(agent_answer) # Proxy for model tokens
})
except Exception as e:
print(f"Error testing {model_id}: {str(e)}")
print()
results.append({
'model': model_id,
'response': f"Error: {str(e)}",
'score': 0,
'result': 'ERROR',
'response_time': 999999,
'response_length': 0,
'evaluator_tokens': 0
})
print()
# Summary comparison
print("=" * 70)
print("SUMMARY: Model Comparison")
print("=" * 70)
print()
# Calculate weighted scores for ranking
# Filter out errors for normalization
valid_results = [r for r in results if r['result'] != 'ERROR']
if valid_results:
# Normalize metrics (0-1 scale)
max_score = max(r['score'] for r in valid_results) if valid_results else 1
min_time = min(r['response_time'] for r in valid_results) if valid_results else 1
max_time = max(r['response_time'] for r in valid_results) if valid_results else 1
min_length = min(r['response_length'] for r in valid_results) if valid_results else 1
max_length = max(r['response_length'] for r in valid_results) if valid_results else 1
# Calculate weighted score for each result
for r in results:
if r['result'] != 'ERROR':
# Normalize metrics (0-1 scale)
norm_score = r['score'] / max_score if max_score > 0 else 0
norm_speed = 1 - ((r['response_time'] - min_time) / (max_time - min_time)) if max_time > min_time else 1
norm_efficiency = 1 - ((r['response_length'] - min_length) / (max_length - min_length)) if max_length > min_length else 1
# Weighted ranking: 60% accuracy, 20% efficiency, 20% speed
r['weighted_score'] = (0.6 * norm_score) + (0.2 * norm_efficiency) + (0.2 * norm_speed)
else:
r['weighted_score'] = 0
# Sort by weighted score (highest first)
results_sorted = sorted(results, key=lambda x: x.get('weighted_score', 0), reverse=True)
for i, result in enumerate(results_sorted, 1):
print(f"{i}. {result['model']}")
print(f" Similarity: {result['score']}/5 | Result: {result['result'].upper()}")
print(f" Response Time: {result.get('response_time', 0):.2f}s | Length: {result.get('response_length', 0)} chars")
print(f" Weighted Score: {result.get('weighted_score', 0):.3f} (60% accuracy, 20% efficiency, 20% speed)")
print()
# Analysis
print("=" * 70)
print("INSIGHTS")
print("=" * 70)
print()
passed = [r for r in results if r['result'] == 'pass']
failed = [r for r in results if r['result'] == 'fail']
errors = [r for r in results if r['result'] == 'ERROR']
print(f"Models Passed (>= threshold): {len(passed)}/{len(results)}")
print(f"Models Failed (< threshold): {len(failed)}/{len(results)}")
print(f"Models with Errors: {len(errors)}/{len(results)}")
print()
if results_sorted and results_sorted[0]['result'] != 'ERROR':
best_model = results_sorted[0]
print(f"Best Performing Model (Weighted Ranking): {best_model['model']}")
print(f" Similarity Score: {best_model['score']}/5")
print(f" Response Time: {best_model.get('response_time', 0):.2f}s")
print(f" Response Length: {best_model.get('response_length', 0)} chars")
print(f" Weighted Score: {best_model.get('weighted_score', 0):.3f}")
print(" (Weights: 60% accuracy, 20% token efficiency, 20% speed)")
print()
if __name__ == "__main__":
asyncio.run(main())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment