Building Trust in AI Systems Through Systematic Evaluation
This repository demonstrates how to use evaluators to build trust in AI systems by systematically measuring quality, safety, and reliability of AI responses. When building Agentic AI solutions, we need to observe what agents did (actions) and why—this is where evaluation frameworks come in.
The multi_model_evaluation.py script demonstrates:
- Testing Multiple LLM Models - Compare responses from different GitHub Models (GPT-4o, Phi-4, DeepSeek, Mistral, etc.)
- Semantic Similarity Evaluation - Use Azure AI Foundry's SimilarityEvaluator with LLM-as-Judge
- Weighted Ranking System - Rank models using:
- 60% Accuracy (similarity score)
- 20% Token Efficiency (response length)
- 20% Speed (response time)
- Comprehensive Metrics Tracking - Monitor all evaluator attributes including tokens, finish reasons, and costs
Different models answer the same question differently, even when semantically correct. The Similarity Evaluator uses semantic understanding (not just string matching) to assess if responses are meaningfully correct, helping you identify which models align best with your expected outputs.
- Python 3.9+ required
- GitHub Token for GitHub Models access
- Azure AI Foundry project with deployed model (for evaluator/judge)
-
Install dependencies:
pip install agent-framework-core azure-ai-evaluation python-dotenv
-
Configure credentials: Copy
.env.exampleto.envand add:# GitHub Models GITHUB_TOKEN=your_github_token GITHUB_ENDPOINT=https://models.inference.ai.azure.com # Azure AI Foundry (for evaluator) AZURE_OPENAI_ENDPOINT=your_azure_endpoint AZURE_OPENAI_KEY=your_azure_key AZURE_OPENAI_DEPLOYMENT=your_deployment_name AZURE_OPENAI_VERSION=2024-08-01-preview
-
Run the evaluation:
python multi_model_evaluation.py
The script will:
- Test each model with the same question
- Display response time and content
- Show detailed evaluation metrics (similarity score, tokens, finish reason)
- Provide weighted ranking of all models
- Identify the best performing model based on your criteria
AI observability refers to the ability to monitor, understand, and troubleshoot AI systems throughout their lifecycle. It involves collecting and analyzing signals such as evaluation metrics, logs, traces, and model and agent outputs to gain visibility into performance, quality, safety, and operational health.
Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users.
Key Evaluator Categories:
- Performance & Quality - Coherence, Fluency, Similarity, Relevance
- Textual Similarity - F1 Score, BLEU, ROUGE, METEOR
- RAG Evaluators - Groundedness, Retrieval effectiveness, Completeness
- Risk & Safety - Content safety, Bias detection, Protected materials
- Agent Evaluators - Task adherence, Tool usage, Intent resolution
The demo uses the Similarity Evaluator which provides these metrics:
| Property | What it means |
|---|---|
similarity |
1–5 semantic alignment score |
similarity_result |
"pass"/"fail" vs threshold |
similarity_threshold |
Decision boundary (default 3) |
similarity_prompt_tokens |
Tokens in evaluator input |
similarity_completion_tokens |
Tokens in evaluator output |
similarity_total_tokens |
Total cost tracking |
similarity_finish_reason |
LLM stop code |
similarity_model |
Evaluator model ID |
Blog Post: What are AI Agent Evaluators and Why They Matter - Deep dive with workflow diagram
References:
- Designing Multi-Agent Systems by Victor Dibia - Recommended book
- Microsoft Foundry - Observability Concepts
- Azure AI Evaluation SDK
- Evaluate Generative AI Apps
- GitHub Models Marketplace
By implementing systematic evaluation, you move from feeling that your AI agent is working to knowing it is working based on quantifiable metrics. This is the foundation for building robust, production-ready Agentic AI systems.