Skip to content

Instantly share code, notes, and snippets.

@caleb-kaiser
Created October 24, 2024 00:42
Show Gist options
  • Select an option

  • Save caleb-kaiser/bed71c1b4af9aa48d7b825f5b9a17081 to your computer and use it in GitHub Desktop.

Select an option

Save caleb-kaiser/bed71c1b4af9aa48d7b825f5b9a17081 to your computer and use it in GitHub Desktop.
05-unit-test.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyPMn4CyZrlxLY17wV+kLsaR",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/caleb-kaiser/bed71c1b4af9aa48d7b825f5b9a17081/05-unit-test.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>"
],
"metadata": {
"id": "PJ_lkrnb3g10"
}
},
{
"cell_type": "markdown",
"source": [
"# LLM Unit Tests with MedQuAD\n",
"\n"
],
"metadata": {
"id": "Fxn-MCa5Awyo"
}
},
{
"cell_type": "markdown",
"source": [
"In this exercise, you'll be implementing an LLM unit test similar to the one you just saw in the lesson. To make the exercise a little more interesting, you'll be using the popular MedQuAD dataset, which is a question-answer dataset with context for each question-answer pair.\n",
"\n",
"You'll first download the dataset from HuggingFace and convert it into an Opik Dataset. Then, you'll reuse your Factuality metric from a previous lesson to build an LLM unit test. In the real world, you might use this unit test as part of your CI/CD pipeline, to ensure that any changes you make to your underlying model, prompt, or parameters doesn't lead to a regression.\n",
"\n",
"For this exercise, you can use OpenAI or open source models via LiteLLM."
],
"metadata": {
"id": "JfJNkYp_A4H7"
}
},
{
"cell_type": "markdown",
"source": [
"# Imports & Configuration"
],
"metadata": {
"id": "4EbXCzQ9DIzv"
}
},
{
"cell_type": "code",
"source": [
"! pip install opik openai litellm --quiet"
],
"metadata": {
"id": "v5hMcfoT6Ipv"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xlzz8No30MZx"
},
"outputs": [],
"source": [
"import pytest\n",
"import opik\n",
"from opik import track, llm_unit\n",
"from opik import Opik\n",
"from opik.integrations.openai import track_openai\n",
"import openai\n",
"import json\n",
"import os\n",
"from getpass import getpass\n",
"import pandas as pd\n",
"import litellm\n",
"\n",
"# Define project name to enable tracing\n",
"os.environ[\"OPIK_PROJECT_NAME\"] = \"unit-test-MedQuAD-bench\""
]
},
{
"cell_type": "code",
"source": [
"# Opik configuration\n",
"if \"OPIK_API_KEY\" not in os.environ:\n",
" os.environ[\"OPIK_API_KEY\"] = getpass(\"Enter your Opik API key: \")\n",
"\n",
"opik.configure()"
],
"metadata": {
"id": "D32onuCcArD8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"client = opik.Opik()\n"
],
"metadata": {
"id": "yYTR5IA3DDuj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Dataset"
],
"metadata": {
"id": "0vDQ3EvSDVyM"
}
},
{
"cell_type": "code",
"source": [
"# Create dataset\n",
"dataset = client.get_or_create_dataset(\n",
" name=\"MedQuAD\", description=\"MedQuAD dataset\"\n",
")"
],
"metadata": {
"id": "fBiS0RX0DVTU"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Insert items into dataset\n",
"df = pd.read_parquet(\n",
" \"hf://datasets/AnonymousSub/MedQuAD_Context_Question_Answer_Triples_TWO/data/train-00000-of-00001-c38b6c63d6178c71.parquet\"\n",
")\n",
"df = df.sample(n=50, random_state=42)\n"
],
"metadata": {
"id": "YUE7L051DS-M"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df.head()"
],
"metadata": {
"id": "Lpvigw37NKBP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"dataset.insert(df.to_dict('records'))"
],
"metadata": {
"id": "wXjfaqfyEvtO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Prompts & Templates"
],
"metadata": {
"id": "cD1dRZR5VYGU"
}
},
{
"cell_type": "code",
"source": [
"# prompt template for the Factuality metric\n",
"factuality_template = \"\"\"\n",
"###INSTRUCTIONS###\n",
"\n",
"You are a helpful assistant who should evaluate if a medical assistant's response is factual given the provided medical context. Output 1 if the chatbot response is factually answering the user message and 0 if it doesn't.\n",
"\n",
"###EXAMPLE OUTPUT FORMAT###\n",
"{{\n",
" \"value\": 0,\n",
" \"reason\": \"The response is not factually answering the user question.\"\n",
"}}\n",
"\n",
"\n",
"###CONTEXT:###\n",
"{context}\n",
"\n",
"###INPUTS:###\n",
"{question}\n",
"\n",
"###RESPONSE:###\n",
"{response}\n",
"\"\"\"\n"
],
"metadata": {
"id": "cBVYbNLWValh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"prompt_template = \"\"\"\n",
"### CONTEXT\n",
"{context}\n",
"\n",
"### QUESTION\n",
"{question}\n",
"\"\"\""
],
"metadata": {
"id": "IY4y7BPcVb4K"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"system = \"You are a helpful medical assistant who answers questions using provided medical context\""
],
"metadata": {
"id": "gWQxAJS1YfBr"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# LLM Application"
],
"metadata": {
"id": "tGCcMJ9rVNHd"
}
},
{
"cell_type": "code",
"source": [
"# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n",
"class LLMClient:\n",
" def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n",
" self.client_type = client_type\n",
" self.model = model\n",
"\n",
" if self.client_type == \"openai\":\n",
" self.client = track_openai(openai.OpenAI())\n",
"\n",
" else:\n",
" self.client = None\n",
"\n",
" # LiteLLM query function\n",
" def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system },\n",
" { \"role\": \"user\", \"content\": query }\n",
" ]\n",
"\n",
" response = litellm.completion(\n",
" model=self.model,\n",
" messages=messages\n",
" )\n",
"\n",
" return response.choices[0].message.content\n",
"\n",
" # OpenAI query function - use **kwargs to pass arguments like temperature\n",
" def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system },\n",
" { \"role\": \"user\", \"content\": query }\n",
" ]\n",
"\n",
" response = self.client.chat.completions.create(\n",
" model=self.model,\n",
" messages=messages,\n",
" **kwargs\n",
" )\n",
"\n",
" return response.choices[0].message.content\n",
"\n",
"\n",
" def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
" if self.client_type == 'openai':\n",
" return self._get_openai_response(query, system, **kwargs)\n",
"\n",
" else:\n",
" return self._get_litellm_response(query, system)\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "EPwBZ-Co7HMB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Set your model and initialize your LLM client\n",
"MODEL = \"gpt-4o-mini\"\n",
"llm_client = LLMClient(model=MODEL)"
],
"metadata": {
"id": "xzk-8vTI5S4Y"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"@track\n",
"def generate_factuality_score(question: str, context: str, response: str):\n",
" factuality_score = llm_client.query(factuality_template.format(context=context, question=question, response=response))\n",
" return eval(factuality_score)"
],
"metadata": {
"id": "D-kDhI6t5WY5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"@track\n",
"def llm_application(question: str, context: str) -> str:\n",
" # LLM application code here\n",
" chatbot_response = llm_client.query(prompt_template.format(question=question, context=context))\n",
" return chatbot_response"
],
"metadata": {
"id": "56ZG_-iZ5Xbs"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# LLM Unit Testing"
],
"metadata": {
"id": "upwXqV7EZ0un"
}
},
{
"cell_type": "code",
"source": [
"eval_dataset = json.loads(dataset.to_json())\n",
"\n",
"# convert the list of dictionaries into a list of tuples\n",
"final_dataset = [(item[\"input\"][\"question\"], item[\"expected_output\"][\"response\"]) for item in eval_dataset]"
],
"metadata": {
"id": "IrhHns175fMz"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"@llm_unit(expected_output_key=\"expected_output\")\n",
"@pytest.mark.parametrize(\"user_question, expected_output\", final_dataset)\n",
"def test_factuality_test(user_question, expected_output):\n",
" response = llm_application(user_question)\n",
" factuality_score = generate_factuality_score(user_question, response)\n",
"\n",
" assert factuality_score[\"value\"] > 0.5\n"
],
"metadata": {
"id": "f2s-Fw0K5fLF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "TUOb01k95fJ3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "RWUHedCr5fGX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "1jhGs5tJ5fCR"
},
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment