Created
October 24, 2024 00:42
-
-
Save caleb-kaiser/bed71c1b4af9aa48d7b825f5b9a17081 to your computer and use it in GitHub Desktop.
05-unit-test.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "provenance": [], | |
| "authorship_tag": "ABX9TyPMn4CyZrlxLY17wV+kLsaR", | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| }, | |
| "language_info": { | |
| "name": "python" | |
| } | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "view-in-github", | |
| "colab_type": "text" | |
| }, | |
| "source": [ | |
| "<a href=\"https://colab.research.google.com/gist/caleb-kaiser/bed71c1b4af9aa48d7b825f5b9a17081/05-unit-test.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>" | |
| ], | |
| "metadata": { | |
| "id": "PJ_lkrnb3g10" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# LLM Unit Tests with MedQuAD\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "Fxn-MCa5Awyo" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "In this exercise, you'll be implementing an LLM unit test similar to the one you just saw in the lesson. To make the exercise a little more interesting, you'll be using the popular MedQuAD dataset, which is a question-answer dataset with context for each question-answer pair.\n", | |
| "\n", | |
| "You'll first download the dataset from HuggingFace and convert it into an Opik Dataset. Then, you'll reuse your Factuality metric from a previous lesson to build an LLM unit test. In the real world, you might use this unit test as part of your CI/CD pipeline, to ensure that any changes you make to your underlying model, prompt, or parameters doesn't lead to a regression.\n", | |
| "\n", | |
| "For this exercise, you can use OpenAI or open source models via LiteLLM." | |
| ], | |
| "metadata": { | |
| "id": "JfJNkYp_A4H7" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# Imports & Configuration" | |
| ], | |
| "metadata": { | |
| "id": "4EbXCzQ9DIzv" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "! pip install opik openai litellm --quiet" | |
| ], | |
| "metadata": { | |
| "id": "v5hMcfoT6Ipv" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "id": "xlzz8No30MZx" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "import pytest\n", | |
| "import opik\n", | |
| "from opik import track, llm_unit\n", | |
| "from opik import Opik\n", | |
| "from opik.integrations.openai import track_openai\n", | |
| "import openai\n", | |
| "import json\n", | |
| "import os\n", | |
| "from getpass import getpass\n", | |
| "import pandas as pd\n", | |
| "import litellm\n", | |
| "\n", | |
| "# Define project name to enable tracing\n", | |
| "os.environ[\"OPIK_PROJECT_NAME\"] = \"unit-test-MedQuAD-bench\"" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# Opik configuration\n", | |
| "if \"OPIK_API_KEY\" not in os.environ:\n", | |
| " os.environ[\"OPIK_API_KEY\"] = getpass(\"Enter your Opik API key: \")\n", | |
| "\n", | |
| "opik.configure()" | |
| ], | |
| "metadata": { | |
| "id": "D32onuCcArD8" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "client = opik.Opik()\n" | |
| ], | |
| "metadata": { | |
| "id": "yYTR5IA3DDuj" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# Dataset" | |
| ], | |
| "metadata": { | |
| "id": "0vDQ3EvSDVyM" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# Create dataset\n", | |
| "dataset = client.get_or_create_dataset(\n", | |
| " name=\"MedQuAD\", description=\"MedQuAD dataset\"\n", | |
| ")" | |
| ], | |
| "metadata": { | |
| "id": "fBiS0RX0DVTU" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# Insert items into dataset\n", | |
| "df = pd.read_parquet(\n", | |
| " \"hf://datasets/AnonymousSub/MedQuAD_Context_Question_Answer_Triples_TWO/data/train-00000-of-00001-c38b6c63d6178c71.parquet\"\n", | |
| ")\n", | |
| "df = df.sample(n=50, random_state=42)\n" | |
| ], | |
| "metadata": { | |
| "id": "YUE7L051DS-M" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "df.head()" | |
| ], | |
| "metadata": { | |
| "id": "Lpvigw37NKBP" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "dataset.insert(df.to_dict('records'))" | |
| ], | |
| "metadata": { | |
| "id": "wXjfaqfyEvtO" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# Prompts & Templates" | |
| ], | |
| "metadata": { | |
| "id": "cD1dRZR5VYGU" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# prompt template for the Factuality metric\n", | |
| "factuality_template = \"\"\"\n", | |
| "###INSTRUCTIONS###\n", | |
| "\n", | |
| "You are a helpful assistant who should evaluate if a medical assistant's response is factual given the provided medical context. Output 1 if the chatbot response is factually answering the user message and 0 if it doesn't.\n", | |
| "\n", | |
| "###EXAMPLE OUTPUT FORMAT###\n", | |
| "{{\n", | |
| " \"value\": 0,\n", | |
| " \"reason\": \"The response is not factually answering the user question.\"\n", | |
| "}}\n", | |
| "\n", | |
| "\n", | |
| "###CONTEXT:###\n", | |
| "{context}\n", | |
| "\n", | |
| "###INPUTS:###\n", | |
| "{question}\n", | |
| "\n", | |
| "###RESPONSE:###\n", | |
| "{response}\n", | |
| "\"\"\"\n" | |
| ], | |
| "metadata": { | |
| "id": "cBVYbNLWValh" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "prompt_template = \"\"\"\n", | |
| "### CONTEXT\n", | |
| "{context}\n", | |
| "\n", | |
| "### QUESTION\n", | |
| "{question}\n", | |
| "\"\"\"" | |
| ], | |
| "metadata": { | |
| "id": "IY4y7BPcVb4K" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "system = \"You are a helpful medical assistant who answers questions using provided medical context\"" | |
| ], | |
| "metadata": { | |
| "id": "gWQxAJS1YfBr" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# LLM Application" | |
| ], | |
| "metadata": { | |
| "id": "tGCcMJ9rVNHd" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n", | |
| "class LLMClient:\n", | |
| " def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n", | |
| " self.client_type = client_type\n", | |
| " self.model = model\n", | |
| "\n", | |
| " if self.client_type == \"openai\":\n", | |
| " self.client = track_openai(openai.OpenAI())\n", | |
| "\n", | |
| " else:\n", | |
| " self.client = None\n", | |
| "\n", | |
| " # LiteLLM query function\n", | |
| " def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n", | |
| " messages = [\n", | |
| " {\"role\": \"system\", \"content\": system },\n", | |
| " { \"role\": \"user\", \"content\": query }\n", | |
| " ]\n", | |
| "\n", | |
| " response = litellm.completion(\n", | |
| " model=self.model,\n", | |
| " messages=messages\n", | |
| " )\n", | |
| "\n", | |
| " return response.choices[0].message.content\n", | |
| "\n", | |
| " # OpenAI query function - use **kwargs to pass arguments like temperature\n", | |
| " def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n", | |
| " messages = [\n", | |
| " {\"role\": \"system\", \"content\": system },\n", | |
| " { \"role\": \"user\", \"content\": query }\n", | |
| " ]\n", | |
| "\n", | |
| " response = self.client.chat.completions.create(\n", | |
| " model=self.model,\n", | |
| " messages=messages,\n", | |
| " **kwargs\n", | |
| " )\n", | |
| "\n", | |
| " return response.choices[0].message.content\n", | |
| "\n", | |
| "\n", | |
| " def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n", | |
| " if self.client_type == 'openai':\n", | |
| " return self._get_openai_response(query, system, **kwargs)\n", | |
| "\n", | |
| " else:\n", | |
| " return self._get_litellm_response(query, system)\n", | |
| "\n", | |
| "\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "EPwBZ-Co7HMB" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "# Set your model and initialize your LLM client\n", | |
| "MODEL = \"gpt-4o-mini\"\n", | |
| "llm_client = LLMClient(model=MODEL)" | |
| ], | |
| "metadata": { | |
| "id": "xzk-8vTI5S4Y" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "@track\n", | |
| "def generate_factuality_score(question: str, context: str, response: str):\n", | |
| " factuality_score = llm_client.query(factuality_template.format(context=context, question=question, response=response))\n", | |
| " return eval(factuality_score)" | |
| ], | |
| "metadata": { | |
| "id": "D-kDhI6t5WY5" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "@track\n", | |
| "def llm_application(question: str, context: str) -> str:\n", | |
| " # LLM application code here\n", | |
| " chatbot_response = llm_client.query(prompt_template.format(question=question, context=context))\n", | |
| " return chatbot_response" | |
| ], | |
| "metadata": { | |
| "id": "56ZG_-iZ5Xbs" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# LLM Unit Testing" | |
| ], | |
| "metadata": { | |
| "id": "upwXqV7EZ0un" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "eval_dataset = json.loads(dataset.to_json())\n", | |
| "\n", | |
| "# convert the list of dictionaries into a list of tuples\n", | |
| "final_dataset = [(item[\"input\"][\"question\"], item[\"expected_output\"][\"response\"]) for item in eval_dataset]" | |
| ], | |
| "metadata": { | |
| "id": "IrhHns175fMz" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "@llm_unit(expected_output_key=\"expected_output\")\n", | |
| "@pytest.mark.parametrize(\"user_question, expected_output\", final_dataset)\n", | |
| "def test_factuality_test(user_question, expected_output):\n", | |
| " response = llm_application(user_question)\n", | |
| " factuality_score = generate_factuality_score(user_question, response)\n", | |
| "\n", | |
| " assert factuality_score[\"value\"] > 0.5\n" | |
| ], | |
| "metadata": { | |
| "id": "f2s-Fw0K5fLF" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [], | |
| "metadata": { | |
| "id": "TUOb01k95fJ3" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [], | |
| "metadata": { | |
| "id": "RWUHedCr5fGX" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [], | |
| "metadata": { | |
| "id": "1jhGs5tJ5fCR" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment