caleb-kaiser/05-unit-test.ipynb

## 05-unit-test.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyPMn4CyZrlxLY17wV+kLsaR",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/caleb-kaiser/bed71c1b4af9aa48d7b825f5b9a17081/05-unit-test.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>"
      ],
      "metadata": {
        "id": "PJ_lkrnb3g10"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# LLM Unit Tests with MedQuAD\n",
        "\n"
      ],
      "metadata": {
        "id": "Fxn-MCa5Awyo"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this exercise, you'll be implementing an LLM unit test similar to the one you just saw in the lesson. To make the exercise a little more interesting, you'll be using the popular MedQuAD dataset, which is a question-answer dataset with context for each question-answer pair.\n",
        "\n",
        "You'll first download the dataset from HuggingFace and convert it into an Opik Dataset. Then, you'll reuse your Factuality metric from a previous lesson to build an LLM unit test. In the real world, you might use this unit test as part of your CI/CD pipeline, to ensure that any changes you make to your underlying model, prompt, or parameters doesn't lead to a regression.\n",
        "\n",
        "For this exercise, you can use OpenAI or open source models via LiteLLM."
      ],
      "metadata": {
        "id": "JfJNkYp_A4H7"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Imports & Configuration"
      ],
      "metadata": {
        "id": "4EbXCzQ9DIzv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "! pip install opik openai litellm  --quiet"
      ],
      "metadata": {
        "id": "v5hMcfoT6Ipv"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xlzz8No30MZx"
      },
      "outputs": [],
      "source": [
        "import pytest\n",
        "import opik\n",
        "from opik import track, llm_unit\n",
        "from opik import Opik\n",
        "from opik.integrations.openai import track_openai\n",
        "import openai\n",
        "import json\n",
        "import os\n",
        "from getpass import getpass\n",
        "import pandas as pd\n",
        "import litellm\n",
        "\n",
        "# Define project name to enable tracing\n",
        "os.environ[\"OPIK_PROJECT_NAME\"] = \"unit-test-MedQuAD-bench\""
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Opik configuration\n",
        "if \"OPIK_API_KEY\" not in os.environ:\n",
        "  os.environ[\"OPIK_API_KEY\"] = getpass(\"Enter your Opik API key: \")\n",
        "\n",
        "opik.configure()"
      ],
      "metadata": {
        "id": "D32onuCcArD8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "client = opik.Opik()\n"
      ],
      "metadata": {
        "id": "yYTR5IA3DDuj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Dataset"
      ],
      "metadata": {
        "id": "0vDQ3EvSDVyM"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create dataset\n",
        "dataset = client.get_or_create_dataset(\n",
        "    name=\"MedQuAD\", description=\"MedQuAD dataset\"\n",
        ")"
      ],
      "metadata": {
        "id": "fBiS0RX0DVTU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Insert items into dataset\n",
        "df = pd.read_parquet(\n",
        "    \"hf://datasets/AnonymousSub/MedQuAD_Context_Question_Answer_Triples_TWO/data/train-00000-of-00001-c38b6c63d6178c71.parquet\"\n",
        ")\n",
        "df = df.sample(n=50, random_state=42)\n"
      ],
      "metadata": {
        "id": "YUE7L051DS-M"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "df.head()"
      ],
      "metadata": {
        "id": "Lpvigw37NKBP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "dataset.insert(df.to_dict('records'))"
      ],
      "metadata": {
        "id": "wXjfaqfyEvtO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Prompts & Templates"
      ],
      "metadata": {
        "id": "cD1dRZR5VYGU"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# prompt template for the Factuality metric\n",
        "factuality_template = \"\"\"\n",
        "###INSTRUCTIONS###\n",
        "\n",
        "You are a helpful assistant who should evaluate if a medical assistant's response is factual given the provided medical context. Output 1 if the chatbot response is factually answering the user message and 0 if it doesn't.\n",
        "\n",
        "###EXAMPLE OUTPUT FORMAT###\n",
        "{{\n",
        "    \"value\": 0,\n",
        "    \"reason\": \"The response is not factually answering the user question.\"\n",
        "}}\n",
        "\n",
        "\n",
        "###CONTEXT:###\n",
        "{context}\n",
        "\n",
        "###INPUTS:###\n",
        "{question}\n",
        "\n",
        "###RESPONSE:###\n",
        "{response}\n",
        "\"\"\"\n"
      ],
      "metadata": {
        "id": "cBVYbNLWValh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "prompt_template = \"\"\"\n",
        "### CONTEXT\n",
        "{context}\n",
        "\n",
        "### QUESTION\n",
        "{question}\n",
        "\"\"\""
      ],
      "metadata": {
        "id": "IY4y7BPcVb4K"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "system = \"You are a helpful medical assistant who answers questions using provided medical context\""
      ],
      "metadata": {
        "id": "gWQxAJS1YfBr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# LLM Application"
      ],
      "metadata": {
        "id": "tGCcMJ9rVNHd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n",
        "class LLMClient:\n",
        "  def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n",
        "    self.client_type = client_type\n",
        "    self.model = model\n",
        "\n",
        "    if self.client_type == \"openai\":\n",
        "      self.client = track_openai(openai.OpenAI())\n",
        "\n",
        "    else:\n",
        "      self.client = None\n",
        "\n",
        "  # LiteLLM query function\n",
        "  def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n",
        "    messages = [\n",
        "        {\"role\": \"system\", \"content\": system },\n",
        "        { \"role\": \"user\", \"content\": query }\n",
        "    ]\n",
        "\n",
        "    response = litellm.completion(\n",
        "        model=self.model,\n",
        "        messages=messages\n",
        "    )\n",
        "\n",
        "    return response.choices[0].message.content\n",
        "\n",
        "  # OpenAI query function - use **kwargs to pass arguments like temperature\n",
        "  def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
        "    messages = [\n",
        "        {\"role\": \"system\", \"content\": system },\n",
        "        { \"role\": \"user\", \"content\": query }\n",
        "    ]\n",
        "\n",
        "    response = self.client.chat.completions.create(\n",
        "        model=self.model,\n",
        "        messages=messages,\n",
        "        **kwargs\n",
        "    )\n",
        "\n",
        "    return response.choices[0].message.content\n",
        "\n",
        "\n",
        "  def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
        "    if self.client_type == 'openai':\n",
        "      return self._get_openai_response(query, system, **kwargs)\n",
        "\n",
        "    else:\n",
        "      return self._get_litellm_response(query, system)\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "EPwBZ-Co7HMB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Set your model and initialize your LLM client\n",
        "MODEL = \"gpt-4o-mini\"\n",
        "llm_client = LLMClient(model=MODEL)"
      ],
      "metadata": {
        "id": "xzk-8vTI5S4Y"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "@track\n",
        "def generate_factuality_score(question: str, context: str, response: str):\n",
        "    factuality_score = llm_client.query(factuality_template.format(context=context, question=question, response=response))\n",
        "    return eval(factuality_score)"
      ],
      "metadata": {
        "id": "D-kDhI6t5WY5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "@track\n",
        "def llm_application(question: str, context: str) -> str:\n",
        "    # LLM application code here\n",
        "    chatbot_response = llm_client.query(prompt_template.format(question=question, context=context))\n",
        "    return chatbot_response"
      ],
      "metadata": {
        "id": "56ZG_-iZ5Xbs"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# LLM Unit Testing"
      ],
      "metadata": {
        "id": "upwXqV7EZ0un"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "eval_dataset = json.loads(dataset.to_json())\n",
        "\n",
        "# convert the list of dictionaries into a list of tuples\n",
        "final_dataset = [(item[\"input\"][\"question\"], item[\"expected_output\"][\"response\"]) for item in eval_dataset]"
      ],
      "metadata": {
        "id": "IrhHns175fMz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "@llm_unit(expected_output_key=\"expected_output\")\n",
        "@pytest.mark.parametrize(\"user_question, expected_output\", final_dataset)\n",
        "def test_factuality_test(user_question, expected_output):\n",
        "    response = llm_application(user_question)\n",
        "    factuality_score = generate_factuality_score(user_question, response)\n",
        "\n",
        "    assert factuality_score[\"value\"] > 0.5\n"
      ],
      "metadata": {
        "id": "f2s-Fw0K5fLF"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "TUOb01k95fJ3"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "RWUHedCr5fGX"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "1jhGs5tJ5fCR"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"provenance": [],
	"authorship_tag": "ABX9TyPMn4CyZrlxLY17wV+kLsaR",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/caleb-kaiser/bed71c1b4af9aa48d7b825f5b9a17081/05-unit-test.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>"
	],
	"metadata": {
	"id": "PJ_lkrnb3g10"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"# LLM Unit Tests with MedQuAD\n",
	"\n"
	],
	"metadata": {
	"id": "Fxn-MCa5Awyo"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"In this exercise, you'll be implementing an LLM unit test similar to the one you just saw in the lesson. To make the exercise a little more interesting, you'll be using the popular MedQuAD dataset, which is a question-answer dataset with context for each question-answer pair.\n",
	"\n",
	"You'll first download the dataset from HuggingFace and convert it into an Opik Dataset. Then, you'll reuse your Factuality metric from a previous lesson to build an LLM unit test. In the real world, you might use this unit test as part of your CI/CD pipeline, to ensure that any changes you make to your underlying model, prompt, or parameters doesn't lead to a regression.\n",
	"\n",
	"For this exercise, you can use OpenAI or open source models via LiteLLM."
	],
	"metadata": {
	"id": "JfJNkYp_A4H7"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Imports & Configuration"
	],
	"metadata": {
	"id": "4EbXCzQ9DIzv"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"! pip install opik openai litellm --quiet"
	],
	"metadata": {
	"id": "v5hMcfoT6Ipv"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "xlzz8No30MZx"
	},
	"outputs": [],
	"source": [
	"import pytest\n",
	"import opik\n",
	"from opik import track, llm_unit\n",
	"from opik import Opik\n",
	"from opik.integrations.openai import track_openai\n",
	"import openai\n",
	"import json\n",
	"import os\n",
	"from getpass import getpass\n",
	"import pandas as pd\n",
	"import litellm\n",
	"\n",
	"# Define project name to enable tracing\n",
	"os.environ[\"OPIK_PROJECT_NAME\"] = \"unit-test-MedQuAD-bench\""
	]
	},
	{
	"cell_type": "code",
	"source": [
	"# Opik configuration\n",
	"if \"OPIK_API_KEY\" not in os.environ:\n",
	" os.environ[\"OPIK_API_KEY\"] = getpass(\"Enter your Opik API key: \")\n",
	"\n",
	"opik.configure()"
	],
	"metadata": {
	"id": "D32onuCcArD8"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"client = opik.Opik()\n"
	],
	"metadata": {
	"id": "yYTR5IA3DDuj"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Dataset"
	],
	"metadata": {
	"id": "0vDQ3EvSDVyM"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Create dataset\n",
	"dataset = client.get_or_create_dataset(\n",
	" name=\"MedQuAD\", description=\"MedQuAD dataset\"\n",
	")"
	],
	"metadata": {
	"id": "fBiS0RX0DVTU"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Insert items into dataset\n",
	"df = pd.read_parquet(\n",
	" \"hf://datasets/AnonymousSub/MedQuAD_Context_Question_Answer_Triples_TWO/data/train-00000-of-00001-c38b6c63d6178c71.parquet\"\n",
	")\n",
	"df = df.sample(n=50, random_state=42)\n"
	],
	"metadata": {
	"id": "YUE7L051DS-M"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"df.head()"
	],
	"metadata": {
	"id": "Lpvigw37NKBP"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"dataset.insert(df.to_dict('records'))"
	],
	"metadata": {
	"id": "wXjfaqfyEvtO"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Prompts & Templates"
	],
	"metadata": {
	"id": "cD1dRZR5VYGU"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# prompt template for the Factuality metric\n",
	"factuality_template = \"\"\"\n",
	"###INSTRUCTIONS###\n",
	"\n",
	"You are a helpful assistant who should evaluate if a medical assistant's response is factual given the provided medical context. Output 1 if the chatbot response is factually answering the user message and 0 if it doesn't.\n",
	"\n",
	"###EXAMPLE OUTPUT FORMAT###\n",
	"{{\n",
	" \"value\": 0,\n",
	" \"reason\": \"The response is not factually answering the user question.\"\n",
	"}}\n",
	"\n",
	"\n",
	"###CONTEXT:###\n",
	"{context}\n",
	"\n",
	"###INPUTS:###\n",
	"{question}\n",
	"\n",
	"###RESPONSE:###\n",
	"{response}\n",
	"\"\"\"\n"
	],
	"metadata": {
	"id": "cBVYbNLWValh"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"prompt_template = \"\"\"\n",
	"### CONTEXT\n",
	"{context}\n",
	"\n",
	"### QUESTION\n",
	"{question}\n",
	"\"\"\""
	],
	"metadata": {
	"id": "IY4y7BPcVb4K"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"system = \"You are a helpful medical assistant who answers questions using provided medical context\""
	],
	"metadata": {
	"id": "gWQxAJS1YfBr"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# LLM Application"
	],
	"metadata": {
	"id": "tGCcMJ9rVNHd"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n",
	"class LLMClient:\n",
	" def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n",
	" self.client_type = client_type\n",
	" self.model = model\n",
	"\n",
	" if self.client_type == \"openai\":\n",
	" self.client = track_openai(openai.OpenAI())\n",
	"\n",
	" else:\n",
	" self.client = None\n",
	"\n",
	" # LiteLLM query function\n",
	" def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n",
	" messages = [\n",
	" {\"role\": \"system\", \"content\": system },\n",
	" { \"role\": \"user\", \"content\": query }\n",
	" ]\n",
	"\n",
	" response = litellm.completion(\n",
	" model=self.model,\n",
	" messages=messages\n",
	" )\n",
	"\n",
	" return response.choices[0].message.content\n",
	"\n",
	" # OpenAI query function - use **kwargs to pass arguments like temperature\n",
	" def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
	" messages = [\n",
	" {\"role\": \"system\", \"content\": system },\n",
	" { \"role\": \"user\", \"content\": query }\n",
	" ]\n",
	"\n",
	" response = self.client.chat.completions.create(\n",
	" model=self.model,\n",
	" messages=messages,\n",
	" **kwargs\n",
	" )\n",
	"\n",
	" return response.choices[0].message.content\n",
	"\n",
	"\n",
	" def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
	" if self.client_type == 'openai':\n",
	" return self._get_openai_response(query, system, **kwargs)\n",
	"\n",
	" else:\n",
	" return self._get_litellm_response(query, system)\n",
	"\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "EPwBZ-Co7HMB"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Set your model and initialize your LLM client\n",
	"MODEL = \"gpt-4o-mini\"\n",
	"llm_client = LLMClient(model=MODEL)"
	],
	"metadata": {
	"id": "xzk-8vTI5S4Y"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"@track\n",
	"def generate_factuality_score(question: str, context: str, response: str):\n",
	" factuality_score = llm_client.query(factuality_template.format(context=context, question=question, response=response))\n",
	" return eval(factuality_score)"
	],
	"metadata": {
	"id": "D-kDhI6t5WY5"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"@track\n",
	"def llm_application(question: str, context: str) -> str:\n",
	" # LLM application code here\n",
	" chatbot_response = llm_client.query(prompt_template.format(question=question, context=context))\n",
	" return chatbot_response"
	],
	"metadata": {
	"id": "56ZG_-iZ5Xbs"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# LLM Unit Testing"
	],
	"metadata": {
	"id": "upwXqV7EZ0un"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"eval_dataset = json.loads(dataset.to_json())\n",
	"\n",
	"# convert the list of dictionaries into a list of tuples\n",
	"final_dataset = [(item[\"input\"][\"question\"], item[\"expected_output\"][\"response\"]) for item in eval_dataset]"
	],
	"metadata": {
	"id": "IrhHns175fMz"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"@llm_unit(expected_output_key=\"expected_output\")\n",
	"@pytest.mark.parametrize(\"user_question, expected_output\", final_dataset)\n",
	"def test_factuality_test(user_question, expected_output):\n",
	" response = llm_application(user_question)\n",
	" factuality_score = generate_factuality_score(user_question, response)\n",
	"\n",
	" assert factuality_score[\"value\"] > 0.5\n"
	],
	"metadata": {
	"id": "f2s-Fw0K5fLF"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [],
	"metadata": {
	"id": "TUOb01k95fJ3"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [],
	"metadata": {
	"id": "RWUHedCr5fGX"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [],
	"metadata": {
	"id": "1jhGs5tJ5fCR"
	},
	"execution_count": null,
	"outputs": []
	}
	]
	}
No results found