diff --git a/docs/index.md b/docs/index.md index c3140cbbd4..a7e801c629 100644 --- a/docs/index.md +++ b/docs/index.md @@ -14,6 +14,7 @@ getting_started/quickstart/index open_source/installation_library/index open_source/scan/index +open_source/testset_generation/index open_source/customize_tests/index open_source/integrate_tests/index ``` @@ -74,6 +75,7 @@ cli/index reference/models/index reference/datasets/index reference/scan/index +reference/rag-toolset/index reference/tests/index reference/slicing-functions/index reference/transformation-functions/index diff --git a/docs/open_source/index.md b/docs/open_source/index.md index ee9302d019..5fb96bad77 100644 --- a/docs/open_source/index.md +++ b/docs/open_source/index.md @@ -6,6 +6,7 @@ installation_library/index scan/index +testset_generation/index customize_tests/index integrate_tests/index ``` @@ -23,6 +24,11 @@ integrate_tests/index :link: scan/index.html :::: +::::{grid-item-card}

🧰 RAG Toolset

+:text-align: center +:link: testset_generation/index.html +:::: + ::::{grid-item-card}

🧪 Customize your tests

:text-align: center :link: customize_tests/index.html diff --git a/docs/open_source/scan/scan_llm/index.md b/docs/open_source/scan/scan_llm/index.md index 823368b21f..f6d0c40667 100644 --- a/docs/open_source/scan/scan_llm/index.md +++ b/docs/open_source/scan/scan_llm/index.md @@ -72,7 +72,7 @@ set_llm_model('my-gpt-4-model') We are now ready to start. - +(model-wrapping)= ## Step 1: Wrap your model Start by **wrapping your model**. This step is necessary to ensure a common format for your model and its metadata. diff --git a/docs/open_source/testset_generation/index.md b/docs/open_source/testset_generation/index.md new file mode 100644 index 0000000000..f52c34a0ed --- /dev/null +++ b/docs/open_source/testset_generation/index.md @@ -0,0 +1,194 @@ +# 🧰 RAG Testset Generation + +> ⚠️ **The RAG toolset is currently in early version and is subject to change**. Feel free to reach out on our [Discord server](https://discord.gg/fkv7CAr3FE) if you have any trouble with test set generation or to provide feedback. + + +The Giskard python library provides a toolset dedicated to Retrieval Augmented Generative models (RAGs) that generates question & answer pairs from the knowledge base of the model. The generated test set is then used to evaluate your model. + +(difficulty_levels)= +## Generate questions with difficulty levels + +You can currently generate questions with three difficulty levels: + +```{list-table} +:header-rows: 1 +:widths: 35, 65 +* - Difficulty Level + - Description +* - **1: Easy questions** + - Simple questions generated from an excerpt of the knowledge base +* - **2: Complex questions** + - Questions made more complex by paraphrasing +* - **3: Distracting questions** + - Questions made even more difficult by adding a distracting element which is related to the knowledge base but irrelevant to the question +``` + +These three difficulty levels allow you to evaluate different components of your model. Easy questions are directly generated from your knowledge base. They assess the quality of the answer generation from the context, i.e. the quality of the LLM answer. Complex and distracting questions are more challenging as they can perturb the retrieval component of the RAG. These questions are more realistic of a user seeking precise information with your model. + +## Before starting + +Before starting, make sure you have installed the LLM flavor of Giskard: + +```bash +pip install "giskard[llm]" +``` + +To use the RAG test set generation and evaluation tools, you need to have an OpenAI API key. You can set it in your notebook +like this: + +:::::::{tab-set} +::::::{tab-item} OpenAI + +```python +import os + +os.environ["OPENAI_API_KEY"] = "sk-…" +``` + +:::::: +::::::{tab-item} Azure OpenAI + +Require `openai>=1.0.0` +Make sure that both the LLM and Embeddings models are both deployed on the Azure endpoint. The default embedding model used by the Giskard client is `text-embedding-ada-002`. + +```python +import os +from giskard.llm import set_llm_model + +os.environ['AZURE_OPENAI_API_KEY'] = '...' +os.environ['AZURE_OPENAI_ENDPOINT'] = 'https://xxx.openai.azure.com' +os.environ['OPENAI_API_VERSION'] = '2023-07-01-preview' + + +# You'll need to provide the name of the model that you've deployed +# Beware, the model provided must be capable of using function calls +set_llm_model('my-gpt-4-model') +``` + +:::::: +::::::: + +We are now ready to start. + + +## Step 1: Automatically generate a Q&A test set + +To start, you only need your data or knowledge base in a pandas `DataFrame`. Then, you can initialize the testset +generator ({class}`giskard.rag.TestsetGenerator`) by passing your dataframe. + +If some columns in your dataframe are not relevant for the generation of questions (e.g. they contain metadata), make sure you specify +column names to the `knowledge_base_columns` argument (see {class}`giskard.rag.TestsetGenerator`). + +To make the question generation more accurate, you can also provide a model name and a model description to the generator. This will help the generator to generate questions that are more relevant to your model's task. You can also specify the language of the generated questions. + + +```python + +from giskard.rag import TestsetGenerator + +# Load your data +knowledge_base_df = pd.read_csv("path/to/your/knowledge_base.csv") + +# Initialize the testset generator +generator = TestsetGenerator( + knowledge_base_df, + knowledge_base_columns=["column_1", "column_2"], + language="en", # Optional, if you want to generate questions in a specific language + + # Optionally, you can provide a model name and description to improve the question quality + model_name="Shop Assistant", + model_description="A model that answers common questions about our products", +) +``` + +We are ready to generate the test set. We can start with a small test set of 10 questions and answers for each difficulty level. + +Currently, you can choose the difficulty levels from 1 to 3 (see {ref}`difficulty_levels`) + +```python +# Generate a testset with 10 questions & answers for each difficulty level (this will take a while) +testset = generator.generate_testset(num_questions=10, difficulty=[1, 2]) + +# Save the generated testset +testset.save("my_testset.jsonl") + +# Load it back +from giskard.rag import QATestset + +loaded_testset = QATestset.load("my_testset.jsonl") +``` + +The test set will be an instance of {class}`~giskard.rag.QATestset`. You can save it and load it later with `QATestset.load("path/to/testset.jsonl")`. + +You can also convert it to a pandas DataFrame with `testset.to_pandas()`: + +```py +# Convert it to a pandas dataframe +df = loaded_testset.to_pandas() +``` + +Let's have a look at the generated questions: + +| question | reference_context | reference_answer | difficulty_level | +|----------|-------------------|------------------|------------------| +| For which countries can I track my shipping? | What is your shipping policy? We offer free shipping on all orders over \$50. For orders below \$50, we charge a flat rate of \$5.99. We offer shipping services to customers residing in all 50 states of the US, in addition to providing delivery options to Canada and Mexico. ------ How can I track my order? Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our website's order tracking page. | We ship to all 50 states in the US, as well as to Canada and Mexico. We offer tracking for all our shippings. | 1 | + +As you can see, the data contains 4 columns: +- `question`: the generated question +- `reference_context`: the context that can be used to answer the question +- `reference_answer`: the answer to the question (generated with GPT-4) +- `difficulty_level`: the difficulty level of the question (1, 2 or 3) + +## Step 2: Evaluate your model on the generated test set + +Before evaluating your model, you must wrap it as a `giskard.Model`. This step is necessary to ensure a common format for your model and its metadata. You can wrap anything as long as you can represent it in a Python function (for example an API call to Azure or OpenAI). We also have pre-built wrappers for LangChain objects, or you can create your own wrapper by extending the `giskard.Model` class if you need to wrap a complex object such as a custom-made RAG communicating with a vectorstore. + +To do so, you can follow the instructions from the [LLM Scan feature](../scan/scan_llm/index.md#step-1-wrap-your-model). Make sure that you pass `feature_names = "question"` when wrapping your model, so that it matches the question column of the test set. + +Detailed examples can also be found on our {doc}`LLM tutorials section `. + +Once you have wrapped your model, we can proceed with evaluation. + +Let's convert our test set into an actionable test suite ({class}`giskard.Suite`) that we can save and reuse in further iterations. + +```python +test_suite = testset.to_test_suite("My first test suite") + +test_suite.run(model=giskard_model) +``` + +![](./test_suite_widget.png) + +Jump to the [test customization](https://docs.giskard.ai/en/latest/open_source/customize_tests/index.html) and [test integration](https://docs.giskard.ai/en/latest/open_source/integrate_tests/index.html) sections to find out everything you can do with test suites. + + +## Step 3: upload your test suite to the Giskard Hub + +Uploading a test suite to the hub allows you to: +* Compare the quality of different models and prompts to decide which one to promote +* Create more tests relevant to your use case, combining input prompts that make your model fail and custome evaluation criteria +* Share results, and collaborate with your team to integrate business feedback + +To upload your test suite, you must have created a project on Giskard Hub and instantiated a Giskard Python client. If you haven't done this yet, follow the first steps of [upload your object](https://docs.giskard.ai/en/latest/giskard_hub/upload/index.html#upload-your-object) guide. + +Then, upload your test suite like this: +```python +test_suite.upload(giskard_client, project_id) # project_id should be the id of the Giskard project in which you want to upload the suite +``` + +[Here's a demo](https://huggingface.co/spaces/giskardai/giskard) of the Giskard Hub in action. + + +## What data are being sent to OpenAI/Azure OpenAI + +In order to perform the question generation, we will be sending the following information to OpenAI/Azure OpenAI: + +- Data provided in your knowledge base +- Text generated by your model +- Model name and description + +## Will the test set generation work in any language? +Yes, you can specify the language of the generated questions when you initialize the {class}`giskard.rag.TestsetGenerator`. + +## Troubleshooting +If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel. diff --git a/docs/open_source/testset_generation/test_suite_widget.png b/docs/open_source/testset_generation/test_suite_widget.png new file mode 100644 index 0000000000..1ec3eb2326 Binary files /dev/null and b/docs/open_source/testset_generation/test_suite_widget.png differ diff --git a/docs/reference/index.rst b/docs/reference/index.rst index 94afac716d..dfa15be8b3 100644 --- a/docs/reference/index.rst +++ b/docs/reference/index.rst @@ -13,3 +13,4 @@ API Reference transformation-functions/index push/index suite/index + rag-toolset/index diff --git a/docs/reference/rag-toolset/correctness_evaluator.rst b/docs/reference/rag-toolset/correctness_evaluator.rst new file mode 100644 index 0000000000..c47cb39ab9 --- /dev/null +++ b/docs/reference/rag-toolset/correctness_evaluator.rst @@ -0,0 +1,5 @@ +Correctness Evaluator +====== + +.. autoclass:: giskard.llm.evaluators.CorrectnessEvaluator + :members: diff --git a/docs/reference/rag-toolset/index.rst b/docs/reference/rag-toolset/index.rst new file mode 100644 index 0000000000..efc35eccff --- /dev/null +++ b/docs/reference/rag-toolset/index.rst @@ -0,0 +1,10 @@ +RAG Toolset +============= + +.. toctree:: + :maxdepth: 2 + + testset_generation + vector_store + correctness_evaluator + diff --git a/docs/reference/rag-toolset/testset_generation.rst b/docs/reference/rag-toolset/testset_generation.rst new file mode 100644 index 0000000000..82fe2ac2d3 --- /dev/null +++ b/docs/reference/rag-toolset/testset_generation.rst @@ -0,0 +1,8 @@ +Testset Generation +====== + +.. autoclass:: giskard.rag.TestsetGenerator + :members: + +.. autoclass:: giskard.rag.QATestset + :members: \ No newline at end of file diff --git a/docs/reference/rag-toolset/vector_store.rst b/docs/reference/rag-toolset/vector_store.rst new file mode 100644 index 0000000000..f2fa9d0780 --- /dev/null +++ b/docs/reference/rag-toolset/vector_store.rst @@ -0,0 +1,8 @@ +Vector Store +====== + +.. autoclass:: giskard.rag.vector_store.VectorStore + :members: + +.. autoclass:: giskard.rag.vector_store.Document + :members: diff --git a/docs/reference/tests/llm.rst b/docs/reference/tests/llm.rst index 436a104830..1e3592d64c 100644 --- a/docs/reference/tests/llm.rst +++ b/docs/reference/tests/llm.rst @@ -14,6 +14,7 @@ LLM-as-a-judge .. autofunction:: giskard.testing.tests.llm.test_llm_output_against_requirement_per_row .. autofunction:: giskard.testing.tests.llm.test_llm_single_output_against_requirement .. autofunction:: giskard.testing.tests.llm.test_llm_output_against_requirement +.. autofunction:: giskard.testing.tests.llm.test_llm_correctness Ground Truth -------------- diff --git a/giskard/llm/client/base.py b/giskard/llm/client/base.py index f79fb4712b..904258025a 100644 --- a/giskard/llm/client/base.py +++ b/giskard/llm/client/base.py @@ -3,6 +3,8 @@ from abc import ABC, abstractmethod from dataclasses import dataclass +import numpy as np + from .logger import LLMLogger @@ -22,9 +24,9 @@ class LLMToolCall: @dataclass class LLMMessage: role: str - content: Optional[str] - function_call: Optional[LLMFunctionCall] - tool_calls: Optional[List[LLMToolCall]] + content: Optional[str] = None + function_call: Optional[LLMFunctionCall] = None + tool_calls: Optional[List[LLMToolCall]] = None @staticmethod def create_message(role: str, content: str): @@ -50,3 +52,7 @@ def complete( tool_choice=None, ) -> LLMMessage: ... + + @abstractmethod + def embeddings(self, text) -> np.ndarray: + ... diff --git a/giskard/llm/client/openai.py b/giskard/llm/client/openai.py index 2a9a8b975e..2f891d847c 100644 --- a/giskard/llm/client/openai.py +++ b/giskard/llm/client/openai.py @@ -3,6 +3,7 @@ import json from abc import ABC, abstractmethod +import numpy as np from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential from ..config import LLMConfigurationError @@ -22,6 +23,8 @@ class BaseOpenAIClient(LLMClient, ABC): + _max_embedding_chunk_size = 2048 + def __init__(self, model: str): self._logger = LLMLogger() self.model = model @@ -65,9 +68,9 @@ def _serialize_message(response: LLMMessage) -> Dict: result = { "role": response.role, "content": response.content, - "function_call": BaseOpenAIClient._serialize_function_call(response.function_call) - if response.function_call - else None, + "function_call": ( + BaseOpenAIClient._serialize_function_call(response.function_call) if response.function_call else None + ), "tool_calls": BaseOpenAIClient._serialize_tool_calls(response.tool_calls) if response.tool_calls else None, } @@ -100,12 +103,16 @@ def _parse_message(response) -> LLMMessage: return LLMMessage( role=response["role"], content=response["content"], - function_call=BaseOpenAIClient._parse_function_call(response["function_call"]) - if "function_call" in response and response["function_call"] is not None - else None, - tool_calls=BaseOpenAIClient._parse_tool_calls(response["tool_calls"]) - if "tool_calls" in response and response["tool_calls"] is not None - else None, + function_call=( + BaseOpenAIClient._parse_function_call(response["function_call"]) + if "function_call" in response and response["function_call"] is not None + else None + ), + tool_calls=( + BaseOpenAIClient._parse_tool_calls(response["tool_calls"]) + if "tool_calls" in response and response["tool_calls"] is not None + else None + ), ) def complete( @@ -135,6 +142,18 @@ def complete( return BaseOpenAIClient._parse_message(llm_message) + @abstractmethod + def _embeddings_generation(self, texts: Sequence[str], model: str): + ... + + def embeddings( + self, texts: Sequence[str], model: str = "text-embedding-ada-002", chunk_size: int = 2048 + ) -> np.ndarray: + chunks_indices = range(chunk_size, len(texts), chunk_size) + chunks = np.split(texts, chunks_indices) + embedded_chunks = [self._embeddings_generation(chunk, model) for chunk in chunks] + return np.stack([emb for embeddings in embedded_chunks for emb in embeddings]) + class LegacyOpenAIClient(BaseOpenAIClient): """OpenAI client for versions <= 0.28.1""" @@ -192,6 +211,20 @@ def _completion( return completion["choices"][0]["message"] + def _embeddings_generation(self, texts: Sequence[str], model: str): + try: + out = openai.Embedding.create(input=list(texts), engine=model) + embeddings = [element["embedding"] for element in out["data"]] + except openai.error.InvalidRequestError as err: + raise ValueError( + f"The embedding model: '{model}' was not found," + "make sure the model is correctly deployed on your endpoint." + ) from err + except Exception as err: + raise RuntimeError("Embedding creation failed.") from err + + return embeddings + class OpenAIClient(BaseOpenAIClient): def __init__(self, model: str, client=None): @@ -240,3 +273,18 @@ def _completion( ) return completion.choices[0].message.model_dump() + + def _embeddings_generation(self, texts: Sequence[str], model: str): + try: + out = self._client.embeddings.create(input=list(texts), model=model) + embeddings = [element.embedding for element in out.data] + except openai.NotFoundError as err: + raise ValueError( + f"The embedding model: '{model}' was not found," + "make sure the model is correctly deployed on " + f"the specified endpoint: {self._client._base_url}." + ) from err + except Exception as err: + raise RuntimeError("Embedding creation failed.") from err + + return embeddings diff --git a/giskard/llm/evaluators/__init__.py b/giskard/llm/evaluators/__init__.py index 3606c0161b..c12d57e56b 100644 --- a/giskard/llm/evaluators/__init__.py +++ b/giskard/llm/evaluators/__init__.py @@ -1,5 +1,12 @@ from .coherency import CoherencyEvaluator +from .correctness import CorrectnessEvaluator from .plausibility import PlausibilityEvaluator from .requirements import PerRowRequirementEvaluator, RequirementEvaluator -__all__ = ["CoherencyEvaluator", "RequirementEvaluator", "PerRowRequirementEvaluator", "PlausibilityEvaluator"] +__all__ = [ + "CoherencyEvaluator", + "RequirementEvaluator", + "PerRowRequirementEvaluator", + "PlausibilityEvaluator", + "CorrectnessEvaluator", +] diff --git a/giskard/llm/evaluators/correctness.py b/giskard/llm/evaluators/correctness.py new file mode 100644 index 0000000000..c88dad8f9b --- /dev/null +++ b/giskard/llm/evaluators/correctness.py @@ -0,0 +1,135 @@ +from ...core.test_result import TestResultStatus, create_test_result_details +from ...datasets import Dataset +from ...models.base.model import BaseModel +from ..client.base import LLMMessage +from ..errors import LLMGenerationError +from .base import EVALUATE_MODEL_FUNCTIONS, EvaluationResult, LLMBasedEvaluator + +CORRECTNESS_EVALUATION_PROMPT = """Your role is to test AI models. Your task consists in assessing whether a model output correctly answers a question. +You are provided with the ground truth answer to the question. Your task is then to evaluate if the model answer is close to the ground thruth answer. + +You are auditing the following model: + +Model name: {model_name} +Model description: {model_description} + +Here is the question that was asked to the model and its output, followed by the expected ground truth answer: + +QUESTION: +### +{question} +### + +MODEL OUTPUT: +### +{model_output} +### + +GROUND TRUTH: +### +{ground_truth} +### + +Think step by step and consider the model output in its entirety. Remember: you need to have a strong and sound reason to support your evaluation. +Call the `evaluate_model` function with the result of your evaluation. +""" + + +class CorrectnessEvaluator(LLMBasedEvaluator): + """Assess the correctness of a model answers given questions and associated reference answers.""" + + _default_eval_prompt = CORRECTNESS_EVALUATION_PROMPT + + def _make_evaluate_functions(self): + return EVALUATE_MODEL_FUNCTIONS + + def _make_evaluate_prompt(self, model_name, model_description, question, model_output, ground_truth): + return self.eval_prompt.format( + model_name=model_name, + model_description=model_description, + question=question, + model_output=model_output, + ground_truth=ground_truth, + ) + + def evaluate( + self, + model: BaseModel, + dataset: Dataset, + question_col: str = "question", + reference_answer_col: str = "reference_answer", + ): + if not (question_col in dataset.df and reference_answer_col in dataset.df): + raise ValueError( + f"Missing required columns in the evaluation dataset. Make sure the dataset has columns {question_col} and {reference_answer_col}." + ) + + if question_col not in model.feature_names: + raise ValueError( + f"Model has no feature '{question_col}'. Make sure your Model wrapper accepts '{question_col}'." + ) + + model_outputs = model.predict(dataset).prediction + + succeeded = [] + failed = [] + errored = [] + status = [] + reasons = [] + for evaluation_question, model_output in zip(dataset.df.to_dict("records"), model_outputs): + try: + passed, reason = self._evaluate_single( + model, + evaluation_question[question_col], + evaluation_question[reference_answer_col], + model_output, + ) + reasons.append(reason) + sample = { + **evaluation_question, + "reason": reason, + "model_output": model_output, + "model_evaluation": passed, + } + if passed: + succeeded.append(sample) + status.append(TestResultStatus.PASSED) + else: + failed.append(sample) + status.append(TestResultStatus.FAILED) + except LLMGenerationError as err: + errored.append({"message": str(err), "sample": {**evaluation_question, "model_output": model_output}}) + reasons.append(str(err)) + status.append(TestResultStatus.ERROR) + + return EvaluationResult( + failure_examples=failed, + success_examples=succeeded, + errors=errored, + details=create_test_result_details(dataset, model, model_outputs, status, {"reason": reasons}), + ) + + def _evaluate_single(self, model: BaseModel, question, reference_answer, model_output): + prompt = self._make_evaluate_prompt( + model.meta.name, + model.meta.description, + question, + model_output, + reference_answer, + ) + + out = self.llm_client.complete( + [LLMMessage(role="system", content=prompt)], + tools=self._make_evaluate_functions(), + tool_choice={"type": "function", "function": {"name": "evaluate_model"}}, + temperature=self.llm_temperature, + caller_id=self.__class__.__name__, + ) + + try: + passed_test = out.tool_calls[0].function.arguments["passed_test"] + reason = out.tool_calls[0].function.arguments.get("reason") + except (AttributeError, KeyError, IndexError): + raise LLMGenerationError("Invalid function call arguments received") + + return passed_test, reason diff --git a/giskard/rag/__init__.py b/giskard/rag/__init__.py new file mode 100644 index 0000000000..fb2ecc8cfb --- /dev/null +++ b/giskard/rag/__init__.py @@ -0,0 +1,4 @@ +from .testset import QATestset +from .testset_generator import DifficultyLevel, TestsetGenerator + +__all__ = ["TestsetGenerator", "QATestset", "DifficultyLevel"] diff --git a/giskard/rag/prompts.py b/giskard/rag/prompts.py new file mode 100644 index 0000000000..66533542b4 --- /dev/null +++ b/giskard/rag/prompts.py @@ -0,0 +1,248 @@ +QA_GENERATION_SYSTEM_PROMPT_WITH_DESCRIPTION = """You are a powerful auditor, your role is to generate question & answer pair from a given list of context paragraphs. + +The model you are auditing is the following: +- Model name: {model_name} +- Model description: {model_description} + +Your question must be related to a provided context. +Please respect the following rules to generate the question: +- The answer to the question should be found inside the provided context +- The question must be self-contained +- The question and answer must be in this language: {language} + +The user will provide the context, consisting in multiple paragraphs delimited by dashes "------". +You will return the question and the precise answer to the question based exclusively on the provided context. +You must output a single JSON object with keys 'question' and 'answer'. Make sure you return a valid JSON object.""" + +QA_GENERATION_SYSTEM_PROMPT = """You are a powerful auditor, your role is to generate a question & answer pair from a given list of context paragraphs. + +Your question must be related to a provided context. +Please respect the following rules to generate the question: +- The answer to the question should be found inside the provided context +- The question must be self-contained +- The question and answer must be in this language: {language} + +You will be provided the context, consisting in multiple paragraphs delimited by dashes "------". +You will return the question and the precise answer to the question based exclusively on the provided context. +Your output should be a single JSON object, with keys 'question' and 'answer'. Make sure you return a valid JSON object.""" + +QA_GENERATION_ASSISTANT_EXAMPLE = """{ + "question": "For which countries can I track my shipping?", + "answer": "We ship to all 50 states in the US, as well as to Canada and Mexico. We offer tracking for all our shippings." +}""" + +QA_GENERATION_CONTEXT_EXAMPLE = """What payment methods do you accept? + +We accept a variety of payment methods to provide our customers with a convenient and secure shopping experience. You can make a purchase using major credit and debit cards, including Visa, Mastercard, American Express, and Discover. We also offer the option to pay with popular digital wallets such as PayPal and Google Pay. For added flexibility, you can choose to complete your order using bank transfers or wire transfers. Rest assured that we prioritize the security of your personal information and go the extra mile to ensure your transactions are processed safely. +------ +\tWhat is your shipping policy? + +We offer free shipping on all orders over $50. For orders below $50, we charge a flat rate of $5.99. We offer shipping services to customers residing in all 50\n states of the US, in addition to providing delivery options to Canada and Mexico. +------ +\tHow can I track my order? + +Tracking your order is a breeze! Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our website's order tracking page. Enter your tracking number, and you will be able to monitor the progress of your shipment in real-time. This way, you can stay updated on the estimated delivery date and ensure you're available to receive your package. +""" + +FIX_JSON_FORMAT_PROMPT = """Fix the following json string so it contains a single valid json. Make sure to start and end with curly brackets.""" + + +class QAGenerationPrompt: + system_prompt_with_description = QA_GENERATION_SYSTEM_PROMPT_WITH_DESCRIPTION + system_prompt_raw = QA_GENERATION_SYSTEM_PROMPT + example_prompt = QA_GENERATION_CONTEXT_EXAMPLE + example_answer = QA_GENERATION_ASSISTANT_EXAMPLE + + @classmethod + def _format_system_prompt(cls, model_name, model_description, language): + language = language or "en" + if model_name is not None or model_description is not None: + system_prompt = cls.system_prompt_with_description.format( + model_name=model_name, + model_description=model_description, + language=language, + ) + else: + system_prompt = cls.system_prompt_raw.format( + language=language, + ) + system_message = { + "role": "system", + "content": system_prompt, + } + return system_message + + @classmethod + def _format_example_prompt(cls, examples): + if examples is not None: + return examples + elif cls.example_prompt is not None: + examples = [] + if cls.example_prompt is not None: + examples.append({"role": "user", "content": cls.example_prompt}) + if cls.example_prompt is not None: + examples.append({"role": "assistant", "content": cls.example_answer}) + return examples + return [] + + @classmethod + def format_context(cls, contexts): + return "\n------\n".join(["", *[doc.content for doc in contexts], ""]) + + @classmethod + def create_messages( + cls, + model_name=None, + model_description=None, + language=None, + add_examples=False, + examples=None, + user_content=None, + ): + messages = [cls._format_system_prompt(model_name, model_description, language)] + if add_examples: + messages.extend(cls._format_example_prompt(examples)) + + if user_content is not None: + messages.append({"role": "user", "content": user_content}) + + return messages + + +COMPLEXIFICATION_SYSTEM_PROMPT_WITH_DESCRIPTION = """You are an expert at writing questions. +Your task is to re-write questions that will be used to evaluate the following model: +- Model name: {model_name} +- Model description: {model_description} + +Respect the following rules to reformulate the question: +- The re-written question should not be longer than the original question by up to 10 to 15 words. +- The re-written question should be more elaborated than the original, use elements from the context to enrich the questions. +- The re-written question should be more difficult to handle for AI models but it must be understood and answerable by humans. +- Add one or more constraints / conditions to the question. +- The re-written question must be in this language: {language} + +You will be provided the question delimited by tags. +You will also be provided a relevant context which contain the answer to the question, delimited by tags. It consists in multiple paragraphs delimited by dashes "------". +You will return the reformulated question as a single JSON object, with the key 'question'. Make sure you return a valid JSON object. +""" + +COMPLEXIFICATION_SYSTEM_PROMPT = """You are an expert at writing questions. +Your task is to re-write questions that will be used to evaluate a language model. + +Respect the following rules to reformulate the question: +- The re-written question should not be longer than the original question by up to 10 to 15 words. +- The re-written question should be more elaborated than the original, use elements from the context to enrich the questions. +- The re-written question should be more difficult to handle for AI models but it must be understood and answerable by humans. +- Add one or more constraints / conditions to the question. +- The re-written question must be in this language: {language} + +You will be provided the question delimited with tags. +You will also be provided a relevant context which contain the answer to the question, delimited with tags. It consists in multiple paragraphs delimited by dashes "------". +You will return the reformulated question as a single JSON object, with the key 'question'. Make sure you return a valid JSON object. +""" + +COMPLEXIFICATION_PROMPT_EXAMPLE = """ +For which countries can I track my shipping? + + + +What payment methods do you accept? + +\tWe accept a variety of payment methods to provide our customers with a convenient and secure shopping experience. You can make a purchase using major credit and debit cards, including Visa, Mastercard, American Express, and Discover. We also offer the option to pay with popular digital wallets such as PayPal and Google Pay. For added flexibility, you can choose to complete your order using bank transfers or wire transfers. Rest assured that we prioritize the security of your personal information and go the extra mile to ensure your transactions are processed safely. +------ +\tWhat is your shipping policy? + +We offer free shipping on all orders over $50. For orders below $50, we charge a flat rate of $5.99. We offer shipping services to customers residing in all 50\n states of the US, in addition to providing delivery options to Canada and Mexico. +------ +\tHow can I track my order? + +Tracking your order is a breeze! Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our website's order tracking page. Enter your tracking number, and you will be able to monitor the progress of your shipment in real-time. This way, you can stay updated on the estimated delivery date and ensure you're available to receive your package. + +""" + +COMPLEXIFICATION_ANSWER_EXAMPLE = """{ + "question": "Can you provide my a list of the countries from which I can follow the advancement of the delivery of my shipping?" +}""" + + +class QuestionComplexificationPrompt(QAGenerationPrompt): + system_prompt_with_description = COMPLEXIFICATION_SYSTEM_PROMPT_WITH_DESCRIPTION + system_prompt_raw = COMPLEXIFICATION_SYSTEM_PROMPT + example_prompt = COMPLEXIFICATION_PROMPT_EXAMPLE + example_answer = COMPLEXIFICATION_ANSWER_EXAMPLE + + @classmethod + def format_user_content(cls, question, context): + context_string = f"\n{question}\n\n\n{context}\n" + return context_string + + @classmethod + def create_messages(cls, **kwargs): + kwargs["user_content"] = cls.format_user_content(*kwargs["user_content"]) + return super().create_messages(**kwargs) + + +DISTRACTING_QUESTION_SYSTEM_PROMPT = """You are an expert at rewriting question. +Your task is to re-write questions that will be used to evaluate a language model. + +Your task is to complexify questions given a provided context. +Please respect the following rules to generate the question: +- The new question must include a condition or constraint based on the provided context. +- The new question must have the same answer as the original question. +- The question must be plausible according to the context and the model description. +- The question must be self-contained and understandable by humans. +- The question must be in this language: {language} + +You will be provided the question and its answer delimited with and tags. +You will also be provided a context paragraph delimited with tags. +You will return the reformulated question as a single JSON object, with the key 'question'. Make sure you return a valid JSON object. +""" + +DISTRACTING_QUESTION_SYSTEM_PROMPT_WITH_DESCRIPTION = """You are an expert at rewriting questions. +Your task is to re-write questions that will be used to evaluate the following model: +- Model name: {model_name} +- Model description: {model_description} + +Your task is to complexify questions given a provided context. +Please respect the following rules to generate the question: +- The new question must include a condition or constraint based on the provided context. +- The original question direction should be preserved. +- The question must be plausible according to the context and the model description. +- The question must be self-contained and understandable by humans. +- The question must be in this language: {language} + +You will be provided the question delimited with tags. +You will also be provided a context paragraph delimited with tags. +You will return the reformulated question as a single JSON object, with the key 'question'. Make sure you return a valid JSON object. +""" + +DISTRACTING_QUESTION_USER_INPUT = """ +{question} + + +{answer} + + +{context} +""" + +DISCTRACTING_QUESTION_PROMPT_EXAMPLE = DISTRACTING_QUESTION_USER_INPUT.format( + question="What job offer do you have for engineering student?", + answer="We have plenty of different jobs for engineering student depending on your speciality: mechanical engineer, data scientist, electronic designer and many more.", + context="Sometimes employers assume being accessible and inclusive only means providing physical access like ramps, accessible bathrooms and automatic opening doors. However, there are many other important ways to demonstrate that you welcome and want to attract a diverse workforce including people with disability.", +) + +DISCTRACTING_QUESTION_ANSWER_EXAMPLE = """{ + "question": "Do you have any job opening suitable for engineering students with a disability? " +}""" + + +class DistractingQuestionPrompt(QuestionComplexificationPrompt): + system_prompt_with_description = DISTRACTING_QUESTION_SYSTEM_PROMPT_WITH_DESCRIPTION + system_prompt_raw = DISTRACTING_QUESTION_SYSTEM_PROMPT + example_prompt = DISCTRACTING_QUESTION_PROMPT_EXAMPLE + example_answer = DISCTRACTING_QUESTION_ANSWER_EXAMPLE + + @classmethod + def format_user_content(cls, question, answer, context): + return DISTRACTING_QUESTION_USER_INPUT.format(question=question, answer=answer, context=context) diff --git a/giskard/rag/testset.py b/giskard/rag/testset.py new file mode 100644 index 0000000000..c73afee46e --- /dev/null +++ b/giskard/rag/testset.py @@ -0,0 +1,68 @@ +import pandas as pd + +from ..core.suite import Suite +from ..datasets.base import Dataset +from ..testing.tests.llm import test_llm_correctness + + +class QATestset: + """A class to represent a testset for QA models.""" + + def __init__(self, dataframe: pd.DataFrame): + self._dataframe = dataframe + + def __len__(self): + return len(self._dataframe) + + def to_pandas(self): + """Return the testset as a pandas DataFrame.""" + return self._dataframe + + def to_dataset(self): + return Dataset(self._dataframe, name="QA Testset", target=False, validation=False) + + def save(self, path): + """Save the testset as a JSONL file. + + Parameters + ---------- + path : str + The path to the output JSONL file. + """ + self._dataframe.to_json(path, orient="records", lines=True) + + @classmethod + def load(cls, path): + """Load a testset from a JSONL file. + + Parameters + ---------- + path : str + The path to the input JSONL file. + """ + dataframe = pd.read_json(path, orient="records", lines=True) + return cls(dataframe) + + def to_test_suite(self, name=None): + """ + Convert the testset to a Giskard test suite. + + Parameters + ---------- + name : str, optional + The name of the test suite. If not provided, the name will be "Test suite generated from testset". + + Returns + ------- + giskard.Suite + The test suite. + """ + suite_default_params = {"dataset": self.to_dataset()} + name = name or "Test suite generated from testset" + suite = Suite(name=name, default_params=suite_default_params) + suite.add_test(test_llm_correctness, "TestsetCorrectnessTest", "TestsetCorrectnessTest") + return suite + + def copy(self): + """Return a copy of the testset.""" + return QATestset(self._dataframe.copy()) diff --git a/giskard/rag/testset_generator.py b/giskard/rag/testset_generator.py new file mode 100644 index 0000000000..37ca69008b --- /dev/null +++ b/giskard/rag/testset_generator.py @@ -0,0 +1,252 @@ +from typing import Optional, Sequence, Union + +import json +import logging +from enum import Enum + +import numpy as np +import pandas as pd + +from ..llm.client import get_default_client +from ..llm.client.base import LLMClient +from .prompts import ( + FIX_JSON_FORMAT_PROMPT, + DistractingQuestionPrompt, + QAGenerationPrompt, + QuestionComplexificationPrompt, +) +from .testset import QATestset +from .vector_store import VectorStore + +logger = logging.getLogger(__name__) + + +class DifficultyLevel(int, Enum): + EASY = 1 + COMPLEX = 2 + DISTRACTING_ELEMENT = 3 + + +class TestsetGenerator: + """Testset generator for testing RAG models. + + Explore a given knowledge base and generate question/answer pairs to test the model. + + Each generated item contains the following field + - question: a question about a part of the knowledge base + - reference_answer: the expected answer according to the knowledge base + - reference_context: relevant elements directly extracted from the knowledge base + - difficulty_level: an indicator of how difficult the question is + + Parameters + ---------- + knowledge_base: pd.DataFrame + A dataframe containing the whole knowledge base. + knowledge_base_columns: Sequence[str], optional + The list of columns from the `knowledge_base` to consider. If not specified, all columns of the knowledge base + dataframe will be concatenated to produce a single document. + Example: if your knowledge base consists in FAQ data with columns "Q" and "A", we will format each row into a + single document "Q: [question]\\nA: [answer]" to generate questions. + language: str = "en" + The language used to generate questions (e.g. "fr", "de", ...) + model_name: str, optional + Name of the model to be tested, to get more fitting questions. + model_description: str, optional + Description of the model to be tested. + context_neighbors: int + The maximum number of extracted element from the knowledge base to get a relevant context for question generation + context_similarity_threshold: float = 0.2 + A similarity threshold to filter irrelevant element from the knowledge base during context creation + context_window_length: int = 8192 + Context window length of the llm used in the `llm_client` of the generator + embedding_fn: Callable = None + Embedding function to build the knowledge base index. + seed: int = None + """ + + def __init__( + self, + knowledge_base: pd.DataFrame, + knowledge_base_columns: Sequence[str] = None, + language: str = "en", + model_name: str = None, + model_description: str = None, + context_neighbors: int = 4, + context_similarity_threshold: float = 0.2, + context_window_length: int = 8192, + seed: int = None, + include_examples: bool = True, + embedding_model: str = "text-embedding-ada-002", + llm_client: Optional[LLMClient] = None, + llm_temperature: float = 0.5, + ): + self._knowledge_base = knowledge_base + self._knowledge_base_columns = knowledge_base_columns + self._language = language + self._model_name = model_name + self._model_description = model_description + self._context_neighbors = context_neighbors + self._context_similarity_threshold = context_similarity_threshold + self._embedding_model = embedding_model + self._context_window_length = context_window_length + self._rng = np.random.default_rng(seed=seed) + self._include_examples = include_examples + self._vector_store_inst = None + self._llm_client = llm_client or get_default_client() + self._llm_temperature = llm_temperature + + @property + def _vector_store(self): + if self._vector_store_inst is None: + logger.debug("Initializing vector store from knowledge base.") + self._vector_store_inst = VectorStore.from_df( + self._knowledge_base, + lambda query: self._llm_client.embeddings(query, model=self._embedding_model), + features=self._knowledge_base_columns, + ) + return self._vector_store_inst + + def _get_generator_method(self, level: DifficultyLevel): + mapping = { + DifficultyLevel.EASY: self._generate_question_easy, + DifficultyLevel.COMPLEX: self._generate_question_complex, + DifficultyLevel.DISTRACTING_ELEMENT: self._generate_question_distracting_element, + } + + try: + return mapping[level] + except KeyError: + raise ValueError(f"Invalid difficulty level: {level}.") + + def _generate_question_easy(self, context: str) -> dict: + messages = QAGenerationPrompt.create_messages( + model_name=self._model_name, + model_description=self._model_description, + language=self._language, + user_content=context, + ) + + generated_qa = self._llm_complete(messages=messages) + generated_qa["difficulty"] = DifficultyLevel.EASY + return generated_qa + + def _generate_question_complex(self, context: str) -> dict: + generated_qa = self._generate_question_easy(context) + + messages = QuestionComplexificationPrompt.create_messages( + model_name=self._model_name, + model_description=self._model_description, + language=self._language, + user_content=(generated_qa["question"], context), + ) + generated_qa["difficulty"] = DifficultyLevel.COMPLEX + out = self._llm_complete(messages=messages) + generated_qa["question"] = out["question"] + return generated_qa + + def _generate_question_distracting_element(self, context: str) -> dict: + generated_qa = self._generate_question_easy(context) + + distracting_context = self._rng.choice(self._vector_store.documents).content + messages = DistractingQuestionPrompt.create_messages( + model_name=self._model_name, + model_description=self._model_description, + language=self._language, + user_content=(generated_qa["question"], generated_qa["answer"], distracting_context), + ) + generated_qa["difficulty"] = DifficultyLevel.DISTRACTING_ELEMENT + out = self._llm_complete(messages=messages) + generated_qa["question"] = out["question"] + return generated_qa + + def _get_random_document_group(self): + seed_embedding = self._rng.choice(self._vector_store.embeddings) + relevant_contexts = [ + context + for (context, score) in self._vector_store.vector_similarity_search_with_score( + seed_embedding, k=self._context_neighbors + ) + if score < self._context_similarity_threshold + ] + + return relevant_contexts + + def _prevent_context_window_overflow(self, prompt: str): + # Prevent context overflow + # general rule of thumbs to count tokens: 1 token ~ 4 characters + # https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them + return prompt[: self._context_window_length * 4] + + def _llm_complete(self, messages: Sequence[dict]) -> dict: + try: + out = self._llm_client.complete( + messages=messages, + temperature=self._llm_temperature, + caller_id=self.__class__.__name__, + ) + + return json.loads(out.content, strict=False) + except json.decoder.JSONDecodeError: + logger.warning("JSON decoding error, trying to fix the JSON string.") + return self._try_fix_json_message(out.content) + + def _try_fix_json_message(self, incorrect_json: str): + out = self._llm_client.complete( + messages=[ + {"role": "system", "content": FIX_JSON_FORMAT_PROMPT}, + {"role": "user", "content": incorrect_json}, + ], + temperature=0, + caller_id=self.__class__.__name__, + ) + return json.loads(out.content) + + def generate_testset( + self, + num_questions: int = 10, + difficulty: Union[DifficultyLevel, Sequence[DifficultyLevel]] = DifficultyLevel.EASY, + ) -> QATestset: + """Generates a testset from the knowledge base. + + Parameters + ---------- + num_questions : int + The number of question to generate for each difficulty level. By default 10. + difficulty : Union[DifficultyLevel, Sequence[DifficultyLevel]] + The difficulty level of the questions to generate. Can be 1 (:attr:`DifficultyLevel.EASY`), 2 (:attr:`DifficultyLevel.COMPLEX`), + 3 (:attr:`DifficultyLevel.DISTRACTING_ELEMENT`) or a list of these values. By default will use the easy level. + + Returns + ------- + QATestset + The generated test set. + + """ + if not isinstance(difficulty, Sequence): + difficulty = [difficulty] + + generated_questions = [] + for level in difficulty: + for idx in range(num_questions): + logger.info(f"Generating question {idx + 1}/{num_questions} for difficulty level {str(level)}.") + context_docs = self._get_random_document_group() + context = QAGenerationPrompt.format_context(context_docs) + + generation_fn = self._get_generator_method(level) + + try: + generated_qa = generation_fn(context) + except Exception as e: + logger.error(f"Encountered error in question generation: {e}. Skipping.") + continue + + generated_questions.append( + { + "question": generated_qa["question"], + "reference_answer": generated_qa["answer"], + "reference_context": context, + "difficulty_level": generated_qa["difficulty"], + } + ) + + return QATestset(pd.DataFrame(generated_questions)) diff --git a/giskard/rag/vector_store.py b/giskard/rag/vector_store.py new file mode 100644 index 0000000000..f291414637 --- /dev/null +++ b/giskard/rag/vector_store.py @@ -0,0 +1,64 @@ +from typing import Callable, Optional, Sequence + +import numpy as np +import pandas as pd + +from ..core.errors import GiskardInstallationError + + +class Document: + """A class to wrap the elements of the knowledge base into a unified format.""" + + def __init__(self, document: dict, features: Optional[Sequence] = None): + features = features if features is not None else list(document.keys()) + + if len(features) == 1: + self.content = document[features[0]] + else: + self.content = "\n".join(f"{feat}: {document[feat]}" for feat in features) + + self.metadata = document + + +class VectorStore: + """Stores all embedded Document of the knowledge base. + Relies on `FlatIndexL2` class from FAISS. + """ + + def __init__(self, documents: Sequence[Document], embeddings: np.array, embedding_fn: Callable): + if len(embeddings) == 0 or len(documents) == 0: + raise ValueError("Documents and embeddings must contains at least one element.") + if len(embeddings) != len(documents): + raise ValueError("Documents and embeddings must have the same length.") + + try: + from faiss import IndexFlatL2 + except ImportError as err: + raise GiskardInstallationError(flavor="llm") from err + + self.embeddings = embeddings + self.documents = documents + self.embedding_fn = embedding_fn + + self.dimension = self.embeddings[0].shape[0] + self.index = IndexFlatL2(self.dimension) + self.index.add(self.embeddings) + + @classmethod + def from_df(cls, df: pd.DataFrame, embedding_fn: Callable, features: Sequence[str] = None): + if len(df) > 0: + documents = [Document(knowledge_chunk, features=features) for knowledge_chunk in df.to_dict("records")] + raw_texts = [d.content for d in documents] + embeddings = embedding_fn(raw_texts).astype("float32") + return cls(documents, embeddings, embedding_fn) + else: + raise ValueError("Cannot generate a vector store from empty DataFrame.") + + def similarity_search_with_score(self, query: Sequence[str], k: int) -> Sequence: + query_emb = self.embedding_fn(query).astype("float32") + return self.vector_similarity_search_with_score(query_emb, k) + + def vector_similarity_search_with_score(self, query_emb: np.ndarray, k: int) -> Sequence: + query_emb = np.atleast_2d(query_emb) + distances, indices = self.index.search(query_emb, k) + return [(self.documents[i], d) for d, i in zip(distances[0], indices[0])] diff --git a/giskard/testing/tests/llm/__init__.py b/giskard/testing/tests/llm/__init__.py index 232de77312..ada1c714ce 100644 --- a/giskard/testing/tests/llm/__init__.py +++ b/giskard/testing/tests/llm/__init__.py @@ -1,3 +1,4 @@ +from .correctness import test_llm_correctness from .ground_truth import ( test_llm_as_a_judge_ground_truth_similarity, test_llm_ground_truth, @@ -28,5 +29,6 @@ "test_llm_output_against_strings", "test_llm_ground_truth_similarity", "test_llm_ground_truth", + "test_llm_correctness", "test_llm_as_a_judge_ground_truth_similarity", ] diff --git a/giskard/testing/tests/llm/correctness.py b/giskard/testing/tests/llm/correctness.py new file mode 100644 index 0000000000..6b9b558136 --- /dev/null +++ b/giskard/testing/tests/llm/correctness.py @@ -0,0 +1,55 @@ +from ....core.test_result import TestResult, TestResultStatus +from ....datasets.base import Dataset +from ....llm.evaluators import CorrectnessEvaluator +from ....models.base import BaseModel +from ....registry.decorators import test +from .. import debug_description_prefix + + +@test( + name="LLM Correctness from knowledge base", + tags=["llm", "llm-as-a-judge"], + debug_description=debug_description_prefix + "that are failing the evaluation criteria.", +) +def test_llm_correctness(model: BaseModel, dataset: Dataset, threshold: float = 0.5): + """Tests if LLM answers are correct with respect to a known reference answers. + + The test is passed when the ratio of correct answers is higher than the + threshold. + + Parameters + ---------- + model : BaseModel + Model used to compute the test + dataset : Dataset + Dataset used to compute the test + threshold : float + The threshold value for the ratio of invariant rows. + + Returns + ------- + TestResult + A TestResult object containing the test result. + """ + correctness_evaluator = CorrectnessEvaluator() + eval_result = correctness_evaluator.evaluate(model, dataset) + output_ds = list() + if not eval_result.passed: + failed_indices = [ + idx + for idx, status in zip(dataset.df.index, eval_result.details.results) + if status == TestResultStatus.FAILED + ] + + output_ds.append(dataset.slice(lambda df: df.loc[failed_indices], row_level=False)) + + passed = bool(eval_result.passed_ratio > threshold) + + return TestResult( + passed=passed, + metric=eval_result.passed_ratio, + metric_name="Failing examples ratio", + is_error=eval_result.has_errors, + details=eval_result.details, + output_ds=output_ds, + ) diff --git a/pdm.lock b/pdm.lock index 78ac87534b..c0c541e640 100644 --- a/pdm.lock +++ b/pdm.lock @@ -5,7 +5,7 @@ groups = ["default", "dev", "doc", "hub", "llm", "ml_runtime", "test"] strategy = ["cross_platform", "inherit_metadata"] lock_version = "4.4.1" -content_hash = "sha256:1028b19bca6bf198cf4ed8e9a99dcdb80ae4f23fe8dd67402cda270d57acb5d2" +content_hash = "sha256:8f100fd058c86295057d05712917e343673fe0b8bd8db5afa76e4fdecbcdd988" [[package]] name = "absl-py" @@ -1207,6 +1207,30 @@ files = [ {file = "executing-2.0.1.tar.gz", hash = "sha256:35afe2ce3affba8ee97f2d69927fa823b08b472b7b994e36a52a964b93d16147"}, ] +[[package]] +name = "faiss-cpu" +version = "1.7.4" +summary = "A library for efficient similarity search and clustering of dense vectors." +groups = ["llm"] +files = [ + {file = "faiss-cpu-1.7.4.tar.gz", hash = "sha256:265dc31b0c079bf4433303bf6010f73922490adff9188b915e2d3f5e9c82dd0a"}, + {file = "faiss_cpu-1.7.4-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:50d4ebe7f1869483751c558558504f818980292a9b55be36f9a1ee1009d9a686"}, + {file = "faiss_cpu-1.7.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:7b1db7fae7bd8312aeedd0c41536bcd19a6e297229e1dce526bde3a73ab8c0b5"}, + {file = "faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:17b7fa7194a228a84929d9e6619d0e7dbf00cc0f717e3462253766f5e3d07de8"}, + {file = "faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dca531952a2e3eac56f479ff22951af4715ee44788a3fe991d208d766d3f95f3"}, + {file = "faiss_cpu-1.7.4-cp310-cp310-win_amd64.whl", hash = "sha256:7173081d605e74766f950f2e3d6568a6f00c53f32fd9318063e96728c6c62821"}, + {file = "faiss_cpu-1.7.4-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:d0bbd6f55d7940cc0692f79e32a58c66106c3c950cee2341b05722de9da23ea3"}, + {file = "faiss_cpu-1.7.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:e13c14280376100f143767d0efe47dcb32618f69e62bbd3ea5cd38c2e1755926"}, + {file = "faiss_cpu-1.7.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c521cb8462f3b00c0c7dfb11caff492bb67816528b947be28a3b76373952c41d"}, + {file = "faiss_cpu-1.7.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:afdd9fe1141117fed85961fd36ee627c83fc3b9fd47bafb52d3c849cc2f088b7"}, + {file = "faiss_cpu-1.7.4-cp311-cp311-win_amd64.whl", hash = "sha256:2ff7f57889ea31d945e3b87275be3cad5d55b6261a4e3f51c7aba304d76b81fb"}, + {file = "faiss_cpu-1.7.4-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:559a0133f5ed44422acb09ee1ac0acffd90c6666d1bc0d671c18f6e93ad603e2"}, + {file = "faiss_cpu-1.7.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:ea1d71539fe3dc0f1bed41ef954ca701678776f231046bf0ca22ccea5cf5bef6"}, + {file = "faiss_cpu-1.7.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:12d45e0157024eb3249842163162983a1ac8b458f1a8b17bbf86f01be4585a99"}, + {file = "faiss_cpu-1.7.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2f0eab359e066d32c874f51a7d4bf6440edeec068b7fe47e6d803c73605a8b4c"}, + {file = "faiss_cpu-1.7.4-cp39-cp39-win_amd64.whl", hash = "sha256:98459ceeeb735b9df1a5b94572106ffe0a6ce740eb7e4626715dd218657bb4dc"}, +] + [[package]] name = "faker" version = "23.1.0" @@ -5686,7 +5710,7 @@ name = "tenacity" version = "8.2.3" requires_python = ">=3.7" summary = "Retry code until it succeeds" -groups = ["hub", "ml_runtime"] +groups = ["hub", "llm", "ml_runtime"] files = [ {file = "tenacity-8.2.3-py3-none-any.whl", hash = "sha256:ce510e327a630c9e1beaf17d42e6ffacc88185044ad85cf74c0a8887c6a0f88c"}, {file = "tenacity-8.2.3.tar.gz", hash = "sha256:5398ef0d78e63f40007c1fb4c0bff96e1911394d2fa8d194f77619c05ff6cc8a"}, diff --git a/pyproject.toml b/pyproject.toml index f137365663..8117da2130 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -178,7 +178,13 @@ dependencies = [ ] [project.optional-dependencies] -llm = ["openai", "evaluate>=0.4.1", "bert-score>=0.3.13"] +llm = [ + "openai", + "evaluate>=0.4.1", + "bert-score>=0.3.13", + "tenacity>=4.11.0", + "faiss-cpu>=1.7.4", +] hub = [ # Worker deps diff --git a/tests/llm/evaluators/test_correctness_evaluator.py b/tests/llm/evaluators/test_correctness_evaluator.py new file mode 100644 index 0000000000..1269211146 --- /dev/null +++ b/tests/llm/evaluators/test_correctness_evaluator.py @@ -0,0 +1,167 @@ +from unittest.mock import Mock + +import pandas as pd +import pytest + +from giskard.datasets.base import Dataset +from giskard.llm.client import LLMFunctionCall, LLMMessage +from giskard.llm.client.base import LLMToolCall +from giskard.llm.evaluators.correctness import CorrectnessEvaluator +from giskard.models.base.model_prediction import ModelPredictionResults + + +def _make_eval_dataset(): + ds = Dataset( + pd.DataFrame( + { + "question": ["What is the capital of France?", "What is the capital of Italy?"], + "reference_answer": ["Paris is the capital of France", "Rome is the capital of Italy"], + "reference_context": [ + "France is a unitary semi-presidential republic with its capital in Paris, the country's largest city and main cultural and commercial centre.", + "Italy covers an area of 301,340 km2 is the third-most populous member state of the European Union. Its capital and largest city is Rome.", + ], + "difficulty": [0, 1], + "answerable": [True, True], + } + ) + ) + return ds + + +def _make_mock_model(feature_names=None): + model = Mock() + model.predict.return_value = ModelPredictionResults( + prediction=["The capital of France is Paris", "The capital of Italy is Paris"] + ) + model.feature_names = feature_names if feature_names else ["question", "reference_answer", "reference_context"] + model.name = "Mock model for test" + model.description = "This is a model for testing purposes" + return model + + +def test_correctness_evaluator_correctly_flags_examples(): + dataset = _make_eval_dataset() + model = _make_mock_model() + + client = Mock() + client.complete.side_effect = [ + LLMMessage( + role="assistant", + tool_calls=[ + LLMToolCall( + id="1", + type="function", + function=LLMFunctionCall( + name="evaluate_model", + arguments={"passed_test": True, "reason": ""}, + ), + ) + ], + ), + LLMMessage( + role="assistant", + tool_calls=[ + LLMToolCall( + id="2", + type="function", + function=LLMFunctionCall( + name="evaluate_model", + arguments={ + "passed_test": False, + "reason": "The model output does not agree with the ground truth: Rome is the capital of Italy", + }, + ), + ) + ], + ), + ] + + evaluator = CorrectnessEvaluator(llm_client=client) + + result = evaluator.evaluate(model, dataset) + + assert len(result.success_examples) == 1 + assert len(result.failure_examples) == 1 + + assert ( + result.failure_examples[0]["reason"] + == "The model output does not agree with the ground truth: Rome is the capital of Italy" + ) + assert result.failure_examples[0]["question"] == "What is the capital of Italy?" + assert result.failure_examples[0]["reference_answer"] == "Rome is the capital of Italy" + assert ( + result.failure_examples[0]["reference_context"] + == "Italy covers an area of 301,340 km2 is the third-most populous member state of the European Union. Its capital and largest city is Rome." + ) + assert result.failure_examples[0]["model_output"] == "The capital of Italy is Paris" + assert not result.failure_examples[0]["model_evaluation"] + + # Check LLM client calls arguments + args = client.complete.call_args_list[0] + assert "Your role is to test AI models" in args[0][0][0].content + assert args[1]["tools"][0]["function"]["name"] == "evaluate_model" + + +def test_correctness_evaluator_handles_generation_errors(): + dataset = _make_eval_dataset() + model = _make_mock_model() + + client = Mock() + client.complete.side_effect = [ + LLMMessage( + role="assistant", + tool_calls=[ + LLMToolCall( + id="1", + type="function", + function=LLMFunctionCall(name="evaluate_model", arguments={"passed_test": True, "reason": ""}), + ) + ], + ), + LLMMessage( + role="assistant", + tool_calls=[ + LLMToolCall( + id="2", + type="function", + function=LLMFunctionCall( + name="evaluate_model", + arguments={ + "pass": False, + "reason": "The model output does not agree with the ground truth: Rome is the capital of Italy", + }, + ), + ) + ], + ), + ] + + evaluator = CorrectnessEvaluator(llm_client=client) + + result = evaluator.evaluate(model, dataset) + + assert len(result.success_examples) == 1 + assert len(result.errors) == 1 + + assert result.errors[0]["message"] == "Invalid function call arguments received" + + +def test_raises_error_if_missing_column_in_dataset(): + dataset = _make_eval_dataset() + dataset.df = dataset.df.drop("question", axis=1) + + model = _make_mock_model() + + evaluator = CorrectnessEvaluator(llm_client=Mock()) + with pytest.raises(ValueError, match="Missing required columns in the evaluation dataset."): + evaluator.evaluate(model, dataset) + + +def test_raises_error_if_missing_feature_in_model(): + dataset = _make_eval_dataset() + + model = _make_mock_model(feature_names=["reference_answer"]) + + evaluator = CorrectnessEvaluator(llm_client=Mock()) + with pytest.raises(ValueError, match="Model has no feature 'question'"): + evaluator.evaluate(model, dataset) diff --git a/tests/rag/test_document_creation.py b/tests/rag/test_document_creation.py new file mode 100644 index 0000000000..b45d3b832b --- /dev/null +++ b/tests/rag/test_document_creation.py @@ -0,0 +1,59 @@ +import pytest + +from giskard.rag.vector_store import Document + + +def test_single_feature_document_creation(): + doc = Document({"feature": "This a test value for a feature"}) + + assert doc.content == "This a test value for a feature" + assert doc.metadata == {"feature": "This a test value for a feature"} + + +def test_multiple_features_document_creation(): + doc = Document( + { + "feat1": "This a test value for a feature 1", + "feat2": "This a test value for a feature 2", + "feat3": "This a test value for a feature 3", + } + ) + assert ( + doc.content + == "feat1: This a test value for a feature 1\nfeat2: This a test value for a feature 2\nfeat3: This a test value for a feature 3" + ) + assert doc.metadata == { + "feat1": "This a test value for a feature 1", + "feat2": "This a test value for a feature 2", + "feat3": "This a test value for a feature 3", + } + + doc = Document( + { + "feat1": "This a test value for a feature 1", + "feat2": "This a test value for a feature 2", + "feat3": "This a test value for a feature 3", + }, + features=["feat1"], + ) + assert doc.content == "This a test value for a feature 1" + + doc = Document( + { + "feat1": "This a test value for a feature 1", + "feat2": "This a test value for a feature 2", + "feat3": "This a test value for a feature 3", + }, + features=["feat1", "feat2"], + ) + assert doc.content == "feat1: This a test value for a feature 1\nfeat2: This a test value for a feature 2" + + with pytest.raises(KeyError): + doc = Document( + { + "feat1": "This a test value for a feature 1", + "feat2": "This a test value for a feature 2", + "feat3": "This a test value for a feature 3", + }, + features=["feat4"], + ) diff --git a/tests/rag/test_testset_generator.py b/tests/rag/test_testset_generator.py new file mode 100644 index 0000000000..d2fc7cd828 --- /dev/null +++ b/tests/rag/test_testset_generator.py @@ -0,0 +1,91 @@ +from unittest.mock import Mock + +import numpy as np +import pandas as pd + +from giskard.llm.client import LLMMessage +from giskard.rag import TestsetGenerator + + +def make_knowledge_base_df(): + knowledge_base_df = pd.DataFrame( + [ + {"context": "Camembert is a moist, soft, creamy, surface-ripened cow's milk cheese."}, + { + "context": "Bleu d'Auvergne is a French blue cheese, named for its place of origin in the Auvergne region." + }, + {"context": "Scamorza is a Southern Italian cow's milk cheese."}, + { + "context": "Freeriding is a style of snowboarding or skiing performed on natural, un-groomed terrain, without a set course, goals or rules." + }, + ] + ) + return knowledge_base_df + + +CONTEXT_STRING = """ +------ +Scamorza is a Southern Italian cow's milk cheese. +------ +Bleu d'Auvergne is a French blue cheese, named for its place of origin in the Auvergne region. +------ +Freeriding is a style of snowboarding or skiing performed on natural, un-groomed terrain, without a set course, goals or rules. +------ +""" + + +def test_testset_generation(): + llm_client = Mock() + llm_client.complete.side_effect = ( + [ + LLMMessage( + role="assistant", + content="""{"question": "Where is Camembert from?", +"answer": "Camembert was created in Normandy, in the northwest of France."}""", + ) + ] + * 2 + ) + + embedding_dimension = 8 + + llm_client.embeddings = Mock() + # evenly spaced embeddings for the knowledge base elements and specifically chosen embeddings for + # each mock embedding calls. + kb_embeddings = np.ones((4, embedding_dimension)) * np.arange(4)[:, None] / 100 + query_embeddings = np.ones((2, embedding_dimension)) * np.array([0.02, 10])[:, None] + + llm_client.embeddings.side_effect = [kb_embeddings] + + knowledge_base_df = make_knowledge_base_df() + testset_generator = TestsetGenerator( + knowledge_base_df, + model_name="Test model", + model_description="This is a model for testing purpose.", + llm_client=llm_client, + context_neighbors=3, + ) + testset_generator._rng = Mock() + testset_generator._rng.choice = Mock() + testset_generator._rng.choice.side_effect = list(query_embeddings) + + assert testset_generator._vector_store.index.d == 8 + assert testset_generator._vector_store.embeddings.shape == (4, 8) + assert len(testset_generator._vector_store.documents) == 4 + assert testset_generator._vector_store.documents[2].content.startswith( + "Scamorza is a Southern Italian cow's milk cheese." + ) + + test_set = testset_generator.generate_testset(num_questions=2) + + assert len(test_set) == 2 + + df = test_set.to_pandas() + + assert df.loc[0, "question"] == "Where is Camembert from?" + assert df.loc[0, "reference_answer"] == "Camembert was created in Normandy, in the northwest of France." + assert df.loc[0, "reference_context"] == CONTEXT_STRING + assert df.loc[0, "difficulty_level"] == 1 + + assert df.loc[1, "question"] == "Where is Camembert from?" + assert df.loc[1, "reference_context"] == "\n------\n" diff --git a/tests/rag/test_testset_suite_conversion.py b/tests/rag/test_testset_suite_conversion.py new file mode 100644 index 0000000000..1fddafa80e --- /dev/null +++ b/tests/rag/test_testset_suite_conversion.py @@ -0,0 +1,35 @@ +import pandas as pd + +from giskard.rag import QATestset + + +def make_testset_df(): + return pd.DataFrame( + [ + { + "question": "Which milk is used to make Camembert?", + "reference_answer": "Cow's milk is used to make Camembert.", + "reference_context": "Camembert is a moist, soft, creamy, surface-ripened cow's milk cheese.", + }, + { + "question": "Where is Scarmorza from?", + "reference_answer": "Scarmorza is from Southern Italy.", + "reference_context": "Scamorza is a Southern Italian cow's milk cheese.", + }, + ] + ) + + +def test_testset_suite_conversion(): + testset = QATestset(make_testset_df()) + suite = testset.to_test_suite() + + assert "dataset" in suite.default_params + assert suite.default_params["dataset"].df.loc[0, "question"] == "Which milk is used to make Camembert?" + assert ( + suite.default_params["dataset"].df.loc[1, "reference_context"] + == "Scamorza is a Southern Italian cow's milk cheese." + ) + + assert len(suite.tests) == 1 + assert suite.tests[0].display_name == "TestsetCorrectnessTest" diff --git a/tests/rag/test_vector_store.py b/tests/rag/test_vector_store.py new file mode 100644 index 0000000000..46ca86fa91 --- /dev/null +++ b/tests/rag/test_vector_store.py @@ -0,0 +1,68 @@ +from unittest.mock import Mock + +import numpy as np +import pandas as pd +import pytest + +from giskard.rag.vector_store import Document, VectorStore + + +def test_vector_store_creation(): + dimension = 8 + embeddings = np.repeat(np.arange(5)[:, None], 8, axis=1) + documents = [Document({"feature": "This is a test string"})] * 5 + + embedding_fn = Mock() + + store = VectorStore(documents, embeddings, embedding_fn) + assert store.embeddings.shape == (5, 8) + assert len(store.documents) == 5 + assert store.index.d == dimension + assert store.index.ntotal == 5 + + with pytest.raises(ValueError, match="Documents and embeddings must have the same length."): + store = VectorStore(documents, np.repeat(np.arange(4)[:, None], 8, axis=1), embedding_fn) + + with pytest.raises(ValueError, match="Documents and embeddings must contains at least one element."): + store = VectorStore(documents, [], embedding_fn) + + with pytest.raises(ValueError, match="Documents and embeddings must contains at least one element."): + store = VectorStore([], [], embedding_fn) + + +def test_vector_store_creation_from_df(): + dimension = 8 + df = pd.DataFrame(["This is a test string"] * 5) + + embedding_fn = Mock() + random_embedding = np.random.rand(5, dimension) + embedding_fn.side_effect = [random_embedding] + + store = VectorStore.from_df(df, embedding_fn) + assert store.index.d == dimension + assert store.embeddings.shape == (5, 8) + assert len(store.documents) == 5 + assert store.index.ntotal == 5 + + assert np.allclose(store.embeddings, random_embedding) + + +def test_vector_store_similarity_search_with_score(): + dimension = 8 + embeddings = np.repeat(np.arange(100)[:, None], 8, axis=1) + documents = [Document({"feature": f"This is test string {idx + 1}"}) for idx in range(100)] + + embedding_fn = Mock() + embedding_fn.side_effect = [np.ones((1, dimension)) * 49] + + store = VectorStore(documents, embeddings, embedding_fn) + + query = ["This is test string 50"] + retrieved_elements = store.similarity_search_with_score(query, k=3) + assert len(retrieved_elements) == 3 + assert retrieved_elements[0][0].content == "This is test string 50" + assert retrieved_elements[0][1] == 0.0 + assert retrieved_elements[1][0].content == "This is test string 49" + assert retrieved_elements[1][1] == 8.0 + assert retrieved_elements[2][0].content == "This is test string 51" + assert retrieved_elements[2][1] == 8.0