Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSK-2540, GSK-2558, GSK-2663] RAG toolset #1735

Merged
merged 107 commits into from
Feb 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
958b9d2
Remove unused arguments in _make_evaluate_function
pierlj Jan 15, 2024
b7cae6a
Add correctness evaluator
pierlj Jan 15, 2024
43cea8d
Add tests for CorrectnessEvaluator
pierlj Jan 15, 2024
5291c67
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Jan 16, 2024
bb8902b
Minor fix following Matteo's comments
pierlj Jan 16, 2024
b140cca
Add basic vector store and embedding model
pierlj Jan 17, 2024
caaef72
Add testset generator
pierlj Jan 17, 2024
c453b32
Add unit tests for the rag module
pierlj Jan 17, 2024
37eaf8a
Add faiss-cpu as a dependency
pierlj Jan 17, 2024
9b91a9a
Fix imports of rag module in tests
pierlj Jan 17, 2024
fb8a314
Add import in __init__ file
pierlj Jan 17, 2024
81ba5ba
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Jan 17, 2024
3c3a14e
Minor cleaning
pierlj Jan 17, 2024
6a1da20
Add batch embedding to get much faster KB creation
pierlj Jan 17, 2024
baa17ce
Update question prompt to enfore generation of only one question
pierlj Jan 17, 2024
1991fa2
Update tests
pierlj Jan 17, 2024
ed6e608
Add a flag to handle control characters inside LLM response decoding
pierlj Jan 11, 2024
facadd5
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Jan 17, 2024
061bcef
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Jan 18, 2024
41631a1
Fix string in __all__
pierlj Jan 18, 2024
61eaa61
Add LLM correctness test
pierlj Jan 18, 2024
948a2bd
Add Testset dataset wrapper to build test suite
pierlj Jan 18, 2024
56758cb
Fix validation of feature name between testset, model and evaluator
pierlj Jan 18, 2024
b811dbd
Add threshold and failure examples in correctness test output
pierlj Jan 19, 2024
b579e90
Add testset to test suite convertion test + minor fix to generator test
pierlj Jan 19, 2024
43395a4
Add failed indices inside evaluator's outputs
pierlj Jan 19, 2024
071307a
Add some documentation
pierlj Jan 19, 2024
e2c0c65
Move faiss dependency inside llm module
pierlj Jan 22, 2024
2d00ca6
Fix circular import and minor typing issue
pierlj Jan 22, 2024
68c7a68
Add safe import of faiss and openai modules
pierlj Jan 22, 2024
39c7ea5
Fix broken test
pierlj Jan 22, 2024
b8496c7
Merge question and answer generation prompt and separate system instr…
pierlj Jan 23, 2024
ce70205
Update tests for testset generator
pierlj Jan 23, 2024
c196c04
Change the prompt to fix the number of output of the model
pierlj Jan 23, 2024
543d6fd
Improve handling of JSONDecoderErrors
pierlj Jan 23, 2024
6583b44
Minor refactor
pierlj Jan 23, 2024
42e7c46
Remove unnecessary uvloop dependency
pierlj Jan 23, 2024
62c73db
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
pierlj Jan 23, 2024
b91bee8
Enforce JSON format via prompt
mattbit Jan 25, 2024
2c41675
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Jan 25, 2024
acade50
Remove unwanted code
mattbit Jan 25, 2024
7eebc73
Merge branch 'gsk-2540-add-testset-generation-evaluator' of github.co…
mattbit Jan 25, 2024
da384f6
Make the language selection work
mattbit Jan 25, 2024
e3071ca
Add documentation for RAG toolset
pierlj Jan 22, 2024
a98c749
Update documentation and add API reference for RAG toolset
pierlj Jan 24, 2024
8c0f560
Add embeddings function inside base LLM client
pierlj Jan 26, 2024
a7132bc
Remove embedding model and replace it with llm client embeddings
pierlj Jan 26, 2024
a4f8509
Update tests for rag module
pierlj Jan 26, 2024
91d4a29
Make model name and description optional
pierlj Jan 26, 2024
3951f50
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Jan 26, 2024
08669e1
Update documentation
pierlj Jan 26, 2024
4e5957e
Improve handling embedding generation in llm client
pierlj Jan 26, 2024
6313ced
Reorder RAG in the doc
pierlj Jan 29, 2024
2a6cef9
Update docs with Rabah's comments
pierlj Jan 30, 2024
4893601
Remove model name and description from prompt if both are not specified
pierlj Jan 30, 2024
f255c56
Minor change + add some logging
pierlj Jan 30, 2024
6d0f111
Compute embedding in chunks to respect OpenAI API limits
pierlj Jan 30, 2024
0ec5ab3
Remove failed_indices from Correctness evaluator and add TestResultDe…
pierlj Jan 30, 2024
f7f557a
Add prompt template for easier prompt formatting
pierlj Jan 31, 2024
70f0ed4
Add difficulty level in question generation
pierlj Jan 31, 2024
e64b0f2
Add missing llm dependency
pierlj Jan 31, 2024
9acee2a
Update RAG docs
pierlj Jan 31, 2024
ca05d21
Add minor fixes from last test session with Rabah
pierlj Jan 31, 2024
f14ce82
Add missing llm dependency
pierlj Jan 31, 2024
0090416
Update RAG docs
pierlj Jan 31, 2024
a756cea
Add minor fixes from last test session with Rabah
pierlj Jan 31, 2024
d366e4b
Add difficulty level 3 questions
pierlj Jan 31, 2024
605d692
Update docs with difficulty levels
pierlj Feb 1, 2024
ea29236
Merge branch 'gsk-2663-add-difficulty-levels' into gsk-2540-add-tests…
pierlj Feb 1, 2024
a2ef22b
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Feb 2, 2024
95e84c6
Regenerating pdm.lock
Feb 2, 2024
b0dafcf
Remove case matching
pierlj Feb 2, 2024
ce66db8
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Feb 2, 2024
c456bfe
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
kevinmessiaen Feb 5, 2024
b5973ed
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Feb 8, 2024
c16d8b4
Fix minor issues after merge
mattbit Feb 9, 2024
938dd09
Prompt fixes
mattbit Feb 9, 2024
8dabe83
Docs update
mattbit Feb 9, 2024
4127321
Fixing LLM client
mattbit Feb 9, 2024
a3f2811
Make content optional in LLMMessage
mattbit Feb 9, 2024
fc647f8
Fix evaluator and tests
mattbit Feb 9, 2024
3c32a56
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Feb 12, 2024
853f31f
Fix docs
mattbit Feb 12, 2024
3b34173
Start refactoring of QATestset
mattbit Feb 12, 2024
50a51c1
Update docs
pierlj Feb 12, 2024
e0631fe
Fixing correctness evaluator
mattbit Feb 12, 2024
4bd0b75
Merge branch 'gsk-2540-add-testset-generation-evaluator' of github.co…
mattbit Feb 12, 2024
0c13ebe
Add conversion to dataset to testset
mattbit Feb 12, 2024
46fbea2
More fixes to correctness evaluator
mattbit Feb 12, 2024
6a7487a
Fix typo
mattbit Feb 12, 2024
add75fb
Fix reason
mattbit Feb 12, 2024
d6b88ec
Fixing tests
mattbit Feb 12, 2024
47f8994
Fix typo in testset copy method
pierlj Feb 13, 2024
af08e5d
Fixing and reformatting LLMClient
mattbit Feb 13, 2024
2c14cd0
Small refactoring
mattbit Feb 13, 2024
0980190
Small refactoring
mattbit Feb 13, 2024
79d0ee3
Fix test
mattbit Feb 13, 2024
40a80ca
Fix level 3 generator
mattbit Feb 13, 2024
63306d4
Update RAG toolset docs
mattbit Feb 13, 2024
fa453f7
Nice table in docs
mattbit Feb 13, 2024
7bc9f6c
Add docs for QATestset
mattbit Feb 13, 2024
46170be
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Feb 13, 2024
a9cd688
Add warning message in the RAG toolset doc
pierlj Feb 13, 2024
85b38a7
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
pierlj Feb 13, 2024
20da7c1
Update docs
pierlj Feb 13, 2024
e5d0fef
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
mattbit Feb 13, 2024
93fc038
Merge branch 'main' into gsk-2540-add-testset-generation-evaluator
Hartorn Feb 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ getting_started/quickstart/index

open_source/installation_library/index
open_source/scan/index
open_source/testset_generation/index
open_source/customize_tests/index
open_source/integrate_tests/index
```
Expand Down Expand Up @@ -74,6 +75,7 @@ cli/index
reference/models/index
reference/datasets/index
reference/scan/index
reference/rag-toolset/index
reference/tests/index
reference/slicing-functions/index
reference/transformation-functions/index
Expand Down
6 changes: 6 additions & 0 deletions docs/open_source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

installation_library/index
scan/index
testset_generation/index
customize_tests/index
integrate_tests/index
```
Expand All @@ -23,6 +24,11 @@ integrate_tests/index
:link: scan/index.html
::::

::::{grid-item-card} <br/><h3>🧰 RAG Toolset</h3>
:text-align: center
:link: testset_generation/index.html
::::

::::{grid-item-card} <br/><h3>🧪 Customize your tests</h3>
:text-align: center
:link: customize_tests/index.html
Expand Down
2 changes: 1 addition & 1 deletion docs/open_source/scan/scan_llm/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ set_llm_model('my-gpt-4-model')

We are now ready to start.


(model-wrapping)=
## Step 1: Wrap your model

Start by **wrapping your model**. This step is necessary to ensure a common format for your model and its metadata.
Expand Down
194 changes: 194 additions & 0 deletions docs/open_source/testset_generation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# 🧰 RAG Testset Generation

> ⚠️ **The RAG toolset is currently in early version and is subject to change**. Feel free to reach out on our [Discord server](https://discord.gg/fkv7CAr3FE) if you have any trouble with test set generation or to provide feedback.


The Giskard python library provides a toolset dedicated to Retrieval Augmented Generative models (RAGs) that generates question & answer pairs from the knowledge base of the model. The generated test set is then used to evaluate your model.

(difficulty_levels)=
## Generate questions with difficulty levels

You can currently generate questions with three difficulty levels:

```{list-table}
:header-rows: 1
:widths: 35, 65
* - Difficulty Level
- Description
* - **1: Easy questions**
- Simple questions generated from an excerpt of the knowledge base
* - **2: Complex questions**
- Questions made more complex by paraphrasing
* - **3: Distracting questions**
- Questions made even more difficult by adding a distracting element which is related to the knowledge base but irrelevant to the question
```

These three difficulty levels allow you to evaluate different components of your model. Easy questions are directly generated from your knowledge base. They assess the quality of the answer generation from the context, i.e. the quality of the LLM answer. Complex and distracting questions are more challenging as they can perturb the retrieval component of the RAG. These questions are more realistic of a user seeking precise information with your model.

## Before starting

Before starting, make sure you have installed the LLM flavor of Giskard:

```bash
pip install "giskard[llm]"
```

To use the RAG test set generation and evaluation tools, you need to have an OpenAI API key. You can set it in your notebook
like this:

:::::::{tab-set}
::::::{tab-item} OpenAI

```python
import os

os.environ["OPENAI_API_KEY"] = "sk-…"
```

::::::
::::::{tab-item} Azure OpenAI

Require `openai>=1.0.0`
Make sure that both the LLM and Embeddings models are both deployed on the Azure endpoint. The default embedding model used by the Giskard client is `text-embedding-ada-002`.

```python
import os
from giskard.llm import set_llm_model

os.environ['AZURE_OPENAI_API_KEY'] = '...'
os.environ['AZURE_OPENAI_ENDPOINT'] = 'https://xxx.openai.azure.com'
os.environ['OPENAI_API_VERSION'] = '2023-07-01-preview'


# You'll need to provide the name of the model that you've deployed
# Beware, the model provided must be capable of using function calls
set_llm_model('my-gpt-4-model')
```

::::::
:::::::

We are now ready to start.


## Step 1: Automatically generate a Q&A test set

To start, you only need your data or knowledge base in a pandas `DataFrame`. Then, you can initialize the testset
generator ({class}`giskard.rag.TestsetGenerator`) by passing your dataframe.

If some columns in your dataframe are not relevant for the generation of questions (e.g. they contain metadata), make sure you specify
column names to the `knowledge_base_columns` argument (see {class}`giskard.rag.TestsetGenerator`).

To make the question generation more accurate, you can also provide a model name and a model description to the generator. This will help the generator to generate questions that are more relevant to your model's task. You can also specify the language of the generated questions.


```python

from giskard.rag import TestsetGenerator

# Load your data
knowledge_base_df = pd.read_csv("path/to/your/knowledge_base.csv")

# Initialize the testset generator
generator = TestsetGenerator(
knowledge_base_df,
knowledge_base_columns=["column_1", "column_2"],
language="en", # Optional, if you want to generate questions in a specific language

# Optionally, you can provide a model name and description to improve the question quality
model_name="Shop Assistant",
model_description="A model that answers common questions about our products",
)
```

We are ready to generate the test set. We can start with a small test set of 10 questions and answers for each difficulty level.

Currently, you can choose the difficulty levels from 1 to 3 (see {ref}`difficulty_levels`)

```python
# Generate a testset with 10 questions & answers for each difficulty level (this will take a while)
testset = generator.generate_testset(num_questions=10, difficulty=[1, 2])

# Save the generated testset
testset.save("my_testset.jsonl")

# Load it back
from giskard.rag import QATestset

loaded_testset = QATestset.load("my_testset.jsonl")
```

The test set will be an instance of {class}`~giskard.rag.QATestset`. You can save it and load it later with `QATestset.load("path/to/testset.jsonl")`.

You can also convert it to a pandas DataFrame with `testset.to_pandas()`:

```py
# Convert it to a pandas dataframe
df = loaded_testset.to_pandas()
```

Let's have a look at the generated questions:

| question | reference_context | reference_answer | difficulty_level |
|----------|-------------------|------------------|------------------|
| For which countries can I track my shipping? | What is your shipping policy? We offer free shipping on all orders over \$50. For orders below \$50, we charge a flat rate of \$5.99. We offer shipping services to customers residing in all 50 states of the US, in addition to providing delivery options to Canada and Mexico. ------ How can I track my order? Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our website's order tracking page. | We ship to all 50 states in the US, as well as to Canada and Mexico. We offer tracking for all our shippings. | 1 |

As you can see, the data contains 4 columns:
- `question`: the generated question
- `reference_context`: the context that can be used to answer the question
- `reference_answer`: the answer to the question (generated with GPT-4)
- `difficulty_level`: the difficulty level of the question (1, 2 or 3)

## Step 2: Evaluate your model on the generated test set

Before evaluating your model, you must wrap it as a `giskard.Model`. This step is necessary to ensure a common format for your model and its metadata. You can wrap anything as long as you can represent it in a Python function (for example an API call to Azure or OpenAI). We also have pre-built wrappers for LangChain objects, or you can create your own wrapper by extending the `giskard.Model` class if you need to wrap a complex object such as a custom-made RAG communicating with a vectorstore.

To do so, you can follow the instructions from the [LLM Scan feature](../scan/scan_llm/index.md#step-1-wrap-your-model). Make sure that you pass `feature_names = "question"` when wrapping your model, so that it matches the question column of the test set.

Detailed examples can also be found on our {doc}`LLM tutorials section </tutorials/llm_tutorials/index>`.

Once you have wrapped your model, we can proceed with evaluation.

Let's convert our test set into an actionable test suite ({class}`giskard.Suite`) that we can save and reuse in further iterations.

```python
test_suite = testset.to_test_suite("My first test suite")

test_suite.run(model=giskard_model)
```

![](./test_suite_widget.png)

Jump to the [test customization](https://docs.giskard.ai/en/latest/open_source/customize_tests/index.html) and [test integration](https://docs.giskard.ai/en/latest/open_source/integrate_tests/index.html) sections to find out everything you can do with test suites.


## Step 3: upload your test suite to the Giskard Hub

Uploading a test suite to the hub allows you to:
* Compare the quality of different models and prompts to decide which one to promote
* Create more tests relevant to your use case, combining input prompts that make your model fail and custome evaluation criteria
* Share results, and collaborate with your team to integrate business feedback

To upload your test suite, you must have created a project on Giskard Hub and instantiated a Giskard Python client. If you haven't done this yet, follow the first steps of [upload your object](https://docs.giskard.ai/en/latest/giskard_hub/upload/index.html#upload-your-object) guide.

Then, upload your test suite like this:
```python
test_suite.upload(giskard_client, project_id) # project_id should be the id of the Giskard project in which you want to upload the suite
```

[Here's a demo](https://huggingface.co/spaces/giskardai/giskard) of the Giskard Hub in action.


## What data are being sent to OpenAI/Azure OpenAI

In order to perform the question generation, we will be sending the following information to OpenAI/Azure OpenAI:

- Data provided in your knowledge base
- Text generated by your model
- Model name and description

## Will the test set generation work in any language?
Yes, you can specify the language of the generated questions when you initialize the {class}`giskard.rag.TestsetGenerator`.

## Troubleshooting
If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ API Reference
transformation-functions/index
push/index
suite/index
rag-toolset/index
5 changes: 5 additions & 0 deletions docs/reference/rag-toolset/correctness_evaluator.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Correctness Evaluator
======

.. autoclass:: giskard.llm.evaluators.CorrectnessEvaluator
:members:
10 changes: 10 additions & 0 deletions docs/reference/rag-toolset/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
RAG Toolset
=============

.. toctree::
:maxdepth: 2

testset_generation
vector_store
correctness_evaluator

8 changes: 8 additions & 0 deletions docs/reference/rag-toolset/testset_generation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Testset Generation
======

.. autoclass:: giskard.rag.TestsetGenerator
:members:

.. autoclass:: giskard.rag.QATestset
:members:
8 changes: 8 additions & 0 deletions docs/reference/rag-toolset/vector_store.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Vector Store
======

.. autoclass:: giskard.rag.vector_store.VectorStore
:members:

.. autoclass:: giskard.rag.vector_store.Document
:members:
1 change: 1 addition & 0 deletions docs/reference/tests/llm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ LLM-as-a-judge
.. autofunction:: giskard.testing.tests.llm.test_llm_output_against_requirement_per_row
.. autofunction:: giskard.testing.tests.llm.test_llm_single_output_against_requirement
.. autofunction:: giskard.testing.tests.llm.test_llm_output_against_requirement
.. autofunction:: giskard.testing.tests.llm.test_llm_correctness

Ground Truth
--------------
Expand Down
12 changes: 9 additions & 3 deletions giskard/llm/client/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
from abc import ABC, abstractmethod
from dataclasses import dataclass

import numpy as np

from .logger import LLMLogger


Expand All @@ -22,9 +24,9 @@ class LLMToolCall:
@dataclass
class LLMMessage:
role: str
content: Optional[str]
function_call: Optional[LLMFunctionCall]
tool_calls: Optional[List[LLMToolCall]]
content: Optional[str] = None
function_call: Optional[LLMFunctionCall] = None
tool_calls: Optional[List[LLMToolCall]] = None

@staticmethod
def create_message(role: str, content: str):
Expand All @@ -50,3 +52,7 @@ def complete(
tool_choice=None,
) -> LLMMessage:
...

@abstractmethod
def embeddings(self, text) -> np.ndarray:
...
Loading
Loading