Evaluating the RAG answer quality

📺 Watch: (RAG Deep Dive series) Evaluating RAG answer quality

Follow these steps to evaluate the quality of the answers generated by the RAG flow.

Deploy an evaluation model
Setup the evaluation environment
Generate ground truth data
Run bulk evaluation
Review the evaluation results
Run bulk evaluation on a PR

Deploy an evaluation model

Run this command to tell azd to deploy a GPT-4 level model for evaluation:
```
azd env set USE_EVAL true
```
Set the capacity to the highest possible value to ensure that the evaluation runs relatively quickly. Even with a high capacity, it can take a long time to generate ground truth data and run bulk evaluations.
```
azd env set AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY 100
```
By default, that will provision a gpt-4o model, version 2024-08-06. To change those settings, set the azd environment variables AZURE_OPENAI_EVAL_MODEL and AZURE_OPENAI_EVAL_MODEL_VERSION to the desired values.
Then, run the following command to provision the model:
```
azd provision
```

Setup the evaluation environment

Make a new Python virtual environment and activate it. This is currently required due to incompatibilities between the dependencies of the evaluation script and the main project.

python -m venv .evalenv

source .evalenv/bin/activate

Install all the dependencies for the evaluation script by running the following command:

pip install -r evals/requirements.txt

Generate ground truth data

Modify the search terms and tasks in evals/generate_config.json to match your domain.

Generate ground truth data by running the following command:

python evals/generate_ground_truth.py --numquestions=200 --numsearchdocs=1000

The options are:

numquestions: The number of questions to generate. We suggest at least 200.
numsearchdocs: The number of documents (chunks) to retrieve from your search index. You can leave off the option to fetch all documents, but that will significantly increase time it takes to generate ground truth data. You may want to at least start with a subset.
kgfile: An existing RAGAS knowledge base JSON file, which is usually ground_truth_kg.json. You may want to specify this if you already created a knowledge base and just want to tweak the question generation steps.
groundtruthfile: The file to write the generated ground truth answwers. By default, this is evals/ground_truth.jsonl.

🕰️ This may take a long time, possibly several hours, depending on the size of the search index.

Review the generated data in evals/ground_truth.jsonl after running that script, removing any question/answer pairs that don't seem like realistic user input.

Run bulk evaluation

Review the configuration in evals/eval_config.json to ensure that everything is correctly setup. You may want to adjust the metrics used. See the ai-rag-chat-evaluator README for more information on the available metrics.

By default, the evaluation script will evaluate every question in the ground truth data. Run the evaluation script by running the following command:

python evals/evaluate.py

The options are:

numquestions: The number of questions to evaluate. By default, this is all questions in the ground truth data.
resultsdir: The directory to write the evaluation results. By default, this is a timestamped folder in evals/results. This option can also be specified in eval_config.json.
targeturl: The URL of the running application to evaluate. By default, this is http://localhost:50505. This option can also be specified in eval_config.json.

🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions, and the TPM capacity of the evaluation model, and the number of GPT metrics requested.

Review the evaluation results

The evaluation script will output a summary of the evaluation results, inside the evals/results directory.

You can see a summary of results across all evaluation runs by running the following command:

python -m evaltools summary evals/results

Compare answers to the ground truth by running the following command:

python -m evaltools diff evals/results/baseline/

Compare answers across two runs by running the following command:

python -m evaltools diff evals/results/baseline/ evals/results/SECONDRUNHERE

Run bulk evaluation on a PR

This repository includes a GitHub Action workflow evaluate.yaml that can be used to run the evaluation on the changes in a PR.

In order for the workflow to run successfully, you must first set up continuous integration for the repository.

To run the evaluation on the changes in a PR, a repository member can post a /evaluate comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation.md

evaluation.md

Evaluating the RAG answer quality

Deploy an evaluation model

Setup the evaluation environment

Generate ground truth data

Run bulk evaluation

Review the evaluation results

Run bulk evaluation on a PR

Files

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

Evaluating the RAG answer quality

Deploy an evaluation model

Setup the evaluation environment

Generate ground truth data

Run bulk evaluation

Review the evaluation results

Run bulk evaluation on a PR