Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add multimodal eval support #1559

Merged
merged 7 commits into from
Oct 25, 2024

Conversation

Yunnglin
Copy link
Contributor

I am a developer from ModelScope. This framework is great and I would like to add some new features. Multi-modal RAG evaluation is important, as mentioned in #1030.

This PR adds support for image-text context RAG evaluation. Currently, it preliminarily supports MultiModalFaithfulness and MultiModalRelevance by referring to LlamaIndex (reference: faithfulness and relevancy). The current evaluation metrics are still quite preliminary and can be further improved in the future.

The usage is as follows:

from ragas.metrics import MultiModalFaithfulness, MultiModalRelevance
from datasets import Dataset
from ragas import evaluate

# load dataset
dataset = Dataset.from_json("outputs/testset_multi_modal.json")

# load metrics
metrics = [MultiModalFaithfulness(), MultiModalRelevance()]

# evaluate
score = evaluate(
    dataset,
    metrics=metrics,
    llm=llm # models with interleaved image-text input, such as gpt-4o
)
score_df = score.to_pandas()
score_df

Input example:

[
    {
        "user_input": "What brand is the car in the picture?",
        "retrieved_contexts": [
            "custom_eval/multimodal/images/tesla.jpg",
            "The picture is related to an electric vehicle brand."
        ],
        "response": "Tesla is a car brand.",
        "reference": "The car brand in the picture is Tesla."
    },
    {
        "user_input": "What about the Tesla Model X?",
        "retrieved_contexts": [
            "custom_eval/multimodal/images/tesla.jpg"
        ],
        "response": "Cats are cute.",
        "reference": "The Tesla Model X is an electric SUV manufactured by Tesla."
    }
]

Output example:

[
    {
        "user_input": "What brand is the car in the picture?",
        "retrieved_contexts": [
            "custom_eval/multimodal/images/tesla.jpg",
            "The picture is related to an electric vehicle brand."
        ],
        "response": "Tesla is a car brand.",
        "reference": "The car brand in the picture is Tesla.",
        "faithful_rate": true,
        "relevance_rate": true
    },
    {
        "user_input": "What about the Tesla Model X?",
        "retrieved_contexts": [
            "custom_eval/multimodal/images/tesla.jpg"
        ],
        "response": "Cats are cute.",
        "reference": "The Tesla Model X is an electric SUV manufactured by Tesla.",
        "faithful_rate": false,
        "relevance_rate": false
    }
]

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 23, 2024
@Yunnglin Yunnglin changed the title Add multimodal eval support Feat: add multimodal eval support Oct 23, 2024
@jjmachan jjmachan requested a review from shahules786 October 23, 2024 14:52
@jjmachan
Copy link
Member

hey @Yunnglin I had a quick look at this and it's great - thanks alot for contributing it in ❤️

testing it on my end too and will merge it in. I also see there are a couple of type check error, will you be tackling them or should I help you (happy to 🙂)

@Yunnglin
Copy link
Contributor Author

Hello, I have corrected these errors. Could you please recheck them?

@shahules786
Copy link
Member

Hey @Yunnglin this seems great. We could improve the method for calculating faithfulness later on if required. It would be great if you can add these two the docs as well. It would go under RAG section - https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
A small description of both would be perfect. Let me know if you need help with this.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 24, 2024
@Yunnglin
Copy link
Contributor Author

I have added the relevant documents. Could you please take a look and see if any modifications are needed?

@shahules786 shahules786 requested a review from jjmachan October 25, 2024 04:05
Copy link
Member

@shahules786 shahules786 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jjmachan
Copy link
Member

image

just made some small fixes for callbacks support

@jjmachan
Copy link
Member

thanks a lot @Yunnglin for the PR - made a couple of small tweaks to merge it in but looks great ❤️

btw we have a form for goodies do check it out 🙂 https://docs.google.com/forms/d/e/1FAIpQLSdM9FrrZrnpByG4XxuTbcAB-zn-Z7i_a7CsMkgBVOWQjRJckg/viewform

@jjmachan jjmachan merged commit 0f412de into explodinggradients:main Oct 25, 2024
15 checks passed
@ethanelasky
Copy link

ethanelasky commented Dec 6, 2024

Hi, nice work! Is it possible to add base64 image support as well (to mirror how anthropic/openai-compatible models accept images)?

@simjak
Copy link

simjak commented Jan 21, 2025

base64 would be very useful @jjmachan any examples of how I can evaluate multimodal retrieval?
https://mragbench.github.io/

@jjmachan
Copy link
Member

hey @simjak I will take a look at this and let you know :)

btw are you in discord?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants