Eval full pipeline #29

rti · 2024-02-12T08:11:27Z

Terms

I have searched open and closed issues
I agree to follow Wikimedia's Code of Conduct

Issue

I think it would be interesting to evaluate the performance of the pipeline at different stages.

How good is the retrieval?
- How do different embedding models perform in comparison?
What is the best amount of contexts to give into the model?
Which model answers questions best?
- Takes up the actual facts from the context
- Least hallucinations
- Best phrasing

For the last GB&C Silvan and I implemented something very simple, but conceptually similar for the askwikidata prototype:
https://github.com/rti/askwikidata/blob/main/eval.py

There are also frameworks such as Ragas that might help https://docs.ragas.io/en/latest/getstarted/evaluation.html#metrics

exowanderer · 2024-02-12T18:00:45Z

To throw a link at the wall, I read this medium article -- Top Evaluation Metrics for RAG Failures
-- last week.
It could be relevant to this conversation. At least you will probably know where I got some fancy new (and hard to code) solution from:

Furthermore, I met an open-source company in San Jose called TruEra. They focus on evaluating LLMs along with RAG algorithms. I proposed to them that we should work together as a use case for our future (yet unconfirmed) Wikidata-VectorDB.

If we ask them to help us evaluate our RAG, we could start the collaboration earlier, which is more likely to keep the long-term collaboration viable.

rti mentioned this issue Feb 12, 2024

wip: Integration #23

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval full pipeline #29

Eval full pipeline #29

rti commented Feb 12, 2024 •

edited

Loading

exowanderer commented Feb 12, 2024 •

edited

Loading

Eval full pipeline #29

Eval full pipeline #29

Comments

rti commented Feb 12, 2024 • edited Loading

Terms

Issue

exowanderer commented Feb 12, 2024 • edited Loading

rti commented Feb 12, 2024 •

edited

Loading

exowanderer commented Feb 12, 2024 •

edited

Loading