Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval full pipeline #29

Open
2 tasks done
rti opened this issue Feb 12, 2024 · 1 comment
Open
2 tasks done

Eval full pipeline #29

rti opened this issue Feb 12, 2024 · 1 comment

Comments

@rti
Copy link
Owner

rti commented Feb 12, 2024

Terms

Issue

I think it would be interesting to evaluate the performance of the pipeline at different stages.

  • How good is the retrieval?
    • How do different embedding models perform in comparison?
  • What is the best amount of contexts to give into the model?
  • Which model answers questions best?
    • Takes up the actual facts from the context
    • Least hallucinations
    • Best phrasing

For the last GB&C Silvan and I implemented something very simple, but conceptually similar for the askwikidata prototype:
https://github.com/rti/askwikidata/blob/main/eval.py

There are also frameworks such as Ragas that might help https://docs.ragas.io/en/latest/getstarted/evaluation.html#metrics

@exowanderer
Copy link
Collaborator

exowanderer commented Feb 12, 2024

To throw a link at the wall, I read this medium article -- Top Evaluation Metrics for RAG Failures
-- last week.
It could be relevant to this conversation. At least you will probably know where I got some fancy new (and hard to code) solution from:


Furthermore, I met an open-source company in San Jose called TruEra. They focus on evaluating LLMs along with RAG algorithms. I proposed to them that we should work together as a use case for our future (yet unconfirmed) Wikidata-VectorDB.

If we ask them to help us evaluate our RAG, we could start the collaboration earlier, which is more likely to keep the long-term collaboration viable.

@rti rti mentioned this issue Feb 12, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants