Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine the text cleaning before embedding the documents in the RAG pipeline #3089

Closed
yonitoo opened this issue Feb 7, 2024 · 0 comments
Closed
Assignees
Labels
enhancement New feature or request initiative: VDK for Private AI Initiative including the effort to support Private AI usecases of VMWare with VDK

Comments

@yonitoo
Copy link
Contributor

yonitoo commented Feb 7, 2024

Overview

Our current text cleaning method converts the text to lower case, removes punctuation, lemmatizes and removes the stop words from the text. As discussed HERE, the transformer models (in our case SentenceTransformer) doesn't require such extensive preprocessing, it's even suggested to not do it as this way some context might be lost.

Suggested solution
Drop the lemmatization and stop words removal from the cleaning.
Double-check if the lower case conversion isn't done by default by the transformer model we are using.
The cleaning step is something you would expect to have in a pipeline, so we need to figure out how to handle it properly.
Decide on what text cleaning logic might be relevant and add it.

Acceptance criteria

Remove the extensive NLP preprocessing (lemmatization and stop words removal).
Add relevant text cleaning logic.

@yonitoo yonitoo added enhancement New feature or request initiative: VDK for Private AI Initiative including the effort to support Private AI usecases of VMWare with VDK labels Feb 7, 2024
@yonitoo yonitoo closed this as completed Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request initiative: VDK for Private AI Initiative including the effort to support Private AI usecases of VMWare with VDK
Projects
None yet
Development

No branches or pull requests

1 participant