Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084
Labels
enhancement
New feature or request
initiative: VDK for Private AI
Initiative including the effort to support Private AI usecases of VMWare with VDK
What is the feature request? What problem does it solve?
This feature request aims to address the challenge of processing large Confluence documents within the RAG pipeline. Currently, the whole documents are embedded and ingested into the database but their size sometimes exceed the token limit (4096 tokens) imposed by the LLM used for inference. This story aims to introduce a proper document chunking mechanism to overcome this problem.
Suggested solution
Develop a chunking logic that can split large Confluence documents into smaller, manageable chunks. Each chunk should be of reasonable token size to comply with the LLM token limit. In order to keep the context, chunk overlap between consecutive chunks should be added.
The chunk size and the overlap size should be configurable.
Acceptance criteria
Have a configurable chunking mechanism that splits the texts prior to embedding.
Modify the schema and ingest the chunks properly in the vector db.
Extend the existing data job.
The text was updated successfully, but these errors were encountered: