Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084

yonitoo · 2024-02-06T09:51:35Z

What is the feature request? What problem does it solve?
This feature request aims to address the challenge of processing large Confluence documents within the RAG pipeline. Currently, the whole documents are embedded and ingested into the database but their size sometimes exceed the token limit (4096 tokens) imposed by the LLM used for inference. This story aims to introduce a proper document chunking mechanism to overcome this problem.

Suggested solution
Develop a chunking logic that can split large Confluence documents into smaller, manageable chunks. Each chunk should be of reasonable token size to comply with the LLM token limit. In order to keep the context, chunk overlap between consecutive chunks should be added.
The chunk size and the overlap size should be configurable.

Acceptance criteria

Have a configurable chunking mechanism that splits the texts prior to embedding.
Modify the schema and ingest the chunks properly in the vector db.
Extend the existing data job.

What: Extend the pgvector-embedder by adding configurable chunking mechanism. Why: Until now, the whole documents were embedded and ingested into the database but their size sometimes exceed the token limit imposed by the LLM used for inference. This change introduces a configurable document chunking mechanism to overcome this problem. Testing Done: Ran the pipeline jobs locally. Closes #3084 Signed-off by: Yoan Salambashev <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Yoan Salambashev <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

yonitoo added enhancement New feature or request initiative: VDK for Private AI Initiative including the effort to support Private AI usecases of VMWare with VDK labels Feb 6, 2024

DeltaMichael mentioned this issue Feb 7, 2024

Vector database data chunking #3077

Closed

duyguHsnHsn assigned yonitoo Feb 7, 2024

yonitoo mentioned this issue Feb 7, 2024

examples: add chunker job to support configurable chunking #3093

Merged

yonitoo closed this as completed in #3093 Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084

Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084

yonitoo commented Feb 6, 2024 •

edited

Loading

Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084

Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084

Comments

yonitoo commented Feb 6, 2024 • edited Loading

Acceptance criteria

yonitoo commented Feb 6, 2024 •

edited

Loading