Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084

Closed
yonitoo opened this issue Feb 6, 2024 · 0 comments · Fixed by #3093
Closed

Introduce document chunking mechanism prior to embedding within the RAG pipeline #3084

yonitoo opened this issue Feb 6, 2024 · 0 comments · Fixed by #3093
Assignees
Labels
enhancement New feature or request initiative: VDK for Private AI Initiative including the effort to support Private AI usecases of VMWare with VDK

Comments

@yonitoo
Copy link
Contributor

yonitoo commented Feb 6, 2024

What is the feature request? What problem does it solve?
This feature request aims to address the challenge of processing large Confluence documents within the RAG pipeline. Currently, the whole documents are embedded and ingested into the database but their size sometimes exceed the token limit (4096 tokens) imposed by the LLM used for inference. This story aims to introduce a proper document chunking mechanism to overcome this problem.

Suggested solution
Develop a chunking logic that can split large Confluence documents into smaller, manageable chunks. Each chunk should be of reasonable token size to comply with the LLM token limit. In order to keep the context, chunk overlap between consecutive chunks should be added.
The chunk size and the overlap size should be configurable.

Acceptance criteria

Have a configurable chunking mechanism that splits the texts prior to embedding.
Modify the schema and ingest the chunks properly in the vector db.
Extend the existing data job.

@yonitoo yonitoo added enhancement New feature or request initiative: VDK for Private AI Initiative including the effort to support Private AI usecases of VMWare with VDK labels Feb 6, 2024
yonitoo added a commit that referenced this issue Feb 19, 2024
What: Extend the pgvector-embedder by adding configurable chunking
mechanism.

Why: Until now, the whole documents were embedded and ingested into the
database but their size sometimes exceed the token limit imposed by the
LLM used for inference. This change introduces a configurable document
chunking mechanism to overcome this problem.

Testing Done: Ran the pipeline jobs locally.

Closes #3084 

Signed-off by: Yoan Salambashev <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Yoan Salambashev <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request initiative: VDK for Private AI Initiative including the effort to support Private AI usecases of VMWare with VDK
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant