Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Extend document chunker transform to support fixed-size token window chunker with overlap #641

Closed
2 tasks done
juancappi opened this issue Sep 29, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@juancappi
Copy link
Contributor

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The current document chunker only supports Markdown and Docling JSON chunking. However, there are many other valid chunking criteria, and specifically, there is a need to support a fixed-size token window with token overlap. This functionality is important for classification-based transforms that operate on smaller chunks of data.

While there are multiple approaches to token counting, I propose leveraging the existing capabilities of the llama-index-core library (which is already a dependency) to implement this feature

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@juancappi juancappi added the enhancement New feature or request label Sep 29, 2024
juancappi added a commit to juancappi/data-prep-kit that referenced this issue Sep 29, 2024
juancappi added a commit to juancappi/data-prep-kit that referenced this issue Sep 29, 2024
juancappi added a commit to juancappi/data-prep-kit that referenced this issue Oct 3, 2024
to better reflect the new chunker is also leveraging a Llama Index chunker

Signed-off-by: Juan Cappi <[email protected]>
IBM#641
juancappi added a commit to juancappi/data-prep-kit that referenced this issue Oct 3, 2024
touma-I added a commit that referenced this issue Oct 7, 2024
…king

feat: #641 - fExtend document chunker transform to support fixed-size token window chunker with overlap- Python Only
@dolfim-ibm
Copy link
Member

The PR of @juancappi was merged. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants