[Feature] Extend document chunker transform to support fixed-size token window chunker with overlap #641

juancappi · 2024-09-29T12:44:02Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The current document chunker only supports Markdown and Docling JSON chunking. However, there are many other valid chunking criteria, and specifically, there is a need to support a fixed-size token window with token overlap. This functionality is important for classification-based transforms that operate on smaller chunks of data.

While there are multiple approaches to token counting, I propose leveraging the existing capabilities of the llama-index-core library (which is already a dependency) to implement this feature

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Signed-off-by: Juan Cappi <[email protected]>

to better reflect the new chunker is also leveraging a Llama Index chunker Signed-off-by: Juan Cappi <[email protected]> IBM#641

Signed-off-by: Juan Cappi <[email protected]>

…king feat: #641 - fExtend document chunker transform to support fixed-size token window chunker with overlap- Python Only

dolfim-ibm · 2024-10-29T13:13:30Z

The PR of @juancappi was merged. Closing the issue.

juancappi added the enhancement New feature or request label Sep 29, 2024

juancappi added a commit to juancappi/data-prep-kit that referenced this issue Sep 29, 2024

feat: IBM#641 - first draft implementation, python only

43d36f1

juancappi mentioned this issue Sep 29, 2024

feat: #641 - Extend document chunker transform to support fixed-size token window chunker with overlap #642

Merged

juancappi added a commit to juancappi/data-prep-kit that referenced this issue Sep 29, 2024

feat: IBM#641 - first draft implementation, python only

242fad1

Signed-off-by: Juan Cappi <[email protected]>

juancappi added a commit to juancappi/data-prep-kit that referenced this issue Oct 3, 2024

fix: change naming

c481c5c

to better reflect the new chunker is also leveraging a Llama Index chunker Signed-off-by: Juan Cappi <[email protected]> IBM#641

juancappi added a commit to juancappi/data-prep-kit that referenced this issue Oct 3, 2024

fix: adjust documentation - IBM#641

6d21ef3

Signed-off-by: Juan Cappi <[email protected]>

daw3rd assigned dolfim-ibm Oct 4, 2024

touma-I added a commit that referenced this issue Oct 7, 2024

Merge pull request #642 from juancappi/feat/641-fixed-size-token-chun…

d04454e

…king feat: #641 - fExtend document chunker transform to support fixed-size token window chunker with overlap- Python Only

dolfim-ibm closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Extend document chunker transform to support fixed-size token window chunker with overlap #641

[Feature] Extend document chunker transform to support fixed-size token window chunker with overlap #641

juancappi commented Sep 29, 2024

dolfim-ibm commented Oct 29, 2024

[Feature] Extend document chunker transform to support fixed-size token window chunker with overlap #641

[Feature] Extend document chunker transform to support fixed-size token window chunker with overlap #641

Comments

juancappi commented Sep 29, 2024

Search before asking

Component

Feature

Are you willing to submit a PR?

dolfim-ibm commented Oct 29, 2024