[Feature] Extend document chunker transform to support fixed-size token window chunker with overlap #641
Closed
2 tasks done
Labels
enhancement
New feature or request
Search before asking
Component
Transforms/Other
Feature
The current document chunker only supports Markdown and Docling JSON chunking. However, there are many other valid chunking criteria, and specifically, there is a need to support a fixed-size token window with token overlap. This functionality is important for classification-based transforms that operate on smaller chunks of data.
While there are multiple approaches to token counting, I propose leveraging the existing capabilities of the llama-index-core library (which is already a dependency) to implement this feature
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: