Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add new textspliter option: lenfunc #609

Merged
merged 5 commits into from
Feb 21, 2024
Merged

feat: add new textspliter option: lenfunc #609

merged 5 commits into from
Feb 21, 2024

Conversation

whyiug
Copy link
Contributor

@whyiug whyiug commented Feb 9, 2024

I want to add a new parameter to the textspliter struct,lenfunc, to represent a custom length function. In this way, sentences can be splited according to lenfuc, for example, utf8.RuneCountInString can be passed to split Chinese characters.
Also aligned with the python code https://github.com/langchain-ai/langchain/blob/00a09e1b7117f3bde14a44748510fcccc95f9de5/libs/langchain/langchain/text_splitter.py.
For me, it's a must-have feature.

@whyiug whyiug changed the title add new textspliter option: lenfunc feat: add new textspliter option: lenfunc Feb 9, 2024
@whyiug
Copy link
Contributor Author

whyiug commented Feb 9, 2024

Also satisfy issue #231

@whyiug
Copy link
Contributor Author

whyiug commented Feb 18, 2024

ping @tmc

@tmc
Copy link
Owner

tmc commented Feb 19, 2024

Hmm, it might be best to just default to RuneCount/RuneCountInString to treat content more naturally, thoughts?

I think I prefer this over erroneously using len() on strings.

@whyiug
Copy link
Contributor Author

whyiug commented Feb 19, 2024

Your suggestion is feasible, just like what is done in Python projects.
// In Python, the len function indeed represents the number of characters rather than the number of bytes.
I have already updated the latest code. Please review it.

@tmc
Copy link
Owner

tmc commented Feb 20, 2024

This is better, but I can't help but think we shouldn't even make this configurable and just use utf8.RuneCountInString -- thoughts?

@@ -20,6 +23,7 @@ func DefaultOptions() Options {
ChunkSize: _defaultTokenChunkSize,
ChunkOverlap: _defaultTokenChunkOverlap,
Separators: []string{"\n\n", "\n", " ", ""},
LenFunc: defaultLenFunc,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could just be utf8.RuneCountInString instead of being wrapped with defaultLenFunc, I believe.

@whyiug
Copy link
Contributor Author

whyiug commented Feb 20, 2024

Hmm, I don't think so.
Some scenarios require custom length functions.
Let's say I want to have the same number of words, not characters in each chunk.
I can simply set the LenFunc:

func countWords(s string) int {
    words := strings.Fields(s)
    return len(words)
}

Or the number of tokens divided by other word dividers.

Copy link
Owner

@tmc tmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tmc tmc enabled auto-merge February 21, 2024 00:00
@tmc tmc merged commit 853fc04 into tmc:main Feb 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants