-
-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add new textspliter option: lenfunc #609
Conversation
Also satisfy issue #231 |
ping @tmc |
Hmm, it might be best to just default to RuneCount/RuneCountInString to treat content more naturally, thoughts? I think I prefer this over erroneously using len() on strings. |
Your suggestion is feasible, just like what is done in Python projects. |
This is better, but I can't help but think we shouldn't even make this configurable and just use utf8.RuneCountInString -- thoughts? |
textsplitter/options.go
Outdated
@@ -20,6 +23,7 @@ func DefaultOptions() Options { | |||
ChunkSize: _defaultTokenChunkSize, | |||
ChunkOverlap: _defaultTokenChunkOverlap, | |||
Separators: []string{"\n\n", "\n", " ", ""}, | |||
LenFunc: defaultLenFunc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could just be utf8.RuneCountInString
instead of being wrapped with defaultLenFunc, I believe.
Hmm, I don't think so. func countWords(s string) int {
words := strings.Fields(s)
return len(words)
} Or the number of tokens divided by other word dividers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I want to add a new parameter to the textspliter struct,lenfunc, to represent a custom length function. In this way, sentences can be splited according to lenfuc, for example,
utf8.RuneCountInString
can be passed to split Chinese characters.Also aligned with the python code https://github.com/langchain-ai/langchain/blob/00a09e1b7117f3bde14a44748510fcccc95f9de5/libs/langchain/langchain/text_splitter.py.
For me, it's a must-have feature.