feat: add new textspliter option: lenfunc #609

whyiug · 2024-02-09T07:59:43Z

I want to add a new parameter to the textspliter struct,lenfunc, to represent a custom length function. In this way, sentences can be splited according to lenfuc, for example, utf8.RuneCountInString can be passed to split Chinese characters.
Also aligned with the python code https://github.com/langchain-ai/langchain/blob/00a09e1b7117f3bde14a44748510fcccc95f9de5/libs/langchain/langchain/text_splitter.py.
For me, it's a must-have feature.

whyiug · 2024-02-09T08:06:08Z

Also satisfy issue #231

whyiug · 2024-02-18T06:51:22Z

ping @tmc

tmc · 2024-02-19T08:21:18Z

Hmm, it might be best to just default to RuneCount/RuneCountInString to treat content more naturally, thoughts?

I think I prefer this over erroneously using len() on strings.

whyiug · 2024-02-19T08:46:51Z

Your suggestion is feasible, just like what is done in Python projects.
// In Python, the len function indeed represents the number of characters rather than the number of bytes.
I have already updated the latest code. Please review it.

tmc · 2024-02-20T03:09:08Z

This is better, but I can't help but think we shouldn't even make this configurable and just use utf8.RuneCountInString -- thoughts?

tmc · 2024-02-20T03:12:34Z

textsplitter/options.go

@@ -20,6 +23,7 @@ func DefaultOptions() Options {
 		ChunkSize:    _defaultTokenChunkSize,
 		ChunkOverlap: _defaultTokenChunkOverlap,
 		Separators:   []string{"\n\n", "\n", " ", ""},
+		LenFunc:      defaultLenFunc,


this could just be utf8.RuneCountInString instead of being wrapped with defaultLenFunc, I believe.

whyiug · 2024-02-20T05:48:02Z

Hmm, I don't think so.
Some scenarios require custom length functions.
Let's say I want to have the same number of words, not characters in each chunk.
I can simply set the LenFunc:

func countWords(s string) int {
    words := strings.Fields(s)
    return len(words)
}

Or the number of tokens divided by other word dividers.

tmc

LGTM

add new textspliter option: lenfunc

8734b60

whyiug changed the title ~~add new textspliter option: lenfunc~~ feat: add new textspliter option: lenfunc Feb 9, 2024

add new textspliter option: lenfunc

a63d3a9

Refactor lenFunc to use utf8.RuneCountInString

ba73dea

tmc reviewed Feb 20, 2024

View reviewed changes

textsplitter: remove wrapper

3f4ee00

tmc approved these changes Feb 21, 2024

View reviewed changes

Merge branch 'main' into main

22159ce

tmc enabled auto-merge February 21, 2024 00:00

tmc merged commit 853fc04 into tmc:main Feb 21, 2024
3 checks passed

tmc temporarily deployed to github-pages February 27, 2024 02:36 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add new textspliter option: lenfunc #609

feat: add new textspliter option: lenfunc #609

whyiug commented Feb 9, 2024

whyiug commented Feb 9, 2024

whyiug commented Feb 18, 2024

tmc commented Feb 19, 2024

whyiug commented Feb 19, 2024

tmc commented Feb 20, 2024

tmc Feb 20, 2024

whyiug commented Feb 20, 2024

tmc left a comment

feat: add new textspliter option: lenfunc #609

feat: add new textspliter option: lenfunc #609

Conversation

whyiug commented Feb 9, 2024

whyiug commented Feb 9, 2024

whyiug commented Feb 18, 2024

tmc commented Feb 19, 2024

whyiug commented Feb 19, 2024

tmc commented Feb 20, 2024

tmc Feb 20, 2024

Choose a reason for hiding this comment

whyiug commented Feb 20, 2024

tmc left a comment

Choose a reason for hiding this comment