Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer added to avoid splitted chatacter #122

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ovler-Young
Copy link

inspired by ggerganov/whisper.cpp#399 (comment)

The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string.
To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.

In this repo, the problem is very similar. Instead of use the icu library which might only on linux, I found a way to check it by pure c++, so no need to modify the makefile.

Some related issues: #109 #37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant