-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
examples: Fix the encoding issues on Windows #1313
Conversation
I'll provide a description of how this works tomorrow and update the test result. I was so busy last week because I was traveling. |
examples/main/main.cpp
Outdated
std::ofstream open(const std::string & path) { | ||
#if WIN32 | ||
std::ofstream file_out(ConvertUTF8toUTF16(path)); | ||
#else | ||
std::ofstream file_out(path); | ||
#endif | ||
return file_out; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid this change and simply call: std::ofstream fout(ConvertUTF8toUTF16(fname));
regardless if WIN32
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid this change and simply call:
std::ofstream fout(ConvertUTF8toUTF16(fname));
regardless ifWIN32
?
AFAIK, Linux might have some issues with UTF-16LE
support. It either doesn't play well with it, or if it does, it's kinda glitchy. So, I was thinking we could have two versions of this ConvertUTF8toUTF16
function: one that genuinely converts and another basic one that doesn't really do anything. If we use those #if
and #endif
tags, we can have the do-nothing version for non-Windows stuff like Linux. But that's kinda misleading, right? I mean, the function's name literally says it's converting. We've gotta rethink this. I guess we're gonna have to throw in some platform-specific code using #if
and #endif
. It's a bit of a pain, but if we want this to work across different platforms, we might not have much choice.
@bobqianic Thanks to the hard work, when I tested ml=1, some of the Chinese language could not be output. |
This issue has now been fixed. I forgot to flush the buffer at the end of the segment.
This is due to the limitations of whisper models. there's hardly anything we can do about it.
I need more time to look into this problem. |
@bobqianic Still not normal output. 破产 and 小组. |
Strange. Could you send over the audio? |
./main -m models/ggml-medium.bin -l auto -f |
![]() ![]() |
@@ -272,6 +273,51 @@ void whisper_print_progress_callback(struct whisper_context * /*ctx*/, struct wh | |||
} | |||
} | |||
|
|||
whisper_merged_tokens whisper_merge_tokens(struct whisper_context * ctx, const whisper_params & params, int s0, int n_segments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, AFAIK such "post-processing" should not be necessary. If it makes a difference now, it most likely means that we have a bug in the tokenizer, which is actually very likely.
Will need to review this later in more details after #1422
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just discovered some interesting code in OpenAI Whisper.
https://github.com/openai/whisper/blob/0efe52add20d980c850b0bac972adefd54d4eb8e/whisper/tokenizer.py#L277-L327
tokenizer.py
def split_to_word_tokens(self, tokens: List[int]):
if self.language in {"zh", "ja", "th", "lo", "my", "yue"}:
# These languages don't typically use spaces, so it is difficult to split words
# without morpheme analysis. Here, we instead split words at any
# position where the tokens are decoded as valid unicode points
return self.split_tokens_on_unicode(tokens)
return self.split_tokens_on_spaces(tokens)
def split_tokens_on_unicode(self, tokens: List[int]):
decoded_full = self.decode_with_timestamps(tokens)
replacement_char = "\ufffd"
words = []
word_tokens = []
current_tokens = []
unicode_offset = 0
for token in tokens:
current_tokens.append(token)
decoded = self.decode_with_timestamps(current_tokens)
if (
replacement_char not in decoded
or decoded_full[unicode_offset + decoded.index(replacement_char)]
== replacement_char
):
words.append(decoded)
word_tokens.append(current_tokens)
current_tokens = []
unicode_offset += len(decoded)
return words, word_tokens
def split_tokens_on_spaces(self, tokens: List[int]):
subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens)
words = []
word_tokens = []
for subword, subword_tokens in zip(subwords, subword_tokens_list):
special = subword_tokens[0] >= self.eot
with_space = subword.startswith(" ")
punctuation = subword.strip() in string.punctuation
if special or with_space or punctuation or len(words) == 0:
words.append(subword)
word_tokens.append(subword_tokens)
else:
words[-1] = words[-1] + subword
word_tokens[-1].extend(subword_tokens)
return words, word_tokens
Regarding the tokenization, I just started today to get familiar with whisper.cpp and Japanese audio files, especially with per token timestamps. While I don't have a fix, I can provide a test case: Audio file: test_1.wav.zip With Details
[00:00:00.000 --> 00:00:00.260] 5 By debug printing tokens in the code
So utf-8 sequences are cut in the middle of characters. This can be solved by the post processing step mentioned above. This test was done with the latest large model and code. |
Use _WIN32 instead of WIN32 when possible. |
I recently did some research and found out why whisper.cpp behaves oddly. It skips the final step where
texts: List[str] = [tokenizer.decode(t).strip() for t in tokens]
def decode(self, token_ids: List[int], **kwargs) -> str:
token_ids = [t for t in token_ids if t < self.timestamp_begin]
return self.encoding.decode(token_ids, **kwargs)
@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2", num_languages: int = 99):
vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
ranks = {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in open(vocab_path) if line)
}
n_vocab = len(ranks)
special_tokens = {}
specials = [
"<|endoftext|>",
"<|startoftranscript|>",
*[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nospeech|>",
"<|notimestamps|>",
*[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
]
for token in specials:
special_tokens[token] = n_vocab
n_vocab += 1
return tiktoken.Encoding(
name=os.path.basename(vocab_path),
explicit_n_vocab=n_vocab,
pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
mergeable_ranks=ranks,
special_tokens=special_tokens,
)
class Encoding:
def __init__(
self,
name: str,
*,
pat_str: str,
mergeable_ranks: dict[bytes, int],
special_tokens: dict[str, int],
explicit_n_vocab: Optional[int] = None,
):
"""Creates an Encoding object.
See openai_public.py for examples of how to construct an Encoding object.
Args:
name: The name of the encoding. It should be clear from the name of the encoding
what behaviour to expect, in particular, encodings with different special tokens
should have different names.
pat_str: A regex pattern string that is used to split the input text.
mergeable_ranks: A dictionary mapping mergeable token bytes to their ranks. The ranks
must correspond to merge priority.
special_tokens: A dictionary mapping special token strings to their token values.
explicit_n_vocab: The number of tokens in the vocabulary. If provided, it is checked
that the number of mergeable tokens and special tokens is equal to this number.
"""
self.name = name
self._pat_str = pat_str
self._mergeable_ranks = mergeable_ranks
self._special_tokens = special_tokens
self.max_token_value = max(
max(mergeable_ranks.values()), max(special_tokens.values(), default=0)
)
if explicit_n_vocab:
assert len(mergeable_ranks) + len(special_tokens) == explicit_n_vocab
assert self.max_token_value == explicit_n_vocab - 1
self._core_bpe = _tiktoken.CoreBPE(mergeable_ranks, special_tokens, pat_str) |
Probably we can utilize the BPE tokenizer implementation from |
@bobqianic Thank you for the research! So I had a look at the tiktoken documentation and it answered my original question. The tokenizer works byte by byte so it can lead to a representation where tokens are incomplete utf-8 sequences. See here, section 4, "Warning: although .decode() can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries.". So as long as text is encoded and decoded with the same tokenizer, it should work correctly. In my current understanding: If we decode the tokens as a byte sequence (for each token, join the byte representation of each) and then interpret it as a utf-8 string then it works OK. But since the model is trained on abstract byte sequences, the Please correct me if I'm wrong, that just my analysis after re-thinking about the problem with the info you added! |
Yes, this is correct.
I'm not entirely certain, but it seems there's an ongoing issue with whisper.cpp. When you look at the decoding outputs at the token level, you'll notice that whisper.cpp often requires two tokens to represent the same text, whereas OpenAI's Whisper only needs one token for the same text (for example, |
I'm going to close this PR and move everything to #1768 |
The problem with Windows is that it has a heavy historical burden. Many narrow string (
char
) APIs do not support Unicode, and only wide character (wchar_t
) APIs support Unicode. Using narrow string (char) APIs might result in garbled text for some languages. This PR addresses this issue.In this PR, we've enabled the Windows terminal to accept
wchar_t
arrays encoded inUTF-16LE
. Before processing, we convert them to char arrays encoded inUTF-8
. By usingSetConsoleOutputCP
, we've adjusted the Windows terminal's output encoding toUTF-8
, ensuring that multiple languages can be correctly displayed in the Windows terminal. Additionally, based on the documentation provided by Microsoft, we've enabledVirtual Terminal Processing
in the Windows terminal, allowing text colors to be displayed correctly. If you have a better solution, please feel free to make suggestions.The current issue:print color
causes garbled text in non-alphabetic languages.#399 #554 #1151
Test Results
English:
Chinese:
Japanese:
Russian:
French:
Vietnamese: