Skip to content

Commit

Permalink
vocab : ignore invalid UTF-8 input in the BPE tokenizer (ggerganov#11729
Browse files Browse the repository at this point in the history
)

Silently insert U+FFFD(s) (Unicode replacement character) instead until the
next valid codepoint can be found.

This fixes `llama_tokenize` throwing an exception across the C API boundary
or libllama's module boundary (the caller's runtime might be incompatible!)

Returing a proper error code might be desirable, however the signature
of `llama_tokenize` doesn't allow it as all return values already have
existing meaning.
  • Loading branch information
cfillion authored Feb 7, 2025
1 parent 333820d commit 2d219b3
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion src/unicode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -618,7 +618,14 @@ std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8) {
result.reserve(utf8.size());
size_t offset = 0;
while (offset < utf8.size()) {
result.push_back(unicode_cpt_from_utf8(utf8, offset));
try {
result.push_back(unicode_cpt_from_utf8(utf8, offset));
}
catch (const std::invalid_argument & /*ex*/) {
// Silently ignore invalid UTF-8 input to avoid leaking the exception beyond llama_tokenize
++offset;
result.emplace_back(0xFFFD); // replacement character
}
}
return result;
}
Expand Down

0 comments on commit 2d219b3

Please sign in to comment.