-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to BPE tokenization impl from openai/tiktoken #1118
base: master
Are you sure you want to change the base?
Conversation
What is the advantage of this? |
Processing should be more accurate when input context is provided. With the current implementation, the tokenization of the prompt (if provided) is not 100% correct |
Is there any progress on this? |
I've just run a test on this implementation, and there's a significant issue with its efficiency. It's running very very slowly, barely a few KiB/s... |
The algorithm is also giving incorrect results. I used a corpus for training LLM as a test set, with a size of Test sample: OpenAI:
PR:
|
I can share my test script with you: scriptfrom tiktoken import _educational
from tiktoken import core
import base64
LANGUAGES = {
"en": "english",
"zh": "chinese",
"de": "german",
"es": "spanish",
"ru": "russian",
"ko": "korean",
"fr": "french",
"ja": "japanese",
"pt": "portuguese",
"tr": "turkish",
"pl": "polish",
"ca": "catalan",
"nl": "dutch",
"ar": "arabic",
"sv": "swedish",
"it": "italian",
"id": "indonesian",
"hi": "hindi",
"fi": "finnish",
"vi": "vietnamese",
"he": "hebrew",
"uk": "ukrainian",
"el": "greek",
"ms": "malay",
"cs": "czech",
"ro": "romanian",
"da": "danish",
"hu": "hungarian",
"ta": "tamil",
"no": "norwegian",
"th": "thai",
"ur": "urdu",
"hr": "croatian",
"bg": "bulgarian",
"lt": "lithuanian",
"la": "latin",
"mi": "maori",
"ml": "malayalam",
"cy": "welsh",
"sk": "slovak",
"te": "telugu",
"fa": "persian",
"lv": "latvian",
"bn": "bengali",
"sr": "serbian",
"az": "azerbaijani",
"sl": "slovenian",
"kn": "kannada",
"et": "estonian",
"mk": "macedonian",
"br": "breton",
"eu": "basque",
"is": "icelandic",
"hy": "armenian",
"ne": "nepali",
"mn": "mongolian",
"bs": "bosnian",
"kk": "kazakh",
"sq": "albanian",
"sw": "swahili",
"gl": "galician",
"mr": "marathi",
"pa": "punjabi",
"si": "sinhala",
"km": "khmer",
"sn": "shona",
"yo": "yoruba",
"so": "somali",
"af": "afrikaans",
"oc": "occitan",
"ka": "georgian",
"be": "belarusian",
"tg": "tajik",
"sd": "sindhi",
"gu": "gujarati",
"am": "amharic",
"yi": "yiddish",
"lo": "lao",
"uz": "uzbek",
"fo": "faroese",
"ht": "haitian creole",
"ps": "pashto",
"tk": "turkmen",
"nn": "nynorsk",
"mt": "maltese",
"sa": "sanskrit",
"lb": "luxembourgish",
"my": "myanmar",
"bo": "tibetan",
"tl": "tagalog",
"mg": "malagasy",
"as": "assamese",
"tt": "tatar",
"haw": "hawaiian",
"ln": "lingala",
"ha": "hausa",
"ba": "bashkir",
"jw": "javanese",
"su": "sundanese",
"yue": "cantonese",
}
if __name__ == '__main__':
text = " " + '"os chefes de defesa da estónia, letónia, lituânia, alemanha, itália, espanha e eslováquia assinarão o acordo para fornecer pessoal e financiamento para o centro."'
pat = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
special_tokens = {}
specials = [
"<|endoftext|>",
"<|startoftranscript|>",
*[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:99]],
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nospeech|>",
"<|notimestamps|>",
*[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
]
ranks = dict()
with open("multilingual.tiktoken", "rb") as file:
content = file.read()
content = content.split(b"\n")[:-1]
for x in content:
token_bytes, token = x.split(b" ")
ranks[base64.b64decode(token_bytes)] = int(token)
n_vocab = len(ranks)
for token in specials:
special_tokens[token] = n_vocab
n_vocab += 1
tokenizer_edu = _educational.SimpleBytePairEncoding(pat_str=pat, mergeable_ranks=ranks)
print(tokenizer_edu.encode(text, "colour"))
tokenizer_core = core.Encoding(text, pat_str=pat, mergeable_ranks=ranks, special_tokens=special_tokens)
print(tokenizer_core.encode(text)) |
Alright, the issue stemmed from Edit: I will create a PR and fix this issue. |
// convert std::map<std::string, int> token_to_id to ranks | ||
std::map<std::string, size_t> ranks; | ||
for (const auto& it : vocab.token_to_id) { | ||
ranks[it.first] = it.second; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
// convert std::map<std::string, int> token_to_id to ranks | |
std::map<std::string, size_t> ranks; | |
for (const auto& it : vocab.token_to_id) { | |
ranks[it.first] = it.second; | |
} |
} | ||
return -1; | ||
}; | ||
return _byte_pair_merge_string(piece, ranks, f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return _byte_pair_merge_string(piece, ranks, f); | |
return _byte_pair_merge_string(piece, vocab.token_to_id, f); |
// | ||
std::vector<int> _byte_pair_merge_string( | ||
const std::string& piece, | ||
const std::map<std::string, size_t>& ranks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::map<std::string, size_t>& ranks, | |
const std::map<std::string, int32_t>& ranks, |
@bobqianic Thank you for looking into this! I was using whisper to transcribe german audio and encountered some issues with the tokenization and just ported some code from With that said I think a proper BPE implementation that works correctly for at least 95% of use cases and languages would be very nice to have. Unfortunately I don't have enough capacity to deep dive right now, so feel free to go ahead and open a new PR for this 👍🏾 |
Fixes #1098
(i.e. check Token decoding issue - some characters are missing #25)