Switch to BPE tokenization impl from openai/tiktoken #1118

iceychris · 2023-07-18T07:16:13Z

port basic logic
make sure new code works for all languages
(i.e. check Token decoding issue - some characters are missing #25)
write tests and integrate in CI

fire · 2023-07-26T19:39:03Z

What is the advantage of this?

ggerganov · 2023-07-26T20:07:06Z

What is the advantage of this?

Processing should be more accurate when input context is provided. With the current implementation, the tokenization of the prompt (if provided) is not 100% correct

bobqianic · 2023-10-29T17:21:57Z

Is there any progress on this?

bobqianic · 2024-01-20T00:39:21Z

I've just run a test on this implementation, and there's a significant issue with its efficiency. It's running very very slowly, barely a few KiB/s...

bobqianic · 2024-01-20T01:03:30Z

The algorithm is also giving incorrect results.

I used a corpus for training LLM as a test set, with a size of 11.3 MiB, covering most languages.

Test sample: "os chefes de defesa da estónia, letónia, lituânia, alemanha, itália, espanha e eslováquia assinarão o acordo para fornecer pessoal e financiamento para o centro."

OpenAI:

[503, 329, 10530, 279, 368, 1060, 13708, 1120, 871, 1801, 654, 11, 718, 1801, 654, 11, 287, 6380, 9667, 654, 11, 6775, 1601, 1641, 11, 309, 11447, 654, 11, 785, 6040, 1641, 308, 785, 28257, 842, 358, 654, 1256, 6470, 1076, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]

PR:

[503, 329, 10530, 279, 368, 1060, 13708, 1120, 871, 812, 12679, 11, 718, 812, 12679, 11, 287, 6380, 3201, 12679, 11, 6775, 1601, 1641, 11, 309, 842, 14218, 11, 785, 6040, 1641, 308, 785, 28257, 842, 358, 654, 1256, 6470, 1046, 78, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]

bobqianic · 2024-01-20T13:26:05Z

I can share my test script with you:

script

from tiktoken import _educational
from tiktoken import core
import base64

LANGUAGES = {
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
    "yue": "cantonese",
}


if __name__ == '__main__':
    text = " " + '"os chefes de defesa da estónia, letónia, lituânia, alemanha, itália, espanha e eslováquia assinarão o acordo para fornecer pessoal e financiamento para o centro."'


    pat = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
    special_tokens = {}

    specials = [
        "<|endoftext|>",
        "<|startoftranscript|>",
        *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:99]],
        "<|translate|>",
        "<|transcribe|>",
        "<|startoflm|>",
        "<|startofprev|>",
        "<|nospeech|>",
        "<|notimestamps|>",
        *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
    ]

    ranks = dict()
    with open("multilingual.tiktoken", "rb") as file:
        content = file.read()
    content = content.split(b"\n")[:-1]
    for x in content:
        token_bytes, token = x.split(b" ")
        ranks[base64.b64decode(token_bytes)] = int(token)

    n_vocab = len(ranks)
    for token in specials:
        special_tokens[token] = n_vocab
        n_vocab += 1

    tokenizer_edu = _educational.SimpleBytePairEncoding(pat_str=pat, mergeable_ranks=ranks)
    print(tokenizer_edu.encode(text, "colour"))
    tokenizer_core = core.Encoding(text, pat_str=pat,  mergeable_ranks=ranks, special_tokens=special_tokens)
    print(tokenizer_core.encode(text))

bobqianic · 2024-01-20T20:39:14Z

Alright, the issue stemmed from std::regex, as it lacks support for Unicode properties, leading to incorrect string splitting. To resolve this, I adapted the bpe_gpt2_preprocess function from llama.cpp, which mostly fixed the problem. However, bpe_gpt2_preprocess still has a few bugs and occasionally produces results that differ from those obtained with Python's regex.

Edit:
The current implementation in llama.cpp tends to split 'a into and 'a , while OpenAI's implementation splits it into ' and a

I will create a PR and fix this issue.

ggml-org/llama.cpp#3502

bobqianic · 2024-02-01T00:24:04Z

whisper.cpp

+    // convert std::map<std::string, int> token_to_id to ranks
+    std::map<std::string, size_t> ranks;
+    for (const auto& it : vocab.token_to_id) {
+        ranks[it.first] = it.second;
+    }


Why?

Suggested change

// convert std::map<std::string, int> token_to_id to ranks

std::map<std::string, size_t> ranks;

for (const auto& it : vocab.token_to_id) {

ranks[it.first] = it.second;

}

bobqianic · 2024-02-01T00:24:45Z

whisper.cpp

+        }
+        return -1;
+    };
+    return _byte_pair_merge_string(piece, ranks, f);


Suggested change

return _byte_pair_merge_string(piece, ranks, f);

return _byte_pair_merge_string(piece, vocab.token_to_id, f);

bobqianic · 2024-02-01T00:25:31Z

whisper.cpp

+// 
+std::vector<int> _byte_pair_merge_string(
+    const std::string& piece,
+    const std::map<std::string, size_t>& ranks,


Suggested change

const std::map<std::string, size_t>& ranks,

const std::map<std::string, int32_t>& ranks,

iceychris · 2024-02-01T16:18:16Z

@bobqianic Thank you for looking into this!

I was using whisper to transcribe german audio and encountered some issues with the tokenization and just ported some code from tiktoken to whisper.cpp, which was enough to fix the problem for my use case.

With that said I think a proper BPE implementation that works correctly for at least 95% of use cases and languages would be very nice to have. Unfortunately I don't have enough capacity to deep dive right now, so feel free to go ahead and open a new PR for this 👍🏾

Switch to BPE tokenization impl from openai/tiktoken (fixes ggerganov…

c2fb6b7

…#1098)

iceychris changed the title ~~Switch to BPE tokenization impl from openai/tiktoken (fixes #1098)~~ Switch to BPE tokenization impl from openai/tiktoken Jul 18, 2023

tazz4843 mentioned this pull request Jan 11, 2024

Invalid utf-8 tazz4843/whisper-rs#115

Closed

bobqianic reviewed Feb 1, 2024

View reviewed changes

bobqianic mentioned this pull request Feb 5, 2024

Fix the decoding issues #1768

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to BPE tokenization impl from openai/tiktoken #1118

Switch to BPE tokenization impl from openai/tiktoken #1118

iceychris commented Jul 18, 2023 •

edited

Loading

fire commented Jul 26, 2023

ggerganov commented Jul 26, 2023

bobqianic commented Oct 29, 2023

bobqianic commented Jan 20, 2024

bobqianic commented Jan 20, 2024 •

edited

Loading

bobqianic commented Jan 20, 2024

bobqianic commented Jan 20, 2024 •

edited

Loading

bobqianic Feb 1, 2024

bobqianic Feb 1, 2024

bobqianic Feb 1, 2024

iceychris commented Feb 1, 2024

	return _byte_pair_merge_string(piece, ranks, f);
	return _byte_pair_merge_string(piece, vocab.token_to_id, f);

	const std::map<std::string, size_t>& ranks,
	const std::map<std::string, int32_t>& ranks,

Switch to BPE tokenization impl from openai/tiktoken #1118

Are you sure you want to change the base?

Switch to BPE tokenization impl from openai/tiktoken #1118

Conversation

iceychris commented Jul 18, 2023 • edited Loading

fire commented Jul 26, 2023

ggerganov commented Jul 26, 2023

bobqianic commented Oct 29, 2023

bobqianic commented Jan 20, 2024

bobqianic commented Jan 20, 2024 • edited Loading

bobqianic commented Jan 20, 2024

bobqianic commented Jan 20, 2024 • edited Loading

bobqianic Feb 1, 2024

Choose a reason for hiding this comment

bobqianic Feb 1, 2024

Choose a reason for hiding this comment

bobqianic Feb 1, 2024

Choose a reason for hiding this comment

iceychris commented Feb 1, 2024

iceychris commented Jul 18, 2023 •

edited

Loading

bobqianic commented Jan 20, 2024 •

edited

Loading

bobqianic commented Jan 20, 2024 •

edited

Loading