Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to BPE tokenization impl from openai/tiktoken #1118

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iceychris
Copy link
Contributor

@iceychris iceychris commented Jul 18, 2023

Fixes #1098

@iceychris iceychris changed the title Switch to BPE tokenization impl from openai/tiktoken (fixes #1098) Switch to BPE tokenization impl from openai/tiktoken Jul 18, 2023
@fire
Copy link

fire commented Jul 26, 2023

What is the advantage of this?

@ggerganov
Copy link
Owner

What is the advantage of this?

Processing should be more accurate when input context is provided. With the current implementation, the tokenization of the prompt (if provided) is not 100% correct

@bobqianic
Copy link
Collaborator

Is there any progress on this?

@bobqianic
Copy link
Collaborator

I've just run a test on this implementation, and there's a significant issue with its efficiency. It's running very very slowly, barely a few KiB/s...

@bobqianic
Copy link
Collaborator

bobqianic commented Jan 20, 2024

The algorithm is also giving incorrect results.

I used a corpus for training LLM as a test set, with a size of 11.3 MiB, covering most languages.

Test sample: "os chefes de defesa da estónia, letónia, lituânia, alemanha, itália, espanha e eslováquia assinarão o acordo para fornecer pessoal e financiamento para o centro."

OpenAI:

[503, 329, 10530, 279, 368, 1060, 13708, 1120, 871, 1801, 654, 11, 718, 1801, 654, 11, 287, 6380, 9667, 654, 11, 6775, 1601, 1641, 11, 309, 11447, 654, 11, 785, 6040, 1641, 308, 785, 28257, 842, 358, 654, 1256, 6470, 1076, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]

PR:

[503, 329, 10530, 279, 368, 1060, 13708, 1120, 871, 812, 12679, 11, 718, 812, 12679, 11, 287, 6380, 3201, 12679, 11, 6775, 1601, 1641, 11, 309, 842, 14218, 11, 785, 6040, 1641, 308, 785, 28257, 842, 358, 654, 1256, 6470, 1046, 78, 277, 49392, 1690, 337, 716, 1776, 24811, 308, 24323, 8824, 1690, 277, 24607, 889]

@bobqianic
Copy link
Collaborator

I can share my test script with you:

script
from tiktoken import _educational
from tiktoken import core
import base64

LANGUAGES = {
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
    "yue": "cantonese",
}


if __name__ == '__main__':
    text = " " + '"os chefes de defesa da estónia, letónia, lituânia, alemanha, itália, espanha e eslováquia assinarão o acordo para fornecer pessoal e financiamento para o centro."'


    pat = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
    special_tokens = {}

    specials = [
        "<|endoftext|>",
        "<|startoftranscript|>",
        *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:99]],
        "<|translate|>",
        "<|transcribe|>",
        "<|startoflm|>",
        "<|startofprev|>",
        "<|nospeech|>",
        "<|notimestamps|>",
        *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
    ]

    ranks = dict()
    with open("multilingual.tiktoken", "rb") as file:
        content = file.read()
    content = content.split(b"\n")[:-1]
    for x in content:
        token_bytes, token = x.split(b" ")
        ranks[base64.b64decode(token_bytes)] = int(token)

    n_vocab = len(ranks)
    for token in specials:
        special_tokens[token] = n_vocab
        n_vocab += 1

    tokenizer_edu = _educational.SimpleBytePairEncoding(pat_str=pat, mergeable_ranks=ranks)
    print(tokenizer_edu.encode(text, "colour"))
    tokenizer_core = core.Encoding(text, pat_str=pat,  mergeable_ranks=ranks, special_tokens=special_tokens)
    print(tokenizer_core.encode(text))

@bobqianic
Copy link
Collaborator

bobqianic commented Jan 20, 2024

Alright, the issue stemmed from std::regex, as it lacks support for Unicode properties, leading to incorrect string splitting. To resolve this, I adapted the bpe_gpt2_preprocess function from llama.cpp, which mostly fixed the problem. However, bpe_gpt2_preprocess still has a few bugs and occasionally produces results that differ from those obtained with Python's regex.

Edit:
The current implementation in llama.cpp tends to split 'a into and 'a , while OpenAI's implementation splits it into ' and a

I will create a PR and fix this issue.

ggml-org/llama.cpp#3502

Comment on lines +2632 to +2636
// convert std::map<std::string, int> token_to_id to ranks
std::map<std::string, size_t> ranks;
for (const auto& it : vocab.token_to_id) {
ranks[it.first] = it.second;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Suggested change
// convert std::map<std::string, int> token_to_id to ranks
std::map<std::string, size_t> ranks;
for (const auto& it : vocab.token_to_id) {
ranks[it.first] = it.second;
}

}
return -1;
};
return _byte_pair_merge_string(piece, ranks, f);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _byte_pair_merge_string(piece, ranks, f);
return _byte_pair_merge_string(piece, vocab.token_to_id, f);

//
std::vector<int> _byte_pair_merge_string(
const std::string& piece,
const std::map<std::string, size_t>& ranks,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::map<std::string, size_t>& ranks,
const std::map<std::string, int32_t>& ranks,

@iceychris
Copy link
Contributor Author

@bobqianic Thank you for looking into this!

I was using whisper to transcribe german audio and encountered some issues with the tokenization and just ported some code from tiktoken to whisper.cpp, which was enough to fix the problem for my use case.

With that said I think a proper BPE implementation that works correctly for at least 95% of use cases and languages would be very nice to have. Unfortunately I don't have enough capacity to deep dive right now, so feel free to go ahead and open a new PR for this 👍🏾

@bobqianic bobqianic mentioned this pull request Feb 5, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prompt tokenization does not match openai/whisper
4 participants