-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868
Comments
What about items in |
Thanks, I forget the special configuration file for tokenizer. |
Using a 3rd party dependency is not desired. We should try to implement these within |
The content of the OBJ type is actually a list of all key names of the object. * GGUFWriter: * add `def add_kv(self, key: str, val: Any) -> None`: This will be added based on the val type * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value * constants: * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer. * gguf_reader: * add `ReaderField.get`: to return the value of the field * Unit test added. Related Issues: ggml-org#4868, ggml-org#2872
The content of the OBJ type is actually a list of all key names of the object. This change includes several improvements and additions to the codebase: * GGUFWriter: * Added `def add_kv(self, key: str, val: Any) -> None` method: Automatically determines the appropriate value type based on val. * Added `def add_dict(self, key: str, val: dict) -> None` method: add object(dict) key-value * constants: * Revised `GGUFValueType.get_type(val)`: Added support for numpy's integers and floating point numbers, and appropriately selected the number of digits according to the size of the integer. * gguf_reader * Added `ReaderField.get()` method: get the value of this ReaderField * Unit tests have been added to cover these changes. Related Issues: ggml-org#4868, ggml-org#2872
The content of the OBJ type is actually a list of all key names of the object. * GGUFWriter: * add `def add_kv(self, key: str, val: Any) -> None`: This will be added based on the val type * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value * constants: * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer. * gguf_reader: * add `ReaderField.get`: to return the value of the field * Unit test added. Related Issues: ggml-org#4868, ggml-org#2872
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Currently llama.cpp lacks support for HuggingFace's tokenization pipeline.
it is crucial to address its current limitations regarding integrated tokenization pipeline configurations from HuggingFace's Tokenizers library, which are stored in a separate JSON file named "tokenizer.json". These configuration files contain essential information for implementing advanced features like subword regularization and customizable pre-processing techniques that improve language model performance.
By incorporating these metadata into the gguf format, llama.cpp can offer users a more seamless experience by providing access to HuggingFace's comprehensive tokenization pipeline within its single file implementation of language models.
Motivation
Related Issues: #2872 #3502
Possible Implementation
Only need to add the contents of the relevant subkeys (normalizer, pretokenizer, model, postprocessor, decoder) in the
tokenizer.json
file to the metadata of gguf. Also don’t forget thetokenizer_config.json
file for special tokenizer configruation.Subsequently, the next step can proceed to implement the tokenizer within HF (Hugging Face). The most effortless approach is utilizing the pre-existing tokenizers-cpp, which encapsulates and binds both the HuggingFace tokenizers library and sentencepiece while offering a minimal common interface in C++ language for seamless integration with HF applications.
Alternatively, it is possible to implement all tokenizer functionalities solely using pure C++ code without any external libraries or dependencies if so desired. For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js
Related JSON Example in tokenizer.json
The text was updated successfully, but these errors were encountered: