[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

snowyu · 2024-01-11T04:51:43Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Currently llama.cpp lacks support for HuggingFace's tokenization pipeline.

it is crucial to address its current limitations regarding integrated tokenization pipeline configurations from HuggingFace's Tokenizers library, which are stored in a separate JSON file named "tokenizer.json". These configuration files contain essential information for implementing advanced features like subword regularization and customizable pre-processing techniques that improve language model performance.

By incorporating these metadata into the gguf format, llama.cpp can offer users a more seamless experience by providing access to HuggingFace's comprehensive tokenization pipeline within its single file implementation of language models.

Motivation

Related Issues: #2872 #3502

Possible Implementation

Only need to add the contents of the relevant subkeys (normalizer, pretokenizer, model, postprocessor, decoder) in the tokenizer.json file to the metadata of gguf. Also don’t forget the tokenizer_config.json file for special tokenizer configruation.

Subsequently, the next step can proceed to implement the tokenizer within HF (Hugging Face). The most effortless approach is utilizing the pre-existing tokenizers-cpp, which encapsulates and binds both the HuggingFace tokenizers library and sentencepiece while offering a minimal common interface in C++ language for seamless integration with HF applications.

Alternatively, it is possible to implement all tokenizer functionalities solely using pure C++ code without any external libraries or dependencies if so desired. For a clearer example of how these pre-tokenization steps might be implemented using JavaScript code in one file, you can refer to the source for transformers.js's tokenizer implementation: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

Related JSON Example in tokenizer.json

{
  "normalizer": {
    "type": "Sequence",
    "normalizers": [
      {
        "type": "Prepend",
        "prepend": "▁"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": " "
        },
        "content": "▁"
      }
    ]
  },
  "pre_tokenizer": {
     "type": "Sequence",
     "pretokenizers": [
        {"type": "WhitespaceSplit"},
        {"type": "Metaspace","replacement": "▁", ...}
     ]
  },
  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": "<unk>",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": true,
    "byte_fallback": true,
    "vocab": { ... }
  },

  "post_processor": {
    "type": "TemplateProcessing",
    "single": [
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      }
    ],
    "pair": [
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 1
        }
      },
      {
        "Sequence": {
          "id": "B",
          "type_id": 1
        }
      }
    ],
    "special_tokens": {
      "<s>": {
        "id": "<s>",
        "ids": [
          1
        ],
        "tokens": [
          "<s>"
        ]
      }
    }
  },
  "decoder": {
    "type": "Sequence",
    "decoders": [
      {
        "type": "Replace",
        "pattern": {
          "String": "▁"
        },
        "content": " "
      },
      {
        "type": "ByteFallback"
      },
      {
        "type": "Fuse"
      },
      {
        "type": "Strip",
        "content": " ",
        "start": 1,
        "stop": 0
      }
    ]
  },
}

The text was updated successfully, but these errors were encountered:

sroussey · 2024-01-13T01:15:30Z

What about items in tokenizer_config.json?

snowyu · 2024-01-13T03:49:57Z

Thanks, I forget the special configuration file for tokenizer.

ggerganov · 2024-01-13T14:37:21Z

The most effortless approach is utilizing the pre-existing tokenizers-cpp, which encapsulates and binds both the HuggingFace tokenizers library and sentencepiece while offering a minimal common interface in C++ language for seamless integration with HF applications.

Using a 3rd party dependency is not desired. We should try to implement these within llama.cpp.

The content of the OBJ type is actually a list of all key names of the object. * GGUFWriter: * add `def add_kv(self, key: str, val: Any) -> None`: This will be added based on the val type * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value * constants: * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer. * gguf_reader: * add `ReaderField.get`: to return the value of the field * Unit test added. Related Issues: ggml-org#4868, ggml-org#2872

The content of the OBJ type is actually a list of all key names of the object. This change includes several improvements and additions to the codebase: * GGUFWriter: * Added `def add_kv(self, key: str, val: Any) -> None` method: Automatically determines the appropriate value type based on val. * Added `def add_dict(self, key: str, val: dict) -> None` method: add object(dict) key-value * constants: * Revised `GGUFValueType.get_type(val)`: Added support for numpy's integers and floating point numbers, and appropriately selected the number of digits according to the size of the integer. * gguf_reader * Added `ReaderField.get()` method: get the value of this ReaderField * Unit tests have been added to cover these changes. Related Issues: ggml-org#4868, ggml-org#2872

The content of the OBJ type is actually a list of all key names of the object. * GGUFWriter: * add `def add_kv(self, key: str, val: Any) -> None`: This will be added based on the val type * add `def add_dict(self, key: str, val: dict) -> None`: add object(dict) value * constants: * `GGUFValueType.get_type`: Added support for Numpy's integers and floating-point numbers, and selected the appropriate number of digits based on the size of the integer. * gguf_reader: * add `ReaderField.get`: to return the value of the field * Unit test added. Related Issues: ggml-org#4868, ggml-org#2872

github-actions · 2024-03-18T01:34:01Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-03T01:14:07Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

snowyu added the enhancement New feature or request label Jan 11, 2024

snowyu mentioned this issue Jan 26, 2024

feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

Open

hiepxanh mentioned this issue Feb 20, 2024

BERT wordpiece tokenizer differers from official HF implementation #5496

Closed

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

snowyu commented Jan 11, 2024 •

edited

Loading

sroussey commented Jan 13, 2024

snowyu commented Jan 13, 2024

ggerganov commented Jan 13, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 3, 2024

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

Comments

snowyu commented Jan 11, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

sroussey commented Jan 13, 2024

snowyu commented Jan 13, 2024

ggerganov commented Jan 13, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 3, 2024

snowyu commented Jan 11, 2024 •

edited

Loading