Fix SentencePiece tokenizers conversion #616
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is a bug in offset mapping that actually affects all the fast tokenizers converted from SentencePiece. During the pre-tokenization step, we first split everything on whitespaces (
WhitespaceSplit
pre-tokenizer), and in a second step, we add the▁
character in front of each word (Metaspace
pre-tokenizer). This process is accurate in terms of tokenization, but it makes the offset tracking very difficult:▁
in front of each word, so these tokens actually point back to the beginning of each word: the first character.How we fix it
The initial idea of using the
WhitespaceSplit
in a first step was simply to deduplicate the whitespaces but since it leads to loss of information we replace it with the following process:Metaspace
pre-tokenizer.Related to huggingface/transformers#9633 and huggingface/transformers#9637