Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SentencePiece tokenizers conversion #616

Merged
merged 3 commits into from
Feb 3, 2021
Merged

Fix SentencePiece tokenizers conversion #616

merged 3 commits into from
Feb 3, 2021

Conversation

n1t0
Copy link
Member

@n1t0 n1t0 commented Feb 3, 2021

There is a bug in offset mapping that actually affects all the fast tokenizers converted from SentencePiece. During the pre-tokenization step, we first split everything on whitespaces (WhitespaceSplit pre-tokenizer), and in a second step, we add the character in front of each word (Metaspace pre-tokenizer). This process is accurate in terms of tokenization, but it makes the offset tracking very difficult:

  • All the whitespaces get removed, so we won't have any token pointing back to them.
  • We add a "new" in front of each word, so these tokens actually point back to the beginning of each word: the first character.

How we fix it

The initial idea of using the WhitespaceSplit in a first step was simply to deduplicate the whitespaces but since it leads to loss of information we replace it with the following process:

  • Normalization step that replaces groups of whitespaces with a single one, effectively mapping the single whitespace to the group in the original input.
  • Pretokenization step: we just keep the Metaspace pre-tokenizer.

Related to huggingface/transformers#9633 and huggingface/transformers#9637

@n1t0 n1t0 force-pushed the fix-spm-conversion branch from 2dff861 to 4c6f90b Compare February 3, 2021 15:40
@n1t0 n1t0 force-pushed the fix-spm-conversion branch from da9a03e to 4a75b1c Compare February 3, 2021 16:00
@n1t0 n1t0 merged commit 2c711d4 into master Feb 3, 2021
@n1t0 n1t0 deleted the fix-spm-conversion branch February 3, 2021 17:48
@n1t0 n1t0 mentioned this pull request Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant