Fix SentencePiece tokenizers conversion #616

n1t0 · 2021-02-03T15:01:37Z

There is a bug in offset mapping that actually affects all the fast tokenizers converted from SentencePiece. During the pre-tokenization step, we first split everything on whitespaces (WhitespaceSplit pre-tokenizer), and in a second step, we add the ▁ character in front of each word (Metaspace pre-tokenizer). This process is accurate in terms of tokenization, but it makes the offset tracking very difficult:

All the whitespaces get removed, so we won't have any token pointing back to them.
We add a "new" ▁ in front of each word, so these tokens actually point back to the beginning of each word: the first character.

How we fix it

The initial idea of using the WhitespaceSplit in a first step was simply to deduplicate the whitespaces but since it leads to loss of information we replace it with the following process:

Normalization step that replaces groups of whitespaces with a single one, effectively mapping the single whitespace to the group in the original input.
Pretokenization step: we just keep the Metaspace pre-tokenizer.

Related to huggingface/transformers#9633 and huggingface/transformers#9637

n1t0 added 2 commits February 3, 2021 10:36

Fix SentencePiece tokenizers conversion

e761e96

Rust - Fix a bug in the Metaspace PreTokenizer

4c6f90b

n1t0 force-pushed the fix-spm-conversion branch from 2dff861 to 4c6f90b Compare February 3, 2021 15:40

CI - Force pyarrow<3.0.0 for now

4a75b1c

n1t0 force-pushed the fix-spm-conversion branch from da9a03e to 4a75b1c Compare February 3, 2021 16:00

n1t0 merged commit 2c711d4 into master Feb 3, 2021

n1t0 deleted the fix-spm-conversion branch February 3, 2021 17:48

n1t0 mentioned this pull request Feb 8, 2021

Prepare for python v0.10.1 #625

Merged

basma-b mentioned this pull request Mar 22, 2021

SentencePiece fast tokenizers still have issues with the offsets_mapping #663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SentencePiece tokenizers conversion #616

Fix SentencePiece tokenizers conversion #616

n1t0 commented Feb 3, 2021 •

edited

Loading

Fix SentencePiece tokenizers conversion #616

Fix SentencePiece tokenizers conversion #616

Conversation

n1t0 commented Feb 3, 2021 • edited Loading

How we fix it

n1t0 commented Feb 3, 2021 •

edited

Loading