[ERROR] Tokenizer and TokenizerFast ??? #5490

1512262 · 2020-07-03T07:35:59Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...): BERT

Language I am using the model on (English, Chinese ...): 'bert-base-multilingual-cased'

The problem arises when using:

[ x] the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
[ x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import *
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-cased')
tokenizer.decode(tokenizer.encode('mở bài lạc trôi')) --> wrong

but:

from transformers import *
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer.decode(tokenizer.encode('mở bài lạc trôi')) --> true

Expected behavior

the decode sentence after encoding and decoding using TokenizerFast should be true

Environment info

transformers version:
Platform: Pytorch and TF
Python version: 3.6
PyTorch version (GPU?): GPU
Tensorflow version (GPU?): 2.2
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

The text was updated successfully, but these errors were encountered:

n1t0 · 2020-07-06T13:42:18Z

This is related to #2917. In the slow tokenizers, when do_lower_case=False we don't strip accents, while we do it when do_lower_case=True. In the fast tokenizers, this is controlled by the strip_accents option, which is True here.

@thomwolf How do you think we should fix this?

thomwolf · 2020-07-06T13:48:28Z

Yes let's do it @n1t0 and stick to the official bert tokenizer behavior in the fast tokenizers as well.

1512262 closed this as completed Jul 3, 2020

1512262 reopened this Jul 3, 2020

1512262 changed the title ~~Tokenizer and TokenizerFast ???~~ [ERROR] Tokenizer and TokenizerFast ??? Jul 3, 2020

thomwolf assigned n1t0 Jul 3, 2020

thomwolf added the Core: Tokenization Internals of the library; Tokenization. label Jul 3, 2020

n1t0 mentioned this issue Jul 6, 2020

Various tokenizers fixes #5558

Merged

LysandreJik closed this as completed in #5558 Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ERROR] Tokenizer and TokenizerFast ??? #5490

[ERROR] Tokenizer and TokenizerFast ??? #5490

1512262 commented Jul 3, 2020 •

edited

Loading

n1t0 commented Jul 6, 2020

thomwolf commented Jul 6, 2020

[ERROR] Tokenizer and TokenizerFast ??? #5490

[ERROR] Tokenizer and TokenizerFast ??? #5490

Comments

1512262 commented Jul 3, 2020 • edited Loading

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

n1t0 commented Jul 6, 2020

thomwolf commented Jul 6, 2020

1512262 commented Jul 3, 2020 •

edited

Loading