Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERROR] Tokenizer and TokenizerFast ??? #5490

Closed
2 tasks
1512262 opened this issue Jul 3, 2020 · 2 comments · Fixed by #5558
Closed
2 tasks

[ERROR] Tokenizer and TokenizerFast ??? #5490

1512262 opened this issue Jul 3, 2020 · 2 comments · Fixed by #5558
Assignees
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@1512262
Copy link

1512262 commented Jul 3, 2020

🐛 Bug

Information

Model I am using (Bert, XLNet ...): BERT

Language I am using the model on (English, Chinese ...): 'bert-base-multilingual-cased'

The problem arises when using:

  • [ x] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [ x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. from transformers import *
  2. tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-cased')
  3. tokenizer.decode(tokenizer.encode('mở bài lạc trôi')) --> wrong

but:

  1. from transformers import *
  2. tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
  3. tokenizer.decode(tokenizer.encode('mở bài lạc trôi')) --> true

Expected behavior

the decode sentence after encoding and decoding using TokenizerFast should be true

Environment info

  • transformers version:
  • Platform: Pytorch and TF
  • Python version: 3.6
  • PyTorch version (GPU?): GPU
  • Tensorflow version (GPU?): 2.2
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No
@1512262 1512262 closed this as completed Jul 3, 2020
@1512262 1512262 reopened this Jul 3, 2020
@1512262 1512262 changed the title Tokenizer and TokenizerFast ??? [ERROR] Tokenizer and TokenizerFast ??? Jul 3, 2020
@thomwolf thomwolf added the Core: Tokenization Internals of the library; Tokenization. label Jul 3, 2020
@n1t0
Copy link
Member

n1t0 commented Jul 6, 2020

This is related to #2917. In the slow tokenizers, when do_lower_case=False we don't strip accents, while we do it when do_lower_case=True. In the fast tokenizers, this is controlled by the strip_accents option, which is True here.

@thomwolf How do you think we should fix this?

@thomwolf
Copy link
Member

thomwolf commented Jul 6, 2020

Yes let's do it @n1t0 and stick to the official bert tokenizer behavior in the fast tokenizers as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants