-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking-change behavior in BERT tokenizer when stripping accents #2917
Comments
Yeah, I found the same problem in my code. The "encode" won't add padding even "pad_to_max_length = True". |
HI @bryant1410, Thanks for reporting the issue. The parameter I've a PR exposing the missing parameters #2921, it will land soon on master and will be included in the first maintenance release of 2.5 |
I see, thanks! There's an incompatibility still though, which is that you can choose if to strip accents in the fast tokenizers but you can't control that in the previous tokenizers. I believe this should be fixed as well. And be aware that, IIRC, this is still a breaking change, because in the previous tokenizers you would get stipped accents by default in one way but now it seems to behave in a different way by default. I don't know if this also the case for the other params added in #2921, and for other models apart from BERT. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Please don't close it as this is an important issue. |
Same one reported by @stefan-it, @n1t0 ? |
Yes same one. Stripping accents is happening only when We can probably add an explicit option for this on slow tokenizers, and specify the default values in the configs. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Don't close it!! I want to have control of striping accents when tokenizing |
🐛 Bug
Information
Model I am using (Bert, XLNet ...): Bert (could happen with other ones, don't know)
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
To reproduce
With the slow, it only strips accents if lowercase is enabled (maybe a bug?):
transformers/src/transformers/tokenization_bert.py
Line 346 in e676764
With the fast one, it'd never strip accents:
https://github.com/huggingface/tokenizers/blob/python-v0.5.0/bindings/python/tokenizers/implementations/bert_wordpiece.py#L23
transformers/src/transformers/tokenization_bert.py
Lines 557 to 565 in e676764
I'd be cool to have that flag also, in both tokenizers.
Finally, this warning seems odd for the simple code from above:
Maybe here the
if pad_to_max_length
should be nesting the rest of the if?transformers/src/transformers/tokenization_utils.py
Lines 80 to 95 in e676764
Didn't check in the other transformer models.
Expected behavior
Environment info
transformers
version: 2.5.0The text was updated successfully, but these errors were encountered: