-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] padding tokens are also masked in DataCollatorForLanguageModeling #11155
Comments
I have similar issues. "pad" token is not masked when I run bert-base-uncased model , but "pad" token can be masked when I run albert-base-v2 In examples/language-modeliing/run_mlm.py, I try to call tokenizer.get_special_tokens_mask.
Interestingly, "get_special_tokens_mask" function is called from "class PreTrainedTokenizerBase" when I run bert-base-uncased, but "get_special_tokens_mask" function is called from "class AlbertTokenizerFast" whenn I run albert-base-v2. In PretrainedToknizerBase class,
However in AlbertTokenizerFast class,
=> These two functions are different. Thus when I use bert, all_special_ids( it contains cls, sep, pad id) are ids which cannot be masked. But when i use albert, only cls, sep ids cannot be masked. Thus pad token can be masked when i use albert. I don't know why the functions are called from different class when I run bert-base-uncased or albert. And is it correct that pad token will be masked in albert model?? [bert command]
[albert command]
|
Thanks for reporting! This is actually a bug in the
|
Environment info
transformers
version: 4.3.2Who can help
@sgugger
Information
Model I am using (Bert, XLNet ...): All models that use DataCollatorForLanguageModeling.
The bug is introduced in this PR.
3 lines (241-243) are removed by mistake from this line.
Now padding tokens are also masked in MLM.
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
From the output you can easily see that the padding tokens are masked. Add back the three removed lines fix this bug.
Expected behavior
padding token is not supposed to be mask-able in MLM.
The text was updated successfully, but these errors were encountered: