-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLMRobertaTokenizerFast producing wrong tokenized output #9637
Comments
There are two different subjects being discussed here:
CauseThis bug in offset mapping actually affects all the fast tokenizers converted from sentencepiece. During the pre-tokenization step, we first split everything on whitespaces (
How to fix itThe initial idea of using the
In order to fix this we need to:
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Unstale |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Any update on this one? |
Bump |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Environment info
Who can help
@mfuntowicz
@stefan-it
Information
Model I am using is XLM-RoBERTa.
The problem arises when using XLMRobertaTokenizerFast tokenizer.
The tasks I am working on is token-classification. In order to align the labels with the sub-word units I have used the code snippet provided here: https://huggingface.co/transformers/custom_datasets.html [ Fine-tuning with custom datasets/Token Classification with W-NUT Emerging Entities ].
When trying to align the labels with the encodings, it throws: "ValueError: NumPy boolean array indexing assignment cannot assign X input values to the Y output values where the mask is true."
This behavior is due to tokenizing punctuation. Moreover comma ( ' , ' ) gets tokenized into '' and ',' ( having offset values (0,1) ) Similar behavior happens with dot. However, some other punctuation marks are producing only one token (i.g. ' : ' -> ':').
In addition, the offset_mapping value for ':' is different in different sentences resulting either in (0,0) or (0,3) tuple. The problem is that padding tokens have offset tuple with values (0,0) which are excluded from alignment, but in this case I have to preserve the punctuation since it is POS tagging problem.
To reproduce
Moreover, although I fixed this issue by writing my own masks, I found new issue: the blank space which denotes start of the word is tokenized as separate token instead of being together with the starting sub-token.
To reproduce
Expected behavior
The text was updated successfully, but these errors were encountered: