-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoBERTa/GPT2 tokenization #1196
Comments
This is a more complex question than it may seem but in general, I think both will be pretty similar in practice. This is related to the fact that the GPT-2 tokenizer (also used by RoBERTa) requires a space before all the words (see this wise note in fairseq about it). Now at the beginning of a string you don't have a space which can result in strange behaviors. Here is an example of the resulting behavior on RoBERTa. You would expect that the strings
In this example, the first word is split and not the second. In our tokenizer, to avoid this behavior we decided to always add a space at the beginning of a string (multiple spaces doesn't have an effect so it's ok to always add one) so that the tokenization can be consistent. A side effect of this (indicated in the doc/docstring) is that the encoding/decoding process doesn't preserve the absence of a space at the beginning of a string but on the other hand the resulting behavior is more consistent.
Here is a short discussion from my point of view but it would but nice, I think, to have @myleott inputs on this as well. |
Thanks for your explanation 👍 I just ran an experiment for a downstream task (English NER) and F1-score decreased around 0.5% 😟 I'll repeat that experiment with one commit before 0517e7a (that introduced the whitespace rule) to find out where this performance drop comes from. |
Update on that: I used 3bcbebd and re-do my experiment on NER. Now the final F1-score is 92.26 (consistent with a prior result that was 92.31) - in contrast to 91.81 for the latest 1.2.0 version 🤔 Would it possible to add a flag that uses the "original" tokenization 🤔 |
We'll see what we can do (cc @LysandreJik @julien-c). Is this difference significantly different with regards to seed run variability? |
I made a few more experiments with the same dataset and different runs:
On average, the difference is 0.52%. |
Thanks a lot for the detailed experiments Stefan. The comparison is pretty consistently in favor of the original tokenization so I guess we will switch back to the fairseq tokenization as default and add an option to use the "consistent-tokenization". |
Hi,
I've one question regarding to the tokenization logic.
I'm using the RoBERTa tokenizer from
fairseq
:Interestingly, Berlin will be splitted into two subwords (with ids 26795 and 2614).
When I use the
pytorch-transformer
implementation:Berlin is not splitted 😅
The
roberta.encode
method will return one subword for Berlin, when I start the sentence with a space - which tokenizer is correct here 🤔The text was updated successfully, but these errors were encountered: