Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoBERTa/GPT2 tokenization #1196

Closed
stefan-it opened this issue Sep 4, 2019 · 6 comments
Closed

RoBERTa/GPT2 tokenization #1196

stefan-it opened this issue Sep 4, 2019 · 6 comments

Comments

@stefan-it
Copy link
Collaborator

stefan-it commented Sep 4, 2019

Hi,

I've one question regarding to the tokenization logic.

I'm using the RoBERTa tokenizer from fairseq:

In [15]: tokens = roberta.encode("Berlin and Munich have a lot of puppeteer to see .")                                                                                                                                                                

In [16]: tokens                                                                                                                                                                                                                                       
Out[16]: 
tensor([    0, 26795,  2614,     8, 10489,    33,    10,   319,     9, 32986,
         9306,   254,     7,   192,   479,     2])

Interestingly, Berlin will be splitted into two subwords (with ids 26795 and 2614).

When I use the pytorch-transformer implementation:

In [21]: tokens = tokenizer.tokenize("<s>Berlin and Munich have a lot of puppeteer to see .</s>")                                                                                                                                                    

In [22]: indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)                                                                                                                                                                                     

In [23]: indexed_tokens                                                                                                                                                                                                                               
Out[23]: [0, 5459, 8, 10489, 33, 10, 319, 9, 32986, 9306, 254, 7, 192, 479, 2]

Berlin is not splitted 😅

The roberta.encode method will return one subword for Berlin, when I start the sentence with a space - which tokenizer is correct here 🤔

@thomwolf
Copy link
Member

thomwolf commented Sep 4, 2019

This is a more complex question than it may seem but in general, I think both will be pretty similar in practice.

This is related to the fact that the GPT-2 tokenizer (also used by RoBERTa) requires a space before all the words (see this wise note in fairseq about it).

Now at the beginning of a string you don't have a space which can result in strange behaviors.

Here is an example of the resulting behavior on RoBERTa. You would expect that the strings Berlin and Munich and Munich and Berlin are tokenized similarly with only the order of the tokens modified but they are not:

>>> roberta.encode("Berlin and Munich")
tensor([    0, 26795,  2614,     8, 10489,     2])
>>> roberta.encode("Munich and Berlin")
tensor([   0,  448,  879, 1725,    8, 5459,    2])

In this example, the first word is split and not the second.

In our tokenizer, to avoid this behavior we decided to always add a space at the beginning of a string (multiple spaces doesn't have an effect so it's ok to always add one) so that the tokenization can be consistent.

A side effect of this (indicated in the doc/docstring) is that the encoding/decoding process doesn't preserve the absence of a space at the beginning of a string but on the other hand the resulting behavior is more consistent.

>>> tokenizer.encode("Berlin and Munich", add_special_tokens=True)
[0, 5459, 8, 10489, 2]
>>> tokenizer.encode("Munich and Berlin", add_special_tokens=True)
[0, 10489, 8, 5459, 2]

Here is a short discussion from my point of view but it would but nice, I think, to have @myleott inputs on this as well.

@stefan-it
Copy link
Collaborator Author

Thanks for your explanation 👍

I just ran an experiment for a downstream task (English NER) and F1-score decreased around 0.5% 😟

I'll repeat that experiment with one commit before 0517e7a (that introduced the whitespace rule) to find out where this performance drop comes from.

@stefan-it
Copy link
Collaborator Author

stefan-it commented Sep 5, 2019

Update on that: I used 3bcbebd and re-do my experiment on NER. Now the final F1-score is 92.26 (consistent with a prior result that was 92.31) - in contrast to 91.81 for the latest 1.2.0 version 🤔

Would it possible to add a flag that uses the "original" tokenization 🤔

@thomwolf
Copy link
Member

thomwolf commented Sep 5, 2019

We'll see what we can do (cc @LysandreJik @julien-c).

Is this difference significantly different with regards to seed run variability?

@stefan-it
Copy link
Collaborator Author

I made a few more experiments with the same dataset and different runs:

Version Run 1 Run 2 Run 3 Avg.
1.2.0 91.81 91.82 91.78 91.80
3bcbebd 92.31 92.26 92.38 92.32

On average, the difference is 0.52%.

@thomwolf
Copy link
Member

thomwolf commented Sep 7, 2019

Thanks a lot for the detailed experiments Stefan.

The comparison is pretty consistently in favor of the original tokenization so I guess we will switch back to the fairseq tokenization as default and add an option to use the "consistent-tokenization".

cc @LysandreJik @julien-c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants