RoBERTa/GPT2 tokenization #1196

stefan-it · 2019-09-04T12:44:52Z

Hi,

I've one question regarding to the tokenization logic.

I'm using the RoBERTa tokenizer from fairseq:

In [15]: tokens = roberta.encode("Berlin and Munich have a lot of puppeteer to see .")                                                                                                                                                                

In [16]: tokens                                                                                                                                                                                                                                       
Out[16]: 
tensor([    0, 26795,  2614,     8, 10489,    33,    10,   319,     9, 32986,
         9306,   254,     7,   192,   479,     2])

Interestingly, Berlin will be splitted into two subwords (with ids 26795 and 2614).

When I use the pytorch-transformer implementation:

In [21]: tokens = tokenizer.tokenize("<s>Berlin and Munich have a lot of puppeteer to see .</s>")                                                                                                                                                    

In [22]: indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)                                                                                                                                                                                     

In [23]: indexed_tokens                                                                                                                                                                                                                               
Out[23]: [0, 5459, 8, 10489, 33, 10, 319, 9, 32986, 9306, 254, 7, 192, 479, 2]

Berlin is not splitted 😅

The roberta.encode method will return one subword for Berlin, when I start the sentence with a space - which tokenizer is correct here 🤔

The text was updated successfully, but these errors were encountered:

thomwolf · 2019-09-04T13:30:53Z

This is a more complex question than it may seem but in general, I think both will be pretty similar in practice.

This is related to the fact that the GPT-2 tokenizer (also used by RoBERTa) requires a space before all the words (see this wise note in fairseq about it).

Now at the beginning of a string you don't have a space which can result in strange behaviors.

Here is an example of the resulting behavior on RoBERTa. You would expect that the strings Berlin and Munich and Munich and Berlin are tokenized similarly with only the order of the tokens modified but they are not:

>>> roberta.encode("Berlin and Munich")
tensor([    0, 26795,  2614,     8, 10489,     2])
>>> roberta.encode("Munich and Berlin")
tensor([   0,  448,  879, 1725,    8, 5459,    2])

In this example, the first word is split and not the second.

In our tokenizer, to avoid this behavior we decided to always add a space at the beginning of a string (multiple spaces doesn't have an effect so it's ok to always add one) so that the tokenization can be consistent.

A side effect of this (indicated in the doc/docstring) is that the encoding/decoding process doesn't preserve the absence of a space at the beginning of a string but on the other hand the resulting behavior is more consistent.

>>> tokenizer.encode("Berlin and Munich", add_special_tokens=True)
[0, 5459, 8, 10489, 2]
>>> tokenizer.encode("Munich and Berlin", add_special_tokens=True)
[0, 10489, 8, 5459, 2]

Here is a short discussion from my point of view but it would but nice, I think, to have @myleott inputs on this as well.

stefan-it · 2019-09-04T21:01:52Z

Thanks for your explanation 👍

I just ran an experiment for a downstream task (English NER) and F1-score decreased around 0.5% 😟

I'll repeat that experiment with one commit before 0517e7a (that introduced the whitespace rule) to find out where this performance drop comes from.

stefan-it · 2019-09-05T08:09:00Z

Update on that: I used 3bcbebd and re-do my experiment on NER. Now the final F1-score is 92.26 (consistent with a prior result that was 92.31) - in contrast to 91.81 for the latest 1.2.0 version 🤔

Would it possible to add a flag that uses the "original" tokenization 🤔

thomwolf · 2019-09-05T08:28:50Z

We'll see what we can do (cc @LysandreJik @julien-c).

Is this difference significantly different with regards to seed run variability?

stefan-it · 2019-09-06T07:57:05Z

I made a few more experiments with the same dataset and different runs:

Version	Run 1	Run 2	Run 3	Avg.
1.2.0	91.81	91.82	91.78	91.80
`3bcbebd`	92.31	92.26	92.38	92.32

On average, the difference is 0.52%.

thomwolf · 2019-09-07T10:08:39Z

Thanks a lot for the detailed experiments Stefan.

The comparison is pretty consistently in favor of the original tokenization so I guess we will switch back to the fairseq tokenization as default and add an option to use the "consistent-tokenization".

cc @LysandreJik @julien-c

thomwolf closed this as completed in 7a99e4b Sep 26, 2019

stefan-it mentioned this issue Sep 26, 2019

GH-1156: PyTorch-Transformers -> Transformers flairNLP/flair#1163

Merged

thomwolf referenced this issue Sep 27, 2019

Update RoBERTa and GPT-2 Tokenizer documentation (fix #1343)

ecfddc6

cformosa mentioned this issue Sep 30, 2019

Confusing tokenizer result on single word #1380

Closed

This was referenced Nov 6, 2019

Adding tokenizer alignment function nyu-mll/jiant#953

Merged

Updating Tokenization nyu-mll/jiant#954

Closed

HaokunLiu mentioned this issue Jan 15, 2020

Update to transformers 2.3.0 & Add ALBERT nyu-mll/jiant#990

Merged

stefan-it mentioned this issue Apr 13, 2020

GH-1492: added new BERT embeddings implementation flairNLP/flair#1494

Merged

alanakbik mentioned this issue Apr 14, 2020

Inconsistencies and possible bugs in different tokenizers #3788

Closed

guymorlan mentioned this issue Jun 8, 2020

RoBERTa tokenization does not add leading space ThilinaRajapakse/simpletransformers#458

Closed

brandenchan mentioned this issue Jul 21, 2020

Difference in answers predictions using the same model in FarmReader and TransformersReader deepset-ai/haystack#248

Closed

tholor mentioned this issue Aug 2, 2020

Add option to use fast HF tokenizer. deepset-ai/FARM#482

Merged

4 tasks

This was referenced Sep 17, 2020

[CLOSED] Adding tokenizer alignment function nyu-mll/jiant-v1-legacy#953

Closed

[CLOSED] Updating Tokenization nyu-mll/jiant-v1-legacy#954

Closed

[CLOSED] Update to transformers 2.3.0 & Add ALBERT nyu-mll/jiant-v1-legacy#990

Closed

This was referenced Jul 21, 2021

Leading space in answers since haykstack 0.9.0 deepset-ai/haystack#1299

Closed

Make Piaf work with Haystack to 0.9.0 etalab-ia/piaf-ml#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoBERTa/GPT2 tokenization #1196

RoBERTa/GPT2 tokenization #1196

stefan-it commented Sep 4, 2019 •

edited

Loading

thomwolf commented Sep 4, 2019 •

edited

Loading

stefan-it commented Sep 4, 2019

stefan-it commented Sep 5, 2019 •

edited

Loading

thomwolf commented Sep 5, 2019 •

edited

Loading

stefan-it commented Sep 6, 2019

thomwolf commented Sep 7, 2019

RoBERTa/GPT2 tokenization #1196

RoBERTa/GPT2 tokenization #1196

Comments

stefan-it commented Sep 4, 2019 • edited Loading

thomwolf commented Sep 4, 2019 • edited Loading

stefan-it commented Sep 4, 2019

stefan-it commented Sep 5, 2019 • edited Loading

thomwolf commented Sep 5, 2019 • edited Loading

stefan-it commented Sep 6, 2019

thomwolf commented Sep 7, 2019

stefan-it commented Sep 4, 2019 •

edited

Loading

thomwolf commented Sep 4, 2019 •

edited

Loading

stefan-it commented Sep 5, 2019 •

edited

Loading

thomwolf commented Sep 5, 2019 •

edited

Loading