Potential bug: Tokens with punctuation are re-tokenized although I've set `is_split_into_words=True` #11333

kstathou · 2021-04-20T09:52:54Z

Environment info

transformers version: 4.5.1
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.8.1+cu101 (False)
Tensorflow version (GPU?): 2.4.1 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

Information

I am working on a token classification task where my input is in the following format:

texts = [['Foo', 'bar', '.'], ['Hello', 'world', '.']]
tags = [['B-ENT, 'I-ENT', 'O'], ['O', 'O, 'O']]

Model: bert-large-cased
Tokenizer: BertTokenizerFast. I align tokens with their tags as in this tutorial.

Problem

Although I've set is_split_into_words=True in the tokenizer, tokens containing punctuation are tokenized.

To reproduce

I reproduced the issue in this Google Colab notebook.

Expected behavior

Since I've set is_split_into_words=True, I would expect the tokenizer to keep the tokens as they are and split them into subwords with ##. For example, if a token is 'foo(bar', I would expect it to stay that way, instead of being split into ['foo', '(', 'bar'].

Thanks a lot for reading the issue!

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-04-20T13:11:30Z

The is_split_into_words should be set to skip pre-tokenization (splitting on whitespace), not tokenization. This flag should be set to True if you have split your text into individual words, and you're now looking to have each word split into tokens and converted to IDs.

This seems to be unclear from the documentation, we'll work on improving this.

kstathou · 2021-04-20T14:01:23Z

Makes sense, thank you for the clarification! I'd be happy to work on this if needed.

LysandreJik · 2021-04-20T14:33:15Z

Yes, we would welcome such a contribution! I guess we would need to find all occurrences of that is_split_into_words parameter and clarify that pre-tokenization isn't tokenization as one could expect.

We would gladly welcome a PR!

kstathou · 2021-04-21T07:21:15Z

Great, I will work on this next week! Thanks again for the help!

kstathou mentioned this issue Apr 26, 2021

Clarify description of the is_split_into_words argument #11449

Merged

5 tasks

sgugger closed this as completed in #11449 Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug: Tokens with punctuation are re-tokenized although I've set `is_split_into_words=True` #11333

Potential bug: Tokens with punctuation are re-tokenized although I've set `is_split_into_words=True` #11333

kstathou commented Apr 20, 2021 •

edited

Loading

LysandreJik commented Apr 20, 2021 •

edited

Loading

kstathou commented Apr 20, 2021

LysandreJik commented Apr 20, 2021

kstathou commented Apr 21, 2021

Potential bug: Tokens with punctuation are re-tokenized although I've set is_split_into_words=True #11333

Potential bug: Tokens with punctuation are re-tokenized although I've set is_split_into_words=True #11333

Comments

kstathou commented Apr 20, 2021 • edited Loading

Environment info

Who can help

Information

Problem

To reproduce

Expected behavior

LysandreJik commented Apr 20, 2021 • edited Loading

kstathou commented Apr 20, 2021

LysandreJik commented Apr 20, 2021

kstathou commented Apr 21, 2021

Potential bug: Tokens with punctuation are re-tokenized although I've set `is_split_into_words=True` #11333

Potential bug: Tokens with punctuation are re-tokenized although I've set `is_split_into_words=True` #11333

kstathou commented Apr 20, 2021 •

edited

Loading

LysandreJik commented Apr 20, 2021 •

edited

Loading