Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug: Tokens with punctuation are re-tokenized although I've set is_split_into_words=True #11333

Closed
kstathou opened this issue Apr 20, 2021 · 4 comments · Fixed by #11449

Comments

@kstathou
Copy link
Contributor

kstathou commented Apr 20, 2021

Environment info

  • transformers version: 4.5.1
  • Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.1+cu101 (False)
  • Tensorflow version (GPU?): 2.4.1 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

tokenizers: Hi @LysandreJik!

Information

I am working on a token classification task where my input is in the following format:

texts = [['Foo', 'bar', '.'], ['Hello', 'world', '.']]
tags = [['B-ENT, 'I-ENT', 'O'], ['O', 'O, 'O']]
  • Model: bert-large-cased
  • Tokenizer: BertTokenizerFast. I align tokens with their tags as in this tutorial.

Problem

Although I've set is_split_into_words=True in the tokenizer, tokens containing punctuation are tokenized.

To reproduce

I reproduced the issue in this Google Colab notebook.

Expected behavior

Since I've set is_split_into_words=True, I would expect the tokenizer to keep the tokens as they are and split them into subwords with ##. For example, if a token is 'foo(bar', I would expect it to stay that way, instead of being split into ['foo', '(', 'bar'].

Thanks a lot for reading the issue!

@LysandreJik
Copy link
Member

LysandreJik commented Apr 20, 2021

The is_split_into_words should be set to skip pre-tokenization (splitting on whitespace), not tokenization. This flag should be set to True if you have split your text into individual words, and you're now looking to have each word split into tokens and converted to IDs.

This seems to be unclear from the documentation, we'll work on improving this.

@kstathou
Copy link
Contributor Author

Makes sense, thank you for the clarification! I'd be happy to work on this if needed.

@LysandreJik
Copy link
Member

Yes, we would welcome such a contribution! I guess we would need to find all occurrences of that is_split_into_words parameter and clarify that pre-tokenization isn't tokenization as one could expect.

We would gladly welcome a PR!

@kstathou
Copy link
Contributor Author

Great, I will work on this next week! Thanks again for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants