You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since I've set is_split_into_words=True, I would expect the tokenizer to keep the tokens as they are and split them into subwords with ##. For example, if a token is 'foo(bar', I would expect it to stay that way, instead of being split into ['foo', '(', 'bar'].
Thanks a lot for reading the issue!
The text was updated successfully, but these errors were encountered:
The is_split_into_words should be set to skip pre-tokenization (splitting on whitespace), not tokenization. This flag should be set to True if you have split your text into individual words, and you're now looking to have each word split into tokens and converted to IDs.
This seems to be unclear from the documentation, we'll work on improving this.
Yes, we would welcome such a contribution! I guess we would need to find all occurrences of that is_split_into_words parameter and clarify that pre-tokenization isn't tokenization as one could expect.
Environment info
transformers
version: 4.5.1Who can help
tokenizers: Hi @LysandreJik!
Information
I am working on a token classification task where my input is in the following format:
bert-large-cased
BertTokenizerFast
. I align tokens with their tags as in this tutorial.Problem
Although I've set
is_split_into_words=True
in the tokenizer, tokens containing punctuation are tokenized.To reproduce
I reproduced the issue in this Google Colab notebook.
Expected behavior
Since I've set
is_split_into_words=True
, I would expect the tokenizer to keep the tokens as they are and split them into subwords with##
. For example, if a token is'foo(bar'
, I would expect it to stay that way, instead of being split into['foo', '(', 'bar']
.Thanks a lot for reading the issue!
The text was updated successfully, but these errors were encountered: