You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RoBERTa uses the BPE tokeniser. If you see the vocabulary (https://huggingface.co/roberta-base/blob/main/vocab.json), most tokens have this character at the beginning: "Ġ". This is because whitespace is considered a part of the token. Setting add_prefix_space to True will add a space at the beginning of the input sequences, so that the token is not mapped to UNK.
I think we can add this add_prefix_space argument to the base level BytePairTokenizer class. This will allow us to use it from all the models that need it (e.g. gpt2 and roberta).
RoBERTa uses the BPE tokeniser. If you see the vocabulary (https://huggingface.co/roberta-base/blob/main/vocab.json), most tokens have this character at the beginning: "Ġ". This is because whitespace is considered a part of the token. Setting
add_prefix_space
toTrue
will add a space at the beginning of the input sequences, so that the token is not mapped to UNK.Have a look here for more details: https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer.
The text was updated successfully, but these errors were encountered: