Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an add_prefix_space Arg in RobertaPreprocessor #436

Closed
abheesht17 opened this issue Oct 29, 2022 · 3 comments · Fixed by #715
Closed

Add an add_prefix_space Arg in RobertaPreprocessor #436

abheesht17 opened this issue Oct 29, 2022 · 3 comments · Fixed by #715
Assignees

Comments

@abheesht17
Copy link
Collaborator

RoBERTa uses the BPE tokeniser. If you see the vocabulary (https://huggingface.co/roberta-base/blob/main/vocab.json), most tokens have this character at the beginning: "Ġ". This is because whitespace is considered a part of the token. Setting add_prefix_space to True will add a space at the beginning of the input sequences, so that the token is not mapped to UNK.

Have a look here for more details: https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer.

@chenmoneygithub
Copy link
Contributor

@abheesht17 Thanks for reporting the issue!

I think the current BytePairTokenizer implementation is respecting whitespace, does it fail in your test case? The corresponding code is here: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/tokenizers/byte_pair_tokenizer.py#L46

@shivance
Copy link
Collaborator

shivance commented Jan 10, 2023

Is this issue still open ? @chenmoneygithub @mattdangerw @jbischof

@mattdangerw
Copy link
Member

Sure is! Would you like to take it on?

I think we can add this add_prefix_space argument to the base level BytePairTokenizer class. This will allow us to use it from all the models that need it (e.g. gpt2 and roberta).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants