Add an `add_prefix_space` Arg in `RobertaPreprocessor` #436

abheesht17 · 2022-10-29T07:10:12Z

RoBERTa uses the BPE tokeniser. If you see the vocabulary (https://huggingface.co/roberta-base/blob/main/vocab.json), most tokens have this character at the beginning: "Ġ". This is because whitespace is considered a part of the token. Setting add_prefix_space to True will add a space at the beginning of the input sequences, so that the token is not mapped to UNK.

Have a look here for more details: https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer.

The text was updated successfully, but these errors were encountered:

chenmoneygithub · 2022-10-31T23:32:18Z

@abheesht17 Thanks for reporting the issue!

I think the current BytePairTokenizer implementation is respecting whitespace, does it fail in your test case? The corresponding code is here: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/tokenizers/byte_pair_tokenizer.py#L46

shivance · 2023-01-10T02:23:04Z

Is this issue still open ? @chenmoneygithub @mattdangerw @jbischof

mattdangerw · 2023-01-10T20:06:33Z

Sure is! Would you like to take it on?

I think we can add this add_prefix_space argument to the base level BytePairTokenizer class. This will allow us to use it from all the models that need it (e.g. gpt2 and roberta).

mattdangerw assigned shivance Jan 10, 2023

shivance mentioned this issue Jan 17, 2023

Add AlbertClassifier #668

Merged

shivance mentioned this issue Feb 2, 2023

Add an add_prefix_space Arg in BytePairTokenizer #715

Merged

mattdangerw closed this as completed in #715 Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an `add_prefix_space` Arg in `RobertaPreprocessor` #436

Add an `add_prefix_space` Arg in `RobertaPreprocessor` #436

abheesht17 commented Oct 29, 2022

chenmoneygithub commented Oct 31, 2022

shivance commented Jan 10, 2023 •

edited

Loading

mattdangerw commented Jan 10, 2023

Add an add_prefix_space Arg in RobertaPreprocessor #436

Add an add_prefix_space Arg in RobertaPreprocessor #436

Comments

abheesht17 commented Oct 29, 2022

chenmoneygithub commented Oct 31, 2022

shivance commented Jan 10, 2023 • edited Loading

mattdangerw commented Jan 10, 2023

Add an `add_prefix_space` Arg in `RobertaPreprocessor` #436

Add an `add_prefix_space` Arg in `RobertaPreprocessor` #436

shivance commented Jan 10, 2023 •

edited

Loading