Add `from_slow` in fast tokenizers build and fixes some bugs #9987

sgugger · 2021-02-03T21:01:47Z

What does this PR do?

This PR adds an argument to the initialization of the PreTrainedTokenizerFast to force the conversion from a slow tokenizer. This will be useful to help users re-build the tokenizer.json file for some models where we can't update faulty ones right now without breaking backward compatibility (see #9637).

In passing it fixes a few bugs:

wrong formatting for the documentation
the fast sentencepiece tokenziers don't have an sp_model attribute so remove the documentation for that
BarthezTokenizerFast was not registered properly in the autotokenizers, so AutoTokenizer was not finding it

n1t0

I say yes!!!

LysandreJik

Thank you for applying these changes. They look good to me!

Add from_slow in fast tokenizers build and fixes some bugs

4da6b42

sgugger requested review from n1t0 and LysandreJik February 3, 2021 21:01

n1t0 approved these changes Feb 3, 2021

View reviewed changes

LysandreJik approved these changes Feb 4, 2021

View reviewed changes

LysandreJik merged commit 7898fc0 into master Feb 4, 2021

LysandreJik deleted the tokenizer_from_slow branch February 4, 2021 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `from_slow` in fast tokenizers build and fixes some bugs #9987

Add `from_slow` in fast tokenizers build and fixes some bugs #9987

sgugger commented Feb 3, 2021

n1t0 left a comment

LysandreJik left a comment

Add from_slow in fast tokenizers build and fixes some bugs #9987

Add from_slow in fast tokenizers build and fixes some bugs #9987

Conversation

sgugger commented Feb 3, 2021

What does this PR do?

n1t0 left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Add `from_slow` in fast tokenizers build and fixes some bugs #9987

Add `from_slow` in fast tokenizers build and fixes some bugs #9987