-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable option for subword regularization in more tokenizers. #11417
Enable option for subword regularization in more tokenizers. #11417
Conversation
I found this somehow obscure function argument called
It seems to enable subword regularization but with fixed parameters for I would remove that What do you think? @sgugger @LysandreJik @stefan-it PS: Same here: transformers/src/transformers/models/bert_generation/tokenization_bert_generation.py Line 113 in 52166f6
|
095407e
to
2508eea
Compare
This argument is not called from anywhere so it's only accessible if users somehow rewrote the tokenize method to pass it along to the private method |
98d8e20
to
728981a
Compare
Yes, removing the |
Agree! |
728981a
to
d44d72f
Compare
rebase upstream/master done |
087540f
to
708a2f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LGTM! Thanks a lot.
Hey @LysandreJik - this is not done yet. Please do not merge now. ;-) |
Oh, I was misled! There are indeed a few tokenizers remaining. Thank you for letting me know! |
This is ready to be merged from my point of view. |
Can you take care of the merge conflicts? Will review tomorrow :-) |
42e9375
to
70edf9b
Compare
@sgugger All conflicts resolved & green CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this!
For the tests, I see a lot of repetition so I wonder if it would be possible to have the two tests be common tests (with a class attribute test_subword_regularization
in the Tester False by default that the classes where we want tot test would set to True). I think it would also be cleaner to have the kwargs passed to get_tokenizer
method so you can use:
tokenizer = self.get_tokenizer(keep_accents=True, sp_model_kwargs={"enable_sampling": True, "alpha": 0.1, "nbest_size": -1})
in your common test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks a lot for going through all of those!
Great work on the tests, this is great. The tests could indeed be refactored in a common test if you feel like it.
I will refactor the tests the next days. Shame on me that I criticized the lack of DRY in the tokenizers but did not follow the DRY principle in the tests. |
This is strange:
Will trigget CI again,,, |
@LysandreJik @sgugger Tests are refactored and DRY now. CI is green again. Maybe you want to investigate the flaky test (see my comment above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor looks good to me, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, thanks a lot @PhilipMay! Very clean PR.
…face#11417) * improve slow class tok usage at xlm rob * add subword regularization for barthez * improve barthez tok. test * fix tokenizer tests * add subword regularization for camembert * add subword regularization for deberta v2 tokenizer * add more doc to deberta v2 tokenizer * add subword regularization for speech to text tok. * fix sp_model_kwargs type in speech 2 text tok. * add subword regularization for M2M100 tok. * add more concrete type hints * fix tests for m2m100 and s2t tok. * add missing Any import * fix syntax error in m2m100 tok. * fix unpickle of m2m100 and s2t tok. * fix test of m2m100 and s2t tok. * improve unpickle of deberta v2 tok. * add test for pickle of barthez & camembert * fix pickle of barthez & camembert * add test for deberta v2 tok. pickle * fix m2m100 tok. pickle * fix s2t tok. pickle * add subword regularization to albert tok. * refactor subword reg. test into TokenizerTesterMixin improve albert tok. test remove sample argument form albert tok. check subword reg. using TokenizerTesterMixin improve tok. tests improve xlm roberta tok. tests improve xlm roberta tok. tests * add subword regularization for big bird t. * improve xlm roberta tok. test * add subword regularization for mbart50 tok. * add subword regularization for pegasus tok. * add subword regularization for reformer tok. * add subword regularization for T5 tok. * fix t5 tok. test formatting * add subword regularization for xlm_proph. tok. * add subword regularization for xlnet tok. * add subword regularization for gert_gen tok. * add typing to tokenizers * add typing to xlm rob. tok * add subword regularization for marian tok. * add reverse tok. test * fix marian tok test * fix marian tok test * fix casing in tok. tests * fix style of tok. common test * fix deberta v2 tok test * add type annotations to tok. tests * add type annotations to tok. __init__ * add typing to kokenizer * add type annotations to tok. __init__ * don't specify the default when it's None * fix barthez tok. doc * move sentencepiece tok. tests to TokenizerTesterMixin * fix unused imports * fix albert tok. test * add comment to sentencepiece test options * fix Any import at big bird tok. * fix Any import at xlm prophetnet tok. * empty commit to trigger CI
see #11149 (review)
To-do
AlbertTokenizer
sp_model_kwargs
param with testsample
BarthezTokenizer
sp_model_kwargs
param with testremove obscure function argument calledsample
BertGenerationTokenizer
sp_model_kwargs
param with testsample
BigBirdTokenizer
sp_model_kwargs
param with testsample
CamembertTokenizer
sp_model_kwargs
param with testremove obscure function argument calledsample
DebertaV2Tokenizer
sp_model_kwargs
param with testremove obscure function argument calledsample
M2M100Tokenizer
sp_model_kwargs
param with testremove obscure function argument calledsample
MarianTokenizer
- has src and target tokenizersp_model_kwargs
param with testremove obscure function argument calledsample
MBart50Tokenizer
sp_model_kwargs
param with testremove obscure function argument calledsample
PegasusTokenizer
sp_model_kwargs
param with testsample
ReformerTokenizer
sp_model_kwargs
param with testsample
Speech2TextTokenizer
sp_model_kwargs
param with testremove obscure function argument calledsample
T5Tokenizer
sp_model_kwargs
param with testsample
XLMProphetNetTokenizer
sp_model_kwargs
param with testremove obscure function argument calledsample
XLNetTokenizer
sp_model_kwargs
param with testsample
XML RoBERTa
General
After review
None
test_sentencepiece_skip_back_convert_check