Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable option for subword regularization in more tokenizers. #11417

Conversation

PhilipMay
Copy link
Contributor

@PhilipMay PhilipMay commented Apr 24, 2021

see #11149 (review)

To-do

AlbertTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • remove obscure function argument called sample
  • check
  • refactor test to follow DRY

BarthezTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

BertGenerationTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • remove obscure function argument called sample
  • check
  • refactor test to follow DRY

BigBirdTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • remove obscure function argument called sample
  • check
  • refactor test to follow DRY

CamembertTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

DebertaV2Tokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

M2M100Tokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

MarianTokenizer - has src and target tokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

MBart50Tokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

PegasusTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • remove obscure function argument called sample
  • check
  • refactor test to follow DRY

ReformerTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • remove obscure function argument called sample
  • check
  • refactor test to follow DRY

Speech2TextTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

T5Tokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • remove obscure function argument called sample
  • check
  • refactor test to follow DRY

XLMProphetNetTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • check
  • refactor test to follow DRY
  • remove obscure function argument called sample

XLNetTokenizer

  • add sp_model_kwargs param with test
  • add pickle support with test
  • remove obscure function argument called sample
  • check
  • refactor test to follow DRY

XML RoBERTa

  • refactor test to follow DRY

General

  • check if we changed all tokenizers
  • add typing
  • check if tok. is used in other functions
  • also add changes to XLM RoBERTa tokenizer

After review

  • fix type comments with default None
  • possibly remove test_sentencepiece_skip_back_convert_check

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Apr 24, 2021

I found this somehow obscure function argument called sample at AlbertTokenizer:

def _tokenize(self, text, sample=False):

It seems to enable subword regularization but with fixed parameters for nbest_size and alpha.

https://github.com/google/sentencepiece/blob/351600c2971401f4e849147579aa1b5d42f614e1/python/src/sentencepiece/__init__.py#L110-L111

I would remove that sample parameter and replace that with my solution which is more flexible. But that would mean we have a breaking change. As an alternative I could add my solution but keep the sample argument. But that would add more complexity to the code.

What do you think? @sgugger @LysandreJik @stefan-it

PS: Same here:

def _tokenize(self, text, sample=False):

def _tokenize(self, text, sample=False):

def _tokenize(self, text, sample=False):

def _tokenize(self, text, sample=False):

def _tokenize(self, text, sample=False):

@PhilipMay PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch 3 times, most recently from 095407e to 2508eea Compare April 24, 2021 21:51
@sgugger
Copy link
Collaborator

sgugger commented Apr 26, 2021

This argument is not called from anywhere so it's only accessible if users somehow rewrote the tokenize method to pass it along to the private method _tokenize. Therefore I think it's fine to do the breaking change and clean up the code using sample=True, but let's see what @patrickvonplaten and @LysandreJik think before going forward (note that Lysandre is on vacation until this Wednesday so he'll reply at the end of the week :-) ).

@PhilipMay PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch 3 times, most recently from 98d8e20 to 728981a Compare April 28, 2021 19:08
@LysandreJik
Copy link
Member

Yes, removing the sample and cleaning up the _tokenize() method sounds good to me. As @sgugger said, it is private and nowhere is a sample or a **kwargs passed to that method.

@patrickvonplaten
Copy link
Contributor

Yes, removing the sample and cleaning up the _tokenize() method sounds good to me. As @sgugger said, it is private and nowhere is a sample or a **kwargs passed to that method.

Agree!

@PhilipMay PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch from 728981a to d44d72f Compare May 1, 2021 04:49
@PhilipMay
Copy link
Contributor Author

rebase upstream/master done

@PhilipMay PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch from 087540f to 708a2f5 Compare May 1, 2021 05:33
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LGTM! Thanks a lot.

@PhilipMay
Copy link
Contributor Author

Yes, LGTM! Thanks a lot.

Hey @LysandreJik - this is not done yet. Please do not merge now. ;-)

@LysandreJik
Copy link
Member

Oh, I was misled! There are indeed a few tokenizers remaining. Thank you for letting me know!

@PhilipMay PhilipMay changed the title [WIP] Enable option for subword regularization in more tokenizers. Enable option for subword regularization in more tokenizers. May 9, 2021
@PhilipMay
Copy link
Contributor Author

This is ready to be merged from my point of view.

@sgugger
Copy link
Collaborator

sgugger commented May 9, 2021

Can you take care of the merge conflicts? Will review tomorrow :-)

@PhilipMay PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch from 42e9375 to 70edf9b Compare May 10, 2021 08:57
@PhilipMay
Copy link
Contributor Author

Can you take care of the merge conflicts? Will review tomorrow :-)

@sgugger All conflicts resolved & green CI

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this!

For the tests, I see a lot of repetition so I wonder if it would be possible to have the two tests be common tests (with a class attribute test_subword_regularization in the Tester False by default that the classes where we want tot test would set to True). I think it would also be cleaner to have the kwargs passed to get_tokenizer method so you can use:

tokenizer = self.get_tokenizer(keep_accents=True, sp_model_kwargs={"enable_sampling": True, "alpha": 0.1, "nbest_size": -1})

in your common test.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks a lot for going through all of those!

Great work on the tests, this is great. The tests could indeed be refactored in a common test if you feel like it.

@PhilipMay
Copy link
Contributor Author

Great work on the tests, this is great. The tests could indeed be refactored in a common test if you feel like it.

I will refactor the tests the next days. Shame on me that I criticized the lack of DRY in the tokenizers but did not follow the DRY principle in the tests.

@PhilipMay PhilipMay changed the title Enable option for subword regularization in more tokenizers. [WIP] Enable option for subword regularization in more tokenizers. May 11, 2021
@PhilipMay
Copy link
Contributor Author

PhilipMay commented May 12, 2021

This is strange:

FAILED tests/test_hf_api.py::HfApiEndpointsTest::test_list_repos_objs - reque...

See here: https://app.circleci.com/pipelines/github/huggingface/transformers/23276/workflows/bf1ad505-efdc-4394-8852-a07702b9f5be/jobs/209965/parallel-runs/0/steps/0-108

Will trigget CI again,,,

@PhilipMay
Copy link
Contributor Author

@LysandreJik @sgugger Tests are refactored and DRY now. CI is green again.
IMO ready for merge.

Maybe you want to investigate the flaky test (see my comment above).

@PhilipMay PhilipMay changed the title [WIP] Enable option for subword regularization in more tokenizers. Enable option for subword regularization in more tokenizers. May 12, 2021
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor looks good to me, thanks a lot!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thanks a lot @PhilipMay! Very clean PR.

@LysandreJik LysandreJik merged commit 37ed3ab into huggingface:master May 13, 2021
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
…face#11417)

* improve slow class tok usage at xlm rob

* add subword regularization for barthez

* improve barthez tok. test

* fix tokenizer tests

* add subword regularization for camembert

* add subword regularization for deberta v2 tokenizer

* add more doc to deberta v2 tokenizer

* add subword regularization for speech to text tok.

* fix sp_model_kwargs type in speech 2 text tok.

* add subword regularization for M2M100 tok.

* add more concrete type hints

* fix tests for m2m100 and s2t tok.

* add missing Any import

* fix syntax error in m2m100 tok.

* fix unpickle of m2m100 and s2t tok.

* fix test of m2m100 and s2t tok.

* improve unpickle of deberta v2 tok.

* add test for pickle of barthez & camembert

* fix pickle of barthez & camembert

* add test for deberta v2 tok. pickle

* fix m2m100 tok. pickle

* fix s2t tok. pickle

* add subword regularization to albert tok.

* refactor subword reg. test into TokenizerTesterMixin

improve albert tok. test

remove sample argument form albert tok.

check subword reg. using TokenizerTesterMixin

improve tok. tests

improve xlm roberta tok. tests

improve xlm roberta tok. tests

* add subword regularization for big bird t.

* improve xlm roberta tok. test

* add subword regularization for mbart50 tok.

* add subword regularization for pegasus tok.

* add subword regularization for reformer tok.

* add subword regularization for T5 tok.

* fix t5 tok. test formatting

* add subword regularization for xlm_proph. tok.

* add subword regularization for xlnet tok.

* add subword regularization for gert_gen tok.

* add typing to tokenizers

* add typing to xlm rob. tok

* add subword regularization for marian tok.

* add reverse tok. test

* fix marian tok test

* fix marian tok test

* fix casing in tok. tests

* fix style of tok. common test

* fix deberta v2 tok test

* add type annotations to tok. tests

* add type annotations to tok. __init__

* add typing to kokenizer

* add type annotations to tok. __init__

* don't specify the default when it's None

* fix barthez tok. doc

* move sentencepiece tok. tests to TokenizerTesterMixin

* fix unused imports

* fix albert tok. test

* add comment to sentencepiece test options

* fix Any import at big bird tok.

* fix Any import at xlm prophetnet tok.

* empty commit to trigger CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants