Enable option for subword regularization in more tokenizers. #11417

PhilipMay · 2021-04-24T19:43:14Z

see #11149 (review)

To-do

`AlbertTokenizer`

add sp_model_kwargs param with test
add pickle support with test
remove obscure function argument called sample
check
refactor test to follow DRY

`BarthezTokenizer`

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`BertGenerationTokenizer`

add sp_model_kwargs param with test
add pickle support with test
remove obscure function argument called sample
check
refactor test to follow DRY

`BigBirdTokenizer`

add sp_model_kwargs param with test
add pickle support with test
remove obscure function argument called sample
check
refactor test to follow DRY

`CamembertTokenizer`

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`DebertaV2Tokenizer`

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`M2M100Tokenizer`

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`MarianTokenizer` - has src and target tokenizer

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`MBart50Tokenizer`

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`PegasusTokenizer`

add sp_model_kwargs param with test
add pickle support with test
remove obscure function argument called sample
check
refactor test to follow DRY

`ReformerTokenizer`

add sp_model_kwargs param with test
add pickle support with test
remove obscure function argument called sample
check
refactor test to follow DRY

`Speech2TextTokenizer`

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`T5Tokenizer`

add sp_model_kwargs param with test
add pickle support with test
remove obscure function argument called sample
check
refactor test to follow DRY

`XLMProphetNetTokenizer`

add sp_model_kwargs param with test
add pickle support with test
check
refactor test to follow DRY
~~remove obscure function argument called sample~~

`XLNetTokenizer`

add sp_model_kwargs param with test
add pickle support with test
remove obscure function argument called sample
check
refactor test to follow DRY

`XML RoBERTa`

refactor test to follow DRY

General

check if we changed all tokenizers
add typing
check if tok. is used in other functions
also add changes to XLM RoBERTa tokenizer

After review

fix type comments with default None
possibly remove test_sentencepiece_skip_back_convert_check

PhilipMay · 2021-04-24T20:30:38Z

I found this somehow obscure function argument called sample at AlbertTokenizer:

transformers/src/transformers/models/albert/tokenization_albert.py

Line 189 in 52166f6

def _tokenize(self, text, sample=False):

It seems to enable subword regularization but with fixed parameters for nbest_size and alpha.

https://github.com/google/sentencepiece/blob/351600c2971401f4e849147579aa1b5d42f614e1/python/src/sentencepiece/__init__.py#L110-L111

I would remove that sample parameter and replace that with my solution which is more flexible. But that would mean we have a breaking change. As an alternative I could add my solution but keep the sample argument. But that would add more complexity to the code.

What do you think? @sgugger @LysandreJik @stefan-it

PS: Same here:

transformers/src/transformers/models/bert_generation/tokenization_bert_generation.py

Line 113 in 52166f6

def _tokenize(self, text, sample=False):

transformers/src/transformers/models/big_bird/tokenization_big_bird.py

Line 143 in 52166f6

def _tokenize(self, text, sample=False):

transformers/src/transformers/models/pegasus/tokenization_pegasus.py

Line 169 in 52166f6

def _tokenize(self, text, sample=False):

transformers/src/transformers/models/reformer/tokenization_reformer.py

Line 109 in 52166f6

def _tokenize(self, text, sample=False):

transformers/src/transformers/models/t5/tokenization_t5.py

Line 237 in 52166f6

def _tokenize(self, text, sample=False):

transformers/src/transformers/models/xlnet/tokenization_xlnet.py

Line 191 in 52166f6

def _tokenize(self, text, sample=False):

sgugger · 2021-04-26T14:06:50Z

This argument is not called from anywhere so it's only accessible if users somehow rewrote the tokenize method to pass it along to the private method _tokenize. Therefore I think it's fine to do the breaking change and clean up the code using sample=True, but let's see what @patrickvonplaten and @LysandreJik think before going forward (note that Lysandre is on vacation until this Wednesday so he'll reply at the end of the week :-) ).

LysandreJik · 2021-04-30T09:28:20Z

Yes, removing the sample and cleaning up the _tokenize() method sounds good to me. As @sgugger said, it is private and nowhere is a sample or a **kwargs passed to that method.

patrickvonplaten · 2021-04-30T13:03:34Z

Yes, removing the sample and cleaning up the _tokenize() method sounds good to me. As @sgugger said, it is private and nowhere is a sample or a **kwargs passed to that method.

Agree!

PhilipMay · 2021-05-01T04:50:33Z

rebase upstream/master done

LysandreJik

Yes, LGTM! Thanks a lot.

PhilipMay · 2021-05-03T08:05:38Z

Yes, LGTM! Thanks a lot.

Hey @LysandreJik - this is not done yet. Please do not merge now. ;-)

LysandreJik · 2021-05-03T08:07:27Z

Oh, I was misled! There are indeed a few tokenizers remaining. Thank you for letting me know!

PhilipMay · 2021-05-09T14:36:13Z

This is ready to be merged from my point of view.

sgugger · 2021-05-09T18:33:32Z

Can you take care of the merge conflicts? Will review tomorrow :-)

PhilipMay · 2021-05-10T09:15:02Z

Can you take care of the merge conflicts? Will review tomorrow :-)

@sgugger All conflicts resolved & green CI

sgugger

Thanks a lot for working on this!

For the tests, I see a lot of repetition so I wonder if it would be possible to have the two tests be common tests (with a class attribute test_subword_regularization in the Tester False by default that the classes where we want tot test would set to True). I think it would also be cleaner to have the kwargs passed to get_tokenizer method so you can use:

tokenizer = self.get_tokenizer(keep_accents=True, sp_model_kwargs={"enable_sampling": True, "alpha": 0.1, "nbest_size": -1})

in your common test.

src/transformers/models/albert/tokenization_albert.py

LysandreJik

Cool, thanks a lot for going through all of those!

Great work on the tests, this is great. The tests could indeed be refactored in a common test if you feel like it.

PhilipMay · 2021-05-11T09:49:12Z

Great work on the tests, this is great. The tests could indeed be refactored in a common test if you feel like it.

I will refactor the tests the next days. Shame on me that I criticized the lack of DRY in the tokenizers but did not follow the DRY principle in the tests.

PhilipMay · 2021-05-12T08:36:19Z

This is strange:

FAILED tests/test_hf_api.py::HfApiEndpointsTest::test_list_repos_objs - reque...

See here: https://app.circleci.com/pipelines/github/huggingface/transformers/23276/workflows/bf1ad505-efdc-4394-8852-a07702b9f5be/jobs/209965/parallel-runs/0/steps/0-108

Will trigget CI again,,,

PhilipMay · 2021-05-12T09:00:28Z

@LysandreJik @sgugger Tests are refactored and DRY now. CI is green again.
IMO ready for merge.

Maybe you want to investigate the flaky test (see my comment above).

sgugger

The refactor looks good to me, thanks a lot!

LysandreJik

Fantastic, thanks a lot @PhilipMay! Very clean PR.

…face#11417) * improve slow class tok usage at xlm rob * add subword regularization for barthez * improve barthez tok. test * fix tokenizer tests * add subword regularization for camembert * add subword regularization for deberta v2 tokenizer * add more doc to deberta v2 tokenizer * add subword regularization for speech to text tok. * fix sp_model_kwargs type in speech 2 text tok. * add subword regularization for M2M100 tok. * add more concrete type hints * fix tests for m2m100 and s2t tok. * add missing Any import * fix syntax error in m2m100 tok. * fix unpickle of m2m100 and s2t tok. * fix test of m2m100 and s2t tok. * improve unpickle of deberta v2 tok. * add test for pickle of barthez & camembert * fix pickle of barthez & camembert * add test for deberta v2 tok. pickle * fix m2m100 tok. pickle * fix s2t tok. pickle * add subword regularization to albert tok. * refactor subword reg. test into TokenizerTesterMixin improve albert tok. test remove sample argument form albert tok. check subword reg. using TokenizerTesterMixin improve tok. tests improve xlm roberta tok. tests improve xlm roberta tok. tests * add subword regularization for big bird t. * improve xlm roberta tok. test * add subword regularization for mbart50 tok. * add subword regularization for pegasus tok. * add subword regularization for reformer tok. * add subword regularization for T5 tok. * fix t5 tok. test formatting * add subword regularization for xlm_proph. tok. * add subword regularization for xlnet tok. * add subword regularization for gert_gen tok. * add typing to tokenizers * add typing to xlm rob. tok * add subword regularization for marian tok. * add reverse tok. test * fix marian tok test * fix marian tok test * fix casing in tok. tests * fix style of tok. common test * fix deberta v2 tok test * add type annotations to tok. tests * add type annotations to tok. __init__ * add typing to kokenizer * add type annotations to tok. __init__ * don't specify the default when it's None * fix barthez tok. doc * move sentencepiece tok. tests to TokenizerTesterMixin * fix unused imports * fix albert tok. test * add comment to sentencepiece test options * fix Any import at big bird tok. * fix Any import at xlm prophetnet tok. * empty commit to trigger CI

PhilipMay mentioned this pull request Apr 24, 2021

Enable option for subword regularization in XLMRobertaTokenizer #11149

Merged

3 tasks

PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch 3 times, most recently from 095407e to 2508eea Compare April 24, 2021 21:51

PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch 3 times, most recently from 98d8e20 to 728981a Compare April 28, 2021 19:08

PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch from 728981a to d44d72f Compare May 1, 2021 04:49

PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch from 087540f to 708a2f5 Compare May 1, 2021 05:33

LysandreJik approved these changes May 3, 2021

View reviewed changes

PhilipMay changed the title ~~[WIP] Enable option for subword regularization in more tokenizers.~~ Enable option for subword regularization in more tokenizers. May 9, 2021

PhilipMay mentioned this pull request May 10, 2021

Strange implementation of convert_tokens_to_string in albert tokenizer. #11646

Closed

PhilipMay added 9 commits May 10, 2021 10:46

improve slow class tok usage at xlm rob

e5d012e

add subword regularization for barthez

0f64640

improve barthez tok. test

93d3d90

fix tokenizer tests

5032ae6

add subword regularization for camembert

05a3d4e

add subword regularization for deberta v2 tokenizer

1a4ee85

add more doc to deberta v2 tokenizer

67163a9

add subword regularization for speech to text tok.

01a2c7a

fix sp_model_kwargs type in speech 2 text tok.

99fa11b

PhilipMay added 5 commits May 10, 2021 10:54

fix style of tok. common test

869fe1f

fix deberta v2 tok test

d49d91e

add type annotations to tok. tests

e60f0c4

add type annotations to tok. __init__

e853ad3

add typing to kokenizer

70edf9b

PhilipMay force-pushed the subword_reg_in_more_sentencep_tok branch from 42e9375 to 70edf9b Compare May 10, 2021 08:57

add type annotations to tok. __init__

8bc90b6

sgugger approved these changes May 10, 2021

View reviewed changes

src/transformers/models/albert/tokenization_albert.py Outdated Show resolved Hide resolved

LysandreJik approved these changes May 11, 2021

View reviewed changes

PhilipMay changed the title ~~Enable option for subword regularization in more tokenizers.~~ [WIP] Enable option for subword regularization in more tokenizers. May 11, 2021

PhilipMay mentioned this pull request May 11, 2021

Remove "optional, defaults to :obj:None" #11687

Closed

PhilipMay added 8 commits May 11, 2021 20:53

don't specify the default when it's None

c2deb0c

fix barthez tok. doc

f90e5fa

move sentencepiece tok. tests to TokenizerTesterMixin

801ecb4

fix unused imports

a12db1d

fix albert tok. test

daf8ecf

add comment to sentencepiece test options

1448e49

fix Any import at big bird tok.

c90e5d6

fix Any import at xlm prophetnet tok.

78f5700

empty commit to trigger CI

5f3f30c

PhilipMay changed the title ~~[WIP] Enable option for subword regularization in more tokenizers.~~ Enable option for subword regularization in more tokenizers. May 12, 2021

sgugger approved these changes May 12, 2021

View reviewed changes

LysandreJik approved these changes May 13, 2021

View reviewed changes

LysandreJik merged commit 37ed3ab into huggingface:master May 13, 2021

PhilipMay mentioned this pull request May 26, 2021

Add regression tests for slow sentencepiece tokenizers. #11737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable option for subword regularization in more tokenizers. #11417

Enable option for subword regularization in more tokenizers. #11417

PhilipMay commented Apr 24, 2021 •

edited

Loading

PhilipMay commented Apr 24, 2021 •

edited

Loading

sgugger commented Apr 26, 2021

LysandreJik commented Apr 30, 2021

patrickvonplaten commented Apr 30, 2021

PhilipMay commented May 1, 2021

LysandreJik left a comment

PhilipMay commented May 3, 2021

LysandreJik commented May 3, 2021

PhilipMay commented May 9, 2021

sgugger commented May 9, 2021

PhilipMay commented May 10, 2021

sgugger left a comment

LysandreJik left a comment

PhilipMay commented May 11, 2021

PhilipMay commented May 12, 2021 •

edited

Loading

PhilipMay commented May 12, 2021

sgugger left a comment

LysandreJik left a comment

Enable option for subword regularization in more tokenizers. #11417

Enable option for subword regularization in more tokenizers. #11417

Conversation

PhilipMay commented Apr 24, 2021 • edited Loading

To-do

AlbertTokenizer

BarthezTokenizer

BertGenerationTokenizer

BigBirdTokenizer

CamembertTokenizer

DebertaV2Tokenizer

M2M100Tokenizer

MarianTokenizer - has src and target tokenizer

MBart50Tokenizer

PegasusTokenizer

ReformerTokenizer

Speech2TextTokenizer

T5Tokenizer

XLMProphetNetTokenizer

XLNetTokenizer

XML RoBERTa

General

After review

PhilipMay commented Apr 24, 2021 • edited Loading

sgugger commented Apr 26, 2021

LysandreJik commented Apr 30, 2021

patrickvonplaten commented Apr 30, 2021

PhilipMay commented May 1, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

PhilipMay commented May 3, 2021

LysandreJik commented May 3, 2021

PhilipMay commented May 9, 2021

sgugger commented May 9, 2021

PhilipMay commented May 10, 2021

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

PhilipMay commented May 11, 2021

PhilipMay commented May 12, 2021 • edited Loading

PhilipMay commented May 12, 2021

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

PhilipMay commented Apr 24, 2021 •

edited

Loading

`AlbertTokenizer`

`BarthezTokenizer`

`BertGenerationTokenizer`

`BigBirdTokenizer`

`CamembertTokenizer`

`DebertaV2Tokenizer`

`M2M100Tokenizer`

`MarianTokenizer` - has src and target tokenizer

`MBart50Tokenizer`

`PegasusTokenizer`

`ReformerTokenizer`

`Speech2TextTokenizer`

`T5Tokenizer`

`XLMProphetNetTokenizer`

`XLNetTokenizer`

`XML RoBERTa`

PhilipMay commented Apr 24, 2021 •

edited

Loading

PhilipMay commented May 12, 2021 •

edited

Loading