Move `from_preset` to base tokenizer classes #673

shivance · 2023-01-17T16:25:05Z

Closes #648

shivance · 2023-01-17T16:25:36Z

@jbischof Please review.

mattdangerw

Thanks you!!

The actual runnable code changes looks good, just one minor comment.

We will likely need to do something fancier for docstrings though. Will think through this a bit and post more here.

mattdangerw · 2023-01-17T20:10:39Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

@@ -536,3 +537,57 @@ def _bpe_merge_and_update_cache(self, tokens):
            tokenized_words, axis=1, separator=" "
        )
        self.cache.insert(tokens, tokenized_words)
+


each of these classes should probably have a preset property defined now, that is empty. E.g.

https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/backbone.py#L33-L35

jbischof

Thanks mostly documentation changes!

jbischof · 2023-01-17T23:20:05Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        tokenizer.detokenize([5, 6, 7, 8, 9])
+        ```
+        """
+


After adding the preset property, check if empty as we do for backbone (link). Same for the other two tokenizers.

jbischof · 2023-01-17T23:39:34Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        preset,
+        **kwargs,
+    ):
+        """Instantiate a GPT-2 tokenizer from preset vocabulary and merge rules.


Please remove GPT-2 language. This should be a generic docstring for BPE. Same for the other two

We can actually switch this to a templatized version like this... https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L45-L65
Make sure to not copy that verbatim, we should keep the language from this docstring, but update this to use the format variables {{model_name}} {{preset_names}} and {{example_preset_name}}.

To get that working, you will also need to copy the __init_subclass__ method we use for our Backbone and Task classes, but you should be able to copy that almost exactly (just update Backbone -> BytePairTokenizer). https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L94-L114

We should make similar changes to the other tokenizer base classes.

mattdangerw

Left some comments re how we can handle the docstrings here.

Also, we are a bit of a moving target here (lots of changes this week!), but you can also mirror these changes for the albert and f_net models. Thank you!

mattdangerw · 2023-01-19T00:10:31Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        preset,
+        **kwargs,
+    ):
+        """Instantiate a GPT-2 tokenizer from preset vocabulary and merge rules.


We can actually switch this to a templatized version like this... https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L45-L65
Make sure to not copy that verbatim, we should keep the language from this docstring, but update this to use the format variables {{model_name}} {{preset_names}} and {{example_preset_name}}.

To get that working, you will also need to copy the __init_subclass__ method we use for our Backbone and Task classes, but you should be able to copy that almost exactly (just update Backbone -> BytePairTokenizer). https://github.com/keras-team/keras-nlp/blob/c9e5040bf7646da471bf9cec2177be2398162568/keras_nlp/models/backbone.py#L94-L114

We should make similar changes to the other tokenizer base classes.

mattdangerw · 2023-01-19T00:11:43Z

keras_nlp/models/bert/bert_tokenizer.py

@@ -113,51 +111,9 @@ def __init__(
    def presets(cls):
        return copy.deepcopy({**backbone_presets, **classifier_presets})

-    @classmethod
-    @format_docstring(names=PRESET_NAMES)
    def from_preset(


After following the changes below re docstrings and __init__subclass__ you should be able to remove the from_preset method here and elsewhere entirely!

@shivance, we don't need from_preset in subclasses anymore! See, for example, BertPreprocessor

mattdangerw · 2023-01-19T00:17:49Z

Looks like you also have some formatting issues on this PR, checkout the Tests / Check the code format (pull_request) above!

jbischof

Good progress! Let us know if you get stuck. You need to run the format.sh script before every commit, so let us know if you're having trouble there.

jbischof · 2023-01-22T15:28:47Z

keras_nlp/models/bert/bert_tokenizer.py

@@ -113,51 +111,9 @@ def __init__(
    def presets(cls):
        return copy.deepcopy({**backbone_presets, **classifier_presets})

-    @classmethod
-    @format_docstring(names=PRESET_NAMES)
    def from_preset(


@shivance, we don't need from_preset in subclasses anymore! See, for example, BertPreprocessor

jbischof · 2023-01-22T15:29:55Z

keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py

-        )
-
-        return cls.from_config({**config, **kwargs})
+        return super().from_preset(cls, preset, **kwargs)


You need a newline at the end of each file. Are you still having issues with the format.sh script?

Hi @jbischof , I'm still addressing the comments. So work is pending.
I tend to run formatting scripts upon finishing changes for every round of review.

Should I continue with this or run it every time before commit?

Thanks.

Sorry, didn't understand this was still WIP! Up to you on how you organize your commits 😄

shivance · 2023-01-22T18:10:15Z

@jbischof It's ready for review now 😄

mattdangerw

LGTM! Thank you!

Just found a few small nits that need fixing.

mattdangerw · 2023-01-23T20:55:23Z

keras_nlp/models/albert/albert_tokenizer.py

@@ -89,52 +87,3 @@ def __init__(self, proto, **kwargs):
    @classproperty


We will need some changes to the class level docstrings for our model specific tokenizers, we should document the from preset usage front and center in our code examples above. But I think that would best be done as a follow up anyway, just opened #688

mattdangerw · 2023-01-23T21:02:25Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        """Instantiate {{model_name}} tokenizer from preset vocabulary.
+
+        Args:
+            preset: string. Must be one of {{preset_names}}.


We actually need {{preset_names}} surrounded by quotes for the docstring to render correctly. See https://github.com/keras-team/keras-nlp/blob/3cfdeb6bb1eeacd755a880f1674bf8b9d765aa43/keras_nlp/models/backbone.py#L58

mattdangerw · 2023-01-23T21:02:40Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+        """Instantiate {{model_name}} tokenizer from preset vocabulary.
+
+        Args:
+            preset: string. Must be one of {{preset_names}}.


Surround with quotes.

mattdangerw · 2023-01-23T21:03:10Z

keras_nlp/tokenizers/word_piece_tokenizer.py

+        """Instantiate {{model_name}} tokenizer from preset vocabulary.
+
+        Args:
+            preset: string. Must be one of {{preset_names}}.


Surround with quotes.

jbischof

Looks good in general, but please follow @mattdangerw's suggestions and fix the formatting

mattdangerw · 2023-01-24T21:50:18Z

Actually, since my comments add up to just a couple lines changes, I can just make these as merge this. Thanks very much for contribution!

shivance added 2 commits January 17, 2023 21:44

moving from_preset to base tokenizer classes

c06c006

formatting

fc38f4b

mattdangerw self-requested a review January 17, 2023 19:59

mattdangerw requested changes Jan 17, 2023

View reviewed changes

jbischof suggested changes Jan 17, 2023

View reviewed changes

mattdangerw requested changes Jan 19, 2023

View reviewed changes

Merge branch 'master' into issue648

c149950

jbischof reviewed Jan 22, 2023

View reviewed changes

shivance added 4 commits January 22, 2023 23:24

incorporating suggested changes

025da95

incoming + updated

2b2d907

minor edit

774da5a

formatting

1277c6f

shivance requested a review from jbischof January 22, 2023 18:10

mattdangerw approved these changes Jan 23, 2023

View reviewed changes

Format and docstring fixes

7a518a1

jbischof approved these changes Jan 24, 2023

View reviewed changes

mattdangerw merged commit 9d19bc5 into keras-team:master Jan 24, 2023

mattdangerw mentioned this pull request Jan 25, 2023

Add BartTokenizer and BART Presets #685

Merged

jbischof mentioned this pull request Jan 31, 2023

Base classes for architecture workhorses #530

Closed

shivance deleted the issue648 branch February 13, 2023 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `from_preset` to base tokenizer classes #673

Move `from_preset` to base tokenizer classes #673

shivance commented Jan 17, 2023

shivance commented Jan 17, 2023

mattdangerw left a comment

mattdangerw Jan 17, 2023

jbischof left a comment

jbischof Jan 17, 2023

jbischof Jan 17, 2023

mattdangerw Jan 19, 2023 •

edited

Loading

mattdangerw left a comment

mattdangerw Jan 19, 2023 •

edited

Loading

mattdangerw Jan 19, 2023

jbischof Jan 22, 2023

mattdangerw commented Jan 19, 2023

jbischof left a comment

jbischof Jan 22, 2023

jbischof Jan 22, 2023

shivance Jan 22, 2023

jbischof Jan 22, 2023

shivance commented Jan 22, 2023

mattdangerw left a comment

mattdangerw Jan 23, 2023 •

edited

Loading

mattdangerw Jan 23, 2023

mattdangerw Jan 23, 2023

mattdangerw Jan 23, 2023

jbischof left a comment •

edited

Loading

mattdangerw commented Jan 24, 2023

		@@ -89,52 +87,3 @@ def __init__(self, proto, **kwargs):
		@classproperty

Move from_preset to base tokenizer classes #673

Move from_preset to base tokenizer classes #673

Conversation

shivance commented Jan 17, 2023

shivance commented Jan 17, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbischof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Jan 19, 2023

jbischof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivance commented Jan 22, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbischof left a comment • edited Loading

Choose a reason for hiding this comment

mattdangerw commented Jan 24, 2023

Move `from_preset` to base tokenizer classes #673

Move `from_preset` to base tokenizer classes #673

mattdangerw Jan 19, 2023 •

edited

Loading

mattdangerw Jan 19, 2023 •

edited

Loading

mattdangerw Jan 23, 2023 •

edited

Loading

jbischof left a comment •

edited

Loading