Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework model docstrings for progressive disclosure of complexity for f_net #879

Merged
merged 30 commits into from
Apr 12, 2023

Conversation

ADITYADAS1999
Copy link
Contributor

@ADITYADAS1999 ADITYADAS1999 commented Mar 19, 2023

Rework model docstrings for progressive disclosure of complexity #867

  • Make sure to update any "custom vocabulary" examples to match the model actual vocabulary type and special token requirements (varies per model).
  • Test out all docstring snippets!
  • Make sure to follow our code style guidelines re indentation etc.

cc : @mattdangerw @chenmoneygithub

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs another careful pass, may not have found all the issue. Make sure you have rewritten all the docstring for the model in question, and tested them out.

@@ -55,7 +55,7 @@ class FNetClassifier(Task):
`None`, this model will not apply preprocessing, and inputs should
be preprocessed before calling the model.

Example usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is missing most of the content on the BERT classifier, may be worth another look.

second = tf.constant(["The fox tripped.", "Oh look, a whale."])
preprocessor((first, second))
```
Mapping with `tf.data.Dataset`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline before this heading

preprocessor = keras_nlp.models.BertPreprocessor(tokenizer)
preprocessor("The quick brown fox jumped.")
```
Mapping with `tf.data.Dataset`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline before

# Custom vocabulary.
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
vocab += ["The", "quick", "brown", "fox", "jumped", "."]
tokenizer = keras_nlp.models.BertTokenizer(vocabulary=vocab)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still says Bert in a lot of places, and uses the wrong vocabulary type for fnet. We will need to show sentencepiece here, you can look around at other PRs for examples on showing that type of vocabulary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still says Bert in a lot of places, and uses the wrong vocabulary type for fnet. We will need to show sentencepiece here, you can look around at other PRs for examples on showing that type of vocabulary.

can you suggest some example like this, I can not find a proper one !

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! In general there are many examples erroring out, please test the modified examples. Thanks!

Mapping with `tf.data.Dataset`.
```python
preprocessor = keras_nlp.models.FNetMaskedLMPreprocessor.from_preset(
"bert_base_en_uncased"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the preset: f_net_base_en

```python
# Load the preprocessor from a preset.
preprocessor = keras_nlp.models.FNetMaskedLMPreprocessor.from_preset(
"f_net_base_en"
"f_net_base_en_uncased"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the preset: f_net_base_en

user_defined_symbols="[MASK]",
# Map sentence pairs.
ds = tf.data.Dataset.from_tensor_slices((first, second))
# Watch out for tf.data's default unpacking of tuples here!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not by this PR - I think it is worth calling out first and second will be concatenated if calling preprocessor in this way. Now the comment just says "watch out" without showing the output. Maybe we can add "sentence pairs are automatically packed before tokenization"? @mattdangerw thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that is not quite the issue here.

The fact that the outputs are concatenated is not that surprising. The fact the tf.data handles tuples specially is! Basically if you just called ds = ds.map(preprocessor) here, you would see your second input being passed as a label and not a feature. It's an annoying gotcha, but not our to solve I think.

It stems from the fact that these two calls are handled differently...

tf.data.Dataset.from_tensor_slices([[1, 2, 3], [1, 2, 3]]).map(lambda x: x)  # OK
tf.data.Dataset.from_tensor_slices(([1, 2, 3], [1, 2, 3])).map(lambda x: x)  # ERROR

We can update this comment if we want, but I would not do it on this PR. I would do that on a separate PR, for all the models at once (so we don't forget to update this elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be fine, if we generate separate PR for different models at once

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the comment above was meant as an explainer. Let's stick to the language we have been using in other PRs verbatim for this PR.

Examples:

Directly calling the layer on data.
```python
tokenizer = keras_nlp.models.FNetTokenizer(proto="model.spm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to use from_preset()

# Custom vocabulary.
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
vocab += ["The", "quick", "brown", "fox", "jumped", "."]
tokenizer = keras_nlp.models.FNetTokenizer(vocabulary=vocab)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, FNetTokenizer needs SPE proto.

Copy link
Contributor Author

@ADITYADAS1999 ADITYADAS1999 Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, FNetTokenizer needs SPE proto.

So, here we can remove this custom vocabulary ?

Mapping with `tf.data.Dataset`.
```python
preprocessor = keras_nlp.models.FNetPreprocessor.from_preset(
"bert_base_en_uncased"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong preset here

tokenizer(["the quick brown fox", "the earth is round"])
# Unbatched input.
tokenizer = keras_nlp.models.FNetTokenizer.from_preset(
"bert_base_en_uncased",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong preset

# Custom vocabulary.
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
vocab += ["The", "quick", "brown", "fox", "jumped", "."]
tokenizer = keras_nlp.models.FNetTokenizer(vocabulary=vocab)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work as well. We can just delete the custom vocab example

@mattdangerw
Copy link
Member

/gcbrun

@chenmoneygithub
Copy link
Contributor

@ADITYADAS1999 There are still some pending comments unaddressed, could you give me a ping after fixing them? Thanks!

@ADITYADAS1999
Copy link
Contributor Author

@ADITYADAS1999 There are still some pending comments unaddressed, could you give me a ping after fixing them? Thanks!

I try to fix them asap and inform 👍🏻

@ADITYADAS1999
Copy link
Contributor Author

ADITYADAS1999 commented Apr 6, 2023

hey @chenmoneygithub can you check the fixes now.

@mattdangerw
Copy link
Member

/gcbrun

@mattdangerw mattdangerw merged commit 1c9ab0b into keras-team:master Apr 12, 2023
@ADITYADAS1999
Copy link
Contributor Author

Thanks for reviewing but the acceleration testing are still failing some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants