-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework model docstrings for progressive disclosure of complexity for f_net #879
Rework model docstrings for progressive disclosure of complexity for f_net #879
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs another careful pass, may not have found all the issue. Make sure you have rewritten all the docstring for the model in question, and tested them out.
@@ -55,7 +55,7 @@ class FNetClassifier(Task): | |||
`None`, this model will not apply preprocessing, and inputs should | |||
be preprocessed before calling the model. | |||
|
|||
Example usage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is missing most of the content on the BERT classifier, may be worth another look.
second = tf.constant(["The fox tripped.", "Oh look, a whale."]) | ||
preprocessor((first, second)) | ||
``` | ||
Mapping with `tf.data.Dataset`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline before this heading
preprocessor = keras_nlp.models.BertPreprocessor(tokenizer) | ||
preprocessor("The quick brown fox jumped.") | ||
``` | ||
Mapping with `tf.data.Dataset`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline before
# Custom vocabulary. | ||
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] | ||
vocab += ["The", "quick", "brown", "fox", "jumped", "."] | ||
tokenizer = keras_nlp.models.BertTokenizer(vocabulary=vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still says Bert in a lot of places, and uses the wrong vocabulary type for fnet. We will need to show sentencepiece here, you can look around at other PRs for examples on showing that type of vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still says Bert in a lot of places, and uses the wrong vocabulary type for fnet. We will need to show sentencepiece here, you can look around at other PRs for examples on showing that type of vocabulary.
can you suggest some example like this, I can not find a proper one !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! In general there are many examples erroring out, please test the modified examples. Thanks!
Mapping with `tf.data.Dataset`. | ||
```python | ||
preprocessor = keras_nlp.models.FNetMaskedLMPreprocessor.from_preset( | ||
"bert_base_en_uncased" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the preset: f_net_base_en
```python | ||
# Load the preprocessor from a preset. | ||
preprocessor = keras_nlp.models.FNetMaskedLMPreprocessor.from_preset( | ||
"f_net_base_en" | ||
"f_net_base_en_uncased" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the preset: f_net_base_en
user_defined_symbols="[MASK]", | ||
# Map sentence pairs. | ||
ds = tf.data.Dataset.from_tensor_slices((first, second)) | ||
# Watch out for tf.data's default unpacking of tuples here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not by this PR - I think it is worth calling out first and second will be concatenated if calling preprocessor in this way. Now the comment just says "watch out" without showing the output. Maybe we can add "sentence pairs are automatically packed before tokenization"? @mattdangerw thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is not quite the issue here.
The fact that the outputs are concatenated is not that surprising. The fact the tf.data
handles tuples specially is! Basically if you just called ds = ds.map(preprocessor)
here, you would see your second input being passed as a label and not a feature. It's an annoying gotcha, but not our to solve I think.
It stems from the fact that these two calls are handled differently...
tf.data.Dataset.from_tensor_slices([[1, 2, 3], [1, 2, 3]]).map(lambda x: x) # OK
tf.data.Dataset.from_tensor_slices(([1, 2, 3], [1, 2, 3])).map(lambda x: x) # ERROR
We can update this comment if we want, but I would not do it on this PR. I would do that on a separate PR, for all the models at once (so we don't forget to update this elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be fine, if we generate separate PR for different models at once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the comment above was meant as an explainer. Let's stick to the language we have been using in other PRs verbatim for this PR.
Examples: | ||
|
||
Directly calling the layer on data. | ||
```python | ||
tokenizer = keras_nlp.models.FNetTokenizer(proto="model.spm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to use from_preset()
# Custom vocabulary. | ||
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] | ||
vocab += ["The", "quick", "brown", "fox", "jumped", "."] | ||
tokenizer = keras_nlp.models.FNetTokenizer(vocabulary=vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work, FNetTokenizer
needs SPE proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work,
FNetTokenizer
needs SPE proto.
So, here we can remove this custom vocabulary ?
Mapping with `tf.data.Dataset`. | ||
```python | ||
preprocessor = keras_nlp.models.FNetPreprocessor.from_preset( | ||
"bert_base_en_uncased" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong preset here
tokenizer(["the quick brown fox", "the earth is round"]) | ||
# Unbatched input. | ||
tokenizer = keras_nlp.models.FNetTokenizer.from_preset( | ||
"bert_base_en_uncased", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong preset
# Custom vocabulary. | ||
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] | ||
vocab += ["The", "quick", "brown", "fox", "jumped", "."] | ||
tokenizer = keras_nlp.models.FNetTokenizer(vocabulary=vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't work as well. We can just delete the custom vocab example
/gcbrun |
@ADITYADAS1999 There are still some pending comments unaddressed, could you give me a ping after fixing them? Thanks! |
I try to fix them asap and inform 👍🏻 |
hey @chenmoneygithub can you check the fixes now. |
/gcbrun |
Thanks for reviewing but the acceleration testing are still failing some reason. |
Rework model docstrings for progressive disclosure of complexity #867
cc : @mattdangerw @chenmoneygithub