Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlm mask generator docstring adding example #916

Conversation

abuelnasr0
Copy link
Contributor

@abuelnasr0 abuelnasr0 commented Mar 24, 2023

for the contribution welcome issue #120 I added a simple example to show the best practice of using MLMMaskGenerator to train a language model

  • Prepare data, we can generate random strings or load some small datasets from TFDS (tensorflow datasets).
  • Instantiate a tokenizer, and call tokenize on the data.
  • Apply MLMMaskGenerator on the tokenized data.
  • Feed the masked data to a dummy MLM model by calling model.fit().

link to google colab with the example:
https://colab.research.google.com/drive/19GmGy4nCe2-AgRsHAqdVIMmjxR_WxSkB#scrollTo=YdPZIknsopS_

waiting for feedback.
thank you.

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

keras_nlp/layers/masked_lm_mask_generator.py Show resolved Hide resolved
keras_nlp/layers/masked_lm_mask_generator.py Outdated Show resolved Hide resolved
keras_nlp/layers/masked_lm_mask_generator.py Outdated Show resolved Hide resolved
keras_nlp/layers/masked_lm_mask_generator.py Outdated Show resolved Hide resolved
keras_nlp/layers/masked_lm_mask_generator.py Outdated Show resolved Hide resolved
keras_nlp/layers/masked_lm_mask_generator.py Outdated Show resolved Hide resolved
An end-to-end masked language model training using masked language mask
generator.
```python
train_data = tf.constant([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high level comments...

I think showing a more fleshed out usage is OK to do here, but I would scale it down significantly from what we are showing here.

We do not want to show keras_nlp.models usage here in our lower level layers. And we probably do not want to show fit() either. We should generally try to keep our usages fairly local to the symbols in question, so we avoid a ton of cross cutting dependencies when we update a symbol usage.

I might instead just show something that looks closer to a real world example, but is still just focused on the layer itself.

pad_id, cls_id, sep_id, mask_id = 0, 1, 2, 3
batch = [
    [cls_id,     4,     5,      6, sep_id,    7,    8, sep_id, pad_id, pad_id],
    [cls_id,     4,     5, sep_id,      6,    7,    8,      9, sep_id, pad_id],
]
# the rest of the MaskedLMMaskGenerator and invocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like last push?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes perhaps, sorry about that!

This may be me needing to sync up with @chenmoneygithub more. I do not think we want the examples here to depend on a totally separate part of our API, I will talk with @chenmoneygithub about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just talked with @chenmoneygithub, apologies for all the confusion here. We are scaling up our contributors, so our wires may get crossed a few times :)

But this latest version looks good to me! Thanks very much!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Writing docstrings and changes help me to get familiar with the library more. Thanks for the feedback.
One thing: should I run that /gcbrun or a mentor should run it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abuelnasr0 good question. We are playing with different options here, but /gcbrun would need to be added to project by a maintainer. Unclear if we are going to require it long term, but no action should be needed from you!

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@chenmoneygithub chenmoneygithub merged commit 1795d10 into keras-team:master Mar 31, 2023
@abuelnasr0 abuelnasr0 deleted the MLMMaskGenerator-docstring-adding-example branch March 31, 2023 02:02
@abuelnasr0 abuelnasr0 restored the MLMMaskGenerator-docstring-adding-example branch March 31, 2023 02:02
@abuelnasr0 abuelnasr0 deleted the MLMMaskGenerator-docstring-adding-example branch December 28, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants