Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RobertaMaskedLM task and preprocessor #653

Merged
merged 6 commits into from
Feb 1, 2023

Conversation

mattdangerw
Copy link
Member

This is exposing a task model for a mask token prediction task. This is the task used to pre-train RoBERTa, and could be used for futher fine-tuning.

@mattdangerw mattdangerw requested review from chenmoneygithub and jbischof and removed request for jbischof and chenmoneygithub January 13, 2023 20:25
@mattdangerw
Copy link
Member Author

Removing reviewers while I fix tests!

@mattdangerw
Copy link
Member Author

Ok! This is ready for review again.

There is still a test failure related to random seeds on tf.nightly I will have to look into. If we don't think we can fully control randomness deterministically across tf versions, I can just re-write the asserts to be a little more general for preprocessing.

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, left some initial comments.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""RoBERTa classification model."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem updated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something?

Copy link
Member Author

@mattdangerw mattdangerw Jan 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry! Dunno how I spaced on this!

keras_nlp/models/roberta/roberta_masked_lm.py Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm_preprocessor.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_tokenizer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

keras_nlp/models/roberta/roberta_masked_lm.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm_preprocessor.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm_preprocessor.py Outdated Show resolved Hide resolved
tokenizer=tokenizer,
sequence_length=20,
)
preprocessor(" quick fox quick fox")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove the leading space to keep the consistency of inputs?

keras_nlp/models/roberta/roberta_masked_lm_preprocessor.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_preprocessor.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_preprocessor.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some more comments. Haven't looked at the tests yet.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""RoBERTa classification model."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem updated?

keras_nlp/models/roberta/roberta_masked_lm.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm_preprocessor.py Outdated Show resolved Hide resolved
@mattdangerw mattdangerw force-pushed the roberta-masked-lm branch 2 times, most recently from 8044017 to 88b5200 Compare January 25, 2023 21:31
Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems basically ready to me. Have you played around with this in colab? How is the UX?

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""RoBERTa classification model."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something?

keras_nlp/models/roberta/roberta_masked_lm.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm.py Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm_preprocessor.py Outdated Show resolved Hide resolved
keras_nlp/models/roberta/roberta_masked_lm.py Outdated Show resolved Hide resolved
@mattdangerw
Copy link
Member Author

mattdangerw commented Jan 28, 2023

@jbischof

It seems basically ready to me. Have you played around with this in colab? How is the UX?

Heh UX is in the eye of the beholder, but in my biased view, this is correct for where we are at. Here's a quick colab ->
https://colab.research.google.com/gist/mattdangerw/fead93c20342c70e438a253f392bd425/roberta-mlm.ipynb

The major usability gap I see is one that stands for causal learning as well. There is no way to do either of these with densely packed, fixed sequence length windowing, which is how this is done for both GPT2 and RoBERTa pretraining. I can open an issue next week with some musing and possible approaches, but let's not solve that on this PR.

@jbischof
Copy link
Contributor

Yes @mattdangerw please open an issue for data streaming! The code is only a toy until we can replicate the behavior and performance of the original pretraining.

@mattdangerw
Copy link
Member Author

Yes @mattdangerw please open an issue for data streaming!

Not quite data streaming, data windowing. And I don't think the workflow shown here is toy, it just suitable to some datasets and problems.

Put shortly, an offering that only did densely packed sequence windowing would be incomplete. And a offering that only did padded windowing is also incomplete. We only have the latter right now.

But all of this is "streaming"--this boils down to options in how the data preprocessing stream should function. Definitely look at the colab if you haven't yet!

At the meta level--agreed that replicating pre-training is a good goal, but let's work to that incrementally!

@mattdangerw
Copy link
Member Author

I've opened #701 to cover the discussion above. This is an important question we need to solve at the library level I think, but let's not solve on this PR.

There are a lot of open question to tackle there. And it may be that what we have here is sufficient, and what we need is simply a recipe for pretraining, where most of the preprocessing happens in a separate job entirely.

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, glad to see our task models expanding!!!

The test asserting random output seems a bit wonky, I would recommended mocking instead.

mask_selection_length=4,
sequence_length=12,
)
keras.utils.set_random_seed(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it good practice to assert random output as a function of a random seed? My understanding is that standard process is to mock the function with random output and then just have an overall integration test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me poke around. The thing that generates the random output is kinda complex down in MaskedLMMaskGenerator, and when mapping with tf.data the whole call graph will be compiled. Mocking might be more trouble than it's worth, and only apply in the note compiled case.

In our tests there we just do a lot of shape assertions, and don't assert the exact structure. Maybe that is the move.

Regardless, I am down to remove the random seed setting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, got some more deterministic seedless testing in. Did not go with the mock as I think that would have ended up being pretty fragile.

I also exposed a few more of our underlying "mask generator" options of the preprocessing layer. I think they will be useful to have (they certainly are for these tests).

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies, I didn't mean to pollute the API just for testing! I'd rather just have an integration test in the short term than change the API.

@mattdangerw
Copy link
Member Author

My apologies, I didn't mean to pollute the API just for testing! I'd rather just have an integration test in the short term than change the API.

I do think it probably makes sense to add these options. It is very reasonable to want to only replace the mask with a mask token (rather than 80% mask token, 10% random token, 10% same token done in original BERT and RoBERTa I believe).

I had been on the fence about adding them, was going to wait for a contributor to ask for it. But after I was noticing how it simplified testing, I am down to add them in.

@mattdangerw
Copy link
Member Author

@jbischof
Copy link
Contributor

jbischof commented Feb 1, 2023

Apologies for misunderstanding, merge away!

This adds the underlying options for how masks are generated from
the mask generator layer. This is turn allows us to write some tests
for the preprocessor that are fully deterministic, while still testing
the logic in the preprocessor layer itself.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants