-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RobertaMaskedLM task and preprocessor #653
Conversation
Removing reviewers while I fix tests! |
Ok! This is ready for review again. There is still a test failure related to random seeds on tf.nightly I will have to look into. If we don't think we can fully control randomness deterministically across tf versions, I can just re-write the asserts to be a little more general for preprocessing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, left some initial comments.
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""RoBERTa classification model.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry! Dunno how I spaced on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
tokenizer=tokenizer, | ||
sequence_length=20, | ||
) | ||
preprocessor(" quick fox quick fox") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove the leading space to keep the consistency of inputs?
aea3d2b
to
dc62952
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some more comments. Haven't looked at the tests yet.
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""RoBERTa classification model.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem updated?
8044017
to
88b5200
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems basically ready to me. Have you played around with this in colab? How is the UX?
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""RoBERTa classification model.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something?
Heh UX is in the eye of the beholder, but in my biased view, this is correct for where we are at. Here's a quick colab -> The major usability gap I see is one that stands for causal learning as well. There is no way to do either of these with densely packed, fixed sequence length windowing, which is how this is done for both GPT2 and RoBERTa pretraining. I can open an issue next week with some musing and possible approaches, but let's not solve that on this PR. |
88b5200
to
270991a
Compare
Yes @mattdangerw please open an issue for data streaming! The code is only a toy until we can replicate the behavior and performance of the original pretraining. |
Not quite data streaming, data windowing. And I don't think the workflow shown here is toy, it just suitable to some datasets and problems. Put shortly, an offering that only did densely packed sequence windowing would be incomplete. And a offering that only did padded windowing is also incomplete. We only have the latter right now. But all of this is "streaming"--this boils down to options in how the data preprocessing stream should function. Definitely look at the colab if you haven't yet! At the meta level--agreed that replicating pre-training is a good goal, but let's work to that incrementally! |
I've opened #701 to cover the discussion above. This is an important question we need to solve at the library level I think, but let's not solve on this PR. There are a lot of open question to tackle there. And it may be that what we have here is sufficient, and what we need is simply a recipe for pretraining, where most of the preprocessing happens in a separate job entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, glad to see our task models expanding!!!
The test asserting random output seems a bit wonky, I would recommended mocking instead.
mask_selection_length=4, | ||
sequence_length=12, | ||
) | ||
keras.utils.set_random_seed(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it good practice to assert random output as a function of a random seed? My understanding is that standard process is to mock the function with random output and then just have an overall integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me poke around. The thing that generates the random output is kinda complex down in MaskedLMMaskGenerator
, and when mapping with tf.data the whole call graph will be compiled. Mocking might be more trouble than it's worth, and only apply in the note compiled case.
In our tests there we just do a lot of shape assertions, and don't assert the exact structure. Maybe that is the move.
Regardless, I am down to remove the random seed setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, got some more deterministic seedless testing in. Did not go with the mock as I think that would have ended up being pretty fragile.
I also exposed a few more of our underlying "mask generator" options of the preprocessing layer. I think they will be useful to have (they certainly are for these tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, I didn't mean to pollute the API just for testing! I'd rather just have an integration test in the short term than change the API.
I do think it probably makes sense to add these options. It is very reasonable to want to only replace the mask with a mask token (rather than 80% mask token, 10% random token, 10% same token done in original BERT and RoBERTa I believe). I had been on the fence about adding them, was going to wait for a contributor to ask for it. But after I was noticing how it simplified testing, I am down to add them in. |
Apologies for misunderstanding, merge away! |
7264179
to
b291e58
Compare
This adds the underlying options for how masks are generated from the mask generator layer. This is turn allows us to write some tests for the preprocessor that are fully deterministic, while still testing the logic in the preprocessor layer itself.
b291e58
to
5fb5b8c
Compare
This is exposing a task model for a mask token prediction task. This is the task used to pre-train RoBERTa, and could be used for futher fine-tuning.