Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up model input names for consistency #327

Merged
merged 1 commit into from
Aug 30, 2022

Conversation

mattdangerw
Copy link
Member

This proposes a few changes to the naming our our model and layer inputs
and outputs.

  1. Rename input_ids -> token_ids for bert/roberta.
    Everything is an "input", including the segment id input, so I don't
    think input is a helpful naming prefix in this case.
  2. Rename input_mask -> padding_mask for bert/roberta.
    This matches the name of the variable for the transformer
    encoder/decoder argument.
  3. Rename tokens -> token_ids for MLMMaskGenerator.
    This layer only operates in id space, so I think token_ids is more
    descriptive and consistent with above.

This proposes a few changes to the naming our our model and layer inputs
and outputs.

1) Rename `input_ids` -> `token_ids` for bert/roberta.
   Everything is an "input", including the segment id input, so I don't
   think input is a helpful naming prefix in this case.
2) Rename `input_mask`  -> `padding_mask` for bert/roberta.
   This matches the name of the variable for the transformer
   encoder/decoder argument.
3) Rename `tokens` -> `token_ids` for MLMMaskGenerator.
   This layer only operates in id space, so I think token_ids is more
   descriptive and consistent with above.
@mattdangerw mattdangerw changed the title Clean up model input/outputs for consistency Clean up model input for consistency Aug 29, 2022
@mattdangerw mattdangerw changed the title Clean up model input for consistency Clean up model input names for consistency Aug 29, 2022
@jbischof
Copy link
Contributor

jbischof commented Aug 29, 2022

Looks good in general but I'm confused by the padding_mask label. Aren't we also masking the holdout words in pretraining and not just the padding?

Does the examples/bert/ code still work? I see it still uses input_ids here.

Finally, there's one other naming inconsistency you could mop up in sine_position_encoding.

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added some comments for consideration but trust your judgment.

@mattdangerw
Copy link
Member Author

mattdangerw commented Aug 30, 2022

@jbischof Thanks! And good call I will rename in examples/bert to match.

Re padding_mask and the MLM masked words, that's actually a different mask I think. There's actually a [PAD] token and a [MASK] token in the bert vocab. During pretraining, [PAD] will be used only to pad sequences, and [MASK] will be used for the MLM labels.

So I do think it is accurate to use padding_mask as a name here. Other options are mask or token_mask. I am actually fine with any of those. I was just choosing padding_mask because that the name we chose for our TransformerEncoder/TransformerDecoder https://keras.io/api/keras_nlp/layers/transformer_encoder/#call-method.

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants