Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from_preset constructor to BertPreprocessor #390

Merged
merged 6 commits into from
Oct 18, 2022

Conversation

jbischof
Copy link
Contributor

Closes #388

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design Q: should preprocessors solely be available via

Preprocessor.from_preset(name)

or should they be also be available from the corresponding name model, e.g.

Bert.from_preset(name).tokenizer # or something

keras_nlp/models/bert/bert_preprocessing.py Outdated Show resolved Hide resolved
keras_nlp/models/bert/bert_presets.py Outdated Show resolved Hide resolved
@jbischof
Copy link
Contributor Author

@fchollet, @mattdangerw is working on a pipeline class that composes preprocessing and the model. It would have its own from_preset method. Looking forward to discussing when we have a design doc.

@fchollet
Copy link
Collaborator

fchollet commented Oct 17, 2022

@fchollet, @mattdangerw is working on a pipeline class that composes preprocessing and the model

I am not sure that a separate "pipeline" class is warranted. I'd favor including the preprocessing in the user-facing model (e.g. Bert or BertClassifier), similar to what we do for vision models.

  • Models are instantiated with or without preprocessing included (defaults to True, i.e. with).
  • With fit(), predict(), and evaluate(): the preprocessing is automatically mapped into the dataset (if it is included in the model).
  • With __call__(): preprocessing is applied by default (if it is included in the model) but you have an argument to turn it off, which can be useful in various situations when writing low-level code.

Ultimately we just want models that can process raw strings. To have to deal with an additional layer of abstraction to get the ability to process strings seems clumsy.

@jbischof
Copy link
Contributor Author

jbischof commented Oct 17, 2022

@fchollet I was assuming that regardless of how we implement preprocessing as part of the model we still need separate preprocessing to be a good experience, which I have attempted in this PR. Are you saying you want to block this PR until we figure out joint preprocessing, that the current PR is a bad experience for separate preprocessing, or that we shouldn't have separate preprocessing?

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me! And definitely a positive delta from what we have.

I think one bigger question is whether we want to have a BertPreprocessor.presets. Soon, Bert.from_preset, BertClassifier.from_preset and BertPreprocessor.from_preset will all have different sets of keys they accept (the latter being the combination of the two former).

Should we try to keep a consistent UX, where each class has a preset static property, and the valid keys can be accessed with MySymbol.preset.keys()? I don't see a reason we need to figure that out to submit this, and there is probably plenty to discuss in the weeds there. But something we should think about going forward.

keras_nlp/models/bert/bert_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/bert/bert_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/bert/bert_preprocessing.py Outdated Show resolved Hide resolved
keras_nlp/models/bert/bert_preprocessing.py Show resolved Hide resolved
keras_nlp/models/bert/bert_preprocessing.py Outdated Show resolved Hide resolved
keras_nlp/models/bert/bert_preprocessing.py Outdated Show resolved Hide resolved
@mattdangerw
Copy link
Member

@fchollet and @jbischof, re how to access the pre-preprocessing paired with a model, I think we can do that as a follow up. (Basically I'm just echoing Jonathan's point above).

IMO, we should have a way to access preprocessing standalone--that's a totally valid user journey. This PR address that in a very consistent way with the UX we just came up with for keras_nlp.models.Bert.

As Francois was saying in the original question, the question is whether to also expose the preprocessing on the model. I think that is a fairly dense topic, and one that can roughly be summarized as "what do we want to do about pipelines"? Let's tackle that in a future PR!

@fchollet
Copy link
Collaborator

fchollet commented Oct 17, 2022

Are you saying you want to block this PR until we figure out joint preprocessing or that the current PR is a bad experience for separate preprocessing?

No, it's clear that being able to retrieve the tokenizer independently is useful. I was saying:

should they also be available from the corresponding name model

@chenmoneygithub
Copy link
Contributor

I tried this implementation and spotted a bug:

bert_preprocessor = keras_nlp.models.BertPreprocessor.from_preset("bert_small_uncased_en")
inputs = ["This is a testing string!", "Another test string!"]
inputs = bert_preprocessor(inputs)
print(inputs)

This gives shape (512, ) for token_ids, segment_ids and padding_mask, which means the batch information is lost.

@jbischof
Copy link
Contributor Author

@chenmoneygithub the preprocessing layer does not handle batch inputs. It is expected to be mapped over each example in the batch. If there are multiple sentences in the input, these are interpreted as multiple segments for the same example (e.g., for NSP or sentence similarity).

@mattdangerw
Copy link
Member

@jbischof @chenmoneygithub I think this batching convo is unrelated to the PR. But seems worth clearing up.

  • All our preprocessing does handle batched and unbatched input. Loading a dataset with, say, tfds batched, and piping it through our preprocessing is actually a pretty core use case! (And we do support it universally)
  • All our preprocessing supports converting to tensors, just as a convenience. (Most tf ops do as well)
  • The BertPreprocessor supports multiple (batched or unbatched) inputs which are then collated with sep tokens. This is also the behavior of the MultiSegmentPacker layer.

The sum total of these means you need to be careful when passing to un-tensorized inputs to the layer, because you hit an ambiguity of sorts. We treat the first tuple or list we see as indicating multiple separate string features to collate. So in the code snippet you posted Chen, we are doing a very valid thing, and contacting the two sentences together in a single sequence. This colab shows what I mean.
https://gist.github.com/mattdangerw/61b74938a11b61f1ce858b770c601bd6

We could consider flipping the default behavior there, so if you pass a tuple of strings (rather than a tuple of numpy or tensors) we treat this as a batch. That is definitely something for a separate issue though!

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add from_preset constructor to BertPreprocessor
4 participants