Add `from_preset` constructor to `BertPreprocessor` #390

jbischof · 2022-10-14T23:48:25Z

Closes #388

fchollet

Design Q: should preprocessors solely be available via

Preprocessor.from_preset(name)

or should they be also be available from the corresponding name model, e.g.

Bert.from_preset(name).tokenizer # or something

keras_nlp/models/bert/bert_preprocessing.py

keras_nlp/models/bert/bert_presets.py

jbischof · 2022-10-17T21:59:48Z

@fchollet, @mattdangerw is working on a pipeline class that composes preprocessing and the model. It would have its own from_preset method. Looking forward to discussing when we have a design doc.

fchollet · 2022-10-17T22:41:28Z

@fchollet, @mattdangerw is working on a pipeline class that composes preprocessing and the model

I am not sure that a separate "pipeline" class is warranted. I'd favor including the preprocessing in the user-facing model (e.g. Bert or BertClassifier), similar to what we do for vision models.

Models are instantiated with or without preprocessing included (defaults to True, i.e. with).
With fit(), predict(), and evaluate(): the preprocessing is automatically mapped into the dataset (if it is included in the model).
With __call__(): preprocessing is applied by default (if it is included in the model) but you have an argument to turn it off, which can be useful in various situations when writing low-level code.

Ultimately we just want models that can process raw strings. To have to deal with an additional layer of abstraction to get the ability to process strings seems clumsy.

jbischof · 2022-10-17T22:48:55Z

@fchollet I was assuming that regardless of how we implement preprocessing as part of the model we still need separate preprocessing to be a good experience, which I have attempted in this PR. Are you saying you want to block this PR until we figure out joint preprocessing, that the current PR is a bad experience for separate preprocessing, or that we shouldn't have separate preprocessing?

mattdangerw

Overall this looks good to me! And definitely a positive delta from what we have.

I think one bigger question is whether we want to have a BertPreprocessor.presets. Soon, Bert.from_preset, BertClassifier.from_preset and BertPreprocessor.from_preset will all have different sets of keys they accept (the latter being the combination of the two former).

Should we try to keep a consistent UX, where each class has a preset static property, and the valid keys can be accessed with MySymbol.preset.keys()? I don't see a reason we need to figure that out to submit this, and there is probably plenty to discuss in the weeds there. But something we should think about going forward.

keras_nlp/models/bert/bert_presets.py

keras_nlp/models/bert/bert_preprocessing.py

mattdangerw · 2022-10-17T23:14:10Z

@fchollet and @jbischof, re how to access the pre-preprocessing paired with a model, I think we can do that as a follow up. (Basically I'm just echoing Jonathan's point above).

IMO, we should have a way to access preprocessing standalone--that's a totally valid user journey. This PR address that in a very consistent way with the UX we just came up with for keras_nlp.models.Bert.

As Francois was saying in the original question, the question is whether to also expose the preprocessing on the model. I think that is a fairly dense topic, and one that can roughly be summarized as "what do we want to do about pipelines"? Let's tackle that in a future PR!

fchollet · 2022-10-17T23:20:40Z

Are you saying you want to block this PR until we figure out joint preprocessing or that the current PR is a bad experience for separate preprocessing?

No, it's clear that being able to retrieve the tokenizer independently is useful. I was saying:

should they also be available from the corresponding name model

chenmoneygithub · 2022-10-18T00:18:06Z

I tried this implementation and spotted a bug:

bert_preprocessor = keras_nlp.models.BertPreprocessor.from_preset("bert_small_uncased_en")
inputs = ["This is a testing string!", "Another test string!"]
inputs = bert_preprocessor(inputs)
print(inputs)

This gives shape (512, ) for token_ids, segment_ids and padding_mask, which means the batch information is lost.

jbischof · 2022-10-18T17:24:32Z

@chenmoneygithub the preprocessing layer does not handle batch inputs. It is expected to be mapped over each example in the batch. If there are multiple sentences in the input, these are interpreted as multiple segments for the same example (e.g., for NSP or sentence similarity).

mattdangerw · 2022-10-18T19:06:00Z

@jbischof @chenmoneygithub I think this batching convo is unrelated to the PR. But seems worth clearing up.

All our preprocessing does handle batched and unbatched input. Loading a dataset with, say, tfds batched, and piping it through our preprocessing is actually a pretty core use case! (And we do support it universally)
All our preprocessing supports converting to tensors, just as a convenience. (Most tf ops do as well)
The BertPreprocessor supports multiple (batched or unbatched) inputs which are then collated with sep tokens. This is also the behavior of the MultiSegmentPacker layer.

The sum total of these means you need to be careful when passing to un-tensorized inputs to the layer, because you hit an ambiguity of sorts. We treat the first tuple or list we see as indicating multiple separate string features to collate. So in the code snippet you posted Chen, we are doing a very valid thing, and contacting the two sentences together in a single sequence. This colab shows what I mean.
https://gist.github.com/mattdangerw/61b74938a11b61f1ce858b770c601bd6

We could consider flipping the default behavior there, so if you pass a tuple of strings (rather than a tuple of numpy or tensors) we treat this as a batch. That is definitely something for a separate issue though!

mattdangerw

LGTM! Thanks!

jbischof added 2 commits October 14, 2022 23:46

Initial commit

d5804c5

Format

ba712a2

jbischof requested review from mattdangerw and fchollet October 14, 2022 23:51

jbischof added 2 commits October 14, 2022 23:59

Test sequence_length override

9289a1d

Mock vocab download to improve unit testing

3b7951d

fchollet reviewed Oct 17, 2022

View reviewed changes

keras_nlp/models/bert/bert_preprocessing.py Outdated Show resolved Hide resolved

keras_nlp/models/bert/bert_presets.py Outdated Show resolved Hide resolved

Address comments

3f305e4

mattdangerw requested changes Oct 17, 2022

View reviewed changes

Address comments 2

841ca66

mattdangerw approved these changes Oct 18, 2022

View reviewed changes

jbischof merged commit 813d8da into keras-team:master Oct 18, 2022

jbischof deleted the preproc_presets branch October 18, 2022 22:48

jbischof mentioned this pull request Oct 25, 2022

Specify vocabulary to BertPreprocessing using a checkpoint name #364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `from_preset` constructor to `BertPreprocessor` #390

Add `from_preset` constructor to `BertPreprocessor` #390

jbischof commented Oct 14, 2022

fchollet left a comment

jbischof commented Oct 17, 2022

fchollet commented Oct 17, 2022 •

edited

Loading

jbischof commented Oct 17, 2022 •

edited

Loading

mattdangerw left a comment

mattdangerw commented Oct 17, 2022

fchollet commented Oct 17, 2022 •

edited

Loading

chenmoneygithub commented Oct 18, 2022

jbischof commented Oct 18, 2022

mattdangerw commented Oct 18, 2022

mattdangerw left a comment

Add from_preset constructor to BertPreprocessor #390

Add from_preset constructor to BertPreprocessor #390

Conversation

jbischof commented Oct 14, 2022

fchollet left a comment

Choose a reason for hiding this comment

jbischof commented Oct 17, 2022

fchollet commented Oct 17, 2022 • edited Loading

jbischof commented Oct 17, 2022 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw commented Oct 17, 2022

fchollet commented Oct 17, 2022 • edited Loading

chenmoneygithub commented Oct 18, 2022

jbischof commented Oct 18, 2022

mattdangerw commented Oct 18, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

Add `from_preset` constructor to `BertPreprocessor` #390

Add `from_preset` constructor to `BertPreprocessor` #390

fchollet commented Oct 17, 2022 •

edited

Loading

jbischof commented Oct 17, 2022 •

edited

Loading

fchollet commented Oct 17, 2022 •

edited

Loading