-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add from_preset
constructor to BertPreprocessor
#390
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design Q: should preprocessors solely be available via
Preprocessor.from_preset(name)
or should they be also be available from the corresponding name
model, e.g.
Bert.from_preset(name).tokenizer # or something
@fchollet, @mattdangerw is working on a |
I am not sure that a separate "pipeline" class is warranted. I'd favor including the preprocessing in the user-facing model (e.g. Bert or BertClassifier), similar to what we do for vision models.
Ultimately we just want models that can process raw strings. To have to deal with an additional layer of abstraction to get the ability to process strings seems clumsy. |
@fchollet I was assuming that regardless of how we implement preprocessing as part of the model we still need separate preprocessing to be a good experience, which I have attempted in this PR. Are you saying you want to block this PR until we figure out joint preprocessing, that the current PR is a bad experience for separate preprocessing, or that we shouldn't have separate preprocessing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me! And definitely a positive delta from what we have.
I think one bigger question is whether we want to have a BertPreprocessor.presets
. Soon, Bert.from_preset
, BertClassifier.from_preset
and BertPreprocessor.from_preset
will all have different sets of keys they accept (the latter being the combination of the two former).
Should we try to keep a consistent UX, where each class has a preset
static property, and the valid keys can be accessed with MySymbol.preset.keys()
? I don't see a reason we need to figure that out to submit this, and there is probably plenty to discuss in the weeds there. But something we should think about going forward.
@fchollet and @jbischof, re how to access the pre-preprocessing paired with a model, I think we can do that as a follow up. (Basically I'm just echoing Jonathan's point above). IMO, we should have a way to access preprocessing standalone--that's a totally valid user journey. This PR address that in a very consistent way with the UX we just came up with for As Francois was saying in the original question, the question is whether to also expose the preprocessing on the model. I think that is a fairly dense topic, and one that can roughly be summarized as "what do we want to do about pipelines"? Let's tackle that in a future PR! |
No, it's clear that being able to retrieve the tokenizer independently is useful. I was saying:
|
I tried this implementation and spotted a bug:
This gives shape (512, ) for |
@chenmoneygithub the preprocessing layer does not handle batch inputs. It is expected to be mapped over each example in the batch. If there are multiple sentences in the input, these are interpreted as multiple segments for the same example (e.g., for NSP or sentence similarity). |
@jbischof @chenmoneygithub I think this batching convo is unrelated to the PR. But seems worth clearing up.
The sum total of these means you need to be careful when passing to un-tensorized inputs to the layer, because you hit an ambiguity of sorts. We treat the first tuple or list we see as indicating multiple separate string features to collate. So in the code snippet you posted Chen, we are doing a very valid thing, and contacting the two sentences together in a single sequence. This colab shows what I mean. We could consider flipping the default behavior there, so if you pass a tuple of strings (rather than a tuple of numpy or tensors) we treat this as a batch. That is definitely something for a separate issue though! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
Closes #388