-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed no oov token error in vocab for WordPieceTokenizer #136
Fixed no oov token error in vocab for WordPieceTokenizer #136
Conversation
bfdfb98
to
b934655
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A couple comments..
no_pretokenization=True, | ||
support_detokenization=True, | ||
) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem about this try-catch is that when RuntimeError is encountered, it will always prompt the user that oov_token
is problematic (either unset or missing in the vocab). However, it is not only the case that throws RuntimeError, so users may end up seeing the wrong message.
I think here we can explicitly check:
- If
oov_token
is unset. If true, we raise an error. - If
oov_token
cannot be found in vocab. If true, we prompt users to add the token into their vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my message above. We can just remove the try catch and do our own check, but preserve the error message in this PR which is helpful and just needs copy edits.
We could add a separate check
if oov_token is None:
raise ValueError("oov_token cannot be None")
In case users try to manually set oov_token to None I guess. Might come up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be the case, if user don't really care about oov_token
fuss we sought to fix here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw One funny thing here is we posted the review at the exactly same time (during triage meeting), so we got a race condition.
Yea, we just need to make this check more explicit.
b934655
to
ed76964
Compare
) * Fixed no oov token error in vocab for WordPieceTokenizer * Raise no oov_token error during explicit checking for WordPieceTokenizer * Edits * Fix Co-authored-by: Matt Watson <[email protected]>
Resolves #135.
@mattdangerw This is what I could figure out to fix the issue. Please review🙌!