Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed no oov token error in vocab for WordPieceTokenizer #136

Merged
merged 4 commits into from
Apr 25, 2022

Conversation

adhadse
Copy link
Contributor

@adhadse adhadse commented Apr 21, 2022

Resolves #135.
@mattdangerw This is what I could figure out to fix the issue. Please review🙌!

@adhadse adhadse force-pushed the WordPieceTokenizer_fix branch from bfdfb98 to b934655 Compare April 21, 2022 04:50
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple comments..

keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved
no_pretokenization=True,
support_detokenization=True,
)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem about this try-catch is that when RuntimeError is encountered, it will always prompt the user that oov_token is problematic (either unset or missing in the vocab). However, it is not only the case that throws RuntimeError, so users may end up seeing the wrong message.

I think here we can explicitly check:

  • If oov_token is unset. If true, we raise an error.
  • If oov_token cannot be found in vocab. If true, we prompt users to add the token into their vocab.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my message above. We can just remove the try catch and do our own check, but preserve the error message in this PR which is helpful and just needs copy edits.

We could add a separate check

if oov_token is None:
    raise ValueError("oov_token cannot be None")

In case users try to manually set oov_token to None I guess. Might come up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the case, if user don't really care about oov_token fuss we sought to fix here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdangerw One funny thing here is we posted the review at the exactly same time (during triage meeting), so we got a race condition.

Yea, we just need to make this check more explicit.

@adhadse adhadse force-pushed the WordPieceTokenizer_fix branch from b934655 to ed76964 Compare April 23, 2022 01:08
@mattdangerw mattdangerw merged commit fca13e8 into keras-team:master Apr 25, 2022
adhadse added a commit to adhadse/keras-nlp that referenced this pull request Sep 17, 2022
)

* Fixed no oov token error in vocab for WordPieceTokenizer

* Raise no oov_token error during explicit checking for WordPieceTokenizer

* Edits

* Fix

Co-authored-by: Matt Watson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raise RuntimeError if oov_token is not found in vocabulary of WordPieceTokenizer
3 participants