Fixed no oov token error in vocab for WordPieceTokenizer #136

adhadse · 2022-04-21T04:36:48Z

Resolves #135.
@mattdangerw This is what I could figure out to fix the issue. Please review🙌!

mattdangerw

Thanks! A couple comments..

keras_nlp/tokenizers/word_piece_tokenizer.py

chenmoneygithub · 2022-04-21T17:31:42Z

keras_nlp/tokenizers/word_piece_tokenizer.py

-            no_pretokenization=True,
-            support_detokenization=True,
-        )
+        try:


One problem about this try-catch is that when RuntimeError is encountered, it will always prompt the user that oov_token is problematic (either unset or missing in the vocab). However, it is not only the case that throws RuntimeError, so users may end up seeing the wrong message.

I think here we can explicitly check:

If oov_token is unset. If true, we raise an error.

If oov_token cannot be found in vocab. If true, we prompt users to add the token into their vocab.

See my message above. We can just remove the try catch and do our own check, but preserve the error message in this PR which is helpful and just needs copy edits.

We could add a separate check

if oov_token is None: raise ValueError("oov_token cannot be None")

In case users try to manually set oov_token to None I guess. Might come up?

This could be the case, if user don't really care about oov_token fuss we sought to fix here...

@mattdangerw One funny thing here is we posted the review at the exactly same time (during triage meeting), so we got a race condition.

Yea, we just need to make this check more explicit.

) * Fixed no oov token error in vocab for WordPieceTokenizer * Raise no oov_token error during explicit checking for WordPieceTokenizer * Edits * Fix Co-authored-by: Matt Watson <[email protected]>

adhadse force-pushed the WordPieceTokenizer_fix branch from bfdfb98 to b934655 Compare April 21, 2022 04:50

mattdangerw requested changes Apr 21, 2022

View reviewed changes

keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved

keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved

keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved

chenmoneygithub suggested changes Apr 21, 2022

View reviewed changes

adhadse added 2 commits April 23, 2022 06:26

Fixed no oov token error in vocab for WordPieceTokenizer

9dfeed4

Raise no oov_token error during explicit checking for WordPieceTokenizer

ed76964

adhadse force-pushed the WordPieceTokenizer_fix branch from b934655 to ed76964 Compare April 23, 2022 01:08

adhadse requested review from mattdangerw and chenmoneygithub April 23, 2022 01:12

Edits

7310c21

mattdangerw approved these changes Apr 25, 2022

View reviewed changes

Fix

903c600

mattdangerw merged commit fca13e8 into keras-team:master Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed no oov token error in vocab for WordPieceTokenizer #136

Fixed no oov token error in vocab for WordPieceTokenizer #136

adhadse commented Apr 21, 2022

mattdangerw left a comment

chenmoneygithub Apr 21, 2022

mattdangerw Apr 21, 2022

adhadse Apr 22, 2022

chenmoneygithub Apr 22, 2022

Fixed no oov token error in vocab for WordPieceTokenizer #136

Fixed no oov token error in vocab for WordPieceTokenizer #136

Conversation

adhadse commented Apr 21, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

chenmoneygithub Apr 21, 2022

Choose a reason for hiding this comment

mattdangerw Apr 21, 2022

Choose a reason for hiding this comment

adhadse Apr 22, 2022

Choose a reason for hiding this comment

chenmoneygithub Apr 22, 2022

Choose a reason for hiding this comment