Handle [MASK] token in DebertaV3Tokenizer #759

abheesht17 · 2023-02-18T03:09:25Z

mattdangerw

Thanks for tracking this down! Just have some high level comments for now.

mattdangerw · 2023-02-22T23:08:17Z

keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py

+
+        # Maintain a private copy of the original vocabulary; the parent class's
+        # `get_vocabulary()` function calls `self.vocabulary_size()`, which
+        # throws up a segmentation fault.


What is the segmentation fault here? I'm not sure I totally follow. Ideally we don't have to store a copy of the vocabulary. This would be a not-totally-insignificant waste of memory!

Made edits!

Calling super().get_vocabulary() in __init__ causes a seg fault because SentencePieceTokenizer calls self.vocabulary_size() here: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/tokenizers/sentence_piece_tokenizer.py#L161-L165. Since we change the vocabulary_size() function in DebertaV3Tokenizer to return a value greater than the SPM vocabulary size, this causes a seg fault.

mattdangerw · 2023-02-22T23:23:59Z

keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py

@@ -48,6 +48,11 @@ class DebertaV3Tokenizer(SentencePieceTokenizer):
            `bytes` object with a serialized SentencePiece proto. See the
            [SentencePiece repository](https://github.com/google/sentencepiece)
            for more details on the format.
+        mask_token_id: The token ID (int) of the mask token (`[MASK]`). If


I think given that most users will not need an MLM task, we should actually make this optional when "brining your own data." Something like...

Use one of our presets. self.mask_token_id is set and works as expected.

Pass your own local copy of a deberta spm file and don't set anything. self.mask_token_id is None, everything works except DebertaMaskedLM, which throws a friendly error message. We can cover the error in Solve #721 Deberta masklm model #732

Optional (but already working here). Use your own custom spm file with a "[MASK]" token. self.mask_token_id is set and works as expected.

Does that make sense to you?

(1) and (3) were already taken care of. I've pushed changes which solves all three cases, and resolves the other comment.

Thanks! Looks good. Left some thoughts below on how we could maybe make the subclass changes a bit easier by modifying the super class.

mattdangerw · 2023-02-24T02:54:23Z

keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py

+
+        return (
+            original_vocabulary
+            + [None] * (self._mask_token_id - super().vocabulary_size())


Should we do something like "[PLACEHOLDER]" here? Or whatever deberta does?

Seems like a bug waiting to happen to sneak None into a list of strings.

mattdangerw · 2023-02-24T02:58:46Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

@@ -159,7 +159,7 @@ def vocabulary_size(self) -> int:
        return int(self._sentence_piece.vocab_size().numpy())

    def get_vocabulary(self) -> List[str]:
-        """Get the size of the tokenizer vocabulary."""
+        """Get the tokenizer vocabulary."""


Is there any downside to making the super class impl use self._sentence_piece.vocab_size() instead of self.vocabulary_size() here? Then we don't need all this indirection on the subclass.

mattdangerw

Thanks! This looks good. Though I think we need to update the presets to fix tests.

mattdangerw · 2023-02-24T19:12:55Z

keras_nlp/models/deberta_v3/deberta_v3_presets.py

@@ -25,7 +25,9 @@
            "max_sequence_length": 512,
            "bucket_size": 256,
        },
-        "preprocessor_config": {},
+        "preprocessor_config": {
+            "mask_token_id": 128000,


I think we need to remove this from all the presets now right? It is breaking tests.

mattdangerw · 2023-02-24T19:13:34Z

keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py

+            self.mask_token_id = super().vocabulary_size()
+
+    def vocabulary_size(self):
+        return max(super().vocabulary_size(), self.mask_token_id + 1)


I would write this a little longer just for clarify...

# Account for appended mask token if necessary. sentencepiece_size = super().vocabulary_size() if sentencepiece_size == self.mask_token_id: return sentencepiece_size + 1 return sentencepiece_size

mattdangerw · 2023-02-24T21:41:42Z

Thanks!

Handle [MASK] token in DebertaV3Tokenizer

2d5c0f2

abheesht17 requested a review from mattdangerw February 18, 2023 03:10

abheesht17 added 8 commits February 18, 2023 09:50

Fixes

027cd23

Fix vocabulary size of presets

d74c340

Merge branch 'master' into deberta-tokenizer-mask-fix

11ff687

Add preset UT for mask

f18584c

Fix cls and preprocessor tests

ed99bcb

Format

1cc1800

Edit doc-string

3ce2b3d

Merge branch 'master' into deberta-tokenizer-mask-fix

bf1c627

mattdangerw requested changes Feb 22, 2023

View reviewed changes

mattdangerw mentioned this pull request Feb 23, 2023

Solve #721 Deberta masklm model #732

Merged

abheesht17 added 2 commits February 23, 2023 19:42

Address comments

6de376d

Small change

afad0ce

mattdangerw requested changes Feb 24, 2023

View reviewed changes

abheesht17 added 2 commits February 24, 2023 13:38

Simplify stuff

4cc2653

Modify comment

6354cf2

abheesht17 requested a review from mattdangerw February 24, 2023 08:10

mattdangerw approved these changes Feb 24, 2023

View reviewed changes

abheesht17 added 2 commits February 25, 2023 01:22

Fix preset tests

b25d609

Address comment

a978fb1

mattdangerw merged commit 1eeec3b into keras-team:master Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle [MASK] token in DebertaV3Tokenizer #759

Handle [MASK] token in DebertaV3Tokenizer #759

abheesht17 commented Feb 18, 2023 •

edited

Loading

mattdangerw left a comment

mattdangerw Feb 22, 2023

abheesht17 Feb 23, 2023

mattdangerw Feb 22, 2023

abheesht17 Feb 23, 2023 •

edited

Loading

mattdangerw Feb 24, 2023

mattdangerw Feb 24, 2023

mattdangerw Feb 24, 2023

mattdangerw left a comment

mattdangerw Feb 24, 2023

mattdangerw Feb 24, 2023

mattdangerw commented Feb 24, 2023

Handle [MASK] token in DebertaV3Tokenizer #759

Handle [MASK] token in DebertaV3Tokenizer #759

Conversation

abheesht17 commented Feb 18, 2023 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abheesht17 Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Feb 24, 2023

abheesht17 commented Feb 18, 2023 •

edited

Loading

abheesht17 Feb 23, 2023 •

edited

Loading