Add an add_prefix_space Arg in BytePairTokenizer #715

shivance · 2023-02-02T12:08:48Z

Closes #436

abheesht17 · 2023-02-02T12:47:38Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

@@ -87,7 +87,7 @@ def remove_strings_from_inputs(tensor, string_to_remove):
    return result


-def split_strings_for_bpe(inputs):
+def split_strings_for_bpe(inputs, add_prefix_space):


Will this work 🤔? We are passing this argument in this function, but not actually doing anything with it inside the function.

I was thinking of doing something along these lines in the tokenize() function:

def tokenize(self, inputs): if not isinstance(inputs, (tf.Tensor, tf.RaggedTensor)): inputs = tf.convert_to_tensor(inputs) if self.add_prefix_space: inputs = tf.strings.join([" ", inputs]) ...

Trial:

>>> import tensorflow as tf >>> inputs = tf.constant(["add space at the beginning of every string", "we can use tf.strings.join(...)"]) >>> inputs <tf.Tensor: shape=(2,), dtype=string, numpy= array([b' add space at the beginning of every string', b' we can use tf.strings.join(...)'], dtype=object)>

CC: @mattdangerw

keras_nlp/tokenizers/byte_pair_tokenizer.py

shivance · 2023-02-02T13:12:45Z

Not sure why code format check fails

Ran format script locally

abheesht17 · 2023-02-02T13:18:38Z

Not sure why code format check fails

Ran format script locally

Upgrade black version.

pip install --upgrade black

abheesht17 · 2023-02-02T13:19:58Z

@shivance, please add a unit test as well

shivance · 2023-02-02T13:21:23Z

Thanks @abheesht17 , just found #708

shivance · 2023-02-02T16:38:29Z

cc: @jbischof
Ready for review

chenmoneygithub

Thanks for the PR! Also thanks Abi for the careful review!

shivance · 2023-02-02T19:19:59Z

Yes , definitely Thanks @abheesht17 !!

mattdangerw

This looks good to me! One minor comment on the docstring.

keras_nlp/tokenizers/byte_pair_tokenizer.py

mattdangerw · 2023-02-03T00:25:31Z

Thank you!!

shivance added 2 commits February 2, 2023 17:37

init commit

e4507da

updated

4475b94

abheesht17 reviewed Feb 2, 2023

View reviewed changes

abheesht17 requested a review from mattdangerw February 2, 2023 12:50

abheesht17 reviewed Feb 2, 2023

View reviewed changes

keras_nlp/tokenizers/byte_pair_tokenizer.py Show resolved Hide resolved

formatting + docstring change

4cd2960

bumping black version

eb9e7fb

adding unit test

398f979

chenmoneygithub assigned chenmoneygithub and unassigned chenmoneygithub Feb 2, 2023

chenmoneygithub self-requested a review February 2, 2023 18:37

chenmoneygithub approved these changes Feb 2, 2023

View reviewed changes

mattdangerw requested changes Feb 2, 2023

View reviewed changes

keras_nlp/tokenizers/byte_pair_tokenizer.py Outdated Show resolved Hide resolved

minor docstring change

a3c3514

mattdangerw merged commit 6cec401 into keras-team:master Feb 3, 2023

shivance deleted the add_prefix_space branch February 13, 2023 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an add_prefix_space Arg in BytePairTokenizer #715

Add an add_prefix_space Arg in BytePairTokenizer #715

shivance commented Feb 2, 2023

abheesht17 Feb 2, 2023

shivance commented Feb 2, 2023 •

edited

Loading

abheesht17 commented Feb 2, 2023

abheesht17 commented Feb 2, 2023

shivance commented Feb 2, 2023

shivance commented Feb 2, 2023

chenmoneygithub left a comment

shivance commented Feb 2, 2023

mattdangerw left a comment

mattdangerw commented Feb 3, 2023

Add an add_prefix_space Arg in BytePairTokenizer #715

Add an add_prefix_space Arg in BytePairTokenizer #715

Conversation

shivance commented Feb 2, 2023

abheesht17 Feb 2, 2023

Choose a reason for hiding this comment

shivance commented Feb 2, 2023 • edited Loading

abheesht17 commented Feb 2, 2023

abheesht17 commented Feb 2, 2023

shivance commented Feb 2, 2023

shivance commented Feb 2, 2023

chenmoneygithub left a comment

Choose a reason for hiding this comment

shivance commented Feb 2, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw commented Feb 3, 2023

shivance commented Feb 2, 2023 •

edited

Loading