Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124

aflah02 · 2022-04-16T07:10:55Z

In continuation to #119 which was closed due to git issues
Fixes #113
Currently in progress

mattdangerw

Few comments on the code. I will need to check with people tomorrow to make sure this is the naming we want.

keras_nlp/tokenizers/tokenizer.py

aflah02 · 2022-04-18T23:43:57Z

Thanks for the review! I'll get to work on this

keras_nlp/tokenizers/tokenizer.py

aflah02 · 2022-04-21T06:10:53Z

@mattdangerw
I have some questions regarding testing the function as you had described here
Essentially we need to test whether it outputs the right string over byte strings right when used with something like the Unicode Character Tokenizer . For those tests I don't get how a simple subclass can help us test this out?
Won't it be better to test it with the Unicode Character Tokenizer Class and other classes directly?

mattdangerw · 2022-04-21T20:12:27Z

The test cases should live on the base class, as we are writing unit tests for functionality on the base class. But the tokenizer base class is not a usable thing standalone, it needs to be subclassed. So in tokenizer_test.py you could define a quick helper. For example

class PassThroughTokenizer(Tokenizer):
    __test__ = False  # for pytest

    def tokenize(self, inputs):
        return inputs

    def detokenize(self, inputs):
        return inputs

You could use that to fully test the logic of your helper function (by calling detokenize_to_string with raggeds, dense, scalar, etc.).

aflah02 · 2022-04-24T10:21:15Z

@mattdangerw Sorry if I missed the message but I can't seem to find any confirmation for the naming of the method. Has it been decided?

…as-nlp into DetokenizingToString

mattdangerw

Left some comments! Will check re function signature.

keras_nlp/tokenizers/tokenizer.py

keras_nlp/tokenizers/tokenizer_test.py

aflah02 · 2022-04-25T18:54:41Z

@mattdangerw I've made fixes according to the suggestions I'm a bit doubtful about the function names still apart from that I think it's all covered

Adds a little more description as well

mattdangerw · 2022-04-25T19:29:58Z

Oops sorry, tried to put some copy edits in (mainly wanted the docstring summary to fit on a single line), this needs to always be the case, and I broke tests. Let me fix. 😬

aflah02 · 2022-04-25T19:33:13Z

@mattdangerw I think the tests fail due to the .numpy() call before getting the rank as after that we can't do those calls as it's no longer a tensor

mattdangerw · 2022-04-25T19:48:27Z

Yeah, need to get in habit of suggesting edits for a local checkout, not just straight from github browser. Anyway, lgtm. Will check with others re desired naming for public signatures before merging this though.

mattdangerw · 2022-04-29T19:23:09Z

Well I have been talking @fchollet re function signature here, and I don't think we fully know what we want yet.

I think right now we have preference for a flag to detokenize (e.g. return_strings=True), but implementation would be a little awkward here. If we wanted this to show up on docs for all tokenizers, we might want to add this to the signature of each detoknize() method and support it in each detokenize, which would add overhead to subclassers.

In the interest of unblocking this PR, let's just add this as a utility for now.

Add a keras_nlp/utils/ directory
Add tensor_utils.py and tensor_utils_test.py
Add this symbol as standalone function tensor_to_string_list
Port your util tests to the test file.

Then we can keep figure out how to expose this in tokenizer. I also this there will be benefit to having this as a utility anyway, so we could use this from non-tokenizer contexts.

Thanks! Sorry about the slow discussion here, API design is definitely the hard part on a lot of these issues.

aflah02 · 2022-04-29T20:02:45Z

No worries! This seems reasonable, I'll port these files as recommended

aflah02 · 2022-05-03T13:29:03Z

@mattdangerw The PR is ready for review, I've removed the *args and **kwargs from the function signature as I think those are no longer needed since they were solely being used as input to the tokenizer's detokenize function

mattdangerw · 2022-05-03T17:20:45Z

Thank you! This looks good to me.

Added Functions to Base Class

395a296

mattdangerw requested changes Apr 18, 2022

View reviewed changes

chenmoneygithub suggested changes Apr 19, 2022

View reviewed changes

keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved

Tightened Logic started Work on Tests

e4fead5

aflah02 added 4 commits April 24, 2022 16:22

Added tests

a7271ff

Merge branch 'keras-team:master' into DetokenizingToString

4bf1744

Updated Docstring

d6dbe9d

Merge branch 'DetokenizingToString' of https://github.com/aflah02/ker…

aa3f10d

…as-nlp into DetokenizingToString

aflah02 requested review from mattdangerw and chenmoneygithub April 24, 2022 10:55

mattdangerw requested changes Apr 25, 2022

View reviewed changes

aflah02 added 3 commits April 26, 2022 00:05

Fixing Tokenizer

5ce8091

Fixed Broken Tests

e1b6df8

Ran format and lint

d615cb5

mattdangerw added 2 commits April 25, 2022 12:16

Fix docstring summary to fit on single line

8d98121

Adds a little more description as well

Remove trailing whitespace

f84b49d

fix

511ccd6

mattdangerw approved these changes Apr 25, 2022

View reviewed changes

chenmoneygithub approved these changes Apr 26, 2022

View reviewed changes

mattdangerw mentioned this pull request Apr 26, 2022

Add ROUGE Metric #122

Merged

Ported tensor_to_string_list to tensor_utils

9d6188f

mattdangerw merged commit 375082e into keras-team:master May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124

aflah02 commented Apr 16, 2022

mattdangerw left a comment

aflah02 commented Apr 18, 2022

aflah02 commented Apr 21, 2022

mattdangerw commented Apr 21, 2022

aflah02 commented Apr 24, 2022

mattdangerw left a comment

aflah02 commented Apr 25, 2022 •

edited

Loading

mattdangerw commented Apr 25, 2022

aflah02 commented Apr 25, 2022

mattdangerw commented Apr 25, 2022

mattdangerw commented Apr 29, 2022

aflah02 commented Apr 29, 2022

aflah02 commented May 3, 2022 •

edited

Loading

mattdangerw commented May 3, 2022

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124

Conversation

aflah02 commented Apr 16, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

aflah02 commented Apr 18, 2022

aflah02 commented Apr 21, 2022

mattdangerw commented Apr 21, 2022

aflah02 commented Apr 24, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

aflah02 commented Apr 25, 2022 • edited Loading

mattdangerw commented Apr 25, 2022

aflah02 commented Apr 25, 2022

mattdangerw commented Apr 25, 2022

mattdangerw commented Apr 29, 2022

aflah02 commented Apr 29, 2022

aflah02 commented May 3, 2022 • edited Loading

mattdangerw commented May 3, 2022

aflah02 commented Apr 25, 2022 •

edited

Loading

aflah02 commented May 3, 2022 •

edited

Loading