Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124

Merged
merged 13 commits into from
May 3, 2022

Conversation

aflah02
Copy link
Collaborator

@aflah02 aflah02 commented Apr 16, 2022

In continuation to #119 which was closed due to git issues
Fixes #113
Currently in progress

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments on the code. I will need to check with people tomorrow to make sure this is the naming we want.

keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 18, 2022

Thanks for the review! I'll get to work on this

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 21, 2022

@mattdangerw
I have some questions regarding testing the function as you had described here
Essentially we need to test whether it outputs the right string over byte strings right when used with something like the Unicode Character Tokenizer . For those tests I don't get how a simple subclass can help us test this out?
Won't it be better to test it with the Unicode Character Tokenizer Class and other classes directly?

@mattdangerw
Copy link
Member

The test cases should live on the base class, as we are writing unit tests for functionality on the base class. But the tokenizer base class is not a usable thing standalone, it needs to be subclassed. So in tokenizer_test.py you could define a quick helper. For example

class PassThroughTokenizer(Tokenizer):
    __test__ = False  # for pytest

    def tokenize(self, inputs):
        return inputs

    def detokenize(self, inputs):
        return inputs

You could use that to fully test the logic of your helper function (by calling detokenize_to_string with raggeds, dense, scalar, etc.).

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 24, 2022

@mattdangerw Sorry if I missed the message but I can't seem to find any confirmation for the naming of the method. Has it been decided?

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments! Will check re function signature.

keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer_test.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer_test.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer_test.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/tokenizer_test.py Outdated Show resolved Hide resolved
@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 25, 2022

@mattdangerw I've made fixes according to the suggestions I'm a bit doubtful about the function names still apart from that I think it's all covered

@mattdangerw
Copy link
Member

Oops sorry, tried to put some copy edits in (mainly wanted the docstring summary to fit on a single line), this needs to always be the case, and I broke tests. Let me fix. 😬

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 25, 2022

@mattdangerw I think the tests fail due to the .numpy() call before getting the rank as after that we can't do those calls as it's no longer a tensor

@mattdangerw
Copy link
Member

Yeah, need to get in habit of suggesting edits for a local checkout, not just straight from github browser. Anyway, lgtm. Will check with others re desired naming for public signatures before merging this though.

@mattdangerw mattdangerw mentioned this pull request Apr 26, 2022
@mattdangerw
Copy link
Member

Well I have been talking @fchollet re function signature here, and I don't think we fully know what we want yet.

I think right now we have preference for a flag to detokenize (e.g. return_strings=True), but implementation would be a little awkward here. If we wanted this to show up on docs for all tokenizers, we might want to add this to the signature of each detoknize() method and support it in each detokenize, which would add overhead to subclassers.

In the interest of unblocking this PR, let's just add this as a utility for now.

  • Add a keras_nlp/utils/ directory
  • Add tensor_utils.py and tensor_utils_test.py
  • Add this symbol as standalone function tensor_to_string_list
  • Port your util tests to the test file.

Then we can keep figure out how to expose this in tokenizer. I also this there will be benefit to having this as a utility anyway, so we could use this from non-tokenizer contexts.

Thanks! Sorry about the slow discussion here, API design is definitely the hard part on a lot of these issues.

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 29, 2022

No worries! This seems reasonable, I'll port these files as recommended

@aflah02
Copy link
Collaborator Author

aflah02 commented May 3, 2022

@mattdangerw The PR is ready for review, I've removed the *args and **kwargs from the function signature as I think those are no longer needed since they were solely being used as input to the tokenizer's detokenize function

@mattdangerw
Copy link
Member

Thank you! This looks good to me.

@mattdangerw mattdangerw merged commit 375082e into keras-team:master May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a utility to detokenize as lists of python strings
3 participants