-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments on the code. I will need to check with people tomorrow to make sure this is the naming we want.
Thanks for the review! I'll get to work on this |
@mattdangerw |
The test cases should live on the base class, as we are writing unit tests for functionality on the base class. But the tokenizer base class is not a usable thing standalone, it needs to be subclassed. So in tokenizer_test.py you could define a quick helper. For example
You could use that to fully test the logic of your helper function (by calling detokenize_to_string with raggeds, dense, scalar, etc.). |
@mattdangerw Sorry if I missed the message but I can't seem to find any confirmation for the naming of the method. Has it been decided? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments! Will check re function signature.
@mattdangerw I've made fixes according to the suggestions I'm a bit doubtful about the function names still apart from that I think it's all covered |
Adds a little more description as well
Oops sorry, tried to put some copy edits in (mainly wanted the docstring summary to fit on a single line), this needs to always be the case, and I broke tests. Let me fix. 😬 |
@mattdangerw I think the tests fail due to the .numpy() call before getting the rank as after that we can't do those calls as it's no longer a tensor |
Yeah, need to get in habit of suggesting edits for a local checkout, not just straight from github browser. Anyway, lgtm. Will check with others re desired naming for public signatures before merging this though. |
Well I have been talking @fchollet re function signature here, and I don't think we fully know what we want yet. I think right now we have preference for a flag to detokenize (e.g. In the interest of unblocking this PR, let's just add this as a utility for now.
Then we can keep figure out how to expose this in tokenizer. I also this there will be benefit to having this as a utility anyway, so we could use this from non-tokenizer contexts. Thanks! Sorry about the slow discussion here, API design is definitely the hard part on a lot of these issues. |
No worries! This seems reasonable, I'll port these files as recommended |
@mattdangerw The PR is ready for review, I've removed the |
Thank you! This looks good to me. |
In continuation to #119 which was closed due to git issues
Fixes #113
Currently in progress