-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a UnicodeCharacterTokenizer #100
Conversation
@aflah02 thanks! Will take a look soon. Re decoding, yeah the way to go from a tensor scalar string to a python unicode string is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks great generally! Left a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments. Looks pretty close!
@mattdangerw I've made most changes however I'm facing 2 issues
|
@aflah02 thanks! Re first issue, I think someone else was hitting that too. I will take a look later today. Re second issue, do you have a code snippet that's not working? In terms of a utility to detokenize to python strings (for all tokenizers) let's do that on a separate issue. I'll open one today. Are you interested in taking that on? |
@mattdangerw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending this formatting issue, which I will take a look at later.
@mattdangerw Thanks!
So essentially the only remaining issue is |
@aflah02 I think your return types are actually correct. b"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83" is a byte string representation for "▀▁▂▃" in a utf-8 encoding. And by default when you are working with tf.Tensor strings, you get byte strings with a utf-8 encoding. https://www.tensorflow.org/text/guide/unicode So the issue is not that this representation of the return type is wrong, it's just that most users will want eventually to go from tf.Tensors to regular python strings and it is confusing how to reverse this process. So that's where I was proposing we add a helper to all tokenizers, e.g. |
@mattdangerw Thanks! That makes total sense now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting issue should be fixed is you rebase to the latest master.
Found a few more style nits, but the main one is I think some of your usage examples now have incorrect output.
Thanks! Merging. These build failures are unrelated and fixed up on #121 |
* Debugging * Debugging * Fixed Sequence Length Issue * Sequence Length Changes * Removed _ From Class Attributes * Fixed Null Bytes in Detokenization * Testing regex_replace * Testing * Helper Function and Debug Statements * Testing Regex Replace New Ordering * Added Checks for Errors and Normalization Form * Doc String Completed * Ran lint/format * New Tests and Decoding Changes * Changes * Minor Tweak * Tweaking Detokenizer * Added Tests and Updated Docstrings * Ran format.sh and lint.sh * Refactoring and Removing Unused Lines * Fixed Some Broken Tests * Fixed All Tests * Testing Decode * Testing * Debug * Fixes + Replaced Regex with BooleanMask * Added Debug Lines * Added Debug Line for .numpy() * Testing Byte Tokenizer Approach * Testing With Unicode_transcode * Listing Methods of Object * Testing _numpy * Added Decode Call * Checking Methods post _numpy() * Removed Debug Statements and Improved Docstring * Fixed Failing Test * Ran format/lint * Fixed Docstring and Improved Examples * Ran format and lint * Copy edits * Copy edits Co-authored-by: Matt Watson <[email protected]>
Fixes #78
Sorry for the late PR
I've added the Tokenizer as well as added some tests.
Based on the discussion with @mattdangerw I've skipped custom padding tokens but since I had already implemented custom input/output encoding type I've retained that for now
There's one issue which I'm facing right now w.r.t detokenization -
When the input is something like:
"▀▁▂▃"
it gets tokenized to[9600, 9601, 9602, 9603]
but during detokenization it's not clear how do I bring back the original format as currently I detokenize like this and this outputsb"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83"
. One way to then decode this is to do a.numpy()
and.decode()
call but I'm not sure if this is the intended way