Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a UnicodeCharacterTokenizer #100

Merged
merged 42 commits into from
Apr 16, 2022
Merged

Adding a UnicodeCharacterTokenizer #100

merged 42 commits into from
Apr 16, 2022

Conversation

aflah02
Copy link
Collaborator

@aflah02 aflah02 commented Apr 8, 2022

Fixes #78
Sorry for the late PR
I've added the Tokenizer as well as added some tests.
Based on the discussion with @mattdangerw I've skipped custom padding tokens but since I had already implemented custom input/output encoding type I've retained that for now
There's one issue which I'm facing right now w.r.t detokenization -
When the input is something like: "▀▁▂▃" it gets tokenized to [9600, 9601, 9602, 9603] but during detokenization it's not clear how do I bring back the original format as currently I detokenize like this and this outputs b"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83". One way to then decode this is to do a .numpy() and .decode() call but I'm not sure if this is the intended way

@mattdangerw
Copy link
Member

mattdangerw commented Apr 11, 2022

@aflah02 thanks! Will take a look soon.

Re decoding, yeah the way to go from a tensor scalar string to a python unicode string is .numpy().decode('utf-8'). That's totally correct and you can use that for code examples/tests. We might want to provide some extra helper for this on the tokenizer base class (e.g. tokenizer.detokenize_as_string()) as I suspect this will be a common point of confusion, but I don't think we need to worry about that on this PR.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great generally! Left a few comments.

keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer_test.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer_test.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments. Looks pretty close!

keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 13, 2022

@mattdangerw I've made most changes however I'm facing 2 issues

  • lint keeps raising an issue that the imports are ordered incorrectly even after I run format.sh, the exact error is:
ERROR: /mnt/c/Users/ASUS/Desktop/keras-nlp/keras_nlp/tokenizers/unicode_character_tokenizer_test.py Imports are incorrectly sorted and/or formatted.
ERROR: /mnt/c/Users/ASUS/Desktop/keras-nlp/keras_nlp/tokenizers/__init__.py Imports are incorrectly sorted and/or formatted.
Skipped 1 files
  • I'm unable to figure out how to decode the inputs to UTF-8 I tried calling .numpy() and other variants but it's not working as if I try to do something like a .numpy() .decode() call after unicode_encode there's an error that the object has no .numpy() attribute. Any clues as to what might be going wrong? I've tried several variations but none seem to be working out

@mattdangerw
Copy link
Member

@aflah02 thanks!

Re first issue, I think someone else was hitting that too. I will take a look later today.

Re second issue, do you have a code snippet that's not working? In terms of a utility to detokenize to python strings (for all tokenizers) let's do that on a separate issue. I'll open one today. Are you interested in taking that on?

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 14, 2022

@mattdangerw
Cool
Yeah sure I can take it up and I'll share the code snippets that I tried in a bit my commits would have them but I'll just list out the main ones

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending this formatting issue, which I will take a look at later.

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 14, 2022

@mattdangerw Thanks!
These are the snippets I've tried:

  • Doing a .numpy() followed by .decode('utf-8') before returning here
  • Since .numpy() raised errors I tried to list all available methods with the object and spotted ._numpy but that also raised issues after that
  • Using exactly what Byte Tokenizer does however it uses a character list which would be very large for unicode characters to always generate for detokenization

So essentially the only remaining issue is "▀▁▂▃" gets tokenized to [9600, 9601, 9602, 9603] but during detokenization it becomes b"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83" and not "▀▁▂▃" again.

@mattdangerw
Copy link
Member

@aflah02 I think your return types are actually correct. b"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83" is a byte string representation for "▀▁▂▃" in a utf-8 encoding. And by default when you are working with tf.Tensor strings, you get byte strings with a utf-8 encoding. https://www.tensorflow.org/text/guide/unicode

So the issue is not that this representation of the return type is wrong, it's just that most users will want eventually to go from tf.Tensors to regular python strings and it is confusing how to reverse this process.

So that's where I was proposing we add a helper to all tokenizers, e.g. tokenizer.detokenize_to_strings() that will convert all tensors to lists of lists, and then convert all list elements to python strings. #113

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 14, 2022

@mattdangerw Thanks! That makes total sense now

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting issue should be fixed is you rebase to the latest master.

Found a few more style nits, but the main one is I think some of your usage examples now have incorrect output.

keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer_test.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/unicode_character_tokenizer_test.py Outdated Show resolved Hide resolved
@mattdangerw
Copy link
Member

Thanks! Merging. These build failures are unrelated and fixed up on #121

@mattdangerw mattdangerw merged commit a006232 into keras-team:master Apr 16, 2022
adhadse pushed a commit to adhadse/keras-nlp that referenced this pull request Sep 17, 2022
* Debugging

* Debugging

* Fixed Sequence Length Issue

* Sequence Length Changes

* Removed _ From Class Attributes

* Fixed Null Bytes in Detokenization

* Testing regex_replace

* Testing

* Helper Function and Debug Statements

* Testing Regex Replace New Ordering

* Added Checks for Errors and Normalization Form

* Doc String Completed

* Ran lint/format

* New Tests and Decoding Changes

* Changes

* Minor Tweak

* Tweaking Detokenizer

* Added Tests and Updated Docstrings

* Ran format.sh and lint.sh

* Refactoring and Removing Unused Lines

* Fixed Some Broken Tests

* Fixed All Tests

* Testing Decode

* Testing

* Debug

* Fixes + Replaced Regex with BooleanMask

* Added Debug Lines

* Added Debug Line for .numpy()

* Testing Byte Tokenizer Approach

* Testing With Unicode_transcode

* Listing Methods of Object

* Testing _numpy

* Added Decode Call

* Checking Methods post _numpy()

* Removed Debug Statements and Improved Docstring

* Fixed Failing Test

* Ran format/lint

* Fixed Docstring and Improved Examples

* Ran format and lint

* Copy edits

* Copy edits

Co-authored-by: Matt Watson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a UnicodeCharacterTokenizer
2 participants