Adding a UnicodeCharacterTokenizer #100

aflah02 · 2022-04-08T09:41:52Z

Fixes #78
Sorry for the late PR
I've added the Tokenizer as well as added some tests.
Based on the discussion with @mattdangerw I've skipped custom padding tokens but since I had already implemented custom input/output encoding type I've retained that for now
There's one issue which I'm facing right now w.r.t detokenization -
When the input is something like: "▀▁▂▃" it gets tokenized to [9600, 9601, 9602, 9603] but during detokenization it's not clear how do I bring back the original format as currently I detokenize like this and this outputs b"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83". One way to then decode this is to do a .numpy() and .decode() call but I'm not sure if this is the intended way

mattdangerw · 2022-04-11T22:01:35Z

@aflah02 thanks! Will take a look soon.

Re decoding, yeah the way to go from a tensor scalar string to a python unicode string is .numpy().decode('utf-8'). That's totally correct and you can use that for code examples/tests. We might want to provide some extra helper for this on the tokenizer base class (e.g. tokenizer.detokenize_as_string()) as I suspect this will be a common point of confusion, but I don't think we need to worry about that on this PR.

mattdangerw

Thanks! Looks great generally! Left a few comments.

keras_nlp/tokenizers/unicode_character_tokenizer.py

keras_nlp/tokenizers/unicode_character_tokenizer_test.py

keras_nlp/tokenizers/unicode_character_tokenizer.py

mattdangerw

Few more comments. Looks pretty close!

keras_nlp/tokenizers/unicode_character_tokenizer.py

aflah02 · 2022-04-13T19:34:58Z

@mattdangerw I've made most changes however I'm facing 2 issues

lint keeps raising an issue that the imports are ordered incorrectly even after I run format.sh, the exact error is:

ERROR: /mnt/c/Users/ASUS/Desktop/keras-nlp/keras_nlp/tokenizers/unicode_character_tokenizer_test.py Imports are incorrectly sorted and/or formatted.
ERROR: /mnt/c/Users/ASUS/Desktop/keras-nlp/keras_nlp/tokenizers/__init__.py Imports are incorrectly sorted and/or formatted.
Skipped 1 files

I'm unable to figure out how to decode the inputs to UTF-8 I tried calling .numpy() and other variants but it's not working as if I try to do something like a .numpy() .decode() call after unicode_encode there's an error that the object has no .numpy() attribute. Any clues as to what might be going wrong? I've tried several variations but none seem to be working out

mattdangerw · 2022-04-14T19:02:15Z

@aflah02 thanks!

Re first issue, I think someone else was hitting that too. I will take a look later today.

Re second issue, do you have a code snippet that's not working? In terms of a utility to detokenize to python strings (for all tokenizers) let's do that on a separate issue. I'll open one today. Are you interested in taking that on?

aflah02 · 2022-04-14T19:05:18Z

@mattdangerw
Cool
Yeah sure I can take it up and I'll share the code snippets that I tried in a bit my commits would have them but I'll just list out the main ones

mattdangerw

LGTM pending this formatting issue, which I will take a look at later.

aflah02 · 2022-04-14T20:00:28Z

@mattdangerw Thanks!
These are the snippets I've tried:

Doing a .numpy() followed by .decode('utf-8') before returning here
Since .numpy() raised errors I tried to list all available methods with the object and spotted ._numpy but that also raised issues after that
Using exactly what Byte Tokenizer does however it uses a character list which would be very large for unicode characters to always generate for detokenization

So essentially the only remaining issue is "▀▁▂▃" gets tokenized to [9600, 9601, 9602, 9603] but during detokenization it becomes b"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83" and not "▀▁▂▃" again.

mattdangerw · 2022-04-14T20:53:25Z

@aflah02 I think your return types are actually correct. b"\xe2\x96\x80\xe2\x96\x81\xe2\x96\x82\xe2\x96\x83" is a byte string representation for "▀▁▂▃" in a utf-8 encoding. And by default when you are working with tf.Tensor strings, you get byte strings with a utf-8 encoding. https://www.tensorflow.org/text/guide/unicode

So the issue is not that this representation of the return type is wrong, it's just that most users will want eventually to go from tf.Tensors to regular python strings and it is confusing how to reverse this process.

So that's where I was proposing we add a helper to all tokenizers, e.g. tokenizer.detokenize_to_strings() that will convert all tensors to lists of lists, and then convert all list elements to python strings. #113

aflah02 · 2022-04-14T20:56:50Z

@mattdangerw Thanks! That makes total sense now

mattdangerw

Formatting issue should be fixed is you rebase to the latest master.

Found a few more style nits, but the main one is I think some of your usage examples now have incorrect output.

keras_nlp/tokenizers/unicode_character_tokenizer.py

keras_nlp/tokenizers/unicode_character_tokenizer_test.py

mattdangerw · 2022-04-16T01:49:47Z

Thanks! Merging. These build failures are unrelated and fixed up on #121

* Debugging * Debugging * Fixed Sequence Length Issue * Sequence Length Changes * Removed _ From Class Attributes * Fixed Null Bytes in Detokenization * Testing regex_replace * Testing * Helper Function and Debug Statements * Testing Regex Replace New Ordering * Added Checks for Errors and Normalization Form * Doc String Completed * Ran lint/format * New Tests and Decoding Changes * Changes * Minor Tweak * Tweaking Detokenizer * Added Tests and Updated Docstrings * Ran format.sh and lint.sh * Refactoring and Removing Unused Lines * Fixed Some Broken Tests * Fixed All Tests * Testing Decode * Testing * Debug * Fixes + Replaced Regex with BooleanMask * Added Debug Lines * Added Debug Line for .numpy() * Testing Byte Tokenizer Approach * Testing With Unicode_transcode * Listing Methods of Object * Testing _numpy * Added Decode Call * Checking Methods post _numpy() * Removed Debug Statements and Improved Docstring * Fixed Failing Test * Ran format/lint * Fixed Docstring and Improved Examples * Ran format and lint * Copy edits * Copy edits Co-authored-by: Matt Watson <[email protected]>

aflah02 added 19 commits April 8, 2022 10:51

Debugging

f04325c

Debugging

20db313

Fixed Sequence Length Issue

e0b6c44

Sequence Length Changes

f949f5a

Removed _ From Class Attributes

ac2bb89

Fixed Null Bytes in Detokenization

1d1a1a2

Testing regex_replace

ef1b5b6

Testing

0de3153

Helper Function and Debug Statements

161e316

Testing Regex Replace New Ordering

8054855

Added Checks for Errors and Normalization Form

d638260

Doc String Completed

5fad8ad

Ran lint/format

a6b095f

New Tests and Decoding Changes

c45de16

Changes

78f4da7

Minor Tweak

927fdc6

Tweaking Detokenizer

68830d4

Added Tests and Updated Docstrings

7137c39

Ran format.sh and lint.sh

8cd02d2

mattdangerw requested changes Apr 12, 2022

View reviewed changes

aflah02 added 6 commits April 12, 2022 09:53

Refactoring and Removing Unused Lines

11e5eed

Fixed Some Broken Tests

09f5f30

Fixed All Tests

91c06af

Testing Decode

24fb3ac

Testing

2ded9a7

Debug

82ee48c

mattdangerw requested changes Apr 13, 2022

View reviewed changes

aflah02 added 2 commits April 14, 2022 00:12

Fixes + Replaced Regex with BooleanMask

43c33c8

Added Debug Lines

0731294

aflah02 added 10 commits April 14, 2022 00:18

Added Debug Line for .numpy()

4da8739

Testing Byte Tokenizer Approach

996fd25

Testing With Unicode_transcode

44b01f7

Listing Methods of Object

d3fe320

Testing _numpy

aaf9454

Added Decode Call

1046798

Checking Methods post _numpy()

fa8eeea

Removed Debug Statements and Improved Docstring

b47806a

Fixed Failing Test

9d1514f

Ran format/lint

1ec59df

mattdangerw approved these changes Apr 14, 2022

View reviewed changes

aflah02 added 3 commits April 15, 2022 11:26

Fixed Docstring and Improved Examples

ba76dcc

Merge branch 'keras-team:master' into master

ac55c10

Ran format and lint

96cf050

mattdangerw added 2 commits April 15, 2022 17:33

Copy edits

a915c3d

Copy edits

053375d

mattdangerw merged commit a006232 into keras-team:master Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a UnicodeCharacterTokenizer #100

Adding a UnicodeCharacterTokenizer #100

aflah02 commented Apr 8, 2022

mattdangerw commented Apr 11, 2022 •

edited

Loading

mattdangerw left a comment

mattdangerw left a comment

aflah02 commented Apr 13, 2022

mattdangerw commented Apr 14, 2022

aflah02 commented Apr 14, 2022

mattdangerw left a comment

aflah02 commented Apr 14, 2022

mattdangerw commented Apr 14, 2022

aflah02 commented Apr 14, 2022

mattdangerw left a comment

mattdangerw commented Apr 16, 2022

Adding a UnicodeCharacterTokenizer #100

Adding a UnicodeCharacterTokenizer #100

Conversation

aflah02 commented Apr 8, 2022

mattdangerw commented Apr 11, 2022 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

aflah02 commented Apr 13, 2022

mattdangerw commented Apr 14, 2022

aflah02 commented Apr 14, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

aflah02 commented Apr 14, 2022

mattdangerw commented Apr 14, 2022

aflah02 commented Apr 14, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw commented Apr 16, 2022

mattdangerw commented Apr 11, 2022 •

edited

Loading