BytePair Tokenizer Implementation #303

jessechancy · 2022-08-22T08:53:39Z

An implementation of OpenAI's BytePair encoder in TF compatible graph mode, which would allow for the e2e development of certain pretrained models that use this tokenizer (RoBERTa, GPT etc.).

Currently a rough version, and suggestions are welcome!

mattdangerw

Just some quick initial comments.

Have we validated that on a number of different languages (and weirder unicode like control characters, emojis, etc)? We should make sure this is not just equivalent in ascii land.

mattdangerw · 2022-08-22T23:45:40Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+
+class BytePairTokenizerCache:
+    def __init__(self):
+        self.key2id = tf.lookup.experimental.DenseHashTable(


Why DenseHashTable for one and MutableHashTable for the other? Does this make a performance difference?

This is mainly due to a limitation with these experimental hashtables. DenseHashTable is more efficient but it can only map string to int and not vice versa. Similarly, a reason why we needed two hashtables was due to the limitation we cannot have a string to string mapping.

Would it significantly degrade performance to use MutableHashTable in all cases? That might reduce our coverage on experimental features, and IIRC we might have some bugs to remove DenseHashTable in the future of tf.

mattdangerw · 2022-08-22T23:49:20Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+    return bs, cs  # int to string mapping
+
+
+class BytePairTokenizerCache:


It feels weird to have a cache with no size limit. That's not actually a cahce.

Would this become a complete memory hog on a sufficiently large vocabulary? Does that come up in practice?

I think the lru_cache gpt2 uses does have a max size by default.

The lru_cache, at least in this openai implementation, is only for the byte2unicode mapping, which would be a fixed size. The python dictionary for the cache is unbounded. We could add a limit to the cache if that is better, but to have an lru_cache would require reimplementing something like a MutableHashTable or DenseHashTable.

Ohh got it, thanks! Yeah if the original implementation also has an unbounded cache, that makes me more comfortable.

mattdangerw · 2022-08-24T21:46:31Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+    return bs, cs  # int to string mapping
+
+
+class BytePairTokenizerCache:


Ohh got it, thanks! Yeah if the original implementation also has an unbounded cache, that makes me more comfortable.

mattdangerw · 2022-08-24T21:49:19Z

keras_nlp/tokenizers/byte_pair_tokenizer_test.py

+from keras_nlp.tokenizers.byte_pair_tokenizer import BytePairTokenizer
+
+
+class BytePairTokenizerTest(tf.test.TestCase):


Let's add a little more coverage, for some key use cases

A batched tf dataset

A unbatched tf dataset

A function annotated with @tf.function

Some more complex unicode character cases (maybe we can validate these first with the original tokenizer impl to make sure we have it right)

mattdangerw · 2022-08-24T21:50:53Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+
+class BytePairTokenizerCache:
+    def __init__(self):
+        self.key2id = tf.lookup.experimental.DenseHashTable(


Would it significantly degrade performance to use MutableHashTable in all cases? That might reduce our coverage on experimental features, and IIRC we might have some bugs to remove DenseHashTable in the future of tf.

mattdangerw · 2022-09-01T21:57:16Z

Forgot to leave a comment here, but from conversations with @jessechancy

For the cache, we want a way to go from a string word input, to a token list output. We are currently doing that with two different hashtables, to go from string -> int, int -> string, to workaround an issue where tf does not offer a string -> string lookup.

I think we could have a slightly simpler workaround, where we hash the input string, and then do a int -> string lookup to get the tokenized form. That should save us one of these hash tables, which should make things simpler, faster and lower memory usage.

mattdangerw · 2022-09-01T22:01:25Z

Another issue we need to work though is that python regex and tf regex appear to handle certain whitespace characters--non breaking spaces. We need to fix this, probably with some regex hacking.

jbischof

Very impressive! Some minor stylistic comments.

Right now this code is a bit dense and lightly documented. Since we're doing a lot of heavy lifting here it would be nice to lay out the steps or organize the code in a way that someone could understand the gist without reading every line.

jbischof · 2022-09-08T17:14:33Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+
+
+def create_static_hashtable(keys, values, default):
+    hashtable = tf.lookup.StaticHashTable(


return on this line

jbischof · 2022-09-08T17:15:45Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+        merges,
+        sequence_length: int = None,
+        **kwargs,
+    ) -> None:


typehints seem to be unpopular in Keras, so drop for consistency

jbischof · 2022-09-08T17:17:33Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+    return hashtable
+
+
+class BytePairTokenizer(tokenizer.Tokenizer):


Need a docstring

jbischof · 2022-09-08T17:22:27Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

+
+    @tf.function
+    def _byte_pair_merge_loop_body(self, words, mask):
+        """Iterative merging process for byte pair encoding algorithm."""


Empty line after docstring

chenmoneygithub · 2022-10-01T23:40:34Z

@mattdangerw Do we still have unresolved issues on functionality of this implementation? I played around with it a bit more and the functionality looks correct to me (compared with RoBERTa's tokenizer). I remember you mentioned there are some special token handled differently?

One concern I have is the implementation is quite complex, but after going through the code I don't know where we can really simplify it. Compared to the fairseq implementation, the additional complexity mainly comes from tensor manipulation when we do merge/check/etc, and how hash and while-loop are handled. We have two approaches here:

Just use the python implementation and disregard the graph support, which seems to going against our design pattern.
Use Jesse's implementation, which is with our design pattern, but the code is quite heavy.

mattdangerw · 2022-10-03T17:32:24Z

@chenmoneygithub left a few brief comments above in that regard.

The issue with different output is apparently with non-breaking space characters.

And there are some things I would like to try re simplifying, in particular removing one of the hash tables by using a hash function (#303 (comment)).

Overall I think we should probably go with this and not the python implementation. IIUC the py_function escape hatch to tf.data will be prohibitively slow. But we do need some digging and cleanup on this PR before we can land it.

chenmoneygithub · 2022-10-03T17:56:43Z

@mattdangerw Simply wrapping by py_function has tons of runtime errors, the alternative to this PR is not supporting tf.data pipeline, e.g., HuggingFace Roberta TF model, which actually won't cause performance loss based on what Jesse told me earlier, but the downside is it breaks our contract of supporting tf.data. My thought is the same tho, we should check in this implementation, just to clarify here.

Do you remember what output diff you saw earlier? Jesse found one minor diff before his presentation, but he said he had fixed it.

abheesht17 · 2022-10-11T17:03:21Z

@mattdangerw, @chenmoneygithub - a minor comment here. The merges.txt file present in the official repo (and the HF repo - the HF repo has the same file as the one present in the official repo) has #version: ... on the first line: https://huggingface.co/gpt2/blob/main/merges.txt. I have copied over the same merges.txt file to the GCP folder.

So, we should probably ignore the first line of the text file after https://github.com/keras-team/keras-nlp/pull/303/files#diff-9f7f9b8a01fa1e5d050c27b1dcfdb801ec24d46f724a2eef011c8f86c9bed53aR142, right?

chenmoneygithub · 2022-10-11T19:20:33Z

@abheesht17 Thanks for raising it! It's actually fine, because #version: 0.2 won't be performed as a merge rule, because it requires to see a token #version:0.2, but it should be split in the pre-split phase.

mattdangerw · 2022-10-21T18:11:16Z

Closing, this was landed with some edits on #389 by @chenmoneygithub. But huge props on writing this, this is a big deal for the library!!

jessechancy added 2 commits August 22, 2022 01:45

byte pair encoder implementation

494c56a

style fixes

c780a38

mattdangerw requested changes Aug 22, 2022

View reviewed changes

jessechancy added 2 commits August 23, 2022 16:16

byte pair detokenize method

3ca9da8

add to init

66c013c

mattdangerw requested changes Aug 24, 2022

View reviewed changes

mattdangerw mentioned this pull request Sep 2, 2022

Investigate recreating a GPT-2 forward pass with KerasNLP #337

Closed

jbischof reviewed Sep 8, 2022

View reviewed changes

Update BPE with simplified implementation

22bbd0c

handle lookahead and special whitespace tokens

e1b5acf

abheesht17 mentioned this pull request Oct 14, 2022

Add a byte pair encoding (BPE) tokenizer layer #46

Closed

chenmoneygithub mentioned this pull request Oct 14, 2022

BPE tokenizer #389

Merged

mattdangerw closed this Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BytePair Tokenizer Implementation #303

BytePair Tokenizer Implementation #303

jessechancy commented Aug 22, 2022

mattdangerw left a comment

mattdangerw Aug 22, 2022

jessechancy Aug 23, 2022

mattdangerw Aug 24, 2022

mattdangerw Aug 22, 2022

jessechancy Aug 23, 2022

mattdangerw Aug 24, 2022

mattdangerw Aug 24, 2022

mattdangerw Aug 24, 2022

mattdangerw Aug 24, 2022

mattdangerw commented Sep 1, 2022

mattdangerw commented Sep 1, 2022

jbischof left a comment

jbischof Sep 8, 2022

jbischof Sep 8, 2022

jbischof Sep 8, 2022

jbischof Sep 8, 2022

chenmoneygithub commented Oct 1, 2022

mattdangerw commented Oct 3, 2022

chenmoneygithub commented Oct 3, 2022

abheesht17 commented Oct 11, 2022

chenmoneygithub commented Oct 11, 2022

mattdangerw commented Oct 21, 2022

		return bs, cs # int to string mapping


		class BytePairTokenizerCache:

		from keras_nlp.tokenizers.byte_pair_tokenizer import BytePairTokenizer


		class BytePairTokenizerTest(tf.test.TestCase):



		def create_static_hashtable(keys, values, default):
		hashtable = tf.lookup.StaticHashTable(

		return hashtable


		class BytePairTokenizer(tokenizer.Tokenizer):

BytePair Tokenizer Implementation #303

BytePair Tokenizer Implementation #303

Conversation

jessechancy commented Aug 22, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Sep 1, 2022

mattdangerw commented Sep 1, 2022

jbischof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenmoneygithub commented Oct 1, 2022

mattdangerw commented Oct 3, 2022

chenmoneygithub commented Oct 3, 2022

abheesht17 commented Oct 11, 2022

chenmoneygithub commented Oct 11, 2022

mattdangerw commented Oct 21, 2022