Add `FalconTokenizer` #1485

SamanehSaadat · 2024-03-05T00:33:19Z

This PR is part of addressing #1372 and adds tokenizer and a preset for the Falcon model and updates the conversion scripts to save a full preset directory.

Falcon Tokenizer and Weight Conversion Verificaiton Colab

mattdangerw

Looks good! Just a few comments.

keras_nlp/models/falcon/falcon_presets.py

keras_nlp/models/falcon/falcon_tokenizer.py

keras_nlp/models/falcon/falcon_presets.py

mattdangerw · 2024-03-06T00:39:18Z

keras_nlp/models/falcon/falcon_tokenizer.py

+    ```
+    """
+
+    def __init__(


General question, but what is up with the 7b tokenizer? It looks like it is still basically just BPE, but with extra special tokens? https://huggingface.co/tiiuae/falcon-7b-instruct/raw/main/tokenizer.json

Maybe we can pull the vocab and merges out of this json, so we can handle them normally, and tackle the rest of the weirdness in code?

I think it's not just special token differences. They have different vocab sizes: 1b has 50256 tokens in total while 7b has 65023 (there are only 10 extra special tokens in 7b).

Yeah, new tokenizer vocab for sure, but I was hoping we could avoid a whole new file format.

It seems like they are still fundamentally BPE, with a different vocab and more special tokens right? If we can still save this as a tokenizer.json + assets/tokenizer/merges.txt + assets/tokenizer/vocab.json that seems ideal to me. But we could also write a custom loader for Falcon's bespoke tokenizer.json format if we think that's better.

I misunderstood what you said.

Do you mean that since they are basically using BPE, we can skip creating FalconTokenizer and use BPE directly?

I was hoping we could avoid a whole new file format.

Could you explain what you mean by "a new file format"? What's the new file format I'm creating? :D

Sorry! I am being unclear. Everything you have looks good for the 1b model. I am asking about/trying to think through an upcoming problem with the 7b falcon models.

Take a look at:

https://huggingface.co/tiiuae/falcon-rw-1b/tree/main

https://huggingface.co/tiiuae/falcon-7b/tree/main

The 7b tokenizer assets are different, there is no merges.txt or vocab.json. There is just one weird tokenizer.json that combines the two. We don't have any code that will allow reading that bespoke tokenizer.json today, so we could either extrac a merges and vocab from it, support loading it directly with new parsing code, or something else.

Does that clarify or not really?

And to be clear, we should have a FalconTokenizer for sure. The question I have is just whether it can be a "simple subclass" of the BytePairEncoding tokenizer, or whether we need custom json parsing code after we also convert the 7b models.

Oh, I see! Thanks for the clarification!

I agree that it would be better to just extract vocab and merges from their format and load it like other models as there isn't any other model that has this format.

I am not sure if I understand what you arguing right but I believe it's about special tokens that the 7 billion model have and how we can save them properly.
If that is the case so we can add special_tokens arg for FalconTokenizer, something like WhisperTokenizer, but instead it can be only a list because they are already included in the vocabulary, and while initialization, we pass them with <|endoftext|> in usplittable_tokens arg while initializing the super class .
And for the conversion script, while converting the tokenizer, we can check for hf_tokenizer["added_tokens"] list if it has any added tokens other than the <|endoftext|> token and pass them to the FalconTokenizer as special_tokens. and then of course we need to update the config to contain the special tokens. So we will have only tokenizer.json + assets/tokenizer/merges.txt + assets/tokenizer/vocab.json, but with the config in tokenizer.json having a list of special_tokens.

mattdangerw

Looks great!

SamanehSaadat · 2024-03-08T00:18:25Z

@mattdangerw Thanks for the review!

* Add FalconTokenizer. * Update checkpoint conversion script. * Address reviews.

SamanehSaadat added 3 commits March 4, 2024 23:34

Add FalconTokenizer.

0155ea6

Merge branch 'keras-team:master' into falcon-tokenizer

2c17b1e

Update checkpoint conversion script.

011a614

SamanehSaadat requested a review from mattdangerw March 5, 2024 18:54

Merge branch 'keras-team:master' into falcon-tokenizer

d190406

mattdangerw requested changes Mar 6, 2024

View reviewed changes

sampathweb added the kokoro:force-run Runs Tests on GPU label Mar 6, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 6, 2024

SamanehSaadat added 2 commits March 6, 2024 09:13

Merge branch 'keras-team:master' into falcon-tokenizer

4c20b4d

Address reviews.

7c74021

SamanehSaadat requested a review from mattdangerw March 7, 2024 17:47

mattdangerw approved these changes Mar 7, 2024

View reviewed changes

SamanehSaadat merged commit 7ef18a1 into keras-team:master Mar 8, 2024
10 checks passed

SamanehSaadat deleted the falcon-tokenizer branch March 8, 2024 00:19

abuelnasr0 pushed a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024

Add FalconTokenizer (keras-team#1485)

7e1362f

* Add FalconTokenizer. * Update checkpoint conversion script. * Address reviews.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `FalconTokenizer` #1485

Add `FalconTokenizer` #1485

SamanehSaadat commented Mar 5, 2024

mattdangerw left a comment

mattdangerw Mar 6, 2024

SamanehSaadat Mar 6, 2024

mattdangerw Mar 6, 2024 •

edited

Loading

SamanehSaadat Mar 6, 2024

mattdangerw Mar 7, 2024

mattdangerw Mar 7, 2024

SamanehSaadat Mar 7, 2024

abuelnasr0 Mar 8, 2024

mattdangerw left a comment

SamanehSaadat commented Mar 8, 2024

Add FalconTokenizer #1485

Add FalconTokenizer #1485

Conversation

SamanehSaadat commented Mar 5, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Mar 6, 2024

Choose a reason for hiding this comment

SamanehSaadat Mar 6, 2024

Choose a reason for hiding this comment

mattdangerw Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

SamanehSaadat Mar 6, 2024

Choose a reason for hiding this comment

mattdangerw Mar 7, 2024

Choose a reason for hiding this comment

mattdangerw Mar 7, 2024

Choose a reason for hiding this comment

SamanehSaadat Mar 7, 2024

Choose a reason for hiding this comment

abuelnasr0 Mar 8, 2024

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

SamanehSaadat commented Mar 8, 2024

Add `FalconTokenizer` #1485

Add `FalconTokenizer` #1485

mattdangerw Mar 6, 2024 •

edited

Loading