Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FalconTokenizer #1485

Merged
merged 6 commits into from
Mar 8, 2024
Merged

Conversation

SamanehSaadat
Copy link
Member

This PR is part of addressing #1372 and adds tokenizer and a preset for the Falcon model and updates the conversion scripts to save a full preset directory.

Falcon Tokenizer and Weight Conversion Verificaiton Colab

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few comments.

keras_nlp/models/falcon/falcon_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/falcon/falcon_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/models/falcon/falcon_presets.py Outdated Show resolved Hide resolved
```
"""

def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question, but what is up with the 7b tokenizer? It looks like it is still basically just BPE, but with extra special tokens? https://huggingface.co/tiiuae/falcon-7b-instruct/raw/main/tokenizer.json

Maybe we can pull the vocab and merges out of this json, so we can handle them normally, and tackle the rest of the weirdness in code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not just special token differences. They have different vocab sizes: 1b has 50256 tokens in total while 7b has 65023 (there are only 10 extra special tokens in 7b).

Copy link
Member

@mattdangerw mattdangerw Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, new tokenizer vocab for sure, but I was hoping we could avoid a whole new file format.

It seems like they are still fundamentally BPE, with a different vocab and more special tokens right? If we can still save this as a tokenizer.json + assets/tokenizer/merges.txt + assets/tokenizer/vocab.json that seems ideal to me. But we could also write a custom loader for Falcon's bespoke tokenizer.json format if we think that's better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood what you said.

Do you mean that since they are basically using BPE, we can skip creating FalconTokenizer and use BPE directly?

I was hoping we could avoid a whole new file format.

Could you explain what you mean by "a new file format"? What's the new file format I'm creating? :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I am being unclear. Everything you have looks good for the 1b model. I am asking about/trying to think through an upcoming problem with the 7b falcon models.

Take a look at:

The 7b tokenizer assets are different, there is no merges.txt or vocab.json. There is just one weird tokenizer.json that combines the two. We don't have any code that will allow reading that bespoke tokenizer.json today, so we could either extrac a merges and vocab from it, support loading it directly with new parsing code, or something else.

Does that clarify or not really?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to be clear, we should have a FalconTokenizer for sure. The question I have is just whether it can be a "simple subclass" of the BytePairEncoding tokenizer, or whether we need custom json parsing code after we also convert the 7b models.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see! Thanks for the clarification!

I agree that it would be better to just extract vocab and merges from their format and load it like other models as there isn't any other model that has this format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understand what you arguing right but I believe it's about special tokens that the 7 billion model have and how we can save them properly.
If that is the case so we can add special_tokens arg for FalconTokenizer, something like WhisperTokenizer, but instead it can be only a list because they are already included in the vocabulary, and while initialization, we pass them with <|endoftext|> in usplittable_tokens arg while initializing the super class .
And for the conversion script, while converting the tokenizer, we can check for hf_tokenizer["added_tokens"] list if it has any added tokens other than the <|endoftext|> token and pass them to the FalconTokenizer as special_tokens. and then of course we need to update the config to contain the special tokens. So we will have only tokenizer.json + assets/tokenizer/merges.txt + assets/tokenizer/vocab.json, but with the config in tokenizer.json having a list of special_tokens.

@sampathweb sampathweb added the kokoro:force-run Runs Tests on GPU label Mar 6, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 6, 2024
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@SamanehSaadat
Copy link
Member Author

@mattdangerw Thanks for the review!

@SamanehSaadat SamanehSaadat merged commit 7ef18a1 into keras-team:master Mar 8, 2024
10 checks passed
@SamanehSaadat SamanehSaadat deleted the falcon-tokenizer branch March 8, 2024 00:19
abuelnasr0 pushed a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024
* Add FalconTokenizer.

* Update checkpoint conversion script.

* Address reviews.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants