-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FalconTokenizer
#1485
Add FalconTokenizer
#1485
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few comments.
``` | ||
""" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question, but what is up with the 7b tokenizer? It looks like it is still basically just BPE, but with extra special tokens? https://huggingface.co/tiiuae/falcon-7b-instruct/raw/main/tokenizer.json
Maybe we can pull the vocab and merges out of this json, so we can handle them normally, and tackle the rest of the weirdness in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not just special token differences. They have different vocab sizes: 1b has 50256
tokens in total while 7b has 65023
(there are only 10 extra special tokens in 7b).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, new tokenizer vocab for sure, but I was hoping we could avoid a whole new file format.
It seems like they are still fundamentally BPE, with a different vocab and more special tokens right? If we can still save this as a tokenizer.json
+ assets/tokenizer/merges.txt
+ assets/tokenizer/vocab.json
that seems ideal to me. But we could also write a custom loader for Falcon's bespoke tokenizer.json
format if we think that's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood what you said.
Do you mean that since they are basically using BPE, we can skip creating FalconTokenizer
and use BPE directly?
I was hoping we could avoid a whole new file format.
Could you explain what you mean by "a new file format"? What's the new file format I'm creating? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! I am being unclear. Everything you have looks good for the 1b model. I am asking about/trying to think through an upcoming problem with the 7b falcon models.
Take a look at:
- https://huggingface.co/tiiuae/falcon-rw-1b/tree/main
- https://huggingface.co/tiiuae/falcon-7b/tree/main
The 7b tokenizer assets are different, there is no merges.txt or vocab.json. There is just one weird tokenizer.json
that combines the two. We don't have any code that will allow reading that bespoke tokenizer.json
today, so we could either extrac a merges and vocab from it, support loading it directly with new parsing code, or something else.
Does that clarify or not really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to be clear, we should have a FalconTokenizer
for sure. The question I have is just whether it can be a "simple subclass" of the BytePairEncoding
tokenizer, or whether we need custom json parsing code after we also convert the 7b models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see! Thanks for the clarification!
I agree that it would be better to just extract vocab
and merges
from their format and load it like other models as there isn't any other model that has this format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understand what you arguing right but I believe it's about special tokens that the 7 billion model have and how we can save them properly.
If that is the case so we can add special_tokens
arg for FalconTokenizer, something like WhisperTokenizer
, but instead it can be only a list because they are already included in the vocabulary, and while initialization, we pass them with <|endoftext|>
in usplittable_tokens
arg while initializing the super class .
And for the conversion script, while converting the tokenizer, we can check for hf_tokenizer["added_tokens"]
list if it has any added tokens other than the <|endoftext|>
token and pass them to the FalconTokenizer
as special_tokens
. and then of course we need to update the config to contain the special tokens. So we will have only tokenizer.json
+ assets/tokenizer/merges.txt
+ assets/tokenizer/vocab.json
, but with the config in tokenizer.json
having a list of special_tokens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@mattdangerw Thanks for the review! |
* Add FalconTokenizer. * Update checkpoint conversion script. * Address reviews.
This PR is part of addressing #1372 and adds tokenizer and a preset for the Falcon model and updates the conversion scripts to save a full preset directory.
Falcon Tokenizer and Weight Conversion Verificaiton Colab