-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLaMA 3 tokenizer and preset #1584
Add LLaMA 3 tokenizer and preset #1584
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw @SamanehSaadat I initially wanted to reuse all the components we had for LLaMA but:
- The huggingface repo contains a broken sentence piece model. The only other available option is using their
tokenizer.json
file which contains the vocab and merges for BPE. - LLaMA 3 tokenizer doesn't prepend a start token to the inputs anymore, so the preprocessor has fundamentally changed.
So, I created new Llama3*
components which live under their own directory. But I am not sure if this is the best way to add the model since we didn't name the LLaMA 2 components Llama2*
. This breaks consistency so ideas welcome here; please let me know if there's a better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Tirth! Looks good! Left some comments.
**kwargs | ||
): | ||
super().__init__( | ||
tokenizer, sequence_length, add_start_token, add_end_token, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure I understand correctly: is the main difference between llama3 preprocessor and llama preprocessor that add_start_token=False
in llama3 and add_start_token=True
in llama?
Or are there other differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as far as the preprocessor is concerned, this is the only change. But there's one more change in the tokenizer, the authors don't provide a sentence piece tokenizer anymore. Instead, they provide a BPE vocab and merges file, so I had to create a completely new preprocessing pipeline for LLaMA 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
re new directories for llama3, what do you think about adding new llama3 classes to llama files?
- llama
- llama_tokenizer.py
- LlamaTokenizer
- Llama3Tokenizer
- llama_preprocessor.py
- LlamaPreprocessor
- Llama3Preprocessor
- etc
- llama_tokenizer.py
This way, we won't have a new llama3 directory and files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I went for a different directory to maintain the precedence in KerasNLP. I will check if it's possible to do this without making the existing files too cumbersome.
What do you think about #1584 (review)? I am curious if there's a better way to utilize the current infra to avoid creating separate classes. One way I can think of is to accept both proto
and vocab, merges
in the tokenizer class and branch to BPE or sentence piece based on the inputs. I avoided that since it an unclean approach and also pollutes the API a bit.
@tirthasheshpatel I think you need to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! just a few comments
from keras_nlp.src.models.llama.llama_causal_lm import LlamaCausalLM | ||
|
||
|
||
class Llama3CausalLM(LlamaCausalLM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inconsistent with the rest of the classes. Should we copy a docstring over here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
from transformers import LlamaForCausalLM | ||
|
||
from keras_nlp import upload_preset | ||
from keras_nlp.api.models import Llama3Backbone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need this anywhere. Can't we just do keras_nlp.models
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
from keras_nlp.models import LlamaTokenizer | ||
from keras_nlp.utils.preset_utils import save_to_preset | ||
from keras_nlp import upload_preset | ||
from keras_nlp.api.models import LlamaBackbone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. keras_nlp.models
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# LLaMA 3 shares the same architecture as its predecessors | ||
# So, we simply create an alias for API consistency | ||
@keras_nlp_export("keras_nlp.models.Llama3Backbone") | ||
class Llama3Backbone(LlamaBackbone): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for consistency, can we copy some tests for llama3 backbone too? a preset numerics test we should probably annotate with extra_large, given how big our presets are. or just leave it off if it will be too much hassle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for consistency, can we copy some tests for llama3 backbone too?
We should. I don't have compute to locally run the preset tests so feel free to add them in a follow-up if you have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do as a follow up!
Thanks! Looks good to me! |
Looks like we need to accept the llama terms of service for our test account. Just did, should hopefully propagate soon and turn CI green. |
Closes #1583
LLaMA 3 uses a BPE tokenizer with different special start and end tokens. Also, unlike LLaMA 2, it doesn't add the start token anymore when generating. To reflect these changes, this PR adds a new
Llama3Tokenizer
,Llama3Preprocessor
,Llama3CausalLMPreprocessor
, andLlama3CausalLM
. The backbone hasn't changed and uses the plain oldLlamaBackbone
.A checkpoint conversion script has also been added.
TODOs: