Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert : fix gemma v1 tokenizer convert #8248

Merged
merged 2 commits into from
Jul 4, 2024
Merged

convert : fix gemma v1 tokenizer convert #8248

merged 2 commits into from
Jul 4, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jul 2, 2024

Follow up on #8244

It seems that Gemma v1 tokenization has always been broken due to missing add_space_prefix == false flag. Also, add tokenizer tests for both Gemma and Gemma-2

# get tokenizers
python3 convert-hf-to-gguf-update.py <hf_token>

# generate ggml vocabs and tests
python3 convert-hf-to-gguf.py models/tokenizers/gemma/   --outfile models/ggml-vocab-gemma.gguf   --vocab-only
python3 convert-hf-to-gguf.py models/tokenizers/gemma-2/ --outfile models/ggml-vocab-gemma-2.gguf --vocab-only

# run the tests
make -j tests
./tests/test-tokenizer-0 models/ggml-vocab-gemma.gguf
./tests/test-tokenizer-0 models/ggml-vocab-gemma-2.gguf

@github-actions github-actions bot added the python python script changes label Jul 2, 2024
@mofosyne mofosyne added medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 3, 2024
@ggerganov ggerganov merged commit 20fc380 into master Jul 4, 2024
10 checks passed
@ggerganov ggerganov deleted the gg/fix-gemma branch July 4, 2024 07:41
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants