Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

Open
chenying99 opened this issue Jan 23, 2025 · 2 comments
Open
Assignees

Comments

@chenying99
Copy link

chenying99 commented Jan 23, 2025

vocabulary size 6400

text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

   
end = time.time()
print(end - start)

3.8366940021514893 seconds

@mehtamansi29
Copy link
Collaborator

Hi @chenying99 - Thanks for reporting the issue. Can you please help me with which tokenizer class from a library like nltk or transformers ? And also please provide the more sample code here to reproduce the issue.

I tried with transformer tokenizer (bert-base-uncased) and it is working fine. I am getting 0.0023925304412841797 sec(2.39 ms). That's pretty correct time.

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 

text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

end = time.time()
print(end - start)

@chenying99
Copy link
Author

chenying99 commented Jan 31, 2025

I am testing the BytePairTokenizer in this project (keras_hub), not the tokenizer from the transformers library.

I am training the large language model in this project, using the BytePairTokenizer provided by the project, where tokenization takes up most of the time.

import time
import keras_hub

tokenizer = keras_hub.models.GPT2Tokenizer.from_preset("gpt2_base_en")

# or tokenizer = keras_hub.models.WhisperTokenizer.from_preset("whisper_tiny_multi")


text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

   
end = time.time()
print(end - start)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants