You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @chenying99 - Thanks for reporting the issue. Can you please help me with which tokenizer class from a library like nltk or transformers ? And also please provide the more sample code here to reproduce the issue.
I tried with transformer tokenizer (bert-base-uncased) and it is working fine. I am getting 0.0023925304412841797 sec(2.39 ms). That's pretty correct time.
import time
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Are you OK? "
start = time.time()
for i in range(10):
tokenizer.tokenize(text + str(i))
end = time.time()
print(end - start)
I am testing the BytePairTokenizer in this project (keras_hub), not the tokenizer from the transformers library.
I am training the large language model in this project, using the BytePairTokenizer provided by the project, where tokenization takes up most of the time.
import time
import keras_hub
tokenizer = keras_hub.models.GPT2Tokenizer.from_preset("gpt2_base_en")
# or tokenizer = keras_hub.models.WhisperTokenizer.from_preset("whisper_tiny_multi")
text = "Are you OK? "
start = time.time()
for i in range(10):
tokenizer.tokenize(text + str(i))
end = time.time()
print(end - start)
vocabulary size 6400
3.8366940021514893 seconds
The text was updated successfully, but these errors were encountered: