The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

chenying99 · 2025-01-23T21:19:50Z

vocabulary size 6400

text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

   
end = time.time()
print(end - start)

3.8366940021514893 seconds

The text was updated successfully, but these errors were encountered:

mehtamansi29 · 2025-01-28T14:38:56Z

Hi @chenying99 - Thanks for reporting the issue. Can you please help me with which tokenizer class from a library like nltk or transformers ? And also please provide the more sample code here to reproduce the issue.

I tried with transformer tokenizer (bert-base-uncased) and it is working fine. I am getting 0.0023925304412841797 sec(2.39 ms). That's pretty correct time.

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 

text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

end = time.time()
print(end - start)

chenying99 · 2025-01-31T13:46:00Z

I am testing the BytePairTokenizer in this project (keras_hub), not the tokenizer from the transformers library.

I am training the large language model in this project, using the BytePairTokenizer provided by the project, where tokenization takes up most of the time.

import time
import keras_hub

tokenizer = keras_hub.models.GPT2Tokenizer.from_preset("gpt2_base_en")

# or tokenizer = keras_hub.models.WhisperTokenizer.from_preset("whisper_tiny_multi")


text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

   
end = time.time()
print(end - start)

mehtamansi29 self-assigned this Jan 28, 2025

mehtamansi29 added type:Bug Something isn't working stat:awaiting response from contributor labels Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

chenying99 commented Jan 23, 2025 •

edited

Loading

mehtamansi29 commented Jan 28, 2025

chenying99 commented Jan 31, 2025 •

edited

Loading

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

Comments

chenying99 commented Jan 23, 2025 • edited Loading

mehtamansi29 commented Jan 28, 2025

chenying99 commented Jan 31, 2025 • edited Loading

chenying99 commented Jan 23, 2025 •

edited

Loading

chenying99 commented Jan 31, 2025 •

edited

Loading