[Bug]? how does the tokenizer encode the special tokens? #1263

vpegasus · 2023-05-29T08:54:43Z

Hi, all, I used the tokenzier to process data for llama model(already converted to hf formated) and set:

tokenizer = AutoTokenizer.from_pretrained(llama_model_id, model_max_length=1024, padding_side='right',
                                              trust_remote_code=True)
tokenizer.add_special_tokens(  
            {
                "eos_token": "</s>",
                "bos_token": "</s>",
                "unk_token": "</s>",
            })
tokenizer.pad_token = tokenizer.eos_token

when tokenizing a piece of text with an eos_token:

tokenizer(['ASSISTANT: Hello!</s>']) # there is no space between ! and </s>.

output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 829, 29879, 29958]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

The eos_token: </s> is encoded to 829, 29879, 29958 which means </s> is regarded as </,s and >.

tokenizer(['ASSISTANT: Hello! </s>'])  # there is a space between ! and </s>.

output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 2]],
  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]],
  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}

in this time, </s> is encoded correctly (token id is 2).

As description above, does this mean we should add a space between text and eos_token? however, I find many popular projects like Alpaca concatenate text with eos_token without a space.

I previously thought tokenizer encode text in a greedy style, the eos_token would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.

could anyone help me, if there is something misunderstood by me? thx.

The text was updated successfully, but these errors were encountered:

vpegasus · 2023-05-30T02:09:36Z

after some other experiments, I found some weird thing:

tokenizer('我是谁')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132]

1 is bos_token_id, 29871 is the token id of ''

tokenizer('我是谁</s>')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 829, 29879, 29958]

tokenizer('who are you</s>')
output:
'input_ids': [1, 1058, 526, 366, 829, 29879, 29958] # there is no 29871.

when add a space between 谁 and </s>.

tokenizer('我是谁 </s>') 
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 2] # the `</s>` is encoded correctly

when decode [1, 29871, 30672, 30392, 235, 179, 132, 2]

tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 2])
output:
'<s> 我是谁</s>'

the space is ignored!

When manually add token id 29871：

tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 29871, 2])
output:
'<s> 我是谁 </s>'

this time, there is a space between 谁 and </s>.

Does these experiments above means encode, decode methods are not completely Reciprocal reversible operation?

vpegasus · 2023-06-02T00:44:00Z

huggingface/transformers#23909

ArthurZucker · 2023-06-02T08:47:01Z

Thanks for linking to the transformers PR! This issue slipped through the cracks 👍🏻
I'm working on it!

vpegasus changed the title ~~how does the tokenizer encode the special tokens?~~ [bug]? how does the tokenizer encode the special tokens? May 30, 2023

vpegasus changed the title ~~[bug]? how does the tokenizer encode the special tokens?~~ [Bug]? how does the tokenizer encode the special tokens? May 30, 2023

vpegasus mentioned this issue May 30, 2023

[Bug]? how does the tokenizer encode the special tokens? huggingface/transformers#23851

Closed

4 tasks

vpegasus closed this as completed Jun 2, 2023

andreasbinder mentioned this issue Sep 16, 2023

RuntimeError: CUDA error: device-side assert triggered when using Llama 2 from HF maitrix-org/llm-reasoners#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]? how does the tokenizer encode the special tokens? #1263

[Bug]? how does the tokenizer encode the special tokens? #1263

vpegasus commented May 29, 2023 •

edited

Loading

vpegasus commented May 30, 2023

vpegasus commented Jun 2, 2023

ArthurZucker commented Jun 2, 2023

[Bug]? how does the tokenizer encode the special tokens? #1263

[Bug]? how does the tokenizer encode the special tokens? #1263

Comments

vpegasus commented May 29, 2023 • edited Loading

vpegasus commented May 30, 2023

vpegasus commented Jun 2, 2023

ArthurZucker commented Jun 2, 2023

vpegasus commented May 29, 2023 •

edited

Loading