ByT5Tokenizer ignores spaces around added tokens #19873

djstrong · 2022-10-25T15:27:38Z

System Info

transformers 4.23.1

Who can help?

@patrickvonplaten @SaulLu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('google/byt5-base')
tokenizer.add_tokens('<x>', special_tokens=True)
print(tokenizer('<x> <x> <x><x>'))
{'input_ids': [384, 384, 384, 384, 1], 'attention_mask': [1, 1, 1, 1, 1]}

in comparison to:

print(tokenizer('a a aa'))
{'input_ids': [100, 35, 100, 35, 100, 100, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Expected behavior

In my task presence of spaces around added tokens are important. Despite that, I think byT5 tokenizer should not ignore any characters (bytes).

The text was updated successfully, but these errors were encountered:

sgugger · 2022-10-25T16:13:30Z

May also be of interest to @ArthurZucker

patrickvonplaten · 2022-11-30T10:34:22Z

Also cc @Narsil - any ideas here?

Narsil · 2022-11-30T14:34:46Z

Also cc @Narsil - any ideas here?

Yes, by default added tokens always use lstrip/rstrip=True which swallows prefix/suffix spaces (it's a convenience for so you don't have to worry how to add in within some text.)
Since ByT5 is pure bytes, it doesn't have tokenizers support (doesn't make sense speedwise). and using the "slow" class (it's not slow though).

tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
# tokenizer.add_tokens("<x>", special_tokens=True)
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)
tokenizer._additional_special_tokens.append(new_token)

This change will fix it, however it require changing internals which is not great. Definitely looks like a bug.

Pinging @ydshieh which was looking at this recently and trying to figure out some tokenizer stuff.

I "think" this qualifies as a bug. (Well the original shared code is not OK, the defaults are to strip left and right, but if you do add_tokens(AddedToken(.., lstrip=False, rstrip=False)) then it should honor that. And the workaround I had to look at a few different internal variables to set it appropriately so that the Trie class could do it's job correctly (otherwise it just couldn't see the AddedToken values.

ydshieh · 2023-02-13T14:31:17Z

Sorry for being late here. So as @Narsil pointed out,

tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)

should work (which is not the case for now) without the need of tokenizer._additional_special_tokens.append(new_token).
And the goal is to make the above code snippet do it job correctly. Is this right?

ArthurZucker · 2023-06-01T15:15:35Z

Hey! I'll take this one on as part of #23909, since it is an issue with rstrip and lstrip being ignored (as the default behaviour if a token is not special is to always stip)

ArthurZucker · 2023-06-27T12:49:45Z

As mentioned, this will take a bit more time, a big refactoring is coming! 🔥

ArthurZucker · 2023-08-16T08:44:14Z

Should be merged this week!

huggingface deleted a comment from github-actions bot Nov 29, 2022

ArthurZucker self-assigned this Nov 29, 2022

huggingface deleted a comment from github-actions bot Dec 25, 2022

huggingface deleted a comment from github-actions bot Jan 18, 2023

huggingface deleted a comment from github-actions bot Feb 13, 2023

huggingface deleted a comment from github-actions bot Mar 9, 2023

huggingface deleted a comment from github-actions bot Apr 5, 2023

github-actions bot closed this as completed May 7, 2023

huggingface deleted a comment from github-actions bot May 9, 2023

ArthurZucker reopened this Jun 1, 2023

huggingface deleted a comment from github-actions bot Jun 27, 2023

huggingface deleted a comment from github-actions bot Jul 21, 2023

ArthurZucker mentioned this issue Aug 10, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

huggingface deleted a comment from github-actions bot Aug 16, 2023

huggingface deleted a comment from github-actions bot Sep 11, 2023

ArthurZucker closed this as completed in #23909 Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByT5Tokenizer ignores spaces around added tokens #19873

ByT5Tokenizer ignores spaces around added tokens #19873

djstrong commented Oct 25, 2022

sgugger commented Oct 25, 2022

patrickvonplaten commented Nov 30, 2022

Narsil commented Nov 30, 2022 •

edited

Loading

ydshieh commented Feb 13, 2023

ArthurZucker commented Jun 1, 2023

ArthurZucker commented Jun 27, 2023

ArthurZucker commented Aug 16, 2023

ByT5Tokenizer ignores spaces around added tokens #19873

ByT5Tokenizer ignores spaces around added tokens #19873

Comments

djstrong commented Oct 25, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Oct 25, 2022

patrickvonplaten commented Nov 30, 2022

Narsil commented Nov 30, 2022 • edited Loading

ydshieh commented Feb 13, 2023

ArthurZucker commented Jun 1, 2023

ArthurZucker commented Jun 27, 2023

ArthurZucker commented Aug 16, 2023

Narsil commented Nov 30, 2022 •

edited

Loading