Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByT5Tokenizer ignores spaces around added tokens #19873

Closed
4 tasks
djstrong opened this issue Oct 25, 2022 · 7 comments · Fixed by #23909
Closed
4 tasks

ByT5Tokenizer ignores spaces around added tokens #19873

djstrong opened this issue Oct 25, 2022 · 7 comments · Fixed by #23909
Assignees

Comments

@djstrong
Copy link

System Info

transformers 4.23.1

Who can help?

@patrickvonplaten @SaulLu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('google/byt5-base')
tokenizer.add_tokens('<x>', special_tokens=True)
print(tokenizer('<x> <x> <x><x>'))
{'input_ids': [384, 384, 384, 384, 1], 'attention_mask': [1, 1, 1, 1, 1]}

in comparison to:

print(tokenizer('a a aa'))
{'input_ids': [100, 35, 100, 35, 100, 100, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Expected behavior

In my task presence of spaces around added tokens are important. Despite that, I think byT5 tokenizer should not ignore any characters (bytes).

@sgugger
Copy link
Collaborator

sgugger commented Oct 25, 2022

May also be of interest to @ArthurZucker

@huggingface huggingface deleted a comment from github-actions bot Nov 29, 2022
@ArthurZucker ArthurZucker self-assigned this Nov 29, 2022
@patrickvonplaten
Copy link
Contributor

Also cc @Narsil - any ideas here?

@Narsil
Copy link
Contributor

Narsil commented Nov 30, 2022

Also cc @Narsil - any ideas here?

Yes, by default added tokens always use lstrip/rstrip=True which swallows prefix/suffix spaces (it's a convenience for so you don't have to worry how to add in within some text.)
Since ByT5 is pure bytes, it doesn't have tokenizers support (doesn't make sense speedwise). and using the "slow" class (it's not slow though).

tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
# tokenizer.add_tokens("<x>", special_tokens=True)
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)
tokenizer._additional_special_tokens.append(new_token)

This change will fix it, however it require changing internals which is not great. Definitely looks like a bug.

Pinging @ydshieh which was looking at this recently and trying to figure out some tokenizer stuff.

I "think" this qualifies as a bug. (Well the original shared code is not OK, the defaults are to strip left and right, but if you do add_tokens(AddedToken(.., lstrip=False, rstrip=False)) then it should honor that. And the workaround I had to look at a few different internal variables to set it appropriately so that the Trie class could do it's job correctly (otherwise it just couldn't see the AddedToken values.

@huggingface huggingface deleted a comment from github-actions bot Dec 25, 2022
@huggingface huggingface deleted a comment from github-actions bot Jan 18, 2023
@huggingface huggingface deleted a comment from github-actions bot Feb 13, 2023
@ydshieh
Copy link
Collaborator

ydshieh commented Feb 13, 2023

Sorry for being late here. So as @Narsil pointed out,

tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)

should work (which is not the case for now) without the need of tokenizer._additional_special_tokens.append(new_token).
And the goal is to make the above code snippet do it job correctly. Is this right?

@huggingface huggingface deleted a comment from github-actions bot Mar 9, 2023
@huggingface huggingface deleted a comment from github-actions bot Apr 5, 2023
@github-actions github-actions bot closed this as completed May 7, 2023
@huggingface huggingface deleted a comment from github-actions bot May 9, 2023
@ArthurZucker
Copy link
Collaborator

Hey! I'll take this one on as part of #23909, since it is an issue with rstrip and lstrip being ignored (as the default behaviour if a token is not special is to always stip)

@ArthurZucker ArthurZucker reopened this Jun 1, 2023
@huggingface huggingface deleted a comment from github-actions bot Jun 27, 2023
@ArthurZucker
Copy link
Collaborator

As mentioned, this will take a bit more time, a big refactoring is coming! 🔥

@ArthurZucker
Copy link
Collaborator

Should be merged this week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants