-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByT5Tokenizer ignores spaces around added tokens #19873
Comments
May also be of interest to @ArthurZucker |
Also cc @Narsil - any ideas here? |
Yes, by default added tokens always use tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
# tokenizer.add_tokens("<x>", special_tokens=True)
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True)
tokenizer._additional_special_tokens.append(new_token) This change will fix it, however it require changing internals which is not great. Definitely looks like a bug. Pinging @ydshieh which was looking at this recently and trying to figure out some tokenizer stuff. I "think" this qualifies as a bug. (Well the original shared code is not OK, the defaults are to strip left and right, but if you do |
Sorry for being late here. So as @Narsil pointed out, tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
new_token = AddedToken("<x>", lstrip=False, rstrip=False)
tokenizer.add_tokens(new_token, special_tokens=True) should work (which is not the case for now) without the need of |
Hey! I'll take this one on as part of #23909, since it is an issue with |
As mentioned, this will take a bit more time, a big refactoring is coming! 🔥 |
Should be merged this week! |
System Info
transformers 4.23.1
Who can help?
@patrickvonplaten @SaulLu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
in comparison to:
Expected behavior
In my task presence of spaces around added tokens are important. Despite that, I think byT5 tokenizer should not ignore any characters (bytes).
The text was updated successfully, but these errors were encountered: