-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core tokenization
] add_dummy_prefix_space
option to help with latest issues
#28010
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Just wanted to say this would be hugely helpful for us over at https://github.com/probcomp/hfppl ! |
Likewise the ability to not include an extra SPIECE_UNDERLINE / Llama token 29871 when encoding a word with a space in front ( |
…s into add-prefix-space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @Lysandre decide, but instead of following what we do with bloom I'd rather we convert from slow. Bit slower but at least we are sure we use the correct logic.
This is done with a warning.
Failing test is unrelated 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this looks good to me
if add_prefix_space is not None: | ||
logger.warning_once( | ||
"You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers" | ||
) | ||
kwargs["from_slow"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does it take to convert the tokenizer from slow? If it's quick we can move it to info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Around 10 seconds I believe!
@ArthurZucker I know this is old at this point, but this PR seems to introduce a unintended side-effect for Mistral v0.1 tokenizers. I figured out that I can fix it by always initializing with . Not sure if this is on your radar or if it is fixed in newer versions, but would totally appreciate if this can be fixed for the Mistral model series. Tokens before (4.37.2):
Tokens after (this PR,
Tokens after (this PR,
|
What does this PR do?
Allows users to use
tokenizer.tokenize
controlling the addition of prefix space. Let's also update fast!fixes #28622