-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong offsets_mapping in T5TokenizerFast #9633
Comments
@patrickvonplaten @n1t0 do you have any advice on this? The T5 tokenizer tokenizes the sentence as follows:
Unfortunately the offset mapping point to both '▁' and 'a' being at
How should one map this encoding back to the initial sequence? |
@patrickvonplaten @n1t0 - did you have a chance to look at this? |
Thanks @n1t0, I wondered if there have been any progress on this? Any expectation for when the fix will be avail? Thanks! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@zorikg Using the last few versions of tokenizer = T5TokenizerFast.from_pretrained('google/t5-v1_1-base', from_slow=True) This will force the conversion from the slow tokenizer, thus using the fixed version. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I am getting some difference between these 2 tokenizers is this solved? |
Environment info
transformers
version: 4.2.1Who can help @patrickvonplaten, @mfuntowicz
Information
Model I am using: T5
To reproduce
See comments in the code snippet.
Expected behavior
The text was updated successfully, but these errors were encountered: