-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BertTokenizer.from_pretrained fails for local_files_only=True when added_tokens.json is missing #9147
Comments
Actually, all of the files 404 here except |
If these files are missing even BertTokenizer.from_pretrained('google/bert_uncased_L-2_H-128_A-2'); should give an error; however it passed due to the below code; any particular reason this logic was added in the below mentioned: |
@hlahkar Are you sure? The code you linked seems to just check for In fact, my hacky workaround was to replace this line with |
My concern is should we also not be going into the error flow whenever we are getting a 404 error also; otherwise it might give a false sense of working to the user |
In my previous comment, I mentioned the wrong line number. My Question is; why is the 404 error ignored in the below code segment: |
So, is this problem solved in any way? |
@akutuzov on which version of transformers are you? I agree that this is a bug that we should solve, cc @LysandreJik @sgugger |
Taking a look. |
@julien-c I use Transformers 4.1.1 |
Aimed to fix that in #9807, feedback appreciated @julianmichael |
The PR looks good as a stopgap — I guess the subsequent check at L1766 will catch the case where the tokenizer hasn't been downloaded yet since no files should be present. But is this problem necessarily only for tokenizers? It seems like a general issue which is going to hold for any cached resources that have optional files. It might be cleaner to handle it in the file cache itself. But that's a much bigger issue I guess. |
I believe this is only the case for tokenizers. The two other that could be possibly affected by this are:
Let me know if you think I'm missing something and I'll see what we can do. |
Ok, sounds good. No need for unnecessary/premature refactoring then :) |
Environment info
transformers
version: 4.0.1Who can help
@mfuntowicz
Information
Model I am using (Bert, XLNet ...):
google/bert_uncased_L-2_H-128_A-2
The problem arises when using:
The tasks I am working on is:
To reproduce
Run the following:
In the Python interpreter, this produces the following error:
Looking more closely, I have isolated the issue to the logic here. In this case, the error is because the cached path for the url
https://huggingface.co/google/bert_uncased_L-2_H-128_A-2/resolve/main/added_tokens.json
cannot be found in the cache whenlocal_files_only=True
. This is because the URL 404s; i.e., the file does not exist.When
local_files_only=False
, the GET returns a 404 and the tokenizer init code just ignores the missing file. However, whenlocal_files_only=True
and the file is not found, it throws aValueError
instead which is not caught.What makes this non-trivial is that without making HTTP requests, there is no way of telling the difference between a file that doesn't exist and a file which exists but hasn't been downloaded. It seems to me that there are several potential ways of fixing the issue.
local_files_only
without downloading the model first.Option 3 seems the cleanest to me, while option 4 is what I'm shunting into my transformers egg for now so I can keep working.
Expected behavior
After downloading, I would expect any artifact to be loadable from cache and equivalent to the downloaded one.
The text was updated successfully, but these errors were encountered: