Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BertTokenizer.from_pretrained fails for local_files_only=True when added_tokens.json is missing #9147

Closed
1 of 4 tasks
julianmichael opened this issue Dec 16, 2020 · 13 comments · Fixed by #9807
Closed
1 of 4 tasks

Comments

@julianmichael
Copy link

julianmichael commented Dec 16, 2020

Environment info

  • transformers version: 4.0.1
  • Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-centos-7.6.1810-Core
  • Python version: 3.7.6
  • PyTorch version (GPU?): 1.7.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@mfuntowicz

Information

Model I am using (Bert, XLNet ...): google/bert_uncased_L-2_H-128_A-2

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Run the following:

from transformers import BertTokenizer
BertTokenizer.from_pretrained('google/bert_uncased_L-2_H-128_A-2')
BertTokenizer.from_pretrained('google/bert_uncased_L-2_H-128_A-2', local_files_only=True)

In the Python interpreter, this produces the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/gscratch/cse/julianjm/anaconda3/lib/python3.7/site-packages/transformers-4.0.1-py3.8.egg/transformers/tokenization_utils_base.py", line 1747, in from_pretrained
  File "/gscratch/cse/julianjm/anaconda3/lib/python3.7/site-packages/transformers-4.0.1-py3.8.egg/transformers/file_utils.py", line 1007, in cached_path
  File "/gscratch/cse/julianjm/anaconda3/lib/python3.7/site-packages/transformers-4.0.1-py3.8.egg/transformers/file_utils.py", line 1171, in get_from_cache
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

Looking more closely, I have isolated the issue to the logic here. In this case, the error is because the cached path for the url https://huggingface.co/google/bert_uncased_L-2_H-128_A-2/resolve/main/added_tokens.json cannot be found in the cache when local_files_only=True. This is because the URL 404s; i.e., the file does not exist.

When local_files_only=False, the GET returns a 404 and the tokenizer init code just ignores the missing file. However, when local_files_only=True and the file is not found, it throws a ValueError instead which is not caught.

What makes this non-trivial is that without making HTTP requests, there is no way of telling the difference between a file that doesn't exist and a file which exists but hasn't been downloaded. It seems to me that there are several potential ways of fixing the issue.

  1. Ensure that all files exist. Don't let people upload incomplete sets of files (and fix the ones which are currently incomplete).
  2. Recover from 404s by caching an "empty" file here. But this only works where there is a meaningful notion of "empty" file, like lists of tokens. I think this would not work for json files or serialized models.
  3. Put a special kind of file in the cache which says "hey, this file isn't supposed to exist", and handle appropriately everywhere files are loaded. Potentially could throw a special error saying the file isn't supposed to exist; HTTP 404s could then be caught and re-thrown as this special error, so, the case could be handled uniformly.
  4. Just log a warning for files that aren't in the cache, and treat them like 404s. Wild west, but at least if the code unexpectedly fails later the user will be able to guess the problem. Easy to implement, but will worsen the UX every time someone tries to use local_files_only without downloading the model first.

Option 3 seems the cleanest to me, while option 4 is what I'm shunting into my transformers egg for now so I can keep working.

Expected behavior

After downloading, I would expect any artifact to be loadable from cache and equivalent to the downloaded one.

@julianmichael
Copy link
Author

Actually, all of the files 404 here except vocab.txt. I have added_tokens.json, special_tokens_map.json, tokenizer_config.json, and tokenizer.json all missing for this model.

@hlahkar
Copy link

hlahkar commented Dec 17, 2020

Actually, all of the files 404 here except vocab.txt. I have added_tokens.json, special_tokens_map.json, tokenizer_config.json, and tokenizer.json all missing for this model.

If these files are missing even BertTokenizer.from_pretrained('google/bert_uncased_L-2_H-128_A-2'); should give an error; however it passed due to the below code; any particular reason this logic was added in the below mentioned:

https://github.com/huggingface/transformers/blob/master/src/transformers/file_utils.py#L1232

@julianmichael
Copy link
Author

julianmichael commented Dec 17, 2020

@hlahkar Are you sure? The code you linked seems to just check for requests.exceptions.ConnectionError and requests.exceptions.Timeout. I think a 404 will raise a requests.exceptions.HTTPError, which bubble up to be thrown by get_from_cache, through cached_path, and then here where it is then caught and ignored.

In fact, my hacky workaround was to replace this line with raise requests.exceptions.HTTPError("404 Client Error"), so the same thing happens when local_files_only=True; now I can load the tokenizer in that case.

@hlahkar
Copy link

hlahkar commented Dec 18, 2020

@hlahkar Are you sure? The code you linked seems to just check for requests.exceptions.ConnectionError and requests.exceptions.Timeout. I think a 404 will raise a requests.exceptions.HTTPError, which bubble up to be thrown by get_from_cache, through cached_path, and then here where it is then caught and ignored.

In fact, my hacky workaround was to replace this line with raise requests.exceptions.HTTPError("404 Client Error"), so the same thing happens when local_files_only=True; now I can load the tokenizer in that case.

My concern is should we also not be going into the error flow whenever we are getting a 404 error also; otherwise it might give a false sense of working to the user

@hlahkar
Copy link

hlahkar commented Dec 18, 2020

In my previous comment, I mentioned the wrong line number. My Question is; why is the 404 error ignored in the below code segment:
https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py#L1784

@akutuzov
Copy link

So, is this problem solved in any way?
It seems it is now impossible to use most Bert-like models without the Internet connection, even though all the model files are cached.
Transformers tries to get the added_tokens.json file, can't find it, and fails with "ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on."
This is really bothersome on HPC systems, where compute nodes are often offline by design.

@julien-c
Copy link
Member

@akutuzov on which version of transformers are you?

I agree that this is a bug that we should solve, cc @LysandreJik @sgugger

@LysandreJik
Copy link
Member

Taking a look.

@akutuzov
Copy link

@julien-c I use Transformers 4.1.1

@LysandreJik
Copy link
Member

Aimed to fix that in #9807, feedback appreciated @julianmichael

@julianmichael
Copy link
Author

The PR looks good as a stopgap — I guess the subsequent check at L1766 will catch the case where the tokenizer hasn't been downloaded yet since no files should be present. But is this problem necessarily only for tokenizers? It seems like a general issue which is going to hold for any cached resources that have optional files. It might be cleaner to handle it in the file cache itself. But that's a much bigger issue I guess.

@LysandreJik
Copy link
Member

I believe this is only the case for tokenizers. The two other that could be possibly affected by this are:

  • Configuration downloads -> downloads a single file
  • Model downloads -> downloads the configuration file and the model state dict, both of which are necessary and need to raise an error if missing.

Let me know if you think I'm missing something and I'll see what we can do.

@julianmichael
Copy link
Author

Ok, sounds good. No need for unnecessary/premature refactoring then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants