BertTokenizer.from_pretrained fails for local_files_only=True when added_tokens.json is missing #9147

julianmichael · 2020-12-16T08:28:14Z

Environment info

transformers version: 4.0.1
Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python version: 3.7.6
PyTorch version (GPU?): 1.7.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@mfuntowicz

Information

Model I am using (Bert, XLNet ...): google/bert_uncased_L-2_H-128_A-2

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Run the following:

from transformers import BertTokenizer
BertTokenizer.from_pretrained('google/bert_uncased_L-2_H-128_A-2')
BertTokenizer.from_pretrained('google/bert_uncased_L-2_H-128_A-2', local_files_only=True)

In the Python interpreter, this produces the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/gscratch/cse/julianjm/anaconda3/lib/python3.7/site-packages/transformers-4.0.1-py3.8.egg/transformers/tokenization_utils_base.py", line 1747, in from_pretrained
  File "/gscratch/cse/julianjm/anaconda3/lib/python3.7/site-packages/transformers-4.0.1-py3.8.egg/transformers/file_utils.py", line 1007, in cached_path
  File "/gscratch/cse/julianjm/anaconda3/lib/python3.7/site-packages/transformers-4.0.1-py3.8.egg/transformers/file_utils.py", line 1171, in get_from_cache
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

Looking more closely, I have isolated the issue to the logic here. In this case, the error is because the cached path for the url https://huggingface.co/google/bert_uncased_L-2_H-128_A-2/resolve/main/added_tokens.json cannot be found in the cache when local_files_only=True. This is because the URL 404s; i.e., the file does not exist.

When local_files_only=False, the GET returns a 404 and the tokenizer init code just ignores the missing file. However, when local_files_only=True and the file is not found, it throws a ValueError instead which is not caught.

What makes this non-trivial is that without making HTTP requests, there is no way of telling the difference between a file that doesn't exist and a file which exists but hasn't been downloaded. It seems to me that there are several potential ways of fixing the issue.

Ensure that all files exist. Don't let people upload incomplete sets of files (and fix the ones which are currently incomplete).
Recover from 404s by caching an "empty" file here. But this only works where there is a meaningful notion of "empty" file, like lists of tokens. I think this would not work for json files or serialized models.
Put a special kind of file in the cache which says "hey, this file isn't supposed to exist", and handle appropriately everywhere files are loaded. Potentially could throw a special error saying the file isn't supposed to exist; HTTP 404s could then be caught and re-thrown as this special error, so, the case could be handled uniformly.
Just log a warning for files that aren't in the cache, and treat them like 404s. Wild west, but at least if the code unexpectedly fails later the user will be able to guess the problem. Easy to implement, but will worsen the UX every time someone tries to use local_files_only without downloading the model first.

Option 3 seems the cleanest to me, while option 4 is what I'm shunting into my transformers egg for now so I can keep working.

Expected behavior

After downloading, I would expect any artifact to be loadable from cache and equivalent to the downloaded one.

The text was updated successfully, but these errors were encountered:

julianmichael · 2020-12-16T20:50:05Z

Actually, all of the files 404 here except vocab.txt. I have added_tokens.json, special_tokens_map.json, tokenizer_config.json, and tokenizer.json all missing for this model.

hlahkar · 2020-12-17T14:21:21Z

Actually, all of the files 404 here except vocab.txt. I have added_tokens.json, special_tokens_map.json, tokenizer_config.json, and tokenizer.json all missing for this model.

If these files are missing even BertTokenizer.from_pretrained('google/bert_uncased_L-2_H-128_A-2'); should give an error; however it passed due to the below code; any particular reason this logic was added in the below mentioned:

https://github.com/huggingface/transformers/blob/master/src/transformers/file_utils.py#L1232

julianmichael · 2020-12-17T18:37:13Z

@hlahkar Are you sure? The code you linked seems to just check for requests.exceptions.ConnectionError and requests.exceptions.Timeout. I think a 404 will raise a requests.exceptions.HTTPError, which bubble up to be thrown by get_from_cache, through cached_path, and then here where it is then caught and ignored.

In fact, my hacky workaround was to replace this line with raise requests.exceptions.HTTPError("404 Client Error"), so the same thing happens when local_files_only=True; now I can load the tokenizer in that case.

hlahkar · 2020-12-18T04:23:05Z

@hlahkar Are you sure? The code you linked seems to just check for requests.exceptions.ConnectionError and requests.exceptions.Timeout. I think a 404 will raise a requests.exceptions.HTTPError, which bubble up to be thrown by get_from_cache, through cached_path, and then here where it is then caught and ignored.

In fact, my hacky workaround was to replace this line with raise requests.exceptions.HTTPError("404 Client Error"), so the same thing happens when local_files_only=True; now I can load the tokenizer in that case.

My concern is should we also not be going into the error flow whenever we are getting a 404 error also; otherwise it might give a false sense of working to the user

hlahkar · 2020-12-18T06:18:09Z

In my previous comment, I mentioned the wrong line number. My Question is; why is the 404 error ignored in the below code segment:
https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py#L1784

akutuzov · 2021-01-24T16:47:32Z

So, is this problem solved in any way?
It seems it is now impossible to use most Bert-like models without the Internet connection, even though all the model files are cached.
Transformers tries to get the added_tokens.json file, can't find it, and fails with "ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on."
This is really bothersome on HPC systems, where compute nodes are often offline by design.

julien-c · 2021-01-25T10:38:53Z

@akutuzov on which version of transformers are you?

I agree that this is a bug that we should solve, cc @LysandreJik @sgugger

LysandreJik · 2021-01-25T11:59:15Z

Taking a look.

akutuzov · 2021-01-25T12:40:08Z

@julien-c I use Transformers 4.1.1

LysandreJik · 2021-01-26T15:03:57Z

Aimed to fix that in #9807, feedback appreciated @julianmichael

julianmichael · 2021-01-26T19:31:28Z

The PR looks good as a stopgap — I guess the subsequent check at L1766 will catch the case where the tokenizer hasn't been downloaded yet since no files should be present. But is this problem necessarily only for tokenizers? It seems like a general issue which is going to hold for any cached resources that have optional files. It might be cleaner to handle it in the file cache itself. But that's a much bigger issue I guess.

LysandreJik · 2021-01-27T09:33:35Z

I believe this is only the case for tokenizers. The two other that could be possibly affected by this are:

Configuration downloads -> downloads a single file
Model downloads -> downloads the configuration file and the model state dict, both of which are necessary and need to raise an error if missing.

Let me know if you think I'm missing something and I'll see what we can do.

julianmichael · 2021-01-27T18:34:15Z

Ok, sounds good. No need for unnecessary/premature refactoring then :)

LysandreJik mentioned this issue Jan 26, 2021

Partial local tokenizer load #9807

Merged

LysandreJik closed this as completed in #9807 Jan 28, 2021

LysandreJik mentioned this issue Mar 25, 2021

Error with detecting cached files when running without Internet connection (related to #10067) #10901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BertTokenizer.from_pretrained fails for local_files_only=True when added_tokens.json is missing #9147

BertTokenizer.from_pretrained fails for local_files_only=True when added_tokens.json is missing #9147

julianmichael commented Dec 16, 2020 •

edited

Loading

julianmichael commented Dec 16, 2020

hlahkar commented Dec 17, 2020 •

edited

Loading

julianmichael commented Dec 17, 2020 •

edited

Loading

hlahkar commented Dec 18, 2020

hlahkar commented Dec 18, 2020

akutuzov commented Jan 24, 2021

julien-c commented Jan 25, 2021

LysandreJik commented Jan 25, 2021

akutuzov commented Jan 25, 2021

LysandreJik commented Jan 26, 2021

julianmichael commented Jan 26, 2021

LysandreJik commented Jan 27, 2021

julianmichael commented Jan 27, 2021

BertTokenizer.from_pretrained fails for local_files_only=True when added_tokens.json is missing #9147

BertTokenizer.from_pretrained fails for local_files_only=True when added_tokens.json is missing #9147

Comments

julianmichael commented Dec 16, 2020 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

julianmichael commented Dec 16, 2020

hlahkar commented Dec 17, 2020 • edited Loading

julianmichael commented Dec 17, 2020 • edited Loading

hlahkar commented Dec 18, 2020

hlahkar commented Dec 18, 2020

akutuzov commented Jan 24, 2021

julien-c commented Jan 25, 2021

LysandreJik commented Jan 25, 2021

akutuzov commented Jan 25, 2021

LysandreJik commented Jan 26, 2021

julianmichael commented Jan 26, 2021

LysandreJik commented Jan 27, 2021

julianmichael commented Jan 27, 2021

julianmichael commented Dec 16, 2020 •

edited

Loading

hlahkar commented Dec 17, 2020 •

edited

Loading

julianmichael commented Dec 17, 2020 •

edited

Loading