Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with detecting cached files when running without Internet connection (related to #10067) #10901

Closed
aosokin opened this issue Mar 25, 2021 · 3 comments

Comments

@aosokin
Copy link

aosokin commented Mar 25, 2021

Environment info

  • transformers version: 4.5.0.dev0
  • Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik (related to #10235 and #10067)

Information

I'm trying to run

from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking")

from an environment without Internet access. It crashes even though I have all files downloaded and cached. The uncaught exception:

raise ValueError(
"Connection error, and we cannot find the requested files in the cached path."
" Please try again or make sure your Internet connection is on."
)

When file_id == 'added_tokens_file' file_path equals https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/added_tokens.json which does not exist. (

for file_id, file_path in vocab_files.items():
)
This results in line
r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout)
throwing ConnectTimeout which is caught in
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):

and further ignored until another exception in
which is not caught enywhere.

When trying to get the same file with the internet is on the code work differently: line

r.raise_for_status()
throws requests.exceptions.HTTPError, which is caught and processed here
except requests.exceptions.HTTPError as err:
if "404 Client Error" in str(err):
logger.debug(err)
resolved_vocab_files[file_id] = None

The rest of the code works just fine after resolved_vocab_files[file_id] = None

Using BertTokenizer.from_pretrained(bert_version, local_files_only=True) works just fine because of this condition:

except FileNotFoundError as error:
if local_files_only:
unresolved_files.append(file_id)
else:
raise error

The current workaround is to use BertTokenizer.from_pretrained(bert_version, local_files_only=True) but this does not allow to use same code with and without Internet.

To reproduce

Steps to reproduce the behavior:

Run

from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking")

from env without internet but all the required cache files pre-downloaded.

Expected behavior

Works exactly as

from transformers import BertTokenizer
BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking", local_files_only=True)
@LysandreJik
Copy link
Member

Related issue: #9147, with proposed fix in #9807

@aosokin
Copy link
Author

aosokin commented Mar 25, 2021

Why do we need this condition?

if local_files_only:
unresolved_files.append(file_id)
else:
raise error

Was introduced here: 863e553 Is it needed in any other scenario?
Would it be better to do unresolved_files.append(file_id) unconditionally?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants