-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate HuggingFace connection errors #616
Conversation
return load_method(hf_tokenizer_name, local_files_only=True) | ||
except OSError: | ||
hlog(f"Local files do not exist for HuggingFace tokenizer: {hf_tokenizer_name}. Downloading...") | ||
return load_method(hf_tokenizer_name, local_files_only=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this already check whether files are cached and would load them if they exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same problem as huggingface/transformers#10901. It still tries to download and requires an internet connection even if the files are cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to see if upgrading the library version fixes the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I upgraded to the latest version of transformers
and can still reproduce the issue:
With my laptop's wifi connection turned off and the tokenizer cached on my local disk:
>>> GPT2TokenizerFast.from_pretrained("gpt2")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1675, in from_pretrained
resolved_config_file = get_file_from_repo(
File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 687, in get_file_from_repo
resolved_file = cached_path(
File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 284, in cached_path
output_path = get_from_cache(
File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 554, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
local_files_only=True
works though:
>>> GPT2TokenizerFast.from_pretrained("gpt2", local_files_only=True)
PreTrainedTokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand - if HF always tries to download, then when does it ever use the cache?
I think it's a limitation with the library. I've seen a handful of GitHub issues regarding this issue. I don't think the |
Resolves #606
Problem
We infrequently get the following error when downloading from HuggingFace:
Solution
_NUM_RETRIES = 6