Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate HuggingFace connection errors #616

Merged
merged 4 commits into from
Jul 16, 2022
Merged

Mitigate HuggingFace connection errors #616

merged 4 commits into from
Jul 16, 2022

Conversation

teetone
Copy link
Member

@teetone teetone commented Jul 14, 2022

Resolves #606

Problem

We infrequently get the following error when downloading from HuggingFace:

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Solution

  1. Try to use local files for the tokenizers unless we need to download
  2. When we need to download, added additional retry, so _NUM_RETRIES = 6

@teetone teetone requested a review from percyliang July 14, 2022 10:48
return load_method(hf_tokenizer_name, local_files_only=True)
except OSError:
hlog(f"Local files do not exist for HuggingFace tokenizer: {hf_tokenizer_name}. Downloading...")
return load_method(hf_tokenizer_name, local_files_only=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this already check whether files are cached and would load them if they exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same problem as huggingface/transformers#10901. It still tries to download and requires an internet connection even if the files are cached.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to see if upgrading the library version fixes the issue.

Copy link
Member Author

@teetone teetone Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I upgraded to the latest version of transformers and can still reproduce the issue:

With my laptop's wifi connection turned off and the tokenizer cached on my local disk:

>>> GPT2TokenizerFast.from_pretrained("gpt2")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1675, in from_pretrained
    resolved_config_file = get_file_from_repo(
  File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 687, in get_file_from_repo
    resolved_file = cached_path(
  File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 284, in cached_path
    output_path = get_from_cache(
  File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 554, in get_from_cache
    raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

local_files_only=True works though:

>>> GPT2TokenizerFast.from_pretrained("gpt2", local_files_only=True)
PreTrainedTokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'})

@teetone teetone requested a review from percyliang July 15, 2022 22:29
Copy link
Contributor

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand - if HF always tries to download, then when does it ever use the cache?

@teetone
Copy link
Member Author

teetone commented Jul 16, 2022

I don't understand - if HF always tries to download, then when does it ever use the cache?

I think it's a limitation with the library. I've seen a handful of GitHub issues regarding this issue. I don't think the local_files_only flag should even exist. I think my fix is the right thing to do, and we should merge it to fix #606, but I can take another look.

@teetone teetone merged commit 749d619 into main Jul 16, 2022
@teetone teetone deleted the moreretry branch July 16, 2022 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update retry logic for fetching HuggingFace tokenizer
2 participants