Mitigate HuggingFace connection errors #616

teetone · 2022-07-14T10:48:32Z

Resolves #606

Problem

We infrequently get the following error when downloading from HuggingFace:

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Solution

Try to use local files for the tokenizers unless we need to download
When we need to download, added additional retry, so _NUM_RETRIES = 6

percyliang · 2022-07-15T21:59:27Z

src/proxy/huggingface_tokenizer.py

+                return load_method(hf_tokenizer_name, local_files_only=True)
+            except OSError:
+                hlog(f"Local files do not exist for HuggingFace tokenizer: {hf_tokenizer_name}. Downloading...")
+                return load_method(hf_tokenizer_name, local_files_only=False)


Doesn't this already check whether files are cached and would load them if they exist?

It's the same problem as huggingface/transformers#10901. It still tries to download and requires an internet connection even if the files are cached.

I will try to see if upgrading the library version fixes the issue.

I upgraded to the latest version of transformers and can still reproduce the issue:

With my laptop's wifi connection turned off and the tokenizer cached on my local disk:

>>> GPT2TokenizerFast.from_pretrained("gpt2") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1675, in from_pretrained resolved_config_file = get_file_from_repo( File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 687, in get_file_from_repo resolved_file = cached_path( File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 284, in cached_path output_path = get_from_cache( File "/Users/tonyhlee/research/mercury/benchmarking/venv/lib/python3.9/site-packages/transformers/utils/hub.py", line 554, in get_from_cache raise ValueError( ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

local_files_only=True works though:

>>> GPT2TokenizerFast.from_pretrained("gpt2", local_files_only=True) PreTrainedTokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'})

percyliang

I don't understand - if HF always tries to download, then when does it ever use the cache?

teetone · 2022-07-16T17:36:51Z

I don't understand - if HF always tries to download, then when does it ever use the cache?

I think it's a limitation with the library. I've seen a handful of GitHub issues regarding this issue. I don't think the local_files_only flag should even exist. I think my fix is the right thing to do, and we should merge it to fix #606, but I can take another look.

teetone added 3 commits July 14, 2022 03:01

Additional retry to fix HuggingFace connection error

925ab63

Try to load tokenizer from local files

9caead8

Try to load tokenizer from local files

bdc4845

teetone requested a review from percyliang July 14, 2022 10:48

percyliang reviewed Jul 15, 2022

View reviewed changes

clarify check in comments

41bf70e

teetone requested a review from percyliang July 15, 2022 22:29

percyliang approved these changes Jul 16, 2022

View reviewed changes

teetone merged commit 749d619 into main Jul 16, 2022

teetone deleted the moreretry branch July 16, 2022 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate HuggingFace connection errors #616

Mitigate HuggingFace connection errors #616

teetone commented Jul 14, 2022 •

edited

Loading

percyliang Jul 15, 2022

teetone Jul 15, 2022

teetone Jul 15, 2022

teetone Jul 15, 2022 •

edited

Loading

percyliang left a comment

teetone commented Jul 16, 2022

Mitigate HuggingFace connection errors #616

Mitigate HuggingFace connection errors #616

Conversation

teetone commented Jul 14, 2022 • edited Loading

Problem

Solution

percyliang Jul 15, 2022

Choose a reason for hiding this comment

teetone Jul 15, 2022

Choose a reason for hiding this comment

teetone Jul 15, 2022

Choose a reason for hiding this comment

teetone Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

percyliang left a comment

Choose a reason for hiding this comment

teetone commented Jul 16, 2022

teetone commented Jul 14, 2022 •

edited

Loading

teetone Jul 15, 2022 •

edited

Loading