Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial local tokenizer load #9807

Merged
merged 4 commits into from
Jan 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/transformers/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1239,7 +1239,7 @@ def get_from_cache(
# the models might've been found if local_files_only=False
# Notify the user about that
if local_files_only:
raise ValueError(
raise FileNotFoundError(
"Cannot find the requested files in the cached path and outgoing traffic has been"
" disabled. To enable model look-ups and downloads online, set 'local_files_only'"
" to False."
Expand Down
35 changes: 26 additions & 9 deletions src/transformers/tokenization_utils_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1730,27 +1730,41 @@ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike],

# Get files from url, cache, or disk depending on the case
resolved_vocab_files = {}
unresolved_files = []
for file_id, file_path in vocab_files.items():
if file_path is None:
resolved_vocab_files[file_id] = None
else:
try:
resolved_vocab_files[file_id] = cached_path(
file_path,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
local_files_only=local_files_only,
use_auth_token=use_auth_token,
)
try:
resolved_vocab_files[file_id] = cached_path(
file_path,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
local_files_only=local_files_only,
use_auth_token=use_auth_token,
)
except FileNotFoundError as error:
if local_files_only:
unresolved_files.append(file_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not necessary, but would a:

else:
    raise FineNotFoundError(...)

be helpful here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it doesn't change anything right now as the only way a FileNotFoundError would be excepted here is it local_files_only is True

if local_files_only:
raise FileNotFoundError(
"Cannot find the requested files in the cached path and outgoing traffic has been"
" disabled. To enable model look-ups and downloads online, set 'local_files_only'"
" to False."
)

It would indeed be more future proof if we add FileNotFoundErrors down the road :) Updating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 863e553

else:
raise error

except requests.exceptions.HTTPError as err:
if "404 Client Error" in str(err):
logger.debug(err)
resolved_vocab_files[file_id] = None
else:
raise err

if len(unresolved_files) > 0:
logger.info(
f"Can't load following files from cache: {unresolved_files} and cannot check if these "
"files are necessary for the tokenizer to operate."
)

if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
msg = (
f"Can't load tokenizer for '{pretrained_model_name_or_path}'. Make sure that:\n\n"
Expand All @@ -1760,6 +1774,9 @@ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike],
raise EnvironmentError(msg)

for file_id, file_path in vocab_files.items():
if file_id not in resolved_vocab_files:
continue

if file_path == resolved_vocab_files[file_id]:
logger.info("loading file {}".format(file_path))
else:
Expand Down