-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] [GHA] HuggingFace cache #28481
[CI] [GHA] HuggingFace cache #28481
Conversation
…nto ci/gha/hf-cache-test
os.environ['HF_HUB_CACHE'] = hf_cache_dir | ||
|
||
no_clean_cache_dir = False | ||
hf_hub_cache_dir = tempfile.gettempdir() | ||
hf_hub_cache_dir = hf_cache_dir | ||
if os.environ.get('USE_SYSTEM_CACHE', 'True') == 'False': | ||
no_clean_cache_dir = True | ||
os.environ['HUGGINGFACE_HUB_CACHE'] = hf_hub_cache_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure how it was supposed to work in the first place:
- There are two env variables for HF cache:
HF_HUB_CACHE
&HUGGINGFACE_HUB_CACHE
, the latter is deprecated but maybe needed for backwards compatibility HF_HUB_CACHE
is taken from the environment and if not present -> a temp directory is used insteadHUGGINGFACE_HUB_CACHE
was always set to a created temporary directory, w/o even looking for it in the env. what if we want to use a remote cache like in CI?- The cleanup is controlled by another env variable
USE_SYSTEM_CACHE
but only for a deprecatedHUGGINGFACE_HUB_CACHE
Via the changes in this PR, I set HF_HUB_CACHE
as a single source of truth but I think it could and should be simplified further. Is HUGGINGFACE_HUB_CACHE
even needed? I think it could be done like:
- Get only
HF_HUB_CACHE
from the env:- if present, just use the value
- If not present -> set it to the temp directory
- Drop
HUGGINGFACE_HUB_CACHE
/ set it toHF_HUB_CACHE
- Rename
USE_SYSTEM_CACHE
into something likeCLEAN_HF_CACHE
/KEEP_HF_CACHE
/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two variables by mistake. The idea in HUGGINGFACE_HUB_CACHE
was to allow the system default if we run tests locally, that is why it checks USE_SYSTEM_CACHE
variable before setting it. And this doesn't work right now anyway, so you can drop HUGGINGFACE_HUB_CACHE
|
||
- name: Setup HuggingFace Cache Directory (Windows) | ||
if: runner.os == 'Windows' | ||
run: Add-Content -Path $env:GITHUB_ENV -Value "HF_HUB_CACHE=C:\\mount\\caches\\huggingface" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define it in the common job env and get the value from job arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not do that for any other such variable, e.g., we define the pip cache directory in the reusable jobs so I think it could be done like it is now, when setting the variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerning about the following case: this job can be run on different runners and in theory the mount point could be different, inside the job we don't know the runners environment, that is why I think it is better to set it in the workflows. In addition to that we will be able to set it once inside the job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we need to extend automatic cleanup for this new share - https://github.com/mryzhov/openvino/blob/gha/coverity_docker/.github/workflows/cleanup_caches.yml#L1
os.environ['TFHUB_CACHE_DIR'] = tf_hub_cache_dir | ||
os.environ['HF_HUB_CACHE'] = hf_cache_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two environment variables re-initialized from outside (from GHA)?
I think there should be some connection from CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are set on a per-job basis like in: https://github.com/openvinotoolkit/openvino/pull/28481/files#diff-059da6ae752137afeded3262287ede715ddfb1659b5d31489f723c9f971b008bR71 and https://github.com/openvinotoolkit/openvino/pull/28481/files#diff-58b495d99e8e2631e772171d5dfb9d6c331d35b8f4b4394a78cf053767bda273R63.
These two lines are only for the case if the env is empty and they are populated with the tmp directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that TFHUB_CACHE_DIR is not initialized from GHA job
It is:
but this PR is concerned only with the HF hub cache for which the share was created. If |
The new HuggingFace share was added to the self-hosted runners with the path being
mount/caches/huggingface
. This PR adds this share to workflows that use HF data.Tickets: