Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] [GHA] HuggingFace cache #28481

Merged
merged 32 commits into from
Feb 17, 2025

Conversation

akashchi
Copy link
Contributor

@akashchi akashchi commented Jan 16, 2025

The new HuggingFace share was added to the self-hosted runners with the path being mount/caches/huggingface. This PR adds this share to workflows that use HF data.

Tickets:

  • 159012
  • 159010

@akashchi akashchi added the WIP work in progress label Jan 16, 2025
@akashchi akashchi added this to the 2025.1 milestone Jan 16, 2025
@github-actions github-actions bot added category: CI OpenVINO public CI github_actions Pull requests that update GitHub Actions code labels Jan 16, 2025
@github-actions github-actions bot added category: TF FE OpenVINO TensorFlow FrontEnd category: PyTorch FE OpenVINO PyTorch Frontend category: JAX FE OpenVINO JAX FrontEnd labels Jan 27, 2025
@rkazants rkazants self-requested a review February 5, 2025 13:01
Comment on lines 27 to 33
os.environ['HF_HUB_CACHE'] = hf_cache_dir

no_clean_cache_dir = False
hf_hub_cache_dir = tempfile.gettempdir()
hf_hub_cache_dir = hf_cache_dir
if os.environ.get('USE_SYSTEM_CACHE', 'True') == 'False':
no_clean_cache_dir = True
os.environ['HUGGINGFACE_HUB_CACHE'] = hf_hub_cache_dir
Copy link
Contributor Author

@akashchi akashchi Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure how it was supposed to work in the first place:

  • There are two env variables for HF cache: HF_HUB_CACHE & HUGGINGFACE_HUB_CACHE, the latter is deprecated but maybe needed for backwards compatibility
  • HF_HUB_CACHE is taken from the environment and if not present -> a temp directory is used instead
  • HUGGINGFACE_HUB_CACHE was always set to a created temporary directory, w/o even looking for it in the env. what if we want to use a remote cache like in CI?
  • The cleanup is controlled by another env variable USE_SYSTEM_CACHE but only for a deprecated HUGGINGFACE_HUB_CACHE

Via the changes in this PR, I set HF_HUB_CACHE as a single source of truth but I think it could and should be simplified further. Is HUGGINGFACE_HUB_CACHE even needed? I think it could be done like:

  • Get only HF_HUB_CACHE from the env:
    • if present, just use the value
    • If not present -> set it to the temp directory
  • Drop HUGGINGFACE_HUB_CACHE / set it to HF_HUB_CACHE
  • Rename USE_SYSTEM_CACHE into something like CLEAN_HF_CACHE/KEEP_HF_CACHE/...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two variables by mistake. The idea in HUGGINGFACE_HUB_CACHE was to allow the system default if we run tests locally, that is why it checks USE_SYSTEM_CACHE variable before setting it. And this doesn't work right now anyway, so you can drop HUGGINGFACE_HUB_CACHE

@akashchi akashchi requested a review from mvafin February 11, 2025 14:18
@akashchi akashchi requested a review from mvafin February 12, 2025 09:36

- name: Setup HuggingFace Cache Directory (Windows)
if: runner.os == 'Windows'
run: Add-Content -Path $env:GITHUB_ENV -Value "HF_HUB_CACHE=C:\\mount\\caches\\huggingface"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define it in the common job env and get the value from job arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not do that for any other such variable, e.g., we define the pip cache directory in the reusable jobs so I think it could be done like it is now, when setting the variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerning about the following case: this job can be run on different runners and in theory the mount point could be different, inside the job we don't know the runners environment, that is why I think it is better to set it in the workflows. In addition to that we will be able to set it once inside the job

Copy link
Contributor

@mryzhov mryzhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 26 to 27
os.environ['TFHUB_CACHE_DIR'] = tf_hub_cache_dir
os.environ['HF_HUB_CACHE'] = hf_cache_dir
Copy link
Member

@rkazants rkazants Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two environment variables re-initialized from outside (from GHA)?
I think there should be some connection from CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akashchi akashchi requested a review from rkazants February 12, 2025 17:25
Copy link
Member

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that TFHUB_CACHE_DIR is not initialized from GHA job

@akashchi
Copy link
Contributor Author

I see that TFHUB_CACHE_DIR is not initialized from GHA job

It is:

echo "TFHUB_CACHE_DIR=/mount/testdata$((GITHUB_RUN_NUMBER % NUMBER_OF_REPLICAS))/tfhub_models" >> "$GITHUB_ENV"

but this PR is concerned only with the HF hub cache for which the share was created.

If TFHUB_CACHE_DIR is needed in other workflows/jobs, I suggest addressing it in separate PRs/tickets.

@akashchi akashchi requested a review from rkazants February 13, 2025 09:29
@akashchi akashchi enabled auto-merge February 17, 2025 09:15
@akashchi akashchi added this pull request to the merge queue Feb 17, 2025
Merged via the queue into openvinotoolkit:master with commit d92782c Feb 17, 2025
189 checks passed
@akashchi akashchi deleted the ci/gha/hf-cache-test branch February 17, 2025 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CI OpenVINO public CI category: JAX FE OpenVINO JAX FrontEnd category: PyTorch FE OpenVINO PyTorch Frontend category: TF FE OpenVINO TensorFlow FrontEnd github_actions Pull requests that update GitHub Actions code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants