-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cache for 🤗 Hub models in the CI #312
Conversation
Hello! This is looking great. I'm looking at the logs now. On CI run 1 (no caches in place), I observe:
The total job duration was 11m 34s, running for a combined 1h 44m 57s. I'm rerunning the jobs now. On CI run 2 (caches presumably in place). Let's observe:
The vast majority of jobs took less time than before. In that link, the bottom of the two equivalently named jobs are the never one with the cache. The only exceptions are some Windows jobs that inexplicably take a really long time to restore a relatively small cache. Others online have reported similar behaviour too. I'm quite satisfied with these results, but I do see two points of improvement:
Alternatively, it may be prudent to recognize the slowness of all Windows jobs with caching and completely disable all Windows caches. I'd like to hear your thoughts on this, too.
|
Hi Tom, Thank you very much for your reply and the very detailed analysis! 🙂 I didn't notice that caches on Windows were slower. I split the cache restore and save steps because I thought that the cache might change during the unit tests and therefore has to be refreshed: e.g., when a new model is added to the tests or a model used in the tests are updated in the Hub. If we add What if we change to something like the below? - name: Run unit tests
shell: bash
run: |
echo "OLD_HF_CACHE_HASH=$(find ~/.cache/huggingface ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1)" >> $GITHUB_ENV
pytest -sv tests/
echo "NEW_HF_CACHE_HASH=$(find ~/.cache/huggingface ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1)" >> $GITHUB_ENV
- name: Save new HF models to cache
uses: actions/cache/save@v3
with:
path: |
~/.cache/huggingface
~/.cache/torch
key: hf-models-${{ matrix.os }}-${{ env.NEW_HF_CACHE_HASH }}
# Only save cache if the hash has changed
if: ${{ env.NEW_HF_CACHE_HASH}} != ${{ env.OLD_HF_CACHE_HASH }} I agree that, if this doesn't solve the slowness on Windows, we can enable the cache just for Ubuntu jobs. |
You're very right, I had not thought about that thoroughly enough. Your proposal looks like a very solid solution. |
For now I just pushed the version that:
I am expecting that the cache will refresh across all the workflows, because we added the datasets that are not present before. I haven't disabled the Windows cache just yet |
Thanks for the changes! I've ran and re-ran the CI now, and I'm noticing that each of the jobs are making caches, rather than just one cache for Ubuntu and one for Windows. The issue seems to be that the hash differs for each of the jobs. I'm not sure exactly why that is, as I believe this issue did not happen before 6ee3961. Beyond that, the total cache size is growing fairly large. Do you think it would be a good idea to disable caching of dependencies? The dependency caches cannot be shared between the different jobs, and the Ubuntu dependency caches are 1.7 GB. For reference, GitHub will start removing caches after a total cache size of 10GB. |
Mmm yes, I noticed that too. Maybe there is something in the cache that changes at each run. I will do now some debugging on my private fork. I will let you know what I find 👀 |
Hey @tomaarsen , I made some tests on my fork and I found some interesting stuff.
This happens if I calculate the hash on the whole I then tried to only cache I think we should go back to only cache models (as pushed in my test commit), and have 2 caches (1 per platform) Please let me know what you think and if you have any other ideas :) |
Sounds good. It is unreasonable to expect to be able to cache datasets if they rely on some platform-dependant hash. Instead, if I have the time and desire, I'll create and push much smaller test datasets to the Hub. Then the test can use those, instead of larger "real world" datasets, which should improve the test suite execution time somewhat. I'll have a more thorough look at this tomorrow, but I think we're probably almost ready to merge this 🎉 Thanks for looking into this! I'm well aware that debugging CI jobs is a pain sometimes.
|
Works like a charm! I'll merge this. Thanks for the help, I appreciate it! |
Awesome! thank you very much for the help and the very useful feedback 🙂 |
Closes #307
I have enabled caching of the 🤗 models in the CI.
Currently, two model caches are created: one for each OS. The datasets are not included and are still downloaded during testing, do you want to cache these too?
I struggled a bit to create a unique hash that defines the 🤗 Hub cache. I settled with the command:
find ~/.cache/huggingface/hub ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1
(checksum of all the sorted files in the caches of huggingface/hub and torch).Please let me know if you have any suggestions on how to improve this. : ) I scrolled through some HF repos, but I couldn't find any CI pipelines with a cache for models where I could take inspiration from.