Add cache for 🤗 Hub models in the CI #312

EdAbati · 2023-02-13T07:28:55Z

Closes #307

I have enabled caching of the 🤗 models in the CI.
Currently, two model caches are created: one for each OS. The datasets are not included and are still downloaded during testing, do you want to cache these too?

I struggled a bit to create a unique hash that defines the 🤗 Hub cache. I settled with the command: find ~/.cache/huggingface/hub ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1 (checksum of all the sorted files in the caches of huggingface/hub and torch).

Please let me know if you have any suggestions on how to improve this. : ) I scrolled through some HF repos, but I couldn't find any CI pipelines with a cache for models where I could take inspiration from.

tomaarsen · 2023-02-14T07:22:59Z

Hello!

This is looking great. I'm looking at the logs now. On CI run 1 (no caches in place), I observe:

Only one job per OS (ubuntu-latest 3.8 & windows-latest 3.7) actually creates the cache, which ends up being about 480MB.
All of the other jobs throw warnings that the cache could not be created due to not being able to reserve the cache key, as well as a warning that git fails: The process '/usr/bin/git' failed with exit code 128 and The process 'C:\Program Files\Git\bin\git.exe' failed with exit code 128. These warnings are probably related.
The Ubuntu jobs that do not end up making the cache take about 1 to 2 seconds before stopping with a warning, while the Windows jobs take exactly a minute before stopping... The latter is a shame.
Saving the cache took about 1:10 on Windows, whereas it only takes about 10 seconds on Ubuntu.

The total job duration was 11m 34s, running for a combined 1h 44m 57s.

I'm rerunning the jobs now. On CI run 2 (caches presumably in place). Let's observe:

All of the jobs were able to load the cache.
The Ubuntu jobs could load the caches in ~5-6 seconds whereas the Windows job took around 10 seconds each.
The same warning as before are thrown.
Trying to save the cache again takes about 1-2 seconds for Ubuntu and a minute for Windows.

The vast majority of jobs took less time than before. In that link, the bottom of the two equivalently named jobs are the never one with the cache. The only exceptions are some Windows jobs that inexplicably take a really long time to restore a relatively small cache. Others online have reported similar behaviour too.

I'm quite satisfied with these results, but I do see two points of improvement:

Wrap the Save new HF models to cache step with:
```
if: steps.restore-hf-models.outputs.cache-hit != 'true'
```
while id: restore-hf-models is added to the Restore HF models from cache step. That way, the CI will only ever attempt to save a cache if the cache could not be loaded. This should consistently save about a minute for the Windows jobs.
See the docs on this cache-hit here.
Include datasets in the cache. I think there's quite a few that are loaded by the test suite.

Alternatively, it may be prudent to recognize the slowness of all Windows jobs with caching and completely disable all Windows caches. I'd like to hear your thoughts on this, too.

Tom Aarsen

EdAbati · 2023-02-16T07:38:33Z

Hi Tom,

Thank you very much for your reply and the very detailed analysis! 🙂 I didn't notice that caches on Windows were slower.

I split the cache restore and save steps because I thought that the cache might change during the unit tests and therefore has to be refreshed: e.g., when a new model is added to the tests or a model used in the tests are updated in the Hub.

If we add steps.restore-hf-models.outputs.cache-hit != 'true', the save step will always be skipped when a cache exists, right?

What if we change to something like the below?

      - name: Run unit tests
        shell: bash
        run: |
          echo "OLD_HF_CACHE_HASH=$(find ~/.cache/huggingface ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1)" >> $GITHUB_ENV
          pytest -sv tests/
          echo "NEW_HF_CACHE_HASH=$(find ~/.cache/huggingface ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1)" >> $GITHUB_ENV

      - name: Save new HF models to cache
        uses: actions/cache/save@v3
        with:
          path: |
            ~/.cache/huggingface
            ~/.cache/torch
          key: hf-models-${{ matrix.os }}-${{ env.NEW_HF_CACHE_HASH }}
        # Only save cache if the hash has changed
        if: ${{ env.NEW_HF_CACHE_HASH}} != ${{ env.OLD_HF_CACHE_HASH }}

I agree that, if this doesn't solve the slowness on Windows, we can enable the cache just for Ubuntu jobs.

tomaarsen · 2023-02-16T07:52:33Z

If we add steps.restore-hf-models.outputs.cache-hit != 'true', the save step will always be skipped when a cache exists, right?

You're very right, I had not thought about that thoroughly enough.

Your proposal looks like a very solid solution.

EdAbati · 2023-02-16T18:59:24Z

For now I just pushed the version that:

caches everything in the HF cache location
checks the hashes before and after the tests

I am expecting that the cache will refresh across all the workflows, because we added the datasets that are not present before.

I haven't disabled the Windows cache just yet

tomaarsen · 2023-02-17T13:07:01Z

Thanks for the changes!

I've ran and re-ran the CI now, and I'm noticing that each of the jobs are making caches, rather than just one cache for Ubuntu and one for Windows. The issue seems to be that the hash differs for each of the jobs. I'm not sure exactly why that is, as I believe this issue did not happen before 6ee3961.

Beyond that, the total cache size is growing fairly large. Do you think it would be a good idea to disable caching of dependencies? The dependency caches cannot be shared between the different jobs, and the Ubuntu dependency caches are 1.7 GB. For reference, GitHub will start removing caches after a total cache size of 10GB.

EdAbati · 2023-02-19T18:16:50Z

Mmm yes, I noticed that too. Maybe there is something in the cache that changes at each run.

I will do now some debugging on my private fork. I will let you know what I find 👀

EdAbati · 2023-02-23T22:10:57Z

Hey @tomaarsen ,

I made some tests on my fork and I found some interesting stuff.

The issue seems to be that the hash differs for each of the jobs. I'm not sure exactly why that is

This happens if I calculate the hash on the whole ~/.cache/huggingface. This folder also contains subfolders like modules, evaluate and metrics. These contain python code and __pycache__ folders. My guess is this what is inconsistent in each job.

I then tried to only cache ~/.cache/huggingface/hub, ~/.cache/torch (for models) and ~/.cache/huggingface/datasets (for datasets): https://github.com/EdAbati/setfit/actions/runs/4247055316
In this case, the hash stays the same for certain jobs, while it varies for others.
This is because there is a hash in the file name of each dataset, and this varies across platforms/python versions.

I think we should go back to only cache models (as pushed in my test commit), and have 2 caches (1 per platform)

Please let me know what you think and if you have any other ideas :)

tomaarsen · 2023-02-23T22:46:25Z

Sounds good. It is unreasonable to expect to be able to cache datasets if they rely on some platform-dependant hash. Instead, if I have the time and desire, I'll create and push much smaller test datasets to the Hub. Then the test can use those, instead of larger "real world" datasets, which should improve the test suite execution time somewhat.

I'll have a more thorough look at this tomorrow, but I think we're probably almost ready to merge this 🎉

Thanks for looking into this! I'm well aware that debugging CI jobs is a pain sometimes.

Tom Aarsen

tomaarsen · 2023-02-24T06:54:02Z

Works like a charm! I'll merge this. Thanks for the help, I appreciate it!

EdAbati · 2023-02-24T10:10:34Z

Awesome! thank you very much for the help and the very useful feedback 🙂

EdAbati added 6 commits February 12, 2023 22:24

added cache for hf models in ci

025a395

reduce tests for quick prototyping

20144f2

trying sha256sum

8021e4f

added cut in hash

16ba2ff

added sort

27c7cb6

removed unused python script

14bb85c

tomaarsen added enhancement New feature or request CI Regarding the Continuous Integration admin Anything that needs the maintainers' eyes labels Feb 14, 2023

caching all hugging face cache

f9493d9

save cache when hashes differ

6ee3961

only cache models

2b4fe58

tomaarsen merged commit bcff822 into huggingface:main Feb 24, 2023

EdAbati deleted the add-model-cache-ci branch February 24, 2023 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache for 🤗 Hub models in the CI #312

Add cache for 🤗 Hub models in the CI #312

EdAbati commented Feb 13, 2023

tomaarsen commented Feb 14, 2023 •

edited

Loading

EdAbati commented Feb 16, 2023

tomaarsen commented Feb 16, 2023

EdAbati commented Feb 16, 2023 •

edited

Loading

tomaarsen commented Feb 17, 2023

EdAbati commented Feb 19, 2023

EdAbati commented Feb 23, 2023

tomaarsen commented Feb 23, 2023

tomaarsen commented Feb 24, 2023

EdAbati commented Feb 24, 2023

Add cache for 🤗 Hub models in the CI #312

Add cache for 🤗 Hub models in the CI #312

Conversation

EdAbati commented Feb 13, 2023

tomaarsen commented Feb 14, 2023 • edited Loading

EdAbati commented Feb 16, 2023

tomaarsen commented Feb 16, 2023

EdAbati commented Feb 16, 2023 • edited Loading

tomaarsen commented Feb 17, 2023

EdAbati commented Feb 19, 2023

EdAbati commented Feb 23, 2023

tomaarsen commented Feb 23, 2023

tomaarsen commented Feb 24, 2023

EdAbati commented Feb 24, 2023

tomaarsen commented Feb 14, 2023 •

edited

Loading

EdAbati commented Feb 16, 2023 •

edited

Loading