Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache for 🤗 Hub models in the CI #312

Merged
merged 9 commits into from
Feb 24, 2023

Conversation

EdAbati
Copy link
Contributor

@EdAbati EdAbati commented Feb 13, 2023

Closes #307

I have enabled caching of the 🤗 models in the CI.
Currently, two model caches are created: one for each OS. The datasets are not included and are still downloaded during testing, do you want to cache these too?

I struggled a bit to create a unique hash that defines the 🤗 Hub cache. I settled with the command: find ~/.cache/huggingface/hub ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1 (checksum of all the sorted files in the caches of huggingface/hub and torch).

Please let me know if you have any suggestions on how to improve this. : ) I scrolled through some HF repos, but I couldn't find any CI pipelines with a cache for models where I could take inspiration from.

@tomaarsen
Copy link
Member

tomaarsen commented Feb 14, 2023

Hello!

This is looking great. I'm looking at the logs now. On CI run 1 (no caches in place), I observe:

  1. Only one job per OS (ubuntu-latest 3.8 & windows-latest 3.7) actually creates the cache, which ends up being about 480MB.
  2. All of the other jobs throw warnings that the cache could not be created due to not being able to reserve the cache key, as well as a warning that git fails: The process '/usr/bin/git' failed with exit code 128 and The process 'C:\Program Files\Git\bin\git.exe' failed with exit code 128. These warnings are probably related.
  3. The Ubuntu jobs that do not end up making the cache take about 1 to 2 seconds before stopping with a warning, while the Windows jobs take exactly a minute before stopping... The latter is a shame.
  4. Saving the cache took about 1:10 on Windows, whereas it only takes about 10 seconds on Ubuntu.

The total job duration was 11m 34s, running for a combined 1h 44m 57s.

I'm rerunning the jobs now. On CI run 2 (caches presumably in place). Let's observe:

  1. All of the jobs were able to load the cache.
  2. The Ubuntu jobs could load the caches in ~5-6 seconds whereas the Windows job took around 10 seconds each.
  3. The same warning as before are thrown.
  4. Trying to save the cache again takes about 1-2 seconds for Ubuntu and a minute for Windows.

The vast majority of jobs took less time than before. In that link, the bottom of the two equivalently named jobs are the never one with the cache. The only exceptions are some Windows jobs that inexplicably take a really long time to restore a relatively small cache. Others online have reported similar behaviour too.

I'm quite satisfied with these results, but I do see two points of improvement:

  1. Wrap the Save new HF models to cache step with:
    if: steps.restore-hf-models.outputs.cache-hit != 'true'
    
    while id: restore-hf-models is added to the Restore HF models from cache step. That way, the CI will only ever attempt to save a cache if the cache could not be loaded. This should consistently save about a minute for the Windows jobs.
    See the docs on this cache-hit here.
  2. Include datasets in the cache. I think there's quite a few that are loaded by the test suite.

Alternatively, it may be prudent to recognize the slowness of all Windows jobs with caching and completely disable all Windows caches. I'd like to hear your thoughts on this, too.

  • Tom Aarsen

@tomaarsen tomaarsen added enhancement New feature or request CI Regarding the Continuous Integration admin Anything that needs the maintainers' eyes labels Feb 14, 2023
@EdAbati
Copy link
Contributor Author

EdAbati commented Feb 16, 2023

Hi Tom,

Thank you very much for your reply and the very detailed analysis! 🙂 I didn't notice that caches on Windows were slower.

I split the cache restore and save steps because I thought that the cache might change during the unit tests and therefore has to be refreshed: e.g., when a new model is added to the tests or a model used in the tests are updated in the Hub.

If we add steps.restore-hf-models.outputs.cache-hit != 'true', the save step will always be skipped when a cache exists, right?

What if we change to something like the below?

      - name: Run unit tests
        shell: bash
        run: |
          echo "OLD_HF_CACHE_HASH=$(find ~/.cache/huggingface ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1)" >> $GITHUB_ENV
          pytest -sv tests/
          echo "NEW_HF_CACHE_HASH=$(find ~/.cache/huggingface ~/.cache/torch -type f -exec sha256sum {} + | LC_ALL=C sort | sha256sum | cut -d ' ' -f 1)" >> $GITHUB_ENV

      - name: Save new HF models to cache
        uses: actions/cache/save@v3
        with:
          path: |
            ~/.cache/huggingface
            ~/.cache/torch
          key: hf-models-${{ matrix.os }}-${{ env.NEW_HF_CACHE_HASH }}
        # Only save cache if the hash has changed
        if: ${{ env.NEW_HF_CACHE_HASH}} != ${{ env.OLD_HF_CACHE_HASH }}

I agree that, if this doesn't solve the slowness on Windows, we can enable the cache just for Ubuntu jobs.

@tomaarsen
Copy link
Member

If we add steps.restore-hf-models.outputs.cache-hit != 'true', the save step will always be skipped when a cache exists, right?

You're very right, I had not thought about that thoroughly enough.

Your proposal looks like a very solid solution.

@EdAbati
Copy link
Contributor Author

EdAbati commented Feb 16, 2023

For now I just pushed the version that:

  • caches everything in the HF cache location
  • checks the hashes before and after the tests

I am expecting that the cache will refresh across all the workflows, because we added the datasets that are not present before.

I haven't disabled the Windows cache just yet

@tomaarsen
Copy link
Member

Thanks for the changes!

I've ran and re-ran the CI now, and I'm noticing that each of the jobs are making caches, rather than just one cache for Ubuntu and one for Windows. The issue seems to be that the hash differs for each of the jobs. I'm not sure exactly why that is, as I believe this issue did not happen before 6ee3961.

Beyond that, the total cache size is growing fairly large. Do you think it would be a good idea to disable caching of dependencies? The dependency caches cannot be shared between the different jobs, and the Ubuntu dependency caches are 1.7 GB. For reference, GitHub will start removing caches after a total cache size of 10GB.

@EdAbati
Copy link
Contributor Author

EdAbati commented Feb 19, 2023

Mmm yes, I noticed that too. Maybe there is something in the cache that changes at each run.

I will do now some debugging on my private fork. I will let you know what I find 👀

@EdAbati
Copy link
Contributor Author

EdAbati commented Feb 23, 2023

Hey @tomaarsen ,

I made some tests on my fork and I found some interesting stuff.

The issue seems to be that the hash differs for each of the jobs. I'm not sure exactly why that is

This happens if I calculate the hash on the whole ~/.cache/huggingface. This folder also contains subfolders like modules, evaluate and metrics. These contain python code and __pycache__ folders. My guess is this what is inconsistent in each job.

I then tried to only cache ~/.cache/huggingface/hub, ~/.cache/torch (for models) and ~/.cache/huggingface/datasets (for datasets): https://github.com/EdAbati/setfit/actions/runs/4247055316
In this case, the hash stays the same for certain jobs, while it varies for others.
This is because there is a hash in the file name of each dataset, and this varies across platforms/python versions. 

I think we should go back to only cache models (as pushed in my test commit), and have 2 caches (1 per platform)

Please let me know what you think and if you have any other ideas :)

@tomaarsen
Copy link
Member

Sounds good. It is unreasonable to expect to be able to cache datasets if they rely on some platform-dependant hash. Instead, if I have the time and desire, I'll create and push much smaller test datasets to the Hub. Then the test can use those, instead of larger "real world" datasets, which should improve the test suite execution time somewhat.

I'll have a more thorough look at this tomorrow, but I think we're probably almost ready to merge this 🎉

Thanks for looking into this! I'm well aware that debugging CI jobs is a pain sometimes.

  • Tom Aarsen

@tomaarsen
Copy link
Member

Works like a charm! I'll merge this. Thanks for the help, I appreciate it!

@tomaarsen tomaarsen merged commit bcff822 into huggingface:main Feb 24, 2023
@EdAbati
Copy link
Contributor Author

EdAbati commented Feb 24, 2023

Awesome! thank you very much for the help and the very useful feedback 🙂

@EdAbati EdAbati deleted the add-model-cache-ci branch February 24, 2023 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
admin Anything that needs the maintainers' eyes CI Regarding the Continuous Integration enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Caching 🤗 Hub models in the CI tests
2 participants