Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIP cache should cache the installed packages as well #330

Open
crabhi opened this issue Feb 3, 2022 · 34 comments
Open

PIP cache should cache the installed packages as well #330

crabhi opened this issue Feb 3, 2022 · 34 comments
Labels
feature request New feature or request to improve the current logic needs eyes

Comments

@crabhi
Copy link

crabhi commented Feb 3, 2022

Description:
Currently, setup-python caches only the ~/.cache/pip directory to avoid redownloads. However, it doesn't cache the installed packages. As some package have lengthy installation steps, this leads to delays in builds.

You can see the current behaviour for example in https://github.com/crabhi/setup-python-cache-test/actions/runs/1789016634 (or in attached build.txt) - the pip install output shows "Collecting" and "Installing" instead of "Requirement already satisfied" for all packages.

Justification:
For example installing the ansible package takes well over a minute even if it's already downloaded.

Are you willing to submit a PR?
Yes, I can try.

@crabhi crabhi added feature request New feature or request to improve the current logic needs triage labels Feb 3, 2022
@nikita-bykov
Copy link
Contributor

Hello @crabhi, thanks for your request!
We will look at it.

@barbieri
Copy link

would be also nice to follow https://github.com/actions/cache#outputs and provide an output cache-hit so we can if: steps.[id]. cache-hit != 'true' to avoid calling pip altogether.

@jbergstroem
Copy link

This pattern affects more languages (actions/setup-node works the same - only caches downloads, not installs) - would love to see a general consensus towards caching installs, not tarballs (perhaps behind a flag/attribute for future compat cache: pip-install).

@barbieri
Copy link

for pip I think it makes even more sense as there are no postinstall actions... with setup-node I'm also caching the node_modules instead of the packages, but it "broke" some flows where there was a postinstall script to configure other things (like pre-build some typescript scripts). The solution is simple, just run that script manually (or in my case, cache the built scripts)... but not "one size fits all".

For pip AFAIR there are no postinstall scripts, then this would not be an issue.

@jbergstroem
Copy link

jbergstroem commented Mar 29, 2022

For pip AFAIR there are no postinstall scripts, then this would not be an issue.

I'm experimenting with this at the moment and caching site-packages (read: pip output) isn't straightforward either; for instance binary wrappers (black, ..) won't work (python -m black works fine tho). Might be one of thos YMMV cases that makes it hard to standardize for everyone.

@dhvcc
Copy link
Contributor

dhvcc commented Apr 5, 2022

would be also nice to follow https://github.com/actions/cache#outputs and provide an output cache-hit so we can if: steps.[id]. cache-hit != 'true' to avoid calling pip altogether.

Hey, this feature was merged today and should be a part of the near-future release

@barbieri
Copy link

barbieri commented Apr 5, 2022

but the cache-hit is just for the packages, not the installation, right? IOW: do I still need to call pip install?

@dhvcc
Copy link
Contributor

dhvcc commented Apr 5, 2022

but the cache-hit is just for the packages, not the installation, right? IOW: do I still need to call pip install?

Oh you're talking pip. Well yeah, then you'll have to wait for this action to support caching venv's out of the box. It's a case for pipenv and poetry though. The best this for now is to manually cache

@belm0
Copy link

belm0 commented Apr 16, 2022

I have a case where building packages for pypy (grpcio, grpcio-tools) takes about 6 minutes-- it's way too slow to introduce a matrix.

If anyone has a manual example using actions/cache, please share it.

@rashidnhm
Copy link

I was creating a python venv and then caching that directory, however I hit an issue where that was broken once restored (behaviour was inconsistent).

I currently have a job that takes ~6 min to complete, 4 min of which is installation of pip packages. An effective caching of installed packages would be a great boost.

@dhvcc
Copy link
Contributor

dhvcc commented May 10, 2022

I was creating a python venv and then caching that directory, however I hit an issue where that was broken once restored (behaviour was inconsistent).

I currently have a job that takes ~6 min to complete, 4 min of which is installation of pip packages. An effective caching of installed packages would be a great boost.

Could you share the workflow so the people can take a look at it? I think it's possible to hack around while this feature is not here

@rashidnhm
Copy link

rashidnhm commented May 10, 2022

I was creating a python venv and then caching that directory, however I hit an issue where that was broken once restored (behaviour was inconsistent).
I currently have a job that takes ~6 min to complete, 4 min of which is installation of pip packages. An effective caching of installed packages would be a great boost.

Could you share the workflow so the people can take a look at it? I think it's possible to hack around while this feature is not here

- uses: actions/checkout@v3

- id: setup_python
  uses: actions/setup-python@v3
  with:
    python-version: 3.7

- id: python_cache
  uses: actions/cache@v3
  with:
    path: venv
    key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('requirements.txt') }}

- if: steps.python_cache.outputs.cache-hit != 'true'
  run: |
    python3 -m venv venv

- run: |
    venv/bin/python3 -m pip install -r requirements.txt

This worked quite well for me for the most part, just that after a while I started getting errors as such:

Error: [Errno 2] No such file or directory: '/home/runner/work/myrepo/myrepo/venv/bin/python3': '/home/runner/work/myrepo/myrepo/venv/bin/python3'

@dhvcc
Copy link
Contributor

dhvcc commented May 13, 2022

@rashidnhm have you tried debugging this issue? It seems like the problem may be not in this action.

@rashidnhm
Copy link

@rashidnhm have you tried debugging this issue? It seems like the problem may be not in this action.

So weirdly enough, I have not been able to reproduce the issue. To fix I simply removed the venv code and recreated and re cached it. I'm not even sure what caused it in the first place.

My only thought was maybe somehow the cach got corrupted and it kept restoring that. Really can't say.

For now I've kept the code I sent above, it's been working well since and haven't hit any other issues

@dhvcc
Copy link
Contributor

dhvcc commented May 13, 2022

Ok, nice. The code seemed ok, so that was strange. I'd only advise you to may be not run pip install if cache was hit implying you don't want to modify cache in any way if it's hit to avoid corruption

@rashidnhm
Copy link

rashidnhm commented May 14, 2022

Ok, nice. The code seemed ok, so that was strange. I'd only advise you to may be not run pip install if cache was hit implying you don't want to modify cache in any way if it's hit to avoid corruption

So I have done quite a deep dive into the venv corruption issue, and I believe I know what happened, and how to avoid it as well.

The version of Python between when my cache was created and when it was restored changed. And I had a generic restore key which matched the old cache key. See detailed explanation below.

This is how I had my yaml file was when I hit this error:

# BAD CONFIG DO NOT USE (Illustrative purposes only)

- uses: actions/checkout@v3

- id: setup_python
  uses: actions/setup-python@v3
  with:
    python-version: 3.7

- id: python_cache
  uses: actions/cache@v3
  with:
    path: venv
    key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('requirements.txt') }}
    restore-keys: |
      pip-${{ steps.setup_python.outputs.python-version }}-
      pip-  # This line in specific was the cause of the issue

- if: steps.python_cache.outputs.cache-hit != 'true'
  run: |
    python3 -m venv venv

- run: |
    venv/bin/python3 -m pip install -r requirements.txt

When this workflow initially ran and saved the venv to cache, the latest release of Python3.7 was 3.7.12 ... meaning the venv created had symlinks to 3.7.12.

However, few days later when the workflow ran again, the latest release of Python3.7 was 3.7.13.

Notice in my workflow I don't pin my Python patch version, so actions/setup-python downloaded the latest available patch release of Python 3.7 (as expected).

However, my restore-key pip- matched the old cache, which restored the old venv created for Python 3.7.12 ... meaning all the symlinks inside were now broken! I have setup Python 3.7.13 but am trying to use a venv with symlinks to 3.7.12! Hence why when I tried to call the python executable from the venv, it could not find the file!

The resolution is to really ensure that the output of setup python is always part of the cache key. So any change in python version (even a patch version bump) would create a new cache key.

This is the code I have now, it has been working well without any issues. I have updated the workflow with the advice @dhvcc gave in the above comment. The venv is not touched if there is a cache hit.

- uses: actions/checkout@v3

- id: setup_python
  uses: actions/setup-python@v3
  with:
    python-version: 3.7

- id: python_cache
  uses: actions/cache@v3
  with:
    path: venv
    key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('requirements.txt') }}

- if: steps.python_cache.outputs.cache-hit != 'true'
  run: |
    # Check if venv exists (restored from secondary keys if any, and delete)
    # You might not need this line if you only have one primary key for the venv caching
    # I kept it in my code as a fail-safe
    if [ -d "venv" ]; then rm -rf venv; fi
    
    # Re-create the venv
    python3 -m venv venv

    # Install dependencies
    venv/bin/python3 -m pip install -r requirements.txt

@IvanZosimov
Copy link
Contributor

Hi, @rashidnhm 👋 Thanks a lot for such a detailed explanation, it should help others who encountered such issues.

@Axeln78
Copy link

Axeln78 commented Jul 12, 2022

Any news on how to flag to cache the installed packages, and not only the downloaded ones, with actions/setup-python@v4? I am not seeing any flags for that in the documentation

@dhvcc
Copy link
Contributor

dhvcc commented Jul 12, 2022

Any news on how to flag to cache the installed packages, and not only the downloaded ones, with actions/setup-python@v4? I am not seeing any flags for that in the documentation

What do you exactly mean by that? A bit more context would be helpful to avoid misunderstandings

@Axeln78
Copy link

Axeln78 commented Jul 13, 2022

Sorry, @dhvcc if I didn't manage to make myself clear. actions/setup-python@v4 uses actions/cache@v3 under the hood and users do not need to call on the actions/cache@v3 module in an example such as:

    - uses: actions/checkout@v3
    - name: Set up Python 3.10 and caches
      id: setup and cache
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
        cache: 'pip'

It would be great if the installed packages could be cached as well (the purpose of this issue #330) through actions/setup-python@v4

strayge added a commit to strayge/flake8-hangover that referenced this issue Nov 19, 2022
setup-python action caches only downloads, so install always needed
open issue for installed packages cache:
actions/setup-python#330
@Avasam
Copy link

Avasam commented Nov 24, 2022

I wonder if it's actually worth it.
Here I cached the content of ${{ env.pythonLocation }}/lib/site-packages and ${{ env.pythonLocation }}/Scripts using actions/cache:

No caching:
image

Caching, no cache hit (+1m):
image

Caching, cache hit (+18s):
image

@dhvcc
Copy link
Contributor

dhvcc commented Dec 2, 2022

@Avasam possibly at least less strain on pypi. Also we should test small and big amounts of dependencies

@adamjstewart
Copy link
Contributor

adamjstewart commented Feb 5, 2023

Just wanted to add an anecdote of my own experience. TorchGeo has a long list of dependencies:

Install times without caching vary quite a bit by OS and Python version:

Python Linux macOS Windows
3.10 2m 30s 2m 23s 5m 4s
3.9 2m 50s 4m 50s 5m 49s
3.8 2m 29s 2m 12s 3m 19s

We first tried using the cache feature of setup-python:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
        cache: 'pip'
        cache-dependency-path: |
          requirements/required.txt
          requirements/datasets.txt
          requirements/tests.txt
    - name: Install pip dependencies
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

Not only do install times not significantly improve, in many cases it's actually worse!

Python Linux macOS Windows
3.10 2m 42s 1m 53s 5m 50s
3.9 2m 50s 2m 11s 5m 46s
3.8 2m 39s 3m 21s 2m 35s

Finally, we tried the setup proposed in this blog that manually caches the entire Python installation:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
    - name: Cache dependencies
      uses: actions/[email protected]
      id: cache
      with:
        path: ${{ env.pythonLocation }}
        key: ${{ env.pythonLocation }}-${{ hashFiles('requirements/required.txt') }}-${{ hashFiles('requirements/datasets.txt') }}-${{ hashFiles('requirements/tests.txt') }}
    - name: Install pip dependencies
      if: steps.cache.outputs.cache-hit != 'true'
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

This resulted in significantly faster installation times, which could likely be further improved by only caching the site-packages directory:

Python Linux macOS Windows
3.10 38s 39s 4m 11s
3.9 53s 45s 4m 20s
3.8 1m 2s 1m 15s 1m 21s

Apparently slower Windows caching is a known issue: actions/cache#752.

So yes, if setup-python also cached installed packages, that would be awesome!

@adamjstewart
Copy link
Contributor

which could likely be further improved by only caching the site-packages directory

In hindsight, this is a bad idea, many tools like black or flake8 also install files into bin so we'll at least need to cache bin too.

@jbergstroem
Copy link

In hindsight, this is a bad idea, many tools like black or flake8 also install files into bin so we'll at least need to cache bin too.

I addressed this point a while ago (above) - recap here:

I'm experimenting with this at the moment and caching site-packages (read: pip output) isn't straightforward either; for instance binary wrappers (black, ..) won't work (python -m black works fine tho). Might be one of thos YMMV cases that makes it hard to standardize for everyone.

So, instead of invoking black, do python -m black.

@adamjstewart
Copy link
Contributor

That's a decent workaround, but I don't think it's realistic to expect all users to change how they invoke other steps later in their workflow. I think we would have to cache bin too. Possibly everything. Bonus of caching everything is that we have to install Python from a cache anyway.

@jbergstroem
Copy link

jbergstroem commented Feb 7, 2023

That's a decent workaround, but I don't think it's realistic to expect all users to change how they invoke other steps later in their workflow. I think we would have to cache bin too. Possibly everything. Bonus of caching everything is that we have to install Python from a cache anyway.

Most definitely not a catch-all! To be honest I'm not confident there's a straightforward solution..

@Seluj78
Copy link

Seluj78 commented Feb 22, 2023

The workaround from @adamjstewart seems to work wonders indeed ! But I think a standard implementation from this repository would be a great addition. Any updates on it from the dev team ?

@CoreyGaunt
Copy link

Just wanted to add an anecdote of my own experience. TorchGeo has a long list of dependencies:

Install times without caching vary quite a bit by OS and Python version:

Python Linux macOS Windows
3.10 2m 30s 2m 23s 5m 4s
3.9 2m 50s 4m 50s 5m 49s
3.8 2m 29s 2m 12s 3m 19s
We first tried using the cache feature of setup-python:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
        cache: 'pip'
        cache-dependency-path: |
          requirements/required.txt
          requirements/datasets.txt
          requirements/tests.txt
    - name: Install pip dependencies
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

Not only do install times not significantly improve, in many cases it's actually worse!

Python Linux macOS Windows
3.10 2m 42s 1m 53s 5m 50s
3.9 2m 50s 2m 11s 5m 46s
3.8 2m 39s 3m 21s 2m 35s
Finally, we tried the setup proposed in this blog that manually caches the entire Python installation:

    - name: Set up python
      uses: actions/[email protected]
      with:
        python-version: ${{ matrix.python-version }}
    - name: Cache dependencies
      uses: actions/[email protected]
      id: cache
      with:
        path: ${{ env.pythonLocation }}
        key: ${{ env.pythonLocation }}-${{ hashFiles('requirements/required.txt') }}-${{ hashFiles('requirements/datasets.txt') }}-${{ hashFiles('requirements/tests.txt') }}
    - name: Install pip dependencies
      if: steps.cache.outputs.cache-hit != 'true'
      run: pip install -r requirements/required.txt -r requirements/datasets.txt -r requirements/tests.txt

This resulted in significantly faster installation times, which could likely be further improved by only caching the site-packages directory:

Python Linux macOS Windows
3.10 38s 39s 4m 11s
3.9 53s 45s 4m 20s
3.8 1m 2s 1m 15s 1m 21s
Apparently slower Windows caching is a known issue: actions/cache#752.

So yes, if setup-python also cached installed packages, that would be awesome!

This right here has been a life saver for me - I toiled over this caching for so long, but this got me there!! Thank you so so so much!!

@nicklausbrown
Copy link

nicklausbrown commented Aug 28, 2023

In hindsight, this is a bad idea, many tools like black or flake8 also install files into bin so we'll at least need to cache bin too.

@adamjstewart, thanks for your awesome post, it really reduced my build times.

It's definitely possible to achieve your idea of caching only the necessary items instead of the entire python build by caching the site-packages and bin (in order to get executable packages like black, ruff, etc..). In my case, my two dependency files are classic pip-tools dev-requirements.txt and requirements.txt.

  - name: Cache dependency libraries
    uses: actions/[email protected]
    id: cache-libraries
    with:
      path: ${{ env.pythonLocation }}/lib/python3.11/site-packages
      key: ${{ env.pythonLocation }}-${{ hashFiles('dev-requirements.txt') }}-${{ hashFiles('requirements.txt') }}

  - name: Cache dependency binaries
    uses: actions/[email protected]
    id: cache-binaries
    with:
      path: ${{ env.pythonLocation }}/bin
      key: ${{ env.pythonLocation }}-${{ hashFiles('dev-requirements.txt') }}-${{ hashFiles('requirements.txt') }}

Finally I run the pip install step conditionally based on successful cache hits with the following:

  - name: Install pip-tools and dependencies
    if: steps.cache-libraries.outputs.cache-hit != 'true' || steps.cache-binaries.outputs.cache-hit != 'true'
    run: |
      python -m pip install --upgrade pip
      pip install pip-tools
      pip-sync dev-requirements.txt requirements.txt

Mileage can vary for small dependency lists based on the cache restore network speed.

@harpener
Copy link

harpener commented Sep 7, 2023

@adamjstewart Thanks for the cache example, it really helps. :-)

I have one more suggestion that I decided to use, because I want to avoid downloading all pip dependencies everytime any of them changes - to introduce a second cache just for pip cache.

            -   name: "Python: Setup Python"
                uses: actions/[email protected]
                with:
                    python-version: 3.10.13

            -   name: "Cache: Cache Python"
                id: python-cache
                uses: actions/[email protected]
                with:
                    path: ${{env.pythonLocation}}
                    key: ${{env.pythonLocation}}-${{hashFiles('requirements-dev.txt')}}-${{hashFiles('requirements.txt')}}

            -   name: "Shell: Get pip cache dir"
                id: pip-cache-dir
                if: steps.python-cache.outputs.cache-hit != 'true'
                run: |
                    python -m pip install -U pip
                    pip install -U wheel
                    echo "pip-cache-dir=$(pip cache dir)" >> ${GITHUB_OUTPUT}

            -   name: "Cache: Cache pip"
                if: steps.python-cache.outputs.cache-hit != 'true'
                uses: actions/[email protected]
                with:
                    path: ${{steps.pip-cache-dir.outputs.pip-cache-dir}}
                    key: 3.10-${{hashFiles('requirements-dev.txt')}}-${{hashFiles('requirements.txt')}}
                    restore-keys: |
                        3.10-${{hashFiles('requirements-dev.txt')}}-
                        3.10-

            -   name: "Shell: Install pip dependencies"
                if: steps.python-cache.outputs.cache-hit != 'true'
                run: pip install -r requirements-dev.txt

@itsme2980
Copy link

itsme2980 commented Oct 11, 2023

Unfortunately, the built-in cache functionality only supports dependencies downloaded by package manger, not the whole Python as expected.

I made an example to cache both of download libraries and binaries (python + package manager like pip) via actions/cache

    - name: Cache dependencies
      uses: actions/[email protected]
      id: cache
      with:
        path: ${{ runner.tool_cache }}/Python/${{ inputs.python-version }} # e.g /opt/hostedtoolcache/Python/3.11.6
        key: ${{ runner.tool_cache }}/Python/${{ inputs.python-version }}/${{ runner.arch }}-${{ hashFiles('requirements.txt') }}

    - name: Set up Python env
      uses: actions/setup-python@v4
      with:
        token: ${{ secrets.xyz }} # if needed
        python-version: ${{ inputs.python-version }}

    - name: Install pip dependencies
      shell: bash
      run: |
        pip install -r requirements.txt

The key is on the cache step standing before the actions/setup-python. It will restored the cache, includes: binaries and libraries then parse into path. actions/setup-python sees x64.complete 0-byte file existed under path so will not download Python binary again.

The console output with cache should be

Cache restored successfully
Cache restored from key: /opt/hostedtoolcache/Python/3.11.6/x64-95421aa4e4a3b4024058ea1f527f5a724a65d3408f2a8c64daad58bfa48fa33d
...
Run actions/setup-python@v4
  Successfully set up CPython (3.11.6)
...
Run pip install -r requirements.txt
Requirement already satisfied: requests==2.25.1 in /opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages (from -r requirements.txt (line 2)) (2.25.1)
Requirement already satisfied: pyyaml==6.0.1 in /opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages (from -r requirements.txt (line 3)) (6.0.1)

Note: the above ^ testing on self-host runner.

@wyardley
Copy link

Finally, we tried the setup proposed in this blog that manually caches the entire Python installation:

This works pretty well after the cache is populated, however, populating the cache for me seems to take 13m with the deps we have, not just on initial creation, but also when the cache is busted

This is running

/usr/bin/tar --posix -cf cache.tgz --exclude cache.tgz -P -C /home/runner/work/xxx/yyy --files-from manifest.txt -z

Has anyone had any luck with a similar approach that will handle partial restores more cleanly (without using pipenv, which is what I've typically done for caching in the past)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request to improve the current logic needs eyes
Projects
None yet
Development

No branches or pull requests