Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing the size of the cuda_compiler zip #6967

Open
h-vetinari opened this issue Jan 26, 2025 · 7 comments
Open

Reducing the size of the cuda_compiler zip #6967

h-vetinari opened this issue Jan 26, 2025 · 7 comments

Comments

@h-vetinari
Copy link
Member

h-vetinari commented Jan 26, 2025

The size of the zip involving the cuda compilers is pretty unwieldy:

zip_keys:
- # [unix]
- c_compiler_version # [unix]
- cxx_compiler_version # [unix]
- fortran_compiler_version # [unix]
- cuda_compiler # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
- cuda_compiler_version # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
- github_actions_labels # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
- docker_image # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM", "").startswith("linux-")]

which makes it hard to override locally. There's a general desire to keep the zip complexity down and remove entries here, rather than adding them.

And indeed we recently removed cdt_name (48621d1) as well as c_stdlib_version (c7f5973), and we made it easier to override the image version without having to locally override the whole zip (#6626).

#6910 now added another component (github_action_labels) to the zip, due to the fact that there exist feedstocks that require GPU agents (a very rare resource), and that without linking this with the zip that determines non-CUDA vs. CUDA, we're stuck with only bad options:

The desire for simplity in the zip is well-founded, but it doesn't trump reality. Keeping these data apart when they are closely connected causes far more problems than a single extra zip entry (aside from the fact that only 5 feedstocks across all of conda-forge are affected and the required changes are trivial).

The CUDA zip overall is IMO a reflection of irreducible complexity that exists in our packaging constraints. However, there are some nice opportunities to further simplify this zip:

  • docker_image can be removed as soon as we drop CUDA 11.8 (because then everything can just run on the alma9 images)
  • {c,cxx,fortran}_compiler_version could be dropped if we accept being limited by the maximum GCC version supported by our lowest-supported CUDA
    • this can obviously be accelerated if we keep only building for the newest CUDA 12.x, as we recently started doing (and assuming that newer nvcc keeps updating their GCC compatibility).
    • the problem case there is CUDA on PPC, which was dropped in CUDA 12.5; CUDA 12.4 is limited to gcc 12
  • we might even consider turning cuda_compiler into a constant (once we drop 11.8)

So in most realistic most scenario (after we drop 11.8 but assuming we keep CUDA on PPC for the foreseeable future), things would look like

zip_keys:
  -                             # [unix]
    - c_compiler_version        # [unix]
    - cxx_compiler_version      # [unix]
    - fortran_compiler_version  # [unix]
    - cuda_compiler             # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
    - cuda_compiler_version     # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
    - github_actions_labels     # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]

which is still 2 entries shorter than it was until last November.

If we did drop CUDA support on PPC and relied only on the latest CUDA (as well as recipes guarding their {{ compiler("cuda") }} correctly with cuda_compiler_version != "None" selectors), the zip could even be reduced to

zip_keys:
  -                             # [unix]
    - c_compiler_version        # [unix]
    - cxx_compiler_version      # [unix]
    - fortran_compiler_version  # [unix]
  # uncoupled!
  -                             # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
    - cuda_compiler_version     # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
    - github_actions_labels     # [linux and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]

but not sure whether we have the appetite for that. IMO the longer zip is less of an issue than dropping support for a platform.

@isuruf
Copy link
Member

isuruf commented Jan 27, 2025

duplicated builds in CI matrix

This IMO is a bug.

very long variant config names, which causes failures on windows, see
conda-forge/pytorch-cpu-feedstock#332 & conda-forge/conda-smithy#2233

This is also a bug in conda-smithy. We have

C:\Users\runnerx\_work\pytorch-cpu-feedstock\pytorch-cpu-feedstock\.ci_support\linux_64_blas_implgenericc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13github_actions_labelscirun-openstack-cpu-2xlargeis_h2ab6b548.yaml

which has 284 characters and you can have only 260 on windows.

@isuruf
Copy link
Member

isuruf commented Jan 27, 2025

duplicated builds in CI matrix

I can help you with debugging this if you have a commit I can have a look at.

@h-vetinari
Copy link
Member Author

I can help you with debugging this if you have a commit I can have a look at.

It's just regular smithy behaviour. It could probably be avoided by adding

  skip: true  # [cuda_compiler_version == "None" and github_actions_labels == "cirun-openstack-gpu-2xlarge"]
  skip: true  # [cuda_compiler_version != "None" and github_actions_labels == "cirun-openstack-cpu-2xlarge"]

to every single output, but that becomes another pointless thing to keep in sync.

This is also a bug in conda-smithy. We have

C:\Users\runnerx\_work\pytorch-cpu-feedstock\pytorch-cpu-feedstock\.ci_support\linux_64_blas_implgenericc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13github_actions_labelscirun-openstack-cpu-2xlargeis_h2ab6b548.yaml

which has 284 characters and you can have only 260 on windows.

ah, I was wondering how we got beyond 260 (conda-forge/conda-smithy#2233 uses 200), as that was not at all obvious from the error message. However, if we subtract another 24 characters (to get that path <=260), we'll lose visibility of the GHA label in the path.

To be clear, I'm sure there's some halfway workable alternative to putting github_action_labels in the zip - but all those hacks are IMO worse than just dealing with the fact that we should be able to determine CPU vs. GPU agents together with the quantity (cuda_compiler_version) that determines whether or not we have a CUDA build.

All in all, the zip got shorter since last November, and it'll get shorter still once we drop CUDA 11.8. I don't see how a cosmetic concern outweighs all the workarounds that become necessary otherwise.

@isuruf
Copy link
Member

isuruf commented Jan 27, 2025

It's just regular smithy behaviour. It could probably be avoided by adding

Here's the recipe I tried and it skipped windows just like I intended

package:
  name: foo-split
  version: 1.0.0

build:
  number: 1
  skip: true   # [win]

outputs:
  - name: foo

  - name: bar
    build:
      skip: true  # [linux]

about:
  home: asd

@h-vetinari
Copy link
Member Author

h-vetinari commented Jan 27, 2025

@h-vetinari
Copy link
Member Author

Here's the recipe I tried and it skipped windows just like I intended

You could also make a fork of conda-forge/tensorflow-feedstock#414, and figure what's necessary to get smithy to only produce one set of jobs.

@h-vetinari
Copy link
Member Author

@isuruf, did you have some time to look at how to achieve this (e.g. on the tensorflow feedstock)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants