Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from manual artifact handling to automated JLLs #1629

Merged
merged 9 commits into from
Oct 21, 2022
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Oct 17, 2022

This PR switches our manual handling of artifacts (for loading the CUDA toolkit and some of its libraries) to autogenerated JLLs that come directly from Yggdrasil:

  • CUDA_Driver_jll: for loading the driver (libcuda.so), either from the system or from a forward-compatible package
  • CUDA_Runtime_jll: for loading the CUDA toolkit, both libraries (libcublas, libcufft, etc) and some essential binaries (ptxas, nvdisasm) and files (libcudadevrt, libdevice)
  • CUDNN_jll, CUTENSOR_jll, cuQuantum_jll (blame @kshyatt for the change in capitalization): for loading libraries that depend on the CUDA toolkit

The above is driven by a cuda tag that's put in the host platform, indicating which version of the toolkit has been loaded. This then informs downstream packages like CUDNN_jll which artifact to load. I'm not sure we have this exactly figured out yet, i.e., CUDNN looks to release 11.x builds that are compatible with, well, every CUDA 11.x toolkit, while NCCL has separate 11.0 and 11.8 builds of which I'm not sure how compatible they are with other versions of the toolkit.

This also implies that the env vars we used to steer this, JULIA_CUDA_VERSION and JULIA_CUDA_USE_BINARYBUILDER, have been removed. Instead, the same can be accomplished through Preferences for CUDA_Runtime_jll. The version can be conveniently set using CUDA.set_runtime_version!, while using a local toolkit needs overrides for each product path (for which there's a helper script in deps/local.jl).

Finally, we're doing this not only to simplify things, but because it'll make it possible to build arbitrary binaries that have CUDA dependencies and simply load them by depending on and importing CUDA_Runtime_jll at run time.

@maleadt maleadt added the installation CUDA is easy to install, right? label Oct 17, 2022
@codecov
Copy link

codecov bot commented Oct 17, 2022

Codecov Report

Base: 61.02% // Head: 61.46% // Increases project coverage by +0.44% 🎉

Coverage data is based on head (96a14e9) compared to base (1c2245d).
Patch coverage: 39.09% of modified lines in pull request are covered.

❗ Current head 96a14e9 differs from pull request most recent head 4eede5c. Consider uploading reports for the commit 4eede5c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1629      +/-   ##
==========================================
+ Coverage   61.02%   61.46%   +0.44%     
==========================================
  Files         152      151       -1     
  Lines       11240    11288      +48     
==========================================
+ Hits         6859     6938      +79     
+ Misses       4381     4350      -31     
Impacted Files Coverage Δ
lib/cublas/CUBLAS.jl 47.82% <ø> (-28.29%) ⬇️
lib/cudnn/src/CUDNN.jl 0.00% <0.00%> (ø)
lib/cudnn/src/libcudnn.jl 0.00% <ø> (ø)
lib/cudnn/src/libcudnn_deprecated.jl 0.00% <0.00%> (ø)
lib/cudnn/test/runtests.jl 0.00% <ø> (ø)
lib/cufft/CUFFT.jl 100.00% <ø> (ø)
lib/cupti/CUPTI.jl 100.00% <ø> (ø)
lib/curand/CURAND.jl 100.00% <ø> (ø)
lib/cusparse/CUSPARSE.jl 78.26% <ø> (+0.98%) ⬆️
lib/custatevec/src/CUSTATEVEC.jl 0.00% <0.00%> (ø)
... and 55 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@maleadt maleadt changed the title JLL conversion. WIP: JLL conversion. Oct 17, 2022
@maleadt maleadt marked this pull request as ready for review October 17, 2022 17:47
@maleadt maleadt force-pushed the tb/jll branch 20 times, most recently from 2e80b87 to 0562bce Compare October 20, 2022 08:35
@maleadt maleadt changed the title WIP: JLL conversion. Switch from manual artifact handling to automated JLLs Oct 20, 2022
@maleadt maleadt force-pushed the tb/jll branch 2 times, most recently from c99d0c9 to bc04c17 Compare October 20, 2022 12:52
@sloede
Copy link

sloede commented Oct 20, 2022

@maleadt It would be great if you could give a heads up before this is merged and/or tagged in a new release to provide supercomputer centers a chance to prepare their module files in time, e.g., @omlins @carstenbauer. Also, it would be great if there were a transition guide for both users and cluster operators such that the latter know how to prepare their module files in an appropriate manner (maybe similar to https://juliaparallel.org/MPI.jl/stable/configuration/#Notes-to-HPC-cluster-administrators).

In any case, it's great to see that choosing the appropriate backend libraries becomes easier 👍

@carstenbauer
Copy link
Member

carstenbauer commented Oct 21, 2022

Thanks a lot @maleadt, I will test things later today (as soon as I can find time for it).

@carstenbauer
Copy link
Member

carstenbauer commented Oct 21, 2022

Tried it and seems to work just fine. Thanks again!

Do note that the result of this is that setting the toolkit to "local" explicitly opts out of using any CUDA-related JLL, because those need to be resolved at compile time.

Besides version="local", is it still possible to override the artifact product paths via preferences explicitly (like previously in this PR)? I take it that if one does this (accepting the unnecessary download of artifacts) your comment doesn't apply and other CUDA-related JLLs would work, correct? (TBC, I wouldn't mind if this doesn't work anymore and one has to make a clean decision, either local CUDA or JLL CUDA.

@maleadt
Copy link
Member Author

maleadt commented Oct 21, 2022

Besides version="local", is it still possible to override the artifact product paths via preferences explicitly (like previously in this PR)?

Yes, that's a feature of the JLLWrappers-generated code, and you just want to set preferences like libcublasrt_path on CUDA_Runtime_jll (i.e., what deps/local.jl used to generate) without having set the version to "local". I wouldn't recommend it though, because (1) it is then possible to only partially override an artifact (or, less intentionally so when you forget to override a product), and (2) it may result in the CUDA platform tag (which contains the toolkit version, and is determined based on the CUDA driver) to be different from the actual version being used, resulting in incompatible artifacts being loaded by other CUDA-related JLLs. You can avoid the latter by also setting the version preference, which overrides the platform tag.

So it's possible, but finicky. The alternative is that code that uses JLLs does something like:

if CUDA_Runtime_jll.is_available()
  using CUDNN_jll
else
  const libcudnn = "libcudnn"
end

@maleadt maleadt merged commit 014fcef into master Oct 21, 2022
@maleadt maleadt deleted the tb/jll branch October 21, 2022 18:41
@JBlaschke
Copy link

JBlaschke commented Oct 24, 2022

I'm not sure. Very few users are going to need this, only sysadmins who will need to run this script once and save the preference somewhere globally, so this isn't intended to be a user-facing API.

I disagree! Strongly! Our admin time at NERSC is very expensive, so introducing toil, such as this and shrugging it off as "just let the admins do it" makes me mad. (maybe because I'm an admin).

It's worth noting that we like to enable as much flexibility in our setup as possible, so I very much like the idea of "here is an intuitive API, have at it" in addition to sensibly defined defaults in a global location.

@JBlaschke
Copy link

JBlaschke commented Oct 24, 2022

Yeah, but it was a pain to maintain, and made it impossible to integrate with the rest of the Julia binary wrappers ecosystem. You win some you lose some shrug

@maleadt I'll be honest -- I am miffed. Not because of this change, but the fact that it was introduced very quickly -- practically without warning considering the software release cycle on big supercomputers.

Have you heard about deprecation warnings? I guess sometimes you win some, and sometimes you loose support from the DOE 🤷

@JBlaschke
Copy link

JBlaschke commented Oct 24, 2022

I feel you, I really do. And I would say that it's fine to loose some of the comfort if that makes the maintenance much easier. But IMO allowing users to opt-out of (parts) of the artifact story is really important and shouldn't be made unnecessarily hard or neglected entirely.

@carstenbauer You put it much nicer than me (probably got a nice cup of mint tea before typing your reply) -- much appreciated. Thanks to my the others in this thread (@sloede and @ViralBShah ) who where on top of this, and triaged this PR before I could get to checking my messages.

@maleadt I also appreciate that you're listening to the HPC community. In case some of my more flamboyant communication style is reprieved as aggressive, let me take the opportunity to make sure to thank you for attempting to balance competing interests.

RE my remarks about loosing DOE support over this -- I want to share with @maleadt a scenario: I'm not exaggerating when I say that I am currently the only one on the NERSC staff devoting a considerable amount of time to support -- and advocate for -- Julia. If we don't settle on a sensible low-effort solution to use the system binaries, some of this support will look like: "Here's a script. Run it. And if it doesn't work, heck, then I don't know ether ... try Tim". Worse still, management my ask: "how much staff effort would we need to invest in supporting Julia?". If I said "2 to 3 FTE's", then there is a real risk that they'll decide that we can't afford to support Julia.

@JBlaschke
Copy link

I feel disk space wasting is less than ideal - but in the grand scheme of things not a real issue. All of this is a tiny fraction of the actual usage on an HPC cluster anyways.

@ViralBShah I disagree -- users readily bump into the limits of their quota (we have 8000 users, so even a modest quota would scale up to obscene amounts when considering the full file system) thanks to anaconda. This can be pretty painful for users as they are forced triage space on the global fast file systems.

@JBlaschke
Copy link

Maybe open new issues to track the inconveniences for HPC clusters?

@ViralBShah Please do!

@JBlaschke
Copy link

JBlaschke commented Oct 24, 2022

https://cuda.juliagpu.org/stable/installation/overview/#Local-installation doesn't seem to have the "new" way of doing this and needs updating. I guess it has to be regenerated to bring it into line with https://github.com/JuliaGPU/CUDA.jl/blob/master/docs/src/installation/overview.md#using-a-local-cuda ? (I actually don't know how the pages are rendered so I wanted to bring this up here.

There has been some frayed nerves on my end -- so I will ask @maleadt here: If an HPC center wanted to use the local CUDA install by default. Am I correct in understanding that that's handled using a global Preferences.toml containing:

[CUDA_Runtime_jll]
version = "local"

@maleadt
Copy link
Member Author

maleadt commented Oct 24, 2022

@JBlaschke Let's stay level-headed and try and improve CUDA.jl in a way that suits all users. I'm only going to respond to the strictly technical portions of your comments, since I, as you correctly presumed, do not appreciate the general tone.

I am miffed. Not because of this change, but the fact that it was introduced very quickly -- practically without warning considering the software release cycle on big supercomputers.

This PR was merged relatively quickly for technical reasons (other PRs easily causing conflicts). As noted above, a release of these changes is still quite some time off, and will be accompanied by a breaking version release so it should not impact user code that relies on current behavior. So there is still ample time to improve.

Have you heard about deprecation warnings?

I have. It's not possible to deprecate the entire code loading mechanism, hence the breaking release.

If we don't settle on a sensible low-effort solution to use the system binaries [...]

What do you not like about the new API then? In case you missed the changes I did after Carsten's comments, it's now just a matter of calling CUDA.set_runtime_version!("local") and re-starting Julia, which seems almost identical to setting an environment variable and re-starting Julia.

I feel disk space wasting is less than ideal - but in the grand scheme of things not a real issue. All of this is a tiny fraction of the actual usage on an HPC cluster anyways.

@ViralBShah I disagree -- users readily bump into the limits of their quota (we have 8000 users, so even a modest quota would scale up to obscene amounts when considering the full file system) thanks to anaconda. This can be pretty painful for users as they are forced triage space on the global fast file systems.

Again, maybe you missed my latest changes, but CUDA.jl does not needlessly download any artifacts anymore if the version is set to "local". Since that's done by a preference, this could be globally installed (e.g. provided by the cluster's module system). If that doesn't work, or needs improvements on the Preferences.jl side, please open specific issues.

https://cuda.juliagpu.org/stable/installation/overview/#Local-installation doesn't seem to have the "new" way of doing this and needs updating.

That's expected, note the stable in the URL. If you click the dev docs button in the README instead, you end up on the development version of the documentation (again, because this is part of an unreleased version), which contains the updated content: https://cuda.juliagpu.org/dev/installation/overview/#CUDA-toolkit

If an HPC center wanted to use the local CUDA install by default. Am I correct in understanding that that's handled using a global Preferences.toml containing:

[CUDA_Runtime_jll]
version = "local"

Correct, that's how it's intended to work. So please test this out, and if there's issues, I'd be happy to help you resolve them or further improve CUDA.jl so that it works well for HPC users.

@ViralBShah
Copy link
Contributor

@JBlaschke Thanks for chiming in, and thanks to @maleadt for incorporating all the feedback.

@JBlaschke
Copy link

@maleadt I think I was being unfair to you, and my criticism was too harsh. So I want to make a public apology. Now that I understand the context a little better -- and that there are safeguards in package system that would have prevented a silent failure -- it's clear that I overreacted.

@JBlaschke
Copy link

I think this is in good shape. To summarize various out-of-thread discussions:

  1. The HPC centers might need to implement a test that checks if artifact overwriting works in their CI
  2. It is absolutely crucial to include an "escape hatch" for compiled artifacts (at least until we can force system integrators to work with unpatched libraries). I think it's crucial to make sure that this works with as little toil as possible. Otherwise we run the risk of our own two language problem (Julia for HPC, and the rest)
  3. Run-time library resolution has problems as APIs change -- and therefore need recompilation. I think it's important to have this nevertheless as a safety net. @vchuravy pointed out that we could also test the idea of using Lmod to generate preference files (which would be generated whenever the cudatoolkit etc modules are generated). @sloede @carstenbauer
  4. The version = "local" is a good fallback solution. @maleadt we might want to track HPC compatibility issues and where to go from here in its own dedicated issue like @ViralBShah said.

@JBlaschke
Copy link

I think there is one last issue @maleadt -- looking at

# warn about old, deprecated environment variables
haskey(ENV, "JULIA_CUDA_USE_BINARYBUILDER") &&
@error """JULIA_CUDA_USE_BINARYBUILDER is deprecated, and CUDA.jl always uses artifacts now.
To use a local installation, use overrides or preferences to customize the artifact.
Please check the CUDA.jl or Pkg.jl documentation for more details."""
haskey(ENV, "JULIA_CUDA_VERSION") &&
@error """JULIA_CUDA_VERSION is deprecated. Call `CUDA.jl.set_runtime_version!` to use a different version instead."""
an error is emitted when the env var is set. We use the approach of setting configurations globally, so that users have "good" defaults and are free to install their own variants if needed. What do you think should be the solution for when different users have different CUDA.jl versions, or even the same users have multiple versions of CUDA.jl?

One option could be to make a module that switches from the env var to the Preferences.toml. But I would prefer to require as few "special steps" users need to take. Any thoughts?

@vchuravy
Copy link
Member

Yeah maybe not warning if the preference is set to local

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking installation CUDA is easy to install, right?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants