Switch from manual artifact handling to automated JLLs #1629

maleadt · 2022-10-17T16:24:54Z

This PR switches our manual handling of artifacts (for loading the CUDA toolkit and some of its libraries) to autogenerated JLLs that come directly from Yggdrasil:

CUDA_Driver_jll: for loading the driver (libcuda.so), either from the system or from a forward-compatible package
CUDA_Runtime_jll: for loading the CUDA toolkit, both libraries (libcublas, libcufft, etc) and some essential binaries (ptxas, nvdisasm) and files (libcudadevrt, libdevice)
CUDNN_jll, CUTENSOR_jll, cuQuantum_jll (blame @kshyatt for the change in capitalization): for loading libraries that depend on the CUDA toolkit

The above is driven by a cuda tag that's put in the host platform, indicating which version of the toolkit has been loaded. This then informs downstream packages like CUDNN_jll which artifact to load. I'm not sure we have this exactly figured out yet, i.e., CUDNN looks to release 11.x builds that are compatible with, well, every CUDA 11.x toolkit, while NCCL has separate 11.0 and 11.8 builds of which I'm not sure how compatible they are with other versions of the toolkit.

This also implies that the env vars we used to steer this, JULIA_CUDA_VERSION and JULIA_CUDA_USE_BINARYBUILDER, have been removed. Instead, the same can be accomplished through Preferences for CUDA_Runtime_jll. The version can be conveniently set using CUDA.set_runtime_version!, while using a local toolkit needs overrides for each product path (for which there's a helper script in deps/local.jl).

Finally, we're doing this not only to simplify things, but because it'll make it possible to build arbitrary binaries that have CUDA dependencies and simply load them by depending on and importing CUDA_Runtime_jll at run time.

codecov · 2022-10-17T17:29:29Z

Codecov Report

Base: 61.02% // Head: 61.46% // Increases project coverage by +0.44% 🎉

Coverage data is based on head (96a14e9) compared to base (1c2245d).
Patch coverage: 39.09% of modified lines in pull request are covered.

❗ Current head 96a14e9 differs from pull request most recent head 4eede5c. Consider uploading reports for the commit 4eede5c to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1629      +/-   ##
==========================================
+ Coverage   61.02%   61.46%   +0.44%     
==========================================
  Files         152      151       -1     
  Lines       11240    11288      +48     
==========================================
+ Hits         6859     6938      +79     
+ Misses       4381     4350      -31

Impacted Files	Coverage Δ
lib/cublas/CUBLAS.jl	`47.82% <ø> (-28.29%)`	⬇️
lib/cudnn/src/CUDNN.jl	`0.00% <0.00%> (ø)`
lib/cudnn/src/libcudnn.jl	`0.00% <ø> (ø)`
lib/cudnn/src/libcudnn_deprecated.jl	`0.00% <0.00%> (ø)`
lib/cudnn/test/runtests.jl	`0.00% <ø> (ø)`
lib/cufft/CUFFT.jl	`100.00% <ø> (ø)`
lib/cupti/CUPTI.jl	`100.00% <ø> (ø)`
lib/curand/CURAND.jl	`100.00% <ø> (ø)`
lib/cusparse/CUSPARSE.jl	`78.26% <ø> (+0.98%)`	⬆️
lib/custatevec/src/CUSTATEVEC.jl	`0.00% <0.00%> (ø)`
... and 55 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

sloede · 2022-10-20T13:14:23Z

@maleadt It would be great if you could give a heads up before this is merged and/or tagged in a new release to provide supercomputer centers a chance to prepare their module files in time, e.g., @omlins @carstenbauer. Also, it would be great if there were a transition guide for both users and cluster operators such that the latter know how to prepare their module files in an appropriate manner (maybe similar to https://juliaparallel.org/MPI.jl/stable/configuration/#Notes-to-HPC-cluster-administrators).

In any case, it's great to see that choosing the appropriate backend libraries becomes easier 👍

carstenbauer · 2022-10-21T12:15:35Z

Thanks a lot @maleadt, I will test things later today (as soon as I can find time for it).

carstenbauer · 2022-10-21T13:16:30Z

Tried it and seems to work just fine. Thanks again!

Do note that the result of this is that setting the toolkit to "local" explicitly opts out of using any CUDA-related JLL, because those need to be resolved at compile time.

Besides version="local", is it still possible to override the artifact product paths via preferences explicitly (like previously in this PR)? I take it that if one does this (accepting the unnecessary download of artifacts) your comment doesn't apply and other CUDA-related JLLs would work, correct? (TBC, I wouldn't mind if this doesn't work anymore and one has to make a clean decision, either local CUDA or JLL CUDA.

maleadt · 2022-10-21T13:39:02Z

Besides version="local", is it still possible to override the artifact product paths via preferences explicitly (like previously in this PR)?

Yes, that's a feature of the JLLWrappers-generated code, and you just want to set preferences like libcublasrt_path on CUDA_Runtime_jll (i.e., what deps/local.jl used to generate) without having set the version to "local". I wouldn't recommend it though, because (1) it is then possible to only partially override an artifact (or, less intentionally so when you forget to override a product), and (2) it may result in the CUDA platform tag (which contains the toolkit version, and is determined based on the CUDA driver) to be different from the actual version being used, resulting in incompatible artifacts being loaded by other CUDA-related JLLs. You can avoid the latter by also setting the version preference, which overrides the platform tag.

So it's possible, but finicky. The alternative is that code that uses JLLs does something like:

if CUDA_Runtime_jll.is_available()
  using CUDNN_jll
else
  const libcudnn = "libcudnn"
end

JBlaschke · 2022-10-24T02:04:12Z

I'm not sure. Very few users are going to need this, only sysadmins who will need to run this script once and save the preference somewhere globally, so this isn't intended to be a user-facing API.

I disagree! Strongly! Our admin time at NERSC is very expensive, so introducing toil, such as this and shrugging it off as "just let the admins do it" makes me mad. (maybe because I'm an admin).

It's worth noting that we like to enable as much flexibility in our setup as possible, so I very much like the idea of "here is an intuitive API, have at it" in addition to sensibly defined defaults in a global location.

JBlaschke · 2022-10-24T02:09:53Z

Yeah, but it was a pain to maintain, and made it impossible to integrate with the rest of the Julia binary wrappers ecosystem. You win some you lose some shrug

@maleadt I'll be honest -- I am miffed. Not because of this change, but the fact that it was introduced very quickly -- practically without warning considering the software release cycle on big supercomputers.

Have you heard about deprecation warnings? I guess sometimes you win some, and sometimes you loose support from the DOE 🤷

JBlaschke · 2022-10-24T02:34:47Z

I feel you, I really do. And I would say that it's fine to loose some of the comfort if that makes the maintenance much easier. But IMO allowing users to opt-out of (parts) of the artifact story is really important and shouldn't be made unnecessarily hard or neglected entirely.

@carstenbauer You put it much nicer than me (probably got a nice cup of mint tea before typing your reply) -- much appreciated. Thanks to my the others in this thread (@sloede and @ViralBShah ) who where on top of this, and triaged this PR before I could get to checking my messages.

@maleadt I also appreciate that you're listening to the HPC community. In case some of my more flamboyant communication style is reprieved as aggressive, let me take the opportunity to make sure to thank you for attempting to balance competing interests.

RE my remarks about loosing DOE support over this -- I want to share with @maleadt a scenario: I'm not exaggerating when I say that I am currently the only one on the NERSC staff devoting a considerable amount of time to support -- and advocate for -- Julia. If we don't settle on a sensible low-effort solution to use the system binaries, some of this support will look like: "Here's a script. Run it. And if it doesn't work, heck, then I don't know ether ... try Tim". Worse still, management my ask: "how much staff effort would we need to invest in supporting Julia?". If I said "2 to 3 FTE's", then there is a real risk that they'll decide that we can't afford to support Julia.

JBlaschke · 2022-10-24T02:40:55Z

I feel disk space wasting is less than ideal - but in the grand scheme of things not a real issue. All of this is a tiny fraction of the actual usage on an HPC cluster anyways.

@ViralBShah I disagree -- users readily bump into the limits of their quota (we have 8000 users, so even a modest quota would scale up to obscene amounts when considering the full file system) thanks to anaconda. This can be pretty painful for users as they are forced triage space on the global fast file systems.

JBlaschke · 2022-10-24T02:44:58Z

Maybe open new issues to track the inconveniences for HPC clusters?

@ViralBShah Please do!

JBlaschke · 2022-10-24T03:05:57Z

https://cuda.juliagpu.org/stable/installation/overview/#Local-installation doesn't seem to have the "new" way of doing this and needs updating. I guess it has to be regenerated to bring it into line with https://github.com/JuliaGPU/CUDA.jl/blob/master/docs/src/installation/overview.md#using-a-local-cuda ? (I actually don't know how the pages are rendered so I wanted to bring this up here.

There has been some frayed nerves on my end -- so I will ask @maleadt here: If an HPC center wanted to use the local CUDA install by default. Am I correct in understanding that that's handled using a global Preferences.toml containing:

[CUDA_Runtime_jll]
version = "local"

maleadt · 2022-10-24T05:45:28Z

@JBlaschke Let's stay level-headed and try and improve CUDA.jl in a way that suits all users. I'm only going to respond to the strictly technical portions of your comments, since I, as you correctly presumed, do not appreciate the general tone.

I am miffed. Not because of this change, but the fact that it was introduced very quickly -- practically without warning considering the software release cycle on big supercomputers.

This PR was merged relatively quickly for technical reasons (other PRs easily causing conflicts). As noted above, a release of these changes is still quite some time off, and will be accompanied by a breaking version release so it should not impact user code that relies on current behavior. So there is still ample time to improve.

Have you heard about deprecation warnings?

I have. It's not possible to deprecate the entire code loading mechanism, hence the breaking release.

If we don't settle on a sensible low-effort solution to use the system binaries [...]

What do you not like about the new API then? In case you missed the changes I did after Carsten's comments, it's now just a matter of calling CUDA.set_runtime_version!("local") and re-starting Julia, which seems almost identical to setting an environment variable and re-starting Julia.

I feel disk space wasting is less than ideal - but in the grand scheme of things not a real issue. All of this is a tiny fraction of the actual usage on an HPC cluster anyways.

@ViralBShah I disagree -- users readily bump into the limits of their quota (we have 8000 users, so even a modest quota would scale up to obscene amounts when considering the full file system) thanks to anaconda. This can be pretty painful for users as they are forced triage space on the global fast file systems.

Again, maybe you missed my latest changes, but CUDA.jl does not needlessly download any artifacts anymore if the version is set to "local". Since that's done by a preference, this could be globally installed (e.g. provided by the cluster's module system). If that doesn't work, or needs improvements on the Preferences.jl side, please open specific issues.

https://cuda.juliagpu.org/stable/installation/overview/#Local-installation doesn't seem to have the "new" way of doing this and needs updating.

That's expected, note the stable in the URL. If you click the dev docs button in the README instead, you end up on the development version of the documentation (again, because this is part of an unreleased version), which contains the updated content: https://cuda.juliagpu.org/dev/installation/overview/#CUDA-toolkit

If an HPC center wanted to use the local CUDA install by default. Am I correct in understanding that that's handled using a global Preferences.toml containing:
[CUDA_Runtime_jll]
version = "local"

Correct, that's how it's intended to work. So please test this out, and if there's issues, I'd be happy to help you resolve them or further improve CUDA.jl so that it works well for HPC users.

ViralBShah · 2022-10-24T13:52:47Z

@JBlaschke Thanks for chiming in, and thanks to @maleadt for incorporating all the feedback.

JBlaschke · 2022-10-24T14:58:19Z

@maleadt I think I was being unfair to you, and my criticism was too harsh. So I want to make a public apology. Now that I understand the context a little better -- and that there are safeguards in package system that would have prevented a silent failure -- it's clear that I overreacted.

JBlaschke · 2022-10-24T15:12:12Z

I think this is in good shape. To summarize various out-of-thread discussions:

The HPC centers might need to implement a test that checks if artifact overwriting works in their CI
It is absolutely crucial to include an "escape hatch" for compiled artifacts (at least until we can force system integrators to work with unpatched libraries). I think it's crucial to make sure that this works with as little toil as possible. Otherwise we run the risk of our own two language problem (Julia for HPC, and the rest)
Run-time library resolution has problems as APIs change -- and therefore need recompilation. I think it's important to have this nevertheless as a safety net. @vchuravy pointed out that we could also test the idea of using Lmod to generate preference files (which would be generated whenever the cudatoolkit etc modules are generated). @sloede @carstenbauer
The version = "local" is a good fallback solution. @maleadt we might want to track HPC compatibility issues and where to go from here in its own dedicated issue like @ViralBShah said.

JBlaschke · 2022-10-24T15:17:01Z

I think there is one last issue @maleadt -- looking at

CUDA.jl/src/initialization.jl

Lines 119 to 125 in 52a16d6

    
               # warn about old, deprecated environment variables 
        
               haskey(ENV, "JULIA_CUDA_USE_BINARYBUILDER") && 
        
                   @error """JULIA_CUDA_USE_BINARYBUILDER is deprecated, and CUDA.jl always uses artifacts now. 
        
                             To use a local installation, use overrides or preferences to customize the artifact. 
        
                             Please check the CUDA.jl or Pkg.jl documentation for more details.""" 
        
               haskey(ENV, "JULIA_CUDA_VERSION") && 
        
                   @error """JULIA_CUDA_VERSION is deprecated. Call `CUDA.jl.set_runtime_version!` to use a different version instead."""

an error is emitted when the env var is set. We use the approach of setting configurations globally, so that users have "good" defaults and are free to install their own variants if needed. What do you think should be the solution for when different users have different CUDA.jl versions, or even the same users have multiple versions of CUDA.jl?

One option could be to make a module that switches from the env var to the Preferences.toml. But I would prefer to require as few "special steps" users need to take. Any thoughts?

vchuravy · 2022-10-24T15:20:15Z

Yeah maybe not warning if the preference is set to local

maleadt added the installation CUDA is easy to install, right? label Oct 17, 2022

maleadt changed the title ~~JLL conversion.~~ WIP: JLL conversion. Oct 17, 2022

maleadt marked this pull request as ready for review October 17, 2022 17:47

maleadt force-pushed the tb/jll branch 20 times, most recently from 2e80b87 to 0562bce Compare October 20, 2022 08:35

maleadt changed the title ~~WIP: JLL conversion.~~ Switch from manual artifact handling to automated JLLs Oct 20, 2022

maleadt added the breaking label Oct 20, 2022

maleadt force-pushed the tb/jll branch 2 times, most recently from c99d0c9 to bc04c17 Compare October 20, 2022 12:52

maleadt force-pushed the tb/jll branch from bc04c17 to 19fe123 Compare October 20, 2022 13:17

Bump version.

8ba21c6

maleadt force-pushed the tb/jll branch from 4eede5c to 14406a4 Compare October 21, 2022 14:29

maleadt added 8 commits October 21, 2022 16:36

JLL conversion.

93e55c9

Move CUDNN to a JLL.

df69847

Move CUTENSOR to a JLL.

7d8ffeb

Move CUSTATEVEC to a JLL.

3948d8a

Move CUTENSORNET to a JLL.

62f0b71

Remove unused libcuquantum wrappers.

37baaf0

Add a Docker container.

2138e06

Update documentation.

7b928cb

maleadt force-pushed the tb/jll branch from 14406a4 to 7b928cb Compare October 21, 2022 14:38

maleadt merged commit 014fcef into master Oct 21, 2022

maleadt deleted the tb/jll branch October 21, 2022 18:41

maleadt mentioned this pull request Oct 26, 2022

Don't warn about the old JULIA_CUDA_USE_BINARYBUILDER env var when using preferences #1652

Merged

simonbyrne mentioned this pull request Nov 14, 2022

Split libnvToolsExt from CUDA_Runtime_jll? #1672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from manual artifact handling to automated JLLs #1629

Switch from manual artifact handling to automated JLLs #1629

maleadt commented Oct 17, 2022 •

edited

Loading

codecov bot commented Oct 17, 2022 •

edited

Loading

sloede commented Oct 20, 2022

carstenbauer commented Oct 21, 2022 •

edited

Loading

carstenbauer commented Oct 21, 2022 •

edited

Loading

maleadt commented Oct 21, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022 •

edited

Loading

maleadt commented Oct 24, 2022

ViralBShah commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

vchuravy commented Oct 24, 2022

Switch from manual artifact handling to automated JLLs #1629

Switch from manual artifact handling to automated JLLs #1629

Conversation

maleadt commented Oct 17, 2022 • edited Loading

codecov bot commented Oct 17, 2022 • edited Loading

Codecov Report

sloede commented Oct 20, 2022

carstenbauer commented Oct 21, 2022 • edited Loading

carstenbauer commented Oct 21, 2022 • edited Loading

maleadt commented Oct 21, 2022 • edited Loading

JBlaschke commented Oct 24, 2022 • edited Loading

JBlaschke commented Oct 24, 2022 • edited Loading

JBlaschke commented Oct 24, 2022 • edited Loading

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022 • edited Loading

maleadt commented Oct 24, 2022

ViralBShah commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

JBlaschke commented Oct 24, 2022

vchuravy commented Oct 24, 2022

maleadt commented Oct 17, 2022 •

edited

Loading

codecov bot commented Oct 17, 2022 •

edited

Loading

carstenbauer commented Oct 21, 2022 •

edited

Loading

carstenbauer commented Oct 21, 2022 •

edited

Loading

maleadt commented Oct 21, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022 •

edited

Loading

JBlaschke commented Oct 24, 2022 •

edited

Loading