-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from manual artifact handling to automated JLLs #1629
Conversation
Codecov ReportBase: 61.02% // Head: 61.46% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #1629 +/- ##
==========================================
+ Coverage 61.02% 61.46% +0.44%
==========================================
Files 152 151 -1
Lines 11240 11288 +48
==========================================
+ Hits 6859 6938 +79
+ Misses 4381 4350 -31
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
2e80b87
to
0562bce
Compare
c99d0c9
to
bc04c17
Compare
@maleadt It would be great if you could give a heads up before this is merged and/or tagged in a new release to provide supercomputer centers a chance to prepare their module files in time, e.g., @omlins @carstenbauer. Also, it would be great if there were a transition guide for both users and cluster operators such that the latter know how to prepare their module files in an appropriate manner (maybe similar to https://juliaparallel.org/MPI.jl/stable/configuration/#Notes-to-HPC-cluster-administrators). In any case, it's great to see that choosing the appropriate backend libraries becomes easier 👍 |
Thanks a lot @maleadt, I will test things later today (as soon as I can find time for it). |
Tried it and seems to work just fine. Thanks again!
Besides |
Yes, that's a feature of the JLLWrappers-generated code, and you just want to set preferences like So it's possible, but finicky. The alternative is that code that uses JLLs does something like:
|
I disagree! Strongly! Our admin time at NERSC is very expensive, so introducing toil, such as this and shrugging it off as "just let the admins do it" makes me mad. (maybe because I'm an admin). It's worth noting that we like to enable as much flexibility in our setup as possible, so I very much like the idea of "here is an intuitive API, have at it" in addition to sensibly defined defaults in a global location. |
@maleadt I'll be honest -- I am miffed. Not because of this change, but the fact that it was introduced very quickly -- practically without warning considering the software release cycle on big supercomputers. Have you heard about deprecation warnings? I guess sometimes you win some, and sometimes you loose support from the DOE 🤷 |
@carstenbauer You put it much nicer than me (probably got a nice cup of mint tea before typing your reply) -- much appreciated. Thanks to my the others in this thread (@sloede and @ViralBShah ) who where on top of this, and triaged this PR before I could get to checking my messages. @maleadt I also appreciate that you're listening to the HPC community. In case some of my more flamboyant communication style is reprieved as aggressive, let me take the opportunity to make sure to thank you for attempting to balance competing interests. RE my remarks about loosing DOE support over this -- I want to share with @maleadt a scenario: I'm not exaggerating when I say that I am currently the only one on the NERSC staff devoting a considerable amount of time to support -- and advocate for -- Julia. If we don't settle on a sensible low-effort solution to use the system binaries, some of this support will look like: "Here's a script. Run it. And if it doesn't work, heck, then I don't know ether ... try Tim". Worse still, management my ask: "how much staff effort would we need to invest in supporting Julia?". If I said "2 to 3 FTE's", then there is a real risk that they'll decide that we can't afford to support Julia. |
@ViralBShah I disagree -- users readily bump into the limits of their quota (we have 8000 users, so even a modest quota would scale up to obscene amounts when considering the full file system) thanks to anaconda. This can be pretty painful for users as they are forced triage space on the global fast file systems. |
@ViralBShah Please do! |
https://cuda.juliagpu.org/stable/installation/overview/#Local-installation doesn't seem to have the "new" way of doing this and needs updating. I guess it has to be regenerated to bring it into line with https://github.com/JuliaGPU/CUDA.jl/blob/master/docs/src/installation/overview.md#using-a-local-cuda ? (I actually don't know how the pages are rendered so I wanted to bring this up here. There has been some frayed nerves on my end -- so I will ask @maleadt here: If an HPC center wanted to use the local CUDA install by default. Am I correct in understanding that that's handled using a global [CUDA_Runtime_jll]
version = "local" |
@JBlaschke Let's stay level-headed and try and improve CUDA.jl in a way that suits all users. I'm only going to respond to the strictly technical portions of your comments, since I, as you correctly presumed, do not appreciate the general tone.
This PR was merged relatively quickly for technical reasons (other PRs easily causing conflicts). As noted above, a release of these changes is still quite some time off, and will be accompanied by a breaking version release so it should not impact user code that relies on current behavior. So there is still ample time to improve.
I have. It's not possible to deprecate the entire code loading mechanism, hence the breaking release.
What do you not like about the new API then? In case you missed the changes I did after Carsten's comments, it's now just a matter of calling
Again, maybe you missed my latest changes, but CUDA.jl does not needlessly download any artifacts anymore if the version is set to
That's expected, note the
Correct, that's how it's intended to work. So please test this out, and if there's issues, I'd be happy to help you resolve them or further improve CUDA.jl so that it works well for HPC users. |
@JBlaschke Thanks for chiming in, and thanks to @maleadt for incorporating all the feedback. |
@maleadt I think I was being unfair to you, and my criticism was too harsh. So I want to make a public apology. Now that I understand the context a little better -- and that there are safeguards in package system that would have prevented a silent failure -- it's clear that I overreacted. |
I think this is in good shape. To summarize various out-of-thread discussions:
|
I think there is one last issue @maleadt -- looking at Lines 119 to 125 in 52a16d6
CUDA.jl versions, or even the same users have multiple versions of CUDA.jl ?
One option could be to make a module that switches from the env var to the |
Yeah maybe not warning if the preference is set to |
This PR switches our manual handling of artifacts (for loading the CUDA toolkit and some of its libraries) to autogenerated JLLs that come directly from Yggdrasil:
libcuda.so
), either from the system or from a forward-compatible packagelibcublas
,libcufft
, etc) and some essential binaries (ptxas
,nvdisasm
) and files (libcudadevrt
,libdevice
)The above is driven by a
cuda
tag that's put in the host platform, indicating which version of the toolkit has been loaded. This then informs downstream packages like CUDNN_jll which artifact to load. I'm not sure we have this exactly figured out yet, i.e., CUDNN looks to release 11.x builds that are compatible with, well, every CUDA 11.x toolkit, while NCCL has separate 11.0 and 11.8 builds of which I'm not sure how compatible they are with other versions of the toolkit.This also implies that the env vars we used to steer this, JULIA_CUDA_VERSION and JULIA_CUDA_USE_BINARYBUILDER, have been removed. Instead, the same can be accomplished through Preferences for
CUDA_Runtime_jll
. The version can be conveniently set usingCUDA.set_runtime_version!
, while using a local toolkit needs overrides for each product path (for which there's a helper script indeps/local.jl
).Finally, we're doing this not only to simplify things, but because it'll make it possible to build arbitrary binaries that have CUDA dependencies and simply load them by depending on and importing CUDA_Runtime_jll at run time.