-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KernelAbstractions support #147
Conversation
Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:
The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698 |
Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package. I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors. |
I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:
|
Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. |
That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU. I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals. |
Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit. Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA. |
Ok, so a couple of quick notes here:
|
That is okay.
I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now. |
Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level. Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support? |
I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support. If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #147 +/- ##
==========================================
+ Coverage 67.01% 67.28% +0.26%
==========================================
Files 35 37 +2
Lines 5526 5489 -37
==========================================
- Hits 3703 3693 -10
+ Misses 1823 1796 -27 ☔ View full report in Codecov by Sentry. |
Getting around to this and noticed a bunch of segfaults in the CPU tests. I then found that there's a strange conflict between
But only if
st:
Note that using a single thread "fixes" the issue. It seems to be a UCX / MPI issue, but I am not loading them and neither are in the Manifest. |
This looks exactly like https://juliaparallel.org/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling What is |
The fix mentioned there seems to work:
Note
|
Wild... What is |
Is it a linux thing like libpthread? |
# This triggers an error but it isn't printed | ||
# See https://discourse.julialang.org/t/error-handling-in-cuda-kernels/79692 | ||
# for how to throw a more meaningful error | ||
error("wrong force unit returned, was expecting $F but got $(unit(f[1]))") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interpolation here is particularly tricky. I would avoid that if at all possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: are you referring to the error(...)
call in an inlined function within an @kernel
? Or carrying the units through to this stage in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error("Oops")
is fine, error("Oops, $F")
is sadly not since it string interpolation is really tough on the GPU compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove that, no problem.
Right, having an issue right now with zygote tests. Not really sure hot to go about debugging it, so I'll paste it here and then think about it
|
As a note, it looks like #182 is close-ish to being done? If so, it might be best to rework this PR once those changes are in. As it stands, this PR can work as a branch for any non-CUDA GPU (As long as the user does not need differentiable sims). |
Thanks for keeping going on this. Yes #182 will get merged soon, I was waiting on full Enzyme GPU broadcasting support to avoid GPU differentiable simulations breaking, but I might just merge it and wait for that to arrive later. Consequently I wouldn't worry about the Zygote error. I'm happy to help update this PR after #182, I realise that the ground has been moving under you. |
Yeah, no worries. I should go ahead and close the #99 PR since this one supersedes it. On my end, I am happy to wait a little longer and rebase up when you are happy with #182. I have other projects to work on in the mean time. Could you link the Enzyme issue? Is it this one? EnzymeAD/Enzyme.jl#1672 |
Okay great. The example I posted on JuliaGPU/CUDA.jl#2471 doesn't work as it stands, that and similar issues around reduction and broadcasting are the problem. |
@jgreener64 if there isn't an open issue on Enzyme.jl associated with it it is very likely that Billy and I will lose track of it. |
Billy has favoured moving CUDA-specific issues to CUDA.jl, e.g. EnzymeAD/Enzyme.jl#1454 (comment), in either case I can find some MWEs in the next few days of what is erroring and make sure they are tracked in issues. |
#182 is now merged. |
The potential energy no NL kernel is the same as the NL kernel, the I made some changes and fixes to the branch, hope you don't mind. I also reviewed it as I went and I think it is looking strong. Next week I'll work on fixing the tests, since I broke something during my changes, and we can merge. |
Ah, I see what happened. Thanks for looking at this. Please take your time with the review. It's a lot of changed lines over a long period of time. I'm happy to let you do your magic for a bit. Let me know when / if you want me to look at it again for final touches. |
Okay I think this is ready from my end. I did notice wrong results with Would you be able to run the tests on an AMDGPU? I don't have access to one. |
Great work! Seems like there was a lot of cleanup. AMD tests are broken somehow, but I have seen the error before, so I think it's manageable to fix tomorrow. It is breaking here (neighbors.jl L175):
So there is something funky about the Error:
|
Well, I am still not sure about the AMD errors, but I am getting the following errors on CUDA:
|
So this error is not triggered when I throw each line individually into the REPL. It only happens with We are using the same versions for AMDGPU (1.1.7) in both cases, and I cannot think of other packages that might be influencing the result. More than that, the error is actually an error of an error. As in there was a problem, but the error message shown is telling me that they couldn't show me the error. It runs on Metal and CUDA for you? |
CUDA and Metal seem to work okay for me (bar the separate issue with Metal forces I mentioned). Could it be a GPUArrays issue? If it worked for you before when the version was set to 10. Only throwing in the test suite could be due to |
Also, I just reverted a change that I made for Metal compatibility because I noticed it was giving wrong indices due to precision errors. I don't see how that could affect the |
The updates actually fixed my errors as well. I think it must have been an odd interaction with bounds checking in the I think I am also getting some correctness issues in the force kernels because I am getting some failed simulation tests:
|
Interesting. It seems like there are NaN values somewhere. I'm going to see if my institute has any AMD GPUs lying around so I can test too. I'll also try and debug the Metal forces issue, since that might be related. |
Yeah, I am sure I did something screwy with the force kernels then. I guess a small precision issue could eventually lead to NaNs if left unchecked for a bit, so the Metal and AMD issues might still be related. |
What a catch. Tests seem to be passing on my end until:
|
Same for me. I'm also getting an occasional precision error in the protein tests. Given that they are small errors I am going to disable those tests on non-CUDA backends and merge this. Thanks for all your work on this James, I'll look to get the next release out soon. |
Great! Glad we finally got this through. Thanks for the patience and taking it over at the end |
The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.
Notes:
CUDA
has been removed and replaced withKernelAbstractions
andGPUArrays
. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ieisa(a, AbstractGPUArray)
).