-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shared memory in DivergenceF2C
stencil operators
#2184
base: main
Are you sure you want to change the base?
Conversation
c2f0a84
to
ddc3e58
Compare
Note to self: need to look at threading pattern |
82ff728
to
f4b175f
Compare
I think the expectation of a 40 % improvement relies on the assumption that you have complete L1 cache misses, but if different threads within a block share faces and not much more data is needed, the effective reads to global memory will be on average less than 5 because some values will be cached in the L1 cache of the streaming multiprocessor (which has the same latency as the shared memory). |
ba4f5dc
to
3f85c1b
Compare
Even slightly complicated cases are showing a nice improvement (~2x), so it may just depend on additional factors (e.g., register pressure / if there are errors / traps emitted by LLVM). |
In this PR, shared memory (shmem) is supported through a single layer of operator composition. That is, composed operators like Unfortunately, there are probably a lot of combinations of combined operators, however, we may be able to automatically transform the composed operators into the combined operators on the back-end, so that we don't need to introduce a slew of new terminology to users. |
I think that this is in near merge-worthy shape, the only remaining issue(s) I see are:
|
eea802c
to
3ba4ad9
Compare
8d6a613
to
eb6db43
Compare
I've fixed the main bug I was running into, I think that this is ready to go. Here are the preliminary results for some of the relevant benchmarks (in
It's notable that some of these kernels can be further optimized. For example This PR only adds shared memory for a F2C operator, we should probably add one C2F operator, so that we exercise the cell center-filling shared memory branch code (which is currently not exercised, and is therefore likely not correct). I've fixed two more out-of-bounds issues, and there is still one inference issue on one case where the BCs are specified as fields. Hoping that is the last one. All of the problematic cases I found I've added in the tests. |
4a4b7c7
to
bfa8df9
Compare
Try dont_limit on recursive resolve_shmem methods Fixes + more dont limit Matrix field fixes Matrix field fixes DivergenceF2C fix MatrixField fixes Qualify DivergenceF2C wip Refactor + fixed space bug. All seems good. More tests.. Fixes Test updates Fixes
bfa8df9
to
274f646
Compare
@@ -541,6 +541,16 @@ Required for statically infering the result type of the divergence operation for | |||
) where {FT, A1, A2 <: LocalAxis, S <: StaticMatrix{S1, S2}} where {S1, S2} = | |||
AxisVector{FT, A2, SVector{S2, FT}} | |||
|
|||
# TODO: can we better generalize this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dennisYatunin, can we better generalize this?
The build that you run in ClimaAtmos shows a significant regression in the "Benchmark: GPU prog edmf" test compared to I also leave my first impression here:
I think that this should be addressed. We will not need more than 256 levels for a global simulation, but we might want it in other settings. For example, I used more than 256 levels to study self-convergence in ClimaTimeSteppers. I think that the restriction of 256 points on a finite difference method is very strong for most applicaitons that are not global simulations. Moreover, this would further differentiate our CPU and GPU capabilities. (I think it'd be perfectly acceptable to have a slower path for when the number of points is more than 256, but users should still be able to run with such a configuration) |
I'll of course make sure we address the performance before merging. Allowing a larger number of vertical levels would be nice too. I'm going to think about this. If we could preserve both code-paths somehow, it might fix both issues. Do you have any other comments? |
Yes, but it will be much more efficient to talk in person. So I'd suggest you first look at the problem with that job and after we can schedule a call to chat about this |
I just pushed a fix that should fix all of these issues. Now we check ahead of time and transform the broadcasted style to disable the shmem broadcasted object if no shmem is supported (which should fix the regression) or if the resolution is too high (to maintain super-high resolution support). |
Which build are you comparing against? |
I sketched out applying shared memory to
DivergenceF2C
stencil operators a while ago, and I wanted to open a PR with this branch before I accidentally clean it up and delete it locally.but the performance improvement is not nearly what we should expect. Currently, per thread, we do
J
on centersJinv
on facesarg
on faceswith a total of 5 reads per thread per point. Using shared memory, this should be:
J
on centersJinv
on facesarg
on faceswith a total of 3 reads per thread per point. So, we should see a 40% performance improvement, but I'm only seeing ~15%.
there are still some edge cases that need fixed.