-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hommexx/SL: Fix a threading issue. #7012
Hommexx/SL: Fix a threading issue. #7012
Conversation
Add some team_barriers and rearrange a section of code to permit more team_barriers.
This PR affects only EAMxx, so I'm running e3sm_eamxx_v1_medres on Chrysalis, PM-GPU, and Frontier to check the PR. I'll run e3sm_developer on Chrysalis before integrating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm puzzled. On GPU, don't we always have more than 1 th per elem? Why wasn't this exposed on GPU then?
No, this is one of the cases that shows up only on CPU. The lock-step nature of threading on the GPU hides these. There are a couple of others. This fix is from a branch I've had going since August collecting some of these in preparation for a CPU threading test. I'll get them all in soon. |
Luca, the other thing I should mention is that, because I brought this fix in whole from the branch, it also addresses an anti-pattern we have in Hxx: team_barrier inside a team range. Those work for us on GPU because of how we allocate threads, but for certain threading configurations (plausible though unlikely in practice on CPU and pointless on GPU), these will cause a deadlock. There are a few other such fixes I need to bring in, also soon. For the specific issue in the failing nightly test of two threads/team, none of these potential deadlocks is actually an issue because 2 divides np^2. |
Tests pass on Chrys and PM-GPU (against baselines) and Frontier (w/o baselines). I'll merge this very likely next week, once the dashboard clears up from the RRTMGP merge. (I was testing the wrong clone on PM-GPU; hence the previous comment about diffs.) |
I'm going to wait until some things clear on master, like the Frontier update and test renaming. |
…ject#7012) Hommexx/SL: Fix a threading issue. Fix a threading error in EAMxx simulations on CPU in the case that each element has more than one thread. Add some team_barriers and rearrange a section of code to permit more team_barriers. Clean up some compiler warnings. [BFB]
Fix a threading error in EAMxx simulations on CPU in the case that each element has more than one thread. Add some team_barriers and rearrange a section of code to permit more team_barriers. Clean up some compiler warnings.
[BFB]