Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hommexx/SL: Fix a threading issue. #7012

Merged
merged 1 commit into from
Feb 27, 2025

Conversation

ambrad
Copy link
Member

@ambrad ambrad commented Feb 14, 2025

Fix a threading error in EAMxx simulations on CPU in the case that each element has more than one thread. Add some team_barriers and rearrange a section of code to permit more team_barriers. Clean up some compiler warnings.

[BFB]

Add some team_barriers and rearrange a section of code to permit more
team_barriers.
@ambrad ambrad self-assigned this Feb 14, 2025
@ambrad ambrad requested a review from bartgol February 14, 2025 02:53
@ambrad ambrad added BFB PR leaves answers BFB HOMME EAMxx PRs focused on capabilities for EAMxx labels Feb 14, 2025
@ambrad
Copy link
Member Author

ambrad commented Feb 14, 2025

This PR affects only EAMxx, so I'm running e3sm_eamxx_v1_medres on Chrysalis, PM-GPU, and Frontier to check the PR. I'll run e3sm_developer on Chrysalis before integrating.

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm puzzled. On GPU, don't we always have more than 1 th per elem? Why wasn't this exposed on GPU then?

@ambrad
Copy link
Member Author

ambrad commented Feb 14, 2025

No, this is one of the cases that shows up only on CPU. The lock-step nature of threading on the GPU hides these. There are a couple of others. This fix is from a branch I've had going since August collecting some of these in preparation for a CPU threading test. I'll get them all in soon.

@ambrad
Copy link
Member Author

ambrad commented Feb 14, 2025

Luca, the other thing I should mention is that, because I brought this fix in whole from the branch, it also addresses an anti-pattern we have in Hxx: team_barrier inside a team range. Those work for us on GPU because of how we allocate threads, but for certain threading configurations (plausible though unlikely in practice on CPU and pointless on GPU), these will cause a deadlock. There are a few other such fixes I need to bring in, also soon. For the specific issue in the failing nightly test of two threads/team, none of these potential deadlocks is actually an issue because 2 divides np^2.

@ambrad ambrad marked this pull request as draft February 14, 2025 19:23
@ambrad
Copy link
Member Author

ambrad commented Feb 14, 2025

Tests pass on Chrys and PM-GPU (against baselines) and Frontier (w/o baselines). I'll merge this very likely next week, once the dashboard clears up from the RRTMGP merge.

(I was testing the wrong clone on PM-GPU; hence the previous comment about diffs.)

@ambrad ambrad marked this pull request as ready for review February 14, 2025 22:58
@ambrad
Copy link
Member Author

ambrad commented Feb 17, 2025

I'm going to wait until some things clear on master, like the Frontier update and test renaming.

@ambrad ambrad marked this pull request as draft February 20, 2025 20:28
@ambrad ambrad marked this pull request as ready for review February 20, 2025 20:28
ambrad added a commit to ambrad/E3SM that referenced this pull request Feb 26, 2025
…ject#7012)

Hommexx/SL: Fix a threading issue.

Fix a threading error in EAMxx simulations on CPU in the case that each element
has more than one thread. Add some team_barriers and rearrange a section of code
to permit more team_barriers. Clean up some compiler warnings.

[BFB]
@ambrad ambrad merged commit 3ed0f5a into E3SM-Project:master Feb 27, 2025
26 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx PRs focused on capabilities for EAMxx HOMME
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants