Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI synchronization in real.exe #1600

Merged
merged 1 commit into from
Dec 16, 2021
Merged

Conversation

honnorat
Copy link
Contributor

@honnorat honnorat commented Dec 15, 2021

TYPE: [bug fix]

KEYWORDS: real.exe, MPI, bug fix

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
The communicator mpi_comm_allcompute, created by subroutine split_communicator called by init_modules(1),
is not explicitly activated for the call to wrf_dm_bcast_bytes( configbuf, nbytes ) in real.exe. On some platforms,
this may prevent broadcast of namelist configuration (put in configbuf after the call to get_config_as_buffer())
across the MPI processes before the call to setup_physics_suite().

An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810,
with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version).

Solution:
The initialization step used in the WRF executable never triggers a failure as described in issue #1267. This PR reuses
the temporary MPI context switch from WRF code.

ISSUE:
Fixes #1267

LIST OF MODIFIED FILES:
M main/real_em.F

TESTS CONDUCTED:

  1. The modification systematically solves the problem on the noted cluster.
  2. Jenkins tests are all passing.

RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue #1267. For users that have had no troubles with the real program running MPI, this will have no impact.

@honnorat honnorat requested a review from a team as a code owner December 15, 2021 08:26
@honnorat honnorat changed the base branch from master to release-v4.3.3 December 15, 2021 08:27
@davegill davegill changed the base branch from release-v4.3.3 to develop December 15, 2021 15:38
@davegill davegill changed the title Fix MPI synchronization in real.exe #1267 Fix MPI synchronization in real.exe Dec 15, 2021
@davegill
Copy link
Contributor

@kkeene44
Kelly,
This is an infrastructure review. Normally I alone would approve - but I need someone else. This one is all on me for responsibility.

@davegill davegill merged commit a39a94b into wrf-model:develop Dec 16, 2021
@davegill
Copy link
Contributor

@honnorat
Marc,
Thanks for staying with this issue. This modification to real is now in the develop branch, and will be publicly available in the WRF v4.4 release and all subsequent releases.

@honnorat honnorat deleted the fix-1267 branch December 17, 2021 06:11
vlakshmanan-scala pushed a commit to scala-computing/WRF that referenced this pull request Apr 4, 2024
…l#1600)

TYPE: [bug fix]

KEYWORDS: real.exe, MPI, bug fix

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
The communicator `mpi_comm_allcompute`, created by subroutine `split_communicator` called by `init_modules(1)`, 
is not explicitly activated for the call to `wrf_dm_bcast_bytes( configbuf, nbytes )` in real.exe. On some platforms, 
this may prevent broadcast of namelist configuration (put in `configbuf` after the call to `get_config_as_buffer()`) 
across the MPI processes _before_ the call to `setup_physics_suite()`.

An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, 
with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version).

Solution:
The initialization step used in the WRF executable never triggers a failure as described in issue wrf-model#1267. This PR reuses 
the temporary MPI context switch from WRF code.

ISSUE: 
Fixes wrf-model#1267

LIST OF MODIFIED FILES:
M       main/real_em.F

TESTS CONDUCTED: 
1. The modification systematically solves the problem on the noted cluster.
2. Jenkins tests are all passing.

RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue wrf-model#1267. For users that have had no troubles with the real program running MPI, this will have no impact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Irregular MPI synchronization bug in real.exe
3 participants