Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI synchronization in real.exe #1268

Closed
wants to merge 3 commits into from

Conversation

honnorat
Copy link
Contributor

Fixes MPI synchronization bug in real.exe #1267

TYPE: [bug fix]

KEYWORDS: mpi, real, bug

SOURCE: "Marc Honnorat (EXWEXs)"

DESCRIPTION OF CHANGES:
Problem:
When running real.exe on multiple processes with MPI, one or more process occasionnaly crashes in setup_physics_suite (in share/module_check_a_mundo.F). This may be due to the fact that wrf_dm_initialize is apparently non-blocking from an MPI point of view. See #1267 for an example.

Solution:
An MPI barrier is added just after the call to wrf_dm_initialize to force all processes be synced before checking namelist consistancy.

ISSUE:
Fixes #1267

LIST OF MODIFIED FILES:

  • M main/real_em.F

TESTS CONDUCTED:
On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.

@honnorat honnorat requested review from a team as code owners August 12, 2020 15:01
@davegill
Copy link
Contributor

@honnorat
Marc,
I am getting this email from our jenkins test:

Please find result of the WRF regression test cases in the attachment. This build is for Commit ID: 93852a2daa136063d12913183af585067d6ae202, requested by: honnorat for PR: https://github.com/wrf-model/WRF/pull/1268. For any query please send e-mail to David Gill.

    Test Type              | Expected  | Received |  Failed
    = = = = = = = = = = = = = = = = = = = = = = = =  = = = =
    Number of Tests        : 19           18
    Number of Builds       : 48           46
    Number of Simulations  : 166           164         80
    Number of Comparisons  : 105           24         0

    Failed Simulations are: 
    output_1:2 = STATUS test_001s em em_real 32 3dtke
output_1:2 = STATUS test_001s em em_real 32 conus
output_1:2 = STATUS test_001s em em_real 32 rap
output_1:2 = STATUS test_001s em em_real 32 tropical
output_1:2 = STATUS test_001o em em_real 33 3dtke
output_1:2 = STATUS test_001o em em_real 33 conus
output_1:2 = STATUS test_001o em em_real 33 rap
output_1:2 = STATUS test_001o em em_real 33 tropical
output_11:2 = STATUS test_011s em em_real 32 03
output_11:2 = STATUS test_011s em em_real 32 03DF
output_11:2 = STATUS test_011s em em_real 32 03FD
output_11:2 = STATUS test_011s em em_real 32 06
output_11:2 = STATUS test_011s em em_real 32 07NE
output_11:2 = STATUS test_011o em em_real 33 03
output_11:2 = STATUS test_011o em em_real 33 03DF
output_11:2 = STATUS test_011o em em_real 33 03FD
output_11:2 = STATUS test_011o em em_real 33 06
output_11:2 = STATUS test_011o em em_real 33 07NE
output_12:2 = STATUS test_012s em em_real 32 10
output_12:2 = STATUS test_012s em em_real 32 11
output_12:2 = STATUS test_012s em em_real 32 14
output_12:2 = STATUS test_012s em em_real 32 16
output_12:2 = STATUS test_012o em em_real 33 10
output_12:2 = STATUS test_012o em em_real 33 11
output_12:2 = STATUS test_012o em em_real 33 14
output_12:2 = STATUS test_012o em em_real 33 16
output_13:2 = STATUS test_013s em em_real 32 17
output_13:2 = STATUS test_013s em em_real 32 17AD
output_13:2 = STATUS test_013s em em_real 32 18
output_13:2 = STATUS test_013s em em_real 32 20
output_13:2 = STATUS test_013s em em_real 32 20NE
output_13:2 = STATUS test_013o em em_real 33 17
output_13:2 = STATUS test_013o em em_real 33 17AD
output_13:2 = STATUS test_013o em em_real 33 18
output_13:2 = STATUS test_013o em em_real 33 20
output_13:2 = STATUS test_013o em em_real 33 20NE
output_14:2 = STATUS test_014s em em_real 32 38
output_14:2 = STATUS test_014s em em_real 32 48
output_14:2 = STATUS test_014s em em_real 32 49
output_14:2 = STATUS test_014s em em_real 32 50
output_14:2 = STATUS test_014s em em_real 32 51
output_14:2 = STATUS test_014o em em_real 33 38
output_14:2 = STATUS test_014o em em_real 33 48
output_14:2 = STATUS test_014o em em_real 33 49
output_14:2 = STATUS test_014o em em_real 33 50
output_14:2 = STATUS test_014o em em_real 33 51
output_15:2 = STATUS test_015s em em_real 32 52
output_15:2 = STATUS test_015s em em_real 32 52DF
output_15:2 = STATUS test_015s em em_real 32 52FD
output_15:2 = STATUS test_015s em em_real 32 60
output_15:2 = STATUS test_015s em em_real 32 60NE
output_15:2 = STATUS test_015o em em_real 33 52
output_15:2 = STATUS test_015o em em_real 33 52DF
output_15:2 = STATUS test_015o em em_real 33 52FD
output_15:2 = STATUS test_015o em em_real 33 60
output_15:2 = STATUS test_015o em em_real 33 60NE
output_16:2 = STATUS test_016s em em_real 32 65DF
output_16:2 = STATUS test_016s em em_real 32 66FD
output_16:2 = STATUS test_016s em em_real 32 71
output_16:2 = STATUS test_016s em em_real 32 78
output_16:2 = STATUS test_016s em em_real 32 79
output_16:2 = STATUS test_016o em em_real 33 65DF
output_16:2 = STATUS test_016o em em_real 33 66FD
output_16:2 = STATUS test_016o em em_real 33 71
output_16:2 = STATUS test_016o em em_real 33 78
output_16:2 = STATUS test_016o em em_real 33 79
output_17:2 = STATUS test_017s em em_real 32 kiaps1NE
output_17:2 = STATUS test_017s em em_real 32 kiaps2
output_17:2 = STATUS test_017o em em_real 33 kiaps1NE
output_17:2 = STATUS test_017o em em_real 33 kiaps2
output_18:2 = STATUS test_018s em em_real 32 cmt
output_18:2 = STATUS test_018s em em_real 32 solaraNE
output_18:2 = STATUS test_018s em em_real 32 urb3bNE
output_3:2 = STATUS test_003s chem em_real 32 1
output_3:2 = STATUS test_003s chem em_real 32 2
output_3:2 = STATUS test_003s chem em_real 32 5
output_6:2 = STATUS test_006s real8 em_real 32 14
output_6:2 = STATUS test_006s real8 em_real 32 17AD
output_6:2 = STATUS test_006o real8 em_real 33 14
output_6:2 = STATUS test_006o real8 em_real 33 17AD
    Which comparisons are not bit-for-bit: 
    None

It looks like every test that uses the ARW real program with serial or OpenMP is failing. Perhaps an ifdef around an MPI call is missing?

@davegill
Copy link
Contributor

@honnorat
Marc,

It looks like every test that uses the ARW real program with serial or OpenMP is failing. Perhaps an ifdef around an MPI call is missing?

Nope, I looked at your mods - that is not the trouble.

@honnorat
Copy link
Contributor Author

Maybe DM_PARALLEL is not a sufficient key, we should replace the current fence with :

#if ( defined( DM_PARALLEL ) && ( ! defined( STUBMPI ) ) )

I have no time today to fix the fix. Maybe tomorrow 🤞

@davegill
Copy link
Contributor

@honnorat
Marc,
No rush at all.

@davegill davegill changed the base branch from master to release-v4.2.2 August 12, 2020 17:38
@davegill
Copy link
Contributor

@honnorat
Marc,
Also, I changed the base branch to be release-v4.2.2

@honnorat
Copy link
Contributor Author

Yes, off course MPI_Barrier had to be fenced with STUBMPI key. Jenkins seems happy now.

@davegill
Copy link
Contributor

@honnorat
Marc,
Here is a snapshot of the changes, if I exclude the whitespace modifications. Notice that only the module_dm.F has any substantive changes.

Screen Shot 2020-08-28 at 9 42 14 AM

When I include the whitespace modifications, there are about 150 locations of changes, in two files, one with whitespace changes only.

Given the actual small change that you implemented, would you describe what that single MPI command does and why that single command is needed.

@honnorat
Copy link
Contributor Author

As I explained in #1267, real.exe occasionnaly crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter uses model_config_rec % physics_suite which is badly initialized, as if the broadcast of model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes.

I had never seen this bug #1267 before, it has only happend on one machine (an Intel-based cluster using Intel-MPI and ifort).

The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the issue on my machine.

As you mentioned, the first version of the fix was badly designed and we finally got a one-liner. My code editor removed automatically all trailing spaces in the modified files.

@davegill
Copy link
Contributor

davegill commented Oct 9, 2020

@honnorat
Marc,
Here is the only non-white-space modification:

Screen Shot 2020-10-09 at 3 07 49 PM

Perhaps the easiest thing to do is to close this PR and re-open a new one. The only mod to the base branch release-v4.2.2 would be the new MPI barrier.

You can grab the commit message from this PR for the new PR.

honnorat pushed a commit to EXWEXs/WRF that referenced this pull request Oct 12, 2020
@honnorat honnorat closed this Oct 12, 2020
davegill pushed a commit that referenced this pull request Oct 23, 2020
TYPE: bug fix

KEYWORDS: mpi, real, bug

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite 
(in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI. 

The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter 
uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of 
model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes.

I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort).

The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the 
issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this 
read and broadcast of the namelist occurs only once per real / WRF / ndown run.

Solution:
An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking 
the namelist consistency.

This is a simplification of PR #1268, which had extra white space.

ISSUE:
Fixes #1267

LIST OF MODIFIED FILES:
M external/RSL_LITE/module_dm.F

TESTS CONDUCTED:
1. On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.
2. Jenkins testing is all PASS.

RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view.  The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.
honnorat pushed a commit to EXWEXs/WRF that referenced this pull request Jun 17, 2021
honnorat pushed a commit to EXWEXs/WRF that referenced this pull request Dec 15, 2021
@honnorat honnorat deleted the fix-real-master branch December 15, 2021 08:48
davegill pushed a commit that referenced this pull request Dec 16, 2021
TYPE: [bug fix]

KEYWORDS: real.exe, MPI, bug fix

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
The communicator `mpi_comm_allcompute`, created by subroutine `split_communicator` called by `init_modules(1)`, 
is not explicitly activated for the call to `wrf_dm_bcast_bytes( configbuf, nbytes )` in real.exe. On some platforms, 
this may prevent broadcast of namelist configuration (put in `configbuf` after the call to `get_config_as_buffer()`) 
across the MPI processes _before_ the call to `setup_physics_suite()`.

An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, 
with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version).

Solution:
The initialization step used in the WRF executable never triggers a failure as described in issue #1267. This PR reuses 
the temporary MPI context switch from WRF code.

ISSUE: 
Fixes #1267

LIST OF MODIFIED FILES:
M       main/real_em.F

TESTS CONDUCTED: 
1. The modification systematically solves the problem on the noted cluster.
2. Jenkins tests are all passing.

RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue #1267. For users that have had no troubles with the real program running MPI, this will have no impact.
vlakshmanan-scala pushed a commit to scala-computing/WRF that referenced this pull request Apr 4, 2024
TYPE: bug fix

KEYWORDS: mpi, real, bug

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite 
(in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI. 

The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter 
uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of 
model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes.

I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort).

The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the 
issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this 
read and broadcast of the namelist occurs only once per real / WRF / ndown run.

Solution:
An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking 
the namelist consistency.

This is a simplification of PR wrf-model#1268, which had extra white space.

ISSUE:
Fixes wrf-model#1267

LIST OF MODIFIED FILES:
M external/RSL_LITE/module_dm.F

TESTS CONDUCTED:
1. On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.
2. Jenkins testing is all PASS.

RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view.  The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.
vlakshmanan-scala pushed a commit to scala-computing/WRF that referenced this pull request Apr 4, 2024
…l#1600)

TYPE: [bug fix]

KEYWORDS: real.exe, MPI, bug fix

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
The communicator `mpi_comm_allcompute`, created by subroutine `split_communicator` called by `init_modules(1)`, 
is not explicitly activated for the call to `wrf_dm_bcast_bytes( configbuf, nbytes )` in real.exe. On some platforms, 
this may prevent broadcast of namelist configuration (put in `configbuf` after the call to `get_config_as_buffer()`) 
across the MPI processes _before_ the call to `setup_physics_suite()`.

An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, 
with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version).

Solution:
The initialization step used in the WRF executable never triggers a failure as described in issue wrf-model#1267. This PR reuses 
the temporary MPI context switch from WRF code.

ISSUE: 
Fixes wrf-model#1267

LIST OF MODIFIED FILES:
M       main/real_em.F

TESTS CONDUCTED: 
1. The modification systematically solves the problem on the noted cluster.
2. Jenkins tests are all passing.

RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue wrf-model#1267. For users that have had no troubles with the real program running MPI, this will have no impact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Irregular MPI synchronization bug in real.exe
2 participants