-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MPI synchronization in real.exe #1268
Conversation
@honnorat
It looks like every test that uses the ARW real program with serial or OpenMP is failing. Perhaps an ifdef around an MPI call is missing? |
@honnorat
Nope, I looked at your mods - that is not the trouble. |
Maybe
I have no time today to fix the fix. Maybe tomorrow 🤞 |
@honnorat |
@honnorat |
Yes, off course |
@honnorat When I include the whitespace modifications, there are about 150 locations of changes, in two files, one with whitespace changes only. Given the actual small change that you implemented, would you describe what that single MPI command does and why that single command is needed. |
As I explained in #1267, I had never seen this bug #1267 before, it has only happend on one machine (an Intel-based cluster using Intel-MPI and ifort). The current fix makes sure that all processes are well synced before proceeding with As you mentioned, the first version of the fix was badly designed and we finally got a one-liner. My code editor removed automatically all trailing spaces in the modified files. |
@honnorat Perhaps the easiest thing to do is to close this PR and re-open a new one. The only mod to the base branch release-v4.2.2 would be the new MPI barrier. You can grab the commit message from this PR for the new PR. |
Simplification of PR wrf-model#1268 Fixes wrf-model#1267
TYPE: bug fix KEYWORDS: mpi, real, bug SOURCE: Marc Honnorat (EXWEXs) DESCRIPTION OF CHANGES: Problem: When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI. The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes. I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this read and broadcast of the namelist occurs only once per real / WRF / ndown run. Solution: An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking the namelist consistency. This is a simplification of PR #1268, which had extra white space. ISSUE: Fixes #1267 LIST OF MODIFIED FILES: M external/RSL_LITE/module_dm.F TESTS CONDUCTED: 1. On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup. 2. Jenkins testing is all PASS. RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view. The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.
(cherry picked from commit 183600b)
TYPE: [bug fix] KEYWORDS: real.exe, MPI, bug fix SOURCE: Marc Honnorat (EXWEXs) DESCRIPTION OF CHANGES: Problem: The communicator `mpi_comm_allcompute`, created by subroutine `split_communicator` called by `init_modules(1)`, is not explicitly activated for the call to `wrf_dm_bcast_bytes( configbuf, nbytes )` in real.exe. On some platforms, this may prevent broadcast of namelist configuration (put in `configbuf` after the call to `get_config_as_buffer()`) across the MPI processes _before_ the call to `setup_physics_suite()`. An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version). Solution: The initialization step used in the WRF executable never triggers a failure as described in issue #1267. This PR reuses the temporary MPI context switch from WRF code. ISSUE: Fixes #1267 LIST OF MODIFIED FILES: M main/real_em.F TESTS CONDUCTED: 1. The modification systematically solves the problem on the noted cluster. 2. Jenkins tests are all passing. RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue #1267. For users that have had no troubles with the real program running MPI, this will have no impact.
TYPE: bug fix KEYWORDS: mpi, real, bug SOURCE: Marc Honnorat (EXWEXs) DESCRIPTION OF CHANGES: Problem: When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI. The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes. I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this read and broadcast of the namelist occurs only once per real / WRF / ndown run. Solution: An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking the namelist consistency. This is a simplification of PR wrf-model#1268, which had extra white space. ISSUE: Fixes wrf-model#1267 LIST OF MODIFIED FILES: M external/RSL_LITE/module_dm.F TESTS CONDUCTED: 1. On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup. 2. Jenkins testing is all PASS. RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view. The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.
…l#1600) TYPE: [bug fix] KEYWORDS: real.exe, MPI, bug fix SOURCE: Marc Honnorat (EXWEXs) DESCRIPTION OF CHANGES: Problem: The communicator `mpi_comm_allcompute`, created by subroutine `split_communicator` called by `init_modules(1)`, is not explicitly activated for the call to `wrf_dm_bcast_bytes( configbuf, nbytes )` in real.exe. On some platforms, this may prevent broadcast of namelist configuration (put in `configbuf` after the call to `get_config_as_buffer()`) across the MPI processes _before_ the call to `setup_physics_suite()`. An example of a problematic platform: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version). Solution: The initialization step used in the WRF executable never triggers a failure as described in issue wrf-model#1267. This PR reuses the temporary MPI context switch from WRF code. ISSUE: Fixes wrf-model#1267 LIST OF MODIFIED FILES: M main/real_em.F TESTS CONDUCTED: 1. The modification systematically solves the problem on the noted cluster. 2. Jenkins tests are all passing. RELEASE NOTE: A fix for an MPI synchronization bug related to (not used) split communicators in the real program provides a solution to issue wrf-model#1267. For users that have had no troubles with the real program running MPI, this will have no impact.
Fixes MPI synchronization bug in real.exe #1267
TYPE: [bug fix]
KEYWORDS: mpi, real, bug
SOURCE: "Marc Honnorat (EXWEXs)"
DESCRIPTION OF CHANGES:
Problem:
When running real.exe on multiple processes with MPI, one or more process occasionnaly crashes in
setup_physics_suite
(inshare/module_check_a_mundo.F
). This may be due to the fact thatwrf_dm_initialize
is apparently non-blocking from an MPI point of view. See #1267 for an example.Solution:
An MPI barrier is added just after the call to
wrf_dm_initialize
to force all processes be synced before checking namelist consistancy.ISSUE:
Fixes #1267
LIST OF MODIFIED FILES:
TESTS CONDUCTED:
On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.