-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FMS2+MOM6 model crashes during restart with mpich errors #761
Comments
@nikizadehgfdl The root pe goes inside Which will then call Inside Lines 1257 to 1258 in caf4466
MOM can either remove the I tried removing the if and that run successfully. @marshallward I think this is the same crash that Will Cooke was getting with SPEAR. |
Thanks @uramirez8707. You would also have to set |
Thanks @uramirez8707 , your suggestion to remove if (is_root_pe()) in MOM_restart.F90 worked for me as well and tests that were crashing before passed. @marshallward and @adcroft do you think this is a bug in MOM6_restart.F90? Regarding passing broadcast=.false. to functions, I once tried it and the test crashed on function return below since if not broadcast, nvars would not be defined on all pes in function return. |
If It sounds like the intention is that all ranks would call the function, and the I am not very familiar with this code block, and there could be some unexpected behavior, but if it fixes your case and does not produce any regressions then it seems correct. |
Can I ask if this is over all ranks? Or just IO-domain ranks? |
The io root pe will do the reading and broadcast it to the other ranks its io_pelist. |
I think that explains why we missed this problem. We still need to improve our IO layout testing. |
- This addresses the FMS issue $761 NOAA-GFDL/FMS#761 - There is a mpp_broadcast in the FMS2 subroutine get_unlimited_dimension_name() and this subroutine has to be called by all pes, so it cannot be inside a if(is_root_pe()) block
Describe the bug
OMIP models (and probably coupled models) crash during restart with strange MPICH errors.
I dug in with ddt and saw it is crashing on a mpp_prodcast call in netcdf_io.F90.
As a result of the bad broadcast the variable "nvar" get junk then an allocation(array(nvar)) is tried, hence the "insufficient virtual memory" red-herring in stdout:
stdout:
/lustre/f2/scratch/Niki.Zadeh/FMS2021.02_mom6_20210603_FMS2/OM4p5_IAF_BLING_CFC_abio_csf_mle200/ncrc4.intel18-prod/stdout/run/OM4p5_IAF_BLING_CFC_abio_csf_mle200_2x0m1d_256x1o.o268828545
To Reproduce
module load fre/bronx-18
fremake -x /ncrc/home2/Niki.Zadeh/frerts/FMS2021.02_mom6_20210603/_FMS2/OMIP4_CORE2.xml.20210610165957 -p ncrc4.intel18 -t debug MOM6_SIS2_compile_FMS2
frerun -x /ncrc/home2/Niki.Zadeh/frerts/FMS2021.02_mom6_20210603/_FMS2/OMIP4_CORE2.xml.20210610165957 -p ncrc4.intel18 -t debug -r debug OM4p5_IAF_BLING_CFC_abio_csf_mle200
Expected behavior
Restart. No crash
System Environment
Describe the system environment, include:
Additional context
Smaller test cases with io-tiled restarts run fine with the same executable.
I think this crash has to do with the restart being io-tiled and in more than 1 file
Here's the call stack that cause crash:
MOM_io_infra.F90: if (present(nvar)) nvar = get_num_variables(IO_handle%fileobj)
netcdf_io.F90: call mpp_broadcast(nvars, fileobj%io_root, pelist=fileobj%pelist)
The text was updated successfully, but these errors were encountered: