Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMS2+MOM6 model crashes during restart with mpich errors #761

Closed
nikizadehgfdl opened this issue Jun 11, 2021 · 7 comments
Closed

FMS2+MOM6 model crashes during restart with mpich errors #761

nikizadehgfdl opened this issue Jun 11, 2021 · 7 comments

Comments

@nikizadehgfdl
Copy link
Contributor

Describe the bug
OMIP models (and probably coupled models) crash during restart with strange MPICH errors.
I dug in with ddt and saw it is crashing on a mpp_prodcast call in netcdf_io.F90.
As a result of the bad broadcast the variable "nvar" get junk then an allocation(array(nvar)) is tried, hence the "insufficient virtual memory" red-herring in stdout:

Rank 32 [Thu Jun 10 18:31:13 2021] [c5-0c2s9n3] Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(1614)....................: MPI_Bcast(buf=0x7ffffffcfc2c, count=1, dtype=0x4c000430, root=0, comm=0xc4000001) failed
MPIR_Bcast_impl(1454)...............:
MPIR_CRAY_Bcast(1101)...............:
MPIR_CRAY_Bcast_Tree(382)...........:
MPIC_Recv(418)......................:
MPIDI_CH3U_Request_unpack_uebuf(595): Message truncated; 256 bytes received but buffer size is 4
forrtl: error (76): Abort trap signal
forrtl: severe (41): insufficient virtual memory

Image              PC                Routine            Line        Source
fms_MOM6_SIS2_com  000000000175E4A9  mpp_mod_mp_mpp_br         194  mpp_transmit_mpi.h
fms_MOM6_SIS2_com  0000000001C86EC5  netcdf_io_mod_mp_        1318  netcdf_io.F90
fms_MOM6_SIS2_com  0000000000776C97  mom_io_infra_mp_g         436  MOM_io_infra.F90
fms_MOM6_SIS2_com  000000000093625E  mom_restart_mp_re        1104  MOM_restart.F90
fms_MOM6_SIS2_com  0000000000D19AB4  mom_state_initial         487  MOM_state_initialization.F90
fms_MOM6_SIS2_com  000000000098D344  mom_mp_initialize        2468  MOM.F90
fms_MOM6_SIS2_com  00000000008C0D58  ocean_model_mod_m         277  ocean_model_MOM.F90
fms_MOM6_SIS2_com  000000000040A840  coupler_main_IP_c        1728  coupler_main.F90
fms_MOM6_SIS2_com  0000000000401B42  MAIN__                    568  coupler_main.F90

stdout:
/lustre/f2/scratch/Niki.Zadeh/FMS2021.02_mom6_20210603_FMS2/OM4p5_IAF_BLING_CFC_abio_csf_mle200/ncrc4.intel18-prod/stdout/run/OM4p5_IAF_BLING_CFC_abio_csf_mle200_2x0m1d_256x1o.o268828545

To Reproduce
module load fre/bronx-18
fremake -x /ncrc/home2/Niki.Zadeh/frerts/FMS2021.02_mom6_20210603/_FMS2/OMIP4_CORE2.xml.20210610165957 -p ncrc4.intel18 -t debug MOM6_SIS2_compile_FMS2

frerun -x /ncrc/home2/Niki.Zadeh/frerts/FMS2021.02_mom6_20210603/_FMS2/OMIP4_CORE2.xml.20210610165957 -p ncrc4.intel18 -t debug -r debug OM4p5_IAF_BLING_CFC_abio_csf_mle200

Expected behavior
Restart. No crash

System Environment
Describe the system environment, include:

  • bronx-18 , intel18, ncrc4

Additional context
Smaller test cases with io-tiled restarts run fine with the same executable.
I think this crash has to do with the restart being io-tiled and in more than 1 file

ls /lustre/f2/scratch/Niki.Zadeh/work/FMS2021.02_mom6_20210603/OM4p5_IAF_BLING_CFC_abio_csf_mle200_2x0m1d_256x1o.o268828545/INPUT/MOM.res*
INPUT/MOM.res_1.nc.0000  INPUT/MOM.res_1.nc.0002  INPUT/MOM.res.nc.0000  INPUT/MOM.res.nc.0002
INPUT/MOM.res_1.nc.0001  INPUT/MOM.res_1.nc.0003  INPUT/MOM.res.nc.0001  INPUT/MOM.res.nc.0003

Here's the call stack that cause crash:
MOM_io_infra.F90: if (present(nvar)) nvar = get_num_variables(IO_handle%fileobj)
netcdf_io.F90: call mpp_broadcast(nvars, fileobj%io_root, pelist=fileobj%pelist)

@uramirez8707
Copy link
Contributor

uramirez8707 commented Jun 14, 2021

@nikizadehgfdl
The problem is actually here:
https://github.com/NOAA-GFDL/MOM6/blob/da287e1a799e5d9250245ec3b8e9af00116896d3/src/framework/MOM_restart.F90#L1085-L1100

The root pe goes inside get_file_times, which will then call get_file_info
https://github.com/NOAA-GFDL/MOM6/blob/da287e1a799e5d9250245ec3b8e9af00116896d3/config_src/infra/FMS2/MOM_io_infra.F90#L466

Which will then call get_unlimited_dimension_name
https://github.com/NOAA-GFDL/MOM6/blob/da287e1a799e5d9250245ec3b8e9af00116896d3/config_src/infra/FMS2/MOM_io_infra.F90#L439

Inside get_unlimited_dimension_name, the root pes gets the unlimited dimension name and tries to broadcast it to the other ranks, but because this is all inside the if (is_root_pe()) then it can't do it

FMS/fms2_io/netcdf_io.F90

Lines 1257 to 1258 in caf4466

call mpp_broadcast(buffer, nf90_max_name, fileobj%io_root, &
pelist=fileobj%pelist)

MOM can either remove the if (is_root_pe()) then in their restore_state subroutine OR pass in broadcast=.false. to get_unlimited_dimension_name, so that the code does attempt to broadcast anything.
https://github.com/NOAA-GFDL/FMS/blob/main/fms2_io/netcdf_io.F90#L1252-L1255

I tried removing the if and that run successfully.

@marshallward I think this is the same crash that Will Cooke was getting with SPEAR.

@menzel-gfdl
Copy link
Collaborator

Thanks @uramirez8707. You would also have to set broadcast=.false. in get_dimension_size too here.

@nikizadehgfdl
Copy link
Contributor Author

Thanks @uramirez8707 , your suggestion to remove if (is_root_pe()) in MOM_restart.F90 worked for me as well and tests that were crashing before passed. @marshallward and @adcroft do you think this is a bug in MOM6_restart.F90?

Regarding passing broadcast=.false. to functions, I once tried it and the test crashed on function return below since if not broadcast, nvars would not be defined on all pes in function return.
https://github.com/NOAA-GFDL/FMS/blob/main/fms2_io/netcdf_io.F90#L1296

@marshallward
Copy link
Member

If get_unlimited_dimension_name contains a barrier then I agree it's a bug and we cannot wrap it in an is_root_pe() if-block.

It sounds like the intention is that all ranks would call the function, and the broadcast flag should be the exceptional case, so removal of the if-block seems correct.

I am not very familiar with this code block, and there could be some unexpected behavior, but if it fixes your case and does not produce any regressions then it seems correct.

@marshallward
Copy link
Member

Can I ask if this is over all ranks? Or just IO-domain ranks?

@uramirez8707
Copy link
Contributor

The io root pe will do the reading and broadcast it to the other ranks its io_pelist.

@marshallward
Copy link
Member

I think that explains why we missed this problem. We still need to improve our IO layout testing.

nikizadehgfdl added a commit to nikizadehgfdl/MOM6 that referenced this issue Jul 12, 2021
- This addresses the FMS issue $761
NOAA-GFDL/FMS#761

- There is a mpp_broadcast in the FMS2 subroutine
get_unlimited_dimension_name() and this subroutine has to be called by
all pes, so it cannot be inside a if(is_root_pe()) block
@rem1776 rem1776 closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants