Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler toolchain compatibility for transition to new HPC #157

Closed
aidanheerdegen opened this issue Aug 14, 2019 · 23 comments
Closed

Compiler toolchain compatibility for transition to new HPC #157

aidanheerdegen opened this issue Aug 14, 2019 · 23 comments

Comments

@aidanheerdegen
Copy link
Contributor

NCI is installing a new peak HPC, called gadi

The new machine will not support 1.x series OpenMPI, and current builds use 1.10.2. We will need to migrate to a new version of OpenMPI, which will also require a new version of the intel fortran compiler.

This issue is a collection point for information about tests that have been performed so as to not duplicate effort.

@aidanheerdegen
Copy link
Contributor Author

Rui forwarded this info

I built access-om2 model with ompi v1,2,3,4 + intel compiler 2019 and they all work fine for 1deg and 0.25deg examples. However, the two 0.1deg examples i.e. 01deg_jra55_iaf 01deg_jra55_ryf crashed for all builds including the original one with openmpi 1.10.2+ intel compiler 2017.

@rxy900
Copy link

rxy900 commented Aug 15, 2019

With Andrew's advice the above issue has been fixed by changing ice_ocean_timestep from 600 to 300.

@rxy900
Copy link

rxy900 commented Aug 15, 2019

Another issue on 01deg_jra55_iaf:
It simply stopped after reading zbgc_nml and the warning messages in access-om2.err are " ice: Input nprocs not same as system request ".

bash-4.1$ more access-om2.out
YATM_COMMIT_HASH=b6caeab4bdc1dcab88847d421c6e5250c7e70a2c
matmxx: LIBACCESSOM2_COMMIT_HASH=b6caeab4bdc1dcab88847d421c6e5250c7e70a2c
NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 32768.
&MPP_IO_NML
HEADER_BUFFER_VAL = 16384,
GLOBAL_FIELD_ON_ROOT_PE = T,
IO_CLOCKS_ON = F,
SHUFFLE = 1,
DEFLATE_LEVEL = 5
/
NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072.
NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 115200.

                                           ======== MODEL BEING DRIVEN BY OCEAN_SOLO_MOD ========

&OCEAN_SOLO_NML
N_MASK = 0,
LAYOUT_MASK = 20,
MASK_LIST = 4096
0,
RESTART_INTERVAL = 6*0,
DEBUG_THIS_MODULE = F,
ACCESSOM2_CONFIG_DIR = ../

/
mom5xx: LIBACCESSOM2_COMMIT_HASH=b6caeab4bdc1dcab88847d421c6e5250c7e70a2c
Reading setup_nml
Reading grid_nml
Reading tracer_nml
Reading thermo_nml
Reading dynamics_nml
Reading shortwave_nml
Reading ponds_nml
Reading forcing_nml
NOTE from PE 0: diag_manager_mod::diag_manager_init: prepend_date only supported when diag_manager_init is called with time_init present.
Diagnostic output will be in file
ice_diag.d

Reading zbgc_nml

MPI_ABORT was invoked on rank 5744 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Any advice on fixing this issue? Thanks.

@aekiss
Copy link
Contributor

aekiss commented Aug 16, 2019

You'll need to change nprocs in ice/cice_in.nml to match ncpus under name: ice in config.yaml.
This is set up correctly in https://github.com/COSIMA/01deg_jra55_iaf - are you using something different?

@rxy900
Copy link

rxy900 commented Aug 16, 2019

I execute the example from cosima-om2 repository:
https://github.com/OceansAus/access-om2.git

By comparing it with the one from https://github.com/COSIMA/01deg_jra55_iaf I can see some differences

config.yaml:
2,3c2,3
< queue: normalbw

queue: normal
6c6
< ncpus: 5180


ncpus: 5968
30c31
< ncpus: 799


  ncpus: 1392

cice_in.nml:
8c8
< , ndtd = 2

, ndtd = 3
50c50
< nprocs = 799


nprocs = 1600

52c52
< , distribution_type = 'sectrobin'

, distribution_type = 'roundrobin'
177d176
< , highfreq = .true.

So the the example in cosima-om2 package uses sandybridge nodes and specify 1392 cores for ice in config.yaml but 1600 in cice_in.nml.

The case from https://github.com/COSIMA/01deg_jra55_iaf uses broadwell nodes with consistent 799 cores for cice in both config.yaml and cice_in.nml.

I will try the example from https://github.com/COSIMA/01deg_jra55_iaf and it seems the example link within https://github.com/OceansAus/access-om2.git needs to be updated.

@aekiss
Copy link
Contributor

aekiss commented Aug 16, 2019

Yes, I'm in the process of updating everything in https://github.com/OceansAus/access-om2, including these control dirs. If you want to try the very latest (bleeding edge) config, use branch ak-dev on https://github.com/COSIMA/01deg_jra55_iaf

@rxy900
Copy link

rxy900 commented Aug 19, 2019

The example from https://github.com/COSIMA/01deg_jra55_iaf works for all builds using ompi v1,2,3,4 as the CPU cores used for cice is consistent between config.yaml and cice_in.nml. Again I also need to change the original value of ice_ocean_timestep from 450 to 300 to avoid errors happened for 01deg_jra55_ryf.

@aekiss
Copy link
Contributor

aekiss commented Aug 26, 2019

comment from #127 in December:
@marshallward has found that the model runs reliably and efficiently when built with openMPI3.0.3 using the Intel 19 compiler (since the .mod files are not compatible with Intel 18).

@aekiss
Copy link
Contributor

aekiss commented Aug 27, 2019

I gather we may also need to migrate to a newer netcdf library on gadi. I suppose something in the latest 4.6.x series would be most future-proof. There's some discussion here: COSIMA/libaccessom2#24

@benmenadue
Copy link

@aekiss NetCDF 4.7.0 has been out for a while now (since the start of May this year), so I'd suggest looking into using that one...

@aekiss
Copy link
Contributor

aekiss commented Aug 27, 2019

Thanks but 4.6.1 seems to be the newest module on raijin

@benmenadue
Copy link

If you want the new version, just send an e-mail to the helpdesk and someone will install it for you :-) .

@aekiss
Copy link
Contributor

aekiss commented Sep 11, 2019

Note that modules are loaded in numerous places, which would all need updating:

${ACCESS_OM_DIR}/src/mom/bin/environs.nci
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.360x300 
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.1440x1080 
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.3600x2700 
${ACCESS_OM_DIR}/src/libaccessom2/build_on_raijin.sh 
${ACCESS_OM_DIR}/src/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/util/make_dir/config.nci

Have I missed anything?

libcheck.sh is intended to give us an overview of all this.

@aekiss
Copy link
Contributor

aekiss commented Dec 3, 2019

#178 builds with
intel-compiler/2019.5.281
netcdf/4.7.1
openmpi/4.0.1

I guess we can close this issue now?

@aekiss aekiss closed this as completed Dec 10, 2019
@aekiss
Copy link
Contributor

aekiss commented Dec 13, 2019

gadi now has openMPI 4.0.2 installed, which is the latest release: https://www.open-mpi.org/software/ompi/v4.0/
any objections to using that instead of 4.0.1?
4.0.2 fixes a lot of bugs: https://raw.githubusercontent.com/open-mpi/ompi/v4.0.x/NEWS

@aekiss
Copy link
Contributor

aekiss commented Jan 13, 2020

@aidanheerdegen commented on slack that openmpi/4.0.1 throws segfaults, and Peter D has moved to openmpi/4.0.2 for this reason. So I think we should also use openmpi/4.0.2. Any objections?
ping @penguian, @nichannah, @russfiedler

aekiss added a commit to COSIMA/oasis3-mct that referenced this issue Jan 16, 2020
aekiss added a commit to COSIMA/libaccessom2 that referenced this issue Jan 16, 2020
aekiss added a commit to mom-ocean/MOM5 that referenced this issue Jan 16, 2020
aekiss added a commit to COSIMA/cice5 that referenced this issue Jan 16, 2020
aekiss added a commit that referenced this issue Jan 16, 2020
@aekiss
Copy link
Contributor

aekiss commented Jan 17, 2020

I've changed these

${ACCESS_OM_DIR}/src/mom/bin/environs.nci
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.360x300
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.1440x1080
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.3600x2700
${ACCESS_OM_DIR}/src/libaccessom2/build_on_gadi.sh
${ACCESS_OM_DIR}/src/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/util/make_dir/config.gadi

so that the gadi-transition branch now uses openMPI4.0.2 in all executables.

The new gadi builds with openMPI4.0.2 are here:

/g/data4/ik11/inputs/access-om2/bin/yatm_575fb04.exe
/g/data4/ik11/inputs/access-om2/bin/fms_ACCESS-OM_4a2f211_libaccessom2_575fb04.x
/g/data4/ik11/inputs/access-om2/bin/cice_auscom_360x300_24p_365bdc1_libaccessom2_575fb04.exe
/g/data4/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_365bdc1_libaccessom2_575fb04.exe
/g/data4/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_365bdc1_libaccessom2_575fb04.exe

I haven't tested whether they run.

@aidanheerdegen
Copy link
Contributor Author

I do think it is worthwhile upgrading to OpenMPI 4.0.2 before pushing this to master, but so it is documented and doesn't disappear into the memory hole, Peter D said in yesterday's MOM meeting that the segfaults were with an older version of OpenMPI (3.*?) and these were solved by moving to OpenMPI 4.

So this change does not, a priori, mean more stable performance with the tenth. The 1 and 0.25 degree seem to be fine.

aekiss added a commit to COSIMA/01deg_jra55_iaf that referenced this issue Jan 17, 2020
aekiss added a commit to COSIMA/01deg_jra55_ryf that referenced this issue Jan 17, 2020
aekiss added a commit to COSIMA/025deg_jra55_iaf that referenced this issue Jan 17, 2020
aekiss added a commit to COSIMA/025deg_jra55_ryf that referenced this issue Jan 17, 2020
aekiss added a commit to COSIMA/1deg_jra55_iaf that referenced this issue Jan 17, 2020
aekiss added a commit to COSIMA/1deg_jra55_ryf that referenced this issue Jan 17, 2020
@benmenadue
Copy link

Is there a compelling reason to hard-code the OpenMPI version? I'd suggest keeping it up to date with the latest version so that you don't get surprised as new versions are released and existing ones are deprecated or removed.

@aidanheerdegen
Copy link
Contributor Author

It is a very complex collection of code, model configuration and build environment. Keeping the build environment as stable as possible takes out one possible culprit when things stop working.

@aekiss
Copy link
Contributor

aekiss commented Apr 16, 2020

Just noting that intel-compiler/2020.0.166 is now installed on Gadi, whereas we are using intel-compiler/2019.5.281. Presumably there's no reason to switch to the newer compiler?

@aidanheerdegen
Copy link
Contributor Author

I generally wouldn't change things unless necessary. Just adds another possible thing to go wrong.

It is pretty trivial, so I'd suggest bedding down new code/forcing versions and then upgrade that stuff later when comparisons can be easily made.

There is a confounding factor that you may be reluctant to change anything once OMIP style runs have started, so up to you I guess.

@aekiss
Copy link
Contributor

aekiss commented Jun 10, 2020

are we ready to close this issue now? AFAIK the gadi transition is now complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants