Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Gaea C5 support #1783

Closed
ulmononian opened this issue Jun 5, 2023 · 7 comments · Fixed by #1784
Closed

Add Gaea C5 support #1783

ulmononian opened this issue Jun 5, 2023 · 7 comments · Fixed by #1784
Assignees
Labels
enhancement New feature or request

Comments

@ulmononian
Copy link
Collaborator

Description

The capability to run the WM on Gaea c5 should be added. Since c5 has different architecture, cores, login nodes, module management software, and additional compilers, the settings in the WM currently in place for c4 do not apply to c5.

spack-stack/1.4.0 is already installed on c5 (/lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.0-c5/envs/unified-env-v2/install/modulefiles/Core) and testing is underway.

Solution

A Gaea c5 spack-stack-based module file should be added, the Gaea fv3_conf files need modified, and some additional logic in the RT scripts is required.

NOTE: If support for BOTH c3/c4 and c5 is expected, then there will need to be two Gaea modulefiles, two fv3_conf files, and additional logic added to the RT scripts. This may prove to be a bit cumbersome, so perhaps discussion is needed if support for c3/c4 should be dropped in favor of c5 once everything is working properly on the latter.

Related to

#1724

@ulmononian
Copy link
Collaborator Author

currently testing control_c48. compilation is fine. model run does not complete successfully. it fails near the very end. test is here: /lustre/f2/scratch/role.epic/FV3_RT/rt_254132.

truncated err output:

+ srun --label -n 8 ./fv3.exe
0:
0: fv3.exe:9993 terminated with signal 11 at PC=7fc923719c47 SP=7ffcc2bb4c38.  Backtrace:
1:
1: fv3.exe:9994 terminated with signal 11 at PC=7f51dcabbc47 SP=7ffe85427338.  Backtrace:
...
7: fv3.exe:10000 terminated with signal 11 at PC=7fd5e8c5ec47 SP=7ffdce5377b8.  Backtrace:
0: /opt/cray/pe/lib64/libmpi_intel.so.12(+0x2418c47)[0x7fc923719c47]
0: /opt/cray/pe/lib64/libmpi_intel.so.12(+0x2028d90)[0x7fc923329d90]
0: /opt/cray/pe/lib64/libmpi_intel.so.12(+0x1cd974f)[0x7fc922fda74f]
5: /opt/cray/pe/lib64/libmpi_intel.so.12(+0x1cd974f)[0x7ff3000f374f]
...
0: /opt/cray/pe/lib64/libmpi_intel.so.12(PMPI_Barrier+0x16f)[0x7fc92159273f]
0: /sw/gaea-c5/spack-envs/base/opt/cray-sles15-zen2/intel-2022.0.2/darshan-runtime-3.4.0-w3prs3w3d4quwddoz4tglwep6ognnrks/lib/libdarshan.so.0(darshan_core_shutdown+0xd7)[0x7fc9241395a7]
0: /sw/gaea-c5/spack-envs/base/opt/cray-sles15-zen2/intel-2022.0.2/darshan-runtime-3.4.0-w3prs3w3d4quwddoz4tglwep6ognnrks/lib/libdarshan.so.0(MPI_Finalize+0x4e)[0x7fc92413823e]

and the final lines from the out file all read:

"[CRAYBLAS_WARNING] Application linked against multiple cray-libsci libraries".1

@ulmononian
Copy link
Collaborator Author

ulmononian commented Jun 6, 2023

the wm cmake settings (the top-level cmake/Intel.cmake config, gaea cmake config configure_gaea.intel.cmake, and the FV3/atmos_cubed_sphere/cmake/compiler_flags_Intel_Fortran.cmake ) may need to be configured to use -march=core-avx-i rather than -march=core-avx2 (which it defaults to), based on the gaea c5 onboarding document.

to accommodate this for FV3/atmos_cubed_sphere, the following can be added directly after https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/blob/4285e3f3a0bf6c054f8a08fc03469dee6b65e428/cmake/compiler_flags_Intel_Fortran.cmake#L22 in the compiler_flags_Intel_Fortran.cmake file:

 elseif(AVX)
    set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -march=core-avx-i")

for all other WM components, it seems right now that -march=core-avx-i can forced by using the -DAVX=ON (and perhaps simultaneously -DAVX2=OFF) cmake flags that are passed at model compile time.

@ulmononian
Copy link
Collaborator Author

when cray-libsci and darshan-runtime are unloaded (currently the unload commands are added to the gaea modulefile), control_c48 succeeds (see /lustre/f2/scratch/role.epic/FV3_RT/rt_136465). for now, the avx2/avx-i flag change may not be needed.

gaea system admins described a known incompatibility with cray-libsci and the intel-classic/2022.0.2 & intel-oneapi/2022.0.2 compilers. however, for this testing & PR w/ spack-stack/1.4.0, 'intel-classic/2022.1.2 is used, so it appears that there is also an incompatibility with this compiler version and cray-libsci.

@ulmononian
Copy link
Collaborator Author

with the cray-libsci module issue resolved, moved to testing S2SWA configuration usingcpld_control_p8. the model fails are runtime with what seems to be an issue with the aerosol cap (128: fv3.exe 0000000001D0F109 aerosol_cap_mp_mo 348 Aerosol_Cap.F90). the out file shows:

180:        Type 8 : Restart files second request
180:       -----------------------------------------
180:             From     : 2021/03/22 06:00:00 UTC
180:             To       : 2021/03/23 06:00:00 UTC
180:             Interval :            03:00:00
180:
180:        Wave model ...
180:  WW3 log written to /lustre/f2/scratch/wpo/role.epic/FV3_RT/rt_35470/cpld_contro
180:  l_p8_intel/./log.ww3
  0:  Starting pFIO input server on Clients
  0:  Starting pFIO output server on Clients
  0:  Character Resource Parameter: ROOT_CF:AERO.rc
  0:  Character Resource Parameter: ROOT_NAME:AERO
  0:  Character Resource Parameter: HIST_CF:AERO_HISTORY.rc
  0:  Character Resource Parameter: EXTDATA_CF:AERO_ExtData.rc
  0:  DU::SetServices: Dust emission scheme is fengsha
  0:  WARNING: falling back on MAPL NUM_BANDS
  0:  GOCART2G::Initialize: Starting...
  0:
  0:  Integer*4 Resource Parameter: RUN_DT:720
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_NO3 already exists. Skipping ...
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_OH already exists. Skipping ...
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_H2O2 already exists. Skipping ...
  0:   oserver is not split
  0:
  0:  EXPSRC:GEOSgcm-v10.16.0
  0:  EXPID: gocart
  0:  Descr: GOCART2g_diagnostics_at_c360
  0:  DisableSubVmChecks: F
  0:
  0:  Reading HISTORY RC Files:
  0:  -------------------------
  0:  NOT using buffer I/O for file: AERO_HISTORY.rc
  0:  NOT using buffer I/O for file: inst_aod.rcx
  0:
  0:  Freq: 00060000  Dur: 00010000  TM:   -1  Collection: inst_aod

the run dir. is /lustre/f2/scratch/role.epic/FV3_RT/rt_35470.

@zach1221 zach1221 self-assigned this Jun 13, 2023
@ulmononian
Copy link
Collaborator Author

waiting on information regarding mapl functionality with intel compilers newer than 2021.7.x. please see #1791 for more information.

@ulmononian
Copy link
Collaborator Author

opened an issue on the mapl repo: GEOS-ESM/MAPL#2213

@ulmononian
Copy link
Collaborator Author

resolved with JCSDA/spack-stack#675: i.e. upgrade intel compiler to 2023.1.0 (and ifort/2021.9.0). see the description for the same fix on hercules: JCSDA/spack-stack#673 & #1733 (comment), as the fix worked for both machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
2 participants