Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RRFS model startup time. #1322

Closed
dkokron opened this issue Aug 8, 2023 · 10 comments
Closed

Improve RRFS model startup time. #1322

dkokron opened this issue Aug 8, 2023 · 10 comments

Comments

@dkokron
Copy link
Contributor

dkokron commented Aug 8, 2023

The RRFS production team requested help understanding why their ensemble runs bring the Lustre file systems on Cactus and Dogwood to their knees (Ticket#2023032810000014). Pete Johnsen reports file system utilization of ~400 GB/s on a disk based Lustre file system. Furthermore, job startup times increase with the number of ensemble members. I've measured 310s startup times using 24 members running on Cactus:/lfs/h2.

An investigation revealed an enormous number of unexpectedly small read operations (< 2KB) during model startup. The following files were implicated.

phy_data.nc
sfc_data.nc
fv_tracer.res.tile1.nc
fv_core.res.tile1.nc
C3463_grid.tile7.nc

Altering file striping didn't resolve the problem though it does play a role in the final solution.
Altering file chunking didn't resolve the problem though it does play a role in the final solution.
Forcing the reads onto rank zero then broadcasting to the appropriate rank did improve startup time, but did not increase the size of the reads.

Steps to reproduce the behavior
I have a small code on the Cactus system that reads a single variable from the fv_core.res.tile1.nc file. I've used this unit test to evaluate potential solutions. Let me know if you want this code.

The NOAA production systems.
Currently Loaded Modules:

  1. craype-x86-rome (H) 7) intel/19.1.3.304 13) zlib/1.2.11
  2. libfabric/1.11.0.0. (H) 8) cray-mpich/8.1.12 14) libpng/1.6.37
  3. craype-network-ofi (H) 9) cray-pals/1.2.2 15) libjpeg/9c
  4. envvar/1.0 10) netcdf/4.7.4 16) udunits/2.2.28
  5. craype/2.7.17 11) hdf5/1.10.6
  6. PrgEnv-intel/8.3.3 12) jasper/2.0.25

The proposed solution is the combination of the following changes

  1. enable MPI-IO collective buffering within the FMS source code (the subject of the current discussion)
  2. use appropriate NetCDF variable chunking (one chunk per z-level)
  3. use alternate Lustre file striping (one stripe per available disk OST each of size 2MB)
  4. setting MPICH_MPIIO_HINTS in the PBS script

The proposed solution has been tested on the same RRFS case and results in file system utilization of ~80 GB/s for only ~77 seconds running 24 members. The proposed FMS code modifications include specifying NF90_MPIIO in the mode argument to nf90_open in fms2_io/netcdf_io.F90. I'm doing this only for specified files.

use mpi, only: MPI_COMM_WORLD, MPI_INFO_NULL
.
.
.      
       if(string_compare(trim(fileobj%path), "INPUT/phy_data.nc"           , .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/fv_tracer.res.tile1.nc", .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/sfc_data.nc"              , .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/C3463_grid.tile7.nc"      , .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/C3463_grid.tile7.halo3.nc", .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/fv_core.res.tile1.nc"  , .true.) ) then
        err = nf90_open(trim(fileobj%path), ior(NF90_NOWRITE, NF90_MPIIO), fileobj%ncid, comm=MPI_COMM_WORLD, info=MPI_INFO_NULL)
      else
        err = nf90_open(trim(fileobj%path), nf90_nowrite, fileobj%ncid, chunksize=fms2_ncchksz)
      endif

The other code change is to call nf90_var_par_access() with the nf90_collective option for all variables in specified files before calling nf90_get_var(). This is done in fms2_io/include/netcdf_read_data.inc for r4 and r8 2D and 3D variables.

e.g.

        if(string_compare(trim(fileobj%path), "INPUT/phy_data.nc"   , .true.) .or. &
           string_compare(trim(fileobj%path), "INPUT/sfc_data.nc", .true.) ) then
          err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective)
        endif

The proposed code changes do result in a "double free or corruption" at the end of the run. I have a suspicion that this is coming from my direct use of MPI_COMM_WORLD and/or MPI_INFO_NULL. I need some help here.

I'm also not sure we want to hard-code the file names in the FMS code. I need some suggestions for a solution here.

I have started a branch with the above changes.
https://github.com/dkokron/FMS/tree/fms.ParallelStartup

@dkokron dkokron changed the title Poor RRFS model startup. Improve RRFS model startup time. Aug 8, 2023
@pj-gdit
Copy link

pj-gdit commented Sep 26, 2023

Posting IO performance profiles for 3 different methods reading RRFS restart files on WCOSS2 Lustre file system. These profiles are from a quiet disk based Lustre file system (HPE ClusterStor E1000) when running 30 ensemble members, 14 nodes each member.

  1. default FMS method where all MPI ranks read the input files
  2. FMS 2023.01.01 update where single MPI rank reads input files and uses MPP_Scatter to distribute data
  3. Parallel NetCDF4 prototype from Dan Kokron (fms.ParallelStartup branch)

The goal is to reduce pressure on the file system while achieving the shortest model initialization times when running many ensemble members concurrently.

RRFS_read1.pptx

@thomas-robinson
Copy link
Member

@dkokron make sure you pull the updates to main. There are a lot of changes coming with the next release and I wouldn't want your code to get left behind and have large merge conflicts.

@bensonr
Copy link
Contributor

bensonr commented Sep 27, 2023

@pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer.

@dkokron
Copy link
Contributor Author

dkokron commented Oct 2, 2023 via email

@dkokron
Copy link
Contributor Author

dkokron commented Oct 2, 2023 via email

@dkokron
Copy link
Contributor Author

dkokron commented Oct 2, 2023

@pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer.

See attached
ForRustyBenson.tgz

You'll only need to re-stripe the fv_core.res.tile1.nc file for this unit tester. I mistakenly included instructions in the README for re-striping the other files too.

@JacobCarley-NOAA
Copy link

@dkokron @pj-gdit thanks for opening this issue.

@bensonr and @thomas-robinson Is it possible for this item to get into the repository in some accelerated/expedited fashion? This is critical for the eventual operational implementations of RRFS and 3DRTMA. Thanks!

@JacobCarley-NOAA
Copy link

Am tagging @junwang-noaa as she has expressed an interest in this topic as well.

@bensonr
Copy link
Contributor

bensonr commented Oct 18, 2023

@JacobCarley-NOAA - As there have been conversations in email and now here going on concurrently, I'm copying the reply I sent in to others earlier this week.

We are starting to prototype an IO offload system, something we've been talking about for years now. Adding the proposed NetCDF4 updates are something we could look to incorporate as part of the work, but I don't expect anything to be available within the next six months.

I know you most likely need this sooner and since you have knowledge of parallel NetCDF4 and have delved into FMS, I'd encourage you add this as an option to the fms2_io subsystem and submit a PR. The FMS infrastructure library is an important part of our modeling system and any contributions would need to adhere to our guidelines and pass our tests. If this is something you are willing to do, I'd suggest a putting together a short project plan and we (GFDL) can look it over and, if needed, have a meeting to discuss.

@uramirez8707
Copy link
Contributor

Fixed in #1477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants