Improve RRFS model startup time. #1322

dkokron · 2023-08-08T18:12:09Z

The RRFS production team requested help understanding why their ensemble runs bring the Lustre file systems on Cactus and Dogwood to their knees (Ticket#2023032810000014). Pete Johnsen reports file system utilization of ~400 GB/s on a disk based Lustre file system. Furthermore, job startup times increase with the number of ensemble members. I've measured 310s startup times using 24 members running on Cactus:/lfs/h2.

An investigation revealed an enormous number of unexpectedly small read operations (< 2KB) during model startup. The following files were implicated.

phy_data.nc
sfc_data.nc
fv_tracer.res.tile1.nc
fv_core.res.tile1.nc
C3463_grid.tile7.nc

Altering file striping didn't resolve the problem though it does play a role in the final solution.
Altering file chunking didn't resolve the problem though it does play a role in the final solution.
Forcing the reads onto rank zero then broadcasting to the appropriate rank did improve startup time, but did not increase the size of the reads.

Steps to reproduce the behavior
I have a small code on the Cactus system that reads a single variable from the fv_core.res.tile1.nc file. I've used this unit test to evaluate potential solutions. Let me know if you want this code.

The NOAA production systems.
Currently Loaded Modules:

craype-x86-rome (H) 7) intel/19.1.3.304 13) zlib/1.2.11
libfabric/1.11.0.0. (H) 8) cray-mpich/8.1.12 14) libpng/1.6.37
craype-network-ofi (H) 9) cray-pals/1.2.2 15) libjpeg/9c
envvar/1.0 10) netcdf/4.7.4 16) udunits/2.2.28
craype/2.7.17 11) hdf5/1.10.6
PrgEnv-intel/8.3.3 12) jasper/2.0.25

The proposed solution is the combination of the following changes

enable MPI-IO collective buffering within the FMS source code (the subject of the current discussion)
use appropriate NetCDF variable chunking (one chunk per z-level)
use alternate Lustre file striping (one stripe per available disk OST each of size 2MB)
setting MPICH_MPIIO_HINTS in the PBS script

The proposed solution has been tested on the same RRFS case and results in file system utilization of ~80 GB/s for only ~77 seconds running 24 members. The proposed FMS code modifications include specifying NF90_MPIIO in the mode argument to nf90_open in fms2_io/netcdf_io.F90. I'm doing this only for specified files.

use mpi, only: MPI_COMM_WORLD, MPI_INFO_NULL
.
.
.      
       if(string_compare(trim(fileobj%path), "INPUT/phy_data.nc"           , .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/fv_tracer.res.tile1.nc", .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/sfc_data.nc"              , .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/C3463_grid.tile7.nc"      , .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/C3463_grid.tile7.halo3.nc", .true.) .or. &
         string_compare(trim(fileobj%path), "INPUT/fv_core.res.tile1.nc"  , .true.) ) then
        err = nf90_open(trim(fileobj%path), ior(NF90_NOWRITE, NF90_MPIIO), fileobj%ncid, comm=MPI_COMM_WORLD, info=MPI_INFO_NULL)
      else
        err = nf90_open(trim(fileobj%path), nf90_nowrite, fileobj%ncid, chunksize=fms2_ncchksz)
      endif

The other code change is to call nf90_var_par_access() with the nf90_collective option for all variables in specified files before calling nf90_get_var(). This is done in fms2_io/include/netcdf_read_data.inc for r4 and r8 2D and 3D variables.

e.g.

        if(string_compare(trim(fileobj%path), "INPUT/phy_data.nc"   , .true.) .or. &
           string_compare(trim(fileobj%path), "INPUT/sfc_data.nc", .true.) ) then
          err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective)
        endif

The proposed code changes do result in a "double free or corruption" at the end of the run. I have a suspicion that this is coming from my direct use of MPI_COMM_WORLD and/or MPI_INFO_NULL. I need some help here.

I'm also not sure we want to hard-code the file names in the FMS code. I need some suggestions for a solution here.

I have started a branch with the above changes.
https://github.com/dkokron/FMS/tree/fms.ParallelStartup

The text was updated successfully, but these errors were encountered:

pj-gdit · 2023-09-26T13:22:16Z

Posting IO performance profiles for 3 different methods reading RRFS restart files on WCOSS2 Lustre file system. These profiles are from a quiet disk based Lustre file system (HPE ClusterStor E1000) when running 30 ensemble members, 14 nodes each member.

default FMS method where all MPI ranks read the input files
FMS 2023.01.01 update where single MPI rank reads input files and uses MPP_Scatter to distribute data
Parallel NetCDF4 prototype from Dan Kokron (fms.ParallelStartup branch)

The goal is to reduce pressure on the file system while achieving the shortest model initialization times when running many ensemble members concurrently.

RRFS_read1.pptx

thomas-robinson · 2023-09-26T13:29:18Z

@dkokron make sure you pull the updates to main. There are a lot of changes coming with the next release and I wouldn't want your code to get left behind and have large merge conflicts.

bensonr · 2023-09-27T12:17:32Z

@pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer.

dkokron · 2023-10-02T11:04:08Z

@thomas-robinson Done.

…

On Tue, Sep 26, 2023, 8:29 AM Tom Robinson ***@***.***> wrote: @dkokron <https://github.com/dkokron> make sure you pull the updates to main. There are a lot of changes coming with the next release and I wouldn't want your code to get left behind and have large merge conflicts. — Reply to this email directly, view it on GitHub <#1322 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2HRWVV5TSVVEEU2DYTX4LKDRANCNFSM6AAAAAA3I3ETRM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dkokron · 2023-10-02T11:05:37Z

Pete used something I put together. I'll package it up and send it to you.

…

On Wed, Sep 27, 2023, 7:17 AM Rusty Benson ***@***.***> wrote: @pj-gdit <https://github.com/pj-gdit> - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer. — Reply to this email directly, view it on GitHub <#1322 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACODV2BB7PGIIOC3AWBJTD3X4QKORANCNFSM6AAAAAA3I3ETRM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dkokron · 2023-10-02T11:36:57Z

@pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer.

See attached
ForRustyBenson.tgz

You'll only need to re-stripe the fv_core.res.tile1.nc file for this unit tester. I mistakenly included instructions in the README for re-striping the other files too.

JacobCarley-NOAA · 2023-10-16T16:22:21Z

@dkokron @pj-gdit thanks for opening this issue.

@bensonr and @thomas-robinson Is it possible for this item to get into the repository in some accelerated/expedited fashion? This is critical for the eventual operational implementations of RRFS and 3DRTMA. Thanks!

JacobCarley-NOAA · 2023-10-17T20:10:13Z

Am tagging @junwang-noaa as she has expressed an interest in this topic as well.

bensonr · 2023-10-18T18:57:27Z

@JacobCarley-NOAA - As there have been conversations in email and now here going on concurrently, I'm copying the reply I sent in to others earlier this week.

We are starting to prototype an IO offload system, something we've been talking about for years now. Adding the proposed NetCDF4 updates are something we could look to incorporate as part of the work, but I don't expect anything to be available within the next six months.

I know you most likely need this sooner and since you have knowledge of parallel NetCDF4 and have delved into FMS, I'd encourage you add this as an option to the fms2_io subsystem and submit a PR. The FMS infrastructure library is an important part of our modeling system and any contributions would need to adhere to our guidelines and pass our tests. If this is something you are willing to do, I'd suggest a putting together a short project plan and we (GFDL) can look it over and, if needed, have a meeting to discuss.

uramirez8707 · 2024-05-07T15:26:55Z

Fixed in #1477

dkokron changed the title ~~Poor RRFS model startup.~~ Improve RRFS model startup time. Aug 8, 2023

GeorgeVandenberghe-NOAA mentioned this issue Oct 20, 2023

Placeholder issue seeking guidance for optimizing model run time on WCOSS2 for RRFS application ufs-community/ufs-weather-model#1792

Open

dkokron mentioned this issue Nov 7, 2023

Add MPI-IO collective reads of NetCDF-4 files #1405

Closed

8 tasks

dkokron mentioned this issue Mar 11, 2024

Fms.parallel startup #1477

Merged

8 tasks

uramirez8707 closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve RRFS model startup time. #1322

Improve RRFS model startup time. #1322

dkokron commented Aug 8, 2023 •

edited

Loading

pj-gdit commented Sep 26, 2023

thomas-robinson commented Sep 26, 2023

bensonr commented Sep 27, 2023

dkokron commented Oct 2, 2023 via email •

edited

Loading

dkokron commented Oct 2, 2023 via email

dkokron commented Oct 2, 2023 •

edited

Loading

JacobCarley-NOAA commented Oct 16, 2023

JacobCarley-NOAA commented Oct 17, 2023

bensonr commented Oct 18, 2023

uramirez8707 commented May 7, 2024

Improve RRFS model startup time. #1322

Improve RRFS model startup time. #1322

Comments

dkokron commented Aug 8, 2023 • edited Loading

pj-gdit commented Sep 26, 2023

thomas-robinson commented Sep 26, 2023

bensonr commented Sep 27, 2023

dkokron commented Oct 2, 2023 via email • edited Loading

dkokron commented Oct 2, 2023 via email

dkokron commented Oct 2, 2023 • edited Loading

JacobCarley-NOAA commented Oct 16, 2023

JacobCarley-NOAA commented Oct 17, 2023

bensonr commented Oct 18, 2023

uramirez8707 commented May 7, 2024

dkokron commented Aug 8, 2023 •

edited

Loading

dkokron commented Oct 2, 2023 via email •

edited

Loading

dkokron commented Oct 2, 2023 •

edited

Loading