-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve RRFS model startup time. #1322
Comments
Posting IO performance profiles for 3 different methods reading RRFS restart files on WCOSS2 Lustre file system. These profiles are from a quiet disk based Lustre file system (HPE ClusterStor E1000) when running 30 ensemble members, 14 nodes each member.
The goal is to reduce pressure on the file system while achieving the shortest model initialization times when running many ensemble members concurrently. |
@dkokron make sure you pull the updates to |
@pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer. |
@thomas-robinson
Done.
…On Tue, Sep 26, 2023, 8:29 AM Tom Robinson ***@***.***> wrote:
@dkokron <https://github.com/dkokron> make sure you pull the updates to
main. There are a lot of changes coming with the next release and I
wouldn't want your code to get left behind and have large merge conflicts.
—
Reply to this email directly, view it on GitHub
<#1322 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2HRWVV5TSVVEEU2DYTX4LKDRANCNFSM6AAAAAA3I3ETRM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Pete used something I put together. I'll package it up and send it to you.
…On Wed, Sep 27, 2023, 7:17 AM Rusty Benson ***@***.***> wrote:
@pj-gdit <https://github.com/pj-gdit> - I believe you had some standalone
tests that were used to generate this data, is it possible you could share
that test code with us? I am interested in prototyping some solutions that
could allow us to properly augment the existing IO layer.
—
Reply to this email directly, view it on GitHub
<#1322 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2BB7PGIIOC3AWBJTD3X4QKORANCNFSM6AAAAAA3I3ETRM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
See attached You'll only need to re-stripe the fv_core.res.tile1.nc file for this unit tester. I mistakenly included instructions in the README for re-striping the other files too. |
@dkokron @pj-gdit thanks for opening this issue. @bensonr and @thomas-robinson Is it possible for this item to get into the repository in some accelerated/expedited fashion? This is critical for the eventual operational implementations of RRFS and 3DRTMA. Thanks! |
Am tagging @junwang-noaa as she has expressed an interest in this topic as well. |
@JacobCarley-NOAA - As there have been conversations in email and now here going on concurrently, I'm copying the reply I sent in to others earlier this week.
|
Fixed in #1477 |
The RRFS production team requested help understanding why their ensemble runs bring the Lustre file systems on Cactus and Dogwood to their knees (Ticket#2023032810000014). Pete Johnsen reports file system utilization of ~400 GB/s on a disk based Lustre file system. Furthermore, job startup times increase with the number of ensemble members. I've measured 310s startup times using 24 members running on Cactus:/lfs/h2.
An investigation revealed an enormous number of unexpectedly small read operations (< 2KB) during model startup. The following files were implicated.
phy_data.nc
sfc_data.nc
fv_tracer.res.tile1.nc
fv_core.res.tile1.nc
C3463_grid.tile7.nc
Altering file striping didn't resolve the problem though it does play a role in the final solution.
Altering file chunking didn't resolve the problem though it does play a role in the final solution.
Forcing the reads onto rank zero then broadcasting to the appropriate rank did improve startup time, but did not increase the size of the reads.
Steps to reproduce the behavior
I have a small code on the Cactus system that reads a single variable from the fv_core.res.tile1.nc file. I've used this unit test to evaluate potential solutions. Let me know if you want this code.
The NOAA production systems.
Currently Loaded Modules:
The proposed solution is the combination of the following changes
The proposed solution has been tested on the same RRFS case and results in file system utilization of ~80 GB/s for only ~77 seconds running 24 members. The proposed FMS code modifications include specifying NF90_MPIIO in the mode argument to nf90_open in fms2_io/netcdf_io.F90. I'm doing this only for specified files.
The other code change is to call nf90_var_par_access() with the nf90_collective option for all variables in specified files before calling nf90_get_var(). This is done in fms2_io/include/netcdf_read_data.inc for r4 and r8 2D and 3D variables.
e.g.
The proposed code changes do result in a "double free or corruption" at the end of the run. I have a suspicion that this is coming from my direct use of MPI_COMM_WORLD and/or MPI_INFO_NULL. I need some help here.
I'm also not sure we want to hard-code the file names in the FMS code. I need some suggestions for a solution here.
I have started a branch with the above changes.
https://github.com/dkokron/FMS/tree/fms.ParallelStartup
The text was updated successfully, but these errors were encountered: