Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placeholder issue seeking guidance for optimizing model run time on WCOSS2 for RRFS application #1792

Open
MatthewPyle-NOAA opened this issue Jun 8, 2023 · 2 comments
Labels
question Further information is requested

Comments

@MatthewPyle-NOAA
Copy link
Collaborator

MatthewPyle-NOAA commented Jun 8, 2023

We are looking for guidance on optimizing run-time speed on WCOSS2 for a couple of different node count configurations for a large regional domain. One is for a small node count (~13 nodes) used in making 1 hour forecasts within the EnKF system (and writing a restart file at the 1 h time). The other is for a large node count (~110 nodes) used in making 60 hour forecasts with hourly history output (and eventually 15 minute history output for the first ~18 hours), and (eventually) 6 h restart writes.

Input files and a run script has been collected under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_update/ on Cactus and Dogwood

The job_card.sh_6hfore_fullnode script currently is set up to run a 6 h forecast using a 61 node configuration, but can switch to 76, 97, or 110 node configurations easily. If the initialization time could be reduced (it is several minutes for large node counts), and the integration speed could be improved, it would be a great help.

The job_card.sh_1hfore_fullnode script runs a 1 h forecast, writing out a restart file at the end. This general configuration is for the EnKF system, which will either be 30 members run concurrently (each member using 12-15 nodes to run) or in two batches of 15 members (each running on 25-30 nodes). In either scenario, would ideally have all of these forecasts complete within ~10 minutes.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

How much I/O does each member do, particularly input. What are the file names and how large are they and how fast to they need to get read? There is evidence the WCOSS2 filesystem itself is being crushed by this input when ensembles are started and if so, the I/O needs redesign. I am late to this investigation and am aware of FMS work to mitigate it

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

FMS work is described in NOAA-GFDL/FMS#1322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
No open projects
Development

No branches or pull requests

3 participants