Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More than 9999 MPI tasks with WRF #697

Closed
ghost opened this issue Nov 14, 2018 · 4 comments
Closed

More than 9999 MPI tasks with WRF #697

ghost opened this issue Nov 14, 2018 · 4 comments

Comments

@ghost
Copy link

ghost commented Nov 14, 2018

Currently, WRF only permits < 10000 MPI tasks. As the grid size increases, more than 10000 cores could be necessary. Therefore, the value of RSL_MAXPROC in external/RSL_LITE/rsl_lite.h needs to be increased to e.g. 10000.
Also the sprintf statements in external/RSL_LITE/c_code.c needs to changed permitting "06d" integers.

@davegill
Copy link
Contributor

@thomasedds
Thomas,
This is a good idea.

How would you like to generate the pull request for this, and walk the whole procedure through to completion?

  1. The mods should be set up so that the current rsl.{out,error}.xxxx format is maintained for < 10000 MPI ranks, so that we do not cause troubles for a large number of users who may have scripts that assume a certain format.

  2. Do a test on a large enough domain to effectively try just over 10k MPI ranks vs just under 10k MPI ranks.

On our local machine 280 nodes at 36 MPI ranks/node yields 10080 cores. This gives a default WRF decomposition of cores into 105x96. A domain that has the y-dimension >= 1100 and the x-dimension >= 1000 would work.

For the test of less than 10k cores, 275 nodes at 36 MPI ranks/node yields 9900 cores. This gives a default WRF decomposition of cores into 100x99.

  1. We would want to see bit-for-bit identical results on these two decompositions, though the forecast could be only a 20 (or so) time steps.

  2. We would want a build that had all of the debugging options activated to do bounds checking.

  3. There are options in WRF to only print out the master node, and options to not do any print out. These would need to be tested to verify that those capabilities were not impacted.

Take a look at our pull requests to get a "feel" for what we would like to see. There is a template to follow that github will suggest, and the same template is the WRF/tools/commit_form.txt file.

Take as your starting point (the base of your pull request) the top of the develop branch.

The pull request would be from your fork / your branch to the wrf-model fork / develop branch.

@cponder
Copy link

cponder commented Apr 19, 2021

If I just try running WRF 3.8.1 with large process counts, I see the output-files being named

rsl.out.0000
.....
rsl.out.9999
rsl.error.10000
....

So for <10,000 it pads out to 4 digits, and it just uses as many digits as it needs after that, which I think satisfies requirement #1.

@davegill
Copy link
Contributor

@cponder
Carl,

We added PR #1055, based on these suggestions. To see the files that were changed:

git diff --name-only 115c94684081c69..bcf8223aeebb57

There were a number of magical "10000" values floating around (and the associated number of allowable digits" that needed to be changed. Some compilers permitted exceeding these bounds, but that was not robust.

@davegill
Copy link
Contributor

We added PR #1055, based on these suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants