-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully coupled test case (cpld_bmarkfrac_wave_v16) not thread-safe #367
Comments
I have so far narrowed this down to wav->atm coupling as the fully coupled test with the feedback from the wave to atmosphere turned off reproduced itself with 1 and 2 threads. A simpler test, with just wave-atm coupling: fv3_ccpp_gfdlmprad_atmwav is being used for debugging now. I'll post more updates when I have them. |
@JamesAbeles-NOAA do you have some time to help Jessica narrow this down? The wave model is not thread safe and you helped the waves team track this down for GFS v16 when it was one way coupled. This has re surfaced for two way coupled and our hunch is this is more sections of the code that have some bugs in openmp similar to what you found earlier |
I can help if @JessicaMeixner-NOAA points me to a tarfile that has the code and run directory. Hopefully, I can do the testing on Hera as I don't get much through WCOSS queues |
Is a pointer for what to git clone acceptable instead of a tar file? |
@JessicaMeixner-NOAA Sure, instructions are fine |
Clone the repo and branch:
This is the develop branch as of today (6a4c307) but has the updates to just run the test needed and added a identical test but with 2 threads. To run the test:
This runs two tests for wav-atm 2-way coupling, 1 with 2 threads another with 1 thread. You'll get two output directories in: /scratch1/NCEPDEV/stmp2/$user/FV3_RT/rt_*/ and comparing the netcdf output of the atm model they are different. The original test used nemsio output but I thought netcdf would be easier to debug with netcdf output so I switched it. For example I have output in: and you will see that for example the phyf002.tile6.nc tiles have differences. @JamesAbeles-NOAA Let me know if you'd prefer the test set up differently. I suspect the issue is in WW3/model/ftn/wmesmfmd.ftn in the call from SetExport to CalcRoughl subroutine because the passing of z0 roughness length from wave model to the atm model is what is different from the one-way coupled wav->atm case which is thread safe. |
So removing the FLD2 switch, we regain thread-safe for the 2-way coupling test, which likely means it's something going on in https://github.com/NOAA-EMC/WW3/blob/develop/model/ftn/w3fld2md.ftn (likewise a similar issue exists when using FLD1 instead of FLD2, so there's likely the same bug in w3fld1md.ftn). |
We should also look at the main programs where these modules are called. Thanks Jessica, this should help us narrow it down
…Sent from my iPhone
On Feb 1, 2021, at 10:05 AM, Jessica Meixner ***@***.***> wrote:
So removing the FLD2 switch, we regain thread-safe for the 2-way coupling test, which likely means it's something going on in https://github.com/NOAA-EMC/WW3/blob/develop/model/ftn/w3fld2md.ftn (likewise a similar issue exists when using FLD1 instead of FLD2, so there's likely the same bug in w3fld1md.ftn).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I think we should comment out the openMP for the loop that has "CALL W3SRCE" to verify it is this loop that is causing the problem. Once verified we can decide how to best find the problem. Subroutines called in a threaded region often are the hardest to debug. |
OK, I did what I suggested (removed openMP on the loop) and for the two rt tests Jessica had me run, now the phy002.tile6.nc files are identical |
Thanks @JamesAbeles-NOAA that's good information to have. In terms of debugging having threadding that high up is obviously makes it difficult. In terms of performance of threading, is it best to have the threading as high up in loops/subroutines as possible, or would it be better to have them in the inner-most loops, subroutines, etc? I have not been able to get a WW3 only small scale test to consistently give differences for a smaller test case to isolate this issue, but I will keep trying. I am also trying to turn on debug flags one-by-one to turn on the debugging options and looking for uninitialized variables in w3fld(1/2)md.ftn but haven't found anything yet. |
The higher up the better since threading has to start/stop regions, so one large region is good. SO on the wW3 only small scale test, the answers replicate when the loop is threaded and the FLD2 switch is on? |
I wanted to see what the OpenMP standard said about saved variables in threaded regions. I found this, so I need to see if this will affect us |
The test I was trying would replicate itself with threads every other time you would run, so it was not consistently failing. We have saved variables in the fld1/2... that could be the issue? |
Well, there are some in w3srce, so I wonder about them. I am surprised this is not an issue with the code already, so maybe this is a red herring. I see these definitions in w3srce I'm grasping at at straws here, but the variable declarations in the OMP loop look correct |
I added this to w3srcemd.ftn to see if it helps |
It worked! |
Yay!!! Thank you @JamesAbeles-NOAA !!! |
I confirmed that @JamesAbeles-NOAA fix fixed the issue in the wave-atm coupling case. I'm now running this on orion to make sure it's working on multiple machines and going to run the fully-coupled test which is what originally was not working. I'll update the issue here when those tests are done and start the process for pushing the updates back to WW3 and to ufs-weather-model after that. |
The fully coupled test case is now thread-safe as well. If someone needs the WW3 update before it makes it way back into develop here's the commit: JessicaMeixner-NOAA/WW3@2c0d3f6 Also, since this is thread safe, assuming it's okay I'll add waves to the cpld threading test to make sure we keep it that way when I push the update for WW3 back. |
Oh the wave-atm case also worked with the fix on orion, so it's been tested on at least 2 platforms. |
Why do these two variables need to be saved?
Why only these two variables need to be declared "threadprivate"?
Moorthi
…On Wed, Feb 3, 2021 at 5:28 PM Jessica Meixner ***@***.***> wrote:
The fully coupled test case is now thread-safe as well. If someone needs
the WW3 update before it makes it way back into develop here's the commit:
***@***.***
<JessicaMeixner-NOAA/WW3@2c0d3f6>
Also, since this is thread safe, assuming it's okay I'll add waves to the
cpld threading test to make sure we keep it that way when I push the update
for WW3 back.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#367 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYVX7HX2UGAIQBAWS2DS5HEZLANCNFSM4WBBPMJA>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: [email protected]
Phone: (301) 683-3718 Fax: (301) 683-3718
|
@SMoorthi-emc I can't answer the first part to the question, but any variables declared with the save attribute that are in routines inside a threaded region are shared by default. |
Jim, I have a feeling that that "save" is not needed. In that case the
omp declaration is also not needed right?
Thanks
Moorthi
…On Wed, Feb 3, 2021 at 7:47 PM JamesAbeles-NOAA ***@***.***> wrote:
@SMoorthi-emc <https://github.com/SMoorthi-emc> I can't answer the first
part to the question, but any variables declared with the save attribute
that are in routines inside a threaded region are shared by default.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#367 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYRNHZTDS25R2EAN743S5HVCFANCNFSM4WBBPMJA>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: [email protected]
Phone: (301) 683-3718 Fax: (301) 683-3718
|
If they don't need to be saved, then they would be private and no need for the omp directive |
@SMoorthi-emc the variables are supposed to be saved to be a first guess of an iterative solve, however in looking at the code I agree this might not be needed. I made an issue on WW3 with your point: NOAA-EMC/WW3#308 and regardless of if this needs to be saved or not, we will also look in all of WW3 for other variables that should be saved as thread private (NOAA-EMC/WW3#307) to make sure other parts of the code are okay. |
Jessica,
It is a good practice to avoid "save" in a threaded area unless it is
really needed.
Moorthi
…On Thu, Feb 4, 2021 at 10:03 AM Jessica Meixner ***@***.***> wrote:
@SMoorthi-emc <https://github.com/SMoorthi-emc> the variables are
supposed to be saved to be a first guess of an iterative solve, however in
looking at the code I agree this might not be needed. I made an issue on
WW3 with your point: NOAA-EMC/WW3#308
<NOAA-EMC/WW3#308> and regardless of if this
needs to be saved or not, we will also look in all of WW3 for other
variables that should be saved as thread private (NOAA-EMC/WW3#307
<NOAA-EMC/WW3#307>) to make sure other parts of
the code are okay.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#367 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYV4O4FKSFLV7273OG3S5KZKBANCNFSM4WBBPMJA>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: [email protected]
Phone: (301) 683-3718 Fax: (301) 683-3718
|
The code updates were committed, close this issue. |
orog_files_mdl="${CRES}${DOT_OR_USCORE}oro_data.tile${TILE_RGNL}.halo${NH4}.nc" to run workflow with the UFS_UTILS GNU build.
Description
It has been reported by @SMoorthi-emc and @shansun6 that when running cpld_bmarkfrac_wave_v16 test with and without threads do not reproduce each other.
To Reproduce:
Run cpld_bmarkfrac_wave_v16 and then run cpld_bmarkfrac_wave_v16 with multiple threads
Additional context
Going to try to run with different aspects of wave coupling turned on or off (ie, no wave feedback to ocean) to narrow down area to look for what part of the code is not thread safe. Will update when have more information.
@SMoorthi-emc to turn off OMP parts of WW3, remove OMPG OMPH from WW3/model/esmf/switch
The text was updated successfully, but these errors were encountered: