Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate computational instability in the datm_cdeps_mx025_cfsr test on gaea, orion and wcoss_dell_p3 #642

Merged
merged 39 commits into from
Jul 7, 2021

Conversation

binli2337
Copy link
Contributor

@binli2337 binli2337 commented Jun 15, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

Eliminate computational instability in the datm_cdeps_mx025_cfsr test on Gaea, Orion and WCOSS_dell_p3. This PR also uses the latest develop branch of CDEPS and reduces the forecast time to 12 hours for the "datm_cdeps_mx025_cfsr" and "datm_cdeps_mx025_gefs" tests, so that the two tests can be done within 30 minutes.

Issue(s) addressed

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3
  • CI: b91a740

junwang-noaa and others added 20 commits October 21, 2019 14:32
Change-Id: Ie2103eab6d91105f9cd0e9bb069c2d95bef2b6b8
Change-Id: I7d848f4cd91bea74b04d8b43ab1f45a1ace8cda2
Change-Id: I92025fbae25785f84abf7fa1a08bfbca901bffa6
Change-Id: I85510b236d2ef25e09e6f4578ac6196b3a7e2de1
Change-Id: Ie09f7f4cf645bc6661bc49d73fec21e51edfb728
Change-Id: I137a0e798eb2273be61046d081fb379dc0aa0a7a
Change-Id: I9619f724e12e36607ea84a009134e9997d9bddf0
Change-Id: Ide8e962ffc1f1304c042bb79907789442180338f
* revise rt.conf
* revise datm_cdeps_mx025_cfsr
* revise datm_cdeps_mx025_gefs
@binli2337 binli2337 changed the title Eiminate computational instabiluty in the datm_cdeps_mx025_cfsr test on gaea, orion and wcoss_dell_p3 Eliminate computational instability in the datm_cdeps_mx025_cfsr test on gaea, orion and wcoss_dell_p3 Jun 15, 2021
@binli2337 binli2337 added the Baseline Updates Current baselines will be updated. label Jun 15, 2021
@junwang-noaa
Copy link
Collaborator

@binli2337 @uturuncoglu @aerorahul I am wondering if we can have PR#611 and PR#615 in Bin's PR so that he can test the CDEPS in UFS, Also Ufuk, can you confirm the HAFS is working with PR#611 and PR#615 and the latest CDEPS code?

@aerorahul
Copy link
Contributor

@junwang-noaa
I have no issues with merging #611 and #615 into #642.
Please let me know what I need to do.

@junwang-noaa
Copy link
Collaborator

@uturuncoglu Can you confirm if HAFS is working with PR #611 and #615? Once you confirm, we will combine the #611/#615 with this PR and plan to get it committed in the next two weeks. If we did not hear from you, we have to assume you need more time to test PR #611 and #615 within HAFS, we will not put the two PRs in the commit queue planned for the next two weeks. Please let us know. Thanks

@uturuncoglu
Copy link
Collaborator

@junwang-noaa let me check. I am still trying to update HAFS and RTs for recent changes. I'll update you.

Copy link
Collaborator

@uturuncoglu uturuncoglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binli2337 @junwang-noaa I also run RT for HAFS standalone regional FV3 configuration and that also did not pass. So, it seems that the issue related with the FV3 not coupling and since the FV3 version is fixed (i am using same version with feature/hafs_couplehycom_cdeps branch under HAFS) and it was passing without any issue. I am suspecting a build time issue at this point. So, we need to investigate it further. Please let me know if you know anything that could lead to answer change such as change in compiler version etc.

@github-actions github-actions bot removed the run-ci label Jul 7, 2021
@BinLiu-NOAA
Copy link
Contributor

@binli2337 @junwang-noaa I also run RT for HAFS standalone regional FV3 configuration and that also did not pass. So, it seems that the issue related with the FV3 not coupling and since the FV3 version is fixed (i am using same version with feature/hafs_couplehycom_cdeps branch under HAFS) and it was passing without any issue. I am suspecting a build time issue at this point. So, we need to investigate it further. Please let me know if you know anything that could lead to answer change such as change in compiler version etc.

@uturuncoglu When you say FV3 standalone RT did not pass here, did the build fail, the run fail to run, or just the forecast produce different results? If it is the last situation (producing different results), then that might not be a big concern here, since the fetaure/hafs_couplehycom_cdeps is somewhat (~2 months) behind the develop branch.

@MinsukJi-NOAA
Copy link
Contributor

MinsukJi-NOAA commented Jul 7, 2021

@binli2337 , please replace 32BIT=Y with -D32BIT=ON in utest.bld

+regional_control | -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn,FV3_GFS_v15_thompson_mynn_RRTMGP -D32BIT=ON

@binli2337
Copy link
Contributor Author

New baseline directories need to be created for the following 4 data atmosphere tests due to shorter forecast time and reduced ocean dynamics time step:
datm_mx025_cfsr
datm_mx025_gefs
datm_cdeps_mx025_cfsr
datm_cdeps_mx025_gefs

New baseline directories also need to be created for the following tests on wcoss_phase3:
control_wrtGauss_netcdf_parallel_debug
control_wrtGauss_netcdf_parallel
regional_quilt_netcdf_parallel
regional_control
regional_restart
regional_control_debug
regional_quilt_hafs
regional_quilt
regional_quilt_RRTMGP

After new baseline results are created on wcoss_phase3, the following 3 RT tests cannot reproduce the baseline results:
control_wrtGauss_netcdf_parallel_debug
control_wrtGauss_netcdf_parallel
regional_quilt_netcdf_parallel

@binli2337
Copy link
Contributor Author

@MinsukJi-NOAA It is done. Thanks!

@binli2337
Copy link
Contributor Author

@uturuncoglu @junwang-noaa The following tests failed only on wcoss_phase3. The issue might be related to how netcdf library is installed on different platforms:
control_wrtGauss_netcdf_parallel_debug
control_wrtGauss_netcdf_parallel
regional_quilt_netcdf_parallel

@uturuncoglu
Copy link
Collaborator

@binli2337 HAFS stand-alone regional fv3 test runs but answer changes.

@uturuncoglu
Copy link
Collaborator

@binli2337 I am not sure netcdf change could lead to answer change. I could compare the module files and try to run the case again with old modules to see what happens.

@uturuncoglu
Copy link
Collaborator

@binli2337 @junwang-noaa i think having HAFS related RTS in UFS level needs to be priority to catch these issues quickly. Since model is enclosing too fast it is hard to test it with HAFS with this way.

@binli2337
Copy link
Contributor Author

@uturuncoglu The plan is to merge the updates from this PR to the develop branch today. Please let @junwang-noaa and @DusanJovic-NOAA know if you have any questions or concerns.

@uturuncoglu
Copy link
Collaborator

@binli2337 @junwang-noaa @DusanJovic-NOAA @BinLiu-NOAA My main concern is to have answer change in HAFS side without knowing the source of it. At this point, the decision is yours. If you still plain to merge it, that is fine but we need to create baselines in HAFS side again. @BinLiu-NOAA let me know what do you think? We were plaining to sync support/HAFS before Friday after all PRs merged. Are you fine with the possible answer change that is caused with this PR?

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor cosmetic changes.
I don't see a need to rerun the regression tests.

export RESTART_N=${FHMAX}
export DT_DYNAM_MOM6='900'
export DT_THERM_MOM6='1800'
export CPL_SLOW=${DT_THERM_MOM6}
Copy link
Collaborator

@DeniseWorthen DeniseWorthen Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've reset the variable CPL_SLOW, you also need to reset the variable export coupling_interval_slow_sec=${CPL_SLOW} in the test to have it take effect in nems.configure.datm.IN, otherwise it the original (default=3600s) value will be used. This will impact all the mx025 tests where you changed the MOM timesteps.

I just tested this on Cheyenne in the datm_mx025_cfsr test; the nems.configure for it shows:

runSeq::
@3600
   MED med_phases_prep_ocn_avg
   MED -> OCN :remapMethod=redist
   OCN
   @900
     MED med_phases_prep_ice
     MED -> ICE :remapMethod=redist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DeniseWorthen The line "export CPL_SLOW=${DT_THERM_MOM6}" will be removed from the following 4 tests:
datm_mx025_cfsr
datm_mx025_gefs
datm_cdeps_mx025_cfsr
datm_cdeps_mx025_gefs

In above 4 tests,
coupling_interval_slow_sec=3600 seconds
coupling_interval_fast_sec=900 seconds
DT_DYNAM_MOM6=900 seconds (ocean dynamics time step)
DT_THERM_MOM6=1800 seconds (ocean thermodynamic time step)

After the changes, the 4 tests passed RT tests on wcoss_phase3. Thanks for discovering the issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that is the correct solution. If the slow coupling only occurs every hour, then MOM6 will take two internal thermodynamic timesteps before it gets "updated" forcing from the atm/ice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better solution will be used in a future PR. An issue will be submitted. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. I think I mis-understood your original comment then.

add RegressionTests_wcoss_dell_p3.log
update 4 datm tests
@DusanJovic-NOAA DusanJovic-NOAA requested a review from aerorahul July 7, 2021 20:16
@DusanJovic-NOAA DusanJovic-NOAA dismissed aerorahul’s stale review July 7, 2021 20:48

Requested code changes have been made. See commit b58d781

@DusanJovic-NOAA DusanJovic-NOAA merged commit b64e648 into ufs-community:develop Jul 7, 2021
@binli2337 binli2337 deleted the gaea_test branch July 19, 2021 12:22
epic-cicd-jenkins pushed a commit that referenced this pull request Apr 17, 2023
* Update templates for new ufs weather model

* Add option for pressure tendency diagnostic

* Remove unnecessary namelist parameters

* Modify model_configure

* change nfhout to output_fh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update CDEPS to the latest develop branch Test "datm_cdeps_mx025_cfsr" cannot be run on 3 platforms
9 participants