Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/P7b: add fix for omp reproducibility issue for P7b #689

Merged
merged 3 commits into from
Jul 13, 2021

Conversation

junwang-noaa
Copy link
Collaborator

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

The PR is going to fix the run to run reproducibility issue in release/P7b branch.

Issue(s) addressed

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3
  • CI

Dependencies

If testing this branch requires non-default branches in other repositories, list them. Those branches should have matching names (ideally).

Do PRs in upstream repositories need to be merged first?

@junwang-noaa
Copy link
Collaborator Author

@DeniseWorthen @jiandewang would you please run a short P7b test to confirm this fixes the reproducibility issue? Thanks

@junwang-noaa junwang-noaa changed the title add fix for omp reproducibility issue for P7b Release/P7b: add fix for omp reproducibility issue for P7b Jul 13, 2021
@DeniseWorthen
Copy link
Collaborator

I tested 24hr for 20131001. I compiled and ran and then re-compiled and ran a second time. The coupler history file at the end of the 24hrs is identical for the two runs.

@yangfanglin
Copy link
Collaborator

yangfanglin commented Jul 13, 2021 via email

@junwang-noaa
Copy link
Collaborator Author

Denise, thanks for doing the testing.

@junwang-noaa junwang-noaa merged commit 88e89a3 into ufs-community:release/P7b Jul 13, 2021
@jiandewang
Copy link
Collaborator

I tested 24hr for 20131001. I compiled and ran and then re-compiled and ran a second time. The coupler history file at the end of the 24hrs is identical for the two runs.

my two runs just finished, they generated identical results

@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Jul 13, 2021 via email

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Jul 13, 2021 via email

@junwang-noaa junwang-noaa deleted the fix_repro branch November 24, 2021 14:52
epic-cicd-jenkins pushed a commit that referenced this pull request Apr 17, 2023
## DESCRIPTION OF CHANGES: 
Cleaning up bugs in the machine files.  The first bug prompted this PR, and the rest were found subsequently.  The bugs (and their fixes) are as follows:

1) A space is missing after the `print_info_msg` and `print_err_msg_exit` function calls in the `file_location` functions.  Inserting a space gets passed this bug, but subsequent issues were found as described below.

**For machine files that call the `print_info_msg` function in `file_location` (`cheyenne.sh`, `hera.sh`, `jet.sh`, and `orion.sh`):**
Fixing this bug leads to other failures because when the "*" stanza is encountered in the `file_location` function, 
the `EXTRN_MDL_SYSBASEDIR_ICS|LBCS` variable gets set to the message that `file_location` returns.  Since that message contains spaces, it leads to other failures in downstream scripts (the ex-scripts).  Simply removing the printing out of the message (thus causing `EXTRN_MDL_SYSBASEDIR_ICS|LBCS` to be set to a null string) fixes the failures, so this was the fix implemented.  If desired, a message for an empty value for `EXTRN_MDL_SYSBASEDIR_ICS|LBCS` can be placed in another script (where those variables are used).

**For machine files that use `print_err_msg_exit` in `file_location` (`stampede.sh` and `wcoss_dell_p3.sh`):**
These should not exit if the file location is not available since the experiment can still complete successfully.  So just removing the `print_err_msg_exit` call should work (and make the behavior of these machine files consistent with the set above).

2) In all the machine files, the variable `FV3GFS_FILE_FMT_ICS` should be changed to `FV3GFS_FILE_FMT_LBCS` in the definition of `EXTRN_MDL_SYSBASEDIR_LBCS`.  This was fixed in all the files.

3) In `stampede.sh`, a variable named `SYSBASEDIR_ICS` is defined.  This is a typo.  Modify to `EXTRN_MDL_SYSBASEDIR_ICS`.

## TESTS CONDUCTED: 
Ran the WE2E test `grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_GSD_SAR` on:
* Hera -- successful
* Jet -- successful except for UPP tasks
* Cheyenne -- successful except for UPP tasks

The UPP task failures are new and being experienced by other PRs as well (e.g. #689).  The original issue with machine files seems resolved.

## CONTRIBUTORS (optional): 
@JeffBeck-NOAA encountered and reported the original error.
epic-cicd-jenkins pushed a commit that referenced this pull request Apr 17, 2023
* Tweaks for running with containers on azure

* added config.sh for GST on azure

* added AWS to load_modules_run_task.sh

* working on bare metal now

* Changing to azure, aws, and singularity

* updates for singularity

* tweaks for running using singularity exec

* tweaks for running using singularity exec

* Converting to a single noaacloud type

* slight changes to config.sh for aws

* update machine file

* added missing slash to namelist

* changes for intel

* more cleanup

* cleaned up commented lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants