Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement updated auto-rts for NCAR fork #101

Merged
merged 56 commits into from
Oct 30, 2023
Merged

Re-implement updated auto-rts for NCAR fork #101

merged 56 commits into from
Oct 30, 2023

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Jul 5, 2023

This PR implements updated automated regression testing for the NCAR fork of ufs-weather-model. It is still a work in progress, but when ready should fully re-implement the regression testing for the NCAR fork on Hera and Cheyenne for Intel and GNU compilers, with some additional features that add flexibility in how and where tests are run on each machine (i.e. replacing hard-coded paths with command-line arguments).

Note: some of these improvements come courtesy of @dustinswales's initial efforts to adapt this system to the NCAR fork.

Change details

In rt.sh

  1. "Machine name" is now a required variable, removing the need for messy and impractical "detect_machine.sh" logic
  2. Machine-specific logic stanzas (variables, module loads, paths, etc.) are moved out of rt.sh and into a subdirectory tests/machine/. The default name of the sourced file is the same as the machine name, but can be over-written by command-line argument.
  3. Because these settings can be controlled by any arbitrary input file now, new "cheyenne.ncar" and "hera.ncar" machine files are included for running RTs on this fork

In rt_auto.py and other python logic for auto-tests

  1. Command-line arguments are added for machine name, HPC account, working directory, and other helpful options
  2. Renamed some variables and clarified some logic
  3. A new, multi-function yaml-format configuration file:

New yaml config file

  1. A yaml config file (rt_auto.yaml) must be provided to give account-specific git information, so peoples' personal git account information is no longer hard-coded in the repository
  2. This yaml config file optionally can specify different repository and branch names to watch
  3. Additionally, this file can be used to provide command-line argument values, to avoid having to type very long arguments on the command line
  4. Specifics about this file's format and options can be found in new README file (README.rt_auto.yaml)

Example rt_auto.yaml:

args:
  machine: hera
  account: gmtb
  workdir: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model/run/     # Location to run tests
  new_baseline: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model/new/     # Location to store the new baseline (if necessary)
  envfile: machine/hera.ncar               # This file controls the general setting of paths and variables in rt.sh
  additional_args: -n control_p8 intel     # This flag tells rt.sh to only run the test "control_p8_intel
git:
  config:            # This is the only mandatory section: need to set your Github email and username
    user.email: [email protected]
    user.name: mkavulich
  github:
    org: NCAR
    repo: ufs-weather-model
    base: main

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721024050/ufs-weather-model
Please make changes and add the following label back: hera-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721025333/ufs-weather-model
Please make changes and add the following label back: hera-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721044901/ufs-weather-model
Please make changes and add the following label back: hera-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721054326/ufs-weather-model
Please make changes and add the following label back: hera-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721061725/ufs-weather-model
Please make changes and add the following label back: hera-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721065158/ufs-weather-model
Please make changes and add the following label back: hera-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: gnu
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721144634/ufs-weather-model
[RT] Error: Test control_flake 005 failed in run_test failed
[RT] Error: Test hrrr_control 013 failed in run_test failed
[RT] Error: Test hrrr_control_2threads 014 failed in run_test failed
[RT] Error: Test hrrr_control_decomp 015 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm 018 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm_2threads 019 failed in run_test failed
[RT] Error: Test rrfs_conus13km_hrrr_warm 020 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_radar_tten_warm 021 failed in run_test failed
[RT] Error: Test hrrr_control_debug 026 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm_debug 034 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm_debug_2threads 035 failed in run_test failed
[RT] Error: Test rrfs_conus13km_hrrr_warm_debug 036 failed in run_test failed
[RT] Error: Test rap_flake_debug 037 failed in run_test failed
[RT] Error: Test rap_clm_lake_debug 038 failed in run_test failed
[RT] Error: Test hrrr_control_dyn32_phy32 041 failed in run_test failed
[RT] Error: Test hrrr_control_2threads_dyn32_phy32 043 failed in run_test failed
[RT] Error: Test hrrr_control_decomp_dyn32_phy32 044 failed in run_test failed
[RT] Error: Test hrrr_control_debug_dyn32_phy32 049 failed in run_test failed
Please make changes and add the following label back: hera-gnu-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
[RT] Repo location: /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model//run//1421862753/20230721154437/ufs-weather-model
[RT] Error: Test cpld_bmark_p8 013 failed in run_test failed
[RT] Error: Test control_flake 024 failed in run_test failed
[RT] Error: Test hrrr_control 066 failed in run_test failed
[RT] Error: Test hrrr_control_decomp 067 failed in run_test failed
[RT] Error: Test hrrr_control_2threads 068 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm 073 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm_2threads 074 failed in run_test failed
[RT] Error: Test rrfs_conus13km_hrrr_warm 075 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_radar_tten_warm 076 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm_debug 084 failed in run_test failed
[RT] Error: Test rrfs_smoke_conus13km_hrrr_warm_debug_2threads 085 failed in run_test failed
[RT] Error: Test rrfs_conus13km_hrrr_warm_debug 086 failed in run_test failed
[RT] Error: Test hrrr_control_debug 098 failed in run_test failed
[RT] Error: Test rap_clm_lake_debug 109 failed in run_test failed
[RT] Error: Test rap_flake_debug 110 failed in run_test failed
[RT] Error: Test hrrr_control_dyn32_phy32 114 failed in run_test failed
[RT] Error: Test hrrr_control_2threads_dyn32_phy32 116 failed in run_test failed
[RT] Error: Test hrrr_control_decomp_dyn32_phy32 117 failed in run_test failed
[RT] Error: Test hrrr_control_debug_dyn32_phy32 122 failed in run_test failed
Please make changes and add the following label back: hera-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run/1421862753/20230724120028/ufs-weather-model
Please make changes and add the following label back: cheyenne-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run/1421862753/20230724180520/ufs-weather-model
Please make changes and add the following label back: cheyenne-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run/1421862753/20230724181434/ufs-weather-model
Please make changes and add the following label back: cheyenne-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run/1421862753/20230724182236/ufs-weather-model
Please make changes and add the following label back: cheyenne-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20230823105608/ufs-weather-model
Please make changes and add the following label back: cheyenne-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20230823115806/ufs-weather-model
Please make changes and add the following label back: cheyenne-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20231010093416/ufs-weather-model
[RT] Error: Test 001 cpld_control_p8_intel FAIL
[RT] Error: Test 001 cpld_control_p8_intel FAIL
Please make changes and add the following label back: cheyenne-intel-RT

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20231010101039/ufs-weather-model
[RT] Error: Test 001 cpld_control_p8_intel FAIL
[RT] Error: Test 001 cpld_control_p8_intel FAIL
[RT] Log file shows failures.
[RT] Please obtain logs from /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20231010101039/ufs-weather-model

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20231010212803/ufs-weather-model
[RT] Error: Test 001 cpld_control_p8_intel FAIL
[RT] Error: Test 001 cpld_control_p8_intel FAIL
[RT] Log file shows failures.
[RT] Please obtain logs from /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20231010212803/ufs-weather-model

@mkavulich
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: cheyenne
Compiler: intel
Job: RT
[RT] Repo location: /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20231010222833/ufs-weather-model
[RT] Error: Test 001 cpld_control_p8_mixedmode_intel FAIL
[RT] Error: Test 001 cpld_control_p8_mixedmode_intel FAIL
[RT] Error: Test 002 cpld_control_gfsv17_intel FAIL
[RT] Error: Test 002 cpld_control_gfsv17_intel FAIL
[RT] Error: Test 003 cpld_control_p8_intel FAIL
[RT] Error: Test 003 cpld_control_p8_intel FAIL
[RT] Error: Test 005 cpld_control_qr_p8_intel FAIL
[RT] Error: Test 005 cpld_control_qr_p8_intel FAIL
[RT] Error: Test 007 cpld_2threads_p8_intel FAIL
[RT] Error: Test 007 cpld_2threads_p8_intel FAIL
[RT] Error: Test 008 cpld_decomp_p8_intel FAIL
[RT] Error: Test 008 cpld_decomp_p8_intel FAIL
[RT] Error: Test 009 cpld_mpi_p8_intel FAIL
[RT] Error: Test 009 cpld_mpi_p8_intel FAIL
[RT] Error: Test 010 cpld_control_ciceC_p8_intel FAIL
[RT] Error: Test 010 cpld_control_ciceC_p8_intel FAIL
[RT] Error: Test 011 cpld_control_c192_p8_intel FAIL
[RT] Error: Test 011 cpld_control_c192_p8_intel FAIL
[RT] Error: Test 013 cpld_control_noaero_p8_intel FAIL
[RT] Error: Test 013 cpld_control_noaero_p8_intel FAIL
[RT] Error: Test 014 cpld_control_nowave_noaero_p8_intel FAIL
[RT] Error: Test 014 cpld_control_nowave_noaero_p8_intel FAIL
[RT] Error: Test 015 cpld_debug_p8_intel FAIL
[RT] Error: Test 015 cpld_debug_p8_intel FAIL
[RT] Error: Test 016 cpld_debug_noaero_p8_intel FAIL
[RT] Error: Test 016 cpld_debug_noaero_p8_intel FAIL
[RT] Error: Test 017 cpld_control_noaero_p8_agrid_intel FAIL
[RT] Error: Test 017 cpld_control_noaero_p8_agrid_intel FAIL
[RT] Error: Test 018 cpld_control_c48_intel FAIL
[RT] Error: Test 018 cpld_control_c48_intel FAIL
[RT] Error: Test 028 control_c48_intel FAIL Tries: 2
[RT] Error: Test 031 control_c384gdas_intel FAIL Tries: 2
[RT] Error: Test 037 control_p8_intel FAIL
[RT] Error: Test 037 control_p8_intel FAIL
[RT] Error: Test 039 control_qr_p8_intel FAIL
[RT] Error: Test 039 control_qr_p8_intel FAIL
[RT] Error: Test 041 control_decomp_p8_intel FAIL
[RT] Error: Test 041 control_decomp_p8_intel FAIL
[RT] Error: Test 042 control_2threads_p8_intel FAIL
[RT] Error: Test 042 control_2threads_p8_intel FAIL
[RT] Error: Test 043 control_p8_lndp_intel FAIL
[RT] Error: Test 043 control_p8_lndp_intel FAIL
[RT] Error: Test 044 control_p8_rrtmgp_intel FAIL
[RT] Error: Test 044 control_p8_rrtmgp_intel FAIL
[RT] Error: Test 045 control_p8_mynn_intel FAIL
[RT] Error: Test 045 control_p8_mynn_intel FAIL
[RT] Error: Test 046 merra2_thompson_intel FAIL
[RT] Error: Test 046 merra2_thompson_intel FAIL
[RT] Error: Test 057 rap_control_intel FAIL Tries: 2
[RT] Error: Test 059 rap_decomp_intel FAIL Tries: 2
[RT] Error: Test 060 rap_2threads_intel FAIL Tries: 2
[RT] Error: Test 062 rap_sfcdiff_intel FAIL Tries: 2
[RT] Error: Test 063 rap_sfcdiff_decomp_intel FAIL Tries: 2
[RT] Error: Test 065 hrrr_control_intel FAIL Tries: 2
[RT] Error: Test 066 hrrr_control_qr_intel FAIL Tries: 2
[RT] Error: Test 067 hrrr_control_decomp_intel FAIL Tries: 2
[RT] Error: Test 068 hrrr_control_2threads_intel FAIL Tries: 2
[RT] Error: Test 071 rrfs_v1beta_intel FAIL Tries: 2
[RT] Error: Test 072 rrfs_v1nssl_intel FAIL Tries: 2
[RT] Error: Test 073 rrfs_v1nssl_nohailnoccn_intel FAIL Tries: 2
[RT] Error: Test 074 rrfs_smoke_conus13km_hrrr_warm_intel FAIL Tries: 2
[RT] Error: Test 075 rrfs_smoke_conus13km_hrrr_warm_qr_intel FAIL
[RT] Error: Test 075 rrfs_smoke_conus13km_hrrr_warm_qr_intel FAIL
[RT] Error: Test 077 rrfs_conus13km_hrrr_warm_intel FAIL Tries: 2
[RT] Error: Test 085 control_p8_faster_intel FAIL
[RT] Error: Test 085 control_p8_faster_intel FAIL
[RT] Error: Test 098 control_debug_p8_intel FAIL
[RT] Error: Test 098 control_debug_p8_intel FAIL
[RT] Error: Test 111 rrfs_v1beta_debug_intel FAIL Tries: 2
[RT] Error: Test 116 rap_control_dyn32_phy32_intel FAIL Tries: 2
[RT] Error: Test 117 hrrr_control_dyn32_phy32_intel FAIL Tries: 2
[RT] Error: Test 118 hrrr_control_qr_dyn32_phy32_intel FAIL Tries: 2
[RT] Error: Test 119 rap_2threads_dyn32_phy32_intel FAIL Tries: 2
[RT] Error: Test 120 hrrr_control_2threads_dyn32_phy32_intel FAIL Tries: 2
[RT] Error: Test 121 hrrr_control_decomp_dyn32_phy32_intel FAIL Tries: 2
[RT] Error: Test 125 rrfs_smoke_conus13km_fast_phy32_intel FAIL Tries: 2
[RT] Error: Test 126 rrfs_smoke_conus13km_fast_phy32_qr_intel FAIL
[RT] Error: Test 126 rrfs_smoke_conus13km_fast_phy32_qr_intel FAIL
[RT] Error: Test 129 rap_control_dyn64_phy32_intel FAIL Tries: 2
[RT] Error: Test 138 hafs_regional_1nest_atm_intel FAIL Tries: 2
[RT] Error: Test 140 hafs_global_1nest_atm_intel FAIL Tries: 2
[RT] Error: Test 141 hafs_global_multiple_4nests_atm_intel FAIL Tries: 2
[RT] Error: Test 143 hafs_regional_storm_following_1nest_atm_intel FAIL Tries: 2
[RT] Error: Test 168 control_p8_atmlnd_sbs_intel FAIL
[RT] Error: Test 168 control_p8_atmlnd_sbs_intel FAIL
[RT] Error: Test 169 atmwav_control_noaero_p8_intel FAIL
[RT] Error: Test 169 atmwav_control_noaero_p8_intel FAIL
[RT] Error: Test 170 control_atmwav_intel FAIL Tries: 2
[RT] Error: Test 171 atmaero_control_p8_intel FAIL
[RT] Error: Test 171 atmaero_control_p8_intel FAIL
[RT] Error: Test 172 atmaero_control_p8_rad_intel FAIL
[RT] Error: Test 172 atmaero_control_p8_rad_intel FAIL
[RT] Error: Test 173 atmaero_control_p8_rad_micro_intel FAIL
[RT] Error: Test 173 atmaero_control_p8_rad_micro_intel FAIL
[RT] Error: Test 174 regional_atmaq_intel FAIL Tries: 2
[RT] Error: Test 175 regional_atmaq_faster_intel FAIL Tries: 2
[RT] Log file shows failures.
[RT] Please obtain logs from /glade/p/ral/jntp/CCPP_regression_testing/NCAR_ufs-weather-model/run//1421862753/20231010222833/ufs-weather-model

@mkavulich
Copy link
Collaborator Author

@grantfirl I thought I was ready to mark this PR ready for review, but after incorporating the latest changes I'm seeing a lot of failures, even after switching to the latest baselines on both Cheyenne and Hera. Are these all potentially related to the HDF5 problems you mentioned in #104? Or was that only on Cheyenne that it should have been a problem?

@grantfirl
Copy link
Collaborator

@grantfirl I thought I was ready to mark this PR ready for review, but after incorporating the latest changes I'm seeing a lot of failures, even after switching to the latest baselines on both Cheyenne and Hera. Are these all potentially related to the HDF5 problems you mentioned in #104? Or was that only on Cheyenne that it should have been a problem?

Cheyenne should be the only one with the problem. The latest baselines aren't even there due to the problem. Hera should be fine. If you run the RTs manually with the latest NCAR main branches on Hera, do you still see problems?

@mkavulich
Copy link
Collaborator Author

@grantfirl Turns out it was a false alarm, Hera tests did all pass! I am opening this PR for review. Let me know if you have any comments.

@mkavulich mkavulich marked this pull request as ready for review October 12, 2023 15:28
DISKNM=$dprefix/
STMP=$dprefix/stmp4
PTMP=$dprefix/stmp2
RTPWD=${RTPWD:-/scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_ufs-weather-model/baselines/main-${BL_DATE}/}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who has permissions in here? I'd like to be able to clear this out when necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for all directories that we'll be writing to. Everyone tasked with running RTs should have write permissions in the directories in order to clear space.

Copy link
Collaborator

@grantfirl grantfirl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK to me. I guess we can merge this and start using it to work out any bugs and then try a PR to the ufs-community fork? Also, please write some internal documentation for us (and anyone who comes after us) to use this (e.g. how to start the auto RT scripts on the machine, how to stop them, default paths). Also, please make sure that at least you and I have permissions everywhere the NCAR machine files point. One of the continual problems has been making sure that we have enough account storage space and RTs can take a lot!

@grantfirl
Copy link
Collaborator

@mkavulich Do you think that we should get some other approvals before merging?

@grantfirl grantfirl merged commit 18631eb into main Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants