Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update hashes of GSI and rrfs_utils #800

Merged

Conversation

BenjaminBlake-NOAA
Copy link
Collaborator

DESCRIPTION OF CHANGES:

This PR updates the hashes of the GSI and rrfs_utils in Externals.cfg to the versions currently used by the real-time RRFS runs. The GSI version is #4afe6ed (committed on March 23) and the rrfs_utils version is #4ec1a33 (committed on April 14). To compile the newer version of GSI, ncdiag and ncio must be loaded to build the code. I have made the changes to get it working on WCOSS2 and Hera, but will need assistance on the other supported platforms.

Type of change

  • [X ] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • [X ] hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • [X ] wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None.

DOCUMENTATION:

No updates needed for this PR.

ISSUE:

Fixes issue #548

CHECKLIST

  • [X ] My code follows the style guidelines in the Contributor's Guide
  • [X ] I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • [X ] My changes do not require updates to the documentation (explain).
  • [X ] My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • [X ] Work In Progress
  • bug
  • [X ] enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • [X ] help wanted

CONTRIBUTORS (optional):

@christinaholtNOAA has volunteered to test on Jet

@MichaelLueken
Copy link
Collaborator

@BenjaminBlake-NOAA @christinaholtNOAA I'm willing to assist with testing to make sure that the updated hashes of the GSI and RRFS_UTILS build on the various machines, but should we move forward with updating the GSI at this point? With the Intel 2022 run issues still not addressed (will be addressed with PR #571) and issues with HDF5 versions other than 1.10.6, should we wait until these issues have been resolved, or move forward. I'm fine with updating the hashes now, but they will likely need to be updated again before the SRW v3.0.0 release.

@christinaholtNOAA
Copy link
Collaborator

The updated hash earlier rather than later will allow us to identify any other potential issues that may arise in the WE2E testing process as we add those workflow capabilities.

@BenjaminBlake-NOAA
Copy link
Collaborator Author

@MichaelLueken I agree with @christinaholtNOAA that it would be beneficial to update the GSI sooner if possible - thanks.

@christinaholtNOAA
Copy link
Collaborator

I am working on building the branch on Jet.

@MichaelLueken
Copy link
Collaborator

@BenjaminBlake-NOAA

Attempted to build the updated GSI and rrfs_utils on Cheyenne. It ultimately failed because ncdiag/1.0.0 isn't available in the hpc-stack on the machine for either Intel or GNU compilers. I have notified @natalie-perlin about the missing ncdiag/1.0.0 on the machine and to please add this at her earliest convenience.

The GSI and rrfs_utils built without issue on Orion and Gaea (using the updated modulefile from PR #799).

@BenjaminBlake-NOAA
Copy link
Collaborator Author

@MichaelLueken Sounds good! Once ncdiag is added to Cheyenne, it should work.

@MichaelLueken
Copy link
Collaborator

@BenjaminBlake-NOAA I fear I spoke too soon on Gaea. It looks like it is failing during the linking step to build the GSI executable due to undefined references to mkl.

I'll also reach out to @natalie-perlin to see if this is something that she can correct as well.

@MichaelLueken
Copy link
Collaborator

@mark-a-potts - We can't move to spack-stack until the ufs-weather-model has done so. In this case, we would need add openblas to hpc-stack. Would this need to be for both Hera GNU and Cheyenne GNU, or just Hera GNU?

@mark-a-potts
Copy link
Collaborator

Anyone know if there are regression tests for rrfs_utils or will we just need to test with some SRW E2E tests?

@christinaholtNOAA
Copy link
Collaborator

There is no testing framework for rrfs_utils. I think it's typically tested in the RRFS_dev1 retros as many of the other executable changes are.

@mark-a-potts
Copy link
Collaborator

mark-a-potts commented May 24, 2023

I think that openblas needs to be added to the modulefile for build_hera_gnu.lua. With that addition, the devbuild.sh script works with gsi on Hera.

@mark-a-potts
Copy link
Collaborator

The rrfs_utils PR is ready for review, but I am unsure how to go about testing it.

@christinaholtNOAA
Copy link
Collaborator

The RRFS Utils PR was just merged. The hash is now 5681d1c. @BenjaminBlake-NOAA would you be willing to bump to that version in this PR?

@BenjaminBlake-NOAA
Copy link
Collaborator Author

@christinaholtNOAA Yes I can do that. I will test the build on WCOSS2 and Hera then push the change to my feature branch.

@MichaelLueken
Copy link
Collaborator

@mark-a-potts @BenjaminBlake-NOAA On Cheyenne GNU, rrfs_utils is still failing with the following:

[ 21%] Building Fortran object adjust_soiltq/CMakeFiles/adjust_soiltq.exe.dir/module_bkio_fv3lam_parall.f90.o
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:400:16:

  400 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:453:16:

  453 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:493:16:

  493 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:527:16:

  527 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)

The same issue is encountered on Hera GNU:

[ 73%] Building Fortran object adjust_soiltq/CMakeFiles/adjust_soiltq.exe.dir/module_bkio_fv3lam_parall.f90.o
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:400:15:

  400 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:453:15:

  453 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:493:15:

  493 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:527:15:

  527 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)

@mark-a-potts
Copy link
Collaborator

@mark-a-potts @BenjaminBlake-NOAA On Cheyenne GNU, rrfs_utils is still failing with the following:

[ 21%] Building Fortran object adjust_soiltq/CMakeFiles/adjust_soiltq.exe.dir/module_bkio_fv3lam_parall.f90.o
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:400:16:

  400 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:453:16:

  453 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:493:16:

  493 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)
/glade/scratch/mlueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:527:16:

  527 |            stop(333)
      |                1
Error: Blank required in STOP statement near (1)

The same issue is encountered on Hera GNU:

[ 73%] Building Fortran object adjust_soiltq/CMakeFiles/adjust_soiltq.exe.dir/module_bkio_fv3lam_parall.f90.o
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:400:15:

  400 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:453:15:

  453 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:493:15:

  493 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/sorc/rrfs_utl/adjust_soiltq/module_bkio_fv3lam_parall.f90:527:15:

  527 |            stop(333)
      |               1
Error: Blank required in STOP statement near (1)

So, it looks like those commands (actually that whole file) got added to rrfs_utils after I put in my PR, so they didn't get caught as an issue with gnu. The solution is to change them from "stop(333)" to "stop 333" in adjust_soiltq/module_bkio_fv3lam_parall.f90. I have verified that this compiles with both Intel and gnu compilers on Hera. I can put in another PR to rrfs_utils to address this issue.

@MichaelLueken
Copy link
Collaborator

So, it looks like those commands (actually that whole file) got added to rrfs_utils after I put in my PR, so they didn't get caught as an issue with gnu. The solution is to change them from "stop(333)" to "stop 333" in adjust_soiltq/module_bkio_fv3lam_parall.f90. I have verified that this compiles with both Intel and gnu compilers on Hera. I can put in another PR to rrfs_utils to address this issue.

@mark-a-potts If you have verified that it builds successfully on GNU compilers, please go ahead and create the PR to update the new file. Thanks!

@mark-a-potts
Copy link
Collaborator

The new PR is here-- NOAA-GSL/rrfs_utl#31

@mark-a-potts
Copy link
Collaborator

The rrfs_utils PR has been merged, so that should now compile with GNU. I believe that an openblas module will need to be added to the GNU modulefiles (on Hera, at least) in order to get the GSI to compile, however.

@MichaelLueken
Copy link
Collaborator

@mark-a-potts @BenjaminBlake-NOAA The new rrfs_utls hash - 6cacff5, was successfully built on both Hera and Cheyenne. The GSI also built without issue on both systems.

@mark-a-potts Would we see issues related with openblas at compile or runtime? I'm not seeing any issues with the compile of the GSI on either Hera or Cheyenne. Now, I'm noting failures of the process_obs WE2E test on Hera, with the following messages:

/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/exec/process_larccld.exe: error while loading shared libraries: libmkl_gf_lp64.so: cannot open shared object file: No such file or directory
/scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/exec/process_NSSL_mosaic.exe: error while loading shared libraries: libmkl_gf_lp64.so: cannot open shared object file: No such file or directory

Please note that these messages are seen with and without openblas loaded during code compilation.

The test can be ran using:

./run_WE2E_tests.py -t=process_obs -m=hera -c=gnu -a=epic

in the tests/WE2E directory.

@christopherwharrop-noaa
Copy link
Collaborator

@MichaelLueken - So, the Intel MKL libraries are linked when building process_larccld.exe and process_NSSL_mosaic.exe with GNU? It appears that MKL is used instead of OpenBLAS and that's why loading it doesn't matter.

@mark-a-potts
Copy link
Collaborator

@MichaelLueken, when I had built the SRW with "./devbuild.sh --platform=hera --compiler=gnu all" earlier on Hera, I had compile errors in GSI due to missing Lapack/Blas libraries. I'll test that again, though.

@MichaelLueken
Copy link
Collaborator

@mark-a-potts I compiled without issue using:

./devbuild.sh -p=hera -c=gnu gsi rrfs_utils

I'll test with ./devbuild.sh -p=hera -c=gnu all

@MichaelLueken
Copy link
Collaborator

@christopherwharrop-noaa I don't know about the rrfs_utl repository, but I know that the GSI is set up to always load the MKL libraries. If the MKL libraries aren't found, then the GSI will load LAPACK in it's place. It looks like the GSI build will need to change to allow OpenBLAS for GNU and MKL for Intel.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Jun 5, 2023

@MichaelLueken @mark-a-potts @christopherwharrop-noaa -
Please note that GSI modulefiles for GNU need to have the new/EPIC stacks loaded and the line setting MKLROOT variable commented out. Openblas is to be loaded, and is successfully found for building LAPACK, instead of the MKL libraries for LAPACK.


-- pushenv("MKLROOT", "/apps/oneapi/mkl/2022.0.2")
local openblas_ver=os.getenv("openblas_ver") or "0.3.23"
load(pathJoin("openblas", openblas_ver))

Working on regression tests on Hera/GNU (some pass) at the moment, and on other modulefiles and test builds for the PR:
NOAA-EMC/GSI#571 , and also GNU modulefiles to resolve NOAA-EMC/GSI#582

@MichaelLueken
Copy link
Collaborator

@natalie-perlin The SRW App doesn't use the component modulefiles to build the separate components. All components are built using modulefiles/build_hera_gnu.lua in the SRW App itself. The suggested change to the GSI modulefile doesn't affect the SRW App's build.

@natalie-perlin
Copy link
Collaborator

@MichaelLueken -
I was just mentioning that the openblas may needs to be loaded for the GSI, or MKL library set via "MKLROOT" variable for intel compilers.

@@ -27,5 +27,6 @@ load("sigio/2.3.2")
load("w3nco/2.4.1")
load("wrf_io/1.2.0")

load("ncdiag/1.0.0")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent updates to GSI require ncdiag/1.1.0 (added to all hpc-stacks on Hera gnu/intel, Jet, Orion, Cheyenne gnu/intel, Gaea
on June 2, 2023)

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBlake-NOAA Following the latest modifications to this PR, the GSI and rrfs_utils now compile using both Intel and GNU compilers on Hera and Cheyenne and the process_obs WE2E test runs using both Intel and GNU executables on Hera.

Thank you very much @mark-a-potts and @natalie-perlin creating the PRs in the rrfs_utl repo to correct the GNU build issues and adding OpenBLAS to the HPC-Stack!

Approving this work now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jun 7, 2023
@MichaelLueken
Copy link
Collaborator

@BenjaminBlake-NOAA The Jenkins tests failed on:

Cheyenne GNU - the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km test failed, but this is a known issue (#806) and will not hold thePR back.

The Jenkins tests can't run on Gaea because the SRW App can't be built on the machine at this time.

Hera Intel - the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta, grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2, and grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 tests failed. The grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta failure is a known issue (#805) and won't hold the PR back. The grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 and grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 tests were rerun on Hera Intel and successfully passed.

Orion - Manually ran the Jenkins tests and all tests successfully passed.

Merging this PR now.

@MichaelLueken MichaelLueken merged commit cc92e50 into ufs-community:develop Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests Work in Progress
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Update GSI hash
6 participants