Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_post fails for NA_3km domain #705

Closed
mkavulich opened this issue Mar 30, 2023 · 0 comments · Fixed by #732
Closed

run_post fails for NA_3km domain #705

mkavulich opened this issue Mar 30, 2023 · 0 comments · Fixed by #732
Assignees
Labels
bug Something isn't working

Comments

@mkavulich
Copy link
Collaborator

Expected behavior

The WE2E test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta is the most expensive test currently in the test suite, and so is currently not run as part of the comprehensive suite. At the last point it was tested (prior to merging #686), this test succeeded, and was expected to succeed in the develop branch.

Current behavior

As part of PR #676, we were testing all tests including this one, and noticed it was not succeeding. After some debugging we realized the failures not due to those changes, but was in fact also failing in the top of develop. The run_post tasks are failing with a number of error messages, eventually with a segmentation fault.

 ient,iget =         1010         225
 ient,iget =         1010         225
 increased MXBIT for APCP                                    22
 increased MXBIT for APCP                                    22
 increased MXBIT for NCPCP                                   22
 increased MXBIT for NCPCP                                   22
srun: error: h16c14: task 44: Killed
srun: launch/slurm: _step_signal: Terminating StepId=43370940.0
slurmstepd: error: *** STEP 43370940.0 ON h16c06 CANCELLED AT 2023-03-30T00:54:56 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
upp.x              000000000137EAFB  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B9FBB37E630  Unknown               Unknown  Unknown
upp.x              000000000144C380  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
upp.x              000000000137EAFB  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B4169FBB630  Unknown               Unknown  Unknown
upp.x              0000000000E63BF2  Unknown               Unknown  Unknown
upp.x              0000000000E5E21A  Unknown               Unknown  Unknown
upp.x              00000000005FA189  grib2_module_mp_g        1010  grib2_module.f
upp.x              00000000005F47DB  grib2_module_mp_g         433  grib2_module.f
upp.x              000000000055B9A5  MAIN__                    736  WRFPOST.f
upp.x              000000000040C062  Unknown               Unknown  Unknown
libc-2.17.so       00002B416AEBA555  __libc_start_main     Unknown  Unknown
upp.x              000000000040BF69  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)

Machines affected

This has only been tested on Hera. Unclear if this will occur on other platforms.

Steps To Reproduce

  1. Run WE2E test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta (this is a large test, uses ~1500 core hours on Hera)
  2. Observe failure
  3. To save core hours, reference failures in log files here on disk on Hera: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/expt_dirs/grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta/

Detailed Description of Fix

Giving the task more processors (12 nodes vs 8 nodes) results in success, so this seems likely to be an out-of-memory issue. I will address this in my next test reorg PR.

@mkavulich mkavulich added the bug Something isn't working label Mar 30, 2023
@mkavulich mkavulich self-assigned this Mar 30, 2023
mkavulich added a commit to mkavulich/ufs-srweather-app that referenced this issue Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant