run_post fails for NA_3km domain #705

mkavulich · 2023-03-30T04:39:12Z

Expected behavior

The WE2E test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta is the most expensive test currently in the test suite, and so is currently not run as part of the comprehensive suite. At the last point it was tested (prior to merging #686), this test succeeded, and was expected to succeed in the develop branch.

Current behavior

As part of PR #676, we were testing all tests including this one, and noticed it was not succeeding. After some debugging we realized the failures not due to those changes, but was in fact also failing in the top of develop. The run_post tasks are failing with a number of error messages, eventually with a segmentation fault.

 ient,iget =         1010         225
 ient,iget =         1010         225
 increased MXBIT for APCP                                    22
 increased MXBIT for APCP                                    22
 increased MXBIT for NCPCP                                   22
 increased MXBIT for NCPCP                                   22
srun: error: h16c14: task 44: Killed
srun: launch/slurm: _step_signal: Terminating StepId=43370940.0
slurmstepd: error: *** STEP 43370940.0 ON h16c06 CANCELLED AT 2023-03-30T00:54:56 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
upp.x              000000000137EAFB  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B9FBB37E630  Unknown               Unknown  Unknown
upp.x              000000000144C380  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
upp.x              000000000137EAFB  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B4169FBB630  Unknown               Unknown  Unknown
upp.x              0000000000E63BF2  Unknown               Unknown  Unknown
upp.x              0000000000E5E21A  Unknown               Unknown  Unknown
upp.x              00000000005FA189  grib2_module_mp_g        1010  grib2_module.f
upp.x              00000000005F47DB  grib2_module_mp_g         433  grib2_module.f
upp.x              000000000055B9A5  MAIN__                    736  WRFPOST.f
upp.x              000000000040C062  Unknown               Unknown  Unknown
libc-2.17.so       00002B416AEBA555  __libc_start_main     Unknown  Unknown
upp.x              000000000040BF69  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)

Machines affected

This has only been tested on Hera. Unclear if this will occur on other platforms.

Steps To Reproduce

Run WE2E test grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta (this is a large test, uses ~1500 core hours on Hera)
Observe failure
To save core hours, reference failures in log files here on disk on Hera: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/expt_dirs/grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta/

Detailed Description of Fix

Giving the task more processors (12 nodes vs 8 nodes) results in success, so this seems likely to be an out-of-memory issue. I will address this in my next test reorg PR.

The text was updated successfully, but these errors were encountered:

…n_post tasks to use 12 nodes to resolve ufs-community#705

mkavulich added the bug Something isn't working label Mar 30, 2023

mkavulich self-assigned this Mar 30, 2023

mkavulich added a commit to mkavulich/ufs-srweather-app that referenced this issue Apr 17, 2023

Increase grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta ru…

2f32728

…n_post tasks to use 12 nodes to resolve ufs-community#705

mkavulich mentioned this issue Apr 17, 2023

[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

Merged

22 tasks

MichaelLueken closed this as completed in #732 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_post fails for NA_3km domain #705

run_post fails for NA_3km domain #705

mkavulich commented Mar 30, 2023

run_post fails for NA_3km domain #705

run_post fails for NA_3km domain #705

Comments

mkavulich commented Mar 30, 2023

Expected behavior

Current behavior

Machines affected

Steps To Reproduce

Detailed Description of Fix