Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RRTMGp radiation physics fails when using more than 1 OpenMP thread #59

Closed
pj-gdit opened this issue Apr 3, 2023 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@pj-gdit
Copy link

pj-gdit commented Apr 3, 2023

Description

We are running UFS on WCOSS2 HPE/Cray systems to evaluate RRTMGP radiation scheme and find that it fails in mo_gas_concentrat routine in longwave code if more than 1 OMP thread is used. shortwave code (and other RRTMGP routines executed before that) are OK.

Steps to Reproduce

  1. Build UFS with OpenMP enabled
  2. Run with OMP_NUM_THREADS=1 and get succesful completion
  3. Run with OMP_NUM_THREADS=2 code will get segfault in mo_gas_concentrat routine

Additional Context

  • WCOSS2 HPE/Cray EX
  • Intel 19.1.3.304
  • RRTMGP (RRTM radiation scheme is OK)

Output

Traceback and errors from UFS

68: fv3.exe 00000000043CEE02 Unknown Unknown Unknown
68: fv3.exe 000000000393039E mo_gas_concentrat 288 mo_gas_concentrations.F90
68: fv3.exe 00000000038066A7 rrtmgp_lw_main_mp 300 rrtmgp_lw_main.F90
68: fv3.exe 0000000003429255 ccpp_fv3_gfs_v17_ 518 ccpp_FV3_GFS_v17_p8_rrtmgp_radiation_cap.F90
68: fv3.exe 00000000030E13B6 ccpp_static_api_m 943 ccpp_static_api.F90
68: fv3.exe 00000000030DDDA2 ccpp_driver_mp_cc 188 CCPP_driver.F90

78: [h24c21:2818 :0:2884] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
56: [h24c12:18836:0:18888] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
41: [h24c12:18821:0:18894] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
68: forrtl: severe (153): allocatable array or pointer is not allocated

@pj-gdit pj-gdit added the bug Something isn't working label Apr 3, 2023
@dustinswales
Copy link
Collaborator

@pj-gdit Thanks for bringing this to our attention!
I have a fix in place and will open a PR soon, just waiting to combine with some other RRTMGP related cleanup.

@pj-gdit
Copy link
Author

pj-gdit commented May 16, 2023

Thanks Dustin.

I downloaded the modified src, recompiled, and ran some 1 and 2 OMP thread tests with our c96 UFS case. Threading now runs successfully but I noticed that the answers change every run when number of threads is greater than 1. The old RRTMG code does not exhibit this - answers are bit reproducible every run and across different number of threads. There must be some race condition within RRTMGP. I will explore this some more. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants