Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running C768/C1152 on GaeaC6 #2607

Open
JessicaMeixner-NOAA opened this issue Feb 20, 2025 · 5 comments
Open

Running C768/C1152 on GaeaC6 #2607

JessicaMeixner-NOAA opened this issue Feb 20, 2025 · 5 comments

Comments

@JessicaMeixner-NOAA
Copy link
Collaborator

Has anyone successfully run a S2SW or S2SWA C768 or C1152 case on GaeaC6? If so, do you have recommendations for resources and environment variables?

I'm trying to run from the global-workflow (see issue: NOAA-EMC/global-workflow#3324) and have yet to have success. I'm hoping it's just a matter of resources and/or environment variables that need to be properly set.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

I have confirmed that I'm not using over 21k PETs, which might need a new ESMF version based on: #2486 (comment)

If anyone has environment variables, module files, or layouts that they've used to make successful C768 or C1152 runs on Gaea C6 and can share, I will try that to see if I can also replicate others success.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Possibly related to: #2540

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Error from a C768 S2S case without ESMF managed threading:
Full log file is here: /gpfs/f6/ira-sti/scratch/Jessica.Meixner/testwcda02/c768s2st01/COMROOT/c768s2st01/logs/2019120300/gfs_fcst_seg0.log

1632:  MOM_inc domain decomposition
1632: whalo =    2, ehalo =    2, shalo =    2, nhalo =    2
1632:   X-AXIS =  144 144 144 144 144
1632:   Y-AXIS =   50  49  49  49  49  48  49  49  49  49  50
srun: error: c6n0680: task 1968: Exited with exit code 255
srun: Terminating StepId=207868435.0
   0: slurmstepd: error: *** STEP 207868435.0 ON c6n0530 CANCELLED AT 2025-02-24T10:58:19 ***
srun: error: c6n0677: tasks 1852,1856,1860,1864,1868: Exited with exit code 255
1874: forrtl: error (78): process killed (SIGTERM)
1874: Image              PC                Routine            Line        Source
1874: libpthread-2.31.s  000014CE626A9910  Unknown               Unknown  Unknown
1874: libpthread-2.31.s  000014CE626A470A  pthread_cond_wait     Unknown  Unknown
1874: gfs_model.x        0000000000C99A04  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000C9AD39  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000F96C30  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000A05EEE  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000727311  Unknown               Unknown  Unknown
1874: gfs_model.x        00000000004642A0  Unknown               Unknown  Unknown
1874: gfs_model.x        000000000048C5D6  Unknown               Unknown  Unknown
1874: gfs_model.x        00000000004962FB  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000F95295  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000F9901F  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000C8227A  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000CA32DF  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000F967AA  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000A05D60  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000727311  Unknown               Unknown  Unknown
1874: gfs_model.x        0000000000434B8E  MAIN__                    392  UFS.F90
1874: gfs_model.x        00000000004339BD  Unknown               Unknown  Unknown
1874: libc-2.31.so       000014CE5F13224D  __libc_start_main     Unknown  Unknown
1874: gfs_model.x        00000000004338EA  Unknown               Unknown  Unknown

Starting to look at info in #2540 to see if I can modify my set-up to replicate HR4 runs at C768 that succeeded.

@DusanJovic-NOAA
Copy link
Collaborator

@JessicaMeixner-NOAA Please take a look at this run directory: /gpfs/f6/drsa-hurr1/world-shared/noscrub/Dusan.Jovic/ufs/devel/s2sw_c768. This is s2sw c768 configuration using 144 tasks per node (out of 192). See job_card for tasks configuration and env variables. So far job finished ~60h forecast. I'll let it run until it reaches job time limit.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Thanks @DusanJovic-NOAA !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants