Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable cycling support for Gaea C6 #3323

Open
wants to merge 66 commits into
base: develop
Choose a base branch
from

Conversation

DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Feb 13, 2025

Description

This adds support for the Gaea clusters by enabling cycled experiments on C6 and addressing a number of issues on both clusters.

Included in this PR is support for HPSS on C6, but it should NOT be utilized at this time. The ES cluster, where HPSS connections are made, has a significant issue with the F6 (C6's filesystem) mount that causes the system to run extremely slow when filesystem-intense operations are performed (such as htar). There is a plan to enable this feature more broadly in the near future.

  • Fixed memory variable unsetting for Gaea C5/6 in config.resources.GAEAC{5,6}
  • Refactoring the system-level parameter detection when determining task resources in the setup scripts to make it easier to define multiple partitions, queues, and clusters.
  • Adding a DTN partition, queue, and cluster definition.
  • Added/renamed missing/miss-named tasks to tasks.py and added a check that the input task is valid.

NOTE: Archiving from the DTNs for files located on the f6 filesystem is excruciatingly slow and can bog down both C5 and C6. Thus, it is recommended to not use HPSS at this time on Gaea/C6. Therefore, the option is disabled by default. According to system admins, there should be new DTNs installed soon that will help alleviate this issue.

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO

How has this been tested?

  • C48_ATM on C6
  • C48_S2SW on C6
  • Cycle testing on C6
  • CI suite on Hera
  • CI suite on WCOSS2

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

@DavidHuber-NOAA DavidHuber-NOAA marked this pull request as ready for review February 19, 2025 19:07
@DavidHuber-NOAA DavidHuber-NOAA changed the title Add HPSS support and fix memory unsetting for Gaea C5/6 Enable GSI cycling support for Gaea C6 Feb 28, 2025
@DavidHuber-NOAA DavidHuber-NOAA changed the title Enable GSI cycling support for Gaea C6 Enable cycling support for Gaea C6 Feb 28, 2025
@DavidHuber-NOAA
Copy link
Contributor Author

Updated the title and description of this PR. Launching CI on Hercules.

@DavidHuber-NOAA DavidHuber-NOAA added CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules and removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed labels Feb 28, 2025
@emcbot emcbot added CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules and removed CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules labels Feb 28, 2025
@emcbot emcbot added CI-Hercules-Failed **Bot use only** CI testing on Hercules for this PR has failed and removed CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules labels Feb 28, 2025
@DavidHuber-NOAA DavidHuber-NOAA added CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules and removed CI-Hercules-Failed **Bot use only** CI testing on Hercules for this PR has failed labels Feb 28, 2025
@emcbot emcbot added CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules and removed CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules labels Feb 28, 2025
@emcbot
Copy link

emcbot commented Feb 28, 2025

Build FAILED on Hercules in Build# 3 with error logs:

/work2/noaa/global/CI/HERCULES/3323/global-workflow/sorc/logs/gefs_ww3_prepost.log
/work2/noaa/global/CI/HERCULES/3323/global-workflow/sorc/logs/gsi_monitor.log

Follow link here to view the contents of the above file(s): (link)

@emcbot emcbot added CI-Hercules-Failed **Bot use only** CI testing on Hercules for this PR has failed and removed CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules labels Feb 28, 2025
@DavidHuber-NOAA DavidHuber-NOAA added CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS and removed CI-Hercules-Failed **Bot use only** CI testing on Hercules for this PR has failed labels Feb 28, 2025
@emcbot emcbot added the CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS label Feb 28, 2025
@KateFriedman-NOAA KateFriedman-NOAA removed the CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS label Feb 28, 2025
Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting all of these C6 updates together @DavidHuber-NOAA !

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emcbot emcbot added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants