Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional 'time_unlimited' logical flag to model_configure #1535

Merged
merged 20 commits into from
Dec 28, 2022

Conversation

DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA DusanJovic-NOAA commented Dec 13, 2022

PR Checklist

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.

  • New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsibility to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

Add new model_configure parameter (logical flag) 'time_unlimited'. This flag is set to .false.by default. When the user sets it to .true. explicitly in the model_configure file the time dimension in fv3atm history files will be a record dimension (ie. unlimited)

Issue(s) addressed

Link the issues to be closed with this PR, whether in this repository, or in another repository.
(Remember, issues must always be created before starting work on a PR branch!)

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • acorn.intel
  • opnReqTest for newly added/changed feature
  • CI

Dependencies

If testing this branch requires non-default branches in other repositories, list them. Those branches should have matching names (ideally).

Do PRs in upstream repositories need to be merged first? Yes.
If so add the "waiting for other repos" label and list the upstream PRs

@DusanJovic-NOAA DusanJovic-NOAA added the No Baseline Change No Baseline Change label Dec 13, 2022
@DeniseWorthen
Copy link
Collaborator

@DusanJovic-NOAA Currently when we write tiled output, the netcdf time dimension is unlimited. By setting time_unlimited to false by default, will this mean tiled output does not have an unlimited time dimension?

@DusanJovic-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA Currently when we write tiled output, the netcdf time dimension is unlimited. By setting time_unlimited to false by default, will this mean tiled output does not have an unlimited time dimension?

Tiled history files (each tile in a separate file) are created using esmf io, and are not affected by the option.

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA can I check if you are working today? If so, we may combine this pr with #1538.

@DusanJovic-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA can I check if you are working today? If so, we may combine this pr with #1538.

Yes, I'm working today. Let me first sync my branches with develop

@DusanJovic-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA can I check if you are working today? If so, we may combine this pr with #1538.

Yes, I'm working today. Let me first sync my branches with develop

Should I merge #1538 into this PR?

@jkbk2004
Copy link
Collaborator

Yes, go ahead to merge in #1538. Its PR template update. All no baseline change. It's good to combine.

@DusanJovic-NOAA
Copy link
Collaborator Author

Merged #1538

@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: hera
Compiler: gnu
Job: RT
[RT] Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/1163629247/20221227181518/ufs-weather-model
[RT] Error: Test cpld_control_p8 046 failed in run_test failed
Please make changes and add the following label back: hera-gnu-RT

@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: gaea
Compiler: intel
Job: RT
[RT] Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/1163629247/20221227203006/ufs-weather-model
[RT] Error: Test control_c384 028 failed in run_test failed
[RT] Error: Test control_c384gdas 029 failed in run_test failed
Please make changes and add the following label back: gaea-intel-RT

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA On gaea, can you take a look at /lustre/f2/scratch/Jong.Kim/FV3_RT/rt_16491_troubleshoot? I think this is similar issue as cheyenne for control_c384 and control_c384gdas.

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA On gaea, can you take a look at /lustre/f2/scratch/Jong.Kim/FV3_RT/rt_16491_troubleshoot? I think this is similar issue as cheyenne for control_c384 and control_c384gdas.

I am not sure if we need @uturuncoglu 's opinion. We thought memory issue. But just in case we need think from esmf's point of view. @uturuncoglu We created #1549. we started seeing this issue from cheyenne but it seems to be happening on gaea as well.

@DusanJovic-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA On gaea, can you take a look at /lustre/f2/scratch/Jong.Kim/FV3_RT/rt_16491_troubleshoot? I think this is similar issue as cheyenne for control_c384 and control_c384gdas.

Let me try to run the tests with fewer tasks per node on gaea for these two tests.

@DusanJovic-NOAA
Copy link
Collaborator Author

Regression test passed on Gaea using TPN=18 for control_c384 and control_c384gdas.

@jkbk2004
Copy link
Collaborator

Regression test passed on Gaea using TPN=18 for control_c384 and control_c384gdas.

cool! Let me finish up the RT log on gaea. I will try on cheyenne as well.

@DeniseWorthen
Copy link
Collaborator

I'm confused why the changes in this PR have any impact on the resource needs. The unlimited option is off by default, correct? So why are we seeing a resource impact?

@DusanJovic-NOAA
Copy link
Collaborator Author

I'm confused why the changes in this PR have any impact on the resource needs. The unlimited option is off by default, correct? So why are we seeing a resource impact?

The changes in this PR are not the reason for the failures but probably the switching to esmf managed threading, few commits earlier. For some reason these two tests occasionally fail on gaea and/or cheyenne most probably due to memory limits.

@DeniseWorthen
Copy link
Collaborator

@DusanJovic-NOAA Thanks, that is also what I suspect. But I didn't think the esmf-managed threading changed any of the resource allocations, so that jobs running w/ 1 thread before were still running w/ one thread after etc.

@jkbk2004
Copy link
Collaborator

I am turning on control_c384 on cheyenne. It runs ok with TPN =18.

@DusanJovic-NOAA
Copy link
Collaborator Author

Is see in PR #1523 in this commit (a941652) I changed the number of threads for c384 tests to 1 after @jkbk2004 reported failure on gaea (#1523 (comment)). That seemed to solve the issue then. But in this PR these two tests failed again.

@DeniseWorthen
Copy link
Collaborator

This feels like playing whack-a-mole.

@jkbk2004
Copy link
Collaborator

I will watch. odd seems to be reduced. BTW, I think the git yaml script to pick up git.run.id might be a bit outdated. I will test on my side.

@jkbk2004
Copy link
Collaborator

So, all tests set. we can start merging process.

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA fv3 pr was merged.

@jkbk2004 jkbk2004 merged commit 92d2d43 into ufs-community:develop Dec 28, 2022
@DusanJovic-NOAA DusanJovic-NOAA deleted the time_unlimited branch January 12, 2023 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Baseline Change No Baseline Change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unlimited to time dimension in gaussian_grid netcdf output
4 participants