Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAMxx: simplify branch runs and adding new output streams #7063

Merged
merged 5 commits into from
Mar 10, 2025

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Feb 26, 2025

Simplify workflow for branch runs, and, in general, runs where new output streams are added upon restart.

[BFB]


Right now, for branch runs, the user needs to add the entry Perform Restart: false entry in the Restart sublist of their output yaml file. This allows EAMxx to skip the look for the rhist file. However, the user must later remove this entry, so that upon subsequent restarts the stream IS restarted. This is confusing, and may be overlooked.

This PR automatically handles this kind of details. In particular:

  • for a branch run, we automatically ignore the rhist file, without need of explicitly adding yaml entries
  • for streams added after restarts (not branch runs!), replace Perform Restart: false with skip_restart_if_rhist_not_found: true. This makes it simpler, since this param does not need to be removed after the 1st "post-restart-run"

@bartgol bartgol added BFB PR leaves answers BFB EAMxx PRs focused on capabilities for EAMxx code usability labels Feb 26, 2025
@bartgol bartgol requested a review from AaronDonahue February 26, 2025 22:10
@bartgol bartgol self-assigned this Feb 26, 2025
@@ -217,6 +217,7 @@ class AtmosphereDriver
util::TimeStamp m_run_t0;
util::TimeStamp m_case_t0;
RunType m_run_type;
bool m_branch_run = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about adding a new value to RunType, but it required a few more changes across the library. Maybe that's more appropriate though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone back and forth on this in my mind. I think this is probably okay as it is. I don't forsee a reason to use RunType=branch as an option outside of this application. Just to confirm, for the rest of the code RunType will be Restart for branch runs right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@bartgol bartgol requested a review from tcclevenger February 26, 2025 22:12
Copy link

github-actions bot commented Feb 26, 2025

PR Preview Action v1.6.0

🚀 View preview at
https://E3SM-Project.github.io/E3SM/pr-preview/pr-7063/

Built to branch gh-pages at 2025-03-06 20:00 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@mt5555
Copy link
Contributor

mt5555 commented Mar 2, 2025

Confirming this works. Cherry-picking these commits into a branch from December allowed a CIME style branch run to work without having to add "Perform Restart: false" to all the I/O yaml files.

Copy link
Contributor

@AaronDonahue AaronDonahue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, otherwise looks good.

@@ -217,6 +217,7 @@ class AtmosphereDriver
util::TimeStamp m_run_t0;
util::TimeStamp m_case_t0;
RunType m_run_type;
bool m_branch_run = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone back and forth on this in my mind. I think this is probably okay as it is. I don't forsee a reason to use RunType=branch as an option outside of this application. Just to confirm, for the rest of the code RunType will be Restart for branch runs right?

@mahf708
Copy link
Contributor

mahf708 commented Mar 6, 2025

So in essence, you're treating it a sub-category of a Restart run. That's fine, since it makes sense conceptually, but I would unify the lingo to use the one people are used to and documented here https://github.com/E3SM-Project/E3SM/blob/2b545d3f6a458dc539ce6e4120906705387ae8a8/run_e3sm.template.sh#L36C1-L38C33

relevant snippets from the run script

# Run options
readonly MODEL_START_TYPE="initial"  # 'initial', 'continue', 'branch', 'hybrid'
readonly START_DATE="0001-01-01"

# Additional options for 'branch' and 'hybrid'
readonly GET_REFCASE=TRUE
#readonly RUN_REFDIR=""
#readonly RUN_REFCASE=""
#readonly RUN_REFDATE=""   # same as MODEL_START_DATE for 'branch', can be different for 'hybrid'
    # Run type
    # Start from default of user-specified initial conditions
    if [ "${MODEL_START_TYPE,,}" == "initial" ]; then
        ./xmlchange RUN_TYPE="startup"
        ./xmlchange CONTINUE_RUN="FALSE"

    # Continue existing run
    elif [ "${MODEL_START_TYPE,,}" == "continue" ]; then
        ./xmlchange CONTINUE_RUN="TRUE"

    elif [ "${MODEL_START_TYPE,,}" == "branch" ] || [ "${MODEL_START_TYPE,,}" == "hybrid" ]; then
        ./xmlchange RUN_TYPE=${MODEL_START_TYPE,,}
        ./xmlchange GET_REFCASE=${GET_REFCASE}
	./xmlchange RUN_REFDIR=${RUN_REFDIR}
        ./xmlchange RUN_REFCASE=${RUN_REFCASE}
        ./xmlchange RUN_REFDATE=${RUN_REFDATE}
        echo 'Warning: $MODEL_START_TYPE = '${MODEL_START_TYPE}
	echo '$RUN_REFDIR = '${RUN_REFDIR}
	echo '$RUN_REFCASE = '${RUN_REFCASE}
	echo '$RUN_REFDATE = '${START_DATE}

    else
        echo 'ERROR: $MODEL_START_TYPE = '${MODEL_START_TYPE}' is unrecognized. Exiting.'
        exit 380
    fi

graph TD;
    EAMxx-->Initial;
    EAMxx-->Restart;
    Initial-->startup
    Restart-->continue;
    Restart-->branch;
    Restart-->hybrid;
Loading

In other words, I would turn m_branch_run into an option that specifies continue, branch, or hybrid. For now, we only support continue and branch, so it makes sense it is true/false, but might as well make it an enum now? Or maybe we can wait.

Also note the outermost leaves are what we get from the CIME RUN_TYPE option (except for "continue")...

Additionally, for hybrid and branch, do we explicitly support REFDIR, REFCASE, REFDATE? These options facilitate the "staging" of the necessary files by CIME to make sure the run can continue in a dir, etc. --- my guess is that this should be automatically supported as CIME will take care of it if the user sets these options, but we should test it to be super sure.

Copying @rljacob, does the above conceptual diagram sound right to you or do you think it's misleading?

from the reference page on run type, esp on diff between hybrid and branch...

https://www2.cesm.ucar.edu/models/cesm1.2/cesm/doc/modelnl/env_run.html#run_start

Run initialization type . 
Determines the model run initialization type.  
This setting is only important for the initial run of a production run when the 
CONTINUE_RUN variable is set to FALSE.  After the initial run, the CONTINUE_RUN
variable is set to TRUE, and the model restarts exactly using input
files in a case, date, and bit-for-bit continuous fashion.
Default: startup.
-- In a startup run (the default), all components are initialized
using baseline states.  These baseline states are set independently by
each component and can include the use of restart files, initial
files, external observed data files, or internal initialization (i.e.,
a cold start). In a startup run, the coupler sends the start date to
the components at initialization. In addition, the coupler does not
need an input data file.  In a startup initialization, the ocean model
does not start until the second ocean coupling (normally the second
day).
-- In a branch run, all components are initialized using a consistent
set of restart files from a previous run (determined by the
RUN_REFCASE and RUN_REFDATE variables in env_run.xml).  The case name
is generally changed for a branch run, although it does not have to
be. In a branch run, setting RUN_STARTDATE is ignored because the
model components obtain the start date from their restart datasets.
Therefore, the start date cannot be changed for a branch run. This is
the same mechanism that is used for performing a restart run (where
CONTINUE_RUN is set to TRUE in the env_run.xml) Branch runs are
typically used when sensitivity or parameter studies are required, or
when settings for history file output streams need to be modified
while still maintaining bit-for-bit reproducibility. Under this
scenario, the new case is able to produce an exact bit-for-bit restart
in the same manner as a continuation run IF no source code or
component namelist inputs are modified. All models use restart files
to perform this type of run.  RUN_REFCASE and RUN_REFDATE are required
for branch runs.
To set up a branch run, locate the restart tar file or restart
directory for RUN_REFCASE and RUN_REFDATE from a previous run, then
place those files in the RUNDIR directory.
--- A hybrid run indicates that the model is initialized more like a
startup, but uses initialization datasets FROM A PREVIOUS case.  This
is somewhat analogous to a branch run with relaxed restart
constraints.  A hybrid run allows users to bring together combinations
of initial/restart files from a previous case (specified by
RUN_REFCASE) at a given model output date (specified by
RUN_REFDATE). Unlike a branch run, the starting date of a hybrid run
(specified by RUN_STARTDATE) can be modified relative to the reference
case. In a hybrid run, the model does not continue in a bit-for-bit
fashion with respect to the reference case. The resulting climate,
however, should be continuous provided that no model source code or
namelists are changed in the hybrid run.  In a hybrid initialization,
the ocean model does not start until the second ocean coupling
(normally the second day), and the coupler does a cold start without
a restart file.
Valid Values: startup,hybrid,branch 

@mahf708
Copy link
Contributor

mahf708 commented Mar 6, 2025

On that note, I wouldn't give people an option in IO on what to do tbh. If a user selects branch/hybrid run, they are responsible for understanding what that entails for IO. So, I would impl all of this like follows:

  1. Get CIME options for run type where user decides
  2. Set options in IO accordingly where user cannot decide

The user only gets to choose if they want initial, continue, branch, hybrid (maybe we can error out early on hybrid for now), and the IO decisions are bound by that choice, not more optionality ...

@bartgol
Copy link
Contributor Author

bartgol commented Mar 6, 2025

The user only gets to choose if they want initial, continue, branch, hybrid (maybe we can error out early on hybrid for now), and the IO decisions are bound by that choice, not more optionality ...

That is the idea, no? After we merge this, the user picks the run type via CIME vars, and IO will decide whether to look for restarts or not based on that. Unless I'm missing something with our restart behavior?

@bartgol bartgol force-pushed the bartgol/eamxx/io-branch-run-fixes branch from 2460f47 to 6a70f62 Compare March 6, 2025 19:57
@mahf708
Copy link
Contributor

mahf708 commented Mar 6, 2025

The user only gets to choose if they want initial, continue, branch, hybrid (maybe we can error out early on hybrid for now), and the IO decisions are bound by that choice, not more optionality ...

That is the idea, no? After we merge this, the user picks the run type via CIME vars, and IO will decide whether to look for restarts or not based on that. Unless I'm missing something with our restart behavior?

You're introducing this new runtime param: skip_restart_if_rhist_not_found

@bartgol
Copy link
Contributor Author

bartgol commented Mar 7, 2025

The user only gets to choose if they want initial, continue, branch, hybrid (maybe we can error out early on hybrid for now), and the IO decisions are bound by that choice, not more optionality ...

That is the idea, no? After we merge this, the user picks the run type via CIME vars, and IO will decide whether to look for restarts or not based on that. Unless I'm missing something with our restart behavior?

You're introducing this new runtime param: skip_restart_if_rhist_not_found

That's a power user param. It is for users that want to add a new stream upon restart, without wiping the other streams (i.e., without doing a branch run). It is not needed (in fact, not used) for branch runs.

@mahf708
Copy link
Contributor

mahf708 commented Mar 7, 2025

That's kind of my point... this option shouldn't exist. If someone wants to modify a simulation this way, then they should branch out. That's one of the reasons branch runs exist!

@bartgol
Copy link
Contributor Author

bartgol commented Mar 7, 2025

That works with me. In general, I'm fine keeping some backdoors for power users. But I'm also fine not having to maintain backdoors that nobody uses...

@mahf708
Copy link
Contributor

mahf708 commented Mar 7, 2025

That works with me. In general, I'm fine keeping some backdoors for power users. But I'm also fine not having to maintain backdoors that nobody uses...

No need to adjust it now, just giving my 2c on how I would envision these things from a user perspective. THis PR should be integrated whenever you see fit. My most important comment is this #7063 (comment) (see if you disagree with it or if you can make eamxx as close as possible to how users will expect the code to behave, as they're used to stuff in CIME)

bartgol added a commit that referenced this pull request Mar 10, 2025
Simplify workflow for branch runs, and, in general,
runs where new output streams are added upon restart.

[BFB]
@bartgol bartgol merged commit c9cba71 into master Mar 10, 2025
26 of 28 checks passed
@bartgol bartgol deleted the bartgol/eamxx/io-branch-run-fixes branch March 10, 2025 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB code usability EAMxx PRs focused on capabilities for EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants