Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running SRW using wrapper launch scripts is not working #473

Closed
natalie-perlin opened this issue Nov 14, 2022 · 12 comments · Fixed by #557
Closed

Running SRW using wrapper launch scripts is not working #473

natalie-perlin opened this issue Nov 14, 2022 · 12 comments · Fixed by #557
Labels
bug Something isn't working

Comments

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Nov 14, 2022

Running the SRW using individual launch scripts from ./ush/wrappers/*.sh does not work, when no rocoto manager specified, i.e. WORKFLOW_MANAGER: none in ./ush/machine/<platform>.yaml On some systems, it fails at the very first task, run_make_grid.sh, on other systems it fails on run_get_lbcs.sh or run_get_ics.sh. When an individual task goes through, it still reports bash errors.
One of the errors to be corrected in ./ush/job_preamble.sh, line 10 (or line 22 after generating the workflow):

if [ $subcyc -eq 0 ]; then (wrong)
to
if [[ $subcyc -eq 0 ]]; then

Other errors yet to be determined and corrected.
Systems tested that indicated failure:
cheyenne (intel)
hera (intel)
gaea
orion
macos

@natalie-perlin natalie-perlin added the bug Something isn't working label Nov 14, 2022
@danielabdi-noaa
Copy link
Collaborator

danielabdi-noaa commented Nov 14, 2022

@natalie-perlin Are the wrapper scripts worth fixing? It looks so outdated -- loading individual modules, missing variable definitions etc. Also, most of the wrapper scripts simply try to replicate what rocoto does better. So I am all for deleting the whole wrappers/ folder. For example, this is an auto-generated rocoto script for run_fcst with rocoto

rocotorun -v 10 -w FV3LAM_wflow.xml -d FV3LAM_wflow.db
#! /bin/sh
#SBATCH --account=zrtrr
#SBATCH --qos=batch
#SBATCH --partition=hera
#SBATCH --ntasks=12
#SBATCH -t 01:00:00
#SBATCH --job-name=run_fcst_mem1
#SBATCH -o /scratch2/BMC/gsd-hpcs/Daniel.Abdi/nco_dirs/output/20190615/run_fcst_mem1_2019061500.id_1668444250.log
#SBATCH --cpus-per-task 2 --exclusive
#SBATCH --export=NONE
#SBATCH --comment=c2e24558005fca0a835d1a03d0254e62
export GLOBAL_VAR_DEFNS_FP='/scratch2/BMC/gsd-hpcs/Daniel.Abdi/expt_dirs/MET_ensemble_verification/var_defns.sh'
export USHdir='/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush'
export PDY='20190615'
export cyc='00'
export subcyc='00'
export LOGDIR='/scratch2/BMC/gsd-hpcs/Daniel.Abdi/nco_dirs/output/20190615'
export SLASH_ENSMEM_SUBDIR='/mem1'
export ENSMEM_INDX='1'
/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/load_modules_run_task.sh "run_fcst" "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/jobs/JREGIONAL_RUN_FCST"

Compared to the wrapper script

#!/bin/sh
export GLOBAL_VAR_DEFNS_FP="${EXPTDIR}/var_defns.sh"
set -x
source ${GLOBAL_VAR_DEFNS_FP}
export CDATE=${DATE_FIRST_CYCL}
export CYCLE_DIR=${EXPTDIR}/${CDATE}
export SLASH_ENSMEM_SUBDIR=""
export ENSMEM_INDX=""

${JOBSdir}/JREGIONAL_RUN_FCST

Rocoto is readily available on all Tier-1 platforms and I think it is much better to add the xml entry for a task than writing a standalone script. Tagging @christinaholtNOAA @MichaelLueken @christopherwharrop-noaa

@natalie-perlin
Copy link
Collaborator Author

@danielabdi-noaa
As long rocoto workflow manager is not a pre-requisite for running the SRW, we need to support the way to run the SRW step-by-step. The workflow manager is a great tool when everything works smoothly - which could be OK on Tier 1 systems with pre-configured grids, partitions and standard tests.
In most other cases, some greater or lesser debugging process is needed. Having an option to run it task-by-task is essential for if we want to make the SRW accessible for community beyond of current Tier-1 system users.

@danielabdi-noaa
Copy link
Collaborator

@natalie-perlin I don't think these scripts were meant to replace rocoto on non-tier 1 systems -- I believe they are meant for development purposes only. I would say rocoto is pretty much a requirement. In the docs, even the singularity container uses rocoto for running workflow. For these wrappers to replace rocoto it would take a significant amount of work. For example, the necessary script for forecast will look different which ensemble member you are running, so one script for forecast probably will not work unless we use jinja2 to template it. The qsub_job.sh and sq_job.sh look like they are written specifically for cheyenne and hera resp. I don't think the wrappers worked for a long time given the number of variables it misses, lack of NCO mode etc. so they are not generic enough to replace rocoto.

@natalie-perlin
Copy link
Collaborator Author

@danielabdi-noaa - the tests for generic Linux and Mac during for the previous release were successfully done using wrapper scripts. The same wrapper scripts worked well on Cheyenne, too (speaking from running them myself, not sure whether anybody tried them on different platforms).

Rocoto workflow manager is indeed useful for routine or operational purposes, when everything has been tested and running smoothly. At the present moment, this is not the case for the general public, researchers, graduate students and research faculty in the area of atmospheric and weather science. The idea behind EPIC is to make weather apps accessible and platform-independent.
Research and experimental work involving specific case studies may not need to re-run the entire workflow, such as grid generation, getting and preparing lateral and boundary conditions, processing climatology and observations. It may want to do the pre-processing stages once and then focus on certain model options or code enhancement. At the very least, there needs to be a separation of the workflow to pre-processing stage (maybe with several sub-tasks) which could be run serially, the model run, and any post-processing.

@danielabdi-noaa
Copy link
Collaborator

danielabdi-noaa commented Nov 14, 2022

@natalie-perlin If the issue is to be able to run a step directlyy, you can use rocotoboot. It will not check dependencies and will run the step immediately. Rocoto can be installed and run on any platform just as easily. However, it needs a job scheduler slurm/pbs that is often not available on a linux/mac system. I have tried in the past to circumvent the problem by providing "fake slurm commands" that rocoto can execute.
https://github.com/danielabdi-noaa/gfs-docker/tree/master/scripts/slurm

You may have run one simple test case with the scripts on linux but clearly they won't be able to run all of WE2E test cases on any platform. They are hardly used by anyone -- the proof of which being no one noticed they were broken until now. In my opinion, they should not be part of the repo in their current state, because it is twice the effort, easily broken when someone adds a variable in the xml file, and are not generic enough to run any WE2E test case on any platform.

I should note that GFS has wrapper scripts for each task, but those are used by rocoto too.

https://github.com/NOAA-EMC/global-workflow/tree/develop/jobs/rocoto

If they are desinged that way, and something that allows direct execution of job, then it may be acceptable since there will not be duplication of effort.

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Nov 15, 2022

The attempt was made to run a single test (default in config.community.yaml), not the WE2E test cases, which was not successful.

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Dec 2, 2022

@danielabdi-noaa @MichaelLueken

As for now, no tests could be run for MacOS. (Testing on x86_64, Monterey 12.1.6). The bash 5.2.12 installed on Mac OS using Homebrew does not offer choice of versions. Default Darwin version bundled with the OS is bash v3 - too low for SRW.
The SRW scripts with #!/bin/bash that use python do not seem to work on Mac. The Level 1 systems do work with wrapper scripts but usually have bash v.4.x.x.

A solution for running the shell scripts manually (with no rocoto manager that apparently requires slurm) is absolutely needed to make it community-friendly and aligns with goals of EPIC project.

@danielabdi-noaa
Copy link
Collaborator

@natalie-perlin I have not tried building the SRW app on linux/mac so far. I will try and do that following the docs, and hopefully come up with a solution that makes running WE2E tests possible on linux/mac.

@natalie-perlin
Copy link
Collaborator Author

@danielabdi-noaa - that will be very helpful! We do need it greatly.

@natalie-perlin
Copy link
Collaborator Author

@danielabdi-noaa -
Please note that we are talking not only about running pre-defined end-to-end tests, but about running any SRW tests in general. The wrapper scripts are not working so there is no way to run any task of the workflow, i.e. complete loss of functionality for the Darwin systems, as far as I see it.

Running the ./generate_FV3LAM_wflow.py script is the last step that works for the MacOS at this point.

NB: @christinaholtNOAA !!! You help may be needed!

Christina worked on job launching scripts and helped to "pythonize" them - maybe the solution would be to move to python completely and not mix of bash-python?

@natalie-perlin
Copy link
Collaborator Author

Placing my comment from PR-508:

There is a solution for bash problem for Macs. Any scripts that start from #!/bin/bash when executed on a Mac would be using an older Bash v.3.x.x. When the bash is upgraded using Homebrew, it intallation location is architecture-dependent, and is placed to either /usr/local/bin/bash (Intel) or /opt/homebrew/bin/bash (M1). The /bin/bash still remains intact and points to the old version.

The workaround is to use gsed and to change headers of all bash scripts that are used for job launching (currently, in three different directories) to #!/usr/bin/env bash, as following:

./ush/wrappers/*.sh (before or after they are copied to an experiment directory)
gsed -i -e "s|/bin/sh|/usr/bin/env bash|" *sh

./jobs/JREGIONAL_*
gsed -i -e "s|/bin/bash|/usr/bin/env bash|" *JREGIONAL

./scripts/exregional*.sh
gsed -i -e "s|/bin/bash|/usr/bin/env bash|" *sh

This approach does not solve a problem of outdated launch scripts that could benefit from cleaning up, but it restores functionality for running the SRW App on MacOS. Any changes to be made are required only in documentation, but also could be placed as informative messages in the modules or used notes in the end of a workflow generation script...

This approach is being tested successfully, and the forecast stage is currently running on an Intel Mac. :)

@natalie-perlin
Copy link
Collaborator Author

Issue resolved in PR-557

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants