-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running SRW using wrapper launch scripts is not working #473
Comments
@natalie-perlin Are the wrapper scripts worth fixing? It looks so outdated -- loading individual modules, missing variable definitions etc. Also, most of the wrapper scripts simply try to replicate what rocoto does better. So I am all for deleting the whole rocotorun -v 10 -w FV3LAM_wflow.xml -d FV3LAM_wflow.db #! /bin/sh
#SBATCH --account=zrtrr
#SBATCH --qos=batch
#SBATCH --partition=hera
#SBATCH --ntasks=12
#SBATCH -t 01:00:00
#SBATCH --job-name=run_fcst_mem1
#SBATCH -o /scratch2/BMC/gsd-hpcs/Daniel.Abdi/nco_dirs/output/20190615/run_fcst_mem1_2019061500.id_1668444250.log
#SBATCH --cpus-per-task 2 --exclusive
#SBATCH --export=NONE
#SBATCH --comment=c2e24558005fca0a835d1a03d0254e62
export GLOBAL_VAR_DEFNS_FP='/scratch2/BMC/gsd-hpcs/Daniel.Abdi/expt_dirs/MET_ensemble_verification/var_defns.sh'
export USHdir='/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush'
export PDY='20190615'
export cyc='00'
export subcyc='00'
export LOGDIR='/scratch2/BMC/gsd-hpcs/Daniel.Abdi/nco_dirs/output/20190615'
export SLASH_ENSMEM_SUBDIR='/mem1'
export ENSMEM_INDX='1'
/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/load_modules_run_task.sh "run_fcst" "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/jobs/JREGIONAL_RUN_FCST" Compared to the wrapper script #!/bin/sh
export GLOBAL_VAR_DEFNS_FP="${EXPTDIR}/var_defns.sh"
set -x
source ${GLOBAL_VAR_DEFNS_FP}
export CDATE=${DATE_FIRST_CYCL}
export CYCLE_DIR=${EXPTDIR}/${CDATE}
export SLASH_ENSMEM_SUBDIR=""
export ENSMEM_INDX=""
${JOBSdir}/JREGIONAL_RUN_FCST Rocoto is readily available on all Tier-1 platforms and I think it is much better to add the xml entry for a task than writing a standalone script. Tagging @christinaholtNOAA @MichaelLueken @christopherwharrop-noaa |
@danielabdi-noaa |
@natalie-perlin I don't think these scripts were meant to replace rocoto on non-tier 1 systems -- I believe they are meant for development purposes only. I would say rocoto is pretty much a requirement. In the docs, even the singularity container uses rocoto for running workflow. For these wrappers to replace rocoto it would take a significant amount of work. For example, the necessary script for forecast will look different which ensemble member you are running, so one script for forecast probably will not work unless we use jinja2 to template it. The |
@danielabdi-noaa - the tests for generic Linux and Mac during for the previous release were successfully done using wrapper scripts. The same wrapper scripts worked well on Cheyenne, too (speaking from running them myself, not sure whether anybody tried them on different platforms). Rocoto workflow manager is indeed useful for routine or operational purposes, when everything has been tested and running smoothly. At the present moment, this is not the case for the general public, researchers, graduate students and research faculty in the area of atmospheric and weather science. The idea behind EPIC is to make weather apps accessible and platform-independent. |
@natalie-perlin If the issue is to be able to run a step directlyy, you can use You may have run one simple test case with the scripts on linux but clearly they won't be able to run all of WE2E test cases on any platform. They are hardly used by anyone -- the proof of which being no one noticed they were broken until now. In my opinion, they should not be part of the repo in their current state, because it is twice the effort, easily broken when someone adds a variable in the xml file, and are not generic enough to run any WE2E test case on any platform. I should note that GFS has wrapper scripts for each task, but those are used by rocoto too. https://github.com/NOAA-EMC/global-workflow/tree/develop/jobs/rocoto If they are desinged that way, and something that allows direct execution of job, then it may be acceptable since there will not be duplication of effort. |
The attempt was made to run a single test (default in config.community.yaml), not the WE2E test cases, which was not successful. |
@danielabdi-noaa @MichaelLueken As for now, no tests could be run for MacOS. (Testing on x86_64, Monterey 12.1.6). The bash 5.2.12 installed on Mac OS using Homebrew does not offer choice of versions. Default Darwin version bundled with the OS is bash v3 - too low for SRW. A solution for running the shell scripts manually (with no rocoto manager that apparently requires slurm) is absolutely needed to make it community-friendly and aligns with goals of EPIC project. |
@natalie-perlin I have not tried building the SRW app on linux/mac so far. I will try and do that following the docs, and hopefully come up with a solution that makes running WE2E tests possible on linux/mac. |
@danielabdi-noaa - that will be very helpful! We do need it greatly. |
@danielabdi-noaa - Running the ./generate_FV3LAM_wflow.py script is the last step that works for the MacOS at this point. NB: @christinaholtNOAA !!! You help may be needed! Christina worked on job launching scripts and helped to "pythonize" them - maybe the solution would be to move to python completely and not mix of bash-python? |
Placing my comment from PR-508: There is a solution for bash problem for Macs. Any scripts that start from #!/bin/bash when executed on a Mac would be using an older Bash v.3.x.x. When the bash is upgraded using Homebrew, it intallation location is architecture-dependent, and is placed to either /usr/local/bin/bash (Intel) or /opt/homebrew/bin/bash (M1). The /bin/bash still remains intact and points to the old version. The workaround is to use gsed and to change headers of all bash scripts that are used for job launching (currently, in three different directories) to #!/usr/bin/env bash, as following: ./ush/wrappers/*.sh (before or after they are copied to an experiment directory) ./jobs/JREGIONAL_* ./scripts/exregional*.sh This approach does not solve a problem of outdated launch scripts that could benefit from cleaning up, but it restores functionality for running the SRW App on MacOS. Any changes to be made are required only in documentation, but also could be placed as informative messages in the modules or used notes in the end of a workflow generation script... This approach is being tested successfully, and the forecast stage is currently running on an Intel Mac. :) |
Issue resolved in PR-557 |
Running the SRW using individual launch scripts from
./ush/wrappers/*.sh
does not work, when norocoto
manager specified, i.e.WORKFLOW_MANAGER: none
in./ush/machine/<platform>.yaml
On some systems, it fails at the very first task, run_make_grid.sh, on other systems it fails on run_get_lbcs.sh or run_get_ics.sh. When an individual task goes through, it still reports bash errors.One of the errors to be corrected in ./ush/job_preamble.sh, line 10 (or line 22 after generating the workflow):
if [ $subcyc -eq 0 ]; then (wrong)
to
if [[ $subcyc -eq 0 ]]; then
Other errors yet to be determined and corrected.
Systems tested that indicated failure:
cheyenne (intel)
hera (intel)
gaea
orion
macos
The text was updated successfully, but these errors were encountered: