-
Notifications
You must be signed in to change notification settings - Fork 56
How To Start a Run From Existing Data Files ("Hot starting")
We often run into situations where we want to run a simulation longer than originally planned, or for longer than we were allowed to run it. For example, you may have submitted a time-limited job to an HPC resource and the job scheduler killed the run after it reached the job time limit. Limit support for this type of operation is implemented in the '-h' or '--hotStart' option to parun
:
$ parun -h
Usage: parun [options] soModule.py [[soFile.sso] [pModule.py nModule.py]]
Options:
-h, --help show this help message and exit
...output truncated...
-H, --hotStart Use the last step in the archive as the initial
condition and continue appending to the
archive
The preconditions for running with this option are that 1) you have previously run the simulation so you have .xmf/.h5 and mesh.* files for the simulation and 2) that you wish to run it from the last time step in the XDMF archive using the same number of processors. We should be able to relax condition 2 eventually, but the current functionality is pretty useful anyway.
Here's how I usually use hot starts. First, take a look at a typical job script for the initial run:
#!/bin/bash
#PBS -A ERDCV00898R40
#PBS -l walltime=008:00:00
#PBS -l select=32:ncpus=32:mpiprocs=16
#PBS -q standard
#PBS -N plunging5
#PBS -j oe
#PBS -l application=proteus
#PBS -V
#PBS -m eba
#PBS -M [email protected]
source /opt/modules/default/etc/modules.sh
source /lustre/shared/projects/proteus/garnet.gnu.sh
cd $PBS_O_WORKDIR
mkdir $WORKDIR/plunging5.$PBS_JOBID
aprun -n 512 parun tank_so.py -l 3 -O ../petscOptions/petsc.options.asm -D $WORKDIR/plunging5.$PBS_JOBID
As you can see, this stores the outputs in WORKDIR/plunging5.$PBS_JOBID
. The walltime
is set to 8 hours. Assuming my job was killed by the scheduler and PBS_JOBID=12345
, I would then do the following to continue the run
$ cp *.py *.pbs mesh.* ../petscOptions/petsc.options.asm $WORKDIR/plunging5.12345
$ cd $WORKDIR/plunging5.12345
$ emacs garnet.pbs
$ qsub garnet.pbs
where I edit the pbs script to look like this
#!/bin/bash
#PBS -A ERDCV00898R40
#PBS -l walltime=008:00:00
#PBS -l select=32:ncpus=32:mpiprocs=16
#PBS -q standard
#PBS -N plunging5
#PBS -j oe
#PBS -l application=proteus
#PBS -V
#PBS -m eba
#PBS -M [email protected]
source /opt/modules/default/etc/modules.sh
source /lustre/shared/projects/proteus/garnet.gnu.sh
cd $PBS_O_WORKDIR
aprun -n 512 parun tank_so.py -l 3 -O petsc.options.asm -H
Notice that I've deleted the line that creates a new data directory and replaced the -D
option with -H
.
Assuming that the XDMF and mesh files were valid, this job will now use the last time step in the XDMF as initial data and continue time stepping until it finished the time interval specified in the so-file.