How To Start a Run From Existing Data Files ("Hot starting")

We often run into situations where we want to run a simulation longer than originally planned, or for longer than we were allowed to run it. For example, you may have submitted a time-limited job to an HPC resource and the job scheduler killed the run after it reached the job time limit. Limit support for this type of operation is implemented in the '-h' or '--hotStart' option to parun:

$ parun -h
Usage: parun [options] soModule.py [[soFile.sso] [pModule.py nModule.py]]

Options:
  -h, --help            show this help message and exit
...output truncated...
  -H, --hotStart        Use the last step in the archive as the initial
                        condition and continue appending to the
                        archive

The preconditions for running with this option are that 1) you have previously run the simulation so you have .xmf/.h5 and mesh.* files for the simulation and 2) that you wish to run it from the last time step in the XDMF archive using the same number of processors. We should be able to relax condition 2 eventually, but the current functionality is pretty useful anyway.

Here's how I usually use hot starts. First, take a look at a typical job script for the initial run:

#!/bin/bash
#PBS -A ERDCV00898R40
#PBS -l walltime=008:00:00
#PBS -l select=32:ncpus=32:mpiprocs=16
#PBS -q standard
#PBS -N plunging5
#PBS -j oe
#PBS -l application=proteus
#PBS -V
#PBS -m eba
#PBS -M [email protected]
source /opt/modules/default/etc/modules.sh
source /lustre/shared/projects/proteus/garnet.gnu.sh
cd $PBS_O_WORKDIR
mkdir $WORKDIR/plunging5.$PBS_JOBID
aprun -n 512  parun tank_so.py -l 3 -O ../petscOptions/petsc.options.asm -D $WORKDIR/plunging5.$PBS_JOBID

As you can see, this stores the outputs in WORKDIR/plunging5.$PBS_JOBID. The walltime is set to 8 hours. Assuming my job was killed by the scheduler and PBS_JOBID=12345, I would then do the following to continue the run

$ cp *.py *.pbs mesh.* ../petscOptions/petsc.options.asm $WORKDIR/plunging5.12345
$ cd $WORKDIR/plunging5.12345
$ emacs garnet.pbs
$ qsub garnet.pbs

where I edit the pbs script to look like this

#!/bin/bash
#PBS -A ERDCV00898R40
#PBS -l walltime=008:00:00
#PBS -l select=32:ncpus=32:mpiprocs=16
#PBS -q standard
#PBS -N plunging5
#PBS -j oe
#PBS -l application=proteus
#PBS -V
#PBS -m eba
#PBS -M [email protected]
source /opt/modules/default/etc/modules.sh
source /lustre/shared/projects/proteus/garnet.gnu.sh
cd $PBS_O_WORKDIR
aprun -n 512  parun tank_so.py -l 3 -O petsc.options.asm -H

Notice that I've deleted the line that creates a new data directory and replaced the -D option with -H.

Assuming that the XDMF and mesh files were valid, this job will now use the last time step in the XDMF as initial data and continue time stepping until it finished the time interval specified in the so-file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To Start a Run From Existing Data Files ("Hot starting")

Clone this wiki locally