Preemptible compute and fault-tolerant jobs #24
Replies: 1 comment
-
This topic was debated during our first roundtable on 2022-09-07. Summary
Restarts are useful, albeit limitedThere are some failure modes (e.g. node failure, stochastic data, network dependency, preemption) in which a restart would be useful, however most common failure modes (e.g. runtime exception, OOMKilled) are deterministic hence there would be no point in restarting it. Should it be possible to describe to Output directory must be cleaned by
|
Beta Was this translation helpful? Give feedback.
-
Abstract
pman
should support working around scheduler preemption and retrying jobs on runtime failure: see FNNDSC/pman#208We propose a new property to be added to the ChRIS plugin spec: "fault-tolerant" is a boolean which indicates whether a plugin instance can be safely restarted if it terminates unexpectedly, without cleaning its output directory.
Motivation
In shared-resource cloud computing, providers may offer "preemptable" environments. See here: https://cloud.google.com/compute/docs/instances/preemptible
In summary, preempatble environments are usually cheaper. BCH E2 offers the preemptable partitions
bch-computer-pe
andbch-gpu-pe
which are faster to be scheduled.ChRIS plugin instances can sometimes run for a long time (12h). These can take a while to get scheduled on SLURM.
Proposal
Support for preemptive environments
ChRIS can leverage the preemptable E2 partitions to improve average scheduling time and better leverage available resources on E2. Running analysis on E2 via ChRIS is uniquely advantageous to calling
sbatch
manually because CHRIS could be developed to manage restarting of preempted jobs.Moreover, the same logic to handle preemption on SLURM could be used to handle pod restarts and reschedules on Kubernetes.
Indicating plugin fault tolerance
Most programs expect a "clean starting state," for instance, most ChRIS plugins expect their output directory to start off as empty. However, some complicated pipeline-like programs support "resume-on-failure" modes of operation, meaning they are able to restore an intermediate state by looking at the output files from an interrupted execution. These kinds of programs are "fault-tolerant."
It'd be easiest to handle restarting of jobs for plugin instances by clearing the output directory, but doing so is wasteful in cases where the plugin is fault-tolerant.
The plugin JSON description spec should be updated to have a "fault_tolerance" field with the possible values being
true
orfalse
.Market Research
Terra supports using preemptible VMs on GCP. It is powered by Cromwell behind-the-scenes, which has these features:
preemptible
: number of times to retry on preemptible VM before retrying with a non-preemptible VM.Beta Was this translation helpful? Give feedback.
All reactions