Preemptible compute and fault-tolerant jobs #24

jennydaman · 2022-08-15T01:55:25Z

jennydaman
Aug 15, 2022
Maintainer

Abstract

pman should support working around scheduler preemption and retrying jobs on runtime failure: see FNNDSC/pman#208

We propose a new property to be added to the ChRIS plugin spec: "fault-tolerant" is a boolean which indicates whether a plugin instance can be safely restarted if it terminates unexpectedly, without cleaning its output directory.

Motivation

In shared-resource cloud computing, providers may offer "preemptable" environments. See here: https://cloud.google.com/compute/docs/instances/preemptible

In summary, preempatble environments are usually cheaper. BCH E2 offers the preemptable partitions bch-computer-pe and bch-gpu-pe which are faster to be scheduled.

ChRIS plugin instances can sometimes run for a long time (12h). These can take a while to get scheduled on SLURM.

Proposal

Support for preemptive environments

ChRIS can leverage the preemptable E2 partitions to improve average scheduling time and better leverage available resources on E2. Running analysis on E2 via ChRIS is uniquely advantageous to calling sbatch manually because CHRIS could be developed to manage restarting of preempted jobs.

Moreover, the same logic to handle preemption on SLURM could be used to handle pod restarts and reschedules on Kubernetes.

Indicating plugin fault tolerance

Most programs expect a "clean starting state," for instance, most ChRIS plugins expect their output directory to start off as empty. However, some complicated pipeline-like programs support "resume-on-failure" modes of operation, meaning they are able to restore an intermediate state by looking at the output files from an interrupted execution. These kinds of programs are "fault-tolerant."

It'd be easiest to handle restarting of jobs for plugin instances by clearing the output directory, but doing so is wasteful in cases where the plugin is fault-tolerant.

The plugin JSON description spec should be updated to have a "fault_tolerance" field with the possible values being true or false.

Market Research

Terra supports using preemptible VMs on GCP. It is powered by Cromwell behind-the-scenes, which has these features:

jennydaman · 2022-09-08T08:04:20Z

jennydaman
Sep 8, 2022
Maintainer Author

This topic was debated during our first roundtable on 2022-09-07.

Summary

pman should support restarting of jobs 2–3 times.
outputdir/ should be cleared of files before each restart.
Do not enhance plugin spec regarding fault-tolerance — bigger spec = confusing

Restarts are useful, albeit limited

There are some failure modes (e.g. node failure, stochastic data, network dependency, preemption) in which a restart would be useful, however most common failure modes (e.g. runtime exception, OOMKilled) are deterministic hence there would be no point in restarting it.

Should it be possible to describe to pman under which failure modes a specific plugin should be restarted and under which failure modes should it give up without restart?

Output directory must be cleaned by `pman` before restarting a job.

Most programs fail if unexpected files from a previous failed attempt are present in the output directory.

Restart-friendly programs are rare. A simple spec is preferable.

Few developers consider to implement mechanisms for their programs to recover from a dirty state and resume execution. Even in these situations, the recovery mechanism might not be correct (meaning, a recovered job would produce different outputs than if the job never crashed in the first place).

We have agreed that programs which are able to "resume-on-failure" correctly are too exceptional that it is not worth it to support the feature in ChRIS. If we were to add to the spec so that it could describe "resume-on-failure," it would cause more confusion and code complexity than it would solve problems. That is to say, simpler solutions can be better than efficient ones.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preemptible compute and fault-tolerant jobs #24

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Preemptible compute and fault-tolerant jobs #24

jennydaman Aug 15, 2022 Maintainer

Abstract

Motivation

Proposal

Support for preemptive environments

Indicating plugin fault tolerance

Market Research

Replies: 1 comment

jennydaman Sep 8, 2022 Maintainer Author

Summary

Restarts are useful, albeit limited

Output directory must be cleaned by pman before restarting a job.

Restart-friendly programs are rare. A simple spec is preferable.

jennydaman
Aug 15, 2022
Maintainer

jennydaman
Sep 8, 2022
Maintainer Author

Output directory must be cleaned by `pman` before restarting a job.