Implement basic checkpointing #21

pbayer · 2021-02-13T17:31:38Z

pbayer · 2021-02-15T07:20:11Z

There are multiple ways to realize checkpointing with Actors:

explicit, user-defined checkpointing based on the functionality mentioned above,
use checkpoints in init! and term! callback functions,
saving intermediate results to a guarded variable, and the :guard actor does the checkpointing,
checkpoints could be served by a :genserver actor,
function memoization saved to a guarded/checkpointed Dict,
...

This leads to the impression that only the basic checkpointing functionality (1, 2) should go into Actors while the more sophisticated stuff should be realized in a framework library like Checkpoints.jl

Maybe people working with HPC like @oschulz could comment on this.

see issue #21

pbayer · 2021-02-17T08:07:14Z

Init/terminate and restart strategy

Checkpointing must be integrated within init, term and restart callbacks:

A user can provide a termination callback taking a checkpoint of an actor's acquaintance variables upon abnormal actor exit.
A user can provide an init and/or restart callback to look for a checkpoint and to eventually restore it before actor start.
Then a monitor can be specified to call an init or restart action on actor termination.
Supervisors should do the following for actor restart:
1. if specified, call a restart function (must return a link to a started actor),
2. else if specified, call an init function (must return a link to a started actor),
3. else spawn an actor with the last behavior.

~~Anyway to enable the 4th point, a failed actor's variable must be delivered to the supervisor on abnormal exit.~~ (already done)

see issue #21

pbayer · 2021-02-22T10:22:29Z

Enable Multilevel-Checkpointing

In-memory checkpointing has been demonstrated to be fast and scalable. In addition, multilevel checkpointing technologies ... are increasingly popular. Multilevel checkpointing involves combining several storage technologies (multiple levels) to store the checkpoint, each level offering specific overhead and reliability trade-offs. The checkpoint is first stored in the highest performance storage technology, which generally is the local memory or a local SSD. This level supports process failure but not node failure. The second level stores the checkpoint in remote memory of remote SSD. This second level supports a single-node failure. The third level corresponds to the encoding the checkpoints in blocks and in distributing the blocks in multiple nodes. This approach supports multinode failures. Generally, the last level of checkpointing is still the parallel file system. This last level supports catastrophic failures such as full system outage. (see: D6.1 Preliminary Report on State of the Art Software-Based Fault Tolerance Techniques)

The checkpointing actor should support multilevel as a combination of user-level and application-level checkpointing. The concept is as follows:

the user-actor takes application-specific checkpoints to a checkpointing actor running on the same node (under the same supervisor). This usually is an in-memory checkpoint.
Then a 2nd level checkpoint is taken at regular intervals by an aggregating checkpointing actor residing on another node. The aggregation is done from 1st-level or other 2nd-level actors.
The third level is the organization of the checkpointing and restart/restore hierarchy.

The first two levels should be realized in Actors. The third level can then be realized in a separate framework library Checkpointing.jl.

pbayer · 2021-02-24T10:03:55Z

Multilevel checkpointing mechanism

1st level checkpoints must be inexpensive (in-memory). Thus they can be taken frequently. ¹
2nd level checkpoints can consist of arbitrary hierarchy levels.
1. They are usually triggered and/or saved at regular intervals (can be parametrized for each actor).
2. When they are triggered, they collect the current checkpoints from all inferior checkpointing actors.
3. 2nd level checkpoints and save routines can also be triggered from inferior actors. In that case such requests are propagated upwards in the checkpointing hierarchy until the required level is reached.
4. Therefore the checkpoint function must have a level argument.
The restore function also must have a level argument. The restoring then is done from the chosen level.

—————-
¹ 1st level checkpoints are triggered by a user actor. Automatic checkpointing at the 1st level must wait until Julia Atomics is ready.

pbayer · 2021-02-25T07:26:39Z

Checkpointing intervals

Update and Save intervals must be configurable and it must also be possible to turn automatic checkpointing on and off.

see issue #21

pbayer · 2021-03-01T10:23:29Z

Detect and handle node failures

Supervisors looking after actors on other workers must detect node failures and handle them appropriately.

They start a task that
1. polls isready on the RemoteChannel of the supervised actor and
2. sends a ProcessExitedException as an Exit message to the supervisor.
They treat those exceptions differently from actor Exits since they don't contain actor state.

Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.

pbayer · 2021-03-04T09:55:00Z

Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.

This is not a good strategy to work with! The consequence of that requirement would be to have two supervisor levels where we need only one. Therefore:

Introduce an actor for node supervision

If a supervisor gets a child on another node, it starts another helper child, responsible for scanning regularly the connections to foreign actors.

If a connection gives an ProcessExitedException, the helper informs the supervisor about the failed PID thus that it can act accordingly restart the foreign actors on another node. The helper must send only one Exit message to the supervisor for a failed PID even if it hosts multiple child actors.

see issue #21

test for remote_failure.jl, see issue #21

caleb-allen · 2021-10-25T21:30:54Z

Possibly useful to this feature, the forthcoming Julia 1.7 includes the ability to migrate tasks between threads: JuliaLang/julia#40715

I don't understand the code of the implementation here, but it seems like it enables the thread-local storage to be tracked with each Task? Regardless, this change may be of for checkpointing or actions taken by supervisors

pbayer added enhancement error handling labels Feb 13, 2021

pbayer added this to the Error Handling v0.3 milestone Feb 13, 2021

pbayer self-assigned this Feb 13, 2021

pbayer added a commit that referenced this issue Feb 16, 2021

checkpointing, first implementation

4077922

see issue #21

pbayer added a commit that referenced this issue Feb 19, 2021

init and termination

3fdf8d7

see issue #21

pbayer added a commit that referenced this issue Feb 28, 2021

implement multilevel checkpointing

b897a15

see issue #21

pbayer added a commit that referenced this issue Mar 25, 2021

remote failure handling

91ac9d7

see issue #21

pbayer added a commit that referenced this issue Apr 1, 2021

improve remote failure supervision

e2b52f6

test for remote_failure.jl, see issue #21

pbayer mentioned this issue Apr 8, 2021

Detect and handle node failures #24

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement basic checkpointing #21

Implement basic checkpointing #21

pbayer commented Feb 13, 2021 •

edited

Loading

pbayer commented Feb 15, 2021 •

edited

Loading

pbayer commented Feb 17, 2021 •

edited

Loading

pbayer commented Feb 22, 2021

pbayer commented Feb 24, 2021 •

edited

Loading

pbayer commented Feb 25, 2021 •

edited

Loading

pbayer commented Mar 1, 2021 •

edited

Loading

pbayer commented Mar 4, 2021 •

edited

Loading

caleb-allen commented Oct 25, 2021

Implement basic checkpointing #21

Implement basic checkpointing #21

Comments

pbayer commented Feb 13, 2021 • edited Loading

Actor initialization and termination with user defined callback functions

User-defined checkpointing:

Integration

pbayer commented Feb 15, 2021 • edited Loading

pbayer commented Feb 17, 2021 • edited Loading

Init/terminate and restart strategy

pbayer commented Feb 22, 2021

Enable Multilevel-Checkpointing

pbayer commented Feb 24, 2021 • edited Loading

Multilevel checkpointing mechanism

pbayer commented Feb 25, 2021 • edited Loading

Checkpointing intervals

pbayer commented Mar 1, 2021 • edited Loading

Detect and handle node failures

pbayer commented Mar 4, 2021 • edited Loading

Introduce an actor for node supervision

caleb-allen commented Oct 25, 2021

pbayer commented Feb 13, 2021 •

edited

Loading

pbayer commented Feb 15, 2021 •

edited

Loading

pbayer commented Feb 17, 2021 •

edited

Loading

pbayer commented Feb 24, 2021 •

edited

Loading

pbayer commented Feb 25, 2021 •

edited

Loading

pbayer commented Mar 1, 2021 •

edited

Loading

pbayer commented Mar 4, 2021 •

edited

Loading