Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic checkpointing #21

Open
8 of 11 tasks
pbayer opened this issue Feb 13, 2021 · 8 comments
Open
8 of 11 tasks

Implement basic checkpointing #21

pbayer opened this issue Feb 13, 2021 · 8 comments

Comments

@pbayer
Copy link
Contributor

pbayer commented Feb 13, 2021

Now with basic error handling (see issue #16 and description in the manual) there is still an issue of maintaining/saving and restoring actor state at termination and restart.

For actor and task restart (by supervisors) checkpoint and restore is an important option. Thus actor state can be restored at restart.

Actor initialization and termination with user defined callback functions

  • develop init! functionality,
  • develop term! functionality,
  • implement the restart strategy described below.

User-defined checkpointing:

  • basic checkpointing actor,
  • checkpoint call,
  • restore call,
  • checkpointing hierarchy,
  • checkpointing interval for 2nd level.

Integration

  • tests,
  • documentation,
  • examples
@pbayer
Copy link
Contributor Author

pbayer commented Feb 15, 2021

There are multiple ways to realize checkpointing with Actors:

  1. explicit, user-defined checkpointing based on the functionality mentioned above,
  2. use checkpoints in init! and term! callback functions,
  3. saving intermediate results to a guarded variable, and the :guard actor does the checkpointing,
  4. checkpoints could be served by a :genserver actor,
  5. function memoization saved to a guarded/checkpointed Dict,
  6. ...

This leads to the impression that only the basic checkpointing functionality (1, 2) should go into Actors while the more sophisticated stuff should be realized in a framework library like Checkpoints.jl

Maybe people working with HPC like @oschulz could comment on this.

pbayer added a commit that referenced this issue Feb 16, 2021
@pbayer
Copy link
Contributor Author

pbayer commented Feb 17, 2021

Init/terminate and restart strategy

Checkpointing must be integrated within init, term and restart callbacks:

  1. A user can provide a termination callback taking a checkpoint of an actor's acquaintance variables upon abnormal actor exit.
  2. A user can provide an init and/or restart callback to look for a checkpoint and to eventually restore it before actor start.
  3. Then a monitor can be specified to call an init or restart action on actor termination.
  4. Supervisors should do the following for actor restart:
    1. if specified, call a restart function (must return a link to a started actor),
    2. else if specified, call an init function (must return a link to a started actor),
    3. else spawn an actor with the last behavior.

Anyway to enable the 4th point, a failed actor's variable must be delivered to the supervisor on abnormal exit. (already done)

pbayer added a commit that referenced this issue Feb 19, 2021
@pbayer
Copy link
Contributor Author

pbayer commented Feb 22, 2021

Enable Multilevel-Checkpointing

In-memory checkpointing has been demonstrated to be fast and scalable. In addition, multilevel checkpointing technologies ... are increasingly popular. Multilevel checkpointing involves combining several storage technologies (multiple levels) to store the checkpoint, each level offering specific overhead and reliability trade-offs. The checkpoint is first stored in the highest performance storage technology, which generally is the local memory or a local SSD. This level supports process failure but not node failure. The second level stores the checkpoint in remote memory of remote SSD. This second level supports a single-node failure. The third level corresponds to the encoding the checkpoints in blocks and in distributing the blocks in multiple nodes. This approach supports multinode failures. Generally, the last level of checkpointing is still the parallel file system. This last level supports catastrophic failures such as full system outage. (see: D6.1 Preliminary Report on State of the Art Software-Based Fault Tolerance Techniques)

The checkpointing actor should support multilevel as a combination of user-level and application-level checkpointing. The concept is as follows:

  1. the user-actor takes application-specific checkpoints to a checkpointing actor running on the same node (under the same supervisor). This usually is an in-memory checkpoint.
  2. Then a 2nd level checkpoint is taken at regular intervals by an aggregating checkpointing actor residing on another node. The aggregation is done from 1st-level or other 2nd-level actors.
  3. The third level is the organization of the checkpointing and restart/restore hierarchy.

The first two levels should be realized in Actors. The third level can then be realized in a separate framework library Checkpointing.jl.

@pbayer
Copy link
Contributor Author

pbayer commented Feb 24, 2021

Multilevel checkpointing mechanism

  1. 1st level checkpoints must be inexpensive (in-memory). Thus they can be taken frequently. ¹
  2. 2nd level checkpoints can consist of arbitrary hierarchy levels.
    1. They are usually triggered and/or saved at regular intervals (can be parametrized for each actor).
    2. When they are triggered, they collect the current checkpoints from all inferior checkpointing actors.
    3. 2nd level checkpoints and save routines can also be triggered from inferior actors. In that case such requests are propagated upwards in the checkpointing hierarchy until the required level is reached.
    4. Therefore the checkpoint function must have a level argument.
  3. The restore function also must have a level argument. The restoring then is done from the chosen level.

—————-
¹ 1st level checkpoints are triggered by a user actor. Automatic checkpointing at the 1st level must wait until Julia Atomics is ready.

@pbayer
Copy link
Contributor Author

pbayer commented Feb 25, 2021

Checkpointing intervals

Update and Save intervals must be configurable and it must also be possible to turn automatic checkpointing on and off.

pbayer added a commit that referenced this issue Feb 28, 2021
@pbayer
Copy link
Contributor Author

pbayer commented Mar 1, 2021

Detect and handle node failures

Supervisors looking after actors on other workers must detect node failures and handle them appropriately.

  1. They start a task that
    1. polls isready on the RemoteChannel of the supervised actor and
    2. sends a ProcessExitedException as an Exit message to the supervisor.
  2. They treat those exceptions differently from actor Exits since they don't contain actor state.

Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.

@pbayer
Copy link
Contributor Author

pbayer commented Mar 4, 2021

Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.

This is not a good strategy to work with! The consequence of that requirement would be to have two supervisor levels where we need only one. Therefore:

Introduce an actor for node supervision

If a supervisor gets a child on another node, it starts another helper child, responsible for scanning regularly the connections to foreign actors.

If a connection gives an ProcessExitedException, the helper informs the supervisor about the failed PID thus that it can act accordingly restart the foreign actors on another node. The helper must send only one Exit message to the supervisor for a failed PID even if it hosts multiple child actors.

pbayer added a commit that referenced this issue Mar 25, 2021
pbayer added a commit that referenced this issue Apr 1, 2021
test for remote_failure.jl, see issue #21
@caleb-allen
Copy link
Contributor

Possibly useful to this feature, the forthcoming Julia 1.7 includes the ability to migrate tasks between threads: JuliaLang/julia#40715

I don't understand the code of the implementation here, but it seems like it enables the thread-local storage to be tracked with each Task? Regardless, this change may be of for checkpointing or actions taken by supervisors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants