-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement basic checkpointing #21
Comments
There are multiple ways to realize checkpointing with
This leads to the impression that only the basic checkpointing functionality (1, 2) should go into Maybe people working with HPC like @oschulz could comment on this. |
Init/terminate and restart strategyCheckpointing must be integrated within
|
Enable Multilevel-Checkpointing
The
The first two levels should be realized in |
Multilevel checkpointing mechanism
—————- |
Checkpointing intervals
|
Detect and handle node failuresSupervisors looking after actors on other workers must detect node failures and handle them appropriately.
Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above. |
This is not a good strategy to work with! The consequence of that requirement would be to have two supervisor levels where we need only one. Therefore: Introduce an actor for node supervisionIf a supervisor gets a child on another node, it starts another helper child, responsible for scanning regularly the connections to foreign actors. If a connection gives an |
test for remote_failure.jl, see issue #21
Possibly useful to this feature, the forthcoming Julia 1.7 includes the ability to migrate tasks between threads: JuliaLang/julia#40715 I don't understand the code of the implementation here, but it seems like it enables the thread-local storage to be tracked with each Task? Regardless, this change may be of for checkpointing or actions taken by supervisors |
Now with basic error handling (see issue #16 and description in the manual) there is still an issue of maintaining/saving and restoring actor state at termination and restart.
For actor and task restart (by supervisors)
checkpoint
andrestore
is an important option. Thus actor state can be restored at restart.Actor initialization and termination with user defined callback functions
init!
functionality,term!
functionality,User-defined checkpointing:
checkpointing
actor,checkpoint
call,restore
call,Integration
The text was updated successfully, but these errors were encountered: