Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework for restoring state after failover for stateful applications #5006

Open
Dyex719 opened this issue May 30, 2024 · 0 comments
Open
Labels
kind/question Indicates an issue that is a support question.

Comments

@Dyex719
Copy link

Dyex719 commented May 30, 2024

Please provide an in-depth description of the question you have:
We are using Karmada's failover and scheduling features to migrate our Flink jobs from one cluster to another during failover. Since Flink like other stateful applications need to resume from the last processed state, we need to make a call to a persistent storage system like S3/HDFS to restore the latest state from there.

Since the process of restoring state from a persistent store will be a common theme for all stateful applications, we want to know if the community is open to adding this as a feature in Karmada itself. This will prevent the user having to implement their own custom solution using other third party tools.

To describe our use-case a little:
Flink stores snapshots of its state to a persistent storage (checkpoints). This state includes information like the source offsets that it needs to resume from as well as any intermediate state of the Flink Job Graph. In order to restore from a particular state, Flink would need the checkpoint path to restore from added to the job manifest. The Flink operator then reads this checkpoint path, makes a call to the persistent store and fetches the last state metadata. The job is then resumed from that state.

To implement this in a generic way we could include an option in the scheduler to restore state before the work item is created if the resource to be scheduled is a stateful application.

Environment:

  • Karmada version:
    v1.9.0
  • Kubernetes version:
    v1.29
@Dyex719 Dyex719 added the kind/question Indicates an issue that is a support question. label May 30, 2024
@Dyex719 Dyex719 changed the title Advice on how to restore state after failover for stateful applications Framework for restoring state after failover for stateful applications May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Indicates an issue that is a support question.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant