Framework for restoring state after failover for stateful applications #5006

Dyex719 · 2024-05-30T19:59:31Z

Please provide an in-depth description of the question you have:
We are using Karmada's failover and scheduling features to migrate our Flink jobs from one cluster to another during failover. Since Flink like other stateful applications need to resume from the last processed state, we need to make a call to a persistent storage system like S3/HDFS to restore the latest state from there.

Since the process of restoring state from a persistent store will be a common theme for all stateful applications, we want to know if the community is open to adding this as a feature in Karmada itself. This will prevent the user having to implement their own custom solution using other third party tools.

To describe our use-case a little:
Flink stores snapshots of its state to a persistent storage (checkpoints). This state includes information like the source offsets that it needs to resume from as well as any intermediate state of the Flink Job Graph. In order to restore from a particular state, Flink would need the checkpoint path to restore from added to the job manifest. The Flink operator then reads this checkpoint path, makes a call to the persistent store and fetches the last state metadata. The job is then resumed from that state.

To implement this in a generic way we could include an option in the scheduler to restore state before the work item is created if the resource to be scheduled is a stateful application.

Environment:

Karmada version:
v1.9.0
Kubernetes version:
v1.29

Dyex719 added the kind/question Indicates an issue that is a support question. label May 30, 2024

Dyex719 changed the title ~~Advice on how to restore state after failover for stateful applications~~ Framework for restoring state after failover for stateful applications May 30, 2024

Dyex719 mentioned this issue Jun 5, 2024

Add a label/annotation to the resource being rescheduled during failover #4969

Closed

This was referenced Jul 1, 2024

Stateful Proposal doc Dyex719/karmada#1

Closed

Stateful Failover Proposal #5116

Closed

RainbowMango mentioned this issue Jan 6, 2025

[Feature] Stateful Application Failover Support #5788

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framework for restoring state after failover for stateful applications #5006

Framework for restoring state after failover for stateful applications #5006

Dyex719 commented May 30, 2024 •

edited

Loading

Framework for restoring state after failover for stateful applications #5006

Framework for restoring state after failover for stateful applications #5006

Comments

Dyex719 commented May 30, 2024 • edited Loading

Dyex719 commented May 30, 2024 •

edited

Loading