You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please provide an in-depth description of the question you have:
We are using Karmada's failover and scheduling features to migrate our Flink jobs from one cluster to another during failover. Since Flink like other stateful applications need to resume from the last processed state, we need to make a call to a persistent storage system like S3/HDFS to restore the latest state from there.
Since the process of restoring state from a persistent store will be a common theme for all stateful applications, we want to know if the community is open to adding this as a feature in Karmada itself. This will prevent the user having to implement their own custom solution using other third party tools.
To describe our use-case a little:
Flink stores snapshots of its state to a persistent storage (checkpoints). This state includes information like the source offsets that it needs to resume from as well as any intermediate state of the Flink Job Graph. In order to restore from a particular state, Flink would need the checkpoint path to restore from added to the job manifest. The Flink operator then reads this checkpoint path, makes a call to the persistent store and fetches the last state metadata. The job is then resumed from that state.
To implement this in a generic way we could include an option in the scheduler to restore state before the work item is created if the resource to be scheduled is a stateful application.
Environment:
Karmada version:
v1.9.0
Kubernetes version:
v1.29
The text was updated successfully, but these errors were encountered:
Dyex719
changed the title
Advice on how to restore state after failover for stateful applications
Framework for restoring state after failover for stateful applications
May 30, 2024
Please provide an in-depth description of the question you have:
We are using Karmada's failover and scheduling features to migrate our Flink jobs from one cluster to another during failover. Since Flink like other stateful applications need to resume from the last processed state, we need to make a call to a persistent storage system like S3/HDFS to restore the latest state from there.
Since the process of restoring state from a persistent store will be a common theme for all stateful applications, we want to know if the community is open to adding this as a feature in Karmada itself. This will prevent the user having to implement their own custom solution using other third party tools.
To describe our use-case a little:
Flink stores snapshots of its state to a persistent storage (checkpoints). This state includes information like the source offsets that it needs to resume from as well as any intermediate state of the Flink Job Graph. In order to restore from a particular state, Flink would need the checkpoint path to restore from added to the job manifest. The Flink operator then reads this checkpoint path, makes a call to the persistent store and fetches the last state metadata. The job is then resumed from that state.
To implement this in a generic way we could include an option in the scheduler to restore state before the work item is created if the resource to be scheduled is a stateful application.
Environment:
v1.9.0
v1.29
The text was updated successfully, but these errors were encountered: