Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of overlapping of first delta snapshot with full snapshot isn't done correctly. #844

Open
ishan16696 opened this issue Feb 17, 2025 · 1 comment · May be fixed by #845
Open

Handling of overlapping of first delta snapshot with full snapshot isn't done correctly. #844

ishan16696 opened this issue Feb 17, 2025 · 1 comment · May be fixed by #845
Assignees
Labels
area/disaster-recovery Disaster recovery related kind/bug Bug
Milestone

Comments

@ishan16696
Copy link
Member

ishan16696 commented Feb 17, 2025

How to categorize this issue?

/area disaster-recovery
/kind bug

What happened:
@unmarshall and I observed that the Druid end-to-end tests were failing because the restoration process encountered an error. The backup-restore applied an incorrect delta event, resulting in the following error:

time="2025-02-17T09:45:57Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to restore snapshot: mismatched event revision while applying delta snapshot, expected 11 but applied 15 " actor=backup-restore-server

Upon further investigation, we discovered that a separate handling mechanism for applying the first delta snapshot was introduced in this PR: #29.
This mechanism was introduced to address the overlap of events between the full snapshot and the first delta snapshot. However, it fails to account for the case where there is a complete overlap of events between the delta snapshot and the full snapshot:

var newRevisionIndex int
for index, event := range events {
if event.EtcdEvent.Kv.ModRevision > lastRevision {
newRevisionIndex = index
break
}
}
r.logger.Infof("Applying first delta snapshot %s", path.Join(snap.SnapDir, snap.SnapName))
return applyEventsToEtcd(clientKV, events[newRevisionIndex:])

As a result, backup-restore re-applies some etcd events that should not be reapplied, causing the restoration verification checks to fail and ultimately leading to the restoration failure:

func verifySnapshotRevision(clientKV client.KVCloser, snap *brtypes.Snapshot) error {
ctx := context.TODO()
getResponse, err := clientKV.Get(ctx, "foo")
if err != nil {
return fmt.Errorf("failed to connect to etcd KV client: %v", err)
}
etcdRevision := getResponse.Header.GetRevision()
if snap.LastRevision != etcdRevision {
return fmt.Errorf("mismatched event revision while applying delta snapshot, expected %d but applied %d ", snap.LastRevision, etcdRevision)
}

How to reproduce it (as minimally and precisely as possible):

  1. Start an etcd server.
  2. Put some dummy data.
  3. Start the backup-restore and make sure to take a full snapshot and delta snapshot which completely overlaps each other and timestamp of an overlapping delta snapshot should be later than that of the latest full snapshot.
  4. example here, full snapshot with 0 to 11 revision completely overlaps with delta snapshot 8 to 11 revision with same timestamp.
Full-00000000-00000011-1739782988.gz
Incr-00000008-00000011-1739782988.gz
  1. Trigger the restoration.

Anything else we need to know?:
We seen several occurence of this issue in the past, Example: #763

@gardener-robot gardener-robot added area/disaster-recovery Disaster recovery related kind/bug Bug labels Feb 17, 2025
@ishan16696 ishan16696 changed the title Apply of overlapping of delta snapshot is wrong Handling of overlapping of first delta snapshot with full snapshot isn't done correctly. Feb 17, 2025
@ishan16696
Copy link
Member Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/disaster-recovery Disaster recovery related kind/bug Bug
Projects
None yet
2 participants