Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance condition to take full snapshot during startup. #570

Closed
ishan16696 opened this issue Jan 4, 2023 · 3 comments · Fixed by #574
Closed

Enhance condition to take full snapshot during startup. #570

ishan16696 opened this issue Jan 4, 2023 · 3 comments · Fixed by #574
Assignees
Labels
area/backup Backup related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@ishan16696
Copy link
Member

How to categorize this issue?
/area monitoring
/area backup
/kind enhancement

What would you like to be added:
During startup, etcd-backup-restore should also consider configured FullSnapshotSchedule along with timeStamp of last full snapshot, so that it won't missed any full snapshot.

Why is this needed:
Currently, alert KubeEtcdFullBackupFailed is calculated based on Last FullSnapshot Timestamp, so it checks whether backup-restore has taken a full snapshot within last 24h or not. But we have observed some false KubeEtcdFullBackupFailed alerts for some shoots if shoot is hibernated and wake up within 24h and determineBackupSchedule is calculating FullSnapshotSchedule on the basis of maintenance window.

Take this Scenario:

  1. Suppose, shoot have maintenance window m1.
  2. etcd-backup-restore took a full snapshot at timestamp t1 .
  3. Then the cluster was hibernated at t2(before maintenance window i.e t2<m1) and woken up again at t3. (t3>t2>t1 && t3>m1)
  4. When Cluster was woken up, t3-t1< 24h (24h didn’t pass yet), so no new full snapshot had taken by backup-restore as it calculates 23.5h from timeStamp of last full snapshot.
  5. Past 24h, at t4 alert KubeEtcdFullBackupFailed checks the last timestamp of full snapshot and it found t4-t1>24h, now an alert has been raised and backup-restore is waiting to take full snapshot according to --schedule passed to backup-restore and it was calculated on basis of maintenance window of shoot (m1 of next day) which is not yet reached.

/cc @timuthy @shreyas-s-rao

@gardener-robot gardener-robot added area/backup Backup related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels Jan 4, 2023
@ishan16696 ishan16696 self-assigned this Jan 4, 2023
@abdasgupta abdasgupta added the priority/2 Priority (lower number equals higher priority) label Jan 6, 2023
@ishan16696 ishan16696 added this to the v0.22.0 milestone Jan 30, 2023
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Feb 13, 2023
@shreyas-s-rao
Copy link
Collaborator

@ishan16696 thanks for resolving this issue. Can you please also raise another issue to calculate the previous full snapshot scheduled time based purely on the cron schedule? As discussed in #574 (review) thread

@ishan16696
Copy link
Member Author

Hi @shreyas-s-rao ,
I have created an issue for follow up: #587

@shreyas-s-rao
Copy link
Collaborator

Thanks! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
4 participants