Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FullSnapshotSchedule calculation which is passed as flag --schedule to backup-restore. #7280

Closed
ishan16696 opened this issue Jan 4, 2023 · 2 comments
Labels
area/backup Backup related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension

Comments

@ishan16696
Copy link
Member

ishan16696 commented Jan 4, 2023

How to categorize this issue?
/area monitoring
/area backup
/kind enhancement

What would you like to be added:
Currently, alert KubeEtcdFullBackupFailed is calculated based on Last FullSnapshot Timestamp, so it checks whether backup-restore has taken a full snapshot within last 24h or not. But we have observed some false KubeEtcdFullBackupFailed alerts for some shoots if shoot is hibernated and wake up within 24h.
It has been observed that determineBackupSchedule is calculating FullSnapshotSchedule on the basis of maintenance window. IMO, it should calculate FullSnapshotSchedule on the basis of last fullSnapshot Timestamp not on the basis of maintenance window.

Why is this needed:
We have observed some false KubeEtcdFullBackupFailed alerts due to above described behaviour when cluster was hibernated and woken up again.

Take this Scenario:

  1. Suppose, shoot have maintenance window m1.
  2. etcd-backup-restore took a full snapshot at timestamp t1 .
  3. Then the cluster was hibernated at t2(before maintenance window i.e t2<m1) and woken up again at t3. (t3>t2>t1 && t3>m1)
  4. When Cluster was woken up, t3-t1< 24h (24h didn’t pass yet), so no new full snapshot had taken by backup-restore as it calculates 23.5h from timeStamp of last full snapshot.
  5. Past 24h, at t4 alert KubeEtcdFullBackupFailed checks the last timestamp of full snapshot and it found t4-t1>24h, now an alert has been raised and backup-restore is waiting to take full snapshot according to --schedule passed to backup-restore and it was calculated on basis of maintenance window of shoot (m1 of next day) which is not yet reached.

/cc @timuthy @shreyas-s-rao

@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related area/backup Backup related kind/enhancement Enhancement, improvement, extension labels Jan 4, 2023
@ishan16696
Copy link
Member Author

We decided let's not make determineBackupSchedule more complicated, we will handle this edge case in backup-restore itself.
Follow-up issue on backup-restore: gardener/etcd-backup-restore#570.
Hence closing this issue.
/close

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Jan 4, 2023

@ishan16696: Closing this issue.

In response to this:

We decided let's not make determineBackupSchedule more complicated, we will handle this edge case in backup-restore itself.
Follow-up issue on backup-restore: gardener/etcd-backup-restore#570.
Hence closing this issue.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gardener-prow gardener-prow bot closed this as completed Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

1 participant