Improve FullSnapshotSchedule calculation which is passed as flag `--schedule` to backup-restore. #7280

ishan16696 · 2023-01-04T05:46:33Z

How to categorize this issue?
/area monitoring
/area backup
/kind enhancement

What would you like to be added:
Currently, alert KubeEtcdFullBackupFailed is calculated based on Last FullSnapshot Timestamp, so it checks whether backup-restore has taken a full snapshot within last 24h or not. But we have observed some false KubeEtcdFullBackupFailed alerts for some shoots if shoot is hibernated and wake up within 24h.
It has been observed that determineBackupSchedule is calculating FullSnapshotSchedule on the basis of maintenance window. IMO, it should calculate FullSnapshotSchedule on the basis of last fullSnapshot Timestamp not on the basis of maintenance window.

Why is this needed:
We have observed some false KubeEtcdFullBackupFailed alerts due to above described behaviour when cluster was hibernated and woken up again.

Take this Scenario:

Suppose, shoot have maintenance window m1.
etcd-backup-restore took a full snapshot at timestamp t1 .
Then the cluster was hibernated at t2(before maintenance window i.e t2<m1) and woken up again at t3. (t3>t2>t1 && t3>m1)
When Cluster was woken up, t3-t1< 24h (24h didn’t pass yet), so no new full snapshot had taken by backup-restore as it calculates 23.5h from timeStamp of last full snapshot.
Past 24h, at t4 alert KubeEtcdFullBackupFailed checks the last timestamp of full snapshot and it found t4-t1>24h, now an alert has been raised and backup-restore is waiting to take full snapshot according to --schedule passed to backup-restore and it was calculated on basis of maintenance window of shoot (m1 of next day) which is not yet reached.

/cc @timuthy @shreyas-s-rao

The text was updated successfully, but these errors were encountered:

ishan16696 · 2023-01-04T10:45:22Z

We decided let's not make determineBackupSchedule more complicated, we will handle this edge case in backup-restore itself.
Follow-up issue on backup-restore: gardener/etcd-backup-restore#570.
Hence closing this issue.
/close

gardener-prow · 2023-01-04T10:45:25Z

@ishan16696: Closing this issue.

In response to this:

We decided let's not make determineBackupSchedule more complicated, we will handle this edge case in backup-restore itself.
Follow-up issue on backup-restore: gardener/etcd-backup-restore#570.
Hence closing this issue.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related area/backup Backup related kind/enhancement Enhancement, improvement, extension labels Jan 4, 2023

gardener-prow bot closed this as completed Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve FullSnapshotSchedule calculation which is passed as flag `--schedule` to backup-restore. #7280

Improve FullSnapshotSchedule calculation which is passed as flag `--schedule` to backup-restore. #7280

ishan16696 commented Jan 4, 2023 •

edited

Loading

ishan16696 commented Jan 4, 2023

gardener-prow bot commented Jan 4, 2023

Improve FullSnapshotSchedule calculation which is passed as flag --schedule to backup-restore. #7280

Improve FullSnapshotSchedule calculation which is passed as flag --schedule to backup-restore. #7280

Comments

ishan16696 commented Jan 4, 2023 • edited Loading

ishan16696 commented Jan 4, 2023

gardener-prow bot commented Jan 4, 2023

Improve FullSnapshotSchedule calculation which is passed as flag `--schedule` to backup-restore. #7280

Improve FullSnapshotSchedule calculation which is passed as flag `--schedule` to backup-restore. #7280

ishan16696 commented Jan 4, 2023 •

edited

Loading