Improve FullSnapshotSchedule calculation which is passed as flag --schedule
to backup-restore.
#7280
Labels
area/backup
Backup related
area/monitoring
Monitoring (including availability monitoring and alerting) related
kind/enhancement
Enhancement, improvement, extension
How to categorize this issue?
/area monitoring
/area backup
/kind enhancement
What would you like to be added:
Currently, alert KubeEtcdFullBackupFailed is calculated based on Last FullSnapshot Timestamp, so it checks whether
backup-restore
has taken a full snapshot within last 24h or not. But we have observed some falseKubeEtcdFullBackupFailed
alerts for some shoots if shoot is hibernated and wake up within 24h.It has been observed that determineBackupSchedule is calculating
FullSnapshotSchedule
on the basis of maintenance window. IMO, it should calculateFullSnapshotSchedule
on the basis of last fullSnapshot Timestamp not on the basis of maintenance window.Why is this needed:
We have observed some false
KubeEtcdFullBackupFailed
alerts due to above described behaviour when cluster was hibernated and woken up again.Take this Scenario:
m1
.etcd-backup-restore
took a full snapshot at timestampt1
.t2
(before maintenance window i.e t2<m1) and woken up again att3
. (t3>t2>t1 && t3>m1)t3-t1
<24h
(24h didn’t pass yet), so no new full snapshot had taken by backup-restore as it calculates 23.5h from timeStamp of last full snapshot.24h
, att4
alertKubeEtcdFullBackupFailed
checks the last timestamp of full snapshot and it foundt4-t1>24h
, now an alert has been raised and backup-restore is waiting to take full snapshot according to--schedule
passed to backup-restore and it was calculated on basis of maintenance window of shoot (m1
of next day) which is not yet reached./cc @timuthy @shreyas-s-rao
The text was updated successfully, but these errors were encountered: