Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix false critical on OMD backup job when agent runs at the time the backup is about to start #641

Closed
wants to merge 0 commits into from

Conversation

dnlldl
Copy link
Contributor

@dnlldl dnlldl commented Oct 26, 2023

Prevent this false critical alert:

Host
Service OMD backup
Event OK → CRITICAL
Time Mon Oct 23 01:30:05 EDT 2023
Summary Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06, Size: 426 MiB, Next run: 2023-10-23 01:30:00CRIT
Details Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06Size: 426 MiBNext run: 2023-10-23 01:30:00CRIT
Host Metrics rta=0.010ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.038ms;;;; rtmin=0.002ms;;;;
Service Metrics backup_duration=123.582501;;;; backup_avgspeed=865828.190744;;;; backup_size=446827456;;;;

Basically, this happens when the backup is about to start (here at 01:30:00) but hasn't started yet when the agent checked (around 01:30:00 also in this case but the alert was generated at 01:30:05). In the logs, the backup actually started at 01:30:03, it's normal for a cron job to sometimes have a very small discrepancy, add to that the discrepancy between the check and the time of the alert reported by Checkmk and we get a false critical in this case. The 30 seconds buffer will prevent this corner case from every happening again.

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

@github-actions
Copy link

github-actions bot commented Oct 26, 2023

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@dnlldl
Copy link
Contributor Author

dnlldl commented Oct 26, 2023

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

@dnlldl
Copy link
Contributor Author

dnlldl commented Mar 23, 2024

Just curious, is there anything else to do? I'm aware 30 seconds isn't exactly cute, could just take 2 checks or similar before it turns critical instead.

@TimotheusBachinger
Copy link
Contributor

Dear Checkmk Contributor! Unfortunately, we had to re-write our git-repo history, rendering your PR auto-closed. We will therefore rebase your PR onto the current master and reopen it again. Sorry for the inconvenience.

@TimotheusBachinger
Copy link
Contributor

Dear Contributor. Unfortunately, we learned that re-opening a PR which was force-rebased, is not possible (see isaacs/github#361). Therefore we kindly ask you to create a new PR for your change. We apologize for the circumstances and will implement technical measures to prevent such incidents in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants