Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest should create resolution metrics for health alerts #2804

Closed
rahulguptajss opened this issue Apr 3, 2024 · 6 comments · Fixed by #2977
Closed

Harvest should create resolution metrics for health alerts #2804

rahulguptajss opened this issue Apr 3, 2024 · 6 comments · Fixed by #2977
Labels

Comments

@rahulguptajss
Copy link
Contributor

Thanks @faguayot for raising here

@rahulguptajss
Copy link
Contributor Author

rahulguptajss commented Jun 24, 2024

@faguayot This feature is available through nightly builds. Please do provide your feedback once you've had a chance to try these changes.

We publish a value of 1 for health metrics when an alert is detected and 0 once it is resolved.

@faguayot
Copy link

Hello @rahulguptajss,
I was trying the nightly but I don't know what is the name of new metric that you have created for health? I was checking health_ha_alerts which is an old metric but I don't find any information on that.

The other alerts related to health that I have in my prometheus database are the following:

image

Thanks.

@rahulguptajss
Copy link
Contributor Author

@faguayot The names of the alerts will remain the same. The only difference is in the value: a value of 1 indicates that the alert is raised or active, while a value of 0 indicates that the alert is resolved.

For example, health_ha_alerts == 1 means there is an HA issue. Once this issue is resolved, a metric health_ha_alerts == 0 will be published to mark the earlier alert as resolved. Note that health_ha_alerts == 0 is not always published, it is only published when an issue related to health_ha_alerts == 1 is resolved and is done so once per relevant issue instance resolution.

@faguayot
Copy link

Ok, in that case I can't check the information until something happens in the HA. I thought that you write everytime the good state (in this case a 0) and when something happens you will write a 1 (failed state). I don't know if you can test it and show me the information available for the metric.

When something happens in our environment, I will be pending to review these parameters. Thanks for the implementation.

@rahulguptajss
Copy link
Contributor Author

@faguayot We only record a failed state (1) when an issue occurs. Once the issue is resolved, we write a good state (0) to signal the resolution. Below is an example with health_lif_alerts.

When the LIF is not home, we continuously publish the following until a failure state is detected:

image

Once the LIF is back home, we publish a good state (0) once to indicate the issue is resolved:

image

@Hardikl
Copy link
Contributor

Hardikl commented Aug 9, 2024

Verified in 24.08 with commit 4e3945c

When home port is changed via UI
image

When the port reverted to home port via UI
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants