-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert for HA Pair Going Down #2315
Comments
Paqui on Discord has a similar request. |
@johnwarlick I have done analysis around this. Could you please confirm if the following addresses your use-case? This proposal aims to address the issue described in #2315, where We want to monitor the status of a High Availability (HA) pair using Harvest. There are several EMS events that can indicate issues with an HA pair, including:
Case 1: Interconnect Down These EMS events primarily capture cases where the interconnect in an HA pair has gone down. This can be simulated using the following command in diagnostic mode:
The
Case 2: Node Unhealthy or Down There are also cases where a node in an HA pair has been taken down via To address this, we propose to monitor the Zapi: cf-get-iter field takeover-by-partner-possible |
We have been using the Health dashboard from 23.05.0 to check for nodes going down. When combined with the approach you outlined, I believe it solves it. The only edge case I am slightly unsure on is if an HA pair goes down suddenly and simultaneously, i.e. a datacenter mishap. I believe the ha.takeover_check.takeover_possible should still return false, and the health dashboard should still show node down, correct? |
@johnwarlick Your observation regarding the edge case is correct. In scenarios where both nodes of a HA pair are down, the We plan to create a new metric when ![]() |
@johnwarlick @faguayot Changes are available via nightly build. |
@rahulguptajss I have deployed the version |
@faguayot This metric is only available when |
@rahulguptajss So it only has a failed status? I mean it isn't a constant check that says the node/s are OK, right? Only when a problem happens is when there is any state. |
Verified in
health_ha_alerts{cluster="umeng-aff300-05-06", datacenter="nane", instance="localhost:12991", job="prometheus", node="umeng-aff300-05", partner="umeng-aff300-06", partner_state="Up", severity="error", state_description="Connected to umeng-aff300-06, Takeover is not possible: Storage failover is disabled", takeover_possible="false"} ![]() |
Hello, Below the metrics In summary if no value is inserted when retrieving at least 1 time, we have no way of comparing what we were receiving to perform the retrieval of the same. Thanks. |
@faguayot Yes, that is correct. Currently, we do not publish any resolution metrics for these health alerts; we only publish when there is an error. If an alert has been active for the last N minutes, then it indicates a problem; otherwise, it is considered auto-resolved. I understand that handling auto-resolution on the InfluxDB alerting side will be tricky. Could you open a feature request to add support for resolution alerts for these health metrics? |
On InfluxDB, at least how our monitoring team have deployed the stack in the organization, we need a value to recover an alert. If we don't receive any other information, the alert is still happening. So we don't have an automatic recovery after some time. @rahulguptajss Yes, I can open a request for this. |
Sure Thanks. |
@rahulguptajss Apologize por delay, I didn't have time to check this and open the ticket for the new feature. Thanks so much for creating. |
A customer has faced a couple of instances where an HA pair has gone down abruptly, and wanted to know the best metric(s)/alert(s) to use to quickly be on top of any such future scenarios. They would prefer to not rely on AIQUM.
Based on feedback in Discord, it seems the EMS Collector is the best way to go. I have found the following EMS alerts:
A combination of one or more these and perhaps some other EMS event(s) could be used to create an an HA pair down alert. I will pick back up on working on this is a lab on demand.
The text was updated successfully, but these errors were encountered: