Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert for HA Pair Going Down #2315

Closed
johnwarlick opened this issue Aug 21, 2023 · 16 comments · Fixed by #2519
Closed

Alert for HA Pair Going Down #2315

johnwarlick opened this issue Aug 21, 2023 · 16 comments · Fixed by #2519
Assignees
Labels

Comments

@johnwarlick
Copy link

A customer has faced a couple of instances where an HA pair has gone down abruptly, and wanted to know the best metric(s)/alert(s) to use to quickly be on top of any such future scenarios. They would prefer to not rely on AIQUM.

Based on feedback in Discord, it seems the EMS Collector is the best way to go. I have found the following EMS alerts:

  • callhome.hainterconnect.down
  • ic.HAInterconnectDown
  • ic.HAInterconnectLinkDown
  • ic.linkStatusChange
  • cf.ic.heartBeatFailed

A combination of one or more these and perhaps some other EMS event(s) could be used to create an an HA pair down alert. I will pick back up on working on this is a lab on demand.

@johnwarlick johnwarlick added the feature New feature or request label Aug 21, 2023
@cgrinds cgrinds removed the 23.11 label Oct 10, 2023
@cgrinds cgrinds added the 24.02 label Nov 15, 2023
@cgrinds
Copy link
Collaborator

cgrinds commented Nov 15, 2023

Paqui on Discord has a similar request.

@rahulguptajss
Copy link
Contributor

rahulguptajss commented Nov 28, 2023

@johnwarlick I have done analysis around this. Could you please confirm if the following addresses your use-case?

This proposal aims to address the issue described in #2315, where We want to monitor the status of a High Availability (HA) pair using Harvest.

There are several EMS events that can indicate issues with an HA pair, including:

  • callhome.hainterconnect.down
  • ic.HAInterconnectDown
  • ic.HAInterconnectLinkDown
  • ic.linkStatusChange
  • cf.ic.heartBeatFailed

Case 1: Interconnect Down

These EMS events primarily capture cases where the interconnect in an HA pair has gone down. This can be simulated using the following command in diagnostic mode:

system ha interconnect link off -node sti92-vsim-ucs531m -link 0

The storage failover show command will then show that a takeover is not possible due to an interconnect error and unsynchronized NVRAM log.

C1_sti92-vsim-ucs531m_1700727969::*> storage failover show                                           
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
sti92-vsim-ucs531m
               sti92-vsim-    false    Waiting for sti92-vsim-ucs531n,
               ucs531n                 Takeover is not possible: Storage
                                       failover interconnect error, NVRAM
                                       log not synchronized
sti92-vsim-ucs531n
               sti92-vsim-    false    Waiting for sti92-vsim-ucs531m,
               ucs531m                 Takeover is not possible: NVRAM log
                                       not synchronized

Case 2: Node Unhealthy or Down

There are also cases where a node in an HA pair has been taken down via system halt or other reasons. In these cases, the takeover possible status in the storage failover show command shows as false, but none of the above EMS events are raised.

To address this, we propose to monitor the takeover possible status for each node. If this status is false, an alert will be sent indicating that the HA pair is down. This approach will cover both Case 1 and Case 2.

Zapi: cf-get-iter field takeover-by-partner-possible
Rest private CLI: /api/private/cli/storage/failover?fields=possible
Rest public API: api/cluster/nodes?fields=ha.takeover_check.takeover_possible (introduced in 9.14)

@johnwarlick
Copy link
Author

johnwarlick commented Nov 28, 2023

We have been using the Health dashboard from 23.05.0 to check for nodes going down. When combined with the approach you outlined, I believe it solves it.

The only edge case I am slightly unsure on is if an HA pair goes down suddenly and simultaneously, i.e. a datacenter mishap. I believe the ha.takeover_check.takeover_possible should still return false, and the health dashboard should still show node down, correct?

@rahulguptajss
Copy link
Contributor

rahulguptajss commented Nov 29, 2023

@johnwarlick Your observation regarding the edge case is correct. In scenarios where both nodes of a HA pair are down, the Takeover Possible value is set to -. The screenshot provided below shows that the node health status is set to false for such cases. The same will be reflected in the health dashboard.

We plan to create a new metric when Takeover Possible != true. This metric will be integrated into the Health Dashboard.

image

@rahulguptajss
Copy link
Contributor

rahulguptajss commented Dec 6, 2023

@johnwarlick @faguayot Changes are available via nightly build. HA Down is tracked via health dashboard. Metric name is health_ha_alerts.

@faguayot
Copy link

@rahulguptajss I have deployed the version harvest version 23.12.11-nightly (commit 006cc7f0) (build date 2023-12-11T00:23:26-0500) linux/amd64 of this night but I don't have the metric health_ha_alerts.

image

@rahulguptajss
Copy link
Contributor

@faguayot This metric is only available when HA down occurs.

@faguayot
Copy link

@rahulguptajss So it only has a failed status? I mean it isn't a constant check that says the node/s are OK, right? Only when a problem happens is when there is any state.

@rahulguptajss
Copy link
Contributor

Yes, that is correct. This metric is consumed in Harvest Health dashboard as well.

image

@rahulguptajss rahulguptajss removed their assignment Feb 12, 2024
@cgrinds cgrinds self-assigned this Feb 12, 2024
@cgrinds
Copy link
Collaborator

cgrinds commented Feb 12, 2024

Verified in 24.02 84e2f7c

health_ha_alerts metric is published

health_ha_alerts{cluster="umeng-aff300-05-06", datacenter="nane", instance="localhost:12991", job="prometheus", node="umeng-aff300-05", partner="umeng-aff300-06", partner_state="Up", severity="error", state_description="Connected to umeng-aff300-06, Takeover is not possible: Storage failover is disabled", takeover_possible="false"}
1
health_ha_alerts{cluster="umeng-aff300-05-06", datacenter="nane", instance="localhost:12991", job="prometheus", node="umeng-aff300-06", partner="umeng-aff300-05", partner_state="Up", severity="error", state_description="Connected to umeng-aff300-05, Takeover is not possible: Storage failover is disabled", takeover_possible="false"}

image

@faguayot
Copy link

Hello,
We finally could test this new information but the thing is that we can't use it for alerting since there the value for alerts don't gather any information when the node recover its health. You can take a look on the information gathered for both databases that we use InfluxDB and Prometheus but we must use InfluxDB for alerting.

Below the metrics health_ha_alerts and health_node_alerts in Prometheus and health_ha and health_node in InfluxDB

  • InfluxDB
    image
    image

  • Prometheus
    image
    image

In summary if no value is inserted when retrieving at least 1 time, we have no way of comparing what we were receiving to perform the retrieval of the same.

Thanks.

@rahulguptajss
Copy link
Contributor

Hello, We finally could test this new information but the thing is that we can't use it for alerting since there the value for alerts don't gather any information when the node recover its health. You can take a look on the information gathered for both databases that we use InfluxDB and Prometheus but we must use InfluxDB for alerting.

Below the metrics health_ha_alerts and health_node_alerts in Prometheus and health_ha and health_node in InfluxDB

  • InfluxDB
    image
    image
  • Prometheus
    image
    image

In summary if no value is inserted when retrieving at least 1 time, we have no way of comparing what we were receiving to perform the retrieval of the same.

Thanks.

@faguayot Yes, that is correct. Currently, we do not publish any resolution metrics for these health alerts; we only publish when there is an error. If an alert has been active for the last N minutes, then it indicates a problem; otherwise, it is considered auto-resolved. I understand that handling auto-resolution on the InfluxDB alerting side will be tricky.

Could you open a feature request to add support for resolution alerts for these health metrics?

@faguayot
Copy link

On InfluxDB, at least how our monitoring team have deployed the stack in the organization, we need a value to recover an alert. If we don't receive any other information, the alert is still happening. So we don't have an automatic recovery after some time.

@rahulguptajss Yes, I can open a request for this.

@rahulguptajss
Copy link
Contributor

On InfluxDB, at least how our monitoring team have deployed the stack in the organization, we need a value to recover an alert. If we don't receive any other information, the alert is still happening. So we don't have an automatic recovery after some time.

@rahulguptajss Yes, I can open a request for this.

Sure Thanks.

@rahulguptajss
Copy link
Contributor

@faguayot I have created a feature request #2804 for this.

@faguayot
Copy link

@rahulguptajss Apologize por delay, I didn't have time to check this and open the ticket for the new feature. Thanks so much for creating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants