Alert for HA Pair Going Down #2315

johnwarlick · 2023-08-21T21:43:30Z

A customer has faced a couple of instances where an HA pair has gone down abruptly, and wanted to know the best metric(s)/alert(s) to use to quickly be on top of any such future scenarios. They would prefer to not rely on AIQUM.

Based on feedback in Discord, it seems the EMS Collector is the best way to go. I have found the following EMS alerts:

callhome.hainterconnect.down
ic.HAInterconnectDown
ic.HAInterconnectLinkDown
ic.linkStatusChange
cf.ic.heartBeatFailed

A combination of one or more these and perhaps some other EMS event(s) could be used to create an an HA pair down alert. I will pick back up on working on this is a lab on demand.

cgrinds · 2023-11-15T18:26:05Z

Paqui on Discord has a similar request.

rahulguptajss · 2023-11-28T14:23:06Z

@johnwarlick I have done analysis around this. Could you please confirm if the following addresses your use-case?

This proposal aims to address the issue described in #2315, where We want to monitor the status of a High Availability (HA) pair using Harvest.

There are several EMS events that can indicate issues with an HA pair, including:

callhome.hainterconnect.down
ic.HAInterconnectDown
ic.HAInterconnectLinkDown
ic.linkStatusChange
cf.ic.heartBeatFailed

Case 1: Interconnect Down

These EMS events primarily capture cases where the interconnect in an HA pair has gone down. This can be simulated using the following command in diagnostic mode:

system ha interconnect link off -node sti92-vsim-ucs531m -link 0

The storage failover show command will then show that a takeover is not possible due to an interconnect error and unsynchronized NVRAM log.

C1_sti92-vsim-ucs531m_1700727969::*> storage failover show                                           
                              Takeover          
Node           Partner        Possible State Description  
-------------- -------------- -------- -------------------------------------
sti92-vsim-ucs531m
               sti92-vsim-    false    Waiting for sti92-vsim-ucs531n,
               ucs531n                 Takeover is not possible: Storage
                                       failover interconnect error, NVRAM
                                       log not synchronized
sti92-vsim-ucs531n
               sti92-vsim-    false    Waiting for sti92-vsim-ucs531m,
               ucs531m                 Takeover is not possible: NVRAM log
                                       not synchronized

Case 2: Node Unhealthy or Down

There are also cases where a node in an HA pair has been taken down via system halt or other reasons. In these cases, the takeover possible status in the storage failover show command shows as false, but none of the above EMS events are raised.

To address this, we propose to monitor the takeover possible status for each node. If this status is false, an alert will be sent indicating that the HA pair is down. This approach will cover both Case 1 and Case 2.

Zapi: cf-get-iter field takeover-by-partner-possible
Rest private CLI: /api/private/cli/storage/failover?fields=possible
Rest public API: api/cluster/nodes?fields=ha.takeover_check.takeover_possible (introduced in 9.14)

johnwarlick · 2023-11-28T17:20:20Z

We have been using the Health dashboard from 23.05.0 to check for nodes going down. When combined with the approach you outlined, I believe it solves it.

The only edge case I am slightly unsure on is if an HA pair goes down suddenly and simultaneously, i.e. a datacenter mishap. I believe the ha.takeover_check.takeover_possible should still return false, and the health dashboard should still show node down, correct?

rahulguptajss · 2023-11-29T09:28:30Z

@johnwarlick Your observation regarding the edge case is correct. In scenarios where both nodes of a HA pair are down, the Takeover Possible value is set to -. The screenshot provided below shows that the node health status is set to false for such cases. The same will be reflected in the health dashboard.

We plan to create a new metric when Takeover Possible != true. This metric will be integrated into the Health Dashboard.

rahulguptajss · 2023-12-06T07:42:17Z

@johnwarlick @faguayot Changes are available via nightly build. HA Down is tracked via health dashboard. Metric name is health_ha_alerts.

faguayot · 2023-12-11T09:59:51Z

@rahulguptajss I have deployed the version harvest version 23.12.11-nightly (commit 006cc7f0) (build date 2023-12-11T00:23:26-0500) linux/amd64 of this night but I don't have the metric health_ha_alerts.

rahulguptajss · 2023-12-11T10:03:31Z

@faguayot This metric is only available when HA down occurs.

faguayot · 2023-12-11T11:33:27Z

@rahulguptajss So it only has a failed status? I mean it isn't a constant check that says the node/s are OK, right? Only when a problem happens is when there is any state.

rahulguptajss · 2023-12-11T11:39:43Z

Yes, that is correct. This metric is consumed in Harvest Health dashboard as well.

cgrinds · 2024-02-12T15:52:49Z

Verified in 24.02 84e2f7c

health_ha_alerts metric is published

health_ha_alerts{cluster="umeng-aff300-05-06", datacenter="nane", instance="localhost:12991", job="prometheus", node="umeng-aff300-05", partner="umeng-aff300-06", partner_state="Up", severity="error", state_description="Connected to umeng-aff300-06, Takeover is not possible: Storage failover is disabled", takeover_possible="false"}
1
health_ha_alerts{cluster="umeng-aff300-05-06", datacenter="nane", instance="localhost:12991", job="prometheus", node="umeng-aff300-06", partner="umeng-aff300-05", partner_state="Up", severity="error", state_description="Connected to umeng-aff300-05, Takeover is not possible: Storage failover is disabled", takeover_possible="false"}

faguayot · 2024-03-20T17:34:55Z

Hello,
We finally could test this new information but the thing is that we can't use it for alerting since there the value for alerts don't gather any information when the node recover its health. You can take a look on the information gathered for both databases that we use InfluxDB and Prometheus but we must use InfluxDB for alerting.

Below the metrics health_ha_alerts and health_node_alerts in Prometheus and health_ha and health_node in InfluxDB

InfluxDB
Prometheus

In summary if no value is inserted when retrieving at least 1 time, we have no way of comparing what we were receiving to perform the retrieval of the same.

Thanks.

rahulguptajss · 2024-03-21T14:19:02Z

Hello, We finally could test this new information but the thing is that we can't use it for alerting since there the value for alerts don't gather any information when the node recover its health. You can take a look on the information gathered for both databases that we use InfluxDB and Prometheus but we must use InfluxDB for alerting.

Below the metrics health_ha_alerts and health_node_alerts in Prometheus and health_ha and health_node in InfluxDB

InfluxDB

Prometheus

In summary if no value is inserted when retrieving at least 1 time, we have no way of comparing what we were receiving to perform the retrieval of the same.

Thanks.

@faguayot Yes, that is correct. Currently, we do not publish any resolution metrics for these health alerts; we only publish when there is an error. If an alert has been active for the last N minutes, then it indicates a problem; otherwise, it is considered auto-resolved. I understand that handling auto-resolution on the InfluxDB alerting side will be tricky.

Could you open a feature request to add support for resolution alerts for these health metrics?

faguayot · 2024-03-22T12:25:39Z

On InfluxDB, at least how our monitoring team have deployed the stack in the organization, we need a value to recover an alert. If we don't receive any other information, the alert is still happening. So we don't have an automatic recovery after some time.

@rahulguptajss Yes, I can open a request for this.

rahulguptajss · 2024-03-25T12:25:50Z

On InfluxDB, at least how our monitoring team have deployed the stack in the organization, we need a value to recover an alert. If we don't receive any other information, the alert is still happening. So we don't have an automatic recovery after some time.

@rahulguptajss Yes, I can open a request for this.

Sure Thanks.

rahulguptajss · 2024-04-03T08:23:58Z

@faguayot I have created a feature request #2804 for this.

faguayot · 2024-04-15T15:42:20Z

@rahulguptajss Apologize por delay, I didn't have time to check this and open the ticket for the new feature. Thanks so much for creating.

johnwarlick added the feature New feature or request label Aug 21, 2023

cgrinds added customer 23.11 labels Aug 22, 2023

cgrinds removed the 23.11 label Oct 10, 2023

cgrinds added the 24.02 label Nov 15, 2023

rahulguptajss self-assigned this Nov 16, 2023

rahulguptajss added the status/open label Nov 23, 2023

rahulguptajss linked a pull request Nov 30, 2023 that will close this issue

feat: add HA down and sensor issues in Health Dashboard #2519

Merged

cgrinds closed this as completed in #2519 Dec 5, 2023

rahulguptajss added status/testme and removed status/open labels Dec 6, 2023

rahulguptajss removed their assignment Feb 12, 2024

cgrinds self-assigned this Feb 12, 2024

cgrinds added status/done and removed status/testme labels Feb 12, 2024

rahulguptajss mentioned this issue Apr 3, 2024

Harvest should create resolution metrics for health alerts #2804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert for HA Pair Going Down #2315

Alert for HA Pair Going Down #2315

johnwarlick commented Aug 21, 2023

cgrinds commented Nov 15, 2023

rahulguptajss commented Nov 28, 2023 •

edited

Loading

johnwarlick commented Nov 28, 2023 •

edited

Loading

rahulguptajss commented Nov 29, 2023 •

edited

Loading

rahulguptajss commented Dec 6, 2023 •

edited

Loading

faguayot commented Dec 11, 2023

rahulguptajss commented Dec 11, 2023

faguayot commented Dec 11, 2023

rahulguptajss commented Dec 11, 2023

cgrinds commented Feb 12, 2024

faguayot commented Mar 20, 2024

rahulguptajss commented Mar 21, 2024

faguayot commented Mar 22, 2024

rahulguptajss commented Mar 25, 2024

rahulguptajss commented Apr 3, 2024

faguayot commented Apr 15, 2024

Alert for HA Pair Going Down #2315

Alert for HA Pair Going Down #2315

Comments

johnwarlick commented Aug 21, 2023

cgrinds commented Nov 15, 2023

rahulguptajss commented Nov 28, 2023 • edited Loading

johnwarlick commented Nov 28, 2023 • edited Loading

rahulguptajss commented Nov 29, 2023 • edited Loading

rahulguptajss commented Dec 6, 2023 • edited Loading

faguayot commented Dec 11, 2023

rahulguptajss commented Dec 11, 2023

faguayot commented Dec 11, 2023

rahulguptajss commented Dec 11, 2023

cgrinds commented Feb 12, 2024

faguayot commented Mar 20, 2024

rahulguptajss commented Mar 21, 2024

faguayot commented Mar 22, 2024

rahulguptajss commented Mar 25, 2024

rahulguptajss commented Apr 3, 2024

faguayot commented Apr 15, 2024

rahulguptajss commented Nov 28, 2023 •

edited

Loading

johnwarlick commented Nov 28, 2023 •

edited

Loading

rahulguptajss commented Nov 29, 2023 •

edited

Loading

rahulguptajss commented Dec 6, 2023 •

edited

Loading