Shelf metrics should report on op-status for components #262

hashi825 · 2021-06-27T23:32:02Z

Is your feature request related to a problem? Please describe.
Shelf metrics currently only have a value mapping for state to create metric shelf_status and psu-is-enabled for shelf_psu_labels.
Currently for example to detect a failure you have to use shelf_sensor_reading to see if that value dropped to 0 which is pretty ambiguous and doesn't help pinpoint the issue.

Describe the solution you'd like
Have the default template for harvest/conf/zapi/cdot/9.8.0/shelf.yaml use op-status from storage-shelf-info-get-iter for shelf components bays, psu, fans etc.

Describe alternatives you've considered
None at the moment

Additional context
In a real environment it's unlikely that shelf state is a very important metric since Shelves rarely change from online > offline, but it's extremely common to have failed power supplies or other components. Each of these components in a shelf have their own op-status that allows detecting failures. These failures are also rolled up into the shelf-errors field that reports storage-shelf-error-info, these values report text data though such as error-type, error-text so these would be potentially difficult to convert to time series, but they can also be used to potentially verify their existence as a shelf health metric, ie if empty 0 otherwise 1 (same concept as disk_status outage-info field) ?

The text was updated successfully, but these errors were encountered:

cgrinds · 2021-07-13T12:06:39Z

@hashi825 does #295 address you ask?

hashi825 · 2021-07-14T01:36:04Z

Yea that looks like it could work, you could use the label metrics instead of say shelf_status in an alert ie:

expr: shelf_sensor_labels{status!="normal"}

Would it also be better to use op-status at the shelf level for the value mapping? this way shelf_status can actually report a 1/0 value for the op-status and not state which is more indicative of the status.

Also I noticed that the value mapping for status 1/0 (not just for the shelf template but for others) are back to front from what typical up/status metrics are in Prometheus. In Prometheus generally 1 indicates healthy and 0 otherwise, this is pretty common across all prometheus exporters.

Hardikl · 2021-07-14T09:52:46Z

@hashi825,
I need to see the feasibility for this: expr: shelf_sensor_labels{status!="normal"}, let me check and update here.

Regarding the 2nd ask for op-status instead of state in shelf_status, I have made the changes in current PR.

And last one,
[value mapping for status 1/0 (not just for the shelf template but for others) are back to front from what typical up/status metrics are in Prometheus.],
--> Yeah, this is the same across all yamls where 0 indicates up/normal and 1 indicates default(or not expected outcome). I would suggest that we can track this separately and handle for all if required.

cgrinds · 2021-07-14T12:29:31Z

@hashi825 opened #306 about the status values - we'll update everything to use 1 for healthy. In cases where there are multiple kinds of failures would distinct values be preferable or would you rather have all of those mapped to zero?

Hardikl · 2021-07-14T13:47:27Z

@hashi825
regarding the ask: expr: shelf_sensor_labels{status!="normal"}

With current changes:
shelf_sensor_labels metric would show these records:

shelf_sensor_labels{cluster="F8080-32-25", datacenter="DC-03", instance="localhost:12991", job="prometheus1", location="rear of the shelf on the lower left power supply", sensor_id="1", status="normal", shelf="1.0"}	1
shelf_sensor_labels{cluster="F8080-32-25", datacenter="DC-03", instance="localhost:12991", job="prometheus1", location="rear of the shelf on the lower left power supply", sensor_id="1", status="normal", shelf="1.10"}	1

Do you mean that the shelf_sensor_labels metric would only gives result when status would be non-normal?
In case of all sensors are normal(like above example), metric won't list records?

hashi825 · 2021-07-14T13:56:53Z

Since status has now been added as a label you can use the expression for Prometheus alerting to send alerts on non normal statuses eg

shelf_sensor_labels{status!="normal"} = 1

That should return all non normal series since we are querying based on the label value rather than the metric value, works well for alerting since the label metrics return more descriptive information such as location of sensors compared to the status related metrics

Hardikl · 2021-07-21T15:13:41Z

This is harvest 21.05.3-2 RPM: 10.140.133.43

This is harvest 21.05.4-1 RPM: 10.140.132.253

Child Objects:
Child object's new status metric is available and status field is added in label in 21.05.4:
status metric gives 1 for normal value and 0 for non normal values.
Fan:

PSU:

Sensor:

Voltage:

Temperature:

op_status column is visible in shelf dashboard and all child object are having status metric with required mapping values.
So with these, moving this to status/done state.

hashi825 added the feature New feature or request label Jun 27, 2021

cgrinds assigned vgratian Jun 29, 2021

cgrinds added the status/open label Jun 29, 2021

cgrinds assigned Hardikl and unassigned vgratian Jul 12, 2021

cgrinds mentioned this issue Jul 13, 2021

feat: shelf metrics should report on op-status for components #295

Merged

cgrinds added bug Something isn't working and removed feature New feature or request labels Jul 13, 2021

Hardikl linked a pull request Jul 14, 2021 that will close this issue

feat: shelf metrics should report on op-status for components #295

Merged

cgrinds mentioned this issue Jul 14, 2021

Change Harvest metrics to use a one for healthy #306

Closed

Hardikl closed this as completed in #295 Jul 15, 2021

Hardikl linked a pull request Jul 19, 2021 that will close this issue

feat: shelf child metrics correction to add shelf_fan_status, etc #313

Merged

Hardikl added status/testme and removed status/open labels Jul 19, 2021

Hardikl added status/done and removed status/testme labels Jul 21, 2021

cgrinds unassigned Hardikl May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shelf metrics should report on op-status for components #262

Shelf metrics should report on op-status for components #262

hashi825 commented Jun 27, 2021

cgrinds commented Jul 13, 2021

hashi825 commented Jul 14, 2021

Hardikl commented Jul 14, 2021

cgrinds commented Jul 14, 2021

Hardikl commented Jul 14, 2021

hashi825 commented Jul 14, 2021 •

edited

Loading

Hardikl commented Jul 21, 2021

Shelf metrics should report on op-status for components #262

Shelf metrics should report on op-status for components #262

Comments

hashi825 commented Jun 27, 2021

cgrinds commented Jul 13, 2021

hashi825 commented Jul 14, 2021

Hardikl commented Jul 14, 2021

cgrinds commented Jul 14, 2021

Hardikl commented Jul 14, 2021

hashi825 commented Jul 14, 2021 • edited Loading

Hardikl commented Jul 21, 2021

hashi825 commented Jul 14, 2021 •

edited

Loading