Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shelf metrics should report on op-status for components #262

Closed
hashi825 opened this issue Jun 27, 2021 · 7 comments · Fixed by #295 or #313
Closed

Shelf metrics should report on op-status for components #262

hashi825 opened this issue Jun 27, 2021 · 7 comments · Fixed by #295 or #313
Labels
bug Something isn't working status/done

Comments

@hashi825
Copy link

Is your feature request related to a problem? Please describe.
Shelf metrics currently only have a value mapping for state to create metric shelf_status and psu-is-enabled for shelf_psu_labels.
Currently for example to detect a failure you have to use shelf_sensor_reading to see if that value dropped to 0 which is pretty ambiguous and doesn't help pinpoint the issue.

Describe the solution you'd like
Have the default template for harvest/conf/zapi/cdot/9.8.0/shelf.yaml use op-status from storage-shelf-info-get-iter for shelf components bays, psu, fans etc.

Describe alternatives you've considered
None at the moment

Additional context
In a real environment it's unlikely that shelf state is a very important metric since Shelves rarely change from online > offline, but it's extremely common to have failed power supplies or other components. Each of these components in a shelf have their own op-status that allows detecting failures. These failures are also rolled up into the shelf-errors field that reports storage-shelf-error-info, these values report text data though such as error-type, error-text so these would be potentially difficult to convert to time series, but they can also be used to potentially verify their existence as a shelf health metric, ie if empty 0 otherwise 1 (same concept as disk_status outage-info field) ?

@cgrinds
Copy link
Collaborator

cgrinds commented Jul 13, 2021

@hashi825 does #295 address you ask?

@cgrinds cgrinds added bug Something isn't working and removed feature New feature or request labels Jul 13, 2021
@hashi825
Copy link
Author

Yea that looks like it could work, you could use the label metrics instead of say shelf_status in an alert ie:

expr: shelf_sensor_labels{status!="normal"}

Would it also be better to use op-status at the shelf level for the value mapping? this way shelf_status can actually report a 1/0 value for the op-status and not state which is more indicative of the status.

Also I noticed that the value mapping for status 1/0 (not just for the shelf template but for others) are back to front from what typical up/status metrics are in Prometheus. In Prometheus generally 1 indicates healthy and 0 otherwise, this is pretty common across all prometheus exporters.

@Hardikl
Copy link
Contributor

Hardikl commented Jul 14, 2021

@hashi825,
I need to see the feasibility for this: expr: shelf_sensor_labels{status!="normal"}, let me check and update here.

Regarding the 2nd ask for op-status instead of state in shelf_status, I have made the changes in current PR.

And last one,
[value mapping for status 1/0 (not just for the shelf template but for others) are back to front from what typical up/status metrics are in Prometheus.],
--> Yeah, this is the same across all yamls where 0 indicates up/normal and 1 indicates default(or not expected outcome). I would suggest that we can track this separately and handle for all if required.

@cgrinds
Copy link
Collaborator

cgrinds commented Jul 14, 2021

@hashi825 opened #306 about the status values - we'll update everything to use 1 for healthy. In cases where there are multiple kinds of failures would distinct values be preferable or would you rather have all of those mapped to zero?

@Hardikl
Copy link
Contributor

Hardikl commented Jul 14, 2021

@hashi825
regarding the ask: expr: shelf_sensor_labels{status!="normal"}

With current changes:
shelf_sensor_labels metric would show these records:

shelf_sensor_labels{cluster="F8080-32-25", datacenter="DC-03", instance="localhost:12991", job="prometheus1", location="rear of the shelf on the lower left power supply", sensor_id="1", status="normal", shelf="1.0"} 1
shelf_sensor_labels{cluster="F8080-32-25", datacenter="DC-03", instance="localhost:12991", job="prometheus1", location="rear of the shelf on the lower left power supply", sensor_id="1", status="normal", shelf="1.10"} 1

Do you mean that the shelf_sensor_labels metric would only gives result when status would be non-normal?
In case of all sensors are normal(like above example), metric won't list records?

@hashi825
Copy link
Author

hashi825 commented Jul 14, 2021

Since status has now been added as a label you can use the expression for Prometheus alerting to send alerts on non normal statuses eg

shelf_sensor_labels{status!="normal"} = 1

That should return all non normal series since we are querying based on the label value rather than the metric value, works well for alerting since the label metrics return more descriptive information such as location of sensors compared to the status related metrics

@Hardikl
Copy link
Contributor

Hardikl commented Jul 21, 2021

This is harvest 21.05.3-2 RPM: 10.140.133.43
image

This is harvest 21.05.4-1 RPM: 10.140.132.253
image

Child Objects:
Child object's new status metric is available and status field is added in label in 21.05.4:
status metric gives 1 for normal value and 0 for non normal values.
Fan:
image

image

PSU:
image

image

Sensor:
image

image

Voltage:
image

image

Temperature:
image

image

op_status column is visible in shelf dashboard and all child object are having status metric with required mapping values.
So with these, moving this to status/done state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/done
Projects
None yet
4 participants