-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shelf metrics should report on op-status for components #262
Comments
Yea that looks like it could work, you could use the label metrics instead of say
Would it also be better to use Also I noticed that the value mapping for status 1/0 (not just for the shelf template but for others) are back to front from what typical up/status metrics are in Prometheus. In Prometheus generally 1 indicates healthy and 0 otherwise, this is pretty common across all prometheus exporters. |
@hashi825, Regarding the 2nd ask for op-status instead of state in shelf_status, I have made the changes in current PR. And last one, |
@hashi825 With current changes:
Do you mean that the shelf_sensor_labels metric would only gives result when status would be non-normal? |
Since status has now been added as a label you can use the expression for Prometheus alerting to send alerts on non normal statuses eg shelf_sensor_labels{status!="normal"} = 1 That should return all non normal series since we are querying based on the label value rather than the metric value, works well for alerting since the label metrics return more descriptive information such as location of sensors compared to the status related metrics |
Is your feature request related to a problem? Please describe.
Shelf metrics currently only have a value mapping for
state
to create metricshelf_status
andpsu-is-enabled
forshelf_psu_labels
.Currently for example to detect a failure you have to use
shelf_sensor_reading
to see if that value dropped to 0 which is pretty ambiguous and doesn't help pinpoint the issue.Describe the solution you'd like
Have the default template for
harvest/conf/zapi/cdot/9.8.0/shelf.yaml
useop-status
fromstorage-shelf-info-get-iter
for shelf components bays, psu, fans etc.Describe alternatives you've considered
None at the moment
Additional context
In a real environment it's unlikely that shelf state is a very important metric since Shelves rarely change from online > offline, but it's extremely common to have failed power supplies or other components. Each of these components in a shelf have their own
op-status
that allows detecting failures. These failures are also rolled up into theshelf-errors
field that reportsstorage-shelf-error-info
, these values report text data though such aserror-type
,error-text
so these would be potentially difficult to convert to time series, but they can also be used to potentially verify their existence as a shelf health metric, ie if empty 0 otherwise 1 (same concept as disk_status outage-info field) ?The text was updated successfully, but these errors were encountered: