Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI #6572

Closed
bbrandt opened this issue Apr 11, 2024 · 7 comments

Comments

@bbrandt
Copy link
Contributor

bbrandt commented Apr 11, 2024

@andreasohlund @lailabougria

Using this article as a guide, I am attempting to calculate an SLI per microservice per message type of message processing failure rate. When I do this, for some measured intervals, I have noticed that failures count can occasionally exceed fetched count for some message types being processed. As a result, failure ration can exceed 100% at these moments which is a little weird.

Is this caused by immediate retries where a single fetch could result in 0 to many failures as well as 0 to 1 successes? If so, what would you recommend to more accurately represent "number of attempted executions of a handler" which would be more appropriate for this calculation than "fetches"?

Example visualization where SLI exceeds 1:
image

Example visualization of the 2 separate time series, fetches (yellow) and failures (green) overlayed:
image
You can notice a few places failures leaps above fetches.

To do this calculation across all my services and message types I would use this PromQL:

sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_core_nservicebus_messaging_failures[$__rate_interval])) 
/ 
sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_core_nservicebus_messaging_fetches[$__rate_interval]))

To filter to a specific namespace and service the PromQL would be:

sum (rate(nservicebus_core_nservicebus_messaging_failures{kubernetes_namespace="mynamespace",app_kubernetes_io_name="my-service-name", nservicebus_message_type=~"some.file.type.name.."}[$__rate_interval])) 
/ 
sum (rate(nservicebus_core_nservicebus_messaging_fetches{kubernetes_namespace="mynamespace",app_kubernetes_io_name="my-service-name", nservicebus_message_type=~"some.file.type.name.."}[$__rate_interval]))

Note: In production, we are still using prometheus-net rather than the OpenTelemetry Prometheus exporter, so I am not sure if metric names here will be exactly as you see them.

Calculation from the article (slightly different because I am interested in "unavailability"/failure rate):

SLO Calculation
Another common use case for the rate() function is calculating SLIs, and seeing if you do not violate your SLO/SLA. Google has recently released a popular book for site-reliability engineers. Here is how they calculate the availability of the services:
SLI Formula
As you can see, they calculate the rate of change of the amount of all of the requests that were not 5xx and then divide by the rate of change of the total amount of requests. If there are any 5xx responses then the resulting value would be less than one. You can, again, use this formula in your alerting rules with some kind of specified threshold - then you would get an alert if it is violated or you could predict the near future with predict_linear and avoid any SLA/SLO problems.

Feedback for 'Monitoring NServiceBus endpoints with Prometheus and Grafana' https://docs.particular.net/samples/open-telemetry/prometheus-grafana/

Location in GitHub: https://github.com/Particular/docs.particular.net/blob/master/samples/open-telemetry/prometheus-grafana/sample.md

@bbrandt bbrandt changed the title Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI Apr 15, 2024
@bbrandt
Copy link
Contributor Author

bbrandt commented Apr 15, 2024

After reviewing the code, it looks like retries could not be the cause. I do not know how it is possible that failures could exceed fetches unless there is something strange happening downstream of NSB, such as the time series having a time offset in Prometheus. Or my PromQL is broken in some way.

@andreasohlund
Copy link
Member

andreasohlund commented Apr 16, 2024

@bbrandt what transport are you using? (some transports do in-memory retries which might explain the numbers being off

@bbrandt
Copy link
Contributor Author

bbrandt commented Apr 16, 2024 via email

@andreasohlund
Copy link
Member

@bbrandt looking at the code https://github.com/Particular/NServiceBus/blob/master/src/NServiceBus.Core/OpenTelemetry/Metrics/ReceiveDiagnosticsBehavior.cs#L18 it seems like there should be no possibility to have more failures than fetches, can you see if you can spot something?

Are you starting and stopping the endpoint during the tests?

Azure Service Bus transport.

ASB doesn't do in-memory retries, https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/blob/master/src/Transport/Receiving/MessagePump.cs#L296

Would you be able to share the code you are running? (Send an email to [email protected] if there are sensitive details and @lailabougria or myself will pick it up)

@bbrandt bbrandt closed this as completed Apr 19, 2024
@bbrandt
Copy link
Contributor Author

bbrandt commented Apr 19, 2024

Closing the issue. It's all on the Prometheus/Grafana side of things, with a dash of user error. I've been on other tasks for a bit, but when I get a chance to circle back I'll write something up about it here.

@bbrandt
Copy link
Contributor Author

bbrandt commented Apr 26, 2024

So, I never really tracked down exactly what this was, because I am still a Prometheus and Grafana novice. But I am certain it is either user error on my part of something in between how Prometheus scrapes the metrics and Grafana displays them on the screen.

I discovered that if I use a larger interval of 5m or higher, the strange spikes go away. This may be the reason all the PromQL examples I see in NSB docs use 5m as the interval. When I used $__rate_interval, I noticed it was resolving this interval to 2m or lower when I saw the strange visual artifacts.

@bbrandt
Copy link
Contributor Author

bbrandt commented May 15, 2024

The reason I was seeing strange artifacts in my calculated failure rate time series was that nservicebus_core_nservicebus_messaging_failures is not time-aligned with the nservicebus_core_nservicebus_messaging_fetches time series. The amount of time offset between when a message appears in each of these time series is variable, based on how long a message takes to process.

For example, if I have a handler that takes 5 minute to process and I send a message at 9:00am and no others after that, then the nservicebus_core_nservicebus_messaging_fetches will have a value of 1 at 9:00am and nservicebus_core_nservicebus_messaging_failures or nservicebus_core_nservicebus_messaging_successes will have a value 1 at 9:05am.
If I have metrics every 1 minute, here is an example of what the time series may look like:

Time fetches successes calculation bad result
9:00am 1 0 0/1 0
9:01am 0 0 0/0 undefined
9:02am 0 0 0/0 undefined
9:03am 0 0 0/0 undefined
9:04am 0 0 0/0 undefined
9:05am 0 1 1/0 undefined

The way I am now calculating failure rate is:
$failures/(success+failures)$

In PromQL, in aggregate across all services and message types, this is:

sum (rate(nservicebus_messaging_failures_total[$__rate_interval])) 
/ 
(sum (rate(nservicebus_messaging_failures_total[$__rate_interval]))
+
sum (rate(nservicebus_messaging_successes_total[$__rate_interval])))

Or per namespace and service and message type:

sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total[$__rate_interval])) 
/ 
(sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total[$__rate_interval]))
+
sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_successes_total[$__rate_interval])))

With the filtering we use on our Grafana Dashboard we end up with something like this:

sum by(app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total{kubernetes_namespace="$namespace", app_kubernetes_io_name=~"$service", nservicebus_message_type=~"$messagetype", nservicebus_failure_type=~"$failuretype"}[$__rate_interval])) 
/ 
(sum by(app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total{kubernetes_namespace="$namespace", app_kubernetes_io_name=~"$service", nservicebus_message_type=~"$messagetype", nservicebus_failure_type=~"$failuretype"}[$__rate_interval]))
+
sum by(app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_successes_total{kubernetes_namespace="$namespace", app_kubernetes_io_name=~"$service", nservicebus_message_type=~"$messagetype"}[$__rate_interval])))

Where service, messagetype, and failuretype variables that allow multi-select.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants