Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI #6572

bbrandt · 2024-04-11T21:47:33Z

Using this article as a guide, I am attempting to calculate an SLI per microservice per message type of message processing failure rate. When I do this, for some measured intervals, I have noticed that failures count can occasionally exceed fetched count for some message types being processed. As a result, failure ration can exceed 100% at these moments which is a little weird.

Is this caused by immediate retries where a single fetch could result in 0 to many failures as well as 0 to 1 successes? If so, what would you recommend to more accurately represent "number of attempted executions of a handler" which would be more appropriate for this calculation than "fetches"?

Example visualization where SLI exceeds 1:

Example visualization of the 2 separate time series, fetches (yellow) and failures (green) overlayed:

You can notice a few places failures leaps above fetches.

To do this calculation across all my services and message types I would use this PromQL:

sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_core_nservicebus_messaging_failures[$__rate_interval])) 
/ 
sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_core_nservicebus_messaging_fetches[$__rate_interval]))

To filter to a specific namespace and service the PromQL would be:

sum (rate(nservicebus_core_nservicebus_messaging_failures{kubernetes_namespace="mynamespace",app_kubernetes_io_name="my-service-name", nservicebus_message_type=~"some.file.type.name.."}[$__rate_interval])) 
/ 
sum (rate(nservicebus_core_nservicebus_messaging_fetches{kubernetes_namespace="mynamespace",app_kubernetes_io_name="my-service-name", nservicebus_message_type=~"some.file.type.name.."}[$__rate_interval]))

Note: In production, we are still using prometheus-net rather than the OpenTelemetry Prometheus exporter, so I am not sure if metric names here will be exactly as you see them.

Calculation from the article (slightly different because I am interested in "unavailability"/failure rate):

SLO Calculation
Another common use case for the rate() function is calculating SLIs, and seeing if you do not violate your SLO/SLA. Google has recently released a popular book for site-reliability engineers. Here is how they calculate the availability of the services:
‍
As you can see, they calculate the rate of change of the amount of all of the requests that were not 5xx and then divide by the rate of change of the total amount of requests. If there are any 5xx responses then the resulting value would be less than one. You can, again, use this formula in your alerting rules with some kind of specified threshold - then you would get an alert if it is violated or you could predict the near future with predict_linear and avoid any SLA/SLO problems.

Feedback for 'Monitoring NServiceBus endpoints with Prometheus and Grafana' https://docs.particular.net/samples/open-telemetry/prometheus-grafana/

Location in GitHub: https://github.com/Particular/docs.particular.net/blob/master/samples/open-telemetry/prometheus-grafana/sample.md

The text was updated successfully, but these errors were encountered:

bbrandt · 2024-04-15T14:56:32Z

After reviewing the code, it looks like retries could not be the cause. I do not know how it is possible that failures could exceed fetches unless there is something strange happening downstream of NSB, such as the time series having a time offset in Prometheus. Or my PromQL is broken in some way.

andreasohlund · 2024-04-16T11:42:46Z

@bbrandt what transport are you using? (some transports do in-memory retries which might explain the numbers being off

bbrandt · 2024-04-16T13:37:45Z

Azure Service Bus transport.

…

On Tue, Apr 16, 2024, 6:43 AM Andreas Öhlund ***@***.***> wrote: @bbrandt <https://github.com/bbrandt> what transport are you using? (some transports do in-memory retries which might explain the numbers being off — Reply to this email directly, view it on GitHub <#6572 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETFD2P7QEBMKJ3GOF7JRDY5UFEXAVCNFSM6AAAAABGDB2UGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJYHA4TCNBQGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

andreasohlund · 2024-04-17T07:06:51Z

@bbrandt looking at the code https://github.com/Particular/NServiceBus/blob/master/src/NServiceBus.Core/OpenTelemetry/Metrics/ReceiveDiagnosticsBehavior.cs#L18 it seems like there should be no possibility to have more failures than fetches, can you see if you can spot something?

Are you starting and stopping the endpoint during the tests?

Azure Service Bus transport.

ASB doesn't do in-memory retries, https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/blob/master/src/Transport/Receiving/MessagePump.cs#L296

Would you be able to share the code you are running? (Send an email to [email protected] if there are sensitive details and @lailabougria or myself will pick it up)

bbrandt · 2024-04-19T05:07:32Z

Closing the issue. It's all on the Prometheus/Grafana side of things, with a dash of user error. I've been on other tasks for a bit, but when I get a chance to circle back I'll write something up about it here.

bbrandt · 2024-04-26T20:40:09Z

So, I never really tracked down exactly what this was, because I am still a Prometheus and Grafana novice. But I am certain it is either user error on my part of something in between how Prometheus scrapes the metrics and Grafana displays them on the screen.

I discovered that if I use a larger interval of 5m or higher, the strange spikes go away. This may be the reason all the PromQL examples I see in NSB docs use 5m as the interval. When I used $__rate_interval, I noticed it was resolving this interval to 2m or lower when I saw the strange visual artifacts.

bbrandt · 2024-05-15T22:02:57Z

The reason I was seeing strange artifacts in my calculated failure rate time series was that nservicebus_core_nservicebus_messaging_failures is not time-aligned with the nservicebus_core_nservicebus_messaging_fetches time series. The amount of time offset between when a message appears in each of these time series is variable, based on how long a message takes to process.

For example, if I have a handler that takes 5 minute to process and I send a message at 9:00am and no others after that, then the nservicebus_core_nservicebus_messaging_fetches will have a value of 1 at 9:00am and nservicebus_core_nservicebus_messaging_failures or nservicebus_core_nservicebus_messaging_successes will have a value 1 at 9:05am.
If I have metrics every 1 minute, here is an example of what the time series may look like:

Time	fetches	successes	calculation	bad result
9:00am	1	0	0/1	0
9:01am	0	0	0/0	undefined
9:02am	0	0	0/0	undefined
9:03am	0	0	0/0	undefined
9:04am	0	0	0/0	undefined
9:05am	0	1	1/0	undefined

The way I am now calculating failure rate is:
$failures/(success+failures)$

In PromQL, in aggregate across all services and message types, this is:

sum (rate(nservicebus_messaging_failures_total[$__rate_interval])) 
/ 
(sum (rate(nservicebus_messaging_failures_total[$__rate_interval]))
+
sum (rate(nservicebus_messaging_successes_total[$__rate_interval])))

Or per namespace and service and message type:

sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total[$__rate_interval])) 
/ 
(sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total[$__rate_interval]))
+
sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_successes_total[$__rate_interval])))

With the filtering we use on our Grafana Dashboard we end up with something like this:

sum by(app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total{kubernetes_namespace="$namespace", app_kubernetes_io_name=~"$service", nservicebus_message_type=~"$messagetype", nservicebus_failure_type=~"$failuretype"}[$__rate_interval])) 
/ 
(sum by(app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_failures_total{kubernetes_namespace="$namespace", app_kubernetes_io_name=~"$service", nservicebus_message_type=~"$messagetype", nservicebus_failure_type=~"$failuretype"}[$__rate_interval]))
+
sum by(app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_messaging_successes_total{kubernetes_namespace="$namespace", app_kubernetes_io_name=~"$service", nservicebus_message_type=~"$messagetype"}[$__rate_interval])))

Where service, messagetype, and failuretype variables that allow multi-select.

bbrandt changed the title ~~Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana'~~ Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI Apr 15, 2024

bbrandt closed this as completed Apr 19, 2024

bbrandt mentioned this issue Apr 26, 2024

Emit critical time and processing time meters Particular/NServiceBus#6953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI #6572

Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI #6572

bbrandt commented Apr 11, 2024 •

edited

Loading

bbrandt commented Apr 15, 2024 •

edited

Loading

andreasohlund commented Apr 16, 2024 •

edited

Loading

bbrandt commented Apr 16, 2024 via email

andreasohlund commented Apr 17, 2024

bbrandt commented Apr 19, 2024 •

edited

Loading

bbrandt commented Apr 26, 2024

bbrandt commented May 15, 2024 •

edited

Loading

Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI #6572

Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI #6572

Comments

bbrandt commented Apr 11, 2024 • edited Loading

bbrandt commented Apr 15, 2024 • edited Loading

andreasohlund commented Apr 16, 2024 • edited Loading

bbrandt commented Apr 16, 2024 via email

andreasohlund commented Apr 17, 2024

bbrandt commented Apr 19, 2024 • edited Loading

bbrandt commented Apr 26, 2024

bbrandt commented May 15, 2024 • edited Loading

bbrandt commented Apr 11, 2024 •

edited

Loading

bbrandt commented Apr 15, 2024 •

edited

Loading

andreasohlund commented Apr 16, 2024 •

edited

Loading

bbrandt commented Apr 19, 2024 •

edited

Loading

bbrandt commented May 15, 2024 •

edited

Loading