-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana' - Calculating a Failure Rate SLI #6572
Comments
@bbrandt what transport are you using? (some transports do in-memory retries which might explain the numbers being off |
Azure Service Bus transport.
…On Tue, Apr 16, 2024, 6:43 AM Andreas Öhlund ***@***.***> wrote:
@bbrandt <https://github.com/bbrandt> what transport are you using? (some
transports do in-memory retries which might explain the numbers being off
—
Reply to this email directly, view it on GitHub
<#6572 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETFD2P7QEBMKJ3GOF7JRDY5UFEXAVCNFSM6AAAAABGDB2UGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJYHA4TCNBQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@bbrandt looking at the code https://github.com/Particular/NServiceBus/blob/master/src/NServiceBus.Core/OpenTelemetry/Metrics/ReceiveDiagnosticsBehavior.cs#L18 it seems like there should be no possibility to have more failures than fetches, can you see if you can spot something? Are you starting and stopping the endpoint during the tests?
ASB doesn't do in-memory retries, https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/blob/master/src/Transport/Receiving/MessagePump.cs#L296 Would you be able to share the code you are running? (Send an email to [email protected] if there are sensitive details and @lailabougria or myself will pick it up) |
Closing the issue. It's all on the Prometheus/Grafana side of things, with a dash of user error. I've been on other tasks for a bit, but when I get a chance to circle back I'll write something up about it here. |
So, I never really tracked down exactly what this was, because I am still a Prometheus and Grafana novice. But I am certain it is either user error on my part of something in between how Prometheus scrapes the metrics and Grafana displays them on the screen. I discovered that if I use a larger interval of 5m or higher, the strange spikes go away. This may be the reason all the PromQL examples I see in NSB docs use 5m as the interval. When I used $__rate_interval, I noticed it was resolving this interval to 2m or lower when I saw the strange visual artifacts. |
The reason I was seeing strange artifacts in my calculated failure rate time series was that For example, if I have a handler that takes 5 minute to process and I send a message at 9:00am and no others after that, then the
The way I am now calculating failure rate is: In PromQL, in aggregate across all services and message types, this is:
Or per namespace and service and message type:
With the filtering we use on our Grafana Dashboard we end up with something like this:
Where service, messagetype, and failuretype variables that allow multi-select. |
@andreasohlund @lailabougria
Using this article as a guide, I am attempting to calculate an SLI per microservice per message type of message processing failure rate. When I do this, for some measured intervals, I have noticed that failures count can occasionally exceed fetched count for some message types being processed. As a result, failure ration can exceed 100% at these moments which is a little weird.
Is this caused by immediate retries where a single fetch could result in 0 to many failures as well as 0 to 1 successes? If so, what would you recommend to more accurately represent "number of attempted executions of a handler" which would be more appropriate for this calculation than "fetches"?
Example visualization where SLI exceeds 1:

Example visualization of the 2 separate time series, fetches (yellow) and failures (green) overlayed:

You can notice a few places failures leaps above fetches.
To do this calculation across all my services and message types I would use this PromQL:
To filter to a specific namespace and service the PromQL would be:
Note: In production, we are still using prometheus-net rather than the OpenTelemetry Prometheus exporter, so I am not sure if metric names here will be exactly as you see them.
Calculation from the article (slightly different because I am interested in "unavailability"/failure rate):
Feedback for 'Monitoring NServiceBus endpoints with Prometheus and Grafana' https://docs.particular.net/samples/open-telemetry/prometheus-grafana/
Location in GitHub: https://github.com/Particular/docs.particular.net/blob/master/samples/open-telemetry/prometheus-grafana/sample.md
The text was updated successfully, but these errors were encountered: