-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure Monitor] Expected errors are reported in Application Insights for Blob storage operations #9908
Comments
//fyi @jsquire |
@kinelski: Can you take a look and see if this is because of the conditional access that we're making with ownership requests? I'm trying to determine if these are expected and either Event Hubs or Storage is surfacing something that it shouldn't as an error. |
@sebader Thank you for reporting this issue. Could you tell us how many Event Processor Client instances are being used in your scenario? This might help us understand the nature of the problem. |
________________________________
It’s auto scaling between 4 and 32 instances on AKS depending on the load.
|
@sebader: Apologies for the difficulties and thank you for bringing this to our attention. While we certainly agree that having these errors appear is confusing and not the experience that we want to offer, it is, unfortunately, by-design in the current implementation. For context, the diagnostics emitting these errors are coming from the Storage client that Event Hubs uses, which is based entirely off of the response. Because it is a 4xx series, it is automatically interpreted by the diagnostics framework as an error. The Event Processor makes a conditional request when trying to claim ownership of a partition for processing. Because processor instances compete for ownership, it is expected that many of these requests do not succeed due to another instance having already claimed it. Within Event Hubs, this is treated as a normal code path. However, at that point, the Storage client has already registered the failure with its diagnostics. I've opened #9934 as a feature request for exposing the ability to treat service responses that are expected and normal for the consuming application as non-failures. |
I've left comment here #9934 (comment) and basically this is the Azure Monitor (Application Insights) approach to mark 4xx as failures. If we attempt to change this from Azure SDK side, it will become inconsistent with the rest of Azure Monitor logic handling 4xx status codes for incoming and outgoing requests . One approach we provide is to do some custom logic in the code to mark suck failures as non-failures This is a bit involved, but allows to customize almost everything. From Azure Monitor side, I believe we should do better job helping you isolate such calls and tell they are noise. App Map for example has a filter to remove 4xx failures and I think we should do more. @sebader can you please help me understand an issue a bit better?
|
@jsquire thanks for looking into it and the thorough the explanation! @lmolkova First of all I didn't have any idea where the error came from. I just saw it popping up in my monitoring. From a user perspective, of course you get concerned if there are unexplained errors. A user does not know that those in this case represent "works-as-expected". If I as a user see errors, which seem to be related to checkpointing, I get concerned that my checkpoints might not be properly written and thus I will run into issues. So I would say, no, they are not just noise. Without making it clear to the user (and I don't really know what that could look like here), it raises concern. I also would expect to see errors of underlying SDKs in my monitoring - if they do represent actual errors that I as the app owner need to take care of. When that's not the case, I would expect the SDK to hide them, or at least clearly mark them as noise in the monitoring without me as the user looking through github, docs, etc. to find out whats going on and then manually build a filter. Does this make sense from a user perspective? |
@lmolkova: I'm not quite sure what next steps would be here; there does not seem to be action that the Event Hubs client library can directly take to influence the behavior, and there are legitimate considerations raised to the proposal in #9934. Should we open an issue somewhere for consideration or is this something that is considered by-design and that we aren't able to influence? |
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azmonapplicationinsights. |
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you! |
1 similar comment
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you! |
What is the expected feedback here from me? Was there any new change done? |
There shouldn't be anything needed from you at this point, @sebader. Actions are needed from the Azure Monitor team. The bot was reacting to tags, but I don't believe those tags were accurate. |
Describe the bug
I'm using the Azure.Messaging.EventHubs.Processor (5.0.1) with an EH with 32 partitions. Every partition gets checkpointed every 10 seconds (if there was new data arriving). Now I started to notice in Application Insights, that some of the checkpointing calls to Blob storage fail with a 412 error code.
I can also see the errors as "Client Error" in my Blob storage metrics. Most of the calls seem to work fine, but some create the error. Looks something inside the SDK to me, not directly related to my code.
Expected behavior
Should run without errors.
To Reproduce
Hard to tell. The errors also come when I have almost zero load on the EH (only a handful messages).
Happy to jump on a screen share if that helps.
Environment:
The text was updated successfully, but these errors were encountered: