Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure Monitor] Expected errors are reported in Application Insights for Blob storage operations #9908

Closed
sebader opened this issue Feb 11, 2020 · 14 comments
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Monitor Monitor, Monitor Ingestion, Monitor Query needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Service This issue points to a problem in the service.

Comments

@sebader
Copy link
Member

sebader commented Feb 11, 2020

Describe the bug
I'm using the Azure.Messaging.EventHubs.Processor (5.0.1) with an EH with 32 partitions. Every partition gets checkpointed every 10 seconds (if there was new data arriving). Now I started to notice in Application Insights, that some of the checkpointing calls to Blob storage fail with a 412 error code.

image

Azure.RequestFailedException: The condition specified using HTTP conditional header(s) is not met.
RequestId:2130b973-701e-0130-5c14-e1c499000000
Time:2020-02-11T19:47:37.1509796Z
Status: 412 (The condition specified using HTTP conditional header(s) is not met.)

ErrorCode: ConditionNotMet

Headers:
Server: Windows-Azure-Blob/1.0,Microsoft-HTTPAPI/2.0
x-ms-request-id: 2130b973-701e-0130-5c14-e1c499000000
x-ms-client-request-id: f01bd21c-fda0-429a-b88f-2eb41a153efb
x-ms-version: 2019-02-02
x-ms-error-code: ConditionNotMet
Date: Tue, 11 Feb 2020 19:47:36 GMT
Content-Length: 252
Content-Type: application/xml

   at Azure.Storage.Blobs.BlobRestClient.Blob.SetMetadataAsync_CreateResponse(Response response)
   at Azure.Storage.Blobs.BlobRestClient.Blob.SetMetadataAsync(ClientDiagnostics clientDiagnostics, HttpPipeline pipeline, Uri resourceUri, Nullable`1 timeout, IDictionary`2 metadata, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, Nullable`1 ifMatch, Nullable`1 ifNoneMatch, String requestId, Boolean async, String operationName, CancellationToken cancellationToken)

image

I can also see the errors as "Client Error" in my Blob storage metrics. Most of the calls seem to work fine, but some create the error. Looks something inside the SDK to me, not directly related to my code.

Expected behavior
Should run without errors.

To Reproduce
Hard to tell. The errors also come when I have almost zero load on the EH (only a handful messages).
Happy to jump on a screen share if that helps.

Environment:

  • Name and version of the Library package used: Azure.Messaging.EventHubs.Processor (5.0.1)
  • Hosting platform or OS and .NET runtime version : dotnet 3.1 in Linux container running on AKS.
@triage-new-issues triage-new-issues bot added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 11, 2020
@tg-msft tg-msft added Client This issue points to a problem in the data-plane of the library. Event Hubs and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Feb 11, 2020
@tg-msft
Copy link
Member

tg-msft commented Feb 11, 2020

//fyi @jsquire

@jsquire
Copy link
Member

jsquire commented Feb 11, 2020

@kinelski: Can you take a look and see if this is because of the conditional access that we're making with ownership requests? I'm trying to determine if these are expected and either Event Hubs or Storage is surfacing something that it shouldn't as an error.

@kinelski
Copy link
Member

@sebader Thank you for reporting this issue.

Could you tell us how many Event Processor Client instances are being used in your scenario? This might help us understand the nature of the problem.

@sebader
Copy link
Member Author

sebader commented Feb 12, 2020 via email

@jsquire
Copy link
Member

jsquire commented Feb 12, 2020

@sebader: Apologies for the difficulties and thank you for bringing this to our attention. While we certainly agree that having these errors appear is confusing and not the experience that we want to offer, it is, unfortunately, by-design in the current implementation.

For context, the diagnostics emitting these errors are coming from the Storage client that Event Hubs uses, which is based entirely off of the response. Because it is a 4xx series, it is automatically interpreted by the diagnostics framework as an error.

The Event Processor makes a conditional request when trying to claim ownership of a partition for processing. Because processor instances compete for ownership, it is expected that many of these requests do not succeed due to another instance having already claimed it. Within Event Hubs, this is treated as a normal code path. However, at that point, the Storage client has already registered the failure with its diagnostics.

I've opened #9934 as a feature request for exposing the ability to treat service responses that are expected and normal for the consuming application as non-failures.

@lmolkova
Copy link
Member

I've left comment here #9934 (comment) and basically this is the Azure Monitor (Application Insights) approach to mark 4xx as failures.

If we attempt to change this from Azure SDK side, it will become inconsistent with the rest of Azure Monitor logic handling 4xx status codes for incoming and outgoing requests .

One approach we provide is to do some custom logic in the code to mark suck failures as non-failures
https://stackoverflow.com/questions/37533431/how-to-tell-application-insights-to-ignore-404-responses
https://docs.microsoft.com/en-us/azure/azure-monitor/app/api-filtering-sampling

This is a bit involved, but allows to customize almost everything.

From Azure Monitor side, I believe we should do better job helping you isolate such calls and tell they are noise. App Map for example has a filter to remove 4xx failures and I think we should do more.

@sebader can you please help me understand an issue a bit better?

  • are you expecting to see only calls made directly by your code? I.e. is what happens under checkpointer a concern for you and would you want to know about underlying storage operations?
  • what are the problems these failures introduce from service monitoring perspective? do they hide anything? are you able to separate them from real issues? Are they just noise/confusion?

@sebader
Copy link
Member Author

sebader commented Feb 13, 2020

@jsquire thanks for looking into it and the thorough the explanation!

@lmolkova First of all I didn't have any idea where the error came from. I just saw it popping up in my monitoring. From a user perspective, of course you get concerned if there are unexplained errors. A user does not know that those in this case represent "works-as-expected". If I as a user see errors, which seem to be related to checkpointing, I get concerned that my checkpoints might not be properly written and thus I will run into issues. So I would say, no, they are not just noise. Without making it clear to the user (and I don't really know what that could look like here), it raises concern.

I also would expect to see errors of underlying SDKs in my monitoring - if they do represent actual errors that I as the app owner need to take care of. When that's not the case, I would expect the SDK to hide them, or at least clearly mark them as noise in the monitoring without me as the user looking through github, docs, etc. to find out whats going on and then manually build a filter.
And yes, from the issue that @jsquire created, I understand that this might not be so easy to do.

Does this make sense from a user perspective?

@AlexGhiondea AlexGhiondea changed the title [BUG] Event Hub Processor Host - creating errors when checkpointing to blob Event Hub Processor Host - creating errors when checkpointing to blob Mar 3, 2020
@AlexGhiondea AlexGhiondea added the customer-reported Issues that are reported by GitHub users external to the Azure organization. label Mar 3, 2020
@jsquire jsquire changed the title Event Hub Processor Host - creating errors when checkpointing to blob Event Hub Processor - creating errors when checkpointing to blob Mar 17, 2020
@jsquire
Copy link
Member

jsquire commented Mar 17, 2020

@lmolkova: I'm not quite sure what next steps would be here; there does not seem to be action that the Event Hubs client library can directly take to influence the behavior, and there are legitimate considerations raised to the proposal in #9934.

Should we open an issue somewhere for consideration or is this something that is considered by-design and that we aren't able to influence?

@jsquire jsquire changed the title Event Hub Processor - creating errors when checkpointing to blob [Event Hub Processor] Expected errors are logged for Blob storage operations Apr 2, 2020
@jsquire jsquire changed the title [Event Hub Processor] Expected errors are logged for Blob storage operations [Event Hub Processor] Expected errors are reported in Application Insightsfor Blob storage operations Apr 2, 2020
@jsquire jsquire changed the title [Event Hub Processor] Expected errors are reported in Application Insightsfor Blob storage operations [Event Hub Processor] Expected errors are reported in Application Insights for Blob storage operations Apr 2, 2020
@jsquire jsquire added customer-response-expected Service This issue points to a problem in the service. Service Attention Workflow: This issue is responsible by Azure service team. and removed Event Hubs Client This issue points to a problem in the data-plane of the library. labels Apr 2, 2020
@ghost
Copy link

ghost commented Apr 2, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azmonapplicationinsights.

@AlexGhiondea AlexGhiondea added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed customer-response-expected labels Apr 20, 2020
@AlexGhiondea AlexGhiondea added the question The issue doesn't require a change to the product in order to be resolved. Most issues start as that label Apr 22, 2020
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Apr 30, 2020
@ghost
Copy link

ghost commented Apr 30, 2020

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

1 similar comment
@ghost
Copy link

ghost commented Apr 30, 2020

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@sebader
Copy link
Member Author

sebader commented May 3, 2020

What is the expected feedback here from me? Was there any new change done?

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. labels May 3, 2020
@jsquire
Copy link
Member

jsquire commented May 4, 2020

There shouldn't be anything needed from you at this point, @sebader. Actions are needed from the Azure Monitor team. The bot was reacting to tags, but I don't believe those tags were accurate.

@jsquire jsquire changed the title [Event Hub Processor] Expected errors are reported in Application Insights for Blob storage operations [Azure Monitor] Expected errors are reported in Application Insights for Blob storage operations May 4, 2020
@pakrym
Copy link
Contributor

pakrym commented Oct 4, 2021

Covered by #9934 & #19982

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Monitor Monitor, Monitor Ingestion, Monitor Query needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Service This issue points to a problem in the service.
Projects
None yet
Development

No branches or pull requests

8 participants