You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sentry 0.29.1 with features contexts, panic, reqwest, and rustls sentry-tower in version 0.29.1 with the feature http sentry-tracing in version 0.29.1 with default features
Steps to Reproduce
This is most likely to be reproduced with traces_sample_rate set to 1.0 but it will eventually also happen with lower sample rates as well.
Configure the DSN to something that will cause requests to run into a timeout.
Instrument your code with tracing and register the layer
Run your code in such a way that it produces 32 or more traces in a few seconds.
Expected Result
The overhead from the sentry tracing layer should be minimal.
Actual Result
The instrumented code is blocking, waiting for the traces to be sent to sentry.
Detailed explanation
I've spent almost a week tracking down hanging integration tests after moving a test machine from one location to another.
I ultimately found out that we were configuring a DSN of https://[email protected]/1 in integration tests which, of course, is not a valid DSN, but it worked fine because it had always caused connection errors immediately. Only in that particular network the machine was moved to, it didn't produce a connection error immediately, but only after a timeout of somewhere between ~30s and ~90s. This led to a channel in sentry filling up and the integration tests essentially hanging for hours in CI.
The channel in question is used in TransportThread here:
It is a completely synchronous, bounded channel with capacity of 30. (that's why things start slowing down on the 32nd request because one envelope is already in flight, 30 are in the channel and the 32nd is the one that hangs, waiting to be written into the channel).
Although this issue didn't happen in production for us, it very well could if that same call blocks for some reason (e.g. a network issue or an issue in your sentry.io backend infrastructure). Let's imagine for example that sentry.io has an issue where incoming requests are accepted but no data flows, all customers using sentry-rust would suddenly find that every operation that is sampled will block for multiple tens of seconds.
How to fix
The easiest fix I can come up with is to start dropping transactions if the TransportThread can't keep up, because I think a running application is more important than not losing any traces.
let _ = self.sender.send(Task::SendEnvelope(envelope));
}
To:
pubfnsend(&self,envelope:Envelope){ifself.sender.try_send(Task::SendEnvelope(envelope)).is_err(){// Some error message that an envelope was dropped}}
There might be other ways or places to achieve that same goal. But ultimately, the tracing instrumentation should not, under any circumstance, be allowed to block the instrumented code to such an extent as waiting for an HTTP request to go through.
The text was updated successfully, but these errors were encountered:
Dropping envelopes is a good quick fix. I don’t think we want to move to an unbounded channel either, as that would rather lead to runaway memory usage.
Actually sending more envelopes concurrently with the async-capable transports might be a middle ground solution to send as much as possible while not blocking.
Environment
sentry 0.29.1
with featurescontexts
,panic
,reqwest
, andrustls
sentry-tower
in version0.29.1
with the featurehttp
sentry-tracing
in version0.29.1
with default featuresSteps to Reproduce
This is most likely to be reproduced with
traces_sample_rate
set to1.0
but it will eventually also happen with lower sample rates as well.tracing
and register thelayer
Expected Result
The overhead from the sentry tracing layer should be minimal.
Actual Result
The instrumented code is blocking, waiting for the traces to be sent to
sentry
.Detailed explanation
I've spent almost a week tracking down hanging integration tests after moving a test machine from one location to another.
I ultimately found out that we were configuring a DSN of
https://[email protected]/1
in integration tests which, of course, is not a valid DSN, but it worked fine because it had always caused connection errors immediately. Only in that particular network the machine was moved to, it didn't produce a connection error immediately, but only after a timeout of somewhere between ~30s and ~90s. This led to a channel insentry
filling up and the integration tests essentially hanging for hours in CI.The channel in question is used in
TransportThread
here:sentry-rust/sentry/src/transports/tokio_thread.rs
Line 29 in 616587b
It is a completely synchronous, bounded channel with capacity of 30. (that's why things start slowing down on the 32nd request because one envelope is already in flight, 30 are in the channel and the 32nd is the one that hangs, waiting to be written into the channel).
The envelopes are enqueued here:
sentry-rust/sentry/src/transports/tokio_thread.rs
Lines 87 to 89 in 616587b
And this is the code where things were blocked waiting for the network timeout:
sentry-rust/sentry/src/transports/reqwest.rs
Line 71 in 616587b
Although this issue didn't happen in production for us, it very well could if that same call blocks for some reason (e.g. a network issue or an issue in your
sentry.io
backend infrastructure). Let's imagine for example thatsentry.io
has an issue where incoming requests are accepted but no data flows, all customers usingsentry-rust
would suddenly find that every operation that is sampled will block for multiple tens of seconds.How to fix
The easiest fix I can come up with is to start dropping transactions if the
TransportThread
can't keep up, because I think a running application is more important than not losing any traces.This could, for example, be achieved by changing:
sentry-rust/sentry/src/transports/tokio_thread.rs
Lines 87 to 89 in 616587b
To:
There might be other ways or places to achieve that same goal. But ultimately, the
tracing
instrumentation should not, under any circumstance, be allowed to block the instrumented code to such an extent as waiting for an HTTP request to go through.The text was updated successfully, but these errors were encountered: