Incompatibility between default retry settings and timeout settings #30305
Labels
bug
Something isn't working
closed as inactive
exporter/prometheusremotewrite
priority:p2
Medium
Stale
Component(s)
exporter/prometheusremotewrite
What happened?
Description
The prometheus remote write exporter implements its own retry logic (implemented in this PR) and does not use the queued_retry from the exporter helper. This has to be done so that we avoid out of order samples - data is split into smaller chunks and then submitted to workers that will send it to the backend using a retry strategy in case of failure. Each time series is guaranteed to be only in a single chunk, which guarantees that there won't be out of order samples.
This component is using the default timeout setting of 5s. However the retry settings are not consistent with this value: the max time that each request performed by a worker can be retried is 1 minute.
Therefore we can see that there is a great chance of timeout errors happening in case os consecutive retries.
Expected Result
We expect that the timeout settings is consistent with the retry logic implemented inside the component.
Actual Result
Consecutive retries can generate timeout errors.
Proposal
we would like to propose to remove the timeout from the exporter helper and instead set a timeout on the context just before requests are sent to the backend
Collector version
v0.90.1
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: