-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454
Comments
@jvoravong can you please take a look and see if it makes sense to update the timeout value? |
Here are the timeout and retry configurations:
When we start dropping metrics:
It's reasonable to increase the connection timeout to a value between 20-60s with no harmful affects. Our users facing this issue could really use better default values or the ability to configure at least the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs. The size of the sending queue directly affects our memory usage, the retry_on_failure.max_elapsed_time affects how fast the sending queue will fill when the backend has pressure. Without some of sort of performance metrics or testing, I'm unsure what default values would be good for retry_on_failure and sending_queue configs. Action Items I'd recommend:
|
Thanks Josh! |
HTTP timeout 10s is pretty high already. If http connection cannot be established in 10s, it must be something wrong with the backend or in the middle. I don't think tuning timeout setting will help in these cases. I don't think we should expose this setting. Better to get open-telemetry/opentelemetry-collector-contrib#6803 merged and expose the full set of http settings for those who really need them. |
It is either the timeout, or the default maxconnections setting that causes this, I believe. Either way collector should be configured to not overwhelm the backend, so if we need testing to help validate I'd suggest the team get with the cloud eng and have a proper look. I have multiple customers who hit this... |
Noticed the Splunk Hec Exporter defaults for max_content_length_logs and max_content_length_metrics are set to 2MiB. Some of the Splunk Cloud docs say we should avoid having max content lengths above 1MB. Splunk Cloud Platform service limits and constraints For the Splunk Cloud Hec Exporters, I propose we set max_content_length_logs and max_content_length_metrics to equal 1MB. |
Yeah thats a "soft" limit. Enterprise default max is 5MB. Keeping under 1MB is probably wise. |
I have created a PR(signalfx/splunk-otel-collector-chart#460) to expose batch, retry_on_failure and sending_queue config in values.yaml file. It will not fix this issue but users can update these configs according to their needs. |
@harshit-splunk thanks. I am going to mark this issue as resolved, now that those settings are exposed. |
Hi team,
Have seen frequent reports from users of exporting failed due to
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
. Wondering if this is an opportunity to tune the chart timeout values to be more resilient when sending to Splunk Cloud or customer managed Splunk, especially in the event of any back pressure?https://github.com/signalfx/splunk-otel-collector-chart/blob/97dc6dca6f04a764a1fe48658453de72ed747a35/helm-charts/splunk-otel-collector/values.yaml#L47-L48
Here's an example of my collector sending from k8s to Splunk Cloud and hitting timeouts and eventually dropping metrics/logs
Eventually it continues to retry until:
The text was updated successfully, but these errors were encountered: