"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454

matthewmodestino · 2022-04-11T15:16:15Z

Hi team,

Have seen frequent reports from users of exporting failed due to context deadline exceeded (Client.Timeout exceeded while awaiting headers). Wondering if this is an opportunity to tune the chart timeout values to be more resilient when sending to Splunk Cloud or customer managed Splunk, especially in the event of any back pressure?

https://github.com/signalfx/splunk-otel-collector-chart/blob/97dc6dca6f04a764a1fe48658453de72ed747a35/helm-charts/splunk-otel-collector/values.yaml#L47-L48

Here's an example of my collector sending from k8s to Splunk Cloud and hitting timeouts and eventually dropping metrics/logs

2022-04-06T22:09:21.013Z	info	exporterhelper/queued_retry.go:215	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "name": "splunk_hec/platform_logs", "error": "Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "9.43516099s"}
2022-04-06T22:09:27.279Z	info	exporterhelper/queued_retry.go:215	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "name": "splunk_hec/platform_metrics", "error": "Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "3.892538109s"}

Eventually it continues to retry until:

2022-04-06T22:16:07.084Z	error	exporterhelper/queued_retry_inmemory.go:106	Exporting failed. No more retries left. Dropping data.	{"kind": "exporter", "name": "splunk_hec/platform_logs", "error": "max elapsed time expired Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 4615}
go.opentelemetry.io/collector/exporter/exporterhelper.onTemporaryFailure
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:106
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:199
go.opentelemetry.io/collector/exporter/exporterhelper.(*logsExporterWithObservability).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/logs.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:78
	
	
2022-04-06T22:16:10.201Z	error	exporterhelper/queued_retry_inmemory.go:106	Exporting failed. No more retries left. Dropping data.	{"kind": "exporter", "name": "splunk_hec/platform_metrics", "error": "max elapsed time expired Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 478}
go.opentelemetry.io/collector/exporter/exporterhelper.onTemporaryFailure
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:106
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:199
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:133
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:78

The text was updated successfully, but these errors were encountered:

emaderer · 2022-04-22T16:11:17Z

@jvoravong can you please take a look and see if it makes sense to update the timeout value?

jvoravong · 2022-04-22T17:43:48Z

Here are the timeout and retry configurations:

timeout (default = 5s): Time to wait per individual attempt to send data to a backend.

retry_on_failure
-- enabled (default = true)
-- initial_interval (default = 5s): Time to wait after the first failure before retrying
-- max_interval (default = 30s): Is the upper bound on backoff
-- max_elapsed_time (default = 300s): Is the maximum amount of time spent trying to send a batch

sending_queue
-- enabled (default = true)
-- num_consumers (default = 10): Number of consumers that dequeue batches; ignored if enabled is false
-- queue_size (default = 5000): Maximum number of batches kept in memory before dropping. User should calculate this as num_seconds * requests_per_second where: num_seconds is the number of seconds to buffer in case of a backend outage requests_per_second is the average number of requests per seconds.

When we start dropping metrics:

If we have failed to send data to the backend after 300s of retrying, drop the data.
If we have more than 5000 batches of data retrying in the sending queue, start dropping batches retrying from the queue.

It's reasonable to increase the connection timeout to a value between 20-60s with no harmful affects.

Our users facing this issue could really use better default values or the ability to configure at least the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs. The size of the sending queue directly affects our memory usage, the retry_on_failure.max_elapsed_time affects how fast the sending queue will fill when the backend has pressure. Without some of sort of performance metrics or testing, I'm unsure what default values would be good for retry_on_failure and sending_queue configs.

Action Items I'd recommend:

Increase the connection timeout to 20s.
Expose the retry_on_failure and sending_queue configurations so they are configurable within the values.yaml. Users would still use default values, but this way users have full control. Users who are willing to use more memory can configure higher values for the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs.

emaderer · 2022-04-22T17:55:44Z

Thanks Josh!

dmitryax · 2022-04-26T18:23:45Z

HTTP timeout 10s is pretty high already. If http connection cannot be established in 10s, it must be something wrong with the backend or in the middle. I don't think tuning timeout setting will help in these cases. I don't think we should expose this setting. Better to get open-telemetry/opentelemetry-collector-contrib#6803 merged and expose the full set of http settings for those who really need them.

matthewmodestino · 2022-04-27T14:32:53Z

It is either the timeout, or the default maxconnections setting that causes this, I believe. Either way collector should be configured to not overwhelm the backend, so if we need testing to help validate I'd suggest the team get with the cloud eng and have a proper look. I have multiple customers who hit this...

jvoravong · 2022-04-27T18:18:27Z

Noticed the Splunk Hec Exporter defaults for max_content_length_logs and max_content_length_metrics are set to 2MiB.

Some of the Splunk Cloud docs say we should avoid having max content lengths above 1MB.

Splunk Cloud Platform service limits and constraints
HEC maximum content length size limit = 1 MB
There is a recommended limit to the HEC payload size in Splunk Cloud Platform to ensure data balance and ingestion fidelity. A HEC request can have one or more Splunk events batched into it but the payload size should be no larger than this limit. If you exceed this limit, you may experience performance issues related to data balance and ingestion fidelity.

For the Splunk Cloud Hec Exporters, I propose we set max_content_length_logs and max_content_length_metrics to equal 1MB.

matthewmodestino · 2022-04-27T18:29:42Z

Yeah thats a "soft" limit. Enterprise default max is 5MB. Keeping under 1MB is probably wise.

hvaghani221 · 2022-06-03T05:17:09Z

I have created a PR(signalfx/splunk-otel-collector-chart#460) to expose batch, retry_on_failure and sending_queue config in values.yaml file. It will not fix this issue but users can update these configs according to their needs.

atoulme · 2022-08-10T16:54:55Z

@harshit-splunk thanks. I am going to mark this issue as resolved, now that those settings are exposed.
Matthew, please make sure to open a new issue if you see more timeout errors.

matthewmodestino mentioned this issue Apr 11, 2022

"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" signalfx/splunk-otel-collector-chart#431

Closed

jvoravong mentioned this issue Apr 26, 2022

Increase splunkObservability and splunkPlatform timeouts to 20s signalfx/splunk-otel-collector-chart#438

Closed

atoulme closed this as completed Aug 10, 2022

matthewmodestino mentioned this issue Mar 17, 2023

Fail to Export Prometheus metrics to Splunk Enterprise signalfx/splunk-otel-collector-chart#698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454

"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454

matthewmodestino commented Apr 11, 2022 •

edited

Loading

emaderer commented Apr 22, 2022

jvoravong commented Apr 22, 2022 •

edited

Loading

emaderer commented Apr 22, 2022

dmitryax commented Apr 26, 2022 •

edited

Loading

matthewmodestino commented Apr 27, 2022

jvoravong commented Apr 27, 2022

matthewmodestino commented Apr 27, 2022

hvaghani221 commented Jun 3, 2022

atoulme commented Aug 10, 2022

"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454

"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454

Comments

matthewmodestino commented Apr 11, 2022 • edited Loading

emaderer commented Apr 22, 2022

jvoravong commented Apr 22, 2022 • edited Loading

emaderer commented Apr 22, 2022

dmitryax commented Apr 26, 2022 • edited Loading

matthewmodestino commented Apr 27, 2022

jvoravong commented Apr 27, 2022

matthewmodestino commented Apr 27, 2022

hvaghani221 commented Jun 3, 2022

atoulme commented Aug 10, 2022

matthewmodestino commented Apr 11, 2022 •

edited

Loading

jvoravong commented Apr 22, 2022 •

edited

Loading

dmitryax commented Apr 26, 2022 •

edited

Loading