Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"splunk_hec/platform_logs" and "splunk_hec/platform_metrics" "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" #1454

Closed
matthewmodestino opened this issue Apr 11, 2022 · 9 comments

Comments

@matthewmodestino
Copy link

matthewmodestino commented Apr 11, 2022

Hi team,

Have seen frequent reports from users of exporting failed due to context deadline exceeded (Client.Timeout exceeded while awaiting headers). Wondering if this is an opportunity to tune the chart timeout values to be more resilient when sending to Splunk Cloud or customer managed Splunk, especially in the event of any back pressure?

https://github.com/signalfx/splunk-otel-collector-chart/blob/97dc6dca6f04a764a1fe48658453de72ed747a35/helm-charts/splunk-otel-collector/values.yaml#L47-L48

Here's an example of my collector sending from k8s to Splunk Cloud and hitting timeouts and eventually dropping metrics/logs

2022-04-06T22:09:21.013Z	info	exporterhelper/queued_retry.go:215	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "name": "splunk_hec/platform_logs", "error": "Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "9.43516099s"}
2022-04-06T22:09:27.279Z	info	exporterhelper/queued_retry.go:215	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "name": "splunk_hec/platform_metrics", "error": "Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "3.892538109s"}

Eventually it continues to retry until:

2022-04-06T22:16:07.084Z	error	exporterhelper/queued_retry_inmemory.go:106	Exporting failed. No more retries left. Dropping data.	{"kind": "exporter", "name": "splunk_hec/platform_logs", "error": "max elapsed time expired Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 4615}
go.opentelemetry.io/collector/exporter/exporterhelper.onTemporaryFailure
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:106
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:199
go.opentelemetry.io/collector/exporter/exporterhelper.(*logsExporterWithObservability).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/logs.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:78
	
	
2022-04-06T22:16:10.201Z	error	exporterhelper/queued_retry_inmemory.go:106	Exporting failed. No more retries left. Dropping data.	{"kind": "exporter", "name": "splunk_hec/platform_metrics", "error": "max elapsed time expired Post \"https://http-inputs-foo.splunkcloud.com/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 478}
go.opentelemetry.io/collector/exporter/exporterhelper.onTemporaryFailure
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:106
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:199
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:133
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
	/builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:78
	
@emaderer
Copy link
Contributor

@jvoravong can you please take a look and see if it makes sense to update the timeout value?

@jvoravong
Copy link
Contributor

jvoravong commented Apr 22, 2022

Here are the timeout and retry configurations:

  • timeout (default = 5s): Time to wait per individual attempt to send data to a backend.
  • retry_on_failure
    -- enabled (default = true)
    -- initial_interval (default = 5s): Time to wait after the first failure before retrying
    -- max_interval (default = 30s): Is the upper bound on backoff
    -- max_elapsed_time (default = 300s): Is the maximum amount of time spent trying to send a batch
  • sending_queue
    -- enabled (default = true)
    -- num_consumers (default = 10): Number of consumers that dequeue batches; ignored if enabled is false
    -- queue_size (default = 5000): Maximum number of batches kept in memory before dropping. User should calculate this as num_seconds * requests_per_second where: num_seconds is the number of seconds to buffer in case of a backend outage requests_per_second is the average number of requests per seconds.

When we start dropping metrics:

  • If we have failed to send data to the backend after 300s of retrying, drop the data.
  • If we have more than 5000 batches of data retrying in the sending queue, start dropping batches retrying from the queue.

It's reasonable to increase the connection timeout to a value between 20-60s with no harmful affects.

Our users facing this issue could really use better default values or the ability to configure at least the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs. The size of the sending queue directly affects our memory usage, the retry_on_failure.max_elapsed_time affects how fast the sending queue will fill when the backend has pressure. Without some of sort of performance metrics or testing, I'm unsure what default values would be good for retry_on_failure and sending_queue configs.

Action Items I'd recommend:

  • Increase the connection timeout to 20s.
  • Expose the retry_on_failure and sending_queue configurations so they are configurable within the values.yaml. Users would still use default values, but this way users have full control. Users who are willing to use more memory can configure higher values for the retry_on_failure.max_elapsed_time and sending_queue.queue_size configs.

@emaderer
Copy link
Contributor

Thanks Josh!

@dmitryax
Copy link
Contributor

dmitryax commented Apr 26, 2022

HTTP timeout 10s is pretty high already. If http connection cannot be established in 10s, it must be something wrong with the backend or in the middle. I don't think tuning timeout setting will help in these cases. I don't think we should expose this setting. Better to get open-telemetry/opentelemetry-collector-contrib#6803 merged and expose the full set of http settings for those who really need them.

@matthewmodestino
Copy link
Author

It is either the timeout, or the default maxconnections setting that causes this, I believe. Either way collector should be configured to not overwhelm the backend, so if we need testing to help validate I'd suggest the team get with the cloud eng and have a proper look. I have multiple customers who hit this...

@jvoravong
Copy link
Contributor

Noticed the Splunk Hec Exporter defaults for max_content_length_logs and max_content_length_metrics are set to 2MiB.

Some of the Splunk Cloud docs say we should avoid having max content lengths above 1MB.

Splunk Cloud Platform service limits and constraints
HEC maximum content length size limit = 1 MB
There is a recommended limit to the HEC payload size in Splunk Cloud Platform to ensure data balance and ingestion fidelity. A HEC request can have one or more Splunk events batched into it but the payload size should be no larger than this limit. If you exceed this limit, you may experience performance issues related to data balance and ingestion fidelity.

For the Splunk Cloud Hec Exporters, I propose we set max_content_length_logs and max_content_length_metrics to equal 1MB.

@matthewmodestino
Copy link
Author

Yeah thats a "soft" limit. Enterprise default max is 5MB. Keeping under 1MB is probably wise.

@hvaghani221
Copy link

I have created a PR(signalfx/splunk-otel-collector-chart#460) to expose batch, retry_on_failure and sending_queue config in values.yaml file. It will not fix this issue but users can update these configs according to their needs.

@atoulme
Copy link
Contributor

atoulme commented Aug 10, 2022

@harshit-splunk thanks. I am going to mark this issue as resolved, now that those settings are exposed.
Matthew, please make sure to open a new issue if you see more timeout errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants