[exporter/prometheusremotewrite] remote_write_queue num_consumers hard-coded at 1 #30765

tredman · 2024-01-24T23:04:11Z

Component(s)

exporter/prometheusremotewrite

Describe the issue you're reporting

Wasn't sure whether to file this as a bug or feature request. It seems that due to issue 2949 the ability to configure the number of consumers was deliberately removed and num_consumers was set to 1. I believe this is the relevant code.

However, the README still describes this as configurable:

remote_write_queue: fine tuning for queueing and sending of the outgoing remote writes
enabled: enable the sending queue (default: true)
queue_size: number of OTLP metrics that can be queued. Ignored if enabled is false (default: 10000)
num_consumers: minimum number of workers to use to fan out the outgoing requests. (default: 5)

I discovered this today while trying to understand why my collector was filling its queue and then dropping metrics, while a prometheus agent I had running in parallel was processing all metrics without issue (at a rate of approximately 3300/second). Without the ability to increase the number of consumers and send more batches in parallel, the throughput using this exporter is significantly limited. It seems like what's needed here is a way to distribute samples across consumers without violating the requirement that they be in chronological order, which is probably more feature development than bug fix but I'll let folks here decide. :)

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-24T23:04:28Z

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-01-25T16:51:23Z

Hello @tredman, it looks like there were some follow up issues and changes to the one you referenced that added back in a subset of the queue retry capabilities. Relevant PRs: open-telemetry/opentelemetry-collector#2974, open-telemetry/opentelemetry-collector#3046.

num_consumers is still being used though, just in a different way than what it was originally. Here we can see it's being used to set concurrency, which will then be used to fan out the export operations concurrently. The code you're referencing with NumConsumers then comes into play with the single consumer exporter helper's queue settings.

I'm not familiar enough with this exporter to say whether this is an overall design flaw or a strict requirement, I'll have to defer to someone else.

bryan-aguilar · 2024-01-25T21:24:53Z

This is the relevant comment explaining the hard coded consumer size which was linked in the original issue. You can still use num_consumers to increase the amount of workers used to export data. The only difference is that the PRWE shards time series to a specific worker to avoid out of order samples.

bryan-aguilar · 2024-01-25T21:32:39Z

Speaking from experience, there are a few places where the PRWE could be improved to increase performance. There can be a bottleneck in the PRWE when handling large batch sizes. The PRWE translates batches from OTLP -> Prometheus format in a single threaded manner. Signs that this is happening are queue backing up, context deadline exceeded errors or analyzing your batch processor send size. If you are running into these errors you can use the batch process send_batch_max_size to cap the size of the batches to prevent outlier batches from backing up your PRWE.

tredman · 2024-01-25T22:50:53Z

Thanks for the quick response! I see the code you pointed out where num_consumers gets passed in to set concurrency so I'll read up on that code path. It's odd that for me, adjusting num_consumers has had no effect on throughput to the remote write endpoint. I've tried ranges from 5 to 1000. I see here that the code actually takes the min value of (concurrency, len(requests)) when fanning out so it makes me think I am running into a batch size issue or something of that nature.

In practice, the only knob I've been able to turn to increase throughput here and prevent the retry queue from filling is upstream, via a batch processor. With a large enough batch size (5000) I'm able to keep up with the metrics load (regardless of what i set for num_consumers). In general I understand why batching can improve throughput but not in this particular context. PWRE isn't taking a batch of 5000 metrics and immediately writing them upstream, is it? It has to take what the processing pipeline feeds it and convert them into Prom format, then shard them right?

There can be a bottleneck in the PRWE when handling large batch sizes. The PRWE translates batches from OTLP -> Prometheus format in a single threaded manner.

Thanks, this is good info. In your experience what is a large batch size? In my case I am starting to wonder if the downstream workers are actually being starved by feeding insufficiently large batches to PRWE. That might explain why I've seen an increase in throughput by increasing the batch sizes in the batch processor.

bryan-aguilar · 2024-01-25T23:16:57Z

In your experience what is a large batch size?

It's really use case dependent but like 40k+ can be quite large. The default batch processor configuration does not cap batch size, so unless you explicitly configure send_batch_max_size they will be uncapped. Another thing to look for is to see if you Collector running PRWE is being CPU constrained. Translating metrics and large batches can be quite CPU intensive.

With a large enough batch size (5000) I'm able to keep up with the metrics load (regardless of what i set for num_consumers)

Can you explain a bit more on what you mean by large enough batch size here? Can you share a batch processor configuration as an example? In general though, it is always recommended to run a batch processor in your pipeline. There are probably some use cases where it does not makes sense but for the majority you should. One of the main benefits of batching metrics is to reduce the amount of outbound requests. Exporters can make 1 request for a "batch of 50" rather than 50 requests.

What kind of load is going through your collector? Do you have any metrics to help us understand the scale you are operating at?

tredman · 2024-01-26T00:23:06Z

Sure, this is what I have set currently:

batch:
        timeout: 1000ms
        send_batch_size: 5000
        send_batch_max_size: 5000

at 1000 I was seeing the queue get filled. Specifically the otelcol_exporter_queue_size would grow to the max size of the remote write queue (10000), then the logs would start complaining about dropped metrics. Once I set batch size to 5000 or higher I stopped seeing the retry queue fill.

What kind of load is going through your collector? Do you have any metrics to help us understand the scale you are operating at?

Yeah to give some more context here, I'm trying to determine if we can replace an existing prometheus agent with an OTEL collector. We're planning to use collectors for other use cases and it would be ideal if it could handle some existing workloads currently handled by prometheus agents and reduce the number of tools we're using. I have them in running in parallel and was working through the process of configuring the OTEL collector to scrape the same data as the prometheus, when I started seeing the OTEL collector drop data.

These graphs are from Amazon Managed Prometheus metrics in CloudWatch. The green line is the data we're ingesting from the existing prometheus agent. The blue line is the OTEL collector. Rates here are per minute. I'm actually looking to hit 3x the current load to reach "parity" here. Note these are writing to separate AMP workspaces so I don't think we're hitting throughput limits on the remote endpoint itself.

bryan-aguilar · 2024-01-26T00:45:38Z

Yeah, I believe setting the batch size too low will have the inverse effect also! So there can be a sweet spot between too large or too small batches. From what I have seen the default send_batch_size is fine, and then you can tune send_batch_max_size based on your workload. It does not need to be 1:1. You can try 1:2 and go from there.

bryan-aguilar · 2024-01-26T00:53:19Z

Are both the prometheus agent and collector doing the same work at the moment? Or is the collector only receiving a subset of the metrics. Scrape jobs I presume?

tredman · 2024-01-26T01:05:40Z

The collector is doing a subset of the work, that's correct. There are a few somewhat complicated scrape jobs I need to replicate but I needed to resolve these dropped metrics first.

From what I have seen the default send_batch_size is fine, and then you can tune send_batch_max_size based on your workload. It does not need to be 1:1. You can try 1:2 and go from there.

thanks, I'll give that a try. I learned today that the default batch size is pretty high (8192) so I've been pretty conservative with my configuration so far. I'll see if I can just be more aggressive here.

bryan-aguilar · 2024-01-26T01:09:55Z

I've found the the batch processor emits some good metrics to help with this. Observing those over time along with the queue size/cpu/mem usage is very helpful.

tredman · 2024-01-27T00:04:52Z

Going to close this since my initial report was not accurate. Thanks for the assist folks! I was able to reach parity with my prometheus implementation by just adjusting the batch size. :)

Also throwing this out there in case anyone happens to stumble on this issue, if you're using the OTEL operator like we are, you can actually scale your prometheus receivers horizontally. The TargetAllocator will handle distributing jobs to the collectors. I didn't need to do this to handle the throughput here but it's a potential option if you exhaust the total throughput of a single collector.

bryan-aguilar · 2024-01-27T00:09:53Z

Awesome to hear! I didn't want to dig into your setup quite yet but I would have totally suggested the target allocator if I knew you were on k8s!

tredman added the needs triage New item requiring triage label Jan 24, 2024

github-actions bot added the exporter/prometheusremotewrite label Jan 24, 2024

crobert-1 removed the needs triage New item requiring triage label Jan 26, 2024

tredman closed this as completed Jan 27, 2024

github-actions bot mentioned this issue Jan 30, 2024

Weekly Report: 2024-01-23 - 2024-01-30 #30848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/prometheusremotewrite] remote_write_queue num_consumers hard-coded at 1 #30765

[exporter/prometheusremotewrite] remote_write_queue num_consumers hard-coded at 1 #30765

tredman commented Jan 24, 2024

github-actions bot commented Jan 24, 2024

crobert-1 commented Jan 25, 2024

bryan-aguilar commented Jan 25, 2024 •

edited

Loading

bryan-aguilar commented Jan 25, 2024

tredman commented Jan 25, 2024

bryan-aguilar commented Jan 25, 2024 •

edited

Loading

tredman commented Jan 26, 2024

bryan-aguilar commented Jan 26, 2024

bryan-aguilar commented Jan 26, 2024

tredman commented Jan 26, 2024

bryan-aguilar commented Jan 26, 2024

tredman commented Jan 27, 2024

bryan-aguilar commented Jan 27, 2024

[exporter/prometheusremotewrite] remote_write_queue num_consumers hard-coded at 1 #30765

[exporter/prometheusremotewrite] remote_write_queue num_consumers hard-coded at 1 #30765

Comments

tredman commented Jan 24, 2024

Component(s)

Describe the issue you're reporting

github-actions bot commented Jan 24, 2024

crobert-1 commented Jan 25, 2024

bryan-aguilar commented Jan 25, 2024 • edited Loading

bryan-aguilar commented Jan 25, 2024

tredman commented Jan 25, 2024

bryan-aguilar commented Jan 25, 2024 • edited Loading

tredman commented Jan 26, 2024

bryan-aguilar commented Jan 26, 2024

bryan-aguilar commented Jan 26, 2024

tredman commented Jan 26, 2024

bryan-aguilar commented Jan 26, 2024

tredman commented Jan 27, 2024

bryan-aguilar commented Jan 27, 2024

bryan-aguilar commented Jan 25, 2024 •

edited

Loading

bryan-aguilar commented Jan 25, 2024 •

edited

Loading