Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] remote_write_queue num_consumers hard-coded at 1 #30765

Closed
tredman opened this issue Jan 24, 2024 · 13 comments
Closed

Comments

@tredman
Copy link

tredman commented Jan 24, 2024

Component(s)

exporter/prometheusremotewrite

Describe the issue you're reporting

Wasn't sure whether to file this as a bug or feature request. It seems that due to issue 2949 the ability to configure the number of consumers was deliberately removed and num_consumers was set to 1. I believe this is the relevant code.

However, the README still describes this as configurable:

remote_write_queue: fine tuning for queueing and sending of the outgoing remote writes
enabled: enable the sending queue (default: true)
queue_size: number of OTLP metrics that can be queued. Ignored if enabled is false (default: 10000)
num_consumers: minimum number of workers to use to fan out the outgoing requests. (default: 5)

I discovered this today while trying to understand why my collector was filling its queue and then dropping metrics, while a prometheus agent I had running in parallel was processing all metrics without issue (at a rate of approximately 3300/second). Without the ability to increase the number of consumers and send more batches in parallel, the throughput using this exporter is significantly limited. It seems like what's needed here is a way to distribute samples across consumers without violating the requirement that they be in chronological order, which is probably more feature development than bug fix but I'll let folks here decide. :)

@tredman tredman added the needs triage New item requiring triage label Jan 24, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @tredman, it looks like there were some follow up issues and changes to the one you referenced that added back in a subset of the queue retry capabilities. Relevant PRs: open-telemetry/opentelemetry-collector#2974, open-telemetry/opentelemetry-collector#3046.

num_consumers is still being used though, just in a different way than what it was originally. Here we can see it's being used to set concurrency, which will then be used to fan out the export operations concurrently. The code you're referencing with NumConsumers then comes into play with the single consumer exporter helper's queue settings.

I'm not familiar enough with this exporter to say whether this is an overall design flaw or a strict requirement, I'll have to defer to someone else.

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Jan 25, 2024

This is the relevant comment explaining the hard coded consumer size which was linked in the original issue. You can still use num_consumers to increase the amount of workers used to export data. The only difference is that the PRWE shards time series to a specific worker to avoid out of order samples.

@bryan-aguilar
Copy link
Contributor

Speaking from experience, there are a few places where the PRWE could be improved to increase performance. There can be a bottleneck in the PRWE when handling large batch sizes. The PRWE translates batches from OTLP -> Prometheus format in a single threaded manner. Signs that this is happening are queue backing up, context deadline exceeded errors or analyzing your batch processor send size. If you are running into these errors you can use the batch process send_batch_max_size to cap the size of the batches to prevent outlier batches from backing up your PRWE.

@tredman
Copy link
Author

tredman commented Jan 25, 2024

Thanks for the quick response! I see the code you pointed out where num_consumers gets passed in to set concurrency so I'll read up on that code path. It's odd that for me, adjusting num_consumers has had no effect on throughput to the remote write endpoint. I've tried ranges from 5 to 1000. I see here that the code actually takes the min value of (concurrency, len(requests)) when fanning out so it makes me think I am running into a batch size issue or something of that nature.

In practice, the only knob I've been able to turn to increase throughput here and prevent the retry queue from filling is upstream, via a batch processor. With a large enough batch size (5000) I'm able to keep up with the metrics load (regardless of what i set for num_consumers). In general I understand why batching can improve throughput but not in this particular context. PWRE isn't taking a batch of 5000 metrics and immediately writing them upstream, is it? It has to take what the processing pipeline feeds it and convert them into Prom format, then shard them right?

There can be a bottleneck in the PRWE when handling large batch sizes. The PRWE translates batches from OTLP -> Prometheus format in a single threaded manner.

Thanks, this is good info. In your experience what is a large batch size? In my case I am starting to wonder if the downstream workers are actually being starved by feeding insufficiently large batches to PRWE. That might explain why I've seen an increase in throughput by increasing the batch sizes in the batch processor.

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Jan 25, 2024

In your experience what is a large batch size?

It's really use case dependent but like 40k+ can be quite large. The default batch processor configuration does not cap batch size, so unless you explicitly configure send_batch_max_size they will be uncapped. Another thing to look for is to see if you Collector running PRWE is being CPU constrained. Translating metrics and large batches can be quite CPU intensive.

With a large enough batch size (5000) I'm able to keep up with the metrics load (regardless of what i set for num_consumers)

Can you explain a bit more on what you mean by large enough batch size here? Can you share a batch processor configuration as an example? In general though, it is always recommended to run a batch processor in your pipeline. There are probably some use cases where it does not makes sense but for the majority you should. One of the main benefits of batching metrics is to reduce the amount of outbound requests. Exporters can make 1 request for a "batch of 50" rather than 50 requests.

What kind of load is going through your collector? Do you have any metrics to help us understand the scale you are operating at?

@tredman
Copy link
Author

tredman commented Jan 26, 2024

Sure, this is what I have set currently:

batch:
        timeout: 1000ms
        send_batch_size: 5000
        send_batch_max_size: 5000

at 1000 I was seeing the queue get filled. Specifically the otelcol_exporter_queue_size would grow to the max size of the remote write queue (10000), then the logs would start complaining about dropped metrics. Once I set batch size to 5000 or higher I stopped seeing the retry queue fill.

What kind of load is going through your collector? Do you have any metrics to help us understand the scale you are operating at?

Yeah to give some more context here, I'm trying to determine if we can replace an existing prometheus agent with an OTEL collector. We're planning to use collectors for other use cases and it would be ideal if it could handle some existing workloads currently handled by prometheus agents and reduce the number of tools we're using. I have them in running in parallel and was working through the process of configuring the OTEL collector to scrape the same data as the prometheus, when I started seeing the OTEL collector drop data.

These graphs are from Amazon Managed Prometheus metrics in CloudWatch. The green line is the data we're ingesting from the existing prometheus agent. The blue line is the OTEL collector. Rates here are per minute. I'm actually looking to hit 3x the current load to reach "parity" here. Note these are writing to separate AMP workspaces so I don't think we're hitting throughput limits on the remote endpoint itself.

2024-01-25_16-08

@bryan-aguilar
Copy link
Contributor

Yeah, I believe setting the batch size too low will have the inverse effect also! So there can be a sweet spot between too large or too small batches. From what I have seen the default send_batch_size is fine, and then you can tune send_batch_max_size based on your workload. It does not need to be 1:1. You can try 1:2 and go from there.

@bryan-aguilar
Copy link
Contributor

Are both the prometheus agent and collector doing the same work at the moment? Or is the collector only receiving a subset of the metrics. Scrape jobs I presume?

@tredman
Copy link
Author

tredman commented Jan 26, 2024

The collector is doing a subset of the work, that's correct. There are a few somewhat complicated scrape jobs I need to replicate but I needed to resolve these dropped metrics first.

From what I have seen the default send_batch_size is fine, and then you can tune send_batch_max_size based on your workload. It does not need to be 1:1. You can try 1:2 and go from there.

thanks, I'll give that a try. I learned today that the default batch size is pretty high (8192) so I've been pretty conservative with my configuration so far. I'll see if I can just be more aggressive here.

@bryan-aguilar
Copy link
Contributor

I've found the the batch processor emits some good metrics to help with this. Observing those over time along with the queue size/cpu/mem usage is very helpful.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Jan 26, 2024
@tredman
Copy link
Author

tredman commented Jan 27, 2024

Going to close this since my initial report was not accurate. Thanks for the assist folks! I was able to reach parity with my prometheus implementation by just adjusting the batch size. :)

Also throwing this out there in case anyone happens to stumble on this issue, if you're using the OTEL operator like we are, you can actually scale your prometheus receivers horizontally. The TargetAllocator will handle distributing jobs to the collectors. I didn't need to do this to handle the throughput here but it's a potential option if you exhaust the total throughput of a single collector.

@tredman tredman closed this as completed Jan 27, 2024
@bryan-aguilar
Copy link
Contributor

Awesome to hear! I didn't want to dig into your setup quite yet but I would have totally suggested the target allocator if I knew you were on k8s!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants