-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote_storage: revisit throttling/ratelimiting #3698
Comments
Earlier I made an additional task requiring version of leaky bucket, or similar to that. After auditing https://github.com/udoprog/leaky-bucket it seems like a good implementation, and does not require a separate task, nor a channel for sending the permits. I'll PR this in. Follow-ups after discussions about the elusive s3 prefix, I'm thinking that we might need this ratelimiter per each tenant_id assuming we don't already max out the bandwidth with rps=3500. |
We currently have a semaphore based rate limiter which we hope will keep us under S3 limits. However, the semaphore does not consider time, so I've been hesitant to raise the concurrency limit of 100. See #3698. The PR Introduces a leaky-bucket based rate limiter instead of the `tokio::sync::Semaphore` which will allow us to raise the limit later on. The configuration changes are not contained here.
Let's step back a bit. Why do we need to rate limit the S3 requests in the first place? From the slack thread:
Yes, ok, that makes some sense. You don't want to launch any more parallel downloads, if the server's network interface is already saturated. I remember that we also had problems with large number of IAM requests. IAM has fairly low rate limits. But that was solved, we don't issue a separate IAM request for every GET request anymore. That should not be a problem anymore. So let's be very clear about what we are trying to accomplish: We are trying to avoid fully saturating the server's network interface. Right? For that, limiting the # of requests started per second doesn't make much sense. A small number of very large GET requests are much more likely to saturate the network bandwidth than a large number of small GET requests. We should measure network bandwidth more directly, not requests. We care about the total # of bytes / s. The semaphore approach was not great. It basically assumed that each operation can do x MB/s, and then used the # of concurrent operations as a proxy for the network bandwidth used. Nevertheless, it seems more sensible than limiting the # of requests started per second. By rate limiting # of requests started, if each operation takes e.g 5 seconds, and the rate limit is 100 / s, you can have 500 concurrent operations in progress, which is more than we want. On the other hand, if each operation only takes 0.1 s, you are seriously throttling how many of those requests you allow. For no good reason AFAICS. (There is also a limit in AWS S3 of 5500 GET requests / s per prefix (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html). We use a different prefix for each timeline, so if I understand correctly, we would need to do more than 5500 GET requests for a particular timeline to hit that limit. I don't think we need to worry about reaching that limit. And even if we do hit that limit, I think we can just let AWS do the throttling for us. We don't gain anything by throttling ourselves earlier.) |
I was hoping this would allow initial index_part.json requests to complete faster for faster activation, but it seems error rate has increased. Example log search
I think there has to be some feedback loop to open connections from these; they appeared on servers using concurrency_limit=100 which I re-used as the requests per second limit. Of course, any of our async worker thread blocking may have stalled the downloads and thus caused more errors. I'd say it was still worth testing it, if it would had helped to avoid the more complex solution like I understand you speculated above, which might be a hybrid of X inflight requests with bandwidth consumption? I'll revert the PR #4292 next. I still think that we are not in a position to raise the semaphore limit assuming there is a requests per second limit on each prefix. Running into issues while testing the #4292 makes me think that the ratelimiting does not happen on just the prefixes, because otherwise we should had not hit it at all with 10k single timeline tenants, which should be 10k prefixes. |
on startup , shouldn't we first download all |
Well, that's how it should work right now but every tenant does it's own:
I don't think we do any downloads right now during the initial load, because initial logical sizes are idle before all initial loads complete since #4399. We certainly could do uploads. Should probably add the metrics for the initial load time and then think about how to measure the wait times. I think we now however around 1ms/tenant. I think that might be quite good for an eager approach. |
In this Slack thread we saw this semaphore put an artificial cap on bulk ingestion throughput. Increasing it from 100 to 195 more than doubled ingestion throughput. We are currently running experiments at 500. The resulting Slack thread retread much of the discussion in this issue: a semaphore is a poor proxy for request rate and interface bandwidth limiting. It also only applies per-pageserver, while S3 rate limits apply across the fleet. We should mostly leave the throttling to S3 rather than do it ourselves, but maybe have some safety limit to prevent spawning too many tasks.
This is the maximum rate of a given prefix (i.e. tenant shard), but S3 will dynamically change how prefixes are mapped onto partitions depending on prefix load. So it may throttle below this until it repartitions and scales out. From the link you provided:
I'll pick this up and revisit some of the history here -- I'm currently working on optimizing bulk ingestion throughput and this is a severe bottleneck. |
In #10038 (comment), we saw that I'm therefore putting this on the back burner for now, but plan to revisit this in the medium term after we've addressed higher-value ingestion performance bottlenecks. When we do address this, we should still have a couple of concurrency limits: one for large transfers (there is no point starting 500 layer file downloads since they'll all be slow), and one for small transfers (we don't want to spawn 10,000 Tokio tasks). This is tracked in #6193, but it's somewhat independent of rate limiting which should be moved over to S3. |
#3663 made the semaphore be held until download completed, this sparked other discussion on if the single semaphore (100 permits) is a good limit and how should we limit it.
Original thread: https://neondb.slack.com/archives/C04C55G1RHB/p1677153938527409
Alternatives proposed:
The text was updated successfully, but these errors were encountered: