remote_storage: revisit throttling/ratelimiting #3698

koivunej · 2023-02-23T13:51:25Z

#3663 made the semaphore be held until download completed, this sparked other discussion on if the single semaphore (100 permits) is a good limit and how should we limit it.

Original thread: https://neondb.slack.com/archives/C04C55G1RHB/p1677153938527409

Alternatives proposed:

we should have separate concurrency limits for index_part.json and layer downloads (@problame)
- refinement: or perhaps "other ops" (timeline list, indexpart up/down, deletes), and "layer up/down"? (@koivunej)
or all: a policy with (op, concurrency value) and index_part.json could have much higher concurrency value than layer download (@shanyp)

koivunej · 2023-05-22T09:53:03Z

Earlier I made an additional task requiring version of leaky bucket, or similar to that. After auditing https://github.com/udoprog/leaky-bucket it seems like a good implementation, and does not require a separate task, nor a channel for sending the permits. I'll PR this in.

Follow-ups after discussions about the elusive s3 prefix, I'm thinking that we might need this ratelimiter per each tenant_id assuming we don't already max out the bandwidth with rps=3500.

We currently have a semaphore based rate limiter which we hope will keep us under S3 limits. However, the semaphore does not consider time, so I've been hesitant to raise the concurrency limit of 100. See #3698. The PR Introduces a leaky-bucket based rate limiter instead of the `tokio::sync::Semaphore` which will allow us to raise the limit later on. The configuration changes are not contained here.

hlinnaka · 2023-05-26T12:44:23Z

Let's step back a bit. Why do we need to rate limit the S3 requests in the first place? From the slack thread:

assuming we have finite bandwidth and AWS infrastructure is able to give us full bandwidth s3 downloads at 100 requests, then I don't see any sense in doing more than 100 concurrent downloads, as in the whole will not become faster, there would just be much more scheduling overhead. but this would also require s3 to have negleable latency of course..the underlying http client should be reusing the tcp connections anyway as soon as the semaphore lets them.

Yes, ok, that makes some sense. You don't want to launch any more parallel downloads, if the server's network interface is already saturated.

I remember that we also had problems with large number of IAM requests. IAM has fairly low rate limits. But that was solved, we don't issue a separate IAM request for every GET request anymore. That should not be a problem anymore.

So let's be very clear about what we are trying to accomplish: We are trying to avoid fully saturating the server's network interface. Right?

For that, limiting the # of requests started per second doesn't make much sense. A small number of very large GET requests are much more likely to saturate the network bandwidth than a large number of small GET requests. We should measure network bandwidth more directly, not requests. We care about the total # of bytes / s.

The semaphore approach was not great. It basically assumed that each operation can do x MB/s, and then used the # of concurrent operations as a proxy for the network bandwidth used. Nevertheless, it seems more sensible than limiting the # of requests started per second. By rate limiting # of requests started, if each operation takes e.g 5 seconds, and the rate limit is 100 / s, you can have 500 concurrent operations in progress, which is more than we want. On the other hand, if each operation only takes 0.1 s, you are seriously throttling how many of those requests you allow. For no good reason AFAICS.

(There is also a limit in AWS S3 of 5500 GET requests / s per prefix (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html). We use a different prefix for each timeline, so if I understand correctly, we would need to do more than 5500 GET requests for a particular timeline to hit that limit. I don't think we need to worry about reaching that limit. And even if we do hit that limit, I think we can just let AWS do the throttling for us. We don't gain anything by throttling ourselves earlier.)

koivunej · 2023-05-26T13:53:58Z

I was hoping this would allow initial index_part.json requests to complete faster for faster activation, but it seems error rate has increased. Example log search {neon_service="pageserver"} |~ "Failed to (download|upload)" !~ "No space left on device") finds at least:

dispatch failure: io error: error writing a body to connection: Broken pipe (os error 32): Broken pipe (os error 32)
service error: unhandled error: unhandled error: Error { code: "SlowDown" ...
dispatch failure: other: connection closed before message completed
dispatch failure

I think there has to be some feedback loop to open connections from these; they appeared on servers using concurrency_limit=100 which I re-used as the requests per second limit. Of course, any of our async worker thread blocking may have stalled the downloads and thus caused more errors.

I'd say it was still worth testing it, if it would had helped to avoid the more complex solution like I understand you speculated above, which might be a hybrid of X inflight requests with bandwidth consumption?

I'll revert the PR #4292 next. I still think that we are not in a position to raise the semaphore limit assuming there is a requests per second limit on each prefix. Running into issues while testing the #4292 makes me think that the ratelimiting does not happen on just the prefixes, because otherwise we should had not hit it at all with 10k single timeline tenants, which should be 10k prefixes.

This reverts commit 024109f for it failing to be speed up anything, but run into more errors. See: #3698.

shanyp · 2023-07-19T07:54:13Z

on startup , shouldn't we first download all index_part.json and only then continue with rest of downloads, or as I suggested just give higher priority to these ?

koivunej · 2023-07-19T16:07:12Z

Well, that's how it should work right now but every tenant does it's own:

for each timeline
    download index_part.json
    reconcile
    schedule uploads

(in background per timeline: process upload queue)

I don't think we do any downloads right now during the initial load, because initial logical sizes are idle before all initial loads complete since #4399. We certainly could do uploads. Should probably add the metrics for the initial load time and then think about how to measure the wait times.

I think we now however around 1ms/tenant. I think that might be quite good for an eager approach.

erikgrinaker · 2024-11-20T11:48:19Z

In this Slack thread we saw this semaphore put an artificial cap on bulk ingestion throughput. Increasing it from 100 to 195 more than doubled ingestion throughput. We are currently running experiments at 500.

The resulting Slack thread retread much of the discussion in this issue: a semaphore is a poor proxy for request rate and interface bandwidth limiting. It also only applies per-pageserver, while S3 rate limits apply across the fleet. We should mostly leave the throttling to S3 rather than do it ourselves, but maybe have some safety limit to prevent spawning too many tasks.

if I understand correctly, we would need to do more than 5500 GET requests for a particular timeline to hit that limit

This is the maximum rate of a given prefix (i.e. tenant shard), but S3 will dynamically change how prefixes are mapped onto partitions depending on prefix load. So it may throttle below this until it repartitions and scales out. From the link you provided:

For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes. The scaling, in the case of both read and write operations, happens gradually and is not instantaneous. While Amazon S3 is scaling to your new higher request rate, you may see some 503 (Slow Down) errors.

I'll pick this up and revisit some of the history here -- I'm currently working on optimizing bulk ingestion throughput and this is a severe bottleneck.

erikgrinaker · 2024-12-11T16:07:19Z

In #10038 (comment), we saw that concurrency_limit wasn't actually an ingestion bottleneck after all -- the benchmark was disrupted by Pageserver restarts.

I'm therefore putting this on the back burner for now, but plan to revisit this in the medium term after we've addressed higher-value ingestion performance bottlenecks.

When we do address this, we should still have a couple of concurrency limits: one for large transfers (there is no point starting 500 layer file downloads since they'll all be slow), and one for small transfers (we don't want to spawn 10,000 Tokio tasks). This is tracked in #6193, but it's somewhat independent of rate limiting which should be moved over to S3.

koivunej added c/storage/pageserver Component: storage: pageserver c/storage/s3 labels Feb 23, 2023

koivunej mentioned this issue May 22, 2023

remote_storage::s3_bucket: replace concurrency_limilt semaphore with proper requests-per-second rate limiter #3997

Closed

koivunej self-assigned this May 22, 2023

koivunej mentioned this issue May 22, 2023

Allow for higher s3 concurrency #4292

Merged

6 tasks

koivunej mentioned this issue May 26, 2023

Revert "Allow for higher s3 concurrency (#4292)" #4356

Merged

koivunej added a commit that referenced this issue May 26, 2023

Revert "Allow for higher s3 concurrency (#4292)" (#4356)

be177f8

This reverts commit 024109f for it failing to be speed up anything, but run into more errors. See: #3698.

jcsp removed the c/storage/s3 label Jun 17, 2024

erikgrinaker assigned erikgrinaker and unassigned koivunej Nov 20, 2024

This was referenced Nov 20, 2024

Epic: optimize WAL ingest pipeline #9624

Closed

Investigate WAL ingest performance for bulk writes #9789

Open

This was referenced Dec 6, 2024

remote_storage: increase concurrency_limit to 200 #10038

Closed

pageserver: separate concurrency limits for big vs. small remote storage PUTs #6193

Open

erikgrinaker removed their assignment Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote_storage: revisit throttling/ratelimiting #3698

remote_storage: revisit throttling/ratelimiting #3698

koivunej commented Feb 23, 2023 •

edited

Loading

koivunej commented May 22, 2023

hlinnaka commented May 26, 2023

koivunej commented May 26, 2023

shanyp commented Jul 19, 2023

koivunej commented Jul 19, 2023

erikgrinaker commented Nov 20, 2024 •

edited

Loading

erikgrinaker commented Dec 11, 2024

remote_storage: revisit throttling/ratelimiting #3698

remote_storage: revisit throttling/ratelimiting #3698

Comments

koivunej commented Feb 23, 2023 • edited Loading

koivunej commented May 22, 2023

hlinnaka commented May 26, 2023

koivunej commented May 26, 2023

shanyp commented Jul 19, 2023

koivunej commented Jul 19, 2023

erikgrinaker commented Nov 20, 2024 • edited Loading

erikgrinaker commented Dec 11, 2024

koivunej commented Feb 23, 2023 •

edited

Loading

erikgrinaker commented Nov 20, 2024 •

edited

Loading