Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote_storage: revisit throttling/ratelimiting #3698

Open
Tracked by #9624
koivunej opened this issue Feb 23, 2023 · 7 comments
Open
Tracked by #9624

remote_storage: revisit throttling/ratelimiting #3698

koivunej opened this issue Feb 23, 2023 · 7 comments
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

koivunej commented Feb 23, 2023

#3663 made the semaphore be held until download completed, this sparked other discussion on if the single semaphore (100 permits) is a good limit and how should we limit it.

Original thread: https://neondb.slack.com/archives/C04C55G1RHB/p1677153938527409

Alternatives proposed:

  • we should have separate concurrency limits for index_part.json and layer downloads (@problame)
    • refinement: or perhaps "other ops" (timeline list, indexpart up/down, deletes), and "layer up/down"? (@koivunej)
  • or all: a policy with (op, concurrency value) and index_part.json could have much higher concurrency value than layer download (@shanyp)
@koivunej
Copy link
Member Author

Earlier I made an additional task requiring version of leaky bucket, or similar to that. After auditing https://github.com/udoprog/leaky-bucket it seems like a good implementation, and does not require a separate task, nor a channel for sending the permits. I'll PR this in.

Follow-ups after discussions about the elusive s3 prefix, I'm thinking that we might need this ratelimiter per each tenant_id assuming we don't already max out the bandwidth with rps=3500.

koivunej added a commit that referenced this issue May 26, 2023
We currently have a semaphore based rate limiter which we hope will keep
us under S3 limits. However, the semaphore does not consider time, so
I've been hesitant to raise the concurrency limit of 100.

See #3698.

The PR Introduces a leaky-bucket based rate limiter instead of the
`tokio::sync::Semaphore` which will allow us to raise the limit later
on. The configuration changes are not contained here.
@hlinnaka
Copy link
Contributor

Let's step back a bit. Why do we need to rate limit the S3 requests in the first place? From the slack thread:

assuming we have finite bandwidth and AWS infrastructure is able to give us full bandwidth s3 downloads at 100 requests, then I don't see any sense in doing more than 100 concurrent downloads, as in the whole will not become faster, there would just be much more scheduling overhead. but this would also require s3 to have negleable latency of course..the underlying http client should be reusing the tcp connections anyway as soon as the semaphore lets them.

Yes, ok, that makes some sense. You don't want to launch any more parallel downloads, if the server's network interface is already saturated.

I remember that we also had problems with large number of IAM requests. IAM has fairly low rate limits. But that was solved, we don't issue a separate IAM request for every GET request anymore. That should not be a problem anymore.

So let's be very clear about what we are trying to accomplish: We are trying to avoid fully saturating the server's network interface. Right?

For that, limiting the # of requests started per second doesn't make much sense. A small number of very large GET requests are much more likely to saturate the network bandwidth than a large number of small GET requests. We should measure network bandwidth more directly, not requests. We care about the total # of bytes / s.

The semaphore approach was not great. It basically assumed that each operation can do x MB/s, and then used the # of concurrent operations as a proxy for the network bandwidth used. Nevertheless, it seems more sensible than limiting the # of requests started per second. By rate limiting # of requests started, if each operation takes e.g 5 seconds, and the rate limit is 100 / s, you can have 500 concurrent operations in progress, which is more than we want. On the other hand, if each operation only takes 0.1 s, you are seriously throttling how many of those requests you allow. For no good reason AFAICS.

(There is also a limit in AWS S3 of 5500 GET requests / s per prefix (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html). We use a different prefix for each timeline, so if I understand correctly, we would need to do more than 5500 GET requests for a particular timeline to hit that limit. I don't think we need to worry about reaching that limit. And even if we do hit that limit, I think we can just let AWS do the throttling for us. We don't gain anything by throttling ourselves earlier.)

@koivunej
Copy link
Member Author

I was hoping this would allow initial index_part.json requests to complete faster for faster activation, but it seems error rate has increased. Example log search {neon_service="pageserver"} |~ "Failed to (download|upload)" !~ "No space left on device") finds at least:

  • dispatch failure: io error: error writing a body to connection: Broken pipe (os error 32): Broken pipe (os error 32)
  • service error: unhandled error: unhandled error: Error { code: "SlowDown" ...
  • dispatch failure: other: connection closed before message completed
  • dispatch failure

I think there has to be some feedback loop to open connections from these; they appeared on servers using concurrency_limit=100 which I re-used as the requests per second limit. Of course, any of our async worker thread blocking may have stalled the downloads and thus caused more errors.

I'd say it was still worth testing it, if it would had helped to avoid the more complex solution like I understand you speculated above, which might be a hybrid of X inflight requests with bandwidth consumption?

I'll revert the PR #4292 next. I still think that we are not in a position to raise the semaphore limit assuming there is a requests per second limit on each prefix. Running into issues while testing the #4292 makes me think that the ratelimiting does not happen on just the prefixes, because otherwise we should had not hit it at all with 10k single timeline tenants, which should be 10k prefixes.

koivunej added a commit that referenced this issue May 26, 2023
This reverts commit 024109f for it
failing to be speed up anything, but run into more errors.

See: #3698.
@shanyp
Copy link
Contributor

shanyp commented Jul 19, 2023

on startup , shouldn't we first download all index_part.json and only then continue with rest of downloads, or as I suggested just give higher priority to these ?

@koivunej
Copy link
Member Author

Well, that's how it should work right now but every tenant does it's own:

for each timeline
    download index_part.json
    reconcile
    schedule uploads

(in background per timeline: process upload queue)

I don't think we do any downloads right now during the initial load, because initial logical sizes are idle before all initial loads complete since #4399. We certainly could do uploads. Should probably add the metrics for the initial load time and then think about how to measure the wait times.

I think we now however around 1ms/tenant. I think that might be quite good for an eager approach.

@jcsp jcsp removed the c/storage/s3 label Jun 17, 2024
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 20, 2024

In this Slack thread we saw this semaphore put an artificial cap on bulk ingestion throughput. Increasing it from 100 to 195 more than doubled ingestion throughput. We are currently running experiments at 500.

The resulting Slack thread retread much of the discussion in this issue: a semaphore is a poor proxy for request rate and interface bandwidth limiting. It also only applies per-pageserver, while S3 rate limits apply across the fleet. We should mostly leave the throttling to S3 rather than do it ourselves, but maybe have some safety limit to prevent spawning too many tasks.

if I understand correctly, we would need to do more than 5500 GET requests for a particular timeline to hit that limit

This is the maximum rate of a given prefix (i.e. tenant shard), but S3 will dynamically change how prefixes are mapped onto partitions depending on prefix load. So it may throttle below this until it repartitions and scales out. From the link you provided:

For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes. The scaling, in the case of both read and write operations, happens gradually and is not instantaneous. While Amazon S3 is scaling to your new higher request rate, you may see some 503 (Slow Down) errors.

I'll pick this up and revisit some of the history here -- I'm currently working on optimizing bulk ingestion throughput and this is a severe bottleneck.

@erikgrinaker
Copy link
Contributor

In #10038 (comment), we saw that concurrency_limit wasn't actually an ingestion bottleneck after all -- the benchmark was disrupted by Pageserver restarts.

I'm therefore putting this on the back burner for now, but plan to revisit this in the medium term after we've addressed higher-value ingestion performance bottlenecks.

When we do address this, we should still have a couple of concurrency limits: one for large transfers (there is no point starting 500 layer file downloads since they'll all be slow), and one for small transfers (we don't want to spawn 10,000 Tokio tasks). This is tracked in #6193, but it's somewhat independent of rate limiting which should be moved over to S3.

@erikgrinaker erikgrinaker removed their assignment Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

5 participants