[Serve] Endpoints should return an error when the cluster is overloaded #22670
Labels
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
serve
Ray Serve Related Issue
Milestone
Search before asking
Description
The Ray Serve Router should detect that it does not have the resources to handle additional requests in a timely fashion and return an appropriate HTTP error instead of attempting to take on more work.
Here's a suggested policy for how this could work:
Returning an error in this case would not only help to prevent the service from thrashing, but would also provide feedback to upstream proxies and queues that the service is overloaded.
Use case
While benchmarking some use cases involving deploying expensive models on Ray Serve, I've observed that, under high load, Serve can get into the following state:
max_concurrent_queries
quotasIn this situation, it would be better if Serve could be configured to return an error instead of enqueuing additional requests.
Related issues
#21438 dealt with a particularly nasty version of this issue where the client cancels and reissues requests, but it did not address the underlying problem of Serve queuing up an unbounded number of requests.
#21161 proposes handling overloads by paging the excess requests to an on-disk queue. This approach makes sense in some applications, but in many cases it is better to return an error instead of queuing a request for an unbounded amount of time. For example, the application may be latency-sensitive.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: