[Serve] Endpoints should return an error when the cluster is overloaded #22670

frreiss · 2022-02-25T23:44:50Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

The Ray Serve Router should detect that it does not have the resources to handle additional requests in a timely fashion and return an appropriate HTTP error instead of attempting to take on more work.

Here's a suggested policy for how this could work:

IF a request arrives and the following conditions are met:
  * All replicas have reached their `max_concurrent_queries` quotas
  * There are insufficient resources to allocate additional replicas
  * The backlog of requests (i.e requests not assigned to a replica)
    exceeds a user-configurable size, OR the age of the oldest request 
    in the backlog exceeds a user-configurable timeout
THEN return an HTTP error (such as 503, service unavailable) instead of enqueuing the request

Returning an error in this case would not only help to prevent the service from thrashing, but would also provide feedback to upstream proxies and queues that the service is overloaded.

Use case

While benchmarking some use cases involving deploying expensive models on Ray Serve, I've observed that, under high load, Serve can get into the following state:

All replicas have reached their max_concurrent_queries quotas
There are insufficient resources to allocate additional replicas
The Serve Router continues to accept incoming HTTP requests but cannot assign them to a replica
An unbounded backlog of pending requests accumulates inside the Router
Client response times grow indefinitely
Server memory consumption grows indefinitely

In this situation, it would be better if Serve could be configured to return an error instead of enqueuing additional requests.

Related issues

#21438 dealt with a particularly nasty version of this issue where the client cancels and reissues requests, but it did not address the underlying problem of Serve queuing up an unbounded number of requests.

#21161 proposes handling overloads by paging the excess requests to an on-disk queue. This approach makes sense in some applications, but in many cases it is better to return an error instead of queuing a request for an unbounded amount of time. For example, the application may be latency-sensitive.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

akshay-anyscale · 2024-06-12T23:14:11Z

There are a few new configurations around this now

frreiss added the enhancement Request for new feature and/or capability label Feb 25, 2022

jiaodong added the serve Ray Serve Related Issue label Feb 25, 2022

shrekris-anyscale added this to the Serve backlog milestone Feb 28, 2022

AmeerHajAli added the platform label Mar 26, 2022

edoakes removed the platform label Apr 25, 2022

sihanwang41 added the P1 Issue that should be fixed within a few weeks label Mar 23, 2023

akshay-anyscale closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Endpoints should return an error when the cluster is overloaded #22670

[Serve] Endpoints should return an error when the cluster is overloaded #22670

frreiss commented Feb 25, 2022

akshay-anyscale commented Jun 12, 2024

[Serve] Endpoints should return an error when the cluster is overloaded #22670

[Serve] Endpoints should return an error when the cluster is overloaded #22670

Comments

frreiss commented Feb 25, 2022

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

akshay-anyscale commented Jun 12, 2024