-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Async request support in Ray Serve #32292
Comments
Greetings @edoakes, I strongly endorse Proposal 2, to elaborate, Our team is engaged in the development of a quant algo which trains and execute trades in real-time, and Proposal 2 aligns perfectly with our objectives. |
Thanks @WBORSA . Could you elaborate on the advantages / disadvantages for your use case of (2) and (1)? |
@danielbichuetti saw that you reacted to the above post -- did you have any feedback/thoughts to share? |
Hello, we use Ray job submission queue quite often to spin up quick inference jobs and hack in new features. While quick, being able to persist jobs throughout Ray restarts seems very important (or rerunning failed jobs) which Serve seems to offer. If it's possible to run separate Python scripts as a Serve job it would be great. |
@richardliaw Hello! When @edoakes mentioned Celery being used as Ray lacked its functionality, it was a valid conclusion for our company. It is difficult to compare the code quality between Ray and Celery, and while Celery is not a bad solution, Ray is far superior. I am personally testing and using Ray Workflows in some scenarios, yet the queue concept is sometimes needed, and it can be quite laborious to establish an effective queue system that integrates well with Ray. Thus, the addition of this feature would be immensely beneficial to us, as we have occasional tasks that take anywhere between 3 and 8 hours. IMHO, the proposal 1 would be more advantageous to both current and new users, and would reduce allocation issues. The DAG concerns, at least for us, don't apply. |
As mentioned by @Joshuaalbert, seems like #26642 is highly related as well |
Hi! I really support this! A colleague and I were the Summit 2022 and had some meetings with the team to see if they could support our use case, which turns out to be the same use case of the, now popular, generative models like ChatGPT etc. Our team at Touch has implemented our own solution on top of Serve to solve our use, and have had 3 iterations of improvement of the system. Happy to share what we've tried and learned and strongly support any dev effort to make this a main line feature. Our use caseOur use case is similar to the "chatbot" use case, where requests are done on behalf of a user and must be stateful. The required latency needs to be low enough that makes passing around state objects intractable overhead. Regarding async vs. sync handling, most of our things can be done synchronously, but not all. Having the option for both types of request handling while also allowing stateful routing of requests would be ideal. Breaking down each use caseI think Serve should be able to do any combination of these two options.
The answers above create a 2x2 matrix:
Examples of use cases:
We need A and B at Touch. I think most teams that need stateful Serve will need both A and B. Typically you first develop a concept with stateful synchronous response handling (B), and then only after demonstrating correctness add the complexity of stateful asynchronous handling (A). We also use C, where we have multiple asynchronous processes that simply perform things on a scheduled basis (similar to how ServeController works) which need access to Ray. We have at least 4 such scheduled operations. What we've learned
|
Miha's comment on Slack:
|
A few thoughts:
Finally, @edoakes could you please elaborate on why Workflows are too heavyweight? Do you mean the API is too cumbersome or that the durability guarantees add too much processing overhead? |
if we go with Proposal 2 wouldn't be more scalable for the longterm ? I mean going with that will allow you to even init multiple ray serve instances? correct me if I am wrong. |
I'd be interested in a solution that would allow for on-demand fine-tuning models or even on-demand training small models. For example:
This would be useful when creating web applications where the user can use the UI to make small changes to the ML pipeline (selecting features, filtering training data, choosing thresholds etc.) where pre-computing each combination of metaparameters would be infeasible. |
We're working on a similar system to fine-tune some small models for users. Our current solution:
In actuallity, the Task is tied to an S3 object associted with the fine-tuning run which we poll for persistent status. Roughly:
The major issue here is the persistence is reliant on something like S3, and importantly, Serve will downscale Dispatcher replicas with active training runs (as they have already returned), resulting in Ray ending the Train call before completion. So the Dispatcher cannot be autoscaled safely. Both suggestions above work, but I prefer the second option as se have a deployment graph for pre-processing and option 2 seems easier to work around. It's also more similar to the workflow of Ray core. |
@kyle-v6x this is a very interesting scenario. I wonder if you have 30 mins to have a Zoom call to discuss more.. If so please email zhz at anyscale.com. Thanks! |
We implemented async Serve calls using the existing Workflows API. The Workflow provides a reference value that can be shared with the caller and used to look up the state of the request. It seems to be working well. The only missing piece is rate limiting. I'm curious why the previous proposal to build on top of Workflows was abandoned? |
How do you handle multiple training requests? |
Good idea is rely on data lake or disk for persistence of tasks: Think all in one solution is very difficult since requirements are very different. |
Sorry for the late reply. Since we're using the Ray Cluster Launcher, new training nodes are automatically added to handle however many requests come in. If there are none available, the requests are queued in the ray task queue. Note that you still have to handle potential ray failures yourself, as the ray queue will not. |
Note that since |
TL;DR - Proposal for an API to support launching expensive computations in Serve (e.g., model fine-tuning, long-running inference) using an asynchronous request API
Problem statement
With the rise of generative models, the Serve team has seen growing interest in supporting "expensive computations" in Serve. For example, users have asked to launch Stable Diffusion fine-tuning jobs and long-running inference tasks that run not for seconds, but for several minutes to an hour. These tasks are often too long to run as a stateless inference request, but too short to justify launching an entirely new Ray job / cluster.
As workarounds, users are often connecting other queueing systems to Ray, such as Celery. The purpose of this RFC is to gather feedback on APIs for handling such workloads natively in Serve without needing a full-blown queueing system.
Previous proposals
A previous RFC proposed using Ray Workflows as a wholesale replacement for queueing systems. This solution works, but is heavyweight and relies on Workflows, a relatively new library: #21161
Below are two alternate proposals with the aim to provide a simpler API.
Proposal 1 -- add async requests API to Serve
Add an "async_request" decorator to Serve deployments. For async decorated methods, Serve will generate queueing preamble/postamble logic and APIs to enable listing, resuming, and checking on the status of async requests. The API would look something as follows:
Here are examples of generated API methods for managing async requests:
Fault tolerance: Serve would persist the queue of requests in its coordinator actor / persistent storage. When resuming from cluster failure, Serve can load and resume previous running async requests, which can resume from any checkpoints they have taken.
Pros:
Cons:
Proposal 2 -- create a simplified TaskQueue API backed by Ray Workflows
Instead of extending Serve's API, create a separate TaskQueue API that users can use to manually create a Serve handler implementing management methods. For example, the above example could be instead implemented as:
Fault tolerance: can be implemented in a similar way as proposal (1).
Pros:
Cons:
The text was updated successfully, but these errors were encountered: