You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had posted the following comment below on Ray's Slack workspace and was asked to share it in Github as an RFC. My experience is developing APIs with Django + Celery/Jobtastic + Dask (thanks @ericl!). So I am happy to do it. Of course, opinions are my own, but I hope this is close enough to others' experiences to be helpful for the Ray team:
I tried building an API with Django + Celery + Dask, but a couple of things that Ray has solved for me that these technologies didn't have were:
Ray can quickly scale to many nodes and control the resources that Actors and Tasks need. For example, some parts of my workflow need exclusive access to a GPU, and the Ray decorators make this relatively easy. Also, Ray essentially solved the issue of serving the services through FastAPI, which I had implemented with Django + Celery. Django is great but overly complicated for this.
I love the Celery project, but I am hesitant to build on top of it, given that the main component I am using is Jobtastic (https://policystat.github.io/jobtastic/). Furthermore, I have not seen a lot of new developments in that project, so I'm not sure if it will be available in the future! On the other hand, Ray seems to be much more active, and judging from my previous experience with Spark, I expect it to continue being well-maintained in the future.
This next point could be both a pro and a con. I feel like Ray gives me much more control over executing my tasks. Everything is in Python, and I know exactly what's going on. With Django + Celery, I am never so sure about this.
I love Ray’s ability to spawn/launch new jobs/tasks/actors/workflows inside other jobs/tasks/actors/workflows. For example, I have a pipeline that processes the images inside a PDF. I wanted to build a task that extracts the images and processes each image in parallel. Doing this in Celery proved complicated because it was hard to create dynamic execution graphs. They do have beautiful concepts such as "Chords" and "Chains" (https://docs.celeryproject.org/en/stable/userguide/canvas.html), but they seemed to be somewhat limited. With Ray, building dynamic graphs is trivial.
About Ray vs. Spark. I see them as different things, and I do not think comparisons are fair. But I like what the Ray team has done with Ray Datasets and the ability to store Tensors inside columns and store them as Parquet files (very efficient). I am sure there is a way of doing this in Spark, but not out of the box. Spark has Vectors for dealing with features, but you sometimes need to store tensors and do operations with them (e.g., take the average tensor across rows). This need happens with my pipeline above, which extracts key points (e.g., SIFT features) from images. I like storing image metadata in some columns along with thousands of key points (e.g., SIFT features; a tensor) in another column.
Comments welcome!
The text was updated successfully, but these errors were encountered:
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.
Hello all,
I had posted the following comment below on Ray's Slack workspace and was asked to share it in Github as an RFC. My experience is developing APIs with Django + Celery/Jobtastic + Dask (thanks @ericl!). So I am happy to do it. Of course, opinions are my own, but I hope this is close enough to others' experiences to be helpful for the Ray team:
I tried building an API with Django + Celery + Dask, but a couple of things that Ray has solved for me that these technologies didn't have were:
I love Ray’s ability to spawn/launch new jobs/tasks/actors/workflows inside other jobs/tasks/actors/workflows. For example, I have a pipeline that processes the images inside a PDF. I wanted to build a task that extracts the images and processes each image in parallel. Doing this in Celery proved complicated because it was hard to create dynamic execution graphs. They do have beautiful concepts such as "Chords" and "Chains" (https://docs.celeryproject.org/en/stable/userguide/canvas.html), but they seemed to be somewhat limited. With Ray, building dynamic graphs is trivial.
About Ray vs. Spark. I see them as different things, and I do not think comparisons are fair. But I like what the Ray team has done with Ray Datasets and the ability to store Tensors inside columns and store them as Parquet files (very efficient). I am sure there is a way of doing this in Spark, but not out of the box. Spark has Vectors for dealing with features, but you sometimes need to store tensors and do operations with them (e.g., take the average tensor across rows). This need happens with my pipeline above, which extracts key points (e.g., SIFT features) from images. I like storing image metadata in some columns along with thousands of key points (e.g., SIFT features; a tensor) in another column.
Comments welcome!
The text was updated successfully, but these errors were encountered: