-
-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fine grained serialization #4897
Conversation
Some things that feel good and not good about this approach: GoodWe're reducing the number of approaches / specifications of tasks in the client/scheduler/worker. This is a big step in the right direction. BadWe're exposing more structural information to Dask infrastructure than we need to. What was previously a black box that got passed around is now very transparent. As mentioned this introduces some pain around tuple/list distinctions, and possibly other interactions between tasks and our messaging / serialization. In general this feels like a step in the wrong direction. ThoughtsTo me it seems like there is a tension between the following two desires:
So we want something that is both a black box to the scheduler and also not a black box to the scheduler. That seems unfortunate. This PR takes the approach that "well, given that the scheduler is involved, let's just accept that and expose the Dask task spec to the scheduler so that it can make tasks like the client (except of course that things like callables will be pre-serialized". This makes sense. It feels better than what we have today. However, if we're going to make a big change like this, then it makes me wonder what other big changes are available to us. This is a step in a good direction, but is there a step in a better direction? AlternativeAs an alternative let me propose the following. While Dask has a task specification that looks like this I don't know what would be better necessarily, but to start a conversation I'll propose something different. Dask.distributed represents a task as a bytestring payload and a msgpack serializable dictionary of kwargs. Using Python syntax:
In a common case without high level graphs, probably kwargs would be empty most of the time. In the case of high level graph layers we would have a payload from the client, which would deserialize into some callable, and then the worker would I don't think that this is necessarily a good plan though, it's just a different one. I think it would be good for us to think through the different options out in the open before we go down one path or another. |
Warning, this is very much work-in-progress
This PR implements fine grained serialization by only serializing none-msgpack-serializable objects. E.g.:
Motivation
In
main
we serialize:to_serialize()
on both the function and all its argumentsdumps_function()
on the function andpickle.dumps()
on its arguments.This means that once serialized, we cannot access or modify the function arguments, which can be a problem: #4673.
Also, this means that we have to separate code paths for nested and non-nested tasks in the Scheduler to the Worker.
The Protocol
Notice
msgpack
(seemsgpack_persist_lists()
){"x": None}
will result in a task just containingNone
. This is a potential problem for the Scheduler, which we have to handle.Serialize
objects within tasks #4673tuple
s &list
s in MsgPack serialization #4575dask.dataframe.read_csv('./filepath/*.csv')
returning tuple dask#7777black distributed
/flake8 distributed
/isort distributed