[WIP] Fine grained serialization #4897

madsbk · 2021-06-09T14:37:27Z

Warning, this is very much work-in-progress

This PR implements fine grained serialization by only serializing none-msgpack-serializable objects. E.g.:

task = (add, (add, 1, 2), 3)  # Nested task
ser = (SerializedCallable(add), (SerializedCallable(add), 1, 2), 3) # Serialized

Motivation

In main we serialize:

Nested tasks using to_serialize() on both the function and all its arguments
Non-nested tasks using dumps_function() on the function and pickle.dumps() on its arguments.

This means that once serialized, we cannot access or modify the function arguments, which can be a problem: #4673.
Also, this means that we have to separate code paths for nested and non-nested tasks in the Scheduler to the Worker.

The Protocol

Fined grained serialization -- only none-msgpack-serializable objects are serialized
Never de-serialize on the Scheduler (other than the implicit msgpack de-serialization)
Always de-serialize on the Client
Delay de-serialize on the Worker until task execution:
- Except when receiving scattered data from the Scheduler or Client
- or when receiving data from other Workers e.g. when gathering data dependencies.

Notice

Since we do not serialize task arguments, we have to handle the implicit convert of lists to tuples by msgpack (see msgpack_persist_lists())
We do not necessarily serialize all Computations thus a task graph such as {"x": None} will result in a task just containing None. This is a potential problem for the Scheduler, which we have to handle.
The goal here is not performance, the goal is to simplify the serialization and make it clear what we need from the HLG pack and unpack functions.
This PR doesn't require any changes to Dask but in a follow up PR I will use this new clean serialization protocol to simplify the HLG pack and unpack functions.
With this PR, it is my hope that we can improve performance by implementing single pass serialization.

Closes [Discussion] Serialize objects within tasks #4673
Closes [DISCUSSION] Can the scheduler use pickle.dumps? #4890
Closes [WIP] Distinguish tuples & lists in MsgPack serialization #4575
Closes Single pass serialization #4699
Closes dask.dataframe.read_csv('./filepath/*.csv') returning tuple dask#7777
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

mrocklin · 2021-06-15T14:38:35Z

Some things that feel good and not good about this approach:

Good

We're reducing the number of approaches / specifications of tasks in the client/scheduler/worker. This is a big step in the right direction.

Bad

We're exposing more structural information to Dask infrastructure than we need to. What was previously a black box that got passed around is now very transparent. As mentioned this introduces some pain around tuple/list distinctions, and possibly other interactions between tasks and our messaging / serialization. In general this feels like a step in the wrong direction.

Thoughts

To me it seems like there is a tension between the following two desires:

We want the client and worker to pass each other tasks as a black box, an opaque blob of bytes. This makes message passing simple
The scheduler is now getting involved in task creation due to high level graphs

So we want something that is both a black box to the scheduler and also not a black box to the scheduler. That seems unfortunate. This PR takes the approach that "well, given that the scheduler is involved, let's just accept that and expose the Dask task spec to the scheduler so that it can make tasks like the client (except of course that things like callables will be pre-serialized". This makes sense. It feels better than what we have today.

However, if we're going to make a big change like this, then it makes me wonder what other big changes are available to us. This is a step in a good direction, but is there a step in a better direction?

Alternative

As an alternative let me propose the following. While Dask has a task specification that looks like this (add, (sum, [1, 2, 3]), "x") we don't need to use that task specification in the Dask.distributed machinery. We can do something totally different.

I don't know what would be better necessarily, but to start a conversation I'll propose something different. Dask.distributed represents a task as a bytestring payload and a msgpack serializable dictionary of kwargs. Using Python syntax:

class Task:
    payload: bytes
    kwargs: dict

In a common case without high level graphs, probably kwargs would be empty most of the time. In the case of high level graph layers we would have a payload from the client, which would deserialize into some callable, and then the worker would **splat the kwargs into that callable. The scheduler would be confined to the parametrizing within this dictionary, rather than within the full task.

I don't think that this is necessarily a good plan though, it's just a different one. I think it would be good for us to think through the different options out in the open before we go down one path or another.

init implementation of fine grained serialization

ff34a83

madsbk mentioned this pull request Jun 9, 2021

[DISCUSSION] Can the scheduler use pickle.dumps? #4890

Open

nested_deserialize(): nesting Computations

90c59fd

madsbk mentioned this pull request Jun 10, 2021

dask.dataframe.read_csv('./filepath/*.csv') returning tuple dask/dask#7777

Closed

madsbk added 3 commits June 10, 2021 15:14

dumps_task(): large bytes warning

bb7910c

Avoid deserialization on the workers

d2c9c13

Support actors

5629e27

madsbk mentioned this pull request Jun 17, 2021

[REVIEW] Formalization of Computation #4923

Closed

9 tasks

rjzamora mentioned this pull request Aug 9, 2021

TypeError: 'Serialize' object is not subscriptable when reading parquet dataset with Client(processes=False). dask/dask#8012

Closed

jakirkham mentioned this pull request Feb 1, 2022

Cythonic SchedulerState (WIP) #5176

Closed

3 tasks

madsbk closed this Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fine grained serialization #4897

[WIP] Fine grained serialization #4897

madsbk commented Jun 9, 2021 •

edited

Loading

mrocklin commented Jun 15, 2021

[WIP] Fine grained serialization #4897

[WIP] Fine grained serialization #4897

Conversation

madsbk commented Jun 9, 2021 • edited Loading

Warning, this is very much work-in-progress

Motivation

The Protocol

Notice

mrocklin commented Jun 15, 2021

Good

Bad

Thoughts

Alternative

madsbk commented Jun 9, 2021 •

edited

Loading