Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dask.order] Memory usage regression for flox xarray reductions #10618

Closed
fjetter opened this issue Nov 8, 2023 · 7 comments
Closed

[Dask.order] Memory usage regression for flox xarray reductions #10618

fjetter opened this issue Nov 8, 2023 · 7 comments

Comments

@fjetter
Copy link
Member

fjetter commented Nov 8, 2023

Looks like I caught our first severe regression introduced by #10535

import xarray as xr
import dask.array as da
from dask.base import visualize
from dask.order import diagnostics
from dask.base import collections_to_dsk
import flox.xarray

dummy = xr.Dataset(
    {"foo": (("x", "y", "z"), da.ones((3, 10, 10), chunks=(1, 5, 5)))},
    {"x": ["a", "b", "c"], "y": range(10)},
)
result = flox.xarray.xarray_reduce(
    dummy,
    dummy.x,
    func="sum",
)
print(max(diagnostics(collections_to_dsk([result]))[1]))
visualize(
    result,
    filename='flox-reduction-order.png',
    color='order',
    optimize_graph=True,
)

Pre #10535

max_pressure: 6

image

Tasks are loaded as required and reducers are scheduled promptly. This is nice.

Main (w/ regression)

max_pressure: 8
image

We can see that dependents of root tasks 1 and 2 are scheduled greedily. This can become quite bad for very large graphs.

Raw graph for testing
a, b, c, d, e = abcde
    dsk = {
        (a, 0): 0,
        (a, 1): 0,
        (a, 2): 0,
        (b, 0, 0, 0): (f, (a, 0), (1, 5, 5)),
        (b, 0, 0, 1): (f, (a, 1), (1, 5, 5)),
        (b, 0, 0, 2): (f, (a, 2), (1, 5, 5)),
        (b, 0, 1, 0): (f, (a, 0), (1, 5, 5)),
        (b, 0, 1, 1): (f, (a, 1), (1, 5, 5)),
        (b, 0, 1, 2): (f, (a, 2), (1, 5, 5)),
        (b, 1, 0, 0): (f, (a, 0), (1, 5, 5)),
        (b, 1, 0, 1): (f, (a, 1), (1, 5, 5)),
        (b, 1, 0, 2): (f, (a, 2), (1, 5, 5)),
        (b, 1, 1, 0): (f, (a, 0), (1, 5, 5)),
        (b, 1, 1, 1): (f, (a, 1), (1, 5, 5)),
        (b, 1, 1, 2): (f, (a, 2), (1, 5, 5)),
        (c, 0, 0, 1): (f, [(b, 0, 1, 0), (b, 0, 1, 1), (b, 0, 1, 2)]),
        (c, 0, 1, 0): (f, [(b, 1, 0, 0), (b, 1, 0, 1), (b, 1, 0, 2)]),
        (c, 0, 1, 1): (f, [(b, 1, 1, 0), (b, 1, 1, 1), (b, 1, 1, 2)]),
        (c, 0, 0, 0): (f, [(b, 0, 0, 0), (b, 0, 0, 1), (b, 0, 0, 2)]),
        (d, 0, 0, 0): (c, 0, 0, 0),
        (d, 0, 0, 1): (c, 0, 0, 1),
        (d, 0, 1, 0): (c, 0, 1, 0),
        (d, 0, 1, 1): (c, 0, 1, 1),
    }
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Nov 8, 2023
@fjetter fjetter added array core regression and removed needs triage Needs a response from a contributor labels Nov 8, 2023
@fjetter
Copy link
Member Author

fjetter commented Nov 8, 2023

@dcherian most of the xarray graphs I've been looking at have this kind of pattern where there are data tasks that basically store an array. If I look at the raw graph of the above thing, I get something like

image

which are the root tasks in the above graphs

image

Those tasks are often the culprits that are throwing off dask.order and I wonder what they are, where they come from and if they can be avoided. I guess the fact that we're dealing here with an array of size one is since I'm running on a small toy example or is this a common thing? If that was true for large scale graphs, we should just inline this data and be rid of the more complex task graphs

@fjetter
Copy link
Member Author

fjetter commented Nov 8, 2023

we should just inline this data and be rid of the more complex task graphs

maybe I should just inline this kind of thing for the ordering part... I'll have to play with this idea a bit.

I still would like to learn more about the pattern that causes these kinds of graphs

@dcherian
Copy link
Contributor

dcherian commented Nov 8, 2023

cc @TomNicholas

Xarray calls dask.array.from_array on an array-like object. We could pass inline_array=True there. See pydata/xarray#6566 . No one wants to make a decision on whether inline_array should be True or False by default. It would help if dask had a strong opinion here. @mrocklin has most context here, I believe.

EDIT: more context here: #6773 (comment)

@dcherian
Copy link
Contributor

dcherian commented Nov 8, 2023

I caught our first severe regression introduced by #10535

This is also the major difference between examples that use random data, and examples that read files like a real workload.

@fjetter
Copy link
Member Author

fjetter commented Nov 8, 2023

No one wants to make a decision on whether inline_array should be True or False by default. It would help if dask had a strong opinion here.

Well, that depends on what that data actually is. I'm not familiar enough with xarray or zarr to make this decision for you.

Generally speaking, if one sticks to the best practice of storing any kind of sizable data remotely (i.e. anything beyond a couple MB) you should be better off with inlining. I strongly hope that a zarr array is typically just pointing to a remote storage location such that the actual payload data is not literally embedded in the graph.

The one point in the discussion that is a little too simplified in this discussion is that while pickle is smart, the scheduler still has to send this data N times to the workers. I'm not entirely sure if pickle on workers can actually deduplicate this data 🤔 (I'll run some tests)

@dcherian
Copy link
Contributor

dcherian commented Nov 8, 2023

a zarr array is typically just pointing to a remote storage location

It could be a local file too but it is not an array in memory.

HEre's an example:

import xarray as xr

xr.tutorial.open_dataset("air_temperature").to_zarr("test.zarr", mode="w")

ds = xr.open_zarr("test.zarr", chunks={})
ds

@fjetter
Copy link
Member Author

fjetter commented Dec 13, 2023

I think this was closed by #10660

@fjetter fjetter closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants