[Dask.order] Memory usage regression for flox xarray reductions #10618

fjetter · 2023-11-08T10:20:53Z

Looks like I caught our first severe regression introduced by #10535

import xarray as xr
import dask.array as da
from dask.base import visualize
from dask.order import diagnostics
from dask.base import collections_to_dsk
import flox.xarray

dummy = xr.Dataset(
    {"foo": (("x", "y", "z"), da.ones((3, 10, 10), chunks=(1, 5, 5)))},
    {"x": ["a", "b", "c"], "y": range(10)},
)
result = flox.xarray.xarray_reduce(
    dummy,
    dummy.x,
    func="sum",
)
print(max(diagnostics(collections_to_dsk([result]))[1]))
visualize(
    result,
    filename='flox-reduction-order.png',
    color='order',
    optimize_graph=True,
)

Pre #10535

max_pressure: 6

Tasks are loaded as required and reducers are scheduled promptly. This is nice.

Main (w/ regression)

max_pressure: 8

We can see that dependents of root tasks 1 and 2 are scheduled greedily. This can become quite bad for very large graphs.

Raw graph for testing

a, b, c, d, e = abcde
    dsk = {
        (a, 0): 0,
        (a, 1): 0,
        (a, 2): 0,
        (b, 0, 0, 0): (f, (a, 0), (1, 5, 5)),
        (b, 0, 0, 1): (f, (a, 1), (1, 5, 5)),
        (b, 0, 0, 2): (f, (a, 2), (1, 5, 5)),
        (b, 0, 1, 0): (f, (a, 0), (1, 5, 5)),
        (b, 0, 1, 1): (f, (a, 1), (1, 5, 5)),
        (b, 0, 1, 2): (f, (a, 2), (1, 5, 5)),
        (b, 1, 0, 0): (f, (a, 0), (1, 5, 5)),
        (b, 1, 0, 1): (f, (a, 1), (1, 5, 5)),
        (b, 1, 0, 2): (f, (a, 2), (1, 5, 5)),
        (b, 1, 1, 0): (f, (a, 0), (1, 5, 5)),
        (b, 1, 1, 1): (f, (a, 1), (1, 5, 5)),
        (b, 1, 1, 2): (f, (a, 2), (1, 5, 5)),
        (c, 0, 0, 1): (f, [(b, 0, 1, 0), (b, 0, 1, 1), (b, 0, 1, 2)]),
        (c, 0, 1, 0): (f, [(b, 1, 0, 0), (b, 1, 0, 1), (b, 1, 0, 2)]),
        (c, 0, 1, 1): (f, [(b, 1, 1, 0), (b, 1, 1, 1), (b, 1, 1, 2)]),
        (c, 0, 0, 0): (f, [(b, 0, 0, 0), (b, 0, 0, 1), (b, 0, 0, 2)]),
        (d, 0, 0, 0): (c, 0, 0, 0),
        (d, 0, 0, 1): (c, 0, 0, 1),
        (d, 0, 1, 0): (c, 0, 1, 0),
        (d, 0, 1, 1): (c, 0, 1, 1),
    }

The text was updated successfully, but these errors were encountered:

fjetter · 2023-11-08T10:29:10Z

@dcherian most of the xarray graphs I've been looking at have this kind of pattern where there are data tasks that basically store an array. If I look at the raw graph of the above thing, I get something like

which are the root tasks in the above graphs

Those tasks are often the culprits that are throwing off dask.order and I wonder what they are, where they come from and if they can be avoided. I guess the fact that we're dealing here with an array of size one is since I'm running on a small toy example or is this a common thing? If that was true for large scale graphs, we should just inline this data and be rid of the more complex task graphs

fjetter · 2023-11-08T10:32:31Z

we should just inline this data and be rid of the more complex task graphs

maybe I should just inline this kind of thing for the ordering part... I'll have to play with this idea a bit.

I still would like to learn more about the pattern that causes these kinds of graphs

dcherian · 2023-11-08T16:23:33Z

cc @TomNicholas

Xarray calls dask.array.from_array on an array-like object. We could pass inline_array=True there. See pydata/xarray#6566 . No one wants to make a decision on whether inline_array should be True or False by default. It would help if dask had a strong opinion here. @mrocklin has most context here, I believe.

EDIT: more context here: #6773 (comment)

dcherian · 2023-11-08T16:24:54Z

I caught our first severe regression introduced by #10535

This is also the major difference between examples that use random data, and examples that read files like a real workload.

fjetter · 2023-11-08T17:31:32Z

No one wants to make a decision on whether inline_array should be True or False by default. It would help if dask had a strong opinion here.

Well, that depends on what that data actually is. I'm not familiar enough with xarray or zarr to make this decision for you.

Generally speaking, if one sticks to the best practice of storing any kind of sizable data remotely (i.e. anything beyond a couple MB) you should be better off with inlining. I strongly hope that a zarr array is typically just pointing to a remote storage location such that the actual payload data is not literally embedded in the graph.

The one point in the discussion that is a little too simplified in this discussion is that while pickle is smart, the scheduler still has to send this data N times to the workers. I'm not entirely sure if pickle on workers can actually deduplicate this data 🤔 (I'll run some tests)

dcherian · 2023-11-08T17:54:25Z

a zarr array is typically just pointing to a remote storage location

It could be a local file too but it is not an array in memory.

HEre's an example:

import xarray as xr

xr.tutorial.open_dataset("air_temperature").to_zarr("test.zarr", mode="w")

ds = xr.open_zarr("test.zarr", chunks={})
ds

fjetter · 2023-12-13T10:10:10Z

I think this was closed by #10660

github-actions bot added the needs triage Needs a response from a contributor label Nov 8, 2023

fjetter added array core regression and removed needs triage Needs a response from a contributor labels Nov 8, 2023

fjetter mentioned this issue Nov 8, 2023

[Dask.order] Ignore data tasks when ordering #10619

Closed

fjetter self-assigned this Nov 15, 2023

fjetter mentioned this issue Dec 7, 2023

Dask.order rewrite using a critical path approach #10660

Merged

fjetter closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dask.order] Memory usage regression for flox xarray reductions #10618

[Dask.order] Memory usage regression for flox xarray reductions #10618

fjetter commented Nov 8, 2023 •

edited

Loading

fjetter commented Nov 8, 2023

fjetter commented Nov 8, 2023

dcherian commented Nov 8, 2023 •

edited

Loading

dcherian commented Nov 8, 2023

fjetter commented Nov 8, 2023

dcherian commented Nov 8, 2023

fjetter commented Dec 13, 2023

[Dask.order] Memory usage regression for flox xarray reductions #10618

[Dask.order] Memory usage regression for flox xarray reductions #10618

Comments

fjetter commented Nov 8, 2023 • edited Loading

fjetter commented Nov 8, 2023

fjetter commented Nov 8, 2023

dcherian commented Nov 8, 2023 • edited Loading

dcherian commented Nov 8, 2023

fjetter commented Nov 8, 2023

dcherian commented Nov 8, 2023

fjetter commented Dec 13, 2023

fjetter commented Nov 8, 2023 •

edited

Loading

dcherian commented Nov 8, 2023 •

edited

Loading