-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dask.order] Memory usage regression for flox xarray reductions #10618
Comments
@dcherian most of the xarray graphs I've been looking at have this kind of pattern where there are data tasks that basically store an array. If I look at the raw graph of the above thing, I get something like which are the root tasks in the above graphs Those tasks are often the culprits that are throwing off dask.order and I wonder what they are, where they come from and if they can be avoided. I guess the fact that we're dealing here with an array of size one is since I'm running on a small toy example or is this a common thing? If that was true for large scale graphs, we should just inline this data and be rid of the more complex task graphs |
maybe I should just inline this kind of thing for the ordering part... I'll have to play with this idea a bit. I still would like to learn more about the pattern that causes these kinds of graphs |
cc @TomNicholas Xarray calls EDIT: more context here: #6773 (comment) |
This is also the major difference between examples that use random data, and examples that read files like a real workload. |
Well, that depends on what that data actually is. I'm not familiar enough with xarray or zarr to make this decision for you. Generally speaking, if one sticks to the best practice of storing any kind of sizable data remotely (i.e. anything beyond a couple MB) you should be better off with inlining. I strongly hope that a zarr array is typically just pointing to a remote storage location such that the actual payload data is not literally embedded in the graph. The one point in the discussion that is a little too simplified in this discussion is that while pickle is smart, the scheduler still has to send this data N times to the workers. I'm not entirely sure if pickle on workers can actually deduplicate this data 🤔 (I'll run some tests) |
It could be a local file too but it is not an array in memory. HEre's an example: import xarray as xr
xr.tutorial.open_dataset("air_temperature").to_zarr("test.zarr", mode="w")
ds = xr.open_zarr("test.zarr", chunks={})
ds |
I think this was closed by #10660 |
Looks like I caught our first severe regression introduced by #10535
Pre #10535
max_pressure: 6
Tasks are loaded as required and reducers are scheduled promptly. This is nice.
Main (w/ regression)
max_pressure: 8

We can see that dependents of root tasks
1
and2
are scheduled greedily. This can become quite bad for very large graphs.Raw graph for testing
The text was updated successfully, but these errors were encountered: