-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make creating a MultiIndex in stack optional #5202
Comments
Do we have any ideas on how expensive the MultiIndex creation is as a share of |
It depends, but it can easily be 50% to nearly 100% of the runtime. If we use Fortran order arrays, we can get a rough lower bound on the time for MultiIndex creation, e.g., consider: import xarray
import numpy as np
a = xarray.DataArray(np.ones((5000, 5000), order='F'), dims=['x', 'y'])
%prun a.stack(z=['x', 'y']) Not surprisingly, making the multi-index takes about half the runtime here. Pandas does delay creating the actual hash-table behind a MultiIndex until it's needed, so I guess the main expense here is just allocating the new coordinate arrays. |
It's a large problem when working with Dask/Zarr:
I had cases where stacking the dimensions took ~15 minutes while computing+saving the dataset was done in < 1min. |
Great, this seems like a good idea — at the very least an |
Besides the CPU requirements, IMHO, the memory consumption is even worse. Imagine you want to hold a 1000x1000x1000 int64 array. That would be ~ 7.5 GB and still fits into RAM on most machines. Now if you stack that, you end up with three additional 7.5GB arrays. With higher dimensions the situation gets even worse. That said, while it generally should be possible to create the coordinates of the stacked array on the fly, I don't have a solution for it. Side note: |
|
From #5692 (comment):
cc @pydata/xarray as this is an improvement regarding this issue but also a sensible change. To ensure a smoother transition we could maybe add a
We can default to |
As @Hoeze notes in #5179, calling
stack()
can be "incredibly slow and memory-demanding, since it creates a MultiIndex of every possible coordinate in the array."This is true with how
stack()
works currently, but I'm not sure this is necessary. I suspect it's a vestigial design choice from copying pandas, back from before Xarray had optional indexes. One benefit is that it's convenient for makingunstack()
the inverse ofstack()
, but isn't always required.Regardless of how we define the semantics for boolean indexing (#1887), it seems like it could be a good idea to allow stack to skip creating a MultiIndex for the new dimension, via a new keyword argument such as
ds.stack(index=False)
. This would be equivalent to callingreset_index()
afterstack()
but would be cheaper because the MultiIndex is never created in the first place.The text was updated successfully, but these errors were encountered: