Make creating a MultiIndex in stack optional #5202

shoyer · 2021-04-21T20:21:03Z

As @Hoeze notes in #5179, calling stack() can be "incredibly slow and memory-demanding, since it creates a MultiIndex of every possible coordinate in the array."

This is true with how stack() works currently, but I'm not sure this is necessary. I suspect it's a vestigial design choice from copying pandas, back from before Xarray had optional indexes. One benefit is that it's convenient for making unstack() the inverse of stack(), but isn't always required.

Regardless of how we define the semantics for boolean indexing (#1887), it seems like it could be a good idea to allow stack to skip creating a MultiIndex for the new dimension, via a new keyword argument such as ds.stack(index=False). This would be equivalent to calling reset_index() after stack() but would be cheaper because the MultiIndex is never created in the first place.

The text was updated successfully, but these errors were encountered:

max-sixty · 2021-04-21T22:05:53Z

Do we have any ideas on how expensive the MultiIndex creation is as a share of stack?

shoyer · 2021-04-22T00:57:56Z

Do we have any ideas on how expensive the MultiIndex creation is as a share of stack?

It depends, but it can easily be 50% to nearly 100% of the runtime. stack() uses reshape() on data variables, which is either free (for arrays that are still contiguous and can use views) or can be delayed until compute-time (with dask). In contrast, the MultiIndex is always created eagerly.

If we use Fortran order arrays, we can get a rough lower bound on the time for MultiIndex creation, e.g., consider:

import xarray
import numpy as np
a = xarray.DataArray(np.ones((5000, 5000), order='F'), dims=['x', 'y'])
%prun a.stack(z=['x', 'y'])

Not surprisingly, making the multi-index takes about half the runtime here.

Pandas does delay creating the actual hash-table behind a MultiIndex until it's needed, so I guess the main expense here is just allocating the new coordinate arrays.

Hoeze · 2021-04-23T08:21:44Z

It's a large problem when working with Dask/Zarr:

First, it loads all indices into memory
Then, it computes in a single thread the MultiIndex

I had cases where stacking the dimensions took ~15 minutes while computing+saving the dataset was done in < 1min.

max-sixty · 2021-04-23T08:30:55Z

Great, this seems like a good idea — at the very least an index=False option

martinitus · 2021-06-07T10:49:49Z

Besides the CPU requirements, IMHO, the memory consumption is even worse.

Imagine you want to hold a 1000x1000x1000 int64 array. That would be ~ 7.5 GB and still fits into RAM on most machines.
Let's assume float coordinates for all three axes. Their memory consumption of 3000*8 bytes is negligible.

Now if you stack that, you end up with three additional 7.5GB arrays. With higher dimensions the situation gets even worse.

That said, while it generally should be possible to create the coordinates of the stacked array on the fly, I don't have a solution for it.

Side note:
I stumbled over that when combining xarray with pytorch, where I want to evaluate a model on a large cartesian grid. For that I stacked the array and batched the stacked coordinates to feed them to pytorch, which makes the iteration over the cartesian space really nice and smooth in code.

benbovy · 2021-06-07T22:10:15Z

it seems like it could be a good idea to allow stack to skip creating a MultiIndex for the new dimension, via a new keyword argument such as ds.stack(index=False)

Dataset.stack might eventually accept any custom index (that supports it) if that makes sense. Would index=None be slightly better than index=False in that case? (considering that the default value would be index=PandasMultiIndex or something like that).

benbovy · 2021-10-06T07:50:59Z

From #5692 (comment):

One change is that a multi-index is not always created with stack. It is created only if each of the dimensions to stack together have one and only one coordinate with a pandas index (this could be a non-dimension coordinate).

This could maybe address #5202, since we could simply drop the indexes before stacking the dimensions in order to avoid the creation of a multi-index. I don't think it's a big breaking change either unless there are users who rely on default multi-indexes with range (0, 1, 2...) levels. Looking at #5202, however, those default multi-indexes seem more problematic than something really useful, but I might be wrong here. Also, range-based indexes can still be created explicitly before stacking the dimensions if needed.

Another consequence is that stack is not always reversible, since unstack still requires a pandas multi-index (one and only one multi-index per dimension to unstack).

cc @pydata/xarray as this is an improvement regarding this issue but also a sensible change. To ensure a smoother transition we could maybe add a create_index option to stack which accepts these values:

True: always create a multi-index
False: never create a multi-index
None: create a multi-index only if we can unambiguously pick one index for each of the dimensions to stack

We can default to True now to avoid breaking changes and maybe later default to None. If we eventually add support for custom (non-pandas backed) indexes, we could also allow passing an xarray.Index class.

shoyer added API design topic-indexing enhancement labels Apr 21, 2021

martinitus mentioned this issue Jul 14, 2021

Conversion to pandas for zero-dimensional Data(Set|Array) #5598

Open

benbovy mentioned this issue Sep 23, 2021

Explicit indexes #5692

Merged

54 tasks

shoyer closed this as completed in #5692 Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make creating a MultiIndex in stack optional #5202

Make creating a MultiIndex in stack optional #5202

shoyer commented Apr 21, 2021 •

edited

Loading

max-sixty commented Apr 21, 2021

shoyer commented Apr 22, 2021

Hoeze commented Apr 23, 2021 •

edited

Loading

max-sixty commented Apr 23, 2021

martinitus commented Jun 7, 2021

benbovy commented Jun 7, 2021

benbovy commented Oct 6, 2021

Make creating a MultiIndex in stack optional #5202

Make creating a MultiIndex in stack optional #5202

Comments

shoyer commented Apr 21, 2021 • edited Loading

max-sixty commented Apr 21, 2021

shoyer commented Apr 22, 2021

Hoeze commented Apr 23, 2021 • edited Loading

max-sixty commented Apr 23, 2021

martinitus commented Jun 7, 2021

benbovy commented Jun 7, 2021

benbovy commented Oct 6, 2021

shoyer commented Apr 21, 2021 •

edited

Loading

Hoeze commented Apr 23, 2021 •

edited

Loading