Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion to pandas for zero-dimensional Data(Set|Array) #5598

Open
martinitus opened this issue Jul 13, 2021 · 2 comments
Open

Conversion to pandas for zero-dimensional Data(Set|Array) #5598

martinitus opened this issue Jul 13, 2021 · 2 comments

Comments

@martinitus
Copy link

martinitus commented Jul 13, 2021

What happened:
Conversion to a pandas DataFrame of a zero dimensional DataArray or Dataset fails.

What you expected to happen:
I would expect it to return a trivial DataFrame with one row and the respective coordinate / data set columns.
However, I am not sure if that conflicts with potential other round trips between xarray and pandas - e.g. for one-dimensional 1-sized data arrays.

Minimal Complete Verifiable Example:

da = DataArray([1, 2, 3], dims=("x",), coords=dict(x=[1, 2, 3]))
# I don't know of a way to construct such data array without the isel.
# Essentially, below also works for higher dimensional data arrays and
# results in a zero dimensional data array with all the coordinates of
# the found minimum.
da = da.isel(**da.argmin(dim=("x",)))
ds = Dataset({'a': da})
# fails with ValueError: cannot convert a scalar to a DataFrame
# from xarray/core/dataarray.py", line 2664, in to_dataframe
da.to_dataframe(name="foo")
# Expected: a DataFrame with two columns (x and foo) and one row

# fails with ValueError: no valid index for a 0-dimensional object
# from xarray/core/coordinates.py", line 106, in to_index
ds.to_dataframe()
# Expected: a DataFrame with two columns (x and a) and one row

Anything else we need to know?:
I tested a little bit and got what I want with simply removing the

def to_dataframe(...):
    ...
    if self.ndim == 0:
        raise ValueError("cannot convert a scalar to a DataFrame")

block from dataarray.py and changing

def to_index(self, ordered_dims: Sequence[Hashable] = None) -> pd.Index:
   ...
    if len(ordered_dims) == 0:
       return pd.Index([0])
      #  raise ValueError("no valid index for a 0-dimensional object")

to not raise and instead return a trivial index in coordinates.py.

I that would be considered reasonable behavior I am happy to contribute the respective unit test and changes!

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.5 (default, May 19 2021, 11:32:47) [GCC 10.2.0] python-bits: 64 OS: Linux OS-release: 5.8.0-59-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.4 numpy: 1.20.2 scipy: 1.6.3 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.4.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 56.0.0 pip: 21.0.1 conda: None pytest: 6.2.3 IPython: 7.24.0 sphinx: None
@shoyer
Copy link
Member

shoyer commented Jul 13, 2021

to_dataframe() always returns a DataFrame with an index based on coordinate values. I guess we could return a trivial integer index, but this feels a little weird/non-consistent to me particularly because it breaks round-tripping. On the other hand, it is basically exactly what pandas does.

One compromise might be adding an index=False option to to_dataframe() (e.g., ds.to_dataframe(index=False)), which would always create a trivial index (rather than a MultiIndex) and thus obviously break round-tripping.

@martinitus
Copy link
Author

I'm not a pandas expert, but maybe one can create a dummy index that enforces the size=1 constraint. E.g. an index which only supports one value (e.g. None or zero). That could potentially be used to fix the round-trip.

Also potentially related: #5202 (also contains discussions about the multiindex/dataset handling)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants