Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic chunking of arrays ? #4055

Closed
AndrewILWilliams opened this issue May 13, 2020 · 11 comments · Fixed by #4064
Closed

Automatic chunking of arrays ? #4055

AndrewILWilliams opened this issue May 13, 2020 · 11 comments · Fixed by #4064

Comments

@AndrewILWilliams
Copy link
Contributor

Hi there,

Hopefully this turns out to be a basic issue, but I was wondering why the chunks='auto' that dask seems to provide (https://docs.dask.org/en/latest/array-chunks.html#automatic-chunking) isn't an option for xarray? I'm not 100% sure of how dask decides how to automatically chunk its arrays, so maybe there's a compatibility issue?

I get the impression that the dask method automatically tries to prevent the issues of "too many chunks" or "too few chunks" which can sometimes happen when choosing chunk sizes automatically. If so, it would maybe be a useful thing to include in future versions?

Happy to be corrected if I've misunderstood something here though, still getting my head around how the dask/xarray compatibility really works...

Cheers!

@dcherian
Copy link
Contributor

dcherian commented May 13, 2020

so da.chunk({dim_name: "auto"}) works but da.chunk("auto") does not. The latter is a relatively easy fix. We just need to update the condition here:

xarray/xarray/core/dataset.py

Lines 1736 to 1737 in bd84186

if isinstance(chunks, Number):
chunks = dict.fromkeys(self.dims, chunks)

A PR would be very welcome if you have the time, @AndrewWilliams3142

@AndrewILWilliams
Copy link
Contributor Author

Oh ok I didn't know about this, I'll take a look and read the contribution docs tomorrow ! It'll be my first PR so may need a bit of hand-holding when it comes to tests. Willing to try though!

@dcherian
Copy link
Contributor

Awesome! Please see https://xarray.pydata.org/en/stable/contributing.html for docs on contributing

@shoyer
Copy link
Member

shoyer commented May 14, 2020

Agreed, this would be very welcome!

chunks='auto' isn't supported only because xarray support for dask predates it :)

@AndrewILWilliams
Copy link
Contributor Author

Cheers! Just had a look, is it as simple as just changing this line to the following, @dcherian ?

if isinstance(chunks, Number) or chunks=='auto':
            chunks = dict.fromkeys(self.dims, chunks)

This seems to work fine in a lot of cases, except automatic chunking isn't implemented for object dtypes at the moment, so it fails if you pass a cftime coordinate, for example.

One option is to automatically use self=xr.decode_cf(self) if the input dataset is cftime? Or could just throw an error.

@AndrewILWilliams
Copy link
Contributor Author

Also, the contributing docs have been super clear so far! Thanks! :)

@dcherian
Copy link
Contributor

is_scalar(chunks) might be the appropriate condition. is_scalar is already imported from .utils in dataset.py

This seems to work fine in a lot of cases, except automatic chunking isn't implemented for object dtypes at the moment, so it fails if you pass a cftime coordinate, for example.

Can we catch this error and re-raise specifying "automatic chunking fails for object arrays. These include cftime DataArrays" or something like that?

@AndrewILWilliams
Copy link
Contributor Author

AndrewILWilliams commented May 14, 2020

Nice, that's neater! Would this work, in the maybe_chunk() call? Sorry about the basic questions!

def maybe_chunk(name, var, chunks):
    chunks = selkeys(chunks, var.dims)
    if not chunks:
           chunks = None
    if var.ndim > 0:
           # when rechunking by different amounts, make sure dask names change
           # by provinding chunks as an input to tokenize.
           # subtle bugs result otherwise. see GH3350
           token2 = tokenize(name, token if token else var._data, chunks)
           name2 = f"{name_prefix}{name}-{token2}"
           try:
               return var.chunk(chunks, name=name2, lock=lock)
           except NotImplementedError as err:
               raise Exception("Automatic chunking fails for object arrays."
                                           + "These include cftime DataArrays.")
    else:
        return var

@shoyer
Copy link
Member

shoyer commented May 14, 2020

The error message from dask is already pretty descriptive:
NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data

I don't think we have much to add on top of that?

@AndrewILWilliams
Copy link
Contributor Author

I also thought that, after the dask error message it's pretty easy to then look at the dataset and check what the problem dimension is.

In general though, is that the type of layout you'd suggest for catching and re-raising errors? Using raise Exception() ?

@shoyer
Copy link
Member

shoyer commented May 14, 2020

If we think can improve an error message by adding additional context, the right solution is to use raise Exception(...) from original_error:
https://stackoverflow.com/a/16414892/809705

On the other hand, if xarray doesn't have anything more to add on top of the original error message, it is best not to add any wrapper at all. Users will just see the original error from dask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants