Automatic chunking of arrays ? #4055

AndrewILWilliams · 2020-05-13T14:02:41Z

Hi there,

Hopefully this turns out to be a basic issue, but I was wondering why the chunks='auto' that dask seems to provide (https://docs.dask.org/en/latest/array-chunks.html#automatic-chunking) isn't an option for xarray? I'm not 100% sure of how dask decides how to automatically chunk its arrays, so maybe there's a compatibility issue?

I get the impression that the dask method automatically tries to prevent the issues of "too many chunks" or "too few chunks" which can sometimes happen when choosing chunk sizes automatically. If so, it would maybe be a useful thing to include in future versions?

Happy to be corrected if I've misunderstood something here though, still getting my head around how the dask/xarray compatibility really works...

Cheers!

The text was updated successfully, but these errors were encountered:

dcherian · 2020-05-13T15:26:10Z

so da.chunk({dim_name: "auto"}) works but da.chunk("auto") does not. The latter is a relatively easy fix. We just need to update the condition here:

xarray/xarray/core/dataset.py

Lines 1736 to 1737 in bd84186

    
           if isinstance(chunks, Number): 
        
               chunks = dict.fromkeys(self.dims, chunks)

A PR would be very welcome if you have the time, @AndrewWilliams3142

AndrewILWilliams · 2020-05-13T19:56:34Z

Oh ok I didn't know about this, I'll take a look and read the contribution docs tomorrow ! It'll be my first PR so may need a bit of hand-holding when it comes to tests. Willing to try though!

dcherian · 2020-05-13T20:35:49Z

Awesome! Please see https://xarray.pydata.org/en/stable/contributing.html for docs on contributing

shoyer · 2020-05-14T00:43:22Z

Agreed, this would be very welcome!

chunks='auto' isn't supported only because xarray support for dask predates it :)

AndrewILWilliams · 2020-05-14T09:25:48Z

Cheers! Just had a look, is it as simple as just changing this line to the following, @dcherian ?

if isinstance(chunks, Number) or chunks=='auto':
            chunks = dict.fromkeys(self.dims, chunks)

This seems to work fine in a lot of cases, except automatic chunking isn't implemented for object dtypes at the moment, so it fails if you pass a cftime coordinate, for example.

One option is to automatically use self=xr.decode_cf(self) if the input dataset is cftime? Or could just throw an error.

AndrewILWilliams · 2020-05-14T09:26:24Z

Also, the contributing docs have been super clear so far! Thanks! :)

dcherian · 2020-05-14T11:51:21Z

is_scalar(chunks) might be the appropriate condition. is_scalar is already imported from .utils in dataset.py

This seems to work fine in a lot of cases, except automatic chunking isn't implemented for object dtypes at the moment, so it fails if you pass a cftime coordinate, for example.

Can we catch this error and re-raise specifying "automatic chunking fails for object arrays. These include cftime DataArrays" or something like that?

AndrewILWilliams · 2020-05-14T12:57:21Z

Nice, that's neater! Would this work, in the maybe_chunk() call? Sorry about the basic questions!

def maybe_chunk(name, var, chunks):
    chunks = selkeys(chunks, var.dims)
    if not chunks:
           chunks = None
    if var.ndim > 0:
           # when rechunking by different amounts, make sure dask names change
           # by provinding chunks as an input to tokenize.
           # subtle bugs result otherwise. see GH3350
           token2 = tokenize(name, token if token else var._data, chunks)
           name2 = f"{name_prefix}{name}-{token2}"
           try:
               return var.chunk(chunks, name=name2, lock=lock)
           except NotImplementedError as err:
               raise Exception("Automatic chunking fails for object arrays."
                                           + "These include cftime DataArrays.")
    else:
        return var

shoyer · 2020-05-14T16:31:39Z

The error message from dask is already pretty descriptive:
NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data

I don't think we have much to add on top of that?

AndrewILWilliams · 2020-05-14T18:01:45Z

I also thought that, after the dask error message it's pretty easy to then look at the dataset and check what the problem dimension is.

In general though, is that the type of layout you'd suggest for catching and re-raising errors? Using raise Exception() ?

shoyer · 2020-05-14T18:37:40Z

If we think can improve an error message by adding additional context, the right solution is to use raise Exception(...) from original_error:
https://stackoverflow.com/a/16414892/809705

On the other hand, if xarray doesn't have anything more to add on top of the original error message, it is best not to add any wrapper at all. Users will just see the original error from dask.

TomNicholas added the topic-dask label May 13, 2020

AndrewILWilliams mentioned this issue May 15, 2020

Auto chunk #4064

Merged

4 tasks

dcherian closed this as completed in #4064 May 25, 2020

tomvothecoder mentioned this issue Jun 6, 2023

[PR]: Add Parallel Computing with Dask Jupyter Notebook xCDAT/xcdat#489

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic chunking of arrays ? #4055

Automatic chunking of arrays ? #4055

AndrewILWilliams commented May 13, 2020

dcherian commented May 13, 2020 •

edited

Loading

AndrewILWilliams commented May 13, 2020

dcherian commented May 13, 2020

shoyer commented May 14, 2020

AndrewILWilliams commented May 14, 2020

AndrewILWilliams commented May 14, 2020

dcherian commented May 14, 2020

AndrewILWilliams commented May 14, 2020 •

edited

Loading

shoyer commented May 14, 2020

AndrewILWilliams commented May 14, 2020

shoyer commented May 14, 2020

Automatic chunking of arrays ? #4055

Automatic chunking of arrays ? #4055

Comments

AndrewILWilliams commented May 13, 2020

dcherian commented May 13, 2020 • edited Loading

AndrewILWilliams commented May 13, 2020

dcherian commented May 13, 2020

shoyer commented May 14, 2020

AndrewILWilliams commented May 14, 2020

AndrewILWilliams commented May 14, 2020

dcherian commented May 14, 2020

AndrewILWilliams commented May 14, 2020 • edited Loading

shoyer commented May 14, 2020

AndrewILWilliams commented May 14, 2020

shoyer commented May 14, 2020

dcherian commented May 13, 2020 •

edited

Loading

AndrewILWilliams commented May 14, 2020 •

edited

Loading