Manually specify chunks in open_zarr #2530

lilyminium · 2018-10-31T19:04:05Z

Addresses manually specify chunks in open_zarr #2423
Tests added (for all bug fixes or enhancements)
Fully documented, including whats-new.rst

This adds a chunks parameter that is analogous to Dataset.chunk. auto_chunk is kept for backwards compatibility and is equivalent to chunks='auto'. It seems reasonable that anyone manually specifying chunks may want to rewrite the dataset in those chunks, and the error that arises when the encoded Zarr chunks mismatch the variable Dask chunks may quickly get annoying. overwrite_encoded_chunks=True sets the encoded chunks to None so there is no clash.

pep8speaks · 2018-10-31T19:04:12Z

Hello @lilyminium! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-12 01:38:10 UTC

xarray/backends/zarr.py

rabernat · 2018-12-20T17:47:39Z

This PR seems neglected. Sorry for that! Hope to find time to review it in the next few days.

rabernat

This looks really good to me. I'm just concerned about possible performance issues with poorly specified chunks.

Also, branch needs to be rebased.

rabernat · 2019-01-08T21:59:58Z

xarray/backends/zarr.py

+        if chunks == 'auto':
+            chunks = var.encoding.get('chunks')
+        else:
+            chunks = selkeys(chunks, var.dims)


I would consider issuing a warning if dask chunks overlap with encoding chunks in a suboptimal way? For example, if the zarr data is chunked along axis 0 and the user specifies chunks along axis 1, this will lead to highly degraded performance.

Pinging @mrocklin or @jcrist for suggestions on how to detect this sort of case.

No particular thoughts from me. You would have to look at the .chunks attribute of the dask array and compare it to the chunking of the zarr data. You might also consider erring or automatically rechunking if they don't align well.

If you go this route then you might want to look at dask.array.core.normalize_chunks, and set both previous_chunks to the zarr array chunks and set limit to some nice byte size.

normalize_chunks('auto', shape=..., limit='100MiB', previous_chunks=zarr_dataset.chunks)

I haven't actually tried that, and it's been a while since I dealt with the auto-rechunking code, so no guarantees on the above. I encourage others to investigate it as an option.

@jcrist might have different thoughts though. You could also ask @jakirkham, who probably knows more here.

That does sound interesting. A warning sounds better than automatic re-chunking, which could be frustrating, especially if the data is small enough to re-chunk with relatively little issue. I'm at a summer school with no signal and little wifi, so I'll look at this when I get back!

Any thoughts on the warning?

shoyer

I missed this earlier, but i think this is basically ready to merge?

xarray/backends/zarr.py

* Various fixes for explicit Dataset.indexes Fixes GH2856 I've added internal consistency checks to the uses of ``assert_equal`` in our test suite, so this shouldn't happen again. * Fix indexes in Dataset.interp

rabernat · 2019-04-09T19:38:05Z

Hi @lilyminium - I appreciate your patience with this. I think we are almost there!

We have a few failing tests. Zarr is raising a ValueError: missing object_codec for object array. This is pretty old, so I don't think it's due to an upstream zarr change.

Do you have any idea what might be going on here? I haven't had time to dig deep into it.

…zarr-open-chunks

lilyminium · 2019-04-12T01:47:17Z

Hi @lilyminium - I appreciate your patience with this. I think we are almost there!

We have a few failing tests. Zarr is raising a ValueError: missing object_codec for object array. This is pretty old, so I don't think it's due to an upstream zarr change.

Do you have any idea what might be going on here? I haven't had time to dig deep into it.

Whoops, I think I was looking at the unicode dtype and forgot which branch I was in, then ran tests in the wrong environment... hopefully fixed now.

rabernat · 2019-04-12T19:38:18Z

LGTM! Thanks for all your work @lilyminium!

Now that all tests are green, I'll leave this open for another day in case anyone else has comments and then merge.

dcherian · 2019-04-18T14:35:21Z

Thanks @lilyminium!

* master: (29 commits) Handle the character array dim name (pydata#2896) Partial fix for pydata#2841 to improve formatting. (pydata#2906) docs: Move quick overview one level up (pydata#2890) Manually specify chunks in open_zarr (pydata#2530) Minor improvement of docstring for Dataset (pydata#2904) Fix minor typos in docstrings (pydata#2903) Added docs example for `xarray.Dataset.get()` (pydata#2894) Bugfix for docs build instructions (pydata#2897) Return correct count for scalar datetime64 arrays (pydata#2892) Indexing with an empty array (pydata#2883) BUG: Fix pydata#2864 by adding the missing vrt parameters (pydata#2865) Reduce length of cftime resample tests (pydata#2879) WIP: type annotations (pydata#2877) decreased pytest verbosity (pydata#2881) Fix mypy typing error in cftime_offsets.py (pydata#2878) update links to https (pydata#2872) revert to 0.12.2 dev 0.12.1 release Various fixes for explicit Dataset.indexes (pydata#2858) Fix minor typo in docstring (pydata#2860) ...

dcherian reviewed Nov 5, 2018

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

dcherian requested a review from rabernat January 8, 2019 21:55

rabernat reviewed Jan 8, 2019

View reviewed changes

lilyminium force-pushed the zarr-open-chunks branch from 7e100b5 to 6411912 Compare January 30, 2019 00:22

shoyer approved these changes Mar 16, 2019

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

Lily Wang and others added 10 commits April 4, 2019 12:41

added manual chunks for open_zarr

c7c990b

updated whats-new

d37d9e1

fixed pep8 issues

f3c829e

removed whitespace

36f253f

added deprecation warning

ae4cf0a

fixed pep8 issues

cccfd04

added warning for bad chunks

da45d77

fixed lingering rebase conflicts

7618c08

fixed pep8 issues

8571131

added stacklevel

a70205a

lilyminium force-pushed the zarr-open-chunks branch from ab2d8a8 to a70205a Compare April 4, 2019 01:42

lilyminium and others added 6 commits April 4, 2019 12:44

fixed pep8 issues

17fa557

Various fixes for explicit Dataset.indexes (pydata#2858)

31619d7

* Various fixes for explicit Dataset.indexes Fixes GH2856 I've added internal consistency checks to the uses of ``assert_equal`` in our test suite, so this shouldn't happen again. * Fix indexes in Dataset.interp

0.12.1 release

aa6abb5

revert to 0.12.2 dev

23d54a8

update links to https (pydata#2872)

e7ec087

Fix mypy typing error in cftime_offsets.py (pydata#2878)

3435b03

rabernat mentioned this pull request Apr 9, 2019

Add GPM dataset pangeo-data/pangeo-datastore#24

Merged

rabernat and others added 4 commits April 9, 2019 16:34

decreased pytest verbosity (pydata#2881)

2c10d14

added manual chunks for open_zarr

f063f55

updated whats-new

c02a1c7

fixed pep8 issues

c361f70

Lily Wang and others added 11 commits April 10, 2019 19:31

removed whitespace

447af8c

added deprecation warning

cdd23d4

fixed pep8 issues

7099e70

added warning for bad chunks

301953a

fixed lingering rebase conflicts

8e61e7e

fixed pep8 issues

8fd65ea

added stacklevel

4bb164d

fixed pep8 issues

485717d

disallow unicode again

b0e1e1e

Merge branch 'zarr-open-chunks' of github.com:lilyminium/xarray into …

c0cfa18

…zarr-open-chunks

disallow unicode again

f17cb5e

rabernat approved these changes Apr 12, 2019

View reviewed changes

rabernat merged commit baf81b4 into pydata:master Apr 18, 2019

andersy005 mentioned this pull request Jan 9, 2020

manually specify chunks in open_zarr #2423

Closed

weiji14 mentioned this pull request Jun 29, 2020

xarray.open_mzar: open multiple zarr files (in parallel) #4003

Closed

weiji14 mentioned this pull request Sep 21, 2020

Xarray open_mfdataset with engine Zarr #4187

Merged

4 tasks

shoyer mentioned this pull request Oct 15, 2020

Flexible backends - Harmonise zarr chunking with other backends chunking #4496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually specify chunks in open_zarr #2530

Manually specify chunks in open_zarr #2530

lilyminium commented Oct 31, 2018

pep8speaks commented Oct 31, 2018 •

edited

Loading

rabernat commented Dec 20, 2018

rabernat left a comment

rabernat Jan 8, 2019

mrocklin Jan 13, 2019

mrocklin Jan 13, 2019

lilyminium Jan 15, 2019

lilyminium Jan 30, 2019

shoyer left a comment

rabernat commented Apr 9, 2019

lilyminium commented Apr 12, 2019

rabernat commented Apr 12, 2019

dcherian commented Apr 18, 2019

Manually specify chunks in open_zarr #2530

Manually specify chunks in open_zarr #2530

Conversation

lilyminium commented Oct 31, 2018

pep8speaks commented Oct 31, 2018 • edited Loading

Comment last updated at 2019-04-12 01:38:10 UTC

rabernat commented Dec 20, 2018

rabernat left a comment

Choose a reason for hiding this comment

rabernat Jan 8, 2019

Choose a reason for hiding this comment

mrocklin Jan 13, 2019

Choose a reason for hiding this comment

mrocklin Jan 13, 2019

Choose a reason for hiding this comment

lilyminium Jan 15, 2019

Choose a reason for hiding this comment

lilyminium Jan 30, 2019

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

rabernat commented Apr 9, 2019

lilyminium commented Apr 12, 2019

rabernat commented Apr 12, 2019

dcherian commented Apr 18, 2019

pep8speaks commented Oct 31, 2018 •

edited

Loading