-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually specify chunks in open_zarr #2530
Conversation
Hello @lilyminium! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-04-12 01:38:10 UTC |
This PR seems neglected. Sorry for that! Hope to find time to review it in the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good to me. I'm just concerned about possible performance issues with poorly specified chunks.
Also, branch needs to be rebased.
xarray/backends/zarr.py
Outdated
if chunks == 'auto': | ||
chunks = var.encoding.get('chunks') | ||
else: | ||
chunks = selkeys(chunks, var.dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider issuing a warning if dask chunks overlap with encoding chunks in a suboptimal way? For example, if the zarr data is chunked along axis 0 and the user specifies chunks along axis 1, this will lead to highly degraded performance.
Pinging @mrocklin or @jcrist for suggestions on how to detect this sort of case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular thoughts from me. You would have to look at the .chunks
attribute of the dask array and compare it to the chunking of the zarr data. You might also consider erring or automatically rechunking if they don't align well.
If you go this route then you might want to look at dask.array.core.normalize_chunks
, and set both previous_chunks
to the zarr array chunks and set limit
to some nice byte size.
normalize_chunks('auto', shape=..., limit='100MiB', previous_chunks=zarr_dataset.chunks)
I haven't actually tried that, and it's been a while since I dealt with the auto-rechunking code, so no guarantees on the above. I encourage others to investigate it as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcrist might have different thoughts though. You could also ask @jakirkham, who probably knows more here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does sound interesting. A warning sounds better than automatic re-chunking, which could be frustrating, especially if the data is small enough to re-chunk with relatively little issue. I'm at a summer school with no signal and little wifi, so I'll look at this when I get back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on the warning?
7e100b5
to
6411912
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this earlier, but i think this is basically ready to merge?
ab2d8a8
to
a70205a
Compare
* Various fixes for explicit Dataset.indexes Fixes GH2856 I've added internal consistency checks to the uses of ``assert_equal`` in our test suite, so this shouldn't happen again. * Fix indexes in Dataset.interp
Hi @lilyminium - I appreciate your patience with this. I think we are almost there! We have a few failing tests. Zarr is raising a Do you have any idea what might be going on here? I haven't had time to dig deep into it. |
Whoops, I think I was looking at the unicode dtype and forgot which branch I was in, then ran tests in the wrong environment... hopefully fixed now. |
LGTM! Thanks for all your work @lilyminium! Now that all tests are green, I'll leave this open for another day in case anyone else has comments and then merge. |
Thanks @lilyminium! |
* master: (29 commits) Handle the character array dim name (pydata#2896) Partial fix for pydata#2841 to improve formatting. (pydata#2906) docs: Move quick overview one level up (pydata#2890) Manually specify chunks in open_zarr (pydata#2530) Minor improvement of docstring for Dataset (pydata#2904) Fix minor typos in docstrings (pydata#2903) Added docs example for `xarray.Dataset.get()` (pydata#2894) Bugfix for docs build instructions (pydata#2897) Return correct count for scalar datetime64 arrays (pydata#2892) Indexing with an empty array (pydata#2883) BUG: Fix pydata#2864 by adding the missing vrt parameters (pydata#2865) Reduce length of cftime resample tests (pydata#2879) WIP: type annotations (pydata#2877) decreased pytest verbosity (pydata#2881) Fix mypy typing error in cftime_offsets.py (pydata#2878) update links to https (pydata#2872) revert to 0.12.2 dev 0.12.1 release Various fixes for explicit Dataset.indexes (pydata#2858) Fix minor typo in docstring (pydata#2860) ...
whats-new.rst
This adds a
chunks
parameter that is analogous to Dataset.chunk.auto_chunk
is kept for backwards compatibility and is equivalent tochunks='auto'
. It seems reasonable that anyone manually specifying chunks may want to rewrite the dataset in those chunks, and the error that arises when the encoded Zarr chunks mismatch the variable Dask chunks may quickly get annoying.overwrite_encoded_chunks=True
sets the encoded chunks to None so there is no clash.