Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually specify chunks in open_zarr #2530

Merged
merged 31 commits into from
Apr 18, 2019

Conversation

lilyminium
Copy link
Contributor

This adds a chunks parameter that is analogous to Dataset.chunk. auto_chunk is kept for backwards compatibility and is equivalent to chunks='auto'. It seems reasonable that anyone manually specifying chunks may want to rewrite the dataset in those chunks, and the error that arises when the encoded Zarr chunks mismatch the variable Dask chunks may quickly get annoying. overwrite_encoded_chunks=True sets the encoded chunks to None so there is no clash.

@pep8speaks
Copy link

pep8speaks commented Oct 31, 2018

Hello @lilyminium! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-12 01:38:10 UTC

@rabernat
Copy link
Contributor

This PR seems neglected. Sorry for that! Hope to find time to review it in the next few days.

@dcherian dcherian requested a review from rabernat January 8, 2019 21:55
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me. I'm just concerned about possible performance issues with poorly specified chunks.

Also, branch needs to be rebased.

if chunks == 'auto':
chunks = var.encoding.get('chunks')
else:
chunks = selkeys(chunks, var.dims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider issuing a warning if dask chunks overlap with encoding chunks in a suboptimal way? For example, if the zarr data is chunked along axis 0 and the user specifies chunks along axis 1, this will lead to highly degraded performance.

Pinging @mrocklin or @jcrist for suggestions on how to detect this sort of case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular thoughts from me. You would have to look at the .chunks attribute of the dask array and compare it to the chunking of the zarr data. You might also consider erring or automatically rechunking if they don't align well.

If you go this route then you might want to look at dask.array.core.normalize_chunks, and set both previous_chunks to the zarr array chunks and set limit to some nice byte size.

normalize_chunks('auto', shape=..., limit='100MiB', previous_chunks=zarr_dataset.chunks)

I haven't actually tried that, and it's been a while since I dealt with the auto-rechunking code, so no guarantees on the above. I encourage others to investigate it as an option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcrist might have different thoughts though. You could also ask @jakirkham, who probably knows more here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does sound interesting. A warning sounds better than automatic re-chunking, which could be frustrating, especially if the data is small enough to re-chunk with relatively little issue. I'm at a summer school with no signal and little wifi, so I'll look at this when I get back!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the warning?

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this earlier, but i think this is basically ready to merge?

lilyminium and others added 6 commits April 4, 2019 12:44
* Various fixes for explicit Dataset.indexes

Fixes GH2856

I've added internal consistency checks to the uses of ``assert_equal`` in our
test suite, so this shouldn't happen again.

* Fix indexes in Dataset.interp
@rabernat
Copy link
Contributor

rabernat commented Apr 9, 2019

Hi @lilyminium - I appreciate your patience with this. I think we are almost there!

We have a few failing tests. Zarr is raising a ValueError: missing object_codec for object array. This is pretty old, so I don't think it's due to an upstream zarr change.

Do you have any idea what might be going on here? I haven't had time to dig deep into it.

@lilyminium
Copy link
Contributor Author

Hi @lilyminium - I appreciate your patience with this. I think we are almost there!

We have a few failing tests. Zarr is raising a ValueError: missing object_codec for object array. This is pretty old, so I don't think it's due to an upstream zarr change.

Do you have any idea what might be going on here? I haven't had time to dig deep into it.

Whoops, I think I was looking at the unicode dtype and forgot which branch I was in, then ran tests in the wrong environment... hopefully fixed now.

@rabernat
Copy link
Contributor

LGTM! Thanks for all your work @lilyminium!

Now that all tests are green, I'll leave this open for another day in case anyone else has comments and then merge.

@rabernat rabernat merged commit baf81b4 into pydata:master Apr 18, 2019
@dcherian
Copy link
Contributor

Thanks @lilyminium!

dcherian added a commit to yohai/xarray that referenced this pull request Apr 19, 2019
* master: (29 commits)
  Handle the character array dim name  (pydata#2896)
  Partial fix for pydata#2841 to improve formatting. (pydata#2906)
  docs: Move quick overview one level up (pydata#2890)
  Manually specify chunks in open_zarr (pydata#2530)
  Minor improvement of docstring for Dataset (pydata#2904)
  Fix minor typos in docstrings (pydata#2903)
  Added docs example for `xarray.Dataset.get()` (pydata#2894)
  Bugfix for docs build instructions (pydata#2897)
  Return correct count for scalar datetime64 arrays (pydata#2892)
  Indexing with an empty array (pydata#2883)
  BUG: Fix pydata#2864 by adding the missing vrt parameters (pydata#2865)
  Reduce length of cftime resample tests (pydata#2879)
  WIP: type annotations (pydata#2877)
  decreased pytest verbosity (pydata#2881)
  Fix mypy typing error in cftime_offsets.py (pydata#2878)
  update links to https (pydata#2872)
  revert to 0.12.2 dev
  0.12.1 release
  Various fixes for explicit Dataset.indexes (pydata#2858)
  Fix minor typo in docstring (pydata#2860)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants