Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray open_mfdataset with engine Zarr #2

Closed
wants to merge 72 commits into from

Conversation

weiji14
Copy link

@weiji14 weiji14 commented Jun 29, 2020

Continuation of work on pydata#4003.

  • Closes #xxxx
  • Tests added
  • Passes isort -rc . && black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

raybellwaves and others added 30 commits April 23, 2020 00:58
* Avoid multiplication DeprecationWarning in rasterio backend

* full_like: error on non-scalar fill_value

Fixes pydata#3977

* Added test

* Updated what's new

* core.utils.is_scalar instead of numpy.is_scalar

* More informative error message

* raises_regex for error test
* Fix handling of abbreviated units like msec

By default, xarray tries to decode times with pandas and falls back to
cftime. This fixes the exception handler to fallback properly in the
cases an unhandled abbreviated unit is passed in.

* Add what's new entry
* ensure Variable._repr_html_ works

* added PR 3972 to Bug fixes

* better attribute access

* moved Varible._repr_html_ test to better location

Co-authored-by: Stephan Hoyer <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
* replace tabs with spaces

* fix some invalid code

* add missing prompts

* apply blackdoc

* reformat the plotting docs code

* whats-new.rst entry
* remove xfail marks from median and cumprod

* remove all xfails not related to indexes or external packages

* switch away from using assert_equal_with_units

* use assert_allclose in a few cases instead

* don't use a kwarg for searchsorted

normally, this should work, but the documentation mismatches the
implementation of searchsorted and names the keys as `keys` instead of `v`

* move the tests for item into their own test function

* move the searchsorted tests into their own test function

* remove a wrapping pytest.param

* treat objects implementing __array_function__ the same as ndarray

* mark numpy.median as xfailing

* remove the xfail marks for the all and any tests

* use assert_units_equal to check the resulting units

* don't attempt to use interpolate_na with int dtype arrays

* update the xfail reason for DataArray.interpolate_na

* xfail the compatible units bivariate_ufunc test and don't use 0

* combine and expand the reindex and interp tests

* combine and expand the reindex_like and interp_like tests

* xfail the quantile tests if pint is not recent enough

* xfail the rolling tests

* don't xfail combine_first

it currently does not test indexing, so probably will need a new test
for that.

* use numpy's assert_allclose

* don't add dimension coordinates if they're not necessary

* add the PR to the list of related PRs

* move the whats-new.rst entry to 0.16.0

* check for __array_ufunc__ to decide if the type is supported

* xfail the bivariate ufunc tests

* remove the check for __array_ufunc__

* skip the DataArray.identical tests

* use pytest.param
* chore: Remove unnecessary comprehension

* Update whats-new.rst
…ydata#4029)

* Support overriding existing variables in to_zarr() without appending

This should be useful for cases where users want to update values in existing
Zarr datasets.

* Update docstring for to_zarr
It looks like this is triggered by the new cartopy version now being installed
on RTD (version 0.17.0 -> 0.18.0).

Long term we should fix this, but for now it's better just to disable the
warning.

Here's the message from RTD:
```
Exception occurred:
  File "/home/docs/checkouts/readthedocs.org/user_builds/xray/conda/latest/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py", line 586, in process_input
    raise RuntimeError('Non Expected warning in `{}` line {}'.format(filename, lineno))
RuntimeError: Non Expected warning in `/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/latest/doc/plotting.rst` line 732
The full traceback has been saved in /tmp/sphinx-err-qav6jjmm.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!

>>>-------------------------------------------------------------------------
Warning in /home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/latest/doc/plotting.rst at block ending on line 732
Specify :okwarning: as an option in the ipython:: block to suppress this message
----------------------------------------------------------------------------
/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/latest/xarray/plot/facetgrid.py:373: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  self.fig.tight_layout()
<<<-------------------------------------------------------------------------
```
https://readthedocs.org/projects/xray/builds/10969146/
* Remove broken test for Panel with to_pandas()

We don't support creating a Panel with to_pandas() with *any* version of
pandas at present, so this test was previous broken if pandas < 0.25 was
isntalled.

* remove unused import

* Fixup LooseVersion import
* transpose coords by default

* whatsnew

* Update doc/whats-new.rst

Co-authored-by: crusaderky <[email protected]>

* Update whats-new.rst

Co-authored-by: crusaderky <[email protected]>
* Allow providing template dataset to map_blocks.

* Update dimension shape check.

This accounts for dimension sizes being changed by the applied function.

* Allow user function to add new unindexed dimension.

* Add docstring for template.

* renaming

* Raise nice error if adding a new chunked dimension,

* Raise nice error message when expected dimension is missing on returned object

* Revert "Allow user function to add new unindexed dimension."

This reverts commit 045ae2b.

* Add test + fix output_chunks for dataarray template

* typing

* fix test

* Add nice error messages when result doesn't match template.

* blacken

* Add template kwarg to DataArray.map_blocks & Dataset.map_blocks

* minor error message fixes.

* docstring updates.

* bugfix for expected shapes when template is not specified

* Add map_blocks docs.

* Update doc/dask.rst

Co-Authored-By: Joe Hamman <[email protected]>

* refactor out slicer for chunks

* Check expected index values.

* Raise nice error when template object does not have required number of chunks

* doc updates.

* more review comments.

* Mention that attrs are taken from template.

* Add test and explicit point out that attrs is copied from template

Co-authored-by: Joe Hamman <[email protected]>
…ture (pydata#4038)

* Use literal syntax instead of function calls to create the data structure

* Update whats-new.rst

* Update whats-new.rst
* support darkmode but in vscode only

* remove unused space

* support colab (maybe) and whatsnew
* FIX: correct dask array handling in _calc_idxminmax

* FIX: remove unneeded import, reformat via black

* fix idxmax, idxmin with dask arrays

* FIX: use array[dim].data in `_calc_idxminmax` as per @keewis suggestion, attach dim name to result

* ADD: add dask tests to `idxmin`/`idxmax` dataarray tests

* FIX: add back fixture line removed by accident

* ADD: complete dask handling in `idxmin`/`idxmax` tests in test_dataarray, xfail dask tests for dtype dateime64 (M)

* ADD: add "support dask handling for idxmin/idxmax" in whats-new.rst

* MIN: reintroduce changes added by pydata#3953

* MIN: change if-clause to use `and` instead of `&` as per review-comment

* MIN: change if-clause to use `and` instead of `&` as per review-comment

* WIP: remove dask handling entirely for debugging purposes

* Test for dask computes

* WIP: re-add dask handling (map_blocks-approach), add `with raise_if_dask_computes()` context to idxmin-tests

* Use dask indexing instead of map_blocks.

* Better chunk choice.

* Return -1 for _nan_argminmax_object if all NaNs along dim

* Revert "Return -1 for _nan_argminmax_object if all NaNs along dim"

This reverts commit 58901b9.

* Raise error for object arrays

* No error for object arrays. Instead expect 1 compute in tests.

Co-authored-by: dcherian <[email protected]>
* rename d and l to dim and length
* add decode_timedelta kwarg in decode_cf and open_* functions and test.

* Fix style issue

* Add chang author reference

* removed check decode_timedelta in open_dataset

* fix docstring indentation

* fix: force dtype in test decode_timedelta
…ydata#4070)

* remove numpydoc

which is the reason for the backslash-escaped stars

* don't install numpydoc
* document zarr encoding

* link to zarr spec

* fix typo [ci skip]
* add html pre element with text repr as fallback

The PRE element is not displayed when CSS is injected.

When CSS is not injected (untrusted notebook), the PRE element
is shown but not the DIV container used for the HTML repr.

* remove title elements in svg icons

Prevent showing those when fallback to plain text repr.

A title tag is already present in the HTML label elements.

* add basic test

* update what's new
* In netcdf3 backend, also coerce unsigned integer dtypes

* Adjust test for netcdf3 rountrip to include coercion

This might be a bit too general for what is required at this point,
though ... 🤔

* Add test for failing dtype coercion

* Add What's New entry for issue pydata#4014 and PR pydata#4018

* Move netcdf3-specific test to NetCDF3Only class

Also uses a class variable for definition of netcdf3 formats now.

Co-authored-by: Deepak Cherian <[email protected]>
* Update dataset.py

* attempt at improving the doc formulation

* update to_zarr docstring

* minor style update

* seems to fix doc compilation locally

* delete saved_on_disk.nc

Co-authored-by: Aurélien Ponte <[email protected]>
* add tests

* weights: bool -> int

* whats new

* Apply suggestions from code review

* avoid unecessary copy

Co-authored-by: Maximilian Roos <[email protected]>
* allow multiindex levels in plots

* query label for test

* 2D plts adapt err msg

* 1D plts adapt err msg

* add errmsg x==y

* WIP _assert_xy_valid

* _assert_valid_xy

* add 1D example

* update docs

* simplify error msg

* remove '

* Apply suggestions from code review
keewis and others added 27 commits June 12, 2020 15:03
* replace the object array with generator expressions and zip/enumerate

* remove a leftover grouping pair of parentheses

* reuse is_array instead of comparing again
* copy the parameter documentation of Dataset.sel to DataArray.sel

* reflow the return value documentation

* update whats-new.rst
* ad a property-like descriptor that works both on objects and classes

* generate documentation for the plotting accessor methods

* add a docstring to the custom property-like descriptor

* use the accessor syntax in the main plotting section

* explain why we need a custom property class

* rename the custom property to UncachedAccessor

to match the behavior of _CachedAccessor, it also accepts the
accessor class (not the object). We lose the ability for custom
docstrings, though.

* declare that __call__ wraps plot

* add accessor tests

* add the autosummary templates from pandas

* update the plotting section to use the accessor templates

* remove the separate callable section

* fix the import order

* add the DataArray.str accessor as a new subsection

* add the datetime accessor to the main api page

* move the plotting functions into the DataArray / Dataset sections

* remove the documentation of the accessor class itself

* manually copy the docstring since functools.wraps does more than that

* also copy the annotations and mark __call__ as wrapping plot

* re-enable __slots__

* update whats-new.rst

Co-authored-by: Deepak Cherian <[email protected]>
* allow passing a callable as compat to diff_{dataset,array}_repr

* rewrite assert_allclose to provide a failure summary

* make sure we're comparing variables

* remove spurious comments

* override test_aggregate_complex with a test compatible with pint

* expect the asserts to raise

* xfail the tests failing due to isclose not accepting non-quantity tolerances

* mark top-level function tests as xfailing if they use assert_allclose

* mark test_1d_math as runnable but xfail it

* bump dask and distributed

* entry to whats-new.rst

* attempt to fix the failing py36-min-all-deps and py36-min-nep18 CI

* conditionally xfail tests using assert_allclose with pint < 0.12

* xfail more tests depending on which pint version is used

* try using numpy.testing.assert_allclose instead

* try computing if the dask version is too old and dask.array[bool]

* fix the dask version checking

* convert all dask arrays to numpy when using a insufficient dask version
* Improve typehints of xr.Dataset.__getitem__

Resolves pydata#4125

* Add overload for Mapping behavior

Sadly this is not working with my version of mypy. See python/mypy#7328

* Overload only Hashable inputs

Given mypy's use of overloads, I think this is all we can do. If the argument is not Hashable, then return the Union type as before.

* Lint

* Quote the DataArray to avoid error in py3.6

* Code review

Co-authored-by: crusaderky <[email protected]>
Instead, we'll use RTD's new doc builder instead. For an example, click on
"docs/readthedocs.org:xray" below or look at GH4159
* Update issue templates based on dask

* add config.yml for issue template
* remove the xfail marks from all aggregations except prod and np.median

* rewrite the aggregation tests

* rewrite the repr tests

it still does not check the content of the repr, though

* rewrite some more tests

* simplify the numpy-method-with-args tests

* always use the same data units unless the compatibility is tested

* partially rewrite more tests

* rewrite combine_first

This also adds tests for units in indexes, which are by default stripped.

* simplify the comparisons test a bit

* skip the tests for identical

* remove the map_values function

* only call convert_units if necessary

* use assert_units_equal and assert_equal in broadcast_like and skip it

* remove the conditional skip since pint now supports __array_function__

* only skip the broadcast_like tests if we attempt to put units in indexes

* remove the xfail mark from the where tests

* reimplement the broadcast_equals tests

* reimplement the tests on stacked arrays

* refactor the to_stacked_array tests

this test is marked as skipped because the unit registry always
returns numpy.array objects which are not hashable, so the initial
dataset with units cannot be constructed (the result of
to_stacked_array wouldn't be correct either because IndexVariable
doesn't support units)

* fix the stacking and reordering tests

* don't create a coordinate for the isel tests

* separate the tests for units in dims from the tests for units in data

* refactor the dataset constructor tests

* fix the repr tests

* raise on all warnings

* rename merge_mappings to zip_mappings

* rename merge_dicts to merge_mappings

* make the missing value filling tests raise on warnings

* remove a leftover assert_equal_with_units

* refactor the sel tests

* make the loc tests a slightly modified copy of the sel tests

* make the drop_sel tests a slightly modified version of the sel tests

* refactor the head / tail / thin tests

* refactor the squeeze tests to not have multiple tests per case

* skip the head / tail / thin tests with units in dimensions

* combine the interp and reindex tests

* combine the interp_like and reindex_like tests

* refactor the computation tests

* rewrite the computation objects tests

* rewrite the resample tests

* rewrite the grouped operations tests

* rewrite the content manipulation tests

* refactor the merge tests

* remove the old assert_equal_with_units function

* xfail the groupby_bins tests for now

* fix and use allclose

* filterwarnings for the whole TestDataset class

* modify the squeeze tests to not use units in indexes

* replace skip with xfail

* update whats-new.rst

* update the xfail reason for the rolling_exp tests

* temporarily use pip to install pint

since the feedstock seems to take a while

* don't use pip to install pint

* update the xfail to require at least 0.12.1

* xfail the prod tests

* filter only UnitStrippedWarning

* remove unncessary commas
* Revise pull request template

See below for the new language, to clarify that documentation is only necessary
for "user visible changes."

I added "including notable bug fixes" to indicate that minor bug fixes may not
be worth noting (I was thinking of test-suite only fixes in this category) but
perhaps that is too confusing.

* remove line break

* Update releasing notes
* replace np.bool with the python type

* replace np.int with the python type

* replace np.complex with the builtin python type

* replace np.float with the builtin python type
* Improve error message: automatic alignment during in-place operation.

* Sorted imports.

* Fix tests.

* Add suggestions from S. Hoyer.
Using `<pre>` messes up the display of nested HTML reprs, e.g., from dask. Now
we only use the `<pre>` tag when displaying text.
* limit length of dataarray reprs

* repr depends on numpy versions

* whatsnew

* correct comment based on @keewis comment

* Update whats-new.rst

Co-authored-by: Deepak Cherian <[email protected]>
* Test attrs handling in open_mfdataset

* Fix attrs handling in open_mfdataset()

Need to pass combine_attrs="drop", to allow attrs_file to set the attrs.

* Update whats-new.rst

* Update doc/whats-new.rst

Co-authored-by: Deepak Cherian <[email protected]>
* Removed auto_combine function and argument to open_mfdataset

* Removed corresponding tests

* Code formatting

* updated what's new

* PEP8 fixes

* Update doc/whats-new.rst

`:py:func:` links fixed

Co-Authored-By: keewis <[email protected]>

* removed auto_combine from API docs

* clarify that auto_combine is completely removed

* concat_dim=None by default for combine='nested'

* fix black formatting

Co-authored-by: keewis <[email protected]>
Co-authored-by: dcherian <[email protected]>
* use assert_allclose in the aggregation tests

* install pint using pip
* Correct dask handling for 1D idxmax/min on ND data

* Passing black and others

* Edit Whats New
* add blackdoc to the pre-commit configuration

* use the stable version of blackdoc

* run blackdoc on all files

* add blackdoc to the linter / formatting tools section

* use language names to enable syntax highlighting

* update whats-new.rst
* Show data by default in HTML repr for DataArray

Fixes pydataGH-4176

* add whats new for html repr

* fix test
…x() methods (pydata#3936)

* DataArray.indices_min() and DataArray.indices_max() methods

These return dicts of the indices of the minimum or maximum of a
DataArray over several dimensions.

* Update whats-new.rst and api.rst with indices_min(), indices_max()

* Fix type checking in DataArray._unravel_argminmax()

* Fix expected results for TestReduce3D.test_indices_max()

* Respect global default for keep_attrs

* Merge behaviour of indices_min/indices_max into argmin/argmax

When argmin or argmax are called with a sequence for 'dim', they now
return a dict with the indices for each dimension in dim.

* Basic overload of argmin() and argmax() for Dataset

If single dim is passed to Dataset.argmin() or Dataset.argmax(), then
pass through to _argmin_base or _argmax_base. If a sequence is passed
for dim, raise an exception, because the result for each DataArray would
be a dict, which cannot be stored in a Dataset.

* Update Variable and dask tests with _argmin_base, _argmax_base

The basic numpy-style argmin() and argmax() methods were renamed when
adding support for handling multiple dimensions in DataArray.argmin()
and DataArray.argmax(). Variable.argmin() and Variable.argmax() are
therefore renamed as Variable._argmin_base() and
Variable._argmax_base().

* Update api-hidden.rst with _argmin_base and _argmax_base

* Explicitly defined class methods override injected methods

If a method (such as 'argmin') has been explicitly defined on a class
(so that hasattr(cls, "argmin")==True), then do not inject that method,
as it would override the explicitly defined one. Instead inject a
private method, prefixed by "_injected_" (such as '_injected_argmin'), so
that the injected method is available to the explicitly defined one.

Do not perform the hasattr check on binary ops, because this breaks
some operations (e.g. addition between DataArray and int in
test_dask.py).

* Move StringAccessor back to bottom of DataArray class definition

* Revert use of _argmin_base and _argmax_base

Now not needed because of change to injection in ops.py.

* Move implementation of argmin, argmax from DataArray to Variable

Makes use of argmin and argmax more general (they are available for
Variable) and is straightforward for DataArray to wrap the Variable
version.

* Update tests for change to coordinates on result of argmin, argmax

* Add 'out' keyword to argmin/argmax methods - allow numpy call signature

When np.argmin(da) is called, numpy passes an 'out' keyword argument to
argmin/argmax. Need to allow this argument to avoid errors (but an
exception is thrown if out is not None).

* Update and correct docstrings for argmin and argmax

* Correct suggested replacement for da.argmin() and da.argmax()

* Remove use of _injected_ methods in argmin/argmax

* Fix typo in name of argminmax_func

Co-Authored-By: keewis <[email protected]>

* Mark argminmax argument to _unravel_argminmax as a string

Co-Authored-By: keewis <[email protected]>

* Hidden internal methods don't need to appear in docs

* Basic docstrings for Dataset.argmin() and Dataset.argmax()

* Set stacklevel for DeprecationWarning in argmin/argmax methods

* Revert "Explicitly defined class methods override injected methods"

This reverts commit 8caf2b8.

* Revert "Add 'out' keyword to argmin/argmax methods - allow numpy call signature"

This reverts commit ab480b5.

* Remove argmin and argmax from ops.py

* Use self.reduce() in Dataset.argmin() and Dataset.argmax()

Replaces need for "_injected_argmin" and "_injected_argmax".

* Whitespace after 'title' lines in docstrings

* Remove tests of np.argmax() and np.argmin() functions from test_units.py

Applying numpy functions to xarray objects is not necessarily expected
to work, and the wrapping of argmin() and argmax() is broken by
xarray-specific interface of argmin() and argmax() methods of Variable,
DataArray and Dataset.

* Clearer deprecation warnings in Dataset.argmin() and Dataset.argmax()

Also, previously suggested workaround was not correct. Remove suggestion
as there is no workaround (but the removed behaviour is unlikely to be
useful).

* Add unravel_index to duck_array_ops, use in Variable._unravel_argminmax

* Filter argmin/argmax DeprecationWarnings in tests

* Correct test for exception for nan in test_argmax

* Remove injected argmin and argmax methods from api-hidden.rst

* flake8 fixes

* Tidy up argmin/argmax following code review

Co-authored-by: Deepak Cherian <[email protected]>

* Remove filters for warnings from argmin/argmax from tests

Pass an explicit axis or dim argument instead to avoid the warning.

* Swap order of reduce_dims checks in Dataset.reduce()

Prefer to pass reduce_dims=None when possible, including for variables
with only one dimension. Avoids an error if an 'axis' keyword was
passed.

* revert the changes to Dataset.reduce

* use dim instead of axis

* use dimension instead of Ellipsis

* Make passing 'dim=...' to Dataset.argmin() or Dataset.argmax() an error

* Better docstrings for Dataset.argmin() and Dataset.argmax()

* Update doc/whats-new.rst

Co-authored-by: keewis <[email protected]>

Co-authored-by: Stephan Hoyer <[email protected]>
Co-authored-by: keewis <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Keewis <[email protected]>
Was missing some 'self' and other kwarg variables. Also linted using black.
Don't pop the backend_kwargs dict as per pydata#4003 (comment), make a shallow copy of the backend_kwargs dictionary first. Also removed `overwrite_encoded_chunks` as a top level kwarg of `open_dataset`. Instead, pass it to `backend_kwargs` when using engine="zarr".
@weiji14 weiji14 marked this pull request as ready for review June 30, 2020 00:00
@weiji14
Copy link
Author

weiji14 commented Jun 30, 2020

Moved to pydata#4187.

@weiji14 weiji14 closed this Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.