-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RasterIO backend #1260
Add RasterIO backend #1260
Conversation
Before I'll get more into details with what needs to be done with rasterio itself, I'd like to get some xarray internals ready first. No need to do a full review yet, but I'd appreciate help with the following points:
How is this possible?
Thanks a lot for your help, I'm afraid this is going to need a few iterations ;-) |
Most likely you're using a dict instead of an OrderedDict somewhere.
See here for an example of lazily computing arrays: Line 314 in 6233320
There are other examples in that file, for decoding CF conventions.
Can you clarify what you mean by an optional coordinate? |
Yes sorry, I meant the have them listed as coordinates without |
thanks @shoyer , I think I can work on this a bit further now and I'll get back to you if I have more questions. |
Getting optional coordinates is hard is because the current "backends" system in xarray was really only designed for handling netCDF-like libraries. For example, every backend gets passed through One alternative would be to simply write a function for reading files with rasterio straight into xarray Dataset objects. There are probably hacks that can get this working with data stores but this really calls out for the bigger refactor like #1087. |
I made some progress with the lazy indexing, I'd be glad to have a first rough feedback on whether this is going in the right direction or not. We have a decision to make regarding the API: I think that creating the optional lon and lat coords automatically is not a good idea:
The current implementation delegates this task to a utility function ( Thoughts? |
This could be done sanely with dask.array (example of multiple outputs from an element-wise function).
We could make this work, it just looks pretty hacky. |
@shoyer thanks for the tips, I think that getting the lons and lats from dask is probably the most elegant method. After trying various things I am still struggling with dask, and in particular on how to apply import xarray as xr
import numpy as np
ds = xr.DataArray(np.zeros((2, 3)),
coords={'x': np.arange(3), 'y': np.arange(2)},
dims=['y', 'x']).to_dataset(name='data')
# non-dask version
lon, lat = np.meshgrid(ds.x, ds.y)
ds['lon'] = (('y', 'x'), lon)
ds['lat'] = (('y', 'x'), lat)
ds.set_coords(['lon', 'lat'], inplace=True)
print(ds) Thanks a lot! |
cc80626
to
d877826
Compare
xarray/backends/rasterio_.py
Outdated
If True, cache data loaded from the underlying datastore in memory as | ||
NumPy arrays when accessed to avoid reading from the underlying data- | ||
store multiple times. Defaults to True unless you specify the `chunks` | ||
argument to use dask, in which case it defaults to False. Does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last sentence can be safely removed.
xarray/backends/rasterio_.py
Outdated
chunks : int, tuple or dict, optional | ||
Chunk sizes along each dimension, e.g., ``5``, ``(5, 5)`` or | ||
``{'x': 5, 'y': 5}``. If chunks is provided, it used to load the new | ||
DataArray into a dask array. This is an experimental feature; see the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably remove the "experimental feature" bit. Dask is not going away.
xarray/backends/rasterio_.py
Outdated
lock : False, True or threading.Lock, optional | ||
If chunks is provided, this argument is passed on to | ||
:py:func:`dask.array.from_array`. By default, a per-variable lock is | ||
used when reading data from netCDF files with the netcdf4 and h5netcdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rasterio/GDAL need a thread lock? Let's remove netcdf4/h5netcdf specific references here and add sane defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot safely share a Rasterio or GDAL dataset object among threads, so a lock may be needed. See https://trac.osgeo.org/gdal/wiki/FAQMiscellaneous#IstheGDALlibrarythread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I am on it.
@shoyer , I have trouble to understand what's going on in the open_dataset
workflow. The documentation says: "By default, a per-variable lock is used when reading data from netCDF files with the netcdf4 and h5netcdf engines", but then the lock is decided once here. This will default to GLOBAL_LOCK
and then passed to maybe_decode_store
. Where is the "per variable" logic happening? Or is the GLOBAL_LOCK
already doing this? (it is defined here)
(excuse my poor knowledge about these issues)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation is out of sync here. We use a global lock for HDF5 (and should probably use one for Rasterio, too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand much more what's going on, and also why salem was having troubles when adding diagnostic variables to netcdf objects. Can I update the docs all at once here?
xarray/backends/rasterio_.py
Outdated
import dask | ||
if LooseVersion(dask.__version__) < LooseVersion('0.6'): | ||
raise ImportError( | ||
'xarray requires dask version 0.6 or newer') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask v0.6 is pretty old now, maybe better to replace this whole conditional logic with:
import dask
from dask.base import tokenize
Yes, please!
…On Wed, May 31, 2017 at 8:48 AM, Fabien Maussion ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In xarray/backends/rasterio_.py
<#1260 (comment)>:
> + chunks : int, tuple or dict, optional
+ Chunk sizes along each dimension, e.g., ``5``, ``(5, 5)`` or
+ ``{'x': 5, 'y': 5}``. If chunks is provided, it used to load the new
+ DataArray into a dask array. This is an experimental feature; see the
+ documentation for more details.
+ cache : bool, optional
+ If True, cache data loaded from the underlying datastore in memory as
+ NumPy arrays when accessed to avoid reading from the underlying data-
+ store multiple times. Defaults to True unless you specify the `chunks`
+ argument to use dask, in which case it defaults to False. Does not
+ change the behavior of coordinates corresponding to dimensions, which
+ always load their data from disk into a ``pandas.Index``.
+ lock : False, True or threading.Lock, optional
+ If chunks is provided, this argument is passed on to
+ :py:func:`dask.array.from_array`. By default, a per-variable lock is
+ used when reading data from netCDF files with the netcdf4 and h5netcdf
Now I understand much more what's going on, and also why salem was having
troubles when adding diagnostic variables to netcdf objects. Can I update
the docs all at once here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1260 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1p7_JyTe8jdB1omZm_4V005U_ex_ks5r_YuygaJpZM4L92Pl>
.
|
OK, all green. Currently the rasterio tests are running on py36 only. Should I add rasterio to the other test suites as well? |
Yes, except for the one that tests bare-minimal dependencies. |
Hey @fmaussion, is this ready for me to try out again? I wanted to let you and @shoyer iterate first. |
@gidden yes absolutely please give it a try, I think it's ready |
@gidden I just updated the documentation recipe to use an accessor to compute the lons and lats, let me know what you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, just a couple more very minor points
doc/gallery/plot_rasterio.py
Outdated
|
||
# Define the accessor | ||
@xr.register_dataarray_accessor('rasterio') | ||
class RasterioAccessor(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards keeping this as just a function -- it keeps the example simpler and an accessor is probably overkill for this problem.
xarray/backends/rasterio_.py
Outdated
return out | ||
|
||
|
||
def open_rasterio(filename, chunks=None, cache=None, lock=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default should be lock=None
to use the global lock.
xarray/backends/rasterio_.py
Outdated
token = tokenize(filename, mtime, chunks) | ||
name_prefix = 'open_rasterio-%s' % token | ||
if lock is None: | ||
lock = GLOBAL_LOCK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can you make a separate global lock for rasterio? We can safely use GDAL and HDF5 at the same time.
xarray/tests/test_backends.py
Outdated
@@ -1438,6 +1438,265 @@ class TestPyNioAutocloseTrue(TestPyNio): | |||
autoclose = True | |||
|
|||
|
|||
@requires_rasterio | |||
class TestRasterio(CFEncodedDataTest, Only32BitTypes, TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any advantage to inheriting from the other test methods, given that this isn't an actual datastore. I think this can just inherit from TestCase
(then you can delete the first two methods).
xarray/tests/test_backends.py
Outdated
with xr.open_dataarray(tmp_nc_file) as ncds: | ||
assert_identical(rioda, ncds) | ||
|
||
# Create a geotiff file in latlong proj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's self-contained, make this a separate test method?
OK, let's get this one in and see what people will report about it. Thanks @shoyer for your patience, @NicWayand and @jhamman for the original PR, @gidden for the testing/reviews and @sgillies for rasterio! |
Nice work @fmaussion. Thanks for bringing this home. |
+1 @fmaussion - Thanks! |
Follow-up to #1070
This is my first backend so this is going to be a bit more work than I expected, but with your help we should be able to get through this.
A long todo list:
git diff upstream/master | flake8 --diff
make coordinates variables lazy too