Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RasterIO backend #1260

Merged
merged 42 commits into from
Jun 6, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
1643ce5
rasterio checkin
Apr 5, 2016
067dedb
temp fixes
NicWayand Oct 27, 2016
7906cfd
update rasterio reader, no lazy loading, no decoding of coords
Oct 28, 2016
2e1b528
keep band dim even for single band. Fix longitude typo
NicWayand Oct 30, 2016
532c5d3
Fix lat/lon decoding. Remove requirment comment
Oct 31, 2016
8bc3da3
Attr error suppression. DataArray to Variable objects. CI requirment …
Oct 31, 2016
77cc0ca
remove >= requirment
Oct 31, 2016
776cbd9
added conda-forge channel to CI check
Oct 31, 2016
eb739de
add scipy requirement
Oct 31, 2016
3a394ae
roll back ci requirements. Rename vars
Nov 2, 2016
061b8fd
roll back
Nov 2, 2016
c0962fa
fixed coord spacing bug where x and y were +1 dim than raster. Uses n…
NicWayand Nov 8, 2016
7275ffa
change test env to py36
fmaussion Feb 10, 2017
51a60af
first tests
fmaussion Feb 10, 2017
5155634
other tests
fmaussion Feb 10, 2017
09196ee
fix test
fmaussion Feb 10, 2017
e2b6786
get the order right
fmaussion Feb 10, 2017
228a5a3
some progress with indexing
fmaussion Feb 14, 2017
4b57d1c
cosmetic changes
fmaussion Mar 5, 2017
2d21b4b
Conflicts
fmaussion May 3, 2017
c2cb927
More rebase
fmaussion May 3, 2017
f86507e
looking good now. Testing
fmaussion May 17, 2017
e1a5b31
docs
fmaussion May 17, 2017
7cb8baf
whats new
fmaussion May 17, 2017
48c7268
fix test
fmaussion May 17, 2017
70bd03a
reviews
fmaussion May 24, 2017
4d49195
Merge branch 'master' into feature-rasterio
fmaussion May 24, 2017
3f18144
docs
fmaussion May 24, 2017
3e3e6fb
Merge remote-tracking branch 'upstream/master' into feature-rasterio
fmaussion May 24, 2017
1ae2d9b
more reviews
fmaussion May 25, 2017
955f6b9
chunking and caching
fmaussion May 30, 2017
7d8fe4d
Merge remote-tracking branch 'upstream/master' into feature-rasterio
fmaussion May 30, 2017
223ce0c
Final tweaks
fmaussion May 30, 2017
6cf2ce9
Lock-doc tweaks
fmaussion May 31, 2017
2cd0386
Merge branch 'master' into feature-rasterio
fmaussion May 31, 2017
9193a2b
Add rasterio to other test suites
fmaussion Jun 1, 2017
4bf4b6a
Merge remote-tracking branch 'origin/feature-rasterio' into feature-r…
fmaussion Jun 1, 2017
c778948
use context managers in tests for windows
fmaussion Jun 1, 2017
4299957
Change example to use an accessor
fmaussion Jun 1, 2017
fcdd894
Reviews
fmaussion Jun 5, 2017
1ca6e38
Merge branch 'master' into feature-rasterio
fmaussion Jun 5, 2017
d5c964e
typo
fmaussion Jun 5, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/requirements-py27-cdat+pynio.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ dependencies:
- scipy
- seaborn
- toolz
- rasterio
- pip:
- coveralls
- pytest-cov
1 change: 1 addition & 0 deletions ci/requirements-py27-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ dependencies:
- scipy
- seaborn
- toolz
- rasterio
1 change: 1 addition & 0 deletions ci/requirements-py35.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dependencies:
- scipy
- seaborn
- toolz
- rasterio
- pip:
- coveralls
- pytest-cov
1 change: 1 addition & 0 deletions ci/requirements-py36-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ dependencies:
- scipy
- seaborn
- toolz
- rasterio
1 change: 1 addition & 0 deletions ci/requirements-py36.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dependencies:
- scipy
- seaborn
- toolz
- rasterio
- pip:
- coveralls
- pytest-cov
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,7 @@ Dataset methods

open_dataset
open_mfdataset
open_rasterio
Dataset.to_netcdf
save_mfdataset
Dataset.to_array
Expand Down
58 changes: 58 additions & 0 deletions doc/gallery/plot_rasterio.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# -*- coding: utf-8 -*-
"""
.. _recipes.rasterio:

=================================
Parsing rasterio's geocoordinates
=================================


Converting a projection's cartesian coordinates into 2D longitudes and
latitudes.

These new coordinates might be handy for plotting and indexing, but it should
be kept in mind that a grid which is regular in projection coordinates will
likely be irregular in lon/lat. It is often recommended to work in the data's
original map projection.
"""

import os
import urllib.request
import numpy as np
import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
from rasterio.warp import transform


# Download the file from rasterio's repository
url = 'https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif'
urllib.request.urlretrieve(url, 'RGB.byte.tif')

# Read the data
da = xr.open_rasterio('RGB.byte.tif')

# Compute the lon/lat coordinates with rasterio.warp.transform
ny, nx = len(da['y']), len(da['x'])
x, y = np.meshgrid(da['x'], da['y'])

# Rasterio works with 1D arrays
lon, lat = transform(da.crs, {'init': 'EPSG:4326'},
x.flatten(), y.flatten())
lon = np.asarray(lon).reshape((ny, nx))
lat = np.asarray(lat).reshape((ny, nx))
da.coords['lon'] = (('y', 'x'), lon)
da.coords['lat'] = (('y', 'x'), lat)

# Compute a greyscale out of the rgb image
greyscale = da.mean(dim='band')

# Plot on a map
ax = plt.subplot(projection=ccrs.PlateCarree())
greyscale.plot(ax=ax, x='lon', y='lat', transform=ccrs.PlateCarree(),
cmap='Greys_r', add_colorbar=False)
ax.coastlines('10m', color='r')
plt.show()

# Delete the file
os.remove('RGB.byte.tif')
42 changes: 42 additions & 0 deletions doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,48 @@ over the network until we look at particular values:

.. image:: _static/opendap-prism-tmax.png

.. _io.rasterio:

Rasterio
--------

GeoTIFFs and other gridded raster datasets can be opened using `rasterio`_, if
rasterio is installed. Here is an example of how to use
:py:func:`~xarray.open_rasterio` to read one of rasterio's `test files`_:

.. ipython::
:verbatim:

In [7]: rio = xr.open_rasterio('RGB.byte.tif')

In [8]: rio
Out[8]:
<xarray.DataArray (band: 3, y: 718, x: 791)>
[1703814 values with dtype=uint8]
Coordinates:
* band (band) int64 1 2 3
* y (y) float64 2.827e+06 2.827e+06 2.826e+06 2.826e+06 2.826e+06 ...
* x (x) float64 1.02e+05 1.023e+05 1.026e+05 1.029e+05 1.032e+05 ...
Attributes:
crs: +init=epsg:32618

The ``x`` and ``y`` coordinates are generated out of the file's metadata
(``bounds``, ``width``, ``height``), and they can be understood as cartesian
coordinates defined in the file's projection provided by the ``crs`` attribute.
``crs`` is a PROJ4 string which can be parsed by e.g. `pyproj`_ or rasterio.
See :ref:`recipes.rasterio` for an example of how to convert these to
longitudes and latitudes.

.. warning::

This feature has been added in xarray v0.9.6 and should still be
considered as being experimental. Please report any bug you may find
on xarray's github repository.

.. _rasterio: https://mapbox.github.io/rasterio/
.. _test files: https://github.com/mapbox/rasterio/blob/master/tests/data/RGB.byte.tif
.. _pyproj: https://github.com/jswhit/pyproj

.. _io.pynio:

Formats supported by PyNIO
Expand Down
6 changes: 6 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@ Enhancements
an error instead of aligning when indexes to be aligned are not equal.
By `Stephan Hoyer <https://github.com/shoyer>`_.

- New backend to open raster files with the
`rasterio <https://mapbox.github.io/rasterio/>`_ library.
By `Joe Hamman <https://github.com/jhamman>`_,
`Nic Wayand <https://github.com/NicWayand>`_ and
`Fabien Maussion <https://github.com/fmaussion>`_

Bug fixes
~~~~~~~~~

Expand Down
2 changes: 2 additions & 0 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

from .backends.api import (open_dataset, open_dataarray, open_mfdataset,
save_mfdataset)
from .backends.rasterio_ import open_rasterio

from .conventions import decode_cf

try:
Expand Down
22 changes: 5 additions & 17 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,11 +181,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
chunks : int or dict, optional
If chunks is provided, it used to load the new dataset into dask
arrays. ``chunks={}`` loads the dataset with dask using a single
chunk for all arrays. This is an experimental feature; see the
documentation for more details.
chunk for all arrays.
lock : False, True or threading.Lock, optional
If chunks is provided, this argument is passed on to
:py:func:`dask.array.from_array`. By default, a per-variable lock is
:py:func:`dask.array.from_array`. By default, a global lock is
used when reading data from netCDF files with the netcdf4 and h5netcdf
engines to avoid issues with concurrent access when using dask's
multithreaded backend.
Expand Down Expand Up @@ -228,17 +227,7 @@ def maybe_decode_store(store, lock=False):
_protect_dataset_variables_inplace(ds, cache)

if chunks is not None:
try:
from dask.base import tokenize
except ImportError:
# raise the usual error if dask is entirely missing
import dask
if LooseVersion(dask.__version__) < LooseVersion('0.6'):
raise ImportError(
'xarray requires dask version 0.6 or newer')
else:
raise

from dask.base import tokenize
# if passed an actual file path, augment the token with
# the file modification time
if (isinstance(filename_or_obj, basestring) and
Expand Down Expand Up @@ -369,11 +358,10 @@ def open_dataarray(*args, **kwargs):
'netcdf4'.
chunks : int or dict, optional
If chunks is provided, it used to load the new dataset into dask
arrays. This is an experimental feature; see the documentation for more
details.
arrays.
lock : False, True or threading.Lock, optional
If chunks is provided, this argument is passed on to
:py:func:`dask.array.from_array`. By default, a per-variable lock is
:py:func:`dask.array.from_array`. By default, a global lock is
used when reading data from netCDF files with the netcdf4 and h5netcdf
engines to avoid issues with concurrent access when using dask's
multithreaded backend.
Expand Down
170 changes: 170 additions & 0 deletions xarray/backends/rasterio_.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
import os
from collections import OrderedDict
from distutils.version import LooseVersion
import numpy as np

from .. import DataArray
from ..core.utils import DunderArrayMixin, NdimSizeLenMixin, is_scalar
from ..core import indexing
try:
from dask.utils import SerializableLock as Lock
except ImportError:
from threading import Lock

RASTERIO_LOCK = Lock()

_ERROR_MSG = ('The kind of indexing operation you are trying to do is not '
'valid on rasterio files. Try to load your data with ds.load()'
'first.')


class RasterioArrayWrapper(NdimSizeLenMixin, DunderArrayMixin):
"""A wrapper around rasterio dataset objects"""
def __init__(self, rasterio_ds):
self.rasterio_ds = rasterio_ds
self._shape = (rasterio_ds.count, rasterio_ds.height,
rasterio_ds.width)
self._ndims = len(self.shape)

@property
def dtype(self):
dtypes = self.rasterio_ds.dtypes
if not np.all(np.asarray(dtypes) == dtypes[0]):
raise ValueError('All bands should have the same dtype')
return np.dtype(dtypes[0])

@property
def shape(self):
return self._shape

def __getitem__(self, key):

# make our job a bit easier
key = indexing.canonicalize_indexer(key, self._ndims)

# bands cannot be windowed but they can be listed
band_key = key[0]
n_bands = self.shape[0]
if isinstance(band_key, slice):
start, stop, step = band_key.indices(n_bands)
if step is not None and step != 1:
raise IndexError(_ERROR_MSG)
band_key = np.arange(start, stop)
# be sure we give out a list
band_key = (np.asarray(band_key) + 1).tolist()

# but other dims can only be windowed
window = []
squeeze_axis = []
for i, (k, n) in enumerate(zip(key[1:], self.shape[1:])):
if isinstance(k, slice):
start, stop, step = k.indices(n)
if step is not None and step != 1:
raise IndexError(_ERROR_MSG)
elif is_scalar(k):
# windowed operations will always return an array
# we will have to squeeze it later
squeeze_axis.append(i+1)
start = k
stop = k+1
else:
k = np.asarray(k)
start = k[0]
stop = k[-1] + 1
ids = np.arange(start, stop)
if not ((k.shape == ids.shape) and np.all(k == ids)):
raise IndexError(_ERROR_MSG)
window.append((start, stop))

out = self.rasterio_ds.read(band_key, window=window)
if squeeze_axis:
out = np.squeeze(out, axis=squeeze_axis)
return out


def open_rasterio(filename, chunks=None, cache=None, lock=None):
"""Open a file with rasterio (experimental).

This should work with any file that rasterio can open (most often:
geoTIFF). The x and y coordinates are generated automatically from the
file's geoinformation.

Parameters
----------
filename : str
Path to the file to open.

Returns
-------
data : DataArray
The newly created DataArray.
chunks : int, tuple or dict, optional
Chunk sizes along each dimension, e.g., ``5``, ``(5, 5)`` or
``{'x': 5, 'y': 5}``. If chunks is provided, it used to load the new
DataArray into a dask array.
cache : bool, optional
If True, cache data loaded from the underlying datastore in memory as
NumPy arrays when accessed to avoid reading from the underlying data-
store multiple times. Defaults to True unless you specify the `chunks`
argument to use dask, in which case it defaults to False.
lock : False, True or threading.Lock, optional
If chunks is provided, this argument is passed on to
:py:func:`dask.array.from_array`. By default, a global lock is
used to avoid issues with concurrent access to the same file when using
dask's multithreaded backend.
"""

import rasterio
riods = rasterio.open(filename, mode='r')

if cache is None:
cache = chunks is None

coords = OrderedDict()

# Get bands
if riods.count < 1:
raise ValueError('Unknown dims')
coords['band'] = np.asarray(riods.indexes)

# Get geo coords
nx, ny = riods.width, riods.height
dx, dy = riods.res[0], -riods.res[1]
x0 = riods.bounds.right if dx < 0 else riods.bounds.left
y0 = riods.bounds.top if dy < 0 else riods.bounds.bottom
coords['y'] = np.linspace(start=y0, num=ny, stop=(y0 + (ny - 1) * dy))
coords['x'] = np.linspace(start=x0, num=nx, stop=(x0 + (nx - 1) * dx))

# Attributes
attrs = {}
if hasattr(riods, 'crs'):
# CRS is a dict-like object specific to rasterio
# We convert it back to a PROJ4 string using rasterio itself
attrs['crs'] = riods.crs.to_string()
# Maybe we'd like to parse other attributes here (for later)

data = indexing.LazilyIndexedArray(RasterioArrayWrapper(riods))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in #1260, consider adding caching here. At the least, I think we want to use copy on write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean here, sorry :(

Copy link
Member

@shoyer shoyer May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub seems to lose the specific comment link -- if you click edit on my comment here you'll find it.

To be more concrete, after this line I would suggest adding:

# this lets you write arrays loaded with rasterio
data = indexing.CopyOnWriteArray(data)
if cache:  # optional
    data = indexing.MemoryCachedArray(data)

cache would be an optional boolean argument, like it is on open_dataset.

This also argues for adding a chunks argument, which is set uses dask to chunk the data and disables the cache. This should probably create a token for dask similar to how we do it for open_dataset:

if (isinstance(filename_or_obj, basestring) and
not is_remote_uri(filename_or_obj)):
mtime = os.path.getmtime(filename_or_obj)
else:
mtime = None
token = tokenize(filename_or_obj, mtime, group, decode_cf,
mask_and_scale, decode_times, concat_characters,
decode_coords, engine, chunks, drop_variables)
name_prefix = 'open_dataset-%s' % token
ds2 = ds.chunk(chunks, name_prefix=name_prefix, token=token,
lock=lock)

Copy link
Member Author

@fmaussion fmaussion May 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing cache went well (appart from this bug).

For chunking I have a question: DataArray.chunk() doesn't have the name_prefix, token and lock keywords. Any reason for this?


# this lets you write arrays loaded with rasterio
data = indexing.CopyOnWriteArray(data)
if cache and (chunks is None):
data = indexing.MemoryCachedArray(data)

result = DataArray(data=data, dims=('band', 'y', 'x'),
coords=coords, attrs=attrs)

if chunks is not None:
from dask.base import tokenize
# augment the token with the file modification time
mtime = os.path.getmtime(filename)
token = tokenize(filename, mtime, chunks)
name_prefix = 'open_rasterio-%s' % token
if lock is None:
lock = RASTERIO_LOCK
result = result.chunk(chunks, name_prefix=name_prefix, token=token,
lock=lock)

# Make the file closeable
result._file_obj = riods

return result
Loading