Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to add indexing subsystem #143

Open
amotl opened this issue Jan 8, 2023 · 1 comment
Open

Proposal to add indexing subsystem #143

amotl opened this issue Jan 8, 2023 · 1 comment

Comments

@amotl
Copy link
Contributor

amotl commented Jan 8, 2023

About

After a few other attempts elsewhere, we started working on an indexer for NWP data once again at 1. We think it will help for different slicing and querying tasks requested by the community, some of them referenced below. The indexer is based on the excellent Caterva and ironArray libraries, which enable efficient slicing on all dimensions, along with several other features that are ideal in this scenario. More details are described in the next section.

graph TD

    H[Herbie loader]
    Z[Zarr loader]
    H -->IDX
    Z -->IDX

    IDX[NWP indexer]
    Q(Query by all coordinates/dimensions: time, lat, lon)
    R[Xarray result]

    IDX -->Q
    Q -->R
Loading

Technologies

There is a newish file format called Zarr that solves the problem of selecting smaller regions of interest without downloading the full file, by chunking the data.

Caterva is a container for multidimensional data that is specially designed to read datasets slices very efficiently. Like other libraries like Zarr, HDF5, or TileDB, Caterva stores the data into multidimensional chunks. Sitting on top of Caterva, ironArray is a C library oriented to compute and handle large multidimensional arrays efficiently.

Synopsis

Setup

git switch indexing
pip install --editable=.[indexing]

Alternative.

pip install --editable='git+https://github.com/earthobservations/Herbie@indexing[indexing]'

Usage

Create index

TIMERANGE = np.arange(
    start=np.datetime64("1987-10-01 08:00"),
    stop=np.datetime64("1987-10-01 10:59"),
    step=datetime.timedelta(hours=1),
)

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.save(dataset=open_era5_zarr(TEMP2M, 1987, 10, TIMERANGE[0], TIMERANGE[-1]))

Query index

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.load()

# Single forecasted temperature value at given time and geopoint, in Monterey, in Fahrenheit.
record = (
    nwp.query(time="1987-10-01 08:00", lat=36.6083, lon=-121.8674)
    .kelvin_to_fahrenheit()
    .data
)
# Verify value.
assert record.values == np.array(73.805008, dtype=np.float32)

# Temperatures in Berlin area, within a given time range and bounding box, in Celsius.
ds = (
    era5_temp2m.query(
        time=pd.date_range(
            start="1987-10-01 08:00", end="1987-10-01 09:00", freq="H"
        ),
        location=BBox(lon1=13.000, lat1=52.700, lon2=13.600, lat2=52.300),
    )
    .kelvin_to_celsius()
    .ds
)
# Verify shape.
assert ds["air_temperature_at_2_metres"].shape == (2, 3, 3)

Thoughts

We would like to gather early feedback from you and the community if this feature would be welcome. In the spirit of Herbie's history and disclaimer, we would be honored to contribute this as an indexing subsystem. Please let us know what you think about the patch, and if you also believe it would fit well into the code base, to enable a whole stack of new possibilities for Herbie.

References

Implementation

The query() method of herbie.index.core fame, filtering on all dimensions (time, lat, lon).

Herbie/herbie/index/core.py

Lines 101 to 125 in 4fd07a5

def query(self, time=None, lat=None, lon=None) -> "Result":
"""
Query ironArray by multiple dimensions.
"""
# Compute slices for time or time range, and geolocation point or range (bbox).
time_slice = self.time_slice(coordinate="time", value=time)
lat_slice = self.geo_slice(coordinate="lat", value=lat)
lon_slice = self.geo_slice(coordinate="lon", value=lon)
# Slice data.
data = self.data[time_slice, lat_slice, lon_slice]
# Rebuild DataArray from result.
outdata = xr.DataArray(
data,
dims=("time", "lat", "lon"),
coords={
"time": self.coordinate.time[time_slice.start : time_slice.stop],
"lat": self.coordinate.lat[lat_slice.start : lat_slice.stop],
"lon": self.coordinate.lon[lon_slice.start : lon_slice.stop],
},
)
return Result(da=outdata)

Footnotes

  1. https://github.com/earthobservations/Herbie/compare/main...earthobservations:Herbie:indexing

@amotl amotl changed the title Add efficient indexing subsystem Add indexing subsystem Jan 8, 2023
@blaylockbk
Copy link
Owner

I appreciate the contribution. Zarr access is something I've wanted to add to Herbie (especially abstracting access to the HRRR Zarr dataset12). This is a bit to digest at the moment. When the docs are migrated to RTD and restored, I'll start to look at this in more depth. I certainly think Herbie would be a good home to add access to more types of data that are available.

Footnotes

  1. https://github.com/blaylockbk/Herbie/blob/main/docs/user_guide/zarr.md

  2. https://hrrrzarr.s3.amazonaws.com/index.html

@amotl amotl changed the title Add indexing subsystem Proposal to add indexing subsystem Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants