Proposal to add indexing subsystem #143

amotl · 2023-01-08T21:47:24Z

About

After a few other attempts elsewhere, we started working on an indexer for NWP data once again at ¹. We think it will help for different slicing and querying tasks requested by the community, some of them referenced below. The indexer is based on the excellent Caterva and ironArray libraries, which enable efficient slicing on all dimensions, along with several other features that are ideal in this scenario. More details are described in the next section.

graph TD

    H[Herbie loader]
    Z[Zarr loader]
    H -->IDX
    Z -->IDX

    IDX[NWP indexer]
    Q(Query by all coordinates/dimensions: time, lat, lon)
    R[Xarray result]

    IDX -->Q
    Q -->R

Technologies

There is a newish file format called Zarr that solves the problem of selecting smaller regions of interest without downloading the full file, by chunking the data.

Caterva is a container for multidimensional data that is specially designed to read datasets slices very efficiently. Like other libraries like Zarr, HDF5, or TileDB, Caterva stores the data into multidimensional chunks. Sitting on top of Caterva, ironArray is a C library oriented to compute and handle large multidimensional arrays efficiently.

Synopsis

Setup

git switch indexing
pip install --editable=.[indexing]

Alternative.

pip install --editable='git+https://github.com/earthobservations/Herbie@indexing[indexing]'

Usage

Create index

TIMERANGE = np.arange(
    start=np.datetime64("1987-10-01 08:00"),
    stop=np.datetime64("1987-10-01 10:59"),
    step=datetime.timedelta(hours=1),
)

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.save(dataset=open_era5_zarr(TEMP2M, 1987, 10, TIMERANGE[0], TIMERANGE[-1]))

Query index

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.load()

# Single forecasted temperature value at given time and geopoint, in Monterey, in Fahrenheit.
record = (
    nwp.query(time="1987-10-01 08:00", lat=36.6083, lon=-121.8674)
    .kelvin_to_fahrenheit()
    .data
)
# Verify value.
assert record.values == np.array(73.805008, dtype=np.float32)

# Temperatures in Berlin area, within a given time range and bounding box, in Celsius.
ds = (
    era5_temp2m.query(
        time=pd.date_range(
            start="1987-10-01 08:00", end="1987-10-01 09:00", freq="H"
        ),
        location=BBox(lon1=13.000, lat1=52.700, lon2=13.600, lat2=52.300),
    )
    .kelvin_to_celsius()
    .ds
)
# Verify shape.
assert ds["air_temperature_at_2_metres"].shape == (2, 3, 3)

Thoughts

We would like to gather early feedback from you and the community if this feature would be welcome. In the spirit of Herbie's history and disclaimer, we would be honored to contribute this as an indexing subsystem. Please let us know what you think about the patch, and if you also believe it would fit well into the code base, to enable a whole stack of new possibilities for Herbie.

References

Implementation

The query() method of herbie.index.core fame, filtering on all dimensions (time, lat, lon).

Herbie/herbie/index/core.py

Lines 101 to 125 in 4fd07a5

    
               def query(self, time=None, lat=None, lon=None) -> "Result": 
        
                   """ 
        
                   Query ironArray by multiple dimensions. 
        
                   """ 
        
                   # Compute slices for time or time range, and geolocation point or range (bbox). 
        
                   time_slice = self.time_slice(coordinate="time", value=time) 
        
                   lat_slice = self.geo_slice(coordinate="lat", value=lat) 
        
                   lon_slice = self.geo_slice(coordinate="lon", value=lon) 
        
                   # Slice data. 
        
                   data = self.data[time_slice, lat_slice, lon_slice] 
        
                   # Rebuild DataArray from result. 
        
                   outdata = xr.DataArray( 
        
                       data, 
        
                       dims=("time", "lat", "lon"), 
        
                       coords={ 
        
                           "time": self.coordinate.time[time_slice.start : time_slice.stop], 
        
                           "lat": self.coordinate.lat[lat_slice.start : lat_slice.stop], 
        
                           "lon": self.coordinate.lon[lon_slice.start : lon_slice.stop], 
        
                       }, 
        
                   ) 
        
                   return Result(da=outdata)

https://github.com/earthobservations/Herbie/compare/main...earthobservations:Herbie:indexing ↩

The text was updated successfully, but these errors were encountered:

blaylockbk · 2023-01-10T04:33:15Z

I appreciate the contribution. Zarr access is something I've wanted to add to Herbie (especially abstracting access to the HRRR Zarr dataset¹²). This is a bit to digest at the moment. When the docs are migrated to RTD and restored, I'll start to look at this in more depth. I certainly think Herbie would be a good home to add access to more types of data that are available.

amotl changed the title ~~Add efficient indexing subsystem~~ Add indexing subsystem Jan 8, 2023

amotl changed the title ~~Add indexing subsystem~~ Proposal to add indexing subsystem Jan 14, 2023

amotl mentioned this issue Jan 14, 2023

[DRAFT] Add indexing subsystem #153

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to add indexing subsystem #143

Proposal to add indexing subsystem #143

amotl commented Jan 8, 2023 •

edited

Loading

blaylockbk commented Jan 10, 2023

Proposal to add indexing subsystem #143

Proposal to add indexing subsystem #143

Comments

amotl commented Jan 8, 2023 • edited Loading

About

Technologies

Synopsis

Setup

Usage

Create index

Query index

Thoughts

References

Implementation

Footnotes

blaylockbk commented Jan 10, 2023

Footnotes

amotl commented Jan 8, 2023 •

edited

Loading