Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of bounded dataarray functionality #737

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pochedls
Copy link
Collaborator

Description

This PR is meant to prototype and explore avenues for enabling xcdat functionality to operate on dataarray objects. The initial PR takes the approach of a) embedding bounds in a "bounded dataarray" (following this xarray exchange), b) converting the bounded dataarray to a dataset to call xcdat functionality (spatial averaging was prototyped).

This approach seems to work fine and is fast (converting between datasets/dataarrays is extremely fast), but it does require some manual re-implementation of functions and documentation. It also is only currently prototyped for rectilinear grids (but maybe a similar approach could be followed for other grid types). Last, it overwrites xr.Dataset.__call__ with a function that converts the dataset to a bounded dataarray, which might be inadvisable (we could avoid this if we needed to, by calling the underlying function directly, e.g., tas = xc.to_bounded_dataarray(ds, "tas")).

Alternate approached could be explored (in this PR or separate PRs):

  • The "bounded datarray" is actually just a dataarray, but maybe we should actually create a new class
  • Re-factor xcdat logic to be able to work with dataset bounds or dataarray bounds (the challenge here is that accessors are specific to dataarrays and datasets)
  • Re-implement all existing logic on dataarrays (rather than flipping back and forth between dataarray/dataset objects) [nonstarter?]
  • Some other creative solution?

Regardless, it might be useful to attend an xarray meeting to discuss what they think makes sense.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@github-actions github-actions bot added the type: enhancement New enhancement request label Feb 12, 2025
@@ -173,7 +173,7 @@ def average(
Using custom weights for averaging:

>>> # The shape of the weights must align with the data var.
>>> self.weights = xr.DataArray(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation update is more generic than this PR and should probably be implemented separately...

Comment on lines +859 to +870
da = self._dataarray.copy()
ds = boundedDataArray_to_dataset(da)
# get data_var key
data_var = da.name
# pass on call to spatial averager
ds_sa = ds.spatial.average(
data_var,
axis=axis,
weights=weights,
keep_weights=keep_weights,
lat_bounds=lat_bounds,
lon_bounds=lon_bounds,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the template: take your "bounded dataarray", convert it back to a dataset, and then call the existing xcdat API (passing along all the underlying arguments).

A shortfall, beyond replicating most (but not all) documentation, is that if we change the dataset API, we need to make sure to update it here, too. Maybe there is a fancy way to pass in **kwargs that can do this automatically (probably not)?

except: # noqa: E722
continue
# return dataarray with bounds
da = xr.DataArray(ds[key], coords=coords, dims=ds[key].dims, attrs=ds[key].attrs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pasting the lat_bnds coordinate / dataarray that is embedded in the "bounded dataarray" – I'm wondering if this could take on a nicer form than this random dtype: ('lower', '<f8'), ('upper', '<f8'). I'm guessing if we had other bound types, we could make this more generic, e.g., ('lower_left', '<f8'), ('lower_right', '<f8'), ('upper_right', '<f8', ('upper_left', '<f8')).

da.lat_bnds

<xarray.DataArray 'lat_bnds' (lat: 64)> Size: 1kB
array([(-90. , -86.57774751), (-86.57774751, -83.75702878),
(-83.75702878, -80.95502019), (-80.95502019, -78.15834785),
...
( 78.15834785, 80.95502019), ( 80.95502019, 83.75702878),
( 83.75702878, 86.57774751), ( 86.57774751, 90. )],
dtype=[('lower', '<f8'), ('upper', '<f8')])
Coordinates:

  • lat (lat) float64 512B -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
    lat_bnds (lat) [('lower', '<f8'), ('upper', '<f8')] 1kB (-90.0, -86.5777...

@tomvothecoder
Copy link
Collaborator

Another thing to note is that Xarray does not have a native API for generating bounds. Pitching our bounds generation functionalities might also be a good idea, considering I think they're generic enough (except for time frequency based on CF calendar).

@pochedls
Copy link
Collaborator Author

@tomvothecoder – I asked ChatGPT if there was a better way to approach this problem and it yielded a slightly different approach for including the bounds in the dataarray:

import xarray as xr
import xcdat as xc

@xr.register_dataarray_accessor("cf_bounds")
class CFBoundsAccessor:
    def __init__(self, xarray_obj):
        self._obj = xarray_obj
        # Initialize or load bounds as a dictionary from attributes.
        self._bounds = self._obj.attrs.get("bounds", {})

    @property
    def bounds(self):
        """Return the entire dictionary of bounds."""
        if not self._bounds:
            raise AttributeError("No bounds are attached to this DataArray.")
        return self._bounds

    def set_bounds(self, bounds_dict):
        """
        Set multiple bounds at once.
        
        Parameters
        ----------
        bounds_dict : dict
            A dictionary mapping names (e.g., 'lat_bnds', 'lon_bnds', 'time_bnds')
            to their respective bounds arrays.
        """
        if not isinstance(bounds_dict, dict):
            raise TypeError("bounds_dict must be a dictionary.")
        self._bounds = bounds_dict
        self._obj.attrs["bounds"] = bounds_dict

    def get_bounds(self, key):
        """
        Retrieve the bounds for a given coordinate or bounds name.
        
        Parameters
        ----------
        key : str
            The key corresponding to the bounds you want to retrieve.
        
        Returns
        -------
        The bounds array associated with the given key.
        """
        try:
            return self._bounds[key]
        except KeyError:
            raise KeyError(f"No bounds available for key '{key}'.")

    def add_bound(self, key, bound_array):
        """
        Add or update a single bounds entry.
        
        Parameters
        ----------
        key : str
            The key for the bounds (e.g., 'lat_bnds').
        bound_array : array-like
            The bounds array corresponding to the key.
        """
        self._bounds[key] = bound_array
        self._obj.attrs["bounds"] = self._bounds


ds = xc.open_dataset('tas_Amon_MIROC-ES2L_historical_r2i1p1f2_gn_185001-201412.nc')
tas = ds['tas']
tas.cf_bounds.set_bounds({
    "lat_bnds": ds.lat_bnds,
    "lon_bnds": ds.lon_bnds
})

print(tas.cf_bounds.get_bounds("lat_bnds"))
print(tas.cf_bounds.get_bounds("lon_bnds"))

<xarray.DataArray 'lat_bnds' (lat: 64, bnds: 2)> Size: 1kB
[128 values with dtype=float64]
Coordinates:

  • lat (lat) float64 512B -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
    height float64 8B 2.0
    Dimensions without coordinates: bnds
    <xarray.DataArray 'lon_bnds' (lon: 128, bnds: 2)> Size: 2kB
    [256 values with dtype=float64]
    Coordinates:
  • lon (lon) float64 1kB 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    height float64 8B 2.0
    Dimensions without coordinates: bnds

I like this approach (generally; I'd change some thing) because it seems to store the bounds in their native dataarray form, but thinking back (5 years ago?) – did you try something like this?

I also asked ChatGPT: once we have a dataarray with bounds included, how can we best leverage our existing functions to perform operations (e.g., spatial averaging). It basically came back with:

  • write functions that can handle either dataset or dataarrays
  • the approach in this PR (convert the dataarray back to a dataset, call dataset accessor functions, return dataarray result)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

[Exploration]: Including dataarrays with our current dataset API model (#671 discussion)
2 participants