Add weight threshold option for spatial averaging #672

pochedls · 2024-06-28T19:58:43Z

Description

This PR adds an optional argument that requires a minimum fraction of data be available to perform a spatial average. The initial PR is for spatial averaging only (it would need to be expanded to handle temporal averaging).

Closes [Enhancement]: Add weight threshold option for averaging operations #531

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

codecov · 2024-06-28T20:02:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (b260fcb) to head (7033584).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #672   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        15           
  Lines         1621      1645   +24     
=========================================
+ Hits          1621      1645   +24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pochedls · 2024-06-28T20:01:29Z

xcdat/spatial.py

+            # zero out cells with missing values in data_var
+            weights = xr.where(~np.isnan(data_var), weights, 0)
+            # sum all weights (including zero for missing values)
+            weight_sum_masked = weights.sum(dim=dim)  # type: ignore


Needed to add type: ignore for unclear reasons...

I think dim is expected to be a string but we pass a Hashable (dict key).

# type: ignore is fine here or set dim=str(dim) to remove that comment

I don't think this should be a string. The loop above creates a list of strings (e.g., ["lat", "lon"]), which is supposedly okay according to xarray docs.

From mypy:

error: Argument "dim" to "sum" of "DataArrayAggregations" has incompatible type "list[str | list[str]]"; expected "str | Collection[Hashable] | ellipsis | None" [arg-type]

I tried tuple(dims), but that didn't work either. I'll leave this alone for now (but if you know how to fix the mypy complaint, please do!).

We can just keep # type: ignore. The docstring for dim says:

Name of dimension[s] along which to apply sum. For e.g. dim="x" or dim=["x", "y"]. If “…” or None, will reduce over all dimensions.

Although the type annotation for dim is Dims = Union[str, Collection[Hashable], "ellipsis", None], the example includes a list (dim=["x", "y"]).

pochedls · 2024-06-28T20:02:15Z

xcdat/spatial.py

@@ -716,6 +725,9 @@ def _averager(self, data_var: xr.DataArray, axis: List[SpatialAxis]):
            Data variable inside a Dataset.
        axis : List[SpatialAxis]
            List of axis dimensions to average over.
+        required_weight : optional, float


Is required_weight a good parameter name? What should the default be (currently zero to make default behavior backwards compatible).

Maybe required_weight_pct? required_weight sounds like a boolean parameter.

I think 0 is the appropriate default value.

Actually I like None as a default value to indicate the arg is not set by default.

None is okay with me (given my comment above). I prefer required_weight (since this isn't a percentage, shorter is better, and this doesn't come across as a boolean to me – I think require_weight sounds more Boolean-like than required_weight).

How about the alternative: minimum_weight?

I can see how required_weight doesn't sound like a boolean. I like minimum_weight, it sounds clearer. How about min_weight?

min_weight is good with me!

xcdat/spatial.py

pochedls · 2024-06-28T20:12:14Z

tests/test_spatial.py

+    def test_spatial_average_with_required_weight_as_None(self):
+        ds = self.ds.copy()
+
+        result = ds.spatial.average(
+            "ts",
+            axis=["X", "Y"],
+            lat_bounds=(-5.0, 5),
+            lon_bounds=(-170, -120.1),
+            required_weight=None,
+        )
+
+        expected = self.ds.copy()
+        expected["ts"] = xr.DataArray(
+            data=np.array([2.25, 1.0, 1.0]),
+            coords={"time": expected.time},
+            dims="time",
+        )
+
+        xr.testing.assert_allclose(result, expected)


This test is related to the code chunk in _averager that deals with the case where required_weight=None. I don't know why it would ever be None (it is supposed to be Optional[float] = 0., so I don't know how it could be None). If you don't have those conditionals you get an issue (maybe from mypy) about comparing None to int | float with the lines that check if required_weight > 0..

So if we could ensure that required_weight is always of type float we could remove this test and other conditionals in _averager. That would be ideal...

Using Optional indicates that the argument can also be None. You can just specify float as the type annotation if None is not expected.

I like None as the default value

I see. From the docs:

Note that this is not the same concept as an optional argument, which is one that has a default.

None is converted to 0. later. I guess None might feel different to the user than 0., so I'll leave this alone.

tomvothecoder · 2024-07-30T21:29:24Z

I'm going to open a separate PR to address this enhancement with the temporal APIs because those require more work and we can merge this earlier when it is ready.

xcdat/spatial.py

tomvothecoder · 2024-07-30T22:42:37Z

I started PR #683 for temporal operations. I used some of your code and split them up into reusable functions. We can think about making these functions generalizable across the spatial and temporal classes.

Check them out here: https://github.com/xCDAT/xcdat/pull/683/files

pochedls

Review review and apply suggestions.

pochedls · 2024-08-23T17:40:02Z

tests/test_spatial.py

+    def test_spatial_average_with_required_weight_as_None(self):
+        ds = self.ds.copy()
+
+        result = ds.spatial.average(
+            "ts",
+            axis=["X", "Y"],
+            lat_bounds=(-5.0, 5),
+            lon_bounds=(-170, -120.1),
+            required_weight=None,
+        )
+
+        expected = self.ds.copy()
+        expected["ts"] = xr.DataArray(
+            data=np.array([2.25, 1.0, 1.0]),
+            coords={"time": expected.time},
+            dims="time",
+        )
+
+        xr.testing.assert_allclose(result, expected)


I see. From the docs:

Note that this is not the same concept as an optional argument, which is one that has a default.

None is converted to 0. later. I guess None might feel different to the user than 0., so I'll leave this alone.

pochedls · 2024-08-23T17:43:30Z

xcdat/spatial.py

@@ -716,6 +725,9 @@ def _averager(self, data_var: xr.DataArray, axis: List[SpatialAxis]):
            Data variable inside a Dataset.
        axis : List[SpatialAxis]
            List of axis dimensions to average over.
+        required_weight : optional, float


None is okay with me (given my comment above). I prefer required_weight (since this isn't a percentage, shorter is better, and this doesn't come across as a boolean to me – I think require_weight sounds more Boolean-like than required_weight).

How about the alternative: minimum_weight?

xcdat/spatial.py

pochedls · 2024-08-23T17:47:07Z

xcdat/spatial.py

+            # zero out cells with missing values in data_var
+            weights = xr.where(~np.isnan(data_var), weights, 0)
+            # sum all weights (including zero for missing values)
+            weight_sum_masked = weights.sum(dim=dim)  # type: ignore


I don't think this should be a string. The loop above creates a list of strings (e.g., ["lat", "lon"]), which is supposedly okay according to xarray docs.

From mypy:

error: Argument "dim" to "sum" of "DataArrayAggregations" has incompatible type "list[str | list[str]]"; expected "str | Collection[Hashable] | ellipsis | None" [arg-type]

I tried tuple(dims), but that didn't work either. I'll leave this alone for now (but if you know how to fix the mypy complaint, please do!).

tomvothecoder · 2024-08-27T22:49:56Z

I've created general functions for weight thresholds here. You can copy them over to utils.py in your branch and then use them in the SpatialAccessor class as needed.

tomvothecoder · 2024-08-28T18:39:23Z

xcdat/spatial.py

+        # need weights to match data_var dimensionality
+        if minimum_weight > 0.0:
+            weights, data_var = xr.broadcast(weights, data_var)


I think we can remove this line.

Per docs for xr.broadcast:

xarray objects automatically broadcast against each other in arithmetic operations, so this function should not be necessary for normal use.

I think the weights are probably of dimensions [lat, lon] and this casts them to [time, lat, lon]. That way when you sum them below you get [time] (instead of a dimensionless scalar) which is then used to figure out which values do not meet the weight threshold and need to be masked.

Xarray should be handling the automatic broadcasting when using the xr.where() API to mask the data using weights. Otherwise, it would probably break with an error that the dimensions don't align.

According to the docs for xr.where():

Performs xarray-like broadcasting across input arguments.

The unit tests pass with this line commented out. I also didn't implement xr.broadcast() explicitly in PR #683.

tomvothecoder · 2024-09-05T17:48:01Z

Rebased this branch on latest main and pushed commit 34b570d (#672) with code review updates (refer to commit message for list of changes).

tomvothecoder

@pochedls Your PR is ready for review again.

I actually think it is okay to keep this PR and #683 as separate PRs. I've ported the utilities over from #683 and updated spatial.py to use them.

tomvothecoder · 2024-11-21T20:34:25Z

xcdat/spatial.py

+        # TODO: This conditional might not be needed because Xarray will
+        # automatically broadcast the weights to the data variable for
+        # operations such as .mean() and .where().
+        if min_weight > 0.0:
+            weights, data_var = xr.broadcast(weights, data_var)


Suggested change

# TODO: This conditional might not be needed because Xarray will

# automatically broadcast the weights to the data variable for

# operations such as .mean() and .where().

if min_weight > 0.0:

weights, data_var = xr.broadcast(weights, data_var)

Xarray will automatically broadcast to align the shapes of weights and data_var.

tomvothecoder · 2025-01-29T17:54:07Z

FYI Just did a review of the utils.py and it is the same as PR #683. We can proceed with testing this PR with real datasets. I'll do another review of the code.

pochedls · 2025-02-22T01:03:53Z

@tomvothecoder – this feature stops working correctly at this commit.

I think the issue arises, because this chunk of code is put into a separate function (_mask_var_with_weight_threshold and _get_masked_weights) in the subsequent commit.

The problem is that the original code operates on the full [time, lat, lon] dataarray when it computes the amount of weight if you exclude missing values: weights = xr.where(~np.isnan(data_var), weights, 0). In the subsequent commit, it seems like the logic is updated to operate on the already averaged array (dimensions of [time]). It therefore can't accurately determine the amount of weight in the original 3D dataarray (it is giving an answer and not throwing an error because of broadcasting).

Should I roll this back to the commit that was working? Or try to refactor from the latest commit? I think part of the issue is that you moved some of the logic to utils (which may work for the temporal version of this code).

For testing, this should yield some NaN values:

import xcdat as xc
fn = '/p/user_pub/climate_work/pochedley1/surface/HadCRUT.4.6.0.0.median.nc'
ds = xc.open_dataset(fn)
ts = ds.spatial.average('temperature_anomaly', min_weight=1.0)

tomvothecoder · 2025-02-24T22:31:04Z

xcdat/spatial.py

+    def _mask_var_with_weight_threshold(
+        self, dv: xr.DataArray, dim: List[str], weights: xr.DataArray, min_weight: float
+    ) -> xr.DataArray:
+        """Mask values that do not meet the minimum weight threshold with np.nan.
+
+        This function is useful for cases where the weighting of data might be
+        skewed based on the availability of data. For example, if a portion of
+        cells in a region has significantly more missing data than other other
+        regions, it can result in inaccurate calculations of spatial averaging.
+        Masking values that do not meet the minimum weight threshold ensures
+        more accurate calculations.
+
+        Parameters
+        ----------
+        dv : xr.DataArray
+            The weighted variable.
+        dim: List[str]:
+            List of axis dimensions to average over.
+        weights : xr.DataArray
+            A DataArray containing either the regional weights used for weighted
+            averaging. ``weights`` must include the same axis dimensions and
+            dimensional sizes as the data variable.
+        min_weight : float
+            Fraction of data coverage (i.e, weight) needed to return a
+            spatial average value. Value must range from 0 to 1.
+
+        Returns
+        -------
+        xr.DataArray
+            The variable with the minimum weight threshold applied.
+        """
+        # Sum all weights, including zero for missing values.
+        weight_sum_all = weights.sum(dim=dim)
+
+        masked_weights = _get_masked_weights(dv, weights)
+        weight_sum_masked = masked_weights.sum(dim=dim)
+
+        # Get fraction of the available weight.
+        frac = weight_sum_masked / weight_sum_all
+
+        # Nan out values that don't meet specified weight threshold.
+        dv_new = xr.where(frac >= min_weight, dv, np.nan, keep_attrs=True)
+        dv_new.name = dv.name
+
+        return dv_new


@tomvothecoder – this feature stops working correctly at this commit.

I think the issue arises, because this chunk of code is put into a separate function (_mask_var_with_weight_threshold and _get_masked_weights) in the subsequent commit.

The problem is that the original code operates on the full [time, lat, lon] dataarray when it computes the amount of weight if you exclude missing values: weights = xr.where(~np.isnan(data_var), weights, 0). In the subsequent commit, it seems like the logic is updated to operate on the already averaged array (dimensions of [time]). It therefore can't accurately determine the amount of weight in the original 3D dataarray (it is giving an answer and not throwing an error because of broadcasting).

Should I roll this back to the commit that was working? Or try to refactor from the latest commit? I think part of the issue is that you moved some of the logic to utils (which may work for the temporal version of this code).

For testing, this should yield some NaN values:

import xcdat as xc fn = '/p/user_pub/climate_work/pochedley1/surface/HadCRUT.4.6.0.0.median.nc' ds = xc.open_dataset(fn) ts = ds.spatial.average('temperature_anomaly', min_weight=1.0)

I see what you're saying. Let's keep the current commit.

I think we just need to update _mask_var_with_weight_threshold() with parameters for dv (original var) and dv_mean (averaged var). All current references to dv should be dv_mean, while line 810 (masked_weights = _get_masked_weights(dv, weights)) should be kept the same to properly calculate weights using dv.

I can make this update soon unless you want to give it a shot.

Co-authored-by: Tom Vo <[email protected]>

- Rename arg `minimum_weight` to `min_weight` - Add `_get_masked_weights()` and `_validate_min_weight()` to `utils.py` - Update `SpatialAccessor` to use `_get_masked_weights()` and `_validate_min_weight()` - Replace type annotation `Optional` with `|` - Extract `_mask_var_with_with_threshold()` from `_averager()` for readability

- Add `dv_mean` parameter to `_mask_var_with_weight_threshold()` and fix parameter references - Remove unnecessary broadcasting of variable and weights in `_averager()`

tomvothecoder

@pochedls This commit should fix the issue mentioned here. I did not test yet.

pochedls · 2025-02-26T23:36:54Z

I wrote a script to sanity check this (it seemed to do what I thought it should):

# %% imports
import sys
sys.path.append('/home/pochedley1/code/xcdat/')
import xcdat as xc
import numpy as np
import xarray as xr

# %% parameters
fn = '/p/user_pub/climate_work/pochedley1/surface/HadCRUT.4.6.0.0.median.nc'
weight_levels = np.arange(0., 1.01, 0.05)
vid = 'temperature_anomaly'
lat_bounds = (-90, 90)

# open dataset
ds = xc.open_dataset(fn)
# get weight matrix
weights = ds.spatial.get_weights(axis=["X", "Y"], lat_bounds=lat_bounds)
# loop over weight thresholds
for wl in weight_levels:
    # get matching dimensionality
    wb, da = xr.broadcast(weights, ds[vid])
    # put time first
    wb = wb.transpose("time", "latitude", "longitude")
    da = da.transpose("time", "latitude", "longitude")
    # zero weights where the underlying dataarray is NaN
    wb = wb.where(~np.isnan(da), 0)
    # compute the fraction of weight available at each time step
    f = wb.sum(dim=['latitude', 'longitude']) / weights.sum(dim=['latitude', 'longitude'])
    # take the spatial average with the new weight thresholding functionality
    ts = ds.spatial.average(vid, min_weight=wl, lat_bounds=lat_bounds)[vid]
    # get indices where there should be data (i.e., more data is available than the fractional threshold)
    I1 = np.where(f >= wl)[0]
    # get indices where the spatial averager returns a non-nan solution
    I2 = np.where(~np.isnan(ts))[0]
    # ensure arrays are equal
    # else throw a valueerror to alert us to a problem
    if np.array_equal(I1, I2):
        print(str(wl) + ' checks out...')
        n = len(ds.time)
        print(str(n - len(I1)) + ' timesteps left out')
        print()
        continue
    else:
        raise ValueError('Mismatch for weight threshold ' + str(wl))

I think it would be helpful if @lee1043 or someone else could come up with a different way to test this functionality (and review whether they like the API design). One other plausible test would be to run the spatial averager across a bunch of models and ensure this PR (and main branch) yield the same answer if min_weight=0..

github-actions bot added the type: enhancement New enhancement request label Jun 28, 2024

pochedls commented Jun 28, 2024

View reviewed changes

tomvothecoder changed the title ~~Set (optional) weight threshold for averaging operations~~ Add weight threshold option for spatial averaging Jul 30, 2024

tomvothecoder mentioned this pull request Jul 30, 2024

Add weight threshold option for temporal operations #683

Open

9 tasks

tomvothecoder reviewed Jul 30, 2024

View reviewed changes

xcdat/spatial.py Outdated Show resolved Hide resolved

pochedls commented Aug 23, 2024

View reviewed changes

tomvothecoder added this to the FY24Q4 (07/01/24 - 09/30/24) milestone Aug 28, 2024

tomvothecoder reviewed Aug 28, 2024

View reviewed changes

tomvothecoder force-pushed the feature/531_min_weight_for_average branch from b56befe to 34b570d Compare September 5, 2024 17:47

tomvothecoder modified the milestones: FY24Q4 (07/01/24 - 09/30/24), FY25Q1 (10/01/24 - 12/31/24) Oct 7, 2024

tomvothecoder added the project: seats-fy24 label Oct 10, 2024

tomvothecoder force-pushed the feature/531_min_weight_for_average branch from 7c27e6f to 82cbfe2 Compare November 21, 2024 20:32

tomvothecoder reviewed Nov 21, 2024

View reviewed changes

tomvothecoder modified the milestones: FY25Q1 (10/01/24 - 12/31/24), FY25 Q2 (01/01/25 - 03/31/25) Jan 17, 2025

tomvothecoder reviewed Feb 24, 2025

View reviewed changes

pochedls and others added 5 commits February 26, 2025 10:40

initial attempt at #531 (for spatial averaging)

1b4742c

cleanup print statement and complete code coverage

f69853e

Apply review suggestion.

56dc0f9

Co-authored-by: Tom Vo <[email protected]>

update required_weight argument (to minimum_weight)

dba659e

update tests for minimum_weight parameter

95f57a5

tomvothecoder added 4 commits February 26, 2025 10:40

Fix TypeError for optional region arg in TypedDict

22397a4

Remove rebase comment

fbd0d55

Fix incorrect weight generation

ce058ce

- Add `dv_mean` parameter to `_mask_var_with_weight_threshold()` and fix parameter references - Remove unnecessary broadcasting of variable and weights in `_averager()`

tomvothecoder force-pushed the feature/531_min_weight_for_average branch from 5b3e662 to ce058ce Compare February 26, 2025 17:40

Fix return docstring for _mask_var_with_weight_threshold()`

a9c4c52

tomvothecoder reviewed Feb 26, 2025

View reviewed changes

Fix pre-commit hooks

7033584

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add weight threshold option for spatial averaging #672

Add weight threshold option for spatial averaging #672

pochedls commented Jun 28, 2024 •

edited by tomvothecoder

Loading

codecov bot commented Jun 28, 2024 •

edited

Loading

pochedls Jun 28, 2024

tomvothecoder Jul 30, 2024 •

edited

Loading

pochedls Aug 23, 2024

tomvothecoder Aug 27, 2024 •

edited

Loading

pochedls Jun 28, 2024

tomvothecoder Jul 30, 2024

tomvothecoder Jul 30, 2024

pochedls Aug 23, 2024

tomvothecoder Aug 27, 2024

pochedls Aug 27, 2024

pochedls Jun 28, 2024

tomvothecoder Jul 30, 2024

tomvothecoder Jul 30, 2024

pochedls Aug 23, 2024

tomvothecoder commented Jul 30, 2024

tomvothecoder commented Jul 30, 2024

pochedls left a comment

pochedls Aug 23, 2024

pochedls Aug 23, 2024

pochedls Aug 23, 2024

tomvothecoder commented Aug 27, 2024

tomvothecoder Aug 28, 2024 •

edited

Loading

pochedls Aug 28, 2024

tomvothecoder Sep 5, 2024

tomvothecoder Sep 5, 2024

tomvothecoder commented Sep 5, 2024 •

edited

Loading

tomvothecoder left a comment

tomvothecoder Nov 21, 2024

tomvothecoder commented Jan 29, 2025

pochedls commented Feb 22, 2025

tomvothecoder Feb 24, 2025 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

pochedls commented Feb 26, 2025 •

edited

Loading

Add weight threshold option for spatial averaging #672

Are you sure you want to change the base?

Add weight threshold option for spatial averaging #672

Conversation

pochedls commented Jun 28, 2024 • edited by tomvothecoder Loading

Description

Checklist

codecov bot commented Jun 28, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

tomvothecoder Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder commented Jul 30, 2024

tomvothecoder commented Jul 30, 2024

pochedls left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder commented Aug 27, 2024

tomvothecoder Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder commented Sep 5, 2024 • edited Loading

tomvothecoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder commented Jan 29, 2025

pochedls commented Feb 22, 2025

tomvothecoder Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

tomvothecoder left a comment • edited Loading

Choose a reason for hiding this comment

pochedls commented Feb 26, 2025 • edited Loading

pochedls commented Jun 28, 2024 •

edited by tomvothecoder

Loading

codecov bot commented Jun 28, 2024 •

edited

Loading

tomvothecoder Jul 30, 2024 •

edited

Loading

tomvothecoder Aug 27, 2024 •

edited

Loading

tomvothecoder Aug 28, 2024 •

edited

Loading

tomvothecoder commented Sep 5, 2024 •

edited

Loading

tomvothecoder Feb 24, 2025 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

pochedls commented Feb 26, 2025 •

edited

Loading