-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add weight threshold option for spatial averaging #672
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #672 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 15 15
Lines 1621 1645 +24
=========================================
+ Hits 1621 1645 +24 ☔ View full report in Codecov by Sentry. |
xcdat/spatial.py
Outdated
# zero out cells with missing values in data_var | ||
weights = xr.where(~np.isnan(data_var), weights, 0) | ||
# sum all weights (including zero for missing values) | ||
weight_sum_masked = weights.sum(dim=dim) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to add type: ignore
for unclear reasons...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dim
is expected to be a string but we pass a Hashable (dict key).
# type: ignore
is fine here or set dim=str(dim)
to remove that comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be a string. The loop above creates a list of strings (e.g., ["lat", "lon"]
), which is supposedly okay according to xarray docs.
From mypy:
error: Argument "dim" to "sum" of "DataArrayAggregations" has incompatible type "list[str | list[str]]"; expected "str | Collection[Hashable] | ellipsis | None" [arg-type]
I tried tuple(dims)
, but that didn't work either. I'll leave this alone for now (but if you know how to fix the mypy complaint, please do!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just keep # type: ignore
. The docstring for dim
says:
Name of dimension[s] along which to apply sum. For e.g. dim="x" or dim=["x", "y"]. If “…” or None, will reduce over all dimensions.
Although the type annotation for dim is Dims = Union[str, Collection[Hashable], "ellipsis", None]
, the example includes a list (dim=["x", "y"]
).
xcdat/spatial.py
Outdated
@@ -716,6 +725,9 @@ def _averager(self, data_var: xr.DataArray, axis: List[SpatialAxis]): | |||
Data variable inside a Dataset. | |||
axis : List[SpatialAxis] | |||
List of axis dimensions to average over. | |||
required_weight : optional, float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is required_weight
a good parameter name? What should the default be (currently zero to make default behavior backwards compatible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe required_weight_pct
? required_weight
sounds like a boolean parameter.
I think 0 is the appropriate default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I like None
as a default value to indicate the arg is not set by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is okay with me (given my comment above). I prefer required_weight
(since this isn't a percentage, shorter is better, and this doesn't come across as a boolean to me – I think require_weight
sounds more Boolean-like than required_weight
).
How about the alternative: minimum_weight
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how required_weight
doesn't sound like a boolean. I like minimum_weight
, it sounds clearer. How about min_weight
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_weight
is good with me!
tests/test_spatial.py
Outdated
def test_spatial_average_with_required_weight_as_None(self): | ||
ds = self.ds.copy() | ||
|
||
result = ds.spatial.average( | ||
"ts", | ||
axis=["X", "Y"], | ||
lat_bounds=(-5.0, 5), | ||
lon_bounds=(-170, -120.1), | ||
required_weight=None, | ||
) | ||
|
||
expected = self.ds.copy() | ||
expected["ts"] = xr.DataArray( | ||
data=np.array([2.25, 1.0, 1.0]), | ||
coords={"time": expected.time}, | ||
dims="time", | ||
) | ||
|
||
xr.testing.assert_allclose(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is related to the code chunk in _averager
that deals with the case where required_weight=None
. I don't know why it would ever be None
(it is supposed to be Optional[float] = 0.
, so I don't know how it could be None). If you don't have those conditionals you get an issue (maybe from mypy) about comparing None
to int | float
with the lines that check if required_weight > 0.
.
So if we could ensure that required_weight
is always of type float
we could remove this test and other conditionals in _averager
. That would be ideal...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Optional
indicates that the argument can also be None
. You can just specify float
as the type annotation if None
is not expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like None
as the default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. From the docs:
Note that this is not the same concept as an optional argument, which is one that has a default.
None
is converted to 0.
later. I guess None
might feel different to the user than 0.
, so I'll leave this alone.
I'm going to open a separate PR to address this enhancement with the temporal APIs because those require more work and we can merge this earlier when it is ready. |
I started PR #683 for temporal operations. I used some of your code and split them up into reusable functions. We can think about making these functions generalizable across the spatial and temporal classes. Check them out here: https://github.com/xCDAT/xcdat/pull/683/files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review review and apply suggestions.
tests/test_spatial.py
Outdated
def test_spatial_average_with_required_weight_as_None(self): | ||
ds = self.ds.copy() | ||
|
||
result = ds.spatial.average( | ||
"ts", | ||
axis=["X", "Y"], | ||
lat_bounds=(-5.0, 5), | ||
lon_bounds=(-170, -120.1), | ||
required_weight=None, | ||
) | ||
|
||
expected = self.ds.copy() | ||
expected["ts"] = xr.DataArray( | ||
data=np.array([2.25, 1.0, 1.0]), | ||
coords={"time": expected.time}, | ||
dims="time", | ||
) | ||
|
||
xr.testing.assert_allclose(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. From the docs:
Note that this is not the same concept as an optional argument, which is one that has a default.
None
is converted to 0.
later. I guess None
might feel different to the user than 0.
, so I'll leave this alone.
xcdat/spatial.py
Outdated
@@ -716,6 +725,9 @@ def _averager(self, data_var: xr.DataArray, axis: List[SpatialAxis]): | |||
Data variable inside a Dataset. | |||
axis : List[SpatialAxis] | |||
List of axis dimensions to average over. | |||
required_weight : optional, float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is okay with me (given my comment above). I prefer required_weight
(since this isn't a percentage, shorter is better, and this doesn't come across as a boolean to me – I think require_weight
sounds more Boolean-like than required_weight
).
How about the alternative: minimum_weight
?
xcdat/spatial.py
Outdated
# zero out cells with missing values in data_var | ||
weights = xr.where(~np.isnan(data_var), weights, 0) | ||
# sum all weights (including zero for missing values) | ||
weight_sum_masked = weights.sum(dim=dim) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be a string. The loop above creates a list of strings (e.g., ["lat", "lon"]
), which is supposedly okay according to xarray docs.
From mypy:
error: Argument "dim" to "sum" of "DataArrayAggregations" has incompatible type "list[str | list[str]]"; expected "str | Collection[Hashable] | ellipsis | None" [arg-type]
I tried tuple(dims)
, but that didn't work either. I'll leave this alone for now (but if you know how to fix the mypy complaint, please do!).
I've created general functions for weight thresholds here. You can copy them over to |
xcdat/spatial.py
Outdated
# need weights to match data_var dimensionality | ||
if minimum_weight > 0.0: | ||
weights, data_var = xr.broadcast(weights, data_var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this line.
Per docs for xr.broadcast:
xarray objects automatically broadcast against each other in arithmetic operations, so this function should not be necessary for normal use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the weights are probably of dimensions [lat, lon]
and this casts them to [time, lat, lon]
. That way when you sum them below you get [time]
(instead of a dimensionless scalar) which is then used to figure out which values do not meet the weight threshold and need to be masked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xarray should be handling the automatic broadcasting when using the xr.where()
API to mask the data using weights. Otherwise, it would probably break with an error that the dimensions don't align.
According to the docs for xr.where()
:
Performs xarray-like broadcasting across input arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit tests pass with this line commented out. I also didn't implement xr.broadcast()
explicitly in PR #683.
b56befe
to
34b570d
Compare
Rebased this branch on latest |
7c27e6f
to
82cbfe2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xcdat/spatial.py
Outdated
# TODO: This conditional might not be needed because Xarray will | ||
# automatically broadcast the weights to the data variable for | ||
# operations such as .mean() and .where(). | ||
if min_weight > 0.0: | ||
weights, data_var = xr.broadcast(weights, data_var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: This conditional might not be needed because Xarray will | |
# automatically broadcast the weights to the data variable for | |
# operations such as .mean() and .where(). | |
if min_weight > 0.0: | |
weights, data_var = xr.broadcast(weights, data_var) |
Xarray will automatically broadcast to align the shapes of weights
and data_var
.
FYI Just did a review of the |
@tomvothecoder – this feature stops working correctly at this commit. I think the issue arises, because this chunk of code is put into a separate function ( The problem is that the original code operates on the full Should I roll this back to the commit that was working? Or try to refactor from the latest commit? I think part of the issue is that you moved some of the logic to utils (which may work for the temporal version of this code). For testing, this should yield some import xcdat as xc
fn = '/p/user_pub/climate_work/pochedley1/surface/HadCRUT.4.6.0.0.median.nc'
ds = xc.open_dataset(fn)
ts = ds.spatial.average('temperature_anomaly', min_weight=1.0) |
xcdat/spatial.py
Outdated
def _mask_var_with_weight_threshold( | ||
self, dv: xr.DataArray, dim: List[str], weights: xr.DataArray, min_weight: float | ||
) -> xr.DataArray: | ||
"""Mask values that do not meet the minimum weight threshold with np.nan. | ||
|
||
This function is useful for cases where the weighting of data might be | ||
skewed based on the availability of data. For example, if a portion of | ||
cells in a region has significantly more missing data than other other | ||
regions, it can result in inaccurate calculations of spatial averaging. | ||
Masking values that do not meet the minimum weight threshold ensures | ||
more accurate calculations. | ||
|
||
Parameters | ||
---------- | ||
dv : xr.DataArray | ||
The weighted variable. | ||
dim: List[str]: | ||
List of axis dimensions to average over. | ||
weights : xr.DataArray | ||
A DataArray containing either the regional weights used for weighted | ||
averaging. ``weights`` must include the same axis dimensions and | ||
dimensional sizes as the data variable. | ||
min_weight : float | ||
Fraction of data coverage (i.e, weight) needed to return a | ||
spatial average value. Value must range from 0 to 1. | ||
|
||
Returns | ||
------- | ||
xr.DataArray | ||
The variable with the minimum weight threshold applied. | ||
""" | ||
# Sum all weights, including zero for missing values. | ||
weight_sum_all = weights.sum(dim=dim) | ||
|
||
masked_weights = _get_masked_weights(dv, weights) | ||
weight_sum_masked = masked_weights.sum(dim=dim) | ||
|
||
# Get fraction of the available weight. | ||
frac = weight_sum_masked / weight_sum_all | ||
|
||
# Nan out values that don't meet specified weight threshold. | ||
dv_new = xr.where(frac >= min_weight, dv, np.nan, keep_attrs=True) | ||
dv_new.name = dv.name | ||
|
||
return dv_new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomvothecoder – this feature stops working correctly at this commit.
I think the issue arises, because this chunk of code is put into a separate function (
_mask_var_with_weight_threshold
and_get_masked_weights
) in the subsequent commit.The problem is that the original code operates on the full
[time, lat, lon]
dataarray when it computes the amount of weight if you exclude missing values:weights = xr.where(~np.isnan(data_var), weights, 0)
. In the subsequent commit, it seems like the logic is updated to operate on the already averaged array (dimensions of[time]
). It therefore can't accurately determine the amount of weight in the original 3D dataarray (it is giving an answer and not throwing an error because of broadcasting).Should I roll this back to the commit that was working? Or try to refactor from the latest commit? I think part of the issue is that you moved some of the logic to utils (which may work for the temporal version of this code).
For testing, this should yield some
NaN
values:import xcdat as xc fn = '/p/user_pub/climate_work/pochedley1/surface/HadCRUT.4.6.0.0.median.nc' ds = xc.open_dataset(fn) ts = ds.spatial.average('temperature_anomaly', min_weight=1.0)
I see what you're saying. Let's keep the current commit.
I think we just need to update _mask_var_with_weight_threshold()
with parameters for dv
(original var) and dv_mean
(averaged var). All current references to dv
should be dv_mean
, while line 810 (masked_weights = _get_masked_weights(dv, weights)
) should be kept the same to properly calculate weights using dv
.
I can make this update soon unless you want to give it a shot.
Co-authored-by: Tom Vo <[email protected]>
- Rename arg `minimum_weight` to `min_weight` - Add `_get_masked_weights()` and `_validate_min_weight()` to `utils.py` - Update `SpatialAccessor` to use `_get_masked_weights()` and `_validate_min_weight()` - Replace type annotation `Optional` with `|` - Extract `_mask_var_with_with_threshold()` from `_averager()` for readability
- Add `dv_mean` parameter to `_mask_var_with_weight_threshold()` and fix parameter references - Remove unnecessary broadcasting of variable and weights in `_averager()`
5b3e662
to
ce058ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a script to sanity check this (it seemed to do what I thought it should): # %% imports
import sys
sys.path.append('/home/pochedley1/code/xcdat/')
import xcdat as xc
import numpy as np
import xarray as xr
# %% parameters
fn = '/p/user_pub/climate_work/pochedley1/surface/HadCRUT.4.6.0.0.median.nc'
weight_levels = np.arange(0., 1.01, 0.05)
vid = 'temperature_anomaly'
lat_bounds = (-90, 90)
# open dataset
ds = xc.open_dataset(fn)
# get weight matrix
weights = ds.spatial.get_weights(axis=["X", "Y"], lat_bounds=lat_bounds)
# loop over weight thresholds
for wl in weight_levels:
# get matching dimensionality
wb, da = xr.broadcast(weights, ds[vid])
# put time first
wb = wb.transpose("time", "latitude", "longitude")
da = da.transpose("time", "latitude", "longitude")
# zero weights where the underlying dataarray is NaN
wb = wb.where(~np.isnan(da), 0)
# compute the fraction of weight available at each time step
f = wb.sum(dim=['latitude', 'longitude']) / weights.sum(dim=['latitude', 'longitude'])
# take the spatial average with the new weight thresholding functionality
ts = ds.spatial.average(vid, min_weight=wl, lat_bounds=lat_bounds)[vid]
# get indices where there should be data (i.e., more data is available than the fractional threshold)
I1 = np.where(f >= wl)[0]
# get indices where the spatial averager returns a non-nan solution
I2 = np.where(~np.isnan(ts))[0]
# ensure arrays are equal
# else throw a valueerror to alert us to a problem
if np.array_equal(I1, I2):
print(str(wl) + ' checks out...')
n = len(ds.time)
print(str(n - len(I1)) + ' timesteps left out')
print()
continue
else:
raise ValueError('Mismatch for weight threshold ' + str(wl)) I think it would be helpful if @lee1043 or someone else could come up with a different way to test this functionality (and review whether they like the API design). One other plausible test would be to run the spatial averager across a bunch of models and ensure this PR (and main branch) yield the same answer if |
Description
This PR adds an optional argument that requires a minimum fraction of data be available to perform a spatial average. The initial PR is for spatial averaging only (it would need to be expanded to handle temporal averaging).
Checklist
If applicable: