Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which statistics in iris.analysis are lazy? #4039

Closed
stefsmeets opened this issue Feb 26, 2021 · 8 comments · Fixed by #5128
Closed

Which statistics in iris.analysis are lazy? #4039

stefsmeets opened this issue Feb 26, 2021 · 8 comments · Fixed by #5128

Comments

@stefsmeets
Copy link

stefsmeets commented Feb 26, 2021

📰 Custom Issue

Hi everyone, we are currently working on a feature to make our multimodel calculations lazy in ESMValTool by depending on iris.analysis to perform the calculations (ESMValGroup/ESMValCore#968). The documentation states that MEAN, STD_DEV and VARIANCE already have lazy implementations via dask.

Looking through the code, I have noticed that iris.analysis.MIN and iris.analysis.MAX also have lazy functions associated with them, but these are not mentioned in the documentation as being lazy. I'm wondering if I'm missing something or if this information is not yet available in the documentation.

We would be very interested to also make some of the other statistics lazy on our side, i.e. MEDIAN and PERCENTILE, which also have implementations available via dask.

@rcomer
Copy link
Member

rcomer commented Feb 27, 2021

Hi @stefsmeets, thanks for this. I agree we ought to have something in the docstrings that tells us which aggregators are lazy. We have an open issue about doing this for functions generally (#3292), but nobody has got to it yet.

For percentiles, I have an open PR at #3901. I’d welcome any feedback on that as I’m pretty new to dask.

@rcomer
Copy link
Member

rcomer commented Feb 27, 2021

It looks like dask.array.median is using numpy.median under the hood, so doesn't respect masks:

import numpy.ma as ma
import dask.array as da

arr = ma.array(range(4), mask=[0,0,0,1])
print(ma.median(arr))

larr = da.from_array(arr)
print(da.median(larr, axis=0).compute())

Output:

1.0
1.5

So I think we would need something extra to make lazy median consistent with our existing median aggregator.

@stefsmeets
Copy link
Author

Hi @rcomer , I just noticed that a nanmedian function exists in dask. Would this be a way to make the median operation lazy in iris?

@rcomer
Copy link
Member

rcomer commented Mar 11, 2021

Hi @stefsmeets, yes it looks like that should work in principle. Something like

import numpy as np
import numpy.ma as ma
import dask.array as da

def lazy_median(array, axis):
    array = array.astype(np.float_)
    nan_array = da.ma.filled(array, np.nan)
    median = da.nanmedian(nan_array, axis)
    return da.ma.fix_invalid(median)

arr = ma.array(range(4), mask=[0,0,0,1])
print(ma.median(arr))

larr = da.from_array(arr)
print(lazy_median(larr, axis=0).compute())

Output:

1.0
1.0

Though I am very much not an expert. Maybe @pp-mo has thoughts on this.

@trexfeathers
Copy link
Contributor

Related: #3292

@bouweandela
Copy link
Member

@fnattino: This issue could be interesting for you

@rcomer
Copy link
Member

rcomer commented Nov 23, 2022

#5066 updated all the aggregator docstrings to indicate which are lazy. PERCENTILE has been lazy since v3.3. Is there still appetite to work on MEDIAN? Obviously you can just use the 50th percentile but perhaps there is an advantage to the separate median function.

@pp-mo
Copy link
Member

pp-mo commented Nov 23, 2022

@SciTools/peloton let's wait till 2023-01 if anyone really wants MEDIAN to be lazy.
IF not, add a doc note to say "it's not lazy, but percentile is"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants