Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to standardize to anomalies preprocessor function #300

Merged
merged 36 commits into from
Feb 21, 2020

Conversation

bascrezee
Copy link
Contributor

@bascrezee bascrezee commented Oct 8, 2019

I added a new preprocessor function, making use of two other existing preprocessor functions. This is the description I provided in the docstring:

This function standardizes the input data. It calculates the anomalies
(x-mean(x)) and divides them through the standard deviation where both
anomalies and standard deviation are calculated over the specified period.

As far as I could see it is not possible to achieve this behaviour by combining existing preprocessor functions in the recipe, therefore I needed to add this as a new preprocessor function.

Tasks

  • Add option to standardize to anomalies preprocessor function #299
  • Add unit tests
  • Public functions should have a numpy-style docstring so they appear properly in the [API documentation]
  • If writing a new/modified preprocessor function, please update the documentation
  • Circle/CI tests pass. Status can be seen below your pull request. If the tests are failing, click the link to find out why.
  • Codacy code quality checks pass. Status can be seen below your pull request. If there is an error, click the link to find out why. If you suspect Codacy may be wrong, please ask by commenting.
  • If you make backward incompatible changes to the recipe format, make a new pull request in the ESMValTool repository and add the link below --> there are no backward incompatible changes, as far as I can see

Closes #299

@bascrezee
Copy link
Contributor Author

Can I have a quick review (by @jvegasbsc or @valeriupredoi or ... ?) before proceeding with writing a unit test and adding it to the documentation?

@jvegreg
Copy link
Contributor

jvegreg commented Oct 8, 2019

Looks nice, only that I think it will be better to have it as an option in the anomalies preprocessor.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one! a couple minor comments. Also, a bit confusing to me (as a non-scientist) as to why monthly, month, mon -> maybe detail this in the func docstring, defo needed in the documentation

@bascrezee
Copy link
Contributor Author

nice one! a couple minor comments. Also, a bit confusing to me (as a non-scientist) as to why monthly, month, mon -> maybe detail this in the func docstring, defo needed in the documentation

This is equally confusing to me, my function inherits this from the other functions. Maybe @jvegasbsc can explain this?

@bascrezee
Copy link
Contributor Author

Looks nice, only that I think it will be better to have it as an option in the anomalies preprocessor.

Yes, I agree. That would be an option as well. What do others think?

@mattiarighi
Copy link
Contributor

I also prefer extending existing preprocessors rather than creating new ones.

@jvegreg
Copy link
Contributor

jvegreg commented Oct 8, 2019

This is equally confusing to me, my function inherits this from the other functions. Maybe @jvegasbsc can explain this?

To support all common abbreviates used for monthly data

@mattiarighi mattiarighi added the preprocessor Related to the preprocessor label Oct 11, 2019
@bascrezee bascrezee changed the title Added new preprocessor standardize Add option to standardize to anomalies preprocessor function Oct 22, 2019
@bascrezee
Copy link
Contributor Author

@jvegasbsc I just included it as an option in the existing anomaly function. Could you test it?

@bascrezee bascrezee marked this pull request as ready for review October 24, 2019 07:45
@bascrezee
Copy link
Contributor Author

I added a unit test for only one case, at the moment, as a proof of concept. I look forward to some feedback, maybe from @bouweandela or @jvegasbsc ? The strategy I took is to calculate the expected outcome by using numpy functions. Unfortunately this will get rather complicated for the other cases. Also, I noted, that actually there IS quite a difference in the outcome (see the tolerances that I set). I didn't find out yet why there are these differences, it is not related to the weighting along the time dimension, I checked that.

I do think that trying to put more functionality into one function (as suggested by @mattiarighi and @jvegasbsc above) in general renders unit testing way more complex (since the options that need to be covered equal the product between input arguments). Probably something to keep in mind in the future.

@jvegreg
Copy link
Contributor

jvegreg commented Dec 2, 2019

I added a unit test for only one case, at the moment, as a proof of concept. I look forward to some feedback, maybe from @bouweandela or @jvegasbsc ?

I modified it to simplify it a bit and corrected a couple of flake8 issues. Anyway, my suggestion will be to create a new method for testing the standardized case. I think it will be also a good idea to generate a new data cube that will make easier to know at a glance what the result should be

@bouweandela
Copy link
Member

since the options that need to be covered equal the product between input arguments

Indeed this is only feasible if the number of possible input values is small. Parametrizing with pytest helps a bit, but not when the list of options/possible values grows very large. Usually people try to test at the very least every code path, have a look at the coverage report in test-reports/coverage_html/index.html and make sure there is no code that is not executed during a test (marked red).

@bouweandela bouweandela changed the base branch from development to master January 3, 2020 12:18
@bascrezee
Copy link
Contributor Author

bascrezee commented Feb 12, 2020

Just a small update (also as a note to self). I got time again to work on this :) I did a first implementation solely for the case period='full'. Interestingly, the differences between numpy result and esmvalcore preprocessor result are pretty large (~0.0012 absolute difference). I checked and the difference can not be explained by weighting of the time axis (since all weights are 1 for this test case). I am now trying to track this further down. Difference could be related to argument ddof which can be passed to Iris Aggregator function.

@valeriupredoi
Copy link
Contributor

there is no weighting since stdev does not support weights computation - it seems to me that the difference comes from np.stdev() vs cube.collapsed(axis, stdev) no? 🍺

@bascrezee
Copy link
Contributor Author

bascrezee commented Feb 12, 2020

there is no weighting since stdev does not support weights computation - it seems to me that the difference comes from np.stdev() vs cube.collapsed(axis, stdev) no?

By default Iris uses ddof=1 in the STD_DEV aggregator, whereas numpy uses ddof=0 by default. After providing this as an argument in the testing, assert_allclose can be replaced by assert_array_equal (hopefully). 😄

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Feb 12, 2020

here try this with a say, temperature cube:

import iris
import numpy as np

c = iris.load_cube("t1.nc")
# shape (1800, 96, 192)
c1 = c.collapsed('time', iris.analysis.STD_DEV)
c2 = np.std(c.data, axis=0, keepdims=True)[0, ...]
print(np.mean(c1.data - c2))

I get a mean delta of 10^-3 which is consistent with what you notice

Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a few minor suggestions on formatting

Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @mattiarighi Could you please test?

@bascrezee
Copy link
Contributor Author

Can this please be merged? @mattiarighi ?

@mattiarighi mattiarighi merged commit e2404b7 into master Feb 21, 2020
@mattiarighi mattiarighi deleted the add_standardize_preproc branch February 21, 2020 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessor Related to the preprocessor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to standardize to anomalies preprocessor function
5 participants