-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle fill value on netCDF save #2747
Conversation
lib/iris/fileformats/netcdf.py
Outdated
@@ -902,11 +919,16 @@ def write(self, cube, local_keys=None, unlimited_dimensions=None, | |||
on `cube.data` and possible masking. For masked data, `fill_value` | |||
is taken from netCDF4.default_fillvals. For more control, pass a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a tweak as it seems to say that if 'packing' is used then 'fill_value' would be ignored.
Whereas, the new note for 'fill_value' itself implies that it will be used for packing, if given, which I think is correct.
lib/iris/fileformats/netcdf.py
Outdated
@@ -1858,54 +1881,61 @@ def _create_cf_data_variable(self, cube, dimension_names, local_keys=None, | |||
An interable of cube attribute keys. Any cube attributes | |||
with matching keys will become attributes on the data variable. | |||
|
|||
* packing (type or string or dict or list): A numpy integer datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit fragile to repeat all this between __init__
and _create_cf_data_variable
.
Especially as this is not a public docstring for Sphinx, I think it would make more sense to cross-refer to the other here. How about just "'packing' and 'fill_value' controls are applied to the data, as described in '__init__'."
I've pushed changes so that the _FillValue attribute is removed from the netCDF variable if the data wasn't masked. Some things to note:
|
@@ -158,12 +158,13 @@ def test_big_endian(self): | |||
def test_zlib(self): | |||
cube = self._simple_cube('>f4') | |||
with mock.patch('iris.fileformats.netcdf.netCDF4') as api: | |||
api.default_fillvals = {'f4': 12345.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had to modify this test because the createVariable
call now passes the default fill value for the data type. Since the netCDF4
module is mocked out, it turns out that the value passed for the fill_value
argument is a numpy array. I'm guessing that this doesn't play well with mock.ANY
(it probably tries to do an equality test somewhere along the line, which causes numpy to raise an exception), because the test fails unless I make sure there is an actual value passed by assigning this dict.
with Saver('/dummy/path', 'NETCDF4') as saver: | ||
saver.write(cube, zlib=True) | ||
dataset = api.Dataset.return_value | ||
create_var_calls = mock.call.createVariable( | ||
'air_pressure_anomaly', np.dtype('float32'), ['dim0', 'dim1'], | ||
fill_value=None, shuffle=True, least_significant_digit=None, | ||
fill_value=mock.ANY, shuffle=True, least_significant_digit=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I've had to make this test less strict. I could check that there is a delncattr
call to remove the _FillValue
...
Some initial thoughts on the behaviour, following your comments :
I think both those points differ from the conclusions of yesterday's discussion with @pelson So, for what it's worth, I think that outcome says ...
|
My own personal inclination would be to add a warning when saving masked int/uint32 as well...
The actual 'missing' values, from
So actually there are two values missing from the signed type ranges. |
This would have the same effect as not assigning a _FillValue at all because of how the netCDF4 library works. Personally, I think we should avoid writing a fill value if we don't have to; it's what we tried to do before (<=1.13), and changing it means changing a lot of test results. Though I don't much mind either way.
OK. I'm assuming byte data would be an exception here. I guess the reason I avoided this was because I didn't want to have to add branching based on the dtype.
OK, but what if the user passes their own fill value? Perhaps that should override the warning?
Assuming you mean a blanket warning like in the byte case, I disagree with this. I think it's enough to warn if there are clashes: round trips still work and maintain the mask, I don't see it causing problems if there are no clashes. |
lib/iris/fileformats/netcdf.py
Outdated
warnings.warn("Cube '{}' contains data points equal to the fill" | ||
"value {}. The points will be interpreted as being " | ||
"masked. Please provide a fill_value argument not " | ||
"equal to any data point.".format(cube.name(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo Please feel free to suggest a rewording of these warnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will consider this later, as I'm thinking to integrate this with the same requirements in the PP saver
( #2744, to be updated when this is done )
Looking much more sensible and consistent now, and the code looks neater too 👍 !
|
d9dc750
to
fa2420c
Compare
lib/iris/fileformats/netcdf.py
Outdated
def __setitem__(self, keys, arr): | ||
if self.fill_value is not None: | ||
self.contains_value |= self.fill_value in arr | ||
self.is_masked |= ma.isMaskedArray(arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would make sense to short-circuit this check (and the fill_value in
check). Once we have found them to be true once, we can save ourselves the effort of inspecting the arrays each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had assumed |=
would short circuit but I guess not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're going to need to double check that for me now. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had already checked. It's because it's bitwise, and in fact it changes the data type of the left hand side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Well, at least you can get away with the boolean operators...
>>> def foo(val):
... print('called')
... return val
...
>>> False or foo(False)
called
False
>>> True or foo(False)
True
on `cube.data` and possible masking. For masked data, `fill_value` | ||
is taken from netCDF4.default_fillvals. For more control, pass a | ||
dict with one or more of the following keys: `dtype` (required), | ||
`scale_factor`, `add_offset`, and `fill_value`. Note that automatic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to ensure this public behaviour change is well documented in the what's new
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @pelson : this needs a whatsnew entry,
raise ValueError(msg) | ||
dtype = np.dtype(packing['dtype']) | ||
scale_factor = packing.get('scale_factor', None) | ||
add_offset = packing.get('add_offset', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code, but would you mind checking that there aren't other keys in there other than the ones handled?
lib/iris/fileformats/netcdf.py
Outdated
scale_factor = packing.get('scale_factor', None) | ||
add_offset = packing.get('add_offset', None) | ||
else: | ||
masked = ma.isMaskedArray(cube.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would trigger the data to be loaded. Is that what you intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry not your code... Perhaps you wouldn't mind adding a comment though:
# We compute the scale_factor based on the min/max of the data. This requires the data to be loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would trigger the data to be loaded. Is that what you intended?
I hadn't noticed this, it looks like it comes from the old code. It can be fixed, but it would require getting the min and max of the data while storing it. I can do that if you like, but it might be better to do it as a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just adding the comment is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the docstring at this comment cover it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. It is about making the code readable. Doesn't need to be an essay.
@@ -1924,36 +1937,55 @@ def set_packing_ncattrs(cfvar): | |||
self._dataset.file_format in ('NETCDF3_CLASSIC', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the has_lazy_data
here? It'd be nice to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It determines which store
method gets defined, although I could move the if
into that method if you like.
lib/iris/fileformats/netcdf.py
Outdated
|
||
# If packing attributes are specified, don't bother checking whether | ||
# the fill value is in the data. | ||
fill_value_to_check = None if packing else \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's unpack this line. Neat as it is, it really isn't very readable 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
lib/iris/fileformats/netcdf.py
Outdated
@@ -2244,13 +2284,28 @@ def is_valid_packspec(p): | |||
raise ValueError(msg) | |||
packspecs = packing | |||
|
|||
if isinstance(fill_value, six.string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block is surprising in its data-validation. Is it consistent with other parts of this codebase? If not, what is the motivation for adding such complexity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to be handle a single value or iterable of values which may be strings. If it's a single value it needs to be converted to an iterable that can be zip
ped with cube
if cube
is a list. I couldn't find a more succinct way of achieving that, though maybe there is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you not use np.array(..., ndmin=1) for this ?
E.G.
>>> for thing in ('a', ('a', 'bc'), ['ab', 'c'], 1.0, [1, 2], np.array([1., 2])):
... print repr(thing), ' --> ', np.array(thing, ndmin=1)
...
'a' --> ['a']
('a', 'bc') --> ['a' 'bc']
['ab', 'c'] --> ['ab' 'c']
1.0 --> [ 1.]
[1, 2] --> [1 2]
array([ 1., 2.]) --> [ 1. 2.]
>>>
It does the special-casing of strings for you, and handles all types of iterable uniformly.
I'm not sure about the repeat operation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was playing around with it but it doesn't work for generators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finally accepted that this can't be improved on, though it is painful !
However, could you add a comment to say exactly what all this is about ?
? Maybe something like ...
# Make full-value(s) into an iterable over cubes.
if isinstance(fill_value, six.string_types):
# Strings are awkward -- handle separately.
fill_values = repeat(fill_value)
else:
...
@@ -0,0 +1 @@ | |||
* When saving a cube or list of cubes, a fill value or list of fill values can be specified via a new `fill_value` argument. If a `fill_value` argument is not specified, the default fill value for the file format and the cube's data type will be used. Fill values are no longer taken from the cube's `data` attribute when it is a masked array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a note explaining that multiple fill values are applied to output cubes in sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's test for masked points, never for masked array type.
lib/iris/fileformats/netcdf.py
Outdated
def __setitem__(self, keys, arr): | ||
if self.fill_value is not None: | ||
self.contains_value = self.contains_value or self.fill_value in arr | ||
self.is_masked = self.is_masked or ma.isMaskedArray(arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be using "ma.is_masked" instead of "ma.isMaskedArray".
As stated here "make no difference between masked + non-masked arrays"
lib/iris/fileformats/netcdf.py
Outdated
|
||
def store(data, cf_var, fill_value): | ||
cf_var[:] = data | ||
is_masked = ma.isMaskedArray(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too should be using "ma.is_masked" instead of "ma.isMaskedArray".
As stated here "make no difference between masked + non-masked arrays"
…s to argument validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor points..
but I haven't completed checking so may be other stuff to add
@@ -24,10 +24,16 @@ | |||
# importing anything else. | |||
import iris.tests as tests | |||
|
|||
from contextlib import contextmanager | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is now obsolete
import netCDF4 as nc | ||
import numpy as np | ||
from numpy import ma | ||
|
||
import iris |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is now obsolete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in a few places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my mistake, must have been looking at the wrong version.
|
||
import iris | ||
import iris._lazy_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just import as_lazy_data
here, as that is all that is used.
NOTE: I think the newer errors Meanwhile, we can't access the newly-merged latest dask code with conda until we get a distinct version string to identify it -- i.e. latest tag is still "0.15.2", which does not include the masking support. dask/dask#2654 addresses. |
... now we can :
@djkirkham once #2762 is in, can you please re-base this ? |
@pp-mo Tests are passing now |
sman.write(cube, local_keys, unlimited_dimensions, zlib, complevel, | ||
shuffle, fletcher32, contiguous, chunksizes, endian, | ||
least_significant_digit, packing=packspec) | ||
least_significant_digit, packing=packspec, | ||
fill_value=fill_value) | ||
|
||
if iris.config.netcdf.conventions_override: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My Eclipse says this needs an extra import to be truly righteous.
@@ -0,0 +1 @@ | |||
* When saving a cube or list of cubes, a fill value or list of fill values can be specified via a new `fill_value` argument. If a list is supplied, each fill value will be applied to each cube in turn. If a `fill_value` argument is not specified, the default fill value for the file format and the cube's data type will be used. Fill values are no longer taken from the cube's `data` attribute when it is a masked array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a general facility -- it only applies to certain save formats (netcdf only here, PP to be added shortly..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll add mentions for both file formats here.
Actually I'll just mention netCDF
self.assertTrue(var[index].mask) | ||
|
||
def test_mask_lazy_default_fill_value(self): | ||
# Test that masked lazy data saves correctly when given a fill value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should say "when not given a fill value" here.
with self._netCDF_var(cube, fill_value=fill_value): | ||
pass | ||
|
||
def test_masked_byte_default_fill_value(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a missing testcase here ?
= saving masked byte data with a nonstandard fill-value
It seems that the existing code will not generate a warning on save for this.
I'm not too clear what you would get when you read it back.
I'm not totally clear what it "ought" to do ?
2e6d116
to
c98585a
Compare
@pp-mo I think I've addressed all your comments and the tests are now passing. |
Thanks for all your patience @djkirkham ! |
Sadly though, it still says that when the thing was fixed :-( |
This is an excellent change. Thanks @djkirkham - great work! 🎉 |
No longer try to get a fill value from the cube's data array for saving to netCDF. Instead, allow the user to pass one in as a keyword argument or use the netCDF default.
Unresolved issues:
packing
argument to save can be an iterable to allow different packing attributes for the different cubes. Perhaps we could do the same forfill_value
.