Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle fill value on netCDF save #2747

Merged
merged 18 commits into from
Oct 11, 2017
Merged

Conversation

djkirkham
Copy link
Contributor

@djkirkham djkirkham commented Aug 22, 2017

No longer try to get a fill value from the cube's data array for saving to netCDF. Instead, allow the user to pass one in as a keyword argument or use the netCDF default.

Unresolved issues:

  • The behaviour of netCDF is to use a default fill value if none is specified, unless the dtype is a single byte in length, in which case the data is interpreted as being non-masked (on writing to a data set, the data is still filled). So the warning raised by my code doesn't apply in this case, and in fact perhaps a warning should be raised if the user tries to store masked byte data.
  • I noticed that that packing argument to save can be an iterable to allow different packing attributes for the different cubes. Perhaps we could do the same for fill_value.
  • Could probably reword some of the docstrings
  • Unit tests

@djkirkham djkirkham requested review from pelson and pp-mo August 22, 2017 09:59
@pp-mo pp-mo self-assigned this Aug 22, 2017
@@ -902,11 +919,16 @@ def write(self, cube, local_keys=None, unlimited_dimensions=None,
on `cube.data` and possible masking. For masked data, `fill_value`
is taken from netCDF4.default_fillvals. For more control, pass a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a tweak as it seems to say that if 'packing' is used then 'fill_value' would be ignored.
Whereas, the new note for 'fill_value' itself implies that it will be used for packing, if given, which I think is correct.

@@ -1858,54 +1881,61 @@ def _create_cf_data_variable(self, cube, dimension_names, local_keys=None,
An interable of cube attribute keys. Any cube attributes
with matching keys will become attributes on the data variable.

* packing (type or string or dict or list): A numpy integer datatype
Copy link
Member

@pp-mo pp-mo Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit fragile to repeat all this between __init__ and _create_cf_data_variable.
Especially as this is not a public docstring for Sphinx, I think it would make more sense to cross-refer to the other here. How about just "'packing' and 'fill_value' controls are applied to the data, as described in '__init__'."

@djkirkham
Copy link
Contributor Author

djkirkham commented Aug 23, 2017

I've pushed changes so that the _FillValue attribute is removed from the netCDF variable if the data wasn't masked. Some things to note:

  • The _FillValue attribute is always removed in the non-masked case, even if it was explicitly passed by the user
  • The warning about clashes with the fill value in the data only occurs if the data is masked. This prevents warnings when writing byte data, which is never treated as masked by netCDF anyway, but we might want to have the warning for other data types.

@@ -158,12 +158,13 @@ def test_big_endian(self):
def test_zlib(self):
cube = self._simple_cube('>f4')
with mock.patch('iris.fileformats.netcdf.netCDF4') as api:
api.default_fillvals = {'f4': 12345.}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had to modify this test because the createVariable call now passes the default fill value for the data type. Since the netCDF4 module is mocked out, it turns out that the value passed for the fill_value argument is a numpy array. I'm guessing that this doesn't play well with mock.ANY (it probably tries to do an equality test somewhere along the line, which causes numpy to raise an exception), because the test fails unless I make sure there is an actual value passed by assigning this dict.

with Saver('/dummy/path', 'NETCDF4') as saver:
saver.write(cube, zlib=True)
dataset = api.Dataset.return_value
create_var_calls = mock.call.createVariable(
'air_pressure_anomaly', np.dtype('float32'), ['dim0', 'dim1'],
fill_value=None, shuffle=True, least_significant_digit=None,
fill_value=mock.ANY, shuffle=True, least_significant_digit=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I've had to make this test less strict. I could check that there is a delncattr call to remove the _FillValue...

@pp-mo
Copy link
Member

pp-mo commented Aug 23, 2017

Some initial thoughts on the behaviour, following your comments :

I've pushed changes so that the _FillValue attribute is removed from the netCDF variable if the data wasn't masked. Some things to note:

  • The _FillValue attribute is always removed in the non-masked case, even if it was explicitly passed by the user
  • The warning about clashes with the fill value in the data only occurs if the data is masked. This prevents warnings when writing byte data, which is never treated as masked by netCDF anyway, but we might want to have the warning for other data types.

I think both those points differ from the conclusions of yesterday's discussion with @pelson
Which I've tried to summarise in #2748 .
It's a shame we don't have @pelson to verify that right now, as the proposals changed several times in discussion + I might have mis-represented him.

So, for what it's worth, I think that outcome says ...

  • every nc variable is assigned a _FillValue attribute
    • except byte ones, where it removed if the data has no masked points
  • the check on clashes with the fill value, and possible resulting warning, is always done
  • there is also a blanket warning when saving masked bytes

@pp-mo
Copy link
Member

pp-mo commented Aug 23, 2017

My own personal inclination would be to add a warning when saving masked int/uint32 as well...
I think this is potentially more awkward than the byte cases

  • present usage in netcdf/netCDF4-python mean byte types essentially don't allow masking
  • but short integer types (2 or 4 byes) can't avoid it, for some larger values.

The actual 'missing' values, from netCDF4.default_fillvals, are

  • u2 --> 65335 = 2^16 - 1 = 0xffff
    valid range is 0 .. 65534 : missing is 65535
  • i2 --> -32767 = 1 - 2^15
    valid range is -32766 ..+32767 : missing is -32767 : status of -32768 is unclear ?
  • u4 --> 4,294,967,295 = 2^32 - 1
    valid range is 0 .. 4,294,967,294 : missing is 4,294,967,295
  • i4 --> -2,147,483,647 = 1 - 2^31
    valid range is -2,147,483,646 .. +2,147,483,647 : missing is -2,147,483,647 : status of -2,147,483,648 is unclear

So actually there are two values missing from the signed type ranges.

@djkirkham
Copy link
Contributor Author

@pp-mo

  • every nc variable is assigned a _FillValue attribute
    • except byte ones, where it removed if the data has no masked points

This would have the same effect as not assigning a _FillValue at all because of how the netCDF4 library works. Personally, I think we should avoid writing a fill value if we don't have to; it's what we tried to do before (<=1.13), and changing it means changing a lot of test results. Though I don't much mind either way.

  • the check on clashes with the fill value, and possible resulting warning, is always done

OK. I'm assuming byte data would be an exception here. I guess the reason I avoided this was because I didn't want to have to add branching based on the dtype.

  • there is also a blanket warning when saving masked bytes

OK, but what if the user passes their own fill value? Perhaps that should override the warning?

My own personal inclination would be to add a warning when saving masked int/uint32 as well...

Assuming you mean a blanket warning like in the byte case, I disagree with this. I think it's enough to warn if there are clashes: round trips still work and maintain the mask, I don't see it causing problems if there are no clashes.

warnings.warn("Cube '{}' contains data points equal to the fill"
"value {}. The points will be interpreted as being "
"masked. Please provide a fill_value argument not "
"equal to any data point.".format(cube.name(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo Please feel free to suggest a rewording of these warnings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider this later, as I'm thinking to integrate this with the same requirements in the PP saver
( #2744, to be updated when this is done )

@pp-mo
Copy link
Member

pp-mo commented Aug 24, 2017

Looking much more sensible and consistent now, and the code looks neater too 👍 !
A few remaining points...

  • +1 for making 'fill_value' an iterable, like 'packing', for the save() call
    • and documenting it
  • some more comprehensive tests for the new behaviours
  • remove the redundant import of iris._lazy_data.get_fill_value
    • .. and then that routine itself as nothing else refers to it

@djkirkham djkirkham force-pushed the netcdf_save branch 2 times, most recently from d9dc750 to fa2420c Compare August 25, 2017 12:37
def __setitem__(self, keys, arr):
if self.fill_value is not None:
self.contains_value |= self.fill_value in arr
self.is_masked |= ma.isMaskedArray(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would make sense to short-circuit this check (and the fill_value in check). Once we have found them to be true once, we can save ourselves the effort of inspecting the arrays each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed |= would short circuit but I guess not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're going to need to double check that for me now. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had already checked. It's because it's bitwise, and in fact it changes the data type of the left hand side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Well, at least you can get away with the boolean operators...

>>> def foo(val):
...    print('called')
...    return val
... 

>>> False or foo(False)
called
False
>>> True or foo(False)
True

on `cube.data` and possible masking. For masked data, `fill_value`
is taken from netCDF4.default_fillvals. For more control, pass a
dict with one or more of the following keys: `dtype` (required),
`scale_factor`, `add_offset`, and `fill_value`. Note that automatic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to ensure this public behaviour change is well documented in the what's new.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @pelson : this needs a whatsnew entry,

raise ValueError(msg)
dtype = np.dtype(packing['dtype'])
scale_factor = packing.get('scale_factor', None)
add_offset = packing.get('add_offset', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, but would you mind checking that there aren't other keys in there other than the ones handled?

scale_factor = packing.get('scale_factor', None)
add_offset = packing.get('add_offset', None)
else:
masked = ma.isMaskedArray(cube.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would trigger the data to be loaded. Is that what you intended?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not your code... Perhaps you wouldn't mind adding a comment though:

# We compute the scale_factor based on the min/max of the data. This requires the data to be loaded.

Copy link
Contributor Author

@djkirkham djkirkham Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would trigger the data to be loaded. Is that what you intended?

I hadn't noticed this, it looks like it comes from the old code. It can be fixed, but it would require getting the min and max of the data while storing it. I can do that if you like, but it might be better to do it as a separate PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just adding the comment is good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the docstring at this comment cover it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. It is about making the code readable. Doesn't need to be an essay.

@@ -1924,36 +1937,55 @@ def set_packing_ncattrs(cfvar):
self._dataset.file_format in ('NETCDF3_CLASSIC',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the has_lazy_data here? It'd be nice to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It determines which store method gets defined, although I could move the if into that method if you like.


# If packing attributes are specified, don't bother checking whether
# the fill value is in the data.
fill_value_to_check = None if packing else \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's unpack this line. Neat as it is, it really isn't very readable 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -2244,13 +2284,28 @@ def is_valid_packspec(p):
raise ValueError(msg)
packspecs = packing

if isinstance(fill_value, six.string_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block is surprising in its data-validation. Is it consistent with other parts of this codebase? If not, what is the motivation for adding such complexity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to be handle a single value or iterable of values which may be strings. If it's a single value it needs to be converted to an iterable that can be zipped with cube if cube is a list. I couldn't find a more succinct way of achieving that, though maybe there is.

Copy link
Member

@pp-mo pp-mo Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not use np.array(..., ndmin=1) for this ?
E.G.

>>> for thing in ('a', ('a', 'bc'), ['ab', 'c'], 1.0, [1, 2], np.array([1., 2])):
...   print repr(thing), ' --> ', np.array(thing, ndmin=1)
... 
'a'  -->  ['a']
('a', 'bc')  -->  ['a' 'bc']
['ab', 'c']  -->  ['ab' 'c']
1.0  -->  [ 1.]
[1, 2]  -->  [1 2]
array([ 1.,  2.])  -->  [ 1.  2.]
>>> 

It does the special-casing of strings for you, and handles all types of iterable uniformly.
I'm not sure about the repeat operation though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was playing around with it but it doesn't work for generators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finally accepted that this can't be improved on, though it is painful !
However, could you add a comment to say exactly what all this is about ?
? Maybe something like ...

    # Make full-value(s) into an iterable over cubes.
    if isinstance(fill_value, six.string_types):
        # Strings are awkward -- handle separately.
        fill_values = repeat(fill_value)
    else:
        ...

@@ -0,0 +1 @@
* When saving a cube or list of cubes, a fill value or list of fill values can be specified via a new `fill_value` argument. If a `fill_value` argument is not specified, the default fill value for the file format and the cube's data type will be used. Fill values are no longer taken from the cube's `data` attribute when it is a masked array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note explaining that multiple fill values are applied to output cubes in sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test for masked points, never for masked array type.

def __setitem__(self, keys, arr):
if self.fill_value is not None:
self.contains_value = self.contains_value or self.fill_value in arr
self.is_masked = self.is_masked or ma.isMaskedArray(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be using "ma.is_masked" instead of "ma.isMaskedArray".
As stated here "make no difference between masked + non-masked arrays"


def store(data, cf_var, fill_value):
cf_var[:] = data
is_masked = ma.isMaskedArray(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too should be using "ma.is_masked" instead of "ma.isMaskedArray".
As stated here "make no difference between masked + non-masked arrays"

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor points..
but I haven't completed checking so may be other stuff to add

@@ -24,10 +24,16 @@
# importing anything else.
import iris.tests as tests

from contextlib import contextmanager
import re
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is now obsolete

import netCDF4 as nc
import numpy as np
from numpy import ma

import iris
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is now obsolete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in a few places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my mistake, must have been looking at the wrong version.


import iris
import iris._lazy_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just import as_lazy_data here, as that is all that is used.

@pp-mo
Copy link
Member

pp-mo commented Sep 4, 2017

NOTE: I think the newer errors
- like "ValueError: cannot reshape array of size 24 into shape (192,)" -
are due to the special branch "jcrist/masked-array" no longer existing,
since the relevant PR dask/dask#2301 was merged to dask/master.

Meanwhile, we can't access the newly-merged latest dask code with conda until we get a distinct version string to identify it -- i.e. latest tag is still "0.15.2", which does not include the masking support.

dask/dask#2654 addresses.
We should wait for that and then remove the special dask-branch fetch in .travis.yml.
-- see #2762 for that.

@pp-mo
Copy link
Member

pp-mo commented Oct 10, 2017

can't access the newly-merged latest dask code

... now we can :

@djkirkham once #2762 is in, can you please re-base this ?
Note: I have been testing a prototype version in : #2780
(with other bits force-merged)

@djkirkham djkirkham closed this Oct 11, 2017
@djkirkham djkirkham reopened this Oct 11, 2017
@djkirkham
Copy link
Contributor Author

@pp-mo Tests are passing now

sman.write(cube, local_keys, unlimited_dimensions, zlib, complevel,
shuffle, fletcher32, contiguous, chunksizes, endian,
least_significant_digit, packing=packspec)
least_significant_digit, packing=packspec,
fill_value=fill_value)

if iris.config.netcdf.conventions_override:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Eclipse says this needs an extra import to be truly righteous.

@@ -0,0 +1 @@
* When saving a cube or list of cubes, a fill value or list of fill values can be specified via a new `fill_value` argument. If a list is supplied, each fill value will be applied to each cube in turn. If a `fill_value` argument is not specified, the default fill value for the file format and the cube's data type will be used. Fill values are no longer taken from the cube's `data` attribute when it is a masked array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a general facility -- it only applies to certain save formats (netcdf only here, PP to be added shortly..)

Copy link
Contributor Author

@djkirkham djkirkham Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll add mentions for both file formats here.

Actually I'll just mention netCDF

self.assertTrue(var[index].mask)

def test_mask_lazy_default_fill_value(self):
# Test that masked lazy data saves correctly when given a fill value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say "when not given a fill value" here.

with self._netCDF_var(cube, fill_value=fill_value):
pass

def test_masked_byte_default_fill_value(self):
Copy link
Member

@pp-mo pp-mo Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a missing testcase here ?
= saving masked byte data with a nonstandard fill-value

It seems that the existing code will not generate a warning on save for this.
I'm not too clear what you would get when you read it back.
I'm not totally clear what it "ought" to do ?

@djkirkham
Copy link
Contributor Author

@pp-mo I think I've addressed all your comments and the tests are now passing.

@djkirkham djkirkham mentioned this pull request Oct 11, 2017
8 tasks
@pp-mo pp-mo merged commit fc0cf49 into SciTools:dask_mask_array Oct 11, 2017
@pp-mo
Copy link
Member

pp-mo commented Oct 11, 2017

Thanks for all your patience @djkirkham !

@pp-mo
Copy link
Member

pp-mo commented Oct 11, 2017

My Eclipse says this needs an extra import to be truly righteous.

Sadly though, it still says that when the thing was fixed :-(

@QuLogic QuLogic added this to the dask-mask milestone Oct 12, 2017
@pelson
Copy link
Member

pelson commented Oct 13, 2017

This is an excellent change. Thanks @djkirkham - great work! 🎉

@djkirkham djkirkham deleted the netcdf_save branch October 26, 2017 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants