Handle fill value on netCDF save #2747

djkirkham · 2017-08-22T09:58:45Z

No longer try to get a fill value from the cube's data array for saving to netCDF. Instead, allow the user to pass one in as a keyword argument or use the netCDF default.

Unresolved issues:

The behaviour of netCDF is to use a default fill value if none is specified, unless the dtype is a single byte in length, in which case the data is interpreted as being non-masked (on writing to a data set, the data is still filled). So the warning raised by my code doesn't apply in this case, and in fact perhaps a warning should be raised if the user tries to store masked byte data.
I noticed that that packing argument to save can be an iterable to allow different packing attributes for the different cubes. Perhaps we could do the same for fill_value.
Could probably reword some of the docstrings
Unit tests

pp-mo · 2017-08-22T10:54:12Z

lib/iris/fileformats/netcdf.py

@@ -902,11 +919,16 @@ def write(self, cube, local_keys=None, unlimited_dimensions=None,
            on `cube.data` and possible masking. For masked data, `fill_value`
            is taken from netCDF4.default_fillvals. For more control, pass a


I think this needs a tweak as it seems to say that if 'packing' is used then 'fill_value' would be ignored.
Whereas, the new note for 'fill_value' itself implies that it will be used for packing, if given, which I think is correct.

pp-mo · 2017-08-22T10:59:06Z

lib/iris/fileformats/netcdf.py

@@ -1858,54 +1881,61 @@ def _create_cf_data_variable(self, cube, dimension_names, local_keys=None,
            An interable of cube attribute keys. Any cube attributes
            with matching keys will become attributes on the data variable.

+        * packing (type or string or dict or list): A numpy integer datatype


It seems a bit fragile to repeat all this between __init__ and _create_cf_data_variable.
Especially as this is not a public docstring for Sphinx, I think it would make more sense to cross-refer to the other here. How about just "'packing' and 'fill_value' controls are applied to the data, as described in '__init__'."

…rings

djkirkham · 2017-08-23T09:12:42Z

I've pushed changes so that the _FillValue attribute is removed from the netCDF variable if the data wasn't masked. Some things to note:

The _FillValue attribute is always removed in the non-masked case, even if it was explicitly passed by the user
The warning about clashes with the fill value in the data only occurs if the data is masked. This prevents warnings when writing byte data, which is never treated as masked by netCDF anyway, but we might want to have the warning for other data types.

djkirkham · 2017-08-23T09:16:44Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

@@ -158,12 +158,13 @@ def test_big_endian(self):
    def test_zlib(self):
        cube = self._simple_cube('>f4')
        with mock.patch('iris.fileformats.netcdf.netCDF4') as api:
+            api.default_fillvals = {'f4': 12345.}


I've had to modify this test because the createVariable call now passes the default fill value for the data type. Since the netCDF4 module is mocked out, it turns out that the value passed for the fill_value argument is a numpy array. I'm guessing that this doesn't play well with mock.ANY (it probably tries to do an equality test somewhere along the line, which causes numpy to raise an exception), because the test fails unless I make sure there is an actual value passed by assigning this dict.

djkirkham · 2017-08-23T09:19:28Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

            with Saver('/dummy/path', 'NETCDF4') as saver:
                saver.write(cube, zlib=True)
        dataset = api.Dataset.return_value
        create_var_calls = mock.call.createVariable(
            'air_pressure_anomaly', np.dtype('float32'), ['dim0', 'dim1'],
-            fill_value=None, shuffle=True, least_significant_digit=None,
+            fill_value=mock.ANY, shuffle=True, least_significant_digit=None,


Unfortunately I've had to make this test less strict. I could check that there is a delncattr call to remove the _FillValue...

pp-mo · 2017-08-23T10:26:58Z

Some initial thoughts on the behaviour, following your comments :

I've pushed changes so that the _FillValue attribute is removed from the netCDF variable if the data wasn't masked. Some things to note:

The _FillValue attribute is always removed in the non-masked case, even if it was explicitly passed by the user

The warning about clashes with the fill value in the data only occurs if the data is masked. This prevents warnings when writing byte data, which is never treated as masked by netCDF anyway, but we might want to have the warning for other data types.

I think both those points differ from the conclusions of yesterday's discussion with @pelson
Which I've tried to summarise in #2748 .
It's a shame we don't have @pelson to verify that right now, as the proposals changed several times in discussion + I might have mis-represented him.

So, for what it's worth, I think that outcome says ...

every nc variable is assigned a _FillValue attribute
- except byte ones, where it removed if the data has no masked points
the check on clashes with the fill value, and possible resulting warning, is always done
there is also a blanket warning when saving masked bytes

pp-mo · 2017-08-23T10:46:35Z

My own personal inclination would be to add a warning when saving masked int/uint32 as well...
I think this is potentially more awkward than the byte cases

present usage in netcdf/netCDF4-python mean byte types essentially don't allow masking
but short integer types (2 or 4 byes) can't avoid it, for some larger values.

The actual 'missing' values, from netCDF4.default_fillvals, are

u2 --> 65335 = 2^16 - 1 = 0xffff
valid range is 0 .. 65534 : missing is 65535
i2 --> -32767 = 1 - 2^15
valid range is -32766 ..+32767 : missing is -32767 : status of -32768 is unclear ?
u4 --> 4,294,967,295 = 2^32 - 1
valid range is 0 .. 4,294,967,294 : missing is 4,294,967,295
i4 --> -2,147,483,647 = 1 - 2^31
valid range is -2,147,483,646 .. +2,147,483,647 : missing is -2,147,483,647 : status of -2,147,483,648 is unclear

So actually there are two values missing from the signed type ranges.

djkirkham · 2017-08-23T14:54:39Z

@pp-mo

every nc variable is assigned a _FillValue attribute

except byte ones, where it removed if the data has no masked points

This would have the same effect as not assigning a _FillValue at all because of how the netCDF4 library works. Personally, I think we should avoid writing a fill value if we don't have to; it's what we tried to do before (<=1.13), and changing it means changing a lot of test results. Though I don't much mind either way.

the check on clashes with the fill value, and possible resulting warning, is always done

OK. I'm assuming byte data would be an exception here. I guess the reason I avoided this was because I didn't want to have to add branching based on the dtype.

there is also a blanket warning when saving masked bytes

OK, but what if the user passes their own fill value? Perhaps that should override the warning?

My own personal inclination would be to add a warning when saving masked int/uint32 as well...

Assuming you mean a blanket warning like in the byte case, I disagree with this. I think it's enough to warn if there are clashes: round trips still work and maintain the mask, I don't see it causing problems if there are no clashes.

djkirkham · 2017-08-24T11:26:36Z

lib/iris/fileformats/netcdf.py

+            warnings.warn("Cube '{}' contains data points equal to the fill"
+                          "value {}. The points will be interpreted as being "
+                          "masked. Please provide a fill_value argument not "
+                          "equal to any data point.".format(cube.name(),


@pp-mo Please feel free to suggest a rewording of these warnings

I will consider this later, as I'm thinking to integrate this with the same requirements in the PP saver
( #2744, to be updated when this is done )

pp-mo · 2017-08-24T12:35:55Z

Looking much more sensible and consistent now, and the code looks neater too 👍 !
A few remaining points...

+1 for making 'fill_value' an iterable, like 'packing', for the save() call
- and documenting it
some more comprehensive tests for the new behaviours
remove the redundant import of iris._lazy_data.get_fill_value
- .. and then that routine itself as nothing else refers to it

pelson · 2017-08-25T11:39:18Z

lib/iris/fileformats/netcdf.py

+    def __setitem__(self, keys, arr):
+        if self.fill_value is not None:
+            self.contains_value |= self.fill_value in arr
+        self.is_masked |= ma.isMaskedArray(arr)


Would make sense to short-circuit this check (and the fill_value in check). Once we have found them to be true once, we can save ourselves the effort of inspecting the arrays each time.

I had assumed |= would short circuit but I guess not.

you're going to need to double check that for me now. 😄

Yeah, I had already checked. It's because it's bitwise, and in fact it changes the data type of the left hand side.

Ok. Well, at least you can get away with the boolean operators...

>>> def foo(val): ... print('called') ... return val ... >>> False or foo(False) called False >>> True or foo(False) True

pelson · 2017-08-25T11:40:14Z

lib/iris/fileformats/netcdf.py

-            on `cube.data` and possible masking. For masked data, `fill_value`
-            is taken from netCDF4.default_fillvals. For more control, pass a
-            dict with one or more of the following keys: `dtype` (required),
-            `scale_factor`, `add_offset`, and `fill_value`. Note that automatic


Need to ensure this public behaviour change is well documented in the what's new.

Agree with @pelson : this needs a whatsnew entry,

pelson · 2017-08-25T12:19:00Z

lib/iris/fileformats/netcdf.py

+                    raise ValueError(msg)
+                dtype = np.dtype(packing['dtype'])
+                scale_factor = packing.get('scale_factor', None)
+                add_offset = packing.get('add_offset', None)


Not your code, but would you mind checking that there aren't other keys in there other than the ones handled?

pelson · 2017-08-25T12:19:52Z

lib/iris/fileformats/netcdf.py

+                scale_factor = packing.get('scale_factor', None)
+                add_offset = packing.get('add_offset', None)
+            else:
+                masked = ma.isMaskedArray(cube.data)


This would trigger the data to be loaded. Is that what you intended?

Sorry not your code... Perhaps you wouldn't mind adding a comment though:

# We compute the scale_factor based on the min/max of the data. This requires the data to be loaded.

This would trigger the data to be loaded. Is that what you intended?

I hadn't noticed this, it looks like it comes from the old code. It can be fixed, but it would require getting the min and max of the data while storing it. I can do that if you like, but it might be better to do it as a separate PR

No, just adding the comment is good.

Doesn't the docstring at this comment cover it ?

Not really. It is about making the code readable. Doesn't need to be an essay.

pelson · 2017-08-25T12:34:59Z

lib/iris/fileformats/netcdf.py

@@ -1924,36 +1937,55 @@ def set_packing_ncattrs(cfvar):
            self._dataset.file_format in ('NETCDF3_CLASSIC',


Do we still need the has_lazy_data here? It'd be nice to remove.

It determines which store method gets defined, although I could move the if into that method if you like.

pelson · 2017-08-25T12:37:01Z

lib/iris/fileformats/netcdf.py

+
+        # If packing attributes are specified, don't bother checking whether
+        # the fill value is in the data.
+        fill_value_to_check = None if packing else \


Let's unpack this line. Neat as it is, it really isn't very readable 😉

pelson · 2017-08-25T12:40:46Z

lib/iris/fileformats/netcdf.py

@@ -2244,13 +2284,28 @@ def is_valid_packspec(p):
                raise ValueError(msg)
        packspecs = packing

+    if isinstance(fill_value, six.string_types):


This whole block is surprising in its data-validation. Is it consistent with other parts of this codebase? If not, what is the motivation for adding such complexity?

I need to be handle a single value or iterable of values which may be strings. If it's a single value it needs to be converted to an iterable that can be zipped with cube if cube is a list. I couldn't find a more succinct way of achieving that, though maybe there is.

Can you not use np.array(..., ndmin=1) for this ?
E.G.

>>> for thing in ('a', ('a', 'bc'), ['ab', 'c'], 1.0, [1, 2], np.array([1., 2])): ... print repr(thing), ' --> ', np.array(thing, ndmin=1) ... 'a' --> ['a'] ('a', 'bc') --> ['a' 'bc'] ['ab', 'c'] --> ['ab' 'c'] 1.0 --> [ 1.] [1, 2] --> [1 2] array([ 1., 2.]) --> [ 1. 2.] >>>

It does the special-casing of strings for you, and handles all types of iterable uniformly.
I'm not sure about the repeat operation though.

Nice suggestion.

Yeah I was playing around with it but it doesn't work for generators.

I've finally accepted that this can't be improved on, though it is painful !
However, could you add a comment to say exactly what all this is about ?
? Maybe something like ...

# Make full-value(s) into an iterable over cubes. if isinstance(fill_value, six.string_types): # Strings are awkward -- handle separately. fill_values = repeat(fill_value) else: ...

pp-mo · 2017-08-29T15:16:36Z

docs/iris/src/whatsnew/contributions_2.0/incompatiblechange_2017-Aug-29_fill_value_save_arg.txt

@@ -0,0 +1 @@
+* When saving a cube or list of cubes, a fill value or list of fill values can be specified via a new `fill_value` argument. If a `fill_value` argument is not specified, the default fill value for the file format and the cube's data type will be used. Fill values are no longer taken from the cube's `data` attribute when it is a masked array.


Can we add a note explaining that multiple fill values are applied to output cubes in sequence.

pp-mo

Let's test for masked points, never for masked array type.

pp-mo · 2017-08-29T15:54:13Z

lib/iris/fileformats/netcdf.py

+    def __setitem__(self, keys, arr):
+        if self.fill_value is not None:
+            self.contains_value = self.contains_value or self.fill_value in arr
+        self.is_masked = self.is_masked or ma.isMaskedArray(arr)


I think this should be using "ma.is_masked" instead of "ma.isMaskedArray".
As stated here "make no difference between masked + non-masked arrays"

pp-mo · 2017-08-29T15:55:26Z

lib/iris/fileformats/netcdf.py

-
+            def store(data, cf_var, fill_value):
+                cf_var[:] = data
+                is_masked = ma.isMaskedArray(data)


Here too should be using "ma.is_masked" instead of "ma.isMaskedArray".
As stated here "make no difference between masked + non-masked arrays"

…s to argument validation

pp-mo

Very minor points..
but I haven't completed checking so may be other stuff to add

pp-mo · 2017-08-30T14:45:28Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

@@ -24,10 +24,16 @@
 # importing anything else.
 import iris.tests as tests

+from contextlib import contextmanager
+import re


This import is now obsolete

pp-mo · 2017-08-30T14:45:40Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

 import netCDF4 as nc
 import numpy as np
+from numpy import ma

 import iris


This import is now obsolete

It's used in a few places

Sorry my mistake, must have been looking at the wrong version.

pp-mo · 2017-08-30T14:46:15Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py


 import iris
+import iris._lazy_data


You could just import as_lazy_data here, as that is all that is used.

pp-mo · 2017-09-04T17:10:17Z

NOTE: I think the newer errors
- like "ValueError: cannot reshape array of size 24 into shape (192,)" -
are due to the special branch "jcrist/masked-array" no longer existing,
since the relevant PR dask/dask#2301 was merged to dask/master.

Meanwhile, we can't access the newly-merged latest dask code with conda until we get a distinct version string to identify it -- i.e. latest tag is still "0.15.2", which does not include the masking support.

dask/dask#2654 addresses.
We should wait for that and then remove the special dask-branch fetch in .travis.yml.
-- see #2762 for that.

pp-mo · 2017-10-10T16:20:18Z

can't access the newly-merged latest dask code

... now we can :

mask support went in at dask=0.15.3
current version = 0.15.4
this latest version has useful fixes, relevant to Iris tests (Critical Testing Instability | Dask Version #2769)

@djkirkham once #2762 is in, can you please re-base this ?
Note: I have been testing a prototype version in : #2780
(with other bits force-merged)

djkirkham · 2017-10-11T10:27:49Z

@pp-mo Tests are passing now

pp-mo · 2017-10-11T10:54:58Z

lib/iris/fileformats/netcdf.py

            sman.write(cube, local_keys, unlimited_dimensions, zlib, complevel,
                       shuffle, fletcher32, contiguous, chunksizes, endian,
-                       least_significant_digit, packing=packspec)
+                       least_significant_digit, packing=packspec,
+                       fill_value=fill_value)

        if iris.config.netcdf.conventions_override:


My Eclipse says this needs an extra import to be truly righteous.

pp-mo · 2017-10-11T11:12:58Z

docs/iris/src/whatsnew/contributions_2.0/incompatiblechange_2017-Aug-29_fill_value_save_arg.txt

@@ -0,0 +1 @@
+* When saving a cube or list of cubes, a fill value or list of fill values can be specified via a new `fill_value` argument. If a list is supplied, each fill value will be applied to each cube in turn. If a `fill_value` argument is not specified, the default fill value for the file format and the cube's data type will be used. Fill values are no longer taken from the cube's `data` attribute when it is a masked array.


This isn't a general facility -- it only applies to certain save formats (netcdf only here, PP to be added shortly..)

~~ok, I'll add mentions for both file formats here.~~

Actually I'll just mention netCDF

pp-mo · 2017-10-11T11:19:14Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

+            self.assertTrue(var[index].mask)
+
+    def test_mask_lazy_default_fill_value(self):
+        # Test that masked lazy data saves correctly when given a fill value.


Should say "when not given a fill value" here.

pp-mo · 2017-10-11T11:26:42Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

+            with self._netCDF_var(cube, fill_value=fill_value):
+                pass
+
+    def test_masked_byte_default_fill_value(self):


Possibly a missing testcase here ?
= saving masked byte data with a nonstandard fill-value

It seems that the existing code will not generate a warning on save for this.
I'm not too clear what you would get when you read it back.
I'm not totally clear what it "ought" to do ?

djkirkham · 2017-10-11T14:13:51Z

@pp-mo I think I've addressed all your comments and the tests are now passing.

pp-mo · 2017-10-11T14:20:31Z

Thanks for all your patience @djkirkham !

pp-mo · 2017-10-11T14:23:01Z

My Eclipse says this needs an extra import to be truly righteous.

Sadly though, it still says that when the thing was fixed :-(

pelson · 2017-10-13T12:50:34Z

This is an excellent change. Thanks @djkirkham - great work! 🎉

Handle fill value on netCDF save

d6f543b

djkirkham requested review from pelson and pp-mo August 22, 2017 09:59

pp-mo self-assigned this Aug 22, 2017

pp-mo reviewed Aug 22, 2017

View reviewed changes

Delete _FillValue after writing data if it wasn't masked. Rejig docst…

c5d9dc0

…rings

djkirkham commented Aug 23, 2017

View reviewed changes

djkirkham mentioned this pull request Aug 23, 2017

Design a common strategy for handling masked data in file i/o #2748

Closed

Modify fill_value/masking handling

1cfe7f8

djkirkham commented Aug 24, 2017

View reviewed changes

djkirkham added 5 commits August 24, 2017 14:28

Allow fill_value to be a list

1f234b3

Remove redundant get_fill_value function

4f5dadd

Modify fill_value description in docstrings

92d5f30

Unskip and fix some tests

2aa21c4

Remove get_fill_value tests and fix pep8 failure

5d8375a

djkirkham force-pushed the netcdf_save branch 2 times, most recently from d9dc750 to fa2420c Compare August 25, 2017 12:37

Add unit tests

ca4067d

djkirkham force-pushed the netcdf_save branch from fa2420c to ca4067d Compare August 25, 2017 12:38

pelson reviewed Aug 25, 2017

View reviewed changes

pp-mo reviewed Aug 29, 2017

View reviewed changes

Amend what's new entries

745ddd2

pp-mo reviewed Aug 29, 2017

View reviewed changes

Use ma.is_masked in place of ma.isMaskedArray. Add clarifying comment…

f119bf5

…s to argument validation

pp-mo reviewed Aug 30, 2017

View reviewed changes

Minor tweaks. Revert test

703494c

pp-mo mentioned this pull request Sep 4, 2017

Use conda-forge for Dask in Travis (not special branch download) #2762

Merged

djkirkham added the dask-mask label Sep 28, 2017

This was referenced Oct 10, 2017

Fix MDI and preserve lazy data in PP saves. #2744

Merged

New dk netcdf save #2780

Closed

djkirkham closed this Oct 11, 2017

djkirkham reopened this Oct 11, 2017

pp-mo reviewed Oct 11, 2017

View reviewed changes

Add import, update What's New, add test, and fix comments

c98585a

djkirkham force-pushed the netcdf_save branch from 2e6d116 to c98585a Compare October 11, 2017 13:17

djkirkham mentioned this pull request Oct 11, 2017

Fix NetCDF test failures #2712

Closed

8 tasks

pp-mo merged commit fc0cf49 into SciTools:dask_mask_array Oct 11, 2017

QuLogic added this to the dask-mask milestone Oct 12, 2017

djkirkham deleted the netcdf_save branch October 26, 2017 13:00

		@@ -902,11 +919,16 @@ def write(self, cube, local_keys=None, unlimited_dimensions=None,
		on `cube.data` and possible masking. For masked data, `fill_value`
		is taken from netCDF4.default_fillvals. For more control, pass a

		@@ -1924,36 +1937,55 @@ def set_packing_ncattrs(cfvar):
		self._dataset.file_format in ('NETCDF3_CLASSIC',

		@@ -0,0 +1 @@
		* When saving a cube or list of cubes, a fill value or list of fill values can be specified via a new `fill_value` argument. If a `fill_value` argument is not specified, the default fill value for the file format and the cube's data type will be used. Fill values are no longer taken from the cube's `data` attribute when it is a masked array.

Handle fill value on netCDF save #2747

Handle fill value on netCDF save #2747

Conversation

djkirkham commented Aug 22, 2017 • edited Loading

Choose a reason for hiding this comment

pp-mo Aug 22, 2017 • edited Loading

Choose a reason for hiding this comment

djkirkham commented Aug 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo commented Aug 23, 2017

pp-mo commented Aug 23, 2017 • edited Loading

djkirkham commented Aug 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo commented Aug 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djkirkham Aug 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo Aug 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo commented Sep 4, 2017 • edited Loading

pp-mo commented Oct 10, 2017 • edited Loading

djkirkham commented Oct 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djkirkham Oct 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo Oct 11, 2017 • edited Loading

Choose a reason for hiding this comment

djkirkham commented Oct 11, 2017

pp-mo commented Oct 11, 2017

pp-mo commented Oct 11, 2017

pelson commented Oct 13, 2017

djkirkham commented Aug 22, 2017 •

edited

Loading

pp-mo Aug 22, 2017 •

edited

Loading

djkirkham commented Aug 23, 2017 •

edited

Loading

pp-mo commented Aug 23, 2017 •

edited

Loading

pp-mo commented Aug 24, 2017 •

edited

Loading

djkirkham Aug 25, 2017 •

edited

Loading

pp-mo Aug 25, 2017 •

edited

Loading

pp-mo commented Sep 4, 2017 •

edited

Loading

pp-mo commented Oct 10, 2017 •

edited

Loading

djkirkham Oct 11, 2017 •

edited

Loading

pp-mo Oct 11, 2017 •

edited

Loading