Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MDI and preserve lazy data in PP saves. #2744

Merged
merged 3 commits into from
Oct 13, 2017

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Aug 18, 2017

This fixes one of the test failures listed in #2717 :

expects lazy data but doesn't get it
FAIL: test_lazy_data (iris.tests.unit.fileformats.pp.test_as_pairs.TestAsFields)

Following discussion in #2738, this now abandons any use of masked-array fill_value for setting a PPField BMDI value.

Content:

  • the save rules no longer fetch PPField.data to get a fill-value, nor set BMDI at all.
    • (before, data fetch was realising + so causing the test-fail)
  • the cube-to-field conversion code in save_pairs_from_cube applies the default BMDI to everything
  • the PPField.save() code now also gives a warning when filling masked data, if any unmasked points are equal to the fill-value.

data = data.filled(fill_value=self.bmdi)
if isinstance(data, ma.MaskedArray):
mdi = self.bmdi
if np.any(data == mdi):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't need to do precisely the same check for non-masked data? Given we are setting a fill value, there is always the risk of collisions, no?

Copy link
Member Author

@pp-mo pp-mo Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head it is the "fill" operation itself which is dangerous, not setting the fill-value.
I'm arguing elsewhere that we should take mdi values appearing in un-masked data as actual missing datapoints, rather than real points with a coincidental value.

For PP, at least, that seems workable as the standard BMDI values are quite extreme.
Likewise, on PP load we automatically interpret any points that match MDI as masked : here
I think we have heard rumours that a separate missing-data value may be used in some integer type fields, which might imply a greater risk of "collision", but no standard usage has been agreed + we don't code for it.

It's rather out of scope here, but I think the netCDF4 requirements make a good case for treating masked + unmasked data differently : Otherwise, we will have to complain about any byte value of 255.

@@ -2414,6 +2422,11 @@ def save_pairs_from_cube(cube, field_coords=None, target=None):
pp_field.lbcode = 1 # Grid code.
pp_field.bmks = 1.0 # Some scaley thing.
pp_field.lbproc = 0
# Set default MDI value.
# Note: *we no longer respect any fill_value of masked data arrays*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "documentation" is going to age really poorly. Is it necessary? Does it not apply to all of these attributes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the historical reference !

I am concerned though that the effect of the new approach still needs be made a bit clearer :
The user should be able to see what fill-value is used + how you might control it
The developer should also be able to follow the implementation.

mock_warn.assert_not_called()

def test_plainarray_mdi_value_nowarning(self):
# Check that MDI in *unmasked* data does not raise a warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like questionable behaviour. Why should we not raise a warning for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a point==mdi in un-masked data, I'd assume that means the data was loaded by non-standard means, and the magic value means it really is a missing point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it really is missing, then let's use masks and/or NaN, not run the risk of genuine collisions.

Copy link
Member Author

@pp-mo pp-mo Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"let's use"

The point is, what interpretation we should put on this case if it does arise.

As I said, if we apply a "no mdi values allowed" rule, then in the case of netcdf data we will generally have no way to save byte data without warnings. Unless we make that a special case, which I really don't much like.
Does this not sway you ?

@@ -1199,7 +1199,8 @@ def test_save_and_merge(self):
self.assertEqual(len(merged_cubes), 1, "expected a single merged cube")
merged_cube = merged_cubes[0]
self.assertEqual(merged_cube.dtype, dtype)
self.assertEqual(merged_cube.data.fill_value, fill_value)
# Note: original data fill-value is *ignored*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, comment seems superfluous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it should probably say "check that original data fill-value is ignored".
I do think that is worth pointing out, though,

field.lbuser = 0
field.brsvd = 0
field.bdatum = 0
field = TestPPField()._ready_for_save()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's nit-picky, but why make _ready_for_save a private method if you are calling it externally?

data = data.filled(fill_value=self.bmdi)
if isinstance(data, ma.MaskedArray):
mdi = self.bmdi
if np.any(data == mdi):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still inconsistent with the proposal at #2748 (comment). Specifically:

warn if any fill-values are present (unmasked points only, in case of masked array)
explain in terms of "this won't read back the same"

@pp-mo
Copy link
Member Author

pp-mo commented Aug 25, 2017

@pelson This is still inconsistent with the proposal at #2748

It is.
I decided it will now be easier to fix this when that has gone in.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 10, 2017

Current-status reminder:

I decided it will now be easier to fix this when that has gone in.

Refers #2748, but that is not a PR.

I think #2747 is our testcase for agreeing the new ideas.
I will review + re-submit this when #2747 is in the bag,

@pp-mo
Copy link
Member Author

pp-mo commented Oct 11, 2017

Fully rebased to get tests re-checking.

I will review all the comments above + see what may be outstanding ...

"data points.")
warnings.warn(msg.format(mdi))
if isinstance(data, ma.MaskedArray):
if ma.is_masked(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ma.is_masked can be used on ndarrays too, so the instance check isn't necessary.

Copy link
Member Author

@pp-mo pp-mo Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that it also qualifies the need for "data = data.data".
We could have..

            if ma.is_masked(data):
                data = data.filled(fill_value=mdi)
            elif isinstance(data, ma.MaskedArray):
                data = data.data

But frankly I think the existing is clearer ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I missed the data = data.data. I would question the need for that actually, but I won't quibble about it.

elif warning_texts:
msg = "Unexpected warnings raised : {!r}."
msg = msg.format(warning_texts)
self.assertEqual(warning_texts, [], msg)
Copy link
Contributor

@djkirkham djkirkham Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've changed the way this works (perhaps intentionally); previously in the expect_warning=False case, it would only fail if a warning matching the text was raised. The reason I wrote it that way was so that if the code happened to raise some other warning it wouldn't cause the test to fail incorrectly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've changed the way this works (perhaps intentionally)

No, it's a mistake really.
I'm going to investigate (a) putting this as an assert routine in iris.tests.IrisTest, and (b) replacing the 'contains' with a Regexp

@pp-mo pp-mo force-pushed the dask_pp_savepairs branch from b0580c1 to 949db81 Compare October 12, 2017 14:28
msg = ('PPField data contains unmasked points equal to the fill '
"value, {}. As saved, these points will read back as "
"missing data. To save these as normal values, please "
"set the field BMDI not equal to any valid data points.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you not allowing the BMDI to be specified via a keyword to iris.save?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On consideration, I didn't think it was necessary.
It's not the same as the netdcf case, because we don't have to worry about smaller integer types. So, especially if we don't provide an option to easily change BMDI, then chance collisions are extremely unlikely. Likewise, more recent version of the UM spec ("F3 document") don't actually admit the use of non-standard BMDI values.

Copy link
Contributor

@djkirkham djkirkham Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not completely sold - I think there's an argument for providing a uniform interface for both PP and NetCDF. But I think it could easily be changed later (post Iris 2 release) without breaking anything so I won't worry about it now.

@@ -1199,7 +1199,8 @@ def test_save_and_merge(self):
self.assertEqual(len(merged_cubes), 1, "expected a single merged cube")
merged_cube = merged_cubes[0]
self.assertEqual(merged_cube.dtype, dtype)
self.assertEqual(merged_cube.data.fill_value, fill_value)
# Check that the original masked-array fill-value is *ignored*.
self.assertArrayAllClose(merged_cube.data.fill_value, -1e30)
Copy link
Contributor

@djkirkham djkirkham Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not self.assertAlmostEqual?

Copy link
Member Author

@pp-mo pp-mo Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "assert_array_almost_equal" applies an absolute precision, 6 decimal places by default, i.e. value differences up to +/-0.000001.
That doesn't work here, you get :

Arrays are not almost equal to 6 decimals

(mismatch 100.0%)
 x: array(-1.0000000150474662e+30, dtype=float32)
 y: array(-1e+30)

The problem is a mismatch between a float64 and float32 versions of "-1e30".
To get it to work, you have to apply the frankly rather ridiculous "decimal=-23", because of the enormous size of the numbers.

That's why I prefer the "assert_allclose" approach -- the use of relative as well as absolute tolerance is more flexible, and also makes for a better default behaviour.

@@ -527,6 +528,28 @@ def assertRaisesRegexp(self, *args, **kwargs):
return six.assertRaisesRegex(super(IrisTest_nometa, self),
*args, **kwargs)

@contextlib.contextmanager
def assertGivesWarning(self, expected_regexp='', expect_warning=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name could be confusing since this function can be used to check that code doesn't give a warning.

@djkirkham djkirkham merged commit f6d2da7 into SciTools:dask_mask_array Oct 13, 2017
@pelson
Copy link
Member

pelson commented Oct 13, 2017

Thanks for this implementation @pp-mo.

Just as a reminder: mdi round-trip ability loss will need to be documented in the release notes.

@QuLogic QuLogic added this to the dask-mask milestone Oct 13, 2017
@pp-mo
Copy link
Member Author

pp-mo commented Oct 16, 2017

mdi round-trip ability loss will need to be documented in the release notes

Whoops. Made a new project issue for that #2798

@pp-mo pp-mo deleted the dask_pp_savepairs branch October 17, 2019 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants