-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MDI and preserve lazy data in PP saves. #2744
Conversation
lib/iris/fileformats/pp.py
Outdated
data = data.filled(fill_value=self.bmdi) | ||
if isinstance(data, ma.MaskedArray): | ||
mdi = self.bmdi | ||
if np.any(data == mdi): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we don't need to do precisely the same check for non-masked data? Given we are setting a fill value, there is always the risk of collisions, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my head it is the "fill" operation itself which is dangerous, not setting the fill-value.
I'm arguing elsewhere that we should take mdi values appearing in un-masked data as actual missing datapoints, rather than real points with a coincidental value.
For PP, at least, that seems workable as the standard BMDI values are quite extreme.
Likewise, on PP load we automatically interpret any points that match MDI as masked : here
I think we have heard rumours that a separate missing-data value may be used in some integer type fields, which might imply a greater risk of "collision", but no standard usage has been agreed + we don't code for it.
It's rather out of scope here, but I think the netCDF4 requirements make a good case for treating masked + unmasked data differently : Otherwise, we will have to complain about any byte value of 255.
lib/iris/fileformats/pp.py
Outdated
@@ -2414,6 +2422,11 @@ def save_pairs_from_cube(cube, field_coords=None, target=None): | |||
pp_field.lbcode = 1 # Grid code. | |||
pp_field.bmks = 1.0 # Some scaley thing. | |||
pp_field.lbproc = 0 | |||
# Set default MDI value. | |||
# Note: *we no longer respect any fill_value of masked data arrays*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "documentation" is going to age really poorly. Is it necessary? Does it not apply to all of these attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the historical reference !
I am concerned though that the effect of the new approach still needs be made a bit clearer :
The user should be able to see what fill-value is used + how you might control it
The developer should also be able to follow the implementation.
mock_warn.assert_not_called() | ||
|
||
def test_plainarray_mdi_value_nowarning(self): | ||
# Check that MDI in *unmasked* data does not raise a warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like questionable behaviour. Why should we not raise a warning for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a point==mdi in un-masked data, I'd assume that means the data was loaded by non-standard means, and the magic value means it really is a missing point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it really is missing, then let's use masks and/or NaN, not run the risk of genuine collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"let's use"
The point is, what interpretation we should put on this case if it does arise.
As I said, if we apply a "no mdi values allowed" rule, then in the case of netcdf data we will generally have no way to save byte data without warnings. Unless we make that a special case, which I really don't much like.
Does this not sway you ?
lib/iris/tests/test_cdm.py
Outdated
@@ -1199,7 +1199,8 @@ def test_save_and_merge(self): | |||
self.assertEqual(len(merged_cubes), 1, "expected a single merged cube") | |||
merged_cube = merged_cubes[0] | |||
self.assertEqual(merged_cube.dtype, dtype) | |||
self.assertEqual(merged_cube.data.fill_value, fill_value) | |||
# Note: original data fill-value is *ignored*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, comment seems superfluous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it should probably say "check that original data fill-value is ignored".
I do think that is worth pointing out, though,
field.lbuser = 0 | ||
field.brsvd = 0 | ||
field.bdatum = 0 | ||
field = TestPPField()._ready_for_save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's nit-picky, but why make _ready_for_save
a private method if you are calling it externally?
lib/iris/fileformats/pp.py
Outdated
data = data.filled(fill_value=self.bmdi) | ||
if isinstance(data, ma.MaskedArray): | ||
mdi = self.bmdi | ||
if np.any(data == mdi): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still inconsistent with the proposal at #2748 (comment). Specifically:
warn if any fill-values are present (unmasked points only, in case of masked array)
explain in terms of "this won't read back the same"
b8de183
to
3efa90a
Compare
Fully rebased to get tests re-checking. I will review all the comments above + see what may be outstanding ... |
lib/iris/fileformats/pp.py
Outdated
"data points.") | ||
warnings.warn(msg.format(mdi)) | ||
if isinstance(data, ma.MaskedArray): | ||
if ma.is_masked(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ma.is_masked
can be used on ndarray
s too, so the instance check isn't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that it also qualifies the need for "data = data.data".
We could have..
if ma.is_masked(data):
data = data.filled(fill_value=mdi)
elif isinstance(data, ma.MaskedArray):
data = data.data
But frankly I think the existing is clearer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I missed the data = data.data
. I would question the need for that actually, but I won't quibble about it.
elif warning_texts: | ||
msg = "Unexpected warnings raised : {!r}." | ||
msg = msg.format(warning_texts) | ||
self.assertEqual(warning_texts, [], msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've changed the way this works (perhaps intentionally); previously in the expect_warning=False
case, it would only fail if a warning matching the text was raised. The reason I wrote it that way was so that if the code happened to raise some other warning it wouldn't cause the test to fail incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've changed the way this works (perhaps intentionally)
No, it's a mistake really.
I'm going to investigate (a) putting this as an assert routine in iris.tests.IrisTest, and (b) replacing the 'contains' with a Regexp
b0580c1
to
949db81
Compare
msg = ('PPField data contains unmasked points equal to the fill ' | ||
"value, {}. As saved, these points will read back as " | ||
"missing data. To save these as normal values, please " | ||
"set the field BMDI not equal to any valid data points.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you not allowing the BMDI to be specified via a keyword to iris.save
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On consideration, I didn't think it was necessary.
It's not the same as the netdcf case, because we don't have to worry about smaller integer types. So, especially if we don't provide an option to easily change BMDI, then chance collisions are extremely unlikely. Likewise, more recent version of the UM spec ("F3 document") don't actually admit the use of non-standard BMDI values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not completely sold - I think there's an argument for providing a uniform interface for both PP and NetCDF. But I think it could easily be changed later (post Iris 2 release) without breaking anything so I won't worry about it now.
@@ -1199,7 +1199,8 @@ def test_save_and_merge(self): | |||
self.assertEqual(len(merged_cubes), 1, "expected a single merged cube") | |||
merged_cube = merged_cubes[0] | |||
self.assertEqual(merged_cube.dtype, dtype) | |||
self.assertEqual(merged_cube.data.fill_value, fill_value) | |||
# Check that the original masked-array fill-value is *ignored*. | |||
self.assertArrayAllClose(merged_cube.data.fill_value, -1e30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not self.assertAlmostEqual
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "assert_array_almost_equal" applies an absolute precision, 6 decimal places by default, i.e. value differences up to +/-0.000001.
That doesn't work here, you get :
Arrays are not almost equal to 6 decimals
(mismatch 100.0%)
x: array(-1.0000000150474662e+30, dtype=float32)
y: array(-1e+30)
The problem is a mismatch between a float64 and float32 versions of "-1e30".
To get it to work, you have to apply the frankly rather ridiculous "decimal=-23", because of the enormous size of the numbers.
That's why I prefer the "assert_allclose" approach -- the use of relative as well as absolute tolerance is more flexible, and also makes for a better default behaviour.
@@ -527,6 +528,28 @@ def assertRaisesRegexp(self, *args, **kwargs): | |||
return six.assertRaisesRegex(super(IrisTest_nometa, self), | |||
*args, **kwargs) | |||
|
|||
@contextlib.contextmanager | |||
def assertGivesWarning(self, expected_regexp='', expect_warning=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name could be confusing since this function can be used to check that code doesn't give a warning.
Thanks for this implementation @pp-mo. Just as a reminder: |
Whoops. Made a new project issue for that #2798 |
This fixes one of the test failures listed in #2717 :
Following discussion in #2738, this now abandons any use of masked-array fill_value for setting a PPField BMDI value.
Content:
save_pairs_from_cube
applies the default BMDI to everything