Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle comparison to undefined nodata values #269

Merged
merged 32 commits into from
Sep 2, 2020

Conversation

emlys
Copy link
Member

@emlys emlys commented Aug 17, 2020

#228

  • Add a new function utils.is_valid(array, nodata) that handles the condition where the nodata value is undefined. This replaces dozens of instances of ~numpy.isclose(array, nodata).
  • Add unit tests for the new function in test_utils.py.
  • Add more complex test to test_seasonal_water_yield_regression.py, testing actual usage of the function

@emlys emlys self-assigned this Aug 17, 2020
@emlys emlys changed the base branch from master to release/3.9 August 17, 2020 22:54
@emlys emlys marked this pull request as ready for review August 18, 2020 23:03
@emlys emlys requested a review from dcdenu4 August 18, 2020 23:50
@emlys emlys linked an issue Aug 18, 2020 that may be closed by this pull request
3 tasks
@emlys
Copy link
Member Author

emlys commented Aug 19, 2020

see convo here #228 (comment)

Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @emlys , I thought I'd take a look at this since we had been talking about it a bit yesterday. I left a few comments in a couple critical places, and I'm sorry that we hadn't had a chance to talk about our design process before this ... but it's a great time to talk about it now! Would you have some time to talk this afternoon? I'll ping you on slack as well.

And as a side note, would it be possible to set more descriptive PR titles? The indirection of having to look up the nature of #228 makes it a bit harder to grok the PR from the PR page listing. Thanks!

Comment on lines 32 to 41
cdef int is_close(double x, double y):
cdef int double_is_close(double x, double y):
return abs(x-y) <= (1e-8+1e-05*abs(y))

def is_close(a, b):
"""Compare values which may be numeric or None"""
if a and b:
return double_is_close(a, b)
else:
return 1 if a == b else 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Emily, we need to be really careful within cython modules for all sorts of reasons. In this case, I don't think we should adapt is_close to handle a case where the nodata value could be None because of how and how often it's called. This is a function that's called for every pixel ... we therefore need the function to be as blazingly fast as possible. Note the cdef preceding the old is_close here. Such a function compiles down to a C function. When we prefix a function with def rather than cdef, the function is a python function and incurse significant overhead.

We therefore have two options for handling the case where the nodata value might not be defined:

  1. We can pick a nodata value that makes sense within the context of the problem if one isn't already defined.
  2. We can require a nodata value to be defined.

Either way, we really really don't want is_close to be a python function.

Comment on lines 769 to 772
if nodata:
return ~numpy.isclose(array, nodata)
else:
return array != nodata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of things that will probably turn out to be problematic here:

  1. Python's truthiness rules allow for values of 0, False, empty iterables and a few other things to all be interpreted as False. How will this function behave if we have a nodata value of 0, which is a valid input?
  2. similar to @richpsharp 's comment at Undefined nodata values cause nodata checking to fail in numpy.isclose #228 (comment), the function name is_valid makes it sound like this function is for checking whether a pixel is valid ... but all it's actually checking for is whether the value is close to nodata. Function names seem like a fairly trivial thing, but it's so so important for us to be clear about the intent of a function and what it's really trying to do. So if the intent is to determine whether a value is close to nodata, then we should really be expressing that in the title. What, exactly, is a valid pixel, is highly dependent on the use case.
  3. If the intent here is actually to create a single function that checks to see if a pixel value is close to nodata, then we need to accommodate the differences in floating-point and integer matrices. It turns out that if we know in advance that we have an integer matrix, then we can take advantage of order-of-magnitude speedups in per-element comparisons: integer_array != nodata is an order of magnitude faster than numpy.isclose(integer_array, nodata). On very large datasets, this can make a noticeable difference. Here's the timing details:
>>> timeit.timeit('numpy.isclose(ones, 0)', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.float32)', number=100000)
5.014724600000022
>>> timeit.timeit('numpy.isclose(ones, 0)', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.int16)', number=100000)
4.701722299999915
>>> timeit.timeit('ones != 0', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.int16)', number=100000)
0.315173499999986
  1. By making it necessary to have an extra function callthrough, we're not only obfuscating the intent of the function, there's also a (relatively minor) computational overhead. The main point here is that the value gained from having a function should outweigh the cognitive and computational overhead of needing to look elsewhere at what the function is doing. This function is so simple that I'm not sure this threshold is met.

@emlys emlys changed the title Bug/228 Handle comparison to undefined nodata values Aug 19, 2020
@emlys
Copy link
Member Author

emlys commented Aug 21, 2020

After conversation with @phargogh about the issues above, I removed utils.is_valid from everywhere, and just check if nodata_value is None where needed. It turns out that in a lot of places where it's operating on an intermediate output raster, not a user input, that the nodata value is guaranteed to be defined, because it's set either in the model or in pygeoprocessing.
One note about the order of operations here: typically the nodata issue arises when making a valid_mask boolean array. A common pattern is to:

  1. Initialize the output array filled with the nodata value
  2. Create the valid_mask boolean array
  3. Apply whatever calculation to input_array[valid_mask] and store the result to output_array[valid_mask]
    (for example: https://github.com/natcap/invest/pull/269/files#diff-9c24b665c47f29dc86d5c6394d33d1dbR578)
    I'm guessing it's done this way to save unnecessary calculation: the calculation isn't done on any pixel that will end up being nodata anyway.

However, with the need to check if nodata is None, this pattern makes the code a little uglier, because you have to set valid_mask to an array where every element is True (e.g. https://github.com/natcap/invest/pull/269/files#diff-9c24b665c47f29dc86d5c6394d33d1dbR581). A neater pattern for this is:

  1. Initialize the empty output array
  2. Apply whatever calculation to every pixel
  3. If nodata is not None, set every nodata pixel to the output nodata value.
    This looks a little nicer and avoids doing any masking when nodata is undefined. It has the downside of performing extra calculations that get overwritten in step 3. I tried out this pattern in a couple of places where the calculation being done is very simple, so the loss of efficiency is probably negligible: e.g.https://github.com/natcap/invest/pull/269/files#diff-9c24b665c47f29dc86d5c6394d33d1dbR462

Any thoughts on the tradeoff of efficiency vs. nicer-looking code in this situation, the likelihood of nodata being undefined, and the amount of pixels that end up having nodata?

@emlys emlys closed this Aug 21, 2020
@emlys emlys reopened this Aug 21, 2020
@emlys emlys requested a review from phargogh August 21, 2020 19:20
@phargogh
Copy link
Member

phargogh commented Aug 25, 2020

@emlys You're very right that the common pattern you mention (initialize the output array, create a mask of valid pixels, limit the calculations to valid pixels) is indeed to limit computation to only those places where it is needed. In general, the internals of InVEST should do what's best and most efficient for the user in the general case, even if it's a little trickier for us to program and to maintain (without gutting maintainability, though!), and also even if it underperforms in certain specific edge cases. All that within the context of the problem, of course!

In the simplest of cases, where a valid_mask is either present or not present when operating on a single raster (like in the globio example you linked to above), might I suggest the following slight tweak to the conditional you had proposed?

if globio_nodata is None:
    valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
else:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]

Or, more abbreviated:

valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
if globio_nodata is not None:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]

With this slight tweak, the slice(None) object creation happens in constant time regardless of the size of the input data, the indexing is an order of magnitude less than boolean array indexing, and we can easily (and cheaply, thanks to python's memory management) overwrite the slice if the nodata value is defined. Interestingly, this doesn't completely avoid indexing, and so in some edge cases this will underperform ... but I think this might be the fastest general-case, DRY solution given what we know about numpy's array indexing.

I found the timings for this to be rather surprising, but slice(None) appears to be the equivalent of using : in array indexing notation (and the python docs support this). Timings are included below if you'd like to try it out!

Timings
>>> import timeit

# Baseline: index into all elements
>>> timeit.timeit('array[:]', setup='import numpy; array=numpy.zeros((50,50))')
0.22549130000000162

# Indexing using a slice
>>> timeit.timeit('array[sl]', setup='import numpy; array=numpy.zeros((50,50)); sl=slice(None)')
0.22285999999985506

# Index using `True`, which should select all elements
>>> timeit.timeit('array[True]', setup='import numpy; array=numpy.zeros((50,50))')
3.0759548999999993

# Index using a boolean array of 1/True
>>> timeit.timeit('array[index]', setup='import numpy; array=numpy.zeros((50,50)); index=numpy.ones((50,50), dtype=numpy.bool)')
3.8698592000000076

Anyways, what do you think about this RE: tradeoffs between efficiency, readability, maintainability and general-case solutions?

@emlys
Copy link
Member Author

emlys commented Aug 25, 2020

@emlys You're very right that the common pattern you mention (initialize the output array, create a mask of valid pixels, limit the calculations to valid pixels) is indeed to limit computation to only those places where it is needed. In general, the internals of InVEST should do what's best and most efficient for the user in the general case, even if it's a little trickier for us to program and to maintain (without gutting maintainability, though!), and also even if it underperforms in certain specific edge cases. All that within the context of the problem, of course!

In the simplest of cases, where a valid_mask is either present or not present when operating on a single raster (like in the globio example you linked to above), might I suggest the following slight tweak to the conditional you had proposed?

if globio_nodata is None:
    valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
else:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]

Or, more abbreviated:

valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
if globio_nodata is not None:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]

With this slight tweak, the slice(None) object creation happens in constant time regardless of the size of the input data, the indexing is an order of magnitude less than boolean array indexing, and we can easily (and cheaply, thanks to python's memory management) overwrite the slice if the nodata value is defined. Interestingly, this doesn't completely avoid indexing, and so in some edge cases this will underperform ... but I think this might be the fastest general-case, DRY solution given what we know about numpy's array indexing.

I found the timings for this to be rather surprising, but slice(None) appears to be the equivalent of using : in array indexing notation (and the python docs support this). Timings are included below if you'd like to try it out!
Timings

Anyways, what do you think about this RE: tradeoffs between efficiency, readability, maintainability and general-case solutions?

I like the slice(None)! Don't think I've ever seen slice() used before. I just replaced all my uses of numpy.full(True) with it.
Agree that with that notation the decrease in readability is minimal, and worth it to avoid slowing down the computation.

Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one or two small comments, but overall this looks like a really nice handling of nodata values!

I did have one other question, though, which was that it looks like this PR handles the use of numpy.isclose for comparing nodata values. Were you also able to take a look at the places where nodata checking is done by !=?

@@ -1282,7 +1282,8 @@ def extract_bathymetry_along_ray(
'bathymetry input fully cover the fetch ray area?'
% (value, {'xoff': ix, 'yoff': iy,
'win_xsize': win_size, 'win_ysize': win_size}))
if ~numpy.isclose(value, bathy_nodata):

if bathy_nodata is None or ~numpy.isclose(value, bathy_nodata):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is kind of our of the scope of this PR, but since we're only working with a single value here, would you be OK changing the ~numpy.isclose here to not math.isclose? math.isclose looks to be about 200 times faster than numpy.isclose for a single value.

>>> timeit.timeit('~numpy.isclose(1.1, 1.0)', setup='import numpy')
47.977831700000024
>>>timeit.timeit('not math.isclose(1.1, 1.0)', setup='import math')
0.18105119999999886

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! Is there some resource you have for finding faster alternatives like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: value is actually an array, not a single value, so math.isclose doesn't work :( I changed the variable name to values to be less confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, sorry to disappoint! I unfortunately don't know of such a resource. In this case, I just saw that we were using numpy but then only checking an array with a single value in it. Numpy's great for operating on arrays that might be very large, but for a single value, it's often faster to just directly compare the values. Plus, numpy does a whole lot of extra work to ensure that it can actually compare the values, compared with the python implementation of isclose, which is about as short as you'd expect for this.

Also I was a bit wrong about the timings I had included above. Since this is a 1-element numpy array, here are the corrected timings:

>>> timeit.timeit('~numpy.isclose(array, 1)', setup='import numpy; array=numpy.array([[1.1]], dtype=numpy.float32)')
42.59300510000003
>>> timeit.timeit('not math.isclose(array[0][0], 1)', setup='import numpy; import math; array=numpy.array([[1.1]], dtype=numpy.float32)')
0.5673721999999941

Not quite as fast thanks to the need to do that indexing, but still substantially faster.

Comment on lines +746 to +750
if observed_yield_nodata is not None:
valid_mask = ~numpy.isclose(yield_block, observed_yield_nodata)
production_pixel_count += numpy.count_nonzero(
~numpy.isclose(yield_block, observed_yield_nodata) &
(yield_block > 0.0))
yield_sum += numpy.sum(
yield_block[
~numpy.isclose(observed_yield_nodata, yield_block)])
valid_mask & (yield_block > 0.0))
yield_sum += numpy.sum(yield_block[valid_mask])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! this looks like it'll be a bit more efficient, as well as supporting the possibility of an undefined nodata value.

Comment on lines 460 to 461
result = numpy.empty_like(lulc_array, dtype=numpy.int16)
result[:] = primary_veg_mask_nodata
result = lulc_array == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be reading this wrong, but wouldn't this second result = line set result to be a new array, and this time of type numpy.bool?

And on a separate note, it looks like this change would also remove the primary_veg_mask_nodata entirely from this function, which sounds like an unintended change!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, you're totally right!

@emlys
Copy link
Member Author

emlys commented Aug 27, 2020

I did have one other question, though, which was that it looks like this PR handles the use of numpy.isclose for comparing nodata values. Were you also able to take a look at the places where nodata checking is done by !=?

I did not, but it seems that where != is used, it shouldn't matter if the nodata value is defined or not.

>>> a = numpy.array([1,2,3])
>>> a != None
array([ True,  True,  True])

Is this what you're referring to?

@phargogh
Copy link
Member

Ah, good point about the != comparison! So yes, we should be functionally ok on that front. I suppose it could be a bit faster to avoid the comparison entirely, but that's outside the scope of this issue and it won't break anything as-is.

dcdenu4 added a commit to dcdenu4/invest that referenced this pull request Aug 28, 2020
Using the latest 2.1+ version of `pgp.convolve_2d`, the nodata handling
is different. For the UCM test, the convolved raster was left with an
undefined nodata value which caused the `wbgt_op` to crash. Since this
is being worked on in PR natcap#269 I copied the current solution from there
to hopefully avoid merge conflicts in the future.
@emlys emlys requested a review from phargogh August 28, 2020 18:16
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @emlys thanks for taking a look at my comments! I mostly wanted to follow up about the possible isclose change. If I'm reading the function correctly, then it looks like we're only ever reading in a single pixel right there, which means that we mght be able to safely extract that one pixel value. Anyways, wanted to check with you about this in case you thought that was worth updating ... let me know! I think this PR otherwise looks good to go.

Comment on lines 1276 to 1287
value = bathy_band.ReadAsArray(
values = bathy_band.ReadAsArray(
xoff=ix, yoff=iy,
win_xsize=win_size, win_ysize=win_size)
if value is None:
if values is None:
raise ValueError(
'got a %s when trying to read bathymetry at %s. Does the '
'bathymetry input fully cover the fetch ray area?'
% (value, {'xoff': ix, 'yoff': iy,
% (values, {'xoff': ix, 'yoff': iy,
'win_xsize': win_size, 'win_ysize': win_size}))
if ~numpy.isclose(value, bathy_nodata):
bathy_values.append(value)

if bathy_nodata is None or ~numpy.isclose(values, bathy_nodata):
bathy_values.append(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey I'm sorry about the somewhat confusing chain of timings and comments over in the other thread. You're right that we can't use math.isclose directly on the array, but since we know ahead of time that this will be a single value, we can just index into that one item, values[0][0] or the like: not math.isclose(values[0][0])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, that makes sense. The nodata value was also (incorrectly) a list so I fixed that in calculate_wind_exposure

nodata_mask = numpy.full(infrastructure_array_list[0].shape, True)
infrastructure_mask = numpy.full(infrastructure_array_list[0].shape, False)

for index in range(0, len(infrastructure_array_list)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty trivial change, but since we're starting from index 0 (the default), this can be abbreviated to range(len(infrastructure_array_list)) if you like to achieve the same effect.

Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @emlys , I'm sorry to have not included this in my previous review! I was thinking about this part of your PR over the weekend, and I think there's one small edit that would be a nice optimization to the SWY cython function. What do you think?

Comment on lines 624 to 631
# make sure that user input nodata values are defined
# precipitation and evapotranspiration data should
# always be non-negative
if precip_nodata is None:
precip_nodata = -1
if et0_nodata is None:
et0_nodata = -1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this over the weekend. Checking for and assigning values for precip_nodata and et0_nodata here will make these extra lines run many, many times (once for each month, but also for each pixel in the work_queue, which ends up being for each valid pixel). Of course it's quite possible that the compiler will optimize this condition away so that it doesn't need to be run so many times, but since we know in advance that this can just be defined elsewhere, would you be ok with moving it outside of the main processing loop? Offhand, it looks like these could be checked for once, up near the top of this function (around line 448 for et0, 456 for precip).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Just moved it outside of the loop

@emlys emlys requested a review from phargogh August 31, 2020 21:02
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this @emlys !

Comment on lines 452 to +453
et0_m_nodata_list.append(
pygeoprocessing.get_raster_info(et0_path)['nodata'][0])
pygeoprocessing.get_raster_info(et0_path)['nodata'][0] or -1)
Copy link
Member

@phargogh phargogh Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! this is a bit of python syntax that I was not expecting to work as a sort of abbreviated ternary operator, but it absolutely does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh heck! This is going to have the same problem with truthiness of 0... so 0 or -1 will be -1. I can fix this in my next PR...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, so it does! I'm sorry I didn't catch that. Yes, let's do this in the next PR.

@phargogh phargogh merged commit e3f9515 into natcap:release/3.9 Sep 2, 2020
@emlys emlys mentioned this pull request Sep 3, 2020
@emlys emlys deleted the bug/228 branch March 8, 2023 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Undefined nodata values cause nodata checking to fail in numpy.isclose
3 participants