Handle comparison to undefined nodata values #269

emlys · 2020-08-17T22:53:32Z

Add a new function utils.is_valid(array, nodata) that handles the condition where the nodata value is undefined. This replaces dozens of instances of ~numpy.isclose(array, nodata).
Add unit tests for the new function in test_utils.py.
Add more complex test to test_seasonal_water_yield_regression.py, testing actual usage of the function

emlys · 2020-08-19T17:36:59Z

phargogh

Hey @emlys , I thought I'd take a look at this since we had been talking about it a bit yesterday. I left a few comments in a couple critical places, and I'm sorry that we hadn't had a chance to talk about our design process before this ... but it's a great time to talk about it now! Would you have some time to talk this afternoon? I'll ping you on slack as well.

And as a side note, would it be possible to set more descriptive PR titles? The indirection of having to look up the nature of #228 makes it a bit harder to grok the PR from the PR page listing. Thanks!

phargogh · 2020-08-19T18:39:42Z

src/natcap/invest/seasonal_water_yield/seasonal_water_yield_core.pyx

-cdef int is_close(double x, double y):
+cdef int double_is_close(double x, double y):
    return abs(x-y) <= (1e-8+1e-05*abs(y))

+def is_close(a, b):
+    """Compare values which may be numeric or None"""
+    if a and b:
+        return double_is_close(a, b)
+    else:
+        return 1 if a == b else 0
+


Hey Emily, we need to be really careful within cython modules for all sorts of reasons. In this case, I don't think we should adapt is_close to handle a case where the nodata value could be None because of how and how often it's called. This is a function that's called for every pixel ... we therefore need the function to be as blazingly fast as possible. Note the cdef preceding the old is_close here. Such a function compiles down to a C function. When we prefix a function with def rather than cdef, the function is a python function and incurse significant overhead.

We therefore have two options for handling the case where the nodata value might not be defined:

We can pick a nodata value that makes sense within the context of the problem if one isn't already defined.

We can require a nodata value to be defined.

Either way, we really really don't want is_close to be a python function.

phargogh · 2020-08-19T19:00:20Z

src/natcap/invest/utils.py

+    if nodata:
+        return ~numpy.isclose(array, nodata)
+    else:
+        return array != nodata


There are a couple of things that will probably turn out to be problematic here:

Python's truthiness rules allow for values of 0, False, empty iterables and a few other things to all be interpreted as False. How will this function behave if we have a nodata value of 0, which is a valid input?

similar to @richpsharp 's comment at Undefined nodata values cause nodata checking to fail in numpy.isclose #228 (comment), the function name is_valid makes it sound like this function is for checking whether a pixel is valid ... but all it's actually checking for is whether the value is close to nodata. Function names seem like a fairly trivial thing, but it's so so important for us to be clear about the intent of a function and what it's really trying to do. So if the intent is to determine whether a value is close to nodata, then we should really be expressing that in the title. What, exactly, is a valid pixel, is highly dependent on the use case.

If the intent here is actually to create a single function that checks to see if a pixel value is close to nodata, then we need to accommodate the differences in floating-point and integer matrices. It turns out that if we know in advance that we have an integer matrix, then we can take advantage of order-of-magnitude speedups in per-element comparisons: integer_array != nodata is an order of magnitude faster than numpy.isclose(integer_array, nodata). On very large datasets, this can make a noticeable difference. Here's the timing details:

>>> timeit.timeit('numpy.isclose(ones, 0)', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.float32)', number=100000) 5.014724600000022 >>> timeit.timeit('numpy.isclose(ones, 0)', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.int16)', number=100000) 4.701722299999915 >>> timeit.timeit('ones != 0', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.int16)', number=100000) 0.315173499999986

By making it necessary to have an extra function callthrough, we're not only obfuscating the intent of the function, there's also a (relatively minor) computational overhead. The main point here is that the value gained from having a function should outweigh the cognitive and computational overhead of needing to look elsewhere at what the function is doing. This function is so simple that I'm not sure this threshold is met.

emlys · 2020-08-21T17:37:14Z

After conversation with @phargogh about the issues above, I removed utils.is_valid from everywhere, and just check if nodata_value is None where needed. It turns out that in a lot of places where it's operating on an intermediate output raster, not a user input, that the nodata value is guaranteed to be defined, because it's set either in the model or in pygeoprocessing.
One note about the order of operations here: typically the nodata issue arises when making a valid_mask boolean array. A common pattern is to:

Initialize the output array filled with the nodata value
Create the valid_mask boolean array
Apply whatever calculation to input_array[valid_mask] and store the result to output_array[valid_mask]
(for example: https://github.com/natcap/invest/pull/269/files#diff-9c24b665c47f29dc86d5c6394d33d1dbR578)
I'm guessing it's done this way to save unnecessary calculation: the calculation isn't done on any pixel that will end up being nodata anyway.

However, with the need to check if nodata is None, this pattern makes the code a little uglier, because you have to set valid_mask to an array where every element is True (e.g. https://github.com/natcap/invest/pull/269/files#diff-9c24b665c47f29dc86d5c6394d33d1dbR581). A neater pattern for this is:

Initialize the empty output array
Apply whatever calculation to every pixel
If nodata is not None, set every nodata pixel to the output nodata value.
This looks a little nicer and avoids doing any masking when nodata is undefined. It has the downside of performing extra calculations that get overwritten in step 3. I tried out this pattern in a couple of places where the calculation being done is very simple, so the loss of efficiency is probably negligible: e.g.https://github.com/natcap/invest/pull/269/files#diff-9c24b665c47f29dc86d5c6394d33d1dbR462

Any thoughts on the tradeoff of efficiency vs. nicer-looking code in this situation, the likelihood of nodata being undefined, and the amount of pixels that end up having nodata?

phargogh · 2020-08-25T02:51:42Z

@emlys You're very right that the common pattern you mention (initialize the output array, create a mask of valid pixels, limit the calculations to valid pixels) is indeed to limit computation to only those places where it is needed. In general, the internals of InVEST should do what's best and most efficient for the user in the general case, even if it's a little trickier for us to program and to maintain (without gutting maintainability, though!), and also even if it underperforms in certain specific edge cases. All that within the context of the problem, of course!

In the simplest of cases, where a valid_mask is either present or not present when operating on a single raster (like in the globio example you linked to above), might I suggest the following slight tweak to the conditional you had proposed?

if globio_nodata is None:
    valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
else:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]

Or, more abbreviated:

valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
if globio_nodata is not None:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]

With this slight tweak, the slice(None) object creation happens in constant time regardless of the size of the input data, the indexing is an order of magnitude less than boolean array indexing, and we can easily (and cheaply, thanks to python's memory management) overwrite the slice if the nodata value is defined. Interestingly, this doesn't completely avoid indexing, and so in some edge cases this will underperform ... but I think this might be the fastest general-case, DRY solution given what we know about numpy's array indexing.

I found the timings for this to be rather surprising, but slice(None) appears to be the equivalent of using : in array indexing notation (and the python docs support this). Timings are included below if you'd like to try it out!

Timings

>>> import timeit

# Baseline: index into all elements
>>> timeit.timeit('array[:]', setup='import numpy; array=numpy.zeros((50,50))')
0.22549130000000162

# Indexing using a slice
>>> timeit.timeit('array[sl]', setup='import numpy; array=numpy.zeros((50,50)); sl=slice(None)')
0.22285999999985506

# Index using `True`, which should select all elements
>>> timeit.timeit('array[True]', setup='import numpy; array=numpy.zeros((50,50))')
3.0759548999999993

# Index using a boolean array of 1/True
>>> timeit.timeit('array[index]', setup='import numpy; array=numpy.zeros((50,50)); index=numpy.ones((50,50), dtype=numpy.bool)')
3.8698592000000076

Anyways, what do you think about this RE: tradeoffs between efficiency, readability, maintainability and general-case solutions?

emlys · 2020-08-25T17:47:13Z

@emlys You're very right that the common pattern you mention (initialize the output array, create a mask of valid pixels, limit the calculations to valid pixels) is indeed to limit computation to only those places where it is needed. In general, the internals of InVEST should do what's best and most efficient for the user in the general case, even if it's a little trickier for us to program and to maintain (without gutting maintainability, though!), and also even if it underperforms in certain specific edge cases. All that within the context of the problem, of course!

In the simplest of cases, where a valid_mask is either present or not present when operating on a single raster (like in the globio example you linked to above), might I suggest the following slight tweak to the conditional you had proposed?
if globio_nodata is None:
    valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
else:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]
Or, more abbreviated:
valid_mask = slice(None)  # the ``slice`` equivalent of ``:``
if globio_nodata is not None:
    valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask]
With this slight tweak, the slice(None) object creation happens in constant time regardless of the size of the input data, the indexing is an order of magnitude less than boolean array indexing, and we can easily (and cheaply, thanks to python's memory management) overwrite the slice if the nodata value is defined. Interestingly, this doesn't completely avoid indexing, and so in some edge cases this will underperform ... but I think this might be the fastest general-case, DRY solution given what we know about numpy's array indexing.

I found the timings for this to be rather surprising, but slice(None) appears to be the equivalent of using : in array indexing notation (and the python docs support this). Timings are included below if you'd like to try it out!
Timings

Anyways, what do you think about this RE: tradeoffs between efficiency, readability, maintainability and general-case solutions?

I like the slice(None)! Don't think I've ever seen slice() used before. I just replaced all my uses of numpy.full(True) with it.
Agree that with that notation the decrease in readability is minimal, and worth it to avoid slowing down the computation.

phargogh

I had one or two small comments, but overall this looks like a really nice handling of nodata values!

I did have one other question, though, which was that it looks like this PR handles the use of numpy.isclose for comparing nodata values. Were you also able to take a look at the places where nodata checking is done by !=?

phargogh · 2020-08-27T18:44:54Z

src/natcap/invest/coastal_vulnerability.py

@@ -1282,7 +1282,8 @@ def extract_bathymetry_along_ray(
                'bathymetry input fully cover the fetch ray area?'
                % (value, {'xoff': ix, 'yoff': iy,
                           'win_xsize': win_size, 'win_ysize': win_size}))
-        if ~numpy.isclose(value, bathy_nodata):
+
+        if bathy_nodata is None or ~numpy.isclose(value, bathy_nodata):


I realize this is kind of our of the scope of this PR, but since we're only working with a single value here, would you be OK changing the ~numpy.isclose here to not math.isclose? math.isclose looks to be about 200 times faster than numpy.isclose for a single value.

>>> timeit.timeit('~numpy.isclose(1.1, 1.0)', setup='import numpy') 47.977831700000024 >>>timeit.timeit('not math.isclose(1.1, 1.0)', setup='import math') 0.18105119999999886

cool! Is there some resource you have for finding faster alternatives like this?

update: value is actually an array, not a single value, so math.isclose doesn't work :( I changed the variable name to values to be less confusing.

Ha, sorry to disappoint! I unfortunately don't know of such a resource. In this case, I just saw that we were using numpy but then only checking an array with a single value in it. Numpy's great for operating on arrays that might be very large, but for a single value, it's often faster to just directly compare the values. Plus, numpy does a whole lot of extra work to ensure that it can actually compare the values, compared with the python implementation of isclose, which is about as short as you'd expect for this.

Also I was a bit wrong about the timings I had included above. Since this is a 1-element numpy array, here are the corrected timings:

>>> timeit.timeit('~numpy.isclose(array, 1)', setup='import numpy; array=numpy.array([[1.1]], dtype=numpy.float32)') 42.59300510000003 >>> timeit.timeit('not math.isclose(array[0][0], 1)', setup='import numpy; import math; array=numpy.array([[1.1]], dtype=numpy.float32)') 0.5673721999999941

Not quite as fast thanks to the need to do that indexing, but still substantially faster.

phargogh · 2020-08-27T18:50:11Z

src/natcap/invest/crop_production_regression.py

+                if observed_yield_nodata is not None:
+                    valid_mask = ~numpy.isclose(yield_block, observed_yield_nodata)
                production_pixel_count += numpy.count_nonzero(
-                    ~numpy.isclose(yield_block, observed_yield_nodata) &
-                    (yield_block > 0.0))
-                yield_sum += numpy.sum(
-                    yield_block[
-                        ~numpy.isclose(observed_yield_nodata, yield_block)])
+                                          valid_mask & (yield_block > 0.0))
+                yield_sum += numpy.sum(yield_block[valid_mask])


nice! this looks like it'll be a bit more efficient, as well as supporting the possibility of an undefined nodata value.

phargogh · 2020-08-27T18:56:35Z

src/natcap/invest/globio.py

    result = numpy.empty_like(lulc_array, dtype=numpy.int16)
-    result[:] = primary_veg_mask_nodata
+    result = lulc_array == 1


I could be reading this wrong, but wouldn't this second result = line set result to be a new array, and this time of type numpy.bool?

And on a separate note, it looks like this change would also remove the primary_veg_mask_nodata entirely from this function, which sounds like an unintended change!

oops, you're totally right!

emlys · 2020-08-27T20:59:49Z

I did have one other question, though, which was that it looks like this PR handles the use of numpy.isclose for comparing nodata values. Were you also able to take a look at the places where nodata checking is done by !=?

I did not, but it seems that where != is used, it shouldn't matter if the nodata value is defined or not.

>>> a = numpy.array([1,2,3])
>>> a != None
array([ True,  True,  True])

Is this what you're referring to?

phargogh · 2020-08-27T21:32:40Z

Ah, good point about the != comparison! So yes, we should be functionally ok on that front. I suppose it could be a bit faster to avoid the comparison entirely, but that's outside the scope of this issue and it won't break anything as-is.

Using the latest 2.1+ version of `pgp.convolve_2d`, the nodata handling is different. For the UCM test, the convolved raster was left with an undefined nodata value which caused the `wbgt_op` to crash. Since this is being worked on in PR natcap#269 I copied the current solution from there to hopefully avoid merge conflicts in the future.

phargogh

Hey @emlys thanks for taking a look at my comments! I mostly wanted to follow up about the possible isclose change. If I'm reading the function correctly, then it looks like we're only ever reading in a single pixel right there, which means that we mght be able to safely extract that one pixel value. Anyways, wanted to check with you about this in case you thought that was worth updating ... let me know! I think this PR otherwise looks good to go.

phargogh · 2020-08-28T21:07:59Z

src/natcap/invest/coastal_vulnerability.py

-        value = bathy_band.ReadAsArray(
+        values = bathy_band.ReadAsArray(
            xoff=ix, yoff=iy,
            win_xsize=win_size, win_ysize=win_size)
-        if value is None:
+        if values is None:
            raise ValueError(
                'got a %s when trying to read bathymetry at %s. Does the '
                'bathymetry input fully cover the fetch ray area?'
-                % (value, {'xoff': ix, 'yoff': iy,
+                % (values, {'xoff': ix, 'yoff': iy,
                           'win_xsize': win_size, 'win_ysize': win_size}))
-        if ~numpy.isclose(value, bathy_nodata):
-            bathy_values.append(value)
+
+        if bathy_nodata is None or ~numpy.isclose(values, bathy_nodata):
+            bathy_values.append(values)


Hey I'm sorry about the somewhat confusing chain of timings and comments over in the other thread. You're right that we can't use math.isclose directly on the array, but since we know ahead of time that this will be a single value, we can just index into that one item, values[0][0] or the like: not math.isclose(values[0][0])

Oh okay, that makes sense. The nodata value was also (incorrectly) a list so I fixed that in calculate_wind_exposure

phargogh · 2020-08-28T21:11:52Z

src/natcap/invest/globio.py

+        nodata_mask = numpy.full(infrastructure_array_list[0].shape, True)
+        infrastructure_mask = numpy.full(infrastructure_array_list[0].shape, False)
+
+        for index in range(0, len(infrastructure_array_list)):


This is a pretty trivial change, but since we're starting from index 0 (the default), this can be abbreviated to range(len(infrastructure_array_list)) if you like to achieve the same effect.

phargogh

Hey @emlys , I'm sorry to have not included this in my previous review! I was thinking about this part of your PR over the weekend, and I think there's one small edit that would be a nice optimization to the SWY cython function. What do you think?

phargogh · 2020-08-31T17:17:02Z

src/natcap/invest/seasonal_water_yield/seasonal_water_yield_core.pyx

+                        # make sure that user input nodata values are defined
+                        # precipitation and evapotranspiration data should 
+                        # always be non-negative
+                        if precip_nodata is None:
+                            precip_nodata = -1
+                        if et0_nodata is None:
+                            et0_nodata = -1
+


I was thinking about this over the weekend. Checking for and assigning values for precip_nodata and et0_nodata here will make these extra lines run many, many times (once for each month, but also for each pixel in the work_queue, which ends up being for each valid pixel). Of course it's quite possible that the compiler will optimize this condition away so that it doesn't need to be run so many times, but since we know in advance that this can just be defined elsewhere, would you be ok with moving it outside of the main processing loop? Offhand, it looks like these could be checked for once, up near the top of this function (around line 448 for et0, 456 for precip).

That's a good point! Just moved it outside of the loop

phargogh

Thanks for all your work on this @emlys !

phargogh · 2020-09-02T19:23:19Z

src/natcap/invest/seasonal_water_yield/seasonal_water_yield_core.pyx

        et0_m_nodata_list.append(
-            pygeoprocessing.get_raster_info(et0_path)['nodata'][0])
+            pygeoprocessing.get_raster_info(et0_path)['nodata'][0] or -1)


Neat! this is a bit of python syntax that I was not expecting to work as a sort of abbreviated ternary operator, but it absolutely does.

Oh heck! This is going to have the same problem with truthiness of 0... so 0 or -1 will be -1. I can fix this in my next PR...

Whoops, so it does! I'm sorry I didn't catch that. Yes, let's do this in the next PR.

emlys added 3 commits August 14, 2020 14:54

add function to make raster with nodata value undefined

3adb758

allow undefined nodata value in seasonal water yield

8a6df92

replaced all usages of isclose to nodata with utils.is_valid

ec32ebd

emlys self-assigned this Aug 17, 2020

emlys changed the base branch from master to release/3.9 August 17, 2020 22:54

emlys added 5 commits August 17, 2020 16:48

fixed split line syntax error

d06db6b

added tests for utils.is_valid

fc82e7b

cleanup

22d7dfa

checkout SWY test from release/3.9

abef4d0

added detailed test for SWY with undefined nodata value

4400b4a

emlys marked this pull request as ready for review August 18, 2020 23:03

emlys added 2 commits August 18, 2020 16:15

added HISTORY note

9ef2fc9

cleanup

3011a63

emlys requested a review from dcdenu4 August 18, 2020 23:50

emlys assigned dcdenu4 Aug 18, 2020

emlys linked an issue Aug 18, 2020 that may be closed by this pull request

Undefined nodata values cause nodata checking to fail in numpy.isclose #228

Closed

3 tasks

phargogh requested changes Aug 19, 2020

View reviewed changes

emlys changed the title ~~Bug/228~~ Handle comparison to undefined nodata values Aug 19, 2020

emlys added 11 commits August 20, 2020 11:17

removing utils.is_valid from globio

7e947fe

globio tests passing

7d553e4

removed utils.is_valid from coastal blue carbon

819cd46

removed utils.is_valid from coastal_vulnerability

f6f0edc

removed utils.is_valid from crop production percentile

476f4e7

removed utils.is_valid from crop production

c25484a

removed utils.is_valid froom delineateit, habitat quality

f64e695

fix seasonal_water_yield_core

4d258ec

remove utils.is_valid from hydropower, ndr

c9cc727

removed utils.is_valid from routedem, scenic quality, sdr

dc81973

added SWY test, removed utils.is_valid

07c96a9

fixed bug in coastal blue carbon

be52fe9

emlys closed this Aug 21, 2020

emlys reopened this Aug 21, 2020

emlys added 3 commits August 21, 2020 10:40

cleanup

22903e6

more small cleanup

744a478

delete cython html file

c029365

emlys requested a review from phargogh August 21, 2020 19:20

use slice(None) rather than True boolean array

dee60d2

fixed typos

36b920f

phargogh requested changes Aug 27, 2020

View reviewed changes

requested changes for PR

b783c5c

Merge branch 'release/3.9' into bug/228

d26b15a

emlys requested a review from phargogh August 28, 2020 18:16

phargogh requested changes Aug 28, 2020

View reviewed changes

requested changes for PR

547e90f

phargogh requested changes Aug 31, 2020

View reviewed changes

move nodata check outside of loop

4026e6e

emlys requested a review from phargogh August 31, 2020 21:02

phargogh approved these changes Sep 2, 2020

View reviewed changes

phargogh merged commit e3f9515 into natcap:release/3.9 Sep 2, 2020

phargogh mentioned this pull request Sep 3, 2020

Undefined nodata values cause nodata checking to fail in numpy.isclose #228

Closed

3 tasks

emlys mentioned this pull request Sep 3, 2020

Pr resolution #297

Merged

phargogh mentioned this pull request Feb 2, 2021

Feature/raster stitch 150 natcap/pygeoprocessing#156

Merged

emlys deleted the bug/228 branch March 8, 2023 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle comparison to undefined nodata values #269

Handle comparison to undefined nodata values #269

emlys commented Aug 17, 2020 •

edited

Loading

emlys commented Aug 19, 2020

phargogh left a comment

phargogh Aug 19, 2020

phargogh Aug 19, 2020

emlys commented Aug 21, 2020

phargogh commented Aug 25, 2020 •

edited

Loading

emlys commented Aug 25, 2020

phargogh left a comment

phargogh Aug 27, 2020

emlys Aug 27, 2020

emlys Aug 27, 2020

phargogh Aug 27, 2020

phargogh Aug 27, 2020

phargogh Aug 27, 2020

emlys Aug 27, 2020

emlys commented Aug 27, 2020

phargogh commented Aug 27, 2020

phargogh left a comment

phargogh Aug 28, 2020

emlys Aug 31, 2020

phargogh Aug 28, 2020

phargogh left a comment

phargogh Aug 31, 2020

emlys Aug 31, 2020

phargogh left a comment

phargogh Sep 2, 2020 •

edited

Loading

emlys Sep 2, 2020

phargogh Sep 2, 2020

Handle comparison to undefined nodata values #269

Handle comparison to undefined nodata values #269

Conversation

emlys commented Aug 17, 2020 • edited Loading

emlys commented Aug 19, 2020

phargogh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emlys commented Aug 21, 2020

phargogh commented Aug 25, 2020 • edited Loading

emlys commented Aug 25, 2020

phargogh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emlys commented Aug 27, 2020

phargogh commented Aug 27, 2020

phargogh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phargogh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phargogh left a comment

Choose a reason for hiding this comment

phargogh Sep 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emlys commented Aug 17, 2020 •

edited

Loading

phargogh commented Aug 25, 2020 •

edited

Loading

phargogh Sep 2, 2020 •

edited

Loading