-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle comparison to undefined nodata values #269
Conversation
see convo here #228 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @emlys , I thought I'd take a look at this since we had been talking about it a bit yesterday. I left a few comments in a couple critical places, and I'm sorry that we hadn't had a chance to talk about our design process before this ... but it's a great time to talk about it now! Would you have some time to talk this afternoon? I'll ping you on slack as well.
And as a side note, would it be possible to set more descriptive PR titles? The indirection of having to look up the nature of #228 makes it a bit harder to grok the PR from the PR page listing. Thanks!
cdef int is_close(double x, double y): | ||
cdef int double_is_close(double x, double y): | ||
return abs(x-y) <= (1e-8+1e-05*abs(y)) | ||
|
||
def is_close(a, b): | ||
"""Compare values which may be numeric or None""" | ||
if a and b: | ||
return double_is_close(a, b) | ||
else: | ||
return 1 if a == b else 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Emily, we need to be really careful within cython modules for all sorts of reasons. In this case, I don't think we should adapt is_close
to handle a case where the nodata value could be None
because of how and how often it's called. This is a function that's called for every pixel ... we therefore need the function to be as blazingly fast as possible. Note the cdef
preceding the old is_close
here. Such a function compiles down to a C function. When we prefix a function with def
rather than cdef
, the function is a python function and incurse significant overhead.
We therefore have two options for handling the case where the nodata value might not be defined:
- We can pick a nodata value that makes sense within the context of the problem if one isn't already defined.
- We can require a nodata value to be defined.
Either way, we really really don't want is_close
to be a python function.
src/natcap/invest/utils.py
Outdated
if nodata: | ||
return ~numpy.isclose(array, nodata) | ||
else: | ||
return array != nodata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of things that will probably turn out to be problematic here:
- Python's truthiness rules allow for values of
0
,False
, empty iterables and a few other things to all be interpreted asFalse
. How will this function behave if we have a nodata value of0
, which is a valid input? - similar to @richpsharp 's comment at Undefined nodata values cause nodata checking to fail in numpy.isclose #228 (comment), the function name
is_valid
makes it sound like this function is for checking whether a pixel is valid ... but all it's actually checking for is whether the value is close tonodata
. Function names seem like a fairly trivial thing, but it's so so important for us to be clear about the intent of a function and what it's really trying to do. So if the intent is to determine whether a value is close tonodata
, then we should really be expressing that in the title. What, exactly, is a valid pixel, is highly dependent on the use case. - If the intent here is actually to create a single function that checks to see if a pixel value is close to
nodata
, then we need to accommodate the differences in floating-point and integer matrices. It turns out that if we know in advance that we have an integer matrix, then we can take advantage of order-of-magnitude speedups in per-element comparisons:integer_array != nodata
is an order of magnitude faster thannumpy.isclose(integer_array, nodata)
. On very large datasets, this can make a noticeable difference. Here's the timing details:
>>> timeit.timeit('numpy.isclose(ones, 0)', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.float32)', number=100000)
5.014724600000022
>>> timeit.timeit('numpy.isclose(ones, 0)', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.int16)', number=100000)
4.701722299999915
>>> timeit.timeit('ones != 0', setup='import numpy; ones=numpy.ones((50, 50), dtype=numpy.int16)', number=100000)
0.315173499999986
- By making it necessary to have an extra function callthrough, we're not only obfuscating the intent of the function, there's also a (relatively minor) computational overhead. The main point here is that the value gained from having a function should outweigh the cognitive and computational overhead of needing to look elsewhere at what the function is doing. This function is so simple that I'm not sure this threshold is met.
After conversation with @phargogh about the issues above, I removed
However, with the need to check
Any thoughts on the tradeoff of efficiency vs. nicer-looking code in this situation, the likelihood of |
@emlys You're very right that the common pattern you mention (initialize the output array, create a mask of valid pixels, limit the calculations to valid pixels) is indeed to limit computation to only those places where it is needed. In general, the internals of InVEST should do what's best and most efficient for the user in the general case, even if it's a little trickier for us to program and to maintain (without gutting maintainability, though!), and also even if it underperforms in certain specific edge cases. All that within the context of the problem, of course! In the simplest of cases, where a if globio_nodata is None:
valid_mask = slice(None) # the ``slice`` equivalent of ``:``
else:
valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask] Or, more abbreviated: valid_mask = slice(None) # the ``slice`` equivalent of ``:``
if globio_nodata is not None:
valid_mask = ~numpy.isclose(msa_f, globio_nodata)
result[valid_mask] = msa_f[valid_mask] * msa_lu[valid_mask] * msa_i[valid_mask] With this slight tweak, the I found the timings for this to be rather surprising, but Timings>>> import timeit
# Baseline: index into all elements
>>> timeit.timeit('array[:]', setup='import numpy; array=numpy.zeros((50,50))')
0.22549130000000162
# Indexing using a slice
>>> timeit.timeit('array[sl]', setup='import numpy; array=numpy.zeros((50,50)); sl=slice(None)')
0.22285999999985506
# Index using `True`, which should select all elements
>>> timeit.timeit('array[True]', setup='import numpy; array=numpy.zeros((50,50))')
3.0759548999999993
# Index using a boolean array of 1/True
>>> timeit.timeit('array[index]', setup='import numpy; array=numpy.zeros((50,50)); index=numpy.ones((50,50), dtype=numpy.bool)')
3.8698592000000076 Anyways, what do you think about this RE: tradeoffs between efficiency, readability, maintainability and general-case solutions? |
I like the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one or two small comments, but overall this looks like a really nice handling of nodata values!
I did have one other question, though, which was that it looks like this PR handles the use of numpy.isclose
for comparing nodata values. Were you also able to take a look at the places where nodata checking is done by !=
?
@@ -1282,7 +1282,8 @@ def extract_bathymetry_along_ray( | |||
'bathymetry input fully cover the fetch ray area?' | |||
% (value, {'xoff': ix, 'yoff': iy, | |||
'win_xsize': win_size, 'win_ysize': win_size})) | |||
if ~numpy.isclose(value, bathy_nodata): | |||
|
|||
if bathy_nodata is None or ~numpy.isclose(value, bathy_nodata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is kind of our of the scope of this PR, but since we're only working with a single value here, would you be OK changing the ~numpy.isclose
here to not math.isclose
? math.isclose
looks to be about 200 times faster than numpy.isclose
for a single value.
>>> timeit.timeit('~numpy.isclose(1.1, 1.0)', setup='import numpy')
47.977831700000024
>>>timeit.timeit('not math.isclose(1.1, 1.0)', setup='import math')
0.18105119999999886
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! Is there some resource you have for finding faster alternatives like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: value
is actually an array, not a single value, so math.isclose
doesn't work :( I changed the variable name to values
to be less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, sorry to disappoint! I unfortunately don't know of such a resource. In this case, I just saw that we were using numpy but then only checking an array with a single value in it. Numpy's great for operating on arrays that might be very large, but for a single value, it's often faster to just directly compare the values. Plus, numpy does a whole lot of extra work to ensure that it can actually compare the values, compared with the python implementation of isclose
, which is about as short as you'd expect for this.
Also I was a bit wrong about the timings I had included above. Since this is a 1-element numpy array, here are the corrected timings:
>>> timeit.timeit('~numpy.isclose(array, 1)', setup='import numpy; array=numpy.array([[1.1]], dtype=numpy.float32)')
42.59300510000003
>>> timeit.timeit('not math.isclose(array[0][0], 1)', setup='import numpy; import math; array=numpy.array([[1.1]], dtype=numpy.float32)')
0.5673721999999941
Not quite as fast thanks to the need to do that indexing, but still substantially faster.
if observed_yield_nodata is not None: | ||
valid_mask = ~numpy.isclose(yield_block, observed_yield_nodata) | ||
production_pixel_count += numpy.count_nonzero( | ||
~numpy.isclose(yield_block, observed_yield_nodata) & | ||
(yield_block > 0.0)) | ||
yield_sum += numpy.sum( | ||
yield_block[ | ||
~numpy.isclose(observed_yield_nodata, yield_block)]) | ||
valid_mask & (yield_block > 0.0)) | ||
yield_sum += numpy.sum(yield_block[valid_mask]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! this looks like it'll be a bit more efficient, as well as supporting the possibility of an undefined nodata value.
src/natcap/invest/globio.py
Outdated
result = numpy.empty_like(lulc_array, dtype=numpy.int16) | ||
result[:] = primary_veg_mask_nodata | ||
result = lulc_array == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be reading this wrong, but wouldn't this second result =
line set result
to be a new array, and this time of type numpy.bool
?
And on a separate note, it looks like this change would also remove the primary_veg_mask_nodata
entirely from this function, which sounds like an unintended change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, you're totally right!
I did not, but it seems that where
Is this what you're referring to? |
Ah, good point about the |
Using the latest 2.1+ version of `pgp.convolve_2d`, the nodata handling is different. For the UCM test, the convolved raster was left with an undefined nodata value which caused the `wbgt_op` to crash. Since this is being worked on in PR natcap#269 I copied the current solution from there to hopefully avoid merge conflicts in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @emlys thanks for taking a look at my comments! I mostly wanted to follow up about the possible isclose
change. If I'm reading the function correctly, then it looks like we're only ever reading in a single pixel right there, which means that we mght be able to safely extract that one pixel value. Anyways, wanted to check with you about this in case you thought that was worth updating ... let me know! I think this PR otherwise looks good to go.
value = bathy_band.ReadAsArray( | ||
values = bathy_band.ReadAsArray( | ||
xoff=ix, yoff=iy, | ||
win_xsize=win_size, win_ysize=win_size) | ||
if value is None: | ||
if values is None: | ||
raise ValueError( | ||
'got a %s when trying to read bathymetry at %s. Does the ' | ||
'bathymetry input fully cover the fetch ray area?' | ||
% (value, {'xoff': ix, 'yoff': iy, | ||
% (values, {'xoff': ix, 'yoff': iy, | ||
'win_xsize': win_size, 'win_ysize': win_size})) | ||
if ~numpy.isclose(value, bathy_nodata): | ||
bathy_values.append(value) | ||
|
||
if bathy_nodata is None or ~numpy.isclose(values, bathy_nodata): | ||
bathy_values.append(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey I'm sorry about the somewhat confusing chain of timings and comments over in the other thread. You're right that we can't use math.isclose
directly on the array, but since we know ahead of time that this will be a single value, we can just index into that one item, values[0][0]
or the like: not math.isclose(values[0][0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay, that makes sense. The nodata value was also (incorrectly) a list so I fixed that in calculate_wind_exposure
src/natcap/invest/globio.py
Outdated
nodata_mask = numpy.full(infrastructure_array_list[0].shape, True) | ||
infrastructure_mask = numpy.full(infrastructure_array_list[0].shape, False) | ||
|
||
for index in range(0, len(infrastructure_array_list)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty trivial change, but since we're starting from index 0 (the default), this can be abbreviated to range(len(infrastructure_array_list))
if you like to achieve the same effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @emlys , I'm sorry to have not included this in my previous review! I was thinking about this part of your PR over the weekend, and I think there's one small edit that would be a nice optimization to the SWY cython function. What do you think?
# make sure that user input nodata values are defined | ||
# precipitation and evapotranspiration data should | ||
# always be non-negative | ||
if precip_nodata is None: | ||
precip_nodata = -1 | ||
if et0_nodata is None: | ||
et0_nodata = -1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this over the weekend. Checking for and assigning values for precip_nodata
and et0_nodata
here will make these extra lines run many, many times (once for each month, but also for each pixel in the work_queue
, which ends up being for each valid pixel). Of course it's quite possible that the compiler will optimize this condition away so that it doesn't need to be run so many times, but since we know in advance that this can just be defined elsewhere, would you be ok with moving it outside of the main processing loop? Offhand, it looks like these could be checked for once, up near the top of this function (around line 448 for et0
, 456 for precip
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! Just moved it outside of the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @emlys !
et0_m_nodata_list.append( | ||
pygeoprocessing.get_raster_info(et0_path)['nodata'][0]) | ||
pygeoprocessing.get_raster_info(et0_path)['nodata'][0] or -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! this is a bit of python syntax that I was not expecting to work as a sort of abbreviated ternary operator, but it absolutely does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh heck! This is going to have the same problem with truthiness of 0... so 0 or -1
will be -1
. I can fix this in my next PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, so it does! I'm sorry I didn't catch that. Yes, let's do this in the next PR.
#228
utils.is_valid(array, nodata)
that handles the condition where the nodata value is undefined. This replaces dozens of instances of~numpy.isclose(array, nodata)
.