Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lazy fill_value retrieval #2738

Closed
djkirkham opened this issue Aug 16, 2017 · 8 comments
Closed

Fix lazy fill_value retrieval #2738

djkirkham opened this issue Aug 16, 2017 · 8 comments

Comments

@djkirkham
Copy link
Contributor

The "smallest slice" method of retrieving the fill value from a dask array doesn't always work. In particular it won't work if that array is formed from the concatenation/stacking of several arrays or proxies which have differing fill values, or combinations of ndarrays and masked arrays:

>>> import numpy as np
>>> import dask.array as da
>>> from iris._lazy_data import get_fill_value
>>> a = da.from_array(np.array([1, 2, 3]), 1)
>>> m = np.ma.masked_array([1, 2, 3], [0, 1, 0], fill_value=10)
>>> m = da.from_array(m, 1, asarray=False)
>>> am = da.concatenate([a, m])
>>> print am.compute().fill_value
10
>>> print get_fill_value(am)
None
@djkirkham
Copy link
Contributor Author

@pp-mo @bjlittle @pelson

@pelson
Copy link
Member

pelson commented Aug 16, 2017

Specifically, this comes up at save time when you want to stream data. Currently, that just means NetCDF. The shimmy around this issue is to write to disk and check that your fill_value is consistent, rather than try to understand your fill_value up front. This can be done with the da.store function. My proof of concept:

In [1]: import numpy as np
   ...: import dask.array as da
   ...: from iris._lazy_data import get_fill_value
   ...: 
   ...: 
   ...: a = da.from_array(np.array([1, 2, 3]), 1)
   ...: n = np.ma.masked_array([1, 2, 3], [0, 1, 0], fill_value=10)
   ...: m = da.from_array(n, 1, asarray=False)
   ...: am = da.concatenate([a, m])
   ...: print am.compute().fill_value
   ...: 
   ...: print get_fill_value(am)
   ...: 
   ...: 
   ...: class F(object):
   ...:     def __init__(self):
   ...:         self.fill_values = set()
   ...: 
   ...:     def __setitem__(self, keys, arr):
   ...:         self.fill_values.add(getattr(arr, 'fill_value', None))
   ...:         print('-->', arr, getattr(arr, 'fill_value', None))
   ...: 
   ...: foo = F()
   ...: 
   ...: da.store([am], [foo])
   ...: 
   ...: print(foo.fill_values)
   ...: 

10
None
('-->', array([1]), None)
('-->', array([2]), None)
('-->', array([3]), None)
('-->', masked_array(data = [1],
             mask = [False],
       fill_value = 10)
, 10)
('-->', masked_array(data = [--],
             mask = [ True],
       fill_value = 10)
, 10)
('-->', masked_array(data = [3],
             mask = [False],
       fill_value = 10)
, 10)
set([None, 10])

@pelson
Copy link
Member

pelson commented Aug 17, 2017

Incidentally, I discovered we have some questionable behaviour with regards to iris save of netcdf files. It comes down to the assumption that the fill value actually represents the mask of a masked_array:

In [16]: from netCDF4 import Dataset
    ...: import numpy as np
    ...: import numpy.ma
    ...: 
    ...: arr = np.ma.masked_array([1, 2, 3], mask=[1, 0, 0], fill_value=2)
    ...: 
    ...: arr = arr - 1
    ...: 

In [17]: 

In [17]: print(arr)
[-- 1 2]

In [18]: nc = Dataset('test.nc','w')
    ...: d = nc.createDimension('x', 3)
    ...: 
    ...: v = nc.createVariable('v',
    ...:                       arr.dtype, 'x',
    ...:                       fill_value=arr.fill_value)
    ...: 
    ...: v[:] = arr
    ...: nc.close()
    ...: 
    ...: 
    ...: nc = Dataset('test.nc')
    ...: 
    ...: print(nc['v'][:])
    ...: 
[-- 1 --]

There are a couple of options here. We could:

  • array.filled(a_fill_value) - but that requires choosing a fill value that doesn't collide with existing numbers
  • raising a warning if the fill value is used as an actual datapoint, telling the user how to re-define their fill value

What is clear: there is redundancy between the masked_array's mask and the fill_value, and there is no attempt from numpy to ensure they remain consistent. Essentially, I therefore consider the existing masked array fill_value to be entirely useless, and certainly not worthy of use as an element of valuable metadata.

@pelson
Copy link
Member

pelson commented Aug 18, 2017

Summary of key details:

• np.ma -> slice -> np.ma in all situations
• da.ma -> slice -> da.ma in all situations
• np.ma -> save nc with fill value -> loaded array is ma only if there are masked points (fill value lost if no masked points)
• There is negligible performance difference between reading a variable with/without a fill_value if there are no actual masked points (this could certainly have not been the case)

My conclusion: stream out our data with a default (overridable) fill value, warning if there are genuine data point collisions as we go. If we want round-trip fill_values, we need to add that to the data model (i.e. not use the fill_value in a ma, because of the previous comment). Adding fill_value to the data model is a possible thing - there are lots of details to ensure that we are consistent, and even then we still need to validate the arrays as they get written to ensure there are no data-point collisions. Implementing this has nothing to do with da.ma, and I don't currently have a strong driver to actually implement round-trip fill_values (we don't round trip other variable attributes, such as valid_range etc.)

Appendix

Performance for reading 100,000,000 points from a NetCDF variable (no masked points):

In [7]: %timeit Dataset('test_w_fill_value.nc')['v'][:]
1 loop, best of 3: 425 ms per loop

In [8]: %timeit Dataset('test_w_fill_value.nc')['v'][:]
1 loop, best of 3: 425 ms per loop

In [9]: %timeit Dataset('test_no_fill_value.nc')['v'][:]
1 loop, best of 3: 422 ms per loop

In [10]: %timeit Dataset('test_no_fill_value.nc')['v'][:]
1 loop, best of 3: 422 ms per loop 

@pelson
Copy link
Member

pelson commented Aug 18, 2017

A question about the ability to optimise calculations by sometimes just returning an array if there are no masked values. It is my assertion that we can still do this:

In [1]: import numpy as np
   ...: import numpy.ma
   ...: import dask.array as da
   ...: 

In [2]: def my_super_duper_algorithm(array):
   ...:     if isinstance(array, np.ma.MaskedArray) and not np.ma.is_masked(array):
   ...:         print('Dropping the mask.')
   ...:         array = array.data
   ...: 
   ...:     # Here it comes...
   ...:     array = array + 1
   ...: 
   ...:     return array
   ...: 

In [3]: a = da.from_array(np.array([1, 2, 3]), 1)
   ...: n = np.ma.masked_array([1, 2, 3], [0, 1, 0], fill_value=10)
   ...: m = da.from_array(n, 1, asarray=False)
   ...: 

In [4]: has_mask = da.core.elemwise(my_super_duper_algorithm, m)
   ...: r = has_mask.compute()
   ...: print(repr(r))
   ...: 
Dropping the mask.
Dropping the mask.
masked_array(data = [2 -- 4],
             mask = [False  True False],
       fill_value = 10)


In [5]: no_mask = da.core.elemwise(my_super_duper_algorithm, m[:1])
   ...: r = no_mask.compute()
   ...: print(repr(r))
   ...: 
Dropping the mask.
array([2])

@pp-mo
Copy link
Member

pp-mo commented Aug 18, 2017

I have posted a query on the dask "masked_arrays" PR : dask/dask#2301 (comment)
Initial responses are positive for providing a pre-compute fillvalue access.
But I think the trend of discussion here is making that irrelevant ?

@pp-mo
Copy link
Member

pp-mo commented Aug 18, 2017

I therefore consider the existing masked array fill_value to be entirely useless, and certainly not worthy of use as an element of valuable metadata.

I think this is consistent with the previous iris-dask code
- that is, when we were using nan-arrays, and had a cube.fill_value

@pp-mo
Copy link
Member

pp-mo commented Aug 21, 2017

Discussed in offline meeting @pelson @djkirkham @bjlittle @pp-mo
Adopted @pelson maximum-simplicity approach, which in summary is :

  • do not make a fuss about 'round-trip' of format encodings for now
    • in particular, such as fill_value encoding
      • but could also consider PP packing; CF valid range
    • current netcdf support does not do this anyway
    • PP currently attempts to capture+replicate BMDI, but this has known shortcomings
  • don't respect fill-value attached to masked data at all
  • (instead) use format-appropriate default fill-values whenever we save
    • netcdf has per-dtype fill-values
      • N.B. not the numpy ones !
    • PP/FF has a default BMDI concept
  • re: save options to control ...
    • do not implement now
    • leave future space for
    • await user-request for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants