Fix lazy fill_value retrieval #2738

djkirkham · 2017-08-16T13:30:52Z

The "smallest slice" method of retrieving the fill value from a dask array doesn't always work. In particular it won't work if that array is formed from the concatenation/stacking of several arrays or proxies which have differing fill values, or combinations of ndarrays and masked arrays:

>>> import numpy as np
>>> import dask.array as da
>>> from iris._lazy_data import get_fill_value
>>> a = da.from_array(np.array([1, 2, 3]), 1)
>>> m = np.ma.masked_array([1, 2, 3], [0, 1, 0], fill_value=10)
>>> m = da.from_array(m, 1, asarray=False)
>>> am = da.concatenate([a, m])
>>> print am.compute().fill_value
10
>>> print get_fill_value(am)
None

The text was updated successfully, but these errors were encountered:

djkirkham · 2017-08-16T13:31:28Z

@pp-mo @bjlittle @pelson

pelson · 2017-08-16T14:52:54Z

Specifically, this comes up at save time when you want to stream data. Currently, that just means NetCDF. The shimmy around this issue is to write to disk and check that your fill_value is consistent, rather than try to understand your fill_value up front. This can be done with the da.store function. My proof of concept:

In [1]: import numpy as np
   ...: import dask.array as da
   ...: from iris._lazy_data import get_fill_value
   ...: 
   ...: 
   ...: a = da.from_array(np.array([1, 2, 3]), 1)
   ...: n = np.ma.masked_array([1, 2, 3], [0, 1, 0], fill_value=10)
   ...: m = da.from_array(n, 1, asarray=False)
   ...: am = da.concatenate([a, m])
   ...: print am.compute().fill_value
   ...: 
   ...: print get_fill_value(am)
   ...: 
   ...: 
   ...: class F(object):
   ...:     def __init__(self):
   ...:         self.fill_values = set()
   ...: 
   ...:     def __setitem__(self, keys, arr):
   ...:         self.fill_values.add(getattr(arr, 'fill_value', None))
   ...:         print('-->', arr, getattr(arr, 'fill_value', None))
   ...: 
   ...: foo = F()
   ...: 
   ...: da.store([am], [foo])
   ...: 
   ...: print(foo.fill_values)
   ...: 

10
None
('-->', array([1]), None)
('-->', array([2]), None)
('-->', array([3]), None)
('-->', masked_array(data = [1],
             mask = [False],
       fill_value = 10)
, 10)
('-->', masked_array(data = [--],
             mask = [ True],
       fill_value = 10)
, 10)
('-->', masked_array(data = [3],
             mask = [False],
       fill_value = 10)
, 10)
set([None, 10])

pelson · 2017-08-17T14:08:52Z

Incidentally, I discovered we have some questionable behaviour with regards to iris save of netcdf files. It comes down to the assumption that the fill value actually represents the mask of a masked_array:

In [16]: from netCDF4 import Dataset
    ...: import numpy as np
    ...: import numpy.ma
    ...: 
    ...: arr = np.ma.masked_array([1, 2, 3], mask=[1, 0, 0], fill_value=2)
    ...: 
    ...: arr = arr - 1
    ...: 

In [17]: 

In [17]: print(arr)
[-- 1 2]

In [18]: nc = Dataset('test.nc','w')
    ...: d = nc.createDimension('x', 3)
    ...: 
    ...: v = nc.createVariable('v',
    ...:                       arr.dtype, 'x',
    ...:                       fill_value=arr.fill_value)
    ...: 
    ...: v[:] = arr
    ...: nc.close()
    ...: 
    ...: 
    ...: nc = Dataset('test.nc')
    ...: 
    ...: print(nc['v'][:])
    ...: 
[-- 1 --]

There are a couple of options here. We could:

array.filled(a_fill_value) - but that requires choosing a fill value that doesn't collide with existing numbers
raising a warning if the fill value is used as an actual datapoint, telling the user how to re-define their fill value

What is clear: there is redundancy between the masked_array's mask and the fill_value, and there is no attempt from numpy to ensure they remain consistent. Essentially, I therefore consider the existing masked array fill_value to be entirely useless, and certainly not worthy of use as an element of valuable metadata.

pelson · 2017-08-18T07:51:43Z

Summary of key details:

• np.ma -> slice -> np.ma in all situations
• da.ma -> slice -> da.ma in all situations
• np.ma -> save nc with fill value -> loaded array is ma only if there are masked points (fill value lost if no masked points)
• There is negligible performance difference between reading a variable with/without a fill_value if there are no actual masked points (this could certainly have not been the case)

My conclusion: stream out our data with a default (overridable) fill value, warning if there are genuine data point collisions as we go. If we want round-trip fill_values, we need to add that to the data model (i.e. not use the fill_value in a ma, because of the previous comment). Adding fill_value to the data model is a possible thing - there are lots of details to ensure that we are consistent, and even then we still need to validate the arrays as they get written to ensure there are no data-point collisions. Implementing this has nothing to do with da.ma, and I don't currently have a strong driver to actually implement round-trip fill_values (we don't round trip other variable attributes, such as valid_range etc.)

Appendix

Performance for reading 100,000,000 points from a NetCDF variable (no masked points):

In [7]: %timeit Dataset('test_w_fill_value.nc')['v'][:]
1 loop, best of 3: 425 ms per loop

In [8]: %timeit Dataset('test_w_fill_value.nc')['v'][:]
1 loop, best of 3: 425 ms per loop

In [9]: %timeit Dataset('test_no_fill_value.nc')['v'][:]
1 loop, best of 3: 422 ms per loop

In [10]: %timeit Dataset('test_no_fill_value.nc')['v'][:]
1 loop, best of 3: 422 ms per loop

pelson · 2017-08-18T08:32:06Z

A question about the ability to optimise calculations by sometimes just returning an array if there are no masked values. It is my assertion that we can still do this:

In [1]: import numpy as np
   ...: import numpy.ma
   ...: import dask.array as da
   ...: 

In [2]: def my_super_duper_algorithm(array):
   ...:     if isinstance(array, np.ma.MaskedArray) and not np.ma.is_masked(array):
   ...:         print('Dropping the mask.')
   ...:         array = array.data
   ...: 
   ...:     # Here it comes...
   ...:     array = array + 1
   ...: 
   ...:     return array
   ...: 

In [3]: a = da.from_array(np.array([1, 2, 3]), 1)
   ...: n = np.ma.masked_array([1, 2, 3], [0, 1, 0], fill_value=10)
   ...: m = da.from_array(n, 1, asarray=False)
   ...: 

In [4]: has_mask = da.core.elemwise(my_super_duper_algorithm, m)
   ...: r = has_mask.compute()
   ...: print(repr(r))
   ...: 
Dropping the mask.
Dropping the mask.
masked_array(data = [2 -- 4],
             mask = [False  True False],
       fill_value = 10)


In [5]: no_mask = da.core.elemwise(my_super_duper_algorithm, m[:1])
   ...: r = no_mask.compute()
   ...: print(repr(r))
   ...: 
Dropping the mask.
array([2])

pp-mo · 2017-08-18T11:31:05Z

I have posted a query on the dask "masked_arrays" PR : dask/dask#2301 (comment)
Initial responses are positive for providing a pre-compute fillvalue access.
But I think the trend of discussion here is making that irrelevant ?

pp-mo · 2017-08-18T11:34:49Z

I therefore consider the existing masked array fill_value to be entirely useless, and certainly not worthy of use as an element of valuable metadata.

I think this is consistent with the previous iris-dask code
- that is, when we were using nan-arrays, and had a cube.fill_value

pp-mo · 2017-08-21T09:41:06Z

Discussed in offline meeting @pelson @djkirkham @bjlittle @pp-mo
Adopted @pelson maximum-simplicity approach, which in summary is :

do not make a fuss about 'round-trip' of format encodings for now
- in particular, such as fill_value encoding
  - but could also consider PP packing; CF valid range
- current netcdf support does not do this anyway
- PP currently attempts to capture+replicate BMDI, but this has known shortcomings
don't respect fill-value attached to masked data at all
(instead) use format-appropriate default fill-values whenever we save
- netcdf has per-dtype fill-values
  - N.B. not the numpy ones !
- PP/FF has a default BMDI concept
re: save options to control ...
- do not implement now
- leave future space for
- await user-request for this

djkirkham added the dask-mask label Aug 16, 2017

pp-mo mentioned this issue Aug 18, 2017

Fix MDI and preserve lazy data in PP saves. #2744

Merged

pp-mo closed this as completed Aug 21, 2017

This was referenced Aug 21, 2017

Fix fill_value when saving to pp #2704

Closed

Design a common strategy for handling masked data in file i/o #2748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lazy fill_value retrieval #2738

Fix lazy fill_value retrieval #2738

djkirkham commented Aug 16, 2017

djkirkham commented Aug 16, 2017

pelson commented Aug 16, 2017

pelson commented Aug 17, 2017

pelson commented Aug 18, 2017

pelson commented Aug 18, 2017

pp-mo commented Aug 18, 2017 •

edited

Loading

pp-mo commented Aug 18, 2017

pp-mo commented Aug 21, 2017 •

edited

Loading

Fix lazy fill_value retrieval #2738

Fix lazy fill_value retrieval #2738

Comments

djkirkham commented Aug 16, 2017

djkirkham commented Aug 16, 2017

pelson commented Aug 16, 2017

pelson commented Aug 17, 2017

pelson commented Aug 18, 2017

Appendix

pelson commented Aug 18, 2017

pp-mo commented Aug 18, 2017 • edited Loading

pp-mo commented Aug 18, 2017

pp-mo commented Aug 21, 2017 • edited Loading

pp-mo commented Aug 18, 2017 •

edited

Loading

pp-mo commented Aug 21, 2017 •

edited

Loading