-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask: Data.hardmask
, Data._set_dask
, Data.to_dask_array
#399
dask: Data.hardmask
, Data._set_dask
, Data.to_dask_array
#399
Conversation
It also will reduce clutter in the dask graph by removing most of the "cf_harden_mask" and "cf_soften_mask" layers, which are nearly always null-ops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all I support the new approach underlying this PR, i.e:
A solution is to "somewhat lazify" the setting of the mask hardness. Instead of the status quo of adding a dask graph operation every time d.hardmask = X is run, the hardmask attribute now just records the desired state of the mask hardness. It is then up to a future method to apply a mask hardening/softening operation if and only if it is necessary. At present, only two methods need this (setitem and where).
and agree with your motivating conclusions:
Not only does this pave the way for active storage reductions to be implemented, it also (in my opinion!) makes the code cleaner, too.
The code here is all good (no comments to make at all, in fact 💯) though your recent and final commit to synchronise with lama-to-dask
here seems to have introduced (from your other recently-merged PRs) a few cases of reset_mask_hardness
which need appropriate conversion, namely I am currently seeing:
======================================================================
ERROR: test_Data_masked_all (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_Data.py", line 3912, in test_Data_masked_all
d = cf.Data.masked_all(shape)
File "/home/sadie/cf-python/cf/data/data.py", line 9348, in masked_all
d._set_dask(dx, reset_mask_hardness=False)
TypeError: _set_dask() got an unexpected keyword argument 'reset_mask_hardness'
======================================================================
ERROR: test_Data_outerproduct (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_Data.py", line 1431, in test_Data_outerproduct
f = d.outerproduct(b)
File "/home/sadie/cf-python/cf/data/data.py", line 96, in wrapper
return method(*args, **kwargs)
File "/home/sadie/cfdm/cfdm/decorators.py", line 44, in inplace_wrapper
processed_copy = operation_method(self, *args, **kwargs)
File "/home/sadie/cf-python/cf/decorators.py", line 62, in precede_with_kwarg_deprecation_check
operation_method_result = operation_method(self, *args, **kwargs)
File "/home/sadie/cf-python/cf/data/data.py", line 8793, in outerproduct
d._set_dask(dx, reset_mask_hardness=False)
TypeError: _set_dask() got an unexpected keyword argument 'reset_mask_hardness'
----------------------------------------------------------------------
Once those are rectified (and merge conflicts managed) this is ready to merge as far as I am concerned. Thanks.
Thanks Sadie - conflict fixed, and merging .... |
Work going on in the active storage world has made it undesirable to stack up dask operations during
Data.__init__
unless absolutely necessary.Currently, after the data definition,
__init__
always adds on a "harden/soften mask" operation, and in addition might add on an "astype" and/or a "where" operation.The current thinking is that to implement active storage reductions, the dask graph can only contain the data definition. If the "astype" and "where" operations are needed then we don't want to use active storage anyway, but the hardness of the mask will not interfere with a storage-side reduction, so we don't want to add this operation by default, thereby making it very hard to tell if an active storage operation is possible, or not.
A solution is to "somewhat lazify" the setting of the mask hardness. Instead of the status quo of adding a dask graph operation every time
d.hardmask = X
is run, thehardmask
attribute now just records the desired state of the mask hardness. It is then up to a future method to apply a mask hardening/softening operation if and only if it is necessary. At present, only two methods need this (__setitem__
andwhere
).Not only does this pave the way for active storage reductions to be implemented, it also (in my opinion!) makes the code cleaner, too.