Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask: Data.hardmask, Data._set_dask, Data.to_dask_array #399

Merged
merged 12 commits into from
May 13, 2022

Conversation

davidhassell
Copy link
Collaborator

@davidhassell davidhassell commented May 4, 2022

Work going on in the active storage world has made it undesirable to stack up dask operations during Data.__init__ unless absolutely necessary.

Currently, after the data definition, __init__ always adds on a "harden/soften mask" operation, and in addition might add on an "astype" and/or a "where" operation.

The current thinking is that to implement active storage reductions, the dask graph can only contain the data definition. If the "astype" and "where" operations are needed then we don't want to use active storage anyway, but the hardness of the mask will not interfere with a storage-side reduction, so we don't want to add this operation by default, thereby making it very hard to tell if an active storage operation is possible, or not.

A solution is to "somewhat lazify" the setting of the mask hardness. Instead of the status quo of adding a dask graph operation every time d.hardmask = X is run, the hardmask attribute now just records the desired state of the mask hardness. It is then up to a future method to apply a mask hardening/softening operation if and only if it is necessary. At present, only two methods need this (__setitem__ and where).

Not only does this pave the way for active storage reductions to be implemented, it also (in my opinion!) makes the code cleaner, too.

@davidhassell davidhassell added the dask Relating to the use of Dask label May 4, 2022
@davidhassell
Copy link
Collaborator Author

It also will reduce clutter in the dask graph by removing most of the "cf_harden_mask" and "cf_soften_mask" layers, which are nearly always null-ops.

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all I support the new approach underlying this PR, i.e:

A solution is to "somewhat lazify" the setting of the mask hardness. Instead of the status quo of adding a dask graph operation every time d.hardmask = X is run, the hardmask attribute now just records the desired state of the mask hardness. It is then up to a future method to apply a mask hardening/softening operation if and only if it is necessary. At present, only two methods need this (setitem and where).

and agree with your motivating conclusions:

Not only does this pave the way for active storage reductions to be implemented, it also (in my opinion!) makes the code cleaner, too.

The code here is all good (no comments to make at all, in fact 💯) though your recent and final commit to synchronise with lama-to-dask here seems to have introduced (from your other recently-merged PRs) a few cases of reset_mask_hardness which need appropriate conversion, namely I am currently seeing:

======================================================================
ERROR: test_Data_masked_all (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Data.py", line 3912, in test_Data_masked_all
    d = cf.Data.masked_all(shape)
  File "/home/sadie/cf-python/cf/data/data.py", line 9348, in masked_all
    d._set_dask(dx, reset_mask_hardness=False)
TypeError: _set_dask() got an unexpected keyword argument 'reset_mask_hardness'

======================================================================
ERROR: test_Data_outerproduct (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Data.py", line 1431, in test_Data_outerproduct
    f = d.outerproduct(b)
  File "/home/sadie/cf-python/cf/data/data.py", line 96, in wrapper
    return method(*args, **kwargs)
  File "/home/sadie/cfdm/cfdm/decorators.py", line 44, in inplace_wrapper
    processed_copy = operation_method(self, *args, **kwargs)
  File "/home/sadie/cf-python/cf/decorators.py", line 62, in precede_with_kwarg_deprecation_check
    operation_method_result = operation_method(self, *args, **kwargs)
  File "/home/sadie/cf-python/cf/data/data.py", line 8793, in outerproduct
    d._set_dask(dx, reset_mask_hardness=False)
TypeError: _set_dask() got an unexpected keyword argument 'reset_mask_hardness'

----------------------------------------------------------------------

Once those are rectified (and merge conflicts managed) this is ready to merge as far as I am concerned. Thanks.

@davidhassell
Copy link
Collaborator Author

Thanks Sadie - conflict fixed, and merging ....

@davidhassell davidhassell merged commit 46b09a0 into NCAS-CMS:lama-to-dask May 13, 2022
@davidhassell davidhassell deleted the dask-hardmask branch November 15, 2022 09:26
@davidhassell davidhassell added this to the 3.14.0 milestone Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Relating to the use of Dask
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants