Skip to content

Commit

Permalink
Lazy netcdf saves (#5191)
Browse files Browse the repository at this point in the history
* Basic functional lazy saving.

* Simplify function signature which upsets Sphinx.

* Non-lazy saves return nothing.

* Now fixed to enable use with process/distributed scheduling.

* Remove dask.utils.SerializableLock, which I think was a mistake.

* Make DefferedSaveWrapper use _thread_safe_nc.

* Fixes for non-lazy save.

* Avoid saver error when no deferred writes.

* Reorganise locking code, ready for shareable locks.

* Remove optional usage of 'filelock' for lazy saves.

* Document dask-specific locking; implement differently for threads or distributed schedulers.

* Minor fix for unit-tests.

* Pin libnetcdf to avoid problems -- see #5187.

* Minor test fix.

* Move DeferredSaveWrapper into _thread_safe_nc; replicate the NetCDFDataProxy fix; use one lock per Saver; add extra up-scaled test

* Update lib/iris/fileformats/netcdf/saver.py

Co-authored-by: Bouwe Andela <[email protected]>

* Update lib/iris/fileformats/netcdf/_dask_locks.py

Co-authored-by: Bouwe Andela <[email protected]>

* Update lib/iris/fileformats/netcdf/saver.py

Co-authored-by: Bouwe Andela <[email protected]>

* Small rename + reformat.

* Remove Saver lazy option; all lazy saves are delayed; factor out fillvalue checks and make them delayable.

* Repurposed 'test__FillValueMaskCheckAndStoreTarget' to 'test__data_fillvalue_check', since old class is gone.

* Disable (temporary) saver debug printouts.

* Fix test problems; Saver automatically completes to preserve existing direct usage (which is public API).

* Fix docstring error.

* Fix spurious error in old saver test.

* Fix Saver docstring.

* More robust exit for NetCDFWriteProxy operation.

* Fix doctests by making the Saver example functional.

* Improve docstrings; unify terminology; simplify non-lazy save call.

* Moved netcdf cell-method handling into nc_load_rules.helpers, and various tests into more specific test folders.

* Fix lockfiles and Makefile process.

* Add unit tests for routine _fillvalue_report().

* Remove debug-only code.

* Added tests for what the save function does with the 'compute' keyword.

* Fix mock-specific problems, small tidy.

* Restructure hierarchy of tests.unit.fileformats.netcdf

* Tidy test docstrings.

* Correct test import.

* Avoid incorrect checking of byte data, and a numpy deprecation warning.

* Alter parameter names to make test reports clearer.

* Test basic behaviour of _lazy_stream_data; make 'Saver._delayed_writes' private.

* Add integration tests, and distributed dependency.

* Docstring fixes.

* Documentation section and whatsnew entry.

* Various fixes to whatsnew, docstrings and docs.

* Minor review changes, fix doctest.

* Arrange tests + results to organise by package-name alone.

* Review changes.

* Review changes.

* Enhance tests + debug.

* Support scheduler type 'single-threaded'; allow retries on delayed-save test.

* Improve test.

* Adding a whatsnew entry for 5224 (#5234)

* Adding a whatsnew entry explaining 5224

* Fixing link and format error

* Replacing numpy legacy printing with array2string and remaking results for dependent tests

* adding a whatsnew entry

* configure codecov

* remove results creation commit from blame

* fixing whatsnew entry

* Bump scitools/workflows from 2023.04.1 to 2023.04.2 (#5236)

Bumps [scitools/workflows](https://github.com/scitools/workflows) from 2023.04.1 to 2023.04.2.
- [Release notes](https://github.com/scitools/workflows/releases)
- [Commits](SciTools/workflows@2023.04.1...2023.04.2)

---
updated-dependencies:
- dependency-name: scitools/workflows
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Use real array for data of of small netCDF variables. (#5229)

* Small netCDF variable data is real.

* Various test fixes.

* More test fixing.

* Fix printout in Mesh documentation.

* Whatsnew + doctests fix.

* Tweak whatsnew.

* Handle derived coordinates correctly in `concatenate` (#5096)

* First working prototype of concatenate that handels derived coordinates correctly

* Added checks for derived coord metadata during concatenation

* Added tests

* Fixed defaults

* Added what's new entry

* Optimized test coverage

* clarity on whatsnew entry contributors (#5240)

* Modernize and simplify iris.analysis._Groupby (#5015)

* Modernize and simplify _Groupby

* Rename variable to improve readability

Co-authored-by: Martin Yeo <[email protected]>

* Add a whatsnew entry

* Add a type hint to _add_shared_coord

* Add a test for iris.analysis._Groupby.__repr__

---------

Co-authored-by: Martin Yeo <[email protected]>

* Finalises Lazy Data documentation (#5137)

* cube and io lazy data notes added

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added comments within analysis, as well as palette and iterate, and what's new

* fixed docstrings as requested in @trexfeathers review

* reverted cube.py for time being

* fixed flake8 issue

* Lazy data second batch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated lastest what'snew

* I almost hope this wasn't the fix, I'm such a moron

* adressed review changes

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bill Little <[email protected]>

* Fixes to _discontiguity_in_bounds (attempt 2) (#4975)

* update ci locks location (#5228)

* Updated environment lockfiles (#5211)

Co-authored-by: Lockfile bot <[email protected]>

* Increase retries.

* Change debug to show which elements failed.

* update cf standard units (#5244)

* update cf standard units

* added whatsnew entry

* Correct pull number

Co-authored-by: Martin Yeo <[email protected]>

---------

Co-authored-by: Martin Yeo <[email protected]>

* libnetcdf <4.9 pin (#5242)

* Pin libnetcdf<4.9 and update lock files.

* What's New entry.

* libnetcdf not available on PyPI.

* Fix for Pandas v2.0.

* Fix for Pandas v2.0.

* Avoid possible same-file crossover between tests.

* Ensure all-different testfiles; load all vars lazy.

* Revert changes to testing framework.

* Remove repeated line from requirements/py*.yml (?merge error), and re-fix lockfiles.

* Revert some more debug changes.

* Reorganise test for better code clarity.

* Use public 'Dataset.isopen()' instead of '._isopen'.

* Create output files in unique temporary directories.

* Tests for fileformats.netcdf._dask_locks.

* Fix attribution names.

* Fixed new py311 lockfile.

* Fix typos spotted by codespell.

* Add distributed test dep for python 3.11

* Fix lockfile for python 3.11

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Bouwe Andela <[email protected]>
Co-authored-by: Henry Wright <[email protected]>
Co-authored-by: Henry Wright <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Manuel Schlund <[email protected]>
Co-authored-by: Bill Little <[email protected]>
Co-authored-by: Bouwe Andela <[email protected]>
Co-authored-by: Martin Yeo <[email protected]>
Co-authored-by: Elias <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: stephenworsley <[email protected]>
Co-authored-by: scitools-ci[bot] <107775138+scitools-ci[bot]@users.noreply.github.com>
Co-authored-by: Lockfile bot <[email protected]>
  • Loading branch information
14 people authored Apr 21, 2023
1 parent 949b296 commit 94e44ef
Show file tree
Hide file tree
Showing 39 changed files with 1,700 additions and 331 deletions.
42 changes: 40 additions & 2 deletions docs/src/userguide/real_and_lazy_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import dask.array as da
import iris
from iris.cube import CubeList
import numpy as np


Expand Down Expand Up @@ -227,10 +228,47 @@ coordinates' lazy points and bounds:
Dask Processing Options
-----------------------

Iris uses dask to provide lazy data arrays for both Iris cubes and coordinates,
and for computing deferred operations on lazy arrays.
Iris uses `Dask <https://docs.dask.org/en/stable/>`_ to provide lazy data arrays for
both Iris cubes and coordinates, and for computing deferred operations on lazy arrays.

Dask provides processing options to control how deferred operations on lazy arrays
are computed. This is provided via the ``dask.set_options`` interface. See the
`dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_
for more information on setting dask processing options.


.. _delayed_netcdf_save:

Delayed NetCDF Saving
---------------------

When saving data to NetCDF files, it is possible to *delay* writing lazy content to the
output file, to be performed by `Dask <https://docs.dask.org/en/stable/>`_ later,
thus enabling parallel save operations.

This works in the following way :
1. an :func:`iris.save` call is made, with a NetCDF file output and the additional
keyword ``compute=False``.
This is currently *only* available when saving to NetCDF, so it is documented in
the Iris NetCDF file format API. See: :func:`iris.fileformats.netcdf.save`.

2. the call creates the output file, but does not fill in variables' data, where
the data is a lazy array in the Iris object. Instead, these variables are
initially created "empty".

3. the :meth:`~iris.save` call returns a ``result`` which is a
:class:`~dask.delayed.Delayed` object.

4. the save can be completed later by calling ``result.compute()``, or by passing it
to the :func:`dask.compute` call.

The benefit of this, is that costly data transfer operations can be performed in
parallel with writes to other data files. Also, where array contents are calculated
from shared lazy input data, these can be computed in parallel efficiently by Dask
(i.e. without re-fetching), similar to what :meth:`iris.cube.CubeList.realise_data`
can do.

.. note::
This feature does **not** enable parallel writes to the *same* NetCDF output file.
That can only be done on certain operating systems, with a specially configured
build of the NetCDF C library, and is not supported by Iris at present.
31 changes: 29 additions & 2 deletions docs/src/whatsnew/latest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,33 @@ This document explains the changes made to Iris for this release
✨ Features
===========

#. N/A
#. `@bsherratt`_ added support for plugins - see the corresponding
:ref:`documentation page<community_plugins>` for further information.
(:pull:`5144`)

#. `@rcomer`_ enabled lazy evaluation of :obj:`~iris.analysis.RMS` calcuations
with weights. (:pull:`5017`)

#. `@schlunma`_ allowed the usage of cubes, coordinates, cell measures, or
ancillary variables as weights for cube aggregations
(:meth:`iris.cube.Cube.collapsed`, :meth:`iris.cube.Cube.aggregated_by`, and
:meth:`iris.cube.Cube.rolling_window`). This automatically adapts cube units
if necessary. (:pull:`5084`)

#. `@lbdreyer`_ and `@trexfeathers`_ (reviewer) added :func:`iris.plot.hist`
and :func:`iris.quickplot.hist`. (:pull:`5189`)

#. `@tinyendian`_ edited :func:`~iris.analysis.cartography.rotate_winds` to
enable lazy computation of rotated wind vector components (:issue:`4934`,
:pull:`4972`)

#. `@ESadek-MO`_ updated to the latest CF Standard Names Table v80
(07 February 2023). (:pull:`5244`)

#. `@pp-mo`_ and `@lbdreyer`_ supported delayed saving of lazy data, when writing to
the netCDF file format. See : :ref:`delayed netCDF saves <delayed_netcdf_save>`.
Also with significant input from `@fnattino`_.
(:pull:`5191`)


🐛 Bugs Fixed
Expand Down Expand Up @@ -97,7 +123,8 @@ This document explains the changes made to Iris for this release
Whatsnew author names (@github name) in alphabetical order. Note that,
core dev names are automatically included by the common_links.inc:
.. _@fnattino: https://github.com/fnattino
.. _@tinyendian: https://github.com/tinyendian


.. comment
Expand Down
210 changes: 206 additions & 4 deletions lib/iris/fileformats/_nc_load_rules/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
build routines, and which it does not use.
"""
import re
from typing import List
import warnings

import cf_units
Expand All @@ -28,10 +30,6 @@
import iris.exceptions
import iris.fileformats.cf as cf
import iris.fileformats.netcdf
from iris.fileformats.netcdf import (
UnknownCellMethodWarning,
parse_cell_methods,
)
from iris.fileformats.netcdf.loader import _get_cf_var_data
import iris.std_names
import iris.util
Expand Down Expand Up @@ -184,6 +182,210 @@
CF_VALUE_STD_NAME_PROJ_Y = "projection_y_coordinate"


################################################################################
# Handling of cell-methods.

_CM_COMMENT = "comment"
_CM_EXTRA = "extra"
_CM_INTERVAL = "interval"
_CM_METHOD = "method"
_CM_NAME = "name"
_CM_PARSE_NAME = re.compile(r"([\w_]+\s*?:\s+)+")
_CM_PARSE = re.compile(
r"""
(?P<name>([\w_]+\s*?:\s+)+)
(?P<method>[\w_\s]+(?![\w_]*\s*?:))\s*
(?:
\(\s*
(?P<extra>.+)
\)\s*
)?
""",
re.VERBOSE,
)

# Cell methods.
_CM_KNOWN_METHODS = [
"point",
"sum",
"mean",
"maximum",
"minimum",
"mid_range",
"standard_deviation",
"variance",
"mode",
"median",
]


def _split_cell_methods(nc_cell_methods: str) -> List[re.Match]:
"""
Split a CF cell_methods attribute string into a list of zero or more cell
methods, each of which is then parsed with a regex to return a list of match
objects.
Args:
* nc_cell_methods: The value of the cell methods attribute to be split.
Returns:
* nc_cell_methods_matches: A list of the re.Match objects associated with
each parsed cell method
Splitting is done based on words followed by colons outside of any brackets.
Validation of anything other than being laid out in the expected format is
left to the calling function.
"""

# Find name candidates
name_start_inds = []
for m in _CM_PARSE_NAME.finditer(nc_cell_methods):
name_start_inds.append(m.start())

# Remove those that fall inside brackets
bracket_depth = 0
for ind, cha in enumerate(nc_cell_methods):
if cha == "(":
bracket_depth += 1
elif cha == ")":
bracket_depth -= 1
if bracket_depth < 0:
msg = (
"Cell methods may be incorrectly parsed due to mismatched "
"brackets"
)
warnings.warn(msg, UserWarning, stacklevel=2)
if bracket_depth > 0 and ind in name_start_inds:
name_start_inds.remove(ind)

# List tuples of indices of starts and ends of the cell methods in the string
method_indices = []
for ii in range(len(name_start_inds) - 1):
method_indices.append((name_start_inds[ii], name_start_inds[ii + 1]))
method_indices.append((name_start_inds[-1], len(nc_cell_methods)))

# Index the string and match against each substring
nc_cell_methods_matches = []
for start_ind, end_ind in method_indices:
nc_cell_method_str = nc_cell_methods[start_ind:end_ind]
nc_cell_method_match = _CM_PARSE.match(nc_cell_method_str.strip())
if not nc_cell_method_match:
msg = (
f"Failed to fully parse cell method string: {nc_cell_methods}"
)
warnings.warn(msg, UserWarning, stacklevel=2)
continue
nc_cell_methods_matches.append(nc_cell_method_match)

return nc_cell_methods_matches


class UnknownCellMethodWarning(Warning):
pass


def parse_cell_methods(nc_cell_methods):
"""
Parse a CF cell_methods attribute string into a tuple of zero or
more CellMethod instances.
Args:
* nc_cell_methods (str):
The value of the cell methods attribute to be parsed.
Returns:
* cell_methods
An iterable of :class:`iris.coords.CellMethod`.
Multiple coordinates, intervals and comments are supported.
If a method has a non-standard name a warning will be issued, but the
results are not affected.
"""

cell_methods = []
if nc_cell_methods is not None:
for m in _split_cell_methods(nc_cell_methods):
d = m.groupdict()
method = d[_CM_METHOD]
method = method.strip()
# Check validity of method, allowing for multi-part methods
# e.g. mean over years.
method_words = method.split()
if method_words[0].lower() not in _CM_KNOWN_METHODS:
msg = "NetCDF variable contains unknown cell method {!r}"
warnings.warn(
msg.format("{}".format(method_words[0])),
UnknownCellMethodWarning,
)
d[_CM_METHOD] = method
name = d[_CM_NAME]
name = name.replace(" ", "")
name = name.rstrip(":")
d[_CM_NAME] = tuple([n for n in name.split(":")])
interval = []
comment = []
if d[_CM_EXTRA] is not None:
#
# tokenise the key words and field colon marker
#
d[_CM_EXTRA] = d[_CM_EXTRA].replace(
"comment:", "<<comment>><<:>>"
)
d[_CM_EXTRA] = d[_CM_EXTRA].replace(
"interval:", "<<interval>><<:>>"
)
d[_CM_EXTRA] = d[_CM_EXTRA].split("<<:>>")
if len(d[_CM_EXTRA]) == 1:
comment.extend(d[_CM_EXTRA])
else:
next_field_type = comment
for field in d[_CM_EXTRA]:
field_type = next_field_type
index = field.rfind("<<interval>>")
if index == 0:
next_field_type = interval
continue
elif index > 0:
next_field_type = interval
else:
index = field.rfind("<<comment>>")
if index == 0:
next_field_type = comment
continue
elif index > 0:
next_field_type = comment
if index != -1:
field = field[:index]
field_type.append(field.strip())
#
# cater for a shared interval over multiple axes
#
if len(interval):
if len(d[_CM_NAME]) != len(interval) and len(interval) == 1:
interval = interval * len(d[_CM_NAME])
#
# cater for a shared comment over multiple axes
#
if len(comment):
if len(d[_CM_NAME]) != len(comment) and len(comment) == 1:
comment = comment * len(d[_CM_NAME])
d[_CM_INTERVAL] = tuple(interval)
d[_CM_COMMENT] = tuple(comment)
cell_method = iris.coords.CellMethod(
d[_CM_METHOD],
coords=d[_CM_NAME],
intervals=d[_CM_INTERVAL],
comments=d[_CM_COMMENT],
)
cell_methods.append(cell_method)
return tuple(cell_methods)


################################################################################
def build_cube_metadata(engine):
"""Add the standard meta data to the cube."""
Expand Down
7 changes: 5 additions & 2 deletions lib/iris/fileformats/netcdf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,18 @@
# Note: *must* be done before importing from submodules, as they also use this !
logger = iris.config.get_logger(__name__)

# Note: these probably shouldn't be public, but for now they are.
from .._nc_load_rules.helpers import (
UnknownCellMethodWarning,
parse_cell_methods,
)
from .loader import DEBUG, NetCDFDataProxy, load_cubes
from .saver import (
CF_CONVENTIONS_VERSION,
MESH_ELEMENTS,
SPATIO_TEMPORAL_AXES,
CFNameCoordMap,
Saver,
UnknownCellMethodWarning,
parse_cell_methods,
save,
)

Expand Down
Loading

0 comments on commit 94e44ef

Please sign in to comment.