Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearrangement of raster.delayed to prepare for tiled reprojection implementation with multiprocessing #655

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vschaffn
Copy link
Contributor

Resolves #647.

Context

The aim of this ticket is to separate the functions in raster.delayed that use dask from those that do not. The functions that do not use dask will form a common basis for tiled reprojection with both dask and multiprocessing, which will be introduced in a future ticket (#648) .

Changes

  • Creation of new repertory raster.distributed_computing to organize the new delayed structure of raster.
  • Creation of new file raster.distributed_computing.delayed_multiproc to prepare the next implementations.
  • Function from raster.delayed that do not use dask have been moved to raster.distributed_computing.delayed_utils.
  • Function from raster.delayed that use dask have been moved to raster.distributed_computing.delayed_dask.
  • In _get_block_ids_per_chunk, cached_cumsum has been replaced by np.cumsum to avoid using dask, and then keep the function common for both dask and multiprocessing.

Tests

  • The test structure of test_raster.delayed has been updated in compliance with the new raster.distributed_computing directory.

@rhugonnet
Copy link
Member

rhugonnet commented Feb 20, 2025

No comments on this, super! 😁 🚀

Looking ahead at the next step where we'll have to mirror the main functions of delayed_dask into delayed_multiprocessing (such as delayed_reproject()):
I guess it would be ideal to re-use the same body for these functions, as they contain the same over-arching logic in applying the steps sequentially needed for both implementations (and not copy/paste them while adjusting the sub-functions calls?).
It would simplify maintenance for sure! But I'm unsure how doable that'd be in practice (we could have the same main function for both, one with backend="dask" and the other backend="multiprocessing" or something; but inputs/outputs/inside I/O operations may vary...)?

The best is probably to start implementing those main functions in delayed_multiprocessing, and once they are operational, if we realize they really mirror closely the ones in delayed_dask and could be an easy merge, we can still do that easily later 😉.

Copy link

@adebardo adebardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need a init.py file in tests/test_raster/test_distributing_computing ---> okay the init file is missing in every folder don't look at my comment ( I can't delete it)

@adebardo
Copy link

@rhugonnet

Moving these functions could affect the memory usage by altering cache usage or failing to properly clean up memory, but I'm not exactly sure. Do you have any idea?

@vschaffn vschaffn force-pushed the 647-multiprocessing1 branch from 01d8c52 to efc51b5 Compare February 24, 2025 16:19
@rhugonnet
Copy link
Member

@adebardo Not sure I understand, you mean the changes of this PR could negatively affect cache/memory this way?

We have memory testing implemented in test_delayed. Those seem to be passing now in this PR's CI.

Sometimes there is an error raised because the cluster does not close properly when using dem-memusage. It's an annoying issue we haven't fully solved (for now it is silenced by executing those tests at the end), see here: #545.

@rhugonnet
Copy link
Member

rhugonnet commented Feb 24, 2025

Haaaaa and that's why the tests are failing in this CI, we need to update the name of the new test module test_delayed_dask here to maintain the temporary fix of #561:

module_items_reordered = [it for k, it in enumerate(module_items) if module_names[k] != "test_delayed"] + [

Then the tests should never fail (but there are still some cluster teardown messages printed that can pollute the pytest output a bit).

@vschaffn vschaffn force-pushed the 647-multiprocessing1 branch from efc51b5 to 6091983 Compare February 25, 2025 09:58
@rhugonnet
Copy link
Member

@vschaffn @adebardo I went through the code changes again... Hard to grasp why the memory usage would increase. Maybe splitting the code into subfunctions simply creates more links to save in the Dask graph, and this builds up a couple more 10s of MBs over all chunks and loops.

If we check that memory usage before the PR changes was very close to ~100 MBs (the limit of the test, now is slightly above at 120 MBs), then we could simply increase the test threshold by multiplying it by 1.5 or something.
If it was much below 100 MBs, then we should look into it more...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[POC] first try of multiprocessing for reprojection 1/2
3 participants