Setting rechunk_max_mem to worker memory doesn't work #54

rsignell-usgs · 2020-09-28T20:26:29Z

My first time using rechunker!
I'm running on HPC with a SlurmCluster, reading and writing zarr to fast local disk.

This runs for about a minute, then barfs with keyerror: startstops:
https://nbviewer.jupyter.org/gist/rsignell-usgs/fe0a1ec0cec562b14ddcc9a222ddc34f

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-09-28T20:52:01Z

That's a strange one. It's likely an issue in distributed rather than rechunker.

What version of dask are you using? I'm not familiar with how SLURMCluster handles environments, but is the version "on the cluster" the same as your client (/home/rsignell/miniconda3/envs/pangeo3/bin/python?)

rsignell-usgs · 2020-09-29T00:34:18Z

@TomAugspurger:

I installed pangeo-dask:2020.09.19 from conda-forge, so I'm using:

(pangeo3) rsignell@denali-login2:~> conda list dask
# packages in environment at /home/rsignell/miniconda3/envs/pangeo3:
#
# Name                    Version                   Build  Channel
dask                      2.27.0                     py_0    conda-forge
dask-core                 2.27.0                     py_0    conda-forge
dask-gateway              0.8.0            py37hc8dfbb8_0    conda-forge
dask-jobqueue             0.7.1                      py_0    conda-forge
dask-kubernetes           0.10.1                     py_0    conda-forge
dask-labextension         3.0.0                      py_0    conda-forge
pangeo-dask               2020.09.19                    0    conda-forge

My understanding is that Dask-jobqueue (which SlurmCluster) submits jobs using the same environment as the notebook, so all versions should match, right?

rsignell-usgs · 2020-09-29T01:25:20Z

@TomAugspurger , well, you were right, there is something funky with my use of SlurmCluster. When I tried a LocalCluster it worked fine.

rsignell-usgs · 2020-09-29T14:08:54Z

@TomAugspurger , and now I've discovered the real problem. It wasn't a problem with SLURMCluster either. My problem was that I was specifying the rechunker max_mem to be the same as the memory on my workers. That apparently doesn't work. When I set my worker memory to 6GB and the rechunker max_mem to 4GB then both LocalCluster and SLURMCluster work fine:

So specifying max_mem to be 2/3 of the worker memory worked for me. Is it known what fraction should be used?

TomAugspurger · 2020-09-29T14:35:56Z

I think that rechunker will try to allocate as close to max_mem as possible. That should probably not exceed the variuos thresholds specified in https://docs.dask.org/en/latest/configuration-reference.html#distributed.worker.memory.target.

For this workload (where the memory usage can be very tightly controlled), the best option might be setting distributed.worker.memory.{target,spill,pause} very close to terminate (so something like 0.9) and then setting max_mem close to whatever that value is (0.9 * worker memory).

You'll want at least some buffer for other stuff in the Python process (like module objects).

(If that's correct, then we should document it 😄)

rsignell-usgs · 2020-10-28T20:06:31Z

For a worker memory of 6GB, I tried systematically reducing the rechunk_max_mem memory from 6GB (didn't work), to 5GB (didn't work), to 4GB (did work). So for this workflow, setting rechunk_max_mem to 2/3 or the worker memory works.

rsignell-usgs changed the title ~~keyerror: startstops~~ Setting rechunk_max_mem to worker memory doesn't work Oct 28, 2020

bolliger32 mentioned this issue Jul 17, 2021

cannot use rechunker starting from 0.4.0 #92

Open

bzah mentioned this issue Mar 9, 2022

Improvements for max_mem #12

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting rechunk_max_mem to worker memory doesn't work #54

Setting rechunk_max_mem to worker memory doesn't work #54

rsignell-usgs commented Sep 28, 2020

TomAugspurger commented Sep 28, 2020

rsignell-usgs commented Sep 29, 2020

rsignell-usgs commented Sep 29, 2020 •

edited

Loading

rsignell-usgs commented Sep 29, 2020 •

edited

Loading

TomAugspurger commented Sep 29, 2020 •

edited

Loading

rsignell-usgs commented Oct 28, 2020

Setting rechunk_max_mem to worker memory doesn't work #54

Setting rechunk_max_mem to worker memory doesn't work #54

Comments

rsignell-usgs commented Sep 28, 2020

TomAugspurger commented Sep 28, 2020

rsignell-usgs commented Sep 29, 2020

rsignell-usgs commented Sep 29, 2020 • edited Loading

rsignell-usgs commented Sep 29, 2020 • edited Loading

TomAugspurger commented Sep 29, 2020 • edited Loading

rsignell-usgs commented Oct 28, 2020

rsignell-usgs commented Sep 29, 2020 •

edited

Loading

rsignell-usgs commented Sep 29, 2020 •

edited

Loading

TomAugspurger commented Sep 29, 2020 •

edited

Loading