Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting rechunk_max_mem to worker memory doesn't work #54

Open
rsignell-usgs opened this issue Sep 28, 2020 · 6 comments
Open

Setting rechunk_max_mem to worker memory doesn't work #54

rsignell-usgs opened this issue Sep 28, 2020 · 6 comments

Comments

@rsignell-usgs
Copy link
Member

My first time using rechunker!
I'm running on HPC with a SlurmCluster, reading and writing zarr to fast local disk.

This runs for about a minute, then barfs with keyerror: startstops:
https://nbviewer.jupyter.org/gist/rsignell-usgs/fe0a1ec0cec562b14ddcc9a222ddc34f

@TomAugspurger
Copy link
Member

That's a strange one. It's likely an issue in distributed rather than rechunker.

What version of dask are you using? I'm not familiar with how SLURMCluster handles environments, but is the version "on the cluster" the same as your client (/home/rsignell/miniconda3/envs/pangeo3/bin/python?)

@rsignell-usgs
Copy link
Member Author

@TomAugspurger:

  1. I installed pangeo-dask:2020.09.19 from conda-forge, so I'm using:
(pangeo3) rsignell@denali-login2:~> conda list dask
# packages in environment at /home/rsignell/miniconda3/envs/pangeo3:
#
# Name                    Version                   Build  Channel
dask                      2.27.0                     py_0    conda-forge
dask-core                 2.27.0                     py_0    conda-forge
dask-gateway              0.8.0            py37hc8dfbb8_0    conda-forge
dask-jobqueue             0.7.1                      py_0    conda-forge
dask-kubernetes           0.10.1                     py_0    conda-forge
dask-labextension         3.0.0                      py_0    conda-forge
pangeo-dask               2020.09.19                    0    conda-forge
  1. My understanding is that Dask-jobqueue (which SlurmCluster) submits jobs using the same environment as the notebook, so all versions should match, right?

@rsignell-usgs
Copy link
Member Author

rsignell-usgs commented Sep 29, 2020

@TomAugspurger , well, you were right, there is something funky with my use of SlurmCluster. When I tried a LocalCluster it worked fine.

@rsignell-usgs
Copy link
Member Author

rsignell-usgs commented Sep 29, 2020

@TomAugspurger , and now I've discovered the real problem. It wasn't a problem with SLURMCluster either. My problem was that I was specifying the rechunker max_mem to be the same as the memory on my workers. That apparently doesn't work. When I set my worker memory to 6GB and the rechunker max_mem to 4GB then both LocalCluster and SLURMCluster work fine:

2020-09-29_10-04-28

So specifying max_mem to be 2/3 of the worker memory worked for me. Is it known what fraction should be used?

@TomAugspurger
Copy link
Member

TomAugspurger commented Sep 29, 2020

I think that rechunker will try to allocate as close to max_mem as possible. That should probably not exceed the variuos thresholds specified in https://docs.dask.org/en/latest/configuration-reference.html#distributed.worker.memory.target.

For this workload (where the memory usage can be very tightly controlled), the best option might be setting distributed.worker.memory.{target,spill,pause} very close to terminate (so something like 0.9) and then setting max_mem close to whatever that value is (0.9 * worker memory).

You'll want at least some buffer for other stuff in the Python process (like module objects).

(If that's correct, then we should document it 😄)

@rsignell-usgs rsignell-usgs changed the title keyerror: startstops Setting rechunk_max_mem to worker memory doesn't work Oct 28, 2020
@rsignell-usgs
Copy link
Member Author

For a worker memory of 6GB, I tried systematically reducing the rechunk_max_mem memory from 6GB (didn't work), to 5GB (didn't work), to 4GB (did work). So for this workflow, setting rechunk_max_mem to 2/3 or the worker memory works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants