-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve serializability of Recipe and related classes #116
Comments
I tried to mitigate this issue via #117. I also tried defining the recipe via from pangeo_forge.patterns import ConcatDim, FilePattern
import pandas as pd
def make_url(time):
input_url_pattern = (
"https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation"
"/v2.1/access/avhrr/{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc"
)
return input_url_pattern.format(
yyyymm=time.strftime("%Y%m"), yyyymmdd=time.strftime("%Y%m%d")
)
time_dim = ConcatDim(
"time",
pd.date_range("1981-09-01", "2021-01-05", freq="D"),
nitems_per_file=1
)
pattern = FilePattern(make_url, time_dim)
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=5) completely skipping any storage assignments. However, the execution step from pangeo_forge.executors import PythonPipelineExecutor
executor_dask = DaskPipelineExecutor()
pl = recipe.to_pipelines()
plan = executor_dask.pipelines_to_plan(pl)
dask.persist(plan, retries=5) ...still blows up my 16 GB of memory. What tools can I use to debug the serialization of this graph to understand why it's so memory hungry. |
Looking into this: stats = []
from time import perf_counter
for k, v in plan.dask.items():
t0 = perf_counter()
out = cloudpickle.dumps(v)
t1 = perf_counter()
stats.append((k, t1 - t0, len(out)))
import pandas as pd
df = pd.DataFrame(stats, columns=['key', 'time', 'size'])
df['hl_key'] = [dask.utils.key_split(x) for x in df['key'].array]
df.groupby("hl_key")[['time', 'size']].sum()
On a subset of the data with ~1500 tasks. We need to improve the various objects in the recipe. I'll take a look at that next. |
Thanks Tom! Can you try with #117? That's my attempt to improve serializability of the FilePattern objects. But it didn't seem to help. |
(copied from #117 (comment)). We can pursue this if serialization is still an issue. I think the best path forward here is to construct a custom HighLevelGraph (or perhaps a from dask.highlevelgraph import Layer
class CacheInputLayer(Layer):
def __init__(self, annotations=None, func=None, n=None):
self.token = dask.base.tokenize(func)
self.func = func
self.n = n
super().__init__(annotations=annotations)
def __getitem__(self, index):
token, n = index
if token == self.token and isinstance(n, int) and (0 <= n < self.n):
return (self.func, n)
def __len__(self):
return self.n
def __iter__(self):
for i in range(self.n):
yield (self.func, i)
def is_materialized(self):
return False
cache_input = pl[0][0]
layer = CacheInputLayer(func=cache_input.func, n=len(cache_input.map_args)) The idea is to replace all the A single instance of that class will replace the I don't know when I'll have time to get to this, so if anyone wants to dig into it feel free. Otherwise I'll try to over the next couple weeks. |
We are still struggling with memory issues for large recipes. The issue is coming up not just with the Dask executor but also with Prefect (see #151). So some deep debugging / creative thinking is still needed. |
After the refactor in #101, our dask graphs have gotten a lot bigger.
I did the following, based on the NOAA OISST recipe from the docs
Then I converted it to a dask delayed object to execute on a Dask Gateway cluster
The graph is about 10 MB, not huge, right?
But when I call
I quickly run out of memory in my notebook pod (16 GB) before the the
persist
call finished. I also tried with, optimize_graph=False
.One culprit could be the new FilePattern class, which stores a big xarray dataset internally
https://github.com/pangeo-forge/pangeo-forge/blob/1c5f3520d6fdca718cb372d60161d4fd8617dcbb/pangeo_forge/patterns.py#L81
I feel like we could optimize this using
__getstate__
/__setstate
.@martindurant it would be great if you could look at this.
The text was updated successfully, but these errors were encountered: