-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor XarrayZarrRecipe to be serialization-friendly #160
Conversation
Thanks for working on this Tom!
Does it directly build on #153? Or is it a new implementation of similar ideas. This is a pretty big refactoring, so I'd like to understand the motivation more clearly, specifically, why this is needed on top of #153. Is the basic issue that we cannot use any methods (that contain |
xarray_open_kwargs: dict, | ||
delete_input_encoding: bool, | ||
process_input: Optional[Callable[[xr.Dataset, str], xr.Dataset]], | ||
metadata_cache: Optional[MetadataTarget], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these long argument blocks may hurt maintainability. There is so much room for programmer error when passing long lists of arguments through the stack. If we go this route, perhaps we want to enclose some of the arguments in a Config
object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Though I think the biggest risk of programmer error is right now / in future major refactors. So it comes down to simplicity of the code (not having to look up what is this "FooConfig" thing?) vs. typing all these out. I'm 51% in favor of writing things out explicitly like this, but am happy to go with a Config / Options style object.
I will say that mypy has been helpful here. It caught a couple issues before running the tests.
New implementation of similar ideas.
Essentially, yes. That's why I fear #153 alone won't fix it, since https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153/files#diff-a78cae0f25369a56a98f5a65392472c337c3df4dd99398759babf5071dc2032eR109-R112 captures the recipe object in the scope of the delayed function. |
cache_input_task = task(self._cache_input, name="cache_input") | ||
prepare_target_task = task(self._prepare_target, name="prepare_target") | ||
store_chunk_task = task(self._store_chunk, name="store_chunk") | ||
finalize_target_task = task(self._finalize_target, name="finalize_target") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you not using _finalize_target
rather than finalize_target
, etc.
Is there any point in maintaining the finalize_target
methods if we are going to remove the existing to_pipelines
method in BaseRecipe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the top-level property .finalize_target
is still really nice for users who are developing / debugging recipes. We want to keep that, while still allowing Dask and Prefect to access the (partially applied) function itself, so they can wrap it in a Task.
Ok, so I think I am on board with this as hopefully the right solution to the serializability problems. Here is a partial checklist of some things that IMO would need to be done before we merge this:
Happy to help with some of these via pushing directly to this brach. Thanks again Tom for taking the time to sort out the core issues. |
cache_input, | ||
cache_inputs=self.cache_inputs, | ||
input_cache=self.input_cache, | ||
file_pattern=self.file_pattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a hard time understanding why these partial functions are "small" in a serialization sense, since each function in the graph essentially contains a functools.partial
-wrapped version of all of the same attributes that are part of the recipe class? In particular, file_pattern
and target
s are all fairly complex / large objects themselves, which are now being curried into a single-argument function. How is that any better than having a method attached to a class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that my initial attempt to answer this question have failed. My hunch is that it's just easier to serialize these functions than it is to serialize the dataclass, but I don't I'd like to have a stronger justification than that. Pointing to the workflow in pangeo-forge/pangeo-forge-azure-bakery#10 is some evidence, but it's pretty indirect.
I'll keep trying to come up with a clear answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, having a forensic understanding of this is interesting academically but far less important than actually shipping code that works. 😁 So given limited time, I would focus on the checklist above (rather than digging deeper).
Let me know if you want help on any aspects here. I have stopped working on #153 in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take care of the first few items around to_pipelines()
and will look into the outstanding metadata caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the docs and notebooks. I had some trouble running the notebook, but I think it was an issue with my local internet connection.
(cherry picked from commit 9c39bf7)
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional to override the to_dask
and to_prefect
methods from BaseRecipe
in XarrayZarrRecipe
?
yield k | ||
def to_prefect(self): | ||
"""Compile the recipe to a Prefect.Flow object.""" | ||
from prefect import Flow, task, unmapped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this now implemented twice? You also have this in BaseRecipe, no?
return xr.open_zarr(target_mapper) | ||
for i, input_key in enumerate(self.iter_inputs()): | ||
dsk[(f"cache_input-{token}", i)] = (self._cache_input, input_key) | ||
dsk[f"checkpoint_0-{token}"] = (lambda *args: None, list(dsk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, this is also implemented in BaseRecipe, no?
Fixed in 6d190ff, to remove the That also removes the I'm going to test this commit out on the full dataset in https://github.com/pangeo-forge/pangeo-forge-azure-bakery again. |
Well, I feel a bit silly. The difficulty in answering https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160/files/e70d52662875f8835cc180cb289fc4e6d4445e4a#diff-e12c886cc124886c5cfa5313d760a36c39649af9da845077c663e6feab8487b5 spurred some more investigation into what was actually taking a lot of size to serialize. It turns out that most of the size was in the FilePattern class. Maybe this isn't too surprising, but it did surprise me that the size was actually in the function pangeo-forge-recipes/pangeo_forge_recipes/patterns.py Lines 151 to 160 in 562e76c
The alternative is to create the For some reason in my debugging, I had turned |
👍 This is definitely expected from my POV. Perhaps we should deprecate that function. |
I think that's worthwhile. It is convenient, but I worry that this will bite us again as we scale users's recipes to large jobs. And I hope that manually constructing the FilePattern isn't too much more difficult. FWIW, when I ran the flow using the fixed FilePattern construction on pangeo-forge-recipes master, my scheduler was OOM Killed. It does seem like this branch is doing something. |
The big question I have is whether this PR is still significantly better than #153 in terms of serialization. (Assuming the more efficient FilePattern approach is used in both cases.) |
It seems like it. I ran a flow that just sleeps for the pangeo-forge/pangeo-forge-azure-bakery#10 (comment) has the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this again and noticed that all of the global metadata references have been commented out. Can you explain why global metadata is more difficult with this new functional syntax?
Is there something I can do to help here?
# if cache_metadata: | ||
# # if nitems_per_input is not constant, we need to cache this info | ||
# recipe_meta = {"input_sequence_lens": input_sequence_lens} | ||
# return recipe_meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here? I don't follow your comment.
# TODO(METADATA): set | ||
metadata_cache[_input_metadata_fname(input_key)] = input_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some reason metadata requires different treatment here?
# TODO(Tom): Handle metadata caching here | ||
# else: | ||
# global_metadata = metadata_cache[_GLOBAL_METADATA_KEY] | ||
# input_sequence_lens = global_metadata["input_sequence_lens"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just flagging that metadata stuff has been commented out.
Fixing the metadata stuff is my (hopefully) last TODO. I'll dig into it today, but maybe you can answer this easily: Where do the writes and reads to the global metadata store happen, client process or the worker processes? |
Worker processes. The client knows nothing about the execution state. |
Sounds good. So it's assumed that the metadata cache is globally read / writeable, like a blob storage file system. In that case, I think my last commit fixes things, but the tests were passing without it. It's probably worth adding some kind of check to the tests to verify that. I'm running through the notebooks now, and at least some are failing with I'll update the notebooks to use MetadataTarget. |
The notebooks caught an issue: I had removed The notebooks are now updated to use a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic. Thanks so much Tom for all your hard work on this!
Could you just update the release notes? I think at this point our next release is 0.4, so probably need to add a new section.
Done in f57e36a. |
@TomAugspurger, can you think of a test we could add to guard against regressions related to serialization? Relevant for big PRs like #166 which touch a lot of different pieces... |
Nothing really comes to mind :/ The typical way to check is to serialize the object and then check the size of the bytestring. But IIRC we found that these objects weren't all that large. |
This is (another) refactor to the XarrayZarrRecipe to resolve some of the memory issues we've seen when executing large jobs (pangeo-forge/pangeo-forge-azure-bakery#10, #151)
It builds on #153, which adds
to_dask
andto_prefect
methods (probably should do that at the base recipe level).It looks like a large diff, but it's primarily just moving code from methods on
XarrayZarrRecipe
to top-level functions, and forwarding arguments appropriately. This eliminatesself
from the functions sent to workers. More explanation at https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160/files#diff-e12c886cc124886c5cfa5313d760a36c39649af9da845077c663e6feab8487b5R685-R693.The main outstanding task right now is ensuring that the metadata_cache is being handled properly. I need to better understand where that object is supposed to live (on the client or workers?) and who is supposed to be able to write to it (do workers write to it? Is it expected to be global so that writes from a worker process are seen by the client?)
BaseRecipe
. This would also help clarify the API questions (finalize_target
vs_finalize_target
, which are required to be implemented where, etc.)to_pipelines
methodto_function
method, which is equivalent to the existing PythonExecutordocs/tutorials
to verify no end-user API changes are needed.Closes #116.