-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write with Zarr #86
Write with Zarr #86
Conversation
Note that this requires pydata/xarray#5065 |
pangeo_forge/recipe.py
Outdated
var.data | ||
) # TODO: can we buffer large data rather than loading it all? | ||
zarr_array = zgroup[vname] | ||
with lock_for_conflicts((vname, conflicts)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming, it's safe to lock here, rather than up on L324 where you read the zarr_array = zgroup[vname]
? Some other worker thread writing to zgroup[vname]
won't change anything we care about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I want to avoid is having "surprise" tasks get sent to the scheduler by accidentally operating on chunked arrays. In the recipe class, we use dask arrays for laziness, but we don't actually want any parallel computations happening at dask.array level. Because parallelism is managed at a higher level in pangeo forge.
The scheduler="single-threaded"
thing above is to make sure we don't accidentally read from the array using multiple threads, thus oversubscribing the workers. It's not related to locking. May not strictly be necessary. At some point we should do a careful audit of this throughout the code base.
The tests are failing with the error
This argument is added in pydata/xarray#5065, which is why I am installing xarray from my dev branch. It shows up as installed:
(Don't know where the weird version number is coming from, but I get the same thing locally where tests pass.) |
This test failed
This case has overlapping chunk writes and uses dask.distributed locking to avoid write conflicts. I've never seen it fail locally, but this suggests there could be an intermittent problem here. (The same test passed on py 3.8). I'm going to retrigger the tests and see if it comes up again. |
Wow, tests are green! |
f"Storing variable {vname} chunk {chunk_key} " | ||
f"to Zarr region {zarr_region}" | ||
) | ||
zarr_array[zarr_region] = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems OK to me?
Is there a good reason for why you need multiple chunks for the |
# get encoding for variable from zarr attributes | ||
# could this backfire some way? | ||
var_coded.encoding.update(zarr_array.attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, as long as it's consistent with how xarray loads encoding for Zarr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it is.
f"Storing variable {vname} chunk {chunk_key} " | ||
f"to Zarr region {zarr_region}" | ||
) | ||
zarr_array[zarr_region] = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems OK to me?
Thanks a lot for your comments Stephan!
The way Pangeo Forge works right now is that the dataset is built up incrementally (in a distributed fashion) from each "chunk", spread over the "sequence dimension" (usually time). In contrast to the xarray open_mfdataset -> to_zarr approach, we don't know the time coordinate a-priori before we start writing chunks. In order to avoid conflicts between chunks, each variable with There are two possible workarounds:
Both of these would be performance optimizations, since things "work" already now. So I would leave them to a future PR. |
If you look at the approach I used in #82, I don't write out coordinate variables in a chunked fashion, but rather do it ahead of time as part of initializing the Zarr store. This works well as long as you can figure out the coordinates ahead of time:
I guess this is basically your first workaround "write it all as one chunk in |
That only works for certain types of recipes, e.g. where there is a fixed number of timesteps per file. For others, you have to open every file and peek inside. We already do that in fact: So I agree this should be possible in theory. I'll open a new issue for it. |
Again we got corrupted data on a dask write. This strongly suggests that my implementation of locking is broken. I need to dig into this deeper.
|
def pytest_addoption(parser): | ||
parser.addoption( | ||
"--redirect-dask-worker-logs-to-stdout", action="store", default="NOTSET", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a fun solution to #84 (comment).
del mapper[key] | ||
assert list(mapper) == [] | ||
|
||
dask.compute([do_stuff(n) for n in range(n_tasks)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR revealed that the locking I thought I had implemented was actually broken. Here is a test that really confirms it works.
This makes sense. For what it's worth, if I were writing this from scratch, I would probably solve this by aggregating "side outputs" after looking at each file (maybe in a pre-processing pass), rather than "re-writing" chunks in the Zarr file. I guess the downside of this is that it would require support for aggregating results in executors, rather than simply support for mapping over all inputs. This would be easy in Beam and Dask, but maybe not more ad-hoc task schedulers. |
I really appreciate the design discussion. Thanks for taking the time! 😊
In the end, aren't these kind of the same? If we want to aggregate the time coordinate from each chunk, we need to either
In either case, we have to deal with encoding. (Imagine that If we consolidate chunks in a finalize step, we are basically just using the zarr array as a temporary cache for data from each chunk. And we don't have to deal with a new, distinct serialization mechanism. |
I'm going to merge this later tonight if there are no further comments. |
This is true, except you also have to write everything to disk. In practice, this is typically fine, but it's definitely slower than keeping things entirely in memory. With large enough datasets, it can make a difference (this is one reason why Spark was preferred to Hadoop). |
I have really been liking the simplicity of zero explicit communication between tasks, because you can just manually run each task in the pipeline as a function. Makes things so much easier to develop and debug. One way to mitigate the disk speed problem would be to use a Redis db for the |
Based on my prototype in #82, manual execution for collecting metadata might look something like: metadata = []
for example in recipe.iter_metadata():
metadata.append(recipe.extract_metadata(example))
recipe.prepare_target(metadata) From an API perspective, the main difference is that an executor would need the ability to pass around Python objects. I guess this would be straightforward for most executors, but some might need an additional explicit cache. |
Hilariously I just stumbled on this PR via git-blame, only to discover that I had already been here for a long-discussion, without absorbing at all why you did this! 🤣 I just stumbled on these issues myself, as the bottleneck in my pipeline! pydata/xarray#5252 should fix the situation in upstream Xarray as well, by using consolidated metadata for incremental writes with |
In debugging some real-world use cases (e.g. pangeo-forge/staged-recipes#23), I realized that our current way of writing suffers from some performance bottlenecks. Specifically, when calling
xr.to_zarr
from eachstore_chunk
task, the entire dataset has to be read AND dimension coordinates have to be loaded. This translates into 1000s of gcsfs calls. This is particularly bad for thetime
dimension, which is a dimension coordinate but is chunked in time.The solution, proposed by @TomAugspurger in https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119, is to bypass zarr for writing individual chunks. (I still use it to set up the dataset.) This gives us the best of both worlds.