-
-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-copy array shuffle #8282
Zero-copy array shuffle #8282
Conversation
@@ -157,11 +160,21 @@ async def _process(self, id: str, shards: list[bytes]) -> None: | |||
with self._directory_lock.read(): | |||
if self._closed: | |||
raise RuntimeError("Already closed") | |||
with open( | |||
self.directory / str(id), mode="ab", buffering=100_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This huge buffering setting was just wasting memory.
Eradicating seven deep copies? Bold claim! Very exciting! Haven't reviewed the changes, yet |
distributed/shuffle/_disk.py
Outdated
else: | ||
# Unserialized numpy arrays | ||
frames = concat( | ||
serialize_bytelist(shard, compression=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression=False is crucial for memory-mapped reads later.
We could however consider enabling it at a later point without changing the design (but it will require offloading).
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 14h 26m 18s ⏱️ + 1h 9m 48s For more details on these failures and errors, see this check. Results for commit 1b26ec6. ± Comparison against base commit 010f896. ♻️ This comment has been updated with latest results. |
9c648b3
to
f657c54
Compare
Functional bug fixed; moving to perf tests |
f657c54
to
98efdc4
Compare
I've got some early LocalCluster benchmarks and they look very exciting. import pickle
import dask.array as da
import distributed
from distributed.diagnostics.memory_sampler import MemorySampler
client = distributed.Client(n_workers=4, threads_per_worker=2)
N = 250 # 29 GiB
old = ((50, 40, 35, 25, 45, 55),) * 4 # 3 to 180 MiB chunks
new = ((40, 35, 45, 70, 60),) * 4
a = da.random.random((N, N, N, N), chunks=old)
b = a.rechunk(chunks=new, method="p2p")
c = b.sum()
try:
with open("ms.pickle", "rb") as fh:
ms = pickle.load(fh)
except FileNotFoundError:
ms = MemorySampler()
with ms.sample("main", interval=0.1):
c.compute()
with open("ms.pickle", "wb") as fh:
pickle.dump(ms, fh) Legendmainmain branch zero-copy mmapWhat's described in the opening post. Data is never deep-copied, except when slicing by column and possibly when the kernel performs mmap caching. zero-copy no-mmapVariant that doesn't use memory-mapped I/O and removes a wealth of seek() and short read() disk accesses at kernel level, replacing them with a single monolithic read(), like in the SpillBuffer, followed by a deep-copy in concatenate3() copy-on-shard mmapVariant that reintroduces a single, explicit deep-copy as soon as the shards are cut out of the original chunk, to allow releasing it faster. I don't quite understand why the copy-on-shard variant is faster than the zero-copy one. Unless pickle.dumps is much slower than ndarray.copy() at generating a non-view buffer for some reason... |
The initial A/B test results leave me a bit miffed. They deliver the expected performance boost for trivial dataset sizes, but the benefit seems to disappear as the dataset grows. I'll need to investigate why. It doesn't make sense to me; p2p runtime should scale linearly with data size and p2p memory usage should be constant. |
Looking at the metrics for |
FYI, this may or may not be related to a leak I've identified (but not yet root-caused) for many chunks with |
The "leak" was just a warmup artifact. Not sure why the warmup is larger than in the main branch, but the delta is negligible IMHO. |
The slowdown with 8 MiB chunks (which in turn generate 8 kiB shards) was caused by the fact that A/B tests, including the new diskless p2p: Detail on the two regressions shows that these use cases are, in fact, 20-25% slower, but the increased memory consumption is quite negligible: |
# Don't wait until all shards have been transferred over the network | ||
# before data can be released | ||
if shard.base is not None: | ||
shard = shard.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicit deep-copy mentioned in step 3 of the opening post
First pass over this looks good. I'm curious about the regression, though. IIUC we're now much faster for reasonably sized arrays but slower for arrays with very small chunks. In a perfect world, we would be agnostic to chunk sizes using P2P (at least first order; using smart buffering and such things). Do you see a way to reduce that penalty in a follow up PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial pass looks good; I'll look at a few workloads up close before signing off.
I have not run a profile to understand where the slowdown is. I suspect it may be in the general serialization and networking stack, in which case it would be hard (but very worthy, considering it would impact all network comms, not just shuffle). |
Yes, that's exactly the intended behaviour. And that should happen immediately after concatenate3(). |
So that we're the same page, here's the exact code I execute: def _(x):
time.sleep(1)
return x
arr = da.random.random(size, chunks=input_chunks)
with dask.config.set({"array.rechunk.method": rechunk_method, "distributed.p2p.disk": disk}):
client.compute(arr.rechunk(output_chunks).map_blocks(_).sum(), sync=True) The important bit is the Ideally, I'd expect the output chunks that get aggregated into the partial results of the sum to vanish and instantly free their memory. From what I see, that does not happen. Instead, memory is only freed once the last (few) output chunks have been processed. I realize this is a trade-off, but this new behavior may hold the outputs much longer in memory if there is a more involved processing/reduction chain in place than This may be more desirable than the previous sluggish rechunking, but it's a significant change in the memory footprint. |
To reiterate: If the buffer contains all shards of a single input chunk that belong to the worker, this means that it will contain a shard for every output chunk on that worker. To free this buffer, we need to |
I understand now. This is definitely not OK. I'll figure out why some mmap'ed buffers are surviving concatenation. |
IIUC, this should have nothing to do with |
It should not be like that. The shards in the send phase are deep-copied immediately after sharding, so they don't hold a reference to the buffer of the original chunk. FWIW the initial version, without the explicit deep copy, was just mildly more memory intensive and slower than the one with the deep copy. The shards in the receive phase share a single buffer per RPC call. This is a known issue (can't find the ticket). So the memory for all the shards in a RPC call will be released, all at once, when the last of them is written to disk and dereferenced. The shards in the read from disk phase in the same file share the same buffer, so the buffer will be released when all the shards are released. However if I understand correctly there's one file per output chunk, so all shards in a file should contribute to the same concanate3 and be released all at once afterwards? |
Ah, this makes sense. You're retaining the per-RPC call aggregation of the buffers until all concatenate calls have completed. I need to introduce an explicit deep-copy specifically for memory. |
I'm not an expert here but related to the above "memory leak", maybe it is a good and easy thing to just copy the output array before we return it in |
[run] = a.extensions["shuffle"].shuffle_runs._runs | ||
shards = [ | ||
s3 for s1 in run._disk_buffer._shards.values() for s2 in s1 for _, s3 in s2 | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmakait this is ugly; could you think of a nicer way to get the shards in the MemoryBuffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't really an API to get all the shards from the buffer as it wasn't designed for that.
bc11931
to
4dccfd6
Compare
4dccfd6
to
62ea31f
Compare
Workload from #8282 (comment) |
b5d5d59
to
c2bb18e
Compare
c2bb18e
to
5e57171
Compare
5e57171
to
e322258
Compare
This is fixed in #8321. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The improvements look great! Thanks, @crusaderky, for diving deeper into this and giving rechunking a thorough treatment. There are some bits and pieces in the code that feel less clean, but I'm also refactoring the buffers, so I don't expect those to stick around but become part of the refactored version.
distributed/shuffle/_disk.py
Outdated
|
||
frames: Iterable[bytes | bytearray | memoryview] | ||
|
||
if not shards or isinstance(shards[0], bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle not shards
here? IIUC, we should not attempt to write an empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change it to assert shards
and see if anything breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's give that a shot, if this breaks any tests that would mean something doesn't work as (I) expected.
[run] = a.extensions["shuffle"].shuffle_runs._runs | ||
shards = [ | ||
s3 for s1 in run._disk_buffer._shards.values() for s2 in s1 for _, s3 in s2 | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't really an API to get all the shards from the buffer as it wasn't designed for that.
Fully agree. I particularly dislike that there's half of the (de)serialization logic in |
Perform the P2P rechunk of dask.array objects without ever deep-copying the buffers.
Currently not working yet due to what looks like a trivial bug somewhere; however I'm eager to receive an early feedback on the design.
CC @fjetter @hendrikmakait
_receive
, which in turn unpickles the individual bytes objects (fourth deep-copy), reorganizes them, and re-pickles them (fifth deep-copy).