-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust Integration: Perform Max Flow in Rust instead of Python #1
Conversation
…de option to run rust or python versions
Could you also quickly document how to build and run this? |
Co-authored-by: Jae-Won Chung <[email protected]>
… pimpl version is actually slower
Edit: needed to also add an empty Fixing some other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely great work, thank you! I've left some suggestions and comments.
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
One thing that would be nice to have is an automated e2e test for result consistency.
The overarching goal is to improve performance by using Rust instead of Python to perform intense computations, while maintaining the Python API.
We implement Python-Rust interop using
pyo3
. We use thepathfinding
crate for the Rust-side implementation of the Edmonds-Karp max flow algorithm.Through initial profiling, we have identified the call to
nx.maximum_flow
to be the overwhelming bottleneck (51-74% of total time spent). This PR replaces the calls tonx.maximum_flow
with Rust-side bindings, which initial profiling shows a total speedup of 1.6-2.6x.Links
Results
Based on profiling runs of 5 models on A40 and A100 GPUs, we found:
max_flow
max_flow
Build and Run
Assume you have some project that is using lowtime, and you have a local clone of this lowtime branch.
Then run the following to set up a virtual environment with this version of lowtime.
If you modify
lib.rs
and want those changes reflected:Design Decisions
SparseCapacity
vsDenseCapacity
The Rust-side code uses the
pathfinding::durected::edmonds_karp::edmonds_karp
implementation for max flow. The function takes in anEK
type parameter, that determines whether to represent the graph using aSparseCapacity
(BTree
ofBtree
s) orDenseCapacity
(adjaceny matrix). Due to the nature of computation graphs of parallelized training workloads, aSparseCapacity
was more efficient. This was verified through profiling, which showed thatSparseCapacity
was up to 10x faster thanDenseCapacity
.OrderedFloat<f64>
vsf32
vsi64
The
edmonds_karp
implementation requires theZero,
Bounded,
Signed,
Ord,
Copy` traits to be implemented by capacity values.Ord
, so they cannot be used;OrderedFloat
does.OrderedFloat<f32>
resulted in incorrect output due to lack of precision.We considered an implementation where we use integers, and convert to/from floats by multiplying by 1e9.
i64
resulted in incorrect output. Furthermore, it resulted in a similar runtime (no significant gain).u64
cannot be used as it does not implementSigned
, which is needed to represent negative residual flows inedmonds_karp
.Therefore
OrderedFloat<f64>
was chosen to represent capacity.pyo3
Data Transfer Reduction with "pimpl" idiomThe
_lowtime_rs::PhillipsDessouky
constructor returns the object itself, which is very large as it contains all the nodes, edges, and associated capacities. Due to a misconception thatpyo3
would have to create a "Python copy" of the Rust object, we experimented with using the pimpl idiom to implement a wrapperPhillipsDessouky
class that contained a pointer to an internal_PhillipsDessouky
class, which contained the actual data and implementations. However,pyo3
seems to have a clever way of associating Rust objects with Python objects without creating a "Python copy". In fact, profiling the "pimpl version" resulted in slightly slower runtimes, probably due to the added indirection and use of the heap.Handling Rust-side Errors
Before this PR, the lowtime codebase performs exception handling on the call to
nx.maximum_flow
, since thenx
library is designed to raise well-defined exceptions (eg.nx.NetworkXUnbounded
). However, the Rust-sidepathfinding
library is does not contain well-defined errors; instead, theedmonds_karp
function considers the following situations unrecoverable:panic!
s ifsource
andsink
nodes are not found inverticies
.unreachable!
if there is no flow to cancel in its internal representation of node capacities. One way this is caused is by providing a negative capacity as input (this caused a very hard-to-debug error when some flows were very small negative values due to floating point imprecision).assert!
s many invariants of the internal graph representation (assert!
willpanic!
if the condition is false).If the Rust-side library use well-defined errors,
pyo3
has integration for propagating those errors as Python exceptions. The "recommended way" when using a third-party crate that does not contain well-defined errors is to, if possible, modify the third-party crate itself to do this.pathfinding
chooses to use unrecoverable macros over well-defined errors. Assuming we do not intend on changing thepathfinding
codebase itself, we have 2 options:catch_unwind
to catchpanic!
s in Rust. Once caught, we convert the error to a custom Error type that is propagated to Python as an exception, which can be caught and handled in Python.panic!
s, letting the program crash gracefully. Thepanic!
andunreachable!
macros are intended as graceful exit points for unrecoverable states, so we should not try to treat it like a Python exception.There is significant debate on this topic; for example, this discussion based on the polars library. In the
polars
discussion, the debate was closed with:We make the same decision here: we do not attempt to catch or handle
panic!
s andunreachable!
crashes. If needed in the future, we can add invariant checks before the call toedmonds_karp
.Profiling Set Up
Through the development process, we use profiling data to determine whether an idea was worth pursuing. Since there were many ideas that depended on each other, there was a need to build a profiling infrastructure fast (even if it is scrappy) before we could start experimenting with lowtime. As a result, the current profiling inftrastructure was built:
find_min_cut
).job.log
for profiling-related logs and groups relevant intervals together.While fast, this approach results in many profiling related logs in the lowtime codebase, and requires a separate script to parse the logs post-lowtime. An ideal solution would be integrated profiling infrastructure integrated into lowtime itself, that can be turned on/off through a command-line argument when running lowtime. However, this PR chooses the "scrappy" option because:
We will need to remove the profiling logs at the end of this chain of commits (i.e. when we eventually merge
lowtime-rust
withmain
), but we keep them for now as we will need to use the existing profiling infrastructure for future commits in this chain.