-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling should be topology/NUMA aware #319
Comments
I agree we should have something, but it's a hard problem in general. To really be NUMA-optimized, programs ought to use a NUMA-aware allocator and/or migrate pages between nodes as needed, both of which are really outside of the scope of rayon. We don't have any idea of what data might be access by a job, let alone where it lives, except for perhaps the immediate collection being iterated. As a first step, I think we could do better just by offering a node-restricted threadpool. You can already accomplish this manually, by setting For deeper integration, I think the job stealing would need to be aware of the NUMA hierarchy, which can span multiple levels. Start by only trying to steal jobs from your own node, then look one level out, etc. But this is still a fairly crude way to attempt to keep memory locality, as again we don't really know the way data is allocated. We should also research what other threading libraries in other languages do for NUMA. |
FWIW, the workers were also shared before rayon-core, except when your project resolved to multiple distinct rayon versions. The point of rayon-core is to make sure we have that common pool even when there are multiple rayon versions. There's |
I basically agree with @cuviper =) -- solid goal, difficult to achieve well. Giving people the ability to configure things would be an obvious starting point. |
Another thing to consider would be a split-LLC topology, I think all of this can be detected using hwloc, there seem to be Rust bindings available. This issue affects me greatly, so I would like to help with implementation. I have a NUMA and split-LLC (two L3 caches per NUMA node) CPU, so I can test it myself. I am no expert on the matter and have not researched work-stealing queues and algorithms, but my naive idea would be as follows.
As far as memory locality is concerned, I think that if work-stealing queues are allocated in unique pages of memory, the OS will migrate them automatically. We could always try to do things like that explicitly. Would this solution introduce expensive locking to maintain these lists of queues? Is there a clever way to reduce contention? My intuition would be that in a loaded system, worker threads would eventually settle down on specific CPUs and this migration would be very rare, unless external pressure forces the OS to move rayon threads around. It seems to me that this initial cost seems similar to the one exerted by work-stealing operations that hammer that one queue where the work initiated in. Atomics and coherency can't be cheap at that level of contention. Any feedback and guidance is appreciated. |
Here's a user's experience where OpenMP fared much better than Rayon: https://www.reddit.com/r/rust/comments/bto10h/update_a_scaling_comparison_between_rustrayon_and/ In my own comments and experiments on 128-thread POWER9, I found that while we fared worse on all threads, Rayon and OpenMP actually perform pretty similarly when pinned to a single NUMA node. In fact Rayon showed higher throughput on 4/5 tests, overall 6.17s elapsed time vs. 6.62s. |
Intel's OpenMP library is implemented on top of their open-source, Apache2-licensed Thread Building Blocks (TBB) library; perhaps there are some ideas there that can be ported into Rayon to improve performance (on NUMA architectures or otherwise)? https://github.com/intel/tbb @cuviper to your comment above (from a couple years back): TBB includes a custom allocator, tbbmalloc, which is also part of their open-source repo (under the src/tbbmalloc folder). Maybe that could be used with Rust's custom allocator functionality? (If so, I'd guess linking+invoking tbbmalloc would be the easiest way at first; then maybe it could be ported later if it proves fruitful.) |
@jack-pappas My impression is that Also, I don't see any mention of NUMA in the TBB source. From this forum post, it sounds like they also punt on this issue:
It might be an interesting data point to port that Babel Stream code to TBB though. |
My wild guess is that OpenMP just has more predictable memory access than Rayon's work-stealing pool, so memory that's first touched from a thread on one particular NUMA node will probably continue to be used only on that thread/node. |
are we sure the vectors are not reallocated in the motivating example ? got bit several times by that. also, i'll be doing a bit of numa aware scheduling in rayon-adaptive but not right now. |
Re-titling the issue to be more general than just "NUMA" as nowadays even within a single NUMA node certain logical cores are a better choices than the others when deciding where to send work. |
Consider using the CPU Set API and/or the NUMA stuff for Windows. |
Has there been work going on regarding this issue? |
Since all idiomatic Rust bindings to hwloc are pretty incomplete, wrong in places (e.g. they assume every object has a CpuSet, which is not true of I/O objects), and abandoned, I've been forking hwloc2-rs into an hwlocality successor. It is now close-ish to being released, I just want to add a lot more test coverage, do a final general API review, and add more examples to the docs. Should be done in a couple of months. Once that is done, and assuming sufficient spare time budget, I would like to use it to write a locality-aware work-stealing scheduler demonstrator that uses a pinned thread pool with a hierarchical work-stealing architecture : steal from your hyperthread, then from your L3 cache shard, then from your Die, then from your Package, then from anywhere in the Machine. The actual hierarchy would not be hardcoded like this, but automatically generated from the hwloc topology tree by translating each node with multiple children into a crossbeam deque (or something like it). I'd implement a join() entry point and a scheduling overhead microbenchmark (most likely good old stupid recursive fibonacci), then compare it to a simple flat thread pool (threads are pinned, but there is a single top-level crossbeam deque spanning all of them), and to rayon. This would allow me to answer the following questions:
Of course, the answer is likely to be hardware dependent, but thankfully I have access to a nice range of x86 hardware:
Then I could run another test which actually manipulates some data to demonstrate the cache locality benefits. Need to think a bit about a good guinea pig task, but the general idea is that thread A should spawn a task that is best processed by thread A itself or another thread close to it because the data is hot in thread A's cache and higher cache levels above thread A. Most likely some variation of map-reduce would do the trick? Assuming I manage to do all of this, it would provide good input to this discussion : how much benefit can we hope to get from adding NUMA-aware scheduling to rayon, and how much code complexity would it cost in return? |
@HadrienG2, I have been following your work at https://github.com/HadrienG2/hwlocality. Any updates on progress? Thanks! |
Hi @Alnaimi-, Like all plans, the above plan needed a number of adjustments when meeting the real world 😉 First, producing an hwlocality v1 that is up to the quality standards that I want is taking longer than I expected for a number of reasons. Namely other stuff competing for my work hours, people starting to use it (which is super cool) and reporting tons of issues with unusual build configurations (which is time consuming to fix); and some features like hwloc group object creation being just unexpectedly hard to test because I actually need to reverse some undocumented parts of the hwloc API contract. For those who can't wait and don't mind running untested code that may have a few bugs and/or still get a few API adjustments in the future, my recommendation remains to use a cargo git dependency if you can, but a In parallel, I had some spare time at home at the end of last year and I decided to use it to start exploring the thread pool problem without waiting for hwlocality to be 100% finished. The current product of this ongoing exploration is https://github.com/hadrienG2/viscose . Here are the rules of the game that I set for myself:
And here are some preliminary observations (short of time to do proper benchmark plots/stats). First, at small CPU counts, the current rayon join implementation is very fast (a few µs per join from memory, I don't have the bench numbers at hand), and my desire not to regress with respect to this sets a strong constraint on how complex task scheduling algorithms can be. For example, I initially wanted to rigorously follow the tree-like organization of hardware in my scheduling algorithm (first try to distribute tasks within the current L3 cache shard, then if all workers there are busy try within the current NUMA node, then across the whole machine...), but in spite of hours of optimization work the locality benefits of this never recouped the cost of traversing a tree, so now I'm sticking with nearest neighbor search on a carefully ordered 1D worker list with a couple of tricks. So far, my locality optimizations have mostly seemed to benefit smaller tasks, i.e. I've not seed a big difference in the performance of cache bound jobs whose working set covers the full L3 cache, but the performance of jobs whose working set only covers a bit more than one L2 cache improves significantly. I suspect that's because for large enough jobs, each worker eventually ends up having a stable, sizeable work queue, and doesn't need to exchange work with others anymore, so the only scheduler performance criterion that matters for larger jobs once they reach the steady state is work queue push/pop performance, which is something that crossbeam does reasonably well. Overall, I'm doing as good as or better than rayon for most problem sizes, but I still take quite a hit from locality issues in my current implementation (big throughput drop when going from one L3 cache shard to two on Zen 2). I've reached the conclusion that this happens because in a binary fork-join model, there is no single locality-aware work stealing algorithm that is good for both 1/distributing work across workers at the start of a job, and for 2/rebalancing the workload one the job is running over all CPU cores. To see why, recall that recursive join() generates big chunks of work in the first iterations and small chunks of work in the last iterations. At the beginning, the remote part of the first join() should contain 1/2 of the work, then the next join() recursion level spawns 1/4 of the work, the next one spawns 1/8 of the work, etc. Because low-locality interconnects (between different NUMA nodes, Zen 2 L3 cache shards...) are not as fast as higher-locality interconnects (between cores in an L3 cache shard), it is better to send a few big chunks of work across the lower-locality interconnects than many small chunks of work. So when you're initially spawning a job, you want the first join() iterations to send lots work to very remote CPU cores, then the next join() iterations to send increasingly less work to increasingly closer CPU cores, until all CPU cores are busy. Once all CPU cores are busy, your goal changes : now, you have hopefully evenly spread the work across the CPU, and the goal is only to feed back CPUs that have run out of work. The best way to feed them is for them to steal work from the nearest (by locality criterion) neighbor whenever needed, as this will be the most efficient form of communication and if the workload is balanced well, shouldn't be a frequent event. The way I hope to resolve this tension between sending work far away at the start of a job and exchanging work with nearest neighbor in the steady state, is to switch to full-blown double-ended scheduling queues that, unlike crossbeam::deque, allow workers to give each other work, not just steal from each other. This will allow workers that handle new jobs to consciously give jobs to remote other workers, while keeping the work stealing logic local. I have written one such deque that seems to perform reasonably well, and the next step, when I'll find the time, is to implement the actual work-giving scheme, which will require some extra plumbing that I don't have yet. |
You are a hero. Thanks for the detailed response. Really excited for this. |
Thanks to @HadrienG2, I'm able to boost ~2x rayon performance in NVIDIA DGX-2, which uses 8 NUMA nodes, by binding each rayon thread into a physical CPU in the same NUMA node. It's implementation is very simple like: https://github.com/ulagbulag/kiss-icp-rs/blob/cf6471e0091a41a2481f1400f8ea5aba1b0ce6ec/core/src/lib.rs#L101 |
I created a sample library that enables NUMA-aware thread placement for existing rayon projects. It boosts x20 rayon sum on my NVIDIA DGX-2 with less CPU cores!
Project repository: https://github.com/ulagbulag/sas Usagefn main() {
// Just place this function on the above of main function.
sas::init();
// ... your heavy works
} |
@kerryeon Do I understand the library correctly insofar it relies on threads.into_par_iter().for_each(|idx| ...); running each closure on exactly one of the worker threads? If so, I am not sure this properly guaranteed and you may want to look into using |
Thanks @adamreichold ! I found the bug and resolved by using ::rayon::scope(|s| {
s.spawn_broadcast({
move |_, ctx| {
println!("{} => {:?}", ctx.index(), ::std::thread::current().id());
}
});
}); |
Hi all, Just checking if there has been any progress on this. While the direction is unclear, I am exploring alternatives like Glommio. Unfortunately, Glommio tasks are executed on a single thread per core by design, which tends to favor I/O-bound work over CPU-bound work. It also means significant amount of chunking/join work required to parallelise the code. Given my needs for a join-based, work-stealing scheduler that is locality-aware, Glommio seems less favorable. I understand that there was an effort to develop a NUMA-aware extension for Rayon by @HadrienG2. However, it appears that progress has stalled since Jan. Thanks! |
Yeah, I've been swamped with other things lately and it will be a while before I can get back to this. Meanwhile, you should perhaps check out @HoKim98 's sas crate and see if it improves things for you already? |
Thanks for the update. I understand things get busy. I've been doing some tests with @HoKim98's SAS work over the past two days. My tests so far indicate that all tasks are currently scheduled exclusively onto a single, randomly selected node. Meaning on a 4-node, 192-core system, only 48 cores are ever used. Interestingly, I've observed that the execution times with and without SAS appear to be similar. This is likely due to the overhead incurred from random access when using all 192 cores, whereas with only 48 cores, the workload is confined to a single node. Especially as the test includes allocating some large vectors. Other than that, I am not sure how to explain the observed behaviour. |
Status update: Sorry, I probably won't ever finish this. If anyone else wants to take over and explore this approach further, the prototype is at https://github.com/hadrienG2/viscose and I'm available to answer any question about how it works (just create an issue in the repo, it will ping my mailbox). |
Thinking back about this, I'm actually not sure if viscose even had the right design goal. As I made it, it aimed to minimize the cost of work and data migrations in the worst-case scenario where work does need to be stolen, by biasing the scheduler to priorize "local" work/data migrations over "remote" ones. But instead I should probably have first and foremost aimed to get this number of migrations to zero in the best-case scenario where work does not need to be stolen. In the HPC common case of a perfectly balanced and "large enough" workload, I now think the first goal should really be to get to this zero-migration best-case target, which can be done as follows:
Then, to accomodate less ideal non-HPC workloads, we can start thinking about the complications of the real world:
TL;DR: Maybe extending rayon is not the optimal approach for NUMA, and a more data-oriented + framework-style approach that imposes more structure on the underlying application + requests more metadata from it would be better? |
It would certainly be better in the sense that providing a framework with more structure/constraints allows for better results. However, I do not think that this means improving the locality-awareness of generic approach like Rayon cannot improve its results in a generic context. Don't let the perfect be the enemy of the good, especially the better-than-the-status-quo. |
Currently rayon spawns as many threads as there are logical PUs on the machine and does nothing more as far as I can tell. This is subpar. Instead it should look at the hardware topology and make an informed decision on how many threads to spawn and where to schedule them. Even more so now, that the rayon-core will share the same pool of workers across all the different users of rayon.
The text was updated successfully, but these errors were encountered: