-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly reduce the startup time #260
Conversation
… and pack/unpack operations. Performance analysis showed these two operations consumed over 60% of the time in starting a local process. The changes resulted in decreasing time from 120s to 40s on launch @8k nodes. Thanks to @hjelmn for the contribution! Signed-off-by: Ralph Castain <[email protected]>
Actually, just the locking alone did that reduction. I will have the relative improvement for the removal of the pack/unpack shortly. Will run up to 8192 nodes and post the speedup later today. |
@hjelmn impressive results What was your testing configuration? Direct modex or full modex? And if Direct modex - were there any actual exchanges? |
As it poses so high overhead I think we can try to do locking in different way, maybe using meta shmem segment. |
@artpol84 I'm not sure we need locking even under direct modex. The existing data isn't altered, and nobody can look at the new data until the daemon notifies them of its presence. So I'm not sure I see a race issue here. FWIW: the test conditions didn't involve any data exchange as @hjelmn was running a /bin/true application. |
@rhc54 We need to analyze the layout carefully. Here is one of the concerning cases:
I'm pretty sure this case will appear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to rework locking mechanism instead of completely removing it.
I see that we can create mutex in the shared memory, may it be faster? |
@artpol84 That would be the correct way to do it. File locking is terrible in a lot of cases but doing a pthread_mutex_t or implementing your own reader/writer lock using atomics should be very fast. |
@artpol84 For the record. The nodes in use are Intel Knights Landing. For the traced runs I used vtune to trace all 4096 orteds in a 4096 node allocation. I discovered that each orted had about 2 minutes of busy time at that scale. Of that, 60s was the locks and about 40-50s was the extra pack/unpack. I was a little surprised about the latter one. I will hopefully post the launch times with both fixes a little after 3pm MST. If you can get me a reworked lock before 10pm MST I can run again with the new locking mechanism. |
Huh, pthread_mutex_t isn't what we want. We want pthread_rwlock_t. |
That's what I meant - I mean all pthread locks, mutexes and rwlocks should work similar I guess. UPD: similar in terms of shared memory support. |
This was a launching of /bin/true using mpirun. So the only data exchange was the orteds themselves. |
I will try to provide new mechanism till the deadline you provided. |
@artpol84 Thanks a lot! |
Regarding overpacking - yeah I was surprised a well. |
@hjelmn out of curiosity how were you using vtune to track orteds? using |
Hmmm, didn't know about that parameter. I just modified plm/alps to add my own script before the orted command. That parameter would probably work too. This is my script: #!/bin/bash
{
time amplxe-cl -collect hotspots -r daemon-vtune/`hostname`.result $*
} &> daemon-vtune/daemon-vtune.`hostname`.out |
Ok, that's what I would do too. But when you mentioned I decided to check if there is a legacy mechanism for that. |
As usual in OMPI :) |
Why not using the existing pipe for synchronization and the shared memory for the data itself. And I think that what you describe here is indeed either a rwlock or a condition. |
What do you mean by "existing pipe"? If you mean usock/tcp communication channel between server and client - then we would like to avoid communicating with server as much as possible to reduce it's involvement in key fetch procedure. |
Again, I did meant rwlock when mentioned mutex. I just brought up the point that we can use memory locking instead of file locking. |
I'm working on the proof of the concept now. |
I made some proof of concept/perf eval code available here: POC code has 3 "modules":
@hjelmn If you have access to one node could you check how it works for you and we can discuss/update it in case it's not accurate. On my laptop I'm getting the following results:
So pthread indeed faster and working correctly. However I can't reproduce this significant overhead you observed. We have some KNL nodes so I'll probably go there and try in hour or two. Will report results back. |
So, with the entirety of this commit the launch time fell to 27s. Still some overhead but now it looks like it is more on the orted/mpirun interaction side. |
@artpol84 I will test with your lock fixes in a little bit. If the pack/unpack change isn't already part of it I will apply it as well to see how it compares to no locking. |
I'm closing this PR as better solutions are underway. I'd like to get the locking and pack/unpack issues resolved soon so we can update the reference server and OMPI master. |
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>
@artpol84 @karasevb @rhc54 There seems to be some action items here for improvements. Will one of you open a Issue or a cherry-picked PR for those items so we can track them. I think the items are:
I'd like to see those items addressed before |
@ggouaillardet could you port #263 to v1.2? |
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)
Update on where we are at:
|
@jjhursey I'm working on |
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]>
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]>
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]>
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)
Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)
Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)
Significantly reduce the startup time by removing unnecessary locking and pack/unpack operations. Performance analysis showed these two operations consumed over 60% of the time in starting a local process. The changes resulted in decreasing time from 120s to 40s on launch @8k nodes.
Thanks to @hjelmn for the contribution!
Signed-off-by: Ralph Castain [email protected]