Significantly reduce the startup time #260

rhc54 · 2017-01-19T18:07:16Z

Significantly reduce the startup time by removing unnecessary locking and pack/unpack operations. Performance analysis showed these two operations consumed over 60% of the time in starting a local process. The changes resulted in decreasing time from 120s to 40s on launch @8k nodes.

Thanks to @hjelmn for the contribution!

Signed-off-by: Ralph Castain [email protected]

@8k

… and pack/unpack operations. Performance analysis showed these two operations consumed over 60% of the time in starting a local process. The changes resulted in decreasing time from 120s to 40s on launch @8k nodes. Thanks to @hjelmn for the contribution! Signed-off-by: Ralph Castain <[email protected]>

hjelmn · 2017-01-19T19:26:57Z

Actually, just the locking alone did that reduction. I will have the relative improvement for the removal of the pack/unpack shortly. Will run up to 8192 nodes and post the speedup later today.

artpol84 · 2017-01-19T19:46:56Z

@hjelmn impressive results
As I understand we need locking in the case of direct modex. Then keys will appear time to time and sometimes database will be in inconsistent state while PMIx daemon is doing the changes.

What was your testing configuration? Direct modex or full modex? And if Direct modex - were there any actual exchanges?

artpol84 · 2017-01-19T19:49:58Z

As it poses so high overhead I think we can try to do locking in different way, maybe using meta shmem segment.

rhc54 · 2017-01-19T19:51:23Z

@artpol84 I'm not sure we need locking even under direct modex. The existing data isn't altered, and nobody can look at the new data until the daemon notifies them of its presence. So I'm not sure I see a race issue here.

FWIW: the test conditions didn't involve any data exchange as @hjelmn was running a /bin/true application.

artpol84 · 2017-01-19T19:57:02Z

@rhc54 We need to analyze the layout carefully. Here is one of the concerning cases:

One proc requested the key from the rank X that is not in the DB now
server fetched rank X's blob and started changing the DB
in the meanwhile the other process starts accessing the data for the rank X and hits incomplete DB state.

I'm pretty sure this case will appear.

artpol84

We need to rework locking mechanism instead of completely removing it.

artpol84 · 2017-01-19T20:21:04Z

I see that we can create mutex in the shared memory, may it be faster?

hjelmn · 2017-01-19T20:23:03Z

@artpol84 That would be the correct way to do it. File locking is terrible in a lot of cases but doing a pthread_mutex_t or implementing your own reader/writer lock using atomics should be very fast.

hjelmn · 2017-01-19T20:27:37Z

@artpol84 For the record. The nodes in use are Intel Knights Landing. For the traced runs I used vtune to trace all 4096 orteds in a 4096 node allocation. I discovered that each orted had about 2 minutes of busy time at that scale. Of that, 60s was the locks and about 40-50s was the extra pack/unpack. I was a little surprised about the latter one. I will hopefully post the launch times with both fixes a little after 3pm MST.

If you can get me a reworked lock before 10pm MST I can run again with the new locking mechanism.

hjelmn · 2017-01-19T20:29:52Z

Huh, pthread_mutex_t isn't what we want. We want pthread_rwlock_t.

artpol84 · 2017-01-19T20:31:02Z

That's what I meant - I mean all pthread locks, mutexes and rwlocks should work similar I guess.

UPD: similar in terms of shared memory support.

hjelmn · 2017-01-19T20:31:14Z

This was a launching of /bin/true using mpirun. So the only data exchange was the orteds themselves.

artpol84 · 2017-01-19T20:32:41Z

I will try to provide new mechanism till the deadline you provided.

hjelmn · 2017-01-19T20:33:36Z

@artpol84 Thanks a lot!

artpol84 · 2017-01-19T20:37:12Z

Regarding overpacking - yeah I was surprised a well.

artpol84 · 2017-01-19T20:41:36Z

@hjelmn out of curiosity how were you using vtune to track orteds? using --launch-agent mpirun parameter?

hjelmn · 2017-01-19T20:43:18Z

Hmmm, didn't know about that parameter. I just modified plm/alps to add my own script before the orted command. That parameter would probably work too.

This is my script:

#!/bin/bash

{
    time amplxe-cl -collect hotspots -r daemon-vtune/`hostname`.result $*
} &> daemon-vtune/daemon-vtune.`hostname`.out

artpol84 · 2017-01-19T20:45:34Z

Ok, that's what I would do too. But when you mentioned I decided to check if there is a legacy mechanism for that.

artpol84 · 2017-01-19T20:45:44Z

As usual in OMPI :)

bosilca · 2017-01-19T21:45:41Z

Why not using the existing pipe for synchronization and the shared memory for the data itself. And I think that what you describe here is indeed either a rwlock or a condition.

artpol84 · 2017-01-19T22:00:46Z

What do you mean by "existing pipe"? If you mean usock/tcp communication channel between server and client - then we would like to avoid communicating with server as much as possible to reduce it's involvement in key fetch procedure.

artpol84 · 2017-01-19T22:01:48Z

Again, I did meant rwlock when mentioned mutex. I just brought up the point that we can use memory locking instead of file locking.

artpol84 · 2017-01-19T22:02:02Z

I'm working on the proof of the concept now.

artpol84 · 2017-01-20T00:49:15Z

I made some proof of concept/perf eval code available here:
https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking
You need to specify path to your mpicc in the Makefile and run make.

POC code has 3 "modules":

dummy that has no locking and demonstrates that verification code catches locking issues.
flock - the current dstore locking mechanism
pthread - pthread rwlock-based mechanism

@hjelmn If you have access to one node could you check how it works for you and we can discuss/update it in case it's not accurate.

On my laptop I'm getting the following results:

mpirun -np 4 ./test_flock
Data correctness verification
Step #10%
Step #20%
Step #30%
....
0: Writer-only: Time to do 10000 lock/unlocks = 0.017983
0: Writer/reader: Time to do 10000 lock/unlocks = 0.070491
1: Writer/reader: Time to do 10000 lock/unlocks = 0.034115

ll/bin/mpirun -np 4 ./test_pthread
Data correctness verification
Step #10%
Step #20%
...
0: Writer-only: Time to do 10000 lock/unlocks = 0.006891
0: Writer/reader: Time to do 10000 lock/unlocks = 0.016011
1: Writer/reader: Time to do 10000 lock/unlocks = 0.004020

So pthread indeed faster and working correctly. However I can't reproduce this significant overhead you observed. We have some KNL nodes so I'll probably go there and try in hour or two. Will report results back.
Feel free to open PR against my POC repo if you'll have any suggestion on this benchmark.

hjelmn · 2017-01-20T01:10:08Z

So, with the entirety of this commit the launch time fell to 27s. Still some overhead but now it looks like it is more on the orted/mpirun interaction side.

hjelmn · 2017-01-20T01:11:55Z

@artpol84 I will test with your lock fixes in a little bit. If the pack/unpack change isn't already part of it I will apply it as well to see how it compares to no locking.

rhc54 · 2017-01-23T15:00:24Z

I'm closing this PR as better solutions are underway. I'd like to get the locking and pack/unpack issues resolved soon so we can update the reference server and OMPI master.

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>

jjhursey · 2017-01-24T15:21:11Z

@artpol84 @karasevb @rhc54 There seems to be some action items here for improvements. Will one of you open a Issue or a cherry-picked PR for those items so we can track them.

I think the items are:

flock vs pthread_lock
Shared memory initialization (needs to be pr'ed over) sm: use posix_fallocate() before mmap'ing it #263
Unnecessary pack/unpack operation (needs to be pr'ed over) dstor: Removed unnecessary pack/unpack operation #266

I'd like to see those items addressed before v1.2.1 if possible since they have scalability implications for Open MPI's v2.1 release.

artpol84 · 2017-01-24T17:36:35Z

@ggouaillardet could you port #263 to v1.2?
@jjhursey, @karasevb will provide the PR for pthread locking asap (my expectation is till the end of this week).

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)

jjhursey · 2017-01-25T14:44:17Z

Update on where we are at:

@ggouaillardet PR'ed sm: use posix_fallocate() before mmap'ing it #263 to v1.2 in sm: use posix_fallocate() before mmap'ing it #267 - That has been merged.
@karasevb PR'ed dstor: Removed unnecessary pack/unpack operation #266 to v1.2 in dstor: Removed unnecessary pack/unpack operation #270 - That is waiting for a review.
The flock vs pthread_lock is still in development - no PR yet.
A few other items have snuck into the process as well see the Milestone for a list.

karasevb · 2017-01-27T02:58:57Z

@jjhursey I'm working on pthread_lock. In state debugging now.

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]>

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]>

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]>

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)

@hjelmn

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 0ceeeec)

@artpol84

Added alternative locking by `pthread` with use shared memory region for the lock. The code path of old locking by `flock` has been saved for evaluate performance against locking by `pthread` and other purposes. To change the locking type needs to enable the code paths by macro: #define PTHREAD_LOCK_ENABLE 1 #define FCNTL_LOCK_ENABLE 1 In priority uses the locking by 'pthread' Thanks to @artpol84 for this proof of concept. - The results of locking by `pthread` here: openpmix#260 (comment) - Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking Signed-off-by: Boris Karasev <[email protected]> (cherry picked from commit 289e30b)

rhc54 assigned artpol84 Jan 19, 2017

rhc54 requested review from artpol84 and karasevb January 19, 2017 18:07

artpol84 suggested changes Jan 19, 2017

View reviewed changes

rhc54 closed this Jan 23, 2017

rhc54 deleted the topic/dst branch January 23, 2017 15:00

karasevb removed their request for review January 24, 2017 08:21

karasevb added a commit to karasevb/pmix that referenced this pull request Jan 24, 2017

dstor: Removed unnecessary pack/unpack operation

f4cadba

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>

karasevb mentioned this pull request Jan 24, 2017

dstor: Removed unnecessary pack/unpack operation #266

Merged

karasevb added a commit to karasevb/pmix that referenced this pull request Jan 25, 2017

dstor: Removed unnecessary pack/unpack operation

5ca697f

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>

karasevb added a commit to karasevb/pmix that referenced this pull request Jan 25, 2017

dstor: Removed unnecessary pack/unpack operation

0ceeeec

Restored the part of PR openpmix#260 Thanks to @hjelmn for the contribution! Signed-off-by: Boris Karasev <[email protected]>

karasevb mentioned this pull request Jan 25, 2017

dstor: Removed unnecessary pack/unpack operation #270

Merged

karasevb mentioned this pull request Jan 27, 2017

dstore: Added locking by pthread. #273

Merged

karasevb mentioned this pull request Jan 27, 2017

dstore: Added locking by pthread. #274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly reduce the startup time #260

Significantly reduce the startup time #260

rhc54 commented Jan 19, 2017

hjelmn commented Jan 19, 2017 •

edited

Loading

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017 •

edited

Loading

rhc54 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 left a comment

artpol84 commented Jan 19, 2017

hjelmn commented Jan 19, 2017

hjelmn commented Jan 19, 2017 •

edited

Loading

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017 •

edited

Loading

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017 •

edited

Loading

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

bosilca commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 20, 2017

hjelmn commented Jan 20, 2017

hjelmn commented Jan 20, 2017

rhc54 commented Jan 23, 2017

jjhursey commented Jan 24, 2017

artpol84 commented Jan 24, 2017 •

edited

Loading

jjhursey commented Jan 25, 2017

karasevb commented Jan 27, 2017

Significantly reduce the startup time #260

Significantly reduce the startup time #260

Conversation

rhc54 commented Jan 19, 2017

hjelmn commented Jan 19, 2017 • edited Loading

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017 • edited Loading

rhc54 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 left a comment

Choose a reason for hiding this comment

artpol84 commented Jan 19, 2017

hjelmn commented Jan 19, 2017

hjelmn commented Jan 19, 2017 • edited Loading

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017 • edited Loading

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017 • edited Loading

hjelmn commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

bosilca commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 19, 2017

artpol84 commented Jan 20, 2017

hjelmn commented Jan 20, 2017

hjelmn commented Jan 20, 2017

rhc54 commented Jan 23, 2017

jjhursey commented Jan 24, 2017

artpol84 commented Jan 24, 2017 • edited Loading

jjhursey commented Jan 25, 2017

karasevb commented Jan 27, 2017

hjelmn commented Jan 19, 2017 •

edited

Loading

artpol84 commented Jan 19, 2017 •

edited

Loading

hjelmn commented Jan 19, 2017 •

edited

Loading

artpol84 commented Jan 19, 2017 •

edited

Loading

artpol84 commented Jan 19, 2017 •

edited

Loading

artpol84 commented Jan 24, 2017 •

edited

Loading