Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly reduce the startup time #260

Closed
wants to merge 1 commit into from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 19, 2017

Significantly reduce the startup time by removing unnecessary locking and pack/unpack operations. Performance analysis showed these two operations consumed over 60% of the time in starting a local process. The changes resulted in decreasing time from 120s to 40s on launch @8k nodes.

Thanks to @hjelmn for the contribution!

Signed-off-by: Ralph Castain [email protected]

… and pack/unpack operations. Performance analysis showed these two operations consumed over 60% of the time in starting a local process. The changes resulted in decreasing time from 120s to 40s on launch @8k nodes.

Thanks to @hjelmn for the contribution!

Signed-off-by: Ralph Castain <[email protected]>
@hjelmn
Copy link
Contributor

hjelmn commented Jan 19, 2017

Actually, just the locking alone did that reduction. I will have the relative improvement for the removal of the pack/unpack shortly. Will run up to 8192 nodes and post the speedup later today.

@artpol84
Copy link
Contributor

@hjelmn impressive results
As I understand we need locking in the case of direct modex. Then keys will appear time to time and sometimes database will be in inconsistent state while PMIx daemon is doing the changes.

What was your testing configuration? Direct modex or full modex? And if Direct modex - were there any actual exchanges?

@artpol84
Copy link
Contributor

artpol84 commented Jan 19, 2017

As it poses so high overhead I think we can try to do locking in different way, maybe using meta shmem segment.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 19, 2017

@artpol84 I'm not sure we need locking even under direct modex. The existing data isn't altered, and nobody can look at the new data until the daemon notifies them of its presence. So I'm not sure I see a race issue here.

FWIW: the test conditions didn't involve any data exchange as @hjelmn was running a /bin/true application.

@artpol84
Copy link
Contributor

@rhc54 We need to analyze the layout carefully. Here is one of the concerning cases:

  • One proc requested the key from the rank X that is not in the DB now
  • server fetched rank X's blob and started changing the DB
  • in the meanwhile the other process starts accessing the data for the rank X and hits incomplete DB state.

I'm pretty sure this case will appear.

Copy link
Contributor

@artpol84 artpol84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to rework locking mechanism instead of completely removing it.

@artpol84
Copy link
Contributor

I see that we can create mutex in the shared memory, may it be faster?

@hjelmn
Copy link
Contributor

hjelmn commented Jan 19, 2017

@artpol84 That would be the correct way to do it. File locking is terrible in a lot of cases but doing a pthread_mutex_t or implementing your own reader/writer lock using atomics should be very fast.

@hjelmn
Copy link
Contributor

hjelmn commented Jan 19, 2017

@artpol84 For the record. The nodes in use are Intel Knights Landing. For the traced runs I used vtune to trace all 4096 orteds in a 4096 node allocation. I discovered that each orted had about 2 minutes of busy time at that scale. Of that, 60s was the locks and about 40-50s was the extra pack/unpack. I was a little surprised about the latter one. I will hopefully post the launch times with both fixes a little after 3pm MST.

If you can get me a reworked lock before 10pm MST I can run again with the new locking mechanism.

@hjelmn
Copy link
Contributor

hjelmn commented Jan 19, 2017

Huh, pthread_mutex_t isn't what we want. We want pthread_rwlock_t.

@artpol84
Copy link
Contributor

artpol84 commented Jan 19, 2017

That's what I meant - I mean all pthread locks, mutexes and rwlocks should work similar I guess.

UPD: similar in terms of shared memory support.

@hjelmn
Copy link
Contributor

hjelmn commented Jan 19, 2017

This was a launching of /bin/true using mpirun. So the only data exchange was the orteds themselves.

@artpol84
Copy link
Contributor

I will try to provide new mechanism till the deadline you provided.

@hjelmn
Copy link
Contributor

hjelmn commented Jan 19, 2017

@artpol84 Thanks a lot!

@artpol84
Copy link
Contributor

Regarding overpacking - yeah I was surprised a well.

@artpol84
Copy link
Contributor

artpol84 commented Jan 19, 2017

@hjelmn out of curiosity how were you using vtune to track orteds? using --launch-agent mpirun parameter?

@hjelmn
Copy link
Contributor

hjelmn commented Jan 19, 2017

Hmmm, didn't know about that parameter. I just modified plm/alps to add my own script before the orted command. That parameter would probably work too.

This is my script:

#!/bin/bash

{
    time amplxe-cl -collect hotspots -r daemon-vtune/`hostname`.result $*
} &> daemon-vtune/daemon-vtune.`hostname`.out

@artpol84
Copy link
Contributor

Ok, that's what I would do too. But when you mentioned I decided to check if there is a legacy mechanism for that.

@artpol84
Copy link
Contributor

As usual in OMPI :)

@bosilca
Copy link
Contributor

bosilca commented Jan 19, 2017

Why not using the existing pipe for synchronization and the shared memory for the data itself. And I think that what you describe here is indeed either a rwlock or a condition.

@artpol84
Copy link
Contributor

What do you mean by "existing pipe"? If you mean usock/tcp communication channel between server and client - then we would like to avoid communicating with server as much as possible to reduce it's involvement in key fetch procedure.

@artpol84
Copy link
Contributor

Again, I did meant rwlock when mentioned mutex. I just brought up the point that we can use memory locking instead of file locking.

@artpol84
Copy link
Contributor

I'm working on the proof of the concept now.

@artpol84
Copy link
Contributor

I made some proof of concept/perf eval code available here:
https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking
You need to specify path to your mpicc in the Makefile and run make.

POC code has 3 "modules":

  • dummy that has no locking and demonstrates that verification code catches locking issues.
  • flock - the current dstore locking mechanism
  • pthread - pthread rwlock-based mechanism

@hjelmn If you have access to one node could you check how it works for you and we can discuss/update it in case it's not accurate.

On my laptop I'm getting the following results:

mpirun -np 4 ./test_flock
Data correctness verification
Step #10%
Step #20%
Step #30%
....
0: Writer-only: Time to do 10000 lock/unlocks = 0.017983
0: Writer/reader: Time to do 10000 lock/unlocks = 0.070491
1: Writer/reader: Time to do 10000 lock/unlocks = 0.034115
ll/bin/mpirun -np 4 ./test_pthread
Data correctness verification
Step #10%
Step #20%
...
0: Writer-only: Time to do 10000 lock/unlocks = 0.006891
0: Writer/reader: Time to do 10000 lock/unlocks = 0.016011
1: Writer/reader: Time to do 10000 lock/unlocks = 0.004020

So pthread indeed faster and working correctly. However I can't reproduce this significant overhead you observed. We have some KNL nodes so I'll probably go there and try in hour or two. Will report results back.
Feel free to open PR against my POC repo if you'll have any suggestion on this benchmark.

@hjelmn
Copy link
Contributor

hjelmn commented Jan 20, 2017

So, with the entirety of this commit the launch time fell to 27s. Still some overhead but now it looks like it is more on the orted/mpirun interaction side.

@hjelmn
Copy link
Contributor

hjelmn commented Jan 20, 2017

@artpol84 I will test with your lock fixes in a little bit. If the pack/unpack change isn't already part of it I will apply it as well to see how it compares to no locking.

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 23, 2017

I'm closing this PR as better solutions are underway. I'd like to get the locking and pack/unpack issues resolved soon so we can update the reference server and OMPI master.

@rhc54 rhc54 closed this Jan 23, 2017
@rhc54 rhc54 deleted the topic/dst branch January 23, 2017 15:00
@karasevb karasevb removed their request for review January 24, 2017 08:21
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 24, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
@jjhursey
Copy link
Member

@artpol84 @karasevb @rhc54 There seems to be some action items here for improvements. Will one of you open a Issue or a cherry-picked PR for those items so we can track them.

I think the items are:

I'd like to see those items addressed before v1.2.1 if possible since they have scalability implications for Open MPI's v2.1 release.

@artpol84
Copy link
Contributor

artpol84 commented Jan 24, 2017

@ggouaillardet could you port #263 to v1.2?
@jjhursey, @karasevb will provide the PR for pthread locking asap (my expectation is till the end of this week).

karasevb added a commit to karasevb/pmix that referenced this pull request Jan 25, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 25, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 25, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 0ceeeec)
@jjhursey
Copy link
Member

Update on where we are at:

@karasevb
Copy link
Contributor

@jjhursey I'm working on pthread_lock. In state debugging now.

karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 289e30b)
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 0ceeeec)
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 289e30b)
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 289e30b)
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 0ceeeec)
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 289e30b)
karasevb added a commit to karasevb/pmix that referenced this pull request Jan 27, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 289e30b)
artpol84 pushed a commit to artpol84/pmix that referenced this pull request Feb 3, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
artpol84 pushed a commit to artpol84/pmix that referenced this pull request Feb 6, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 0ceeeec)
artpol84 pushed a commit to artpol84/pmix that referenced this pull request Feb 6, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 289e30b)
artpol84 pushed a commit to artpol84/pmix that referenced this pull request Feb 7, 2017
Restored the part of PR openpmix#260

Thanks to @hjelmn for the contribution!

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 0ceeeec)
artpol84 pushed a commit to artpol84/pmix that referenced this pull request Feb 7, 2017
Added alternative locking by `pthread` with use shared memory region
for the lock. The code path of old locking by `flock` has been saved
for evaluate performance against locking by `pthread` and other purposes.

To change the locking type needs to enable the code paths by macro:
  #define PTHREAD_LOCK_ENABLE  1
  #define FCNTL_LOCK_ENABLE    1
In priority uses the locking by 'pthread'

Thanks to @artpol84 for this proof of concept.
- The  results of locking by `pthread` here: openpmix#260 (comment)
- Code base of concept: https://github.com/artpol84/poc/tree/master/benchmarks/shmem_locking

Signed-off-by: Boris Karasev <[email protected]>
(cherry picked from commit 289e30b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants