-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression: 2.0.x branch degradation over 1.10 for MTLs #2644
Comments
Some of the tests done:
|
@matcabral i noted you configure with |
I would assume this is dependent on #2448 |
@rhc54 @matcabral (cross reference) #2448 has been merged; it would be good to see how it affected the performance of the PSM / OFI MTLs. |
I'm not seeing any noticeable difference in performance with the commits from #2448 applied to 2.0.x when using PSM2. |
This is extremely puzzling as this commit removes a global lock from the
critical path for all multithreaded builds.
|
(Catching up after vacations) |
May be fixed by #2840. |
Yes indeed!! first round of testing shows the performance got to show same performance as 1.10.x. Will run more tests tomorrow. 😄 |
So, so confirmed that with #2840 there still a degradation around %3 / %4 in latency for smaller messages. Also adding patch for #2448 the degradation goes down to %1. Given that I have not ran enough repetitions to get a reasonable average this %1 can just be noise. In summary, #2840 and #2448 together fix the problem. |
Question (sorry if I'm missing something) I see this issue closed for milestone v2.0.3 but PR #2448 was not merged into 2.0.x branch. Comments? |
I'm re opening until #2448 gets into 2.0.x |
@matcabral Feel free to make PRs to merge this into v2.0.x and v2.x. (frankly, I thought this was already there...?) |
@matcabral Not sure what you mean:
What I'm hearing you say is that there has been no v2.0.x or v2.x PR for the code that was in #2448 (i.e., the PR to master). Is that correct? |
Ok, I now think I get you suggestion. I'll open PR for #2448 for v2.x and v2.0.x |
@matcabral Cool, thanks. |
This is a place holder to fix the performance regressions seen on 2.0.x branch with regards to 1.10 that is impacting MTLs (tested with OFI and PSM2). The degradation is mostly impacting latency in small messages sizes, with some impact in bw.
Building with:
The below tests assume same system setup, only changing OMPI 1.10 for 2.0.x
Two ranks on different nodes running osu_latency over PSM2.
1 -10%
2 -10%
4 -12%
8 -12%
16 -10%
32 -11%
64 -8%
128 -10%
256 -12%
512 -10%
1024 -12%
2048 -5%
4096 -18%
8192 0%
16384 -6%
32768 -4%
65536 -5%
131072 -3%
262144 -2%
524288 8%
1048576 0%
2097152 0%
4194304 18%
Two ranks on same node running osu_latency over PSM2.
1 -19%
2 -16%
4 -16%
8 -16%
16 -19%
32 -4%
64 -6%
128 -6%
256 -4%
512 -6%
1024 -14%
2048 -31%
4096 -1%
8192 -8%
16384 -24%
32768 -18%
65536 -12%
131072 -6%
262144 -8%
524288 -5%
Two ranks on different nodes running osu_bw over PSM2.
1 -7%
2 -9%
4 -7%
8 -7%
16 -5%
32 -4%
64 -6%
128 -2%
256 -5%
512 -7%
1024 -7%
2048 -5%
4096 -2%
8192 -2%
16384 -1%
32768 2%
65536 0%
131072 1%
262144 0%
524288 0%
1048576 0%
2097152 0%
4194304 0%
The text was updated successfully, but these errors were encountered: