-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some corner cases with ADAPT #8039
Conversation
bot:aws:retest |
All told, this is an improvement over what is on master -- we should probably merge it. But I still see some regressions with ADAPT compared to running without ADAPT. I ran all the IBM collective tests in 2 nodes, each with 8 procs. Here's the fails I see with ADAPT on master compared just running without ADAPT on master:
For completeness, I also mention the tests where running with ADAPT caused them to pass (whereas running without ADAPT caused them to fail):
|
Which version of this PR did you try ? |
As of about an hour ago -- i.e., 506cda1. |
@bosilca and I talked in Slack, and I discovered a bug in my test script: I was accidentally running with a much shorter timeout for ADAPT than for non-ADAPT. Fixing that bug (i.e., having a healthy/long timeout for both ADAPT and non-ADAPT) and re-running the test (including @devreal's latest ibcast commit), I get down to just the following errors with ADAPT compared to non-ADAPT:
Here's the stack trace from the
|
I neglected to mention: |
bot:aws:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest commit, the only failure left is ialltoallv_somezeros. In discussion with George, it looks like libnbc is the root cause of the failure here, not ADAPT. Hence, I think we should merge this PR.
bot:ibm:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Via testing on Cisco clusters, it looks good to me.
- Add support for fallback to previous coll module on non-commutative operations (#30) - Replace mutexes by atomic operations. - Use the correct nbc request type (for both ibcast and ireduce) * coll/base: document type casts in ompi_coll_base_retain_* - add module-wide topology cache - use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast - reduce number of memory allocations - call the default request completion. - Remove the requests from the Fortran lookup conversion tables before completing and free it. Signed-off-by: George Bosilca <[email protected]> Signed-off-by: Joseph Schuchart <[email protected]> Co-authored-by: Joseph Schuchart <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
Add support for fallback to previous coll module on non-commutative operations
Replace mutexes by atomic operations.
Use the correct nbc request type
Other minor fixes.
After merging should be added to #7944 for the 4.1 branch
Signed-off-by: George Bosilca [email protected]
Signed-off-by: Joseph Schuchart [email protected]