Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some corner cases with ADAPT #8039

Merged
merged 2 commits into from
Sep 18, 2020
Merged

Fix some corner cases with ADAPT #8039

merged 2 commits into from
Sep 18, 2020

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Sep 9, 2020

Add support for fallback to previous coll module on non-commutative operations
Replace mutexes by atomic operations.
Use the correct nbc request type

  • coll/base: document type casts in ompi_coll_base_retain_*
    Other minor fixes.

After merging should be added to #7944 for the 4.1 branch

Signed-off-by: George Bosilca [email protected]
Signed-off-by: Joseph Schuchart [email protected]

@jsquyres
Copy link
Member

bot:aws:retest

@jsquyres
Copy link
Member

jsquyres commented Sep 14, 2020

All told, this is an improvement over what is on master -- we should probably merge it. But I still see some regressions with ADAPT compared to running without ADAPT. I ran all the IBM collective tests in 2 nodes, each with 8 procs. Here's the fails I see with ADAPT on master compared just running without ADAPT on master:

  • reduce_nocommute_inter
  • reduce_scatter_nocommute_stride
  • reduce_scatter_nocommute_stride_in_place
  • iallreduce_nocommute_stride_in_place
  • ialltoallv_somezeros
  • ibcast_struct
  • ibcast_f
  • ibcast_f90
  • ibcast_f08
  • bcast_init

For completeness, I also mention the tests where running with ADAPT caused them to pass (whereas running without ADAPT caused them to fail):

  • iallgatherv_inter
  • scatterv

@bosilca
Copy link
Member Author

bosilca commented Sep 14, 2020

Which version of this PR did you try ?

@jsquyres
Copy link
Member

As of about an hour ago -- i.e., 506cda1.

@jsquyres
Copy link
Member

jsquyres commented Sep 14, 2020

@bosilca and I talked in Slack, and I discovered a bug in my test script: I was accidentally running with a much shorter timeout for ADAPT than for non-ADAPT. Fixing that bug (i.e., having a healthy/long timeout for both ADAPT and non-ADAPT) and re-running the test (including @devreal's latest ibcast commit), I get down to just the following errors with ADAPT compared to non-ADAPT:

  • ialltoallv_somezeros
  • ibcast_f
  • ibcast_f90
  • ibcast_f08

Here's the stack trace from the ibcast_f failure (all the ibcast failures are similar):

#0  0x3e9be3250f in ???
#1  0x3e9be32495 in ???
#2  0x3e9be33c74 in ???
#3  0x3e9be2b60d in ???
#4  0x3e9be2b6cf in ???
#5  0x2aaaab1cc763 in ompi_request_destruct
        at request/request.c:70
#6  0x2aaab5a1d25c in opal_obj_run_destructors
        at ../../../../opal/class/opal_object.h:462
#7  0x2aaab5a1ddc0 in ompi_coll_adapt_request_free
        at /home/jsquyres/git/ompi/ompi/mca/coll/adapt/coll_adapt_module.c:196
#8  0x2aaaab1cdb3a in ompi_request_free
        at ../ompi/request/request.h:383
#9  0x2aaaab1cdde0 in ompi_request_default_wait
        at request/req_wait.c:80
#10  0x2aaaab2502b0 in PMPI_Wait
        at /home/jsquyres/git/ompi/ompi/mpi/c/profile/pwait.c:74
#11  0x2aaaaaf4b62b in ompi_wait_f
        at /home/jsquyres/git/ompi/ompi/mpi/fortran/mpif-h/profile/pwait_f.c:76
#12  0x40144d in ibcast
        at /home/jsquyres/git/ompi-tests/ibm/collective/ibcast_f.f90:60
#13  0x401515 in main
        at /home/jsquyres/git/ompi-tests/ibm/collective/ibcast_f.f90:72

@jsquyres
Copy link
Member

I neglected to mention: ialltoallv_somezeros-adapt just times out with ADAPT (vs. completing with no errors for non-ADAPT). There's no corefile / no stack trace.

@bosilca bosilca changed the title Fix ADAPT for few corner cases Fix some corner cases with ADAPT Sep 14, 2020
@jsquyres
Copy link
Member

bot:aws:retest

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commit, the only failure left is ialltoallv_somezeros. In discussion with George, it looks like libnbc is the root cause of the failure here, not ADAPT. Hence, I think we should merge this PR.

@jjhursey
Copy link
Member

bot:ibm:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Sep 17, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Sep 17, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Sep 17, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Sep 17, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Sep 17, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Sep 17, 2020
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Via testing on Cisco clusters, it looks good to me.

bosilca and others added 2 commits September 18, 2020 12:50
- Add support for fallback to previous coll module on non-commutative operations (#30)
- Replace mutexes by atomic operations.
- Use the correct nbc request type (for both ibcast and ireduce)
  * coll/base: document type casts in ompi_coll_base_retain_*
- add module-wide topology cache
- use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast
- reduce number of memory allocations
- call the default request completion.
  - Remove the requests from the Fortran lookup conversion tables before completing
    and free it.

Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>

Co-authored-by: Joseph Schuchart <[email protected]>
@bosilca bosilca merged commit 21c9c66 into open-mpi:master Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants