Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll: Add multileader allreduce composition #5921

Merged
merged 4 commits into from
Apr 9, 2022

Conversation

tarudoodi
Copy link

@tarudoodi tarudoodi commented Mar 31, 2022

Pull Request Description

Multi-leaders based composition: It has num_leaders per node, which reduce the data within sub-node_comm. It is followed by intra_node reduce and inter_node allreduce on the piece of data the leader is responsible for. A shared memory buffer is allocated per leader. If size of message exceeds this shm buffer, the message is chunked.
Constraints: For a comm, all nodes should have same number of ranks per node, op should be commutative.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@tarudoodi
Copy link
Author

test:mpich/ch4/most

@tarudoodi tarudoodi force-pushed the allreduce_multileader branch from 9e18a2d to d8acaac Compare March 31, 2022 22:47
@tarudoodi
Copy link
Author

test:mpich/ch4/most

@tarudoodi tarudoodi requested a review from yfguo April 1, 2022 03:03
@tarudoodi
Copy link
Author

@yfguo The testing passed successfully on this and it is ready for review.

@tarudoodi tarudoodi force-pushed the allreduce_multileader branch from d8acaac to 8432cc1 Compare April 5, 2022 18:18
@tarudoodi
Copy link
Author

test:mpich/ch4/most

@tarudoodi
Copy link
Author

The failed test is a timeout ./threads/pt2pt/mt_improbe_sendrecv_huge 2 -iter=64 -count=4194304 MPIR_CVAR_CH4_OFI_EAGER_MAX_MSG_SIZE=16384. It did not show up in the first testing. I think it is not related to this PR. I will retest.

@tarudoodi
Copy link
Author

test:mpich/ch4/most

Taru Doodi and others added 4 commits April 8, 2022 17:11
Multi-leaders based composition: It has `num_leaders` per node,
which reduce the data within sub-node_comm. It is followed by
intra_node reduce and inter_node allreduce on the piece of data
the leader is responsible for. A shared memory buffer is
allocated per leader. If size of message exceeds this shm buffer,
the message is chunked.
Constraints: For a comm, all nodes should have same number
of ranks per node, op should be commutative.

Co-authored-by: Surabhi Jain <[email protected]>
@tarudoodi tarudoodi force-pushed the allreduce_multileader branch from 8432cc1 to 8e9504a Compare April 8, 2022 22:11
@tarudoodi
Copy link
Author

test:mpich/ch4/most

@tarudoodi
Copy link
Author

test:mpich/ch4/ofi

@tarudoodi tarudoodi merged commit f4cc89b into pmodels:main Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants