-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/posix: decrease shm_limit_counter when freeing comm obj #4864
Conversation
10c2ae0
to
14ff4ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks correct to me. Just left one minor comment.
@KaimingOuyang Please in the description describe the manifestation of bug -- that after the application created too many comm, we reach Separately, the same error will happen if application simply created too many comms simultaneously. I think the root is we are too conservative in allocating one separated shm region for each new comm. Maybe we could pre-allocate, say 10 regions (or make it dynamic), and pick a region at the time of EDIT: Another solution may be to preallocate a chunk of shared memory at init time and then implement a simple memory allocation interface. Each time we run a shm collective, we simply "allocate" the amount of shared memory and "free" it at the end of collective. We may optimize for the case where there will be only one collective going on at a time to keep the code simple and efficient. The only overhead is taking a lock which should not be contended unless for the rare case of simultaneous multiple collectives (over different comms). |
98265a0
to
cf35350
Compare
2b40c09
to
13c00a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Pending on passing tests.
f321e38
to
38288d0
Compare
38288d0
to
f4beed8
Compare
test:mpich/ch4/most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KaimingOuyang Could you check what's wrong with the ofi tests?
Ignore the |
02a3981
to
22e9c15
Compare
test:mpich/ch4/most |
22e9c15
to
1be1f54
Compare
test:mpich/ch4/most |
1be1f54
to
64908b7
Compare
shm_limit_counter is increased when allocating shm memory for coll calls, but it is not properly decreased when freeing comm obj. This commit fixes this bug.
when shm mem exceeds predefined threshold, coll calls need to fallback to send/recv based implementation. Current main branch sets fallback action as error which aborts program. This behavior will cause improper exit, and this commit change fallback action to silent to solve this issue.
… finalize MPIDI_POSIX_global.shm_ptr is freed but will be referenced during builtin comm free; here we set MPIDI_POSIX_shm_limit_counter as dummy counter to avoid segmentation fault
64908b7
to
b40f2b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good as long as tests are clean.
test:mpich/ch4/most |
Update MPIR tuning file to 2k nodes or 32 racks
Bug Description
For MPI_Reduce/Allreduce, internal reduce shm buffer will be allocated for each comm.
MPIDI_POSIX_shm_limit_counter
is used to record the amount of shm memory allocated, whichassures the amount does not exceed the predefined threshold.
The bug is
MPIDI_POSIX_shm_limit_counter
is increased when allocating shm memory for coll calls,but it is not properly decreased when freeing comm obj. When users allocate many comm objs, the total
shm memory allocated are not properly reported which causes the
no shm mem
error when MPICH detectsallocated shm memory exceeds the predefined threshold.
This commit fixes this bug.
Expected Impact
Author Checklist
module: short description
and follows good practice