Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/posix: decrease shm_limit_counter when freeing comm obj #4864

Merged
merged 3 commits into from
Nov 8, 2020

Conversation

KaimingOuyang
Copy link

@KaimingOuyang KaimingOuyang commented Nov 2, 2020

Bug Description

For MPI_Reduce/Allreduce, internal reduce shm buffer will be allocated for each comm.
MPIDI_POSIX_shm_limit_counter is used to record the amount of shm memory allocated, which
assures the amount does not exceed the predefined threshold.

The bug is MPIDI_POSIX_shm_limit_counter is increased when allocating shm memory for coll calls,
but it is not properly decreased when freeing comm obj. When users allocate many comm objs, the total
shm memory allocated are not properly reported which causes the no shm mem error when MPICH detects
allocated shm memory exceeds the predefined threshold.

This commit fixes this bug.

Expected Impact

Author Checklist

  • Reference appropriate issues (with "Fixes" or "See" as appropriate)
  • Remove xfail from the test suite when fixing a test
  • Commits are self-contained and do not do two things at once
  • Commit message is of the form: module: short description and follows good practice
  • Passes whitespace checkers
  • Passes warning tests
  • Passes all tests
  • Add comments such that someone without knowledge of the code could understand

Copy link
Contributor

@minsii minsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks correct to me. Just left one minor comment.

@hzhou
Copy link
Contributor

hzhou commented Nov 2, 2020

@KaimingOuyang Please in the description describe the manifestation of bug -- that after the application created too many comm, we reach MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE and fail with MPI_ERR_NO_MEM.

Separately, the same error will happen if application simply created too many comms simultaneously. I think the root is we are too conservative in allocating one separated shm region for each new comm. Maybe we could pre-allocate, say 10 regions (or make it dynamic), and pick a region at the time of release-gather collective? What do you think? @minsii @zhenggb72 @tarudoodi

EDIT: Another solution may be to preallocate a chunk of shared memory at init time and then implement a simple memory allocation interface. Each time we run a shm collective, we simply "allocate" the amount of shared memory and "free" it at the end of collective. We may optimize for the case where there will be only one collective going on at a time to keep the code simple and efficient. The only overhead is taking a lock which should not be contended unless for the rare case of simultaneous multiple collectives (over different comms).

@KaimingOuyang KaimingOuyang force-pushed the fix-shm-leak branch 2 times, most recently from 98265a0 to cf35350 Compare November 3, 2020 20:44
@KaimingOuyang KaimingOuyang force-pushed the fix-shm-leak branch 2 times, most recently from 2b40c09 to 13c00a3 Compare November 3, 2020 22:07
hzhou
hzhou previously approved these changes Nov 3, 2020
Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Pending on passing tests.

@KaimingOuyang
Copy link
Author

test:mpich/ch4/most

Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KaimingOuyang Could you check what's wrong with the ofi tests?

@hzhou hzhou dismissed their stale review November 5, 2020 19:25

code change

@hzhou
Copy link
Contributor

hzhou commented Nov 5, 2020

Ignore the ubsan failures, it is fixed in #4870

@KaimingOuyang
Copy link
Author

test:mpich/ch4/most

@KaimingOuyang
Copy link
Author

test:mpich/ch4/most

Kaiming Ouyang added 3 commits November 6, 2020 16:00
shm_limit_counter is increased when allocating shm memory for coll calls,
but it is not properly decreased when freeing comm obj. This commit fixes
this bug.
when shm mem exceeds predefined threshold, coll calls need to fallback to
send/recv based implementation. Current main branch sets fallback action
as error which aborts program. This behavior will cause improper exit, and
this commit change fallback action to silent to solve this issue.
… finalize

MPIDI_POSIX_global.shm_ptr is freed but will be referenced during builtin
comm free; here we set MPIDI_POSIX_shm_limit_counter as dummy counter to
avoid segmentation fault
Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good as long as tests are clean.

@KaimingOuyang
Copy link
Author

test:mpich/ch4/most

@hzhou hzhou merged commit 4b78f4b into pmodels:main Nov 8, 2020
rithwiktom added a commit to rithwiktom/mpich that referenced this pull request Jul 3, 2024
Update MPIR tuning file to 2k nodes or 32 racks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants