-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.0.x: hook madvaise to fix threading memory allocation issues #4041
Conversation
The current VMA cache implementation backing rcache/grdma can run into a deadlock situation in multi-threaded code when madvise is hooked and the c library uses locks. In this case we may run into the following situation: Thread 1: ... free () <- Holding libc lock madvice_hook () vma_iteration () <- Blocked waiting for vma lock Thread 2: ... vma_insert () <- Holding vma lock vma_item_new () malloc () <- Blocked waiting for libc lock To fix this problem we chose to remove the madvise () hook but that fix is causing issue open-mpi#3685. This commit aims to greatly reduce the chance that the deadlock will be hit by putting vma items into a free list. This moves the allocation outside the vma lock. In general there are a relatively small number of vma items so the default is to allocate 2048 vma items. This default is configurable but it is likely the number is too large not too small. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit d0c4538) Signed-off-by: Nathan Hjelm <[email protected]>
It is not possible to use the patcher based memory hooks without hooking madvise (MADV_DONTNEED). This commit updates the patcher memory hooks to always hook madvise. This should be safe with recent rcache updates. References open-mpi#3685. Close when merged into v2.0.x, v2.x, and v3.0.x. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 0176000) Signed-off-by: Nathan Hjelm <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have commented on the master PR. The same comments apply here: lack of documentation in a critical area, and either a confusion comment or a mishap in the code.
Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit b870d15) Signed-off-by: Nathan Hjelm <[email protected]>
@bosilca Removed the erroneous comment. Please review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proxy approval for @bosilca
Original master issue: #3685
Master PR: #4026