-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMB-RMA test crash when using master branch and btl/ofi #9354
Comments
I got the following back trace from the core dump:
So the btl/ofi component was trying to access a NULL pointer |
Because this error does not happen to ompi 4.1.x, it must be caused by a commit unique to the master branch. I did some bisect, and was able to locate the commit that caused the error:
With some further debugging, I was able to locate the problem to the following 3 lines of code in ompi/ompi/mca/osc/rdma/osc_rdma_component.c Line 608 in 5e42613
If I understand correctly, the intention of these 3 lines is that when all MPI ranks are on same machines, btl/sm will be used so there is no need to do memory registration. (thus set However, this action is correct when alternate btls was used, but is not right when one original btl (such as btl/ofi) is used. When btl/ofi is selected, it will need memory registration even on same instance. Meanwhile, if two alternate btls are being used, In all, I believe these 3 lines are unnecessary and should be removed. After removing these 3 lines, 2 processes test can pass. However, 16 nodes test failed. The error is caused by a NULL |
@wzamazon your last comment -
sounds familiar to what I see with osc/rdma + btl/tcp. Can you post a stack-trace? |
Sure. The stack trace is:
|
I believe this error is caused by the following lines of code in the same function:
Here, So, IMO, the correct code should be:
This set the 1st peer's |
Opened #9358 |
@wzamazon thanks. Unfortunately it does not seem to be related to the rdma/tcp issues that I see. |
PR has been merged |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
master branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from git clone, then configured with
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.e85b814 3rd-party/openpmix (v1.1.3-3095-ge85b814d)
fe0cc05e9cf7ff4b49565afcc334937d7e0b995b 3rd-party/prrte (psrvr-v2.0.0rc1-3983-gfe0cc05e9c)
Please describe the system on which you are running
Details of the problem
When running IMB-RMA with ompi master branch, the application crashed:
The text was updated successfully, but these errors were encountered: