-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't open more than 234 files over the course of a program run #4336
Comments
At first, check the limit on open file descriptors with i suspect it might be |
ggouaillardet: thanks for the quick response. Yes that fixed the issue. Sorry for the trouble. |
on second thought, there could be a file descriptor leak involved here. for example, you can (with the previous limit)
if that fixes the problem, that would strongly points to a file descriptor leak (e.g. a bug) ! |
Thank you again. The fix you suggested using |
@edgargabriel can you please have a look at it ? |
will do, will take me a couple of days however |
An update on this. I can not reproduce the problem on my linux box. I tried 2.1.2, 3.0 and master, and opening/closing a file 2048 times was no problem on any of these versions, despite having reduced the max. number of files to 256 with ulimit. I also walked in the debugger through code and double checked that all references to a file are correctly closed. I will try to steal tonight the macbook of my daughter to see whether I can reproduce it on MacOS. I could imagine a scenario where an error is raised before the second file handle (that we store for shared file pointer operations) could be closed, but I was not able to trigger this problem on linux. |
@edgargabriel if you could not access your daughter's laptop, you can use the following program #include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int i;
MPI_File file;
char * str;
MPI_Init(&argc, &argv);
for (i=0; i<250; i++) {
printf("%d\n", i);
MPI_File_open(MPI_COMM_WORLD, "test.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
MPI_File_close(&file);
asprintf(&str, "lsof -p %d", getpid());
system(str);
free(str);
}
MPI_Finalize();
return 0;
} you will note that at each iteration, there is one more even if there is no crash under Linux, the leak is there. |
in |
also, can/should |
hm, ok, that might be the reason. I was tracking only the file handles, not the semaphores. I will prepare a fix along those lines. Thanks @ggouaillardet ! |
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
in case a named semaphore is used, it is necessary to close the semaphore to remove all sm segments. sem_unlink just removes the name references once all proceeses have closed the sem. This is a cherry-pick of commit 4d995bd Fixes issue: open-mpi#4336 Signed-off-by: Edgar Gabriel <[email protected]> sharedfp/sm: unlink only needs to be called by one process Signed-off-by: Edgar Gabriel <[email protected]>
Background information
I'm sure that this bug is rarely encountered, but I encountered this when testing my code, which involves running it with a large number of parameters.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
2.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed using homebrew.
Please describe the system on which you are running
Mac OS 10.12.6
MacBook Pro (Retina, 13-inch, Early 2015)
Details of the problem
You can reproduce the code with the following Fortran program:
Compile with:
mpif90 test.f90 -o test
Run with:
mpirun -np 1 ./test
I also tried putting sleep statements after the close calls, but the same error occurs.
The text was updated successfully, but these errors were encountered: