Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't open more than 234 files over the course of a program run #4336

Closed
william-dawson opened this issue Oct 13, 2017 · 11 comments
Closed

Can't open more than 234 files over the course of a program run #4336

william-dawson opened this issue Oct 13, 2017 · 11 comments

Comments

@william-dawson
Copy link

Background information

I'm sure that this bug is rarely encountered, but I encountered this when testing my code, which involves running it with a large number of parameters.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

2.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed using homebrew.

Please describe the system on which you are running

Mac OS 10.12.6
MacBook Pro (Retina, 13-inch, Early 2015)


Details of the problem

You can reproduce the code with the following Fortran program:

PROGRAM TestFileIO
  USE MPI
  IMPLICIT NONE
  INTEGER :: counter
  INTEGER :: mpi_err
  INTEGER :: mpi_file_handler
  INTEGER(KIND=MPI_OFFSET_KIND), PARAMETER :: zero_size = 0

  CALL MPI_Init(mpi_err)

  DO counter = 1, 250
    CALL MPI_File_open(MPI_COMM_WORLD, "test.txt", &
        & IOR(MPI_MODE_CREATE,MPI_MODE_WRONLY), MPI_INFO_NULL, &
        & mpi_file_handler, mpi_err)
    CALL MPI_File_close(mpi_file_handler, mpi_err)
  END DO

  CALL MPI_Finalize(mpi_err)
END PROGRAM

Compile with: mpif90 test.f90 -o test
Run with: mpirun -np 1 ./test

[redacted] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/OMPIO_test.txt_-1648689151_.sm
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open
[redacted] mca_sharedfp_sm_file_open: Error during file open

I also tried putting sleep statements after the close calls, but the same error occurs.

@ggouaillardet
Copy link
Contributor

At first, check the limit on open file descriptors with ulimit -n and bump it if appropriate.

i suspect it might be 256 on your system, so you might want to bump it to 512 and see how things go.

@william-dawson
Copy link
Author

ggouaillardet: thanks for the quick response. Yes that fixed the issue. Sorry for the trouble.

@ggouaillardet
Copy link
Contributor

on second thought, there could be a file descriptor leak involved here.
a workaroud is to use romio instead of ompio.

for example, you can (with the previous limit)

mpirun --mca io romio314 -np 1 ./test

if that fixes the problem, that would strongly points to a file descriptor leak (e.g. a bug) !

@ggouaillardet ggouaillardet reopened this Oct 13, 2017
@william-dawson
Copy link
Author

Thank you again. The fix you suggested using romio also works.

@ggouaillardet
Copy link
Contributor

@edgargabriel can you please have a look at it ?
this strongly suggests there is a file descriptor leak in ompio

@edgargabriel
Copy link
Member

will do, will take me a couple of days however

@edgargabriel
Copy link
Member

edgargabriel commented Oct 16, 2017

An update on this. I can not reproduce the problem on my linux box. I tried 2.1.2, 3.0 and master, and opening/closing a file 2048 times was no problem on any of these versions, despite having reduced the max. number of files to 256 with ulimit. I also walked in the debugger through code and double checked that all references to a file are correctly closed.

I will try to steal tonight the macbook of my daughter to see whether I can reproduce it on MacOS. I could imagine a scenario where an error is raised before the second file handle (that we store for shared file pointer operations) could be closed, but I was not able to trigger this problem on linux.

@ggouaillardet
Copy link
Contributor

@edgargabriel if you could not access your daughter's laptop, you can use the following program

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#include <mpi.h>

int main(int argc, char *argv[]) {
    int i;
    MPI_File file;
    char * str;

    MPI_Init(&argc, &argv);
    for (i=0; i<250; i++) {
        printf("%d\n", i);
        MPI_File_open(MPI_COMM_WORLD, "test.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
        MPI_File_close(&file);
        asprintf(&str, "lsof -p %d", getpid());
        system(str);
        free(str);
    }
    MPI_Finalize();

    return 0;
}

you will note that at each iteration, there is one more mmap'ed segment
(grep DEL on Linux, and PSXSEM on OSX)

even if there is no crash under Linux, the leak is there.

@ggouaillardet
Copy link
Contributor

in mca_sharedfp_sm_file_close(), should we sem_close(file_data->mutex); before sem_unlink(file_dta->sem_name); ?

@ggouaillardet
Copy link
Contributor

also, can/should sem_unlink() be moved into mca_sharedfp_sm_file_open() after all tasks have called sem_open() ? in the event of a crash, there would be no left over.

@edgargabriel
Copy link
Member

hm, ok, that might be the reason. I was tracking only the file handles, not the semaphores. I will prepare a fix along those lines. Thanks @ggouaillardet !

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Oct 17, 2017
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Oct 17, 2017
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Oct 18, 2017
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Oct 19, 2017
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Oct 19, 2017
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
bosilca pushed a commit to bosilca/ompi that referenced this issue Dec 4, 2017
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 9, 2018
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 9, 2018
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 9, 2018
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jan 10, 2018
in case a named semaphore is used, it is necessary to close the semaphore to remove
all sm segments. sem_unlink just removes the name references once all proceeses have closed
the sem.

This is a cherry-pick of commit 4d995bd

Fixes issue: open-mpi#4336

Signed-off-by: Edgar Gabriel <[email protected]>

sharedfp/sm: unlink only needs to be called by one process

Signed-off-by: Edgar Gabriel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants