You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.
It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.
It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.
On older mpich 1.? (BG/P) it failed with 'too many communicators' and
didn't hang, which is what they expected.
It seems like it's stuck in the while (*context_id == 0) loop
repeatedly calling allreduce and never settling on a context id in
commutil.c. I didn't do a lot of debug but seems like it's in
vanilla mpich code, not something we modified.
ftmain.f90 fails if you run it on >2k ranks (creates one comm per
rank). This was the original customer testcase.
ftmain2.f90 fails by looping so you can run on fewer ranks.
I just noticed that with --np 1, I get the 'too many communicators' from ftmain2. But --np 2 and up hangs.
stdout[0]: check_newcomm do-start 0 , repeat 2045 , total 2046
stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Comm_create: Other MPI error, error stack:
stderr[0]: PMPI_Comm_create(609).........: MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6, new_comm=0x1dbfffb520) failed
stderr[0]: PMPI_Comm_create(590).........:
stderr[0]: MPIR_Comm_create_intra(250)...:
stderr[0]: MPIR_Get_contextid(521).......:
stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators
The text was updated successfully, but these errors were encountered:
Originally by dinan on 2012-12-14 11:24:17 -0600
Reported by Bob Cernhous @ IBM:
I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.
It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.
It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.
On older mpich 1.? (BG/P) it failed with 'too many communicators' and
didn't hang, which is what they expected.
It seems like it's stuck in the while (*context_id == 0) loop
repeatedly calling allreduce and never settling on a context id in
commutil.c. I didn't do a lot of debug but seems like it's in
vanilla mpich code, not something we modified.
ftmain.f90 fails if you run it on >2k ranks (creates one comm per
rank). This was the original customer testcase.
ftmain2.f90 fails by looping so you can run on fewer ranks.
I just noticed that with --np 1, I get the 'too many communicators' from ftmain2. But --np 2 and up hangs.
The text was updated successfully, but these errors were encountered: