Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context ID exhaustion bug #1768

Closed
mpichbot opened this issue Oct 14, 2016 · 3 comments
Closed

Context ID exhaustion bug #1768

mpichbot opened this issue Oct 14, 2016 · 3 comments
Assignees
Milestone

Comments

@mpichbot
Copy link

mpichbot commented Oct 14, 2016

Originally by dinan on 2012-12-14 11:24:17 -0600


Reported by Bob Cernhous @ IBM:

I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.

It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.

It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.

On older mpich 1.? (BG/P) it failed with 'too many communicators' and
didn't hang, which is what they expected.

It seems like it's stuck in the while (*context_id == 0) loop
repeatedly calling allreduce and never settling on a context id in
commutil.c. I didn't do a lot of debug but seems like it's in
vanilla mpich code, not something we modified.

ftmain.f90 fails if you run it on >2k ranks (creates one comm per
rank). This was the original customer testcase.

ftmain2.f90 fails by looping so you can run on fewer ranks.

I just noticed that with --np 1, I get the 'too many communicators' from ftmain2. But --np 2 and up hangs.

stdout[0]:  check_newcomm do-start           0 , repeat         2045 , total        2046
stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Comm_create: Other MPI error, error stack:
stderr[0]: PMPI_Comm_create(609).........: MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6, new_comm=0x1dbfffb520) failed
stderr[0]: PMPI_Comm_create(590).........:
stderr[0]: MPIR_Comm_create_intra(250)...:
stderr[0]: MPIR_Get_contextid(521).......:
stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators 
@mpichbot mpichbot self-assigned this Oct 14, 2016
@mpichbot mpichbot added this to the mpich-3.0.1 milestone Oct 14, 2016
@mpichbot
Copy link
Author

Originally by dinan on 2012-12-14 11:24:30 -0600


Attachment added: ftmain.f90 (3.7 KiB)
Test case #1

@mpichbot
Copy link
Author

Originally by dinan on 2012-12-14 11:24:41 -0600


Attachment added: ftmain2.f90 (3.8 KiB)
Test case #2

@mpichbot
Copy link
Author

Originally by dinan on 2012-12-17 14:03:19 -0600


Resolved in [3c720d0].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant