Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running SNAP on 1k ranks with OpenMPI causes seg fault #2162

Closed
tenbrugg opened this issue Jun 27, 2016 · 1 comment
Closed

running SNAP on 1k ranks with OpenMPI causes seg fault #2162

tenbrugg opened this issue Jun 27, 2016 · 1 comment

Comments

@tenbrugg
Copy link

This issue is intended to track the OpenMPI seg fault problem discussed last week. When running SNAP with OpenMPI on KNL using 1024 ranks the application seg faults during initialization. This problem does not occur when running with MPICH instead.

srun -n 1024 -N 16 --cpu_bind=none --hint=nomultithread --exclusive ../../../SNAP/src/gsnap 1024tasksSTlibfab.input

Core was generated by `/cray/css/u19/c17581/snap/nersc/SNAPJune13/small/../../../SNAP/src/gsnap 1024ta'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>,

comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710, 
mtl_request=0x80f820) at mtl_ofi.h:537

537 remote_addr = endpoint->peer_fiaddr;

(gdb) where
#0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 <ompi_mtl_ofi>, comm=

0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710, mtl_request=0x80f820)
at mtl_ofi.h:537

#1 0x00007ffff6774cb1 in mca_pml_cm_irecv (addr=0x930530, count=1, datatype=

0x7ffff6a45040 <ompi_mpi_int>, src=119, tag=-12, comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, request=
0x7fffffff6b50) at pml_cm.h:119

#2 0x00007ffff66635a2 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffff6d24, rbuf=

0x7fffffff6d28, count=1, dtype=0x7ffff6a45040 <ompi_mpi_int>, op=0x7ffff6a64720 <ompi_mpi_op_max>, 
comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, module=0x818740) at base/coll_base_allreduce.c:221

#3 0x00007ffff666d03a in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fffffff6d24, rbuf=

0x7fffffff6d28, count=1, dtype=0x7ffff6a45040 <ompi_mpi_int>, op=0x7ffff6a64720 <ompi_mpi_op_max>, 
comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, module=0x818740) at coll_tuned_decision_fixed.c:66

#4 0x00007ffff65a6f62 in ompi_comm_allreduce_intra (inbuf=0x7fffffff6d24, outbuf=0x7fffffff6d28, count=

1, op=0x7ffff6a64720 <ompi_mpi_op_max>, comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, bridgecomm=0x0, 
local_leader=0x0, remote_leader=0x0, send_first=-1, tag=0x7ffff6798b5e "nextcid", iter=0)
at communicator/comm_cid.c:878

#5 0x00007ffff65a5963 in ompi_comm_nextcid (newcomm=0x932490, comm=

0x7ffff6a59b20 <ompi_mpi_comm_world>, bridgecomm=0x0, local_leader=0x0, remote_leader=0x0, mode=32, 
send_first=-1) at communicator/comm_cid.c:221

#6 0x00007ffff65a2875 in ompi_comm_dup_with_info (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, info=0x0,

newcomm=0x7fffffff6e48) at communicator/comm.c:1037

#7 0x00007ffff65a2760 in ompi_comm_dup (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, newcomm=

0x7fffffff6e48) at communicator/comm.c:998

#8 0x00007ffff65f031c in PMPI_Comm_dup (comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, newcomm=

0x7fffffff6e48) at pcomm_dup.c:63

#9 0x00007ffff6ab86e0 in ompi_comm_dup_f (comm=0x43a858, newcomm=

0x63f46c <__plib_module_MOD_comm_snap>, ierr=0x7fffffff6e78) at pcomm_dup_f.c:76

#10 0x0000000000404410 in plib_module_MOD_pinit ()
#11 0x00000000004023bc in MAIN
()
#12 0x00000000004021fd in main ()

Stock nightly libfabric and OpenMPI libraries are used from Sung's install directory. More details can be supplied if desirable.

@tenbrugg
Copy link
Author

wrong repository .. closing again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant