Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client/server mechanism broken? #9396

Open
frizwi opened this issue Sep 20, 2021 · 1 comment
Open

client/server mechanism broken? #9396

frizwi opened this issue Sep 20, 2021 · 1 comment

Comments

@frizwi
Copy link

frizwi commented Sep 20, 2021

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v3.1.4 but have tried v4.1.0 as well - same result

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source via tarball release

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Ubuntu 18.04 LTS
Single node

  • Operating system/version:
  • Computer hardware:
  • Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world

Trying a very simple client/server mechanism exactly as per the MPI documentation.

Server, MPI_Open_port(..), display port name to stdout, then wait via MPI_Comm_accept

First launch the ompi-server, then
Launch this as "mpirun -np 1 --ompi-server file:ompi.txt ./server beer"

In a separate shell, launch the client:
"mpirun -np 1 --ompi-server file:ompi.txt ./client "

The issue is that MPI_Comm_connect hangs:

Here is the stack from gdb:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f4f02574ad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x55ceec04e110) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0 0x00007f4f02574ad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x55ceec04e110) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55ceec04e0a8, cond=0x55ceec04e0e8) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=0x55ceec04e0e8, mutex=0x55ceec04e0a8) at pthread_cond_wait.c:655
#3 0x00007f4f0066eb72 in OPAL_MCA_PMIX2X_PMIx_Connect (procs=0x55ceec04e7b0, nprocs=2, info=0x0, ninfo=0) at client/pmix_client_connect.c:102
#4 0x00007f4f006231c2 in pmix2x_connect (procs=0x7fff61344d40) at pmix2x_client.c:1346
#5 0x00007f4f037c5793 in ompi_dpm_connect_accept (comm=0x55ceeb4b2520 <ompi_mpi_comm_self>, root=0, port_string=0x7fff61346157 "619642881.0:622259260",
send_first=true, newcomm=0x7fff613450b0) at dpm/dpm.c:398
#6 0x00007f4f0380cca3 in PMPI_Comm_connect (port_name=0x7fff61346157 "619642881.0:622259260", info=0x55ceeb4b2220 <ompi_mpi_info_null>, root=0,
comm=0x55ceeb4b2520 <ompi_mpi_comm_self>, newcomm=0x7fff613450f0) at pcomm_connect.c:109
#7 0x000055ceeb2b0b6d in main (argc=2, argv=0x7fff613459f8) at client.c:25
(gdb) up
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55ceec04e0a8, cond=0x55ceec04e0e8) at pthread_cond_wait.c:502
502 pthread_cond_wait.c: No such file or directory.
(gdb)
#2 __pthread_cond_wait (cond=0x55ceec04e0e8, mutex=0x55ceec04e0a8) at pthread_cond_wait.c:655
655 in pthread_cond_wait.c
(gdb)
#3 0x00007f4f0066eb72 in OPAL_MCA_PMIX2X_PMIx_Connect (procs=0x55ceec04e7b0, nprocs=2, info=0x0, ninfo=0) at client/pmix_client_connect.c:102
102 PMIX_WAIT_THREAD(&cb->lock);
(gdb)
#4 0x00007f4f006231c2 in pmix2x_connect (procs=0x7fff61344d40) at pmix2x_client.c:1346
1346 ret = PMIx_Connect(p, nprocs, NULL, 0);

The same exact code works fine in OpenMPI v1.8.8 but hangs in v2.x, 3.x and 4.x. So has something changed?? One thing I've noticed is that on the same machine, the portname string looks very different. 1.8.8 seem to have a much longer string with an ip, portnumber and "tcp" withing but 3.x is something like "0000.0:00000". Is there some configure flag I'm missing?

Here are the exact codes

server.c:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
  char myport[MPI_MAX_PORT_NAME];
  char *msg = NULL;
  MPI_Comm intercomm;
  int len = 0;

  /* Get message if given */
  if (argc > 1) {
    msg = argv[1];
    len = strlen(msg);
  }

  MPI_Init(&argc, &argv);

  MPI_Open_port(MPI_INFO_NULL, myport);

  printf("port name is: %s\n", myport);

  //printf("Publishing ...");
  //MPI_Publish_name("test", MPI_INFO_NULL, myport);
  //  printf("done!\n");
  MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);

  /* do something with intercomm */
  printf("Sending msg ...");
  MPI_Send(&len, 1,   MPI_INT, 0, 0, intercomm);
  MPI_Send(msg,  len, MPI_CHAR, 0, 0, intercomm);
  printf(" ... done\n");

  MPI_Finalize();

  return 0;
}

And client.c

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// Global message
int ARR[3];

int main(int argc, char *argv[])
{
  MPI_Comm intercomm;
  char *name = argv[1];
  char msg[1024];
  int msglen = 0;
  char myport[MPI_MAX_PORT_NAME];

  msg[0] = '\0';

  MPI_Init(&argc, &argv);

  strcpy(myport, name);
  // MPI_Lookup_name("test", MPI_INFO_NULL, myport);
  printf("Got port name as %s\n", myport);

  MPI_Comm_connect(name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);
  printf("Waiting to recv ... \n");
  
  MPI_Recv(&msglen, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, intercomm, MPI_STATUS_IGNORE);
  printf("Got length = %d\n", msglen);

  MPI_Recv(msg, msglen, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, intercomm, MPI_STATUS_IGNORE);
  msg[msglen] = '\0';
  
  printf("\nMessage:\n");
  printf("  %s\n", msg);
  printf("\n");
  
  MPI_Finalize();

  return 0;
}
@frizwi
Copy link
Author

frizwi commented Sep 21, 2021

I just tried this again and seems like its fixed in v4.1.x, but errors/hangs in 4.0.x and 3.1.4. I'll go ahead and update my application to the latest version so happy for this to be closed now.

Sorry, should've done a better job of version testing first before putting the issue in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant