Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

treematch hang #1183

Closed
ggouaillardet opened this issue Dec 7, 2015 · 7 comments
Closed

treematch hang #1183

ggouaillardet opened this issue Dec 7, 2015 · 7 comments
Assignees
Milestone

Comments

@ggouaillardet
Copy link
Contributor

@bosilca the distgraph_test_4 test from the ibm test suite might hang depending on the machine topology.
with 4 mpi tasks, it works fine on a server with 2 sockets / 12 cores / 24 threads, but it hangs on my VM with 1 socket / 4 cores / 4 threads

the inlined topology can be used to evidence the issue

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001">
    <object type="Package" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001">
      <object type="PU" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
      <object type="PU" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
      <object type="PU" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
      <object type="PU" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000001" complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
    </object>
  </object>
</topology>

run as is

$ mpirun -np 4 ./distgraph_test_4
pass!!

run with a simple topology

$ mpirun --mca hwloc_base_topo_file smp.xml -np 4 ./distgraph_test_4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 
nb_constraints = 4, N= 4; nb_processing units = 4
========== Centralized Reordering ========= 

HANG

but this works fine without the treematch module

$ mpirun --mca hwloc_base_topo_file smp.xml --mca topo ^treematch -np 4 ./distgraph_test_4
pass!!

note i had to push a5440ad so the topology file is used by the treematch module (otherwise, hwloc use the topology file, but treematch use the topology of the node)

can you please have a look at this ?

@bosilca
Copy link
Member

bosilca commented Dec 9, 2015

@ggouaillardet a quick look at the test seems to indicate that the test is broken. On my machine the test does not deadlock or segfault, but instead it triggers a nice MPI abort due to a wrong argument in MPI_Dist_graph_create.

If I look how this function is called, for cnt = 9 I have:

n = 4
sources[] = {0, 1, 2, 3}
degrees[] = {2, 2, 2, 3}
destinations[] = {1, 2, 0, 2, 0, 1, 0, 1, 10}

The last destination is clearly wrong (larger than the number of nodes), but should be valid as the total number of degrees is 11.

@ggouaillardet
Copy link
Contributor Author

@bosilca goot catch !
when cnt is 9, destinations does not contains enough elements (!)
i will fix that and review it all from now
(and that is why you can get a nice error when i just get a hang)

@ggouaillardet
Copy link
Contributor Author

@bosilca i pushed open-mpi/ompi-tests@0c77d1ba7b65896a84afba0405e2477dcbaa2875 in order to fix the issue you identified.

here is a main subroutine that can replace the one from distgraph_test_4.c in order to go straight to the hang

int main(int argc, char *argv[]){

    int rank, num_procs, rc, root;
    MPI_Comm newcomm;
    int cnt;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &num_procs);

    Reorder = 1;
    root = 0;

    const_graph(root);
    for(cnt=1; cnt <= MAX_TEST; cnt++){
        create_graph(cnt, root, &newcomm);
        if (MPI_SUCCESS != print_result(newcomm, cnt, root)){
            MPI_Abort(MPI_COMM_WORLD, -1);
        }
        if (MPI_SUCCESS != p2p_test(newcomm, cnt, root)){
            MPI_Abort(MPI_COMM_WORLD, -1);
        }
        if (MPI_SUCCESS != coll_test(newcomm, cnt, root)){
            MPI_Abort(MPI_COMM_WORLD, -1);
        }
        MPI_Comm_free(&newcomm);
    }
    dest_graph();

    MPI_Finalize();
    if(!rank){
        printf("pass!!\n");
        fflush(stdout);
    }
    return 0;
}

hang occurs with 4 MPI tasks on a 4 cores box, or on a 16 cores box if --bind-to core is used.
(i guess you have more than 4 cores on your machine, so you did not see the hang)

if you run with --bind-to core, then at topo_treematch_dist_graph_create.c:308, oversubscribing_objs is false
then you can jump to line 387
topo->weighted is true on rank 0, so rank 0 calls coll_gather
but topo->weighted is false on other ranks, so they do not call coll_gather and hence the hang

a possible fix is to coll_gather no matter what (since topo->weighted might not have the same value on all nodes.

here is a patch for that (it does include some code factorization)

diff --git a/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c b/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
index 6c31d1f..0b2b249 100644
--- a/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
+++ b/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
@@ -389,30 +389,28 @@ int mca_topo_treematch_dist_graph_create(mca_topo_base_module_t* topo_module,
             fprintf(stderr,"========== Centralized Reordering ========= \n");

             local_pattern = (double *)calloc(size*size,sizeof(double));
-            if( true == topo->weighted ) {
-                for(i = 0; i < topo->indegree ; i++)
-                    local_pattern[topo->in[i]] += topo->inw[i];
-                for(i = 0; i < topo->outdegree ; i++)
-                    local_pattern[topo->out[i]] += topo->outw[i];
-                if (OMPI_SUCCESS != (err = comm_old->c_coll.coll_gather(MPI_IN_PLACE, size, MPI_DOUBLE,
-                                                                        local_pattern, size, MPI_DOUBLE,
-                                                                        0, comm_old,
-                                                                        comm_old->c_coll.coll_gather_module)))
-                    return err;
-            }
         } else {
             local_pattern = (double *)calloc(size,sizeof(double));
-            if( true == topo->weighted ) {
-                for(i = 0; i < topo->indegree ; i++)
-                    local_pattern[topo->in[i]] += topo->inw[i];
-                for(i = 0; i < topo->outdegree ; i++)
-                    local_pattern[topo->out[i]] += topo->outw[i];
-                if (OMPI_SUCCESS != (err = comm_old->c_coll.coll_gather(local_pattern, size, MPI_DOUBLE,
-                                                                        NULL,0,0,
-                                                                        0, comm_old,
-                                                                        comm_old->c_coll.coll_gather_module)))
-                    return err;
-            }
+        }
+        if( true == topo->weighted ) {
+            for(i = 0; i < topo->indegree ; i++)
+                local_pattern[topo->in[i]] += topo->inw[i];
+            for(i = 0; i < topo->outdegree ; i++)
+                local_pattern[topo->out[i]] += topo->outw[i];
+        }
+        if(0 == rank) {
+            err = comm_old->c_coll.coll_gather(MPI_IN_PLACE, size, MPI_DOUBLE,
+                                               local_pattern, size, MPI_DOUBLE,
+                                               0, comm_old,
+                                               comm_old->c_coll.coll_gather_module);
+        } else {
+            err = comm_old->c_coll.coll_gather(local_pattern, size, MPI_DOUBLE,
+                                               NULL,0,0,
+                                               0, comm_old,
+                                               comm_old->c_coll.coll_gather_module);
+        }
+        if (OMPI_SUCCESS != err) {
+            return err;
         }

         if( rank == local_procs[0]) {

if the fix looks good to you, i guess a similar one should be applied around line 731 (Partially Distributed Reordering)

note that when running with --bind-to core, the treematch module issues a few printf
(no such thing without --bind-to core)

@ggouaillardet
Copy link
Contributor Author

@bosilca
here is a trivial patch to silence the treematch outputs

diff --git a/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c b/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
index 6c31d1f..d97b017 100644
--- a/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
+++ b/ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
@@ -386,7 +386,9 @@ int mca_topo_treematch_dist_graph_create(mca_topo_base_module_t* topo_module,
          */
         if(0 == rank) {

+#ifdef __DEBUG__
             fprintf(stderr,"========== Centralized Reordering ========= \n");
+#endif

             local_pattern = (double *)calloc(size*size,sizeof(double));
             if( true == topo->weighted ) {
diff --git a/ompi/mca/topo/treematch/treematch/tm_kpartitioning.c b/ompi/mca/topo/treematch/treematch/tm_kpartitioning.c
index 3aaed6a..017f3ed 100644
--- a/ompi/mca/topo/treematch/treematch/tm_kpartitioning.c
+++ b/ompi/mca/topo/treematch/treematch/tm_kpartitioning.c
@@ -426,8 +426,7 @@ tree_t *kpartition_build_tree_from_topology(tm_topology_t *topology,double **com
   verbose_level = get_verbose_level();

   if(verbose_level>=INFO)
-    printf("Number of constraints: %d\n", nb_constraints);
-  printf("Number of constraints: %d, N=%d\n", nb_constraints, N);
+    printf("Number of constraints: %d, N=%d\n", nb_constraints, N);

   nb_cores=nb_processing_units(topology);

diff --git a/ompi/mca/topo/treematch/treematch/tm_tree.c b/ompi/mca/topo/treematch/treematch/tm_tree.c
index 0f41958..3305e2d 100644
--- a/ompi/mca/topo/treematch/treematch/tm_tree.c
+++ b/ompi/mca/topo/treematch/treematch/tm_tree.c
@@ -1611,7 +1611,8 @@ tree_t * build_tree_from_topology(tm_topology_t *topology, double **com_mat, int

   nb_constraints = check_constraints (topology, &constraints);

-  printf("nb_constraints = %d, N= %d; nb_processing units = %d\n",nb_constraints, N, nb_processing_units(topology));
+  if(verbose_level>=INFO)
+    printf("nb_constraints = %d, N= %d; nb_processing units = %d\n",nb_constraints, N, nb_processing_units(topology));

   if(N>nb_constraints){
     if(verbose_level >= CRITICAL){

btw, any reason why you use printf instead of the ompi logging facility ?

bosilca added a commit to bosilca/ompi that referenced this issue Dec 22, 2015
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
@bosilca
Copy link
Member

bosilca commented Dec 22, 2015

The efforts to fix this issue unfold in https://github.com/bosilca/ompi/tree/topic/treematch

bosilca added a commit to bosilca/ompi that referenced this issue Jan 4, 2016
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jan 12, 2016
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
@rhc54 rhc54 modified the milestones: v2.1.0, v2.0.0 Feb 24, 2016
jsquyres added a commit to jsquyres/ompi that referenced this issue Aug 23, 2016
mtl/ofi: Change default provider selection behavior.
bosilca added a commit to bosilca/ompi that referenced this issue Jan 5, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jan 5, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jan 5, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
@jsquyres jsquyres modified the milestones: v3.0.0, v2.1.0 Feb 22, 2017
bosilca added a commit to bosilca/ompi that referenced this issue Jun 13, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jun 14, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jun 15, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jun 21, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jun 30, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jul 11, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jul 12, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
@hppritcha
Copy link
Member

@bosilca will you be opening a PR for master to fix this in the near future (like this week)?

@bosilca
Copy link
Member

bosilca commented Jul 25, 2017

Let me check the status of the branch and I'll get back to you.

bosilca added a commit to bosilca/ompi that referenced this issue Jul 25, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.
bosilca added a commit to bosilca/ompi that referenced this issue Jul 26, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.

Signed-off-by: George Bosilca <[email protected]>
bosilca added a commit to bosilca/ompi that referenced this issue Aug 3, 2017
simply based on some local state. This is the second
part of the patch proposed for open-mpi#1183.

Signed-off-by: George Bosilca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants