Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with a very large derived data type #4829

Closed
rabauke opened this issue Feb 17, 2018 · 4 comments
Closed

issues with a very large derived data type #4829

rabauke opened this issue Feb 17, 2018 · 4 comments

Comments

@rabauke
Copy link

rabauke commented Feb 17, 2018

Using OpenMPI 2.1.1 as distributed by Ubuntu 17.10 x86_64 I get the following error message when dealing with a very large derived data type:

Read 2147479552, expected 2147483748, errno = 38

Below, I attached a small program to reproduce the issue. I build a derived vector-like data type with possibly more than INT_MAX elements using MPI_Type_struct.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stddef.h>
#include <limits.h>
#include "mpi.h"

// build a vector-like data type with more than INT_MAX elements
MPI_Datatype build(size_t count,  MPI_Datatype old_type) {
  MPI_Datatype new_type;
  if (count<=INT_MAX) {
    MPI_Type_contiguous((int)count, old_type, &new_type);
  } else {
    const size_t modulus=INT_MAX;
    const size_t count1=count/modulus;
    const size_t count0=count-count1*modulus;
    MPI_Count lb, extent;
    MPI_Type_get_extent_x(old_type, &lb, &extent);
    MPI_Datatype type_modulus;
    MPI_Type_contiguous((int)modulus, old_type, &type_modulus);
    MPI_Type_commit(&type_modulus);
    int block_lengths[2]={ (int)count0, (int)count1 };
    MPI_Aint displacements[2]={ 0, (MPI_Aint)(count0*extent) };
    MPI_Datatype types[2]={ old_type, type_modulus };
    MPI_Type_create_struct(2, block_lengths, displacements, types, &new_type);
    MPI_Type_commit(&new_type);
    MPI_Type_free(&type_modulus);
  }
  return new_type;
}

int main(int argc, char *argv[]) {
  MPI_Init(&argc, &argv);
  int C_size, C_rank;
  MPI_Comm_size(MPI_COMM_WORLD, &C_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &C_rank);
  if (C_size!=2)
    MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
  const size_t size=(1ull<<31) + 100;
  char *p=malloc(size);
  MPI_Datatype vector_type=build(size, MPI_CHAR);
  if (C_rank==0) {
    for (size_t i=0; i<size; ++i)
      p[i]=i;
    MPI_Send(p, 1, vector_type, 1, 0, MPI_COMM_WORLD);
  } else {
    MPI_Recv(p, 1, vector_type, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    for (size_t i=size-10; i<size; ++i)
      printf("%i\n", (int)p[i]);
  }
  MPI_Type_free(&vector_type);
  free(p);
  MPI_Finalize();
  return EXIT_SUCCESS;
}
@bosilca
Copy link
Member

bosilca commented Feb 17, 2018

Despite the error, the data is correctly transferred (i.e. the printed result looks correct). The warning is generated in the vader BTL, in the put and get functions where for very large datatypes the return os process_vm_writev (or readv) is truncated down to a ssize_t.

@ggouaillardet
Copy link
Contributor

@bosilca i double checked that and

  • process_vm_readv returns 0x7ffff000 instead of 0x8000064 which is an acceptable behavior (just like for read()) but not correctly handled by btl/vader
  • after that, the message is sent/received in full length by non CMA vader functions (in this case, receiving the last 0x1064 bytes would have been enough.

So even if this is only a warning and the data is correct, this is yet suboptimal.

I will make a PR shortly to address that.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Feb 19, 2018
Thanks Heiko Bauke for the bug report.

Refs. open-mpi#4829

Signed-off-by: Gilles Gouaillardet <[email protected]>
@bosilca
Copy link
Member

bosilca commented Feb 19, 2018

@ggouaillardet while I agree that these functions can return less than the total amount requested, in the case of a partial transfer. However, according to the corresponding Linux manpage partial transfers strictly apply at the granularity of iovec elements, in other words it is forbidden to perform a partial transfer that splits a single iovec element.

Your proposed fix is not correct. The real fix would acknowledge whatever was transmitted over CMA, and will continue with the remaining data by other means.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Mar 2, 2018
Important note :

According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written.  This return
value may be less than the total number of requested bytes, if a
partial read/write occurred.  (Partial transfers apply at the
granularity of iovec elements.  These system calls won't perform a
partial transfer that splits a single iovec element.)"

So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.

Thanks Heiko Bauke for the bug report.

Refs. open-mpi#4829

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Mar 2, 2018
…te}v

Important note :

According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written.  This return
value may be less than the total number of requested bytes, if a
partial read/write occurred.  (Partial transfers apply at the
granularity of iovec elements.  These system calls won't perform a
partial transfer that splits a single iovec element.)"

So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.

Thanks Heiko Bauke for the bug report.

Refs. open-mpi#4829

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Mar 2, 2018
…te}v

Important note :

According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written.  This return
value may be less than the total number of requested bytes, if a
partial read/write occurred.  (Partial transfers apply at the
granularity of iovec elements.  These system calls won't perform a
partial transfer that splits a single iovec element.)"

So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.

Thanks Heiko Bauke for the bug report.

Refs. open-mpi#4829

Signed-off-by: Gilles Gouaillardet <[email protected]>

(cherry picked from commit open-mpi/ompi@9fedf28)
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Mar 2, 2018
…te}v

Important note :

According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written.  This return
value may be less than the total number of requested bytes, if a
partial read/write occurred.  (Partial transfers apply at the
granularity of iovec elements.  These system calls won't perform a
partial transfer that splits a single iovec element.)"

So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.

Thanks Heiko Bauke for the bug report.

Refs. open-mpi#4829

Signed-off-by: Gilles Gouaillardet <[email protected]>

(cherry picked from commit open-mpi/ompi@9fedf28)
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Mar 2, 2018
…te}v

Important note :

According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written.  This return
value may be less than the total number of requested bytes, if a
partial read/write occurred.  (Partial transfers apply at the
granularity of iovec elements.  These system calls won't perform a
partial transfer that splits a single iovec element.)"

So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.

Thanks Heiko Bauke for the bug report.

Refs. open-mpi#4829

Signed-off-by: Gilles Gouaillardet <[email protected]>

(cherry picked from commit open-mpi/ompi@9fedf28)
@awlauria
Copy link
Contributor

The test passes for me with no issue now with master:

$. ./exports/bin/mpirun --mca btl self,sm --np 2 ./big 
90
91
92
93
94
95
96
97
98
99

It looks like all PR's have been merged as well. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants