Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teuchos: Teuchos::send/receive not handling a large message #5082

Closed
seheracer opened this issue May 3, 2019 · 7 comments
Closed

Teuchos: Teuchos::send/receive not handling a large message #5082

seheracer opened this issue May 3, 2019 · 7 comments
Labels
pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests

Comments

@seheracer
Copy link
Contributor

seheracer commented May 3, 2019

Bug Report

@trilinos/teuchos

Description

Error when communicating a very long message (consisting of 400 million long long values) using Teuchos::send and Teuchos::receive. When Teuchos::send and Teuchos::receive are replaced by MPI_Send and MPI_Recv, the respective message is successfully communicated with a warning.

Steps to Reproduce

The code to reproduce the bug: (2 MPI ranks)

#include <Teuchos_DefaultMpiComm.hpp>
#include <Teuchos_CommHelpers.hpp>

int main (int argc, char *argv[])
{
  typedef long long count_type;
  typedef long long packet_type;

  MPI_Init(&argc, &argv);
  Teuchos::MpiComm<count_type> comm (MPI_COMM_WORLD);

  count_type length = 400000000;
  if(comm.getRank() == 0) {

    packet_type val = -1;
    Teuchos::ArrayRCP<packet_type> array_to_send(length, val);
    Teuchos::send<count_type, packet_type>(comm, length, array_to_send.getRawPtr(), 1);

    //MPI_Send(array_to_send.getRawPtr(), length, MPI_LONG_LONG, 1, 0, MPI_COMM_WORLD);                                                                                                                                                                                                                               

  }
  else {

    Teuchos::ArrayRCP<packet_type> array_to_recv(length);
    Teuchos::receive<count_type, packet_type>(comm, 0, length, array_to_recv.getRawPtr());

    // MPI_Status status;                                                                                                                                                                                                                                                                                             
    // int result = MPI_Recv(array_to_recv.getRawPtr(), length, MPI_LONG_LONG, 0, 0, MPI_COMM_WORLD, &status);                                                                                                                                                                                                
    // if(result == MPI_SUCCESS)                                                                                                                                                                                                                                                                                      
    //   std::cout << "Successfully received!" << std::endl                                                                                                                                                                                                                                                           
    //          << "MPI_SOURCE: " << status.MPI_SOURCE << std::endl                                                                                                                                                                                                                                                   
    //          << "MPI_TAG: " << status.MPI_TAG << std::endl                                                                                                                                                                                                                                                         
    //          << "MPI_ERROR: " << status.MPI_ERROR << std::endl                                                                                                                                                                                                                                                     
    //          << "_cancelled: " << status._cancelled  << std::endl                                                                                                                                                                                                                                                  
    //          << "_ucount: " << status._ucount << std::endl;                                                                                                                                                                                                                                                        
  }

  MPI_Finalize();
  return 0;
}

The output:

[blake:192730] *** An error occurred in MPI_Send
[blake:192730] *** reported by process [1952841729,0]
[blake:192730] *** on communicator MPI_COMM_WORLD
[blake:192730] *** MPI_ERR_COUNT: invalid count argument
[blake:192730] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[blake:192730] ***    and potentially your MPI job)

When the Teuchos::send/receive calls are replaced by MPI_send/recv (see the lines commented out in the code), the output is:

[blake:192800] Read 2147479552, expected 3200000000, errno = 2
Successfully received!
MPI_SOURCE: 0
MPI_TAG: 0
MPI_ERROR: 0
_cancelled: 0
_ucount: 3200000000

Notes

mpicc: icc (ICC) 18.0.1 20171018
mpirun: mpirun (Open MPI) 2.1.2

An issue on the warning when MPI_Send/Recv is used: open-mpi/ompi#4829.

@seheracer seheracer added type: bug The primary issue is a bug in Trilinos code or tests pkg: Teuchos Issues primarily dealing with the Teuchos Package labels May 3, 2019
@mhoemmen
Copy link
Contributor

mhoemmen commented May 3, 2019

  1. What version of Trilinos is this?
  2. How was the CMake option Teuchos_ENABLE_LONG_LONG_INT set?

(1) is relevant because #1183 was closed a few weeks ago. Before that, it was possible to disable Teuchos' support for long long. If that CMake option was OFF, then Teuchos::send etc. would treat long long as just a sequence of bytes (using MPI_CHAR as the MPI_Datatype). This would limit the maximum buffer length for Teuchos::send to about 2^{28}, which would make a buffer length of 400 million overflow.

@seheracer , if you are able to reconfigure and rebuild Trilinos, could you try setting Teuchos_ENABLE_LONG_LONG_INT=ON? If that fixes the problem, then I would consider this issue resolved, especially since #1183 has been fixed.

Note that MPI's interface itself takes message sizes as (32-bit) integers, so you still will not be able to use MPI_Send for messages longer than 2^{31}-1 items. This is an MPI issue, not a Teuchos issue. You can see this in MPICH's documentation for MPI_Send, for example. The classic work-around is to create a contiguous custom MPI_Datatype sufficiently long to reduce the count below the 32-bit limit.

@trilinos/tpetra might want to be reminded of this.

@seheracer
Copy link
Contributor Author

seheracer commented May 4, 2019

  1. It is Trilinos Release 12.12.
  2. I have Teuchos_ENABLE_LONG_LONG_INT:BOOL=ON already set. The whole script for configuration and build:
cmake \
-D CMAKE_BUILD_TYPE:STRING=RELEASE \
-D CMAKE_INSTALL_PREFIX:FILEPATH="/ascldap/users/sacer/bug/Trilinos/BUILD" \
-D TPL_ENABLE_MPI:BOOL=ON \
-D CMAKE_C_FLAGS:STRING="-m64 -g" \
-D CMAKE_CXX_FLAGS:STRING="-m64 -g" \
-D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \
\
-D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_EXAMPLES:BOOL=OFF \
-D Trilinos_ENABLE_TESTS:BOOL=OFF \
-D Trilinos_VERBOSE_CONFIGURE:BOOL=OFF \
\
-D Trilinos_ENABLE_TeuchosCore:BOOL=ON \
-D Trilinos_ENABLE_TeuchosComm:BOOL=ON \
-D Teuchos_ENABLE_LONG_LONG_INT:BOOL=ON \
-D Teuchos_ENABLE_COMPLEX:BOOL=OFF \
\
-D BLAS_LIBRARY_DIRS:PATH=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64 \
-D BLAS_LIBRARY_NAMES:STRING="mkl_intel_lp64; mkl_core; mkl_sequential" \
-D LAPACK_LIBRARY_DIRS:PATH=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64 \
-D LAPACK_LIBRARY_NAMES:STRING="mkl_intel_lp64; mkl_core; mkl_sequential" \
\
.. |& tee OUTPUT.CMAKE

make -j 8 |& tee OUTPUT.MAKE
make install |& tee OUTPUT.INSTALL

I am dealing with really large data, so I may try splitting the messages having more than 2^{31}-1 items into smaller messages. Thanks for the warning @mhoemmen.

I think Teuchos might be having this issue only when the total size of the message is more than 2^{31}-1 bytes. Maybe I should split any message longer than 2^{31}-1 bytes into smaller ones to bypass this issue? or I will just use MPI's functions and deal with changing the types if I want more data types. Any recommendation is welcome.

@bartlettroscoe
Copy link
Member

@seheracer,

I think currently using long long in Teuchos::MpiComm<long long> is completely useless as the underlying MPI always just using int. But I don't understand how raw MPI can work with messages longer than 2^{31}-1 elements and give you warnings. How can it read an overflow of int and produce a warning?

To address this in Teuchos, we could take advantage of the templated ordinal type in Teuchos::MpiComm<Ordinal> such as Teuchos::MpiComm<long long> and have it automatically break up sends/receives into chucks that fit into buffers that MPI with 'int' can handle. If the templated Ordinal type was long long we could detect when the user passed in a size larger than int could hold and then do something logical, like break up the send/receives into chucks or for operations where we did not want to implement this, we could issue an error saying that the buffers are too large for MPI int to handle. That would be better than undefined behavior. That would finally give some justification for the templated Ordinal type in Teuchos::Comm<Ordinal>.

@mhoemmen, what do you think about that idea?

@mhoemmen
Copy link
Contributor

mhoemmen commented May 4, 2019

@bartlettroscoe I think I get what's going on.

  1. @seheracer imagines that Comm's Ordinal template parameter affects MPI's behavior, but is unfortunately incorrect. MPI always uses int for its counts.

  2. I wrote a lot of specializations of MPI send, etc. for specific data types, including long long. However, those only work if Comm's template argument is int. If Comm's template argument is long long, those specializations don't kick in, and Teuchos just converts everything to bytes and uses MPI_CHAR.

The correct work-around, for now, is to use Comm<int> and its subclasses, always.

@bartlettroscoe wrote:

we could take advantage of the templated ordinal type in Teuchos::MpiComm<Ordinal> such as Teuchos::MpiComm<long long> and have it automatically break up sends/receives into chunks that fit into buffers that MPI with 'int' can handle.

We could do that perfectly well with Ordinal=int too. The problem is that the receiving process wouldn't know how many messages to expect. We could handle this by having the receiver check the resulting MPI_Status for the message size, and post another receive if it's not enough. I don't have time to work on that right now, but it shouldn't be too hard to do.

@mhoemmen
Copy link
Contributor

mhoemmen commented May 4, 2019

@seheracer wrote:

I think Teuchos might be having this issue only when the total size of the message is more than 2^{31}-1 bytes.

That's correct. That's really a problem with MPI's interface. The Ordinal template parameter of Comm, MpiComm, etc. doesn't help with that.

Maybe I should split any message longer than 2^{31}-1 bytes into smaller ones to bypass this issue? or I will just use MPI's functions and deal with changing the types if I want more data types. Any recommendation is welcome.

Either approach should work. More importantly, use Comm<int>, MpiComm<int>, etc. so that the specializations will kick in, and let you use larger messages for standard data types like long long (but only if the count is less than 2^{31}).

@seheracer
Copy link
Contributor Author

@mhoemmen and @bartlettroscoe,

Thanks for the comments and the warnings. From now on, I will always use int-templated Teuchos::MpiComm.

For the very large messages, I guess I have to split them into multiple sub-messages in any case regardless of using MPI or Teuchos.

@mhoemmen
Copy link
Contributor

mhoemmen commented May 5, 2019

@seheracer wrote:

For the very large messages, I guess I have to split them into multiple sub-messages in any case regardless of using MPI or Teuchos.

That is correct. I think newer versions of the MPI standard (> 3) may fix this. You may also try libraries like Jeff Hammond's BigMPI: https://github.com/jeffhammond/BigMPI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants