Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

thrust::count() total malfunction for arrays of 2^31 and more elements #989

Closed
alexey-medvedev-fastdata-io opened this issue May 26, 2019 · 5 comments
Labels

Comments

@alexey-medvedev-fastdata-io
Copy link

alexey-medvedev-fastdata-io commented May 26, 2019

For big arrays it appears that OffsetT typedef'ed type which is widely used in thrust algorithms, is not capable of addressing large arrays. It is hardcoded as "typedef int OffsetT;", but int is normally a 32-bit type.

The following code snippet illustrates the problem the way it appears on thrust::count(), but the same can probably be seen for some other algorithms.

`//
// This file illustrates failure use case for thrust::count() if the size of array is close to 2^31
//
// nvcc -g -lineinfo -o thrust_big thrust_big.cu
//
// sudo /usr/local/cuda/bin/cuda-memcheck $PWD/thrust_big 2137502016
//
// Result:
//
// Error messages about illegal memory reads in LoadDirectStriped() from file block_load.cuh
// called from thrust::cuda_cub::cub::DeviceReduceKernel<>
//
// Changing the OffsetT to long all around thrust library fixes this.
//

#include
#include <assert.h>
#include <unistd.h>
#include <stdlib.h>
#include <thrust/system/cuda/execution_policy.h>
#include <thrust/count.h>

#define CUDA_CALL(X) { cudaError_t err = X; if (err != cudaSuccess) { throw err; } }

size_t fill(char *src, size_t size)
{
assert(size > 2);
size_t cnt = 0;
for (size_t i = 0; i < size - 2; i++) {
// unsigned x = (rand() % 16);
unsigned x = 0;
src[i] = (x == 0 ? '\n' : 'A' + (char)x - (char)1);
if (x == 0) cnt++;
}
src[size-2] = '\n';
src[size-1] = '\0';
return cnt + 1;
}

int main(int argc, char **argv)
{
try {
assert(argc > 1);
char *dev_src = NULL;
size_t size = (size_t)std::stoull(argv[1]);
char *src = (char *)malloc(size);
size_t N = fill(src, size);
CUDA_CALL(cudaMalloc(&dev_src, size));
CUDA_CALL(cudaMemcpy(dev_src, src, size, cudaMemcpyHostToDevice));
size_t num_rows = (size_t)thrust::count(thrust::device, dev_src, dev_src + size, '\n');
assert(num_rows == N);
}
catch(cudaError_t &err) {
std::cerr << "CUDA ERROR: " << cudaGetErrorString(err) << std::endl;
return 1;
}
catch(std::exception &ex) {
std::cerr << "std::exception: " << ex.what() << std::endl;
return 1;
}
catch (...) {
std::cout << "UNKNOWN EXCEPTION" << std::endl;
return 1;
}
return 0;
}
`

Are there any plans to tune or fix this?

@griwes
Copy link
Collaborator

griwes commented May 26, 2019

Yeah. This is one of the longstanding known bugs; I'm in the process of addressing and verifying the fixes for as many of those as I can find. Since it's sometimes hard to write synthetic test cases that detect this without ridiculous amounts of memory necessary, we'll be grateful for a list of algorithms in which you've encountered a problem with input sizes like this.

@alexey-medvedev-fastdata-io
Copy link
Author

alexey-medvedev-fastdata-io commented May 27, 2019

Michał,

As far as can guess (or suspect), the problem resides (at least) at any procedure which defines typedef int OffsetT;. Searching through the code it gives 40 occurrences in thrust-1.9.5 inside these functions:

cub::DeviceRunLengthEncode::Encode()
cub::DeviceRunLengthEncode::NonTrivialRuns()
cub::DeviceReduce::Reduce()
cub::DeviceReduce::Sum()
cub::DeviceReduce::Min()
cub::DeviceReduce::ArgMin()
cub::DeviceReduce::Max()
cub::DeviceReduce::ArgMax()
cub::DeviceReduce::ReduceByKey()
cub::DeviceSegmentedReduce::Reduce()
cub::DeviceSegmentedReduce::Sum()
cub::DeviceSegmentedReduce::Min()
cub::DeviceSegmentedReduce::ArgMin()
cub::DeviceSegmentedReduce::Max()
cub::DeviceSegmentedReduce::ArgMax()
cub::DeviceScan::ExclusiveSum()
cub::DeviceScan::ExclusiveScan()
cub::DeviceScan::InclusiveSum()
cub::DeviceScan::InclusiveScan()
cub::DevicePartition::Flagged()
cub::DeviceSelect::If()
cub::DeviceSegmentedRadixSort::SortPairs()
cub::DeviceSegmentedRadixSort::SortPairsDescending()
cub::DeviceSegmentedRadixSort::SortKeys()
cub::DeviceSegmentedRadixSort::SortKeysDescending()
cub::DeviceSelect::Flagged()
cub::DeviceSelect::If()
cub::DeviceSelect::Unique()

The OffsetT usage seems similar (at first sight) in all these cases so I suspect they all are crashing the same way.

I am not sure there was no other problems of the kind in code, but typedef int OffsetT; is probably the first candidate for a critical review.

Moreover, there are also typedefs:

system/cuda/detail/copy_if.h:708: typedef int size_type;
system/cuda/detail/reduce_by_key.h:982: typedef int size_type;
system/cuda/detail/util.h:703: typedef int difference_type;
system/cuda/detail/unique_by_key.h:731: typedef int size_type;
system/cuda/detail/unique.h:642: typedef int size_type;
system/cuda/detail/reduce.h:67: typedef int GridSizeType;

which seem also suspicious, but I'm not sure yet that they cause problems.

I can drop in an idea to create a special edition of all unit tests with all the arrays extended up to 2^32, and try to run a "smoke test" on this modification. Not sure it makes sense here, but it is just an idea.

Please keep us informed on the progress with this issue.

Regards,
FASTDATA.IO team

@alexey-medvedev-fastdata-io
Copy link
Author

Are there any updates on this issue?

Regards,
Alexey Medvedev
FASTDATA.IO team

@alliepiper
Copy link
Collaborator

This is related to NVIDIA/cccl#744.

@miscco
Copy link
Collaborator

miscco commented Feb 24, 2023

I will close this as a duplicate of NVIDIA/cccl#744. We are aware and have recently started working on expanding algorithms to larger ranges

@miscco miscco closed this as completed Feb 24, 2023
@github-project-automation github-project-automation bot moved this to Done in CCCL Feb 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Archived in project
Development

No branches or pull requests

5 participants