-
Notifications
You must be signed in to change notification settings - Fork 757
thrust parallel for kernel failed on num_items > uint32_max #967
Comments
This is a partial fix, yes, but there's slightly more nuance there; what is essentially the same bug is tracked internally as 2448170, and is, surprisingly, a separate bug from the CUB-based algorithms failing for sizes bigger than 2^32. I'm working on getting the testcase from that internal report working, that should also close this. |
Actually. This seems to indeed fix just plain |
Actually actually, there was another bug in |
This should now be fixed on master. Unlike the other issues I've just pushed a fix for, I'd prefer if you verified that this is indeed fixed on top of the Github version before fixing it, due to the nature of both the bug and the fix ;) |
Yep, I can do some quick checks on our end, @griwes BTW, could you point me to the PR you fixed this bug? |
No PRs, just the commit you can see above pushed directly. |
I think your test cases will fail at
I got an error like
Basically, I think in this solution can only handle uint32_t_max * items_per_title, otherwise, an overflow will still happen during the static cast. We probably need to rewrite the kernel like
|
You're right we're hitting nonsense there, but that is not... entirely... Thrust's fault. Here's a few lines of logs you can make Thrust emit, where it logs the parameters of kernel launches it does:
These values are almost sensible. Let me explain. If we went to mag 41, the grid dimension x would indeed turn into 0, and that is indeed the fault of how the Thrust kernels are written. However. 2147483648 is 2^32, which is a correctly computed dimension. It's bigger than 2^32-1, or the maximum value of 32-bit unsigned integers; the first kernel launch parameter is However. According to the CUDA programming guide, "Maximum x-dimension of a grid of thread blocks: 2^31-1", so a situation similar to this will happen also for problems of mag slightly less than 40. Oh well. There isn't much we can do about that. We could probably split the kernel launch into two when we detect a situation like this; we will probably do that anyway due to a feature that we are planning to add to Thrust over the next year or two, so trying to figure out how to do that right now doesn't seem fully productive. However (and this is the last one, I promise!): as you can see in the log, currently the tuning policy for for_each, for some ancient reason, says that we launch blocks of 256 threads, and execute the function object for two elements on each thread. This seems wasteful; I'll try increasing that value, though no promises on when that lands. (Not significantly, though: when I bumped it to 32 at first, Hopefully you don't often have to run kernels of those sizes? ;) |
Unfortunately, we often run into the kernels of those sizes. BTW, I think originally in Cuda 8.0, for each kernel looks like
and this can handle arbitrary large inputs (at least uint64), |
I think the issue here is our occupancy logic; we probably don't account for the upper limits on # of threads/thread-blocks when we decide how many threads to use/how many elements per thread. We may be able to solve this without splitting the kernel launch; instead we just handle more items per thread. |
totally agree on this |
@griwes Any updates on this? Will it be fixed in next major cuda version? |
There are are exactly the kind of transformations an executor could introduce. |
I have no updates on this since the last comment I left on this, sorry. We've had other things override our priorities on doing the much needed rework of some parts of Thrust. (Btw.: the "just" in Bryce's last comment would be a rather large change, effectively moving from statically sizing the threads to dynamically sizing them, and I doubt that "just" doing that just for |
Thanks for heads up. I understand that might be a lot of other high priority items to fix. But for me, such a bug in for_each algorithm probably should be of the top priority as well since almost every parallel algorithm in thrust is using for_each, transform, filter,scatter etc. If it has a major bug, it will affect a lot of algorithms. |
Hi all, I recently encountered a similar error with size_t size = 2150602529;
auto key_iter = dh::MakeTransformIterator<size_t>( // same with thrust make_transform_iterator
thrust::make_counting_iterator<size_t>(0ul),
[=] __device__(size_t idx) {
assert(idx < size);
return idx;
});
auto value_iter = dh::MakeTransformIterator<size_t>(
thrust::make_counting_iterator<size_t>(0ul),
[=] __device__(size_t idx) -> size_t {
return idx;
});
auto key_value_index_iter = thrust::make_zip_iterator(
thrust::make_tuple(thrust::make_counting_iterator<size_t>(0ul), key_iter, value_iter));
auto end_it = key_value_index_iter + size;
thrust::inclusive_scan(thrust::device, key_value_index_iter,
end_it, thrust::make_discard_iterator(),
[] __device__(auto a, auto b){ return b; }); Any update on this? |
@griwes Any updates on this? Is it already fixed in the latest cuda major release (cuda 12.X)? |
@elstehle can you look into this? |
I will have a look and report back shortly. |
@lucafuji, just to confirm, you are asking whether we are supporting problem sizes larger than |
Closing as duplicate of NVIDIA/cccl#744 |
All thrust for-each family that's using the parallel for agent cannot handle num items >= uint32_max.
The problem is when doing the static cast in AgentLauncher
https://github.com/thrust/thrust/blob/master/thrust/system/cuda/detail/core/agent_launcher.h#L411
Thrust do a static cast in a incorrect place
instead it should be
The text was updated successfully, but these errors were encountered: