thrust parallel for kernel failed on num_items > uint32_max #967

lucafuji · 2019-04-01T21:51:01Z

All thrust for-each family that's using the parallel for agent cannot handle num items >= uint32_max.

The problem is when doing the static cast in AgentLauncher
https://github.com/thrust/thrust/blob/master/thrust/system/cuda/detail/core/agent_launcher.h#L411

Thrust do a static cast in a incorrect place

static_cast<unsigned int>(count + plan.items_per_tile - 1)/plan.items_per_tile

instead it should be

static_cast<unsigned int>((count + plan.items_per_tile - 1)/plan.items_per_tile)

The text was updated successfully, but these errors were encountered:

griwes · 2019-04-10T11:57:19Z

This is a partial fix, yes, but there's slightly more nuance there; what is essentially the same bug is tracked internally as 2448170, and is, surprisingly, a separate bug from the CUB-based algorithms failing for sizes bigger than 2^32. I'm working on getting the testcase from that internal report working, that should also close this.

griwes · 2019-04-10T11:59:04Z

Actually. This seems to indeed fix just plain for_each, that bug invokes more algorithms. I'll still fix them together ;)

griwes · 2019-04-10T12:16:46Z

Actually actually, there was another bug in parallel_for itself. Expect this to be fixed on master soon-ish,

Bug 2448170 Github #967

griwes · 2019-04-15T18:29:45Z

This should now be fixed on master. Unlike the other issues I've just pushed a fix for, I'd prefer if you verified that this is indeed fixed on top of the Github version before fixing it, due to the nature of both the bug and the fix ;)

lucafuji · 2019-04-23T18:28:18Z

Yep, I can do some quick checks on our end, @griwes BTW, could you point me to the PR you fixed this bug?

griwes · 2019-04-24T17:24:04Z

No PRs, just the commit you can see above pushed directly.

lucafuji · 2019-04-25T23:29:17Z

I think your test cases will fail at

 TestForEachWithBigIndexesHelper(40)

I got an error like

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid configuration argument
Aborted (core dumped)

Basically, I think in this solution can only handle uint32_t_max * items_per_title, otherwise, an overflow will still happen during the static cast.

We probably need to rewrite the kernel like

Size grid_size = grid.size() * grid.this_exec.size();

    Size i = grid.this_exec.index() * grid.this_exec.size() + grid.this_exec.this_exec.index();

    first += i;

    while(i < n)
    {
      f(*first);
      i += grid_size;
      first += grid_size;
    }

griwes · 2019-05-08T15:45:48Z

You're right we're hitting nonsense there, but that is not... entirely... Thrust's fault. Here's a few lines of logs you can make Thrust emit, where it logs the parameters of kernel launches it does:

Invoking transform::agent<<<2097152, 256, 0, 1>>>(), 1073741824 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                
Invoking transform::agent<<<4194304, 256, 0, 1>>>(), 2147483648 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                  
Invoking transform::agent<<<8388608, 256, 0, 1>>>(), 4294967296 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                  
Invoking transform::agent<<<16777216, 256, 0, 1>>>(), 8589934592 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                 
Invoking transform::agent<<<2147483648, 256, 0, 1>>>(), 1099511627776 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version

These values are almost sensible. Let me explain.

If we went to mag 41, the grid dimension x would indeed turn into 0, and that is indeed the fault of how the Thrust kernels are written. However.

2147483648 is 2^32, which is a correctly computed dimension. It's bigger than 2^32-1, or the maximum value of 32-bit unsigned integers; the first kernel launch parameter is dim3, which is essentially uint3, which means its members (and this value initializes the first of them) are unsigned integers, and since they are 32-bit mostly everywhere, we're actually overflowing there. So yes, we are overflowing, but not where you think we are.

However.

According to the CUDA programming guide, "Maximum x-dimension of a grid of thread blocks: 2^31-1", so a situation similar to this will happen also for problems of mag slightly less than 40. Oh well.

There isn't much we can do about that. We could probably split the kernel launch into two when we detect a situation like this; we will probably do that anyway due to a feature that we are planning to add to Thrust over the next year or two, so trying to figure out how to do that right now doesn't seem fully productive.

However (and this is the last one, I promise!): as you can see in the log, currently the tuning policy for for_each, for some ancient reason, says that we launch blocks of 256 threads, and execute the function object for two elements on each thread. This seems wasteful; I'll try increasing that value, though no promises on when that lands. (Not significantly, though: when I bumped it to 32 at first, ptxas took almost 10 minutes to finish on our for_each test file, vs something like 15 seconds currently. I'll experiment with out benchmarks and try to find a value that makes more sense, but keep in mind that bumping to 8, which bumps the ptxas runtime to ~45-ish seconds, only allows mag 41 kernels, not bigger ones, so given how the compilation times rise due to the unrolling we do in the kernel, I don't think we can bump it significantly past that.)

Hopefully you don't often have to run kernels of those sizes? ;)

lucafuji · 2019-05-10T00:07:40Z

Unfortunately, we often run into the kernels of those sizes.
Of course, we can let our kernel itself to handle more items on our side, but we originally expect thrust for each kernel can help achieve this .......

BTW, I think originally in Cuda 8.0, for each kernel looks like

Size grid_size = grid.size() * grid.this_exec.size();

    Size i = grid.this_exec.index() * grid.this_exec.size() + grid.this_exec.this_exec.index();

    first += i;

    while(i < n)
    {
      f(*first);
      i += grid_size;
      first += grid_size;
    }

and this can handle arbitrary large inputs (at least uint64),
starting Cuda 9.0, for each kernel is rewritten as the current way, and then it cannot handle large input

brycelelbach · 2019-05-11T18:21:49Z

I think the issue here is our occupancy logic; we probably don't account for the upper limits on # of threads/thread-blocks when we decide how many threads to use/how many elements per thread.

We may be able to solve this without splitting the kernel launch; instead we just handle more items per thread.

lucafuji · 2019-05-16T17:47:59Z

I think the issue here is our occupancy logic; we probably don't account for the upper limits on # of threads/thread-blocks when we decide how many threads to use/how many elements per thread.

We may be able to solve this without splitting the kernel launch; instead we just handle more items per thread.

totally agree on this

lucafuji · 2020-01-10T23:35:11Z

@griwes Any updates on this? Will it be fixed in next major cuda version?

jaredhoberock · 2020-01-10T23:44:31Z

We may be able to solve this without splitting the kernel launch; instead we just handle more items per thread.

There are are exactly the kind of transformations an executor could introduce.

griwes · 2020-01-11T06:32:47Z

I have no updates on this since the last comment I left on this, sorry. We've had other things override our priorities on doing the much needed rework of some parts of Thrust.

(Btw.: the "just" in Bryce's last comment would be a rather large change, effectively moving from statically sizing the threads to dynamically sizing them, and I doubt that "just" doing that just for for_each is a reasonable way forward. We need a grand(er) plan; I hope we can achieve this Soon ™️, even taking the recent shift in priorities into account.)

lucafuji · 2020-01-22T01:04:19Z

Thanks for heads up. I understand that might be a lot of other high priority items to fix. But for me, such a bug in for_each algorithm probably should be of the top priority as well since almost every parallel algorithm in thrust is using for_each, transform, filter,scatter etc. If it has a major bug, it will affect a lot of algorithms.

Bug 2448170 Github #967

trivialfis · 2020-09-26T07:41:25Z

Hi all, I recently encountered a similar error with thrust::inclusive_scan on CUDA 11.0:

  size_t size = 2150602529;
  auto key_iter = dh::MakeTransformIterator<size_t>(  // same with thrust make_transform_iterator
      thrust::make_counting_iterator<size_t>(0ul),
      [=] __device__(size_t idx) {
        assert(idx < size);
        return idx;
      });
  auto value_iter = dh::MakeTransformIterator<size_t>(
      thrust::make_counting_iterator<size_t>(0ul),
      [=] __device__(size_t idx) -> size_t {
        return idx;
      });
  auto key_value_index_iter = thrust::make_zip_iterator(
      thrust::make_tuple(thrust::make_counting_iterator<size_t>(0ul), key_iter, value_iter));
  auto end_it = key_value_index_iter + size;
  thrust::inclusive_scan(thrust::device, key_value_index_iter,
                         end_it, thrust::make_discard_iterator(),
                         [] __device__(auto a, auto b){ return b; });

Any update on this?

lucafuji · 2023-02-10T22:56:53Z

@griwes Any updates on this? Is it already fixed in the latest cuda major release (cuda 12.X)?

jrhemstad · 2023-02-12T14:19:59Z

@elstehle can you look into this?

elstehle · 2023-02-13T10:26:50Z

I will have a look and report back shortly.

elstehle · 2023-02-15T18:19:48Z

@griwes Any updates on this? Is it already fixed in the latest cuda major release (cuda 12.X)?

@lucafuji, just to confirm, you are asking whether we are supporting problem sizes larger than (2^31-1) * 512 or ~10^12 items (aka MAX_GRID_DIM_X * ITEMS_PER_BLOCK)? Can you also help me understand the use case a bit better, if you are able to share?

jrhemstad · 2023-02-24T18:42:08Z

Closing as duplicate of NVIDIA/cccl#744

griwes self-assigned this Apr 10, 2019

griwes added the nvbug Has an associated internal NVIDIA NVBug. label Apr 10, 2019

griwes added a commit that referenced this issue Apr 15, 2019

Fix the CUDA backend for_each so that it can handle big indexes.

986d178

Bug 2448170 Github #967

brycelelbach pushed a commit that referenced this issue May 16, 2020

Fix the CUDA backend for_each so that it can handle big indexes.

476dafa

Bug 2448170 Github #967

trivialfis mentioned this issue Sep 29, 2020

Unexpected behaviour when return type is specified for transform iterator. #1299

Closed

jrhemstad added this to CCCL Aug 11, 2022

jrhemstad assigned elstehle and unassigned griwes Feb 12, 2023

jrhemstad added the thrust label Feb 22, 2023

jrhemstad closed this as not planned Won't fix, can't repro, duplicate, stale Feb 24, 2023

github-project-automation bot moved this to Done in CCCL Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thrust parallel for kernel failed on num_items > uint32_max #967

thrust parallel for kernel failed on num_items > uint32_max #967

lucafuji commented Apr 1, 2019

griwes commented Apr 10, 2019

griwes commented Apr 10, 2019

griwes commented Apr 10, 2019

griwes commented Apr 15, 2019

lucafuji commented Apr 23, 2019

griwes commented Apr 24, 2019

lucafuji commented Apr 25, 2019

griwes commented May 8, 2019

lucafuji commented May 10, 2019

brycelelbach commented May 11, 2019 •

edited

Loading

lucafuji commented May 16, 2019

lucafuji commented Jan 10, 2020

jaredhoberock commented Jan 10, 2020 •

edited

Loading

griwes commented Jan 11, 2020

lucafuji commented Jan 22, 2020

trivialfis commented Sep 26, 2020 •

edited

Loading

lucafuji commented Feb 10, 2023

jrhemstad commented Feb 12, 2023

elstehle commented Feb 13, 2023

elstehle commented Feb 15, 2023 •

edited

Loading

jrhemstad commented Feb 24, 2023

thrust parallel for kernel failed on num_items > uint32_max #967

thrust parallel for kernel failed on num_items > uint32_max #967

Comments

lucafuji commented Apr 1, 2019

griwes commented Apr 10, 2019

griwes commented Apr 10, 2019

griwes commented Apr 10, 2019

griwes commented Apr 15, 2019

lucafuji commented Apr 23, 2019

griwes commented Apr 24, 2019

lucafuji commented Apr 25, 2019

griwes commented May 8, 2019

lucafuji commented May 10, 2019

brycelelbach commented May 11, 2019 • edited Loading

lucafuji commented May 16, 2019

lucafuji commented Jan 10, 2020

jaredhoberock commented Jan 10, 2020 • edited Loading

griwes commented Jan 11, 2020

lucafuji commented Jan 22, 2020

trivialfis commented Sep 26, 2020 • edited Loading

lucafuji commented Feb 10, 2023

jrhemstad commented Feb 12, 2023

elstehle commented Feb 13, 2023

elstehle commented Feb 15, 2023 • edited Loading

jrhemstad commented Feb 24, 2023

brycelelbach commented May 11, 2019 •

edited

Loading

jaredhoberock commented Jan 10, 2020 •

edited

Loading

trivialfis commented Sep 26, 2020 •

edited

Loading

elstehle commented Feb 15, 2023 •

edited

Loading