Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

thrust parallel for kernel failed on num_items > uint32_max #967

Closed
lucafuji opened this issue Apr 1, 2019 · 21 comments
Closed

thrust parallel for kernel failed on num_items > uint32_max #967

lucafuji opened this issue Apr 1, 2019 · 21 comments
Assignees
Labels
nvbug Has an associated internal NVIDIA NVBug. thrust

Comments

@lucafuji
Copy link

lucafuji commented Apr 1, 2019

All thrust for-each family that's using the parallel for agent cannot handle num items >= uint32_max.

The problem is when doing the static cast in AgentLauncher
https://github.com/thrust/thrust/blob/master/thrust/system/cuda/detail/core/agent_launcher.h#L411

Thrust do a static cast in a incorrect place

static_cast<unsigned int>(count + plan.items_per_tile - 1)/plan.items_per_tile

instead it should be

static_cast<unsigned int>((count + plan.items_per_tile - 1)/plan.items_per_tile)
@griwes
Copy link
Collaborator

griwes commented Apr 10, 2019

This is a partial fix, yes, but there's slightly more nuance there; what is essentially the same bug is tracked internally as 2448170, and is, surprisingly, a separate bug from the CUB-based algorithms failing for sizes bigger than 2^32. I'm working on getting the testcase from that internal report working, that should also close this.

@griwes griwes self-assigned this Apr 10, 2019
@griwes griwes added the nvbug Has an associated internal NVIDIA NVBug. label Apr 10, 2019
@griwes
Copy link
Collaborator

griwes commented Apr 10, 2019

Actually. This seems to indeed fix just plain for_each, that bug invokes more algorithms. I'll still fix them together ;)

@griwes
Copy link
Collaborator

griwes commented Apr 10, 2019

Actually actually, there was another bug in parallel_for itself. Expect this to be fixed on master soon-ish,

griwes added a commit that referenced this issue Apr 15, 2019
@griwes
Copy link
Collaborator

griwes commented Apr 15, 2019

This should now be fixed on master. Unlike the other issues I've just pushed a fix for, I'd prefer if you verified that this is indeed fixed on top of the Github version before fixing it, due to the nature of both the bug and the fix ;)

@lucafuji
Copy link
Author

Yep, I can do some quick checks on our end, @griwes BTW, could you point me to the PR you fixed this bug?

@griwes
Copy link
Collaborator

griwes commented Apr 24, 2019

No PRs, just the commit you can see above pushed directly.

@lucafuji
Copy link
Author

I think your test cases will fail at

 TestForEachWithBigIndexesHelper(40)

I got an error like

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid configuration argument
Aborted (core dumped)

Basically, I think in this solution can only handle uint32_t_max * items_per_title, otherwise, an overflow will still happen during the static cast.

We probably need to rewrite the kernel like

Size grid_size = grid.size() * grid.this_exec.size();

    Size i = grid.this_exec.index() * grid.this_exec.size() + grid.this_exec.this_exec.index();

    first += i;

    while(i < n)
    {
      f(*first);
      i += grid_size;
      first += grid_size;
    }

@griwes
Copy link
Collaborator

griwes commented May 8, 2019

You're right we're hitting nonsense there, but that is not... entirely... Thrust's fault. Here's a few lines of logs you can make Thrust emit, where it logs the parameters of kernel launches it does:

Invoking transform::agent<<<2097152, 256, 0, 1>>>(), 1073741824 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                
Invoking transform::agent<<<4194304, 256, 0, 1>>>(), 2147483648 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                  
Invoking transform::agent<<<8388608, 256, 0, 1>>>(), 4294967296 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                  
Invoking transform::agent<<<16777216, 256, 0, 1>>>(), 8589934592 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version                                                                                                                                 
Invoking transform::agent<<<2147483648, 256, 0, 1>>>(), 1099511627776 items total, 2 items per thread, 8 SM occupancy, 0 vshmem size, 700 ptx_version

These values are almost sensible. Let me explain.

If we went to mag 41, the grid dimension x would indeed turn into 0, and that is indeed the fault of how the Thrust kernels are written. However.

2147483648 is 2^32, which is a correctly computed dimension. It's bigger than 2^32-1, or the maximum value of 32-bit unsigned integers; the first kernel launch parameter is dim3, which is essentially uint3, which means its members (and this value initializes the first of them) are unsigned integers, and since they are 32-bit mostly everywhere, we're actually overflowing there. So yes, we are overflowing, but not where you think we are.

However.

According to the CUDA programming guide, "Maximum x-dimension of a grid of thread blocks: 2^31-1", so a situation similar to this will happen also for problems of mag slightly less than 40. Oh well.

There isn't much we can do about that. We could probably split the kernel launch into two when we detect a situation like this; we will probably do that anyway due to a feature that we are planning to add to Thrust over the next year or two, so trying to figure out how to do that right now doesn't seem fully productive.

However (and this is the last one, I promise!): as you can see in the log, currently the tuning policy for for_each, for some ancient reason, says that we launch blocks of 256 threads, and execute the function object for two elements on each thread. This seems wasteful; I'll try increasing that value, though no promises on when that lands. (Not significantly, though: when I bumped it to 32 at first, ptxas took almost 10 minutes to finish on our for_each test file, vs something like 15 seconds currently. I'll experiment with out benchmarks and try to find a value that makes more sense, but keep in mind that bumping to 8, which bumps the ptxas runtime to ~45-ish seconds, only allows mag 41 kernels, not bigger ones, so given how the compilation times rise due to the unrolling we do in the kernel, I don't think we can bump it significantly past that.)

Hopefully you don't often have to run kernels of those sizes? ;)

@lucafuji
Copy link
Author

Unfortunately, we often run into the kernels of those sizes.
Of course, we can let our kernel itself to handle more items on our side, but we originally expect thrust for each kernel can help achieve this .......

BTW, I think originally in Cuda 8.0, for each kernel looks like

Size grid_size = grid.size() * grid.this_exec.size();

    Size i = grid.this_exec.index() * grid.this_exec.size() + grid.this_exec.this_exec.index();

    first += i;

    while(i < n)
    {
      f(*first);
      i += grid_size;
      first += grid_size;
    }

and this can handle arbitrary large inputs (at least uint64),
starting Cuda 9.0, for each kernel is rewritten as the current way, and then it cannot handle large input

@brycelelbach
Copy link
Collaborator

brycelelbach commented May 11, 2019

I think the issue here is our occupancy logic; we probably don't account for the upper limits on # of threads/thread-blocks when we decide how many threads to use/how many elements per thread.

We may be able to solve this without splitting the kernel launch; instead we just handle more items per thread.

@lucafuji
Copy link
Author

I think the issue here is our occupancy logic; we probably don't account for the upper limits on # of threads/thread-blocks when we decide how many threads to use/how many elements per thread.

We may be able to solve this without splitting the kernel launch; instead we just handle more items per thread.

totally agree on this

@lucafuji
Copy link
Author

@griwes Any updates on this? Will it be fixed in next major cuda version?

@jaredhoberock
Copy link
Contributor

jaredhoberock commented Jan 10, 2020

We may be able to solve this without splitting the kernel launch; instead we just handle more items per thread.

There are are exactly the kind of transformations an executor could introduce.

@griwes
Copy link
Collaborator

griwes commented Jan 11, 2020

I have no updates on this since the last comment I left on this, sorry. We've had other things override our priorities on doing the much needed rework of some parts of Thrust.

(Btw.: the "just" in Bryce's last comment would be a rather large change, effectively moving from statically sizing the threads to dynamically sizing them, and I doubt that "just" doing that just for for_each is a reasonable way forward. We need a grand(er) plan; I hope we can achieve this Soon ™️, even taking the recent shift in priorities into account.)

@lucafuji
Copy link
Author

Thanks for heads up. I understand that might be a lot of other high priority items to fix. But for me, such a bug in for_each algorithm probably should be of the top priority as well since almost every parallel algorithm in thrust is using for_each, transform, filter,scatter etc. If it has a major bug, it will affect a lot of algorithms.

brycelelbach pushed a commit that referenced this issue May 16, 2020
@trivialfis
Copy link

trivialfis commented Sep 26, 2020

Hi all, I recently encountered a similar error with thrust::inclusive_scan on CUDA 11.0:

  size_t size = 2150602529;
  auto key_iter = dh::MakeTransformIterator<size_t>(  // same with thrust make_transform_iterator
      thrust::make_counting_iterator<size_t>(0ul),
      [=] __device__(size_t idx) {
        assert(idx < size);
        return idx;
      });
  auto value_iter = dh::MakeTransformIterator<size_t>(
      thrust::make_counting_iterator<size_t>(0ul),
      [=] __device__(size_t idx) -> size_t {
        return idx;
      });
  auto key_value_index_iter = thrust::make_zip_iterator(
      thrust::make_tuple(thrust::make_counting_iterator<size_t>(0ul), key_iter, value_iter));
  auto end_it = key_value_index_iter + size;
  thrust::inclusive_scan(thrust::device, key_value_index_iter,
                         end_it, thrust::make_discard_iterator(),
                         [] __device__(auto a, auto b){ return b; });

Any update on this?

@lucafuji
Copy link
Author

@griwes Any updates on this? Is it already fixed in the latest cuda major release (cuda 12.X)?

@jrhemstad jrhemstad assigned elstehle and unassigned griwes Feb 12, 2023
@jrhemstad
Copy link
Collaborator

@elstehle can you look into this?

@elstehle
Copy link
Collaborator

I will have a look and report back shortly.

@elstehle
Copy link
Collaborator

elstehle commented Feb 15, 2023

@griwes Any updates on this? Is it already fixed in the latest cuda major release (cuda 12.X)?

@lucafuji, just to confirm, you are asking whether we are supporting problem sizes larger than (2^31-1) * 512 or ~10^12 items (aka MAX_GRID_DIM_X * ITEMS_PER_BLOCK)? Can you also help me understand the use case a bit better, if you are able to share?

@jrhemstad
Copy link
Collaborator

Closing as duplicate of NVIDIA/cccl#744

@jrhemstad jrhemstad closed this as not planned Won't fix, can't repro, duplicate, stale Feb 24, 2023
@github-project-automation github-project-automation bot moved this to Done in CCCL Feb 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
nvbug Has an associated internal NVIDIA NVBug. thrust
Projects
Archived in project
Development

No branches or pull requests

7 participants