(cudaErrorInvalidDevice) when trying to perform a thrust::reduce #1371

ArEsKay3 · 2021-01-05T23:13:05Z

I'm seeing the below error (cudaErrorInvalidDevice ) when trying to perform a thrust::reduce.
I'm not sure where I should be looking as I'm not generating any device ordinal's myself, just using this call in the snippet below.

This is within a python cextension and I do use pytorch to set the device. But in this case I don't set it to anything other than 0. I've checked that cudaGetDevice matches the pointer device (and that all input pointers are on device '0').

This worked with previous versions of cub/CUDA 10.

https://github.com/limbo018/DREAMPlace/blob/0035d8a8a40729d414c84d52464b459d46680db9/dreamplace/ops/global_swap/src/global_swap_cuda_kernel.cu#L1275

int compute_max_num_nodes_per_bin(const DetailedPlaceDB<T>& db) {
  int num_bins = db.num_bins_x * db.num_bins_y;
  int* node_count_map = nullptr;
  allocateCUDA(node_count_map, num_bins, int);
  checkCUDA(cudaMemset(node_count_map, 0, sizeof(int) * num_bins));
  compute_num_nodes_in_bins<<<ceilDiv(db.num_movable_nodes, 256), 256>>>(
      db, node_count_map);
  checkCUDA(cudaDeviceSynchronize());
  int max_num_nodes_per_bin =
      thrust::reduce(thrust::device, node_count_map, node_count_map + num_bins,
                     0, thrust::maximum<int>());
  checkCUDA(cudaDeviceSynchronize());
  destroyCUDA(node_count_map);
  return max_num_nodes_per_bin;
}

RuntimeError: after reduction step 1: cudaErrorInvalidDevice: invalid device ordinal

docker run -it gitlab-master.nvidia.com:5005/rkirby/dreamgym:thrustcubdebug /placement/debug_repro.sh

Should be enough to reproduce. I'm on Driver Version: 450.51.06 CUDA: 11.0 with a single V100. If you launch into a shell, you can look in that shell script and I have pointers to source and how to rebuild. Let me know if you need permission to pull from that registry.

The text was updated successfully, but these errors were encountered:

ArEsKay3 · 2021-01-14T19:47:37Z

Traced the failure to this line. The device variable looks ok (it's zero) but the DeviceCount function was inlined so I couldn't tell why it's returning what I assume is -1.

https://github.com/NVIDIA/cub/blob/c3be9a94273b5049520aacc7db00c738668aaa3f/cub/util_device.cuh#L299

However, when I remove the __forceinline__ on both DeviceCount and DeviceCountUncached, the failure goes away.

//CUB_RUNTIME_FUNCTION __forceinline__ int DeviceCountUncached()
CUB_RUNTIME_FUNCTION int DeviceCountUncached()
...
// CUB_RUNTIME_FUNCTION __forceinline__ int DeviceCount()
CUB_RUNTIME_FUNCTION int DeviceCount()

alliepiper · 2021-01-21T18:19:13Z

Heh, wow. Thanks for tracking that down.

Would you be able to make a pull request to https://github.com/NVIDIA/cub that removes those annotations? These functions don't really need to be inlined.

Somewhat related to NVIDIA/cccl#754.

These functions started producing invalid results in CUDA 11 under certain circumstances (see issue NVIDIA/thrust#1371), and removing these hints fixes the issue.

These functions started producing invalid results in CUDA 11 under certain circumstances (see issue NVIDIA/thrust#1371), and removing these hints fixes the issue. NVIDIA#260 reported that other functions in this file were also causing the same issue. These methods are not perf critical -- they don't need to be inlined.

alliepiper added the type: bug: functional Does not work as intended. label Jan 21, 2021

This was referenced Jan 28, 2021

Remove some problematic inline annotations. NVIDIA/cub#259

Merged

DeviceSegmentedRadixSort::SortPairsDescending does not work because of inlined PtxVersion NVIDIA/cub#260

Closed

alliepiper closed this as completed in NVIDIA/cub#259 Feb 9, 2021

ngimel mentioned this issue Mar 21, 2021

code linking to libtorch cannot use thrust/cub functions pytorch/pytorch#52663

Open

SeungsuBaek mentioned this issue Feb 15, 2022

When call cache_populate, CUDA Error: invalid device ordinal facebookresearch/FBTT-Embedding#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(cudaErrorInvalidDevice) when trying to perform a thrust::reduce #1371

(cudaErrorInvalidDevice) when trying to perform a thrust::reduce #1371

ArEsKay3 commented Jan 5, 2021

ArEsKay3 commented Jan 14, 2021 •

edited

Loading

alliepiper commented Jan 21, 2021

(cudaErrorInvalidDevice) when trying to perform a thrust::reduce #1371

(cudaErrorInvalidDevice) when trying to perform a thrust::reduce #1371

Comments

ArEsKay3 commented Jan 5, 2021

ArEsKay3 commented Jan 14, 2021 • edited Loading

alliepiper commented Jan 21, 2021

ArEsKay3 commented Jan 14, 2021 •

edited

Loading