Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

DeviceSegmentedSort synchronizes default stream and produces wrong results when launched from a kernel #409

Closed
fkallen opened this issue Dec 2, 2021 · 3 comments · Fixed by #410
Assignees

Comments

@fkallen
Copy link
Contributor

fkallen commented Dec 2, 2021

  1. DeviceSegmentedSort performs cudaMemcpy on the default stream instead of the supplied stream.
  2. In device code, memcpy may execute before group_sizes and num_selected_groups are calculated. This leads to wrong results.
    if (CUB_IS_HOST_CODE)
    {
      #if CUB_INCLUDE_HOST_CODE
      if (CubDebug(error = cudaMemcpy(h_group_sizes,
                                      group_sizes.get(),
                                      num_selected_groups * sizeof(unsigned int),
                                      cudaMemcpyDeviceToHost)))
      {
        return error;
      }
      #endif
    }
    else
    {
      #if CUB_INCLUDE_DEVICE_CODE
      memcpy(h_group_sizes,
             group_sizes.get(),
             num_selected_groups * sizeof(unsigned int));
      #endif
    }
@gevtushenko
Copy link
Collaborator

Thank you for reporting this! I'll create PR soon.

@gevtushenko
Copy link
Collaborator

@fkallen could you check if the fix works for you?

@fkallen
Copy link
Contributor Author

fkallen commented Dec 3, 2021

Your fix does work for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants