This repository was archived by the owner on Mar 21, 2024. It is now read-only.
[NV 200643040] launch_bounds settings for kernel_agent kernels needs to be tuned #1287
Labels
area: performance
Does not perform as intended.
nvbug
Has an associated internal NVIDIA NVBug.
P0: must have
Absolutely necessary. Critical issue, major blocker, etc.
_kernel_agent global functions have launch bounds set in thrust/system/cuda/detail/core/agent_launcher.h. The launch_bounds are set to 128 threads, which provides the compiler with only partial information. At launch, during some of the thrust tests (like inclusive_scan_inplace_int8_t_8bit_64mib), this kernel gets launched with "(87382 1 1)" blocks. To get good occupancy for this launch, the number of threads used per thread needs to be constrained, by providing minBlocksPerMultiprocessor setting in the launch_bounds. I found the value 12 gets ptxas to allocate the relevant kernel with 40 registers and gets good perf.
Kernel name: ZN6thrust8cuda_cub4core13_kernel_agentINS0_6__scan9ScanAgentINS_6detail15normal_iteratorINS_10device_ptrIaEEEES9_NS_4plusIaEEiaNS5_17integral_constantIbLb1EEEEES9_S9_SB_iN3cub13ScanTileStateIaLb1EEENS3_9DoNothingIaEEEEvT0_T1_T2_T3_T4_T5
Note that this templated function is called in other contexts with 256 maxThreadsPerBlock, where a different minBlocksPerMultiprocessor would be needed.
There is a ~15% perf delta due to this setting.
The text was updated successfully, but these errors were encountered: