Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

[NV 200643040] launch_bounds settings for kernel_agent kernels needs to be tuned #1287

Closed
harinvidia opened this issue Sep 23, 2020 · 2 comments
Assignees
Labels
area: performance Does not perform as intended. nvbug Has an associated internal NVIDIA NVBug. P0: must have Absolutely necessary. Critical issue, major blocker, etc.

Comments

@harinvidia
Copy link

_kernel_agent global functions have launch bounds set in thrust/system/cuda/detail/core/agent_launcher.h. The launch_bounds are set to 128 threads, which provides the compiler with only partial information. At launch, during some of the thrust tests (like inclusive_scan_inplace_int8_t_8bit_64mib), this kernel gets launched with "(87382 1 1)" blocks. To get good occupancy for this launch, the number of threads used per thread needs to be constrained, by providing minBlocksPerMultiprocessor setting in the launch_bounds. I found the value 12 gets ptxas to allocate the relevant kernel with 40 registers and gets good perf.

Kernel name: ZN6thrust8cuda_cub4core13_kernel_agentINS0_6__scan9ScanAgentINS_6detail15normal_iteratorINS_10device_ptrIaEEEES9_NS_4plusIaEEiaNS5_17integral_constantIbLb1EEEEES9_S9_SB_iN3cub13ScanTileStateIaLb1EEENS3_9DoNothingIaEEEEvT0_T1_T2_T3_T4_T5

Note that this templated function is called in other contexts with 256 maxThreadsPerBlock, where a different minBlocksPerMultiprocessor would be needed.

There is a ~15% perf delta due to this setting.

@brycelelbach brycelelbach self-assigned this Sep 23, 2020
@brycelelbach brycelelbach added this to the 1.11.0 milestone Sep 23, 2020
@brycelelbach
Copy link
Collaborator

@allisonvacanti we should look at this with priority. Good chance to try out the perf comparison script. Let me know if you need help.

@brycelelbach brycelelbach changed the title launch_bounds settings for kernel_agent kernels needs to be tuned. launch_bounds settings for kernel_agent kernels needs to be tuned Sep 23, 2020
@alliepiper alliepiper self-assigned this Sep 23, 2020
@alliepiper alliepiper added the area: performance Does not perform as intended. label Sep 23, 2020
@alliepiper alliepiper added the P0: must have Absolutely necessary. Critical issue, major blocker, etc. label Oct 7, 2020
@alliepiper alliepiper changed the title launch_bounds settings for kernel_agent kernels needs to be tuned [NV 200643040] launch_bounds settings for kernel_agent kernels needs to be tuned Oct 7, 2020
@alliepiper alliepiper added the nvbug Has an associated internal NVIDIA NVBug. label Oct 7, 2020
@alliepiper
Copy link
Collaborator

The kernel in question will be removed by #1304. Since we'll be moving to a scan implementation with much better perf (see #1301), this old kernel is not worth fixing.

@alliepiper alliepiper removed this from the 1.11.0 milestone Oct 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area: performance Does not perform as intended. nvbug Has an associated internal NVIDIA NVBug. P0: must have Absolutely necessary. Critical issue, major blocker, etc.
Projects
None yet
Development

No branches or pull requests

3 participants