cuda::pipeline does not use L1 bypass #135

gonzalobg · 2021-03-10T12:00:22Z

I encountered this regression when porting some of the RTM kernels to use libcu++:

Example (https://cuda.godbolt.org/z/bfMq9Y):

#include <cooperative_groups.h>
#include <cuda/pipeline>
#include <cuda/barrier>
#include <cuda_pipeline.h>
#include <cuda_awbarrier.h>


using T = float4;
__global__ void test_pipe_intr(T* input, T volatile* out) {
   __shared__ T smem[32];
     __pipeline_memcpy_async(smem + threadIdx.x, input + threadIdx.x, sizeof(T));
     __pipeline_commit();
     __pipeline_wait_prior(0);
}

__global__ void test_pipe_cxx(T* input, T volatile* out) {
   __shared__ T smem[32];
     cuda::pipeline<cuda::thread_scope_thread> p = cuda::make_pipeline();
     auto g = cooperative_groups::this_thread();
     p.producer_acquire();
     cuda::memcpy_async(g, smem + threadIdx.x, input + threadIdx.x, cuda::aligned_size_t<16>(sizeof(T)), p);
     p.producer_commit();
     cuda::pipeline_consumer_wait_prior<0>(p);
}

The intrinsics implementation generates LDGSTS.E.BYPASS as expected:

test_pipe_intr(float4*, float4 volatile*):
 MOV R1, c[0x0][0x28] 
 S2R R2, SR_TID.X 
 MOV R3, 0x10 
 ULDC.64 UR4, c[0x0][0x118] 
 SHF.L.U32 R5, R2, 0x4, RZ 
 IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x160] 
 LDGSTS.E.BYPASS.128 [R5], [R2.64] 
 LDGDEPBAR 
 DEPBAR.LE SB0, 0x0 
 EXIT 
.L_6:
 BRA `(.L_6)
.L_46:

but the libcucxx implementation generates an LDGSTS.E (aka LDGSTS.E.ACCESS) instead, and an extra BSYNC B0 :

test_pipe_cxx(float4*, float4 volatile*):
 IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] 
 S2R R0, SR_EQMASK 
 BSSY B0, `(.L_3) 
 S2R R3, SR_LTMASK 
 LOP3.LUT R0, R3, R0, RZ, 0xc0, !PT 
 POPC R0, R0 
 IMAD.WIDE.U32 R2, R0, 0x10, RZ 
 ISETP.GT.U32.AND P0, PT, R2, 0xf, PT 
 ISETP.GT.U32.AND.EX P0, PT, R3, RZ, PT, P0 
 @P0 BRA `(.L_4) 
 S2R R5, SR_TID.X 
 IADD3 R4, P0, -R2, -0x10, RZ 
 ULDC.64 UR4, c[0x0][0x118] 
 IADD3.X R2, ~R3, -0x1, RZ, P0, !PT 
 ISETP.NE.U32.AND P0, PT, R4, RZ, PT 
 ISETP.NE.AND.EX P0, PT, R2, RZ, PT, P0 
 IADD3 R3, P1, R0.reuse, R5, RZ 
 IMAD.IADD R0, R0, 0x1, R5 
 IMAD.X R4, RZ, RZ, RZ, P1 
 LEA R2, P1, R3, c[0x0][0x160], 0x4 
 IMAD.SHL.U32 R5, R0, 0x10, RZ 
 LEA.HI.X R3, R3, c[0x0][0x164], R4, 0x4, P1 
 @!P0 LDGSTS.E.128 [R5], [R2.64] 
 @!P0 LDGSTS.E.128 [R5+0x10], [R2.64+0x10] 
 @P0 LDGSTS.E.128 [R5], [R2.64] 
.L_4:
 BSYNC B0 
.L_3:
 LDGDEPBAR 
 DEPBAR.LE SB0, 0x0 
 EXIT 
.L_5:
 BRA `(.L_5)

While this SASS was generated with CUDA 11.1, I can reproduce with CUDA 11.2 (godbolt does not support it though).

The text was updated successfully, but these errors were encountered:

gonzalobg mentioned this issue Nov 8, 2023

cuda::barrier<thread_scope_thread> uses exponential backoff NVIDIA/cccl#971

Open

maddyscientist added the helps: quda Helps or needed by QUDA. label Apr 20, 2021

gonzalobg mentioned this issue Oct 28, 2021

memcpy_async should cache only in L2 when possible #220

Merged

wmaxey closed this as completed in #220 Nov 5, 2021

getianao mentioned this issue Nov 11, 2023

[BUG]: memcpy_async only use L1 bypass when alignment is 16 NVIDIA/cccl#1089

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda::pipeline does not use L1 bypass #135

cuda::pipeline does not use L1 bypass #135

gonzalobg commented Mar 10, 2021 •

edited

Loading

cuda::pipeline does not use L1 bypass #135

cuda::pipeline does not use L1 bypass #135

Comments

gonzalobg commented Mar 10, 2021 • edited Loading

gonzalobg commented Mar 10, 2021 •

edited

Loading