Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

cuda::pipeline does not use L1 bypass #135

Closed
gonzalobg opened this issue Mar 10, 2021 · 0 comments · Fixed by #220
Closed

cuda::pipeline does not use L1 bypass #135

gonzalobg opened this issue Mar 10, 2021 · 0 comments · Fixed by #220
Labels
helps: quda Helps or needed by QUDA.

Comments

@gonzalobg
Copy link
Collaborator

gonzalobg commented Mar 10, 2021

I encountered this regression when porting some of the RTM kernels to use libcu++:

Example (https://cuda.godbolt.org/z/bfMq9Y):

#include <cooperative_groups.h>
#include <cuda/pipeline>
#include <cuda/barrier>
#include <cuda_pipeline.h>
#include <cuda_awbarrier.h>


using T = float4;
__global__ void test_pipe_intr(T* input, T volatile* out) {
   __shared__ T smem[32];
     __pipeline_memcpy_async(smem + threadIdx.x, input + threadIdx.x, sizeof(T));
     __pipeline_commit();
     __pipeline_wait_prior(0);
}

__global__ void test_pipe_cxx(T* input, T volatile* out) {
   __shared__ T smem[32];
     cuda::pipeline<cuda::thread_scope_thread> p = cuda::make_pipeline();
     auto g = cooperative_groups::this_thread();
     p.producer_acquire();
     cuda::memcpy_async(g, smem + threadIdx.x, input + threadIdx.x, cuda::aligned_size_t<16>(sizeof(T)), p);
     p.producer_commit();
     cuda::pipeline_consumer_wait_prior<0>(p);
}

The intrinsics implementation generates LDGSTS.E.BYPASS as expected:

test_pipe_intr(float4*, float4 volatile*):
 MOV R1, c[0x0][0x28] 
 S2R R2, SR_TID.X 
 MOV R3, 0x10 
 ULDC.64 UR4, c[0x0][0x118] 
 SHF.L.U32 R5, R2, 0x4, RZ 
 IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x160] 
 LDGSTS.E.BYPASS.128 [R5], [R2.64] 
 LDGDEPBAR 
 DEPBAR.LE SB0, 0x0 
 EXIT 
.L_6:
 BRA `(.L_6)
.L_46:

but the libcucxx implementation generates an LDGSTS.E (aka LDGSTS.E.ACCESS) instead, and an extra BSYNC B0 :

test_pipe_cxx(float4*, float4 volatile*):
 IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] 
 S2R R0, SR_EQMASK 
 BSSY B0, `(.L_3) 
 S2R R3, SR_LTMASK 
 LOP3.LUT R0, R3, R0, RZ, 0xc0, !PT 
 POPC R0, R0 
 IMAD.WIDE.U32 R2, R0, 0x10, RZ 
 ISETP.GT.U32.AND P0, PT, R2, 0xf, PT 
 ISETP.GT.U32.AND.EX P0, PT, R3, RZ, PT, P0 
 @P0 BRA `(.L_4) 
 S2R R5, SR_TID.X 
 IADD3 R4, P0, -R2, -0x10, RZ 
 ULDC.64 UR4, c[0x0][0x118] 
 IADD3.X R2, ~R3, -0x1, RZ, P0, !PT 
 ISETP.NE.U32.AND P0, PT, R4, RZ, PT 
 ISETP.NE.AND.EX P0, PT, R2, RZ, PT, P0 
 IADD3 R3, P1, R0.reuse, R5, RZ 
 IMAD.IADD R0, R0, 0x1, R5 
 IMAD.X R4, RZ, RZ, RZ, P1 
 LEA R2, P1, R3, c[0x0][0x160], 0x4 
 IMAD.SHL.U32 R5, R0, 0x10, RZ 
 LEA.HI.X R3, R3, c[0x0][0x164], R4, 0x4, P1 
 @!P0 LDGSTS.E.128 [R5], [R2.64] 
 @!P0 LDGSTS.E.128 [R5+0x10], [R2.64+0x10] 
 @P0 LDGSTS.E.128 [R5], [R2.64] 
.L_4:
 BSYNC B0 
.L_3:
 LDGDEPBAR 
 DEPBAR.LE SB0, 0x0 
 EXIT 
.L_5:
 BRA `(.L_5)

While this SASS was generated with CUDA 11.1, I can reproduce with CUDA 11.2 (godbolt does not support it though).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
helps: quda Helps or needed by QUDA.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants