Skip to content

xuezy-mmi/Tensor-Core-FP8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Tensor-Core-FP8-Kernel

NVIDIA released MMA instructions for FP8 (e4m3 and e5m2) in PTX 8.4.

However, there are few open-source kernels that support FP8, except for TransformerEngine and cuBLAS.

I will gradually implement GEMM, GEMV, SpMM, SpMV and other CUDA kernels based on FP8-MMA Instruction in this project.

4090 H100 PCIe
Memory Bandwidth 1008 GB/s 2039 GB/s
FP8 Peak Perf. 660.6 TFLOPS 1513 TFLOPS
Shared Memory/SM 128 KB 228 KB
SMs 128 114
kernel Block Size layout Precision Pipeline Stage
GEMM 128x128x64 row-col e4m3*e4m3=fp32 2-stage
GEMM 128x128x64 row-col e5m2*e5m2=fp32 2-stage
GEMM 128x128x64 row-col e4m3*e5m2=fp32 2-stage
GEMM 128x128x64 row-col e5m2*e4m3=fp32 2-stage
GEMM 128x128x64 row-col e4m3*e4m3=fp32 4-stage
GEMM 128x128x64 row-col e5m2*e5m2=fp32 4-stage
GEMM 128x128x64 row-col e4m3*e5m2=fp32 4-stage
GEMM 128x128x64 row-col e5m2*e4m3=fp32 4-stage

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages