Tensor-Core-FP8-Kernel

NVIDIA released MMA instructions for FP8 (e4m3 and e5m2) in PTX 8.4.

However, there are few open-source kernels that support FP8, except for TransformerEngine and cuBLAS.

I will gradually implement GEMM, GEMV, SpMM, SpMV and other CUDA kernels based on FP8-MMA Instruction in this project.

kernel	Block Size	layout	Precision	Pipeline Stage
GEMM	128x128x64	row-col	e4m3*e4m3=fp32	2-stage
GEMM	128x128x64	row-col	e5m2*e5m2=fp32	2-stage
GEMM	128x128x64	row-col	e4m3*e5m2=fp32	2-stage
GEMM	128x128x64	row-col	e5m2*e4m3=fp32	2-stage
GEMM	128x128x64	row-col	e4m3*e4m3=fp32	4-stage
GEMM	128x128x64	row-col	e5m2*e5m2=fp32	4-stage
GEMM	128x128x64	row-col	e4m3*e5m2=fp32	4-stage
GEMM	128x128x64	row-col	e5m2*e4m3=fp32	4-stage

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
GEMM		GEMM
README.md		README.md
main.cpp		main.cpp

Provide feedback