[Issue]: GPU Core dump when running CK-W8A8GEMM Kernel on GPU ID 1,2,3,4,5,6,7 #89

tjtanaa · 2025-02-05T03:59:17Z

Problem Description

When trying to run the kernel on inputs of GPU ID of non-zero. E.g. 1,2,3,4,5,6,7. It will throw the following error.

Memory access fault by GPU node-2 (Agent handle: 0x9b15d70) on address 0x7ee42d200000. Reason: Unknown.
tensor(False, device='cuda:1')
GPU core dump created: gpucore.10171
Aborted
root@tw024:/app# python ex.py 
Memory access fault by GPU node-2 (Agent handle: 0xa5f71a0) on address 0x7f532b800000. Reason: Unknown.
GPU core dump created: gpucore.10255
Aborted

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD EPYC 9654 96-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.3.1

ROCm Component

composable_kernel

Steps to Reproduce

Install aiter from main branch.
Run the following script

from aiter.ops.gemm_op_a8w8 import gemm_a8w8_CK


import torch


SIZE_LIST = [
   (3840, 16384, 16384),
   (56, 8192, 7392)
   ]




def main():
   for size in SIZE_LIST:
       M, N, K = size
       A = torch.rand(size=(M, K), device="cuda:1").to(torch.int8)
       B = torch.rand(size=(K, N), device="cuda:1").to(torch.int8)
       scale_a = torch.ones((M, 1), device="cuda:1").to(torch.int32)
       scale_b = torch.ones((N, 1), device="cuda:1").to(torch.int32)
       result = gemm_a8w8_CK(A, B.t(), scale_a, scale_b, dtype=torch.bfloat16)


if __name__ == "__main__":
   main()

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

mawong-amd · 2025-02-07T00:02:55Z

A workaround for now is to call torch.cuda.set_device("cuda:1") before calling gemm_a8w8_CK.

It seems a proper fix would be to add device guards as in here.

valarLip · 2025-02-07T01:54:49Z

A workaround for now is to call torch.cuda.set_device("cuda:1") before calling gemm_a8w8_CK.

It seems a proper fix would be to add device guards as in here.

yes, this is the way i planed to fix it... thanks you did it

tjtanaa · 2025-02-07T03:05:38Z

Thank you. ~~Let us test these fixes on our end as well.~~

It works seemlessly now.

valarLip assigned junhaha666 Feb 5, 2025

mawong-amd mentioned this issue Feb 7, 2025

FP16 Paged Attention & Fused MoE #97

Merged

tjtanaa closed this as completed Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: GPU Core dump when running CK-W8A8GEMM Kernel on GPU ID 1,2,3,4,5,6,7 #89

[Issue]: GPU Core dump when running CK-W8A8GEMM Kernel on GPU ID 1,2,3,4,5,6,7 #89

tjtanaa commented Feb 5, 2025

mawong-amd commented Feb 7, 2025

valarLip commented Feb 7, 2025

tjtanaa commented Feb 7, 2025 •

edited

Loading

[Issue]: GPU Core dump when running CK-W8A8GEMM Kernel on GPU ID 1,2,3,4,5,6,7 #89

[Issue]: GPU Core dump when running CK-W8A8GEMM Kernel on GPU ID 1,2,3,4,5,6,7 #89

Comments

tjtanaa commented Feb 5, 2025

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

mawong-amd commented Feb 7, 2025

valarLip commented Feb 7, 2025

tjtanaa commented Feb 7, 2025 • edited Loading

tjtanaa commented Feb 7, 2025 •

edited

Loading