dequantize_4bit() gives wrong output when working in cuda graph mode #1308

chenqianfzh · 2024-08-06T22:43:05Z

System Info

Linux

Reproduction

I am trying to implement BitsAndBytes in vLLM (https://github.com/vllm-project/vllm). My implementation with eager-mode works right and was merged.

However, I found that the weight given by dequantize_4bit() under cuda graph mode is different from the eager mode, which makes the model output nonsense output.

Wonder anybody has some insights on this issue?

I tried to put it in a simple script. Yet it turned out to be hard as it is non-trivial to capture the cuda graph. Yet it is a consistent repro and I would be more than happy to work with the community members to show the data I have collected.

Expected behavior

The cuda graph mode is expected to output the same dequantized tensors as the eager mode.

matthewdouglas · 2024-08-08T14:59:13Z

Thank you for bringing this to our attention @chenqianfzh! I'm not personally aware of a known issue here and do believe it's worth investigating further. If you could help to provide some more details on repro steps that would be appreciated!

For bookkeeping, this relates to vLLM issue: vllm-project/vllm#5569 and the current workaround is to enforce eager mode: vllm-project/vllm#6846

cc: @Titus-von-Koeller

jeejeelee · 2024-08-15T03:11:43Z

@matthewdouglas @chenqianfzh

I also encountered the same problem mentioned above. I conducted a simple investigation, and the likely cause seems to be that the kernel kDequantizeBlockwise did not pass the stream(This phenomenon is common in BNB). If you want to investigate further, you can refer to cudagraph test for relevant verification.

devlup · 2024-09-11T21:09:02Z

hi @matthewdouglas vllm is waiting for a new release to pick up this fix, since they use pypi, wondering when you plan to checkpoint a new release?

Titus-von-Koeller · 2024-09-21T23:53:59Z

@devlup We're planning release early this week.

matthewdouglas added bug Something isn't working contributions-welcome We welcome contributions to fix this issue! labels Aug 8, 2024

Titus-von-Koeller assigned matthewdouglas Aug 15, 2024

chenqianfzh mentioned this issue Aug 19, 2024

[Bug]: Non-Eager-mode with bnb generated low-quality output bd-iaas-us/vllm#24

Closed

thesues mentioned this issue Aug 19, 2024

[BUG]: fix bnb graph mode bug. bd-iaas-us/vllm#28

Closed

jeejeelee mentioned this issue Aug 21, 2024

Enable certain CUDA kernels to accept specified cuda stream #1330

Merged

matthewdouglas closed this as completed in #1330 Aug 22, 2024

This was referenced Aug 29, 2024

enforce eager mode with bnb quantization temporarily vllm-project/vllm#6846

Merged

[Bug]: BitsandBytes quantization is not working as expected vllm-project/vllm#5569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dequantize_4bit() gives wrong output when working in cuda graph mode #1308

dequantize_4bit() gives wrong output when working in cuda graph mode #1308

chenqianfzh commented Aug 6, 2024

matthewdouglas commented Aug 8, 2024

jeejeelee commented Aug 15, 2024

devlup commented Sep 11, 2024

Titus-von-Koeller commented Sep 21, 2024

dequantize_4bit() gives wrong output when working in cuda graph mode #1308

dequantize_4bit() gives wrong output when working in cuda graph mode #1308

Comments

chenqianfzh commented Aug 6, 2024

System Info

Reproduction

Expected behavior

matthewdouglas commented Aug 8, 2024

jeejeelee commented Aug 15, 2024

devlup commented Sep 11, 2024

Titus-von-Koeller commented Sep 21, 2024