You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, I found that the weight given by dequantize_4bit() under cuda graph mode is different from the eager mode, which makes the model output nonsense output.
Wonder anybody has some insights on this issue?
I tried to put it in a simple script. Yet it turned out to be hard as it is non-trivial to capture the cuda graph. Yet it is a consistent repro and I would be more than happy to work with the community members to show the data I have collected.
Expected behavior
The cuda graph mode is expected to output the same dequantized tensors as the eager mode.
The text was updated successfully, but these errors were encountered:
Thank you for bringing this to our attention @chenqianfzh! I'm not personally aware of a known issue here and do believe it's worth investigating further. If you could help to provide some more details on repro steps that would be appreciated!
I also encountered the same problem mentioned above. I conducted a simple investigation, and the likely cause seems to be that the kernel kDequantizeBlockwise did not pass the stream(This phenomenon is common in BNB). If you want to investigate further, you can refer to cudagraph test for relevant verification.
System Info
Linux
Reproduction
I am trying to implement BitsAndBytes in vLLM (https://github.com/vllm-project/vllm). My implementation with eager-mode works right and was merged.
However, I found that the weight given by dequantize_4bit() under cuda graph mode is different from the eager mode, which makes the model output nonsense output.
Wonder anybody has some insights on this issue?
I tried to put it in a simple script. Yet it turned out to be hard as it is non-trivial to capture the cuda graph. Yet it is a consistent repro and I would be more than happy to work with the community members to show the data I have collected.
Expected behavior
The cuda graph mode is expected to output the same dequantized tensors as the eager mode.
The text was updated successfully, but these errors were encountered: