-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: an illegal memory access was encountered with Mistral FP8 Marlin kernels on NVIDIA driver 535.216.01 (AWS Sagemaker Real-time Inference) #2915
Comments
Any chance you could run with |
Assuming text-generation-inference 3.0.0 from here unless otherwise noted. With
It looks like |
System Info
Tested with text-generation-inference 2.4.0 and 3.0.0 Docker containers running the CLI from within on Sagemaker Real-time Inference (NVIDIA driver 535.216.01)
Information
Tasks
Reproduction
Launching a Mistral-based model with FP8 Marlin kernels raises a CUDA illegal memory access error on server startup when using NVIDIA driver 535.216.01. The error happens during model warmup. We have reproduced the error for multiple text-generation-inference versions (3.0.0 and 2.4.0) and multiple GPU models (A10G and L40S)
text-generation-inference 3.0.0
text-generation-launcher --model-id=prometheus-eval/prometheus-7b-v2.0 --quantize=fp8
text-generation-inference 2.4.0
Expected behavior
We would like to leverage FP8 Marlin quantization for Mistral-based models on Sagemaker Real-time Inference which currently is limited to NVIDIA driver 535.216.01
The text was updated successfully, but these errors were encountered: