-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deepseek2 does not support K-shift Denial-of-Service vulnerability #10380
Comments
You can also disable K-shift by disabling context shifting, via this argument: |
@ggerganov Hi! I also find this problem in ollama, while I query a long text to deepseekV2, it would call the K-shift error, how could I set the param in ollama? Otherwise, I think that the model serve should not be crashed anyway. |
I had this problem too, have you solved it yet? |
Long prompts/responses crash llama-server because "Deepseek2 does not support K-shift". For long prompts/responses, llama-server should return an error message or truncate the response, but instead,
GGML_ABORT
is called, which crashes the server. I believe that this is a Denial-of-Service vulnerability. A client should never be able to triggerGGML_ABORT
.The relevant line in the code is here:
https://github.com/ggerganov/llama.cpp/blob/9b75f03cd2ec9cc482084049d87a0f08f9f01517/src/llama.cpp#L18032
I have reported this security vulnerability almost three months ago here (link only visible for maintainers), but have received no response and it is public knowledge now anyway, so I also opened this issue to increase visibility.
Discussed in #9092
Originally posted by 99991 August 19, 2024
It is my understanding that llama.cpp shifts the key-value cache when generating more tokens than fit into the context window, which is not supported for DeepSeek Coder V2. To reproduce, start a server with this model
and then request a prompt completion:
This should trigger the error
with llama.cpp release b3600.
The corresponding code in llama.cpp is here:
https://github.com/ggerganov/llama.cpp/blob/cfac111e2b3953cdb6b0126e67a2487687646971/src/llama.cpp#L15643C31-L15648C1
I believe that a saner approach would simply stop generating tokens instead of crashing the server. Is there some option that can be set to prevent clients from crashing the server?
The text was updated successfully, but these errors were encountered: