-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: add test for num slots, fails on master #6950
Server: add test for num slots, fails on master #6950
Conversation
I should mention, you run into this exact same problem if you have a fixed number of server slots but a varying number of parallel requests which I would argue is even more problematic. |
3e8054e
to
402f418
Compare
The sequences diverge for different batch sizes only if the temperature is high enough. I've added tests with temperature 0 and 1 and commented out those that currently fail on master. To assert that this is not a sampler issue I've expanded the tests around seeds: they now assert that the results are consistent with the same seed but different with different seeds. I've changed the data type of |
This is related to using a unified KV cache. See ggerganov/whisper.cpp#1941 (comment) (I ran into this before in #6122 (comment)) |
402f418
to
122a0d1
Compare
122a0d1
to
9ff8d4d
Compare
And 128 max tokens to predict | ||
And continuous batching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: continuous batching is enabled by default (and cannot be disabled BTW :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
While working on #6828 I've been writing more tests to ensure that the results remain the same. However, while doing so I've noticed that for a given seed and a varying number of slots the results produced by the server are not deterministic. What I think is happening is that llama.cpp does not produce bit-for-bit identical results as the batch size is changed. Therefore, after some number of tokens two otherwise identical sequences randomly sample different tokens at which point they completely diverge.
I don't know if this can be fixed at all since the only way to get bit-for-bit identical results with floating point numbers is to do the exact same operations in the exact same order which would likely not yield good performance. Unless the CPU backend (which I used for testing) is supposed to produce bit-for-bit identical results in which case this would be indicative of a bug. In any case, feedback would be appreciated.
Sample outputs