-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : move the sampling API from common into llama lib #5214
Comments
Is this meant as a short-term stop gap measure? If we are going to add a new sampling API to llama.cpp, it would be good to do this from the ground up with the possibility of GPU sampling in mind. The implementation is |
This change is more relevant for the CPU-based sampling. There are many use cases that require to manage a sampling state (e.g. previously sampled tokens, grammar state, etc.) so it makes sense to add support directly into the core library. I haven't thought deeply about GPU sampling support. Wouldn't it make more sense to have a limited number of GPU sampling options (such as greedy and top-k) as part of |
It's clear that some samplers cannot have GPU implementations, however this doesn't mean that we need two different APIs for GPU and CPU sampling. We could define samplers as an abstract object, that may or may not contain a state, and that may contain ggml or CPU implementations. Then we would need to assemble a pipeline of sampler objects that can be run at the end of the model evaluation. If all the samplers contain ggml implementations, then it can run on the GPU, otherwise at least some parts would still run on the CPU. I think it is mostly a matter of designing a flexible enough interface. |
Ok will give it further thought. One way that comes to mind is something like this: int32_t llama_decode_with_sampling(
struct llama_context * ctx,
struct llama_sampling_context * ctx_s,
struct llama_batch batch,
llama_token * result); The |
It's also important to allow multiple evaluations to be queued together, that's one of the biggest advantages of GPU sampling. That can be done by making |
Yes, this might get tricky when considering multiple sequences in the batch, but seems doable. Let me know if you have other concerns about merging If we do that, then the |
This issue is stale because it has been open for 30 days with no activity. |
Not stale. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
So I better understand this approach, would this involve updating |
Resolved via #9294 |
There is functionality around
llama_sampling_context
currently part ofcommon
. We should move it intollama
. Pretty much the entire API fromcommon/sampling.h
exceptllama_sampling_params
andllama_sampling_sample
can be integrated into the library.This would probably require to also merge the grammar parser into the
llama
lib implementation.The
llama_sampling_params
andllama_sampling_sample
will stay incommon
since they are very example-specific and not general-purpose enough to be merged.The text was updated successfully, but these errors were encountered: